看见数博思达的一篇文章-获取股票信息的简单shiny接口-中说提取新浪股票信息时遇到不规范的数据格式,下载这样的数据要进行大量的数据清洗,其实用rvest可以轻易得到相关数据,非常简单。 librry(rvest)url = "http://biz.finance.sina.com.cn/stock/flash_hq/kline_data.php?symbol=sh600000&end_date=20121231&begin_date=20111231"html_session(url)<session> <a href="http://biz.finance.sina.com.cn/stock/flash_hq/kline_data.php?symbol=sh600000&end_date=20121231&begin_date=20111231" rel="nofollow">http://biz.finance.sina.com.cn/stock/flash_hq/kline_data.php?symbol=sh600000&end_date=20121231&begin_date=20111231</a> Status: 200 Type: text/html Size: 20574警告信息:Failed to parse headers:SINA-LB:aGEuOTEuZzEucXhnLmxiLnNpbmFub2RlLmNvbQ==SINA-TS:ZDllODk0Y2UgMCAwIDAgMTYgMzUK# 确实不规范> dat = do.call(rbind,url %>% html() %>% html_nodes("content") %>% html_attrs()) %>% as.data.frame警告信息:Failed to parse headers:SINA-LB:aGEuOTEuZzEucXhnLmxiLnNpbmFub2RlLmNvbQ==SINA-TS:ZGJlODk0Y2UgMCAwIDAgMTUgMgo= > head(dat) d o h c l v bl1 2012-01-04 8.540 8.560 8.410 8.390 342014 2 2012-01-05 8.470 8.820 8.650 8.470 1321162 3 2012-01-06 8.630 8.780 8.710 8.620 617787 4 2012-01-09 8.720 8.990 8.950 8.680 801362 5 2012-01-10 8.950 9.100 9.070 8.880 720046 6 2012-01-11 9.050 9.100 9.000 8.980 492612 尽管不规范,但是需要的信息得到了。
|