web crawler - R: Webscraping irregular blocks of values -
so attempting webscrape webpage has irregular blocks of data organized in manner easy spot eye. let's imagine looking @ wikipedia. if scraping text articles of following link end 33 entries. if instead grab headers, end 7 (see code below). result not surprise know sections of articles have multiple paragraphs while others have 1 or no paragraph text.
my question though is, how associate headers texts. if there same number of paragraphs per header or multiple, trivial.
library(rvest) wiki <- html("https://en.wikipedia.org/wiki/web_scraping") wikitext <- wiki %>% html_nodes('p+ ul li , p') %>% html_text(trim=true) wikiheading <- wiki %>% html_nodes('.mw-headline') %>% html_text(trim=true)
this give list called content
elements named according headings , contain corresponding text.
library(rvest) # assumes version 0.2.0.9 installed not on cran wiki <- html("https://en.wikipedia.org/wiki/web_scraping") # node set contains headings , text wikicontent <- wiki %>% html_nodes("div[id='mw-content-text']") %>% xml_children() # locates positions of headings headings <- sapply(wikicontent,xml_name) headings <- c(grep("h2",headings),length(headings)-1) # loop through headings keeping stuff in-between them content content <- list() (i in 1:(length(headings)-1)) { foo <- wikicontent[headings[i]:(headings[i+1]-1)] foo.title <- xml_text(foo[[1]]) foo.content <- xml_text(foo[-c(1)]) content[[i]] <- foo.content names(content)[i] <- foo.title }
the key spotting mw-content-text node has things want children.
Comments
Post a Comment