web crawler - R: Webscraping irregular blocks of values -


so attempting webscrape webpage has irregular blocks of data organized in manner easy spot eye. let's imagine looking @ wikipedia. if scraping text articles of following link end 33 entries. if instead grab headers, end 7 (see code below). result not surprise know sections of articles have multiple paragraphs while others have 1 or no paragraph text.

my question though is, how associate headers texts. if there same number of paragraphs per header or multiple, trivial.

library(rvest) wiki <- html("https://en.wikipedia.org/wiki/web_scraping")  wikitext <- wiki %>%    html_nodes('p+ ul li , p') %>%   html_text(trim=true)  wikiheading <- wiki %>%    html_nodes('.mw-headline') %>%   html_text(trim=true) 

this give list called content elements named according headings , contain corresponding text.

library(rvest) # assumes version 0.2.0.9 installed not on cran wiki <- html("https://en.wikipedia.org/wiki/web_scraping")  # node set contains headings , text wikicontent <- wiki %>%    html_nodes("div[id='mw-content-text']") %>%   xml_children()  # locates positions of headings headings <- sapply(wikicontent,xml_name)  headings <- c(grep("h2",headings),length(headings)-1)  # loop through headings keeping stuff in-between them content content <- list() (i in 1:(length(headings)-1)) {   foo <- wikicontent[headings[i]:(headings[i+1]-1)]   foo.title <- xml_text(foo[[1]])   foo.content <- xml_text(foo[-c(1)])   content[[i]] <- foo.content   names(content)[i] <- foo.title } 

the key spotting mw-content-text node has things want children.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -