Extracting Href Attr Or Converting Node To Character List
I try to extract some information from the website library(rvest) library(XML) url <- 'http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc' html <- html(url) nodes <- htm
Solution 1:
Try searching inside nodes' children:
nodes <- html_nodes(html, ".listItemSolr")
sapply(html_children(nodes), function(x){
html_attr( x$a, "href")
})
Update
Hadley suggested using elegant pipes:
html %>%
html_nodes(".listItemSolr") %>%
html_nodes(xpath = "./a") %>%
html_attr("href")
Solution 2:
Package XML function getHTMLLinks()
can do virtually all the work for us, we just have to write the xpath query. Here we query all the node attributes to determine if any contains "listItemSolr", then select the parent node for the href query.
getHTMLLinks(url, xpQuery = "//@*[contains(., 'listItemSolr')]/../a/@href")
In xpQuery
we are doing the following:
//@*[contains(., 'listItemSolr')]
query all node attributes for listItemSolr/..
select the parent node/a/@href
get the href links
Post a Comment for "Extracting Href Attr Or Converting Node To Character List"