Skip to content Skip to sidebar Skip to footer

Extracting Href Attr Or Converting Node To Character List

I try to extract some information from the website library(rvest) library(XML) url <- 'http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc' html <- html(url) nodes <- htm

Solution 1:

Try searching inside nodes' children:

nodes <- html_nodes(html, ".listItemSolr") 

sapply(html_children(nodes), function(x){
  html_attr( x$a, "href")
})

Update

Hadley suggested using elegant pipes:

html %>%  
  html_nodes(".listItemSolr") %>% 
  html_nodes(xpath = "./a") %>% 
  html_attr("href")

Solution 2:

Package XML function getHTMLLinks() can do virtually all the work for us, we just have to write the xpath query. Here we query all the node attributes to determine if any contains "listItemSolr", then select the parent node for the href query.

getHTMLLinks(url, xpQuery = "//@*[contains(., 'listItemSolr')]/../a/@href")

In xpQuery we are doing the following:

  • //@*[contains(., 'listItemSolr')] query all node attributes for listItemSolr
  • /.. select the parent node
  • /a/@href get the href links

Post a Comment for "Extracting Href Attr Or Converting Node To Character List"