How To Easily Parse Html For Consumption As A Service Using Java?

Question

I want to parse an HTML such as http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top and only want extract the text of the element which has

Solution 1:

Use Jsoup. It supports jQuery-like CSS selectors. Here's a kickoff example:

String url = "http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top";
Document document = Jsoup.connect(url).get();
for (Element link : document.select("a.title")) {
    System.out.println(link.absUrl("href"));
}

Result:

http://news.cnet.com/8301-13579_3-10288022-37.html
http://dl.getdropbox.com/u/18264/mspoland.jpg
http://www.reddit.com/r/reddit.com/comments/ar5z1/verizon_stealthily_installed_a_bing_search_app_on/
http://www.grabup.com/uploads/240ccede5360b093dbf298f8946025a5.png
http://www.youtube.com/watch?v=7Ym0tZSWGMc&fmt=34
http://i42.tinypic.com/wv5qar.jpg
http://www.reddit.com/r/technology/comments/8hnya/apple_no_i_dont_want_to_make_quicktime_my_default/
http://cssferret.imgur.com/microsoft_wtf
http://imgur.com/8pct5.png
http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-search.html
http://news.cnet.com/8301-27076_3-20011994-248.html?part=rss&subj=news&tag=2547-1_3-0-20
http://gizmodo.com/5383413/shady-microsoft-plugin-pokes-critical-hole-in-firefox-security
http://i.stack.imgur.com/sl1LY.png
http://imgur.com/T6BMs
http://www.nytimes.com/2010/09/14/world/europe/14raid.html
http://twitter.com/phil_nash/status/21159419598
http://online.wsj.com/article/SB10001424052748704415104576065641376054226.html?mod=WSJASIA_hpp_MIDDLESecondNews
http://www.reddit.com/r/reddit.com/comments/bqqxv/inside_the_chinese_factory_that_makes_microsofts/
http://i.min.us/iX0PA.png
http://imgur.com/m4nuz.gif
http://www.gamesforwindows.com/en-CA/Games/AgeofEmpiresIII/
http://foredecker.wordpress.com/2011/02/27/working-at-microsoft-day-to-day-coding/
http://homepage.mac.com/aleksivic/.Pictures/humor/spotTheBusey.jpg
http://www.bloomberg.com/apps/news?pid=20601087&sid=a7uOT0ro100U&refer=home
http://www.microsoft.com/windowsxp/eula/pro.mspx

Pretty concise, huh?

Solution 2:

Just an observation: Reddit generates XHTML, which means it's XML compliant. So you can just use an XPath library. e.g. (shamelessly copied from http://www.ibm.com/developerworks/library/x-javaxpathapi.html with minor modifications),

import java.io.IOException;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.*;
import javax.xml.xpath.*;

publicclassXPathExample {

  publicstaticvoidmain(String[] args)throws ParserConfigurationException, SAXException, 
          IOException, XPathExpressionException {

    DocumentBuilderFactorydomFactory= DocumentBuilderFactory.newInstance();
    domFactory.setNamespaceAware(true); // never forget this!DocumentBuilderbuilder= domFactory.newDocumentBuilder();
    // replace the following line with code to retrieve and parse the URL of your choiceDocumentdoc= builder.parse("books.xml");

    XPathFactoryfactory= XPathFactory.newInstance();
    XPathxpath= factory.newXPath();
    XPathExpressionexpr= xpath.compile("//a[class='title']/text()");

    Objectresult= expr.evaluate(doc, XPathConstants.NODESET);
    NodeListnodes= (NodeList) result;
    for (inti=0; i < nodes.getLength(); i++) {
        System.out.println(nodes.item(i).getNodeValue()); 
    }

  }

}

Obviously won't work on all websites, but will work for any that serve XHTML.

Html5 Log

How To Easily Parse Html For Consumption As A Service Using Java?

Solution 1:

See also:

Solution 2:

Post a Comment for "How To Easily Parse Html For Consumption As A Service Using Java?"