Should I Use HTML::Parser Or XML::Parser To Extract And Replace Text?
Solution 1:
The approach of HTML::Parser is based on tokens and callbacks. I find it very convenient when you have particularly complex conditions on the context in which the data you whish to extract or to change occurs.
Otherwise I prefer a tree based approach. HTML::TreeBuilder::XPath (based ultimely on HTML::Parser) allows you to find nodes with XPath. It returns HTML::Elements. The documentation is a little scarce (well, spread over a couple of modules). But still the quick way to mine into HTML.
If you deal with pure XML, XML::Twig is an outstanding parser: very good memory management, allows to combine the tree and stream approaches. And the documentation is very good.
Solution 2:
Say in someone's StackOverflow user page you want to replace all instances of PERL with Perl. You could do so with
#! /usr/bin/perl
use warnings;
use strict;
use HTML::Parser;
use LWP::Simple;
my $html = get "http://stackoverflow.com/users/201469/phil-jackson";
die "$0: get failed" unless defined $html;
sub replace_text {
my($skipped,$markup) = @_;
$skipped =~ s/\bPERL\b/Perl/g;
print $skipped, $markup;
}
my $p = HTML::Parser->new(
api_version => 3,
marked_sections => 1,
case_sensitive => 1,
unbroken_text => 1,
xml_mode => 1,
start_h => [ \&replace_text => "skipped_text, text" ],
end_h => [ \&replace_text => "skipped_text, text" ],
);
# your page may use a different encoding
binmode STDOUT, ":utf8" or die "$0: binmode: $!";
$p->parse($html);
The output is what we expect:
$ wget -O phil-jackson.html http://stackoverflow.com/users/201469 $ ./replace-text >out.html $ diff -ub phil-jackson.html out.html --- phil-jackson.html +++ out.html @@ -327,7 +327,7 @@ PERL: -#$linkTrue = … ">comparing PERL md5() and PHP md5()</a></h3> +#$linkTrue = … ">comparing Perl md5() and PHP md5()</a></h3> <div class="tags t-php t-perl t-md5"> <a href="/questions/tagged/php" class="post-tag" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/perl" class="post-tag" title="show questions tagged 'perl'" rel="tag">perl</a> <a href="/questions/tagged/md5" class="post-tag" title="show questions tagged 'md5'" rel="tag">md5</a>
The "PERL:" sore thumb is part of an element attribute, not a text section.
Solution 3:
You should also look at Web::Scraper.
I find this module easier than the HTML::Parser modules, but it helps if your are familiar with XPath.
Parsing of HTML is very unpredictable depending on the actual pages - it is like pdf-display and not data-oriented.
Solution 4:
Which module you should use depends on what you are trying to do. For starters, HTML::Parser comes with great examples which also include a script that extracts plain text from an HTML document.
Do not try to parse HTML documents using an XML parser: You will find yourself in a world of pain as a lot of valid HTML constructs are not valid XML.
Do not try to parse XML documents using an HTML parser: You will lose all the advantages of the stricter requirement that an XML document be well formed before it can be parsed.
Post a Comment for "Should I Use HTML::Parser Or XML::Parser To Extract And Replace Text?"