Parse h1 tag elements from html?

Cameron Leach Fri, 06 Jan 2012 16:58:07 -0800

I've searched for some examples on using tika to parse out specific
pieces of HTML, but have yet to find a good one, so I turn to the
list...


I simply want to parse out the contents of the <h1> elements of some
html pages. So far I've been able to use the HTMLParser and
BodyContentHandler to get the entire body contents, but I'm not sure
what to do next in order to only extract the h1.

I've tried hacking around a little bit with the XPathParser/Matcher
classes, but I'm not sure if that's the correct strategy of doing it,
nor am I clear on the correct syntax to get it to all work together.

Any help would be great.

What I've done so far is:

// Setup the parser
Parser parser = new HtmlParser();
ContentHandler handler = new BodyContentHandler(writer);
Metadata metadata = new Metadata();
XHTMLContentHandler xhch = new XHTMLContentHandler(handler, metadata);
ParseContext parseContext = new ParseContext();

// Parse
parser.parse(in, xhch, metadata, parseContext);

// Extract h1?

Parse h1 tag elements from html?

Reply via email to