I've searched for some examples on using tika to parse out specific pieces of HTML, but have yet to find a good one, so I turn to the list...
I simply want to parse out the contents of the <h1> elements of some html pages. So far I've been able to use the HTMLParser and BodyContentHandler to get the entire body contents, but I'm not sure what to do next in order to only extract the h1. I've tried hacking around a little bit with the XPathParser/Matcher classes, but I'm not sure if that's the correct strategy of doing it, nor am I clear on the correct syntax to get it to all work together. Any help would be great. What I've done so far is: // Setup the parser Parser parser = new HtmlParser(); ContentHandler handler = new BodyContentHandler(writer); Metadata metadata = new Metadata(); XHTMLContentHandler xhch = new XHTMLContentHandler(handler, metadata); ParseContext parseContext = new ParseContext(); // Parse parser.parse(in, xhch, metadata, parseContext); // Extract h1?
