On Tue, 19 May 2015, Andrea Asta wrote:
I would implement the following scenario:
- For HTML pages with a given URL Pattern, extract a part of the page
starting from an XPath
- For other generic HTML pages I would use Boilerpipe
- For different file formats, a simple BodyContentHandler is ok
What's the best way to do this in Tika?
I would suggest pushing your switching logic outside of Tika. Check the
URL to see if it matches your pattern, then parse with a special xpath
content handler if so. Otherwise, pass to Tika with BodyContentHandler but
Boilerpipe configured for html
Having a single Tika config works well when you want the same behaviour
for all content of a type. If you need different behaviour for some URLs
of a given type, then pushing that switch before Tika is probably the
simplest way to handle it
Nick