On Tue, 19 May 2015, Andrea Asta wrote:
I would implement the following scenario:

- For HTML pages with a given URL Pattern, extract a part of the page
starting from an XPath
- For other generic HTML pages I would use Boilerpipe
- For different file formats, a simple BodyContentHandler is ok

What's the best way to do this in Tika?

I would suggest pushing your switching logic outside of Tika. Check the URL to see if it matches your pattern, then parse with a special xpath content handler if so. Otherwise, pass to Tika with BodyContentHandler but Boilerpipe configured for html

Having a single Tika config works well when you want the same behaviour for all content of a type. If you need different behaviour for some URLs of a given type, then pushing that switch before Tika is probably the simplest way to handle it

Nick

Reply via email to