Hello,
I would implement the following scenario:

- For HTML pages with a given URL Pattern, extract a part of the page
starting from an XPath
- For other generic HTML pages I would use Boilerpipe
- For different file formats, a simple BodyContentHandler is ok

What's the best way to do this in Tika?

Thanks
Andrea

Reply via email to