Re: Dynamic content handler

Nick Burch Tue, 26 May 2015 15:28:09 -0700

On Tue, 19 May 2015, Andrea Asta wrote:

I would implement the following scenario:


- For HTML pages with a given URL Pattern, extract a part of the page
starting from an XPath
- For other generic HTML pages I would use Boilerpipe
- For different file formats, a simple BodyContentHandler is ok

What's the best way to do this in Tika?

I would suggest pushing your switching logic outside of Tika. Check theURL to see if it matches your pattern, then parse with a special xpathcontent handler if so. Otherwise, pass to Tika with BodyContentHandler butBoilerpipe configured for html

Having a single Tika config works well when you want the same behaviourfor all content of a type. If you need different behaviour for some URLsof a given type, then pushing that switch before Tika is probably thesimplest way to handle it


Nick

Re: Dynamic content handler

Reply via email to