Hi Nick, thanks for your response. What do you mean with "pass to Tika with BodyContentHandler but Boilerpipe configured for html": how can I do this? The parse method of AutoDetectParser only allow a single ContentHandler parameter.
Thanks Andrea 2015-05-27 0:27 GMT+02:00 Nick Burch <[email protected]>: > On Tue, 19 May 2015, Andrea Asta wrote: > >> I would implement the following scenario: >> >> - For HTML pages with a given URL Pattern, extract a part of the page >> starting from an XPath >> - For other generic HTML pages I would use Boilerpipe >> - For different file formats, a simple BodyContentHandler is ok >> >> What's the best way to do this in Tika? >> > > I would suggest pushing your switching logic outside of Tika. Check the > URL to see if it matches your pattern, then parse with a special xpath > content handler if so. Otherwise, pass to Tika with BodyContentHandler but > Boilerpipe configured for html > > Having a single Tika config works well when you want the same behaviour > for all content of a type. If you need different behaviour for some URLs of > a given type, then pushing that switch before Tika is probably the simplest > way to handle it > > Nick >
