Hi Nick,
thanks for your response.
What do you mean with "pass to Tika with BodyContentHandler but Boilerpipe
configured for html": how can I do this? The parse method of
AutoDetectParser only allow a single ContentHandler parameter.

Thanks
Andrea

2015-05-27 0:27 GMT+02:00 Nick Burch <[email protected]>:

> On Tue, 19 May 2015, Andrea Asta wrote:
>
>> I would implement the following scenario:
>>
>> - For HTML pages with a given URL Pattern, extract a part of the page
>> starting from an XPath
>> - For other generic HTML pages I would use Boilerpipe
>> - For different file formats, a simple BodyContentHandler is ok
>>
>> What's the best way to do this in Tika?
>>
>
> I would suggest pushing your switching logic outside of Tika. Check the
> URL to see if it matches your pattern, then parse with a special xpath
> content handler if so. Otherwise, pass to Tika with BodyContentHandler but
> Boilerpipe configured for html
>
> Having a single Tika config works well when you want the same behaviour
> for all content of a type. If you need different behaviour for some URLs of
> a given type, then pushing that switch before Tika is probably the simplest
> way to handle it
>
> Nick
>

Reply via email to