Simple, only use parse-tika and patch with NUTCH-961.
https://issues.apache.org/jira/browse/NUTCH-961

Extractor algorithms are fixed, it is not possible to preanalyze a page and 
select an extractor accordingly.
 
 
-----Original message-----
> From:imran khan <[email protected]>
> Sent: Monday 29th July 2013 11:25
> To: [email protected]
> Subject: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> 
> Greetings,
> 
> I am trying to understand the role/functionality of different html parsers
> (parse-html and parse-tika) plugin in nutch 2.2.
> 
> My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has
> 
> <mimeType name="*">
>   <plugin id="parse-tika" />
> </mimeType>
> 
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
> 
>         <mimeType name="application/xhtml+xml">
> <plugin id="parse-html" />
> </mimeType>
> 
> So does it mean for parsing html pages "parse-html" plugin would be used ?
> And to use Tika for parsing my html pages I would simply replace it with
> "parse-tika" plugin ?
> 
> And if I want to remove the boilerplate text like menu, ads text etc. from
> my 'content' field in nutch then I guess I have to use Tika with boilerpipe
> ?
> 
> Where can I configure nutch to use boilerpipe with Tika and other
> extracters ? And is there any configuration in Tika/boilerpipe which would
> automatically pick the right extractor for Tika for current Html page ?
> 
> Regards,
> Imran
> 

Reply via email to