Simple, only use parse-tika and patch with NUTCH-961. https://issues.apache.org/jira/browse/NUTCH-961
Extractor algorithms are fixed, it is not possible to preanalyze a page and select an extractor accordingly. -----Original message----- > From:imran khan <[email protected]> > Sent: Monday 29th July 2013 11:25 > To: [email protected] > Subject: Nutch HTML Parsers & tika-boilerpipe configuration > > Greetings, > > I am trying to understand the role/functionality of different html parsers > (parse-html and parse-tika) plugin in nutch 2.2. > > My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has > > <mimeType name="*"> > <plugin id="parse-tika" /> > </mimeType> > > <mimeType name="text/html"> > <plugin id="parse-html" /> > </mimeType> > > <mimeType name="application/xhtml+xml"> > <plugin id="parse-html" /> > </mimeType> > > So does it mean for parsing html pages "parse-html" plugin would be used ? > And to use Tika for parsing my html pages I would simply replace it with > "parse-tika" plugin ? > > And if I want to remove the boilerplate text like menu, ads text etc. from > my 'content' field in nutch then I guess I have to use Tika with boilerpipe > ? > > Where can I configure nutch to use boilerpipe with Tika and other > extracters ? And is there any configuration in Tika/boilerpipe which would > automatically pick the right extractor for Tika for current Html page ? > > Regards, > Imran >

