It does not seems working for me , tried will all three Boilerpipe algorithm.
Tried with apple.com <http://apple.com/> but content still has header stuff, my header start with this <nav id="ac-globalnav" Added below in my nutch-site.xml with default plugin included <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <property> <name>tika.extractor</name> <value>boilerpipe</value> <description> Which text extraction algorithm to use. Valid values are: boilerpipe or none. </description> </property> <property> <name>tika.extractor.boilerpipe.algorithm</name> <value>CanolaExtractor</value> <description> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. </description> </property> Am I missing something here ? Regards, Manish Verma AML Search > On Jun 29, 2016, at 3:06 AM, Markus Jelsma <[email protected]> wrote: > > Manish - you're in luck. Nutch 1.12 was released and has Boilerpipe support. > Check: > https://issues.apache.org/jira/browse/NUTCH-961 > > Markus > > > > -----Original message----- >> From:Manish Verma <[email protected]> >> Sent: Tuesday 28th June 2016 23:46 >> To: [email protected] >> Subject: Remove Header from content >> >> Hi, >> >> I don’t want to index header and footer of content , I know we can make >> changes in HtmlParser.java but I don’t want to change nutch core code, is >> there any other way(plugin) to eleminate Header div from content. >> >> Thanks MV >> >>

