It does not seems working for me , tried will all three Boilerpipe algorithm.

Tried with apple.com <http://apple.com/> but content still has header stuff, my 
header start with this <nav id="ac-globalnav"

Added below in my nutch-site.xml with default plugin included
 
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>


<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>
 
<property> 
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>CanolaExtractor</value>
  <description> 
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

Am I missing something here ?


Regards,
Manish Verma
AML Search

> On Jun 29, 2016, at 3:06 AM, Markus Jelsma <[email protected]> wrote:
> 
> Manish - you're in luck. Nutch 1.12 was released and has Boilerpipe support. 
> Check:
> https://issues.apache.org/jira/browse/NUTCH-961
> 
> Markus
> 
> 
> 
> -----Original message-----
>> From:Manish Verma <[email protected]>
>> Sent: Tuesday 28th June 2016 23:46
>> To: [email protected]
>> Subject: Remove Header from content
>> 
>> Hi,
>> 
>> I don’t want to index header and footer of content , I know we can make 
>> changes in HtmlParser.java but I don’t want to change nutch core code, is 
>> there any other way(plugin) to eleminate Header div from content.
>> 
>> Thanks MV
>> 
>> 

Reply via email to