Nutch 1.11 | Ignoring content header and footer content while parsing HTML

Megha Bhandari Fri, 08 Jul 2016 07:29:19 -0700

Hi

Read a couple of threads that suggest that we can use Tika's boilerplate 
content handler to ignore content like header and footer in Nutch.


Tried the below configurations in nutch-site.xml (Nutch 1.11) . However we can 
still see header and footer content getting extracted.

<property>
                  <name>plugin.includes</name>
                  
<value>protocol-(http|httpclient)|urlfilter-regex|headings|parse-(html|tika|metatags)|index-(basic|metadata)|indexer-solr|urlnormalizer-(pass|regex|basic)|language-identifier</value>
                </property>

<property>
    <name>parser.html.NodesToExclude</name>
    
<value>div;class;navigation-wrapper|footer;class;main-footer|div;class;header|div;id;uhc-top-nav-menu</value>
  </property>
  <property>
  <name>tika.use_boilerpipe</name>
  <value>true</value>
</property>
<property>
  <name>tika.boilerpipe.extractor</name>
  <value>ArticleExtractor</value>
</property>

Anything we are missing here?

Regards
Megha

Nutch 1.11 | Ignoring content header and footer content while parsing HTML

Reply via email to