Hi
Read a couple of threads that suggest that we can use Tika's boilerplate
content handler to ignore content like header and footer in Nutch.
Tried the below configurations in nutch-site.xml (Nutch 1.11) . However we can
still see header and footer content getting extracted.
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|headings|parse-(html|tika|metatags)|index-(basic|metadata)|indexer-solr|urlnormalizer-(pass|regex|basic)|language-identifier</value>
</property>
<property>
<name>parser.html.NodesToExclude</name>
<value>div;class;navigation-wrapper|footer;class;main-footer|div;class;header|div;id;uhc-top-nav-menu</value>
</property>
<property>
<name>tika.use_boilerpipe</name>
<value>true</value>
</property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
Anything we are missing here?
Regards
Megha