Greetings,
I am trying to understand the role/functionality of different html parsers
(parse-html and parse-tika) plugin in nutch 2.2.
My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
So does it mean for parsing html pages "parse-html" plugin would be used ?
And to use Tika for parsing my html pages I would simply replace it with
"parse-tika" plugin ?
And if I want to remove the boilerplate text like menu, ads text etc. from
my 'content' field in nutch then I guess I have to use Tika with boilerpipe
?
Where can I configure nutch to use boilerpipe with Tika and other
extracters ? And is there any configuration in Tika/boilerpipe which would
automatically pick the right extractor for Tika for current Html page ?
Regards,
Imran