Nutch HTML Parsers & tika-boilerpipe configuration

imran khan Mon, 29 Jul 2013 02:27:10 -0700

Greetings,

I am trying to understand the role/functionality of different html parsers
(parse-html and parse-tika) plugin in nutch 2.2.


My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has

<mimeType name="*">
  <plugin id="parse-tika" />
</mimeType>

<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>

        <mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>

So does it mean for parsing html pages "parse-html" plugin would be used ?
And to use Tika for parsing my html pages I would simply replace it with
"parse-tika" plugin ?

And if I want to remove the boilerplate text like menu, ads text etc. from
my 'content' field in nutch then I guess I have to use Tika with boilerpipe
?

Where can I configure nutch to use boilerpipe with Tika and other
extracters ? And is there any configuration in Tika/boilerpipe which would
automatically pick the right extractor for Tika for current Html page ?

Regards,
Imran

Nutch HTML Parsers & tika-boilerpipe configuration

Reply via email to