Hi all,
I am writing an custom HtmlParserFilter by implementing the
HtmlParseFilter. And, I am using the ParserChecker for testing the filter.
I could see by some Syso's in the HTMLParseFilters class that by default
only org.apache.nutch.parse.js.JSParseFilter is being used. If I would like
to use my custom filter should I be adding some configurations any where?
And a point to be noted is that, when I add the following lines in
nutch-site.xml,
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin id names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the
nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via
HTTP,
and basic indexing and search plugins.
</description>
</property>
I don't even see JSParseFilter being applied. The package that has my
custom filter does not have any special plugin configuration xml files, do I
have to add some or configure it else where. I am using Nutch 1.2.
I see my knowledge with Nutch growing considerably, thanks to all of you.
Cheers,
Abi