HTMLParseFilter is only one type of plugin, there are several other types. In
the configuration you have, it looks like JSParseFilter and TestPluginFilter
are the only plugins that implement HTMLParseFilter, so the results make sense.
-MB
On Feb 2, 2011, at 12:09 AM, .: Abhishek :. wrote:
> Hi Mike et all,
>
> Yes the adding of plugin.xml made it work.
>
> However, the outstanding question even now is that - even though my
> plugin.includes lists a lot of plugin names why is that I just see JSParser
> and my own custom parser in the HTMLParseFilters.
>
> The following is my plugin.includes value,
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|test-plugin</value>
>
> Here test-plugin is my custom plugin. When I add the following line,
>
> for(HtmlParseFilter filter: htmlParseFilters){
> System.out.println("Filter Name :
> "+filter.getClass().getName());
> }
>
> below the last line of the constructor that takes conf parameter i.e
> this.htmlParseFilters = (HtmlParseFilter[])
> objectCache.getObject(HtmlParseFilter.class.getName());
> in the HTMLParserFilters I just see,
>
> Filter Name : org.apache.nutch.parse.js.JSParseFilter
> Filter Name : com.test.nutch.TestPluginFilter
>
> I am just wondering why is this. I should be seeing all the listed filters
> in the values tag in plugin.includes right?
>
>
>
>
> On Wed, Feb 2, 2011 at 11:29 AM, Mike Baranczak <[email protected]>wrote:
>
>> Yes, you do have to make a config file for your plugin to be seen by Nutch.
>>
>> If you built Nutch from source, you should have the directory
>> build/plugins. That's where the compiled plugins are. The names of the
>> directories under there are the names that get included in
>> 'plugin.includes'. Take a look at the existing plugin.xml files, you should
>> be able to figure it out by example.
>>
>> The standard way to package the plugin code is to put it in a jar in the
>> corresponding plugin directory. This ensures that it won't get loaded if
>> it's not used. (This is optional: if you KNOW that it's gonna get used every
>> time, you can put your code anywhere on the classpath.)
>>
>> Note that I'm using 1.1 - I can't guarantee that this information is still
>> current.
>>
>> -MB
>>
>>
>>
>> On Feb 1, 2011, at 9:49 PM, .: Abhishek :. wrote:
>>
>>> Hi all,
>>>
>>> I am writing an custom HtmlParserFilter by implementing the
>>> HtmlParseFilter. And, I am using the ParserChecker for testing the
>> filter.
>>>
>>> I could see by some Syso's in the HTMLParseFilters class that by default
>>> only org.apache.nutch.parse.js.JSParseFilter is being used. If I would
>> like
>>> to use my custom filter should I be adding some configurations any where?
>>>
>>> And a point to be noted is that, when I add the following lines in
>>> nutch-site.xml,
>>>
>>> <property>
>>> <name>plugin.includes</name>
>>>
>>>
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
>>> <description>Regular expression naming plugin id names to
>>> include. Any plugin not matching this expression is
>> excluded.
>>> In any case you need at least include the
>>> nutch-extensionpoints plugin. By
>>> default Nutch includes crawling just HTML and plain text via
>>> HTTP,
>>> and basic indexing and search plugins.
>>> </description>
>>> </property>
>>>
>>> I don't even see JSParseFilter being applied. The package that has my
>>> custom filter does not have any special plugin configuration xml files,
>> do I
>>> have to add some or configure it else where. I am using Nutch 1.2.
>>>
>>> I see my knowledge with Nutch growing considerably, thanks to all of you.
>>>
>>> Cheers,
>>> Abi
>>
>>