Hi Markus,

It works. Thank you! I didn't know the parse-plugins.xml.

To summarize (for google bot):
For those who wrote a Nutch plug-in at parser extension point, the steps
required to make it works are:
1. Update entry plugin.includes in nutch-site.xml
2. Update parse-plugins.xml and don't forget the alias section
3. Make sure XPath 
/plugin/extension/implementation/parameter[@name="contentType"] exists at
the plugin.xml and has the value of the preferred mime type.

The community is more active than I expected. Cool! Thanks Markus and
Parnab.!


Regards,
Ake Tangkananond



On 6/25/12 10:11 PM, "Markus Jelsma" <[email protected]> wrote:

>Hello,
>
>Did you add your parser to parse-plugins.xml?
>
>Cheers
>
> 
> 
>-----Original message-----
>> From:Ake Tangkananond <[email protected]>
>> Sent: Mon 25-Jun-2012 16:56
>> To: [email protected]
>> Subject: Content type config on Parser plugin work improperly
>> 
>> Hi experts,
>> 
>> I am experimenting a feature to add plug in at a parser extension
>>point. I
>> had successfully make plugins at indexing extension point working, but
>>not
>> for the parser extension point.
>> 
>> This is a part of my source code of a class extending
>> org.apache.nutch.parse.Parser
>>     public ParseResult getParse(Content content) {
>>         Metadata metadata = content.getMetadata();
>>         metadata.add("feature.enabled", "true");
>> 
>>         ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,
>> "aaa", new Outlink[0], metadata, metadata);
>>         return ParseResult.createParseResult(content.getUrl(), new
>> ParseImpl("bbb", parseData));
>>     }
>> 
>> I have added these parameters inside //plugin/extension/implementation
>>at
>> the plugin.xml:
>>             <parameter name="contentType"
>> value="text/html|application/xhtml+xml"/>
>>             <parameter name="pathSuffix" value=""/>
>> 
>> Then I add my plug in into the nutch-site.xml and at the same time
>>disabling
>> the default parse-html to make sure that only my plug in is dealing
>>with the
>> content-type text/html. However, I got this error:
>> Error parsing: http://www.pantip.com/cafe/home/listerR.php:
>> org.apache.nutch.parse.ParseException: parser not found for
>> contentType=text/html url=http://www.pantip.com/cafe/home/listerR.php
>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
>> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at 
>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> 
>> Can anyone advise why my plug in is being ignored? Thanks for all your
>>time.
>> 
>> 
>> Regards,
>> Ake Tangkananond
>> 
>> 
>> 


Reply via email to