Hi Julien,

How do I file a JIRA?

Thanks


On Fri, 2010-07-23 at 10:03 +0100, Julien Nioche wrote:
> > I have been trying to replace the Nutch 1.1 html parser with my own html
> > parser, but I have no luck so far. There are my findings:
> >
> > 1) in parse-plugins.xml, it doesn't matter whether you comment out or
> > uncomment those properties with plugin id being "parse-html". The only
> > working html parse is HtmlParser.java not Tika html parser.
> 
> 
> I doubt it. The Tika parser is used - people even report problems with it
> :-)
> 
> 
> > Even though
> > you remove the whole part below, the HtmlParser.java will always be
> > called.
> >
> >   <alias name="parse-html"
> >
> >  extension-id="org.apache.nutch.parse.html.HtmlParser" />
> >
> 
> 
> >
> >
> > 2) The way I successfully replaced the Nutch 1.0 html parser (indeed
> > HtmlParser.java) with my own html parser never works within Nutch 1.1.
> > The commentary in the Nutch 1.1 parse-plugins.xml, "You can uncomment
> > the associations below to override parse-tika and chose which plugin
> > should be used for a given content type", is not true. As I states in 1)
> > HtmlParser.java is always called whether or not the following mimeType
> > is commented out.
> >
> > <mimeType name="text/html">
> >                <plugin id="parse-html" />
> >        </mimeType>
> >
> 
> Are you sure that parse-html is called and not parse-tika? What did you
> specify in plugin.includes and what is listed in your logs?
> 
> Plugin.xml in Parse-tika specifies
> 
> <parameter name="contentType" value="*"/>
> 
> i.e parse-tika is used by default. This means that it will be used if no
> association is specified for a given mime-type in parse-plugins.xml *OR* if
> the parser specified fail.
> 
> Are you using Nutch in local or distributed mode? If you are in distributed
> mode then as you certainly know you need to rebuild the job file for the
> modifications to your local conf/ files to be taken into account.
> 
> 
> 
> > I suggest the Nutch team provide a working example and some details
> > showing how to replace the Nutch 1.1 html parser.
> >
> 
> As I already said in a previous email exchange specifying a custom HTML
> parser is a matter of specifying it in plugin.includes and creating an
> association between a mime-type and the parser ID in parse-plugins.xml. If
> you think that this is not working properly then please  file a JIRA and
> attach your parse-plugins.xml + nutch-site.xml and give details on how you
> are using Nutch (local - distributed etc...)
> 
> Thanks
> 
> Julien


Reply via email to