Hi Julien, How do I file a JIRA?
Thanks On Fri, 2010-07-23 at 10:03 +0100, Julien Nioche wrote: > > I have been trying to replace the Nutch 1.1 html parser with my own html > > parser, but I have no luck so far. There are my findings: > > > > 1) in parse-plugins.xml, it doesn't matter whether you comment out or > > uncomment those properties with plugin id being "parse-html". The only > > working html parse is HtmlParser.java not Tika html parser. > > > I doubt it. The Tika parser is used - people even report problems with it > :-) > > > > Even though > > you remove the whole part below, the HtmlParser.java will always be > > called. > > > > <alias name="parse-html" > > > > extension-id="org.apache.nutch.parse.html.HtmlParser" /> > > > > > > > > > > 2) The way I successfully replaced the Nutch 1.0 html parser (indeed > > HtmlParser.java) with my own html parser never works within Nutch 1.1. > > The commentary in the Nutch 1.1 parse-plugins.xml, "You can uncomment > > the associations below to override parse-tika and chose which plugin > > should be used for a given content type", is not true. As I states in 1) > > HtmlParser.java is always called whether or not the following mimeType > > is commented out. > > > > <mimeType name="text/html"> > > <plugin id="parse-html" /> > > </mimeType> > > > > Are you sure that parse-html is called and not parse-tika? What did you > specify in plugin.includes and what is listed in your logs? > > Plugin.xml in Parse-tika specifies > > <parameter name="contentType" value="*"/> > > i.e parse-tika is used by default. This means that it will be used if no > association is specified for a given mime-type in parse-plugins.xml *OR* if > the parser specified fail. > > Are you using Nutch in local or distributed mode? If you are in distributed > mode then as you certainly know you need to rebuild the job file for the > modifications to your local conf/ files to be taken into account. > > > > > I suggest the Nutch team provide a working example and some details > > showing how to replace the Nutch 1.1 html parser. > > > > As I already said in a previous email exchange specifying a custom HTML > parser is a matter of specifying it in plugin.includes and creating an > association between a mime-type and the parser ID in parse-plugins.xml. If > you think that this is not working properly then please file a JIRA and > attach your parse-plugins.xml + nutch-site.xml and give details on how you > are using Nutch (local - distributed etc...) > > Thanks > > Julien

