> I have been trying to replace the Nutch 1.1 html parser with my own html > parser, but I have no luck so far. There are my findings: > > 1) in parse-plugins.xml, it doesn't matter whether you comment out or > uncomment those properties with plugin id being "parse-html". The only > working html parse is HtmlParser.java not Tika html parser.
I doubt it. The Tika parser is used - people even report problems with it :-) > Even though > you remove the whole part below, the HtmlParser.java will always be > called. > > <alias name="parse-html" > > extension-id="org.apache.nutch.parse.html.HtmlParser" /> > > > > 2) The way I successfully replaced the Nutch 1.0 html parser (indeed > HtmlParser.java) with my own html parser never works within Nutch 1.1. > The commentary in the Nutch 1.1 parse-plugins.xml, "You can uncomment > the associations below to override parse-tika and chose which plugin > should be used for a given content type", is not true. As I states in 1) > HtmlParser.java is always called whether or not the following mimeType > is commented out. > > <mimeType name="text/html"> > <plugin id="parse-html" /> > </mimeType> > Are you sure that parse-html is called and not parse-tika? What did you specify in plugin.includes and what is listed in your logs? Plugin.xml in Parse-tika specifies <parameter name="contentType" value="*"/> i.e parse-tika is used by default. This means that it will be used if no association is specified for a given mime-type in parse-plugins.xml *OR* if the parser specified fail. Are you using Nutch in local or distributed mode? If you are in distributed mode then as you certainly know you need to rebuild the job file for the modifications to your local conf/ files to be taken into account. > I suggest the Nutch team provide a working example and some details > showing how to replace the Nutch 1.1 html parser. > As I already said in a previous email exchange specifying a custom HTML parser is a matter of specifying it in plugin.includes and creating an association between a mime-type and the parser ID in parse-plugins.xml. If you think that this is not working properly then please file a JIRA and attach your parse-plugins.xml + nutch-site.xml and give details on how you are using Nutch (local - distributed etc...) Thanks Julien -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

