Hi, 

I have been trying to replace the Nutch 1.1 html parser with my own html
parser, but I have no luck so far. There are my findings:

1) in parse-plugins.xml, it doesn't matter whether you comment out or
uncomment those properties with plugin id being "parse-html". The only
working html parse is HtmlParser.java not Tika html parser. Even though
you remove the whole part below, the HtmlParser.java will always be
called.

   <alias name="parse-html"
                        extension-id="org.apache.nutch.parse.html.HtmlParser" />


2) The way I successfully replaced the Nutch 1.0 html parser (indeed
HtmlParser.java) with my own html parser never works within Nutch 1.1.
The commentary in the Nutch 1.1 parse-plugins.xml, "You can uncomment
the associations below to override parse-tika and chose which plugin
should be used for a given content type", is not true. As I states in 1)
HtmlParser.java is always called whether or not the following mimeType
is commented out.

<mimeType name="text/html">
                <plugin id="parse-html" />
        </mimeType>


I suggest the Nutch team provide a working example and some details
showing how to replace the Nutch 1.1 html parser.

Thanks.

Reply via email to