> I have been trying to replace the Nutch 1.1 html parser with my own html
> parser, but I have no luck so far. There are my findings:
>
> 1) in parse-plugins.xml, it doesn't matter whether you comment out or
> uncomment those properties with plugin id being "parse-html". The only
> working html parse is HtmlParser.java not Tika html parser.


I doubt it. The Tika parser is used - people even report problems with it
:-)


> Even though
> you remove the whole part below, the HtmlParser.java will always be
> called.
>
>   <alias name="parse-html"
>
>  extension-id="org.apache.nutch.parse.html.HtmlParser" />
>


>
>
> 2) The way I successfully replaced the Nutch 1.0 html parser (indeed
> HtmlParser.java) with my own html parser never works within Nutch 1.1.
> The commentary in the Nutch 1.1 parse-plugins.xml, "You can uncomment
> the associations below to override parse-tika and chose which plugin
> should be used for a given content type", is not true. As I states in 1)
> HtmlParser.java is always called whether or not the following mimeType
> is commented out.
>
> <mimeType name="text/html">
>                <plugin id="parse-html" />
>        </mimeType>
>

Are you sure that parse-html is called and not parse-tika? What did you
specify in plugin.includes and what is listed in your logs?

Plugin.xml in Parse-tika specifies

<parameter name="contentType" value="*"/>

i.e parse-tika is used by default. This means that it will be used if no
association is specified for a given mime-type in parse-plugins.xml *OR* if
the parser specified fail.

Are you using Nutch in local or distributed mode? If you are in distributed
mode then as you certainly know you need to rebuild the job file for the
modifications to your local conf/ files to be taken into account.



> I suggest the Nutch team provide a working example and some details
> showing how to replace the Nutch 1.1 html parser.
>

As I already said in a previous email exchange specifying a custom HTML
parser is a matter of specifying it in plugin.includes and creating an
association between a mime-type and the parser ID in parse-plugins.xml. If
you think that this is not working properly then please  file a JIRA and
attach your parse-plugins.xml + nutch-site.xml and give details on how you
are using Nutch (local - distributed etc...)

Thanks

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to