I am trying to parse.xml web pages with nutch 1.2 and I see errors like
this:

 org.apache.nutch.parse.ParseException: parser not found for
contentType=application/xml

I've seen someone mention that we should add

<mimeType name="*application*/*xml*">
<plugin id="parse-html" />
</mimeType>

To the parse-plugins.xml file, but, I look at the the plugins.xml file for
parse-html and alll I see is

      <implementation id="org.apache.nutch.parse.html.HtmlParser"
                      class="org.apache.nutch.parse.html.HtmlParser">
        <parameter name="contentType" value="text/html"/>


So it wouldn't be able to parse xml, would it?

I've seen a mention of a parse-xml but that it doesn't work for  nutch 1.1
or 1.2 (I am using 1.2) and that tika would work.

I tried setting

<mimeType name="*application*/*xml*">
<plugin id="parse-tika" />
</mimeType>

but that didn't work. Any suggestions?

Thanks,
Steve

Reply via email to