I am trying to parse.xml web pages with nutch 1.2 and I see errors like
this:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/xml
I've seen someone mention that we should add
<mimeType name="*application*/*xml*">
<plugin id="parse-html" />
</mimeType>
To the parse-plugins.xml file, but, I look at the the plugins.xml file for
parse-html and alll I see is
<implementation id="org.apache.nutch.parse.html.HtmlParser"
class="org.apache.nutch.parse.html.HtmlParser">
<parameter name="contentType" value="text/html"/>
So it wouldn't be able to parse xml, would it?
I've seen a mention of a parse-xml but that it doesn't work for nutch 1.1
or 1.2 (I am using 1.2) and that tika would work.
I tried setting
<mimeType name="*application*/*xml*">
<plugin id="parse-tika" />
</mimeType>
but that didn't work. Any suggestions?
Thanks,
Steve