I am using a hadoop cluster so I am putting my conf files into the nutch
source conf directory and building a nutch job file. I am then putting the
job file into the classpath. I thought it was working fine since it seems to
be reading the regex-urlfilter.txt from there. However, I am getting
messages like this:
2011-01-04 07:41:55,259 WARN parse.ParserFactory - ParserFactory:Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType
application/xhtml+xml via parse-plugins.xml, but i
ts plugin.xml file does not claim to support contentType:
application/xhtml+xml
But in the parse-plugins.xml file I had put this:
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
Shouldn't the parse-plugins.xml be using parse-tika? If nutch is using a
different parse-plugins.xml, how do I find which one it is using?
Thanks,
Steve Cohen