Hmm .... my first guess would be to insert a line
<parameter name="contentType" value="application/xhtml+xml"/>
right after
<parameter name="contentType" value="text/html"/>
inside plugins/parse-html/plugin.xml
Reading by what the message says, you might have parse-html configured for
this contentType in parse-plugins.xml this way:
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
But not have the same contentType setup in plugin.xml
Please tell me if that works
Best Regards,
Emmanuel de Castro Santana
2010/10/13 Okke Klein <[email protected]>
> 2010-10-12 19:54:19,976 WARN parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.html.HtmlParser mapped to contentType
> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
> not claim to support contentType: application/xhtml+xml
>
> 2010-10-12 19:54:19,991 WARN parse.ParseUtil - Unable to successfully
> parse content http://www.lucidimagination.com/ of type
> application/xhtml+xml
>
> 2010-10-12 19:54:19,991 WARN fetcher.Fetcher - Error parsing:
> http://www.lucidimagination.com/: failed(2,200):
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
>
> I am trying to crawl http://www.lucidimagination.com/ with Nutch 1.2. I
> tried both Tika and html parsers (above is html), but neither work.
>
> Any suggestions?
>
--
Emmanuel de Castro Santana