Hmm .... my first guess would be to insert a line
<parameter name="contentType" value="application/xhtml+xml"/>
right after
<parameter name="contentType" value="text/html"/>
inside plugins/parse-html/plugin.xml

Reading by what the message says, you might have parse-html configured for
this contentType in parse-plugins.xml this way:

<mimeType name="application/xhtml+xml">
        <plugin id="parse-html" />
 </mimeType>

But not have the same contentType setup in plugin.xml
Please tell me if that works

Best Regards,
Emmanuel de Castro Santana


2010/10/13 Okke Klein <[email protected]>

>  2010-10-12 19:54:19,976 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.html.HtmlParser mapped to contentType
> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
> not claim to support contentType: application/xhtml+xml
>
> 2010-10-12 19:54:19,991 WARN  parse.ParseUtil - Unable to successfully
> parse content http://www.lucidimagination.com/ of type
> application/xhtml+xml
>
> 2010-10-12 19:54:19,991 WARN  fetcher.Fetcher - Error parsing:
> http://www.lucidimagination.com/: failed(2,200):
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
>
> I am trying to crawl http://www.lucidimagination.com/ with Nutch 1.2. I
> tried both Tika and html parsers (above is html), but neither work.
>
> Any suggestions?
>



-- 
Emmanuel de Castro Santana

Reply via email to