Re: Error parsing html

Lewis John Mcgibbney Thu, 12 Jul 2012 11:11:14 -0700

For starters there is no parse-xhtml plugin unless of course this is a
custom one you've written yourself.


Unless this is the case then remove this from the plugin.includes
property and re-spin it

hth

On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta <[email protected]> wrote:
> Hi,
>
> I am using Nutch 1.4 and Solr. My crawls were working perfectly fine before
> I made some changes to the SolrWriter (which I believe has nothing to do
> with my problem). Since then, I am getting:
>
> WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully parse
> content <webpage> of type text/html
> INFO : org.apache.nutch.parse.ParseSegment - Parsing: <webpage>
> WARN : org.apache.nutch.parse.ParseSegment - Error parsing: <webpage>:
> failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> successfully parse content
>
> for any <webpage> that I try to crawl!
>
> My nutch-site.xml file reads:
> <value>protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> What could be going wrong?
>
> Thanks,
>
> --Sudip.



-- 
Lewis

Re: Error parsing html

Reply via email to