Error parsing html

Sudip Datta Thu, 12 Jul 2012 11:01:19 -0700

Hi,

I am using Nutch 1.4 and Solr. My crawls were working perfectly fine before
I made some changes to the SolrWriter (which I believe has nothing to do
with my problem). Since then, I am getting:


WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully parse
content <webpage> of type text/html
INFO : org.apache.nutch.parse.ParseSegment - Parsing: <webpage>
WARN : org.apache.nutch.parse.ParseSegment - Error parsing: <webpage>:
failed(2,200): org.apache.nutch.parse.ParseException: Unable to
successfully parse content

for any <webpage> that I try to crawl!

My nutch-site.xml file reads:
<value>protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

What could be going wrong?

Thanks,

--Sudip.

Error parsing html

Reply via email to