Re: Error parsing html

Sudip Datta Thu, 12 Jul 2012 11:35:33 -0700

Nopes. That didn't help. In fact, I had added that entry minutes before
sending a mail to the group and after couple of hours of frustration in
trying to get the parser to work.


On Thu, Jul 12, 2012 at 11:40 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> For starters there is no parse-xhtml plugin unless of course this is a
> custom one you've written yourself.
>
> Unless this is the case then remove this from the plugin.includes
> property and re-spin it
>
> hth
>
> On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta <[email protected]> wrote:
> > Hi,
> >
> > I am using Nutch 1.4 and Solr. My crawls were working perfectly fine
> before
> > I made some changes to the SolrWriter (which I believe has nothing to do
> > with my problem). Since then, I am getting:
> >
> > WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully parse
> > content <webpage> of type text/html
> > INFO : org.apache.nutch.parse.ParseSegment - Parsing: <webpage>
> > WARN : org.apache.nutch.parse.ParseSegment - Error parsing: <webpage>:
> > failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> > successfully parse content
> >
> > for any <webpage> that I try to crawl!
> >
> > My nutch-site.xml file reads:
> >
> <value>protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > What could be going wrong?
> >
> > Thanks,
> >
> > --Sudip.
>
>
>
> --
> Lewis
>

Re: Error parsing html

Reply via email to