RE: Error parsing html

Markus Jelsma Thu, 12 Jul 2012 11:48:55 -0700
strange, check if text/html is mapped to parse-tika or parse-html in 
parse-plugins.xml. You may also want to check tika's plugin.xml, it must be 
mapped to * or a regex of content types.
 
 
-----Original message-----
> From:Sudip Datta <[email protected]>
> Sent: Thu 12-Jul-2012 20:36
> To: [email protected]
> Subject: Re: Error parsing html
> 
> Nopes. That didn't help. In fact, I had added that entry minutes before
> sending a mail to the group and after couple of hours of frustration in
> trying to get the parser to work.
> 
> On Thu, Jul 12, 2012 at 11:40 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
> 
> > For starters there is no parse-xhtml plugin unless of course this is a
> > custom one you've written yourself.
> >
> > Unless this is the case then remove this from the plugin.includes
> > property and re-spin it
> >
> > hth
> >
> > On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta <[email protected]> wrote:
> > > Hi,
> > >
> > > I am using Nutch 1.4 and Solr. My crawls were working perfectly fine
> > before
> > > I made some changes to the SolrWriter (which I believe has nothing to do
> > > with my problem). Since then, I am getting:
> > >
> > > WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully parse
> > > content <webpage> of type text/html
> > > INFO : org.apache.nutch.parse.ParseSegment - Parsing: <webpage>
> > > WARN : org.apache.nutch.parse.ParseSegment - Error parsing: <webpage>:
> > > failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> > > successfully parse content
> > >
> > > for any <webpage> that I try to crawl!
> > >
> > > My nutch-site.xml file reads:
> > >
> > <value>protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> > >
> > > What could be going wrong?
> > >
> > > Thanks,
> > >
> > > --Sudip.
> >
> >
> >
> > --
> > Lewis
> >
>
RE: Error parsing html

Reply via email to