RE: Error parsing html

Markus Jelsma Thu, 12 Jul 2012 14:55:43 -0700
Please provide the whole log snippet. Is it an HTML file? Can the parser parse 
it, is it large?
 
 
-----Original message-----
> From:Sudip Datta <[email protected]>
> Sent: Thu 12-Jul-2012 23:47
> To: Markus Jelsma <[email protected]>
> Cc: [email protected]
> Subject: Re: Error parsing html
> 
> In ParseUtil.java, it gets an Exception (Not TimoutException), while trying 
> to implement:
>      res = task.get(MAX_PARSE_TIME, TimeUnit.SECONDS);
> Wonder if that can help in getting closer to the solution.
> 
> Should I instead try Tika as well, which I believe also parses HTML? What 
> changes will be required for that?
> 
> Thanks again.
> 
> On Fri, Jul 13, 2012 at 1:40 AM, Markus Jelsma <[email protected] 
> <mailto:[email protected]> > wrote:
> Seems correct indeed. Please check the logs, they may tell some more.
> 
> 
> 
> -----Original message-----
> > From:Sudip Datta <[email protected] <mailto:[email protected]> >
> > Sent: Thu 12-Jul-2012 21:51
> > To: Markus Jelsma <[email protected] 
> > <mailto:[email protected]> >
> > Cc: [email protected] <mailto:[email protected]> 
> > Subject: Re: Error parsing html
> >
> > Hi Markus,
> >
> > Yes, they seem to be rightly mapped:
> >
> > parse-plugins.xml reads:
> >
> > <mimeType name="text/html">
> >     <plugin id="parse-html"/>
> > </mimeType>
> >
> > and tika's plugin.xml reads:
> >
> >   <extension point="org.apache.nutch.parse.Parser"
> > id="org.apache.nutch.parse.tika" name="TikaParser">
> >     <implementation id="org.apache.nutch.parse.tika.TikaParser"
> > class="org.apache.nutch.parse.tika.TikaParser">
> >       <parameter name="contentType" value="*"/>
> >     </implementation>
> >   </extension>
> >
> > This one
> > http://stackoverflow.com/questions/8784656/nutch-unable-to-successfully-parse-contentseems
> >  
> > <http://stackoverflow.com/questions/8784656/nutch-unable-to-successfully-parse-contentseems>
> >  
> > to have a similar problem but doesn't mention where in code he has an
> > error.
> >
> > Thanks,
> >
> > --Sudip.
> >
> > On Fri, Jul 13, 2012 at 12:19 AM, Markus Jelsma
> > <[email protected] <mailto:[email protected]> >wrote:
> >
> > > strange, check if text/html is mapped to parse-tika or parse-html in
> > > parse-plugins.xml. You may also want to check tika's plugin.xml, it must 
> > > be
> > > mapped to * or a regex of content types.
> > >
> > >
> > > -----Original message-----
> > > > From:Sudip Datta <[email protected] <mailto:[email protected]> >
> > > > Sent: Thu 12-Jul-2012 20:36
> > > > To: [email protected] <mailto:[email protected]> 
> > > > Subject: Re: Error parsing html
> > > >
> > > > Nopes. That didn't help. In fact, I had added that entry minutes before
> > > > sending a mail to the group and after couple of hours of frustration in
> > > > trying to get the parser to work.
> > > >
> > > > On Thu, Jul 12, 2012 at 11:40 PM, Lewis John Mcgibbney <
> > > > [email protected] <mailto:[email protected]> > wrote:
> > > >
> > > > > For starters there is no parse-xhtml plugin unless of course this is a
> > > > > custom one you've written yourself.
> > > > >
> > > > > Unless this is the case then remove this from the plugin.includes
> > > > > property and re-spin it
> > > > >
> > > > > hth
> > > > >
> > > > > On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta <[email protected] 
> > > > > <mailto:[email protected]> > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am using Nutch 1.4 and Solr. My crawls were working perfectly fine
> > > > > before
> > > > > > I made some changes to the SolrWriter (which I believe has nothing
> > > to do
> > > > > > with my problem). Since then, I am getting:
> > > > > >
> > > > > > WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully
> > > parse
> > > > > > content <webpage> of type text/html
> > > > > > INFO : org.apache.nutch.parse.ParseSegment - Parsing: <webpage>
> > > > > > WARN : org.apache.nutch.parse.ParseSegment - Error parsing:
> > > <webpage>:
> > > > > > failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> > > > > > successfully parse content
> > > > > >
> > > > > > for any <webpage> that I try to crawl!
> > > > > >
> > > > > > My nutch-site.xml file reads:
> > > > > >
> > > > >
> > > <value>protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> > > > > >
> > > > > > What could be going wrong?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > --Sudip.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lewis
> > > > >
> > > >
> > >
> >
> 
>
RE: Error parsing html

Reply via email to