RE: Nutch HTML Parsers & tika-boilerpipe configuration

Markus Jelsma Mon, 29 Jul 2013 06:03:40 -0700
Strange, please check the logs and perhaps restore default settings and config 
files. I'm very sure it works flawlessly on a vanilla Nutch. 
 
-----Original message-----
> From:Saravanakumar Karunanithi <[email protected]>
> Sent: Monday 29th July 2013 13:35
> To: [email protected]
> Subject: Re: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> 
> after applying the patch, I tried the following command
> 
> *bin/nutch parsechecker -dumpText
> http://indiatoday.intoday.in/story/google-unveils-android-4.3-jelly-bean-operating-system/1/296208.html
> *
> Which resulted the expected the results, but when I run the crawler, I get
> ~98% Error while Parsing,
> 
> I get the following error
> 
> *"Unable to successfully parse content URL*"
> 
> 
> 
> On Mon, Jul 29, 2013 at 4:53 PM, Markus Jelsma
> <[email protected]>wrote:
> 
> > Simple, only use parse-tika and patch with NUTCH-961.
> > https://issues.apache.org/jira/browse/NUTCH-961
> >
> > Extractor algorithms are fixed, it is not possible to preanalyze a page
> > and select an extractor accordingly.
> >
> >
> > -----Original message-----
> > > From:imran khan <[email protected]>
> > > Sent: Monday 29th July 2013 11:25
> > > To: [email protected]
> > > Subject: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> > >
> > > Greetings,
> > >
> > > I am trying to understand the role/functionality of different html
> > parsers
> > > (parse-html and parse-tika) plugin in nutch 2.2.
> > >
> > > My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has
> > >
> > > <mimeType name="*">
> > >   <plugin id="parse-tika" />
> > > </mimeType>
> > >
> > > <mimeType name="text/html">
> > > <plugin id="parse-html" />
> > > </mimeType>
> > >
> > >         <mimeType name="application/xhtml+xml">
> > > <plugin id="parse-html" />
> > > </mimeType>
> > >
> > > So does it mean for parsing html pages "parse-html" plugin would be used
> > ?
> > > And to use Tika for parsing my html pages I would simply replace it with
> > > "parse-tika" plugin ?
> > >
> > > And if I want to remove the boilerplate text like menu, ads text etc.
> > from
> > > my 'content' field in nutch then I guess I have to use Tika with
> > boilerpipe
> > > ?
> > >
> > > Where can I configure nutch to use boilerpipe with Tika and other
> > > extracters ? And is there any configuration in Tika/boilerpipe which
> > would
> > > automatically pick the right extractor for Tika for current Html page ?
> > >
> > > Regards,
> > > Imran
> > >
> >
> 
> 
> 
> -- 
> Thanks & Regards,
> Saravanakumar Karunanithi
>
RE: Nutch HTML Parsers & tika-boilerpipe configuration

Reply via email to