Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Eyeris Rodriguez Rueda Wed, 15 Nov 2017 05:59:43 -0800

Hello.

I am using tika boilerpipe with good results in aproximately 2000 websites. 
Rushikesh if tika boilerpipe is not working for you maybe it is because you 
don´t are parsing documents with tika. please check this configuration
and tell us.


make sure that tika plugin is activated in plugin.included property then check:

***********************************************
Use tika parser instead of parse-html.

parse-plugins.xml

<mimeType name="text/html">
                <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
        </mimeType>
***********************************************

***********************************************
nutch-site.xml
<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
ArticleExtractor
  or CanolaExtractor.
  </description>
</property>
****************************************












----- Mensaje original -----
De: "Markus Jelsma" <[email protected]>
Para: [email protected]
Enviados: Martes, 14 de Noviembre 2017 17:40:08
Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble 
getting it configured - it is really just setting a boolean value. Or does it 
work, but not to your satisfaction?

The Bayan solution should work, theoretically, but just with a lot of tedious 
manual per-site configuration.

Regards,
Markus

-----Original message-----
> From:Rushikesh K <[email protected]>
> Sent: Tuesday 14th November 2017 23:30
> To: [email protected]
> Cc: Sebastian Nagel <[email protected]>; [email protected]
> Subject: Re: Removing header,Footer and left menus while crawling
> 
> Hello,
> 
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
> 
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
> 
> I appreciate for your quick suggestions!
> 
> Thanks
> Rushikesh
> 
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> [email protected]> wrote:
> 
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > <property>
> >   <name>tika.extractor</name>
> >   <value>boilerpipe</value>
> >   <description>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   </description>
> > </property>
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected]>
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > > our website and we are happy with the search results  but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
> 
> 
> 
> -- 
> Regards
> Rushikesh M
> .Net Developer
> 
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la 
Revolución
2002-2017

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Reply via email to