Hello.
I am using tika boilerpipe with good results in aproximately 2000 websites.
Rushikesh if tika boilerpipe is not working for you maybe it is because you
don´t are parsing documents with tika. please check this configuration
and tell us.
make sure that tika plugin is activated in plugin.included property then check:
***********************************************
Use tika parser instead of parse-html.
parse-plugins.xml
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
***********************************************
***********************************************
nutch-site.xml
<property>
<name>tika.extractor</name>
<value>boilerpipe</value>
<description>
Which text extraction algorithm to use. Valid values are: boilerpipe or none.
</description>
</property>
<property>
<name>tika.extractor.boilerpipe.algorithm</name>
<value>ArticleExtractor</value>
<description>
Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
or CanolaExtractor.
</description>
</property>
****************************************
----- Mensaje original -----
De: "Markus Jelsma" <[email protected]>
Para: [email protected]
Enviados: Martes, 14 de Noviembre 2017 17:40:08
Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble
getting it configured - it is really just setting a boolean value. Or does it
work, but not to your satisfaction?
The Bayan solution should work, theoretically, but just with a lot of tedious
manual per-site configuration.
Regards,
Markus
-----Original message-----
> From:Rushikesh K <[email protected]>
> Sent: Tuesday 14th November 2017 23:30
> To: [email protected]
> Cc: Sebastian Nagel <[email protected]>; [email protected]
> Subject: Re: Removing header,Footer and left menus while crawling
>
> Hello,
>
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
>
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
>
> I appreciate for your quick suggestions!
>
> Thanks
> Rushikesh
>
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> [email protected]> wrote:
>
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > <property>
> > <name>tika.extractor</name>
> > <value>boilerpipe</value>
> > <description>
> > Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> > </description>
> > </property>
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected]>
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the Nutch 1.3 (with solr 6.4) for crawling
> > > our website and we are happy with the search results but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
>
>
>
> --
> Regards
> Rushikesh M
> .Net Developer
>
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la
Revolución
2002-2017