Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Rushikesh K Wed, 15 Nov 2017 13:21:55 -0800

Hello,
Eyeris - Thanks for your response, i was able to make working with tika
boilerpipe but now i have a different problem ,some of the crawled pages
doesn't have the expected data
For some pages it brings back only the *Title *and skips all the content i
am not sure in what special cases does this do.But in my case i have two
problems now
1. when my page has a image and 1 or 2 lines of text it doesn't get those
lines of data.(the data is in the <p> tag)
2.why is it adding *Title* to the starting of the *content* is there a way
not to include that.


For example see the following image for the first URL it came back with out
any date

[image: Inline image 1]

On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[email protected]>
wrote:

> Hello.
>
> I am using tika boilerpipe with good results in aproximately 2000 websites.
> Rushikesh if tika boilerpipe is not working for you maybe it is because
> you don´t are parsing documents with tika. please check this configuration
> and tell us.
>
> make sure that tika plugin is activated in plugin.included property then
> check:
>
> ***********************************************
> Use tika parser instead of parse-html.
>
> parse-plugins.xml
>
> <mimeType name="text/html">
>                 <plugin id="parse-tika" />
>         </mimeType>
>
>         <mimeType name="application/xhtml+xml">
>                 <plugin id="parse-tika" />
>         </mimeType>
> ***********************************************
>
> ***********************************************
> nutch-site.xml
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>
> <property>
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> ****************************************
>
>
>
>
>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Markus Jelsma" <[email protected]>
> Para: [email protected]
> Enviados: Martes, 14 de Noviembre 2017 17:40:08
> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
> Hello Rushikesh - why is Boilerpipe not working for you. Are you having
> trouble getting it configured - it is really just setting a boolean value.
> Or does it work, but not to your satisfaction?
>
> The Bayan solution should work, theoretically, but just with a lot of
> tedious manual per-site configuration.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Rushikesh K <[email protected]>
> > Sent: Tuesday 14th November 2017 23:30
> > To: [email protected]
> > Cc: Sebastian Nagel <[email protected]>;
> [email protected]
> > Subject: Re: Removing header,Footer and left menus while crawling
> >
> > Hello,
> >
> > *Jorge*
> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> > tried configuring Tika boilerpipe with this version but this doesn't work
> > for me.As you suggested to use own parser ,i am not a java developer by
> > chance.
> > By chance if you or anyone in the community has a working file ,it would
> be
> > great if you can share it because there are many people facing with this
> > issue (i came to know this when i googled).
> >
> > Mark Vega
> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is
> also
> > not working.we followed the same steps.I can share the changes if you
> want
> > to take a look.
> >
> > I appreciate for your quick suggestions!
> >
> > Thanks
> > Rushikesh
> >
> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> > [email protected]> wrote:
> >
> > > Hello Rushikesh,
> > >
> > > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13,
> then you
> > > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > > need to enable this feature with:
> > >
> > > <property>
> > >   <name>tika.extractor</name>
> > >   <value>boilerpipe</value>
> > >   <description>
> > >   Which text extraction algorithm to use. Valid values are: boilerpipe
> or
> > > none.
> > >   </description>
> > > </property>
> > >
> > > And configure the proper extractor with
> > > the tika.extractor.boilerpipe.algorithm setting.
> > >
> > > This is not a perfect solution, but I've used it successfully in the
> past,
> > > of course, your results will depend on how is the structure (markup of
> the
> > > website).
> > >
> > > Other option could be to implement your own parser if you need to have
> more
> > > control over what to include/exclude from the HTML. You can take a
> look at
> > > this issue https://issues.apache.org/jira/browse/NUTCH-585 which
> contains
> > > some info and old patches.
> > >
> > > Best Regards,
> > > Jorge
> > >
> > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected]
> >
> > > wrote:
> > >
> > > > Hello Sebastian,
> > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for
> crawling
> > > > our website and we are happy with the search results  but we had
> > > > requirement to skip the header footer and left menus and some other
> parts
> > > > of the page, can you please guide how can we exclude those parts.i
> was
> > > > trying various ways on google but nothing works for me.
> > > >
> > > > Appreciate for your help in Advance!
> > > > --
> > > > Regards
> > > > Rushikesh M
> > > > .Net Developer
> > > >
> > >
> >
> >
> >
> > --
> > Regards
> > Rushikesh M
> > .Net Developer
> >
> La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a
> la Revolución
> 2002-2017
>



-- 
Regards
Rushikesh M
.Net Developer

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Reply via email to