RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Markus Jelsma Wed, 15 Nov 2017 13:38:36 -0800

Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
websites. It does has a problem with pages with little text, just as all 
extractors have a degree of problems with little text.


I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I 
am not sure, but remember you can get rid of it by removing some lines of code. 
See TikaParser.java, i think it is there.

Regards,
Makrus

> non-open source contribution, you could try our extractor if you want, there 
> is a (low speed) test available at 
> https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> source but available and actively developed, and does much more than just 
> text extraction.


 
-----Original message-----
> From:Rushikesh K <[email protected]>
> Sent: Wednesday 15th November 2017 22:21
> To: [email protected]; [email protected]
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling
> 
> Hello, 
> 
> 
> Eyeris - Thanks for your response, i was able to make working with tika 
> boilerpipe but now i have a different problem ,some of the crawled pages 
> doesnt have the expected data 
> For some pages it brings back only the Title and skips all the content i am 
> not sure in what special cases does this do.But in my case i have two 
> problems now  
> 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> lines of data.(the data is in the <p> tag) 
> 2.why is it adding Title to the starting of the content is there a way not to 
> include that. 
> 
> For example see the following image for the first URL it came back with out 
> any date 
> 
> 
> 
> On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello.
 
> 
 
> I am using tika boilerpipe with good results in aproximately 2000 websites.
 
> Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> don´t are parsing documents with tika. please check this configuration
 
> and tell us.
 
> 
 
> make sure that tika plugin is activated in plugin.included property then 
> check:
 
> 
 
> ***********************************************
 
> Use tika parser instead of parse-html.
 
> 
 
> parse-plugins.xml
 
> 
 
> <mimeType name="text/html">
 
>                 <plugin id="parse-tika" />
 
>         </mimeType>
 
> 
 
>         <mimeType name="application/xhtml+xml">
 
>                 <plugin id="parse-tika" />
 
>         </mimeType>
 
> ***********************************************
 
> 
 
> ***********************************************
 
> nutch-site.xml
 
> <property>
 
>   <name>tika.extractor</name>
 
>   <value>boilerpipe</value>
 
>   <description>
 
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
 
>   </description>
 
> </property>
 
> 
 
> <property>
 
>   <name>tika.extractor.boilerpipe.algorithm</name>
 
>   <value>ArticleExtractor</value>
 
>   <description>
 
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
 
>   or CanolaExtractor.
 
>   </description>
 
> </property>
 
> ****************************************
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> 
 
> ----- Mensaje original -----
 
> De: "Markus Jelsma" <[email protected] 
> <mailto:[email protected]>>
 
> Para: [email protected] <mailto:[email protected]>
 
> Enviados: Martes, 14 de Noviembre 2017 17:40:08
 
> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
 
> 
 
> Hello Rushikesh - why is Boilerpipe not working for you. Are you having 
> trouble getting it configured - it is really just setting a boolean value. Or 
> does it work, but not to your satisfaction?
 
> 
 
> The Bayan solution should work, theoretically, but just with a lot of tedious 
> manual per-site configuration.
 
> 
 
> Regards,
 
> Markus
 
> 
 
> -----Original message-----
 
> > From:Rushikesh K <[email protected] 
> > <mailto:[email protected]>>
 
> > Sent: Tuesday 14th November 2017 23:30
 
> > To: [email protected] <mailto:[email protected]>
 
> > Cc: Sebastian Nagel <[email protected] 
> > <mailto:[email protected]>>; [email protected] 
> > <mailto:[email protected]>
 
> > Subject: Re: Removing header,Footer and left menus while crawling
 
> >
 
> > Hello,
 
> >
 
> > *Jorge*
 
> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
 
> > tried configuring Tika boilerpipe with this version but this doesnt work
 
> > for me.As you suggested to use own parser ,i am not a java developer by
 
> > chance.
 
> > By chance if you or anyone in the community has a working file ,it would be
 
> > great if you can share it because there are many people facing with this
 
> > issue (i came to know this when i googled).
 
> >
 
> > Mark Vega
 
> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
 
> > not working.we followed the same steps.I can share the changes if you want
 
> > to take a look.
 
> >
 
> > I appreciate for your quick suggestions!
 
> >
 
> > Thanks
 
> > Rushikesh
 
> >
 
> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
 
> > [email protected] <mailto:[email protected]>> wrote:
 
> >
 
> > > Hello Rushikesh,
 
> > >
 
> > > Are you using Nutch 1.3 or Nutch 1.13? If youre using Nutch 1.13, then you
 
> > > could use the Tika boilerpipe implementation, on the nutch-site.xml you
 
> > > need to enable this feature with:
 
> > >
 
> > > <property>
 
> > >   <name>tika.extractor</name>
 
> > >   <value>boilerpipe</value>
 
> > >   <description>
 
> > >   Which text extraction algorithm to use. Valid values are: boilerpipe or
 
> > > none.
 
> > >   </description>
 
> > > </property>
 
> > >
 
> > > And configure the proper extractor with
 
> > > the tika.extractor.boilerpipe.algorithm setting.
 
> > >
 
> > > This is not a perfect solution, but Ive used it successfully in the past,
 
> > > of course, your results will depend on how is the structure (markup of the
 
> > > website).
 
> > >
 
> > > Other option could be to implement your own parser if you need to have 
> > > more
 
> > > control over what to include/exclude from the HTML. You can take a look at
 
> > > this issue https://issues.apache.org/jira/browse/NUTCH-585 
> > > <https://issues.apache.org/jira/browse/NUTCH-585> which contains
 
> > > some info and old patches.
 
> > >
 
> > > Best Regards,
 
> > > Jorge
 
> > >
 
> > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected] 
> > > <mailto:[email protected]>>
 
> > > wrote:
 
> > >
 
> > > > Hello Sebastian,
 
> > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for 
> > > > crawling
 
> > > > our website and we are happy with the search results  but we had
 
> > > > requirement to skip the header footer and left menus and some other 
> > > > parts
 
> > > > of the page, can you please guide how can we exclude those parts.i was
 
> > > > trying various ways on google but nothing works for me.
 
> > > >
 
> > > > Appreciate for your help in Advance!
 
> > > > --
 
> > > > Regards
 
> > > > Rushikesh M
 
> > > > .Net Developer
 
> > > >
 
> > >
 
> >
 
> >
 
> >
 
> > --
 
> > Regards
 
> > Rushikesh M
 
> > .Net Developer
 
> >
 
> La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la 
> Revolución
 
> 2002-2017
 
> 
> <br clear="all" />
> -- 
> Regards
> Rushikesh M
> .Net Developer

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Reply via email to