Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Michael Coffey Wed, 15 Nov 2017 14:00:39 -0800

I found a lot of detail about the boilerpipe algortithm in 
http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf



Seems like very short paragraphs can be a problem, since one of the primary 
features used for determining boilerplate is the length of a given text block.

I would also look into the tika.extractor.boilerpipe.algorithm setting. It can 
be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know what the 
differences are, but I bet ArticleExtractor (the default algorithm ) inserts 
the Title.



________________________________
From: Markus Jelsma <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Wednesday, November 15, 2017 1:38 PM
Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling



Boilerpipe is a crude tool but cheap and effective enough for many sorts of 
websites. It does has a problem with pages with little text, just as all 
extractors have a degree of problems with little text.


I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I 
am not sure, but remember you can get rid of it by removing some lines of code. 
See TikaParser.java, i think it is there.


Regards,

Makrus


> non-open source contribution, you could try our extractor if you want, there 
> is a (low speed) test available at 
> https://www.openindex.io/saas/data-extraction/ . It is not free or open 
> source but available and actively developed, and does much more than just 
> text extraction.




-----Original message-----

> From:Rushikesh K <[email protected]>

> Sent: Wednesday 15th November 2017 22:21

> To: [email protected]; [email protected]

> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while 
> crawling

> 

> Hello, 

> 

> 

> Eyeris - Thanks for your response, i was able to make working with tika 
> boilerpipe but now i have a different problem ,some of the crawled pages 
> doesnt have the expected data 

> For some pages it brings back only the Title and skips all the content i am 
> not sure in what special cases does this do.But in my case i have two 
> problems now  

> 1. when my page has a image and 1 or 2 lines of text it doesnt get those 
> lines of data.(the data is in the <p> tag) 

> 2.why is it adding Title to the starting of the content is there a way not to 
> include that. 

> 

> For example see the following image for the first URL it came back with out 
> any date 

> 

> 

> 

> On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[email protected] 
> <mailto:[email protected]>> wrote:

> Hello.


> 


> I am using tika boilerpipe with good results in aproximately 2000 websites.


> Rushikesh if tika boilerpipe is not working for you maybe it is because you 
> don´t are parsing documents with tika. please check this configuration


> and tell us.


> 


> make sure that tika plugin is activated in plugin.included property then 
> check:


> 


> ***********************************************


> Use tika parser instead of parse-html.


> 


> parse-plugins.xml


> 


> <mimeType name="text/html">


>                 <plugin id="parse-tika" />


>         </mimeType>


> 


>         <mimeType name="application/xhtml+xml">


>                 <plugin id="parse-tika" />


>         </mimeType>


> ***********************************************


> 


> ***********************************************


> nutch-site.xml


> <property>


>   <name>tika.extractor</name>


>   <value>boilerpipe</value>


>   <description>


>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.


>   </description>


> </property>


> 


> <property>


>   <name>tika.extractor.boilerpipe.algorithm</name>


>   <value>ArticleExtractor</value>


>   <description>


>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor


>   or CanolaExtractor.


>   </description>


> </property>


> ****************************************


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> ----- Mensaje original -----


> De: "Markus Jelsma" <[email protected] 
> <mailto:[email protected]>>


> Para: [email protected] <mailto:[email protected]>


> Enviados: Martes, 14 de Noviembre 2017 17:40:08


> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling


> 


> Hello Rushikesh - why is Boilerpipe not working for you. Are you having 
> trouble getting it configured - it is really just setting a boolean value. Or 
> does it work, but not to your satisfaction?


> 


> The Bayan solution should work, theoretically, but just with a lot of tedious 
> manual per-site configuration.


> 


> Regards,


> Markus


> 


> -----Original message-----


> > From:Rushikesh K <[email protected] 
> > <mailto:[email protected]>>


> > Sent: Tuesday 14th November 2017 23:30


> > To: [email protected] <mailto:[email protected]>


> > Cc: Sebastian Nagel <[email protected] 
> > <mailto:[email protected]>>; [email protected] 
> > <mailto:[email protected]>


> > Subject: Re: Removing header,Footer and left menus while crawling


> >


> > Hello,


> >


> > *Jorge*


> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i


> > tried configuring Tika boilerpipe with this version but this doesnt work


> > for me.As you suggested to use own parser ,i am not a java developer by


> > chance.


> > By chance if you or anyone in the community has a working file ,it would be


> > great if you can share it because there are many people facing with this


> > issue (i came to know this when i googled).


> >


> > Mark Vega


> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also


> > not working.we followed the same steps.I can share the changes if you want


> > to take a look.


> >


> > I appreciate for your quick suggestions!


> >


> > Thanks


> > Rushikesh


> >


> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <


> > [email protected] <mailto:[email protected]>> wrote:


> >


> > > Hello Rushikesh,


> > >


> > > Are you using Nutch 1.3 or Nutch 1.13? If youre using Nutch 1.13, then you


> > > could use the Tika boilerpipe implementation, on the nutch-site.xml you


> > > need to enable this feature with:


> > >


> > > <property>


> > >   <name>tika.extractor</name>


> > >   <value>boilerpipe</value>


> > >   <description>


> > >   Which text extraction algorithm to use. Valid values are: boilerpipe or


> > > none.


> > >   </description>


> > > </property>


> > >


> > > And configure the proper extractor with


> > > the tika.extractor.boilerpipe.algorithm setting.


> > >


> > > This is not a perfect solution, but Ive used it successfully in the past,


> > > of course, your results will depend on how is the structure (markup of the


> > > website).


> > >


> > > Other option could be to implement your own parser if you need to have 
> > > more


> > > control over what to include/exclude from the HTML. You can take a look at


> > > this issue https://issues.apache.org/jira/browse/NUTCH-585 
> > > <https://issues.apache.org/jira/browse/NUTCH-585> which contains


> > > some info and old patches.


> > >


> > > Best Regards,


> > > Jorge


> > >


> > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected] 
> > > <mailto:[email protected]>>


> > > wrote:


> > >


> > > > Hello Sebastian,


> > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for 
> > > > crawling


> > > > our website and we are happy with the search results  but we had


> > > > requirement to skip the header footer and left menus and some other 
> > > > parts


> > > > of the page, can you please guide how can we exclude those parts.i was


> > > > trying various ways on google but nothing works for me.


> > > >


> > > > Appreciate for your help in Advance!


> > > > --


> > > > Regards


> > > > Rushikesh M


> > > > .Net Developer


> > > >


> > >


> >


> >


> >


> > --


> > Regards


> > Rushikesh M


> > .Net Developer


> >


> La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la 
> Revolución


> 2002-2017


> 

> <br clear="all" />

> -- 

> Regards

> Rushikesh M

> .Net Developer

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Reply via email to