Boilerpipe is a crude tool but cheap and effective enough for many sorts of websites. It does has a problem with pages with little text, just as all extractors have a degree of problems with little text.
I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I am not sure, but remember you can get rid of it by removing some lines of code. See TikaParser.java, i think it is there. Regards, Makrus > non-open source contribution, you could try our extractor if you want, there > is a (low speed) test available at > https://www.openindex.io/saas/data-extraction/ . It is not free or open > source but available and actively developed, and does much more than just > text extraction. -----Original message----- > From:Rushikesh K <[email protected]> > Sent: Wednesday 15th November 2017 22:21 > To: [email protected]; [email protected] > Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while > crawling > > Hello, > > > Eyeris - Thanks for your response, i was able to make working with tika > boilerpipe but now i have a different problem ,some of the crawled pages > doesnt have the expected data > For some pages it brings back only the Title and skips all the content i am > not sure in what special cases does this do.But in my case i have two > problems now > 1. when my page has a image and 1 or 2 lines of text it doesnt get those > lines of data.(the data is in the <p> tag) > 2.why is it adding Title to the starting of the content is there a way not to > include that. > > For example see the following image for the first URL it came back with out > any date > > > > On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[email protected] > <mailto:[email protected]>> wrote: > Hello. > > I am using tika boilerpipe with good results in aproximately 2000 websites. > Rushikesh if tika boilerpipe is not working for you maybe it is because you > don´t are parsing documents with tika. please check this configuration > and tell us. > > make sure that tika plugin is activated in plugin.included property then > check: > > *********************************************** > Use tika parser instead of parse-html. > > parse-plugins.xml > > <mimeType name="text/html"> > <plugin id="parse-tika" /> > </mimeType> > > <mimeType name="application/xhtml+xml"> > <plugin id="parse-tika" /> > </mimeType> > *********************************************** > > *********************************************** > nutch-site.xml > <property> > <name>tika.extractor</name> > <value>boilerpipe</value> > <description> > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > </description> > </property> > > <property> > <name>tika.extractor.boilerpipe.algorithm</name> > <value>ArticleExtractor</value> > <description> > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > </description> > </property> > **************************************** > > > > > > > > > > > > > ----- Mensaje original ----- > De: "Markus Jelsma" <[email protected] > <mailto:[email protected]>> > Para: [email protected] <mailto:[email protected]> > Enviados: Martes, 14 de Noviembre 2017 17:40:08 > Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling > > Hello Rushikesh - why is Boilerpipe not working for you. Are you having > trouble getting it configured - it is really just setting a boolean value. Or > does it work, but not to your satisfaction? > > The Bayan solution should work, theoretically, but just with a lot of tedious > manual per-site configuration. > > Regards, > Markus > > -----Original message----- > > From:Rushikesh K <[email protected] > > <mailto:[email protected]>> > > Sent: Tuesday 14th November 2017 23:30 > > To: [email protected] <mailto:[email protected]> > > Cc: Sebastian Nagel <[email protected] > > <mailto:[email protected]>>; [email protected] > > <mailto:[email protected]> > > Subject: Re: Removing header,Footer and left menus while crawling > > > > Hello, > > > > *Jorge* > > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also i > > tried configuring Tika boilerpipe with this version but this doesnt work > > for me.As you suggested to use own parser ,i am not a java developer by > > chance. > > By chance if you or anyone in the community has a working file ,it would be > > great if you can share it because there are many people facing with this > > issue (i came to know this when i googled). > > > > Mark Vega > > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also > > not working.we followed the same steps.I can share the changes if you want > > to take a look. > > > > I appreciate for your quick suggestions! > > > > Thanks > > Rushikesh > > > > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt < > > [email protected] <mailto:[email protected]>> wrote: > > > > > Hello Rushikesh, > > > > > > Are you using Nutch 1.3 or Nutch 1.13? If youre using Nutch 1.13, then you > > > could use the Tika boilerpipe implementation, on the nutch-site.xml you > > > need to enable this feature with: > > > > > > <property> > > > <name>tika.extractor</name> > > > <value>boilerpipe</value> > > > <description> > > > Which text extraction algorithm to use. Valid values are: boilerpipe or > > > none. > > > </description> > > > </property> > > > > > > And configure the proper extractor with > > > the tika.extractor.boilerpipe.algorithm setting. > > > > > > This is not a perfect solution, but Ive used it successfully in the past, > > > of course, your results will depend on how is the structure (markup of the > > > website). > > > > > > Other option could be to implement your own parser if you need to have > > > more > > > control over what to include/exclude from the HTML. You can take a look at > > > this issue https://issues.apache.org/jira/browse/NUTCH-585 > > > <https://issues.apache.org/jira/browse/NUTCH-585> which contains > > > some info and old patches. > > > > > > Best Regards, > > > Jorge > > > > > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected] > > > <mailto:[email protected]>> > > > wrote: > > > > > > > Hello Sebastian, > > > > we are most excited in using the Nutch 1.3 (with solr 6.4) for > > > > crawling > > > > our website and we are happy with the search results but we had > > > > requirement to skip the header footer and left menus and some other > > > > parts > > > > of the page, can you please guide how can we exclude those parts.i was > > > > trying various ways on google but nothing works for me. > > > > > > > > Appreciate for your help in Advance! > > > > -- > > > > Regards > > > > Rushikesh M > > > > .Net Developer > > > > > > > > > > > > > > > -- > > Regards > > Rushikesh M > > .Net Developer > > > La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la > Revolución > 2002-2017 > > <br clear="all" /> > -- > Regards > Rushikesh M > .Net Developer

