Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?
The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration. Regards, Markus -----Original message----- > From:Rushikesh K <[email protected]> > Sent: Tuesday 14th November 2017 23:30 > To: [email protected] > Cc: Sebastian Nagel <[email protected]>; [email protected] > Subject: Re: Removing header,Footer and left menus while crawling > > Hello, > > *Jorge* > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also i > tried configuring Tika boilerpipe with this version but this doesn't work > for me.As you suggested to use own parser ,i am not a java developer by > chance. > By chance if you or anyone in the community has a working file ,it would be > great if you can share it because there are many people facing with this > issue (i came to know this when i googled). > > Mark Vega > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also > not working.we followed the same steps.I can share the changes if you want > to take a look. > > I appreciate for your quick suggestions! > > Thanks > Rushikesh > > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt < > [email protected]> wrote: > > > Hello Rushikesh, > > > > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you > > could use the Tika boilerpipe implementation, on the nutch-site.xml you > > need to enable this feature with: > > > > <property> > > <name>tika.extractor</name> > > <value>boilerpipe</value> > > <description> > > Which text extraction algorithm to use. Valid values are: boilerpipe or > > none. > > </description> > > </property> > > > > And configure the proper extractor with > > the tika.extractor.boilerpipe.algorithm setting. > > > > This is not a perfect solution, but I've used it successfully in the past, > > of course, your results will depend on how is the structure (markup of the > > website). > > > > Other option could be to implement your own parser if you need to have more > > control over what to include/exclude from the HTML. You can take a look at > > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains > > some info and old patches. > > > > Best Regards, > > Jorge > > > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected]> > > wrote: > > > > > Hello Sebastian, > > > we are most excited in using the Nutch 1.3 (with solr 6.4) for crawling > > > our website and we are happy with the search results but we had > > > requirement to skip the header footer and left menus and some other parts > > > of the page, can you please guide how can we exclude those parts.i was > > > trying various ways on google but nothing works for me. > > > > > > Appreciate for your help in Advance! > > > -- > > > Regards > > > Rushikesh M > > > .Net Developer > > > > > > > > > -- > Regards > Rushikesh M > .Net Developer >

