In my opinion Boilerpipe is the most effective free and open source tool for the job :)
It does require some patching (see linked issues) and manual upgrade to Boilerpipe 1.2.0. -----Original message----- > From:Joe Zhang <[email protected]> > Sent: Tue 11-Jun-2013 21:19 > To: user <[email protected]> > Subject: Re: using Tika within Nutch to remove boiler plates? > > So what in your opinion is the most effective way of removing boilerplates > in Nutch crawls? > > > On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma > <[email protected]>wrote: > > > Yes, Boilerpipe is complex and difficult to adapt. It also requires you to > > preset an extraction algorithm which is impossible for us. I've created an > > extractor instead that works for most pages and ignores stuff like news > > overviews and major parts of homepages. It's also tightly coupled with our > > date extractor (based on [1]) and language detector (based on LangDetect) > > and image extraction. > > > > In many cases boilerpipe's articleextractor will work very well but date > > extraction such as NUTCH-141 won't do the trick as it only works on > > extracted text as a whole and does not consider page semantics. > > > > [1]: https://issues.apache.org/jira/browse/NUTCH-1414 > > > > -----Original message----- > > > From:Joe Zhang <[email protected]> > > > Sent: Tue 11-Jun-2013 18:06 > > > To: user <[email protected]> > > > Subject: Re: using Tika within Nutch to remove boiler plates? > > > > > > Any particular reason why you don't use boilerpipe any more? So what do > > you > > > suggest as an alternative? > > > > > > > > > On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma > > > <[email protected]>wrote: > > > > > > > we don't use Boilerpipe anymore so no point in sharing. Just set those > > two > > > > configuration options in nutch-site.xml as > > > > > > > > <property> > > > > <name>tika.use_boilerpipe</name> > > > > <value>true</value> > > > > </property> > > > > <property> > > > > <name>tika.boilerpipe.extractor</name> > > > > <value>ArticleExtractor</value> > > > > </property> > > > > > > > > and it should work > > > > > > > > -----Original message----- > > > > > From:Joe Zhang <[email protected]> > > > > > Sent: Tue 11-Jun-2013 01:42 > > > > > To: user <[email protected]> > > > > > Subject: Re: using Tika within Nutch to remove boiler plates? > > > > > > > > > > Marcus, do you mind sharing a sample nutch-site.xml? > > > > > > > > > > > > > > > On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma > > > > > <[email protected]>wrote: > > > > > > > > > > > Those settings belong to nutch-site. Enable BP and set the correct > > > > > > extractor and it should work just fine. > > > > > > > > > > > > > > > > > > -----Original message----- > > > > > > > From:Lewis John Mcgibbney <[email protected]> > > > > > > > Sent: Sun 09-Jun-2013 20:47 > > > > > > > To: [email protected] > > > > > > > Subject: Re: using Tika within Nutch to remove boiler plates? > > > > > > > > > > > > > > Hi Joe, > > > > > > > I've not used this feature, it would be great if one of the > > others > > > > could > > > > > > > chime in here. > > > > > > > From what I can infer from the correspondence on the issue, and > > the > > > > > > > available patches, you should be applying the most recent one > > > > uploaded by > > > > > > > Markus [0] as your starting point. This is dated as 22/11/2011. > > > > > > > > > > > > > > On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang <[email protected] > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > One of the comments mentioned the following: > > > > > > > > > > > > > > > > tika.use_boilerpipe=true > > > > > > > > tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor > > > > > > > > > > > > > > > > which part the code is it referring to? > > > > > > > > > > > > > > > > > > > > > > > You will see this included in one of the earlier patches > > uploaded by > > > > > > Markus > > > > > > > on 11/05/2011 [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, within the current Nutch config, should I focus on > > > > > > parse-plugin.xml? > > > > > > > > > > > > > > > > > > > > > > > Look at the other patches and also Gabriele's comments. You may > > most > > > > > > likely > > > > > > > need to alter something but AFAICT the work hasbeen done.. it's > > just > > > > a > > > > > > case > > > > > > > of pulling together several contributions. > > > > > > > > > > > > > > Maybe you should look at the patch for 2.x (uploaded most > > recently by > > > > > > > Roland) and see what is going on there. > > > > > > > > > > > > > > hth > > > > > > > > > > > > > > [0] > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch > > > > > > > [1] > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch > > > > > > > > > > > > > > > > > > > > > > > > > > > >

