Yes, Boilerpipe is complex and difficult to adapt. It also requires you to 
preset an extraction algorithm which is impossible for us. I've created an 
extractor instead that works for most pages and ignores stuff like news 
overviews and major parts of homepages. It's also tightly coupled with our date 
extractor (based on [1]) and language detector (based on LangDetect) and image 
extraction.

In many cases boilerpipe's articleextractor will work very well but date 
extraction such as NUTCH-141 won't do the trick as it only works on extracted 
text as a whole and does not consider page semantics.

[1]: https://issues.apache.org/jira/browse/NUTCH-1414

-----Original message-----
> From:Joe Zhang <[email protected]>
> Sent: Tue 11-Jun-2013 18:06
> To: user <[email protected]>
> Subject: Re: using Tika within Nutch to remove boiler plates?
> 
> Any particular reason why you don't use boilerpipe any more? So what do you
> suggest as an alternative?
> 
> 
> On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
> <[email protected]>wrote:
> 
> > we don't use Boilerpipe anymore so no point in sharing. Just set those two
> > configuration options in nutch-site.xml as
> >
> >   <property>
> >   <name>tika.use_boilerpipe</name>
> >   <value>true</value>
> >  </property>
> >   <property>
> >   <name>tika.boilerpipe.extractor</name>
> >   <value>ArticleExtractor</value>
> >  </property>
> >
> > and it should work
> >
> > -----Original message-----
> > > From:Joe Zhang <[email protected]>
> > > Sent: Tue 11-Jun-2013 01:42
> > > To: user <[email protected]>
> > > Subject: Re: using Tika within Nutch to remove boiler plates?
> > >
> > > Marcus, do you mind sharing a sample nutch-site.xml?
> > >
> > >
> > > On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
> > > <[email protected]>wrote:
> > >
> > > > Those settings belong to nutch-site. Enable BP and set the correct
> > > > extractor and it should work just fine.
> > > >
> > > >
> > > > -----Original message-----
> > > > > From:Lewis John Mcgibbney <[email protected]>
> > > > > Sent: Sun 09-Jun-2013 20:47
> > > > > To: [email protected]
> > > > > Subject: Re: using Tika within Nutch to remove boiler plates?
> > > > >
> > > > > Hi Joe,
> > > > > I've not used this feature, it would be great if one of the others
> > could
> > > > > chime in here.
> > > > > From what I can infer from the correspondence on the issue, and the
> > > > > available patches, you should be applying the most recent one
> > uploaded by
> > > > > Markus [0] as your starting point. This is dated as 22/11/2011.
> > > > >
> > > > > On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang <[email protected]>
> > wrote:
> > > > >
> > > > > >
> > > > > > One of the comments mentioned the following:
> > > > > >
> > > > > > tika.use_boilerpipe=true
> > > > > > tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
> > > > > >
> > > > > > which part the code is it referring to?
> > > > > >
> > > > > >
> > > > > You will see this included in one of the earlier patches uploaded by
> > > > Markus
> > > > > on 11/05/2011 [1]
> > > > >
> > > > >
> > > > > >
> > > > > > Also, within the current Nutch config, should I focus on
> > > > parse-plugin.xml?
> > > > > >
> > > > > >
> > > > > Look at the other patches and also Gabriele's comments. You may most
> > > > likely
> > > > > need to alter something but AFAICT the work hasbeen done.. it's just
> > a
> > > > case
> > > > > of pulling together several contributions.
> > > > >
> > > > > Maybe you should look at the patch for 2.x (uploaded most recently by
> > > > > Roland) and see what is going on there.
> > > > >
> > > > > hth
> > > > >
> > > > > [0]
> > > > >
> > > >
> > https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
> > > > > [1]
> > > > >
> > > >
> > https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
> > > > >
> > > >
> > >
> >
> 

Reply via email to