Marcus, do you mind sharing a sample nutch-site.xml?
On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma <[email protected]>wrote: > Those settings belong to nutch-site. Enable BP and set the correct > extractor and it should work just fine. > > > -----Original message----- > > From:Lewis John Mcgibbney <[email protected]> > > Sent: Sun 09-Jun-2013 20:47 > > To: [email protected] > > Subject: Re: using Tika within Nutch to remove boiler plates? > > > > Hi Joe, > > I've not used this feature, it would be great if one of the others could > > chime in here. > > From what I can infer from the correspondence on the issue, and the > > available patches, you should be applying the most recent one uploaded by > > Markus [0] as your starting point. This is dated as 22/11/2011. > > > > On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang <[email protected]> wrote: > > > > > > > > One of the comments mentioned the following: > > > > > > tika.use_boilerpipe=true > > > tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor > > > > > > which part the code is it referring to? > > > > > > > > You will see this included in one of the earlier patches uploaded by > Markus > > on 11/05/2011 [1] > > > > > > > > > > Also, within the current Nutch config, should I focus on > parse-plugin.xml? > > > > > > > > Look at the other patches and also Gabriele's comments. You may most > likely > > need to alter something but AFAICT the work hasbeen done.. it's just a > case > > of pulling together several contributions. > > > > Maybe you should look at the patch for 2.x (uploaded most recently by > > Roland) and see what is going on there. > > > > hth > > > > [0] > > > https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch > > [1] > > > https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch > > >

