Hi René, It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally released 1.5 at all, the TikaParser.java has changed a bit since the patch and the release of 1.5. Did you resolve the failde hunks? If so, are you sure Tika is being used for (x)html pages? Nutch by default uses the old parse-html plugin to parse those ContentTypes. Check your parse-plugins.xml configuration.
Cheers, Markus -----Original message----- > From:Rene Nederhand <[email protected]> > Sent: Wed 27-Jun-2012 11:59 > To: [email protected] > Subject: Using Nutch with Boilerpipe > > Hi, > > I'm trying to index only the main content (main article) of various websites. > For this, I'd like to use Boilerpipe with Nutch. > > Markus has been developing a patch (NUTCH-961) that does exactly that. > Although, the patch does install without problems, I am not sure how to set > the necessary settings. Is there anyone how can shed some light on this? > > As I understand two variables have to be set: > > tika.boilerpipe = true > tika.boilerpipe.extractor = "ArticleExtractor" > > I have tried to do this in a file conf/tika.config.file (is this still being > used?) and conf/nutch-default.xml within as valid XML within a properly > field. Both, didn't activate Boilerpipe. FYI: I am using Nutch 1.5. > > What should I do to get this thing going? > > Kind regards, > > René > > > >

