Using Nutch with Boilerpipe

Rene Nederhand Wed, 27 Jun 2012 06:51:29 -0700

Hi,

I'm trying to index only the main content (main article) of various websites. 
For this, I'd like to use Boilerpipe with Nutch.


Markus has been developing a patch (NUTCH-961) that does exactly that. 
Although, the patch does install without problems, I am not sure how to set the 
necessary settings. Is there anyone how can shed some light on this?

As I understand two variables have to be set:

tika.boilerpipe = true
tika.boilerpipe.extractor = "ArticleExtractor"

I have tried to do this in a file conf/tika.config.file (is this still being 
used?) and conf/nutch-default.xml within  as valid XML within a properly field. 
Both, didn't activate Boilerpipe. FYI: I am using Nutch 1.5.

What should I do to get this thing going?

Kind regards,

René

Using Nutch with Boilerpipe

Reply via email to