Hi,

I'm trying to index only the main content (main article) of various websites. 
For this, I'd like to use Boilerpipe with Nutch.

Markus has been developing a patch (NUTCH-961) that does exactly that. 
Although, the patch does install without problems, I am not sure how to set the 
necessary settings. Is there anyone how can shed some light on this?

As I understand two variables have to be set:

tika.boilerpipe = true
tika.boilerpipe.extractor = "ArticleExtractor"

I have tried to do this in a file conf/tika.config.file (is this still being 
used?) and conf/nutch-default.xml within  as valid XML within a properly field. 
Both, didn't activate Boilerpipe. FYI: I am using Nutch 1.5.

What should I do to get this thing going?

Kind regards,

René


Reply via email to