RE: Using Nutch with Boilerpipe

Markus Jelsma Wed, 27 Jun 2012 03:32:52 -0700

Hi René,

It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally released 
1.5 at all, the TikaParser.java has changed a bit since the patch and the 
release of 1.5. Did you resolve the failde hunks? If so, are you sure Tika is 
being used for (x)html pages? Nutch by default uses the old parse-html plugin 
to parse those ContentTypes. Check your parse-plugins.xml configuration.


Cheers,
Markus
 
 
-----Original message-----
> From:Rene Nederhand <[email protected]>
> Sent: Wed 27-Jun-2012 11:59
> To: [email protected]
> Subject: Using Nutch with Boilerpipe
> 
> Hi,
> 
> I'm trying to index only the main content (main article) of various websites. 
> For this, I'd like to use Boilerpipe with Nutch.
> 
> Markus has been developing a patch (NUTCH-961) that does exactly that. 
> Although, the patch does install without problems, I am not sure how to set 
> the necessary settings. Is there anyone how can shed some light on this?
> 
> As I understand two variables have to be set:
> 
> tika.boilerpipe = true
> tika.boilerpipe.extractor = "ArticleExtractor"
> 
> I have tried to do this in a file conf/tika.config.file (is this still being 
> used?) and conf/nutch-default.xml within  as valid XML within a properly 
> field. Both, didn't activate Boilerpipe. FYI: I am using Nutch 1.5.
> 
> What should I do to get this thing going?
> 
> Kind regards,
> 
> René
> 
> 
> 
>

RE: Using Nutch with Boilerpipe

Reply via email to