Re: Using Nutch with Boilerpipe

Rene Nederhand Wed, 27 Jun 2012 04:36:23 -0700

Hi Markus,

The patch does work if you specify the -F3 parameter, like:


patch -p0 -ui NUTCH-961-1.5-1.patch -F3

I checked parse-plugins.xml and changed the html and xml-html mimetypes like 
this:

<mimeType name="text/html">                                            
        <plugin id="parse-tika" />                                      
</mimeType>                                                             
                                                                               
<mimeType name="application/xhtml+xml">                                 
        <plugin id="parse-tika" />                                     
</mimeType>

Unfortunately, it doesn't work.

I also checked TikaParser.java and it refers to useBoilerpipe and 
boilerpipeExtractorName:

        boolean useBoilerpipe = getConf().getBoolean("tika.boilerpipe", true);
        boolean useBoilerpipeEstimator = 
getConf().getBoolean("tika.boilerpipe.estimator", false);
        String boilerpipeExtractorName = 
getConf().get("tika.boilerpipe.extractor", "ArticleExtractor"

Still, I am unsure where to specify these variables. Instead I added the 
following lines to the java code (and commented the previous lines):

        boolean useBoilerpipe = true;
        String boilerpipeExtractorName = "ArticleExtractor";

Still, it is not working….

Any ideas?

Cheers,
René



On Jun 27, 2012, at 12:32 PM, Markus Jelsma wrote:

> Hi René,
> 
> It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally released 
> 1.5 at all, the TikaParser.java has changed a bit since the patch and the 
> release of 1.5. Did you resolve the failde hunks? If so, are you sure Tika is 
> being used for (x)html pages? Nutch by default uses the old parse-html plugin 
> to parse those ContentTypes. Check your parse-plugins.xml configuration.
> 
> Cheers,
> Markus
> 
> 
> -----Original message-----
>> From:Rene Nederhand <[email protected]>
>> Sent: Wed 27-Jun-2012 11:59
>> To: [email protected]
>> Subject: Using Nutch with Boilerpipe
>> 
>> Hi,
>> 
>> I'm trying to index only the main content (main article) of various 
>> websites. For this, I'd like to use Boilerpipe with Nutch.
>> 
>> Markus has been developing a patch (NUTCH-961) that does exactly that. 
>> Although, the patch does install without problems, I am not sure how to set 
>> the necessary settings. Is there anyone how can shed some light on this?
>> 
>> As I understand two variables have to be set:
>> 
>> tika.boilerpipe = true
>> tika.boilerpipe.extractor = "ArticleExtractor"
>> 
>> I have tried to do this in a file conf/tika.config.file (is this still being 
>> used?) and conf/nutch-default.xml within  as valid XML within a properly 
>> field. Both, didn't activate Boilerpipe. FYI: I am using Nutch 1.5.
>> 
>> What should I do to get this thing going?
>> 
>> Kind regards,
>> 
>> René
>> 
>> 
>> 
>>

Re: Using Nutch with Boilerpipe

Reply via email to