Hi Markus,

Thank you very much. It does work now!

The problem was that I had to remove the text/html and application/xhtml+xml 
from parse-plugins.xml. I just changed these to parse-tika.

So, for those who want to try Boilerpipe with Nutch 1.5:

1. Download Nutch source version 1.5

2. Apply the patch NUTCH-961 using the -F3 parameter: 
patch -p0 -ui NUTCH-961-1.5-1.patch -F3

3. run ant

4. delete the following lines from runtime/local/conf/parse-plugins.xml:
        <mimeType name="text/html">
                <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
        </mimeType>

5. Add the following lines to runtime/local/conf/nutch-site.xml
        <property>                                                              
    
                <name>tika.boilerpipe</name>                                    
        
                <value>true</value>                                             
        
        </property>

Thanks again!

Cheers,
René


On Jun 27, 2012, at 1:54 PM, Markus Jelsma wrote:

> Hi,
> 
> I took a clean 1.5 and applied the patch with those parameters and built with 
> ant. I then removed the text/html and application/xhtml+xml from the 
> runtime/local/parse-plugins.xml and added tika.boilerpipe=true (as proper 
> XML) to the runtime/local/nutch-site.xml configuration and tested it with 
> $parsechecker -dumpText <url>, it does work. Sometimes the output is (almost) 
> identical whether it is enabled or not. 
> 
> $ bin/nutch parsechecker -dumpText 
> http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
> ---------
> ParseText
> ---------
> 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws 
> het eerst op nu.nl Gepubliceerd:  Laatste update:  27 juni 2012 13:41 27 juni 
> 2012 13:41 Deel: FB 'Turkije zal Syrië niet aanvallen' ANKARA - Turkije is 
> niet van plan buurland Syrië aan te vallen, omdat dit land vorige week een 
> Turkse straaljager had neergeschoten. Foto:  AFP Dat heeft de Turkse premier 
> Recep Tayyip Erdogan woensdag gezegd, zo meldde het persbureau Anatolia. 
> ''Als Turkse natie hebben we geen intentie om aan te vallen'', aldus Erdogan. 
> Dinsdag zei hij nog dat de beschieting niet onbeantwoord blijft en Turkije 
> ''vastberaden'' zal terugslaan. Syrië haalde vrijdag een Turks toestel neer, 
> dat neerstortte in de Middellandse Zee. Erdogan spreekt van een ''schandalige 
> aanval'' en een ''vijandige daad'. De NAVO heeft de beschieting veroordeeld.
> 
> 
> $ bin/nutch parsechecker -dumpText 
> http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
> ---------
> ParseText
> ---------
> 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws 
> het eerst op nu.nl NU NUzakelijk NUsport NUfoto NUjij Zie NUtvgids NUwerk 
> Meer NUlive NUjournaal NUenToen NUreizen NUBijlage Voorpagina   Algemeen 
> Binnenland Buitenland Politiek   Economie Schuldencrisis Geldzaken Beurs 
> Sport EK 2012   Tech Internet Gadgets Games Achterklap Opmerkelijk   Cultuur 
> en Media Film Muziek Boek Media.... MORE TEXT
> 
> Cheers,
> Markus
> 
> 
> 
> -----Original message-----
>> From:Rene Nederhand <[email protected]>
>> Sent: Wed 27-Jun-2012 13:36
>> To: [email protected]
>> Subject: Re: Using Nutch with Boilerpipe
>> 
>> Hi Markus,
>> 
>> The patch does work if you specify the -F3 parameter, like:
>> 
>> patch -p0 -ui NUTCH-961-1.5-1.patch -F3
>> 
>> I checked parse-plugins.xml and changed the html and xml-html mimetypes like 
>> this:
>> 
>> <mimeType name="text/html">                                            
>>      <plugin id="parse-tika" />                                      
>> </mimeType>                                                             
>> 
>> <mimeType name="application/xhtml+xml">                                 
>>      <plugin id="parse-tika" />                                     
>> </mimeType>
>> 
>> Unfortunately, it doesn't work.
>> 
>> I also checked TikaParser.java and it refers to useBoilerpipe and 
>> boilerpipeExtractorName:
>> 
>>      boolean useBoilerpipe = getConf().getBoolean("tika.boilerpipe", true);
>>      boolean useBoilerpipeEstimator = 
>> getConf().getBoolean("tika.boilerpipe.estimator", false);
>>      String boilerpipeExtractorName = 
>> getConf().get("tika.boilerpipe.extractor", "ArticleExtractor"
>> 
>> Still, I am unsure where to specify these variables. Instead I added the 
>> following lines to the java code (and commented the previous lines):
>> 
>>      boolean useBoilerpipe = true;
>>      String boilerpipeExtractorName = "ArticleExtractor";
>> 
>> Still, it is not working….
>> 
>> Any ideas?
>> 
>> Cheers,
>> René
>> 
>> 
>> 
>> On Jun 27, 2012, at 12:32 PM, Markus Jelsma wrote:
>> 
>>> Hi René,
>>> 
>>> It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally 
>>> released 1.5 at all, the TikaParser.java has changed a bit since the patch 
>>> and the release of 1.5. Did you resolve the failde hunks? If so, are you 
>>> sure Tika is being used for (x)html pages? Nutch by default uses the old 
>>> parse-html plugin to parse those ContentTypes. Check your parse-plugins.xml 
>>> configuration.
>>> 
>>> Cheers,
>>> Markus
>>> 
>>> 
>>> -----Original message-----
>>>> From:Rene Nederhand <[email protected]>
>>>> Sent: Wed 27-Jun-2012 11:59
>>>> To: [email protected]
>>>> Subject: Using Nutch with Boilerpipe
>>>> 
>>>> Hi,
>>>> 
>>>> I'm trying to index only the main content (main article) of various 
>>>> websites. For this, I'd like to use Boilerpipe with Nutch.
>>>> 
>>>> Markus has been developing a patch (NUTCH-961) that does exactly that. 
>>>> Although, the patch does install without problems, I am not sure how to 
>>>> set the necessary settings. Is there anyone how can shed some light on 
>>>> this?
>>>> 
>>>> As I understand two variables have to be set:
>>>> 
>>>> tika.boilerpipe = true
>>>> tika.boilerpipe.extractor = "ArticleExtractor"
>>>> 
>>>> I have tried to do this in a file conf/tika.config.file (is this still 
>>>> being used?) and conf/nutch-default.xml within  as valid XML within a 
>>>> properly field. Both, didn't activate Boilerpipe. FYI: I am using Nutch 
>>>> 1.5.
>>>> 
>>>> What should I do to get this thing going?
>>>> 
>>>> Kind regards,
>>>> 
>>>> René
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to