Hi Markus,
Thank you very much. It does work now!
The problem was that I had to remove the text/html and application/xhtml+xml
from parse-plugins.xml. I just changed these to parse-tika.
So, for those who want to try Boilerpipe with Nutch 1.5:
1. Download Nutch source version 1.5
2. Apply the patch NUTCH-961 using the -F3 parameter:
patch -p0 -ui NUTCH-961-1.5-1.patch -F3
3. run ant
4. delete the following lines from runtime/local/conf/parse-plugins.xml:
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
5. Add the following lines to runtime/local/conf/nutch-site.xml
<property>
<name>tika.boilerpipe</name>
<value>true</value>
</property>
Thanks again!
Cheers,
René
On Jun 27, 2012, at 1:54 PM, Markus Jelsma wrote:
> Hi,
>
> I took a clean 1.5 and applied the patch with those parameters and built with
> ant. I then removed the text/html and application/xhtml+xml from the
> runtime/local/parse-plugins.xml and added tika.boilerpipe=true (as proper
> XML) to the runtime/local/nutch-site.xml configuration and tested it with
> $parsechecker -dumpText <url>, it does work. Sometimes the output is (almost)
> identical whether it is enabled or not.
>
> $ bin/nutch parsechecker -dumpText
> http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
> ---------
> ParseText
> ---------
> 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws
> het eerst op nu.nl Gepubliceerd: Laatste update: 27 juni 2012 13:41 27 juni
> 2012 13:41 Deel: FB 'Turkije zal Syrië niet aanvallen' ANKARA - Turkije is
> niet van plan buurland Syrië aan te vallen, omdat dit land vorige week een
> Turkse straaljager had neergeschoten. Foto: AFP Dat heeft de Turkse premier
> Recep Tayyip Erdogan woensdag gezegd, zo meldde het persbureau Anatolia.
> ''Als Turkse natie hebben we geen intentie om aan te vallen'', aldus Erdogan.
> Dinsdag zei hij nog dat de beschieting niet onbeantwoord blijft en Turkije
> ''vastberaden'' zal terugslaan. Syrië haalde vrijdag een Turks toestel neer,
> dat neerstortte in de Middellandse Zee. Erdogan spreekt van een ''schandalige
> aanval'' en een ''vijandige daad'. De NAVO heeft de beschieting veroordeeld.
>
>
> $ bin/nutch parsechecker -dumpText
> http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
> ---------
> ParseText
> ---------
> 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws
> het eerst op nu.nl NU NUzakelijk NUsport NUfoto NUjij Zie NUtvgids NUwerk
> Meer NUlive NUjournaal NUenToen NUreizen NUBijlage Voorpagina Algemeen
> Binnenland Buitenland Politiek Economie Schuldencrisis Geldzaken Beurs
> Sport EK 2012 Tech Internet Gadgets Games Achterklap Opmerkelijk Cultuur
> en Media Film Muziek Boek Media.... MORE TEXT
>
> Cheers,
> Markus
>
>
>
> -----Original message-----
>> From:Rene Nederhand <[email protected]>
>> Sent: Wed 27-Jun-2012 13:36
>> To: [email protected]
>> Subject: Re: Using Nutch with Boilerpipe
>>
>> Hi Markus,
>>
>> The patch does work if you specify the -F3 parameter, like:
>>
>> patch -p0 -ui NUTCH-961-1.5-1.patch -F3
>>
>> I checked parse-plugins.xml and changed the html and xml-html mimetypes like
>> this:
>>
>> <mimeType name="text/html">
>> <plugin id="parse-tika" />
>> </mimeType>
>>
>> <mimeType name="application/xhtml+xml">
>> <plugin id="parse-tika" />
>> </mimeType>
>>
>> Unfortunately, it doesn't work.
>>
>> I also checked TikaParser.java and it refers to useBoilerpipe and
>> boilerpipeExtractorName:
>>
>> boolean useBoilerpipe = getConf().getBoolean("tika.boilerpipe", true);
>> boolean useBoilerpipeEstimator =
>> getConf().getBoolean("tika.boilerpipe.estimator", false);
>> String boilerpipeExtractorName =
>> getConf().get("tika.boilerpipe.extractor", "ArticleExtractor"
>>
>> Still, I am unsure where to specify these variables. Instead I added the
>> following lines to the java code (and commented the previous lines):
>>
>> boolean useBoilerpipe = true;
>> String boilerpipeExtractorName = "ArticleExtractor";
>>
>> Still, it is not working….
>>
>> Any ideas?
>>
>> Cheers,
>> René
>>
>>
>>
>> On Jun 27, 2012, at 12:32 PM, Markus Jelsma wrote:
>>
>>> Hi René,
>>>
>>> It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally
>>> released 1.5 at all, the TikaParser.java has changed a bit since the patch
>>> and the release of 1.5. Did you resolve the failde hunks? If so, are you
>>> sure Tika is being used for (x)html pages? Nutch by default uses the old
>>> parse-html plugin to parse those ContentTypes. Check your parse-plugins.xml
>>> configuration.
>>>
>>> Cheers,
>>> Markus
>>>
>>>
>>> -----Original message-----
>>>> From:Rene Nederhand <[email protected]>
>>>> Sent: Wed 27-Jun-2012 11:59
>>>> To: [email protected]
>>>> Subject: Using Nutch with Boilerpipe
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to index only the main content (main article) of various
>>>> websites. For this, I'd like to use Boilerpipe with Nutch.
>>>>
>>>> Markus has been developing a patch (NUTCH-961) that does exactly that.
>>>> Although, the patch does install without problems, I am not sure how to
>>>> set the necessary settings. Is there anyone how can shed some light on
>>>> this?
>>>>
>>>> As I understand two variables have to be set:
>>>>
>>>> tika.boilerpipe = true
>>>> tika.boilerpipe.extractor = "ArticleExtractor"
>>>>
>>>> I have tried to do this in a file conf/tika.config.file (is this still
>>>> being used?) and conf/nutch-default.xml within as valid XML within a
>>>> properly field. Both, didn't activate Boilerpipe. FYI: I am using Nutch
>>>> 1.5.
>>>>
>>>> What should I do to get this thing going?
>>>>
>>>> Kind regards,
>>>>
>>>> René
>>>>
>>>>
>>>>
>>>>
>>
>>