Hi,

I took a clean 1.5 and applied the patch with those parameters and built with 
ant. I then removed the text/html and application/xhtml+xml from the 
runtime/local/parse-plugins.xml and added tika.boilerpipe=true (as proper XML) 
to the runtime/local/nutch-site.xml configuration and tested it with 
$parsechecker -dumpText <url>, it does work. Sometimes the output is (almost) 
identical whether it is enabled or not. 

$ bin/nutch parsechecker -dumpText 
http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
---------
ParseText
---------
'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws het 
eerst op nu.nl Gepubliceerd:  Laatste update:  27 juni 2012 13:41 27 juni 2012 
13:41 Deel: FB 'Turkije zal Syrië niet aanvallen' ANKARA - Turkije is niet van 
plan buurland Syrië aan te vallen, omdat dit land vorige week een Turkse 
straaljager had neergeschoten. Foto:  AFP Dat heeft de Turkse premier Recep 
Tayyip Erdogan woensdag gezegd, zo meldde het persbureau Anatolia. ''Als Turkse 
natie hebben we geen intentie om aan te vallen'', aldus Erdogan. Dinsdag zei 
hij nog dat de beschieting niet onbeantwoord blijft en Turkije ''vastberaden'' 
zal terugslaan. Syrië haalde vrijdag een Turks toestel neer, dat neerstortte in 
de Middellandse Zee. Erdogan spreekt van een ''schandalige aanval'' en een 
''vijandige daad'. De NAVO heeft de beschieting veroordeeld.


$ bin/nutch parsechecker -dumpText 
http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
---------
ParseText
---------
'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws het 
eerst op nu.nl NU NUzakelijk NUsport NUfoto NUjij Zie NUtvgids NUwerk Meer 
NUlive NUjournaal NUenToen NUreizen NUBijlage Voorpagina   Algemeen Binnenland 
Buitenland Politiek   Economie Schuldencrisis Geldzaken Beurs Sport EK 2012   
Tech Internet Gadgets Games Achterklap Opmerkelijk   Cultuur en Media Film 
Muziek Boek Media.... MORE TEXT

Cheers,
Markus

 
 
-----Original message-----
> From:Rene Nederhand <[email protected]>
> Sent: Wed 27-Jun-2012 13:36
> To: [email protected]
> Subject: Re: Using Nutch with Boilerpipe
> 
> Hi Markus,
> 
> The patch does work if you specify the -F3 parameter, like:
> 
> patch -p0 -ui NUTCH-961-1.5-1.patch -F3
> 
> I checked parse-plugins.xml and changed the html and xml-html mimetypes like 
> this:
> 
> <mimeType name="text/html">                                            
>       <plugin id="parse-tika" />                                      
> </mimeType>                                                             
>                                                                               
>  
> <mimeType name="application/xhtml+xml">                                 
>       <plugin id="parse-tika" />                                     
> </mimeType>
> 
> Unfortunately, it doesn't work.
> 
> I also checked TikaParser.java and it refers to useBoilerpipe and 
> boilerpipeExtractorName:
> 
>       boolean useBoilerpipe = getConf().getBoolean("tika.boilerpipe", true);
>       boolean useBoilerpipeEstimator = 
> getConf().getBoolean("tika.boilerpipe.estimator", false);
>       String boilerpipeExtractorName = 
> getConf().get("tika.boilerpipe.extractor", "ArticleExtractor"
> 
> Still, I am unsure where to specify these variables. Instead I added the 
> following lines to the java code (and commented the previous lines):
> 
>       boolean useBoilerpipe = true;
>       String boilerpipeExtractorName = "ArticleExtractor";
> 
> Still, it is not working….
> 
> Any ideas?
> 
> Cheers,
> René
> 
> 
> 
> On Jun 27, 2012, at 12:32 PM, Markus Jelsma wrote:
> 
> > Hi René,
> > 
> > It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally 
> > released 1.5 at all, the TikaParser.java has changed a bit since the patch 
> > and the release of 1.5. Did you resolve the failde hunks? If so, are you 
> > sure Tika is being used for (x)html pages? Nutch by default uses the old 
> > parse-html plugin to parse those ContentTypes. Check your parse-plugins.xml 
> > configuration.
> > 
> > Cheers,
> > Markus
> > 
> > 
> > -----Original message-----
> >> From:Rene Nederhand <[email protected]>
> >> Sent: Wed 27-Jun-2012 11:59
> >> To: [email protected]
> >> Subject: Using Nutch with Boilerpipe
> >> 
> >> Hi,
> >> 
> >> I'm trying to index only the main content (main article) of various 
> >> websites. For this, I'd like to use Boilerpipe with Nutch.
> >> 
> >> Markus has been developing a patch (NUTCH-961) that does exactly that. 
> >> Although, the patch does install without problems, I am not sure how to 
> >> set the necessary settings. Is there anyone how can shed some light on 
> >> this?
> >> 
> >> As I understand two variables have to be set:
> >> 
> >> tika.boilerpipe = true
> >> tika.boilerpipe.extractor = "ArticleExtractor"
> >> 
> >> I have tried to do this in a file conf/tika.config.file (is this still 
> >> being used?) and conf/nutch-default.xml within  as valid XML within a 
> >> properly field. Both, didn't activate Boilerpipe. FYI: I am using Nutch 
> >> 1.5.
> >> 
> >> What should I do to get this thing going?
> >> 
> >> Kind regards,
> >> 
> >> René
> >> 
> >> 
> >> 
> >> 
> 
> 

Reply via email to