Hello René,

Thanks for summing it up. Keep an eye on NUTCH-1233, it will fix the dirty code 
in the patch concerning the collection of outlinks. NUTCH-961 parses the 
document twice if boilerpipe is enabled, otherwise it will only collect 
outlinks from the returned text.

Cheers,
Markus

 
 
-----Original message-----
> From:Rene Nederhand <[email protected]>
> Sent: Wed 27-Jun-2012 14:28
> To: [email protected]
> Subject: Re: Using Nutch with Boilerpipe
> 
> Hi Markus,
> 
> Thank you very much. It does work now!
> 
> The problem was that I had to remove the text/html and application/xhtml+xml 
> from parse-plugins.xml. I just changed these to parse-tika.
> 
> So, for those who want to try Boilerpipe with Nutch 1.5:
> 
> 1. Download Nutch source version 1.5
> 
> 2. Apply the patch NUTCH-961 using the -F3 parameter: 
> patch -p0 -ui NUTCH-961-1.5-1.patch -F3
> 
> 3. run ant
> 
> 4. delete the following lines from runtime/local/conf/parse-plugins.xml:
>       <mimeType name="text/html">
>               <plugin id="parse-tika" />
>       </mimeType>
> 
>         <mimeType name="application/xhtml+xml">
>               <plugin id="parse-tika" />
>       </mimeType>
> 
> 5. Add the following lines to runtime/local/conf/nutch-site.xml
>       <property>                                                              
>     
>               <name>tika.boilerpipe</name>                                    
>         
>               <value>true</value>                                             
>         
>       </property>
> 
> Thanks again!
> 
> Cheers,
> René
> 
> 
> On Jun 27, 2012, at 1:54 PM, Markus Jelsma wrote:
> 
> > Hi,
> > 
> > I took a clean 1.5 and applied the patch with those parameters and built 
> > with ant. I then removed the text/html and application/xhtml+xml from the 
> > runtime/local/parse-plugins.xml and added tika.boilerpipe=true (as proper 
> > XML) to the runtime/local/nutch-site.xml configuration and tested it with 
> > $parsechecker -dumpText <url>, it does work. Sometimes the output is 
> > (almost) identical whether it is enabled or not. 
> > 
> > $ bin/nutch parsechecker -dumpText 
> > http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
> > ---------
> > ParseText
> > ---------
> > 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws 
> > het eerst op nu.nl Gepubliceerd:  Laatste update:  27 juni 2012 13:41 27 
> > juni 2012 13:41 Deel: FB 'Turkije zal Syrië niet aanvallen' ANKARA - 
> > Turkije is niet van plan buurland Syrië aan te vallen, omdat dit land 
> > vorige week een Turkse straaljager had neergeschoten. Foto:  AFP Dat heeft 
> > de Turkse premier Recep Tayyip Erdogan woensdag gezegd, zo meldde het 
> > persbureau Anatolia. ''Als Turkse natie hebben we geen intentie om aan te 
> > vallen'', aldus Erdogan. Dinsdag zei hij nog dat de beschieting niet 
> > onbeantwoord blijft en Turkije ''vastberaden'' zal terugslaan. Syrië haalde 
> > vrijdag een Turks toestel neer, dat neerstortte in de Middellandse Zee. 
> > Erdogan spreekt van een ''schandalige aanval'' en een ''vijandige daad'. De 
> > NAVO heeft de beschieting veroordeeld.
> > 
> > 
> > $ bin/nutch parsechecker -dumpText 
> > http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
> > ---------
> > ParseText
> > ---------
> > 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws 
> > het eerst op nu.nl NU NUzakelijk NUsport NUfoto NUjij Zie NUtvgids NUwerk 
> > Meer NUlive NUjournaal NUenToen NUreizen NUBijlage Voorpagina   Algemeen 
> > Binnenland Buitenland Politiek   Economie Schuldencrisis Geldzaken Beurs 
> > Sport EK 2012   Tech Internet Gadgets Games Achterklap Opmerkelijk   
> > Cultuur en Media Film Muziek Boek Media.... MORE TEXT
> > 
> > Cheers,
> > Markus
> > 
> > 
> > 
> > -----Original message-----
> >> From:Rene Nederhand <[email protected]>
> >> Sent: Wed 27-Jun-2012 13:36
> >> To: [email protected]
> >> Subject: Re: Using Nutch with Boilerpipe
> >> 
> >> Hi Markus,
> >> 
> >> The patch does work if you specify the -F3 parameter, like:
> >> 
> >> patch -p0 -ui NUTCH-961-1.5-1.patch -F3
> >> 
> >> I checked parse-plugins.xml and changed the html and xml-html mimetypes 
> >> like this:
> >> 
> >> <mimeType name="text/html">                                            
> >>    <plugin id="parse-tika" />                                      
> >> </mimeType>                                                             
> >> 
> >> <mimeType name="application/xhtml+xml">                                 
> >>    <plugin id="parse-tika" />                                     
> >> </mimeType>
> >> 
> >> Unfortunately, it doesn't work.
> >> 
> >> I also checked TikaParser.java and it refers to useBoilerpipe and 
> >> boilerpipeExtractorName:
> >> 
> >>    boolean useBoilerpipe = getConf().getBoolean("tika.boilerpipe", true);
> >>    boolean useBoilerpipeEstimator = 
> >> getConf().getBoolean("tika.boilerpipe.estimator", false);
> >>    String boilerpipeExtractorName = 
> >> getConf().get("tika.boilerpipe.extractor", "ArticleExtractor"
> >> 
> >> Still, I am unsure where to specify these variables. Instead I added the 
> >> following lines to the java code (and commented the previous lines):
> >> 
> >>    boolean useBoilerpipe = true;
> >>    String boilerpipeExtractorName = "ArticleExtractor";
> >> 
> >> Still, it is not working….
> >> 
> >> Any ideas?
> >> 
> >> Cheers,
> >> René
> >> 
> >> 
> >> 
> >> On Jun 27, 2012, at 12:32 PM, Markus Jelsma wrote:
> >> 
> >>> Hi René,
> >>> 
> >>> It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally 
> >>> released 1.5 at all, the TikaParser.java has changed a bit since the 
> >>> patch and the release of 1.5. Did you resolve the failde hunks? If so, 
> >>> are you sure Tika is being used for (x)html pages? Nutch by default uses 
> >>> the old parse-html plugin to parse those ContentTypes. Check your 
> >>> parse-plugins.xml configuration.
> >>> 
> >>> Cheers,
> >>> Markus
> >>> 
> >>> 
> >>> -----Original message-----
> >>>> From:Rene Nederhand <[email protected]>
> >>>> Sent: Wed 27-Jun-2012 11:59
> >>>> To: [email protected]
> >>>> Subject: Using Nutch with Boilerpipe
> >>>> 
> >>>> Hi,
> >>>> 
> >>>> I'm trying to index only the main content (main article) of various 
> >>>> websites. For this, I'd like to use Boilerpipe with Nutch.
> >>>> 
> >>>> Markus has been developing a patch (NUTCH-961) that does exactly that. 
> >>>> Although, the patch does install without problems, I am not sure how to 
> >>>> set the necessary settings. Is there anyone how can shed some light on 
> >>>> this?
> >>>> 
> >>>> As I understand two variables have to be set:
> >>>> 
> >>>> tika.boilerpipe = true
> >>>> tika.boilerpipe.extractor = "ArticleExtractor"
> >>>> 
> >>>> I have tried to do this in a file conf/tika.config.file (is this still 
> >>>> being used?) and conf/nutch-default.xml within  as valid XML within a 
> >>>> properly field. Both, didn't activate Boilerpipe. FYI: I am using Nutch 
> >>>> 1.5.
> >>>> 
> >>>> What should I do to get this thing going?
> >>>> 
> >>>> Kind regards,
> >>>> 
> >>>> René
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >> 
> >> 
> 
> 

Reply via email to