Hello René, Thanks for summing it up. Keep an eye on NUTCH-1233, it will fix the dirty code in the patch concerning the collection of outlinks. NUTCH-961 parses the document twice if boilerpipe is enabled, otherwise it will only collect outlinks from the returned text.
Cheers, Markus -----Original message----- > From:Rene Nederhand <[email protected]> > Sent: Wed 27-Jun-2012 14:28 > To: [email protected] > Subject: Re: Using Nutch with Boilerpipe > > Hi Markus, > > Thank you very much. It does work now! > > The problem was that I had to remove the text/html and application/xhtml+xml > from parse-plugins.xml. I just changed these to parse-tika. > > So, for those who want to try Boilerpipe with Nutch 1.5: > > 1. Download Nutch source version 1.5 > > 2. Apply the patch NUTCH-961 using the -F3 parameter: > patch -p0 -ui NUTCH-961-1.5-1.patch -F3 > > 3. run ant > > 4. delete the following lines from runtime/local/conf/parse-plugins.xml: > <mimeType name="text/html"> > <plugin id="parse-tika" /> > </mimeType> > > <mimeType name="application/xhtml+xml"> > <plugin id="parse-tika" /> > </mimeType> > > 5. Add the following lines to runtime/local/conf/nutch-site.xml > <property> > > <name>tika.boilerpipe</name> > > <value>true</value> > > </property> > > Thanks again! > > Cheers, > René > > > On Jun 27, 2012, at 1:54 PM, Markus Jelsma wrote: > > > Hi, > > > > I took a clean 1.5 and applied the patch with those parameters and built > > with ant. I then removed the text/html and application/xhtml+xml from the > > runtime/local/parse-plugins.xml and added tika.boilerpipe=true (as proper > > XML) to the runtime/local/nutch-site.xml configuration and tested it with > > $parsechecker -dumpText <url>, it does work. Sometimes the output is > > (almost) identical whether it is enabled or not. > > > > $ bin/nutch parsechecker -dumpText > > http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html > > --------- > > ParseText > > --------- > > 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws > > het eerst op nu.nl Gepubliceerd: Laatste update: 27 juni 2012 13:41 27 > > juni 2012 13:41 Deel: FB 'Turkije zal Syrië niet aanvallen' ANKARA - > > Turkije is niet van plan buurland Syrië aan te vallen, omdat dit land > > vorige week een Turkse straaljager had neergeschoten. Foto: AFP Dat heeft > > de Turkse premier Recep Tayyip Erdogan woensdag gezegd, zo meldde het > > persbureau Anatolia. ''Als Turkse natie hebben we geen intentie om aan te > > vallen'', aldus Erdogan. Dinsdag zei hij nog dat de beschieting niet > > onbeantwoord blijft en Turkije ''vastberaden'' zal terugslaan. Syrië haalde > > vrijdag een Turks toestel neer, dat neerstortte in de Middellandse Zee. > > Erdogan spreekt van een ''schandalige aanval'' en een ''vijandige daad'. De > > NAVO heeft de beschieting veroordeeld. > > > > > > $ bin/nutch parsechecker -dumpText > > http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html > > --------- > > ParseText > > --------- > > 'Turkije zal Syrië niet aanvallen' | nu.nl/buitenland | Het laatste nieuws > > het eerst op nu.nl NU NUzakelijk NUsport NUfoto NUjij Zie NUtvgids NUwerk > > Meer NUlive NUjournaal NUenToen NUreizen NUBijlage Voorpagina Algemeen > > Binnenland Buitenland Politiek Economie Schuldencrisis Geldzaken Beurs > > Sport EK 2012 Tech Internet Gadgets Games Achterklap Opmerkelijk > > Cultuur en Media Film Muziek Boek Media.... MORE TEXT > > > > Cheers, > > Markus > > > > > > > > -----Original message----- > >> From:Rene Nederhand <[email protected]> > >> Sent: Wed 27-Jun-2012 13:36 > >> To: [email protected] > >> Subject: Re: Using Nutch with Boilerpipe > >> > >> Hi Markus, > >> > >> The patch does work if you specify the -F3 parameter, like: > >> > >> patch -p0 -ui NUTCH-961-1.5-1.patch -F3 > >> > >> I checked parse-plugins.xml and changed the html and xml-html mimetypes > >> like this: > >> > >> <mimeType name="text/html"> > >> <plugin id="parse-tika" /> > >> </mimeType> > >> > >> <mimeType name="application/xhtml+xml"> > >> <plugin id="parse-tika" /> > >> </mimeType> > >> > >> Unfortunately, it doesn't work. > >> > >> I also checked TikaParser.java and it refers to useBoilerpipe and > >> boilerpipeExtractorName: > >> > >> boolean useBoilerpipe = getConf().getBoolean("tika.boilerpipe", true); > >> boolean useBoilerpipeEstimator = > >> getConf().getBoolean("tika.boilerpipe.estimator", false); > >> String boilerpipeExtractorName = > >> getConf().get("tika.boilerpipe.extractor", "ArticleExtractor" > >> > >> Still, I am unsure where to specify these variables. Instead I added the > >> following lines to the java code (and commented the previous lines): > >> > >> boolean useBoilerpipe = true; > >> String boilerpipeExtractorName = "ArticleExtractor"; > >> > >> Still, it is not working…. > >> > >> Any ideas? > >> > >> Cheers, > >> René > >> > >> > >> > >> On Jun 27, 2012, at 12:32 PM, Markus Jelsma wrote: > >> > >>> Hi René, > >>> > >>> It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally > >>> released 1.5 at all, the TikaParser.java has changed a bit since the > >>> patch and the release of 1.5. Did you resolve the failde hunks? If so, > >>> are you sure Tika is being used for (x)html pages? Nutch by default uses > >>> the old parse-html plugin to parse those ContentTypes. Check your > >>> parse-plugins.xml configuration. > >>> > >>> Cheers, > >>> Markus > >>> > >>> > >>> -----Original message----- > >>>> From:Rene Nederhand <[email protected]> > >>>> Sent: Wed 27-Jun-2012 11:59 > >>>> To: [email protected] > >>>> Subject: Using Nutch with Boilerpipe > >>>> > >>>> Hi, > >>>> > >>>> I'm trying to index only the main content (main article) of various > >>>> websites. For this, I'd like to use Boilerpipe with Nutch. > >>>> > >>>> Markus has been developing a patch (NUTCH-961) that does exactly that. > >>>> Although, the patch does install without problems, I am not sure how to > >>>> set the necessary settings. Is there anyone how can shed some light on > >>>> this? > >>>> > >>>> As I understand two variables have to be set: > >>>> > >>>> tika.boilerpipe = true > >>>> tika.boilerpipe.extractor = "ArticleExtractor" > >>>> > >>>> I have tried to do this in a file conf/tika.config.file (is this still > >>>> being used?) and conf/nutch-default.xml within as valid XML within a > >>>> properly field. Both, didn't activate Boilerpipe. FYI: I am using Nutch > >>>> 1.5. > >>>> > >>>> What should I do to get this thing going? > >>>> > >>>> Kind regards, > >>>> > >>>> René > >>>> > >>>> > >>>> > >>>> > >> > >> > >

