Jep, did not work, although it displays: "URL normalizing: true" in the crawl process... Also bin/nutch plugin ... does not work!
On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche < [email protected]> wrote: > tried ant clean job? > > >> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is local). >> > When executing bin/nucht plugin ... I'm getting a "Plugin >> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml >> contains the property plugin.includes including urlnormalizer-regex. >> > >> Starting the RegexURLNormalizer from within Eclipse is fine and it is >> doing its job. >> >> Regards >> >> Hannes >> >> >> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche < >> [email protected]> wrote: >> >>> Hi, >>> >>> Have you tried using : >>> *./nutch plugin urlnormalizer-regex >>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer >>> http://www.myinputurl.com* >>> that should help finding where the problem is coming from. >>> >>> Are you running in distributed mode? Did you generate a new job file? >>> >>> J. >>> >>> >>> On 24 June 2010 11:18, Hannes Carl Meyer <[email protected]>wrote: >>> >>>> Hi, >>>> >>>> I'm trying to strip a parameter from URLs using the RegexURLNormalizer. >>>> I >>>> added this to my nutch-site.xml: >>>> >>>> <property> >>>> <name>urlnormalizer.order</name> >>>> >>>> >>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> >>>> </property> >>>> >>>> <property> >>>> <name>urlnormalizer.regex.file</name> >>>> <value>regex-normalize.xml</value> >>>> </property> >>>> >>>> And defined this expression rule: >>>> >>>> <regex> >>>> >>>> >>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern> >>>> <substitution>$1$5</substitution> >>>> </regex> >>>> >>>> (to strip the parameter IFLBSERVERID from the URL) >>>> >>>> The indexed documents are still containing the parameter and imho the >>>> RegexURLNormalizer does not work. Is it something with: >>>> https://issues.apache.org/jira/browse/NUTCH-706 ? >>>> >>>> Thanks and regards >>>> >>>> Hannes >>>> >>>> -- >>>> >>>> https://www.xing.com/profile/HannesCarl_Meyer >>>> http://de.linkedin.com/in/hannescarlmeyer >>>> http://twitter.com/hannescarlmeyer >>>> >>> >>> >>> >>> -- >>> DigitalPebble Ltd >>> >>> Open Source Solutions for Text Engineering >>> http://www.digitalpebble.com >>> >> >> >> >> -- >> >> https://www.xing.com/profile/HannesCarl_Meyer >> http://de.linkedin.com/in/hannescarlmeyer >> http://twitter.com/hannescarlmeyer >> > > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > -- https://www.xing.com/profile/HannesCarl_Meyer http://de.linkedin.com/in/hannescarlmeyer http://twitter.com/hannescarlmeyer

