Hi, Have you tried using : *./nutch plugin urlnormalizer-regex org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer http://www.myinputurl.com* that should help finding where the problem is coming from.
Are you running in distributed mode? Did you generate a new job file? J. On 24 June 2010 11:18, Hannes Carl Meyer <[email protected]> wrote: > Hi, > > I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I > added this to my nutch-site.xml: > > <property> > <name>urlnormalizer.order</name> > > <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> > </property> > > <property> > <name>urlnormalizer.regex.file</name> > <value>regex-normalize.xml</value> > </property> > > And defined this expression rule: > > <regex> > > > <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern> > <substitution>$1$5</substitution> > </regex> > > (to strip the parameter IFLBSERVERID from the URL) > > The indexed documents are still containing the parameter and imho the > RegexURLNormalizer does not work. Is it something with: > https://issues.apache.org/jira/browse/NUTCH-706 ? > > Thanks and regards > > Hannes > > -- > > https://www.xing.com/profile/HannesCarl_Meyer > http://de.linkedin.com/in/hannescarlmeyer > http://twitter.com/hannescarlmeyer > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

