Hi,
I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I
added this to my nutch-site.xml:
<property>
<name>urlnormalizer.order</name>
<value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
</property>
<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
</property>
And defined this expression rule:
<regex>
<pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
<substitution>$1$5</substitution>
</regex>
(to strip the parameter IFLBSERVERID from the URL)
The indexed documents are still containing the parameter and imho the
RegexURLNormalizer does not work. Is it something with:
https://issues.apache.org/jira/browse/NUTCH-706 ?
Thanks and regards
Hannes
--
https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer