Hi,

I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I
added this to my nutch-site.xml:

    <property>
        <name>urlnormalizer.order</name>

<value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
    </property>

    <property>
        <name>urlnormalizer.regex.file</name>
        <value>regex-normalize.xml</value>
    </property>

And defined this expression rule:

<regex>

<pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$1$5</substitution>
</regex>

(to strip the parameter IFLBSERVERID from the URL)

The indexed documents are still containing the parameter and imho the
RegexURLNormalizer does not work. Is it something with:
https://issues.apache.org/jira/browse/NUTCH-706 ?

Thanks and regards

Hannes

-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Reply via email to