Hi,

Have you tried using :
*./nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
http://www.myinputurl.com*
that should help finding where the problem is coming from.

Are you running in distributed mode? Did you generate a new job file?

J.

On 24 June 2010 11:18, Hannes Carl Meyer <[email protected]> wrote:

> Hi,
>
> I'm trying to strip a parameter from URLs using the RegexURLNormalizer. I
> added this to my nutch-site.xml:
>
>    <property>
>        <name>urlnormalizer.order</name>
>
> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>    </property>
>
>    <property>
>        <name>urlnormalizer.regex.file</name>
>        <value>regex-normalize.xml</value>
>    </property>
>
> And defined this expression rule:
>
> <regex>
>
>
> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>  <substitution>$1$5</substitution>
> </regex>
>
> (to strip the parameter IFLBSERVERID from the URL)
>
> The indexed documents are still containing the parameter and imho the
> RegexURLNormalizer does not work. Is it something with:
> https://issues.apache.org/jira/browse/NUTCH-706 ?
>
> Thanks and regards
>
> Hannes
>
> --
>
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> http://twitter.com/hannescarlmeyer
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to