Re: RegEx URL Normalizer

Dinçer Kavraal Wed, 07 Sep 2011 05:35:16 -0700

Hi Alexander,

Would this one work? (I am far away from a Nutch installation to test)
(?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))


Don't forget to use &amp; instead of & in the regex.

Best,
Dinçer


2011/9/5 Alexander Fahlke <[email protected]>

> Hi!
>
> I have problems with the right setup of the RegExURLNormalizer. It should
> strip out some parameters for a specific script.
> Only pages where "document.py" is present should be normalized.
>
> Here is an example:
>
>  Input:
>
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>  Output:
>
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>
> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>
> I tried it with the following setup:
>
>  ([;_]?((?i)l|j|bv_)?((?i)date|
> sort|page|pos|anz)=.*?)(\?|&|#|$)
>
>
> How to tell nutch to use this regex only for pages with "document.py"?
>
>
> BR
>
> --
> Alexander Fahlke
> Software Development
> www.informera.de
>

Re: RegEx URL Normalizer

Reply via email to