Hi Alexander, Would this one work? (I am far away from a Nutch installation to test) (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
Don't forget to use & instead of & in the regex. Best, Dinçer 2011/9/5 Alexander Fahlke <[email protected]> > Hi! > > I have problems with the right setup of the RegExURLNormalizer. It should > strip out some parameters for a specific script. > Only pages where "document.py" is present should be normalized. > > Here is an example: > > Input: > > http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf > Output: > > http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf > > Date, Sort, Page, pos, anz are the parameters to be stripped out. > > I tried it with the following setup: > > ([;_]?((?i)l|j|bv_)?((?i)date| > sort|page|pos|anz)=.*?)(\?|&|#|$) > > > How to tell nutch to use this regex only for pages with "document.py"? > > > BR > > -- > Alexander Fahlke > Software Development > www.informera.de >

