Hi! I have problems with the right setup of the RegExURLNormalizer. It should strip out some parameters for a specific script. Only pages where "document.py" is present should be normalized.
Here is an example: Input: http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf Output: http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf Date, Sort, Page, pos, anz are the parameters to be stripped out. I tried it with the following setup: ([;_]?((?i)l|j|bv_)?((?i)date| sort|page|pos|anz)=.*?)(\?|&|#|$) How to tell nutch to use this regex only for pages with "document.py"? BR -- Alexander Fahlke Software Development www.informera.de

