RegEx URL Normalizer

Alexander Fahlke Mon, 05 Sep 2011 03:06:45 -0700

Hi!

I have problems with the right setup of the RegExURLNormalizer. It should
strip out some parameters for a specific script.
Only pages where "document.py" is present should be normalized.


Here is an example:

  Input:
http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
  Output:
http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf

Date, Sort, Page, pos, anz are the parameters to be stripped out.

I tried it with the following setup:

  ([;_]?((?i)l|j|bv_)?((?i)date|
sort|page|pos|anz)=.*?)(\?|&|#|$)


How to tell nutch to use this regex only for pages with "document.py"?


BR

-- 
Alexander Fahlke
Software Development
www.informera.de

RegEx URL Normalizer

Reply via email to