Re: RegEx URL Normalizer

Markus Jelsma Wed, 07 Sep 2011 04:48:35 -0700


On Monday 05 September 2011 12:06:06 Alexander Fahlke wrote:
> Hi!
> 
> I have problems with the right setup of the RegExURLNormalizer. It should
> strip out some parameters for a specific script.
> Only pages where "document.py" is present should be normalized.
> 
> Here is an example:
> 
>   Input:
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2
> 000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf Output:
> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=165
> 19&Blank=1.pdf
> 
> Date, Sort, Page, pos, anz are the parameters to be stripped out.
> 
> I tried it with the following setup:
> 
>   ([;_]?((?i)l|j|bv_)?((?i)date|
> sort|page|pos|anz)=.*?)(\?|&|#|$)
> 
> 
> How to tell nutch to use this regex only for pages with "document.py"?


You can modify the regex to force matching of preceding document.py with some 
look-behind operator. Nutch 1.4-dev uses java.util.regex instead of Apache ORO 
in the normalizer so you have support for the look-behind operator.

> 
> 
> BR

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: RegEx URL Normalizer

Reply via email to