On Monday 05 September 2011 12:06:06 Alexander Fahlke wrote: > Hi! > > I have problems with the right setup of the RegExURLNormalizer. It should > strip out some parameters for a specific script. > Only pages where "document.py" is present should be normalized. > > Here is an example: > > Input: > http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2 > 000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf Output: > http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=165 > 19&Blank=1.pdf > > Date, Sort, Page, pos, anz are the parameters to be stripped out. > > I tried it with the following setup: > > ([;_]?((?i)l|j|bv_)?((?i)date| > sort|page|pos|anz)=.*?)(\?|&|#|$) > > > How to tell nutch to use this regex only for pages with "document.py"?
You can modify the regex to force matching of preceding document.py with some look-behind operator. Nutch 1.4-dev uses java.util.regex instead of Apache ORO in the normalizer so you have support for the look-behind operator. > > > BR -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

