Hi, I am interested in doing this i.e. only strip out parameters from url if some other string is found as well, in my case it will be a domain name. I am using 1.5.1 but I am unfamiliar with the look-behind operator.
Does anyone have a sample of how this is done? best regards, Magnus On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke <[email protected]> wrote: > Thanks guys! > > @Dinçer: This does not check if the URL contains "document.py". :( > > @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize > RegexURLNormalizer. ;) > > --> regexNormalize(String urlString, String scope) { ... > > It now simple stupid checks if urlString contains "document.py" and then > cuts out the unwanted stuff. > I made this is even configurable via nutch-site.xml. > > > Nutch 1.4 would be better for this. Maybe in the next project. > > > BR > > On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <[email protected]> wrote: > >> Hi Alexander, >> >> Would this one work? (I am far away from a Nutch installation to test) >> >> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*)) >> >> Don't forget to use & instead of & in the regex. >> >> Best, >> Dinçer >> >> >> 2011/9/5 Alexander Fahlke <[email protected]> >> >>> Hi! >>> >>> I have problems with the right setup of the RegExURLNormalizer. It should >>> strip out some parameters for a specific script. >>> Only pages where "document.py" is present should be normalized. >>> >>> Here is an example: >>> >>> Input: >>> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf >>> Output: >>> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf >>> >>> Date, Sort, Page, pos, anz are the parameters to be stripped out. >>> >>> I tried it with the following setup: >>> >>> ([;_]?((?i)l|j|bv_)?((?i)date| >>> sort|page|pos|anz)=.*?)(\?|&|#|$) >>> >>> >>> How to tell nutch to use this regex only for pages with "document.py"? >>> >>> >>> BR >>> >>> -- >>> Alexander Fahlke >>> Software Development >>> www.informera.de >>> >> >> > > > -- > Alexander Fahlke > Software Development > www.informera.de

