Hi, Check the bottom normalizer, it uses the lookbehind operator to remove double slashes except the first two.
Cheers, http://svn.apache.org/viewvc/nutch/trunk/conf/regex-normalize.xml.template?view=markup -----Original message----- > From:Magnús Skúlason <[email protected]> > Sent: Mon 22-Oct-2012 00:34 > To: [email protected] > Cc: [email protected]; Markus Jelsma <[email protected]> > Subject: Re: RegEx URL Normalizer > > Hi, > > I am interested in doing this i.e. only strip out parameters from url > if some other string is found as well, in my case it will be a domain > name. I am using 1.5.1 but I am unfamiliar with the look-behind > operator. > > Does anyone have a sample of how this is done? > > best regards, > Magnus > > On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke > <[email protected]> wrote: > > Thanks guys! > > > > @Dinçer: This does not check if the URL contains "document.py". :( > > > > @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize > > RegexURLNormalizer. ;) > > > > --> regexNormalize(String urlString, String scope) { ... > > > > It now simple stupid checks if urlString contains "document.py" and then > > cuts out the unwanted stuff. > > I made this is even configurable via nutch-site.xml. > > > > > > Nutch 1.4 would be better for this. Maybe in the next project. > > > > > > BR > > > > On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <[email protected]> wrote: > > > >> Hi Alexander, > >> > >> Would this one work? (I am far away from a Nutch installation to test) > >> > >> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*)) > >> > >> Don't forget to use & instead of & in the regex. > >> > >> Best, > >> Dinçer > >> > >> > >> 2011/9/5 Alexander Fahlke <[email protected]> > >> > >>> Hi! > >>> > >>> I have problems with the right setup of the RegExURLNormalizer. It should > >>> strip out some parameters for a specific script. > >>> Only pages where "document.py" is present should be normalized. > >>> > >>> Here is an example: > >>> > >>> Input: > >>> > >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf > >>> Output: > >>> > >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf > >>> > >>> Date, Sort, Page, pos, anz are the parameters to be stripped out. > >>> > >>> I tried it with the following setup: > >>> > >>> ([;_]?((?i)l|j|bv_)?((?i)date| > >>> sort|page|pos|anz)=.*?)(\?|&|#|$) > >>> > >>> > >>> How to tell nutch to use this regex only for pages with "document.py"? > >>> > >>> > >>> BR > >>> > >>> -- > >>> Alexander Fahlke > >>> Software Development > >>> www.informera.de > >>> > >> > >> > > > > > > -- > > Alexander Fahlke > > Software Development > > www.informera.de >

