RE: RegEx URL Normalizer

Markus Jelsma Mon, 22 Oct 2012 00:07:19 -0700

Hi,

Check the bottom normalizer, it uses the lookbehind operator to remove double 
slashes except the first two.


Cheers,

http://svn.apache.org/viewvc/nutch/trunk/conf/regex-normalize.xml.template?view=markup
 
 
-----Original message-----
> From:Magnús Skúlason <[email protected]>
> Sent: Mon 22-Oct-2012 00:34
> To: [email protected]
> Cc: [email protected]; Markus Jelsma <[email protected]>
> Subject: Re: RegEx URL Normalizer
> 
> Hi,
> 
> I am interested in doing this i.e. only strip out parameters from url
> if some other string is found as well, in my case it will be a domain
> name. I am using 1.5.1 but I am unfamiliar with the look-behind
> operator.
> 
> Does anyone have a sample of how this is done?
> 
> best regards,
> Magnus
> 
> On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
> <[email protected]> wrote:
> > Thanks guys!
> >
> > @Dinçer: This does not check if the URL contains "document.py". :(
> >
> > @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
> > RegexURLNormalizer. ;)
> >
> >   -->  regexNormalize(String urlString, String scope) { ...
> >
> >   It now simple stupid checks if urlString contains "document.py" and then
> > cuts out the unwanted stuff.
> >   I made this is even configurable via nutch-site.xml.
> >
> >
> > Nutch 1.4 would be better for this. Maybe in the next project.
> >
> >
> > BR
> >
> > On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <[email protected]> wrote:
> >
> >> Hi Alexander,
> >>
> >> Would this one work? (I am far away from a Nutch installation to test)
> >>
> >> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
> >>
> >> Don't forget to use &amp; instead of & in the regex.
> >>
> >> Best,
> >> Dinçer
> >>
> >>
> >> 2011/9/5 Alexander Fahlke <[email protected]>
> >>
> >>> Hi!
> >>>
> >>> I have problems with the right setup of the RegExURLNormalizer. It should
> >>> strip out some parameters for a specific script.
> >>> Only pages where "document.py" is present should be normalized.
> >>>
> >>> Here is an example:
> >>>
> >>>  Input:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
> >>>  Output:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
> >>>
> >>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
> >>>
> >>> I tried it with the following setup:
> >>>
> >>>  ([;_]?((?i)l|j|bv_)?((?i)date|
> >>> sort|page|pos|anz)=.*?)(\?|&|#|$)
> >>>
> >>>
> >>> How to tell nutch to use this regex only for pages with "document.py"?
> >>>
> >>>
> >>> BR
> >>>
> >>> --
> >>> Alexander Fahlke
> >>> Software Development
> >>> www.informera.de
> >>>
> >>
> >>
> >
> >
> > --
> > Alexander Fahlke
> > Software Development
> > www.informera.de
>

RE: RegEx URL Normalizer

Reply via email to