Re: RegEx URL Normalizer

Alexander Fahlke Thu, 08 Sep 2011 05:15:22 -0700

Thanks guys!

@Dinçer: This does not check if the URL contains "document.py". :(


@Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
RegexURLNormalizer. ;)

  -->  regexNormalize(String urlString, String scope) { ...

  It now simple stupid checks if urlString contains "document.py" and then
cuts out the unwanted stuff.
  I made this is even configurable via nutch-site.xml.


Nutch 1.4 would be better for this. Maybe in the next project.


BR

On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <[email protected]> wrote:

> Hi Alexander,
>
> Would this one work? (I am far away from a Nutch installation to test)
>
> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
>
> Don't forget to use &amp; instead of & in the regex.
>
> Best,
> Dinçer
>
>
> 2011/9/5 Alexander Fahlke <[email protected]>
>
>> Hi!
>>
>> I have problems with the right setup of the RegExURLNormalizer. It should
>> strip out some parameters for a specific script.
>> Only pages where "document.py" is present should be normalized.
>>
>> Here is an example:
>>
>>  Input:
>>
>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>>  Output:
>>
>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>>
>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>>
>> I tried it with the following setup:
>>
>>  ([;_]?((?i)l|j|bv_)?((?i)date|
>> sort|page|pos|anz)=.*?)(\?|&|#|$)
>>
>>
>> How to tell nutch to use this regex only for pages with "document.py"?
>>
>>
>> BR
>>
>> --
>> Alexander Fahlke
>> Software Development
>> www.informera.de
>>
>
>


-- 
Alexander Fahlke
Software Development
www.informera.de

Re: RegEx URL Normalizer

Reply via email to