Hi,

I am interested in doing this i.e. only strip out parameters from url
if some other string is found as well, in my case it will be a domain
name. I am using 1.5.1 but I am unfamiliar with the look-behind
operator.

Does anyone have a sample of how this is done?

best regards,
Magnus

On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
<[email protected]> wrote:
> Thanks guys!
>
> @Dinçer: This does not check if the URL contains "document.py". :(
>
> @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
> RegexURLNormalizer. ;)
>
>   -->  regexNormalize(String urlString, String scope) { ...
>
>   It now simple stupid checks if urlString contains "document.py" and then
> cuts out the unwanted stuff.
>   I made this is even configurable via nutch-site.xml.
>
>
> Nutch 1.4 would be better for this. Maybe in the next project.
>
>
> BR
>
> On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal <[email protected]> wrote:
>
>> Hi Alexander,
>>
>> Would this one work? (I am far away from a Nutch installation to test)
>>
>> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
>>
>> Don't forget to use &amp; instead of & in the regex.
>>
>> Best,
>> Dinçer
>>
>>
>> 2011/9/5 Alexander Fahlke <[email protected]>
>>
>>> Hi!
>>>
>>> I have problems with the right setup of the RegExURLNormalizer. It should
>>> strip out some parameters for a specific script.
>>> Only pages where "document.py" is present should be normalized.
>>>
>>> Here is an example:
>>>
>>>  Input:
>>>
>>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
>>>  Output:
>>>
>>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
>>>
>>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
>>>
>>> I tried it with the following setup:
>>>
>>>  ([;_]?((?i)l|j|bv_)?((?i)date|
>>> sort|page|pos|anz)=.*?)(\?|&|#|$)
>>>
>>>
>>> How to tell nutch to use this regex only for pages with "document.py"?
>>>
>>>
>>> BR
>>>
>>> --
>>> Alexander Fahlke
>>> Software Development
>>> www.informera.de
>>>
>>
>>
>
>
> --
> Alexander Fahlke
> Software Development
> www.informera.de

Reply via email to