RE: Regular expressions in regex-urlfilter.txt

Markus Jelsma Fri, 01 Jul 2016 04:45:08 -0700

Hello Jose Marcio!

You mean there is absolutely no self-repeating pattern anywhere in the URL? If 
not, you are in trouble! Nutch URL filters don't operate in context of the page 
the URL is located on, nor does it operate on groups of URL's.


The easiest approach is to limit the URL length to 512 or whatever arbitrary 
number. You can also use the scoring-depth plugin to limit depth from the 
host's root.

If it is a spider trap, neither solution works well and you'd need a spider 
trap detector, which is hard to build.

Markus

 
 
-----Original message-----
> From:Jose Marcio Martins da Cruz <[email protected]>
> Sent: Friday 1st July 2016 11:25
> To: [email protected]
> Subject: Regular expressions in regex-urlfilter.txt
> 
> 
> Hello,
> 
> I'm trying to filter some URLs fetched from a "ugly" web server.
> 
> For some reason it falls at some infinite loop and generate urls like :
> 
> http://server/something/xxx/xxx/xxx/xxx/.../xxx/xxx/toto
> 
> where "xxx" isn't static... And as the returned pages have some difference, 
> they're not detected as duplicate.
> 
> So, I tried to do something like what appears in regex-urlfilter.txt :
> 
> -/([^/]+)/\1/\1/
> 
> with many variants, but none works. Any hint on how to block this (other than 
> enumerating all bad URLs) is welcome.
> 
> I'm using Nutch 1.12 and Solr 6.1.0
> 
> Regards,
> 
> José-Marcio
> 
> -- 
> 
>   Envoyé de ma machine à écrire.
>   ---------------------------------------------------------------
>    Spam : Classement statistique de messages électroniques -
>           Une approche pragmatique
>    Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
>   ---------------------------------------------------------------
>   Jose Marcio MARTINS DA CRUZ            http://www.j-chkmail.org
>   Ecole des Mines de Paris                   http://bit.ly/SpamJM
>   60, bd Saint Michel                      75272 - PARIS CEDEX 06
> 
>

RE: Regular expressions in regex-urlfilter.txt

Reply via email to