Regular expressions in regex-urlfilter.txt

Jose Marcio Martins da Cruz Fri, 01 Jul 2016 02:26:05 -0700


Hello,


I'm trying to filter some URLs fetched from a "ugly" web server.

For some reason it falls at some infinite loop and generate urls like :

http://server/something/xxx/xxx/xxx/xxx/.../xxx/xxx/toto

where "xxx" isn't static... And as the returned pages have some difference, 
they're not detected as duplicate.

So, I tried to do something like what appears in regex-urlfilter.txt :

-/([^/]+)/\1/\1/

with many variants, but none works. Any hint on how to block this (other than 
enumerating all bad URLs) is welcome.

I'm using Nutch 1.12 and Solr 6.1.0

Regards,

José-Marcio

--

 Envoyé de ma machine à écrire.
 ---------------------------------------------------------------
  Spam : Classement statistique de messages électroniques -
         Une approche pragmatique
  Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
 ---------------------------------------------------------------
 Jose Marcio MARTINS DA CRUZ            http://www.j-chkmail.org
 Ecole des Mines de Paris                   http://bit.ly/SpamJM
 60, bd Saint Michel                      75272 - PARIS CEDEX 06

Regular expressions in regex-urlfilter.txt

Reply via email to