Hello,
I'm trying to filter some URLs fetched from a "ugly" web server. For some reason it falls at some infinite loop and generate urls like : http://server/something/xxx/xxx/xxx/xxx/.../xxx/xxx/toto where "xxx" isn't static... And as the returned pages have some difference, they're not detected as duplicate. So, I tried to do something like what appears in regex-urlfilter.txt : -/([^/]+)/\1/\1/ with many variants, but none works. Any hint on how to block this (other than enumerating all bad URLs) is welcome. I'm using Nutch 1.12 and Solr 6.1.0 Regards, José-Marcio -- Envoyé de ma machine à écrire. --------------------------------------------------------------- Spam : Classement statistique de messages électroniques - Une approche pragmatique Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM --------------------------------------------------------------- Jose Marcio MARTINS DA CRUZ http://www.j-chkmail.org Ecole des Mines de Paris http://bit.ly/SpamJM 60, bd Saint Michel 75272 - PARIS CEDEX 06

