Hi Markus,

I was just answering to myself.  Yes... it's a self repeating. And the server 
is doing this to many URLs.

I've found the answer (hope won't generate other problems...) :

I've just put this rule into conf/regex-normalizer.xml

<regex>
  <pattern>(/[^/]+){3,}</pattern>
  <substitution>$1</substitution>
</regex>


Nutch ... What a wonderful world... Wonderful software !!!

Regards

José-Marcio




On 07/01/2016 01:44 PM, Markus Jelsma wrote:
Hello Jose Marcio!

You mean there is absolutely no self-repeating pattern anywhere in the URL? If 
not, you are in trouble! Nutch URL filters don't operate in context of the page 
the URL is located on, nor does it operate on groups of URL's.

The easiest approach is to limit the URL length to 512 or whatever arbitrary 
number. You can also use the scoring-depth plugin to limit depth from the 
host's root.

If it is a spider trap, neither solution works well and you'd need a spider 
trap detector, which is hard to build.

Markus



-----Original message-----
From:Jose Marcio Martins da Cruz <[email protected]>
Sent: Friday 1st July 2016 11:25
To: [email protected]
Subject: Regular expressions in regex-urlfilter.txt


Hello,

I'm trying to filter some URLs fetched from a "ugly" web server.

For some reason it falls at some infinite loop and generate urls like :

http://server/something/xxx/xxx/xxx/xxx/.../xxx/xxx/toto

where "xxx" isn't static... And as the returned pages have some difference, 
they're not detected as duplicate.

So, I tried to do something like what appears in regex-urlfilter.txt :

-/([^/]+)/\1/\1/

with many variants, but none works. Any hint on how to block this (other than 
enumerating all bad URLs) is welcome.

I'm using Nutch 1.12 and Solr 6.1.0

Regards,

José-Marcio

--

  Envoyé de ma machine à écrire.
  ---------------------------------------------------------------
   Spam : Classement statistique de messages électroniques -
          Une approche pragmatique
   Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
  ---------------------------------------------------------------
  Jose Marcio MARTINS DA CRUZ            http://www.j-chkmail.org
  Ecole des Mines de Paris                   http://bit.ly/SpamJM
  60, bd Saint Michel                      75272 - PARIS CEDEX 06




--

 Envoyé de ma machine à écrire.
 ---------------------------------------------------------------
  Spam : Classement statistique de messages électroniques -
         Une approche pragmatique
  Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
 ---------------------------------------------------------------
 Jose Marcio MARTINS DA CRUZ            http://www.j-chkmail.org
 Ecole des Mines de Paris                   http://bit.ly/SpamJM
 60, bd Saint Michel                      75272 - PARIS CEDEX 06

Reply via email to