Hi Markus,
I was just answering to myself. Yes... it's a self repeating. And the server
is doing this to many URLs.
I've found the answer (hope won't generate other problems...) :
I've just put this rule into conf/regex-normalizer.xml
<regex>
<pattern>(/[^/]+){3,}</pattern>
<substitution>$1</substitution>
</regex>
Nutch ... What a wonderful world... Wonderful software !!!
Regards
José-Marcio
On 07/01/2016 01:44 PM, Markus Jelsma wrote:
Hello Jose Marcio!
You mean there is absolutely no self-repeating pattern anywhere in the URL? If
not, you are in trouble! Nutch URL filters don't operate in context of the page
the URL is located on, nor does it operate on groups of URL's.
The easiest approach is to limit the URL length to 512 or whatever arbitrary
number. You can also use the scoring-depth plugin to limit depth from the
host's root.
If it is a spider trap, neither solution works well and you'd need a spider
trap detector, which is hard to build.
Markus
-----Original message-----
From:Jose Marcio Martins da Cruz <[email protected]>
Sent: Friday 1st July 2016 11:25
To: [email protected]
Subject: Regular expressions in regex-urlfilter.txt
Hello,
I'm trying to filter some URLs fetched from a "ugly" web server.
For some reason it falls at some infinite loop and generate urls like :
http://server/something/xxx/xxx/xxx/xxx/.../xxx/xxx/toto
where "xxx" isn't static... And as the returned pages have some difference,
they're not detected as duplicate.
So, I tried to do something like what appears in regex-urlfilter.txt :
-/([^/]+)/\1/\1/
with many variants, but none works. Any hint on how to block this (other than
enumerating all bad URLs) is welcome.
I'm using Nutch 1.12 and Solr 6.1.0
Regards,
José-Marcio
--
Envoyé de ma machine à écrire.
---------------------------------------------------------------
Spam : Classement statistique de messages électroniques -
Une approche pragmatique
Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
---------------------------------------------------------------
Jose Marcio MARTINS DA CRUZ http://www.j-chkmail.org
Ecole des Mines de Paris http://bit.ly/SpamJM
60, bd Saint Michel 75272 - PARIS CEDEX 06
--
Envoyé de ma machine à écrire.
---------------------------------------------------------------
Spam : Classement statistique de messages électroniques -
Une approche pragmatique
Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
---------------------------------------------------------------
Jose Marcio MARTINS DA CRUZ http://www.j-chkmail.org
Ecole des Mines de Paris http://bit.ly/SpamJM
60, bd Saint Michel 75272 - PARIS CEDEX 06