On 10/20/2010 12:13 PM, Martin Koppenhoefer wrote:
Maybe we could work around this by automatically changing the link for
the stored tiles? This would also harm "friendly" projects with small
tile-download-rates though. If it is technically possible to identify
this application they could also be filtered out.
I used to work on a website where we were always waging wars
against webcrawlers.
It's certainly useful to ban certain user agents, but it's very
easy for attackers to change their user agent to look like an ordinary
web browser.
We had a system called "robocop" that did a running tail -f of the
access_log, kept counts of how many hits we'd gotten from different IP
addresses in the last hour, and if somebody was downloading too much,
we'd drop a deny directive into our .htaccess file and that would be the
end of them. I'd even get a text message when this happened.
I sketched out a design for a system called "robocop 2" that would
do this in a better way and would generally help us manage our traffic
in real time. I didn't get the go-ahead to build it.
Before I had that job, I had another "job" doing, uh, "difficult
information retrieval." I had a webcrawler called "Blackbird" that was
designed for low observability and that was designed to understand the
structure of a website enough that, rather than copying the site, it
would copy the database behind the site. With the right configuration,
Blackbird could have completely subverted the defenses of the site
mentioned above -- but I wasn't doing that kind of stuff anymore. I got
sick of being on mailing lists where I knew somebody was a spy but not
who...
_______________________________________________
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk