Re: stopping robots

2006-07-31 Thread Marc Espie
I've got a robots.txt, and a script that loops to infinity.
Actually, it's a useful page on the server, there's a list that can be
ordered two ways, and switching from one to the other increments a parameter
at the end of the invocation.

A robot has no business reading that specific page in the first place (in
fact, they're disallowed to), and after a small number of loops (10 or 15),
the webserver becomes very unresponsive, thus ensuring the robot writer will
lose a lot of time on that page.

Assuming reasonable technologies (e.g., mason), the url does not even have
to look like a script...



Re: stopping robots

2006-07-26 Thread Nick Guenther

On 7/25/06, Mike Erdely [EMAIL PROTECTED] wrote:

prad wrote:
 what is the best way to stop those robots and spiders from getting in?

Someone on this list (who can reveal themselves if they want) has a
pretty good setup to block disrespectful robots.

They have a robots.txt file that specifies a Disallow: /somedir/.
Anyone that actually GOES into that directory gets blocked by PF.

It'd be pretty easy to parse your /var/www/logs/access_log for accesses
of /somedir/ and have them added to a table.

-ME



Arxiv dumps massive amounts of data at you and then blocks you if you
access a special robot-trap page. See
http://arxiv.org/RobotsBeware.html.

If using a CGI/template-based or frame-based site It would not be
difficult to generate a new trap page every day and link it on all
pages for robots to fall in to. You could even make the url sound
reasonable by using a wordbank so that statistical analysis of the
characters can't pick out the real links from the trap ones. You'd
also have to make sure to move the link around (i.e. just having the
trap as the last link on every page is obvious).

However the above is probably excessive; robot authors really aren't
that unlazy (that's the whole reason they are running a robot in the
first place).

-Nick



Re: stopping robots

2006-07-25 Thread Rogier Krieger

On 7/25/06, prad [EMAIL PROTECTED] wrote:

what is the best way to stop those robots and spiders from getting in?


The sure way to stop robots and spiders is to shut down your web
server. I don't suppose that's the answer you're looking for.

Treat malicious robots as malicious/unwelcome users. For whatever your
definition of malicious, do not expect to be able to easily discern
between regular human users and robots. It's too easy to alter
user-agent strings, etc to rely on those without precautions (as with
all client-generated input).



.htaccess?


That might help, but not solve your problem discerning between human
and automated clients. Also, the usual problems/threats regarding
credentials will of course apply. Mind you, automated processes
(robots) can also use credentials.

Possibly you can also use CAPTCHA. Various modules (PHP, Perl) exist
that allow to integrate these easily. Whether (or when) robots will be
able to fool these tests is another matter.



robot.txt and apache directives?


Well-behaved robots will adhere to measures such as (x)html meta tags,
robots.txt files, etc. Other robots may not.



find them on the access_log and block with pf?


Using access_log means you're using information gathered from after the fact.



which are good robots and which are bad?


Apart from robots/spiders potentially being an excellent friend,
allowing robots (e.g. Google) may also have undesirable side effects.
Such effects range from out-dated information being displayed to
search engine users to sensitive data being stored on servers outside
your influence. I'm sure there are many more.

I'd recommend you think about your threat model first and use that to
determine which information you deem sensitive and to what lengths you
will go to secure that information.

Cheers,

Rogier

--
If you don't know where you're going, any road will get you there.



Re: stopping robots

2006-07-25 Thread Mike Erdely

prad wrote:

what is the best way to stop those robots and spiders from getting in?


Someone on this list (who can reveal themselves if they want) has a 
pretty good setup to block disrespectful robots.


They have a robots.txt file that specifies a Disallow: /somedir/. 
Anyone that actually GOES into that directory gets blocked by PF.


It'd be pretty easy to parse your /var/www/logs/access_log for accesses 
of /somedir/ and have them added to a table.


-ME



Re: stopping robots

2006-07-25 Thread Spruell, Darren-Perot
From: [EMAIL PROTECTED] 
 what is the best way to stop those robots and spiders from getting in?
 
 .htaccess?
 robot.txt and apache directives?
 find them on the access_log and block with pf?
 
 i should also ask whether it is a good idea to block robots 
 in the first place 
 since some do help to increase presence on the web.
 which are good robots and which are bad?

And here I've never considered them a threat. Do you have information in a
robots.txt that you shouldn't? What's your concern with them?

DS