[Robots] Re: leading whitespace in robots.txt files

2002-03-25 Thread Klaus Johannes Rusch


In [EMAIL PROTECTED], Sean M. Burke 
[EMAIL PROTECTED] writes:
 User-agent: *
  Disallow: /cgi-bin/
  Disallow: /~mojojojo/misc/
 
 So I've changed it to this, and was about to submit it as a patch for the
 next LWP release:
/^\s*Disallow:\s*(.*)/i
# Silently forgive leading whitespace.
 
 But first, I thought I'd ask the list here: does anyone thing this'd break
 anything? 

The change should not break anything, files using leading whitespace for 
comments or some other obscure purpose do not comply with the specification 
anyway and will see varying results.

However, since the standard is sufficiently clear on the correct format, I 
would rather opt to not support a non-standard format with leading whitespace 
since developers will start relying on this feature and will complain that 
other, standards compliant robots libraries don't support it (the infamous my 
page works in Internet Explorer so I cannot be broken attitude).

Rather than modifying the library I would suggest any application that wants to
handle this content error gracefully should strip leading whitespace prior to 
calling parse().

--
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/




[Robots] Python timeouts

2002-03-25 Thread Nick Arnett


I've been hitting problems with a Python-based robot I'm working on and just
found out that there's a timeout module that will make it easy to implement
the kind of functionality that Tim Bray was suggesting here earlier.  It
apparently works for any TCP connection.  Here's the link:

http://www.timo-tasi.org/python/timeoutsocket.py

--
[EMAIL PROTECTED]
(408) 904-7198