RE: [Robots] Yahoo evolving robots.txt, finally

2004-03-13 Thread ogjunk-robots

--- Matthew Meadows [EMAIL PROTECTED] wrote:
 I agree with Walter.

So do I, partially. :)

 There's a lot of variables that should have
 been
 considered for this new value.  If nothing else the specification
 should
 have called for the time in milliseconds, or otherwise allow for
 fractional seconds.

I disagree that level of granularity is needed.  See my earlier email.

  In addition, it seems a bit presumptuous for Yahoo
 to think that they can force a de facto standard just by implementing
 it first.

That's how things work in real life.  Think web browsers 10 years ago
and various Netscape, then IE extensions.  Now lots of them are
considered standard.

 With this line of thinking webmasters would eventually be
 required to update their robots.txt file for dozens of individual
 bots.

In theory, yes.  In reality, I agree with Walter, this extension will
prove to be as useless as blink, and will therefore not be
supported by any big crawlers.

 It's hard enough to get them to do it now for the general case, this
 additional fragmentation is not going to make anybody's job easier. 
 Is
 Google going to implement their own extensions, then MSN, AltaVista,
 and AllTheWeb?

Not likely.  In order for them to remain competitive, they have to keep
fetching web pages at high rates.  robots.txt only limits them.  I
can't think of an extension to robots.txt that would let them do a
better job.  Actually, I can. :)

 Finally, if we're going to start specifying the criteria for
 scheduling, let's consider some other alternatives, like preferred
 scanning windows.

Same as crawl-delay - everyone would want crawlers to visit their sites
at night, which would saturate crawlers' networks, so search engines
won't push that extension. (actually, big crawlers run from multiple
points around the planet, so maybe my statement is flawed)

Otis

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]
 On Behalf Of Walter Underwood
 Sent: Friday, March 12, 2004 3:37 PM
 To: Internet robots, spiders, web-walkers, etc.
 Subject: Re: [Robots] Yahoo evolving robots.txt, finally
 
 
 --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED]
 wrote:
 
  I am surprised that after all that talk about adding new semantic 
  elements to robots.txt several years ago, nobody commented that the
 
  new Yahoo crawler (former Inktomi crawler) took a brave step in
 that 
  direction by adding Crawl-delay: syntax.
  
  http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
  
  Time to update your robots.txt parsers!
 
 No, time to tell Yahoo to go back and do a better job.
 
 Does crawl-delay allow decimals? Negative numbers? Could this spec be
 a
 bit better quality? The words positive integer would improve things
 a
 lot.
 
 Sigh. It would have been nice if they'd discussed this on the list
 first. crawl-delay is a pretty dumb idea. Any value over one second
 means it takes forever to index a site. Ultraseek 
 has had a spider throttle option to add this sort of delay, but it
 is
 almost never used, because Ultraseek reads 25 pages from one site,
 then
 moves to another. There are many kinds of rate control.
 
 wunder
 --
 Walter Underwood
 Principal Architect
 Verity Ultraseek
 
 ___
 Robots mailing list
 [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
 ___
 Robots mailing list
 [EMAIL PROTECTED]
 http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Yahoo evolving robots.txt, finally

2004-03-12 Thread ogjunk-robots
I am surprised that after all that talk about adding new semantic
elements to robots.txt several years ago, nobody commented that the new
Yahoo crawler (former Inktomi crawler) took a brave step in that
direction by adding Crawl-delay: syntax.

http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html

Time to update your robots.txt parsers!

Otis Gospodnetic

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots