Re: [Robots] Yahoo evolving robots.txt, finally

2004-03-15 Thread Nick Arnett
Walter Underwood wrote:


Nah, they would have e-mailed me directly by now. I used to work
with them at Inktomi.
How about dropping them an e-mail to invite them here?

Yahoo limits crawler access to its own site.  I haven't tried in the 
last 9 or 10 months, but the way it was back then, if you crawled the 
message boards, the crawler's IP address would be blocked for 
increasingly long time periods -- a day, two days, etc.  I tried slowing 
down our gathering, but couldn't find a speed at which they wouldn't 
eventually block it.  And of course they never responded to any 
questions about what they'd consider acceptable.

And yet, their own servers don't seem to have a robots.txt that defines 
any limitations.  Sure would be nice if *they* would tell *us* what's 
acceptable when crawling Yahoo!

Nick

--
Nick Arnett
Director, Business Intelligence Services
LiveWorld Inc.
Phone/fax: (408) 551-0427
[EMAIL PROTECTED]
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Yahoo evolving robots.txt, finally

2004-03-13 Thread ogjunk-robots

--- Matthew Meadows [EMAIL PROTECTED] wrote:
 I agree with Walter.

So do I, partially. :)

 There's a lot of variables that should have
 been
 considered for this new value.  If nothing else the specification
 should
 have called for the time in milliseconds, or otherwise allow for
 fractional seconds.

I disagree that level of granularity is needed.  See my earlier email.

  In addition, it seems a bit presumptuous for Yahoo
 to think that they can force a de facto standard just by implementing
 it first.

That's how things work in real life.  Think web browsers 10 years ago
and various Netscape, then IE extensions.  Now lots of them are
considered standard.

 With this line of thinking webmasters would eventually be
 required to update their robots.txt file for dozens of individual
 bots.

In theory, yes.  In reality, I agree with Walter, this extension will
prove to be as useless as blink, and will therefore not be
supported by any big crawlers.

 It's hard enough to get them to do it now for the general case, this
 additional fragmentation is not going to make anybody's job easier. 
 Is
 Google going to implement their own extensions, then MSN, AltaVista,
 and AllTheWeb?

Not likely.  In order for them to remain competitive, they have to keep
fetching web pages at high rates.  robots.txt only limits them.  I
can't think of an extension to robots.txt that would let them do a
better job.  Actually, I can. :)

 Finally, if we're going to start specifying the criteria for
 scheduling, let's consider some other alternatives, like preferred
 scanning windows.

Same as crawl-delay - everyone would want crawlers to visit their sites
at night, which would saturate crawlers' networks, so search engines
won't push that extension. (actually, big crawlers run from multiple
points around the planet, so maybe my statement is flawed)

Otis

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]
 On Behalf Of Walter Underwood
 Sent: Friday, March 12, 2004 3:37 PM
 To: Internet robots, spiders, web-walkers, etc.
 Subject: Re: [Robots] Yahoo evolving robots.txt, finally
 
 
 --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED]
 wrote:
 
  I am surprised that after all that talk about adding new semantic 
  elements to robots.txt several years ago, nobody commented that the
 
  new Yahoo crawler (former Inktomi crawler) took a brave step in
 that 
  direction by adding Crawl-delay: syntax.
  
  http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
  
  Time to update your robots.txt parsers!
 
 No, time to tell Yahoo to go back and do a better job.
 
 Does crawl-delay allow decimals? Negative numbers? Could this spec be
 a
 bit better quality? The words positive integer would improve things
 a
 lot.
 
 Sigh. It would have been nice if they'd discussed this on the list
 first. crawl-delay is a pretty dumb idea. Any value over one second
 means it takes forever to index a site. Ultraseek 
 has had a spider throttle option to add this sort of delay, but it
 is
 almost never used, because Ultraseek reads 25 pages from one site,
 then
 moves to another. There are many kinds of rate control.
 
 wunder
 --
 Walter Underwood
 Principal Architect
 Verity Ultraseek
 
 ___
 Robots mailing list
 [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
 ___
 Robots mailing list
 [EMAIL PROTECTED]
 http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Yahoo evolving robots.txt, finally

2004-03-13 Thread Walter Underwood
--On Saturday, March 13, 2004 2:22 AM -0800 [EMAIL PROTECTED] wrote:
 
 Does crawl-delay allow decimals?
 
 You think people really want to be able to tell a crawler to fetch a
 page at most every 5.6 seconds, and not 5?

0.5s would be useful. Ultraseek has used a float for the delay for
the past six years.

 Could this spec be a bit better quality?
 
 It's not a spec, it's an implementation, ...

 The words positive integer would improve things a lot.
 
 That's just common sense to me. :)

Well, different peoples' common sense leads to incompatible
implementations. Which is why these things should be specified.
I think negative delays would be goofy, too, but we all know
that someone will try it.

 I am sure their people are on the list, they are just being quiet, and
 will probably remain silent now that their idea has been called dumb.

Nah, they would have e-mailed me directly by now. I used to work
with them at Inktomi.

I called it a dumb idea because it has obvious problems. These
could have been solved by trying to learn from the rest of the
robot community. Crawl-delay isn't useful in our crawler, and
there have been better rate-limit approaches proposed as
far back as 1996. Most sites have pages/day or bytes/day limit,
not instantaneous rate limits, so crawl-delay is controlling
the wrong thing.

Note that Google has implemented Allow lines with a limited
wildcard syntax, so Yahoo isn't alone in being incompatible.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Yahoo evolving robots.txt, finally

2004-03-13 Thread Matthew Meadows
I'm standing firm on my suggestions.  Adding a delay for crawlers is a
good idea in concept, 
and allowing fractional seconds is a way for webmasters to request
reasonable constraints.  Is
it such a stretch to allow a robot that you use to promote your business
unmitigated access to
your site, but require other robots to throttle down to a few pages per
second?

As for preferred scanning windows, many organizations have a huge surge
of traffic from customers
during their normal operating hours, but are relatively calm otherwise.
Requesting that robots
only scan outside of peak hours is a nice compromise between keeping
them out entirely and keeping
them out when you're too busy serving pages to human readers.

I just read Walter's response to this thread, and he mentions
bytes-per-day and pages-per-day
limits.  Those are fine in the abstract and may be helpful.  But if a
robot is limited to 100MB 
a day and it decides to take them all in one draw during your peak
traffic hours, then volume
limits alone are not sufficient.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of [EMAIL PROTECTED]
Sent: Saturday, March 13, 2004 4:31 AM
To: Internet robots, spiders, web-walkers, etc.
Subject: RE: [Robots] Yahoo evolving robots.txt, finally



--- Matthew Meadows [EMAIL PROTECTED] wrote:
 I agree with Walter.

So do I, partially. :)

 There's a lot of variables that should have
 been
 considered for this new value.  If nothing else the specification 
 should have called for the time in milliseconds, or otherwise allow 
 for fractional seconds.

I disagree that level of granularity is needed.  See my earlier email.

  In addition, it seems a bit presumptuous for Yahoo
 to think that they can force a de facto standard just by implementing 
 it first.

That's how things work in real life.  Think web browsers 10 years ago
and various Netscape, then IE extensions.  Now lots of them are
considered standard.

 With this line of thinking webmasters would eventually be required to 
 update their robots.txt file for dozens of individual bots.

In theory, yes.  In reality, I agree with Walter, this extension will
prove to be as useless as blink, and will therefore not be supported
by any big crawlers.

 It's hard enough to get them to do it now for the general case, this 
 additional fragmentation is not going to make anybody's job easier. Is
 Google going to implement their own extensions, then MSN, AltaVista,
 and AllTheWeb?

Not likely.  In order for them to remain competitive, they have to keep
fetching web pages at high rates.  robots.txt only limits them.  I can't
think of an extension to robots.txt that would let them do a better job.
Actually, I can. :)

 Finally, if we're going to start specifying the criteria for 
 scheduling, let's consider some other alternatives, like preferred 
 scanning windows.

Same as crawl-delay - everyone would want crawlers to visit their sites
at night, which would saturate crawlers' networks, so search engines
won't push that extension. (actually, big crawlers run from multiple
points around the planet, so maybe my statement is flawed)

Otis

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 On Behalf Of Walter Underwood
 Sent: Friday, March 12, 2004 3:37 PM
 To: Internet robots, spiders, web-walkers, etc.
 Subject: Re: [Robots] Yahoo evolving robots.txt, finally
 
 
 --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED]
 wrote:
 
  I am surprised that after all that talk about adding new semantic
  elements to robots.txt several years ago, nobody commented that the
 
  new Yahoo crawler (former Inktomi crawler) took a brave step in
 that
  direction by adding Crawl-delay: syntax.
  
  http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
  
  Time to update your robots.txt parsers!
 
 No, time to tell Yahoo to go back and do a better job.
 
 Does crawl-delay allow decimals? Negative numbers? Could this spec be 
 a bit better quality? The words positive integer would improve 
 things a
 lot.
 
 Sigh. It would have been nice if they'd discussed this on the list 
 first. crawl-delay is a pretty dumb idea. Any value over one second 
 means it takes forever to index a site. Ultraseek has had a spider 
 throttle option to add this sort of delay, but it is
 almost never used, because Ultraseek reads 25 pages from one site,
 then
 moves to another. There are many kinds of rate control.
 
 wunder
 --
 Walter Underwood
 Principal Architect
 Verity Ultraseek
 
 ___
 Robots mailing list
 [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
 ___
 Robots mailing list
 [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots

Re: [Robots] Yahoo evolving robots.txt, finally

2004-03-12 Thread Walter Underwood
--On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] wrote:

 I am surprised that after all that talk about adding new semantic
 elements to robots.txt several years ago, nobody commented that the new
 Yahoo crawler (former Inktomi crawler) took a brave step in that
 direction by adding Crawl-delay: syntax.
 
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 
 Time to update your robots.txt parsers!

No, time to tell Yahoo to go back and do a better job.

Does crawl-delay allow decimals? Negative numbers? Could this spec
be a bit better quality? The words positive integer would
improve things a lot.

Sigh. It would have been nice if they'd discussed this on the
list first. crawl-delay is a pretty dumb idea. Any value over
one second means it takes forever to index a site. Ultraseek 
has had a spider throttle option to add this sort of delay,
but it is almost never used, because Ultraseek reads 25 pages
from one site, then moves to another. There are many kinds of
rate control.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Yahoo evolving robots.txt, finally

2004-03-12 Thread Matthew Meadows
I agree with Walter.  There's a lot of variables that should have been
considered for this new value.  If nothing else the specification should
have called for the time in milliseconds, or otherwise allow for
fractional seconds.  In addition, it seems a bit presumptuous for Yahoo
to think that they can force a de facto standard just by implementing it
first.  With this line of thinking webmasters would eventually be
required to update their robots.txt file for dozens of individual bots.
It's hard enough to get them to do it now for the general case, this
additional fragmentation is not going to make anybody's job easier.  Is
Google going to implement their own extensions, then MSN, AltaVista, and
AllTheWeb?  Finally, if we're going to start specifying the criteria for
scheduling, let's consider some other alternatives, like preferred
scanning windows.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Walter Underwood
Sent: Friday, March 12, 2004 3:37 PM
To: Internet robots, spiders, web-walkers, etc.
Subject: Re: [Robots] Yahoo evolving robots.txt, finally


--On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] wrote:

 I am surprised that after all that talk about adding new semantic 
 elements to robots.txt several years ago, nobody commented that the 
 new Yahoo crawler (former Inktomi crawler) took a brave step in that 
 direction by adding Crawl-delay: syntax.
 
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 
 Time to update your robots.txt parsers!

No, time to tell Yahoo to go back and do a better job.

Does crawl-delay allow decimals? Negative numbers? Could this spec be a
bit better quality? The words positive integer would improve things a
lot.

Sigh. It would have been nice if they'd discussed this on the list
first. crawl-delay is a pretty dumb idea. Any value over one second
means it takes forever to index a site. Ultraseek 
has had a spider throttle option to add this sort of delay, but it is
almost never used, because Ultraseek reads 25 pages from one site, then
moves to another. There are many kinds of rate control.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots