RE: [Robots] Yahoo evolving robots.txt, finally

2004-03-13 Thread Matthew Meadows
I'm standing firm on my suggestions.  Adding a delay for crawlers is a
good idea in concept, 
and allowing fractional seconds is a way for webmasters to request
reasonable constraints.  Is
it such a stretch to allow a robot that you use to promote your business
unmitigated access to
your site, but require other robots to throttle down to a few pages per
second?

As for preferred scanning windows, many organizations have a huge surge
of traffic from customers
during their normal operating hours, but are relatively calm otherwise.
Requesting that robots
only scan outside of peak hours is a nice compromise between keeping
them out entirely and keeping
them out when you're too busy serving pages to human readers.

I just read Walter's response to this thread, and he mentions
bytes-per-day and pages-per-day
limits.  Those are fine in the abstract and may be helpful.  But if a
robot is limited to 100MB 
a day and it decides to take them all in one draw during your peak
traffic hours, then volume
limits alone are not sufficient.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of [EMAIL PROTECTED]
Sent: Saturday, March 13, 2004 4:31 AM
To: Internet robots, spiders, web-walkers, etc.
Subject: RE: [Robots] Yahoo evolving robots.txt, finally



--- Matthew Meadows [EMAIL PROTECTED] wrote:
 I agree with Walter.

So do I, partially. :)

 There's a lot of variables that should have
 been
 considered for this new value.  If nothing else the specification 
 should have called for the time in milliseconds, or otherwise allow 
 for fractional seconds.

I disagree that level of granularity is needed.  See my earlier email.

  In addition, it seems a bit presumptuous for Yahoo
 to think that they can force a de facto standard just by implementing 
 it first.

That's how things work in real life.  Think web browsers 10 years ago
and various Netscape, then IE extensions.  Now lots of them are
considered standard.

 With this line of thinking webmasters would eventually be required to 
 update their robots.txt file for dozens of individual bots.

In theory, yes.  In reality, I agree with Walter, this extension will
prove to be as useless as blink, and will therefore not be supported
by any big crawlers.

 It's hard enough to get them to do it now for the general case, this 
 additional fragmentation is not going to make anybody's job easier. Is
 Google going to implement their own extensions, then MSN, AltaVista,
 and AllTheWeb?

Not likely.  In order for them to remain competitive, they have to keep
fetching web pages at high rates.  robots.txt only limits them.  I can't
think of an extension to robots.txt that would let them do a better job.
Actually, I can. :)

 Finally, if we're going to start specifying the criteria for 
 scheduling, let's consider some other alternatives, like preferred 
 scanning windows.

Same as crawl-delay - everyone would want crawlers to visit their sites
at night, which would saturate crawlers' networks, so search engines
won't push that extension. (actually, big crawlers run from multiple
points around the planet, so maybe my statement is flawed)

Otis

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 On Behalf Of Walter Underwood
 Sent: Friday, March 12, 2004 3:37 PM
 To: Internet robots, spiders, web-walkers, etc.
 Subject: Re: [Robots] Yahoo evolving robots.txt, finally
 
 
 --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED]
 wrote:
 
  I am surprised that after all that talk about adding new semantic
  elements to robots.txt several years ago, nobody commented that the
 
  new Yahoo crawler (former Inktomi crawler) took a brave step in
 that
  direction by adding Crawl-delay: syntax.
  
  http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
  
  Time to update your robots.txt parsers!
 
 No, time to tell Yahoo to go back and do a better job.
 
 Does crawl-delay allow decimals? Negative numbers? Could this spec be 
 a bit better quality? The words positive integer would improve 
 things a
 lot.
 
 Sigh. It would have been nice if they'd discussed this on the list 
 first. crawl-delay is a pretty dumb idea. Any value over one second 
 means it takes forever to index a site. Ultraseek has had a spider 
 throttle option to add this sort of delay, but it is
 almost never used, because Ultraseek reads 25 pages from one site,
 then
 moves to another. There are many kinds of rate control.
 
 wunder
 --
 Walter Underwood
 Principal Architect
 Verity Ultraseek
 
 ___
 Robots mailing list
 [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
 ___
 Robots mailing list
 [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots

RE: [Robots] Yahoo evolving robots.txt, finally

2004-03-12 Thread Matthew Meadows
I agree with Walter.  There's a lot of variables that should have been
considered for this new value.  If nothing else the specification should
have called for the time in milliseconds, or otherwise allow for
fractional seconds.  In addition, it seems a bit presumptuous for Yahoo
to think that they can force a de facto standard just by implementing it
first.  With this line of thinking webmasters would eventually be
required to update their robots.txt file for dozens of individual bots.
It's hard enough to get them to do it now for the general case, this
additional fragmentation is not going to make anybody's job easier.  Is
Google going to implement their own extensions, then MSN, AltaVista, and
AllTheWeb?  Finally, if we're going to start specifying the criteria for
scheduling, let's consider some other alternatives, like preferred
scanning windows.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Walter Underwood
Sent: Friday, March 12, 2004 3:37 PM
To: Internet robots, spiders, web-walkers, etc.
Subject: Re: [Robots] Yahoo evolving robots.txt, finally


--On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] wrote:

 I am surprised that after all that talk about adding new semantic 
 elements to robots.txt several years ago, nobody commented that the 
 new Yahoo crawler (former Inktomi crawler) took a brave step in that 
 direction by adding Crawl-delay: syntax.
 
 http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
 
 Time to update your robots.txt parsers!

No, time to tell Yahoo to go back and do a better job.

Does crawl-delay allow decimals? Negative numbers? Could this spec be a
bit better quality? The words positive integer would improve things a
lot.

Sigh. It would have been nice if they'd discussed this on the list
first. crawl-delay is a pretty dumb idea. Any value over one second
means it takes forever to index a site. Ultraseek 
has had a spider throttle option to add this sort of delay, but it is
almost never used, because Ultraseek reads 25 pages from one site, then
moves to another. There are many kinds of rate control.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Another approach

2004-01-11 Thread Matthew Meadows
I don't think the explicit names would be required, most robots simply
read the title tag, or infer it from the first portion of clear text,
the content meta tag, or other document attributes.  Anyway, this method
would become quite burdensome for very complicated sites. I also suspect
the file would also become stale rather quickly.  

I do like the Interval attribute, that makes perfect sense to me.
There's a lot we could do with the same basic concept.  For instance, we
could add a touch date to the file to indicate when the site was last
updated, so that even if the interval has passed robots would not need
to scan the site if they had already done so after the touch date.  Keep
in mind that if robot developers surmise that the touch dates are being
artificially manipulated to keep them out, they'll ignore them.

Anybody else interested in the Session attribute?

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On
Behalf Of Fred Atkinson
Sent: Sunday, January 11, 2004 4:38 PM
To: Robots
Subject: [Robots] Another approach


Another idea that has occured to me is to simply code the
information to be indexed in the robots.txt file.  Then, the robot could
simply suck the information out of the file and be done.

Example:

User-agent: Scooter
Interval: 30d
Disallow: /
Name: Fred's Site
Index: /index.html
Name: My Article
Index: /article/index.html
Name: My Article's FAQs
Index: /article/faq.html

This would tell them to take this information to include in their
search database and move one.

Other ideas?



Fred

___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Post

2002-11-08 Thread Matthew Meadows
Regarding this: 

What's there to invent after Google? 

Quite a lot, actually.  Google has built a magnificent search portal 
for the Internet, but there's still room in the market for companies 
like Inktomi, Verity, DTSearch, AltaVista, and dozens of others big 
and small.  The reason is that search is an extremely rich problem 
domain, and different users have different search needs.  Searching 
source code, tagged documents, databases, log files, archives, LDAP 
servers, Usenet, and the Internet is a lot to ask of any single product. 
Google, AllTheWeb, and other free search engines are optimized for 
one aspect of the IR problem domain: returning relevancy scored results 
to queries into a massive index of web content.  Their business model 
is largely based on selling advertisements that correspond to keywords 
entered into a search page and providing a compelling portal for 
end users to link out to other sites, and the choices they've made in 
their indexing approach reflects that model.  However, many of these 
choices are not necessarily suitable for other aspects of the IR 
problem. 

For instance, most of these indexing algorithms for internet search are 
lossy, and the index administrators (or programmers) have determined the 
depth of the index.  The index relies on stop terms to keep it a manageable 
size, and the result sets include a fraction of results out of orders of 
billions, for good reason.  But these kind of constraints are not 
suitable for source code, log file, or legal document analysis.  Further, 
the types of weightings used in the relevancy scoring are not necessarily the 
same across different document repositories.  For instance, popularity based 
relevance has little bearing on corporate LANS full of ordinary business 
documents, and whereas keyword and metatag scoring have fallen out of 
favor with free public search engines they may be very effective parameters 
in scoring a query against a more controlled document repository. To truly 
create the most effective index possible requires the index administrator 
or an automated query optimizer to adjust the weightings of a wide range 
of variables that impact the size, depth, and effectiveness of the index. 

Consider also vertical searches, indexes optimized for a specific domain. 
A researcher in a particular discipline may benefit from having a clean index 
with a finely-honed affinity to that discipline.  Such indexes allow for 
a tremendous signal-to-noise ratio.  Imagine for example an index specific to 
Genetic Programming that contains daily traffic from message boards, 
Usenet messages and other online content intersected with information from 
your LAN, your inbox, your source code, and other proprietary sources.  
You can achieve an effective depth and breadth of content in such an index 
with far less resources than what would be required in a less discriminating 
database. 

Finally, don't forget about cost.  Last time I checked the enterprise 
versions of Google, AltaVista, and Inktomi - as far as I recall - all charge 
an escalating fee that corresponds to the number of documents indexed, a 
licensing model that may drastically increase the TCO of these solutions as the 
end user's business grows. 

I have built a discriminating filer that has most of these capabilities, and 
many more that I can't describe here.  That's why I never post, I've been busy 
working on the project on the side for over three years.  I can reveal more 
about it in the next couple of months after my management decides its level of 
interest in ownership of the code.  

It's good to see the activity on the mailing list today.  I suspect that a lot 
of people that would normally post are just busy working on their own robots, 
or just flat out lucky enough to be working. 


-Original Message- 
From: Paul Maddox [mailto:paulmdx;hotpop.com] 
Sent: Friday, November 08, 2002 3:42 AM 
To: [EMAIL PROTECTED] 
Subject: Re: [Robots] Post 


Hi, 

I'm sure even Google themselves would admit there there's scope for 
improvement.  With Answers, Catalogs, Image Search, News, etc, etc, 
they seem to be quite busy! :-) 

As an AI programmer specialising in NLP, personally I'd like to see 
web bots actually 'understanding' the content they review, rather 
than indexing by brute force.  How about the equivalent of Dmoz or 
Yahoo Directory, but generated by a web spider? 

Paul. 


On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: 
Haven't seen traffic in ages. 
I guess the theme's pretty much dead. 
 
What's there to invent after Google? 
 
-h 
 


___ 
Robots mailing list 
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots 

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots



[Robots] Re: Perl and LWP robots

2002-03-07 Thread Matthew Meadows


That's a curious remark about readers and their misplaced desire for
recursive spiders.
A recursive spider allows its user to drill down into a particular
information domain and
ultimately exhaust it if the spider is capable enough.  This is of
enormous benefit to the 
information researcher looking for a complete and accurate view of the
information domain, 
as opposed to the relevancy scored aggregate data provided by most
search engines.  It may
not be appropriate for all sites or all topics but can certainly provide
an abundant yield
given the proper parameters.

-Original Message-
From: Sean M. Burke [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 07, 2002 3:51 AM
To: [EMAIL PROTECTED]
Subject: [Robots] Perl and LWP robots



Hi all!
My name is Sean Burke, and I'm writing a book for O'Reilly, which is to 
basically replace the Clinton Wong's now out-of-print /Web Client 
Programming with Perl/.  In my book draft so far, I haven't discussed 
actual recursive spiders (I've only discussed getting a given page, and 
then every page that it links to which is also on the same host), since
I 
think that most readers that think they want a recursive spider, really
don't.
But it has been suggested that I cover recursive spiders, just for sake
of 
completeness.

Aside from basic concepts (don't hammer the server; always obey the 
robots.txt; don't span hosts unless you are really sure that you want
to), 
are there any particular bits of wisdom that list members would want me
to 
pass on to my readers?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send help in the
body of a message to [EMAIL PROTECTED].

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].