RE: [Robots] Yahoo evolving robots.txt, finally
I'm standing firm on my suggestions. Adding a delay for crawlers is a good idea in concept, and allowing fractional seconds is a way for webmasters to request reasonable constraints. Is it such a stretch to allow a robot that you use to promote your business unmitigated access to your site, but require other robots to throttle down to a few pages per second? As for preferred scanning windows, many organizations have a huge surge of traffic from customers during their normal operating hours, but are relatively calm otherwise. Requesting that robots only scan outside of peak hours is a nice compromise between keeping them out entirely and keeping them out when you're too busy serving pages to human readers. I just read Walter's response to this thread, and he mentions bytes-per-day and pages-per-day limits. Those are fine in the abstract and may be helpful. But if a robot is limited to 100MB a day and it decides to take them all in one draw during your peak traffic hours, then volume limits alone are not sufficient. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Saturday, March 13, 2004 4:31 AM To: Internet robots, spiders, web-walkers, etc. Subject: RE: [Robots] Yahoo evolving robots.txt, finally --- Matthew Meadows [EMAIL PROTECTED] wrote: I agree with Walter. So do I, partially. :) There's a lot of variables that should have been considered for this new value. If nothing else the specification should have called for the time in milliseconds, or otherwise allow for fractional seconds. I disagree that level of granularity is needed. See my earlier email. In addition, it seems a bit presumptuous for Yahoo to think that they can force a de facto standard just by implementing it first. That's how things work in real life. Think web browsers 10 years ago and various Netscape, then IE extensions. Now lots of them are considered standard. With this line of thinking webmasters would eventually be required to update their robots.txt file for dozens of individual bots. In theory, yes. In reality, I agree with Walter, this extension will prove to be as useless as blink, and will therefore not be supported by any big crawlers. It's hard enough to get them to do it now for the general case, this additional fragmentation is not going to make anybody's job easier. Is Google going to implement their own extensions, then MSN, AltaVista, and AllTheWeb? Not likely. In order for them to remain competitive, they have to keep fetching web pages at high rates. robots.txt only limits them. I can't think of an extension to robots.txt that would let them do a better job. Actually, I can. :) Finally, if we're going to start specifying the criteria for scheduling, let's consider some other alternatives, like preferred scanning windows. Same as crawl-delay - everyone would want crawlers to visit their sites at night, which would saturate crawlers' networks, so search engines won't push that extension. (actually, big crawlers run from multiple points around the planet, so maybe my statement is flawed) Otis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Walter Underwood Sent: Friday, March 12, 2004 3:37 PM To: Internet robots, spiders, web-walkers, etc. Subject: Re: [Robots] Yahoo evolving robots.txt, finally --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] wrote: I am surprised that after all that talk about adding new semantic elements to robots.txt several years ago, nobody commented that the new Yahoo crawler (former Inktomi crawler) took a brave step in that direction by adding Crawl-delay: syntax. http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html Time to update your robots.txt parsers! No, time to tell Yahoo to go back and do a better job. Does crawl-delay allow decimals? Negative numbers? Could this spec be a bit better quality? The words positive integer would improve things a lot. Sigh. It would have been nice if they'd discussed this on the list first. crawl-delay is a pretty dumb idea. Any value over one second means it takes forever to index a site. Ultraseek has had a spider throttle option to add this sort of delay, but it is almost never used, because Ultraseek reads 25 pages from one site, then moves to another. There are many kinds of rate control. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots
RE: [Robots] Yahoo evolving robots.txt, finally
I agree with Walter. There's a lot of variables that should have been considered for this new value. If nothing else the specification should have called for the time in milliseconds, or otherwise allow for fractional seconds. In addition, it seems a bit presumptuous for Yahoo to think that they can force a de facto standard just by implementing it first. With this line of thinking webmasters would eventually be required to update their robots.txt file for dozens of individual bots. It's hard enough to get them to do it now for the general case, this additional fragmentation is not going to make anybody's job easier. Is Google going to implement their own extensions, then MSN, AltaVista, and AllTheWeb? Finally, if we're going to start specifying the criteria for scheduling, let's consider some other alternatives, like preferred scanning windows. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Walter Underwood Sent: Friday, March 12, 2004 3:37 PM To: Internet robots, spiders, web-walkers, etc. Subject: Re: [Robots] Yahoo evolving robots.txt, finally --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] wrote: I am surprised that after all that talk about adding new semantic elements to robots.txt several years ago, nobody commented that the new Yahoo crawler (former Inktomi crawler) took a brave step in that direction by adding Crawl-delay: syntax. http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html Time to update your robots.txt parsers! No, time to tell Yahoo to go back and do a better job. Does crawl-delay allow decimals? Negative numbers? Could this spec be a bit better quality? The words positive integer would improve things a lot. Sigh. It would have been nice if they'd discussed this on the list first. crawl-delay is a pretty dumb idea. Any value over one second means it takes forever to index a site. Ultraseek has had a spider throttle option to add this sort of delay, but it is almost never used, because Ultraseek reads 25 pages from one site, then moves to another. There are many kinds of rate control. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Another approach
I don't think the explicit names would be required, most robots simply read the title tag, or infer it from the first portion of clear text, the content meta tag, or other document attributes. Anyway, this method would become quite burdensome for very complicated sites. I also suspect the file would also become stale rather quickly. I do like the Interval attribute, that makes perfect sense to me. There's a lot we could do with the same basic concept. For instance, we could add a touch date to the file to indicate when the site was last updated, so that even if the interval has passed robots would not need to scan the site if they had already done so after the touch date. Keep in mind that if robot developers surmise that the touch dates are being artificially manipulated to keep them out, they'll ignore them. Anybody else interested in the Session attribute? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Fred Atkinson Sent: Sunday, January 11, 2004 4:38 PM To: Robots Subject: [Robots] Another approach Another idea that has occured to me is to simply code the information to be indexed in the robots.txt file. Then, the robot could simply suck the information out of the file and be done. Example: User-agent: Scooter Interval: 30d Disallow: / Name: Fred's Site Index: /index.html Name: My Article Index: /article/index.html Name: My Article's FAQs Index: /article/faq.html This would tell them to take this information to include in their search database and move one. Other ideas? Fred ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Post
Regarding this: What's there to invent after Google? Quite a lot, actually. Google has built a magnificent search portal for the Internet, but there's still room in the market for companies like Inktomi, Verity, DTSearch, AltaVista, and dozens of others big and small. The reason is that search is an extremely rich problem domain, and different users have different search needs. Searching source code, tagged documents, databases, log files, archives, LDAP servers, Usenet, and the Internet is a lot to ask of any single product. Google, AllTheWeb, and other free search engines are optimized for one aspect of the IR problem domain: returning relevancy scored results to queries into a massive index of web content. Their business model is largely based on selling advertisements that correspond to keywords entered into a search page and providing a compelling portal for end users to link out to other sites, and the choices they've made in their indexing approach reflects that model. However, many of these choices are not necessarily suitable for other aspects of the IR problem. For instance, most of these indexing algorithms for internet search are lossy, and the index administrators (or programmers) have determined the depth of the index. The index relies on stop terms to keep it a manageable size, and the result sets include a fraction of results out of orders of billions, for good reason. But these kind of constraints are not suitable for source code, log file, or legal document analysis. Further, the types of weightings used in the relevancy scoring are not necessarily the same across different document repositories. For instance, popularity based relevance has little bearing on corporate LANS full of ordinary business documents, and whereas keyword and metatag scoring have fallen out of favor with free public search engines they may be very effective parameters in scoring a query against a more controlled document repository. To truly create the most effective index possible requires the index administrator or an automated query optimizer to adjust the weightings of a wide range of variables that impact the size, depth, and effectiveness of the index. Consider also vertical searches, indexes optimized for a specific domain. A researcher in a particular discipline may benefit from having a clean index with a finely-honed affinity to that discipline. Such indexes allow for a tremendous signal-to-noise ratio. Imagine for example an index specific to Genetic Programming that contains daily traffic from message boards, Usenet messages and other online content intersected with information from your LAN, your inbox, your source code, and other proprietary sources. You can achieve an effective depth and breadth of content in such an index with far less resources than what would be required in a less discriminating database. Finally, don't forget about cost. Last time I checked the enterprise versions of Google, AltaVista, and Inktomi - as far as I recall - all charge an escalating fee that corresponds to the number of documents indexed, a licensing model that may drastically increase the TCO of these solutions as the end user's business grows. I have built a discriminating filer that has most of these capabilities, and many more that I can't describe here. That's why I never post, I've been busy working on the project on the side for over three years. I can reveal more about it in the next couple of months after my management decides its level of interest in ownership of the code. It's good to see the activity on the mailing list today. I suspect that a lot of people that would normally post are just busy working on their own robots, or just flat out lucky enough to be working. -Original Message- From: Paul Maddox [mailto:paulmdx;hotpop.com] Sent: Friday, November 08, 2002 3:42 AM To: [EMAIL PROTECTED] Subject: Re: [Robots] Post Hi, I'm sure even Google themselves would admit there there's scope for improvement. With Answers, Catalogs, Image Search, News, etc, etc, they seem to be quite busy! :-) As an AI programmer specialising in NLP, personally I'd like to see web bots actually 'understanding' the content they review, rather than indexing by brute force. How about the equivalent of Dmoz or Yahoo Directory, but generated by a web spider? Paul. On Fri, 08 Nov 2002 10:22:48 +0100, Harry Behrens wrote: Haven't seen traffic in ages. I guess the theme's pretty much dead. What's there to invent after Google? -h ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Re: Perl and LWP robots
That's a curious remark about readers and their misplaced desire for recursive spiders. A recursive spider allows its user to drill down into a particular information domain and ultimately exhaust it if the spider is capable enough. This is of enormous benefit to the information researcher looking for a complete and accurate view of the information domain, as opposed to the relevancy scored aggregate data provided by most search engines. It may not be appropriate for all sites or all topics but can certainly provide an abundant yield given the proper parameters. -Original Message- From: Sean M. Burke [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 07, 2002 3:51 AM To: [EMAIL PROTECTED] Subject: [Robots] Perl and LWP robots Hi all! My name is Sean Burke, and I'm writing a book for O'Reilly, which is to basically replace the Clinton Wong's now out-of-print /Web Client Programming with Perl/. In my book draft so far, I haven't discussed actual recursive spiders (I've only discussed getting a given page, and then every page that it links to which is also on the same host), since I think that most readers that think they want a recursive spider, really don't. But it has been suggested that I cover recursive spiders, just for sake of completeness. Aside from basic concepts (don't hammer the server; always obey the robots.txt; don't span hosts unless you are really sure that you want to), are there any particular bits of wisdom that list members would want me to pass on to my readers? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].