RE: [Robots] Robots.txt Evolution?
> This would need to be a separate file, probably "robots2.txt". People who are oblivious to the robots.txt standard already complain about grabbing robots.txt and would thought that getting another file would make them complain a little more. Besides who wants to maintain two files that do the same job. In my view, if it was in another file an extended standard would never be used and then yes, there would be no point. /pt ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Robots.txt Evolution?
I'm inclined to agree that a second file would probably get overlooked by bots. I would imagine it was difficult trying to get those who run them to respect the first one. I was unaware of the 'Allow' command. Is there a URL that documents it? Also, the use of wildcards when giving paths should be incorporated. That would greatly reduce the number of path lines that you'd have to type into and a robot would have to read out of the robots.txt file. And, wildcards shouldn't be limited to just the end of the path. You should be able to use them in the middle as well. Perhaps the UNIX brackets could be incorporated ( [0-9]. [a,e.i.o.u]) as matching characters. As far as the number of sites that actually use robots.txt, that would grow as the strength of the robots.txt coding improved. Fred - Original Message - From: "Paul Trippett" <[EMAIL PROTECTED]> To: "'Internet robots, spiders, web-walkers, etc.'" <[EMAIL PROTECTED]> Sent: Sunday, January 11, 2004 8:57 AM Subject: RE: [Robots] Robots.txt Evolution? > > > This would need to be a separate file, probably "robots2.txt". > > People who are oblivious to the robots.txt standard already complain > about grabbing robots.txt and would thought that getting another file > would make them complain a little more. Besides who wants to maintain > two files that do the same job. In my view, if it was in another file an > extended standard would never be used and then yes, there would be no > point. > > /pt > > > ___ > Robots mailing list > [EMAIL PROTECTED] > http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Robots.txt Evolution?
You brought up an interesting point, the lack of sites using robots.txt, but I see two related problems: 1) Most sites do not deploy or properly maintain a robots.txt file. 2) Many robots (nascent, malicious, stealthy or otherwise) do not honor the robots.txt file. If there is to be a successor to robots.txt, it should be sophisticated enough to improve both of these problems. It's only in the best interest of the robot developer to honor the file if it mitigates against link funneling or undesirable files. It's only in the best interest of the site developer if they want portions of their site indexed or omitted. Here's a proposal that I think helps solve both aspects of the problem: By link funneling I'm referring to links that contain random session identifiers, causing the same pages to be served up perpetually with different anchor tags. Most robots could benefit from a line that identified these types of urls by their session identifier, for example: session=jsessionid. This simple enhancement would benefit both the robot developers and the site developers. The robots would no longer need to identify these urls by manual or automated profiling, they could simply extract the session identifier from links that matched the mask. The site developers would prevent the useless traffic that's presently involved in inferring the random session identifiers. Extending this idea, perhaps the specification could allow robots to substitute their own agent names for the session identifiers. This would allow for a loose type of referral tracking. As a side effect it would also cause robots that spoof their agent names to implicate competing robots. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Fred Atkinson Sent: Sunday, January 11, 2004 10:45 AM To: Internet robots, spiders, web-walkers, etc. Subject: Re: [Robots] Robots.txt Evolution? I'm inclined to agree that a second file would probably get overlooked by bots. I would imagine it was difficult trying to get those who run them to respect the first one. I was unaware of the 'Allow' command. Is there a URL that documents it? Also, the use of wildcards when giving paths should be incorporated. That would greatly reduce the number of path lines that you'd have to type into and a robot would have to read out of the robots.txt file. And, wildcards shouldn't be limited to just the end of the path. You should be able to use them in the middle as well. Perhaps the UNIX brackets could be incorporated ( [0-9]. [a,e.i.o.u]) as matching characters. As far as the number of sites that actually use robots.txt, that would grow as the strength of the robots.txt coding improved. Fred - Original Message - From: "Paul Trippett" <[EMAIL PROTECTED]> To: "'Internet robots, spiders, web-walkers, etc.'" <[EMAIL PROTECTED]> Sent: Sunday, January 11, 2004 8:57 AM Subject: RE: [Robots] Robots.txt Evolution? > > > This would need to be a separate file, probably "robots2.txt". > > People who are oblivious to the robots.txt standard already complain > about grabbing robots.txt and would thought that getting another file > would make them complain a little more. Besides who wants to maintain > two files that do the same job. In my view, if it was in another file > an extended standard would never be used and then yes, there would be > no point. > > /pt > > > ___ > Robots mailing list > [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Robots.txt Evolution?
--On Sunday, January 11, 2004 11:44 AM -0500 Fred Atkinson <[EMAIL PROTECTED]> wrote: > > I was unaware of the 'Allow' command. Is there a URL that documents it? The Allow directive is non-standard. Don't use it. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Another approach
Another idea that has occured to me is to simply code the information to be indexed in the robots.txt file. Then, the robot could simply suck the information out of the file and be done. Example: User-agent: Scooter Interval: 30d Disallow: / Name: Fred's Site Index: /index.html Name: My Article Index: /article/index.html Name: My Article's FAQs Index: /article/faq.html This would tell them to take this information to include in their search database and move one. Other ideas? Fred ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Another approach
I don't think the explicit names would be required, most robots simply read the title tag, or infer it from the first portion of clear text, the content meta tag, or other document attributes. Anyway, this method would become quite burdensome for very complicated sites. I also suspect the file would also become stale rather quickly. I do like the Interval attribute, that makes perfect sense to me. There's a lot we could do with the same basic concept. For instance, we could add a touch date to the file to indicate when the site was last updated, so that even if the interval has passed robots would not need to scan the site if they had already done so after the touch date. Keep in mind that if robot developers surmise that the touch dates are being artificially manipulated to keep them out, they'll ignore them. Anybody else interested in the Session attribute? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Fred Atkinson Sent: Sunday, January 11, 2004 4:38 PM To: Robots Subject: [Robots] Another approach Another idea that has occured to me is to simply code the information to be indexed in the robots.txt file. Then, the robot could simply suck the information out of the file and be done. Example: User-agent: Scooter Interval: 30d Disallow: / Name: Fred's Site Index: /index.html Name: My Article Index: /article/index.html Name: My Article's FAQs Index: /article/faq.html This would tell them to take this information to include in their search database and move one. Other ideas? Fred ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Another approach
It was thus said that the Great Matthew Meadows once stated: > > I do like the Interval attribute, that makes perfect sense to me. > There's a lot we could do with the same basic concept. For instance, we > could add a touch date to the file to indicate when the site was last > updated, so that even if the interval has passed robots would not need Then there is the issue with making sure the robots.txt file is updated with the new timestamp each time the site is updated, and I suspect this step may be ignored for forgotten unless it's automated. > Anybody else interested in the Session attribute? What's the session attribute? -spc (http://www.conman.org/people/spc/robots2.html) ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Another approach
You're right about the tendency to ignore the touch date, and in a more general sense the tendency for robots.txt to become stale without automation. As for the Session attribute, I made reference to it in a prior email (albeit without the capitalization): By link funneling I'm referring to links that contain random session identifiers, causing the same pages to be served up perpetually with different anchor tags. Most robots could benefit from a line that identified these types of urls by their session identifier, for example: session=jsessionid. This simple enhancement would benefit both the robot developers and the site developers. The robots would no longer need to identify these urls by manual or automated profiling, they could simply extract the session identifier from links that matched the mask. The site developers would prevent the useless traffic that's presently involved in inferring the random session identifiers. Extending this idea, perhaps the specification could allow robots to substitute their own agent names for the session identifiers. This would allow for a loose type of referral tracking. As a side effect it would also cause robots that spoof their agent names to implicate competing robots. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Sean 'Captain Napalm' Conner Sent: Sunday, January 11, 2004 6:05 PM To: [EMAIL PROTECTED] Subject: Re: [Robots] Another approach It was thus said that the Great Matthew Meadows once stated: > > I do like the Interval attribute, that makes perfect sense to me. > There's a lot we could do with the same basic concept. For instance, > we could add a touch date to the file to indicate when the site was > last updated, so that even if the interval has passed robots would not > need Then there is the issue with making sure the robots.txt file is updated with the new timestamp each time the site is updated, and I suspect this step may be ignored for forgotten unless it's automated. > Anybody else interested in the Session attribute? What's the session attribute? -spc (http://www.conman.org/people/spc/robots2.html) ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Another approach
It was thus said that the Great Matthew Meadows once stated: > > By link funneling I'm referring to links that contain random session > identifiers, causing the same pages to be served up perpetually with > different anchor tags. Most robots could benefit from a line that > identified these types of urls by their session identifier, for example: > session=3Djsessionid. This simple enhancement would benefit both the > robot developers and the site developers. The robots would no longer > need to identify these urls by manual or automated profiling, they could > simply extract the session identifier from links that matched the mask. > The site developers would prevent the useless traffic that's presently > involved in inferring the random session identifiers. Extending this > idea, perhaps the specification could allow robots to substitute their > own agent names for the session identifiers. This would allow for a > loose type of referral tracking. As a side effect it would also cause > robots that spoof their agent names to implicate competing robots. I'm beginning to think that to really extend the robots exclusion protocol, two new directives need to be defined, Ignore: and Index:, that work similar to Disallow: and Allow: but allow regular expressions to be used. An up-to-date robot could internally convert: Disallow: /sooperceecret/ to the equivilent: Ignore: ^/soopercecret/.* So, to avoid session ids, one coudl just do: Ignore: .*sessionid=.* And there you go. Using the different directives makes it backwards compatible with the original robots.txt (where an older robot will ignore the new directives) and without overloading the meaning of existing directives (one of the downpoints of my own proposed extention). So for my hypothetical robots.txt site I mentioned in a previous post, I could do: User-agent: * Index: ^/$ Disallow: / or even: User-agent: * Index: ^/$ Ignore: .* -spc (Likes it ... ) ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Another approach
--On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner <[EMAIL PROTECTED]> wrote: > > And there you go. Using the different directives makes it backwards > compatible with the original robots.txt (where an older robot will ignore > the new directives) and without overloading the meaning of existing > directives (one of the downpoints of my own proposed extention). No it does not make it backwards compatible. It makes it an illegal robots.txt file. Parsers built to ignore unknown directives would still be able to use it. Parsers not built that way would not be able to parse the file, and would probably miss all the legal directives as well as the non-standard ones. I mentioned the internet robustness principle before, but folks seem to have missed that. It is: Be conservative in what you send, liberal in what you accept. In our case, the contents of the robots.txt file is "sent". By the robustness principle, we must not add extra stuff on the assumption that the parsers can deal with it. Because the original format does not have a version number there is no way to change the format safely. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Another approach
So what your saying is that if you come across a robots.txt file where someone has made a spelling mistake of... Disallow: /intranet Dislalow: /private You render the robots.txt file corrupt and start indexing the correctly defined Disallow Directives? If not whats the difference between this mis-spelt directive of Disallow: being spelt Ignore:? /pt -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Walter Underwood Sent: 12 January 2004 03:48 To: Internet robots, spiders, web-walkers, etc. Subject: Re: [Robots] Another approach --On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner <[EMAIL PROTECTED]> wrote: > > And there you go. Using the different directives makes it backwards > compatible with the original robots.txt (where an older robot will ignore > the new directives) and without overloading the meaning of existing > directives (one of the downpoints of my own proposed extention). No it does not make it backwards compatible. It makes it an illegal robots.txt file. Parsers built to ignore unknown directives would still be able to use it. Parsers not built that way would not be able to parse the file, and would probably miss all the legal directives as well as the non-standard ones. I mentioned the internet robustness principle before, but folks seem to have missed that. It is: Be conservative in what you send, liberal in what you accept. In our case, the contents of the robots.txt file is "sent". By the robustness principle, we must not add extra stuff on the assumption that the parsers can deal with it. Because the original format does not have a version number there is no way to change the format safely. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Another approach
It was thus said that the Great Walter Underwood once stated: > > --On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner <[EMAIL > PROTECTED]> wrote: > > > > And there you go. Using the different directives makes it backwards > > compatible with the original robots.txt (where an older robot will ignore > > the new directives) and without overloading the meaning of existing > > directives (one of the downpoints of my own proposed extention). > > No it does not make it backwards compatible. It makes it an > illegal robots.txt file. Parsers built to ignore unknown directives > would still be able to use it. Parsers not built that way would > not be able to parse the file, and would probably miss all the > legal directives as well as the non-standard ones. That would be a pretty poor parser, and besides, from the spec itself (http://www.robotstxt.org/wc/norobots.html): The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form ":". The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary. The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored. Right there---last line---"Unrecognised headers are ignored." Besides, it's a bit more work to *not* ignore unrecognized directives than it is to ignore them: while(fgets(line,sizeof(line),fprobots) != NULL) { if (strncasecmp(line,"user-agent:",11) == 0) /* we have a user-agent */ else if (strncasecmp(line,"disallow:",9) == 0) /* we have a disallow */ /* we have a comment, or an unrecognized directive, ignore */ } > I mentioned the internet robustness principle before, but folks > seem to have missed that. It is: > > Be conservative in what you send, liberal in what you accept. > > In our case, the contents of the robots.txt file is "sent". > By the robustness principle, we must not add extra stuff on > the assumption that the parsers can deal with it. By the same token, it is the robots.txt parser that "accepts" the robots.txt file, so by the robustness principle, you need to ignore directives you don't understand. -spc (And be thankful Martijn didn't decide to use RFC-822 style header lines ... ) ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] This email address is no longer in use.
This email address is no longer in use. If you need to contact me, please call (07973) 172650 ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots