RE: [Robots] Robots.txt Evolution?

2004-01-11 Thread Paul Trippett

> This would need to be a separate file, probably "robots2.txt".

People who are oblivious to the robots.txt standard already complain
about grabbing robots.txt and would thought that getting another file
would make them complain a little more. Besides who wants to maintain
two files that do the same job. In my view, if it was in another file an
extended standard would never be used and then yes, there would be no
point. 

/pt


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Robots.txt Evolution?

2004-01-11 Thread Fred Atkinson
I'm inclined to agree that a second file would probably get overlooked
by bots.  I would imagine it was difficult trying to get those who run them
to respect the first one.

I was unaware of the 'Allow' command.  Is there a URL that documents it?

Also, the use of wildcards when giving paths should be incorporated.
That would greatly reduce the number of path lines that you'd have to type
into and a robot would have to read out of the robots.txt file.  And,
wildcards shouldn't be limited to just the end of the path.  You should be
able to use them in the middle as well.  Perhaps the UNIX brackets could be
incorporated ( [0-9]. [a,e.i.o.u]) as matching characters.

As far as the number of sites that actually use robots.txt, that would
grow as the strength of the robots.txt coding improved.


Fred

- Original Message - 
From: "Paul Trippett" <[EMAIL PROTECTED]>
To: "'Internet robots, spiders, web-walkers, etc.'" <[EMAIL PROTECTED]>
Sent: Sunday, January 11, 2004 8:57 AM
Subject: RE: [Robots] Robots.txt Evolution?


>
> > This would need to be a separate file, probably "robots2.txt".
>
> People who are oblivious to the robots.txt standard already complain
> about grabbing robots.txt and would thought that getting another file
> would make them complain a little more. Besides who wants to maintain
> two files that do the same job. In my view, if it was in another file an
> extended standard would never be used and then yes, there would be no
> point.
>
> /pt
>
>
> ___
> Robots mailing list
> [EMAIL PROTECTED]
> http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Robots.txt Evolution?

2004-01-11 Thread Matthew Meadows
You brought up an interesting point, the lack of sites using robots.txt,
but I see two related problems:

1) Most sites do not deploy or properly maintain a robots.txt file.
2) Many robots (nascent, malicious, stealthy or otherwise) do not honor
the robots.txt file.

If there is to be a successor to robots.txt, it should be sophisticated
enough to improve both of these problems.  It's only in the best
interest of the robot developer to honor the file if it mitigates
against link funneling or undesirable files.  It's only in the best
interest of the site developer if they want portions of their site
indexed or omitted.

Here's a proposal that I think helps solve both aspects of the problem:
By link funneling I'm referring to links that contain random session
identifiers, causing the same pages to be served up perpetually with
different anchor tags.  Most robots could benefit from a line that
identified these types of urls by their session identifier, for example:
session=jsessionid.  This simple enhancement would benefit both the
robot developers and the site developers.  The robots would no longer
need to identify these urls by manual or automated profiling, they could
simply extract the session identifier from links that matched the mask.
The site developers would prevent the useless traffic that's presently
involved in inferring the random session identifiers.  Extending this
idea, perhaps the specification could allow robots to substitute their
own agent names for the session identifiers.  This would allow for a
loose type of referral tracking.  As a side effect it would also cause
robots that spoof their agent names to implicate competing robots.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Fred Atkinson
Sent: Sunday, January 11, 2004 10:45 AM
To: Internet robots, spiders, web-walkers, etc.
Subject: Re: [Robots] Robots.txt Evolution?


I'm inclined to agree that a second file would probably get
overlooked by bots.  I would imagine it was difficult trying to get
those who run them to respect the first one.

I was unaware of the 'Allow' command.  Is there a URL that documents
it?

Also, the use of wildcards when giving paths should be incorporated.
That would greatly reduce the number of path lines that you'd have to
type into and a robot would have to read out of the robots.txt file.
And, wildcards shouldn't be limited to just the end of the path.  You
should be able to use them in the middle as well.  Perhaps the UNIX
brackets could be incorporated ( [0-9]. [a,e.i.o.u]) as matching
characters.

As far as the number of sites that actually use robots.txt, that
would grow as the strength of the robots.txt coding improved.


Fred

- Original Message - 
From: "Paul Trippett" <[EMAIL PROTECTED]>
To: "'Internet robots, spiders, web-walkers, etc.'"
<[EMAIL PROTECTED]>
Sent: Sunday, January 11, 2004 8:57 AM
Subject: RE: [Robots] Robots.txt Evolution?


>
> > This would need to be a separate file, probably "robots2.txt".
>
> People who are oblivious to the robots.txt standard already complain 
> about grabbing robots.txt and would thought that getting another file 
> would make them complain a little more. Besides who wants to maintain 
> two files that do the same job. In my view, if it was in another file 
> an extended standard would never be used and then yes, there would be 
> no point.
>
> /pt
>
>
> ___
> Robots mailing list
> [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Robots.txt Evolution?

2004-01-11 Thread Walter Underwood
--On Sunday, January 11, 2004 11:44 AM -0500 Fred Atkinson <[EMAIL PROTECTED]> wrote:
> 
> I was unaware of the 'Allow' command.  Is there a URL that documents it?

The Allow directive is non-standard. Don't use it.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Another approach

2004-01-11 Thread Fred Atkinson
Another idea that has occured to me is to simply code the information to
be indexed in the robots.txt file.  Then, the robot could simply suck the
information out of the file and be done.

Example:

User-agent: Scooter
Interval: 30d
Disallow: /
Name: Fred's Site
Index: /index.html
Name: My Article
Index: /article/index.html
Name: My Article's FAQs
Index: /article/faq.html

This would tell them to take this information to include in their search
database and move one.

Other ideas?



Fred

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Another approach

2004-01-11 Thread Matthew Meadows
I don't think the explicit names would be required, most robots simply
read the title tag, or infer it from the first portion of clear text,
the content meta tag, or other document attributes.  Anyway, this method
would become quite burdensome for very complicated sites. I also suspect
the file would also become stale rather quickly.  

I do like the Interval attribute, that makes perfect sense to me.
There's a lot we could do with the same basic concept.  For instance, we
could add a touch date to the file to indicate when the site was last
updated, so that even if the interval has passed robots would not need
to scan the site if they had already done so after the touch date.  Keep
in mind that if robot developers surmise that the touch dates are being
artificially manipulated to keep them out, they'll ignore them.

Anybody else interested in the Session attribute?

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On
Behalf Of Fred Atkinson
Sent: Sunday, January 11, 2004 4:38 PM
To: Robots
Subject: [Robots] Another approach


Another idea that has occured to me is to simply code the
information to be indexed in the robots.txt file.  Then, the robot could
simply suck the information out of the file and be done.

Example:

User-agent: Scooter
Interval: 30d
Disallow: /
Name: Fred's Site
Index: /index.html
Name: My Article
Index: /article/index.html
Name: My Article's FAQs
Index: /article/faq.html

This would tell them to take this information to include in their
search database and move one.

Other ideas?



Fred

___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Another approach

2004-01-11 Thread Sean 'Captain Napalm' Conner
It was thus said that the Great Matthew Meadows once stated:
> 
> I do like the Interval attribute, that makes perfect sense to me.
> There's a lot we could do with the same basic concept.  For instance, we
> could add a touch date to the file to indicate when the site was last
> updated, so that even if the interval has passed robots would not need

  Then there is the issue with making sure the robots.txt file is updated
with the new timestamp each time the site is updated, and I suspect this
step may be ignored for forgotten unless it's automated.

> Anybody else interested in the Session attribute?

  What's the session attribute?

  -spc (http://www.conman.org/people/spc/robots2.html)



___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Another approach

2004-01-11 Thread Matthew Meadows
You're right about the tendency to ignore the touch date, and in a more
general sense the tendency for robots.txt to become stale without
automation.

As for the Session attribute, I made reference to it in a prior email
(albeit without the capitalization):

By link funneling I'm referring to links that contain random session
identifiers, causing the same pages to be served up perpetually with
different anchor tags.  Most robots could benefit from a line that
identified these types of urls by their session identifier, for example:
session=jsessionid.  This simple enhancement would benefit both the
robot developers and the site developers.  The robots would no longer
need to identify these urls by manual or automated profiling, they could
simply extract the session identifier from links that matched the mask.
The site developers would prevent the useless traffic that's presently
involved in inferring the random session identifiers.  Extending this
idea, perhaps the specification could allow robots to substitute their
own agent names for the session identifiers.  This would allow for a
loose type of referral tracking.  As a side effect it would also cause
robots that spoof their agent names to implicate competing robots.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Sean 'Captain Napalm' Conner
Sent: Sunday, January 11, 2004 6:05 PM
To: [EMAIL PROTECTED]
Subject: Re: [Robots] Another approach


It was thus said that the Great Matthew Meadows once stated:
> 
> I do like the Interval attribute, that makes perfect sense to me. 
> There's a lot we could do with the same basic concept.  For instance, 
> we could add a touch date to the file to indicate when the site was 
> last updated, so that even if the interval has passed robots would not

> need

  Then there is the issue with making sure the robots.txt file is
updated with the new timestamp each time the site is updated, and I
suspect this step may be ignored for forgotten unless it's automated.

> Anybody else interested in the Session attribute?

  What's the session attribute?

  -spc (http://www.conman.org/people/spc/robots2.html)



___
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Another approach

2004-01-11 Thread Sean 'Captain Napalm' Conner
It was thus said that the Great Matthew Meadows once stated:
> 
> By link funneling I'm referring to links that contain random session
> identifiers, causing the same pages to be served up perpetually with
> different anchor tags.  Most robots could benefit from a line that
> identified these types of urls by their session identifier, for example:
> session=3Djsessionid.  This simple enhancement would benefit both the
> robot developers and the site developers.  The robots would no longer
> need to identify these urls by manual or automated profiling, they could
> simply extract the session identifier from links that matched the mask.
> The site developers would prevent the useless traffic that's presently
> involved in inferring the random session identifiers.  Extending this
> idea, perhaps the specification could allow robots to substitute their
> own agent names for the session identifiers.  This would allow for a
> loose type of referral tracking.  As a side effect it would also cause
> robots that spoof their agent names to implicate competing robots.

  I'm beginning to think that to really extend the robots exclusion
protocol, two new directives need to be defined, Ignore: and Index:, that
work similar to Disallow: and Allow: but allow regular expressions to be
used.  An up-to-date robot could internally convert:

Disallow: /sooperceecret/

to the equivilent:

Ignore: ^/soopercecret/.*

So, to avoid session ids, one coudl just do:

Ignore: .*sessionid=.*

And there you go.  Using the different directives makes it backwards
compatible with the original robots.txt (where an older robot will ignore
the new directives) and without overloading the meaning of existing
directives (one of the downpoints of my own proposed extention).

  So for my hypothetical robots.txt site I mentioned in a previous post, I
could do:

User-agent: *
Index: ^/$
Disallow: /

  or even:

User-agent: *
Index: ^/$
Ignore: .*

  -spc (Likes it ... )

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Another approach

2004-01-11 Thread Walter Underwood
--On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner <[EMAIL 
PROTECTED]> wrote:
> 
> And there you go.  Using the different directives makes it backwards
> compatible with the original robots.txt (where an older robot will ignore
> the new directives) and without overloading the meaning of existing
> directives (one of the downpoints of my own proposed extention).

No it does not make it backwards compatible. It makes it an
illegal robots.txt file. Parsers built to ignore unknown directives
would still be able to use it. Parsers not built that way would
not be able to parse the file, and would probably miss all the
legal directives as well as the non-standard ones.

I mentioned the internet robustness principle before, but folks
seem to have missed that. It is:

  Be conservative in what you send, liberal in what you accept.

In our case, the contents of the robots.txt file is "sent".
By the robustness principle, we must not add extra stuff on
the assumption that the parsers can deal with it.

Because the original format does not have a version number
there is no way to change the format safely.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Another approach

2004-01-11 Thread Paul Trippett
So what your saying is that if you come across a robots.txt file where
someone has made a spelling mistake of...

Disallow: /intranet
Dislalow: /private

You render the robots.txt file corrupt and start indexing the correctly
defined Disallow Directives? If not whats the difference between this
mis-spelt directive of Disallow: being spelt Ignore:?

/pt

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Walter Underwood
Sent: 12 January 2004 03:48
To: Internet robots, spiders, web-walkers, etc.
Subject: Re: [Robots] Another approach

--On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner
<[EMAIL PROTECTED]> wrote:
> 
> And there you go.  Using the different directives makes it backwards
> compatible with the original robots.txt (where an older robot will
ignore
> the new directives) and without overloading the meaning of existing
> directives (one of the downpoints of my own proposed extention).

No it does not make it backwards compatible. It makes it an
illegal robots.txt file. Parsers built to ignore unknown directives
would still be able to use it. Parsers not built that way would
not be able to parse the file, and would probably miss all the
legal directives as well as the non-standard ones.

I mentioned the internet robustness principle before, but folks
seem to have missed that. It is:

  Be conservative in what you send, liberal in what you accept.

In our case, the contents of the robots.txt file is "sent".
By the robustness principle, we must not add extra stuff on
the assumption that the parsers can deal with it.

Because the original format does not have a version number
there is no way to change the format safely.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Another approach

2004-01-11 Thread Sean 'Captain Napalm' Conner
It was thus said that the Great Walter Underwood once stated:
> 
> --On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner <[EMAIL 
> PROTECTED]> wrote:
> > 
> > And there you go.  Using the different directives makes it backwards
> > compatible with the original robots.txt (where an older robot will ignore
> > the new directives) and without overloading the meaning of existing
> > directives (one of the downpoints of my own proposed extention).
> 
> No it does not make it backwards compatible. It makes it an
> illegal robots.txt file. Parsers built to ignore unknown directives
> would still be able to use it. Parsers not built that way would
> not be able to parse the file, and would probably miss all the
> legal directives as well as the non-standard ones.

  That would be a pretty poor parser, and besides, from the spec itself
(http://www.robotstxt.org/wc/norobots.html):

The file consists of one or more records separated by one or more blank
lines (terminated by CR,CR/NL, or NL). Each record contains lines of the
form ":". The field name is case
insensitive.

Comments can be included in file using UNIX bourne shell conventions: the
'#' character is used to indicate that preceding space (if any) and the
remainder of the line up to the line termination is discarded. Lines
containing only a comment are discarded completely, and therefore do not
indicate a record boundary.

The record starts with one or more User-agent lines, followed by one or more
Disallow lines, as detailed below. Unrecognised headers are ignored.

  Right there---last line---"Unrecognised headers are ignored."  Besides,
it's a bit more work to *not* ignore unrecognized directives than it is to
ignore them:

while(fgets(line,sizeof(line),fprobots) != NULL)
{
  if (strncasecmp(line,"user-agent:",11) == 0)
/* we have a user-agent */
  else if (strncasecmp(line,"disallow:",9) == 0)
/* we have a disallow */
  
  /* we have a comment, or an unrecognized directive, ignore */
}

> I mentioned the internet robustness principle before, but folks
> seem to have missed that. It is:
> 
>   Be conservative in what you send, liberal in what you accept.
> 
> In our case, the contents of the robots.txt file is "sent".
> By the robustness principle, we must not add extra stuff on
> the assumption that the parsers can deal with it.

  By the same token, it is the robots.txt parser that "accepts" the
robots.txt file, so by the robustness principle, you need to ignore
directives you don't understand.

  -spc (And be thankful Martijn didn't decide to use RFC-822 style
header lines ... )

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] This email address is no longer in use.

2004-01-11 Thread paul
This email address is no longer in use.  

If you need to contact me, please call (07973) 172650

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots