Re: Stemming and Wildcards in robots.txt files

2000-03-16 Thread Toivio Tuomas

Klaus Johannes Rusch:
>No, you will need to use
>   

IIRC, this thread started with "can I avoid meta-tagging (with FrontPage
litter)?".
Case closed, I guess! ;)

Tuomas




Re: Stemming and Wildcards in robots.txt files

2000-03-15 Thread Toivio Tuomas

>OK, then is there a way to create an internal wildcard?
>
>User-agent: *
>Disallow: /*/97
>Disallow: /*/98
>
>Something like this?

Unfortunately... from Web Server Administrator's Guide to the REP:
" Note also that regular expression are not supported in either the
User-agent or Disallow lines. The '*' in the User-agent field is a special
value meaning "any robot". Specifically, you cannot have lines like
"Disallow: /tmp/*" or "Disallow: *.gif". "
(http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html)

So looks like "Disallow: /*/97" (and so on) would be meaningless as well.

Regards, Tuomas




Re: FW: Stemming and Wildcards in robots.txt files

2000-03-15 Thread Klaus Johannes Rusch

In <[EMAIL PROTECTED]>, Jonathan Knoll 
<[EMAIL PROTECTED]> writes:
> OK, then is there a way to create an internal wildcard?
>
> User-agent: *
> Disallow: /*/97
> Disallow: /*/98

No, you will need to use



in the HTML source for those robots that honor the meta tag. robots.txt does
not provide a mechanism for this.

Klaus Johannes Rusch
--
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/




Re: FW: Stemming and Wildcards in robots.txt files

2000-03-14 Thread Martin Beet

> OK, then is there a way to create an internal wildcard?
>
> User-agent: *
> Disallow: /*/97
> Disallow: /*/98

No, not in the current specification (draft).

Regards, Martin

--
Sent through GMX FreeMail - http://www.gmx.net




FW: Stemming and Wildcards in robots.txt files

2000-03-14 Thread Jonathan Knoll

OK, then is there a way to create an internal wildcard?

User-agent: *
Disallow: /*/97
Disallow: /*/98

Something like this?

>> Jonathan Knoll:
>> >> User-agent: *
>> >> Disallow: /cgi-bin
>> >> Disallow: /site
>>
>> Klaus Johannes Rusch:
>> > /cgi-bin/test.cgi
>> > /siteindex.html
>> > would be excluded.
>>
(Me:)
>> But what about these paths (in the same root dir):
>>
>>/foo/cgi-bin/test.cgi
>>/bar/user1/cgi-bin/test.sgi
>>/bar/user2/cgi-bin/test.cgi
>>
>> Does the wildcard function recognize specified strings elsewhere (later)
>> than in the immediate beginning of a path?
>
>Martin Beet:
>The draft specification is quite clear on this: the strings are compared
>octet by octet until the Allow / Disallow string ends, in which case this
>rule matches, or until a mismatch is found. From the spec:
>
>" The matching process compares every octet in the path portion of
>   the URL and the path from the record. [...]  The match
>   evaluates positively if and only if the end of the path from the
>   record is reached before a difference in octets is encountered."

Thanks, Martin!

To briefly paraphrase this:
A robot never traverses the URL beyond the lenght of the Disallow line. Thus
a Disallow string cannot function as a *free* wildcard element ("Disallow:
/foo" would apply to "/foo/bar" but not to "/bar/foo").

Regards, Tuomas




Re: Stemming and Wildcards in robots.txt files

2000-03-14 Thread Toivio Tuomas

>> Jonathan Knoll:
>> >> User-agent: *
>> >> Disallow: /cgi-bin
>> >> Disallow: /site
>>
>> Klaus Johannes Rusch:
>> > /cgi-bin/test.cgi
>> > /siteindex.html
>> > would be excluded.
>>
(Me:)
>> But what about these paths (in the same root dir):
>>
>>/foo/cgi-bin/test.cgi
>>/bar/user1/cgi-bin/test.sgi
>>/bar/user2/cgi-bin/test.cgi
>>
>> Does the wildcard function recognize specified strings elsewhere (later)
>> than in the immediate beginning of a path?
>
>Martin Beet:
>The draft specification is quite clear on this: the strings are compared
>octet by octet until the Allow / Disallow string ends, in which case this
>rule matches, or until a mismatch is found. From the spec:
>
>" The matching process compares every octet in the path portion of
>   the URL and the path from the record. [...]  The match
>   evaluates positively if and only if the end of the path from the
>   record is reached before a difference in octets is encountered."

Thanks, Martin!

To briefly paraphrase this:
A robot never traverses the URL beyond the lenght of the Disallow line. Thus
a Disallow string cannot function as a *free* wildcard element ("Disallow:
/foo" would apply to "/foo/bar" but not to "/bar/foo").

Regards, Tuomas




Re: Stemming and Wildcards in robots.txt files

2000-03-13 Thread Martin Beet

> Jonathan Knoll:
> >> User-agent: *
> >> Disallow: /cgi-bin
> >> Disallow: /site
>
> Klaus Johannes Rusch:
> > /cgi-bin/test.cgi
> > /siteindex.html
> > would be excluded.
>
> But what about these paths (in the same root dir):
>
>/foo/cgi-bin/test.cgi
>/bar/user1/cgi-bin/test.sgi
>/bar/user2/cgi-bin/test.cgi

>
> Does the wildcard function recognize specified strings elsewhere (later)
> than in the immediate beginning of a path?


The draft specification is quite clear on this: the strings are compared
octet by octet until the Allow / Disallow string ends, in which case this
rule matches, or until a mismatch is found. From the spec:


" The matching process compares every octet in the path portion of
   the URL and the path from the record. [...]  The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered."

Regards, Martin

--
Sent through GMX FreeMail - http://www.gmx.net




Re: Stemming and Wildcards in robots.txt files

2000-03-13 Thread Toivio Tuomas

Jonathan Knoll:
>> User-agent: *
>> Disallow: /cgi-bin
>> Disallow: /site

Klaus Johannes Rusch:
> /cgi-bin/test.cgi
> /siteindex.html
> would be excluded.

But what about these paths (in the same root dir):

   /foo/cgi-bin/test.cgi
   /bar/user1/cgi-bin/test.sgi
   /bar/user2/cgi-bin/test.cgi

Does the wildcard function recognize specified strings elsewhere (later)
than in the immediate beginning of a path?




Re: Stemming and Wildcards in robots.txt files

2000-03-10 Thread Klaus Johannes Rusch

In <[EMAIL PROTECTED]>, Jonathan Knoll 
<[EMAIL PROTECTED]> writes:
> Do robots.txt files have stemming/wildcard abilities?
>
> **
> User-agent: *
> Disallow: /cgi-bin
> Disallow: /site
> **

Yes, robots are instructed to not visit any URLs that start with the given
string, so in your example

  /cgi-bin/test.cgi
  /siteindex.html

would be excluded.

Klaus Johannes Rusch
--
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/




Stemming and Wildcards in robots.txt files

2000-03-09 Thread Jonathan Knoll

I'm sorry, I am a bit new to this, so this may be a bit simplistic.
Nonetheless, I must ask.

Do robots.txt files have stemming/wildcard abilities?

**
User-agent: *
Disallow: /cgi-bin
Disallow: /site
**

Will this only disallow the cgi-bin and the directory site, or will it also
disallow anything that begins with site off the root domain.  If it does
not, is there any way to disallow file names in this manner?

Thanks.

Jonathan.