subject:"Stemming and Wildcards in robots.txt files"

Re: Stemming and Wildcards in robots.txt files

2000-03-16 Thread Toivio Tuomas


Klaus Johannes Rusch:
>No, you will need to use
>   

IIRC, this thread started with "can I avoid meta-tagging (with FrontPage
litter)?".
Case closed, I guess! ;)

Tuomas

Re: Stemming and Wildcards in robots.txt files

2000-03-15 Thread Toivio Tuomas


>OK, then is there a way to create an internal wildcard?
>
>User-agent: *
>Disallow: /*/97
>Disallow: /*/98
>
>Something like this?

Unfortunately... from Web Server Administrator's Guide to the REP:
" Note also that regular expression are not supported in either the
User-agent or Disallow lines. The '*' in the User-agent field is a special
value meaning "any robot". Specifically, you cannot have lines like
"Disallow: /tmp/*" or "Disallow: *.gif". "
(http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html)

So looks like "Disallow: /*/97" (and so on) would be meaningless as well.

Regards, Tuomas

Re: FW: Stemming and Wildcards in robots.txt files

2000-03-15 Thread Klaus Johannes Rusch


In <[EMAIL PROTECTED]>, Jonathan Knoll 
<[EMAIL PROTECTED]> writes:
> OK, then is there a way to create an internal wildcard?
>
> User-agent: *
> Disallow: /*/97
> Disallow: /*/98

No, you will need to use



in the HTML source for those robots that honor the meta tag. robots.txt does
not provide a mechanism for this.

Klaus Johannes Rusch
--
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

Re: FW: Stemming and Wildcards in robots.txt files

2000-03-14 Thread Martin Beet


> OK, then is there a way to create an internal wildcard?
>
> User-agent: *
> Disallow: /*/97
> Disallow: /*/98

No, not in the current specification (draft).

Regards, Martin

--
Sent through GMX FreeMail - http://www.gmx.net

FW: Stemming and Wildcards in robots.txt files

2000-03-14 Thread Jonathan Knoll


OK, then is there a way to create an internal wildcard?

User-agent: *
Disallow: /*/97
Disallow: /*/98

Something like this?

>> Jonathan Knoll:
>> >> User-agent: *
>> >> Disallow: /cgi-bin
>> >> Disallow: /site
>>
>> Klaus Johannes Rusch:
>> > /cgi-bin/test.cgi
>> > /siteindex.html
>> > would be excluded.
>>
(Me:)
>> But what about these paths (in the same root dir):
>>
>>/foo/cgi-bin/test.cgi
>>/bar/user1/cgi-bin/test.sgi
>>/bar/user2/cgi-bin/test.cgi
>>
>> Does the wildcard function recognize specified strings elsewhere (later)
>> than in the immediate beginning of a path?
>
>Martin Beet:
>The draft specification is quite clear on this: the strings are compared
>octet by octet until the Allow / Disallow string ends, in which case this
>rule matches, or until a mismatch is found. From the spec:
>
>" The matching process compares every octet in the path portion of
>   the URL and the path from the record. [...]  The match
>   evaluates positively if and only if the end of the path from the
>   record is reached before a difference in octets is encountered."

Thanks, Martin!

To briefly paraphrase this:
A robot never traverses the URL beyond the lenght of the Disallow line. Thus
a Disallow string cannot function as a *free* wildcard element ("Disallow:
/foo" would apply to "/foo/bar" but not to "/bar/foo").

Regards, Tuomas

Re: Stemming and Wildcards in robots.txt files

2000-03-14 Thread Toivio Tuomas


>> Jonathan Knoll:
>> >> User-agent: *
>> >> Disallow: /cgi-bin
>> >> Disallow: /site
>>
>> Klaus Johannes Rusch:
>> > /cgi-bin/test.cgi
>> > /siteindex.html
>> > would be excluded.
>>
(Me:)
>> But what about these paths (in the same root dir):
>>
>>/foo/cgi-bin/test.cgi
>>/bar/user1/cgi-bin/test.sgi
>>/bar/user2/cgi-bin/test.cgi
>>
>> Does the wildcard function recognize specified strings elsewhere (later)
>> than in the immediate beginning of a path?
>
>Martin Beet:
>The draft specification is quite clear on this: the strings are compared
>octet by octet until the Allow / Disallow string ends, in which case this
>rule matches, or until a mismatch is found. From the spec:
>
>" The matching process compares every octet in the path portion of
>   the URL and the path from the record. [...]  The match
>   evaluates positively if and only if the end of the path from the
>   record is reached before a difference in octets is encountered."

Thanks, Martin!

To briefly paraphrase this:
A robot never traverses the URL beyond the lenght of the Disallow line. Thus
a Disallow string cannot function as a *free* wildcard element ("Disallow:
/foo" would apply to "/foo/bar" but not to "/bar/foo").

Regards, Tuomas

Re: Stemming and Wildcards in robots.txt files

2000-03-13 Thread Martin Beet


> Jonathan Knoll:
> >> User-agent: *
> >> Disallow: /cgi-bin
> >> Disallow: /site
>
> Klaus Johannes Rusch:
> > /cgi-bin/test.cgi
> > /siteindex.html
> > would be excluded.
>
> But what about these paths (in the same root dir):
>
>/foo/cgi-bin/test.cgi
>/bar/user1/cgi-bin/test.sgi
>/bar/user2/cgi-bin/test.cgi

>
> Does the wildcard function recognize specified strings elsewhere (later)
> than in the immediate beginning of a path?


The draft specification is quite clear on this: the strings are compared
octet by octet until the Allow / Disallow string ends, in which case this
rule matches, or until a mismatch is found. From the spec:


" The matching process compares every octet in the path portion of
   the URL and the path from the record. [...]  The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered."

Regards, Martin

--
Sent through GMX FreeMail - http://www.gmx.net

Re: Stemming and Wildcards in robots.txt files

2000-03-13 Thread Toivio Tuomas


Jonathan Knoll:
>> User-agent: *
>> Disallow: /cgi-bin
>> Disallow: /site

Klaus Johannes Rusch:
> /cgi-bin/test.cgi
> /siteindex.html
> would be excluded.

But what about these paths (in the same root dir):

   /foo/cgi-bin/test.cgi
   /bar/user1/cgi-bin/test.sgi
   /bar/user2/cgi-bin/test.cgi

Does the wildcard function recognize specified strings elsewhere (later)
than in the immediate beginning of a path?

Re: Stemming and Wildcards in robots.txt files

2000-03-10 Thread Klaus Johannes Rusch


In <[EMAIL PROTECTED]>, Jonathan Knoll 
<[EMAIL PROTECTED]> writes:
> Do robots.txt files have stemming/wildcard abilities?
>
> **
> User-agent: *
> Disallow: /cgi-bin
> Disallow: /site
> **

Yes, robots are instructed to not visit any URLs that start with the given
string, so in your example

  /cgi-bin/test.cgi
  /siteindex.html

would be excluded.

Klaus Johannes Rusch
--
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

Stemming and Wildcards in robots.txt files

2000-03-09 Thread Jonathan Knoll


I'm sorry, I am a bit new to this, so this may be a bit simplistic.
Nonetheless, I must ask.

Do robots.txt files have stemming/wildcard abilities?

**
User-agent: *
Disallow: /cgi-bin
Disallow: /site
**

Will this only disallow the cgi-bin and the directory site, or will it also
disallow anything that begins with site off the root domain.  If it does
not, is there any way to disallow file names in this manner?

Thanks.

Jonathan.

Re: Stemming and Wildcards in robots.txt files

Re: Stemming and Wildcards in robots.txt files

Re: FW: Stemming and Wildcards in robots.txt files

Re: FW: Stemming and Wildcards in robots.txt files

FW: Stemming and Wildcards in robots.txt files

Re: Stemming and Wildcards in robots.txt files

Re: Stemming and Wildcards in robots.txt files

Re: Stemming and Wildcards in robots.txt files

Re: Stemming and Wildcards in robots.txt files

Stemming and Wildcards in robots.txt files

10 matches

Site Navigation

Mail list logo

Footer information