Re: Stemming and Wildcards in robots.txt files
Klaus Johannes Rusch: >No, you will need to use > IIRC, this thread started with "can I avoid meta-tagging (with FrontPage litter)?". Case closed, I guess! ;) Tuomas
Re: Stemming and Wildcards in robots.txt files
>OK, then is there a way to create an internal wildcard? > >User-agent: * >Disallow: /*/97 >Disallow: /*/98 > >Something like this? Unfortunately... from Web Server Administrator's Guide to the REP: " Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif". " (http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html) So looks like "Disallow: /*/97" (and so on) would be meaningless as well. Regards, Tuomas
Re: FW: Stemming and Wildcards in robots.txt files
In <[EMAIL PROTECTED]>, Jonathan Knoll <[EMAIL PROTECTED]> writes: > OK, then is there a way to create an internal wildcard? > > User-agent: * > Disallow: /*/97 > Disallow: /*/98 No, you will need to use in the HTML source for those robots that honor the meta tag. robots.txt does not provide a mechanism for this. Klaus Johannes Rusch -- [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/
Re: FW: Stemming and Wildcards in robots.txt files
> OK, then is there a way to create an internal wildcard? > > User-agent: * > Disallow: /*/97 > Disallow: /*/98 No, not in the current specification (draft). Regards, Martin -- Sent through GMX FreeMail - http://www.gmx.net
FW: Stemming and Wildcards in robots.txt files
OK, then is there a way to create an internal wildcard? User-agent: * Disallow: /*/97 Disallow: /*/98 Something like this? >> Jonathan Knoll: >> >> User-agent: * >> >> Disallow: /cgi-bin >> >> Disallow: /site >> >> Klaus Johannes Rusch: >> > /cgi-bin/test.cgi >> > /siteindex.html >> > would be excluded. >> (Me:) >> But what about these paths (in the same root dir): >> >>/foo/cgi-bin/test.cgi >>/bar/user1/cgi-bin/test.sgi >>/bar/user2/cgi-bin/test.cgi >> >> Does the wildcard function recognize specified strings elsewhere (later) >> than in the immediate beginning of a path? > >Martin Beet: >The draft specification is quite clear on this: the strings are compared >octet by octet until the Allow / Disallow string ends, in which case this >rule matches, or until a mismatch is found. From the spec: > >" The matching process compares every octet in the path portion of > the URL and the path from the record. [...] The match > evaluates positively if and only if the end of the path from the > record is reached before a difference in octets is encountered." Thanks, Martin! To briefly paraphrase this: A robot never traverses the URL beyond the lenght of the Disallow line. Thus a Disallow string cannot function as a *free* wildcard element ("Disallow: /foo" would apply to "/foo/bar" but not to "/bar/foo"). Regards, Tuomas
Re: Stemming and Wildcards in robots.txt files
>> Jonathan Knoll: >> >> User-agent: * >> >> Disallow: /cgi-bin >> >> Disallow: /site >> >> Klaus Johannes Rusch: >> > /cgi-bin/test.cgi >> > /siteindex.html >> > would be excluded. >> (Me:) >> But what about these paths (in the same root dir): >> >>/foo/cgi-bin/test.cgi >>/bar/user1/cgi-bin/test.sgi >>/bar/user2/cgi-bin/test.cgi >> >> Does the wildcard function recognize specified strings elsewhere (later) >> than in the immediate beginning of a path? > >Martin Beet: >The draft specification is quite clear on this: the strings are compared >octet by octet until the Allow / Disallow string ends, in which case this >rule matches, or until a mismatch is found. From the spec: > >" The matching process compares every octet in the path portion of > the URL and the path from the record. [...] The match > evaluates positively if and only if the end of the path from the > record is reached before a difference in octets is encountered." Thanks, Martin! To briefly paraphrase this: A robot never traverses the URL beyond the lenght of the Disallow line. Thus a Disallow string cannot function as a *free* wildcard element ("Disallow: /foo" would apply to "/foo/bar" but not to "/bar/foo"). Regards, Tuomas
Re: Stemming and Wildcards in robots.txt files
> Jonathan Knoll: > >> User-agent: * > >> Disallow: /cgi-bin > >> Disallow: /site > > Klaus Johannes Rusch: > > /cgi-bin/test.cgi > > /siteindex.html > > would be excluded. > > But what about these paths (in the same root dir): > >/foo/cgi-bin/test.cgi >/bar/user1/cgi-bin/test.sgi >/bar/user2/cgi-bin/test.cgi > > Does the wildcard function recognize specified strings elsewhere (later) > than in the immediate beginning of a path? The draft specification is quite clear on this: the strings are compared octet by octet until the Allow / Disallow string ends, in which case this rule matches, or until a mismatch is found. From the spec: " The matching process compares every octet in the path portion of the URL and the path from the record. [...] The match evaluates positively if and only if the end of the path from the record is reached before a difference in octets is encountered." Regards, Martin -- Sent through GMX FreeMail - http://www.gmx.net
Re: Stemming and Wildcards in robots.txt files
Jonathan Knoll: >> User-agent: * >> Disallow: /cgi-bin >> Disallow: /site Klaus Johannes Rusch: > /cgi-bin/test.cgi > /siteindex.html > would be excluded. But what about these paths (in the same root dir): /foo/cgi-bin/test.cgi /bar/user1/cgi-bin/test.sgi /bar/user2/cgi-bin/test.cgi Does the wildcard function recognize specified strings elsewhere (later) than in the immediate beginning of a path?
Re: Stemming and Wildcards in robots.txt files
In <[EMAIL PROTECTED]>, Jonathan Knoll <[EMAIL PROTECTED]> writes: > Do robots.txt files have stemming/wildcard abilities? > > ** > User-agent: * > Disallow: /cgi-bin > Disallow: /site > ** Yes, robots are instructed to not visit any URLs that start with the given string, so in your example /cgi-bin/test.cgi /siteindex.html would be excluded. Klaus Johannes Rusch -- [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/
Stemming and Wildcards in robots.txt files
I'm sorry, I am a bit new to this, so this may be a bit simplistic. Nonetheless, I must ask. Do robots.txt files have stemming/wildcard abilities? ** User-agent: * Disallow: /cgi-bin Disallow: /site ** Will this only disallow the cgi-bin and the directory site, or will it also disallow anything that begins with site off the root domain. If it does not, is there any way to disallow file names in this manner? Thanks. Jonathan.