Re: URLs with "?"s in them

2000-03-13 Thread Mark Bennett

For the record, our Searchbutton spider DOES follow CGI ? links by default.

This behavior can be overrode by Support if needed, but by default we do
follow them.

Mark

-Original Message-
From: Avi Rappoport [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 13, 2000 11:24 AM
To: [EMAIL PROTECTED]
Subject: Re: URLs with "?"s in them

At 9:30 PM -0800 3/10/2000, Andrew Daviel wrote:
>[snip]
>On my Apache server it's trivial to change some CGI url foo?id to foo/id -
>the parameters appear in path_info instead of query_string. I'd like
>to  see a 404 status when there's no match, though.

Munging the CGI URL is the only way to get pages indexed by the major
search engines.  Amazon does that and it works pretty well.  I know
that there were rumors of Inktomi indexing pages with ? in them, but
the others surely don't.  A guy from Excite told me that they found
most dynamic pages didn't have much value.  I understand but disagree
-- sometimes you just want to find kids' blue socks, you know?

Avi
--

The Complete Guide to Site Indexing and Local Search Engines
  




Re: Robot code

2000-03-13 Thread Avi Rappoport

At 12:38 AM -0800 3/8/2000, proyecto_ii wrote:
>Hello everyone
>I've just started studing web robots,and I would like to know how they
>exactly work.Does anyone know where I can find the code of one?
>If you have documentation of how  programming one , please, send me

I have put all the info I have up on my site at
.

By the way, robot-listies, there is just one person listed in the
Consultants section so far -- there are jobs out there, so please let
me know if you're interested, and I'll list you as well.

Avi
--

The Complete Guide to Site Indexing and Local Search Engines
  





Re: URLs with "?"s in them

2000-03-13 Thread Avi Rappoport

At 9:30 PM -0800 3/10/2000, Andrew Daviel wrote:
>[snip]
>On my Apache server it's trivial to change some CGI url foo?id to foo/id -
>the parameters appear in path_info instead of query_string. I'd like
>to  see a 404 status when there's no match, though.

Munging the CGI URL is the only way to get pages indexed by the major
search engines.  Amazon does that and it works pretty well.  I know
that there were rumors of Inktomi indexing pages with ? in them, but
the others surely don't.  A guy from Excite told me that they found
most dynamic pages didn't have much value.  I understand but disagree
-- sometimes you just want to find kids' blue socks, you know?

Avi
--

The Complete Guide to Site Indexing and Local Search Engines
  





Re: URLs with "?"s in them

2000-03-13 Thread George Phillips

Marc Slemko wrote:
> Except it would also require the changing of a whole bunch of other pages
> on the site to use the new form of URLs, and there are a good thousand or
> so other pages.  I'm quite familiar with all the options, and for a lot of
> reasons, making all the older message pages into static HTML pages just
> isn't an option.
>
> As I said before, there is nothing technically stopping me from changing
> the URLs to not have a "?" in them without changing the underlying way
> they are generated.  It is simply a very significant effort, and it is a
> bit archaic that there is no way to tell a search engine to include such
> pages.

Perhaps you've thought of this, but one way to ease the effort is to
either (a) support both "?" and ordinary forms of the URLs or (b) change
to ordinary forms but leave the "?" forms as redirects to the ordinary
URLs.  Season to taste.

-- George




Re: Stemming and Wildcards in robots.txt files

2000-03-13 Thread Martin Beet

> Jonathan Knoll:
> >> User-agent: *
> >> Disallow: /cgi-bin
> >> Disallow: /site
>
> Klaus Johannes Rusch:
> > /cgi-bin/test.cgi
> > /siteindex.html
> > would be excluded.
>
> But what about these paths (in the same root dir):
>
>/foo/cgi-bin/test.cgi
>/bar/user1/cgi-bin/test.sgi
>/bar/user2/cgi-bin/test.cgi

>
> Does the wildcard function recognize specified strings elsewhere (later)
> than in the immediate beginning of a path?


The draft specification is quite clear on this: the strings are compared
octet by octet until the Allow / Disallow string ends, in which case this
rule matches, or until a mismatch is found. From the spec:


" The matching process compares every octet in the path portion of
   the URL and the path from the record. [...]  The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered."

Regards, Martin

--
Sent through GMX FreeMail - http://www.gmx.net




Re: Stemming and Wildcards in robots.txt files

2000-03-13 Thread Toivio Tuomas

Jonathan Knoll:
>> User-agent: *
>> Disallow: /cgi-bin
>> Disallow: /site

Klaus Johannes Rusch:
> /cgi-bin/test.cgi
> /siteindex.html
> would be excluded.

But what about these paths (in the same root dir):

   /foo/cgi-bin/test.cgi
   /bar/user1/cgi-bin/test.sgi
   /bar/user2/cgi-bin/test.cgi

Does the wildcard function recognize specified strings elsewhere (later)
than in the immediate beginning of a path?