On 2009-12-01, at 11:04, Yaar Schnitman wrote: > Robots.txt can exclude most of the trac site, and then include the > sitemap.xml. This way you block most of the junk and only give permission to > the important file. All major search engine support sitemap.xml, and those > that don't will be blocked by robots.txt. > > A script could generate sitemap.xml from a local svn checkout of trunk. It > will produce one url for each source file (frequency=daily) and one url for > every revision (frequency=year). That will cover most of the search > requirements.
Forgive me, but this doesn't seem to address the issues that I raised in my previous message. To reiterate: We need to allow only an explicit set of URLs to be crawled. Sitemaps do not provide this ability. They expose information about set of URLs to a crawler, they do not limit the set of URLs that it can crawl. A robots.txt file does provide the ability to limit the set of URLs that can be crawled. However, the semantics of robots.txt seem to make it incredibly unwieldy to expose only the content of interest, if it is possible at all. For instance, to expose <http://trac.webkit.org/changeset/#{revision}> while preventing <http://trac.webkit.org/changeset/#{revision}/#{path}> or <http://trac.webkit.org/changeset/#{revision}?format=zip&new=#{revision}> from being crawled. Another example would be exposing <http://trac.webkit.org/browser/#{path}> while preventing <http://trac.webkit.org/browser/#{path}?rev=#{revision}> from being crawled. Is there something that I'm missing? - Mark
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ webkit-dev mailing list [email protected] http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

