On 2009-11-30, at 23:38, Yaar Schnitman wrote:

> A sitemap.xml file is a more modern way of telling Google how to crawl a site 
> and the traffic can be throttled in Google's webmaster tools 
> (http://www.google.com/webmasters/tools/).
> 
> Creating a daily script that generates sitemap.xml for webkit's SVN repo 
> should trivial. There are probably trac plugins that do that already. If done 
> right, google crawler shouldn't produce much more load than an average 
> developer doing a daily svn sync.

Google isn't the only search engine we're concerned about.  We need to prevent 
all search engines from hammering the repository, even those that don't support 
this technology.  I can't find any information about the precedence of 
exclusions in robots.txt vs a sitemap so it's not clear whether that can be 
achieved without having to explicitly whitelist individual crawlers.

If it is possible to use a sitemap without having to whitelist individual 
crawlers then we should investigate doing so.  Suggesting it is trivial is 
being rather optimistic though.  You'd need to dramatically restrict the set of 
content that is exposed for indexing to make it feasible.  For instance: allow 
indexing only the content of files on trunk (no branches, tags, non-HEAD 
revisions).  You'd also want to expose individual changesets to ensure that 
commit messages are indexed.

But… from what I can see a sitemap only points at content that is available, it 
doesn't restrict what can be indexed.  While we'd want individual changeset 
pages to be indexed we'd certainly not want it to follow every individual "view 
diff" link on such a page, nor would we want it to follow the numerous other 
links within the content back to previous revisions, other branches, tags, etc.

Maybe there's something that I'm missing that makes sitemaps usable for this 
purpose though.

- Mark

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
webkit-dev mailing list
[email protected]
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

Reply via email to