Re: [webkit-dev] trac.webkit.org links via Google.com

Mark Rowe Tue, 01 Dec 2009 11:33:17 -0800

On 2009-12-01, at 11:04, Yaar Schnitman wrote:

> Robots.txt can exclude most of the trac site, and then include the 
> sitemap.xml. This way you block most of the junk and only give permission to 
> the important file. All major search engine support sitemap.xml, and those 
> that don't will be blocked by robots.txt.
> 
> A script could generate sitemap.xml from a local svn checkout of trunk. It 
> will produce one url for each source file (frequency=daily) and one url for 
> every revision (frequency=year). That will cover most of the search 
> requirements.


Forgive me, but this doesn't seem to address the issues that I raised in my 
previous message.

To reiterate: We need to allow only an explicit set of URLs to be crawled.  
Sitemaps do not provide this ability.  They expose information about set of 
URLs to a crawler, they do not limit the set of URLs that it can crawl.  A 
robots.txt file does provide the ability to limit the set of URLs that can be 
crawled.

However, the semantics of robots.txt seem to make it incredibly unwieldy to 
expose only the content of interest, if it is possible at all.  For instance, 
to expose <http://trac.webkit.org/changeset/#{revision}> while preventing 
<http://trac.webkit.org/changeset/#{revision}/#{path}> or 
<http://trac.webkit.org/changeset/#{revision}?format=zip&new=#{revision}> from 
being crawled.  Another example would be exposing 
<http://trac.webkit.org/browser/#{path}> while preventing 
<http://trac.webkit.org/browser/#{path}?rev=#{revision}> from being crawled.

Is there something that I'm missing?

- Mark

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
webkit-dev mailing list
[email protected]
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

Re: [webkit-dev] trac.webkit.org links via Google.com

Reply via email to