The urls in sitemap.xml are not patterns - there are exact urls the search engine will retrieve.
So, you would blacklist most urls with blanket rules in robots.txt and whitelist explicit urls in sitemap.xml. e.g. in robots.txt, blacklist /changeset/*, and in sitemap.xml whitelist all http://trac.webkit.org/changeset/1<http://trac.webkit.org/changeset/#%7Brevision%7D> to http://trac.webkit.org/changeset/60000 (It's going to be a big file alright). On Tue, Dec 1, 2009 at 11:33 AM, Mark Rowe <[email protected]> wrote: > > On 2009-12-01, at 11:04, Yaar Schnitman wrote: > > Robots.txt can exclude most of the trac site, and then include the > sitemap.xml. This way you block most of the junk and only give permission to > the important file. All major search engine support sitemap.xml, and those > that don't will be blocked by robots.txt. > > A script could generate sitemap.xml from a local svn checkout of trunk. It > will produce one url for each source file (frequency=daily) and one url for > every revision (frequency=year). That will cover most of the search > requirements. > > > Forgive me, but this doesn't seem to address the issues that I raised in my > previous message. > > To reiterate: We need to allow only an explicit set of URLs to be crawled. > > Sitemaps *do not* provide this ability. They expose information about set > of URLs to a crawler, they do not limit the set of URLs that it can crawl. > A robots.txt file *does* provide the ability to limit the set of URLs > that can be crawled. > > However, the semantics of robots.txt seem to make it incredibly unwieldy to > expose *only *the content of interest, if it is possible at all. For > instance, to expose > <http://trac.webkit.org/changeset/#{revision}<http://trac.webkit.org/changeset/#%7Brevision%7D>> > while preventing <http://trac.webkit.org/changeset/#{revision}/#{path}> or > <http://trac.webkit.org/changeset/#{revision}?<http://trac.webkit.org/changeset/#%7Brevision%7D?>format=zip&new=#{revision}> > from > being crawled. Another example would be exposing < > http://trac.webkit.org/browser/#{path}<http://trac.webkit.org/browser/#%7Bpath%7D>> > while preventing <http://trac.webkit.org/browser/#{path}?rev=#{revision}> > from being crawled. > > Is there something that I'm missing? > > - Mark > >
_______________________________________________ webkit-dev mailing list [email protected] http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

