Re: [webkit-dev] trac.webkit.org links via Google.com

Yaar Schnitman Tue, 01 Dec 2009 12:31:51 -0800

The urls in sitemap.xml are not patterns - there are exact urls the search
engine will retrieve.


So, you would blacklist most urls with blanket rules in robots.txt and
whitelist explicit urls in sitemap.xml. e.g. in robots.txt, blacklist
/changeset/*, and in sitemap.xml whitelist all
http://trac.webkit.org/changeset/1<http://trac.webkit.org/changeset/#%7Brevision%7D>
 to http://trac.webkit.org/changeset/60000 (It's going to be a big file
alright).

On Tue, Dec 1, 2009 at 11:33 AM, Mark Rowe <[email protected]> wrote:

>
> On 2009-12-01, at 11:04, Yaar Schnitman wrote:
>
> Robots.txt can exclude most of the trac site, and then include the
> sitemap.xml. This way you block most of the junk and only give permission to
> the important file. All major search engine support sitemap.xml, and those
> that don't will be blocked by robots.txt.
>
> A script could generate sitemap.xml from a local svn checkout of trunk. It
> will produce one url for each source file (frequency=daily) and one url for
> every revision (frequency=year). That will cover most of the search
> requirements.
>
>
> Forgive me, but this doesn't seem to address the issues that I raised in my
> previous message.
>
> To reiterate: We need to allow only an explicit set of URLs to be crawled.
>
>
Sitemaps *do not* provide this ability.  They expose information about set
> of URLs to a crawler, they do not limit the set of URLs that it can crawl.
>  A robots.txt file *does* provide the ability to limit the set of URLs
> that can be crawled.
>
> However, the semantics of robots.txt seem to make it incredibly unwieldy to
> expose *only *the content of interest, if it is possible at all.  For
> instance, to expose 
> <http://trac.webkit.org/changeset/#{revision}<http://trac.webkit.org/changeset/#%7Brevision%7D>>
> while preventing <http://trac.webkit.org/changeset/#{revision}/#{path}> or
> <http://trac.webkit.org/changeset/#{revision}?<http://trac.webkit.org/changeset/#%7Brevision%7D?>format=zip&new=#{revision}>
>  from
> being crawled.  Another example would be exposing <
> http://trac.webkit.org/browser/#{path}<http://trac.webkit.org/browser/#%7Bpath%7D>>
> while preventing <http://trac.webkit.org/browser/#{path}?rev=#{revision}>
> from being crawled.
>
> Is there something that I'm missing?
>
> - Mark
>
>

_______________________________________________
webkit-dev mailing list
[email protected]
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

Re: [webkit-dev] trac.webkit.org links via Google.com

Reply via email to