---------- Forwarded message ----------
From: dvanhorn <[EMAIL PROTECTED]>
To: list for the ballistic helmet heads <[EMAIL PROTECTED]>
Date: Mon, 03 Nov 2003 14:13:30 -0500
Subject: Re: [ballistichelmet] White House site prevents Iraq material
    being archived
Reply-To: list for the ballistic helmet heads
        <[EMAIL PROTECTED]:4000>
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.1) Gecko/20020827
Precedence: list
Return-Path: <[EMAIL PROTECTED]>

On Thu, 30 Oct 2003, Aaron S. Hawley wrote:

> [for those robots.txt fans]
>
> White House site prevents Iraq material being archived
> http://www.theage.com.au/articles/2003/10/28/1067233141495.html
> By Sam Varghese
> October 28, 2003

The robots.txt convention for preventing material from being archived is
merely a courtesy and is not legally binding.  The citizens of the US have
a right to the information that the administration hosts on
whitehouse.gov, therefore I've downloaded a complete mirror of the
whitehouse.gov site, complete with all the directories that are denied in
robots.txt.  Our copy of the site will be fully indexed by search engines.

    http://ballistichelmet.org/pigstate/

In order to facilitate search engine indexing I've made a file that
includes links to every file within the wh.gov site.  By including the
link to that file here, search engines should pick up the material
shortly.

    http://ballistichelmet.org/pigstate/www.whitehouse.gov.ls.html

(This file doesn't exist yet, but will soon.)

NB This copy of the site is not really meant for human viewing since many
of the links will be broken and images won't be shown.  This is really
only for the sake of search engines being able to index the material from
wh.gov.

I would like to fetch a mirror of the site on a daily or weekly basis and
I have plans to provide "diff" files that show, on a line for line basis,
the changes that have occured with each new version of wh.gov.  We have
unlimited disk space on our new web host and we could potentially do this
with many more goverment sites.

If you are a site administrator and would like to mirror wh.gov, the
command is:

wget --mirror www.whitehouse.gov --relative -e robots=off

I'd like to thank the GNU Project for providing high quality software to
the public, especially GNU Wget.

(The robots.txt convention was never meant to be an access control mechanism.
  It is intended to mark directories that site administrators don't want web
crawling programs to descend into because of the burden it puts on the site.
Since we host a copy of wh.gov, and we don't mind the burden, we can get rid
of robots.txt.)

-d

_______________________________________________
heads mailing list
[EMAIL PROTECTED]:4000
http://ballistichelmet.org/mailman/listinfo/heads

Reply via email to