---------- Forwarded message ---------- From: dvanhorn <[EMAIL PROTECTED]> To: list for the ballistic helmet heads <[EMAIL PROTECTED]> Date: Mon, 03 Nov 2003 14:13:30 -0500 Subject: Re: [ballistichelmet] White House site prevents Iraq material being archived Reply-To: list for the ballistic helmet heads <[EMAIL PROTECTED]:4000> User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.1) Gecko/20020827 Precedence: list Return-Path: <[EMAIL PROTECTED]>
On Thu, 30 Oct 2003, Aaron S. Hawley wrote: > [for those robots.txt fans] > > White House site prevents Iraq material being archived > http://www.theage.com.au/articles/2003/10/28/1067233141495.html > By Sam Varghese > October 28, 2003 The robots.txt convention for preventing material from being archived is merely a courtesy and is not legally binding. The citizens of the US have a right to the information that the administration hosts on whitehouse.gov, therefore I've downloaded a complete mirror of the whitehouse.gov site, complete with all the directories that are denied in robots.txt. Our copy of the site will be fully indexed by search engines. http://ballistichelmet.org/pigstate/ In order to facilitate search engine indexing I've made a file that includes links to every file within the wh.gov site. By including the link to that file here, search engines should pick up the material shortly. http://ballistichelmet.org/pigstate/www.whitehouse.gov.ls.html (This file doesn't exist yet, but will soon.) NB This copy of the site is not really meant for human viewing since many of the links will be broken and images won't be shown. This is really only for the sake of search engines being able to index the material from wh.gov. I would like to fetch a mirror of the site on a daily or weekly basis and I have plans to provide "diff" files that show, on a line for line basis, the changes that have occured with each new version of wh.gov. We have unlimited disk space on our new web host and we could potentially do this with many more goverment sites. If you are a site administrator and would like to mirror wh.gov, the command is: wget --mirror www.whitehouse.gov --relative -e robots=off I'd like to thank the GNU Project for providing high quality software to the public, especially GNU Wget. (The robots.txt convention was never meant to be an access control mechanism. It is intended to mark directories that site administrators don't want web crawling programs to descend into because of the burden it puts on the site. Since we host a copy of wh.gov, and we don't mind the burden, we can get rid of robots.txt.) -d _______________________________________________ heads mailing list [EMAIL PROTECTED]:4000 http://ballistichelmet.org/mailman/listinfo/heads