Hi Hrvoje (or whoever's reading this), Thanks for all your work on wget! It looks like just what I'll need. (I'm helping a colleague in the Political Science department and he wants to archive the web sites of political candidates.) However, I did some preliminary experiments and ran into some difficulties. I hope you can help.
The experiment was to download the NY Times. I did it via cron, using the following incantation: 0 2 * * * wget --mirror -D nytimes.com --directory-prefix=/home/anderson/public_html/times/ http://www.nytimes.com/ -o /home/anderson/wgetlog I ran into two problems, one minor and the other somewhat bigger. The bigger problem is that in viewing the downloaded site, I got a javascript error. I believe the problem is caused by the following, which is in the www.nytimes.com/index.html page: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html><head><script src="/js/csssniff.js"></script> The problem is that /js/csssniff.js doesn't exist on my web server. The file did get correctly downloaded, but it's in /home/anderson/public_html/times/www.nytimes.com/js/csssniff.js That's just where it should be, I suppose, given the --directory-prefix that I specified. However, it means that the site is a little bit broken. I don't care about the js file per se, but I'm worried that other such absolute filename references will be broken, too. I looked through all the info pages and didn't see anything that would help. Maybe I just missed it. Another problem, a minor one, is that despite the domain specification -D nytimes.com, I got a bunch of downloads from other sites. Here's an "ls" of the download directory: ad.doubleclick.net www10.americanexpress.com www.michaelpage.com email.nytimes.com www.brodskyorg.com www.microsoft.com homedelivery.nytimes.com www.eberhartbros.com www.nsu.newschool.edu index.html.old www.frenchculinary.com www.nytbroadway.com jobs.nytimes.com www.manhattanmortgage.com www.nytimes.com nyc.gov www.match.com www.resume.com query.nytimes.com www.merck.com www.tiffany.com Okay, I know why I got files from jobs.nytimes.com and so forth; that's fine. But why ad.doubleclick.net and the others? Now, to be fair, only one file was downloaded from each site, but I really expected none. Is this correct behavior by the -D argument? Is there something else I should do? One file from each of a number of sites doesn't hurt me a lot, but it does slow down the process and eat up some disk space. Thanks for your help! Scott -- Scott D. Anderson Wellesley College, Wellesley, Massachusetts [EMAIL PROTECTED] http://cs.wellesley.edu/~anderson/
