Hi Hrvoje (or whoever's reading this),

Thanks for all your work on wget!  It looks like just what I'll need.
(I'm helping a colleague in the Political Science department and he wants
to archive the web sites of political candidates.)  However, I did some
preliminary experiments and ran into some difficulties.  I hope you can
help.

The experiment was to download the NY Times.  I did it via cron, using the
following incantation:

0 2 * * * wget --mirror -D nytimes.com 
--directory-prefix=/home/anderson/public_html/times/ http://www.nytimes.com/ -o 
/home/anderson/wgetlog

I ran into two problems, one minor and the other somewhat bigger.  The
bigger problem is that in viewing the downloaded site, I got a javascript
error.  I believe the problem is caused by the following, which is in the
www.nytimes.com/index.html page:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html><head><script 
src="/js/csssniff.js"></script>

The problem is that /js/csssniff.js doesn't exist on my web server.  The
file did get correctly downloaded, but it's in

/home/anderson/public_html/times/www.nytimes.com/js/csssniff.js

That's just where it should be, I suppose, given the --directory-prefix
that I specified.  However, it means that the site is a little bit broken.
I don't care about the js file per se, but I'm worried that other such
absolute filename references will be broken, too.  I looked through all
the info pages and didn't see anything that would help.  Maybe I just
missed it.

Another problem, a minor one, is that despite the domain specification -D
nytimes.com, I got a bunch of downloads from other sites.  Here's an "ls"
of the download directory:

ad.doubleclick.net        www10.americanexpress.com  www.michaelpage.com
email.nytimes.com         www.brodskyorg.com         www.microsoft.com
homedelivery.nytimes.com  www.eberhartbros.com       www.nsu.newschool.edu
index.html.old            www.frenchculinary.com     www.nytbroadway.com
jobs.nytimes.com          www.manhattanmortgage.com  www.nytimes.com
nyc.gov                   www.match.com              www.resume.com
query.nytimes.com         www.merck.com              www.tiffany.com

Okay, I know why I got files from jobs.nytimes.com and so forth; that's
fine.  But why ad.doubleclick.net and the others?  Now, to be fair, only
one file was downloaded from each site, but I really expected none.  Is
this correct behavior by the -D argument?  Is there something else I
should do?  One file from each of a number of sites doesn't hurt me a lot,
but it does slow down the process and eat up some disk space.

Thanks for your help!

Scott

-- 
Scott D. Anderson
Wellesley College, Wellesley, Massachusetts
[EMAIL PROTECTED]
http://cs.wellesley.edu/~anderson/

Reply via email to