I can't make any sense of what's happening, but when I try to use wget
to mirror a particular pair of URLs, it doesn't download everything.
I'm doing this:
wget -nv -m -nH -np \
http://www.dnalounge.com/flyers/
http://www.dnalounge.com/gallery/
It's downloading about every 4th subdirectory under gallery/2001/;
if you look at the index.html file there, you'll see that all links
are in identical syntax, so I don't see why it's downloading 07-13/
but skipping 07-14/.
And then, strangely, if I leave off the flyers/ URL on the command
line, it downloads more of the gallery/ directories -- but not all
of them.
It's acting as if there's some maximum number of URLs it's willing
to try, or something like that?
I tried this on Linux with both wget 1.5.3 and 1.7. I've tried it
on two different machines. With the above command line, I always
get this set of directories:
FINISHED --01:20:47--
Downloaded: 23,344,299 bytes in 720 files
% find flyers gallery -type d | sort
flyers
flyers/2001
flyers/2001/07
flyers/2001/08
flyers/2001/09
flyers/2001/10
flyers/2001/11
flyers/2001/12
gallery
gallery/2001
gallery/2001/07-13
gallery/2001/08-17
gallery/2001/09-16
gallery/2001/09-20
gallery/2001/10-05
if it were working properly, I'd get this set of directories:
flyers
flyers/2001
flyers/2001/07
flyers/2001/08
flyers/2001/09
flyers/2001/10
flyers/2001/11
flyers/2001/12
gallery
gallery/2001
gallery/2001/07-13
gallery/2001/07-14
gallery/2001/07-28
gallery/2001/08-01
gallery/2001/08-04
gallery/2001/08-10
gallery/2001/08-17
gallery/2001/08-31
gallery/2001/09-01
gallery/2001/09-16
gallery/2001/09-20
gallery/2001/09-23
gallery/2001/10-05
gallery/2001/10-14
gallery/2001/10-31
I added "-d" to the command line, and saved the output to a file,
in case you're interested. Here are the lines matching one of the
directories it chose to ignore:
% grep 10-31 LOG
flyers/2001/10/31-halloween.html:
merge("http://www.dnalounge.com/flyers/2001/10/31-halloween.html",
"../../../gallery/2001/10-31/") ->
http://www.dnalounge.com/flyers/2001/10/../../../gallery/2001/10-31/
parseurl ("http://www.dnalounge.com/flyers/2001/10/../../../gallery/2001/10-31/")
-> host www.dnalounge.com -> opath flyers/2001/10/../../../gallery/2001/10-31/ -> dir
flyers/2001/10/../../../gallery/2001/10-31 -> file -> ndir gallery/2001/10-31
newpath: /gallery/2001/10-31/
http://www.dnalounge.com/gallery/2001/10-31/ already in list, so we don't load.
gallery/2001/index.html: merge("http://www.dnalounge.com/gallery/2001/", "10-31/")
-> http://www.dnalounge.com/gallery/2001/10-31/
parseurl ("http://www.dnalounge.com/gallery/2001/10-31/") -> host
www.dnalounge.com -> opath gallery/2001/10-31/ -> dir gallery/2001/10-31 -> file ->
ndir gallery/2001/10-31
newpath: /gallery/2001/10-31/
http://www.dnalounge.com/gallery/2001/10-31/ already in list, so we don't load.
The only "ERROR" in the log is about the nonexistent robots.txt.
Any suggestions?
--
Jamie Zawinski
[EMAIL PROTECTED] http://www.jwz.org/
[EMAIL PROTECTED] http://www.dnalounge.com/