Hello, I have noticed very unpredictable behavior from wget 1.8.2 - specifically I have noticed two things:
a) sometimes it does not follow all of the links it should b) sometimes wget will follow links to other sites and URLs - when the command line used should not allow it to do that. Here are the details. First, sometimes when you attempt to download a site with -k -m (--convert-links and --mirror) wget will not follow all of the links and will skip some of the files! I have no idea why it does this with some sites and doesn't do it with other sites. Here is an example that I have reproduced on several systems - all with 1.8.2: # wget -k -m http://www.zorg.org/vsound/ --17:09:32-- http://www.zorg.org/vsound/ => `www.zorg.org/vsound/index.html' Resolving www.zorg.org... done. Connecting to www.zorg.org[213.232.100.31]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ <=> ] 12,235 53.82K/s Last-modified header missing -- time-stamps turned off. 17:09:32 (53.82 KB/s) - `www.zorg.org/vsound/index.html' saved [12235] FINISHED --17:09:32-- Downloaded: 12,235 bytes in 1 files Converting www.zorg.org/vsound/index.html... 2-6 Converted 1 files in 0.03 seconds. What is the problem here ? When I run the exact same command line with wget 1.6, I get this: # wget -k -m http://www.zorg.org/vsound/ --11:10:06-- http://www.zorg.org/vsound/ => `www.zorg.org/vsound/index.html' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] 0K -> .......... . Last-modified header missing -- time-stamps turned off. 11:10:07 (71.12 KB/s) - `www.zorg.org/vsound/index.html' saved [12235] Loading robots.txt; please ignore errors. --11:10:07-- http://www.zorg.org/robots.txt => `www.zorg.org/robots.txt' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 404 Not Found 11:10:07 ERROR 404: Not Found. --11:10:07-- http://www.zorg.org/vsound/vsound.jpg => `www.zorg.org/vsound/vsound.jpg' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 27,629 [image/jpeg] 0K -> .......... .......... ...... [100%] 11:10:08 (51.49 KB/s) - `www.zorg.org/vsound/vsound.jpg' saved [27629/27629] --11:10:09-- http://www.zorg.org/vsound/vsound-0.2.tar.gz => `www.zorg.org/vsound/vsound-0.2.tar.gz' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 108,987 [application/x-tar] 0K -> .......... .......... .......... .......... .......... [ 46%] 50K -> .......... .......... .......... .......... .......... [ 93%] 100K -> ...... [100%] 11:10:12 (46.60 KB/s) - `www.zorg.org/vsound/vsound-0.2.tar.gz' saved [108987/108987] --11:10:12-- http://www.zorg.org/vsound/vsound-0.5.tar.gz => `www.zorg.org/vsound/vsound-0.5.tar.gz' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 116,904 [application/x-tar] 0K -> .......... .......... .......... .......... .......... [ 43%] 50K -> .......... .......... .......... .......... .......... [ 87%] 100K -> .......... .... [100%] 11:10:14 (60.44 KB/s) - `www.zorg.org/vsound/vsound-0.5.tar.gz' saved [116904/116904] --11:10:14-- http://www.zorg.org/vsound/vsound => `www.zorg.org/vsound/vsound' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 3,365 [text/plain] 0K -> ... [100%] 11:10:14 (3.21 MB/s) - `www.zorg.org/vsound/vsound' saved [3365/3365] Converting www.zorg.org/vsound/index.html... done. FINISHED --11:10:14-- Downloaded: 269,120 bytes in 5 files Converting www.zorg.org/vsound/index.html... done. See ? It gets the links inside of index.html, and mirrors those links, and converts them - just like it should. Why does 1.8.2 have a problem with this site ? Other sites are handled just fine by 1.8.2 with the same command line ... it makes no sense that wget 1.8.2 has problems with particular web sites. This is incorrect behavior - and if you try the same URL with 1.8.2 you can reproduce the same results. ------------ The second problem, and I cannot currently give you an example to try yourself but _it does happen_, is if you use this command line: wget --tries=inf -nH --no-parent --directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf --convert-links --html-extension --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1)" www.example.com At first it will act normally, just going over the site in question, but sometimes, you will come back to the terminal and see if grabbing all sorts of pages from totally different sites (!) I have seen this happen numerous times with different web sites - it will start getting the site it is supposed to, and then somewhere in the middle, just start following links to other domains and sites. This is a problem, because the above command line should not allow it to get anything but the site in question. Has anyone else seen wget do this ? thanks!