wget from SVN: Issue with recursive downloading from http:// sites
Hello, Everyone I am running wget from SVN and I have come upon a problem that I have never had before. I promise that I checked the documentation on your website to see if I needed to change how I use wget. I even joined this list, and perused the archives, with no results. Not that they aren't there, but I didn't find any :) I will use my own domain as an example: In the past, I would run: wget -kr -nc http://www.afolkey2.net and the result would be a mirror of my domain, with the links converted for local viewing. (In this case, wget is the SVN version, which is located at /usr/local/bin/wget) Now, if I run that same command, I get the following output: [EMAIL PROTECTED] Archives]$ wget -kr -nc http://www.afolkey2.net --07:55:48-- http://www.afolkey2.net/ Resolving www.afolkey2.net... 12.203.241.111 Connecting to www.afolkey2.net|12.203.241.111|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1339 (1.3K) [text/html] 100%[==] 1,339 --.-K/s in 0.01s 07:55:48 (136 KB/s) - `www.afolkey2.net/index.html' saved [1339/1339] FINISHED --07:55:48-- Downloaded: 1 files, 1.3K in 0.01s (136 KB/s) [EMAIL PROTECTED] Archives]$ As you can see, it downloaded ONLY http://www.afolkey2.net/index.html and exited without error. If I try adding a sub-directory to the above example, the result is the same - wget downloads index.html in the directory that I point it to, and then exits without error. But, if I run: /usr/bin/wget -kr -nc http://www.afolkey2.net; (/usr/bin/wget is the version of wget that ships with the distro that I run, Fedora Core 3) the result is as it should be, www.afolkey2.net is downloaded in it's entirety, and the links are converted for local viewing. I understand that maybe something has changed about wget's options. But I was not able to locate that information on my own. If this is a bug (Fedora Core 3 specific or not) I would be glad to report it as soon as you tell me that it is a bug. If you need any more information, let me know and I will be glad to oblige. Right now, I'm going to uninstall /usr/local/bin/wget and reinstall from a fresh download from SVN and see what happens. Have a Great Day, Steven P. Ulrick P.S.: For clarification, I only used afolkey2.net as an example. Every website that I attempt wget -kr -nc on behaves the same way. But I JUST discovered that recursive downloading from ftp domains seems to work perfectly. I am now downloading ftp.crosswire.org, and it looks like it would happily continue until there was no more to download.
Re: wget from SVN: Issue with recursive downloading from http:// sites
[...] wget is the SVN version, which is located at /usr/local/bin/wget [...] [...] (/usr/bin/wget is the version of wget that ships with the distro that I run, Fedora Core 3) [...] Results from wget -V would be much more informative than knowing the path(s) to the executable(s). (Should I know what SVN is?) Adding -d to your wget commands could also be more helpful in finding a diagnosis. If one program works and one doesn't, why use the one which doesn't? Steven M. Schweda (+1) 651-699-9818 382 South Warwick Street[EMAIL PROTECTED] Saint Paul MN 55105-2547
Re: wget from SVN: Issue with recursive downloading from http:// sites
On Thu, 5 Jan 2006 08:55:53 -0600 (CST) [EMAIL PROTECTED] (Steven M. Schweda) wrote: [...] wget is the SVN version, which is located at /usr/local/bin/wget [...] [...] (/usr/bin/wget is the version of wget that ships with the distro that I run, Fedora Core 3) [...] Results from wget -V would be much more informative than knowing the path(s) to the executable(s). (Should I know what SVN is?) Adding -d to your wget commands could also be more helpful in finding a diagnosis. Hello, Steven That's fair enough. Default Fedora Core 3 version: [EMAIL PROTECTED] ~]$ /usr/bin/wget -V GNU Wget 1.10.2 (Red Hat modified) SVN version: [EMAIL PROTECTED] ~]$ /usr/local/bin/wget -V GNU Wget 1.10+devel If one program works and one doesn't, why use the one which doesn't? Well, that's simple: I am not a programmer, I am just a user. But, if I use the development versions of programs that I like/use a lot, then I can make what little contribution that I can back to the community by reporting any issues that I have with said programs. If I would just stop using stuff that does not work, then I would not be doing my part. But for real, your question is a good one :) Steven P. Ulrick P.S.: Again I do apologize for not putting the version numbers in my original email. I had intended on doing that, but I forgot :(
Re: wget from SVN: Issue with recursive downloading from http:// sites
[EMAIL PROTECTED] (Steven M. Schweda) writes: Results from wget -V would be much more informative than knowing the path(s) to the executable(s). (Should I know what SVN is?) I believe SVN stands for Subversion, the version control software that runs the repository.
Re: wget from SVN: Issue with recursive downloading from http:// sites
Adding -d to your wget commands could also be more helpful in finding a diagnosis. Still true. GNU Wget 1.10.2b built on VMS Alpha V7.3-2 (the original wget 1.10.2 with my VMS-related and other changes) seems to work just fine on that site. You might try starting with a less up-to-the-minute source kit to see if that helps. (Although you'd like to think that such a gross problem would be detected before any such problem code had been checked in. And with that site's content, I might prefer any program which sucked down less of it, but that's neither here nor there.) Steven M. Schweda (+1) 651-699-9818 382 South Warwick Street[EMAIL PROTECTED] Saint Paul MN 55105-2547
Re: wget from SVN: Issue with recursive downloading from http:// sites
On Thu, 5 Jan 2006 09:56:35 -0600 (CST) [EMAIL PROTECTED] (Steven M. Schweda) wrote: Adding -d to your wget commands could also be more helpful in finding a diagnosis. That's fair enough: [EMAIL PROTECTED] ~]$ wget -d -kr -nc http://www.afolkey2.net Setting --convert-links (convertlinks) to 1 Setting --recursive (recursive) to 1 Setting --no (noclobber) to 1 DEBUG output created by Wget 1.10+devel on linux-gnu. Enqueuing http://www.afolkey2.net/ at depth 0 Queue count 1, maxcount 1. Dequeuing http://www.afolkey2.net/ at depth 0 Queue count 0, maxcount 1. in http_loop in http_loop LOOP --10:16:43-- http://www.afolkey2.net/ in gethttp 1 in gethttp 2 in gethttp 3 Resolving www.afolkey2.net... 12.203.241.111 Caching www.afolkey2.net = 12.203.241.111 Connecting to www.afolkey2.net|12.203.241.111|:80... connected. Created socket 4. Releasing 0x09be9240 (new refcount 1). ---request begin--- GET / HTTP/1.0 User-Agent: Wget/1.10+devel Accept: */* Host: www.afolkey2.net Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK Date: Thu, 05 Jan 2006 16:16:44 GMT Server: Apache/2.0.53 (Fedora) Last-Modified: Sun, 18 Dec 2005 13:27:25 GMT ETag: 192c08-53b-6522a940 Accept-Ranges: bytes Content-Length: 1339 Connection: close Content-Type: text/html; charset=UTF-8 ---response end--- 200 OK in gethttp 4 in gethttp 5 Length: 1339 (1.3K) [text/html] 100%[==] 1,339 --.-K/s in 0s Closed fd 4 10:16:44 (33.6 MB/s) - `www.afolkey2.net/index.html' saved [1339/1339] FINISHED --10:16:44-- Downloaded: 1 files, 1.3K in 0s (33.6 MB/s) You have new mail in /var/spool/mail/steve GNU Wget 1.10.2b built on VMS Alpha V7.3-2 (the original wget 1.10.2 with my VMS-related and other changes) seems to work just fine on that site. You might try starting with a less up-to-the-minute source kit to see if that helps. Please forgive my ignorance, but what exactly does that mean? In my original message on this thread, I mentioned and showed that I tried the same command with the version of wget that ships with Fedora Core 3. Though it is true that I did not mention the exact version number. If there is a different version (other than the SVN version of course) that you are referring to, please let me know. (Although you'd like to think that such a gross problem would be detected before any such problem code had been checked in. And with that site's content, I might prefer any program which sucked down less of it, but that's neither here nor there.) What exactly does that mean? If you are referring to afolkey2.net, that is only my playground for learning how to run a web server and a mail server. Like I also said in my original message on this subject, I mentioned that I was using afolkey2.net as an Example. I do sincerely apologize if you did not approve of my example, but I did mention that that was all that it was. But, to clarify that last statement, absolutely no offense was taken, as I am sure that none was intended. I was just asking what was meant, that's all :) Have a Great Day, Steven P. Ulrick
Re: wget from SVN: Issue with recursive downloading from http:// sites
Your -d output suggests a defective Wget (probably because Wget/1.10+devel was still in development). A working one spews much more stuff (as it downloads much more stuff). I'd try starting with the last released source kit: http://www.gnu.org/software/wget/ http://www.gnu.org/software/wget/index.html#downloading http://ftp.gnu.org/pub/gnu/wget/ http://ftp.gnu.org/pub/gnu/wget/wget-1.10.2.tar.gz [...] What exactly does that mean? I was just complaining about the content at afolkey2.net, but, as I said, that's neither here nor there. Steven M. Schweda (+1) 651-699-9818 382 South Warwick Street[EMAIL PROTECTED] Saint Paul MN 55105-2547
Case insensitive enhancement
I'd like to suggest an enhancement that would help people who are downloading web sites housed on a Windows server. (I couldn't find any discussion of this in the email list archive or any mention in the on-line documentation.) Since Windows has a case insensitive file system, Apache and IIS running on a Windows box will think the following URLs are referencing the same resource: http://foo.org/bar.html http://foo.org/BAR.html Apache on a *nix box treats these URLs as references to two different resources. Wget 1.10 running on *nix currently treats the 2 urls as referring to different resources regardless of the operating system housing the web server. Therefore 2 files will be created by wget when only 1 file actually exists on the Windows web server. I ran into this problem when using wget with http://www.harding.edu/hr/. I'd like to suggest a new parameter --ignore-case that would tell wget to convert all URLs to lowercase when retrieving them. This would allow a more accurate downloading of the files residing on a Windows file system and would require fewer files being downloaded. Of course this would not be as useful for mirroring a site on a *nix box since URLs referring to BAR.html would now break. A script could also be used to manually go through and delete redundant files (as was suggested in http://www.mail-archive.com/wget@sunsite.dk/msg08373.html to remove the index.html?BLAH files), but it would be nice to save the user this effort. Regards, Frank