Re: 2 Gb limitation
On 10 Jan 2002 at 17:09, Matt Butt wrote: I've just tried to download a 3Gb+ file (over a network using HTTP) with WGet and it died at exactly 2Gb. Can this limitation be removed? In principle, changes could be made to allow wget to be configured for large file support, by using the appropriate data types (i.e. 'off_t' instead of 'long'). The logging code would be more complicated as there is no portable way to handle the data type in a printf-style function, so these would have to be converted to strings by a bespoke routine and the converted strings passed to the printf-style function. This would also slow down the operation of wget a little bit. A version of wget configured for large file support would also be slower in general than a version not configured for large file support - at least on a 32-bit machine. Large file support should probably be added to the TODO list at least. Quite a few people use wget to download .iso images of CD-ROMs at the moment; in the future, those same people are likely to want to use wget to download DVD-ROM images!
Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?
Thanks for your response. I tried the same command, using your URL, and it worked fine. So I took a look at the site I was retrieving for the failed test. It's a ssl site (didn't think about it before) and I noticed 2 things. The Frame source pages were not downloaded (they were for www.mev.co.uk) and the links were converted to full URLs. ie. FRAME src=menulayer.cgi. became FRAME src=https://www.someframed.page/menulayer.cgi; ... So the content was still reachable, but not really local (this is the original problem). I tried it without the --convert-links, and the frame source remained defined as menulayer.cgi but menulayer.cgi was not downloaded. Do you think this might be an issue with framesets and ssl sites? or an issue with framesets and cgi source files? Thanks again, and will try --no-http-keep-alive at some point. Picot Ian Abbott wrote: On 10 Jan 2002 at 12:39, Picot Chappell wrote: Has anyone solved this issue? I am downloading a single html page, without recursion, and not getting the 'one hop further' that should occur for framesets. I'm using wget 1.8.1, on Solaris 8. According to the documentation, options -p and -k should work to download everything, and from previous postings I see mention that -p should go at least one more hop (also confirmed in the News items on GNU Wget news). Well it seems to work as advertised on my employer's web-site (www.mev.co.uk), at least on my machine. Can you provide an example which fails on your machine? Below is the gist of my call: ./wget --ignore-length --html-extension --tries=3 --timeout=60 --cookies=off --page-requisites --convert-links -- www.someframed.page That looks okay. I substituted in www.mev.co.uk and got the index frameset page, two frames and the images on those frames as expected. The '--ignore-length' switch slows things down rather a lot though, due to keep-alive connections. Adding '--no-http-keep- alive' to the above will speed it up.
wget does not parse .netrc properly
Hello everyone, I'm using wget compiled from the latest CVS sources (GNU Wget 1.8.1+cvs). I use it to mirror several ftp sites. I keep ftp accounts in .netrc file which looks like this: quote file=.netrc # My ftp accounts machine host1 login user1 password pwd1 machine host2 login user2 password pwd2 macdef init # quote site dirstyle prompt binary cd database machine host3 login user3 password pwd3 macdef init prompt binary cd download /quote Problem is that when I try to get data from machine host3, wget tries to log in as anonymous. It looks like it doesn't find host3 in .netrc while it works fine with host1 and host2. If I put machine host3 at first position in .netrc, then it works. But in this case, it doesn't work with neither host1 nor host2. I guess macdef directive confuses wget. The trouble is that Wget 1.6.1 used to work with this .netrc. Best regards, Alexis
Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?
Do you think this might be an issue with framesets and ssl sites? or an issue with framesets and cgi source files? This is not a problem with frames - it IS a problem with SSL. wget, while it appears to have SSL support, didn't quite get it right. The internal schems being used don't treat https: as an http protocol, and thus don't recurse down into sub pages. (wget specifically avoids recursing into unknown protocols,and https was treated as one of these). A previous post on a patch for this exists. The patch is as follows: --- src/recur.c Wed Dec 19 09:27:29 2001 +++ ../wget-1.8.1.esoft/src/recur.c Sat Dec 29 16:17:40 2001 @@ -437,7 +437,7 @@ the list. */ /* 1. Schemes other than HTTP are normally not recursed into. */ - if (u-scheme != SCHEME_HTTP + if (u-scheme != SCHEME_HTTP u-scheme!= SCHEME_HTTPS !(u-scheme == SCHEME_FTP opt.follow_ftp)) { DEBUGP ((Not following non-HTTP schemes.\n)); @@ -446,7 +446,7 @@ /* 2. If it is an absolute link and they are not followed, throw it out. */ - if (u-scheme == SCHEME_HTTP) + if (u-scheme == SCHEME_HTTP || u-scheme == SCHEME_HTTPS) if (opt.relative_only !upos-link_relative_p) { DEBUGP ((It doesn't really look like a relative link.\n)); @@ -534,7 +534,7 @@ } /* 8. */ - if (opt.use_robots u-scheme == SCHEME_HTTP) + if (opt.use_robots (u-scheme == SCHEME_HTTP || u-scheme == SCHEME_HTTPS)) { struct robot_specs *specs = res_get_specs (u-host, u-port); if (!specs) OR, alternatively, simply edit recur.c according to the following instructions: Line 440: change to if (u-scheme != SCHEME_HTTP u-scheme!= SCHEME_HTTPS Line 449: change to if (u-scheme == SCHEME_HTTP || u-scheme == SCHEME_HTTPS) Line 537: change to if (opt.use_robots (u-scheme == SCHEME_HTTP || u-scheme == SCHEME_HTTPS)) and that should work better. Thomas Thanks again, and will try --no-http-keep-alive at some point. Picot Ian Abbott wrote: On 10 Jan 2002 at 12:39, Picot Chappell wrote: Has anyone solved this issue? I am downloading a single html page, without recursion, and not getting the 'one hop further' that should occur for framesets. I'm using wget 1.8.1, on Solaris 8. According to the documentation, options -p and -k should work to download everything, and from previous postings I see mention that -p should go at least one more hop (also confirmed in the News items on GNU Wget news). Well it seems to work as advertised on my employer's web-site (www.mev.co.uk), at least on my machine. Can you provide an example which fails on your machine? Below is the gist of my call: ./wget --ignore-length --html-extension --tries=3 --timeout=60 --cookies=off --page-requisites --convert-links -- www.someframed.page That looks okay. I substituted in www.mev.co.uk and got the index frameset page, two frames and the images on those frames as expected. The '--ignore-length' switch slows things down rather a lot though, due to keep-alive connections. Adding '--no-http-keep- alive' to the above will speed it up. -- E-Soft Inc. http://www.e-softinc.com Publishers of SecuritySpace http://www.securityspace.com Tel: 1-905-331-2260 Fax: 1-905-331-2504 Tollfree in North America: 1-800-799-4831
Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?
On 11 Jan 2002 at 10:51, Picot Chappell wrote: Thanks for your response. I tried the same command, using your URL, and it worked fine. So I took a look at the site I was retrieving for the failed test. It's a ssl site (didn't think about it before) and I noticed 2 things. The Frame source pages were not downloaded (they were for www.mev.co.uk) and the links were converted to full URLs. ie. FRAME src=menulayer.cgi. became FRAME src=https://www.someframed.page/menulayer.cgi; ... So the content was still reachable, but not really local (this is the original problem). I tried it without the --convert-links, and the frame source remained defined as menulayer.cgi but menulayer.cgi was not downloaded. Do you think this might be an issue with framesets and ssl sites? or an issue with framesets and cgi source files? Do you have SSL support compiled in? Also it is possible that the .cgi script on the server is checking HTTP request headers and cookies, doesn't like what it sees and is returning an error. It is sometimes useful to lie to the server about the HTTP user agent using the -U option, e.g.: -U Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0) or include something similar in the wgetrc file: useragent = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0) Some log entries would be useful, particularly with the -d option. You can mask any sensitive bits of the log if you want.
-H suggestion
WGET suggestion The -H switch/option sets host-spanning. Please provide a way to specify a different limit on recursion levels for files retrieved from foreign hosts. -r -l0 -H2 for example would allow unlimited recursion levels on the target host, but only 2 [addtional] levels when a file is being retrieved from a foreign host. Second suggestion: The -i switch provides for a file listing the URLs to be downloaded. Please provide for a list file for URLs to be avoided when -H is enabled. Thanks for listening. And thanks for a marvelous product. Fred Holmes [EMAIL PROTECTED]
Suggestion on job size
It would be nice to have some way to limit the total size of any job, and have it exit gracefully upon reaching that size, by completing the -k -K process upon termination, so that what one has downloaded is useful. A switch that would set the total size of all downloads --total-size=600MB would terminate the run when the total bytes downloaded reached 600 MB, and process the -k -K. What one had already downloaded would then be properly linked for viewing. Probably more difficult would be a way of terminating the run manually (Ctrl-break??), but then being able to run the -k -K process on the already-downloaded files. Fred Holmes
Re: Suggestion on job size
Hi Fred! First, I think this would rather belong in the normal wget list, as I cannot see a bug here. Sorry to the bug tracers, I am posting to the normal wget List and cc-ing Fred, hope that is ok. To your first request: -Q (Quota) should do precisely what you want. I used it with -k and it worked very well. Or am I missing your point here? Your second wish is AFAIK not possible now. Maybe in the future wget could write the record of downloaded files in the appropriate directory. After exiting wget, this file could then be used to process all the files mentioned in it. Just an idea, I would normally not think that this option is an often requested one. HOWEVER: -K works (when I understand it correctly) on the fly, as it decides on the run, if the server file is newer, if a previously converted file exists and what to do. So, only -k would work after the download, right? CU Jens http://www.JensRoesner.de/wgetgui/ It would be nice to have some way to limit the total size of any job, and have it exit gracefully upon reaching that size, by completing the -k -K process upon termination, so that what one has downloaded is useful. A switch that would set the total size of all downloads --total-size=600MB would terminate the run when the total bytes downloaded reached 600 MB, and process the -k -K. What one had already downloaded would then be properly linked for viewing. Probably more difficult would be a way of terminating the run manually (Ctrl-break??), but then being able to run the -k -K process on the already-downloaded files. Fred Holmes