RE: wget re-download fully downloaded files
Micah Cowan wrote: Actually, I'll have to confirm this, but I think that current Wget will re-download it, but not overwrite the current content, until it arrives at some content corresponding to bytes beyond the current content. I need to investigate further to see if this change was somehow intentional (though I can't imagine what the reasoning would be); if I don't find a good reason not to, I'll revert this behavior. One reason to keep the current behavior is to retain all of the existing content in the event of another partial download that is shorter than the previous one. However, I think that only makes sense if wget is comparing the new content with what is already on disk. Tony
RE: [PATCH] Enable wget to download from given offset and just a given amount of bytes
Juan Manuel wrote: OK, you are right, I`ll try to make it better on my free time. I supposed that it would have been more polite with one option, but thought it was easier with two (and since this is my first approach to C I took the easy way) because one option would have to deal with two parameters. It's clearly easier to deal with options that wget is already programmed to support. For a primer on wget options, take a look at this page on the wiki: http://wget.addictivecode.org/OptionsHowto I suspect you will need to add support for a new action (perhaps cmd_range). Tony
RE: A/R matching against query strings
Micah Cowan wrote: Would hash really be useful, ever? Probably not as long as we strip off the hash before we do the comparison. Tony
RE: A/R matching against query strings
Micah Cowan wrote: On expanding current URI acc/rej matches to allow matching against query strings, I've been considering how we might enable/disable this functionality, with an eye toward backwards compatibility. What about something like --match-type=TYPE (with accepted values of all, hash, path, search)? For the URL http://www.domain.com/path/to/name.html?a=true#content all would match against the entire string hash would match against content path would match against path/to/name.html search would match against a=true For backward compatibility the default should be --match-type=path. I thought about having host as an option, but that duplicates another option. Tony
RE: Big files
Cristián Serpell wrote: I would like to know if there is a reason for using a signed int for the length of the files to download. I would like to know why people still complain about bugs that were fixed three years ago. (More accurately, it was a design flaw that originated from a time when no computer OS supported files that big, but regardless of what you call it, the change to wget was made to version 1.10 in 2005.) Tony
RE: Big files
Cristián Serpell wrote: Maybe I should have started by this (I had to change the name of the file shown): [snip] ---response begin--- HTTP/1.1 200 OK Date: Tue, 16 Sep 2008 19:37:46 GMT Server: Apache Last-Modified: Tue, 08 Apr 2008 20:17:51 GMT ETag: 7f710a-8a8e1bf7-47fbd2ef Accept-Ranges: bytes Content-Length: -1970398217 The problem is not with wget. It's with the Apache server, which told wget that the file had a negative length. Tony
RE: Wget and Yahoo login?
Micah Cowan wrote: The easiest way to do what you want may be to log in using your browser, and then tell Wget to use the cookies from your browser, using Given the frequency of the login and then download a file use case , it should probably be documented on the wiki. (Perhaps it already is. :-) Also, it would probably be helpful to have a shell script to automate this. Tony
RE: Wget 1.11.3 - case sensetivity and URLs
Coombe, Allan David (DPS) wrote: However, the case of the files on disk is still mixed - so I assume that wget is not using the URL it originally requested (harvested from the HTML?) to create directories and files on disk. So what is it using? A http header (if so, which one??). I think wget uses the case from the HTML page(s) for the file name; your proxy would need to change the URLs in the HTML pages to lower case too. Tony
RE: Wget 1.11.3 - case sensetivity and URLs
mm w wrote: a simple url-rewriting conf should fix the problem, wihout touch the file system everything can be done server side Why do you assume the user of wget has any control over the server from which content is being downloaded?
RE: Wget 1.11.3 - case sensetivity and URLs
mm w wrote: Hi, after all, after all it's only my point of view :D anyway, /dir/file, dir/File, non-standard Dir/file, non-standard and /Dir/File non-standard According to RFC 2396: The path component contains data, specific to the authority (or the scheme if there is no authority component), identifying the resource within the scope of that scheme and authority. In other words, those names are well within the standard when the server understands them. As far as I know, there is nothing in Internet standards restricting mixed case paths. that's it, if the server manages non-standard URL, it's not my concern, for me it doesn't exist Oh. I see. You're writing to say that wget should only implement features that are meaningful to you. Thanks for your narcissistic input. Tony
RE: Wget 1.11.3 - case sensetivity and URLs
Micah Cowan wrote: Unfortunately, nothing really comes to mind. If you'd like, you could file a feature request at https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option asking Wget to treat URLs case-insensitively. To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think you probably want to send the original case to the server (just in case it really does matter to the server). If you're going to treat different case URIs as matching then the lower-case version will have to be stored in the hash. The most important part (from the perspective that Allan voices) is that the versions written to disk use lower case characters. Tony
RE: Wget 1.11.3 - case sensetivity and URLs
mm w wrote: standard: the URL are case-insensitive you can adapt your software because some people don't respect standard, we are not anymore in 90's, let people doing crapy things deal with their crapy world You obviously missed the point of the original posting: how can one conveniently mirror a site whose server uses case insensitive names onto a server that uses case sensitive names. If the original site has the URI strings /dir/file, dir/File, Dir/file, and /Dir/File, the same local file will be returned. However, wget will treat those as unique directories and files and you wind up with four copies. Allan asked if there is a way to have wget just create one copy and proposed one way that might accomplish that goal. Tony
RE: Wget 1.11.3 - case sensetivity and URLs
Steven M. Schweda wrote: From Tony Lewis: To have the effect that Allan seeks, I think the option would have to convert all URIs to lower case at an appropriate point in the process. I think that that's the wrong way to look at it. Implementation details like name hashing may also need to be adjusted, but this shouldn't be too hard. OK. How would you normalize the names? Tony
RE: retrieval of data from a database
Saint Xavier wrote: Well, you'd better escape the '' in your shell (\) It's probably easier to just put quotes around the entire URL than to try to find all the special characters and put backslashes in front of them. Tony
RE: Not all files downloaded for a web site
Matthias Vill wrote: Alexandru Tudor Constantinescu wrote: I have the feeling wget is not really able to figure out which files to download from some web sites, when css files are used. That's right. Up until wget 1.11 (released yesterday) there is no support for css-files in the matter of parsing links out of it. There for wget will download the css-file, but not any file referenced only there. According to Micah's Future of Wget email, CSS support is planned for 1.12. He wrote: 1.12 - Support for parsing links from CSS. [snip] The really big deal here, to me, is CSS. I want to have CSS support for Wget ASAP. It's an essential part of the Web, and users definitely suffer for the lack of support for it. Tony
RE: Skip certain includes
Wayne Connolly wrote: Thanks mate- i know we chatted on IRC but just thought someone else may be able to provide some insight. OK. Here's some insight: wget is essentially a web browser. If the URL starts with http, then wget sees the exact same content as Internet Explorer, Firefox, and Opera (except in cases where the server customizes its content to the user agent - in those cases you may have to tweak the user agent to see the same content). If the files are visible to FTP, then try using wget with a URL starting with ftp instead. Otherwise, if you want to mirror the files as they appear on the server, you will have to use something like scp to transfer the files directly from Server A to Server B. Tony
RE: .1, .2 before suffix rather than after
Hrvoje Niksic wrote: And how is .tar.gz renamed? .tar-1.gz? Ouch. OK. I'm responding to the chain and not Hrvoje's expression of pain. :-) What if we changed the semantics of --no-clobber so the user could specify the behavior? I'm thinking it could accept the following strings: - after: append a number after the file name (current behavior) - before: insert a number before the suffix - new: change name of new file (current behavior) - old: change name of old file With this scheme --no-clobber becomes equivalent to --no-clobber=after,new. If I want to change where the number appears in the file name or have the old file renamed then I can specify the behavior I want on the command line (or in .wgetrc). I think I would change my default to --no-clobber=before,old. I think it would be useful to have semantics in .wgetrc where I specify what I want my --no-clobber default to be without that meaning I want --no-clobber processing on each invocation. It would be nice if I could say that I want my default to be before,old, but to only have that apply when I specify --no-clobber on the command line. Back to the painful point at the start of this note, I think we treat .tar.gz as a suffix and if --no-clobber=before is specified, the file name becomes .1.tar.gz. Tony
RE: Thoughts on Wget 1.x, 2.0 (*LONG!*)
Micah Cowan wrote: Keeping a single Wget and using runtime libraries (which we were terming plugins) was actually the original concept (there's mention of this in the first post of this thread, actually); the issue is that there are core bits of functionality (such as the multi-stream support) that are too intrinsic to separate into loadable modules, and that, to be done properly (and with a minimum of maintenance commitment) would also depend on other libraries (that is, doing asynchronous I/O wouldn't technically require the use of other libraries, but it can be a lot of work to do efficiently and portably across OSses, and there are already Free libraries to do that for us). Perhaps both versions can include multi-threaded support in their core version, but the lite version would never invoke multi-threading. Tony
RE: Recursive downloading and post
Micah Cowan wrote Stuart Moore wrote: Is there any way to get wget to only use the post data for the first file downloaded? Unfortunately, I'm not sure I can offer much help. AFAICT, --post-file and --post-data weren't really designed for use with recursive downloading. Perhaps not, but I can't imagine that there is any scenario where the POST data should legitimately be sent for anything other than the URL(s) on the command line. I'd vote for this being flagged as a bug. Tony
RE: working on patch to limit to percent of bandwidth
Hrvoje Niksic wrote: Measuring initial bandwidth is simply insufficient to decide what bandwidth is really appropriate for Wget; only the user can know that, and that's what --limit-rate does. The user might be able to make a reasonable guess as to the download rate if wget reported its average rate at the end of a session. That way the user can collect rates over time and try to give --limit-rate a reasonable value. Tony
RE: wget + dowbloading AV signature files
Gerard Seibert wrote: Is it possible for wget to compare the file named AV.hdb' located in one directory, and if it is older than the AV.hdb.gz file located on the remote server, to download the AV.hdb.gz file to the temporary directory? No, you can only get wget to compare a file of the same name between your local system and the remote server. The only option I have come up with is to keep a copy of the gz file in the temporary directory and run wget from there. You will need to keep the original gz file with a timestamp matching the server in order for wget to know that the file you have is the same as the one on the server. Unfortunately, at least as far as I can tell, wget does not issue an exit code if it has downloaded a newer file. Better exit codes is on the wish list. It would really be nice though if wget simply issued an exit code if an updated file were downloaded. Yes, it would. Therefore, I am unable to craft a script that will unpack the file, test and install it if a newer version has been downloaded. Keep one directory that matches the server and another one (or perhaps two) where you process new files. Before and after wget runs, you can check the dates on the directory that matches the server. You only need to process files that changed. Hope that helps. Tony
RE: wget url with hash # issue
Micah Cowan wrote: If you mean that you want Wget to find any file that matches that wildcard, well no: Wget can do that for FTP, which supports directory listings; it can't do that for HTTP, which has no means for listing files in a directory (unless it has been extended, for example with WebDAV, to do so). Seems to me that is a big unless because we've all seen lots of websites that have http directory listings. Apache will do it out of the box (and by default) if there is no index.htm[l] file in the directory. Perhaps we could have a feature to grab all or some of the files in a HTTP directory listing. Maybe something like this could be made to work: wget http://www.exelana.com/images/mc*.gif Perhaps we would need an option such as --http-directory (the first thing that came to mind, but not necessarily the most intuitive name for the option) to explicitly tell wget how it is expected to behave. Or perhaps it can just try stripping the filename when doing an http request and wildcards are specified. At any rate (with or without the command line option), wget would retrieve http://www.exelana.com/images/ and then retrieve any links where the target matches mc*.gif. If wget is going to explicitly support http directory listings, it probably needs to be intelligent enough to ignore the sorting options. In the case of Apache, that would be things like A HREF=?N=DName/A. Anyone have any idea how many different http directory listing formats are out there? Tony
RE: Overview of the wget source code (command line options)
Himanshu Gupta wrote: Thanks Josh and Micah for your inputs. In addition to whatever Josh and Micah told you, let me add the information that follows. More than once I have had to relearn how wget deals with command line options. The last time I did so, I created the HOWTO that appears below (comments about this information from those in the know on this list are welcome). I'm happy to collect any other topics that people want to submit and add them to the file. Perhaps Micah will even be willing to add it to the repository. :-) By the way, if your mail reader throws away line breaks, you will want to restore them. --Tony To find out what a command line option does: Look in src/main.c in the option_data array for the string to corresponds to the command line option; the entries are of the form: { option, 'O', TYPE, data, argtype }, where you're searching for option. If TYPE is OPT_BOOLEAN or OPT_VALUE: Note the value of data. Then look at init.c at the commands array for an entry that starts with the same data. These lines are of the form: { data, opt.variable, cmd_TYPE }, The corresponding line will tell you what variable gets set when that option is selected. Now use grep or some other search tool to find out where the variable is referenced. For example, the --accept option sets the value of opt.accepts, which is referenced in ftp.c and utils.c If the TYPE is anything else: Look to see how main.c handles that TYPE. For example, OPT__APPEND_OUTPUT sets the option named logfile and then sets the variable append_to_log to true. Searching for append_to_log shows that it is only used in main.c. Checking init.c (as described above) for the option logfile shows that it sets the value of opt.lfilename, which is referenced in mswindows.c, progress.c, and utils.c. To add a new command line option: The simplest approach is to find an existing option that is close to what you want to accomplish and mirror it. You will need to edit the following files as described. src/main.c Add a line to the option_data array in the following format: { option, 'O', TYPE, data, argtype }, where: option is the long name to be accepted from the command line Ois the short name (one character) to be accepted from the command or '' if there is no short name; the short name must only be assigned to one option. Also, there are very few short names available and the maintainers are not inclined to give them out unless the option is likely to be used frequently. TYPE is one of the following standard options: OPT_VALUEon the command line, option must be followed by a value that will be stored ?somewhere? OPT_BOOLEAN option is a boolean value that may appear on the command line as --option for true or --no-option for false OPT_FUNCALL an internal function will be invoked if the option is selected on the command line Note: If one of these choices won't work for your option you can add a new value of the OPT__XXX to the enum list and add special code to handle it in src/main.c. data For OPT_VALUE and OPT_BOOLEAN, the name assigned to the option in the commands array defined in src/init.c (see below). For OPT_FUNCALL, a pointer to the function to be invoked. argtype For OPT_VALUE and OPT_BOOLEAN, use -1. For OPT_FUNCALL use no_argument. NOTE: The options *must* appear in alphabetical order because a Boolean search is used for the list. src/main.c Add the help string to function print_help as follows: N_(\ -O, --optiondoes something nifty.\n), If the short name is '', put spaces in place of -O,. Select a reasonable place to add the text into the help output in one of the existing groups of options: Startup, Logging and input file, Download, Directories, HTTP options, HTTPS (SSL/TLS) options, FTP options, Recursive download, or Recursive accept/reject. src/options.h Define the variable to receive the value of the option in the options structure. src/init.c Add a line to the commands array in the following format: { data, opt.variable, cmd_TYPE }, where: data matches the data string you entered above in the options_data array in src/main.c variable is the variable you defined in
RE: Problem with combinations of the -O , -p, and -k parameters in wget
Michiel de Boer wrote: Is there another way though to achieve the same thing? You can always run wget and then rename the file afterward. If this happens often, you might want write a shell script to handle it. Of course, If you want all the references to the file to be converted, the script will be a little more complicated. :-) Tony
RE: ignoring robots.txt
Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. I can't see any good reason for one set of documentation to be different than another. Let the user choose whatever is comfortable. Some users may not even know they have a choice between man and info. While we're on the subject: should we explicitly warn about using such features as robots=off, and --user-agent? And what should those warnings be? Something like, Use of this feature may help you download files from which wget would otherwise be blocked, but it's kind of sneaky, and web site administrators may get upset and block your IP address if they discover you using it? No, I don't think we should nor do I think use of those features is sneaky. With regard to robots.txt, people use it when they don't want *automated* spiders crawling through their sites. A well-crafted wget command that downloads selected information from a site without regard to the robots.txt restrictions is a very different situation. It's true that someone could --mirror the site while ignoring robots.txt, but even that is legitimate in many cases. With regard to user agent, many websites customize their output based on the browser that is displaying the page. If one does not set user agent to match their browser, the retrieved content may be very different than what was displayed in the browser. All that being said, it wouldn't hurt to have a section in the documentation on wget etiquette: think carefully about ignoring robots.txt, use --wait to throttle the download if it will be lengthy, etc. Perhaps we can even add a --be-nice option similar to --mirror that adjusts options to match the etiquette suggestions. Tony
RE: ignoring robots.txt
Micah Cowan wrote: Don't we already follow typical etiquette by default? Or do you mean that to override non-default settings in the rcfile or whatnot? We don't automatically use a --wait time between requests. I'm not sure what other nice options we'd want to make easily available, but there are probably more. Tony
RE: Maximum 20 Redirections HELP!!!
Josh Williams wrote: Hmm. .org, maybe? LOL. Do you know how many kewl domain names I had to go through before I found one that didn't actually exist? Close to a dozen. Tony
RE: 1.11 Release Date: 15 Sept
Noèl Köthe wrote: A switch to the new GPL v3 is a not so small change and like samba (3.0.x - 3.2) would imho be a good reason for wget 1.2 so everybody sees something bigger changed. There already was a version 1.2 (although the program was called geturl at that time). The number scheme could probably use a facelift. Perhaps when we transition to 2.0, we can add a third digit. Tony
wget on gnu.org: error on Development page
On http://www.gnu.org/software/wget/wgetdev.html, step 1 of the summary is: 1. Change to the topmost GNU Wget directory: % cd wget But you need to cd to either wget/trunk or the appropriate version subdirectory of wget/branches.
RE: wget on gnu.org: Report a Bug
Micah Cowan wrote: This information is currently in the bug submitting form at Savannah: That looks good. I think perhaps such things as the wget version and operating system ought to be emitted by default anyway (except when -q is given). I'm not convinced that wget should ordinarily emit the operating system. It's really only useful to someone other than the person running the command. Other than that, what kinds of things would --bug provide above and beyond --debug? It should echo the command line and the contents of .wgetrc to the bug output, which even the --debug option does not do. Perhaps we will think of other things to include in the output if this option gets added. However, the big difference would be where the output was directed. When invoked as: wget ... --bug bug_report all interesting (but sanitized) information would be written to the file bug_report whether or not the command included --debug, which would also direct the debugging output to STDOUT. The main reason I had for suggesting this option is that it would be easy to tell newbies with problems to run the exact same command with --bug bug_report and send the file bug_report to the list (or to whomever is working on the problem). The user wouldn't see the command behave any differently, but we'd have the information we need to investigate the report. It might even be that most of us would choose to run with --bug most of the time relying on the normal wget output except when something appears to have gone wrong and then checking the file when it does. Tony
RE: wget on gnu.org: error on Development page
Micah Cowan wrote: Done. Lemme know if that works for you. Looks good
RE: bug and patch: blank spaces in filenames causes looping
There is a buffer overflow in the following line of the proposed code: sprintf(filecopy, \%.2047s\, file); It should be: sprintf(filecopy, \%.2045s\, file); in order to leave room for the two quotes. Tony -Original Message- From: Rich Cook [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 04, 2007 10:18 AM To: [EMAIL PROTECTED] Subject: bug and patch: blank spaces in filenames causes looping On OS X, if a filename on the FTP server contains spaces, and the remote copy of the file is newer than the local, then wget gets thrown into a loop of No such file or directory endlessly. I have changed the following in ftp-simple.c, and this fixes the error. Sorry, I don't know how to use the proper patch formatting, but it should be clear. == the beginning of ftp_retr: = /* Sends RETR command to the FTP server. */ uerr_t ftp_retr (int csock, const char *file) { char *request, *respline; int nwritten; uerr_t err; /* Send RETR request. */ request = ftp_request (RETR, file); == becomes: == /* Sends RETR command to the FTP server. */ uerr_t ftp_retr (int csock, const char *file) { char *request, *respline; int nwritten; uerr_t err; char filecopy[2048]; if (file[0] != '') { sprintf(filecopy, \%.2047s\, file); } else { strncpy(filecopy, file, 2047); } /* Send RETR request. */ request = ftp_request (RETR, filecopy); -- Rich wealthychef Cook 925-784-3077 -- it takes many small steps to climb a mountain, but the view gets better all the time.
RE: Suppressing DNS lookups when using wget, forcing specific IP address
Try: wget http://ip.of.new.sitename --header=Host: sitename.com --mirror For example: wget http://66.233.187.99 --header=Host: google.com --mirror Tony -Original Message- From: Kelly Jones [mailto:[EMAIL PROTECTED] Sent: Sunday, June 17, 2007 6:10 PM To: wget@sunsite.dk Subject: Suppressing DNS lookups when using wget, forcing specific IP address I'm moving a site from one server to another, and want to use wget -m combined w/ diff -auwr to help make sure the site looks the same on both servers. My problem: wget -m sitename.com always downloads the site at its *current* IP address. Can I tell wget: download sitename.com, but pretend the IP address of sitename.com is ip.address.of.new.server instead of ip.address.of.old.server. In other words, suppress the DNS lookup for sitename.com and force it to use a given IP address. I've considered kludges like using old.sitename.com vs new.sitename.com, editing /etc/hosts, using a proxy server, etc, but I'm wondering if there's a clean solution here? -- We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile.
RE: Question on wget upload/dload usage
Joe Kopra wrote: The wget statement looks like: wget --post-file=serverdata.mup -o postlog -O survey.html http://www14.software.ibm.com/webapp/set2/mds/mds --post-file does not work the way you want it to; it expects a text file that contains something like this: a=1b=2 and it sends that raw text to the server in a POST request using a Content-Type of application/x-www-form-urlencoded. If you run it with -d, you will see something like this: POST /someurl HTTP/1.0 User-Agent: Wget/1.10 Accept: */* Host: www.exelana.com Connection: Keep-Alive Content-Type: application/x-www-form-urlencoded Content-Length: 7 ---request end--- [writing POST file data ... done] To post a file as an argument, you need a Content-Type of multipart/form-data, which wget does not currently support. Tony
RE: wget bug
Highlord Ares wrote: it tries to download web pages named similar to http://site.com?variable=yesmode=awesome http://site.com?variable=yesmode=awesome Since is a reserved character in many command shells, you need to quote the URL on the command line: wget http://site.com?variable=yesmode=awesome http://site.com?variable=yesmode=awesome; Tony
RE: sending Post Data and files
Lara Röpnack wrote: 1.) How can I send Post Data with Line Breaks? I can not press enter and \n or \r or \r\n dont work... You dont need a line break because parameters are separated by ampersands; a=1b=2 2.) I dont understand the post File. I can Send one file - but I cant give the name. Normaly I have a Form with a Formelement Input type=file name=xy Is it possible to send a File with a name? Is it possible to send two files? On the command line you can use --post-data=a=1b=2 or you can put the data into a file. For example if the file foo contains the following string: a=1b=2 you would use --post-file=foo. Currently, it is not possible to send files with wget. It does not support multipart/form-data. Tony
RE: FW: think you have a bug in CSS processing
J.F.Groff wrote: Amazingly I found this feature request in a 2003 message to this very mailing list. Are there only a few lunatics like me who think this should be included? Wget is written and maintained by volunteers. What you need to find is a lunatic willing to volunteer to write the code to support this feature request. Tony
RE: Suggesting Feature: Download anything newer than...
I don't think there is such a feature, but if you're going to add --not-before, you might as well add --not-after too. Tony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Saturday, April 07, 2007 6:27 PM To: wget@sunsite.dk Subject: Suggesting Feature: Download anything newer than... I'm a very frequent user of wget, but must admit I haven't dived too deep into various options - but as far as I can tell, what I'm about to suggest is not a current feature. If it is, can somebody tell me how to access it? 0:-) What I'm suggesting is something similar to -N (check timestamp and download newer) and may perhaps be used more as a modifier to -N than a seperate option. I occationally make a mirror of certain site with wget, and then throw it into an archive. Unfortuanately, a few months (year) later when I want to catch-up with any updates, I either have to mirror the whole thing again or locate the old archive and unpack it (and I haven't necesserely preserved the whole directory structure). What I would love was the ability to specify (through an option) an arbitrary timestamp (a date... and perhaps time), and for only files created/modified after this time to be downloaded (e.g. the approximate time for the creation of my latest archive). I am envision it as based on the -N option; except that rather than looking on the time-stamp - or the size or even the existance - of a local file, it would only compare the remote file's timestamp to the supplied timestamp - and download if the remote file was newer. Of course, it would probably be h*** of a lot worse to program than just rewriting the -N option. :-) It would have to parse links in HTML-files (HTML) or traverse directories (FTP). Usually it would be used when no local mirror existed, and then creating a mirror of just files made after a certain time (it would of course have to create a dir-structure containing directories also older than the specified time, but no older files). However being able to use it (a specified time) together with the -N or --mirror option, may also be useful when updating a local mirror (though I can't actually see when); so perhaps it should be an option to be used in *companion* with -N (rather than instead of -N)... or at least let it be *possible* to use it together with -N and --mirror as well as by itself. -Koppe
RE: Cannot write to auto-generated file name
Vitaly Lomov wrote: It's a file system issue on windows: file path length is limited to 259 chars. In which case, wget should do something reasonable (generate an error message, truncate the file name, etc.). It shouldn't be left as exercise for the user to figure out that the automatically generated name cannot be used by the OS. (My vote is to truncate the name, but it's a lot easier to generate an error message.) Tony
RE: Huh?...NXDOMAINS
Bruce [EMAIL PROTECTED] wrote: the hostname 'ga13.gamesarena.com.au' resoles back to an NX domain NXDOMAIN is short hand for non-existent domain. It means the domain name system doesn't know the IP address of the domain. (It would be like me having a non-published telephone number; if you know my number, you can call me, but it won't do you any good to call directory assistance because they can't tell you my number.) If your web browser is able to find the site then it should be possible for wget to find it too. But, since it's not a straightforward DNS lookup, you'll have to figure out how your browser is pulling off the magic. One way to do that is to run with a local proxy (such as Achilles) and study what happens between your browser and the server. If you compare that with the debug output of wget, you'll have an idea of where the flow is different and what wget might do to make it work. I'm sure someone can point out open-source options for the proxy. :-) Have fun exploring. Tony
RE: wget help on file download
The server told wget that it was going to return 6K: Content-Length: 6720 _ From: Smith, Dewayne R. [mailto:[EMAIL PROTECTED] Sent: Thursday, March 01, 2007 8:05 AM To: [EMAIL PROTECTED] Subject: wget help on file download Trying to download a 4mb file. it only retrieves 6k of it. I've tried without the added --options and it doesn't work. Can you see any issues below? Thx! C:\Backup_CD\WGETwget -dv -S --no-http-keep-alive --ignore-length --secure-protocol=auto --no-check-certificate https:// server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/65dc9ed71 3ae030f85256f31006eb413/$FILE/TR%202004.018%20AEGIS%20TEST%20PLAN..pdf Setting --verbose (verbose) to 1 Setting --server-response (serverresponse) to 1 Setting --http-keep-alive (httpkeepalive) to 0 Setting --ignore-length (ignorelength) to 1 Setting --secure-protocol (secureprotocol) to auto Setting --check-certificate (checkcertificate) to 0 DEBUG output created by Wget 1.10.2 on Windows. --11:01:08-- https://server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/ 65dc9ed713ae030f85256f31006eb413/$F https://server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/6 5dc9ed713ae030f85256f31006eb413/$F ILE/TR%202004.018%20AEGIS%20TEST%20PLAN..pdf = `TR 2004.018 AEGIS TEST PLAN..pdf.4' Resolving server2.csci-va.com... seconds 0.00, 65.207.33.26 Caching server2.csci-va.com = 65.207.33.26 Connecting to server2.csci-va.com|65.207.33.26|:443... seconds 0.00, connected. Created socket 1932. Releasing 0x00395228 (new refcount 1). Initiating SSL handshake. Handshake successful; connected socket 1932 to SSL handle 0x009318c8 certificate: subject: /C=US/O=U.S. Government/OU=ECA/OU=ORC/OU=CSCI/CN=server2.csci-va.com issuer: /C=US/O=U.S. Government/OU=ECA/OU=Certification Authorities/CN=ORC ECA WARNING: Certificate verification error for server2.csci-va.com: self signed certificate in certificate chain ---request begin--- GET /siap/siap.nsf/297c783b5c8fa51985256cd700546846/65dc9ed713ae030f85256f31006e b413/$FILE/TR%202004.018%20AEGIS%20TEST%20PL AN..pdf HTTP/1.0 User-Agent: Wget/1.10.2 Accept: */* Host: server2.csci-va.com ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK Server: Lotus-Domino Date: Thu, 01 Mar 2007 15:57:55 GMT Connection: close Expires: Tue, 01 Jan 1980 06:00:00 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 6720 Pragma: no-cache ---response end--- HTTP/1.1 200 OK Server: Lotus-Domino Date: Thu, 01 Mar 2007 15:57:55 GMT Connection: close Expires: Tue, 01 Jan 1980 06:00:00 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 6720 Pragma: no-cache Length: ignored [text/html] [ = ] 6,720 --.--K/s Closed 1932/SSL 0x9318c8 11:01:08 (309.48 KB/s) - `TR 2004.018 AEGIS TEST PLAN..pdf.4' saved [6720] C:\Backup_CD\WGET Dewayne R. Smith SPAWAR Systems Center Charleston Code 613, Special Projects Branch Office (843) 218-4393 Mobile (843) 696-9472
RE: how to get images into a new directory/filename heirarchy? [GishPuppy]
If it were me, I'd grab all the files to my local drive and then write scripts to do the moving and renaming. Tony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, February 23, 2007 1:33 AM To: wget@sunsite.dk Subject: how to get images into a new directory/filename heirarchy? [GishPuppy] Hi, I'm trying to use wget to download 100s of JPGs into a cache server with a different directory/filename heirarchy. What I tried to do was to create a text or html file with 1 line for each download (e.g. URL -nd -P [new-path] -O [new-filename]) and use the --input-file= switch, However, I discovered that I cannot rename the path/filename of the file inside the input file. Also, the JPGs will not all come from the same domain but they need to be placed in a flattened directory tree with different filenames. Can anyone offer me advice on how to best accomplish this? I'm using the windows platform. m. Gishpuppy | To reply to this email, click here: http://www.gishpuppy.com/cgi-bin/[EMAIL PROTECTED]
RE: php form
The table stuff just affects what's shown on the user's screen. It's the input field that affects what goes to the server; in this case, that's input ... name=country ... so you want to post country=US. If there were multiple fields, you would separate them with ampersands such as country=USstate=CA. Tony _ From: Alan Thomas [mailto:[EMAIL PROTECTED] Sent: Thursday, February 22, 2007 5:27 PM To: Tony Lewis; wget@sunsite.dk Subject: Re: php form Tony, Thanks. I have to log in with username/password, and I think I know how to do that with wget using POST. For the actual search page, the HTML source says it`s: form action=full_search.php method=POST However, I`m not clear on how to convey the data for the search. The search for has defined a table. One of the entries, for example, is: tr tdbfont face=ArialSearch by Country:/font/b/td tdinput type=text name=country size=50 maxlength=100/td /tr If I want to use wget to search for entries in the U.S. (US), then how do I convey this when I post to the php? Thanks, Alan - Original Message - From: Tony Lewis mailto:[EMAIL PROTECTED] To: 'Alan Thomas' mailto:[EMAIL PROTECTED] ; wget@sunsite.dk Sent: Thursday, February 22, 2007 12:53 AM Subject: RE: php form Look for form action=some-web-page method=XXX ... action tells you where the form fields are sent. method tells you if the server is expecting the data to be sent using a GET or POST command; GET is the default. In the case of GET, the arguments go into the URL. If method is POST, follow the instructions in the manual. Hope that helps. Tony _ From: Alan Thomas [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 21, 2007 4:39 PM To: wget@sunsite.dk Subject: php form There is a database on a web server (to which I have access) that is accessible via username/password. The only way for users to access the database is to use a form with search criteria and then press a button that starts a php script that produces a web page with the results of the search. I have a couple of questions: 1. Is there any easy way to know exactly what commands are behind the button, to duplicate them? 2. If so, then do I just use the POST command as described in the manual, after logging in (per the manual), to get the data it provides. I have used wget just a little, but I am completely new to php. Thanks, Alan
RE: SI units
Lars Hamren wrote: Download speeds are reported as K/s, where, I assume, K is short for kilobytes. The correct SI prefix for thousand is k, not K: http://physics.nist.gov/cuu/Units/prefixes.html SI units are for decimal-based numbers (that is powers of 10) whereas computer programs typically use binary-based numbers (powers of 2). It's convenient for humans to equate 10^3 (1,000) with 2^10 (1,024) but with large numbers, these values quickly diverge: 999k or 999 * 10^3 = 999,000, but 999K or 999 * 2^10 = 1,022,976. For what it's worth, according to Wikipedia either k or K is acceptable for 1024: http://en.wikipedia.org/wiki/Binary_prefix
RE: SI units
Christoph Anton Mitterer wrote: I don't agree with that,.. SI units like K/M/G etc. are specified by international standards and those specify them as 10^x. The IEC defined in IEC 60027 symbols for the use with base 2 (e.g. Ki, Mi, Gi) All of this is described in the Wikipedia article I referenced. It's true that International Electrotechnical Commission prefers the term kibibytes and the prefix Ki for 1,024, but it's still not a term commonly used in computer standards. Searching ietf.org there are 1,880 matches for kilobytes and only 2 for kibibytes and those are both feedback from one individual arguing for the use of kibibytes instead of kilobytes. Searching gnu.org there are 452 matches for kilobytes and only 5 for kibibytes and even then, the following appears: `KiB' kibibyte: 2^10 = 1024. `K' is special: the SI prefix is `k' and the IEC 60027-2 prefix is `Ki', but tradition and POSIX use `k' to mean `KiB'. It seems odd to me that one would suggest that wget is the place to start changing the long-established trend of using 'k' for 1,024.
RE: wget question (connect multiple times)
A) This is the list for reporting bugs. Questions should go to wget@sunsite.dk B) wget does not support multiple time simultaneously C) The decreased per-file download time you're seeing is (probably) because wget is reusing its connection to the server to download the second file. It takes some time to set up a connection to the server regardless of whether you're downloading one byte or one gigabyte of data. For small files, the set up time can be a significant part of the overall download time. Hope that helps! Tony -Original Message- From: t u [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 17, 2006 3:50 PM To: [EMAIL PROTECTED] Subject: wget question (connect multiple times) -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 hi, I hope it is okay to drop a question here. I recently found that if wget downloads one file, my download speed will be Y, but if wget downloads two separate files (from the same server, doesn't matter), the download speed for each of the files will be Y (so my network speed will go up to 2 x Y). So my question is, can I make wget download the same file multiple times simultaneously? In a way, it would run as multiple processes and download parts of the file at the same time, speeding up the download. Hope I could explain my question, sorry about the bad english. Thanks PS. Please consider this as an enhancement request if wget cannot get a file by downloading parts of it simultaneously. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) iD8DBQFFNV4YLM1JzWwJYEYRAsEEAJ9FTx+hURJD5VudhbN2f7Iight3AACcDa6f tO3WuBYygfKLA2Pis8Fbcos= =7kNq -END PGP SIGNATURE-
RE: I got one bug on Mac OS X
I don't think that's valid HTML. According to RFC 1866: An HTML user agent should treat end of line in any of its variations as a word space in all contexts except preformatted text. I don't see any provision for end of line within the HREF attribute of an A tag. Tony From: HUAZHANG GUO [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 11, 2006 7:48 AMTo: [EMAIL PROTECTED]Subject: I got one bug on Mac OS X Dear Sir/Madam, while I was trying to download using the command: wget -k -np -r -l inf -E http://dasher.wustl.edu/bio5476/ I got most of the files, but lost some of them. I think I know where the problem is: if the link is broken into two lines in the index.html: PLecture 1 (Jan 17): Exploring Conformational Space for Biomolecules A HREF=""http://dasher.wustl.edu/bio5476/lectures">http://dasher.wustl.edu/bio5476/lectures /lecture-01.pdf"[PDF]/A/P I will get the following error message: --09:13:16-- http://dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf = `/Users/hguo/mywww//dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf' Connecting to dasher.wustl.edu[128.252.208.48]:80... connected. HTTP request sent, awaiting response... 404 Not Found 09:13:16 ERROR 404: Not Found. Please note that wget adds a special charactor '%0A' in the URL. Maybe the Windows new line have one more charactor which is not recoganized by Mac wget. I am using Mac OS X, Tigger Darwin. Thanks!
RE: wget - Returning URL/Links
Mauro Tortonesi wrote: perhaps we should modify wget in order to print the list of touched URLs as well? maybe only in case -v is given? what do you think? On June 28, 2005, I submitted a patch to write unfollowed links to a file. It would be pretty simple to have a similar --followed-links option. Tony
RE: BUG
Title: RE: BUG Run the command with -d and post the output here. Tony _ From: Junior + Suporte [mailto:[EMAIL PROTECTED]] Sent: Monday, July 03, 2006 2:00 PM To: [EMAIL PROTECTED] Subject: BUG Dear, I using wget to send login request to a site, when wget is saving the cookies, the following error message appear: Error in Set-Cookie, field `Path'Syntax error in Set-Cookie: tu=661541|802400391 @TERRA.COM.BR; Expires=Thu, 14-Oct-2055 20:52:46 GMT; Path= at position 78. Location: http://www.tramauniversitario.com.br/servlet/login.jsp?username=802400 391%40terra.com.brpass=123qwerd=http%3A%2F%2Fwww.tramauniversitario.com.br%2Ft uv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp [following] I trying to access URL http://www.tramauniversitario.com.br/tuv2/participe/login.jsp?rd=http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp[EMAIL PROTECTED]pass=123qweSubmit.x=6Submit.y=1 In Internet Explorer, this URL work correctly and the cookie is saved in the local machine, but in WGET, this cookie return an error. Thanks, Luiz Carlos Zancanella Junior
RE: wget - tracking urls/web crawling
Bruce wrote: any idea as to who's working on this feature? Mauro Tortonesi sent out a request for comments to the mailing list on March 29. I don't know whether he has started working on the feature or not. Tony
RE: Batch files in DOS
I think there is a limit to the number of characters that DOS will accept on the command line (perhaps around 256). Try putting echo in front of the command in your batch file and see how much of it gets echoed back to you. As Tobias suggested, you can try moving some of your command line options into the .wgetrc file. Tony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Saturday, June 03, 2006 2:46 PM To: wget@sunsite.dk Subject: Batch files in DOS I'm trying to mirror about 100 servers (small fanfic sites) using wget --recursive --level=inf -Dblah.com, blah.com,blah.com some_address However, when I run the batch file, it stops reading after a while; apparently my command has too many characters. Is there some other way I should be doing this, or a workaround? GNU Wget 1.10.1 running on Windows 98 -- http://www.aericanempire.com/
RE: I cannot get the images
The problem is your accept list; -A*.* says to accept any file that contains at least one dot in the file name and GetFile?id=DBJOHNUNZIOCSBMOMKRUconvert=image%2Fgifscale=3 doesn't contain any dots. I think you want to accept all files so just delete -A*.* from your argument list because the default behavior is to accept everything. Tony -Original Message- From: matis [mailto:[EMAIL PROTECTED] Sent: Monday, May 15, 2006 6:09 AM To: wget@sunsite.dk Subject: I cannot get the images Hi, Im trying to get whole directory but images from database are ignored. If you paste the address below this post to the browser (or even flashget) it will download the image and open with a default extension .gif . But wget informs file should be removed and then remove it :/ . As effect when there's a picture on every page (with the address as below) only empty htmls are downloaded. Does anybody knows what to do? The address (with the wget command used by me): wget --cache=off -p -m -erobots=off -t10 -v -A*.* http://alo.uibk.ac.at:80/filestore/servlet/GetFile?id=DBJOHNUNZIOCSBMOMKRU convert=image/gifscale=3 whole html address (broken): http://www.literature.at/webinterface/library/ALO- BOOK_V01?objid=13017page=3zoom=3 regards matis
RE: wget www.openbc.com post-data/cookie problem
Erich Steinboeck wrote: Is there a way to trace the browser traffic and compare that to the wget traffic, to see where they differ. You can use a web proxy. I like Achilles: http://www.mavensecurity.com/achilles Tony
RE: Defining url in .wgetrc
ks wrote: Just one more question. Something like this inside somefile.txt http://fly.srk.fer.hr/ -r http://www.gnu.org/ -o gnulog -S http://www.lycos.com/ Why not use a batch file or command script (depending on what OS you're using) containing something like: wget http://fly.srk.fer.hr wget -r http://www.gnu.org -o gnulog wget -S http://www.lycos.com Tony
RE: Windows Title Bar
Hrvoje Niksic wrote: Anyway, adding further customizations to an already questionnable feature is IMHO not a very good idea. Perhaps Derek would be happy if there were a way to turn off this questionable feature. Tony
RE: dose wget auto-convert the downloaded text file?
18 mao [EMAIL PROTECTED] wrote: then save the page as 2.html with the FireFox browser You should not assume that the file saved by any browser is the same as the file delivered to the browser by the server. The browser is probably manipulating line endings to match the conventions on your operating system when it saves files so that CR, CR-LF, or LF, all become CR-LF (or whatever your OS uses for line endings). Tony
RE: download of images linkes in css does not work
It's not a bug; it's a (missing) feature. -Original Message- From: Detlef Girke [mailto:[EMAIL PROTECTED] Sent: Thursday, April 13, 2006 3:17 AM To: [EMAIL PROTECTED] Subject: download of images linkes in css does not work Hello, I tried everything, but images, built in via CSS are neither downloaded nor related with wget. Example (inline style): CSS-Terms like div style=background-image : url(/files/inc/image/pjpeg/hintergrund_startseite.jpg); id=pfad do not have any effect on the downloaded web-page. The same thing happens, when you write {background-image : url(/files/inc/image/pjpeg/hintergrund_startseite.jpg);} into a css-file. Maybe other references in CSS do not work either. Perhaps you can prove this. If you could fix this problem, wget would be the best tool for me. Thank you and best regards Detlef -- Detlef Girke, BIK Hamburg, Beratung, Tests und Workshops c/o DIAS GmbH, Neuer Pferdemarkt 1, 20359 Hamburg [EMAIL PROTECTED], www.bik-online.info, 040 43187513,Fax 040 43187519
RE: regex support RFC
Mauro Tortonesi wrote: no. i was talking about regexps. they are more expressive and powerful than simple globs. i don't see what's the point in supporting both. The problem is that users who are expecting globs will try things like --filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their expressions will simply work, which will result in significant confusion when some expression doesn't work, such as --filter:-domain:www-*.yoyodyne.com. :-) It is pretty easy to programmatically convert a glob into a regular expression. One possibility is to make glob the default input and allow regular expressions. For example, the following could be equivalent: --filter:-domain:www-*.yoyodyne.com --filter:-domain,r:www-.*\.yoyodyne\.com Internally, wget would convert the first into the second and then treat it as a regular expression. For the vast majority of cases, glob will work just fine. One might argue that it's a lot of work to implement regular expressions if the default input format is a glob, but I think we should aim for both lack of confusion and robust functionality. Using ,r means people get regular expressions when they want them and know what they're doing. The universe of wget users who know what they're doing are mostly subscribed to this mailing list; the rest of them send us mail saying please CC me as I'm not on the list. :-) If we go this route, I'm wondering if the appropriate conversion from glob to regular expression should take directory separators into account, such as: --filter:-path:path/to/* becoming the same as: --filter:-path,r:path/to/[^/]* or even: --filter:-path,r:path[/\\]to[/\\][^/\\]* Should the glob match path/to/sub/dir? (I suspect it shouldn't.) Tony
RE: regex support RFC
Hrvoje Niksic wrote: But that misses the point, which is that we *want* to make the more expressive language, already used elsewhere on Unix, the default. I didn't miss the point at all. I'm trying to make a completely different one, which is that regular expressions will confuse most users (even if you tell them that the argument to --filter is a regular expression). This mailing list will get a huge number of bug reports when users try to use globs that fail. Yes, regular expressions are used elsewhere on Unix, but not everywhere. The shell is the most obvious comparison for user input dealing with expressions that select multiple objects; the shell uses globs. Personally, I will be quite happy if --filter only supports regular expressions because I've been using them quite effectively for years. I just don't think the same thing can be said for the typical wget user. We've already had disagreements in this chain about what would match a particular regular expression; I suspect everyone involved in the conversation could have correctly predicted what the equivalent glob would do. I don't think ,r complicates the command that much. Internally, the only additional work for supporting both globs and regular expressions is a function that converts a glob into a regexp when ,r is not requested. That's a straightforward transformation. Tony
RE: regex support RFC
Hrvoje Niksic wrote: I don't see a clear line that connects --filter to glob patterns as used by the shell. I want to list all PDFs in the shell, ls -l *.pdf I want a filter to keep all PDFs, --filter=+file:*.pdf Note that *.pdf is not a valid regular expression even though it's what most people will try naturally. Perl complains: /*.pdf/: ?+*{} follows nothing in regexp I predict that the vast majority of bug reports and support requests will be for users who are trying a glob rather than a regular expression. Tony
RE: regex support RFC
How many keywords do we need to provide maximum flexibility on the components of the URI? (I'm thinking we need five.) Consider http://www.example.com/path/to/script.cgi?foo=bar --filter=uri:regex could match against any part of the URI --filter=domain:regex could match against www.example.com --filter=path:regex could match against /path/to/script.cgi --filter=file:regex could match against script.cgi --filter=query:regex could match against foo=bar I think there are good arguments for and against matching against the file name in path: Tony
RE: regex support RFC
Curtis Hatter wrote: Also any way to add modifiers to the regexs? Perhaps --filter=path,i:/path/to/krs would work. Tony
RE: Bug in ETA code on x64
Hrvoje Niksic wrote: The cast to int looks like someone was trying to remove a warning and botched operator precedence in the process. I can't see any good reason to use , here. Why not write the line as: eta_hrs = eta / 3600; eta %= 3600; This makes it much less likely that someone will make a coding error while editing that section of code. Tony
RE: wget option (idea for recursive ftp/globbing)
Mauro Tortonesi wrote: i would like to read other users' opinion before deciding which course of action to take, though. Other users have suggested adding a command line option for -a two or three times in the past: - 2002-11-24: Steve Friedl [EMAIL PROTECTED] submitted a patch - 2002-12-24: Maaged Mazyek [EMAIL PROTECTED] submitted a patch - 2005-05-09: B Wooster [EMAIL PROTECTED] asked if the fix was ever going to be implemented - 2005-08-19: Carl G. Ponder [EMAIL PROTECTED] asked if the patches were going to be applied - 2005-08-20: Hrvoje responded by posting his own patch for --list-options (and that's just what I can find in my local archive searching for list -a) There is clearly a need among the user community for a feature like this and lots of ideas about how to implement it. I'd say you should pick one and implement it. If you need copies of any of the patches mentioned in the list above, let me know. Tony
RE: wget 1.10.x fixed recursive ftp download over proxy
I believe the following simplified code would have the same effect: if ((opt.recursive || opt.page_requisites || opt.use_proxy) url_scheme (*t) != SCHEME_FTP) status = retrieve_tree (*t);else status = retrieve_url (*t, filename, redirected_URL, NULL, dt); Tony From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of CHEN PengSent: Monday, January 09, 2006 12:38 AMTo: [EMAIL PROTECTED]Subject: wget 1.10.x fixed recursive ftp download over proxy Hi, We once encounter an annoying problem of recursively downloading FTP data using wget, through a ftp-over-http proxy. Previously it was the proxy firmware that does not support recursive downloads, but even upgrading we realized there is problem with wget itself as well. We found that with new proxy firmwire, the older wget 1.7.x can download FTP database recursively, but the newer version (1.9.x and 1.10.x) can not. That means there must be something wrong with the code. I also confirmed this is a known bug for wget since 2003 and it is strange it has not been fixed for a long time. To fix this problem, I took some time to analyze its code and it happens wget uses different method to get the list of files for a destination folder when trying to do recursive download. For normal FTP, it uses FTP command "LIST" to get the file listing. For normal HTTP, it uses its internal method "retrieve_tree()" to generate the lists. In main.c, it does to use retrieve_tree() function to generate list if the traffic is FTP. Howerver, when we use ftp-over-http proxy, the actual request to the server is HTTP request, where the "LIST" FTP command wont work, so we only get one "index.html" file. if ((opt.recursive || opt.page_requisites) url_scheme (*t) != SCHEME_FTP) status = retrieve_tree (*t);else status = retrieve_url (*t, filename, redirected_URL, NULL, dt); In this scenario, we need to modify the code to force wget call retrieve_tree function for FTP traffic if the proxy is involved if ((opt.recursive || opt.page_requisites)// url_scheme (*t) != SCHEME_FTP) ((url_scheme (*t) != SCHEME_FTP) || (opt.use_proxy url_scheme (*t) == SCHEME_FTP))) status = retrieve_tree (*t);else status = retrieve_url (*t, filename, redirected_URL, NULL, dt); After patching the main.c, the new wget works perfectly for FTP recursive downloading, both with proxy and without proxy. This patching works for 1.9.x and 1.10.x till the latest version so far (1.10.2).-- CHEN Peng [EMAIL PROTECTED]
RE: spaces in pathnames using --directory-prefix=prefix
Jonathan DeGumbia wrote: I'm trying to use the --directory-prefix=prefix option for wget on a Windows system. My prefix has spaces in the path directories. Wget appears to terminate the path at the first space encountered. In other words if my prefix is: c:/my prefix/ then wget copies files to c:/my/ . Is there a work-around for this? wget is not terminating the path at the command line delimiter, Windows is. In the same way that you have to enter: dir c:\my prefix to list the contents of the directory, you have to enter: wget --directory-prefix=c:/my prefix or the command processor will split the directory path at the space before passing it to wget. Tony
RE: Error connecting to target server
[EMAIL PROTECTED] wrote: Thanks for your reply. Only ping works for bbc.com and not wget. When I issue the command wget www.bbc.com, it successfully downloads the following file: !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 3.2//EN HTML HEAD META HTTP-EQUIV=Refresh content=0; URL=http://www.bbc.co.uk/?ok; TITLEBritish Broadcasting Corporation /TITLE /HEAD BODY BGCOLOR=white /BODY /HTML You might want to try wget http://www.bbc.co.uk;. I think http://www.gnu.org/software/wget/faq.html should have another question: Why did my download fail and how can I get it to work? In the answer to that question we should mention all the common failure modes: disallowed by robots.txt, need to set user agent to look like a browser, META refresh (as above), etc. along with the command line options to resolve the failure. Also, perhaps the next version of wget can handle META refresh. Tony
RE: wget can't handle large files
Eberhard Wolff wrote: Apparently wget can't handle large file. [snip] wget --version GNU Wget 1.8.2 This bug was fixed in version 1.10 of wget. You should obtain a copy of the latest version, 1.10.2. Tony
RE: Wget patches for .files
Mauro Tortonesi wrote: this is a very interesting point, but the patch you mentioned above uses the LIST -a FTP command, which AFAIK is not supported by all FTP servers. As I recall, that's why the patch was not accepted. However, it would be useful if there were some command line option to affect the LIST parameters. Perhaps something like: wget ftp://ftp.somesite.com --ftp-list=-a Tony
RE: wget a file with long path on Windows XP
PoWah Wong wrote: The login page is: http://safari.informit.com/?FPI=uicode= How to figure out the login command? These two commands do not work: wget --save-cookies cookies.txt http://safari.informit.com/?FPI= [snip] wget --save-cookies cookies.txt http://safari.informit.com/?FPI=uicode=/login.php? [snip] When trying to recreate a form in wget, you have to send the data the server is expecting to receive to the location the server is expecting to receive it. You have to look at the login page for the login form and recreate it. In your browser, view the source to http://safari.informit.com/?FPI=uicode= and you will find the form that appears below. Note that I stripped out formatting information for the table that contains the form and reformatted what was left to make it readable. form action=JVXSL.asp method=post input type=hidden name=s value=1 input type=hidden name=o value=1 input type=hidden name=b value=1 input type=hidden name=t value=1 input type=hidden name=f value=1 input type=hidden name=c value=1 input type=hidden name=u value=1 input type=hidden name=r value= input type=hidden name=l value=1 input type=hidden name=g value= input type=hidden name=n value=1 input type=hidden name=d value=1 input type=hidden name=a value=0 input tabindex=1 name=usr id=usr type=text value= size=12 input name=pwd id=pwd tabindex=1 type=password value= size=12 input type=checkbox tabindex=1 name=savepwd id=savepwd value=1 input type=image name=Login src=images/btn_login.gif alt=Login width=40 height=16 border=0 tabindex=1 align=absmiddle /form Note that the server expects the data to be posted to JVXSL.asp and that there are a bunch of fields that must be supplied in order for the server to process the login request. In addition, the two fields you supply are called usr and pwd. So your first wget command line will look something like this: wget --save-cookies cookies.txt http://safari.informit.com/JVXSL.asp; --post-data=s=1o=1b=1t=1f=1c=1u=1r=l=1g=n=1d=1a=0usr=wong_powa [EMAIL PROTECTED]pwd=123savepwd=1 Hope that helps! Tony
RE: connect to server/request multiple pages
Pat Malatack wrote: is there a way to stay connected, because it seems to me that this takes a decent amount of time that could be minimized The following command will do what you want: wget "google.com/news" "google.com/froogle" Tony
RE: Invalid directory names created by wget
Larry Jones wrote: Of course it's directly accessible -- you just have to quote it to keep the shell from processing the parentheses: cd 'title.Die-Struck+(Gold+on+Gold)+Lapel+Pins' You can also make the individual characters into literals: cd title.Die-Struck+\(Gold+on+Gold\)+Lapel+Pins Tony
Name or service not known error
I got a "Name or service not known" error from wget 1.10 running on Linux. When I installed an earlier version of wget, it worked just fine.It also works just fine on version 1.10 running on Windows. Any ideas? Here's the output on Linux: wget --versionGNU Wget 1.9-beta1 wget http://www.calottery.com/Games/MegaMillions/--17:29:59-- http://www.calottery.com/Games/MegaMillions/ = `index.html.8'Resolving www.calottery.com... 64.164.108.164, 64.164.108.202Connecting to www.calottery.com[64.164.108.164]:80... connected.HTTP request sent, awaiting response... 200 OKLength: 45,166 [text/html] 100%[==] 45,166 166.21K/s 17:30:01 (166.17 KB/s) - `index.html.8' saved [45166/45166] snip output from make install of 1.10 here wget --versionGNU Wget 1.10 wget http://www.calottery.com/Games/MegaMillions/--17:30:17-- http://www.calottery.com/Games/MegaMillions/ = `index.html.9'Resolving www.calottery.com... failed: Name or service not known.
RE: Removing thousand separators from file size output
Hrvoje Niksic wrote: In fact, I know of no application that accepts numbers as Wget prints them. Microsoft Calculator does. Tony
RE: Is it just that the -m (mirror) option an impossible task [Was: wget 1.91 skips most files]
Maurice Volaski wrote: wget's -m option seems to be able to ignore most of the files it should download from a site. Is this simply because wget can download only the files it can see? That is, if the web server's directory indexing option is off and a page on the site is present on the server, but it isn't referenced by any publicly viewable page, wget simply can't see it. I've been thinking about coding a --extra-sensory-perception option that would cause wget to read the mind of the server so that it can download files that it cannot see. As soon as I get the algorithm worked out, I'll be submitting the patch. So far I've figured out how to download index.html without being able to see it, but I'm sure that if I keep working at it that wget will be able to detect the rest of the files it cannot see. Of course, I could just be taking the wrong approach; it may work better if I try to implement the --psychic option instead. Tony
RE: links conversion; non-existent index.html
Andrzej wrote: Two problems: There is no index.html under this link: http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ [snip] it creates a non existing link: http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html When you specify a directory, it is up to the web server to determine what resource gets returned. Some web servers will return a directory listing, some will return some file (such as index.html), and others will return an error. For example, Apache might return (in this order): index.html, index.htm, a directory listing (or a 403 Forbidden response if the configuration disallows directory listings). The actual list of files that Apache will search for and the order in which they are selected is determined by the configuration. If the web server returns any information, wget has to save the information that is returned in *some* local file. It chooses to name that local file index.html since it has no way of knowing where the information might have actually been stored on the server. Hope that helps, Tony
RE: SSL options
Hrvoje Niksic wrote: The question is what should we do for 1.10? Document the unreadable names and cryptic values, and have to support them until eternity? My vote is to change them to more reasonable syntax (as you suggested earlier in the note) for 1.10 and include the new syntax in the documentation. However, I think wget should to continue to support the old options and syntax as alternatives in case people have included them in scripts. Tony
RE: newbie question
Alan Thomas wrote: I am having trouble getting the files I want using a wildcard specifier... There are no options on the command line for what you're attempting to do. Neither wget nor the server you're contacting understand *.pdf in a URI. In the case of wget, it is designed to read web pages (HTML files) and then collect a list of resources that are referenced in those pages, which it then retrieves. In the case of the web server, it is designed to return individual objects on request (X.pdf or Y.pdf, but not *.pdf). Some web servers will return a list of files if you specify a directory, but you already tried that in your first use case. Try coming at this from a different direction. If you were going to manually download every PDF from that directory, how would YOU figure out the names of each one? Is there a web page that contains a list somewhere? If so, point wget there. Hope that helps. Tony PS) Jens was mistaken when he said that https requires you to log into the server. Some servers may require authentication before returning information over a secure (https) channel, but that is not a given.
RE: File rejection is not working
Jens Rösner wrote: AFAIK, RegExp for (HTML?) file rejection was requested a few times, but is not implemented at the moment. It seems all the examples people are sending are just attempting to get a match that is not case sensitive. A switch to ignore case in the file name match would be a lot easier to implement than regular expressions and solve the most pressing need. Just a thought. Tony
RE: help!!!
The --post-data option was added in version 1.9. You need to upgrade your version of wget. Tony -Original Message- From: Richard Emanilov [mailto:[EMAIL PROTECTED] Sent: Monday, March 21, 2005 8:49 AM To: Tony Lewis; [EMAIL PROTECTED] Cc: wget@sunsite.dk Subject: RE: help!!! wget --http-user=login --http-passwd=passwd --post-data=login=loginpassword=passwd https://site wget: unrecognized option `--post-data=login=loginpassword=password' Usage: wget [OPTION]... [URL]... wget --http-user=login --http-passwd=passwd --http-post=login=loginpassword=password https:site wget: unrecognized option `--http-post=login=loginpassword=passwd' Usage: wget [OPTION]... [URL]... Try `wget --help' for more options. wget -V GNU Wget 1.8.2 Richard Emanilov [EMAIL PROTECTED] -Original Message- From: Tony Lewis [mailto:[EMAIL PROTECTED] Sent: Monday, March 21, 2005 10:26 AM To: wget@sunsite.dk Cc: Richard Emanilov Subject: RE: help!!! Richard Emanilov wrote: Below is what I have tried with no success wget --http-user=login --http-passwd=passwd --http-post=login=loginpassword=passwd That should be: wget --http-user=login --http-passwd=passwd --post-data=login=loginpassword=passwd Tony
RE: Curb maximum size of headers
Hrvoje Niksic wrote: I don't see how and why a web site would generate headers (not bodies, to be sure) larger than 64k. To be honest, I'm less concerned about the 64K header limit than I am about limiting a header line to 4096 bytes. I don't know any sites that send back header lines that long, but they could. Who's to say some site doesn't have a 4K cookie? Since you already proposing to limit the entire header to 64K, what is gained by adding this second limit? Tony
RE: one bug?
Jesus Legido wrote: I'm getting a file from https://mfi-assets.ecb.int/dla/EA/ea_all_050303.txt: The problem is not with wget. The file on the server starts with 0xFF 0xFE. Put the following into an HTML file (say temp.html) on your hard drive, open it in your web browser, right click on the link and do a "Save As..." to your hard drive. You will get the same thing as wget downloaded. htmlbodya href="">ea.txthttps://mfi-assets.ecb.int/dla/EA/ea_all_050303.txt"ea.txt/a/body/html
RE: wget: question about tag
Normand Savard wrote: I have a question about wget. Is is possible to download other attribute value other than the harcoded ones? No, at least not in the existing versions of wget. I have not heard that anyone is working on such an enhancement.
RE: new string module
Mauro Tortonesi wrote: Alle 18:28, mercoled 5 gennaio 2005, Draen Kacar ha scritto: Jan Minar wrote: What's wrong with mbrtowc(3) and friends? The mysterious solution is probably to use wprintf(3) instead printf(3). Couple of questions on #c on freenode would give you that answer. Historically, wget source was written in a way which allowed one to compile it on really old systems. That would rule out C95 functions. (I'm not advocating this approach, just answering the question.) as long as i am the maintainer of wget, backward compatibility on very old or legacy systems will NOT be broken. I don't think it has be an either/or situation. With well-selected #if statements, you should be able to have something that works on legacy systems while still providing wide character support on more modern operating systems. I'm not volunteering to determine what those #if statements might be :-) ... just pointing out the possibility. Tony
RE: Metric units
John J Foerch wrote: It seems that the system of using the metric prefixes for numbers 2^n is a simple accident of history. Any thoughts on this? I would say that the practice of using powers of 10 for K and M is a response to people who cannot think in binary. Tony
RE: Metric units
Carlos Villegas snidely wrote: I would say that the original poster understands what he is saying, and you clearly don't... I'll put my computer science degree up against your business administration and accounting degree any day. A kilobyte has always been 1024 bytes and the choice was not accidental. Computer memory is laid out in bits, which are always powers of two. There are 10 kinds of people in the world; those who understand binary and those who don't. Tony
RE: Metric units
Mark Post wrote: While we're at it, why don't we just round off the value of pi to be 3.0 Do you live in Indiana? Actually, Dr. Edwin Goodwin wanted to round off pi to any of several values including 3.2. http://www.agecon.purdue.edu/crd/Localgov/Second%20Level%20pages/Indiana_Pi_ Story.htm Tony
RE: date based retrieval
Anthony Caetano wrote: I am looking for a way to stay current without mirroring an entire site. [snip] Does anyone else see a use for this? Yes. Here's my non-wget solution. I truncate all the files in the directories that I don't want, but maintain the date/time accessed and modified. The Perl script I use follows. --Tony #!/usr/bin/perl $dir = shift @ARGV; die No such directory: $dir\n unless -d $dir; $nFiles = 0; my @dirs = ($dir); while (scalar(@dirs)) { processDir(shift @dirs); } print Truncated $nFiles files.\n; sub processDir { my ($dir) = @_; opendir DIR, $dir or die Cannot read directory: $dir\n; foreach $file (readdir DIR) { next if $file eq . || $file eq ..; $path = $dir/$file; if (-d $path) { push @dirs, $path; } else { my @stat = stat($path); $nFiles++; open F, $path or next; close F; utime $stat[8], $stat[9], $path; } } closedir DIR; }
Re: wput mailing list
Justin Gombos wrote: Since I feel that computers serve man, not the reverse, so I don't intend to change my file organization to be web page centric. Looking around the web, I was quite surprized to find that I'm the only one with this problem. I was very relieved to find that there was a wput - then disappointed to find that wput doesn't reverse one of the most important capabilities of wget; that is, the ability to selectively mirror only files that are needed. Sounds like you've got a legitimate feature request for wput. You should submit it to *their* mailing list. Even better, borrow the relevant code from wget and upgrade wput yourself. Tony
Re: Stratus VOS support
Stratus VOS supportJonathan Grubb wrote: Any thoughts of adding support for Stratus VOS file structures? Your question is a little too vague -- even for me (I used to work for Stratus and actually know what VOS is :-) What file structures are you needing supported that wget does not currently support? Are you needing support when wget is running on Stratus and saving files that it downloads from somewhere else? Or are you needing support for VOS file structures when wget is retrieving a file from a VOS system? If the latter, does this support only apply if both systems are VOS? Tony
Re: Stratus VOS support
Jonathan Grubb wrote: Um. I'm using wget on Win2000 to ftp to a VOS machine. I'm finding that the usual '' sign for directories isn't supported by wget and that '/' doesn't work either, I think because the ftp server itself is expecting ''. The problem may be that Win 2000 grabs the before wget ever sees it. Try putting the path in quotes. Tony
Re: retrieve a whole website with image embedded in css
Ploc wrote: The result is a website very different from the original one as it lacks backgrounds. Can you please confirm if what I think is true (or not), if it is registered as a bug, and if there is a date planning to correct it. It is true. wget only retrieves objects that appear in the HTML. It does not parse the CSS or JavaScript used by a site. Tony
Re: retrieve a whole website with image embedded in css
Ploc wrote: Is it already registered as a bug or in a whishlist ? It's not a bug. This feature has been on the wishlist for a long time. Tony
Re: question on wget via http proxy
Malte Schünemann wrote: Since wget is able to obtain directoy listings / retrieve data from there is should be possible to also upload data Then it would be wput. :-) What is so special about wget that it is able to perform this task? You can learn a LOT about how wget is communicating with the target site by using the --debug argument. Hope that helps a little. Tony
Re: Escaping semicolons (actually Ampersands)
Phil Endecott wrote: Tony The stuff between the quotes following HREF is not HTML; it Tony is a URL. Hence, it must follow URL rules not HTML rules. No, it's both a URL and HTML. It must follow both rules. Please see the page that I cited in my previous message: http://www.htmlhelp.com/tools/validator/problems.html#amp I've looked at hundreds of web pages and I've never seen anyone put amp; into HREF in place of an ampersand. Tony
Re: Escaping semicolons
Phil Endecott wrote: There is not much to go on in terms of specifications. The closest is RFC1738, which includes BNF for a file: URI. However it is ten years old, so whether it reflects current practice I do not know. But it does not allow ; in file: URIs. I conclude from this that wget should be replacing ; with its %3b escape sequence. I think you're confusing what wget is required to do with URLs entered on the command line and what it chooses to do with the resulting files that it saves. If a unencoded name of retrieved resource cannot be stored on the local file system, wget encodes it to create a valid name. Tony Lewis wrote: I use semicolons in CGI URIs to separate parameters. (Ampersand is more often used for this, but semicolon is also allowed and has the advantage that there is no need to escape it in HTML.) There is no need to escape ampersands either. Tony, are you suggesting that this is legal HTML? a href=http://foo.foo/foo.cgi?p1=v1p2=v2;Foo/a I'm fairly confident that you need to escape the to make it valid, i.e. a href=http://foo.foo/foo.cgi?p1=v1amp;p2=v2;Foo/a Just out of curiosity, did you try to implement your theory and see what happens? If you did, you would that the first version works and the second does not. By the way, the correct URI encoding of ampersand is %26, not amp;. The latter encoding is used for ampersands in HTML markup. With regard to whether ampersand needs to be encoded, you're misreading the RFC: Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ;, /, ?, :, @, = and are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme. Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL. The RFC says that you have to escape Reserved characters if that character appears in the name of the resource you're trying to retrieve. That is, if you're trying to retrieve a file named ab.txt, you refer to that file as a%26b.txt in the URL because you're using the ampersand for a non-reserved purpose. If you're using a reserved character for the purpose that it has been reserved (in this case, separating parameters), you do NOT want to encode it. The URL you proposed (after correcting the encoding of the ampersand) is requesting a resource (probably a file) whose name is foo.cgi?p1=v1p2=v2. It is NOT requesting that the script foo.cgi be executed with argument p1 having a value of v1 and p2 having a value of v2. Hope that helps. Tony
Re: Escaping semicolons
Phil Endecott wrote: I am using wget to build a downloadable zip file for offline viewing of a CGI-intensive web site that I am building. Essentially it works, but I am encountering difficulties with semicolons. I use semicolons in CGI URIs to separate parameters. (Ampersand is more often used for this, but semicolon is also allowed and has the advantage that there is no need to escape it in HTML.) There is no need to escape ampersands either. Tony
Re: file name problem
henry luo wrote: i find a problem at GNU Wget 1.9.1, but i dont know it is a new function or a bug; the old version(1.8.2) download a link ,for example: wget 'http://www.expekt.com/odds/eventsodds.jsp?range=100sortby=dateactive= bettingbetcategoryId=SOC%25' save file name is eventsodds.jsp?range=100sortby=dateactive=bettingbetcategoryId=SOC%2 5 but the new version(1.9.1) save name is eventsodds.jsp?range=100sortby=dateactive=bettingbetcategoryId=SOC% It is a feature. The latest version of wget converts %nn to the appropriate character *if* that character is valid in a filename on the target system. In this case, %25 converts to %, which can appear in a filename. The --restrict-file-names option gives you some control over this which characters are escaped, but it does not appear to provide the functionality you're looking for: --restrict-file-names=MODE' Change which characters found in remote URLs may show up in local file names generated from those URLs. Characters that are restricted by this option are escaped, i.e. replaced with `%HH', where `HH' is the hexadecimal number that corresponds to the restricted character. By default, Wget escapes the characters that are not valid as part of file names on your operating system, as well as control characters that are typically unprintable. This option is useful for changing these defaults, either because you are downloading to a non-native partition, or because you want to disable escaping of the control characters. When mode is set to unix, Wget escapes the character `/' and the control characters in the ranges 0-31 and 128-159. This is the default on Unix-like OS'es. When mode is seto to windows, Wget escapes the characters `\', `|', `/', `:', `?', `', `*', `', `', and the control characters in the ranges 0-31 and 128-159. In addition to this, Wget in Windows mode uses `+' instead of `:' to separate host and port in local file names, and uses `@' instead of `?' to separate the query portion of the file name from the rest. Therefore, a URL that would be saved as `www.xemacs.org:4300/search.pl?input=blah' in Unix mode would be saved as `www.xemacs.org+4300/[EMAIL PROTECTED]' in Windows mode. This mode is the default on Windows. If you append `,nocontrol' to the mode, as in `unix,nocontrol', escaping of the control characters is also switched off. You can use `--restrict-file-names=nocontrol' to turn off escaping of control characters without affecting the choice of the OS to use as file name restriction mode.
Re: OpenVMS URL
Hrvoje Niksic wrote: Wget could always support a URL parameter, such as: wget 'ftp://server/dir1/dir2/file;disk=foo' Assuming, you can detect a VMS connection, why not simply ftp://server/foo:[dir1.dir2]? Tony
Re: OpenVMS URL
How do you enter the path in your web browser? - Original Message - From: Bufford, Benjamin (AGRE) [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 7:32 AM Subject: OpenVMS URL I am trying to use wget to retrieve a file from an OpenVMS server but have been unable to make wget to process a path with a volume name in it. For example: disk:[directory.subdirectory]filename How would I go about entering this type of path in a way that wget can understand? ** The information contained in, or attached to, this e-mail, may contain confidential information and is intended solely for the use of the individual or entity to whom they are addressed and may be subject to legal privilege. If you have received this e-mail in error you should notify the sender immediately by reply e-mail, delete the message from your system and notify your system manager. Please do not copy it for any purpose, or disclose its contents to any other person. The views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of the company. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused, directly or indirectly, by any virus transmitted in this email. **