Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
On 10/26/07, Micah Cowan <[EMAIL PROTECTED]> wrote: > And, of course, when I say "there would be two Wgets", what I really > mean by that is that the more exotic-featured one would be something > else entirely than a Wget, and would have a separate name. I think the idea of having two Wgets is good. I too have been concerned about the resources required in creating the all-out version 2.0. The current code for Wget is a bit mangled, but I think the basic concepts surrounding it are very good ones. Although the code might suck for those trying to read it, I think it could be very great with a little regular maintenance. There still remains the question, though, of whether version 2 will require a complete rewrite. Considering how fundamental these changes are, I don't think we would have much of a choice. You mentioned that they could share code for recursion, but I don't see how. IIRC, the code for recursion in the current version is very dependent on the current methods of operation. It would probably have to be rewritten to be shared. As for libcurl, I see no reason why not. Also, would these be two separate GNU projects? Would they be packaged in the same source code, like finch and pidgin? I do believe the next question at hand is what version 2's official mascot will be. I purpose Lenny the tortoise ;) _ .. Lenny -> (_\/ \_, 'uuuu~'
Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 With talk of supporting multiple simultaneous connections in a next-generation version of Wget, various things have been tumbling around in my mind. First off is that I would not wish to do such a thing with threads. Threads introduce too many problems of their own, including portability and debugability. I'd much prefer to do asynchronous I/O. With the use of asynchronous I/O, a (possibly) better way to do - --timeout presents itself: we can do the appropriate timeouts in our calls to select(). The main advantage to this is that we don't have to muck around with signals, signal handling, various portability issues, etc. We can do one --timeout and be done. The primary downside to this is that potentially blocking, not directly I/O things don't get timed out anymore. The only thing that currently comes to mind is gethostbyname(), which obviously can block, but can't be select()ed or set to some sort of non-blocking mode. Also, even aside from --timeout, having all other traffic sit around and wait until a name is resolved is not really desirable. The obvious solution to that is to use c-ares, which does exactly that: handle DNS queries asynchronously. Actually, I didn't know this until just now, but c-ares was split off from ares to meet the needs of the curl developers. :) Of course, if we're doing asynchronous net I/O stuff, rather than reinvent the wheel and try to maintain portability for new stuff, we're better off using a prepackaged deal, if one exists. Luckily, one does; a friend of mine (William Ahern) wrote a package called libevnet that handles all of that; it wraps libevent (by Niels Provos, for handling async I/O very portably and using the best available interfaces on the given system) with higher-level socket and buffer I/O facilities and, and provides a wrapper around c-ares that makes it convenient to use with liblookup. If we're going to do async I/O, using libevent and c-ares, or something very like them, is far too convenient not to do, and after that decision is made, libevnet becomes a clear win too. So, the obvious win is that using libevnet, libevent and c-ares gives us a "shortest path" to using async I/O, having multiple simultaneous connections and async DNS queries, and a potentially better way to manage timeouts. The obvious loss, and one which I'm positive many of you are already screaming at me about, is that we just added 3 library dependencies to Wget in one go. Not freaking cool. Not freaking cool AT ALL. - -= Wget's Strongest Points =- I absolutely do not want to require a bunch of libraries in order for people to build Wget. AFAICT, the vast majority of Wget's user base, which is probably system packagers and distributors, use it for just the following reasons: 1. It's pretty small. Only dependency is OpenSSL, which isn't even required, but of course in general nobody really doesn't want SSL. (Ooh looky! Double negatives!) 2. It's robust. Connection dropped? No prob, try again. 3. It avoids mucking with preexisting files. Downloading a file named "foo", but you already _have_ a "foo"? No prob, let's call it "foo.1". To my mind, these are the core values that have led to so many different distributions and large software packages relying on Wget. Messing with any one of these is likely to lose Wget "customers", and in our largest "target market". (DISCLAIMER: naturally I have nothing whatsoever to back these claims up. It's conjecture. But it seems pretty credible to me.) Another major "market" for Wget is the typical command-line "power user", who uses Wget not only to grab off a quick file, but also to grab whole sections of sites recursively, and perhaps with occasional quirky needs like only-visit-these-domains or only-download-these-file-types. For these people, while point #1 above probably holds relatively little value, probably being replaced primarily by Wget's HTML-crawling functionality. In addition to these, points that I believe are highly desirable to such users are: - Being able to tell Wget precisely which files to download and which to skip. The more expressive power we have to accomplish this the better. Wget already has remarkable flexibility in this area; but there are many more things that are desirable, and some of the existing interface is not up to the task of really powerful expression in this area. - Being able to parse and "recursively descend" CSS is really, really important. - Being able to do multiple connections, potentially accelerating the total download time (mainly for multi-host sessions), would be a win. - Being able to extend Wget, to grok new filetypes for recursive descent (such as non-HTML XML files, or JavaScript), or extend the power of expression of "what to grab" even further. - -= The Two Wgets =- It seems to me, then, that what's really required may in fact be two different "Wgets". One that is lightweight but packs a punch: basically Wget as it a
Re: More portability stuff [Re: gettext configuration]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: > Micah Cowan <[EMAIL PROTECTED]> writes: > >> Note that curl provides the additional check for a macro version in >> the configure script, rather than in the source; we should probably >> do it that way as well. I'm not sure how that helps for this, >> though: if the above test is failing, then either it's a function >> (no macro) and configure isn't picking it up; or else it's not >> defined in . > > Or getting the definition requires defining a magic preprocessor > symbol such as _XOPEN_SOURCE. The man page I found claims that the > function is defined by XPG4 and links to standards(5), which > explicitly documents _XOPEN_SOURCE. Right. But we set that unconditionally in , so that shouldn't be the problem... right? Of course, we'd probably do well to upgrade the value we're setting it to (to 600). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHIn677M8hyUobTrERCNg3AJ0XNyH673f9Rk8bwfu4AKmRdQDZ7wCfX0FE Fu7U0ZB4VkSMW7D8u4Z1ITI= =d6do -END PGP SIGNATURE-
Re: More portability stuff [Re: gettext configuration]
Micah Cowan <[EMAIL PROTECTED]> writes: > Note that curl provides the additional check for a macro version in > the configure script, rather than in the source; we should probably > do it that way as well. I'm not sure how that helps for this, > though: if the above test is failing, then either it's a function > (no macro) and configure isn't picking it up; or else it's not > defined in . Or getting the definition requires defining a magic preprocessor symbol such as _XOPEN_SOURCE. The man page I found claims that the function is defined by XPG4 and links to standards(5), which explicitly documents _XOPEN_SOURCE.
Re: More portability stuff [Re: gettext configuration]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Daniel Stenberg wrote: > On Sat, 27 Oct 2007, Hrvoje Niksic wrote: > Do you say that Tru64 lacks both sigsetjmp and siggetmask? Are you sure about that? >>> >>> That is the only system we are currently talking about. >> >> I find it hard to believe that Tru64 lacks both of those functions; >> for example, see >> http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51_HTML/MAN/MAN3/0707.HTM >> >> It is quite possible that the Autoconf test for sigsetjmp yields a >> false negative. > > I very much doubt it does, since we check for it in the curl configure > script, and I can see the output from it running on Tru64 clearly state: > > checking for sigsetjmp... yes Thanks, Daniel. Looking at my own config.h (on GNU/Linux), I see: /* Define to 1 if you have the `sigsetjmp' function. */ /* #undef HAVE_SIGSETJMP */ In utils.c, I see this workaround: #ifndef HAVE_SIGSETJMP /* If sigsetjmp is a macro, configure won't pick it up. */ # ifdef sigsetjmp # define HAVE_SIGSETJMP # endif #endif (on my system, this results in HAVE_SIGSETJMP being set.) I'm not sure how Steven's environment managed not to get HAVE_SIGSETJMP set, then. Steven? Note that curl provides the additional check for a macro version in the configure script, rather than in the source; we should probably do it that way as well. I'm not sure how that helps for this, though: if the above test is failing, then either it's a function (no macro) and configure isn't picking it up; or else it's not defined in . - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHInCz7M8hyUobTrERCPY6AJ44mK6VQWo5qTSn68MvW0aDo4UH+wCdGtVT DTRKsYMeCl6iQ0zA/KghENg= =ym9j -END PGP SIGNATURE-
Re: More portability stuff [Re: gettext configuration]
Daniel Stenberg <[EMAIL PROTECTED]> writes: >> It is quite possible that the Autoconf test for sigsetjmp yields a >> false negative. > > I very much doubt it does, since we check for it in the curl > configure script, Note that I didn't mean "in general". Such bugs can sometimes show in one program or test system, but not in another, depending on previously run tests (which influence headers included by test programs), version of Autoconf, or issues with the tester's installation. > and I can see the output from it running on Tru64 clearly state: > > checking for sigsetjmp... yes It is my understanding that Steven got an error stating that siggetmask is nonexistent, and siggetmask is used only if not HAVE_SIGSETJMP. Since, according to your test, Tru64 indeed does have sigsetjmp, it only confirms my suspicion that Autoconf gets it wrong, at least for that particular combination of Wget and the tester's Tru64 installation.
Re: More portability stuff [Re: gettext configuration]
On Sat, 27 Oct 2007, Hrvoje Niksic wrote: Do you say that Tru64 lacks both sigsetjmp and siggetmask? Are you sure about that? That is the only system we are currently talking about. I find it hard to believe that Tru64 lacks both of those functions; for example, see http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51_HTML/MAN/MAN3/0707.HTM It is quite possible that the Autoconf test for sigsetjmp yields a false negative. I very much doubt it does, since we check for it in the curl configure script, and I can see the output from it running on Tru64 clearly state: checking for sigsetjmp... yes (available here for another ten days or so: http://curl.haxx.se/auto/log.cgi?id=20071026080956-25212)
Re: More portability stuff [Re: gettext configuration]
Micah Cowan <[EMAIL PROTECTED]> writes: >> I know nothing of VMS. If it's sufficiently different from Unix that >> it has wildly different alarm/signal facilities, or no alarm/signal at >> all (as is the case with Windows), then it certainly makes sense for >> Wget to provide a VMS-specific run_with_timeout and use it on VMS. >> Exactly as it's now done with Windows. > > Not when we can use a more portabile facility to make both systems > happy. That's why I said "*if* it's sufficiently different from Unix ...". It obviously isn't if it only differs in the way that signal masks need to be restored after longjmping from a signal handler. > "Doesn't have siggetmask() nor sigsetjmp()" != "wildly different > alarm/signal facilities". Of course. I simply wasn't aware of such a case when I was writing the code. I'm not claiming the current code is perfect, I'm just trying to explain the logic behind it. >>> because it lacks an unportable facility doesn't make sense--esides >>> which, we're talking about a Unix here (Tru64), not VMS (yet). >> >> Do you say that Tru64 lacks both sigsetjmp and siggetmask? Are you >> sure about that? > > That is the only system we are currently talking about. I find it hard to believe that Tru64 lacks both of those functions; for example, see http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51_HTML/MAN/MAN3/0707.HTM It is quite possible that the Autoconf test for sigsetjmp yields a false negative.
Re: %20 and spaces in a URL
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Alan Wehmann wrote: > Fred Holmes cpcug.org> writes: > >> If I have a URL that has %20 in place of spaces, and I use the URL directly >> as > the argument of WGET, it seems that >> the file is always "not found". I've discovered that if I replace each %20 > with a space, and put quotation >> marks around the entire URL, it works. >> . . . > > > This topic is of interest to me, since I am using wget in a Windows XP command > shell, to fetch files from a HTTP server. A number of the file names have > spaces in them and in the url these were replace by "%20"; these files did not > successfully download. What I realized is that the "%" character is not > protected by using double quotes surrounding the url. I could see this by > having "echo on" as the first line of my command file. The "escape" character > "^" that protects other special characters in the command shell doesn't help > in > the case of "%". What does seem to work is to replace "%20" by "%%20". > > I am not a subscriber to > > wget@sunsite.dk > > so please include my email address in replies. I'm not sure what sort of replies you are looking for, as you haven't asked a question. :) It is, of course, the responsibility of the user to ensure that he properly escapes characters that he wants to pass literally to Wget. Note that it's not necessary to convert space characters to %20; simply putting quotes around the whole URL to protect the spaces from becoming field separators for the shell is quite enough. Also, when you quote from a three-year-old comment, it's usually advisable to mention a little more about the context of the message, and where we can find the original thread. But, yeah, if your shell treats % specially, then obviously you need to escape them. This has nothing in particular to do with Wget, but rather with using whatever particular command shell you have. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHIjv57M8hyUobTrERCLyEAJ9Gs5lS9qphZNrlYAz44PaldobFegCfcxC7 x7jIj/DgL+xZt2gXFfcf1No= =CmqR -END PGP SIGNATURE-
Re: %20 and spaces in a URL
Fred Holmes cpcug.org> writes: > > If I have a URL that has %20 in place of spaces, and I use the URL directly as the argument of WGET, it seems that > the file is always "not found". I've discovered that if I replace each %20 with a space, and put quotation > marks around the entire URL, it works. > . . . This topic is of interest to me, since I am using wget in a Windows XP command shell, to fetch files from a HTTP server. A number of the file names have spaces in them and in the url these were replace by "%20"; these files did not successfully download. What I realized is that the "%" character is not protected by using double quotes surrounding the url. I could see this by having "echo on" as the first line of my command file. The "escape" character "^" that protects other special characters in the command shell doesn't help in the case of "%". What does seem to work is to replace "%20" by "%%20". I am not a subscriber to wget@sunsite.dk so please include my email address in replies.
Re: More portability stuff [Re: gettext configuration]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: > Micah Cowan <[EMAIL PROTECTED]> writes: > >> Okay... but I don't see the logic of: >> >> 1. If the system has POSIX's sigsetjmp, use that. >> 2. Otherwise, just assume it has the completely unportable, and not >> even BSDish, siggetmask. > > Are you sure siggetmask isn't BSD-ish? When I tested that code on > various Unix systems, the only one without sigsetjmp was Ultrix, and > it had siggetmask. Linux man page claims siggetmask to belong to the > "BSD signal API" and the headers expose it when _BSD_SOURCE is > defined. My Linux man page claims that all the functions in there, _except_ siggetmask, are from BSD, and that siggetmask is of unclear origin. >> At least sigblock(0) is more portable, > > What makes you say that? Because that one _is_ a BSD-ism. >> And saying that VMS should implement its own completely separate >> run_with_timeout just > > I know nothing of VMS. If it's sufficiently different from Unix that > it has wildly different alarm/signal facilities, or no alarm/signal at > all (as is the case with Windows), then it certainly makes sense for > Wget to provide a VMS-specific run_with_timeout and use it on VMS. > Exactly as it's now done with Windows. Not when we can use a more portabile facility to make both systems happy. "Doesn't have siggetmask() nor sigsetjmp()" != "wildly different alarm/signal facilities". >> because it lacks an unportable facility doesn't make sense--esides >> which, we're talking about a Unix here (Tru64), not VMS (yet). > > Do you say that Tru64 lacks both sigsetjmp and siggetmask? Are you > sure about that? That is the only system we are currently talking about. Steven's been testing on that as a stepping-stone to VMS, as it's the most similar Unix to VMS. He has also run some tests on Solaris, more recently. Sorry if there was some confusion; one of the earlier threads was entitled "VMS and Wget", because the message that spawned it was me prodding him to get his VMS kit up-to-date for inclusion. :) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHIjLb7M8hyUobTrERCDmmAJ9NccwcxdkJ73xrq465SH+GT4LfrwCeJ/sd Z9hotYNSvKVzdQVFLTM73gY= =qh10 -END PGP SIGNATURE-
Re: Using wget through FTP proxy server
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Jochen Roderburg wrote: > Zitat von Alan Watt <[EMAIL PROTECTED]>: >> you connect to TCP port 21 on the proxy server (in the example above, >> 169.254.1.1) >> >> send "USER phred", followed by "PASS xyzzy" >> send "USER [EMAIL PROTECTED]" followed by "PASS holmes" >> >> After this point, everything looks like a real FTP session with the >> remote server. >> > I think, this is a scenario which is not possible with the current wget. > As far as I understand it sends *one* USER/PASS pair to the ftp-server and > there > is no way to get the needed second pair sent. Right; from what I'm seeing, we connect to port 21 on the proxy server, and send that _second_ set ("USER [EMAIL PROTECTED]" followed by "PASS holmes"). We don't support sending the first set. Which we should, so a bug is filed. https://savannah.gnu.org/bugs/index.php?21439 - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHIjEj7M8hyUobTrERCJCmAJ9YjfHDVgInncNMo2GRqghHoz4szwCfbLpK dE7T3KoFoTjNWKQJWunvFpM= =rRTI -END PGP SIGNATURE-
Re: More portability stuff [Re: gettext configuration]
Micah Cowan <[EMAIL PROTECTED]> writes: > Okay... but I don't see the logic of: > > 1. If the system has POSIX's sigsetjmp, use that. > 2. Otherwise, just assume it has the completely unportable, and not > even BSDish, siggetmask. Are you sure siggetmask isn't BSD-ish? When I tested that code on various Unix systems, the only one without sigsetjmp was Ultrix, and it had siggetmask. Linux man page claims siggetmask to belong to the "BSD signal API" and the headers expose it when _BSD_SOURCE is defined. > AFAIK, _no_ system supports POSIX 100%, In case it's not obvious, I was trying to make the code portable to real Unix and Unix-like systems. So, the logic you don't see happened to cover both POSIX and all non-POSIX systems I laid my hands on, or heard of. Wget was ported to *very* strange systems, and I don't remember problems with run_with_timeout. > At least sigblock(0) is more portable, What makes you say that? > And saying that VMS should implement its own completely separate > run_with_timeout just I know nothing of VMS. If it's sufficiently different from Unix that it has wildly different alarm/signal facilities, or no alarm/signal at all (as is the case with Windows), then it certainly makes sense for Wget to provide a VMS-specific run_with_timeout and use it on VMS. Exactly as it's now done with Windows. > because it lacks an unportable facility doesn't make sense--esides > which, we're talking about a Unix here (Tru64), not VMS (yet). Do you say that Tru64 lacks both sigsetjmp and siggetmask? Are you sure about that?
Re: More portability stuff [Re: gettext configuration]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: > Micah Cowan <[EMAIL PROTECTED]> writes: > >>>We ain't go no siggetmask(). None on VMS (out as far as V8.3), >>> either, should I ever get so far. >> siggetmask is an obsolete BSDism; POSIX has the sigprocmask function, >> which we should prefer. > > We do prefer the POSIX way, which is to use sigsetjmp/siglongjmp, in > which case we need no explicit unblocking of signals. It is only on > non-POSIX systems without sigsetjmp that we use siggetmask. > > Non-Unix systems, such as VMS, should be handled like Windows are > currently handled: by providing their own native implementation of > highly non-portable routines such as run_with_timeout. That's the > whole point of having an abstract run_with_timeout function. Okay... but I don't see the logic of: 1. If the system has POSIX's sigsetjmp, use that. 2. Otherwise, just assume it has the completely unportable, and not even BSDish, siggetmask. AFAIK, _no_ system supports POSIX 100%, so just because it lacks one POSIX facility doesn't mean we should assume we don't have another. Much better is to use something with a bit of a better guarantee, or at least not just leap to an assumption like that. At least sigblock(0) is more portable, but we shouldn't assume we have that either. And saying that VMS should implement its own completely separate run_with_timeout just because it lacks an unportable facility doesn't make sense--esides which, we're talking about a Unix here (Tru64), not VMS (yet). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHIiWq7M8hyUobTrERCIjNAJ99DOhtgnur4YlUdoY3T4Tui9yxEwCfeH1g BQmOzgE42CnZCtRNV2lBM+c= =USw1 -END PGP SIGNATURE-
Re: Using wget through FTP proxy server
Zitat von Alan Watt <[EMAIL PROTECTED]>: > The way this particular FTP proxy works, assuming the following: > > proxy user name: phred > proxy user passwd: xyzzy > proxy server IP: 169.254.1.1 > remote FTP user: sherlock > remote FTP passwd: holmes > remote FTP server IP: 30.1.1.1 > > you connect to TCP port 21 on the proxy server (in the example above, > 169.254.1.1) > > send "USER phred", followed by "PASS xyzzy" > send "USER [EMAIL PROTECTED]" followed by "PASS holmes" > > After this point, everything looks like a real FTP session with the > remote server. > I think, this is a scenario which is not possible with the current wget. As far as I understand it sends *one* USER/PASS pair to the ftp-server and there is no way to get the needed second pair sent. J.Roderburg
Re: More portability stuff [Re: gettext configuration]
Micah Cowan <[EMAIL PROTECTED]> writes: > I wasn't really expecting VMS to have sigprocmask(); but I expect > future systems may conceivably have it and lack the BSD ones (and > perhaps such systems are already in the wild). Anyway, we'll use > what's available. I think you're misunderstanding the logic of run_with_timeout. It doesn't use non-POSIX code unless it has to (explanation in the other mail in this thread). It could be improved to support moer non-POSIX systems, but POSIX systems should run it as currently written.
Re: More portability stuff [Re: gettext configuration]
Micah Cowan <[EMAIL PROTECTED]> writes: >>We ain't go no siggetmask(). None on VMS (out as far as V8.3), >> either, should I ever get so far. > > siggetmask is an obsolete BSDism; POSIX has the sigprocmask function, > which we should prefer. We do prefer the POSIX way, which is to use sigsetjmp/siglongjmp, in which case we need no explicit unblocking of signals. It is only on non-POSIX systems without sigsetjmp that we use siggetmask. Non-Unix systems, such as VMS, should be handled like Windows are currently handled: by providing their own native implementation of highly non-portable routines such as run_with_timeout. That's the whole point of having an abstract run_with_timeout function.