-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 With talk of supporting multiple simultaneous connections in a next-generation version of Wget, various things have been tumbling around in my mind.
First off is that I would not wish to do such a thing with threads. Threads introduce too many problems of their own, including portability and debugability. I'd much prefer to do asynchronous I/O. With the use of asynchronous I/O, a (possibly) better way to do - --timeout presents itself: we can do the appropriate timeouts in our calls to select(). The main advantage to this is that we don't have to muck around with signals, signal handling, various portability issues, etc. We can do one --timeout and be done. The primary downside to this is that potentially blocking, not directly I/O things don't get timed out anymore. The only thing that currently comes to mind is gethostbyname(), which obviously can block, but can't be select()ed or set to some sort of non-blocking mode. Also, even aside from --timeout, having all other traffic sit around and wait until a name is resolved is not really desirable. The obvious solution to that is to use c-ares, which does exactly that: handle DNS queries asynchronously. Actually, I didn't know this until just now, but c-ares was split off from ares to meet the needs of the curl developers. :) Of course, if we're doing asynchronous net I/O stuff, rather than reinvent the wheel and try to maintain portability for new stuff, we're better off using a prepackaged deal, if one exists. Luckily, one does; a friend of mine (William Ahern) wrote a package called libevnet that handles all of that; it wraps libevent (by Niels Provos, for handling async I/O very portably and using the best available interfaces on the given system) with higher-level socket and buffer I/O facilities and, and provides a wrapper around c-ares that makes it convenient to use with liblookup. If we're going to do async I/O, using libevent and c-ares, or something very like them, is far too convenient not to do, and after that decision is made, libevnet becomes a clear win too. So, the obvious win is that using libevnet, libevent and c-ares gives us a "shortest path" to using async I/O, having multiple simultaneous connections and async DNS queries, and a potentially better way to manage timeouts. The obvious loss, and one which I'm positive many of you are already screaming at me about, is that we just added 3 library dependencies to Wget in one go. Not freaking cool. Not freaking cool AT ALL. - -= Wget's Strongest Points =- I absolutely do not want to require a bunch of libraries in order for people to build Wget. AFAICT, the vast majority of Wget's user base, which is probably system packagers and distributors, use it for just the following reasons: 1. It's pretty small. Only dependency is OpenSSL, which isn't even required, but of course in general nobody really doesn't want SSL. (Ooh looky! Double negatives!) 2. It's robust. Connection dropped? No prob, try again. 3. It avoids mucking with preexisting files. Downloading a file named "foo", but you already _have_ a "foo"? No prob, let's call it "foo.1". To my mind, these are the core values that have led to so many different distributions and large software packages relying on Wget. Messing with any one of these is likely to lose Wget "customers", and in our largest "target market". (DISCLAIMER: naturally I have nothing whatsoever to back these claims up. It's conjecture. But it seems pretty credible to me.) Another major "market" for Wget is the typical command-line "power user", who uses Wget not only to grab off a quick file, but also to grab whole sections of sites recursively, and perhaps with occasional quirky needs like only-visit-these-domains or only-download-these-file-types. For these people, while point #1 above probably holds relatively little value, probably being replaced primarily by Wget's HTML-crawling functionality. In addition to these, points that I believe are highly desirable to such users are: - Being able to tell Wget precisely which files to download and which to skip. The more expressive power we have to accomplish this the better. Wget already has remarkable flexibility in this area; but there are many more things that are desirable, and some of the existing interface is not up to the task of really powerful expression in this area. - Being able to parse and "recursively descend" CSS is really, really important. - Being able to do multiple connections, potentially accelerating the total download time (mainly for multi-host sessions), would be a win. - Being able to extend Wget, to grok new filetypes for recursive descent (such as non-HTML XML files, or JavaScript), or extend the power of expression of "what to grab" even further. - -= The Two Wgets =- It seems to me, then, that what's really required may in fact be two different "Wgets". One that is lightweight but packs a punch: basically Wget as it already currently is. Making it DTRT where it doesn't, such as with its expectations of FTP servers, or how it handles HTTP authentication; and adding CSS support would be _really_ helpful. In order to _keep_ it lightweight, it would be necessary to keep a tight throttle on what new functionality is accepted; it would be primarily _maintained_, and not _developed_, though it would of course be kept up-to-date with evolving definitions of what the World Wide Web is (CSS being an excellent example). It would support recursive web fetching, but wouldn't bend over backwards to handle the more exotic needs. The other Wget would get all the "cool" stuff: pretty much everything that has been planned for the "next-gen", "2.0" version of Wget. Its focus would be on users that want it to be their "everything" tool, and damn the hard-disk-space requirements and library dependencies (not getting _too_ crazy, of course--that's what the plugin architecture is for!). This would certainly allay a growing fear I've had, that a lot of what people were getting excited about in discussions of "Wget 2.0", just plain doesn't _feel_ right for Wget. But were unquestionably useful feature additions. I initially quelled this concern with the thought that I could simply sequester the really exotic features to plugins, allowing people the freedom to choose what they want their Wget to be. But asynchronous I/O can't be simply partitioned away like that: it requires intrinsic and pervasive changes to Wget's architecture. While the two Wgets could share some logic for recursion, file naming, timestamping, etc, the actual I/O wouldn't easily be sharable. Well: code written for an async I/O platform can easily just be used synchronously, but async code comes at a significant cost to legibility and flexibility, that IMO woudn't be worth the cost in the "synchronous" Wget. Plus, there is the following thought. While I've talked about not reinventing the wheel, using existing packages to save us the trouble of having to maintain portable async code, higher-level buffered-IO and network comm code, etc, I've been neglecting one more package choice. There is, after all, already a Free Software package that goes beyond handling asynchronous network operations, to specifically handle asynchronous _web_ operations; I'm speaking, of course, of libcurl. There would seem to be some obvious motivation for simply using libcurl to handle all asynchronous web traffic, and wrapping it with the logic we need to handle retries, recursion, timestamping, traversing, selecting which files to download, etc. Besides async web code, of course, we'd also automatically get support for a number of various protocols (SFTP, for example) that have been requested in Wget. PLEASE NOTE: these are ramblings. They are ideas. They are what's currently rattling around in my brain. Note too, that there's a couple leaps in the given logic for having a completely separate "Wget 2.0", the biggest one probably being that multiple connections does not automatically implying asynchronous I/O; that's just my preference. Not going for async I/O destroys the depends-on-myriad-libraries argument, and the whole Wget-2.0-needs-to-be-separate, and maybe-we-should-use-libcurl arguments. OTOH, the other options I can think of--using threads, or using multiple processes--have their own strong downsides, especially in terms of portability and maintenance cost. I expect this to be controversial thinking, and am hereby officially begging for feedback, and for alternative thoughts and viewpoints. And, of course, when I say "there would be two Wgets", what I really mean by that is that the more exotic-featured one would be something else entirely than a Wget, and would have a separate name. - -- Micah J. Cowan -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHIpJk7M8hyUobTrERCBphAKCRAPkxU9wws4dbB3EjLW6vybg3wACbBncx VZ260KdS5uWrYiCAncwMQVQ= =Y0Sc -----END PGP SIGNATURE-----