RE: Thoughts on Wget 1.x, 2.0 (*LONG!*)
Micah Cowan wrote: Keeping a single Wget and using runtime libraries (which we were terming plugins) was actually the original concept (there's mention of this in the first post of this thread, actually); the issue is that there are core bits of functionality (such as the multi-stream support) that are too intrinsic to separate into loadable modules, and that, to be done properly (and with a minimum of maintenance commitment) would also depend on other libraries (that is, doing asynchronous I/O wouldn't technically require the use of other libraries, but it can be a lot of work to do efficiently and portably across OSses, and there are already Free libraries to do that for us). Perhaps both versions can include multi-threaded support in their core version, but the lite version would never invoke multi-threading. Tony
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Lewis wrote: Micah Cowan wrote: Keeping a single Wget and using runtime libraries (which we were terming plugins) was actually the original concept (there's mention of this in the first post of this thread, actually); the issue is that there are core bits of functionality (such as the multi-stream support) that are too intrinsic to separate into loadable modules, and that, to be done properly (and with a minimum of maintenance commitment) would also depend on other libraries (that is, doing asynchronous I/O wouldn't technically require the use of other libraries, but it can be a lot of work to do efficiently and portably across OSses, and there are already Free libraries to do that for us). Perhaps both versions can include multi-threaded support in their core version, but the lite version would never invoke multi-threading. I mentioned this in the first post as well. The main problem I offered for this was that async I/O tends to make for much more complicated/hard-to-follow code, which will make the lite Wget (even more) difficult to read, without reaping the actual benefits gained from such complications. Of course, whether this is a sufficient justification to maintain two different versions of Wget is another question... There's also the fact that libcurl starts looking _very_ attractive to handle the async I/O web comm stuff, so that ideally we don't actually have to rewrite any of the I/O and HTTP logic, but just replace it wholesale. If we decide to use that for the async stuff, then it seemse to me that having two separate programs suddenly becomes more-or-less a foregone conclusion, as I don't really want to introduce a dependency on libcurl for the lite Wget (though Hrvoje's response on the thread that Daniel Stenberg posted suggests I'd have an excuse to do so). Note that in any case, having two separate command-line interfaces is pretty much unavoidable IMO, as the current CLI is fast becoming unwieldly, and certain aspects are fairly confusing, so that I don't really want to use it as the basis on which to build some of the newer configuration features; at the same time, I want to keep the current interface around for the current Wget usage, so I don't break people's scripts, etc. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHK3kf7M8hyUobTrERCCe6AJ93sxZkba5yDcaTF1asibpHZdjkzgCgiH0T 9xed5XQH/CEbZmknLpUtRPo= =L3hf -END PGP SIGNATURE-
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Micah Cowan wrote: Tony Lewis wrote: Perhaps both versions can include multi-threaded support in their core version, but the lite version would never invoke multi-threading. I mentioned this in the first post as well. The main problem I offered for this was that async I/O tends to make for much more I should point out, too, that I'm talking about asynchronous I/O support, and not multithreaded support, as I'm not really keen on introducing threads to Wget. Especially since, AFAICT, threads sort of suck on Linux, which happens to be the kernel I actively use. This may be somewhat unfortunate, as multithreading code tends not to introduce the code complexity that async I/O does (though IMO it introduces complexities of a different sort). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHK3uA7M8hyUobTrERCM38AJ9BhohEVNuRl2P1rnsjWO/gEgFxCACgjIf3 9hyCb8WHZIFQLZ1UCCaqK5A= =siMR -END PGP SIGNATURE-
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
Micah Cowan wrote: I'm not sure what you mean about the linux thing; there are many instances of runtime loadable modules on Linux. dlopen() and friends are the standard way of doing this on any Unix kernel flavor. I _thought_ so, but when I asked a distro why they didn't use this, they said it would require rewriting nearly all currently existing applications. My specific complain was against a SuSE distro, that in in order to load one.rpm, it depended on two.rpm, which depended on three.rpm, and that on four.rpm, etc. The functionality in two.rpm was to load a library to handle active directories which, in my non-MS, small setup, I didn't need -- and I didn't want to load the 5-7 supporting packages for AD, since I didn't use them. BUT, because of static-run time loading, one.rpm would fail if two.rpm wasn't loaded...and so on and so forth. AFAIK, the same problem exists on nearly every distro -- because no one bothers to think that they might not want to load every package on the CD, just to support local host lookup using...say nscd. G. Keeping a single Wget and using runtime libraries (which we were terming plugins) was actually the original concept (there's mention of this in the first post of this thread, actually); --- Sounds good to me! :-) the issue is that there are core bits of functionality (such as the multi-stream support) that are too intrinsic to separate into loadable modules, and that, to be done properly (and with a minimum of maintenance commitment) would also depend on other libraries (that is, doing asynchronous I/O wouldn't technically require the use of other libraries, but it can be a lot of work to do efficiently and portably across OSses, and there are already Free libraries to do that for us). - And perhaps that is the problem. In order to re-use existing parts of code, rather than adopted them to a load-if-necessary type structure -- everyone prefers to just use them as is, thus one lib references another, and another...and so on. Like I think you pull in cat, and you get all of the gnu-language libs and tools, which pulls in alternate character set support, which requires certain font rendering packages -- and of course, if you are display alternate characters, lets not forget the corresponding foreign input methods, and the asian-char specific terminal emulators...etc. Can I jump off a cliff yet?...ARG! I hack around such problems, at times, by extracting the 1 run-time library I need, and not the rest of the package, but then my rpm-verify checks turn up supposed errors because I'm missing package dependencies. Sigh... If one wanted to add multi-stream support, couldn't the small wget have a check to see if the multi-stream support lib was present (or not), and if so, set max-streams equal to one that might yield the basic behavior one might want for the small wget? Not pushing a particular solution -- I, like you, am just throwing out ideas to consider...if they've already covered the points I've raised, feel free to just ignore my ramblings and carry on...:-) Linda
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 L Walsh wrote: Micah Cowan wrote: I'm not sure what you mean about the linux thing; there are many instances of runtime loadable modules on Linux. dlopen() and friends are the standard way of doing this on any Unix kernel flavor. I _thought_ so, but when I asked a distro why they didn't use this, they said it would require rewriting nearly all currently existing applications. My specific complain was against a SuSE distro, that in in order to load one.rpm, it depended on two.rpm, which depended on three.rpm, and that on four.rpm, etc. The functionality in two.rpm was to load a library to handle active directories which, in my non-MS, small setup, I didn't need -- and I didn't want to load the 5-7 supporting packages for AD, since I didn't use them. BUT, because of static-run time loading, one.rpm would fail if two.rpm wasn't loaded...and so on and so forth. AFAIK, the same problem exists on nearly every distro -- because no one bothers to think that they might not want to load every package on the CD, just to support local host lookup using...say nscd. G. Ah, well, that's a different situation. In order to decide at runtime whether to load a runtime library or not, dlopen() is the standard way to handle that. However, if the application wasn't designed to make the decision at runtime, but rather at build time, then it does require code rewriting. In this case, though, we're specifically talking about loadable modules. We might choose to allow some of them to be linked at build time, but we'd definitely have to at least support conditional linking at runtime. Keeping a single Wget and using runtime libraries (which we were terming plugins) was actually the original concept (there's mention of this in the first post of this thread, actually); --- Sounds good to me! :-) the issue is that there are core bits of functionality (such as the multi-stream support) that are too intrinsic to separate into loadable modules, and that, to be done properly (and with a minimum of maintenance commitment) would also depend on other libraries (that is, doing asynchronous I/O wouldn't technically require the use of other libraries, but it can be a lot of work to do efficiently and portably across OSses, and there are already Free libraries to do that for us). - And perhaps that is the problem. In order to re-use existing parts of code, rather than adopted them to a load-if-necessary type structure -- everyone prefers to just use them as is, thus one lib references another, and another...and so on. Like I think you pull in cat, and you get all of the gnu-language libs and tools, which pulls in alternate character set support, which requires certain font rendering packages -- and of course, if you are display alternate characters, lets not forget the corresponding foreign input methods, and the asian-char specific terminal emulators...etc. That's retarded. Native Language Support for a terminal program shouldn't pull in font-rendering packages: displaying the characters properly is the terminal's responsibility. I have some trouble believing that any packagers would actually have such dependencies, but if they do, it's retarded. A program like cat should depend only on the system library, and (if NLS is supported) gettext (which shouldn't depend on anything else). Can I jump off a cliff yet?...ARG! I hack around such problems, at times, by extracting the 1 run-time library I need, and not the rest of the package, but then my rpm-verify checks turn up supposed errors because I'm missing package dependencies. Sigh... Frustrating experiences with RedHat's package management is why I'm now a Debian/Ubuntu user. :) If one wanted to add multi-stream support, couldn't the small wget have a check to see if the multi-stream support lib was present (or not), and if so, set max-streams equal to one that might yield the basic behavior one might want for the small wget? Well, but the actual support for having any sort of multi-stream is a major rewrite of the entire I/O code. Much better to use a separate library for that, if we can get it. In that case, it stops being something we can simply check for and use if it's available, but something that the code would absolutely require. Not pushing a particular solution -- I, like you, am just throwing out ideas to consider...if they've already covered the points I've raised, feel free to just ignore my ramblings and carry on...:-) Well, and fortunately we've got plenty of time to talk about these things: my focus right now is on getting 1.11 out the door, after which there are _plenty_ of things to keep me busy for 1.12 (still a lite release) for quite some time. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment:
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
On 10/31/07, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Godshall wrote: On 10/30/07, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Godshall wrote: Perhaps the little wget could be called wg. A quick google and wikipedia search shows no real namespace collisions. To reduce confusion/upgrade problems, I would think we would want to ensure that the traditional/little Wget keeps the current name, and any snazzified version gets a new one. Please not another -ng. How about wget2 (since we're on 1.x). And the current one remains in 1.x. I agree that -ng would not be appropriate. But since we're really talking about two separate beasts, I'd prefer not to limit what we can do with Wget (original)'s versioning. Who's to say a 2.0 release of the light version will not be warranted someday? At any rate, the snazzy one looks to be diverging from classic Wget in some rather significant ways, in which case, I'd kind of prefer to part names a bit more severely than just wget-ng or wget2. Reget, perhaps: that name could be both Recursive Get (describing what's still its primary feature), or Revised/Re-envisioned Wget. :) I think, too, that names such as wget2 are more often things that packagers (say, Debian) do, when they want to include backwards-incompatible, significantly new versions of software, but don't want to break people's usage of older stuff. Or, when they just want to offer both versions. Cf apache2 in Debian. And then eventually everyone's gotten used to used to and can't live without the new bittorrent-like almost-multithreaded features. ;-) :) Pget. Parallel get. Tget. Torrent-like-get. Bget. Bigger get. BBWget. Bigger Better wget. OK, ok sorry.
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 L Walsh wrote: Honest -- I hadn't read all the threads before my post... Great ideas Micah! :-) On the idea of 2 wgets -- there is a clever way to get by with 1. Put the optional functionality into separate run-time loadable files. SGI's Unix (and MS Windows) do this. The small wget then checks to see which libraries are accessible -- those that aren't simply mean the features for those libs are disabled. In a way, it's like how 'vim' can optionally load perllib or python-lib at runtime (at least under windows) if they are present. If they are not present, those features are disabled. Too bad linux didn't take this route with its libraries (have asked, it is possible, but there's no framework for it, and that might need work as well. I'm not sure what you mean about the linux thing; there are many instances of runtime loadable modules on Linux. dlopen() and friends are the standard way of doing this on any Unix kernel flavor. Keeping a single Wget and using runtime libraries (which we were terming plugins) was actually the original concept (there's mention of this in the first post of this thread, actually); the issue is that there are core bits of functionality (such as the multi-stream support) that are too intrinsic to separate into loadable modules, and that, to be done properly (and with a minimum of maintenance commitment) would also depend on other libraries (that is, doing asynchronous I/O wouldn't technically require the use of other libraries, but it can be a lot of work to do efficiently and portably across OSses, and there are already Free libraries to do that for us). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHKo867M8hyUobTrERCBxGAJ44coJN48fRGhORfYv+uN2J6RVz7gCePxva UYeGYTW0sfY+QRcGkpSB9Ls= =wOVv -END PGP SIGNATURE-
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
Honest -- I hadn't read all the threads before my post... Great ideas Micah! :-) On the idea of 2 wgets -- there is a clever way to get by with 1. Put the optional functionality into separate run-time loadable files. SGI's Unix (and MS Windows) do this. The small wget then checks to see which libraries are accessible -- those that aren't simply mean the features for those libs are disabled. In a way, it's like how 'vim' can optionally load perllib or python-lib at runtime (at least under windows) if they are present. If they are not present, those features are disabled. Too bad linux didn't take this route with its libraries (have asked, it is possible, but there's no framework for it, and that might need work as well. My 2 cents, Linda
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
On 10/30/07, Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Godshall wrote: Perhaps the little wget could be called wg. A quick google and wikipedia search shows no real namespace collisions. To reduce confusion/upgrade problems, I would think we would want to ensure that the traditional/little Wget keeps the current name, and any snazzified version gets a new one. Please not another -ng. How about wget2 (since we're on 1.x). And the current one remains in 1.x. And then eventually everyone's gotten used to used to and can't live without the new bittorrent-like almost-multithreaded features. ;-) Tony
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
On 10/26/07, Josh Williams [EMAIL PROTECTED] wrote: On 10/26/07, Micah Cowan [EMAIL PROTECTED] wrote: And, of course, when I say there would be two Wgets, what I really mean by that is that the more exotic-featured one would be something else entirely than a Wget, and would have a separate name. I think the idea of having two Wgets is good. I too have been concerned about the resources required in creating the all-out version 2.0. The current code for Wget is a bit mangled, but I think the basic concepts surrounding it are very good ones. Although the code might suck for those trying to read it, I think it could be very great with a little regular maintenance. Perhaps the little wget could be called wg. A quick google and wikipedia search shows no real namespace collisions. There still remains the question, though, of whether version 2 will require a complete rewrite. Considering how fundamental these changes are, I don't think we would have much of a choice. You mentioned that they could share code for recursion, but I don't see how. IIRC, the code for recursion in the current version is very dependent on the current methods of operation. It would probably have to be rewritten to be shared. As for libcurl, I see no reason why not. Also, would these be two separate GNU projects? Would they be packaged in the same source code, like finch and pidgin? I do believe the next question at hand is what version 2's official mascot will be. I purpose Lenny the tortoise ;) Oooh- confusion with Debian testing _ .. Lenny - (_\/ \_, 'uuuu~' -- Best Regards. Please keep in touch.
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Godshall wrote: Perhaps the little wget could be called wg. A quick google and wikipedia search shows no real namespace collisions. To reduce confusion/upgrade problems, I would think we would want to ensure that the traditional/little Wget keeps the current name, and any snazzified version gets a new one. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHJ59b7M8hyUobTrERCLs9AJ478M50hIs4hMegAGYhKEXL5tCaAgCdGR+e 5A6mtbAq2iX6Azvcfbd10cI= =SXun -END PGP SIGNATURE-
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Daniel Stenberg wrote: I guess I'm not the man to ask nor comment this a lot, but look what I found: http://www.mail-archive.com/wget@sunsite.dk/msg01129.html I've always thought and I still believe that wget's power and most appreciated abilities are in the features it adds on top of the transfer, like HTML parsing, ftp list parsing and the other things you mentioned. Of course, in this case, we'd be talking more about linking with libcurl for Wget2, rather than incorporating it, so we wouldn't have to worry about copyright disclaimers. Besides which, according to the maintainers document, we only need to get those for files that do not include a license statement. Of course, going one single unified transfer library is perhaps not the best thing from a software eco-system perspective, as competition tends to drive innovation and development, but the more users of a free software/open source project we get the better it will become. Well, in the first place, ours isn't a library, so for the most part it isn't really usable by other folks. :) And there's still libwww from the W3C, at least (and probably others). Besides, the great thing about the _free_ software eco-system, is that even when there is only a single, unified library, as long as it is free it can easily be forked to move in a new direction to meet differing requirements. :) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHKAQ47M8hyUobTrERCNZgAJ4rsG9ZlZuoHmvZBssE5oPGKY6yOACfRkc0 HEKiQEEbbs9IZWg3AwfyNII= =kiF5 -END PGP SIGNATURE-
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Josh Williams wrote: Although the code might suck for those trying to read it, I think it could be very great with a little regular maintenance. Oh, I think it's probably already earned a reputation for greatness at this point. But yeah, it needs some maintenance work. Which is, of course, what I volunteered for in the first place :) There still remains the question, though, of whether version 2 will require a complete rewrite. Considering how fundamental these changes are, I don't think we would have much of a choice. Right. The idea I... thought I had settled on, was to refactor what we have, until it is sufficiently pliable to start adding some of the version 2 features. If, OTOH, we're going to have two separate projects, there's less motivation to try to slowly rework everything under the sun; though there are obviously still sections that would benefit from refactoring (gethttp and http_loop are currently still right in my crosshairs). You mentioned that they could share code for recursion, but I don't see how. IIRC, the code for recursion in the current version is very dependent on the current methods of operation. It would probably have to be rewritten to be shared. Yeah, the shared codebase would probably be pretty small. But the actual logic about how to parse HTML, or whether or not to descend, or comparing Web timestamps to local ones, should be sharable. But yes, after a rewrite of the relevant code. I don't think we'd have to make it happen, in particular; as we discover common logic that can be factored, we'll just... do it. As for libcurl, I see no reason why not. Also, would these be two separate GNU projects? Would they be packaged in the same source code, like finch and pidgin? Probably not packaged together. People who want the traditional Wget are not gonna want to download the JavaScript and MetaLink support code. :\ We should keep it as tight as possible. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHKARG7M8hyUobTrERCGHUAJ9a8KP5QV05mZqy1PHhNU0WEjkp7wCbBiG1 qohy2y3OjJZnPT1ErfkkVHw= =XXre -END PGP SIGNATURE-
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
On Fri, 26 Oct 2007, Micah Cowan wrote: The obvious solution to that is to use c-ares, which does exactly that: handle DNS queries asynchronously. Actually, I didn't know this until just now, but c-ares was split off from ares to meet the needs of the curl developers. :) We needed an asynch name resolver for libcurl so c-ares started out that way, but perhaps mostly because the original author didn't care much for our improvements and bug fixes. ADNS is a known alternative, but we couldn't use that due to license restrictions. You (wget) don't have that same problem with it. I'm not able to compare them though, as I never used ADNS... Of course, if we're doing asynchronous net I/O stuff, rather than reinvent the wheel and try to maintain portability for new stuff, we're better off using a prepackaged deal, if one exists. Luckily, one does; a friend of mine (William Ahern) wrote a package called libevnet that handles all of that; When I made libcurl grok a vast number of simultaneous connections, I went straight with libevent for my test and example code. It's solid and fairly easy to use... Perhaps libevnet makes it even easier, I don't know. Plus, there is the following thought. While I've talked about not reinventing the wheel, using existing packages to save us the trouble of having to maintain portable async code, higher-level buffered-IO and network comm code, etc, I've been neglecting one more package choice. There is, after all, already a Free Software package that goes beyond handling asynchronous network operations, to specifically handle asynchronous _web_ operations; I'm speaking, of course, of libcurl. I guess I'm not the man to ask nor comment this a lot, but look what I found: http://www.mail-archive.com/wget@sunsite.dk/msg01129.html I've always thought and I still believe that wget's power and most appreciated abilities are in the features it adds on top of the transfer, like HTML parsing, ftp list parsing and the other things you mentioned. Of course, going one single unified transfer library is perhaps not the best thing from a software eco-system perspective, as competition tends to drive innovation and development, but the more users of a free software/open source project we get the better it will become.
Re: Thoughts on Wget 1.x, 2.0 (*LONG!*)
On 10/26/07, Micah Cowan [EMAIL PROTECTED] wrote: And, of course, when I say there would be two Wgets, what I really mean by that is that the more exotic-featured one would be something else entirely than a Wget, and would have a separate name. I think the idea of having two Wgets is good. I too have been concerned about the resources required in creating the all-out version 2.0. The current code for Wget is a bit mangled, but I think the basic concepts surrounding it are very good ones. Although the code might suck for those trying to read it, I think it could be very great with a little regular maintenance. There still remains the question, though, of whether version 2 will require a complete rewrite. Considering how fundamental these changes are, I don't think we would have much of a choice. You mentioned that they could share code for recursion, but I don't see how. IIRC, the code for recursion in the current version is very dependent on the current methods of operation. It would probably have to be rewritten to be shared. As for libcurl, I see no reason why not. Also, would these be two separate GNU projects? Would they be packaged in the same source code, like finch and pidgin? I do believe the next question at hand is what version 2's official mascot will be. I purpose Lenny the tortoise ;) _ .. Lenny - (_\/ \_, 'uuuu~'