-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dražen Kačar wrote: > Micah Cowan wrote: > >> Okay, so there's been a lot of thought in the past, regarding better >> extensibility features for Wget. Things like hooks for adding support >> for traversal of new Content-Types besides text/html, or adding some >> form of JavaScript support, or support for MetaLink. Also, support for >> being able to filter results pre- and post-processing by Wget: for >> example, being able to do some filtering on the HTML to change how Wget >> sees it before parsing for links, but without affecting the actual >> downloaded version; or filtering the links themselves to alter what Wget >> fetches. > >> However, another thing that's been vaguely itching at me lately, is the >> fact that Wget's design is not particularly unix-y. Instead of doing one >> thing, and doing it well, it does a lot of things, some well, some not. > > It does what various people needed. It wasn't an excercise in writing a > unixy utility. It was a program that solved real problems for real > people.
>> But the thing everyone loves about Unix and GNU (and certainly the thing >> that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline >> paradigm, > > I have always hated that. With a passion. A surprising position from a user of Mutt, whose excellence is due in no small part to its ability to integrate well with other command utilities (that is, to pipeline). The power and flexibility of pipelines is extremely well-established in the Unix world; I feel no need whatsoever to waste breath arguing for it, particularly when you haven't provided the reasons you hate it. For my part, I'm not exaggerating that it's single-handedly responsible for why I'm a Unix/GNU user at all, and why I continue to highly enjoy developing on it. find -name '*.html' -exec sed -i \ 's#http://oldhost/#http://newhost/#g' \; ( cat message; echo; echo '-- '; cat ~/.signature ) | \ gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED] pic | tbl | eqn | eff-ing | troff -ms Each one of these demonstrates the enormously powerful technique of using distinct tools with distinct feature domains, together to form a cohesive solution for the need. The best part is (with the possible exception of the troff pipeline), each of these components are immediately available for use in some other pipeline that does some other completely different function. Note, though, that I don't intend that using "Piped-Wget" would actually mean the user types in a special pipeline each time he wants to do something with it. The primary driver would read in some config file that would tell wget how it should do the piping. You just tweak the config file when you want to add new functionality. >> - The tools themselves, as much as possible, should be written in an >> easily-hackable scripting language. Python makes a good candidate. Where >> we want efficiency, we can implement modules in C to do the work. > > At the time Wget was conceived, that was Tcl's mantra. It failed > miserably. :-) Are you claiming that Tcl's failure was due to the ability to integrate it with C, rather than its abysmal inadequacy as a programming language (changing it from an ability to integrate with C, to an absolute requirement to do so in order to get anything accomplished)? > How about concentrating on the problems listed in your first paragraph > (which is why I quoted it)? Could you show us how would a buch of shell > tools solve them? Or how would a librarized Wget solve them? Or how > would any other paradigm or architecture or whatever solve them? It should be trivially obvious: you plug them in, rather than "wait for the Wget developers to get around to implementing it". The thing that both library-ized Wget and pipeline-ized Wget would offer is the same: extreme flexibility. It puts the users in control of what Wget does, rather than just perpetually hearing, "sorry, Wget can't do it: you could hack the source, though." :p The difference between the two is that a pipelined Wget offers this flexibility to a wider range of users, whereas a library Wget offers it to C programmers. Or how would you expect to do these things without a library-ized (at least) Wget? Implementing them in the core app (at least by default) is clearly wrong (scope bloat). Giving Wget a plugin architecture is good, but then there's only as much flexibility as there are hooks. Libraryizing Wget is equivalent to providing everything as hooks, and puts the program using it in the driver's seat (and, naturally, there'd be a wrapper implementation, like curl for libcurl). A suite of interconnected utilities does the same, but is more accessible to greater numbers of people. Generally at some expense to efficiency (aren't all flexible architectures?); but Wget isn't CPU-bound, it's network-bound. As mentioned in my original post, this would be a separate project from Wget. Wget would not be going away (though it seems likely to me that it would quickly reach a primarily bugfix and essential maintenance stage). It would be an alternative offering. Another thing it probably comes at the expense of, is Windows support, which isn't a really great system for trying to do pipelining things (though it wouldn't be impossible). Which is another reason why Wget proper wouldn't be disappearing. Since you seem to struggle with understanding how this works with the list of features I presented at the start, let's examine them more closely. - - hooks for adding support for traversal of new Content-Types besides text/html Current Wget? Means hacking the source in C. Pluggable/Library Wget? Writing the module in C. Pipelines Wget? Chances are pretty good you've already got a program that can be adapted to spew links for further traversal. Just tell Pipes-Wget to run it and parse the output. If not, write the "module", using your favorite programming language (not just forced to C). - - some form of JavaScript support Exactly the same pattern as above: with a Pipes-Wget you just tell Wget the name of the command that takes in javascripted HTML and spews out links. This command in turn would likely often need to invoke further fetching, which is easily handled through the invocation of "getter"* commands provided as part of the Pipes-Wget suite. - - support for MetaLink Current Wget? I think someone's actually working on this. But, given Wget's current single-connection support, it couldn't be much more than falling back on one URL when another is broken. Pluggable/Library Wget (with multiple connections)? Doable. A level of difficulty. Pipelines Wget? Use a Metalink "getter" rather than the stock Pipes-Wget "getter". The Metalink "getter" itself would probably manage the use of several invocations of stock Pipes-Wget "getter". - - Also, support for being able to filter results pre- and post-processing by Wget Piped-Wget is an utterly obvious shoe-in for this. Patches against the current Wget have already been submitted to make it use piped arbitrary commands. * A "getter" command is mentioned more than once in the above. Note that this is not mutually exclusive with the concept of letting a single process govern connection persistence, which would handle the real work; the "getter" would probaby be a tool for communicating with the main driver. And now, a few things that could only be done with a Piped-Wget, and not by the other possible mechanisms. - - Using existing tools to implement protocols Wget doesn't understand (want scp support? Just register it as an scp:// scheme handler), and instantly add support to Wget for the latest, greatest protocols without hacking Wget or waiting until we get around to implementing it. - - Don't like the way Piped-Wget handles things, perhaps such as parsing html? Swap in your own handler instead. - - Want to add a new feature to Wget, such as emailing someone when there's a problem spidering the site? Drop it in. (Note, that's already possible. You guessed it! It involves pipelining.) - - Want to use a piece of Wget's functionality by itself (html parser, perhaps? Just want a list of all the links from a file?)? Just use the command. This is precisely why pipelines have made Unix so powerful: it is the only truly successful example of that ever-elusive (and now often nearly abandoned) holy-grail of Code Reuse. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIlXvG7M8hyUobTrERArHXAJ4l3u2rHMDefQHsy1+/DzJ7eGVT8wCfU6l7 SIn2dvShl5uvjg10QcRlM5A= =43Ma -----END PGP SIGNATURE-----