Re: Wget scriptability

Micah Cowan Sun, 03 Aug 2008 02:35:19 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dražen Kačar wrote:
> Micah Cowan wrote:
> 
>> Okay, so there's been a lot of thought in the past, regarding better
>> extensibility features for Wget. Things like hooks for adding support
>> for traversal of new Content-Types besides text/html, or adding some
>> form of JavaScript support, or support for MetaLink. Also, support for
>> being able to filter results pre- and post-processing by Wget: for
>> example, being able to do some filtering on the HTML to change how Wget
>> sees it before parsing for links, but without affecting the actual
>> downloaded version; or filtering the links themselves to alter what Wget
>> fetches.
> 
>> However, another thing that's been vaguely itching at me lately, is the
>> fact that Wget's design is not particularly unix-y. Instead of doing one
>> thing, and doing it well, it does a lot of things, some well, some not.
> 
> It does what various people needed. It wasn't an excercise in writing a
> unixy utility. It was a program that solved real problems for real
> people.


>> But the thing everyone loves about Unix and GNU (and certainly the thing
>> that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
>> paradigm,
> 
> I have always hated that. With a passion.

A surprising position from a user of Mutt, whose excellence is due in no
small part to its ability to integrate well with other command utilities
(that is, to pipeline). The power and flexibility of pipelines is
extremely well-established in the Unix world; I feel no need whatsoever
to waste breath arguing for it, particularly when you haven't provided
the reasons you hate it.

For my part, I'm not exaggerating that it's single-handedly responsible
for why I'm a Unix/GNU user at all, and why I continue to highly enjoy
developing on it.

  find -name '*.html' -exec sed -i \
    's#http://oldhost/#http://newhost/#g' \;

  ( cat message; echo; echo '-- '; cat ~/.signature ) | \
    gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED]

  pic | tbl | eqn | eff-ing | troff -ms

Each one of these demonstrates the enormously powerful technique of
using distinct tools with distinct feature domains, together to form a
cohesive solution for the need. The best part is (with the possible
exception of the troff pipeline), each of these components are
immediately available for use in some other pipeline that does some
other completely different function.

Note, though, that I don't intend that using "Piped-Wget" would actually
mean the user types in a special pipeline each time he wants to do
something with it. The primary driver would read in some config file
that would tell wget how it should do the piping. You just tweak the
config file when you want to add new functionality.

>>  - The tools themselves, as much as possible, should be written in an
>> easily-hackable scripting language. Python makes a good candidate. Where
>> we want efficiency, we can implement modules in C to do the work.
> 
> At the time Wget was conceived, that was Tcl's mantra. It failed
> miserably. :-)

Are you claiming that Tcl's failure was due to the ability to integrate
it with C, rather than its abysmal inadequacy as a programming language
(changing it from an ability to integrate with C, to an absolute
requirement to do so in order to get anything accomplished)?

> How about concentrating on the problems listed in your first paragraph
> (which is why I quoted it)? Could you show us how would a buch of shell
> tools solve them? Or how would a librarized Wget solve them? Or how
> would any other paradigm or architecture or whatever solve them?

It should be trivially obvious: you plug them in, rather than "wait for
the Wget developers to get around to implementing it".

The thing that both library-ized Wget and pipeline-ized Wget would offer
is the same: extreme flexibility. It puts the users in control of what
Wget does, rather than just perpetually hearing, "sorry, Wget can't do
it: you could hack the source, though." :p

The difference between the two is that a pipelined Wget offers this
flexibility to a wider range of users, whereas a library Wget offers it
to C programmers.

Or how would you expect to do these things without a library-ized (at
least) Wget? Implementing them in the core app (at least by default) is
clearly wrong (scope bloat). Giving Wget a plugin architecture is good,
but then there's only as much flexibility as there are hooks.
Libraryizing Wget is equivalent to providing everything as hooks, and
puts the program using it in the driver's seat (and, naturally, there'd
be a wrapper implementation, like curl for libcurl). A suite of
interconnected utilities does the same, but is more accessible to
greater numbers of people. Generally at some expense to efficiency
(aren't all flexible architectures?); but Wget isn't CPU-bound, it's
network-bound.

As mentioned in my original post, this would be a separate project from
Wget. Wget would not be going away (though it seems likely to me that it
would quickly reach a primarily bugfix and essential maintenance stage).
It would be an alternative offering.

Another thing it probably comes at the expense of, is Windows support,
which isn't a really great system for trying to do pipelining things
(though it wouldn't be impossible). Which is another reason why Wget
proper wouldn't be disappearing.

Since you seem to struggle with understanding how this works with the
list of features I presented at the start, let's examine them more closely.

- - hooks for adding support for traversal of new Content-Types besides
text/html

Current Wget? Means hacking the source in C.
Pluggable/Library Wget? Writing the module in C.
Pipelines Wget? Chances are pretty good you've already got a program
that can be adapted to spew links for further traversal. Just tell
Pipes-Wget to run it and parse the output. If not, write the "module",
using your favorite programming language (not just forced to C).

- - some form of JavaScript support

Exactly the same pattern as above: with a Pipes-Wget you just tell Wget
the name of the command that takes in javascripted HTML and spews out
links. This command in turn would likely often need to invoke further
fetching, which is easily handled through the invocation of "getter"*
commands provided as part of the Pipes-Wget suite.

- - support for MetaLink

Current Wget? I think someone's actually working on this. But, given
Wget's current single-connection support, it couldn't be much more than
falling back on one URL when another is broken.
Pluggable/Library Wget (with multiple connections)? Doable. A level of
difficulty.
Pipelines Wget? Use a Metalink "getter" rather than the stock Pipes-Wget
"getter". The Metalink "getter" itself would probably manage the use of
several invocations of stock Pipes-Wget "getter".

- - Also, support for being able to filter results pre- and
post-processing by Wget

Piped-Wget is an utterly obvious shoe-in for this. Patches against the
current Wget have already been submitted to make it use piped arbitrary
commands.

* A "getter" command is mentioned more than once in the above. Note that
this is not mutually exclusive with the concept of letting a single
process govern connection persistence, which would handle the real work;
the "getter" would probaby be a tool for communicating with the main driver.

And now, a few things that could only be done with a Piped-Wget, and not
by the other possible mechanisms.

- - Using existing tools to implement protocols Wget doesn't understand
(want scp support? Just register it as an scp:// scheme handler), and
instantly add support to Wget for the latest, greatest protocols without
hacking Wget or waiting until we get around to implementing it.

- - Don't like the way Piped-Wget handles things, perhaps such as parsing
html? Swap in your own handler instead.

- - Want to add a new feature to Wget, such as emailing someone when
there's a problem spidering the site? Drop it in. (Note, that's already
possible. You guessed it! It involves pipelining.)

- - Want to use a piece of Wget's functionality by itself (html parser,
perhaps? Just want a list of all the links from a file?)? Just use the
command.


This is precisely why pipelines have made Unix so powerful: it is the
only truly successful example of that ever-elusive (and now often nearly
abandoned) holy-grail of Code Reuse.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIlXvG7M8hyUobTrERArHXAJ4l3u2rHMDefQHsy1+/DzJ7eGVT8wCfU6l7
SIn2dvShl5uvjg10QcRlM5A=
=43Ma
-----END PGP SIGNATURE-----

Re: Wget scriptability

Reply via email to