Wget scriptability

Micah Cowan Sat, 02 Aug 2008 04:38:41 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Okay, so there's been a lot of thought in the past, regarding better
extensibility features for Wget. Things like hooks for adding support
for traversal of new Content-Types besides text/html, or adding some
form of JavaScript support, or support for MetaLink. Also, support for
being able to filter results pre- and post-processing by Wget: for
example, being able to do some filtering on the HTML to change how Wget
sees it before parsing for links, but without affecting the actual
downloaded version; or filtering the links themselves to alter what Wget
fetches.


The original concept before I came onboard, was plugin modules. After
some thought, I'd decided I didn't like this overly much, and have
mainly been leading toward the idea of a next-gen Wget-as-a-library
thing, probably wrapping libcurl (and with a command-client version,
like curl). This obviously wouldn't have been a Wget any more, so would
have been a separate project, with a different name.

However, another thing that's been vaguely itching at me lately, is the
fact that Wget's design is not particularly unix-y. Instead of doing one
thing, and doing it well, it does a lot of things, some well, some not.

So the last couple days I've been thinking, maybe "wget-ng" should be a
suite of interoperating shell utilities, rather than a library or a
single app. This could have some really huge advantages: users could
choose their own html-parser to use, they can plug in parsers for
whatever filetypes they desire, people who want to implement exotic
features can do that...

Of course, at this point we're talking about something that's
fundamentally different from "Wget". Just as we were when we were
considering making a next-gen library version. It'd be a completely
separate project. And I'm still not going to start it right away (though
I think some preliminary requirements and design discussions would be a
good idea). Wget's not going to die, nor is everyone going to want to
switch to some new-fangled re-envisioning of it.

But the thing everyone loves about Unix and GNU (and certainly the thing
that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
paradigm, which is what enables you to mix-and-match different tools to
cover the different areas of functionality. Wget doesn't fit very well
into that scheme, and I think it could become even much more powerful
than it already is, by being broken into smaller, more discreet,
projects. Or, to be more precise, to offer an alternative that does the
equivalent.

So far, the following principles have struck me as advisable for a
project such as this:

 - The tools themselves, as much as possible, should be written in an
easily-hackable scripting language. Python makes a good candidate. Where
we want efficiency, we can implement modules in C to do the work.

 - While efficiency won't be the highest priority (else we'd just stick
to the monolith), it's still important. Spawning off separate processes
to each fetch their own page, initiating a new connection each time,
would be a lousy idea. So, the architectural model should center around
a "URL-getter" driver, that manages connections and such, reusing
persistent ones as much as possible. Of course, there might be distinct
commands to handle separate types of URLs, (or alternative methods for
handling them, such as MetaLink), and perhaps not all of these would be
able to do persistence (a dead-simple way to add support for scp, etc,
might be to simply call the command-line program).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIlEcX7M8hyUobTrERAqvSAJ9rx99xhU7Zo/xwbKXDbWCWp4jSQwCfbbQM
zmY9j1zYuGq0eNkZnsqR+Jo=
=8wcf
-----END PGP SIGNATURE-----

Wget scriptability

Reply via email to