Updates on Wget Future Directions

Micah Cowan Mon, 31 Mar 2008 21:11:00 -0700

Well, I have some announcements regarding decisions that have been made
regarding future directions in Wget.


First off, I've reversed my previous decision not to include "download
accelerator" features in the multi-streaming version of Wget. It's
becoming clear to me that the benefits far outweigh any disadvantages.
As tool developers, it's our job to supply powerful jobs; it's the
users' job to use them with the appropriate discretion. And while it may
be troublesome to the administrators of smaller servers that may become
overburdened when less-polite users abuse Wget, yet the careful
application by users who know that the servers can handle the requests
have the potential to produce such striking effects on download speeds,
that it seems to me that it's irresponsible to deny such strong
improvements to those who can use Wget responsibly, just for the sake of
those who might abuse it. Considering that the time required to download
a 2 GB file from the web can be reduced ten-fold, simply by splitting
the work into ten separate, simultaneous download streams for 200 MB
each, it's really elitist of us to tell users, "no, you can't do that,
because you might not know what you're doing."

Besides which, it's quite clear from the number of requests we've
received for this functionality, that the addition of this feature will
boost Wget's popularity significantly. We've really no excuse to leave
it out!

.

Following the same policy of "providing the tool, without dictating the
use", it has come to my attention that a not-insignificant portion of
our user base use Wget to perform "screen-scraping" on other sites.
There are a variety of motivations for such practices, which include
analysis of periodically-changing data, site-style imitation, and of
course full look-alike site imitation. The latter is particularly
popular with websites corresponding to financial institutions.

That last group often consists of users with significant funding at
their disposal, which they could easily put towards financing further
Wget development. To this end, there are a few additional features I've
been considering, aimed at appealing to this portion of Wget's user
base.

The one I'll mention today is the --ichthus option. Invoking Wget with:

  wget --ichthus URL-A URL-B

Will download URL-A and any prerequisites (images, CSS, etc), perform
some conversions, and then automatically upload the results to URL-B
(via FTP or WebDAV, configuration options for which will be discussed at
a later date).

The specific conversions to be applied after download include converting
relative URLs to absolute URLs, and the conversion of all
form-submission URLs to point to locations at the host site for URL-B,
obfuscating it in such a way as to appear to still be pointing to a
location on URL-A's host.

For example, if the page at
https://www.infidelitybanking.com/loginPage.php contains a form whose
action attribute has the value "loginProcess.php?submit=foo", then
running:

  wget --ichthus https://www.infidelitybanking.com/loginPage.php \
    https://256.133.312.10/

would download loginPage.php from site A, and upload it to site B,
except that any relative links would be converted to absolute links
(with site A as a baseref); and the HTML form's action would be
converted to something like:

https://www.infidelitybanking.com:[EMAIL PROTECTED]/cgi-bin/loginPage.cgi

.

There's been a lot of discussion lately about how the architecture of
Wget's accept/reject lists could be improved. One thing that hasn't had
much treatment, though (well, any, really) is how potentially
_demeaning_ the existing terminology can be.

Representing the decision whether or not to download a given URL as
either "accepted" or "rejected" is a rather harsh, perhaps even cruel,
way of dividing the world. It can tend to convey the mistaken impression
that some URLs are intrinsically "bad" while others are intrinsically
"good". This can have obvious consequences for self-esteem, and yet it's
clear that a URL that may be "rejected" for a particular session's needs
today, may well be "accepted" in some future session.

Therefore, I'd like to propose that we replace the current terminology
with something more politically sensitive. Rather than --accept
--reject, perhaps --you-fit-my-needs-today and
--not-a-good-fit-for-me-at-this-time? Those names don't feel quite right
(in particular, they're a bit lengthy); but I think you get the general
idea; perhaps someone can suggest something better?

.

Finally, thanks to Julien Buty's helpful recommendation that Wget take
part in this year's Google Summer of Code, we've received a number of
excellent proposals from students eager to take part. A few of these
include some great and novel ideas.

The most promising of these, and something I don't believe previous Wget
maintainers had given much thought to, is the proposal that Wget support
HTCPCP (which is based on good ol' HTTP) as one of its primary supported
transport mechanisms. It's amazing to me that we still currently lack
support for this protocol, which is such an important part of the World
Wide Web. In addition, I'm fairly certain that this is one of the few
transport layers that the Curl guys still have yet to include, so if we
beat them to the punch, we may have one over on them. :)

More information on this most-venerated of protocols may be had at
http://www.ietf.org/rfc/rfc2324.txt.

-- 
Unccl svefg bs Ncevy, sbyxf.

Updates on Wget Future Directions

Reply via email to