text/html assumptions, and slurping huge files

Micah Cowan Mon, 30 Jul 2007 22:52:27 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

A bug report made to Savannah
(https://savannah.gnu.org/bugs/index.php?20496) detailed an example
where wget would download a recursive fetch normally, but then when run
again (with -c), it would eat up vast (_vast_) amounts of memory, until
finally it would give up due to running out of memory. Some of the files
involved were >1GB video files, and may have been updated with slightly
smaller versions between calls to wget.


It turned out the problem was that, if the >1GB file was downloaded
completely (or is larger on disk than the one on the server), wget would
believe, on continuation, that it was an HTML file rather than a video
file, and attempt to slurp the entire file contents into memory for parsing.

There are a couple issues with this, AFAICT. (1) Wget considers it to be
HTML when it's not, and (2) Wget has to slurp HTML files into memory
entirely in order to parse them.

To the second of these, slurping everything into memory, I'm hoping
there will be a fairly straightforward, though tedious, solution. I
imagine the parser could be retrofitted with a more generalized solution
than slurping an entire string; perhaps a get(size) function that
returns the next |size| bytes from the HTML file; then those functions
that parse such things could throw them out once they're done with them.

As a stopgap fix, though, and perhaps even sufficient, we should
probably consider setting a hard limit on the size of HTML files, and
refuse to parse files that exceed this limit.

However, I'm not sure how we could solve issue #1; Wget first issues a
HEAD to check the timestamp/content-disposition, and then a GET with
Range header set. The server in this case responds to the HEAD without a
Content-Type; and to the GET, it responds appropriately with a 416
Requested Range Not Satisfiable, whose body has a content-type of text/html.

Now, Wget is using the Content-Type of the 416 response, which isn't
really appropriate. However, it has nothing else to go by, since the
HEAD had no Content-Type information either, and since unspecified
Content-Types default to text/html... I suppose we could fetch the first
byte of a file to get the true Content-Type, though that wouldn't
address zero-length files (which, on the other hand, wouldn't be an
issue to slurp in :D ).

Anyway, as I said, limiting the filesize allowed for parsing is a good
temporary solution; but as to more permanent resolutions, I'm unsure.
Ideas, anyon?

The original bug reporter has been Cc'd; I'm assuming phe's not
subscribed, so please keep the Cc in your follow-ups, thanks.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGrs347M8hyUobTrERCEG3AJ9XArmmoZ00CEbet0lp0dfu9ctnnwCeK94D
UW789SAIWObLjJ4TMErlXOI=
=RFBy
-----END PGP SIGNATURE-----

text/html assumptions, and slurping huge files

Reply via email to