-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 A bug report made to Savannah (https://savannah.gnu.org/bugs/index.php?20496) detailed an example where wget would download a recursive fetch normally, but then when run again (with -c), it would eat up vast (_vast_) amounts of memory, until finally it would give up due to running out of memory. Some of the files involved were >1GB video files, and may have been updated with slightly smaller versions between calls to wget.
It turned out the problem was that, if the >1GB file was downloaded completely (or is larger on disk than the one on the server), wget would believe, on continuation, that it was an HTML file rather than a video file, and attempt to slurp the entire file contents into memory for parsing. There are a couple issues with this, AFAICT. (1) Wget considers it to be HTML when it's not, and (2) Wget has to slurp HTML files into memory entirely in order to parse them. To the second of these, slurping everything into memory, I'm hoping there will be a fairly straightforward, though tedious, solution. I imagine the parser could be retrofitted with a more generalized solution than slurping an entire string; perhaps a get(size) function that returns the next |size| bytes from the HTML file; then those functions that parse such things could throw them out once they're done with them. As a stopgap fix, though, and perhaps even sufficient, we should probably consider setting a hard limit on the size of HTML files, and refuse to parse files that exceed this limit. However, I'm not sure how we could solve issue #1; Wget first issues a HEAD to check the timestamp/content-disposition, and then a GET with Range header set. The server in this case responds to the HEAD without a Content-Type; and to the GET, it responds appropriately with a 416 Requested Range Not Satisfiable, whose body has a content-type of text/html. Now, Wget is using the Content-Type of the 416 response, which isn't really appropriate. However, it has nothing else to go by, since the HEAD had no Content-Type information either, and since unspecified Content-Types default to text/html... I suppose we could fetch the first byte of a file to get the true Content-Type, though that wouldn't address zero-length files (which, on the other hand, wouldn't be an issue to slurp in :D ). Anyway, as I said, limiting the filesize allowed for parsing is a good temporary solution; but as to more permanent resolutions, I'm unsure. Ideas, anyon? The original bug reporter has been Cc'd; I'm assuming phe's not subscribed, so please keep the Cc in your follow-ups, thanks. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGrs347M8hyUobTrERCEG3AJ9XArmmoZ00CEbet0lp0dfu9ctnnwCeK94D UW789SAIWObLjJ4TMErlXOI= =RFBy -----END PGP SIGNATURE-----