Re: text/html assumptions, and slurping huge files

Micah Cowan Wed, 01 Aug 2007 07:46:24 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hrvoje Niksic wrote:
> Micah Cowan <[EMAIL PROTECTED]> writes:
> 
>> I agree that it's probably a good idea to move HTML parsing to a model
>> that doesn't require slurping everything into memory;
> 
> Note that Wget mmaps the file whenever possible, so it's not actually
> allocated on the heap (slurped).  You need some memory to store the
> URLs found in the file, but that's not really avoidable.  I agree that
> it would be better to completely avoid the memory-based model, as it
> would allow links to be extracted on-the-fly, without saving the file
> at all.  It would be an interesting excercise to write or integrate a
> parser that works like that.


Yes, but when mmap()ping with MEM_PRIVATE, once you actually start
_using_ the mapped space, is there much of a difference? (I'm not
certain MEM_SHARED would improve the situation, though it might be worth
checking.) Also, if mmap() fails (say, with ENOMEM), it falls back to
good old realloc() loops (though, it should probably be seeding that
with what the file size, rather than just starting with a hard-coded
value and resizing until it's right).

mmap() isn't failing; but wget's memory space gets huge through the
simple use of memchr() (on '<', for instance) on the mapped address space.

> Regarding limits to file size, I don't think they are a good idea.
> Whichever limit one chooses, someone will find a valid use case broken
> by the limit.  Even an arbitrary limit I thought entirely reasonable,
> such as the maximum redirection count, recently turned out to be
> broken by design.

Well, that may be too harsh. I think a depth limit of 20 was more than
appropriate; I'm not sure, but I suspect that several interactive user
agents also have redirection limits, and with much lower values.
Arguably, my response to the situation that led to making that value
configurable could reasonably have been "you're Doing The Wrong Thing";
but at any rate, a configurable redirection limit seemed potentially
useful, so the change was made.

But your right: at least, an arbitrary, hard-coded limit, is going to be
a mistake. Your arguments are less strong against a configurable limit,
though.

Still, perhaps a better way to approach this would be to use some sort
of heuristic to determine whether the file looks to be HTML. Doing this
reliably without breaking real HTML files will be something of a
challenge, but perhaps requiring that we find something that looks like
a familiar HTML tag within the first 1k or so would be appropriate. We
can't expect well-formed HTML, of course, so even requiring an <HTML>
tag is not reasonable: but finding any tag whatsoever would be something
to start with.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGsJy17M8hyUobTrERCAXsAJ9ufOpcx+P2nh+3rpPh0w6NcOHoHgCdHYIo
mCFG/ULEFPmbImrQ5PYv2aY=
=CEAM
-----END PGP SIGNATURE-----

Re: text/html assumptions, and slurping huge files

Reply via email to