Re: text/html assumptions, and slurping huge files

Micah Cowan Wed, 01 Aug 2007 08:56:23 -0700

Hrvoje Niksic wrote:

>> mmap() isn't failing; but wget's memory space gets huge through the
>> simple use of memchr() (on '<', for instance) on the mapped address
>> space.
> 
> Wget's virtual memory footprint does get huge, but the resident memory
> needn't.


Sorry, I should've been clearer: specifically, the resident memory grows
enormously. It seems, though, that if I suspend the process, the memory
can creep back down while it's not being used.

I haven't reproduced the actual "out of memory" part of the bug report;
and perhaps the resident memory thing I was seeing was some sort of
temporary caching thing. I really don't know nearly enough about Unix or
GNU/Linux memory models to know. However, if I let it just run, it
creeps up to 1GB of resident memory for the 1GB file (I've no idea how
it would behave on a system with less memory/swap), all within a single
memchr() (I suspect the OP didn't have just a single memchr(): my
simulation uses a file whose contents were copied from /dev/zero). If I
stop it for a while in gdb while I fish around at things, it seems to
creep down slowly, and doesn't reach that full 1GB before freeing the
address space (which instantly causes the resident memory to drop
drastically).

Actually, I was wrong though: sometimes mmap() _is_ failing for me (did
just now), which of course means that everything is in resident memory.
So we've probably been chasing a red herring.

> memchr only accesses memory sequentially, so the above swap
> out scenario applies.  More importantly, in this case the report
> documents "failing to allocate -2147483648 bytes", which is a malloc
> or realloc error, completely unrelated to mapped files.

Good point, and this is consistent with mmap() failure. Your comment
about memchr() and sequential access is consistent with my observations
about memory "dropping" while idle. Though, I'm surprised it keeps so
much in, in the first place.

>> Still, perhaps a better way to approach this would be to use some
>> sort of heuristic to determine whether the file looks to be
>> HTML. Doing this reliably without breaking real HTML files will be
>> something of a challenge, but perhaps requiring that we find
>> something that looks like a familiar HTML tag within the first 1k or
>> so would be appropriate. We can't expect well-formed HTML, of
>> course, so even requiring an <HTML> tag is not reasonable: but
>> finding any tag whatsoever would be something to start with.
> 
> I agree in principle, but I'd still like to know exactly what went
> wrong in the reported case.  I suspect it's not just a case of
> mmapping a huge file, but a case of misparsing it, for example by
> attempting to extract a "URL" hundreds of megabytes' long.

In all the debug sessions I've been in, it never even gets that far.
When mmap() succeeds, it does of course get into the beginning of
parsing, but fails to find its '<' (since it's all zeroes), and exits
pretty quickly. I suspect there are only really issues when mmap() fails
and wget falls back to malloc() and friends.

-- 
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

Re: text/html assumptions, and slurping huge files

Reply via email to