Hrvoje Niksic wrote: >> mmap() isn't failing; but wget's memory space gets huge through the >> simple use of memchr() (on '<', for instance) on the mapped address >> space. > > Wget's virtual memory footprint does get huge, but the resident memory > needn't.
Sorry, I should've been clearer: specifically, the resident memory grows enormously. It seems, though, that if I suspend the process, the memory can creep back down while it's not being used. I haven't reproduced the actual "out of memory" part of the bug report; and perhaps the resident memory thing I was seeing was some sort of temporary caching thing. I really don't know nearly enough about Unix or GNU/Linux memory models to know. However, if I let it just run, it creeps up to 1GB of resident memory for the 1GB file (I've no idea how it would behave on a system with less memory/swap), all within a single memchr() (I suspect the OP didn't have just a single memchr(): my simulation uses a file whose contents were copied from /dev/zero). If I stop it for a while in gdb while I fish around at things, it seems to creep down slowly, and doesn't reach that full 1GB before freeing the address space (which instantly causes the resident memory to drop drastically). Actually, I was wrong though: sometimes mmap() _is_ failing for me (did just now), which of course means that everything is in resident memory. So we've probably been chasing a red herring. > memchr only accesses memory sequentially, so the above swap > out scenario applies. More importantly, in this case the report > documents "failing to allocate -2147483648 bytes", which is a malloc > or realloc error, completely unrelated to mapped files. Good point, and this is consistent with mmap() failure. Your comment about memchr() and sequential access is consistent with my observations about memory "dropping" while idle. Though, I'm surprised it keeps so much in, in the first place. >> Still, perhaps a better way to approach this would be to use some >> sort of heuristic to determine whether the file looks to be >> HTML. Doing this reliably without breaking real HTML files will be >> something of a challenge, but perhaps requiring that we find >> something that looks like a familiar HTML tag within the first 1k or >> so would be appropriate. We can't expect well-formed HTML, of >> course, so even requiring an <HTML> tag is not reasonable: but >> finding any tag whatsoever would be something to start with. > > I agree in principle, but I'd still like to know exactly what went > wrong in the reported case. I suspect it's not just a case of > mmapping a huge file, but a case of misparsing it, for example by > attempting to extract a "URL" hundreds of megabytes' long. In all the debug sessions I've been in, it never even gets that far. When mmap() succeeds, it does of course get into the beginning of parsing, but fails to find its '<' (since it's all zeroes), and exits pretty quickly. I suspect there are only really issues when mmap() fails and wget falls back to malloc() and friends. -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/
