Re: text/html assumptions, and slurping huge files
Micah Cowan [EMAIL PROTECTED] writes: I agree that it's probably a good idea to move HTML parsing to a model that doesn't require slurping everything into memory; Note that Wget mmaps the file whenever possible, so it's not actually allocated on the heap (slurped). You need some memory to store the URLs found in the file, but that's not really avoidable. I agree that it would be better to completely avoid the memory-based model, as it would allow links to be extracted on-the-fly, without saving the file at all. It would be an interesting excercise to write or integrate a parser that works like that. Regarding limits to file size, I don't think they are a good idea. Whichever limit one chooses, someone will find a valid use case broken by the limit. Even an arbitrary limit I thought entirely reasonable, such as the maximum redirection count, recently turned out to be broken by design. In this case it might make sense to investigate exactly where and why the HTML parser spends the memory; perhaps the parser saw something it thought was valid HTML and tried to extract a huge link from it? Maybe the parser simply needs to be taught to perform sanity checks on URLs it encounters.
Re: text/html assumptions, and slurping huge files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: I agree that it's probably a good idea to move HTML parsing to a model that doesn't require slurping everything into memory; Note that Wget mmaps the file whenever possible, so it's not actually allocated on the heap (slurped). You need some memory to store the URLs found in the file, but that's not really avoidable. I agree that it would be better to completely avoid the memory-based model, as it would allow links to be extracted on-the-fly, without saving the file at all. It would be an interesting excercise to write or integrate a parser that works like that. Yes, but when mmap()ping with MEM_PRIVATE, once you actually start _using_ the mapped space, is there much of a difference? (I'm not certain MEM_SHARED would improve the situation, though it might be worth checking.) Also, if mmap() fails (say, with ENOMEM), it falls back to good old realloc() loops (though, it should probably be seeding that with what the file size, rather than just starting with a hard-coded value and resizing until it's right). mmap() isn't failing; but wget's memory space gets huge through the simple use of memchr() (on '', for instance) on the mapped address space. Regarding limits to file size, I don't think they are a good idea. Whichever limit one chooses, someone will find a valid use case broken by the limit. Even an arbitrary limit I thought entirely reasonable, such as the maximum redirection count, recently turned out to be broken by design. Well, that may be too harsh. I think a depth limit of 20 was more than appropriate; I'm not sure, but I suspect that several interactive user agents also have redirection limits, and with much lower values. Arguably, my response to the situation that led to making that value configurable could reasonably have been you're Doing The Wrong Thing; but at any rate, a configurable redirection limit seemed potentially useful, so the change was made. But your right: at least, an arbitrary, hard-coded limit, is going to be a mistake. Your arguments are less strong against a configurable limit, though. Still, perhaps a better way to approach this would be to use some sort of heuristic to determine whether the file looks to be HTML. Doing this reliably without breaking real HTML files will be something of a challenge, but perhaps requiring that we find something that looks like a familiar HTML tag within the first 1k or so would be appropriate. We can't expect well-formed HTML, of course, so even requiring an HTML tag is not reasonable: but finding any tag whatsoever would be something to start with. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGsJy17M8hyUobTrERCAXsAJ9ufOpcx+P2nh+3rpPh0w6NcOHoHgCdHYIo mCFG/ULEFPmbImrQ5PYv2aY= =CEAM -END PGP SIGNATURE-
Re: text/html assumptions, and slurping huge files
Micah Cowan [EMAIL PROTECTED] writes: Yes, but when mmap()ping with MEM_PRIVATE, once you actually start _using_ the mapped space, is there much of a difference? As long as you don't write to the mapped region, there should be no difference between shared and private mapped space -- that's what copy on write (explicitly documented for MAP_PRIVATE in both Linux and Solaris mmap man pages) is supposed to accomplish. I could have used MAP_SHARED, but at the time I believe there was still code that relied on being able to write to the buffer. That code was subsequently removed, but MAP_PRIVATE stayed because I saw no point in removing it. Given the semantics of copy on write, I figured there would be no difference between MAP_SHARED and unwritten-to MAP_PRIVATE. As for the memory footprint getting large, sure, Wget reads through it all, but that is no different from what, say, grep --mmap does. As long as we don't jump backwards in the file, the OS can swap out the unused parts. Another difference between mmap and malloc is that mmap'ed space can be reliably returned to the system. Using mmap pretty much guarantees that Wget's footprint won't increase to 1GB unless you're actually reading a 1GB file, and even then much less will be resident. mmap() isn't failing; but wget's memory space gets huge through the simple use of memchr() (on '', for instance) on the mapped address space. Wget's virtual memory footprint does get huge, but the resident memory needn't. memchr only accesses memory sequentially, so the above swap out scenario applies. More importantly, in this case the report documents failing to allocate -2147483648 bytes, which is a malloc or realloc error, completely unrelated to mapped files. Still, perhaps a better way to approach this would be to use some sort of heuristic to determine whether the file looks to be HTML. Doing this reliably without breaking real HTML files will be something of a challenge, but perhaps requiring that we find something that looks like a familiar HTML tag within the first 1k or so would be appropriate. We can't expect well-formed HTML, of course, so even requiring an HTML tag is not reasonable: but finding any tag whatsoever would be something to start with. I agree in principle, but I'd still like to know exactly what went wrong in the reported case. I suspect it's not just a case of mmapping a huge file, but a case of misparsing it, for example by attempting to extract a URL hundreds of megabytes' long.
Re: text/html assumptions, and slurping huge files
Hrvoje Niksic wrote: mmap() isn't failing; but wget's memory space gets huge through the simple use of memchr() (on '', for instance) on the mapped address space. Wget's virtual memory footprint does get huge, but the resident memory needn't. Sorry, I should've been clearer: specifically, the resident memory grows enormously. It seems, though, that if I suspend the process, the memory can creep back down while it's not being used. I haven't reproduced the actual out of memory part of the bug report; and perhaps the resident memory thing I was seeing was some sort of temporary caching thing. I really don't know nearly enough about Unix or GNU/Linux memory models to know. However, if I let it just run, it creeps up to 1GB of resident memory for the 1GB file (I've no idea how it would behave on a system with less memory/swap), all within a single memchr() (I suspect the OP didn't have just a single memchr(): my simulation uses a file whose contents were copied from /dev/zero). If I stop it for a while in gdb while I fish around at things, it seems to creep down slowly, and doesn't reach that full 1GB before freeing the address space (which instantly causes the resident memory to drop drastically). Actually, I was wrong though: sometimes mmap() _is_ failing for me (did just now), which of course means that everything is in resident memory. So we've probably been chasing a red herring. memchr only accesses memory sequentially, so the above swap out scenario applies. More importantly, in this case the report documents failing to allocate -2147483648 bytes, which is a malloc or realloc error, completely unrelated to mapped files. Good point, and this is consistent with mmap() failure. Your comment about memchr() and sequential access is consistent with my observations about memory dropping while idle. Though, I'm surprised it keeps so much in, in the first place. Still, perhaps a better way to approach this would be to use some sort of heuristic to determine whether the file looks to be HTML. Doing this reliably without breaking real HTML files will be something of a challenge, but perhaps requiring that we find something that looks like a familiar HTML tag within the first 1k or so would be appropriate. We can't expect well-formed HTML, of course, so even requiring an HTML tag is not reasonable: but finding any tag whatsoever would be something to start with. I agree in principle, but I'd still like to know exactly what went wrong in the reported case. I suspect it's not just a case of mmapping a huge file, but a case of misparsing it, for example by attempting to extract a URL hundreds of megabytes' long. In all the debug sessions I've been in, it never even gets that far. When mmap() succeeds, it does of course get into the beginning of parsing, but fails to find its '' (since it's all zeroes), and exits pretty quickly. I suspect there are only really issues when mmap() fails and wget falls back to malloc() and friends. -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/
Re: text/html assumptions, and slurping huge files
Micah Cowan [EMAIL PROTECTED] writes: Actually, I was wrong though: sometimes mmap() _is_ failing for me (did just now), which of course means that everything is in resident memory. I don't understand why mmapping a regular would fail on Linux. What error code are you getting? (Wget tries to handle mmap failing gracefully because the GNU coding standards require it, and also because we support mmap-unaware operating systems anyway.)
Re: text/html assumptions, and slurping huge files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: Actually, I was wrong though: sometimes mmap() _is_ failing for me (did just now), which of course means that everything is in resident memory. I don't understand why mmapping a regular would fail on Linux. What error code are you getting? ENOMEM, based on my recollection of the strerror message; I'm not currently writing from the machine from which I produced this failure, and haven't reproduced it here. I had been running a version with a debug_logprintf() to complain about mapping failures; I believe I may check that code in, as it could be potentially useful information. Which is a bit odd, since I didn't get that the first time or two I ran it. I suspect there may have been a bit of kernel oddness or something. In any case, though, we should assume mmap() could fail. It's certainly possible for a file to be large enough that it cannot be mapped into memory. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGsPKo7M8hyUobTrERCA8+AJ9fthyjdoCPUybkw1J8zgrJdZoL3QCfRgNl 9G7RAKV3ZE7oATOtlxAF/nQ= =4iDo -END PGP SIGNATURE-
Re: text/html assumptions, and slurping huge files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Micah Cowan wrote: A bug report made to Savannah (https://savannah.gnu.org/bugs/index.php?20496) detailed an example where wget would download a recursive fetch normally, but then when run again (with -c), it would eat up vast (_vast_) amounts of memory, until finally it would give up due to running out of memory. Some of the files involved were 1GB video files, and may have been updated with slightly smaller versions between calls to wget. I've now split this bug report apart, to deal with the separate issues I mentioned in my initial post for this thread. See the original bug report (see above link) for details. Right now, I think the most important thing to determine is: what would be an appropriate size limit for HTML files to be read in for parsing? I'd rather err on the side of permissiveness, rather than restrictiveness: I don't want to risk breaking wget for real-world, large HTML files. Is 15MB a decent value? I'm expecting that, when a file of such size or greater is encountered, it would simply be left alone and not parsed, rather than read up to the limit, and parse up to that point, but if anyone would like to argue for the latter behavior, I'm listening. The main reason I think it's not worthwhile is that such files seem less likely to actually be HTML files. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGr8fz7M8hyUobTrERCELsAJ0WEM1T7eP0dJUIdwxVok4Lv3oyuQCfUhmF kQabVs1p8ujf5YmjIMviDMU= =Yshj -END PGP SIGNATURE-
Re: text/html assumptions, and slurping huge files
Hi List, Micah Cowan wrote: Micah Cowan wrote: I'm expecting that, when a file of such size or greater is encountered, it would simply be left alone and not parsed, rather than read up to the limit, and parse up to that point, but if anyone would like to argue for the latter behavior, I'm listening. The main reason I think it's not worthwhile is that such files seem less likely to actually be HTML files. I just converted some project and a log-html was created with 6MB in size and I agree to you, that this is a rare case and opening this file with a browser is no fun, but still I don't like hardcoded sizes. Maybe there will be an all-on-one-page-manual for some software which exceeds this value or someone has a single-page picture-database-export. To me it seems to be more clean to provide either some --parse-limit=xxBytes option and to implement the parse in a way it can't exceed memory. I also guess that loading the whole 15M of a max-sized-html at once looks ugly in memory an can lead to problems when you have multiple wget-processes at once. Maybe wget should be optimized for HTML-files with a max-size of 4M and afterwards parse chunks. Greetings Matthias Vill
Re: text/html assumptions, and slurping huge files
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Matthias Vill wrote: I just converted some project and a log-html was created with 6MB in size and I agree to you, that this is a rare case and opening this file with a browser is no fun, but still I don't like hardcoded sizes. Maybe there will be an all-on-one-page-manual for some software which exceeds this value or someone has a single-page picture-database-export. To me it seems to be more clean to provide either some --parse-limit=xxBytes option and to implement the parse in a way it can't exceed memory. I also guess that loading the whole 15M of a max-sized-html at once looks ugly in memory an can lead to problems when you have multiple wget-processes at once. Maybe wget should be optimized for HTML-files with a max-size of 4M and afterwards parse chunks. I agree that it's probably a good idea to move HTML parsing to a model that doesn't require slurping everything into memory; but in the meantime I'd like to put some sort of stop-gap solution in place, and limiting the maximum size seems like a reasonable solution. I'd kind of like to do some large hard-coded limit, even if it looks more like 50MB; but there's a good case to be made on a configurable one, and if everyone is thinking this way, than I'll go with that instead (still need to answer the question of what is a good default setting for that, though). The only reason I'm kinda hoping to do hard-coded, is that I wanted to get rid of this option when we actually do switch to a not-memory-tied model; but then, perhaps we will want this option for efficiency reasons as well as memory reasons... - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGr+Iw7M8hyUobTrERCK5vAJwMc317TlE8PpiOPtAvH1Ag8RJMEACdFOUB xfx9K3XgrS9fbge0Og4Dh4M= =Br2z -END PGP SIGNATURE-