-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hrvoje Niksic wrote: > Micah Cowan <[EMAIL PROTECTED]> writes: > >> I agree that it's probably a good idea to move HTML parsing to a model >> that doesn't require slurping everything into memory; > > Note that Wget mmaps the file whenever possible, so it's not actually > allocated on the heap (slurped). You need some memory to store the > URLs found in the file, but that's not really avoidable. I agree that > it would be better to completely avoid the memory-based model, as it > would allow links to be extracted on-the-fly, without saving the file > at all. It would be an interesting excercise to write or integrate a > parser that works like that.
Yes, but when mmap()ping with MEM_PRIVATE, once you actually start _using_ the mapped space, is there much of a difference? (I'm not certain MEM_SHARED would improve the situation, though it might be worth checking.) Also, if mmap() fails (say, with ENOMEM), it falls back to good old realloc() loops (though, it should probably be seeding that with what the file size, rather than just starting with a hard-coded value and resizing until it's right). mmap() isn't failing; but wget's memory space gets huge through the simple use of memchr() (on '<', for instance) on the mapped address space. > Regarding limits to file size, I don't think they are a good idea. > Whichever limit one chooses, someone will find a valid use case broken > by the limit. Even an arbitrary limit I thought entirely reasonable, > such as the maximum redirection count, recently turned out to be > broken by design. Well, that may be too harsh. I think a depth limit of 20 was more than appropriate; I'm not sure, but I suspect that several interactive user agents also have redirection limits, and with much lower values. Arguably, my response to the situation that led to making that value configurable could reasonably have been "you're Doing The Wrong Thing"; but at any rate, a configurable redirection limit seemed potentially useful, so the change was made. But your right: at least, an arbitrary, hard-coded limit, is going to be a mistake. Your arguments are less strong against a configurable limit, though. Still, perhaps a better way to approach this would be to use some sort of heuristic to determine whether the file looks to be HTML. Doing this reliably without breaking real HTML files will be something of a challenge, but perhaps requiring that we find something that looks like a familiar HTML tag within the first 1k or so would be appropriate. We can't expect well-formed HTML, of course, so even requiring an <HTML> tag is not reasonable: but finding any tag whatsoever would be something to start with. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGsJy17M8hyUobTrERCAXsAJ9ufOpcx+P2nh+3rpPh0w6NcOHoHgCdHYIo mCFG/ULEFPmbImrQ5PYv2aY= =CEAM -----END PGP SIGNATURE-----
