Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes: I agree that it's probably a good idea to move HTML parsing to a model that doesn't require slurping everything into memory; Note that Wget mmaps the file whenever possible, so it's not actually allocated on the heap (slurped). You need some memory to

Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: I agree that it's probably a good idea to move HTML parsing to a model that doesn't require slurping everything into memory; Note that Wget mmaps the file whenever possible, so it's

Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes: Yes, but when mmap()ping with MEM_PRIVATE, once you actually start _using_ the mapped space, is there much of a difference? As long as you don't write to the mapped region, there should be no difference between shared and private mapped space -- that's

Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Micah Cowan
Hrvoje Niksic wrote: mmap() isn't failing; but wget's memory space gets huge through the simple use of memchr() (on '', for instance) on the mapped address space. Wget's virtual memory footprint does get huge, but the resident memory needn't. Sorry, I should've been clearer: specifically,

Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes: Actually, I was wrong though: sometimes mmap() _is_ failing for me (did just now), which of course means that everything is in resident memory. I don't understand why mmapping a regular would fail on Linux. What error code are you getting? (Wget tries

Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: Actually, I was wrong though: sometimes mmap() _is_ failing for me (did just now), which of course means that everything is in resident memory. I don't understand why mmapping a

Re: text/html assumptions, and slurping huge files

2007-07-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Micah Cowan wrote: A bug report made to Savannah (https://savannah.gnu.org/bugs/index.php?20496) detailed an example where wget would download a recursive fetch normally, but then when run again (with -c), it would eat up vast (_vast_) amounts

Re: text/html assumptions, and slurping huge files

2007-07-31 Thread Matthias Vill
Hi List, Micah Cowan wrote: Micah Cowan wrote: I'm expecting that, when a file of such size or greater is encountered, it would simply be left alone and not parsed, rather than read up to the limit, and parse up to that point, but if anyone would like to argue for the latter behavior, I'm

Re: text/html assumptions, and slurping huge files

2007-07-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Matthias Vill wrote: I just converted some project and a log-html was created with 6MB in size and I agree to you, that this is a rare case and opening this file with a browser is no fun, but still I don't like hardcoded sizes. Maybe there will