Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I agree that it's probably a good idea to move HTML parsing to a model
 that doesn't require slurping everything into memory;

Note that Wget mmaps the file whenever possible, so it's not actually
allocated on the heap (slurped).  You need some memory to store the
URLs found in the file, but that's not really avoidable.  I agree that
it would be better to completely avoid the memory-based model, as it
would allow links to be extracted on-the-fly, without saving the file
at all.  It would be an interesting excercise to write or integrate a
parser that works like that.

Regarding limits to file size, I don't think they are a good idea.
Whichever limit one chooses, someone will find a valid use case broken
by the limit.  Even an arbitrary limit I thought entirely reasonable,
such as the maximum redirection count, recently turned out to be
broken by design.  In this case it might make sense to investigate
exactly where and why the HTML parser spends the memory; perhaps the
parser saw something it thought was valid HTML and tried to extract a
huge link from it?  Maybe the parser simply needs to be taught to
perform sanity checks on URLs it encounters.


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hrvoje Niksic wrote:
 Micah Cowan [EMAIL PROTECTED] writes:
 
 I agree that it's probably a good idea to move HTML parsing to a model
 that doesn't require slurping everything into memory;
 
 Note that Wget mmaps the file whenever possible, so it's not actually
 allocated on the heap (slurped).  You need some memory to store the
 URLs found in the file, but that's not really avoidable.  I agree that
 it would be better to completely avoid the memory-based model, as it
 would allow links to be extracted on-the-fly, without saving the file
 at all.  It would be an interesting excercise to write or integrate a
 parser that works like that.

Yes, but when mmap()ping with MEM_PRIVATE, once you actually start
_using_ the mapped space, is there much of a difference? (I'm not
certain MEM_SHARED would improve the situation, though it might be worth
checking.) Also, if mmap() fails (say, with ENOMEM), it falls back to
good old realloc() loops (though, it should probably be seeding that
with what the file size, rather than just starting with a hard-coded
value and resizing until it's right).

mmap() isn't failing; but wget's memory space gets huge through the
simple use of memchr() (on '', for instance) on the mapped address space.

 Regarding limits to file size, I don't think they are a good idea.
 Whichever limit one chooses, someone will find a valid use case broken
 by the limit.  Even an arbitrary limit I thought entirely reasonable,
 such as the maximum redirection count, recently turned out to be
 broken by design.

Well, that may be too harsh. I think a depth limit of 20 was more than
appropriate; I'm not sure, but I suspect that several interactive user
agents also have redirection limits, and with much lower values.
Arguably, my response to the situation that led to making that value
configurable could reasonably have been you're Doing The Wrong Thing;
but at any rate, a configurable redirection limit seemed potentially
useful, so the change was made.

But your right: at least, an arbitrary, hard-coded limit, is going to be
a mistake. Your arguments are less strong against a configurable limit,
though.

Still, perhaps a better way to approach this would be to use some sort
of heuristic to determine whether the file looks to be HTML. Doing this
reliably without breaking real HTML files will be something of a
challenge, but perhaps requiring that we find something that looks like
a familiar HTML tag within the first 1k or so would be appropriate. We
can't expect well-formed HTML, of course, so even requiring an HTML
tag is not reasonable: but finding any tag whatsoever would be something
to start with.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGsJy17M8hyUobTrERCAXsAJ9ufOpcx+P2nh+3rpPh0w6NcOHoHgCdHYIo
mCFG/ULEFPmbImrQ5PYv2aY=
=CEAM
-END PGP SIGNATURE-


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Yes, but when mmap()ping with MEM_PRIVATE, once you actually start
 _using_ the mapped space, is there much of a difference?

As long as you don't write to the mapped region, there should be no
difference between shared and private mapped space -- that's what copy
on write (explicitly documented for MAP_PRIVATE in both Linux and
Solaris mmap man pages) is supposed to accomplish.  I could have used
MAP_SHARED, but at the time I believe there was still code that relied
on being able to write to the buffer.  That code was subsequently
removed, but MAP_PRIVATE stayed because I saw no point in removing it.
Given the semantics of copy on write, I figured there would be no
difference between MAP_SHARED and unwritten-to MAP_PRIVATE.

As for the memory footprint getting large, sure, Wget reads through it
all, but that is no different from what, say, grep --mmap does.  As
long as we don't jump backwards in the file, the OS can swap out the
unused parts.  Another difference between mmap and malloc is that
mmap'ed space can be reliably returned to the system.  Using mmap
pretty much guarantees that Wget's footprint won't increase to 1GB
unless you're actually reading a 1GB file, and even then much less
will be resident.

 mmap() isn't failing; but wget's memory space gets huge through the
 simple use of memchr() (on '', for instance) on the mapped address
 space.

Wget's virtual memory footprint does get huge, but the resident memory
needn't.  memchr only accesses memory sequentially, so the above swap
out scenario applies.  More importantly, in this case the report
documents failing to allocate -2147483648 bytes, which is a malloc
or realloc error, completely unrelated to mapped files.

 Still, perhaps a better way to approach this would be to use some
 sort of heuristic to determine whether the file looks to be
 HTML. Doing this reliably without breaking real HTML files will be
 something of a challenge, but perhaps requiring that we find
 something that looks like a familiar HTML tag within the first 1k or
 so would be appropriate. We can't expect well-formed HTML, of
 course, so even requiring an HTML tag is not reasonable: but
 finding any tag whatsoever would be something to start with.

I agree in principle, but I'd still like to know exactly what went
wrong in the reported case.  I suspect it's not just a case of
mmapping a huge file, but a case of misparsing it, for example by
attempting to extract a URL hundreds of megabytes' long.


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Micah Cowan
Hrvoje Niksic wrote:

 mmap() isn't failing; but wget's memory space gets huge through the
 simple use of memchr() (on '', for instance) on the mapped address
 space.
 
 Wget's virtual memory footprint does get huge, but the resident memory
 needn't.

Sorry, I should've been clearer: specifically, the resident memory grows
enormously. It seems, though, that if I suspend the process, the memory
can creep back down while it's not being used.

I haven't reproduced the actual out of memory part of the bug report;
and perhaps the resident memory thing I was seeing was some sort of
temporary caching thing. I really don't know nearly enough about Unix or
GNU/Linux memory models to know. However, if I let it just run, it
creeps up to 1GB of resident memory for the 1GB file (I've no idea how
it would behave on a system with less memory/swap), all within a single
memchr() (I suspect the OP didn't have just a single memchr(): my
simulation uses a file whose contents were copied from /dev/zero). If I
stop it for a while in gdb while I fish around at things, it seems to
creep down slowly, and doesn't reach that full 1GB before freeing the
address space (which instantly causes the resident memory to drop
drastically).

Actually, I was wrong though: sometimes mmap() _is_ failing for me (did
just now), which of course means that everything is in resident memory.
So we've probably been chasing a red herring.

 memchr only accesses memory sequentially, so the above swap
 out scenario applies.  More importantly, in this case the report
 documents failing to allocate -2147483648 bytes, which is a malloc
 or realloc error, completely unrelated to mapped files.

Good point, and this is consistent with mmap() failure. Your comment
about memchr() and sequential access is consistent with my observations
about memory dropping while idle. Though, I'm surprised it keeps so
much in, in the first place.

 Still, perhaps a better way to approach this would be to use some
 sort of heuristic to determine whether the file looks to be
 HTML. Doing this reliably without breaking real HTML files will be
 something of a challenge, but perhaps requiring that we find
 something that looks like a familiar HTML tag within the first 1k or
 so would be appropriate. We can't expect well-formed HTML, of
 course, so even requiring an HTML tag is not reasonable: but
 finding any tag whatsoever would be something to start with.
 
 I agree in principle, but I'd still like to know exactly what went
 wrong in the reported case.  I suspect it's not just a case of
 mmapping a huge file, but a case of misparsing it, for example by
 attempting to extract a URL hundreds of megabytes' long.

In all the debug sessions I've been in, it never even gets that far.
When mmap() succeeds, it does of course get into the beginning of
parsing, but fails to find its '' (since it's all zeroes), and exits
pretty quickly. I suspect there are only really issues when mmap() fails
and wget falls back to malloc() and friends.

-- 
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Actually, I was wrong though: sometimes mmap() _is_ failing for me
 (did just now), which of course means that everything is in resident
 memory.

I don't understand why mmapping a regular would fail on Linux.  What
error code are you getting?

(Wget tries to handle mmap failing gracefully because the GNU coding
standards require it, and also because we support mmap-unaware
operating systems anyway.)


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hrvoje Niksic wrote:
 Micah Cowan [EMAIL PROTECTED] writes:
 
 Actually, I was wrong though: sometimes mmap() _is_ failing for me
 (did just now), which of course means that everything is in resident
 memory.
 
 I don't understand why mmapping a regular would fail on Linux.  What
 error code are you getting?

ENOMEM, based on my recollection of the strerror message; I'm not
currently writing from the machine from which I produced this failure,
and haven't reproduced it here. I had been running a version with a
debug_logprintf() to complain about mapping failures; I believe I may
check that code in, as it could be potentially useful information.

Which is a bit odd, since I didn't get that the first time or two I ran
it. I suspect there may have been a bit of kernel oddness or something.

In any case, though, we should assume mmap() could fail. It's certainly
possible for a file to be large enough that it cannot be mapped into memory.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGsPKo7M8hyUobTrERCA8+AJ9fthyjdoCPUybkw1J8zgrJdZoL3QCfRgNl
9G7RAKV3ZE7oATOtlxAF/nQ=
=4iDo
-END PGP SIGNATURE-


Re: text/html assumptions, and slurping huge files

2007-07-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Micah Cowan wrote:
 A bug report made to Savannah
 (https://savannah.gnu.org/bugs/index.php?20496) detailed an example
 where wget would download a recursive fetch normally, but then when run
 again (with -c), it would eat up vast (_vast_) amounts of memory, until
 finally it would give up due to running out of memory. Some of the files
 involved were 1GB video files, and may have been updated with slightly
 smaller versions between calls to wget.

I've now split this bug report apart, to deal with the separate issues I
mentioned in my initial post for this thread. See the original bug
report (see above link) for details.

Right now, I think the most important thing to determine is: what would
be an appropriate size limit for HTML files to be read in for parsing?
I'd rather err on the side of permissiveness, rather than
restrictiveness: I don't want to risk breaking wget for real-world,
large HTML files. Is 15MB a decent value?

I'm expecting that, when a file of such size or greater is encountered,
it would simply be left alone and not parsed, rather than read up to the
limit, and parse up to that point, but if anyone would like to argue for
the latter behavior, I'm listening. The main reason I think it's not
worthwhile is that such files seem less likely to actually be HTML files.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGr8fz7M8hyUobTrERCELsAJ0WEM1T7eP0dJUIdwxVok4Lv3oyuQCfUhmF
kQabVs1p8ujf5YmjIMviDMU=
=Yshj
-END PGP SIGNATURE-


Re: text/html assumptions, and slurping huge files

2007-07-31 Thread Matthias Vill
Hi List,

Micah Cowan wrote:
 Micah Cowan wrote:
 I'm expecting that, when a file of such size or greater is encountered,
 it would simply be left alone and not parsed, rather than read up to the
 limit, and parse up to that point, but if anyone would like to argue for
 the latter behavior, I'm listening. The main reason I think it's not
 worthwhile is that such files seem less likely to actually be HTML files.
 

I just converted some project and a log-html was created with 6MB in
size and I agree to you, that this is a rare case and opening this file
with a browser is no fun, but still I don't like hardcoded sizes.
Maybe there will be an all-on-one-page-manual for some software which
exceeds this value or someone has a single-page picture-database-export.

To me it seems to be more clean to provide either some
--parse-limit=xxBytes option and to implement the parse in a way it
can't exceed memory.

I also guess that loading the whole 15M of a max-sized-html at once
looks ugly in memory an can lead to problems when you have multiple
wget-processes at once.

Maybe wget should be optimized for HTML-files with a max-size of 4M and
afterwards parse chunks.

Greetings

Matthias Vill


Re: text/html assumptions, and slurping huge files

2007-07-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Matthias Vill wrote:
 I just converted some project and a log-html was created with 6MB in
 size and I agree to you, that this is a rare case and opening this file
 with a browser is no fun, but still I don't like hardcoded sizes.
 Maybe there will be an all-on-one-page-manual for some software which
 exceeds this value or someone has a single-page picture-database-export.
 
 To me it seems to be more clean to provide either some
 --parse-limit=xxBytes option and to implement the parse in a way it
 can't exceed memory.
 
 I also guess that loading the whole 15M of a max-sized-html at once
 looks ugly in memory an can lead to problems when you have multiple
 wget-processes at once.
 
 Maybe wget should be optimized for HTML-files with a max-size of 4M and
 afterwards parse chunks.

I agree that it's probably a good idea to move HTML parsing to a model
that doesn't require slurping everything into memory; but in the
meantime I'd like to put some sort of stop-gap solution in place, and
limiting the maximum size seems like a reasonable solution.

I'd kind of like to do some large hard-coded limit, even if it looks
more like 50MB; but there's a good case to be made on a configurable
one, and if everyone is thinking this way, than I'll go with that
instead (still need to answer the question of what is a good default
setting for that, though). The only reason I'm kinda hoping to do
hard-coded, is that I wanted to get rid of this option when we actually
do switch to a not-memory-tied model; but then, perhaps we will want
this option for efficiency reasons as well as memory reasons...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGr+Iw7M8hyUobTrERCK5vAJwMc317TlE8PpiOPtAvH1Ag8RJMEACdFOUB
xfx9K3XgrS9fbge0Og4Dh4M=
=Br2z
-END PGP SIGNATURE-