Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
running git svn fetch on a remote repository (yes I know there are a lot of possible outside variables, including network latency) Code with 1024 reads and 64k writes: real75m19.906s user16m43.919s sys 29m16.326s Code with 1024 reads and 1024 writes: real71m21.006s user12m36.275s sys 24m26.112s ...so the simpler code wins the trivial test. I would say go with it. Should I resubmit? On Thu, Feb 21, 2013 at 3:24 PM, Jeff King p...@peff.net wrote: On Thu, Feb 21, 2013 at 03:18:40PM -0800, Joshua Clayton wrote: By having the read and flush size be the same, it's much simpler. My original bugfix did just read 1024, and write 1024. That works fine and, yes, is simpler. I changed it to be more similar to the original code in case there were performance reasons for doing it that way. That was the only reason I could think of for the design, and adding the $flushSize variable means that some motivated person could easily optimize it. So far I have been too lazy to profile the two versions I guess I'll try a trivial git svn init; git svn fetch and check back in. Its running now. I doubt it will make much of a difference; we are already writing to a perl filehandle, so it will be buffered there (which I assume is 4K, but I haven't checked). And your version retains the 1024-byte read. I do think 1024 is quite low for this sort of descriptor-to-descriptor copying. I'd be tempted to just bump that 1024 to 64K. In git svn fetch (which is how I discovered it) the file being passed to cat_blob is a temporary file, which is checksummed before putting it into place. Great. There may be other callers outside of our tree, of course, but I think it's pretty clear that the responsibility is on the caller to make sure the function succeeded. We are changing what gets put on the output stream for various error conditions, but ultimately that is an implementation detail that the caller should not be depending on. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
On Fri, Feb 22, 2013 at 07:11:54AM -0800, Joshua Clayton wrote: running git svn fetch on a remote repository (yes I know there are a lot of possible outside variables, including network latency) Code with 1024 reads and 64k writes: real75m19.906s user16m43.919s sys 29m16.326s Code with 1024 reads and 1024 writes: real71m21.006s user12m36.275s sys 24m26.112s ...so the simpler code wins the trivial test. Interesting; I'd have expected no change or a slight win for your version, which makes me wonder if the outside variables are dominating. I wonder what 64K/64K would look like. I would say go with it. Should I resubmit? Yes, please. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (reviewed by Jeff King)
On Fri, Feb 22, 2013 at 6:03 PM, Joshua Clayton stillcompil...@gmail.com wrote: Read and write each 1024 byte buffer, rather than trying to buffer the entire content of the file. Previous code would crash on all files 2 Gib, when the offset variable became negative (perhaps below the level of perl), resulting in a crash. On a 32 bit system, or a system with low memory it might crash before reaching 2 GiB due to memory exhaustion. Signed-off-by: Joshua Clayton stillcompil...@gmail.com We usually put Reviewed-by: Name email right below the SoB-line, rather than in the subject. And then we CC the people who were involved in the previous round :) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by)
On Fri, Feb 22, 2013 at 09:30:57AM -0800, Joshua Clayton wrote: Read and write each 1024 byte buffer, rather than trying to buffer the entire content of the file. OK. Did you ever repeat your timing with a larger symmetric buffer? That should probably be a separate patch on top, but it might be worth doing while we are thinking about it. Previous code would crash on all files 2 Gib, when the offset variable became negative (perhaps below the level of perl), resulting in a crash. I'm still slightly dubious of this, just because it doesn't match my knowledge of perl (which is admittedly imperfect). I'm curious how you diagnosed it? On a 32 bit system, or a system with low memory it might crash before reaching 2 GiB due to memory exhaustion. Signed-off-by: Joshua Clayton stillcompil...@gmail.com Reviewed-by: Jeff King p...@peff.net The commit message is a good place to mention any side effects, and why they are not a problem. Something like: The previous code buffered the whole blob before writing, so any error reading from cat-file would result in zero bytes being written to the output stream. After this change, the output may be left in a partially written state (or even fully written, if we fail when parsing the final newline from cat-file). However, it's not reasonable for callers to expect anything about the state of the output when we return an error (after all, even with full buffering, we might fail during the writing process). So any caller which cares about this is broken already, and we do not have to worry about them. --- perl/Git.pm | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) The patch itself looks fine to me. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit
On Fri, Feb 22, 2013 at 09:30:57AM -0800, Joshua Clayton wrote: Subject: Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by) One thing I forgot to note in my other response: the subject is part of the commit message, so information like resubmit with... should go with other cover letter material after the ---. It's also customary to say something like PATCHv2 in the brackets (which does get stripped off), and mention what is changed since v1 after the ---. That can help reviewers remember what happened, and it can help the maintainer understand where in the process of review the patch is. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by)
Joshua Clayton stillcompil...@gmail.com writes: Read and write each 1024 byte buffer, rather than trying to buffer the entire content of the file. Previous code would crash on all files 2 Gib, when the offset variable became negative (perhaps below the level of perl), resulting in a crash. On a 32 bit system, or a system with low memory it might crash before reaching 2 GiB due to memory exhaustion. Signed-off-by: Joshua Clayton stillcompil...@gmail.com Reviewed-by: Jeff King p...@peff.net --- Thanks. Subject: Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by) Please drop the () part. A rule of thumb is to make git show output understandable by people who read it 6 months from now. They do not care if the commit is a re-submission. It seems that this issue was with us since the very beginning of this sub since it was introduced at 7182530d8cad (Git.pm: Add hash_and_insert_object and cat_blob, 2008-05-23). perl/Git.pm | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/perl/Git.pm b/perl/Git.pm index 931047c..cc91288 100644 --- a/perl/Git.pm +++ b/perl/Git.pm @@ -949,13 +949,16 @@ sub cat_blob { last unless $bytesLeft; my $bytesToRead = $bytesLeft 1024 ? $bytesLeft : 1024; - my $read = read($in, $blob, $bytesToRead, $bytesRead); + my $read = read($in, $blob, $bytesToRead); unless (defined($read)) { $self-_close_cat_blob(); throw Error::Simple(in pipe went bad); } - $bytesRead += $read; + unless (print $fh $blob) { + $self-_close_cat_blob(); + throw Error::Simple(couldn't write to passed in filehandle); + } Corrupt patch, line-wrapped by your MUA. I wonder if we still need $bytesRead variable. You have $size that is the size of the whole blob, so my $bytesLeft = $size; while (1) { my $bytesToRead = $bytesLeft 1024 ? $bytesLeft : 1024; my $bytesRead = read($in, $blob, $bytesToRead); ... check errors and use the $blob ... $bytesLeft -= $bytesRead; } may be simpler and easier to read, no? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by)
Thanks for all the input and patience. On Fri, Feb 22, 2013 at 10:34 AM, Jeff King p...@peff.net wrote: On Fri, Feb 22, 2013 at 09:30:57AM -0800, Joshua Clayton wrote: Read and write each 1024 byte buffer, rather than trying to buffer the entire content of the file. OK. Did you ever repeat your timing with a larger symmetric buffer? That should probably be a separate patch on top, but it might be worth doing while we are thinking about it. Previous code would crash on all files 2 Gib, when the offset variable became negative (perhaps below the level of perl), resulting in a crash. I'm still slightly dubious of this, just because it doesn't match my knowledge of perl (which is admittedly imperfect). I'm curious how you diagnosed it? I first had the memory exhaustion problem running my git repo on a 32 vm. After bumping the memory from 512 to 4 GiB, and that failing to fix it I moved to my workstation with 16 GiB ...reproduced After the initial crash, I added print $size, , $bytesToRead, , $bytesRead, \n; right before the read command, and it does indeed crash right after the $bytesRead variable crosses LONG_MAX ... 2567089913 1024 2147482624 2567089913 1024 2147483648 2567089913 1024 2147484672 Offset outside string at /usr/share/perl5/Git.pm line 901, GEN36 line 2604. Note that $bytesRead is still positive. I know very little perl, but that symptom seems pretty clear On a 32 bit system, or a system with low memory it might crash before reaching 2 GiB due to memory exhaustion. Signed-off-by: Joshua Clayton stillcompil...@gmail.com Reviewed-by: Jeff King p...@peff.net The commit message is a good place to mention any side effects, and why they are not a problem. Something like: The previous code buffered the whole blob before writing, so any error reading from cat-file would result in zero bytes being written to the output stream. After this change, the output may be left in a partially written state (or even fully written, if we fail when parsing the final newline from cat-file). However, it's not reasonable for callers to expect anything about the state of the output when we return an error (after all, even with full buffering, we might fail during the writing process). So any caller which cares about this is broken already, and we do not have to worry about them. --- perl/Git.pm | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) The patch itself looks fine to me. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote: Greetings. This is my first patch here. Hopefully I get the stylistic political details right... :) Patch applies against maint and master I have some comments. :) The body of your email should contain the commit message (i.e., whatever people reading git log a year from now would see). Cover letter bits like this should go after the ---. That way git am knows which part is which. Developer's Certificate of Origin 1.1 You don't need to include the DCO. Your Signed-off-by is an indication that you agree to it. Affects git svn clone/fetch Original code loaded entire file contents into a variable before writing to disk. If the offset within the variable passed 2 GiB, it becrame negative, resulting in a crash. Interesting. I didn't think perl had signed wrap-around issues like this, as its numeric variables are not strictly integers. But I don't have a 32-bit machine to test on (and numbers larger than 2G obviously work on 64-bit machines). At any rate, though: On a 32 bit system, or a system with low memory it may crash before reaching 2 GiB due to memory exhaustion. Yeah, it is stupid to read the whole thing into memory if we are just going to dump it to another filehandle. @@ -949,13 +951,21 @@ sub cat_blob { last unless $bytesLeft; my $bytesToRead = $bytesLeft 1024 ? $bytesLeft : 1024; - my $read = read($in, $blob, $bytesToRead, $bytesRead); + my $read = read($in, $blob, $bytesToRead, $blobSize); unless (defined($read)) { $self-_close_cat_blob(); throw Error::Simple(in pipe went bad); } Hmph. The existing code already reads in 1024-byte chunks. For no reason, as far as I can tell, since we are just loading the blob buffer incrementally into memory, only to then flush it all out at once. Why do you read at the $blobSize offset? If we are just reading in chunks, we be able to just keep writing to the start of our small buffer, as we flush each chunk out before trying to read more. IOW, shouldn't the final code look like this: my $bytesLeft = $size; while ($bytesLeft 0) { my $buf; my $bytesToRead = $bytesLeft 1024 ? $bytesLeft : 1024; my $read = read($in, $buf, $bytesToRead); unless (defined($read)) { $self-_close_cat_blob(); throw Error::Simple(unable to read cat-blob pipe); } unless (print $fh $buf) { $self-_close_cat_blob(); throw Error::Simple(unable to write blob output); } $bytesLeft -= $read; } By having the read and flush size be the same, it's much simpler. Your change (and my proposed code) do mean that an error during the read operation will result in a truncated output file, rather than an empty one. I think that is OK, though. That can happen anyway in the original due to a failure in the print step. Any caller who wants to be careful that they leave only a full file in place must either: 1. Check the return value of cat_blob and verify that the result has $size bytes, and otherwise delete it. 2. Write to a temporary file, then once success has been returned from cat_blob, rename the result into place. Neither of which is affected by this change. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
On Thu, Feb 21, 2013 at 2:43 PM, Jeff King p...@peff.net wrote: On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote: Greetings. This is my first patch here. Hopefully I get the stylistic political details right... :) Patch applies against maint and master I have some comments. :) The body of your email should contain the commit message (i.e., whatever people reading git log a year from now would see). Cover letter bits like this should go after the ---. That way git am knows which part is which. Developer's Certificate of Origin 1.1 You don't need to include the DCO. Your Signed-off-by is an indication that you agree to it. Affects git svn clone/fetch Original code loaded entire file contents into a variable before writing to disk. If the offset within the variable passed 2 GiB, it becrame negative, resulting in a crash. Interesting. I didn't think perl had signed wrap-around issues like this, as its numeric variables are not strictly integers. But I don't have a 32-bit machine to test on (and numbers larger than 2G obviously work on 64-bit machines). At any rate, though: On a 32 bit system, or a system with low memory it may crash before reaching 2 GiB due to memory exhaustion. Yeah, it is stupid to read the whole thing into memory if we are just going to dump it to another filehandle. @@ -949,13 +951,21 @@ sub cat_blob { last unless $bytesLeft; my $bytesToRead = $bytesLeft 1024 ? $bytesLeft : 1024; - my $read = read($in, $blob, $bytesToRead, $bytesRead); + my $read = read($in, $blob, $bytesToRead, $blobSize); unless (defined($read)) { $self-_close_cat_blob(); throw Error::Simple(in pipe went bad); } Hmph. The existing code already reads in 1024-byte chunks. For no reason, as far as I can tell, since we are just loading the blob buffer incrementally into memory, only to then flush it all out at once. Why do you read at the $blobSize offset? If we are just reading in chunks, we be able to just keep writing to the start of our small buffer, as we flush each chunk out before trying to read more. IOW, shouldn't the final code look like this: my $bytesLeft = $size; while ($bytesLeft 0) { my $buf; my $bytesToRead = $bytesLeft 1024 ? $bytesLeft : 1024; my $read = read($in, $buf, $bytesToRead); unless (defined($read)) { $self-_close_cat_blob(); throw Error::Simple(unable to read cat-blob pipe); } unless (print $fh $buf) { $self-_close_cat_blob(); throw Error::Simple(unable to write blob output); } $bytesLeft -= $read; } By having the read and flush size be the same, it's much simpler. My original bugfix did just read 1024, and write 1024. That works fine and, yes, is simpler. I changed it to be more similar to the original code in case there were performance reasons for doing it that way. That was the only reason I could think of for the design, and adding the $flushSize variable means that some motivated person could easily optimize it. So far I have been too lazy to profile the two versions I guess I'll try a trivial git svn init; git svn fetch and check back in. Its running now. Your change (and my proposed code) do mean that an error during the read operation will result in a truncated output file, rather than an empty one. I think that is OK, though. That can happen anyway in the original due to a failure in the print step. Any caller who wants to be careful that they leave only a full file in place must either: 1. Check the return value of cat_blob and verify that the result has $size bytes, and otherwise delete it. 2. Write to a temporary file, then once success has been returned from cat_blob, rename the result into place. Neither of which is affected by this change. -Peff In git svn fetch (which is how I discovered it) the file being passed to cat_blob is a temporary file, which is checksummed before putting it into place. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
On Thu, Feb 21, 2013 at 03:18:40PM -0800, Joshua Clayton wrote: By having the read and flush size be the same, it's much simpler. My original bugfix did just read 1024, and write 1024. That works fine and, yes, is simpler. I changed it to be more similar to the original code in case there were performance reasons for doing it that way. That was the only reason I could think of for the design, and adding the $flushSize variable means that some motivated person could easily optimize it. So far I have been too lazy to profile the two versions I guess I'll try a trivial git svn init; git svn fetch and check back in. Its running now. I doubt it will make much of a difference; we are already writing to a perl filehandle, so it will be buffered there (which I assume is 4K, but I haven't checked). And your version retains the 1024-byte read. I do think 1024 is quite low for this sort of descriptor-to-descriptor copying. I'd be tempted to just bump that 1024 to 64K. In git svn fetch (which is how I discovered it) the file being passed to cat_blob is a temporary file, which is checksummed before putting it into place. Great. There may be other callers outside of our tree, of course, but I think it's pretty clear that the responsibility is on the caller to make sure the function succeeded. We are changing what gets put on the output stream for various error conditions, but ultimately that is an implementation detail that the caller should not be depending on. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html