Re: [PATCH] Fix in Git.pm cat_blob crashes on large files

2013-02-22 Thread Joshua Clayton
running git svn fetch on a remote repository (yes I know there are a
lot of possible outside variables, including network latency)
Code with 1024 reads and 64k writes:

real75m19.906s
user16m43.919s
sys 29m16.326s

Code with 1024 reads and 1024 writes:

real71m21.006s
user12m36.275s
sys 24m26.112s

...so the simpler code wins the trivial test.
I would say go with it.
Should I resubmit?

On Thu, Feb 21, 2013 at 3:24 PM, Jeff King p...@peff.net wrote:
 On Thu, Feb 21, 2013 at 03:18:40PM -0800, Joshua Clayton wrote:

  By having the read and flush size be the same, it's much simpler.

 My original bugfix did just read 1024, and write 1024. That works fine
 and, yes, is simpler.
 I changed it to be more similar to the original code in case there
 were performance reasons for doing it that way.
 That was the only reason I could think of for the design, and adding
 the $flushSize variable means that
 some motivated person could easily optimize it.

 So far I have been too lazy to profile the two versions
 I guess I'll try a trivial git svn init; git svn fetch and check back in.
 Its running now.

 I doubt it will make much of a difference; we are already writing to a
 perl filehandle, so it will be buffered there (which I assume is 4K, but
 I haven't checked). And your version retains the 1024-byte read. I do
 think 1024 is quite low for this sort of descriptor-to-descriptor
 copying. I'd be tempted to just bump that 1024 to 64K.

 In git svn fetch (which is how I discovered it) the file being passed
 to cat_blob is a temporary file, which is checksummed before putting
 it into place.

 Great. There may be other callers outside of our tree, of course, but I
 think it's pretty clear that the responsibility is on the caller to make
 sure the function succeeded. We are changing what gets put on the output
 stream for various error conditions, but ultimately that is an
 implementation detail that the caller should not be depending on.

 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files

2013-02-22 Thread Jeff King
On Fri, Feb 22, 2013 at 07:11:54AM -0800, Joshua Clayton wrote:

 running git svn fetch on a remote repository (yes I know there are a
 lot of possible outside variables, including network latency)
 Code with 1024 reads and 64k writes:
 
 real75m19.906s
 user16m43.919s
 sys 29m16.326s
 
 Code with 1024 reads and 1024 writes:
 
 real71m21.006s
 user12m36.275s
 sys 24m26.112s
 
 ...so the simpler code wins the trivial test.

Interesting; I'd have expected no change or a slight win for your
version, which makes me wonder if the outside variables are dominating.
I wonder what 64K/64K would look like.

 I would say go with it.
 Should I resubmit?

Yes, please.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (reviewed by Jeff King)

2013-02-22 Thread Erik Faye-Lund
On Fri, Feb 22, 2013 at 6:03 PM, Joshua Clayton
stillcompil...@gmail.com wrote:
 Read and write each 1024 byte buffer, rather than trying to buffer
 the entire content of the file.
 Previous code would crash on all files  2 Gib, when the offset variable
 became negative (perhaps below the level of perl), resulting in a crash.
 On a 32 bit system, or a system with low memory it might crash before
 reaching 2 GiB due to memory exhaustion.

 Signed-off-by: Joshua Clayton stillcompil...@gmail.com

We usually put Reviewed-by: Name email right below the SoB-line,
rather than in the subject. And then we CC the people who were
involved in the previous round :)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by)

2013-02-22 Thread Jeff King
On Fri, Feb 22, 2013 at 09:30:57AM -0800, Joshua Clayton wrote:

 Read and write each 1024 byte buffer, rather than trying to buffer
 the entire content of the file.

OK. Did you ever repeat your timing with a larger symmetric buffer? That
should probably be a separate patch on top, but it might be worth doing
while we are thinking about it.

 Previous code would crash on all files  2 Gib, when the offset variable
 became negative (perhaps below the level of perl), resulting in a crash.

I'm still slightly dubious of this, just because it doesn't match my
knowledge of perl (which is admittedly imperfect). I'm curious how you
diagnosed it?

 On a 32 bit system, or a system with low memory it might crash before
 reaching 2 GiB due to memory exhaustion.
 
 Signed-off-by: Joshua Clayton stillcompil...@gmail.com
 Reviewed-by: Jeff King p...@peff.net

The commit message is a good place to mention any side effects, and why
they are not a problem. Something like:

  The previous code buffered the whole blob before writing, so any error
  reading from cat-file would result in zero bytes being written to the
  output stream.  After this change, the output may be left in a
  partially written state (or even fully written, if we fail when
  parsing the final newline from cat-file). However, it's not reasonable
  for callers to expect anything about the state of the output when we
  return an error (after all, even with full buffering, we might fail
  during the writing process).  So any caller which cares about this is
  broken already, and we do not have to worry about them.

 ---
  perl/Git.pm |   12 +---
  1 file changed, 5 insertions(+), 7 deletions(-)

The patch itself looks fine to me.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit

2013-02-22 Thread Jeff King
On Fri, Feb 22, 2013 at 09:30:57AM -0800, Joshua Clayton wrote:

 Subject: Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
   (resubmit with reviewed-by)

One thing I forgot to note in my other response: the subject is part of
the commit message, so information like resubmit with... should go
with other cover letter material after the ---.

It's also customary to say something like PATCHv2 in the brackets
(which does get stripped off), and mention what is changed since v1
after the ---. That can help reviewers remember what happened, and it
can help the maintainer understand where in the process of review the
patch is.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by)

2013-02-22 Thread Junio C Hamano
Joshua Clayton stillcompil...@gmail.com writes:

 Read and write each 1024 byte buffer, rather than trying to buffer
 the entire content of the file.
 Previous code would crash on all files  2 Gib, when the offset variable
 became negative (perhaps below the level of perl), resulting in a crash.
 On a 32 bit system, or a system with low memory it might crash before
 reaching 2 GiB due to memory exhaustion.

 Signed-off-by: Joshua Clayton stillcompil...@gmail.com
 Reviewed-by: Jeff King p...@peff.net
 ---

Thanks.

 Subject: Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit 
 with reviewed-by)

Please drop the () part.  A rule of thumb is to make git show
output understandable by people who read it 6 months from now.  They
do not care if the commit is a re-submission.

It seems that this issue was with us since the very beginning of
this sub since it was introduced at 7182530d8cad (Git.pm: Add
hash_and_insert_object and cat_blob, 2008-05-23).

  perl/Git.pm |   12 +---
  1 file changed, 5 insertions(+), 7 deletions(-)

 diff --git a/perl/Git.pm b/perl/Git.pm
 index 931047c..cc91288 100644
 --- a/perl/Git.pm
 +++ b/perl/Git.pm
 @@ -949,13 +949,16 @@ sub cat_blob {
 last unless $bytesLeft;

 my $bytesToRead = $bytesLeft  1024 ? $bytesLeft : 1024;
 -   my $read = read($in, $blob, $bytesToRead, $bytesRead);
 +   my $read = read($in, $blob, $bytesToRead);
 unless (defined($read)) {
 $self-_close_cat_blob();
 throw Error::Simple(in pipe went bad);
 }
 -
 $bytesRead += $read;
 +   unless (print $fh $blob) {
 +   $self-_close_cat_blob();
 +   throw Error::Simple(couldn't write to passed
 in filehandle);
 +   }

Corrupt patch, line-wrapped by your MUA.

I wonder if we still need $bytesRead variable.  You have $size that
is the size of the whole blob, so

my $bytesLeft = $size;
while (1) {
my $bytesToRead = $bytesLeft  1024 ? $bytesLeft : 1024;
my $bytesRead = read($in, $blob, $bytesToRead);
... check errors and use the $blob ...
$bytesLeft -= $bytesRead;
}

may be simpler and easier to read, no?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files (resubmit with reviewed-by)

2013-02-22 Thread Joshua Clayton
Thanks for all the input and patience.

On Fri, Feb 22, 2013 at 10:34 AM, Jeff King p...@peff.net wrote:
 On Fri, Feb 22, 2013 at 09:30:57AM -0800, Joshua Clayton wrote:

 Read and write each 1024 byte buffer, rather than trying to buffer
 the entire content of the file.

 OK. Did you ever repeat your timing with a larger symmetric buffer? That
 should probably be a separate patch on top, but it might be worth doing
 while we are thinking about it.

 Previous code would crash on all files  2 Gib, when the offset variable
 became negative (perhaps below the level of perl), resulting in a crash.

 I'm still slightly dubious of this, just because it doesn't match my
 knowledge of perl (which is admittedly imperfect). I'm curious how you
 diagnosed it?

I first had the memory exhaustion problem running my git repo on a 32 vm.
After bumping the memory from 512 to 4 GiB, and that failing to fix it
I moved to my workstation with 16 GiB
...reproduced
After the initial crash, I added

print $size,  , $bytesToRead,  , $bytesRead, \n;

right before the read command, and it does indeed crash right after
the $bytesRead variable crosses LONG_MAX
...
2567089913 1024 2147482624
2567089913 1024 2147483648
2567089913 1024 2147484672
Offset outside string at /usr/share/perl5/Git.pm line 901, GEN36 line 2604.

Note that $bytesRead is still positive.
I know very little perl, but that symptom seems pretty clear


 On a 32 bit system, or a system with low memory it might crash before
 reaching 2 GiB due to memory exhaustion.

 Signed-off-by: Joshua Clayton stillcompil...@gmail.com
 Reviewed-by: Jeff King p...@peff.net

 The commit message is a good place to mention any side effects, and why
 they are not a problem. Something like:

   The previous code buffered the whole blob before writing, so any error
   reading from cat-file would result in zero bytes being written to the
   output stream.  After this change, the output may be left in a
   partially written state (or even fully written, if we fail when
   parsing the final newline from cat-file). However, it's not reasonable
   for callers to expect anything about the state of the output when we
   return an error (after all, even with full buffering, we might fail
   during the writing process).  So any caller which cares about this is
   broken already, and we do not have to worry about them.

 ---
  perl/Git.pm |   12 +---
  1 file changed, 5 insertions(+), 7 deletions(-)

 The patch itself looks fine to me.

 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files

2013-02-21 Thread Jeff King
On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote:

 Greetings.
 This is my first patch here. Hopefully I get the stylistic  political
 details right... :)
 Patch applies against maint and master

I have some comments. :)

The body of your email should contain the commit message (i.e., whatever
people reading git log a year from now would see). Cover letter bits
like this should go after the ---. That way git am knows which part
is which.

 Developer's Certificate of Origin 1.1

You don't need to include the DCO. Your Signed-off-by is an indication
that you agree to it.

 Affects git svn clone/fetch
 Original code loaded entire file contents into a variable
 before writing to disk. If the offset within the variable passed
 2 GiB, it becrame negative, resulting in a crash.

Interesting. I didn't think perl had signed wrap-around issues like
this, as its numeric variables are not strictly integers. But I don't
have a 32-bit machine to test on (and numbers larger than 2G obviously
work on 64-bit machines). At any rate, though:

 On a 32 bit system, or a system with low memory it may crash before
 reaching 2 GiB due to memory exhaustion.

Yeah, it is stupid to read the whole thing into memory if we are just
going to dump it to another filehandle.

 @@ -949,13 +951,21 @@ sub cat_blob {
   last unless $bytesLeft;
 
   my $bytesToRead = $bytesLeft  1024 ? $bytesLeft : 1024;
 - my $read = read($in, $blob, $bytesToRead, $bytesRead);
 + my $read = read($in, $blob, $bytesToRead, $blobSize);
   unless (defined($read)) {
   $self-_close_cat_blob();
   throw Error::Simple(in pipe went bad);
   }

Hmph. The existing code already reads in 1024-byte chunks. For no
reason, as far as I can tell, since we are just loading the blob buffer
incrementally into memory, only to then flush it all out at once.

Why do you read at the $blobSize offset? If we are just reading in
chunks, we be able to just keep writing to the start of our small
buffer, as we flush each chunk out before trying to read more.

IOW, shouldn't the final code look like this:

  my $bytesLeft = $size;
  while ($bytesLeft  0) {
  my $buf;
  my $bytesToRead = $bytesLeft  1024 ? $bytesLeft : 1024;
  my $read = read($in, $buf, $bytesToRead);
  unless (defined($read)) {
  $self-_close_cat_blob();
  throw Error::Simple(unable to read cat-blob pipe);
  }
  unless (print $fh $buf) {
  $self-_close_cat_blob();
  throw Error::Simple(unable to write blob output);
  }

  $bytesLeft -= $read;
  }

By having the read and flush size be the same, it's much simpler.

Your change (and my proposed code) do mean that an error during the read
operation will result in a truncated output file, rather than an empty
one. I think that is OK, though. That can happen anyway in the original
due to a failure in the print step. Any caller who wants to be careful
that they leave only a full file in place must either:

  1. Check the return value of cat_blob and verify that the result has
 $size bytes, and otherwise delete it.

  2. Write to a temporary file, then once success has been returned from
 cat_blob, rename the result into place.

Neither of which is affected by this change.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files

2013-02-21 Thread Joshua Clayton
On Thu, Feb 21, 2013 at 2:43 PM, Jeff King p...@peff.net wrote:
 On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote:

 Greetings.
 This is my first patch here. Hopefully I get the stylistic  political
 details right... :)
 Patch applies against maint and master

 I have some comments. :)

 The body of your email should contain the commit message (i.e., whatever
 people reading git log a year from now would see). Cover letter bits
 like this should go after the ---. That way git am knows which part
 is which.

 Developer's Certificate of Origin 1.1

 You don't need to include the DCO. Your Signed-off-by is an indication
 that you agree to it.

 Affects git svn clone/fetch
 Original code loaded entire file contents into a variable
 before writing to disk. If the offset within the variable passed
 2 GiB, it becrame negative, resulting in a crash.

 Interesting. I didn't think perl had signed wrap-around issues like
 this, as its numeric variables are not strictly integers. But I don't
 have a 32-bit machine to test on (and numbers larger than 2G obviously
 work on 64-bit machines). At any rate, though:

 On a 32 bit system, or a system with low memory it may crash before
 reaching 2 GiB due to memory exhaustion.

 Yeah, it is stupid to read the whole thing into memory if we are just
 going to dump it to another filehandle.

 @@ -949,13 +951,21 @@ sub cat_blob {
   last unless $bytesLeft;

   my $bytesToRead = $bytesLeft  1024 ? $bytesLeft : 1024;
 - my $read = read($in, $blob, $bytesToRead, $bytesRead);
 + my $read = read($in, $blob, $bytesToRead, $blobSize);
   unless (defined($read)) {
   $self-_close_cat_blob();
   throw Error::Simple(in pipe went bad);
   }

 Hmph. The existing code already reads in 1024-byte chunks. For no
 reason, as far as I can tell, since we are just loading the blob buffer
 incrementally into memory, only to then flush it all out at once.

 Why do you read at the $blobSize offset? If we are just reading in
 chunks, we be able to just keep writing to the start of our small
 buffer, as we flush each chunk out before trying to read more.

 IOW, shouldn't the final code look like this:

   my $bytesLeft = $size;
   while ($bytesLeft  0) {
   my $buf;
   my $bytesToRead = $bytesLeft  1024 ? $bytesLeft : 1024;
   my $read = read($in, $buf, $bytesToRead);
   unless (defined($read)) {
   $self-_close_cat_blob();
   throw Error::Simple(unable to read cat-blob pipe);
   }
   unless (print $fh $buf) {
   $self-_close_cat_blob();
   throw Error::Simple(unable to write blob output);
   }

   $bytesLeft -= $read;
   }

 By having the read and flush size be the same, it's much simpler.

My original bugfix did just read 1024, and write 1024. That works fine
and, yes, is simpler.
I changed it to be more similar to the original code in case there
were performance reasons for doing it that way.
That was the only reason I could think of for the design, and adding
the $flushSize variable means that
some motivated person could easily optimize it.

So far I have been too lazy to profile the two versions
I guess I'll try a trivial git svn init; git svn fetch and check back in.
Its running now.


 Your change (and my proposed code) do mean that an error during the read
 operation will result in a truncated output file, rather than an empty
 one. I think that is OK, though. That can happen anyway in the original
 due to a failure in the print step. Any caller who wants to be careful
 that they leave only a full file in place must either:

   1. Check the return value of cat_blob and verify that the result has
  $size bytes, and otherwise delete it.

   2. Write to a temporary file, then once success has been returned from
  cat_blob, rename the result into place.

 Neither of which is affected by this change.

 -Peff

In git svn fetch (which is how I discovered it) the file being passed
to cat_blob is a temporary file, which is checksummed before putting
it into place.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix in Git.pm cat_blob crashes on large files

2013-02-21 Thread Jeff King
On Thu, Feb 21, 2013 at 03:18:40PM -0800, Joshua Clayton wrote:

  By having the read and flush size be the same, it's much simpler.
 
 My original bugfix did just read 1024, and write 1024. That works fine
 and, yes, is simpler.
 I changed it to be more similar to the original code in case there
 were performance reasons for doing it that way.
 That was the only reason I could think of for the design, and adding
 the $flushSize variable means that
 some motivated person could easily optimize it.
 
 So far I have been too lazy to profile the two versions
 I guess I'll try a trivial git svn init; git svn fetch and check back in.
 Its running now.

I doubt it will make much of a difference; we are already writing to a
perl filehandle, so it will be buffered there (which I assume is 4K, but
I haven't checked). And your version retains the 1024-byte read. I do
think 1024 is quite low for this sort of descriptor-to-descriptor
copying. I'd be tempted to just bump that 1024 to 64K.

 In git svn fetch (which is how I discovered it) the file being passed
 to cat_blob is a temporary file, which is checksummed before putting
 it into place.

Great. There may be other callers outside of our tree, of course, but I
think it's pretty clear that the responsibility is on the caller to make
sure the function succeeded. We are changing what gets put on the output
stream for various error conditions, but ultimately that is an
implementation detail that the caller should not be depending on.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html