Re: [PATCH] drop support for experimental loose objects

2013-11-27 Thread Jeff King
On Mon, Nov 25, 2013 at 10:35:19AM -0800, Junio C Hamano wrote:

  if (type == OBJ_BLOB) {
  if (stream_blob_to_fd(fd, sha1, NULL, 0)  0)
  die(unable to stream %s to stdout, sha1_to_hex(sha1));
  +   if (check_sha1_signature(sha1, NULL, 0, NULL)  0)
  +   die(object %s sha1 mismatch, sha1_to_hex(sha1));
 
 check_sha1_signature() opens the object again and streams the data.
 Essentially the read side is doing twice the work with that patch,
 isn't it?

Yes. I considered that, but I also got ~20% slow-down when just doing
commits/trees, which are in-core and can re-hash the same buffer. So
since my with-blobs numbers backed that up, I didn't think too much
further on it.

But there is something curious about the numbers I posted. It takes 12s
without the check, and 15s with the check. So the extra hashing adds 3s.
But if we are reading each blob twice, and we would expect blob reading
to be a significant chunk of that 12s, then shouldn't we expect much
more than 3s increase?

The answer must be that either we are not streaming as much as I think,
or re-reading the data is much cheaper than I expect. And I think it is
the latter.

The vast majority of blobs in git.git will be stored as packed deltas.
That means the streaming code will fall back to doing the regular
in-core access. We _could_ therefore use that in-core copy to do our
sha1 check rather than streaming; but of course we never get access to
it outside of stream_blob_to_fd, and it is discarded. However, we do
keep a copy in the delta base cache. When we immediately ask to unpack
the exact same entry for check_sha1_signature, we can pull the copy
straight out of the cache without having to re-inflate the object.

After applying the patch below on top of yours, my numbers remain the
same:

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b2ca775..e3ff677 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -197,7 +197,7 @@ static void print_object_or_die(int fd, const unsigned char 
*sha1,
enum object_type type, unsigned long size)
 {
if (type == OBJ_BLOB) {
-   if (stream_blob_to_fd(fd, sha1, NULL, 0)  0)
+   if (stream_blob_to_fd(fd, sha1, NULL, 
STREAMING_VERIFY_OBJECT_NAME)  0)
die(unable to stream %s to stdout, sha1_to_hex(sha1));
}
else {
@@ -208,6 +208,8 @@ static void print_object_or_die(int fd, const unsigned char 
*sha1,
contents = read_sha1_file(sha1, rtype, rsize);
if (!contents)
die(object %s disappeared, sha1_to_hex(sha1));
+   if (check_sha1_signature(sha1, contents, rsize, 
typename(rtype))  0)
+   die(object %s sha1 mismatch, sha1_to_hex(sha1));
if (rtype != type)
die(object %s changed type!?, sha1_to_hex(sha1));
if (rsize != size)

 I wonder if we want to extend the stream_blob_to_fd() API to
 optionally allow the caller to ask to validate that the returned
 data is consistent with the object name the caller asked the data
 for.  Something along the lines of the attached weatherbaloon patch?

Yes, I think it is a reasonable addition to the streaming API. However,
I do not think there are any callsites which would currently want it.
All of the current users of stream_blob_to_fd use read_sha1_file as
their alternative, and not parse_object. So we are not verifying the
sha1 in either case (we may want to change that, of course, but that is
a bigger decision than just trying to bring streaming and non-streaming
code-paths into parity).

I also wondered if parse_object itself had problems with double-reading
or failing to verify. But its use goes the opposite direction; it wants
to verify the sha1 of the blob object, but it knows that it does not
actually need the data. So it streams (as of 090ea12) to check the
signature, but then discards each buffer-full after hashing it.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-27 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 The vast majority of blobs in git.git will be stored as packed deltas.
 That means the streaming code will fall back to doing the regular
 in-core access. We _could_ therefore use that in-core copy to do our
 sha1 check rather than streaming; but of course we never get access to
 it outside of stream_blob_to_fd, and it is discarded. However, we do
 keep a copy in the delta base cache. When we immediately ask to unpack
 the exact same entry for check_sha1_signature, we can pull the copy
 straight out of the cache without having to re-inflate the object.

OK, that explains the overhead of 20% that is lower than one would
naïvely expect.  Thanks.

 Yes, I think it is a reasonable addition to the streaming API. However,
 I do not think there are any callsites which would currently want it.
 All of the current users of stream_blob_to_fd use read_sha1_file as
 their alternative, and not parse_object. So we are not verifying the
 sha1 in either case (we may want to change that, of course, but that is
 a bigger decision than just trying to bring streaming and non-streaming
 code-paths into parity).

True. I am not offhand sure if we want to make read_sha1_file() to
rehash, but I agree that it is a question different from what we are
asking in this discussion.

 I also wondered if parse_object itself had problems with double-reading
 or failing to verify. But its use goes the opposite direction; it wants
 to verify the sha1 of the blob object, but it knows that it does not
 actually need the data. So it streams (as of 090ea12) to check the
 signature, but then discards each buffer-full after hashing it.

 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-27 Thread Jeff King
On Wed, Nov 27, 2013 at 10:57:14AM -0800, Junio C Hamano wrote:

  Yes, I think it is a reasonable addition to the streaming API. However,
  I do not think there are any callsites which would currently want it.
  All of the current users of stream_blob_to_fd use read_sha1_file as
  their alternative, and not parse_object. So we are not verifying the
  sha1 in either case (we may want to change that, of course, but that is
  a bigger decision than just trying to bring streaming and non-streaming
  code-paths into parity).
 
 True. I am not offhand sure if we want to make read_sha1_file() to
 rehash, but I agree that it is a question different from what we are
 asking in this discussion.

I'm torn on that. Having git verify everything all the time is kind of
cool. But it _does_ have a performance impact, and the vast majority of
the time nothing got corrupted since the last time we looked at the
object. It seems like periodically running git fsck is a smarter way
of doing periodic checks.

We already are careful when objects are coming into the repository, and
I think that is a sensible boundary (and I am increasingly of the
opinion that running with transfer.fsckobjects off is not a good idea).

The checks in parse_object seem hack-ish to me, because they catch some
random subset of the times we access objects (e.g., calling parse_object
on a commit sha1 will check, but calling parse_commit on an unparsed
commit struct will not). If anything, I'd suggest moving the checking
down to read_sha1_file, which would add it fairly consistently
everywhere, and then tying it to a config option (off for high
performance, on for slower-but-meticulous).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-25 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Sun, Nov 24, 2013 at 03:44:44AM -0500, Jeff King wrote:

 In any code path where we call parse_object, we double-check that the
 result matches the sha1 we asked for. But low-level commands like
 cat-file just call read_sha1_file directly, and do not have such a
 check. We could add it, but I suspect the processing cost would be
 noticeable.

 Curious, I tested this. It is noticeable. Here's the best-of-five
 timings for the patch below when running a --batch cat-file on every
 object in my git.git repo, using the patch below:

   [before]
   real0m12.941s
   user0m12.700s
   sys 0m0.244s

   [after]
   real0m15.800s
   user0m15.472s
   sys 0m0.344s

 So it's about 20% slower. I don't know what the right tradeoff is. It's
 cool to check the data each time we look at it, but it does carry a
 performance penalty.

 diff --git a/builtin/cat-file.c b/builtin/cat-file.c
 index b2ca775..2b09773 100644
 --- a/builtin/cat-file.c
 +++ b/builtin/cat-file.c
 @@ -199,6 +199,8 @@ static void print_object_or_die(int fd, const unsigned 
 char *sha1,
   if (type == OBJ_BLOB) {
   if (stream_blob_to_fd(fd, sha1, NULL, 0)  0)
   die(unable to stream %s to stdout, sha1_to_hex(sha1));
 + if (check_sha1_signature(sha1, NULL, 0, NULL)  0)
 + die(object %s sha1 mismatch, sha1_to_hex(sha1));

check_sha1_signature() opens the object again and streams the data.
Essentially the read side is doing twice the work with that patch,
isn't it?

I wonder if we want to extend the stream_blob_to_fd() API to
optionally allow the caller to ask to validate that the returned
data is consistent with the object name the caller asked the data
for.  Something along the lines of the attached weatherbaloon patch?

 builtin/fsck.c |  3 ++-
 entry.c|  2 +-
 streaming.c| 29 -
 streaming.h|  4 +++-
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 97ce678..f42ed9c 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -237,7 +237,8 @@ static void check_unreachable_object(struct object *obj)
if (!(f = fopen(filename, w)))
die_errno(Could not open '%s', filename);
if (obj-type == OBJ_BLOB) {
-   if (stream_blob_to_fd(fileno(f), obj-sha1, 
NULL, 1))
+   if (stream_blob_to_fd(fileno(f), obj-sha1, 
NULL,
+ 
STREAMING_OUTPUT_CAN_SEEK))
die_errno(Could not write '%s', 
filename);
} else
fprintf(f, %s\n, sha1_to_hex(obj-sha1));
diff --git a/entry.c b/entry.c
index 7b7aa81..b3bc827 100644
--- a/entry.c
+++ b/entry.c
@@ -127,7 +127,7 @@ static int streaming_write_entry(const struct cache_entry 
*ce, char *path,
if (fd  0)
return -1;
 
-   result |= stream_blob_to_fd(fd, ce-sha1, filter, 1);
+   result |= stream_blob_to_fd(fd, ce-sha1, filter, 
STREAMING_OUTPUT_CAN_SEEK);
*fstat_done = fstat_output(fd, state, statbuf);
result |= close(fd);
 
diff --git a/streaming.c b/streaming.c
index debe904..50599df 100644
--- a/streaming.c
+++ b/streaming.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2011, Google Inc.
  */
 #include cache.h
+#include object.h
 #include streaming.h
 
 enum input_source {
@@ -496,19 +497,33 @@ static open_method_decl(incore)
  /
 
 int stream_blob_to_fd(int fd, unsigned const char *sha1, struct stream_filter 
*filter,
- int can_seek)
+ unsigned flags)
 {
struct git_istream *st;
enum object_type type;
unsigned long sz;
ssize_t kept = 0;
int result = -1;
+   int can_seek = flags  STREAMING_OUTPUT_CAN_SEEK;
+
+   int want_verify = flags  STREAMING_VERIFY_OBJECT_NAME;
+   git_SHA_CTX ctx;
 
st = open_istream(sha1, type, sz, filter);
if (!st)
return result;
if (type != OBJ_BLOB)
goto close_and_exit;
+
+   if (want_verify) {
+   char hdr[32];
+   int hdrlen;
+
+   git_SHA1_Init(ctx);
+   hdrlen = sprintf(hdr, %s %lu, typename(type), sz) + 1;
+   git_SHA1_Update(ctx, hdr, hdrlen);
+   }
+
for (;;) {
char buf[1024 * 16];
ssize_t wrote, holeto;
@@ -518,6 +533,10 @@ int stream_blob_to_fd(int fd, unsigned const char *sha1, 
struct stream_filter *f
goto close_and_exit;
if (!readlen)
break;
+
+   if (want_verify)
+   git_SHA1_Update(ctx, buf, readlen);
+
if (can_seek  sizeof(buf) == readlen) {

Re: [PATCH] drop support for experimental loose objects

2013-11-24 Thread Jeff King
On Fri, Nov 22, 2013 at 01:28:59PM -0400, Joey Hess wrote:

  Hrm. For --batch, I'd think we would open the whole object and notice
  the corruption, even with the current code. But for --batch-check, we
  use sha1_object_info, and for an experimental object, we do not need
  to de-zlib the object at all.  So we end up reporting whatever crap we
  decipher from the garbage bytes.  My patch would fix that, as we would
  not incorrectly guess an object is experimental anymore.
  
  If you have specific cases that trigger even after my patch, I'd be
  interested to see them.
 
 I was seeing it with --batch, not --batch-check. Probably only with the
 old experimental loose object format. In one case, --batch reported a
 size of 20k, and only output 1k of data. With the object file I sent
 earlier, --batch reports a huge size, and fails trying to allocate the
 memory for it before it can output anything.

Ah, yeah, that makes sense. We report the size via sha1_object_info,
whether we are going to output the object itself or not. So we might
report the bogus size, not noticing the corruption, and then hit an
error and bail when sending the object itself.

My patch makes that better in some cases, because we'll notice more
corruption when looking at the header of the object for
sha1_object_info. But fundamentally, we may still hit an error while
outputting the bytes. Reading the cat-file code, it looks like we should
always die if we hit an error, so at least a reader will get premature
EOF (and not the beginning of another object).

I can believe there is some specific corruption that yields a valid zlib
stream that is a different size than the object advertises. Since
v1.8.4, we double-check that the size we advertised matches what we are
about to write. But the streaming-blob code path does not include that
check, so it might still be affected. It would be pretty easy and cheap
to detect that case.

In any code path where we call parse_object, we double-check that the
result matches the sha1 we asked for. But low-level commands like
cat-file just call read_sha1_file directly, and do not have such a
check. We could add it, but I suspect the processing cost would be
noticeable.

 I also have seen at least once a corrupt pack file that caused git to try
 and allocate a absurd quantity of memory.

I'm not surprised by that. The packfiles contain size information
outside of the checksummed zlib data, and we pre-allocate the buffer
before reading the zlib data. We could try to detect it, but then we are
hard-coding the definition of absurd. The current definition is we
asked the OS for memory, and it did not give it to us. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-24 Thread Jeff King
On Sun, Nov 24, 2013 at 03:44:44AM -0500, Jeff King wrote:

 In any code path where we call parse_object, we double-check that the
 result matches the sha1 we asked for. But low-level commands like
 cat-file just call read_sha1_file directly, and do not have such a
 check. We could add it, but I suspect the processing cost would be
 noticeable.

Curious, I tested this. It is noticeable. Here's the best-of-five
timings for the patch below when running a --batch cat-file on every
object in my git.git repo, using the patch below:

  [before]
  real0m12.941s
  user0m12.700s
  sys 0m0.244s

  [after]
  real0m15.800s
  user0m15.472s
  sys 0m0.344s

So it's about 20% slower. I don't know what the right tradeoff is. It's
cool to check the data each time we look at it, but it does carry a
performance penalty. I notice elsewhere in git we are inconsistent. If
you call parse_object() on an object, you get the sha1 check. But if you
just call parse_commit() on something you know to be a commit (e.g.,
because you are traversing the history and looked it up as a parent
pointer), you do not. I don't know if that is oversight, or an
intentional performance decision.

-Peff

---
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b2ca775..2b09773 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -199,6 +199,8 @@ static void print_object_or_die(int fd, const unsigned char 
*sha1,
if (type == OBJ_BLOB) {
if (stream_blob_to_fd(fd, sha1, NULL, 0)  0)
die(unable to stream %s to stdout, sha1_to_hex(sha1));
+   if (check_sha1_signature(sha1, NULL, 0, NULL)  0)
+   die(object %s sha1 mismatch, sha1_to_hex(sha1));
}
else {
enum object_type rtype;
@@ -208,6 +210,8 @@ static void print_object_or_die(int fd, const unsigned char 
*sha1,
contents = read_sha1_file(sha1, rtype, rsize);
if (!contents)
die(object %s disappeared, sha1_to_hex(sha1));
+   if (check_sha1_signature(sha1, contents, rsize, 
typename(rtype))  0)
+   die(object %s sha1 mismatch, sha1_to_hex(sha1));
if (rtype != type)
die(object %s changed type!?, sha1_to_hex(sha1));
if (rsize != size)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Jeff King
On Thu, Nov 21, 2013 at 09:19:25PM +0100, Christian Couder wrote:

 Yeah, I think it might report wrong size in case of replaced objects
 for example.
 I looked at that following Junio's comment about the
 sha1_object_info() API, which,
 unlike read_sha1_file() API, does not interact with the replace mechanism:
 
 http://thread.gmane.org/gmane.comp.version-control.git/234023/
 
 I started to work on a patch about this but didn't take the time to
 finish and post it.

That seems kind of crazy. Would the fix be as simple as this:

diff --git a/sha1_file.c b/sha1_file.c
index 10676ba..a051d6c 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2529,6 +2529,8 @@ int sha1_object_info_extended(const unsigned char *sha1, 
struct object_info *oi)
struct pack_entry e;
int rtype;
 
+   sha1 = lookup_replace_object(sha1);
+
co = find_cached_object(sha1);
if (co) {
if (oi-typep)

or do we need some way for callers to turn off replacement? I notice
that read_sha1_file has such a feature, but it is only used in one
place. I guess we would need to audit all the sha1_object_info callers.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Christian Couder
On Fri, Nov 22, 2013 at 10:58 AM, Jeff King p...@peff.net wrote:
 On Thu, Nov 21, 2013 at 09:19:25PM +0100, Christian Couder wrote:

 Yeah, I think it might report wrong size in case of replaced objects
 for example.
 I looked at that following Junio's comment about the
 sha1_object_info() API, which,
 unlike read_sha1_file() API, does not interact with the replace mechanism:

 http://thread.gmane.org/gmane.comp.version-control.git/234023/

 I started to work on a patch about this but didn't take the time to
 finish and post it.

 That seems kind of crazy. Would the fix be as simple as this:

 diff --git a/sha1_file.c b/sha1_file.c
 index 10676ba..a051d6c 100644
 --- a/sha1_file.c
 +++ b/sha1_file.c
 @@ -2529,6 +2529,8 @@ int sha1_object_info_extended(const unsigned char 
 *sha1, struct object_info *oi)
 struct pack_entry e;
 int rtype;

 +   sha1 = lookup_replace_object(sha1);
 +
 co = find_cached_object(sha1);
 if (co) {
 if (oi-typep)

 or do we need some way for callers to turn off replacement? I notice
 that read_sha1_file has such a feature, but it is only used in one
 place.

Yeah, indeed, I asked myself such a question and that's why it is not
so simple unfortunately.

In sha1_file.c, there is:

void *read_sha1_file_extended(const unsigned char *sha1,
  enum object_type *type,
  unsigned long *size,
  unsigned flag)
{
void *data;
char *path;
const struct packed_git *p;
const unsigned char *repl = (flag  READ_SHA1_FILE_REPLACE)
? lookup_replace_object(sha1) : sha1;

errno = 0;
data = read_object(repl, type, size);
...

And in cache.h, there is:

#define READ_SHA1_FILE_REPLACE 1
static inline void *read_sha1_file(const unsigned char *sha1, enum
object_type *type, unsigned long *size)
{
return read_sha1_file_extended(sha1, type, size,
READ_SHA1_FILE_REPLACE);
}

So the READ_SHA1_FILE_REPLACE is a way to disable replacement at compile time.

But in my opinion if we want such a knob, we should use it when we set
the read_replace_refs global variable.
For example with something like this:

diff --git a/environment.c b/environment.c
index 0a15349..7c99af8 100644
--- a/environment.c
+++ b/environment.c
@@ -44,7 +44,7 @@ const char *editor_program;
 const char *askpass_program;
 const char *excludes_file;
 enum auto_crlf auto_crlf = AUTO_CRLF_FALSE;
-int read_replace_refs = 1; /* NEEDSWORK: rename to use_replace_refs */
+int read_replace_refs = READ_SHA1_FILE_REPLACE; /* NEEDSWORK: rename
to use_replace_refs */
 enum eol core_eol = EOL_UNSET;
 enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;

@Junio what would you think about such a change?

 I guess we would need to audit all the sha1_object_info callers.

Yeah but when I looked at them, there were not many that looked dangerous.

Thanks,
Christian.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Jeff King
On Fri, Nov 22, 2013 at 12:04:01PM +0100, Christian Couder wrote:

 In sha1_file.c, there is:
 
 void *read_sha1_file_extended(const unsigned char *sha1,
   enum object_type *type,
   unsigned long *size,
   unsigned flag)
 {
 void *data;
 char *path;
 const struct packed_git *p;
 const unsigned char *repl = (flag  READ_SHA1_FILE_REPLACE)
 ? lookup_replace_object(sha1) : sha1;
 
 errno = 0;
 data = read_object(repl, type, size);
 ...
 
 And in cache.h, there is:
 
 #define READ_SHA1_FILE_REPLACE 1
 static inline void *read_sha1_file(const unsigned char *sha1, enum
 object_type *type, unsigned long *size)
 {
 return read_sha1_file_extended(sha1, type, size,
 READ_SHA1_FILE_REPLACE);
 }
 
 So the READ_SHA1_FILE_REPLACE is a way to disable replacement at compile time.

Is it? I did not have the impression anyone would ever redefine
READ_SHA1_FILE_REPLACE at compile time, but that it was a flag that
individual callsites would pass to read_sha1_file_extended to tell them
whether they were interested in replacements. And the obvious reasons to
not be are:

  1. You are doing some operation which needs real objects, like fsck or
 generating a packfile.

  2. You have already resolved any replacements, and want to make sure
 you are getting the same object used elsewhere (e.g., because you
 already printed its size :) ).

The only site which calls read_sha1_file_extended directly and does not
pass the REPLACE flag is in streaming.c. And that looks to be a case of
(2), since we resolve the replacement at the start in open_istream().

I am kind of surprised we do not need to do so for (1) in places like
pack-objects.c. Most of that code does not use read_sha1_file,
preferring instead to find the individual pack entries (for reuse). But
there are some calls to read_sha1_file, and I wonder if there is a bug
lurking there.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Christian Couder
On Fri, Nov 22, 2013 at 12:24 PM, Jeff King p...@peff.net wrote:
 On Fri, Nov 22, 2013 at 12:04:01PM +0100, Christian Couder wrote:

 In sha1_file.c, there is:

 void *read_sha1_file_extended(const unsigned char *sha1,
   enum object_type *type,
   unsigned long *size,
   unsigned flag)
 {
 void *data;
 char *path;
 const struct packed_git *p;
 const unsigned char *repl = (flag  READ_SHA1_FILE_REPLACE)
 ? lookup_replace_object(sha1) : sha1;

 errno = 0;
 data = read_object(repl, type, size);
 ...

 And in cache.h, there is:

 #define READ_SHA1_FILE_REPLACE 1
 static inline void *read_sha1_file(const unsigned char *sha1, enum
 object_type *type, unsigned long *size)
 {
 return read_sha1_file_extended(sha1, type, size,
 READ_SHA1_FILE_REPLACE);
 }

 So the READ_SHA1_FILE_REPLACE is a way to disable replacement at compile 
 time.

 Is it? I did not have the impression anyone would ever redefine
 READ_SHA1_FILE_REPLACE at compile time, but that it was a flag that
 individual callsites would pass to read_sha1_file_extended to tell them
 whether they were interested in replacements. And the obvious reasons to
 not be are:

   1. You are doing some operation which needs real objects, like fsck or
  generating a packfile.

   2. You have already resolved any replacements, and want to make sure
  you are getting the same object used elsewhere (e.g., because you
  already printed its size :) ).

 The only site which calls read_sha1_file_extended directly and does not
 pass the REPLACE flag is in streaming.c. And that looks to be a case of
 (2), since we resolve the replacement at the start in open_istream().

Yeah, you are right. Sorry for overlooking this.

But anyway it looks redundant to me to have both this REPLACE flag and
the read_replace_refs global variable, so I think a proper solution
would involve some significant refactoring.

And if we decide to keep a REPLACE flag we might need to add one to
sha1_object_info_extended() too.

Thanks,
Christian.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Jeff King
On Fri, Nov 22, 2013 at 03:23:31PM +0100, Christian Couder wrote:

  The only site which calls read_sha1_file_extended directly and does not
  pass the REPLACE flag is in streaming.c. And that looks to be a case of
  (2), since we resolve the replacement at the start in open_istream().
 
 Yeah, you are right. Sorry for overlooking this.
 
 But anyway it looks redundant to me to have both this REPLACE flag and
 the read_replace_refs global variable, so I think a proper solution
 would involve some significant refactoring.

I don't think it is redundant. The global variable is about does the
whole operation want the replace feature turned on and the flag is
about does this particular callsite want the replace featured turned
on. We use the feature iff both are true.

We could implement the callsite flag by tweaking the global right before
the call to read_sha1_file, but then we would have to remember to turn
it back on afterwards. If this were a language with dynamic scopes like
Perl, that would be easy. But in C you have to remember to reset it in
all code paths. :)

In some cases it does make sense to turn the feature off for a whole
command (like pack-objects); using the global makes sense there. And
indeed, we seem to do it already in things like fsck, index-pack, etc.
So that answers my question of why I did not see more of case (1) in my
previous email: they do not need per-callsite disabling, because they do
it for the whole command.

 And if we decide to keep a REPLACE flag we might need to add one to
 sha1_object_info_extended() too.

Yes, but somebody needs to look at all of the callsites and decide which
form they want. :)

I did a brief skim, and the ones I noticed were:

  - several spots in index-pack, pack-objects, etc. But these are
already covered by unsetting read_replace_refs.

  - replace_object looks at both the original and new object to compare
their types (due to your recent patches); it would obviously want to
get the true type of the original object

  - When creating tags and trees, we care about the type of the object
(the former for the type line of the tag, the latter to set the
mode). What should they do with replace objects? As above, it is
probably insane to switch types, so it may not matter for practical
purposes.

  - istream_source in streaming.c would probably want to turn it off for
the same reason it uses read_sha1_file_extended

So I think most sites would be unaffected, but due to the second and
fourth item in my list above, we would need a flag for
sha1_object_info_extended.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I guess we would need to audit all the sha1_object_info callers.

Yup; I agree that was the conclusion of Christian's thread.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Joey Hess
Jeff King wrote:
  BTW, I've also seen git cat-file --batch report wrong sizes for objects,
 
 Hrm. For --batch, I'd think we would open the whole object and notice
 the corruption, even with the current code. But for --batch-check, we
 use sha1_object_info, and for an experimental object, we do not need
 to de-zlib the object at all.  So we end up reporting whatever crap we
 decipher from the garbage bytes.  My patch would fix that, as we would
 not incorrectly guess an object is experimental anymore.
 
 If you have specific cases that trigger even after my patch, I'd be
 interested to see them.

I was seeing it with --batch, not --batch-check. Probably only with the
old experimental loose object format. In one case, --batch reported a
size of 20k, and only output 1k of data. With the object file I sent
earlier, --batch reports a huge size, and fails trying to allocate the
memory for it before it can output anything.

I also have seen at least once a corrupt pack file that caused git to try
and allocate a absurd quantity of memory.

Unfortunately I do not currently have exemplars for these, although I
should be able to run a less robust version of my code and find them
again. ;) Will try to find time to do that.

BTW, the fuzzing code is here:
http://source.git-repair.branchable.com/?p=source.git;a=blob;f=Git/Destroyer.hs

-- 
see shy jo


signature.asc
Description: Digital signature


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Jonathan Nieder
Jeff King wrote:

  sha1_file.c|  74 
 -

Yay!

  t/t1013-loose-object-format.sh |  66 --

Hmm, not all of these tests are about the experimental format.  Do
we really want to remove them all?

Thanks,
Jonathan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Jeff King
On Fri, Nov 22, 2013 at 04:24:05PM -0800, Jonathan Nieder wrote:

   t/t1013-loose-object-format.sh |  66 --
 
 Hmm, not all of these tests are about the experimental format.  Do
 we really want to remove them all?

I think so. They were not all testing the experimental format, but they
were about making sure the is-it-experimental heuristic triggered
properly with various zlib settings.

Now that we do not apply that heuristic, there is nothing (in git) to
test. We feed the contents straight to zlib. We could keep the objects
with small window size as a test, but we are not really testing git; we
are testing zlib at that point.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-22 Thread Jonathan Nieder
Jeff King wrote:
 On Fri, Nov 22, 2013 at 04:24:05PM -0800, Jonathan Nieder wrote:

  t/t1013-loose-object-format.sh |  66 --

 Hmm, not all of these tests are about the experimental format.  Do
 we really want to remove them all?

 I think so. They were not all testing the experimental format, but they
 were about making sure the is-it-experimental heuristic triggered
 properly with various zlib settings.

 Now that we do not apply that heuristic, there is nothing (in git) to
 test. We feed the contents straight to zlib.

Ok, makes sense.

In principle the tests are still useful as futureproofing in case git
starts to sanity-check the objects as a way to notice corruption
earlier or something.  But in practice, that kind of futureproofing is
probably not worth the extra tests to maintain.

For what it's worth,
Reviewed-by: Jonathan Nieder jrnie...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Jeff King
On Thu, Nov 21, 2013 at 06:41:58AM -0500, Jeff King wrote:

 The test objects removed are all binary. Git seems to guess a few as
 non-binary, though, because they don't contain any NULs, and includes
 gross binary bytes in the patch below. In theory the mail's transfer
 encoding will take care of this. We'll see, I guess. :)

Nope; something munged it along the way (I sent it as C-T-E: 8bit, but
vger rewrites to QP before re-mailing, so that is the likely spot).

Here's the same patch, but with this stuck into .git/info/attributes:

  /t/t1013/objects/*/* binary

which should work.

-- 8 --
Subject: drop support for experimental loose objects

In git v1.4.3, we introduced a new loose object format that
encoded some object information outside of the zlib stream.
Ultimately the format was dropped in v1.5.3, but we kept the
reading side around to help people migrate objects. Each
time we open a loose object, we use a heuristic to check
whether it is in the normal loose format, or the
experimental one.

This heuristic is robust in the face of valid data, but it
tends to treat corrupted or garbage data as an experimental
object. With the regular format, we would notice quickly
that zlib's crc does not check out and complain. With the
experimental object, we are likely to extract a nonsensical
object size and try to allocate a huge buffer, resulting in
xmalloc calling die.

This latter behavior is much worse, for two reasons. One,
git reports an allocation error when the real error is
corruption. And two, the program dies unconditionally, so
you cannot even run fsck (which would otherwise ignore the
broken object and keep going).

We could try to improve the heuristic to err on the side of
normal objects in the face of corruption, but there is
really little point. The experimental format is long-dead,
and was never enabled by default to begin with. We can
instead simply remove it. The only affected repository would
be one that explicitly set core.legacyheaders in 2007, and
then never repacked in the intervening 6 years.

Signed-off-by: Jeff King p...@peff.net
---
 sha1_file.c|  74 -
 t/t1013-loose-object-format.sh |  66 --
 .../14/9cedb5c46929d18e0f118e9fa31927487af3b6  | Bin 117 - 0 bytes
 .../16/56f9233d999f61ef23ef390b9c71d75399f435  | Bin 17 - 0 bytes
 .../1e/72a6b2c4a577ab0338860fa9fe87f761fc9bbd  | Bin 18 - 0 bytes
 .../25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99  | Bin 19 - 0 bytes
 .../2e/65efe2a145dda7ee51d1741299f848e5bf752e  | Bin 10 - 0 bytes
 .../6b/aee0540ea990d9761a3eb9ab183003a71c3696  | Bin 181 - 0 bytes
 .../70/e6a83d8dcb26fc8bc0cf702e2ddeb6adca18fd  | Bin 26 - 0 bytes
 .../76/e7fa9941f4d5f97f64fea65a2cba436bc79cbb  | Bin 155 - 0 bytes
 .../78/75c6237d3fcdd0ac2f0decc7d3fa6a50b66c09  | Bin 139 - 0 bytes
 .../7a/37b887a73791d12d26c0d3e39568a8fb0fa6e8  | Bin 54 - 0 bytes
 .../85/df50785d62d3b05ab03d9cbf7e4a0b49449730  | Bin 13 - 0 bytes
 .../8d/4e360d6c70fbd72411991c02a09c442cf7a9fa  | Bin 156 - 0 bytes
 .../95/b1625de3ba8b2214d1e0d0591138aea733f64f  | Bin 252 - 0 bytes
 .../9a/e9e86b7bd6cb1472d9373702d8249973da0832  | Bin 11 - 0 bytes
 .../bd/15045f6ce8ff75747562173640456a394412c8  | Bin 34 - 0 bytes
 .../e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391  | Bin 9 - 0 bytes
 .../f8/16d5255855ac160652ee5253b06cd8ee14165a  | Bin 116 - 0 bytes
 19 files changed, 140 deletions(-)
 delete mode 100755 t/t1013-loose-object-format.sh
 delete mode 100644 t/t1013/objects/14/9cedb5c46929d18e0f118e9fa31927487af3b6
 delete mode 100644 t/t1013/objects/16/56f9233d999f61ef23ef390b9c71d75399f435
 delete mode 100644 t/t1013/objects/1e/72a6b2c4a577ab0338860fa9fe87f761fc9bbd
 delete mode 100644 t/t1013/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99
 delete mode 100644 t/t1013/objects/2e/65efe2a145dda7ee51d1741299f848e5bf752e
 delete mode 100644 t/t1013/objects/6b/aee0540ea990d9761a3eb9ab183003a71c3696
 delete mode 100644 t/t1013/objects/70/e6a83d8dcb26fc8bc0cf702e2ddeb6adca18fd
 delete mode 100644 t/t1013/objects/76/e7fa9941f4d5f97f64fea65a2cba436bc79cbb
 delete mode 100644 t/t1013/objects/78/75c6237d3fcdd0ac2f0decc7d3fa6a50b66c09
 delete mode 100644 t/t1013/objects/7a/37b887a73791d12d26c0d3e39568a8fb0fa6e8
 delete mode 100644 t/t1013/objects/85/df50785d62d3b05ab03d9cbf7e4a0b49449730
 delete mode 100644 t/t1013/objects/8d/4e360d6c70fbd72411991c02a09c442cf7a9fa
 delete mode 100644 t/t1013/objects/95/b1625de3ba8b2214d1e0d0591138aea733f64f
 delete mode 100644 t/t1013/objects/9a/e9e86b7bd6cb1472d9373702d8249973da0832
 delete mode 100644 t/t1013/objects/bd/15045f6ce8ff75747562173640456a394412c8
 delete mode 100644 t/t1013/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
 delete mode 100644 t/t1013/objects/f8/16d5255855ac160652ee5253b06cd8ee14165a

diff --git a/sha1_file.c b/sha1_file.c
index 7dadd04..a72fcb6 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1442,51 

Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Jeff King
On Thu, Nov 21, 2013 at 07:43:03PM +0700, Duy Nguyen wrote:

  -   if (experimental_loose_object(map)) {
 
 Perhaps keep this..
 
  -   /*
  -* The old experimental format we no longer produce;
  -* we can still read it.
  -*/
  -   used = unpack_object_header_buffer(map, mapsize, type, 
  size);
  -   if (!used || !valid_loose_object_type[type])
  -   return -1;
  -   map += used;
  -   mapsize -= used;
  -
  -   /* Set up the stream for the rest.. */
  -   stream-next_in = map;
  -   stream-avail_in = mapsize;
  -   git_inflate_init(stream);
  -
  -   /* And generate the fake traditional header */
  -   stream-total_out = 1 + snprintf(buffer, bufsiz, %s %lu,
  -typename(type), size);
  -   return 0;
 
 and replace all this with
 
 die(detected an object in obsolete format, please repack the
 repository using a version before XXX);

That would eliminate the second part of my purpose, which is to not
die() on a corrupted object because we incorrectly guess that it is
experimental.

If we think these objects are in the wild, the right thing to do would
be to warn() and continue. But I really find it hard to believe any such
objects exist at this point.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 We could try to improve the heuristic to err on the side of
 normal objects in the face of corruption, but there is
 really little point. The experimental format is long-dead,
 and was never enabled by default to begin with. We can
 instead simply remove it. The only affected repository would
 be one that explicitly set core.legacyheaders in 2007, and
 then never repacked in the intervening 6 years.

Sounds sensible.  Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Christian Couder
On Thu, Nov 21, 2013 at 5:04 PM, Joey Hess j...@kitenet.net wrote:

 BTW, I've also seen git cat-file --batch report wrong sizes for objects,
 sometimes without crashing. This is particularly problimatic because if
 the object size is wrong, it's very hard to detect the actual end of the
 object output in the batch mode stream.

Yeah, I think it might report wrong size in case of replaced objects
for example.
I looked at that following Junio's comment about the
sha1_object_info() API, which,
unlike read_sha1_file() API, does not interact with the replace mechanism:

http://thread.gmane.org/gmane.comp.version-control.git/234023/

I started to work on a patch about this but didn't take the time to
finish and post it.

Thanks,
Christian.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Duy Nguyen
On Thu, Nov 21, 2013 at 6:48 PM, Jeff King p...@peff.net wrote:
 @@ -1514,14 +1469,6 @@ unsigned long unpack_object_header_buffer(const 
 unsigned char *buf,

  int unpack_sha1_header(git_zstream *stream, unsigned char *map, unsigned 
 long mapsize, void *buffer, unsigned long bufsiz)
  {
 -   unsigned long size, used;
 -   static const char valid_loose_object_type[8] = {
 -   0, /* OBJ_EXT */
 -   1, 1, 1, 1, /* commit, tree, blob, tag */
 -   0, /* delta and others are invalid in a loose object */
 -   };
 -   enum object_type type;
 -
 /* Get the data stream */
 memset(stream, 0, sizeof(*stream));
 stream-next_in = map;
 @@ -1529,27 +1476,6 @@ int unpack_sha1_header(git_zstream *stream, unsigned 
 char *map, unsigned long ma
 stream-next_out = buffer;
 stream-avail_out = bufsiz;

 -   if (experimental_loose_object(map)) {

Perhaps keep this..

 -   /*
 -* The old experimental format we no longer produce;
 -* we can still read it.
 -*/
 -   used = unpack_object_header_buffer(map, mapsize, type, 
 size);
 -   if (!used || !valid_loose_object_type[type])
 -   return -1;
 -   map += used;
 -   mapsize -= used;
 -
 -   /* Set up the stream for the rest.. */
 -   stream-next_in = map;
 -   stream-avail_in = mapsize;
 -   git_inflate_init(stream);
 -
 -   /* And generate the fake traditional header */
 -   stream-total_out = 1 + snprintf(buffer, bufsiz, %s %lu,
 -typename(type), size);
 -   return 0;

and replace all this with

die(detected an object in obsolete format, please repack the
repository using a version before XXX);

?

 -   }
 git_inflate_init(stream);
 return git_inflate(stream, 0);
  }
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Keshav Kini
Duy Nguyen pclo...@gmail.com writes:

 On Thu, Nov 21, 2013 at 6:48 PM, Jeff King p...@peff.net wrote:
 @@ -1514,14 +1469,6 @@ unsigned long unpack_object_header_buffer(const 
 unsigned char *buf,

  int unpack_sha1_header(git_zstream *stream, unsigned char *map, unsigned 
 long mapsize, void *buffer, unsigned long bufsiz)
  {
 -   unsigned long size, used;
 -   static const char valid_loose_object_type[8] = {
 -   0, /* OBJ_EXT */
 -   1, 1, 1, 1, /* commit, tree, blob, tag */
 -   0, /* delta and others are invalid in a loose object */
 -   };
 -   enum object_type type;
 -
 /* Get the data stream */
 memset(stream, 0, sizeof(*stream));
 stream-next_in = map;
 @@ -1529,27 +1476,6 @@ int unpack_sha1_header(git_zstream *stream, unsigned 
 char *map, unsigned long ma
 stream-next_out = buffer;
 stream-avail_out = bufsiz;

 -   if (experimental_loose_object(map)) {

 Perhaps keep this..

 -   /*
 -* The old experimental format we no longer produce;
 -* we can still read it.
 -*/
 -   used = unpack_object_header_buffer(map, mapsize, type, 
 size);
 -   if (!used || !valid_loose_object_type[type])
 -   return -1;
 -   map += used;
 -   mapsize -= used;
 -
 -   /* Set up the stream for the rest.. */
 -   stream-next_in = map;
 -   stream-avail_in = mapsize;
 -   git_inflate_init(stream);
 -
 -   /* And generate the fake traditional header */
 -   stream-total_out = 1 + snprintf(buffer, bufsiz, %s %lu,
 -typename(type), size);
 -   return 0;

 and replace all this with

 die(detected an object in obsolete format, please repack the
 repository using a version before XXX);

 ?

Wouldn't that fail to solve the issue of `git fsck` dying on corrupt
data?  experimental_loose_object() would need to be rewritten to be more
conservative in deciding that an object was in the experimental loose
object format.

-Keshav

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Jeff King
On Thu, Nov 21, 2013 at 12:04:26PM -0400, Joey Hess wrote:

 Jeff King wrote:
  This latter behavior is much worse for two reasons. One,
  git reports an allocation error when the real error is
  corruption. And two, the program dies unconditionally, so
  you cannot even run fsck (which would otherwise ignore the
  broken object and keep going).
 
 BTW, I've also seen git cat-file --batch report wrong sizes for objects,
 sometimes without crashing. This is particularly problimatic because if
 the object size is wrong, it's very hard to detect the actual end of the
 object output in the batch mode stream.

Hrm. For --batch, I'd think we would open the whole object and notice
the corruption, even with the current code. But for --batch-check, we
use sha1_object_info, and for an experimental object, we do not need
to de-zlib the object at all.  So we end up reporting whatever crap we
decipher from the garbage bytes.  My patch would fix that, as we would
not incorrectly guess an object is experimental anymore.

If you have specific cases that trigger even after my patch, I'd be
interested to see them.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] drop support for experimental loose objects

2013-11-21 Thread Joey Hess
Jeff King wrote:
 This latter behavior is much worse for two reasons. One,
 git reports an allocation error when the real error is
 corruption. And two, the program dies unconditionally, so
 you cannot even run fsck (which would otherwise ignore the
 broken object and keep going).

BTW, I've also seen git cat-file --batch report wrong sizes for objects,
sometimes without crashing. This is particularly problimatic because if
the object size is wrong, it's very hard to detect the actual end of the
object output in the batch mode stream.

-- 
see shy jo


signature.asc
Description: Digital signature