Re: [PATCH] drop support for experimental loose objects
On Mon, Nov 25, 2013 at 10:35:19AM -0800, Junio C Hamano wrote: if (type == OBJ_BLOB) { if (stream_blob_to_fd(fd, sha1, NULL, 0) 0) die(unable to stream %s to stdout, sha1_to_hex(sha1)); + if (check_sha1_signature(sha1, NULL, 0, NULL) 0) + die(object %s sha1 mismatch, sha1_to_hex(sha1)); check_sha1_signature() opens the object again and streams the data. Essentially the read side is doing twice the work with that patch, isn't it? Yes. I considered that, but I also got ~20% slow-down when just doing commits/trees, which are in-core and can re-hash the same buffer. So since my with-blobs numbers backed that up, I didn't think too much further on it. But there is something curious about the numbers I posted. It takes 12s without the check, and 15s with the check. So the extra hashing adds 3s. But if we are reading each blob twice, and we would expect blob reading to be a significant chunk of that 12s, then shouldn't we expect much more than 3s increase? The answer must be that either we are not streaming as much as I think, or re-reading the data is much cheaper than I expect. And I think it is the latter. The vast majority of blobs in git.git will be stored as packed deltas. That means the streaming code will fall back to doing the regular in-core access. We _could_ therefore use that in-core copy to do our sha1 check rather than streaming; but of course we never get access to it outside of stream_blob_to_fd, and it is discarded. However, we do keep a copy in the delta base cache. When we immediately ask to unpack the exact same entry for check_sha1_signature, we can pull the copy straight out of the cache without having to re-inflate the object. After applying the patch below on top of yours, my numbers remain the same: diff --git a/builtin/cat-file.c b/builtin/cat-file.c index b2ca775..e3ff677 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -197,7 +197,7 @@ static void print_object_or_die(int fd, const unsigned char *sha1, enum object_type type, unsigned long size) { if (type == OBJ_BLOB) { - if (stream_blob_to_fd(fd, sha1, NULL, 0) 0) + if (stream_blob_to_fd(fd, sha1, NULL, STREAMING_VERIFY_OBJECT_NAME) 0) die(unable to stream %s to stdout, sha1_to_hex(sha1)); } else { @@ -208,6 +208,8 @@ static void print_object_or_die(int fd, const unsigned char *sha1, contents = read_sha1_file(sha1, rtype, rsize); if (!contents) die(object %s disappeared, sha1_to_hex(sha1)); + if (check_sha1_signature(sha1, contents, rsize, typename(rtype)) 0) + die(object %s sha1 mismatch, sha1_to_hex(sha1)); if (rtype != type) die(object %s changed type!?, sha1_to_hex(sha1)); if (rsize != size) I wonder if we want to extend the stream_blob_to_fd() API to optionally allow the caller to ask to validate that the returned data is consistent with the object name the caller asked the data for. Something along the lines of the attached weatherbaloon patch? Yes, I think it is a reasonable addition to the streaming API. However, I do not think there are any callsites which would currently want it. All of the current users of stream_blob_to_fd use read_sha1_file as their alternative, and not parse_object. So we are not verifying the sha1 in either case (we may want to change that, of course, but that is a bigger decision than just trying to bring streaming and non-streaming code-paths into parity). I also wondered if parse_object itself had problems with double-reading or failing to verify. But its use goes the opposite direction; it wants to verify the sha1 of the blob object, but it knows that it does not actually need the data. So it streams (as of 090ea12) to check the signature, but then discards each buffer-full after hashing it. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Jeff King p...@peff.net writes: The vast majority of blobs in git.git will be stored as packed deltas. That means the streaming code will fall back to doing the regular in-core access. We _could_ therefore use that in-core copy to do our sha1 check rather than streaming; but of course we never get access to it outside of stream_blob_to_fd, and it is discarded. However, we do keep a copy in the delta base cache. When we immediately ask to unpack the exact same entry for check_sha1_signature, we can pull the copy straight out of the cache without having to re-inflate the object. OK, that explains the overhead of 20% that is lower than one would naïvely expect. Thanks. Yes, I think it is a reasonable addition to the streaming API. However, I do not think there are any callsites which would currently want it. All of the current users of stream_blob_to_fd use read_sha1_file as their alternative, and not parse_object. So we are not verifying the sha1 in either case (we may want to change that, of course, but that is a bigger decision than just trying to bring streaming and non-streaming code-paths into parity). True. I am not offhand sure if we want to make read_sha1_file() to rehash, but I agree that it is a question different from what we are asking in this discussion. I also wondered if parse_object itself had problems with double-reading or failing to verify. But its use goes the opposite direction; it wants to verify the sha1 of the blob object, but it knows that it does not actually need the data. So it streams (as of 090ea12) to check the signature, but then discards each buffer-full after hashing it. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Wed, Nov 27, 2013 at 10:57:14AM -0800, Junio C Hamano wrote: Yes, I think it is a reasonable addition to the streaming API. However, I do not think there are any callsites which would currently want it. All of the current users of stream_blob_to_fd use read_sha1_file as their alternative, and not parse_object. So we are not verifying the sha1 in either case (we may want to change that, of course, but that is a bigger decision than just trying to bring streaming and non-streaming code-paths into parity). True. I am not offhand sure if we want to make read_sha1_file() to rehash, but I agree that it is a question different from what we are asking in this discussion. I'm torn on that. Having git verify everything all the time is kind of cool. But it _does_ have a performance impact, and the vast majority of the time nothing got corrupted since the last time we looked at the object. It seems like periodically running git fsck is a smarter way of doing periodic checks. We already are careful when objects are coming into the repository, and I think that is a sensible boundary (and I am increasingly of the opinion that running with transfer.fsckobjects off is not a good idea). The checks in parse_object seem hack-ish to me, because they catch some random subset of the times we access objects (e.g., calling parse_object on a commit sha1 will check, but calling parse_commit on an unparsed commit struct will not). If anything, I'd suggest moving the checking down to read_sha1_file, which would add it fairly consistently everywhere, and then tying it to a config option (off for high performance, on for slower-but-meticulous). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Jeff King p...@peff.net writes: On Sun, Nov 24, 2013 at 03:44:44AM -0500, Jeff King wrote: In any code path where we call parse_object, we double-check that the result matches the sha1 we asked for. But low-level commands like cat-file just call read_sha1_file directly, and do not have such a check. We could add it, but I suspect the processing cost would be noticeable. Curious, I tested this. It is noticeable. Here's the best-of-five timings for the patch below when running a --batch cat-file on every object in my git.git repo, using the patch below: [before] real0m12.941s user0m12.700s sys 0m0.244s [after] real0m15.800s user0m15.472s sys 0m0.344s So it's about 20% slower. I don't know what the right tradeoff is. It's cool to check the data each time we look at it, but it does carry a performance penalty. diff --git a/builtin/cat-file.c b/builtin/cat-file.c index b2ca775..2b09773 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -199,6 +199,8 @@ static void print_object_or_die(int fd, const unsigned char *sha1, if (type == OBJ_BLOB) { if (stream_blob_to_fd(fd, sha1, NULL, 0) 0) die(unable to stream %s to stdout, sha1_to_hex(sha1)); + if (check_sha1_signature(sha1, NULL, 0, NULL) 0) + die(object %s sha1 mismatch, sha1_to_hex(sha1)); check_sha1_signature() opens the object again and streams the data. Essentially the read side is doing twice the work with that patch, isn't it? I wonder if we want to extend the stream_blob_to_fd() API to optionally allow the caller to ask to validate that the returned data is consistent with the object name the caller asked the data for. Something along the lines of the attached weatherbaloon patch? builtin/fsck.c | 3 ++- entry.c| 2 +- streaming.c| 29 - streaming.h| 4 +++- 4 files changed, 34 insertions(+), 4 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 97ce678..f42ed9c 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -237,7 +237,8 @@ static void check_unreachable_object(struct object *obj) if (!(f = fopen(filename, w))) die_errno(Could not open '%s', filename); if (obj-type == OBJ_BLOB) { - if (stream_blob_to_fd(fileno(f), obj-sha1, NULL, 1)) + if (stream_blob_to_fd(fileno(f), obj-sha1, NULL, + STREAMING_OUTPUT_CAN_SEEK)) die_errno(Could not write '%s', filename); } else fprintf(f, %s\n, sha1_to_hex(obj-sha1)); diff --git a/entry.c b/entry.c index 7b7aa81..b3bc827 100644 --- a/entry.c +++ b/entry.c @@ -127,7 +127,7 @@ static int streaming_write_entry(const struct cache_entry *ce, char *path, if (fd 0) return -1; - result |= stream_blob_to_fd(fd, ce-sha1, filter, 1); + result |= stream_blob_to_fd(fd, ce-sha1, filter, STREAMING_OUTPUT_CAN_SEEK); *fstat_done = fstat_output(fd, state, statbuf); result |= close(fd); diff --git a/streaming.c b/streaming.c index debe904..50599df 100644 --- a/streaming.c +++ b/streaming.c @@ -2,6 +2,7 @@ * Copyright (c) 2011, Google Inc. */ #include cache.h +#include object.h #include streaming.h enum input_source { @@ -496,19 +497,33 @@ static open_method_decl(incore) / int stream_blob_to_fd(int fd, unsigned const char *sha1, struct stream_filter *filter, - int can_seek) + unsigned flags) { struct git_istream *st; enum object_type type; unsigned long sz; ssize_t kept = 0; int result = -1; + int can_seek = flags STREAMING_OUTPUT_CAN_SEEK; + + int want_verify = flags STREAMING_VERIFY_OBJECT_NAME; + git_SHA_CTX ctx; st = open_istream(sha1, type, sz, filter); if (!st) return result; if (type != OBJ_BLOB) goto close_and_exit; + + if (want_verify) { + char hdr[32]; + int hdrlen; + + git_SHA1_Init(ctx); + hdrlen = sprintf(hdr, %s %lu, typename(type), sz) + 1; + git_SHA1_Update(ctx, hdr, hdrlen); + } + for (;;) { char buf[1024 * 16]; ssize_t wrote, holeto; @@ -518,6 +533,10 @@ int stream_blob_to_fd(int fd, unsigned const char *sha1, struct stream_filter *f goto close_and_exit; if (!readlen) break; + + if (want_verify) + git_SHA1_Update(ctx, buf, readlen); + if (can_seek sizeof(buf) == readlen) {
Re: [PATCH] drop support for experimental loose objects
On Fri, Nov 22, 2013 at 01:28:59PM -0400, Joey Hess wrote: Hrm. For --batch, I'd think we would open the whole object and notice the corruption, even with the current code. But for --batch-check, we use sha1_object_info, and for an experimental object, we do not need to de-zlib the object at all. So we end up reporting whatever crap we decipher from the garbage bytes. My patch would fix that, as we would not incorrectly guess an object is experimental anymore. If you have specific cases that trigger even after my patch, I'd be interested to see them. I was seeing it with --batch, not --batch-check. Probably only with the old experimental loose object format. In one case, --batch reported a size of 20k, and only output 1k of data. With the object file I sent earlier, --batch reports a huge size, and fails trying to allocate the memory for it before it can output anything. Ah, yeah, that makes sense. We report the size via sha1_object_info, whether we are going to output the object itself or not. So we might report the bogus size, not noticing the corruption, and then hit an error and bail when sending the object itself. My patch makes that better in some cases, because we'll notice more corruption when looking at the header of the object for sha1_object_info. But fundamentally, we may still hit an error while outputting the bytes. Reading the cat-file code, it looks like we should always die if we hit an error, so at least a reader will get premature EOF (and not the beginning of another object). I can believe there is some specific corruption that yields a valid zlib stream that is a different size than the object advertises. Since v1.8.4, we double-check that the size we advertised matches what we are about to write. But the streaming-blob code path does not include that check, so it might still be affected. It would be pretty easy and cheap to detect that case. In any code path where we call parse_object, we double-check that the result matches the sha1 we asked for. But low-level commands like cat-file just call read_sha1_file directly, and do not have such a check. We could add it, but I suspect the processing cost would be noticeable. I also have seen at least once a corrupt pack file that caused git to try and allocate a absurd quantity of memory. I'm not surprised by that. The packfiles contain size information outside of the checksummed zlib data, and we pre-allocate the buffer before reading the zlib data. We could try to detect it, but then we are hard-coding the definition of absurd. The current definition is we asked the OS for memory, and it did not give it to us. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Sun, Nov 24, 2013 at 03:44:44AM -0500, Jeff King wrote: In any code path where we call parse_object, we double-check that the result matches the sha1 we asked for. But low-level commands like cat-file just call read_sha1_file directly, and do not have such a check. We could add it, but I suspect the processing cost would be noticeable. Curious, I tested this. It is noticeable. Here's the best-of-five timings for the patch below when running a --batch cat-file on every object in my git.git repo, using the patch below: [before] real0m12.941s user0m12.700s sys 0m0.244s [after] real0m15.800s user0m15.472s sys 0m0.344s So it's about 20% slower. I don't know what the right tradeoff is. It's cool to check the data each time we look at it, but it does carry a performance penalty. I notice elsewhere in git we are inconsistent. If you call parse_object() on an object, you get the sha1 check. But if you just call parse_commit() on something you know to be a commit (e.g., because you are traversing the history and looked it up as a parent pointer), you do not. I don't know if that is oversight, or an intentional performance decision. -Peff --- diff --git a/builtin/cat-file.c b/builtin/cat-file.c index b2ca775..2b09773 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -199,6 +199,8 @@ static void print_object_or_die(int fd, const unsigned char *sha1, if (type == OBJ_BLOB) { if (stream_blob_to_fd(fd, sha1, NULL, 0) 0) die(unable to stream %s to stdout, sha1_to_hex(sha1)); + if (check_sha1_signature(sha1, NULL, 0, NULL) 0) + die(object %s sha1 mismatch, sha1_to_hex(sha1)); } else { enum object_type rtype; @@ -208,6 +210,8 @@ static void print_object_or_die(int fd, const unsigned char *sha1, contents = read_sha1_file(sha1, rtype, rsize); if (!contents) die(object %s disappeared, sha1_to_hex(sha1)); + if (check_sha1_signature(sha1, contents, rsize, typename(rtype)) 0) + die(object %s sha1 mismatch, sha1_to_hex(sha1)); if (rtype != type) die(object %s changed type!?, sha1_to_hex(sha1)); if (rsize != size) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Thu, Nov 21, 2013 at 09:19:25PM +0100, Christian Couder wrote: Yeah, I think it might report wrong size in case of replaced objects for example. I looked at that following Junio's comment about the sha1_object_info() API, which, unlike read_sha1_file() API, does not interact with the replace mechanism: http://thread.gmane.org/gmane.comp.version-control.git/234023/ I started to work on a patch about this but didn't take the time to finish and post it. That seems kind of crazy. Would the fix be as simple as this: diff --git a/sha1_file.c b/sha1_file.c index 10676ba..a051d6c 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -2529,6 +2529,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi) struct pack_entry e; int rtype; + sha1 = lookup_replace_object(sha1); + co = find_cached_object(sha1); if (co) { if (oi-typep) or do we need some way for callers to turn off replacement? I notice that read_sha1_file has such a feature, but it is only used in one place. I guess we would need to audit all the sha1_object_info callers. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Fri, Nov 22, 2013 at 10:58 AM, Jeff King p...@peff.net wrote: On Thu, Nov 21, 2013 at 09:19:25PM +0100, Christian Couder wrote: Yeah, I think it might report wrong size in case of replaced objects for example. I looked at that following Junio's comment about the sha1_object_info() API, which, unlike read_sha1_file() API, does not interact with the replace mechanism: http://thread.gmane.org/gmane.comp.version-control.git/234023/ I started to work on a patch about this but didn't take the time to finish and post it. That seems kind of crazy. Would the fix be as simple as this: diff --git a/sha1_file.c b/sha1_file.c index 10676ba..a051d6c 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -2529,6 +2529,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi) struct pack_entry e; int rtype; + sha1 = lookup_replace_object(sha1); + co = find_cached_object(sha1); if (co) { if (oi-typep) or do we need some way for callers to turn off replacement? I notice that read_sha1_file has such a feature, but it is only used in one place. Yeah, indeed, I asked myself such a question and that's why it is not so simple unfortunately. In sha1_file.c, there is: void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag) { void *data; char *path; const struct packed_git *p; const unsigned char *repl = (flag READ_SHA1_FILE_REPLACE) ? lookup_replace_object(sha1) : sha1; errno = 0; data = read_object(repl, type, size); ... And in cache.h, there is: #define READ_SHA1_FILE_REPLACE 1 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size) { return read_sha1_file_extended(sha1, type, size, READ_SHA1_FILE_REPLACE); } So the READ_SHA1_FILE_REPLACE is a way to disable replacement at compile time. But in my opinion if we want such a knob, we should use it when we set the read_replace_refs global variable. For example with something like this: diff --git a/environment.c b/environment.c index 0a15349..7c99af8 100644 --- a/environment.c +++ b/environment.c @@ -44,7 +44,7 @@ const char *editor_program; const char *askpass_program; const char *excludes_file; enum auto_crlf auto_crlf = AUTO_CRLF_FALSE; -int read_replace_refs = 1; /* NEEDSWORK: rename to use_replace_refs */ +int read_replace_refs = READ_SHA1_FILE_REPLACE; /* NEEDSWORK: rename to use_replace_refs */ enum eol core_eol = EOL_UNSET; enum safe_crlf safe_crlf = SAFE_CRLF_WARN; unsigned whitespace_rule_cfg = WS_DEFAULT_RULE; @Junio what would you think about such a change? I guess we would need to audit all the sha1_object_info callers. Yeah but when I looked at them, there were not many that looked dangerous. Thanks, Christian. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Fri, Nov 22, 2013 at 12:04:01PM +0100, Christian Couder wrote: In sha1_file.c, there is: void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag) { void *data; char *path; const struct packed_git *p; const unsigned char *repl = (flag READ_SHA1_FILE_REPLACE) ? lookup_replace_object(sha1) : sha1; errno = 0; data = read_object(repl, type, size); ... And in cache.h, there is: #define READ_SHA1_FILE_REPLACE 1 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size) { return read_sha1_file_extended(sha1, type, size, READ_SHA1_FILE_REPLACE); } So the READ_SHA1_FILE_REPLACE is a way to disable replacement at compile time. Is it? I did not have the impression anyone would ever redefine READ_SHA1_FILE_REPLACE at compile time, but that it was a flag that individual callsites would pass to read_sha1_file_extended to tell them whether they were interested in replacements. And the obvious reasons to not be are: 1. You are doing some operation which needs real objects, like fsck or generating a packfile. 2. You have already resolved any replacements, and want to make sure you are getting the same object used elsewhere (e.g., because you already printed its size :) ). The only site which calls read_sha1_file_extended directly and does not pass the REPLACE flag is in streaming.c. And that looks to be a case of (2), since we resolve the replacement at the start in open_istream(). I am kind of surprised we do not need to do so for (1) in places like pack-objects.c. Most of that code does not use read_sha1_file, preferring instead to find the individual pack entries (for reuse). But there are some calls to read_sha1_file, and I wonder if there is a bug lurking there. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Fri, Nov 22, 2013 at 12:24 PM, Jeff King p...@peff.net wrote: On Fri, Nov 22, 2013 at 12:04:01PM +0100, Christian Couder wrote: In sha1_file.c, there is: void *read_sha1_file_extended(const unsigned char *sha1, enum object_type *type, unsigned long *size, unsigned flag) { void *data; char *path; const struct packed_git *p; const unsigned char *repl = (flag READ_SHA1_FILE_REPLACE) ? lookup_replace_object(sha1) : sha1; errno = 0; data = read_object(repl, type, size); ... And in cache.h, there is: #define READ_SHA1_FILE_REPLACE 1 static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size) { return read_sha1_file_extended(sha1, type, size, READ_SHA1_FILE_REPLACE); } So the READ_SHA1_FILE_REPLACE is a way to disable replacement at compile time. Is it? I did not have the impression anyone would ever redefine READ_SHA1_FILE_REPLACE at compile time, but that it was a flag that individual callsites would pass to read_sha1_file_extended to tell them whether they were interested in replacements. And the obvious reasons to not be are: 1. You are doing some operation which needs real objects, like fsck or generating a packfile. 2. You have already resolved any replacements, and want to make sure you are getting the same object used elsewhere (e.g., because you already printed its size :) ). The only site which calls read_sha1_file_extended directly and does not pass the REPLACE flag is in streaming.c. And that looks to be a case of (2), since we resolve the replacement at the start in open_istream(). Yeah, you are right. Sorry for overlooking this. But anyway it looks redundant to me to have both this REPLACE flag and the read_replace_refs global variable, so I think a proper solution would involve some significant refactoring. And if we decide to keep a REPLACE flag we might need to add one to sha1_object_info_extended() too. Thanks, Christian. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Fri, Nov 22, 2013 at 03:23:31PM +0100, Christian Couder wrote: The only site which calls read_sha1_file_extended directly and does not pass the REPLACE flag is in streaming.c. And that looks to be a case of (2), since we resolve the replacement at the start in open_istream(). Yeah, you are right. Sorry for overlooking this. But anyway it looks redundant to me to have both this REPLACE flag and the read_replace_refs global variable, so I think a proper solution would involve some significant refactoring. I don't think it is redundant. The global variable is about does the whole operation want the replace feature turned on and the flag is about does this particular callsite want the replace featured turned on. We use the feature iff both are true. We could implement the callsite flag by tweaking the global right before the call to read_sha1_file, but then we would have to remember to turn it back on afterwards. If this were a language with dynamic scopes like Perl, that would be easy. But in C you have to remember to reset it in all code paths. :) In some cases it does make sense to turn the feature off for a whole command (like pack-objects); using the global makes sense there. And indeed, we seem to do it already in things like fsck, index-pack, etc. So that answers my question of why I did not see more of case (1) in my previous email: they do not need per-callsite disabling, because they do it for the whole command. And if we decide to keep a REPLACE flag we might need to add one to sha1_object_info_extended() too. Yes, but somebody needs to look at all of the callsites and decide which form they want. :) I did a brief skim, and the ones I noticed were: - several spots in index-pack, pack-objects, etc. But these are already covered by unsetting read_replace_refs. - replace_object looks at both the original and new object to compare their types (due to your recent patches); it would obviously want to get the true type of the original object - When creating tags and trees, we care about the type of the object (the former for the type line of the tag, the latter to set the mode). What should they do with replace objects? As above, it is probably insane to switch types, so it may not matter for practical purposes. - istream_source in streaming.c would probably want to turn it off for the same reason it uses read_sha1_file_extended So I think most sites would be unaffected, but due to the second and fourth item in my list above, we would need a flag for sha1_object_info_extended. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Jeff King p...@peff.net writes: I guess we would need to audit all the sha1_object_info callers. Yup; I agree that was the conclusion of Christian's thread. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Jeff King wrote: BTW, I've also seen git cat-file --batch report wrong sizes for objects, Hrm. For --batch, I'd think we would open the whole object and notice the corruption, even with the current code. But for --batch-check, we use sha1_object_info, and for an experimental object, we do not need to de-zlib the object at all. So we end up reporting whatever crap we decipher from the garbage bytes. My patch would fix that, as we would not incorrectly guess an object is experimental anymore. If you have specific cases that trigger even after my patch, I'd be interested to see them. I was seeing it with --batch, not --batch-check. Probably only with the old experimental loose object format. In one case, --batch reported a size of 20k, and only output 1k of data. With the object file I sent earlier, --batch reports a huge size, and fails trying to allocate the memory for it before it can output anything. I also have seen at least once a corrupt pack file that caused git to try and allocate a absurd quantity of memory. Unfortunately I do not currently have exemplars for these, although I should be able to run a less robust version of my code and find them again. ;) Will try to find time to do that. BTW, the fuzzing code is here: http://source.git-repair.branchable.com/?p=source.git;a=blob;f=Git/Destroyer.hs -- see shy jo signature.asc Description: Digital signature
Re: [PATCH] drop support for experimental loose objects
Jeff King wrote: sha1_file.c| 74 - Yay! t/t1013-loose-object-format.sh | 66 -- Hmm, not all of these tests are about the experimental format. Do we really want to remove them all? Thanks, Jonathan -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Fri, Nov 22, 2013 at 04:24:05PM -0800, Jonathan Nieder wrote: t/t1013-loose-object-format.sh | 66 -- Hmm, not all of these tests are about the experimental format. Do we really want to remove them all? I think so. They were not all testing the experimental format, but they were about making sure the is-it-experimental heuristic triggered properly with various zlib settings. Now that we do not apply that heuristic, there is nothing (in git) to test. We feed the contents straight to zlib. We could keep the objects with small window size as a test, but we are not really testing git; we are testing zlib at that point. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Jeff King wrote: On Fri, Nov 22, 2013 at 04:24:05PM -0800, Jonathan Nieder wrote: t/t1013-loose-object-format.sh | 66 -- Hmm, not all of these tests are about the experimental format. Do we really want to remove them all? I think so. They were not all testing the experimental format, but they were about making sure the is-it-experimental heuristic triggered properly with various zlib settings. Now that we do not apply that heuristic, there is nothing (in git) to test. We feed the contents straight to zlib. Ok, makes sense. In principle the tests are still useful as futureproofing in case git starts to sanity-check the objects as a way to notice corruption earlier or something. But in practice, that kind of futureproofing is probably not worth the extra tests to maintain. For what it's worth, Reviewed-by: Jonathan Nieder jrnie...@gmail.com -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Thu, Nov 21, 2013 at 06:41:58AM -0500, Jeff King wrote: The test objects removed are all binary. Git seems to guess a few as non-binary, though, because they don't contain any NULs, and includes gross binary bytes in the patch below. In theory the mail's transfer encoding will take care of this. We'll see, I guess. :) Nope; something munged it along the way (I sent it as C-T-E: 8bit, but vger rewrites to QP before re-mailing, so that is the likely spot). Here's the same patch, but with this stuck into .git/info/attributes: /t/t1013/objects/*/* binary which should work. -- 8 -- Subject: drop support for experimental loose objects In git v1.4.3, we introduced a new loose object format that encoded some object information outside of the zlib stream. Ultimately the format was dropped in v1.5.3, but we kept the reading side around to help people migrate objects. Each time we open a loose object, we use a heuristic to check whether it is in the normal loose format, or the experimental one. This heuristic is robust in the face of valid data, but it tends to treat corrupted or garbage data as an experimental object. With the regular format, we would notice quickly that zlib's crc does not check out and complain. With the experimental object, we are likely to extract a nonsensical object size and try to allocate a huge buffer, resulting in xmalloc calling die. This latter behavior is much worse, for two reasons. One, git reports an allocation error when the real error is corruption. And two, the program dies unconditionally, so you cannot even run fsck (which would otherwise ignore the broken object and keep going). We could try to improve the heuristic to err on the side of normal objects in the face of corruption, but there is really little point. The experimental format is long-dead, and was never enabled by default to begin with. We can instead simply remove it. The only affected repository would be one that explicitly set core.legacyheaders in 2007, and then never repacked in the intervening 6 years. Signed-off-by: Jeff King p...@peff.net --- sha1_file.c| 74 - t/t1013-loose-object-format.sh | 66 -- .../14/9cedb5c46929d18e0f118e9fa31927487af3b6 | Bin 117 - 0 bytes .../16/56f9233d999f61ef23ef390b9c71d75399f435 | Bin 17 - 0 bytes .../1e/72a6b2c4a577ab0338860fa9fe87f761fc9bbd | Bin 18 - 0 bytes .../25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99 | Bin 19 - 0 bytes .../2e/65efe2a145dda7ee51d1741299f848e5bf752e | Bin 10 - 0 bytes .../6b/aee0540ea990d9761a3eb9ab183003a71c3696 | Bin 181 - 0 bytes .../70/e6a83d8dcb26fc8bc0cf702e2ddeb6adca18fd | Bin 26 - 0 bytes .../76/e7fa9941f4d5f97f64fea65a2cba436bc79cbb | Bin 155 - 0 bytes .../78/75c6237d3fcdd0ac2f0decc7d3fa6a50b66c09 | Bin 139 - 0 bytes .../7a/37b887a73791d12d26c0d3e39568a8fb0fa6e8 | Bin 54 - 0 bytes .../85/df50785d62d3b05ab03d9cbf7e4a0b49449730 | Bin 13 - 0 bytes .../8d/4e360d6c70fbd72411991c02a09c442cf7a9fa | Bin 156 - 0 bytes .../95/b1625de3ba8b2214d1e0d0591138aea733f64f | Bin 252 - 0 bytes .../9a/e9e86b7bd6cb1472d9373702d8249973da0832 | Bin 11 - 0 bytes .../bd/15045f6ce8ff75747562173640456a394412c8 | Bin 34 - 0 bytes .../e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 | Bin 9 - 0 bytes .../f8/16d5255855ac160652ee5253b06cd8ee14165a | Bin 116 - 0 bytes 19 files changed, 140 deletions(-) delete mode 100755 t/t1013-loose-object-format.sh delete mode 100644 t/t1013/objects/14/9cedb5c46929d18e0f118e9fa31927487af3b6 delete mode 100644 t/t1013/objects/16/56f9233d999f61ef23ef390b9c71d75399f435 delete mode 100644 t/t1013/objects/1e/72a6b2c4a577ab0338860fa9fe87f761fc9bbd delete mode 100644 t/t1013/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99 delete mode 100644 t/t1013/objects/2e/65efe2a145dda7ee51d1741299f848e5bf752e delete mode 100644 t/t1013/objects/6b/aee0540ea990d9761a3eb9ab183003a71c3696 delete mode 100644 t/t1013/objects/70/e6a83d8dcb26fc8bc0cf702e2ddeb6adca18fd delete mode 100644 t/t1013/objects/76/e7fa9941f4d5f97f64fea65a2cba436bc79cbb delete mode 100644 t/t1013/objects/78/75c6237d3fcdd0ac2f0decc7d3fa6a50b66c09 delete mode 100644 t/t1013/objects/7a/37b887a73791d12d26c0d3e39568a8fb0fa6e8 delete mode 100644 t/t1013/objects/85/df50785d62d3b05ab03d9cbf7e4a0b49449730 delete mode 100644 t/t1013/objects/8d/4e360d6c70fbd72411991c02a09c442cf7a9fa delete mode 100644 t/t1013/objects/95/b1625de3ba8b2214d1e0d0591138aea733f64f delete mode 100644 t/t1013/objects/9a/e9e86b7bd6cb1472d9373702d8249973da0832 delete mode 100644 t/t1013/objects/bd/15045f6ce8ff75747562173640456a394412c8 delete mode 100644 t/t1013/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 delete mode 100644 t/t1013/objects/f8/16d5255855ac160652ee5253b06cd8ee14165a diff --git a/sha1_file.c b/sha1_file.c index 7dadd04..a72fcb6 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1442,51
Re: [PATCH] drop support for experimental loose objects
On Thu, Nov 21, 2013 at 07:43:03PM +0700, Duy Nguyen wrote: - if (experimental_loose_object(map)) { Perhaps keep this.. - /* -* The old experimental format we no longer produce; -* we can still read it. -*/ - used = unpack_object_header_buffer(map, mapsize, type, size); - if (!used || !valid_loose_object_type[type]) - return -1; - map += used; - mapsize -= used; - - /* Set up the stream for the rest.. */ - stream-next_in = map; - stream-avail_in = mapsize; - git_inflate_init(stream); - - /* And generate the fake traditional header */ - stream-total_out = 1 + snprintf(buffer, bufsiz, %s %lu, -typename(type), size); - return 0; and replace all this with die(detected an object in obsolete format, please repack the repository using a version before XXX); That would eliminate the second part of my purpose, which is to not die() on a corrupted object because we incorrectly guess that it is experimental. If we think these objects are in the wild, the right thing to do would be to warn() and continue. But I really find it hard to believe any such objects exist at this point. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Jeff King p...@peff.net writes: We could try to improve the heuristic to err on the side of normal objects in the face of corruption, but there is really little point. The experimental format is long-dead, and was never enabled by default to begin with. We can instead simply remove it. The only affected repository would be one that explicitly set core.legacyheaders in 2007, and then never repacked in the intervening 6 years. Sounds sensible. Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Thu, Nov 21, 2013 at 5:04 PM, Joey Hess j...@kitenet.net wrote: BTW, I've also seen git cat-file --batch report wrong sizes for objects, sometimes without crashing. This is particularly problimatic because if the object size is wrong, it's very hard to detect the actual end of the object output in the batch mode stream. Yeah, I think it might report wrong size in case of replaced objects for example. I looked at that following Junio's comment about the sha1_object_info() API, which, unlike read_sha1_file() API, does not interact with the replace mechanism: http://thread.gmane.org/gmane.comp.version-control.git/234023/ I started to work on a patch about this but didn't take the time to finish and post it. Thanks, Christian. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Thu, Nov 21, 2013 at 6:48 PM, Jeff King p...@peff.net wrote: @@ -1514,14 +1469,6 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf, int unpack_sha1_header(git_zstream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz) { - unsigned long size, used; - static const char valid_loose_object_type[8] = { - 0, /* OBJ_EXT */ - 1, 1, 1, 1, /* commit, tree, blob, tag */ - 0, /* delta and others are invalid in a loose object */ - }; - enum object_type type; - /* Get the data stream */ memset(stream, 0, sizeof(*stream)); stream-next_in = map; @@ -1529,27 +1476,6 @@ int unpack_sha1_header(git_zstream *stream, unsigned char *map, unsigned long ma stream-next_out = buffer; stream-avail_out = bufsiz; - if (experimental_loose_object(map)) { Perhaps keep this.. - /* -* The old experimental format we no longer produce; -* we can still read it. -*/ - used = unpack_object_header_buffer(map, mapsize, type, size); - if (!used || !valid_loose_object_type[type]) - return -1; - map += used; - mapsize -= used; - - /* Set up the stream for the rest.. */ - stream-next_in = map; - stream-avail_in = mapsize; - git_inflate_init(stream); - - /* And generate the fake traditional header */ - stream-total_out = 1 + snprintf(buffer, bufsiz, %s %lu, -typename(type), size); - return 0; and replace all this with die(detected an object in obsolete format, please repack the repository using a version before XXX); ? - } git_inflate_init(stream); return git_inflate(stream, 0); } -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Duy Nguyen pclo...@gmail.com writes: On Thu, Nov 21, 2013 at 6:48 PM, Jeff King p...@peff.net wrote: @@ -1514,14 +1469,6 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf, int unpack_sha1_header(git_zstream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz) { - unsigned long size, used; - static const char valid_loose_object_type[8] = { - 0, /* OBJ_EXT */ - 1, 1, 1, 1, /* commit, tree, blob, tag */ - 0, /* delta and others are invalid in a loose object */ - }; - enum object_type type; - /* Get the data stream */ memset(stream, 0, sizeof(*stream)); stream-next_in = map; @@ -1529,27 +1476,6 @@ int unpack_sha1_header(git_zstream *stream, unsigned char *map, unsigned long ma stream-next_out = buffer; stream-avail_out = bufsiz; - if (experimental_loose_object(map)) { Perhaps keep this.. - /* -* The old experimental format we no longer produce; -* we can still read it. -*/ - used = unpack_object_header_buffer(map, mapsize, type, size); - if (!used || !valid_loose_object_type[type]) - return -1; - map += used; - mapsize -= used; - - /* Set up the stream for the rest.. */ - stream-next_in = map; - stream-avail_in = mapsize; - git_inflate_init(stream); - - /* And generate the fake traditional header */ - stream-total_out = 1 + snprintf(buffer, bufsiz, %s %lu, -typename(type), size); - return 0; and replace all this with die(detected an object in obsolete format, please repack the repository using a version before XXX); ? Wouldn't that fail to solve the issue of `git fsck` dying on corrupt data? experimental_loose_object() would need to be rewritten to be more conservative in deciding that an object was in the experimental loose object format. -Keshav -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
On Thu, Nov 21, 2013 at 12:04:26PM -0400, Joey Hess wrote: Jeff King wrote: This latter behavior is much worse for two reasons. One, git reports an allocation error when the real error is corruption. And two, the program dies unconditionally, so you cannot even run fsck (which would otherwise ignore the broken object and keep going). BTW, I've also seen git cat-file --batch report wrong sizes for objects, sometimes without crashing. This is particularly problimatic because if the object size is wrong, it's very hard to detect the actual end of the object output in the batch mode stream. Hrm. For --batch, I'd think we would open the whole object and notice the corruption, even with the current code. But for --batch-check, we use sha1_object_info, and for an experimental object, we do not need to de-zlib the object at all. So we end up reporting whatever crap we decipher from the garbage bytes. My patch would fix that, as we would not incorrectly guess an object is experimental anymore. If you have specific cases that trigger even after my patch, I'd be interested to see them. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] drop support for experimental loose objects
Jeff King wrote: This latter behavior is much worse for two reasons. One, git reports an allocation error when the real error is corruption. And two, the program dies unconditionally, so you cannot even run fsck (which would otherwise ignore the broken object and keep going). BTW, I've also seen git cat-file --batch report wrong sizes for objects, sometimes without crashing. This is particularly problimatic because if the object size is wrong, it's very hard to detect the actual end of the object output in the batch mode stream. -- see shy jo signature.asc Description: Digital signature