Re: [PATCH 09/23] pack v4: commit object encoding
On Tue, 3 Sep 2013, Duy Nguyen wrote: On Tue, Sep 3, 2013 at 1:30 PM, Nicolas Pitre n...@fluxnic.net wrote: On Tue, 3 Sep 2013, Duy Nguyen wrote: On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote: This goes as follows: - Tree reference: either variable length encoding of the index into the SHA1 table or the literal SHA1 prefixed by 0 (see add_sha1_ref()). - Parent count: variable length encoding of the number of parents. This is normally going to occupy a single byte but doesn't have to. - List of parent references: a list of add_sha1_ref() encoded references, or nothing if the parent count was zero. With .pack v3 it's impossible to create delta cycles (3b910d0 add tests for indexing packs with delta cycles - 2013-08-23) but it is possible with .pack v4 (and therefore at least index-pack needs to detect and reject them), correct? Some malicious user can create commit A with parent ref 1, then make the SHA-1 table so that ref 1 is A. The same with the new tree representation. pack-index should validate the SHA1 of the object being pointed at. In that case I doubt you'll be able to actually construct an object which contains a SHA1 parent reference and make the SHA1 of this very object resolve to the same value. We could do that for commits. For trees, we need to look at the base's content to construct the current tree and cycles could happen. I think we could make a rule that base trees must appear in the pack before the tree being constructed (similar to delta-ofs). The exception is objects appended for fixing thin pack. I don't really like such artificial constraints. You never know when such constraints are going to prevent future legitimate pack layout optimizations or the like. It is simple enough to detect cycles during delta validation in index-pack anyway. Nicolas -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
On Tue, 3 Sep 2013, Duy Nguyen wrote: On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote: This goes as follows: - Tree reference: either variable length encoding of the index into the SHA1 table or the literal SHA1 prefixed by 0 (see add_sha1_ref()). - Parent count: variable length encoding of the number of parents. This is normally going to occupy a single byte but doesn't have to. - List of parent references: a list of add_sha1_ref() encoded references, or nothing if the parent count was zero. With .pack v3 it's impossible to create delta cycles (3b910d0 add tests for indexing packs with delta cycles - 2013-08-23) but it is possible with .pack v4 (and therefore at least index-pack needs to detect and reject them), correct? Some malicious user can create commit A with parent ref 1, then make the SHA-1 table so that ref 1 is A. The same with the new tree representation. pack-index should validate the SHA1 of the object being pointed at. In that case I doubt you'll be able to actually construct an object which contains a SHA1 parent reference and make the SHA1 of this very object resolve to the same value. Nicolas -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
On Tue, Sep 3, 2013 at 1:30 PM, Nicolas Pitre n...@fluxnic.net wrote: On Tue, 3 Sep 2013, Duy Nguyen wrote: On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote: This goes as follows: - Tree reference: either variable length encoding of the index into the SHA1 table or the literal SHA1 prefixed by 0 (see add_sha1_ref()). - Parent count: variable length encoding of the number of parents. This is normally going to occupy a single byte but doesn't have to. - List of parent references: a list of add_sha1_ref() encoded references, or nothing if the parent count was zero. With .pack v3 it's impossible to create delta cycles (3b910d0 add tests for indexing packs with delta cycles - 2013-08-23) but it is possible with .pack v4 (and therefore at least index-pack needs to detect and reject them), correct? Some malicious user can create commit A with parent ref 1, then make the SHA-1 table so that ref 1 is A. The same with the new tree representation. pack-index should validate the SHA1 of the object being pointed at. In that case I doubt you'll be able to actually construct an object which contains a SHA1 parent reference and make the SHA1 of this very object resolve to the same value. We could do that for commits. For trees, we need to look at the base's content to construct the current tree and cycles could happen. I think we could make a rule that base trees must appear in the pack before the tree being constructed (similar to delta-ofs). The exception is objects appended for fixing thin pack. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote: This goes as follows: - Tree reference: either variable length encoding of the index into the SHA1 table or the literal SHA1 prefixed by 0 (see add_sha1_ref()). - Parent count: variable length encoding of the number of parents. This is normally going to occupy a single byte but doesn't have to. - List of parent references: a list of add_sha1_ref() encoded references, or nothing if the parent count was zero. With .pack v3 it's impossible to create delta cycles (3b910d0 add tests for indexing packs with delta cycles - 2013-08-23) but it is possible with .pack v4 (and therefore at least index-pack needs to detect and reject them), correct? Some malicious user can create commit A with parent ref 1, then make the SHA-1 table so that ref 1 is A. The same with the new tree representation. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
Nicolas Pitre n...@fluxnic.net writes: This goes as follows: - Tree reference: either variable length encoding of the index into the SHA1 table or the literal SHA1 prefixed by 0 (see add_sha1_ref()). - Parent count: variable length encoding of the number of parents. This is normally going to occupy a single byte but doesn't have to. - List of parent references: a list of add_sha1_ref() encoded references, or nothing if the parent count was zero. - Author reference: variable length encoding of an index into the author string dictionary table which also covers the time zone. To make the overall encoding efficient, the author table is already sorted by usage frequency so the most used names are first and require the shortest index encoding. - Author time stamp: variable length encoded. Year 2038 ready! - Committer reference: same as author reference. - Committer time stamp: same as author time stamp. The remainder of the canonical commit object content is then zlib compressed and appended to the above. Rationale: The most important commit object data is densely encoded while requiring no zlib inflate processing, and all SHA1 references are most likely to be direct indices into the pack index file requiring no SHA1 search into the pack index file. Signed-off-by: Nicolas Pitre n...@fluxnic.net --- packv4-create.c | 119 1 file changed, 119 insertions(+) diff --git a/packv4-create.c b/packv4-create.c index bf33d15..cedbbd9 100644 --- a/packv4-create.c +++ b/packv4-create.c @@ -13,6 +13,9 @@ #include tree-walk.h #include pack.h + +static int pack_compression_level = Z_DEFAULT_COMPRESSION; + struct data_entry { unsigned offset; unsigned size; @@ -289,6 +292,122 @@ static unsigned char *add_sha1_ref(unsigned char *dst, const unsigned char *sha1 return dst + 20; } +/* + * This converts a canonical commit object buffer into its + * tightly packed representation using the already populated + * and sorted commit_name_table dictionary. The parsing is + * strict so to ensure the canonical version may always be + * regenerated and produce the same hash. + */ +void * conv_to_dict_commit(void *buffer, unsigned long *psize) Drop SP between asterisk and conv_? +{ + unsigned long size = *psize; + char *in, *tail, *end; + unsigned char *out; + unsigned char sha1[20]; + int nb_parents, index, tz_val; + unsigned long time; + z_stream stream; + int status; + + /* + * It is guaranteed that the output is always going to be smaller + * than the input. We could even do this conversion in place. + */ + in = buffer; + tail = in + size; + buffer = xmalloc(size); + out = buffer; + + /* parse the tree line */ + if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n') + goto bad_data; + if (get_sha1_hex(in + 5, sha1) 0) + goto bad_data; Is this strict enough to guarantee roundtrip hash identity? Because get_sha1_hex() accepts hexadecimal represented with uppercase A-F, you need to reject such a broken commit object, no? Same for parent commit object names below that are parsed with the same helper. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
On Tue, 27 Aug 2013, Junio C Hamano wrote: Nicolas Pitre n...@fluxnic.net writes: This goes as follows: - Tree reference: either variable length encoding of the index into the SHA1 table or the literal SHA1 prefixed by 0 (see add_sha1_ref()). - Parent count: variable length encoding of the number of parents. This is normally going to occupy a single byte but doesn't have to. - List of parent references: a list of add_sha1_ref() encoded references, or nothing if the parent count was zero. - Author reference: variable length encoding of an index into the author string dictionary table which also covers the time zone. To make the overall encoding efficient, the author table is already sorted by usage frequency so the most used names are first and require the shortest index encoding. - Author time stamp: variable length encoded. Year 2038 ready! - Committer reference: same as author reference. - Committer time stamp: same as author time stamp. The remainder of the canonical commit object content is then zlib compressed and appended to the above. Rationale: The most important commit object data is densely encoded while requiring no zlib inflate processing, and all SHA1 references are most likely to be direct indices into the pack index file requiring no SHA1 search into the pack index file. Signed-off-by: Nicolas Pitre n...@fluxnic.net --- packv4-create.c | 119 1 file changed, 119 insertions(+) diff --git a/packv4-create.c b/packv4-create.c index bf33d15..cedbbd9 100644 --- a/packv4-create.c +++ b/packv4-create.c @@ -13,6 +13,9 @@ #include tree-walk.h #include pack.h + +static int pack_compression_level = Z_DEFAULT_COMPRESSION; + struct data_entry { unsigned offset; unsigned size; @@ -289,6 +292,122 @@ static unsigned char *add_sha1_ref(unsigned char *dst, const unsigned char *sha1 return dst + 20; } +/* + * This converts a canonical commit object buffer into its + * tightly packed representation using the already populated + * and sorted commit_name_table dictionary. The parsing is + * strict so to ensure the canonical version may always be + * regenerated and produce the same hash. + */ +void * conv_to_dict_commit(void *buffer, unsigned long *psize) Drop SP between asterisk and conv_? +{ + unsigned long size = *psize; + char *in, *tail, *end; + unsigned char *out; + unsigned char sha1[20]; + int nb_parents, index, tz_val; + unsigned long time; + z_stream stream; + int status; + + /* +* It is guaranteed that the output is always going to be smaller +* than the input. We could even do this conversion in place. +*/ + in = buffer; + tail = in + size; + buffer = xmalloc(size); + out = buffer; + + /* parse the tree line */ + if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n') + goto bad_data; + if (get_sha1_hex(in + 5, sha1) 0) + goto bad_data; Is this strict enough to guarantee roundtrip hash identity? Because get_sha1_hex() accepts hexadecimal represented with uppercase A-F, you need to reject such a broken commit object, no? Indeed, yes. Same concern as with split_ident_line() I mentioned before. Same for parent commit object names below that are parsed with the same helper. Exact. Nicolas -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
On Tue, 27 Aug 2013, Junio C Hamano wrote: Nicolas Pitre n...@fluxnic.net writes: + /* parse the tree line */ + if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n') + goto bad_data; + if (get_sha1_hex(in + 5, sha1) 0) + goto bad_data; Is this strict enough to guarantee roundtrip hash identity? Because get_sha1_hex() accepts hexadecimal represented with uppercase A-F, you need to reject such a broken commit object, no? BTW, is there any such objects in existence where sha1 ascii strings are represented using uppercase letters? Because there would be a simple way to encode that fact in the pack v4 sha1 reference... but that change has to happen now. I'm already claiming we won't support mixed case though. Nicolas -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
Nicolas Pitre n...@fluxnic.net writes: On Tue, 27 Aug 2013, Junio C Hamano wrote: Nicolas Pitre n...@fluxnic.net writes: + /* parse the tree line */ + if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n') + goto bad_data; + if (get_sha1_hex(in + 5, sha1) 0) + goto bad_data; Is this strict enough to guarantee roundtrip hash identity? Because get_sha1_hex() accepts hexadecimal represented with uppercase A-F, you need to reject such a broken commit object, no? BTW, is there any such objects in existence where sha1 ascii strings are represented using uppercase letters? Any commit or tag object that refers to another object with SHA-1 using uppercase letters is broken and invalid. get_sha1_hex() is not limited to reading these (i.e. it also is used to read object name given on the command line) so it is lenient, but the above codepath should care so that the result of hashing will stay the same. Because there would be a simple way to encode that fact in the pack v4 sha1 reference... but that change has to happen now. Hence, I do not think we care. I'm already claiming we won't support mixed case though. Yeah, I am already claiming we won't support any uppercase ;-). -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/23] pack v4: commit object encoding
On Tue, 27 Aug 2013, Junio C Hamano wrote: Nicolas Pitre n...@fluxnic.net writes: On Tue, 27 Aug 2013, Junio C Hamano wrote: Nicolas Pitre n...@fluxnic.net writes: +/* parse the tree line */ +if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n') +goto bad_data; +if (get_sha1_hex(in + 5, sha1) 0) +goto bad_data; Is this strict enough to guarantee roundtrip hash identity? Because get_sha1_hex() accepts hexadecimal represented with uppercase A-F, you need to reject such a broken commit object, no? BTW, is there any such objects in existence where sha1 ascii strings are represented using uppercase letters? Any commit or tag object that refers to another object with SHA-1 using uppercase letters is broken and invalid. get_sha1_hex() is not limited to reading these (i.e. it also is used to read object name given on the command line) so it is lenient, but the above codepath should care so that the result of hashing will stay the same. Indeed, hence my concern about encoding the original case. Because there would be a simple way to encode that fact in the pack v4 sha1 reference... but that change has to happen now. Hence, I do not think we care. I'm already claiming we won't support mixed case though. Yeah, I am already claiming we won't support any uppercase ;-). Perfect! I've added a get_sha1_lowhex() to my tree. Nicolas -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/23] pack v4: commit object encoding
This goes as follows: - Tree reference: either variable length encoding of the index into the SHA1 table or the literal SHA1 prefixed by 0 (see add_sha1_ref()). - Parent count: variable length encoding of the number of parents. This is normally going to occupy a single byte but doesn't have to. - List of parent references: a list of add_sha1_ref() encoded references, or nothing if the parent count was zero. - Author reference: variable length encoding of an index into the author string dictionary table which also covers the time zone. To make the overall encoding efficient, the author table is already sorted by usage frequency so the most used names are first and require the shortest index encoding. - Author time stamp: variable length encoded. Year 2038 ready! - Committer reference: same as author reference. - Committer time stamp: same as author time stamp. The remainder of the canonical commit object content is then zlib compressed and appended to the above. Rationale: The most important commit object data is densely encoded while requiring no zlib inflate processing, and all SHA1 references are most likely to be direct indices into the pack index file requiring no SHA1 search into the pack index file. Signed-off-by: Nicolas Pitre n...@fluxnic.net --- packv4-create.c | 119 1 file changed, 119 insertions(+) diff --git a/packv4-create.c b/packv4-create.c index bf33d15..cedbbd9 100644 --- a/packv4-create.c +++ b/packv4-create.c @@ -13,6 +13,9 @@ #include tree-walk.h #include pack.h + +static int pack_compression_level = Z_DEFAULT_COMPRESSION; + struct data_entry { unsigned offset; unsigned size; @@ -289,6 +292,122 @@ static unsigned char *add_sha1_ref(unsigned char *dst, const unsigned char *sha1 return dst + 20; } +/* + * This converts a canonical commit object buffer into its + * tightly packed representation using the already populated + * and sorted commit_name_table dictionary. The parsing is + * strict so to ensure the canonical version may always be + * regenerated and produce the same hash. + */ +void * conv_to_dict_commit(void *buffer, unsigned long *psize) +{ + unsigned long size = *psize; + char *in, *tail, *end; + unsigned char *out; + unsigned char sha1[20]; + int nb_parents, index, tz_val; + unsigned long time; + z_stream stream; + int status; + + /* +* It is guaranteed that the output is always going to be smaller +* than the input. We could even do this conversion in place. +*/ + in = buffer; + tail = in + size; + buffer = xmalloc(size); + out = buffer; + + /* parse the tree line */ + if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n') + goto bad_data; + if (get_sha1_hex(in + 5, sha1) 0) + goto bad_data; + in += 46; + out = add_sha1_ref(out, sha1); + + /* count how many parent lines */ + nb_parents = 0; + while (in + 48 tail !memcmp(in, parent , 7) in[47] == '\n') { + nb_parents++; + in += 48; + } + out = add_number(out, nb_parents); + + /* rewind and parse the parent lines */ + in -= 48 * nb_parents; + while (nb_parents--) { + if (get_sha1_hex(in + 7, sha1)) + goto bad_data; + out = add_sha1_ref(out, sha1); + in += 48; + } + + /* parse the author line */ + /* it must be at least author x x 0 +\n i.e. 21 chars */ + if (in + 21 = tail || memcmp(in, author , 7)) + goto bad_data; + in += 7; + end = get_nameend_and_tz(in, tz_val); + if (!end) + goto bad_data; + index = dict_add_entry(commit_name_table, tz_val, in, end - in); + if (index 0) + goto bad_dict; + out = add_number(out, index); + time = strtoul(end, end, 10); + if (!end || end[0] != ' ' || end[6] != '\n') + goto bad_data; + out = add_number(out, time); + in = end + 7; + + /* parse the committer line */ + /* it must be at least committer x x 0 +\n i.e. 24 chars */ + if (in + 24 = tail || memcmp(in, committer , 7)) + goto bad_data; + in += 10; + end = get_nameend_and_tz(in, tz_val); + if (!end) + goto bad_data; + index = dict_add_entry(commit_name_table, tz_val, in, end - in); + if (index 0) + goto bad_dict; + out = add_number(out, index); + time = strtoul(end, end, 10); + if (!end || end[0] != ' ' || end[6] != '\n') + goto bad_data; + out = add_number(out, time); + in = end + 7; + + /* finally, deflate the remaining data */ + memset(stream, 0, sizeof(stream)); + deflateInit(stream, pack_compression_level); +