Re: [PATCH 09/23] pack v4: commit object encoding

2013-09-04 Thread Nicolas Pitre
On Tue, 3 Sep 2013, Duy Nguyen wrote:

 On Tue, Sep 3, 2013 at 1:30 PM, Nicolas Pitre n...@fluxnic.net wrote:
  On Tue, 3 Sep 2013, Duy Nguyen wrote:
 
  On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote:
   This goes as follows:
  
   - Tree reference: either variable length encoding of the index
 into the SHA1 table or the literal SHA1 prefixed by 0 (see
 add_sha1_ref()).
  
   - Parent count: variable length encoding of the number of parents.
 This is normally going to occupy a single byte but doesn't have to.
  
   - List of parent references: a list of add_sha1_ref() encoded references,
 or nothing if the parent count was zero.
 
  With .pack v3 it's impossible to create delta cycles (3b910d0 add
  tests for indexing packs with delta cycles - 2013-08-23) but it is
  possible with .pack v4 (and therefore at least index-pack needs to
  detect and reject them), correct? Some malicious user can create
  commit A with parent ref 1, then make the SHA-1 table so that ref 1 is
  A. The same with the new tree representation.
 
  pack-index should validate the SHA1 of the object being pointed at.
 
  In that case I doubt you'll be able to actually construct an object
  which contains a SHA1 parent reference and make the SHA1 of this very
  object resolve to the same value.
 
 We could do that for commits. For trees, we need to look at the base's
 content to construct the current tree and cycles could happen. I think
 we could make a rule that base trees must appear in the pack before
 the tree being constructed (similar to delta-ofs). The exception is
 objects appended for fixing thin pack.

I don't really like such artificial constraints.  You never know when 
such constraints are going to prevent future legitimate pack layout 
optimizations or the like.

It is simple enough to detect cycles during delta validation in 
index-pack anyway.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-09-03 Thread Nicolas Pitre
On Tue, 3 Sep 2013, Duy Nguyen wrote:

 On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote:
  This goes as follows:
 
  - Tree reference: either variable length encoding of the index
into the SHA1 table or the literal SHA1 prefixed by 0 (see
add_sha1_ref()).
 
  - Parent count: variable length encoding of the number of parents.
This is normally going to occupy a single byte but doesn't have to.
 
  - List of parent references: a list of add_sha1_ref() encoded references,
or nothing if the parent count was zero.
 
 With .pack v3 it's impossible to create delta cycles (3b910d0 add
 tests for indexing packs with delta cycles - 2013-08-23) but it is
 possible with .pack v4 (and therefore at least index-pack needs to
 detect and reject them), correct? Some malicious user can create
 commit A with parent ref 1, then make the SHA-1 table so that ref 1 is
 A. The same with the new tree representation.

pack-index should validate the SHA1 of the object being pointed at.

In that case I doubt you'll be able to actually construct an object 
which contains a SHA1 parent reference and make the SHA1 of this very 
object resolve to the same value.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-09-03 Thread Duy Nguyen
On Tue, Sep 3, 2013 at 1:30 PM, Nicolas Pitre n...@fluxnic.net wrote:
 On Tue, 3 Sep 2013, Duy Nguyen wrote:

 On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote:
  This goes as follows:
 
  - Tree reference: either variable length encoding of the index
into the SHA1 table or the literal SHA1 prefixed by 0 (see
add_sha1_ref()).
 
  - Parent count: variable length encoding of the number of parents.
This is normally going to occupy a single byte but doesn't have to.
 
  - List of parent references: a list of add_sha1_ref() encoded references,
or nothing if the parent count was zero.

 With .pack v3 it's impossible to create delta cycles (3b910d0 add
 tests for indexing packs with delta cycles - 2013-08-23) but it is
 possible with .pack v4 (and therefore at least index-pack needs to
 detect and reject them), correct? Some malicious user can create
 commit A with parent ref 1, then make the SHA-1 table so that ref 1 is
 A. The same with the new tree representation.

 pack-index should validate the SHA1 of the object being pointed at.

 In that case I doubt you'll be able to actually construct an object
 which contains a SHA1 parent reference and make the SHA1 of this very
 object resolve to the same value.

We could do that for commits. For trees, we need to look at the base's
content to construct the current tree and cycles could happen. I think
we could make a rule that base trees must appear in the pack before
the tree being constructed (similar to delta-ofs). The exception is
objects appended for fixing thin pack.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-09-02 Thread Duy Nguyen
On Tue, Aug 27, 2013 at 11:25 AM, Nicolas Pitre n...@fluxnic.net wrote:
 This goes as follows:

 - Tree reference: either variable length encoding of the index
   into the SHA1 table or the literal SHA1 prefixed by 0 (see
   add_sha1_ref()).

 - Parent count: variable length encoding of the number of parents.
   This is normally going to occupy a single byte but doesn't have to.

 - List of parent references: a list of add_sha1_ref() encoded references,
   or nothing if the parent count was zero.

With .pack v3 it's impossible to create delta cycles (3b910d0 add
tests for indexing packs with delta cycles - 2013-08-23) but it is
possible with .pack v4 (and therefore at least index-pack needs to
detect and reject them), correct? Some malicious user can create
commit A with parent ref 1, then make the SHA-1 table so that ref 1 is
A. The same with the new tree representation.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-08-27 Thread Junio C Hamano
Nicolas Pitre n...@fluxnic.net writes:

 This goes as follows:

 - Tree reference: either variable length encoding of the index
   into the SHA1 table or the literal SHA1 prefixed by 0 (see
   add_sha1_ref()).

 - Parent count: variable length encoding of the number of parents.
   This is normally going to occupy a single byte but doesn't have to.

 - List of parent references: a list of add_sha1_ref() encoded references,
   or nothing if the parent count was zero.

 - Author reference: variable length encoding of an index into the author
   string dictionary table which also covers the time zone.  To make the
   overall encoding efficient, the author table is already sorted by usage
   frequency so the most used names are first and require the shortest
   index encoding.

 - Author time stamp: variable length encoded.  Year 2038 ready!

 - Committer reference: same as author reference.

 - Committer time stamp: same as author time stamp.

 The remainder of the canonical commit object content is then zlib
 compressed and appended to the above.

 Rationale: The most important commit object data is densely encoded while
 requiring no zlib inflate processing, and all SHA1 references are most
 likely to be direct indices into the pack index file requiring no SHA1
 search into the pack index file.

 Signed-off-by: Nicolas Pitre n...@fluxnic.net
 ---
  packv4-create.c | 119 
 
  1 file changed, 119 insertions(+)

 diff --git a/packv4-create.c b/packv4-create.c
 index bf33d15..cedbbd9 100644
 --- a/packv4-create.c
 +++ b/packv4-create.c
 @@ -13,6 +13,9 @@
  #include tree-walk.h
  #include pack.h
  
 +
 +static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 +
  struct data_entry {
   unsigned offset;
   unsigned size;
 @@ -289,6 +292,122 @@ static unsigned char *add_sha1_ref(unsigned char *dst, 
 const unsigned char *sha1
   return dst + 20;
  }
  
 +/*
 + * This converts a canonical commit object buffer into its
 + * tightly packed representation using the already populated
 + * and sorted commit_name_table dictionary.  The parsing is
 + * strict so to ensure the canonical version may always be
 + * regenerated and produce the same hash.
 + */
 +void * conv_to_dict_commit(void *buffer, unsigned long *psize)

Drop SP between asterisk and conv_?

 +{
 + unsigned long size = *psize;
 + char *in, *tail, *end;
 + unsigned char *out;
 + unsigned char sha1[20];
 + int nb_parents, index, tz_val;
 + unsigned long time;
 + z_stream stream;
 + int status;
 +
 + /*
 +  * It is guaranteed that the output is always going to be smaller
 +  * than the input.  We could even do this conversion in place.
 +  */
 + in = buffer;
 + tail = in + size;
 + buffer = xmalloc(size);
 + out = buffer;
 +
 + /* parse the tree line */
 + if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n')
 + goto bad_data;
 + if (get_sha1_hex(in + 5, sha1)  0)
 + goto bad_data;

Is this strict enough to guarantee roundtrip hash identity?  Because
get_sha1_hex() accepts hexadecimal represented with uppercase A-F,
you need to reject such a broken commit object, no?

Same for parent commit object names below that are parsed with the
same helper.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-08-27 Thread Nicolas Pitre
On Tue, 27 Aug 2013, Junio C Hamano wrote:

 Nicolas Pitre n...@fluxnic.net writes:
 
  This goes as follows:
 
  - Tree reference: either variable length encoding of the index
into the SHA1 table or the literal SHA1 prefixed by 0 (see
add_sha1_ref()).
 
  - Parent count: variable length encoding of the number of parents.
This is normally going to occupy a single byte but doesn't have to.
 
  - List of parent references: a list of add_sha1_ref() encoded references,
or nothing if the parent count was zero.
 
  - Author reference: variable length encoding of an index into the author
string dictionary table which also covers the time zone.  To make the
overall encoding efficient, the author table is already sorted by usage
frequency so the most used names are first and require the shortest
index encoding.
 
  - Author time stamp: variable length encoded.  Year 2038 ready!
 
  - Committer reference: same as author reference.
 
  - Committer time stamp: same as author time stamp.
 
  The remainder of the canonical commit object content is then zlib
  compressed and appended to the above.
 
  Rationale: The most important commit object data is densely encoded while
  requiring no zlib inflate processing, and all SHA1 references are most
  likely to be direct indices into the pack index file requiring no SHA1
  search into the pack index file.
 
  Signed-off-by: Nicolas Pitre n...@fluxnic.net
  ---
   packv4-create.c | 119 
  
   1 file changed, 119 insertions(+)
 
  diff --git a/packv4-create.c b/packv4-create.c
  index bf33d15..cedbbd9 100644
  --- a/packv4-create.c
  +++ b/packv4-create.c
  @@ -13,6 +13,9 @@
   #include tree-walk.h
   #include pack.h
   
  +
  +static int pack_compression_level = Z_DEFAULT_COMPRESSION;
  +
   struct data_entry {
  unsigned offset;
  unsigned size;
  @@ -289,6 +292,122 @@ static unsigned char *add_sha1_ref(unsigned char 
  *dst, const unsigned char *sha1
  return dst + 20;
   }
   
  +/*
  + * This converts a canonical commit object buffer into its
  + * tightly packed representation using the already populated
  + * and sorted commit_name_table dictionary.  The parsing is
  + * strict so to ensure the canonical version may always be
  + * regenerated and produce the same hash.
  + */
  +void * conv_to_dict_commit(void *buffer, unsigned long *psize)
 
 Drop SP between asterisk and conv_?
 
  +{
  +   unsigned long size = *psize;
  +   char *in, *tail, *end;
  +   unsigned char *out;
  +   unsigned char sha1[20];
  +   int nb_parents, index, tz_val;
  +   unsigned long time;
  +   z_stream stream;
  +   int status;
  +
  +   /*
  +* It is guaranteed that the output is always going to be smaller
  +* than the input.  We could even do this conversion in place.
  +*/
  +   in = buffer;
  +   tail = in + size;
  +   buffer = xmalloc(size);
  +   out = buffer;
  +
  +   /* parse the tree line */
  +   if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n')
  +   goto bad_data;
  +   if (get_sha1_hex(in + 5, sha1)  0)
  +   goto bad_data;
 
 Is this strict enough to guarantee roundtrip hash identity?  Because
 get_sha1_hex() accepts hexadecimal represented with uppercase A-F,
 you need to reject such a broken commit object, no?

Indeed, yes.  Same concern as with split_ident_line() I mentioned 
before.

 Same for parent commit object names below that are parsed with the
 same helper.

Exact.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-08-27 Thread Nicolas Pitre
On Tue, 27 Aug 2013, Junio C Hamano wrote:

 Nicolas Pitre n...@fluxnic.net writes:
 
  +   /* parse the tree line */
  +   if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n')
  +   goto bad_data;
  +   if (get_sha1_hex(in + 5, sha1)  0)
  +   goto bad_data;
 
 Is this strict enough to guarantee roundtrip hash identity?  Because
 get_sha1_hex() accepts hexadecimal represented with uppercase A-F,
 you need to reject such a broken commit object, no?

BTW, is there any such objects in existence where sha1 ascii strings are 
represented using uppercase letters?  Because there would be a simple 
way to encode that fact in the pack v4 sha1 reference... but that change 
has to happen now.

I'm already claiming we won't support mixed case though.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-08-27 Thread Junio C Hamano
Nicolas Pitre n...@fluxnic.net writes:

 On Tue, 27 Aug 2013, Junio C Hamano wrote:

 Nicolas Pitre n...@fluxnic.net writes:
 
  +  /* parse the tree line */
  +  if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n')
  +  goto bad_data;
  +  if (get_sha1_hex(in + 5, sha1)  0)
  +  goto bad_data;
 
 Is this strict enough to guarantee roundtrip hash identity?  Because
 get_sha1_hex() accepts hexadecimal represented with uppercase A-F,
 you need to reject such a broken commit object, no?

 BTW, is there any such objects in existence where sha1 ascii strings are 
 represented using uppercase letters?

Any commit or tag object that refers to another object with SHA-1
using uppercase letters is broken and invalid.  get_sha1_hex() is
not limited to reading these (i.e. it also is used to read object
name given on the command line) so it is lenient, but the above
codepath should care so that the result of hashing will stay the
same.

 Because there would be a simple 
 way to encode that fact in the pack v4 sha1 reference... but that change 
 has to happen now.

Hence, I do not think we care.

 I'm already claiming we won't support mixed case though.

Yeah, I am already claiming we won't support any uppercase ;-).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/23] pack v4: commit object encoding

2013-08-27 Thread Nicolas Pitre
On Tue, 27 Aug 2013, Junio C Hamano wrote:

 Nicolas Pitre n...@fluxnic.net writes:
 
  On Tue, 27 Aug 2013, Junio C Hamano wrote:
 
  Nicolas Pitre n...@fluxnic.net writes:
  
   +/* parse the tree line */
   +if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n')
   +goto bad_data;
   +if (get_sha1_hex(in + 5, sha1)  0)
   +goto bad_data;
  
  Is this strict enough to guarantee roundtrip hash identity?  Because
  get_sha1_hex() accepts hexadecimal represented with uppercase A-F,
  you need to reject such a broken commit object, no?
 
  BTW, is there any such objects in existence where sha1 ascii strings are 
  represented using uppercase letters?
 
 Any commit or tag object that refers to another object with SHA-1
 using uppercase letters is broken and invalid.  get_sha1_hex() is
 not limited to reading these (i.e. it also is used to read object
 name given on the command line) so it is lenient, but the above
 codepath should care so that the result of hashing will stay the
 same.

Indeed, hence my concern about encoding the original case.

  Because there would be a simple 
  way to encode that fact in the pack v4 sha1 reference... but that change 
  has to happen now.
 
 Hence, I do not think we care.
 
  I'm already claiming we won't support mixed case though.
 
 Yeah, I am already claiming we won't support any uppercase ;-).

Perfect!

I've added a get_sha1_lowhex() to my tree.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/23] pack v4: commit object encoding

2013-08-26 Thread Nicolas Pitre
This goes as follows:

- Tree reference: either variable length encoding of the index
  into the SHA1 table or the literal SHA1 prefixed by 0 (see
  add_sha1_ref()).

- Parent count: variable length encoding of the number of parents.
  This is normally going to occupy a single byte but doesn't have to.

- List of parent references: a list of add_sha1_ref() encoded references,
  or nothing if the parent count was zero.

- Author reference: variable length encoding of an index into the author
  string dictionary table which also covers the time zone.  To make the
  overall encoding efficient, the author table is already sorted by usage
  frequency so the most used names are first and require the shortest
  index encoding.

- Author time stamp: variable length encoded.  Year 2038 ready!

- Committer reference: same as author reference.

- Committer time stamp: same as author time stamp.

The remainder of the canonical commit object content is then zlib
compressed and appended to the above.

Rationale: The most important commit object data is densely encoded while
requiring no zlib inflate processing, and all SHA1 references are most
likely to be direct indices into the pack index file requiring no SHA1
search into the pack index file.

Signed-off-by: Nicolas Pitre n...@fluxnic.net
---
 packv4-create.c | 119 
 1 file changed, 119 insertions(+)

diff --git a/packv4-create.c b/packv4-create.c
index bf33d15..cedbbd9 100644
--- a/packv4-create.c
+++ b/packv4-create.c
@@ -13,6 +13,9 @@
 #include tree-walk.h
 #include pack.h
 
+
+static int pack_compression_level = Z_DEFAULT_COMPRESSION;
+
 struct data_entry {
unsigned offset;
unsigned size;
@@ -289,6 +292,122 @@ static unsigned char *add_sha1_ref(unsigned char *dst, 
const unsigned char *sha1
return dst + 20;
 }
 
+/*
+ * This converts a canonical commit object buffer into its
+ * tightly packed representation using the already populated
+ * and sorted commit_name_table dictionary.  The parsing is
+ * strict so to ensure the canonical version may always be
+ * regenerated and produce the same hash.
+ */
+void * conv_to_dict_commit(void *buffer, unsigned long *psize)
+{
+   unsigned long size = *psize;
+   char *in, *tail, *end;
+   unsigned char *out;
+   unsigned char sha1[20];
+   int nb_parents, index, tz_val;
+   unsigned long time;
+   z_stream stream;
+   int status;
+
+   /*
+* It is guaranteed that the output is always going to be smaller
+* than the input.  We could even do this conversion in place.
+*/
+   in = buffer;
+   tail = in + size;
+   buffer = xmalloc(size);
+   out = buffer;
+
+   /* parse the tree line */
+   if (in + 46 = tail || memcmp(in, tree , 5) || in[45] != '\n')
+   goto bad_data;
+   if (get_sha1_hex(in + 5, sha1)  0)
+   goto bad_data;
+   in += 46;
+   out = add_sha1_ref(out, sha1);
+
+   /* count how many parent lines */
+   nb_parents = 0;
+   while (in + 48  tail  !memcmp(in, parent , 7)  in[47] == '\n') {
+   nb_parents++;
+   in += 48;
+   }
+   out = add_number(out, nb_parents);
+
+   /* rewind and parse the parent lines */
+   in -= 48 * nb_parents;
+   while (nb_parents--) {
+   if (get_sha1_hex(in + 7, sha1))
+   goto bad_data;
+   out = add_sha1_ref(out, sha1);
+   in += 48;
+   }
+
+   /* parse the author line */
+   /* it must be at least author x x 0 +\n i.e. 21 chars */
+   if (in + 21 = tail || memcmp(in, author , 7))
+   goto bad_data;
+   in += 7;
+   end = get_nameend_and_tz(in, tz_val);
+   if (!end)
+   goto bad_data;
+   index = dict_add_entry(commit_name_table, tz_val, in, end - in);
+   if (index  0)
+   goto bad_dict;
+   out = add_number(out, index);
+   time = strtoul(end, end, 10);
+   if (!end || end[0] != ' ' || end[6] != '\n')
+   goto bad_data;
+   out = add_number(out, time);
+   in = end + 7;
+
+   /* parse the committer line */
+   /* it must be at least committer x x 0 +\n i.e. 24 chars */
+   if (in + 24 = tail || memcmp(in, committer , 7))
+   goto bad_data;
+   in += 10;
+   end = get_nameend_and_tz(in, tz_val);
+   if (!end)
+   goto bad_data;
+   index = dict_add_entry(commit_name_table, tz_val, in, end - in);
+   if (index  0)
+   goto bad_dict;
+   out = add_number(out, index);
+   time = strtoul(end, end, 10);
+   if (!end || end[0] != ' ' || end[6] != '\n')
+   goto bad_data;
+   out = add_number(out, time);
+   in = end + 7;
+
+   /* finally, deflate the remaining data */
+   memset(stream, 0, sizeof(stream));
+   deflateInit(stream, pack_compression_level);
+