Re: [PATCH v2] Document pack v4 format

2013-09-06 Thread Nicolas Pitre
On Fri, 6 Sep 2013, Duy Nguyen wrote:

> On Thu, Sep 5, 2013 at 11:52 PM, Duy Nguyen  wrote:
> > On Thu, Sep 5, 2013 at 12:39 PM, Nicolas Pitre  wrote:
> >> Now the pack index v3 probably needs to be improved a little, again to
> >> accommodate completion of thin packs.  Given that the main SHA1 table is
> >> now in the main pack file, it should be possible to still carry a small
> >> SHA1 table in the index file that corresponds to the appended objects
> >> only. This means that a SHA1 search will have to first use the main SHA1
> >> table in the pack file as it is done now, and if not found then use the
> >> SHA1 table in the index file if it exists.  And of course
> >> nth_packed_object_sha1() will have to be adjusted accordingly.
> >
> > What if the sender prepares the sha-1 table to contain missing objects
> > in advance? The sender should know what base objects are missing. Then
> > we only need to append objects at the receiving end and verify that
> > all new objects are also present in the sha-1 table.
> 
> One minor detail to sort out: the size of sha-1 table. Previously it's
> the number of objects in the pack. Now it's not true because the table
> may have more entries. So how should we record the table size? We
> could use null sha-1 as the end of table marker. Or we could make
> pack-objects to write nr_objects as the number of entries _after_ pack
> completion, not the true number of objects in thin pack. I like the
> latter (no more rehashing the entire pack after completion) but then
> we need a cue to know that we have reached the end of the pack..

See the amendment I made to your documentation patch.  I opted for the 
later.  To mark the end of the transmitted objects a zero byte (object 
type 0) is used.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-05 Thread Duy Nguyen
On Thu, Sep 5, 2013 at 11:52 PM, Duy Nguyen  wrote:
> On Thu, Sep 5, 2013 at 12:39 PM, Nicolas Pitre  wrote:
>> Now the pack index v3 probably needs to be improved a little, again to
>> accommodate completion of thin packs.  Given that the main SHA1 table is
>> now in the main pack file, it should be possible to still carry a small
>> SHA1 table in the index file that corresponds to the appended objects
>> only. This means that a SHA1 search will have to first use the main SHA1
>> table in the pack file as it is done now, and if not found then use the
>> SHA1 table in the index file if it exists.  And of course
>> nth_packed_object_sha1() will have to be adjusted accordingly.
>
> What if the sender prepares the sha-1 table to contain missing objects
> in advance? The sender should know what base objects are missing. Then
> we only need to append objects at the receiving end and verify that
> all new objects are also present in the sha-1 table.

One minor detail to sort out: the size of sha-1 table. Previously it's
the number of objects in the pack. Now it's not true because the table
may have more entries. So how should we record the table size? We
could use null sha-1 as the end of table marker. Or we could make
pack-objects to write nr_objects as the number of entries _after_ pack
completion, not the true number of objects in thin pack. I like the
latter (no more rehashing the entire pack after completion) but then
we need a cue to know that we have reached the end of the pack..
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-05 Thread Nicolas Pitre
On Thu, 5 Sep 2013, Junio C Hamano wrote:

> Nicolas Pitre  writes:
> 
> > On Thu, 5 Sep 2013, Duy Nguyen wrote:
> >
> >> On Thu, Sep 5, 2013 at 12:39 PM, Nicolas Pitre  wrote:
> >> > Now the pack index v3 probably needs to be improved a little, again to
> >> > accommodate completion of thin packs.  Given that the main SHA1 table is
> >> > now in the main pack file, it should be possible to still carry a small
> >> > SHA1 table in the index file that corresponds to the appended objects
> >> > only. This means that a SHA1 search will have to first use the main SHA1
> >> > table in the pack file as it is done now, and if not found then use the
> >> > SHA1 table in the index file if it exists.  And of course
> >> > nth_packed_object_sha1() will have to be adjusted accordingly.
> >> 
> >> What if the sender prepares the sha-1 table to contain missing objects
> >> in advance? The sender should know what base objects are missing. Then
> >> we only need to append objects at the receiving end and verify that
> >> all new objects are also present in the sha-1 table.
> >
> > I do like this idea very much.  And that doesn't increase the thin pack 
> > size as the larger SHA1 table will be compensated by a smaller sha1ref 
> > encoding in those objects referring to the missing ones.
> 
> Let me see if I understand the proposal correctly.  Compared to a
> normal pack-v4 stream, a thin pack-v4 stream:
> 
>  - has all the SHA-1 object names involved in the stream in its main
>object name table---most importantly, names of objects that
>"thin" optimization omits from the pack data body are included;
> 
>  - uses the SHA-1 object name table offset to refer to other
>objects, even to ones that thin stream will not transfer in the
>pack data body;
> 
>  - is completed at the receiving end by appending the data for the
>objects that were not transferred due to the "thin" optimization.
> 
> So the invariant "all objects contained in the pack" in:
> 
>  - A table of sorted SHA-1 object names for all objects contained in
>the pack.
> 
> that appears in Documentation/technical/pack-format.txt is still
> kept at the end, and more importantly, any object that is mentioned
> in this table can be reconstructed by using pack data in the same
> packfile without referencing anything else.  Most importantly, if we
> were to build a v2 .idx file for the resulting .pack, the list of
> object names in the .idx file would be identical to the object names
> in this table in the .pack file.

That is right.

> If that is the case, I too like this.
> 
> I briefly wondered if it makes sense to mention objects that are
> often referred to that do not exist in the pack in this table
> (e.g. new commits included in this pack refer to a tree object that
> has not changed for ages---their trees mention this subtree using a
> "SHA-1 reference encoding" and being able to name the old,
> unchanging tree with an index to the object table may save space),
> but that would break the above invariant in a big way---some objects
> mentioned in the table may not exist in the packfile itself---and it
> probably is not a good idea.

Yet, if an old tree that doesn't change is often referred to, it should 
be possible to have only one such reference in the whole pack and all 
the other trees can use a delta copy sequence to refer to it.  At this 
point whether or not the tree being referred to is listed inline or in 
the SHA1 table doesn't make a big difference.

> Unlike that broken idea, "include
> names of the objects that will be appended anyway" approach to help
> fattening a thin-pack makes very good sense to me.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-05 Thread Junio C Hamano
Nicolas Pitre  writes:

> On Thu, 5 Sep 2013, Duy Nguyen wrote:
>
>> On Thu, Sep 5, 2013 at 12:39 PM, Nicolas Pitre  wrote:
>> > Now the pack index v3 probably needs to be improved a little, again to
>> > accommodate completion of thin packs.  Given that the main SHA1 table is
>> > now in the main pack file, it should be possible to still carry a small
>> > SHA1 table in the index file that corresponds to the appended objects
>> > only. This means that a SHA1 search will have to first use the main SHA1
>> > table in the pack file as it is done now, and if not found then use the
>> > SHA1 table in the index file if it exists.  And of course
>> > nth_packed_object_sha1() will have to be adjusted accordingly.
>> 
>> What if the sender prepares the sha-1 table to contain missing objects
>> in advance? The sender should know what base objects are missing. Then
>> we only need to append objects at the receiving end and verify that
>> all new objects are also present in the sha-1 table.
>
> I do like this idea very much.  And that doesn't increase the thin pack 
> size as the larger SHA1 table will be compensated by a smaller sha1ref 
> encoding in those objects referring to the missing ones.

Let me see if I understand the proposal correctly.  Compared to a
normal pack-v4 stream, a thin pack-v4 stream:

 - has all the SHA-1 object names involved in the stream in its main
   object name table---most importantly, names of objects that
   "thin" optimization omits from the pack data body are included;

 - uses the SHA-1 object name table offset to refer to other
   objects, even to ones that thin stream will not transfer in the
   pack data body;

 - is completed at the receiving end by appending the data for the
   objects that were not transferred due to the "thin" optimization.

So the invariant "all objects contained in the pack" in:

 - A table of sorted SHA-1 object names for all objects contained in
   the pack.

that appears in Documentation/technical/pack-format.txt is still
kept at the end, and more importantly, any object that is mentioned
in this table can be reconstructed by using pack data in the same
packfile without referencing anything else.  Most importantly, if we
were to build a v2 .idx file for the resulting .pack, the list of
object names in the .idx file would be identical to the object names
in this table in the .pack file.

If that is the case, I too like this.

I briefly wondered if it makes sense to mention objects that are
often referred to that do not exist in the pack in this table
(e.g. new commits included in this pack refer to a tree object that
has not changed for ages---their trees mention this subtree using a
"SHA-1 reference encoding" and being able to name the old,
unchanging tree with an index to the object table may save space),
but that would break the above invariant in a big way---some objects
mentioned in the table may not exist in the packfile itself---and it
probably is not a good idea.  Unlike that broken idea, "include
names of the objects that will be appended anyway" approach to help
fattening a thin-pack makes very good sense to me.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-05 Thread Duy Nguyen
On Thu, Sep 5, 2013 at 12:39 PM, Nicolas Pitre  wrote:
> Now the pack index v3 probably needs to be improved a little, again to
> accommodate completion of thin packs.  Given that the main SHA1 table is
> now in the main pack file, it should be possible to still carry a small
> SHA1 table in the index file that corresponds to the appended objects
> only. This means that a SHA1 search will have to first use the main SHA1
> table in the pack file as it is done now, and if not found then use the
> SHA1 table in the index file if it exists.  And of course
> nth_packed_object_sha1() will have to be adjusted accordingly.

What if the sender prepares the sha-1 table to contain missing objects
in advance? The sender should know what base objects are missing. Then
we only need to append objects at the receiving end and verify that
all new objects are also present in the sha-1 table.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-05 Thread Nicolas Pitre
On Thu, 5 Sep 2013, Duy Nguyen wrote:

> On Thu, Sep 5, 2013 at 12:39 PM, Nicolas Pitre  wrote:
> > Now the pack index v3 probably needs to be improved a little, again to
> > accommodate completion of thin packs.  Given that the main SHA1 table is
> > now in the main pack file, it should be possible to still carry a small
> > SHA1 table in the index file that corresponds to the appended objects
> > only. This means that a SHA1 search will have to first use the main SHA1
> > table in the pack file as it is done now, and if not found then use the
> > SHA1 table in the index file if it exists.  And of course
> > nth_packed_object_sha1() will have to be adjusted accordingly.
> 
> What if the sender prepares the sha-1 table to contain missing objects
> in advance? The sender should know what base objects are missing. Then
> we only need to append objects at the receiving end and verify that
> all new objects are also present in the sha-1 table.

I do like this idea very much.  And that doesn't increase the thin pack 
size as the larger SHA1 table will be compensated by a smaller sha1ref 
encoding in those objects referring to the missing ones.



Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-04 Thread Nicolas Pitre
On Thu, 5 Sep 2013, Duy Nguyen wrote:

> On Thu, Sep 5, 2013 at 11:40 AM, Nicolas Pitre  wrote:
> > On Thu, 5 Sep 2013, Duy Nguyen wrote:
> >
> >> On Thu, Sep 5, 2013 at 11:12 AM, Nicolas Pitre  wrote:
> >> > Many other bugs have now been fixed.  A git.git repository with packs
> >> > version 4 appears to be functional and passes git-fsck --full --strict.
> >>
> >> Yeah I was looking at the diff some minutes ago, saw changes in
> >> pack-check.c and wondering if fsck was working. I'll add v4 support to
> >> index-pack.
> >
> > Beware that the tree delta encoding has changed a little.  This saved up
> > to 2% on some repos.
> 
> Thanks for the heads up.
> 
> > I'll probably change the encoding to incorporate the escape hatch
> > for path and name references as discussed previously.

this is now committed.  I don't think there should be any more pack 
format changes at this point.

Now the pack index v3 probably needs to be improved a little, again to 
accommodate completion of thin packs.  Given that the main SHA1 table is 
now in the main pack file, it should be possible to still carry a small 
SHA1 table in the index file that corresponds to the appended objects 
only. This means that a SHA1 search will have to first use the main SHA1 
table in the pack file as it is done now, and if not found then use the 
SHA1 table in the index file if it exists.  And of course 
nth_packed_object_sha1() will have to be adjusted accordingly.



Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-04 Thread Duy Nguyen
On Thu, Sep 5, 2013 at 11:40 AM, Nicolas Pitre  wrote:
> On Thu, 5 Sep 2013, Duy Nguyen wrote:
>
>> On Thu, Sep 5, 2013 at 11:12 AM, Nicolas Pitre  wrote:
>> > Many other bugs have now been fixed.  A git.git repository with packs
>> > version 4 appears to be functional and passes git-fsck --full --strict.
>>
>> Yeah I was looking at the diff some minutes ago, saw changes in
>> pack-check.c and wondering if fsck was working. I'll add v4 support to
>> index-pack.
>
> Beware that the tree delta encoding has changed a little.  This saved up
> to 2% on some repos.

Thanks for the heads up.

> I'll probably change the encoding to incorporate the escape hatch
> for path and name references as discussed previously.
>
>> Waiting to see the new, v4-aware tree walker interface
>> with good "rev-list --all --objects" numbers from you.
>
> Well, unfortunately I've put more time than I really had available into
> this project lately.  I'm about to call for other people to take over it
> and pursue this work further.
>
> I really wanted to set the pack format direction since I've been toying
> with this for so many years.  Now the tool to convert a pack is there,
> and the read side is also there, proving that the format does work and
> the encoding and decoding code is functional and may serve as reference.
> So that's about the extent of what I can contribute at this point.
>
> I'll be happy to provide design assistance and code review comments of
> course.  But I won't be able to put the time to do the actual coding
> myself much longer.

You've done a great job in designing v4 and getting basic support in
place. I think you'll need to post your series again so Junio can pick
it up. Then we (at least I) will try to continue from there. I have
high hopes that this will not drop out like the spit-blob series.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-04 Thread Nicolas Pitre
On Thu, 5 Sep 2013, Duy Nguyen wrote:

> On Thu, Sep 5, 2013 at 11:12 AM, Nicolas Pitre  wrote:
> > Many other bugs have now been fixed.  A git.git repository with packs
> > version 4 appears to be functional and passes git-fsck --full --strict.
> 
> Yeah I was looking at the diff some minutes ago, saw changes in
> pack-check.c and wondering if fsck was working. I'll add v4 support to
> index-pack.

Beware that the tree delta encoding has changed a little.  This saved up 
to 2% on some repos.

I'll probably change the encoding to incorporate the escape hatch 
for path and name references as discussed previously.

> Waiting to see the new, v4-aware tree walker interface
> with good "rev-list --all --objects" numbers from you.

Well, unfortunately I've put more time than I really had available into 
this project lately.  I'm about to call for other people to take over it 
and pursue this work further.

I really wanted to set the pack format direction since I've been toying 
with this for so many years.  Now the tool to convert a pack is there, 
and the read side is also there, proving that the format does work and 
the encoding and decoding code is functional and may serve as reference.  
So that's about the extent of what I can contribute at this point.

I'll be happy to provide design assistance and code review comments of 
course.  But I won't be able to put the time to do the actual coding 
myself much longer.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-04 Thread Duy Nguyen
On Thu, Sep 5, 2013 at 11:12 AM, Nicolas Pitre  wrote:
> Many other bugs have now been fixed.  A git.git repository with packs
> version 4 appears to be functional and passes git-fsck --full --strict.

Yeah I was looking at the diff some minutes ago, saw changes in
pack-check.c and wondering if fsck was working. I'll add v4 support to
index-pack. Waiting to see the new, v4-aware tree walker interface
with good "rev-list --all --objects" numbers from you.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-04 Thread Nicolas Pitre
On Tue, 3 Sep 2013, Duy Nguyen wrote:

> On Tue, Sep 3, 2013 at 6:49 PM, Duy Nguyen  wrote:
> > On Tue, Sep 3, 2013 at 1:46 PM, Nicolas Pitre  wrote:
> >> So... looks like pack v4 is now "functional".
> >>
> >> However something is still wrong as it operates about 6 times slower
> >> than pack v3.
> >>
> >> Anyone wishes to investigate?
> >
> > You recurse in decode_entries too deep.I check the first 1
> > decode_entries() calls in pv4_get_tree(). The deepest level is 3491.
> 
> And I was wrong, the call depth is not that deep, but the number of
> decode_entries calls triggered by one pv4_get_tree() is that many.
> This is on git.git and the tree being processed is "t", which has 672
> entries.. There are funny access patterns. This is the output of
> 
>fprintf(stderr, "[%d] %d - %d %u\n", call_depth, copy_start,
> copy_count, copy_objoffset);
> 
> [1] 0 - 1 48838573
> [2] 0 - 1 48826699
> [3] 0 - 1 48820760
> [4] 0 - 1 48814812
> [5] 0 - 1 48805904
> [6] 0 - 1 48797000
> [7] 0 - 1 48794034
> [8] 0 - 1 48791067
> [9] 0 - 1 48788100
> [10] 0 - 1 48785134
> [11] 0 - 1 48776221
> [12] 0 - 1 48764321
> [13] 0 - 1 48503227
> [14] 0 - 1 48485415
> [15] 0 - 1 48473512
> [16] 0 - 1 48443621
> [17] 0 - 1 48401788
> [18] 0 - 1 48377834
> [19] 0 - 1 48371841
> [20] 0 - 1 48341809
> [21] 0 - 1 48260734
> [22] 0 - 1 48236635
> [23] 0 - 1 46845105
> [24] 0 - 1 14603061
> [25] 2 - 1 48838573
> [2] 0 - 1 48826699
> 
> It goes through 20+ base trees just to get one tree entry, I think..

Yeah... that's true.  The encoding should refer to the deepest tree 
directly in that case.  Better delta heuristics will have to be worked 
out here.  The code as it is now can't do that.

There was also a bug that prevented larger copy sequences to be created 
which is now fixed.

I added to packv4-create the ability to specify the minimum range of 
consecutive entries that can be represented by a copy sequence to allow 
experiments.  However, even when the tree deltas are completely disabled 
(using --min-tree-copy=0 achieves that) the CPU usage is still much 
higher which is rather unexpected.  In theory this shouldn't be the 
case.

Many other bugs have now been fixed.  A git.git repository with packs 
version 4 appears to be functional and passes git-fsck --full --strict.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-03 Thread Duy Nguyen
On Tue, Sep 3, 2013 at 6:49 PM, Duy Nguyen  wrote:
> On Tue, Sep 3, 2013 at 1:46 PM, Nicolas Pitre  wrote:
>> So... looks like pack v4 is now "functional".
>>
>> However something is still wrong as it operates about 6 times slower
>> than pack v3.
>>
>> Anyone wishes to investigate?
>
> You recurse in decode_entries too deep.I check the first 1
> decode_entries() calls in pv4_get_tree(). The deepest level is 3491.

And I was wrong, the call depth is not that deep, but the number of
decode_entries calls triggered by one pv4_get_tree() is that many.
This is on git.git and the tree being processed is "t", which has 672
entries.. There are funny access patterns. This is the output of

   fprintf(stderr, "[%d] %d - %d %u\n", call_depth, copy_start,
copy_count, copy_objoffset);

[1] 0 - 1 48838573
[2] 0 - 1 48826699
[3] 0 - 1 48820760
[4] 0 - 1 48814812
[5] 0 - 1 48805904
[6] 0 - 1 48797000
[7] 0 - 1 48794034
[8] 0 - 1 48791067
[9] 0 - 1 48788100
[10] 0 - 1 48785134
[11] 0 - 1 48776221
[12] 0 - 1 48764321
[13] 0 - 1 48503227
[14] 0 - 1 48485415
[15] 0 - 1 48473512
[16] 0 - 1 48443621
[17] 0 - 1 48401788
[18] 0 - 1 48377834
[19] 0 - 1 48371841
[20] 0 - 1 48341809
[21] 0 - 1 48260734
[22] 0 - 1 48236635
[23] 0 - 1 46845105
[24] 0 - 1 14603061
[25] 2 - 1 48838573
[2] 0 - 1 48826699

It goes through 20+ base trees just to get one tree entry, I think..
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-03 Thread Duy Nguyen
On Tue, Sep 3, 2013 at 1:46 PM, Nicolas Pitre  wrote:
> So... looks like pack v4 is now "functional".
>
> However something is still wrong as it operates about 6 times slower
> than pack v3.
>
> Anyone wishes to investigate?

You recurse in decode_entries too deep.I check the first 1
decode_entries() calls in pv4_get_tree(). The deepest level is 3491.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Document pack v4 format

2013-09-02 Thread Nicolas Pitre
On Tue, 3 Sep 2013, Nicolas Pitre wrote:

> On Sat, 31 Aug 2013, Nguyễn Thái Ngọc Duy wrote:
> 
> > 
> > Signed-off-by: Nguyễn Thái Ngọc Duy 
> > ---
> >  Incorporated suggestions by Nico and Junio. I went ahead and added
> >  escape hatches for converting thin packs to full ones so the document
> >  does not really match the code (I've been watching Nico's repository,
> >  commit reading is added, good stuff!)
> 
> Now tree reading is added.  multiple encoding bug fixes trickled down to 
> their originating commits as well.
> 
> Something is still wrong with deltas though.

Deltas fixed now.

So... looks like pack v4 is now "functional".

However something is still wrong as it operates about 6 times slower 
than pack v3.

Anyone wishes to investigate?


Nicolas


Re: [PATCH v2] Document pack v4 format

2013-09-02 Thread Nicolas Pitre
On Sat, 31 Aug 2013, Nguyễn Thái Ngọc Duy wrote:

> 
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
>  Incorporated suggestions by Nico and Junio. I went ahead and added
>  escape hatches for converting thin packs to full ones so the document
>  does not really match the code (I've been watching Nico's repository,
>  commit reading is added, good stuff!)

Now tree reading is added.  multiple encoding bug fixes trickled down to 
their originating commits as well.

Something is still wrong with deltas though.


Nicolas


[PATCH v2] Document pack v4 format

2013-08-30 Thread Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 Incorporated suggestions by Nico and Junio. I went ahead and added
 escape hatches for converting thin packs to full ones so the document
 does not really match the code (I've been watching Nico's repository,
 commit reading is added, good stuff!)

 The proposal is, value 0 in the index to ident table is reserved,
 followed by the ident string. The real index to ident table is idx-1.

 Similarly, the value 1 in the index to path name table is reserved 
 (value 0 is already used for referring back to base tree) so the
 actual index is idx-2.

 Documentation/technical/pack-format.txt | 128 +++-
 1 file changed, 127 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 8e5bf60..c866287 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -1,7 +1,7 @@
 Git pack format
 ===
 
-== pack-*.pack files have the following format:
+== pack-*.pack files version 2 and 3 have the following format:
 
- A header appears at the beginning and consists of the following:
 
@@ -36,6 +36,127 @@ Git pack format
 
   - The trailer records 20-byte SHA-1 checksum of all of the above.
 
+== pack-*.pack files version 4 have the following format:
+
+   - A header appears at the beginning and consists of the following:
+
+ 4-byte signature:
+   The signature is: {'P', 'A', 'C', 'K'}
+
+ 4-byte version number (network byte order): must be 4
+
+ 4-byte number of objects contained in the pack (network byte order)
+
+   - A series of tables, described separately.
+
+   - The tables are followed by number of object entries, each of
+ which looks like below:
+
+ (undeltified representation)
+ n-byte type and length (4-bit type, (n-1)*7+4-bit length)
+ data
+
+ (deltified representation)
+ n-byte type and length (4-bit type, (n-1)*7+4-bit length)
+ base object name in SHA-1 reference encoding
+ compressed delta data
+
+ In undeltified format, blobs and tags ares compressed. Trees are
+ not compressed at all. Some headers in commits are stored
+ uncompressed, the rest is compressed. Tree and commit
+ representations are described in detail separately.
+
+ Blobs and tags are deltified and compressed the same way in
+ v3. Commits are not delitifed. Trees are deltified using
+ undeltified representation.
+
+  - The trailer records 20-byte SHA-1 checksum of all of the above.
+
+=== Pack v4 tables
+
+ - A table of sorted SHA-1 object names for all objects contained in
+   the pack.
+
+   This table can be referred to using "SHA-1 reference encoding":
+   It's an index number in variable length encoding. If it's
+   non-zero, its value minus one is the index in this table. If it's
+   zero, 20 bytes of SHA-1 is followed.
+
+ - Ident table: the uncompressed length in variable encoding,
+   followed by zlib-compressed dictionary. Each entry consists of
+   two prefix bytes storing timezone followed by a NUL-terminated
+   string.
+
+   Entries should be sorted by frequency so that the most frequent
+   entry has the smallest index, thus most efficient variable
+   encoding.
+
+   The table can be referred to using "ident reference encoding":
+   It's an index number in variable length encoding. If it's
+   non-zero, its value minus one is the index in this table. If it's
+   zero, a new entry in the same format is followed: two prefix
+   bytes and a NUL-terminated string.
+
+ - Tree path table: the same format to ident table. Each entry
+   consists of two prefix bytes storing tree entry mode, then a
+   NUL-terminated path name. Same sort order recommendation applies.
+
+=== Commit representation
+
+  - n-byte type and length (4-bit type, (n-1)*7+4-bit length)
+
+  - Tree SHA-1 in SHA-1 reference encoding
+
+  - Parent count in variable length encoding
+
+  - Parent SHA-1s in SHA-1 reference encoding
+
+  - Author reference in ident reference encoding
+
+  - Author timestamp in variable length encoding
+
+  - Committer reference in ident reference encoding
+
+  - Committer timestamp in variable length encoding
+
+  - Compressed data of remaining header and the body
+
+=== Tree representation
+
+  - n-byte type and length (4-bit type, (n-1)*7+4-bit length)
+
+  - Number of tree entries in variable length encoding
+
+  - A number of entries, each starting with path component reference:
+an number, in variable length encoding.
+
+If the path component reference is greater than 1, its value minus
+two is the index in tree path table. The path component reference
+is followed by the tree entry SHA-1 in SHA-1 reference encoding.
+
+If the path component reference is 1, it's followed by
+
+- two prefix bytes representing tree entry mode
+
+- NUL-terminated path name
+
+- tree entry SHA-1 in SHA-1 reference encoding
+
+If the path compone