Re: [PATCH/RFC] index-pack: produce pack index version 3

2012-08-16 Thread Shawn Pearce
On Wed, Aug 15, 2012 at 10:42 PM, Junio C Hamano gits...@pobox.com wrote:
 Shawn Pearce spea...@spearce.org writes:

 ... But I think its worth giving
 him a few weeks to finish getting the code ready, vs. rushing
 something in that someone else thinks might help. We have waited more
 than 6 years or whatever to improve packing. Colby's experiments are
 showing massive improvements (40s enumeration cut to 100ms) with low
 disk usage increase (10%) and no pack file format changes.

 No matter what you do, you would be saving the bitmap somewhere
 outside the *.pack file, yes?  Will it be in some extended section
 of the new *.idx file?

Yes, because it is derived from the pack stream its stored external
from it. We have chosen to append it as a new section below the CRC-32
table in the *.idx file. It could also go into a new file stream
(*.bdx?) but I don't think this is worthwhile. Its a bikeshed that can
be painted once the algorithm has proven its value. If it can't do
that, its irrelevant where its stored.

 With the bitmap, your object enumeration phase may go very fast, and
 you would be able to figure out, in response to a fetch request I
 have these tips of refs; please update me with these refs of yours,
 the set of objects you would need to pack (i.e. the ones that are
 reachable from your refs that were asked for, but that are not
 reachable from the refs the requestor has).

Yes. Not only that, but we are finding that the number of objects to
send is fewer, resulting in a much smaller data stream. The bitmap
gives us near perfect knowledge of everything the client has, not just
the objects on the edge of the graph cut expressed by the ACKed have
lines. Objects that existed earlier than the cut don't have to be
reset.

So the object enumeration phase goes very fast. The compressing
objects phase is also time reduced, as fewer objects are needed to be
considered, as fewer objects are being sent. With more complete
knowledge of the have side, more deltas can be reused as-is, rather
than re-encoded on the fly. This reduces the compressing objects phase
time even further. And with fewer objects to send, we have easily seen
trivial cases of a Linux kernel fetch that is 1 week behind Linus' tip
save like 200K on an 800K transfer. Its hard to hate having all of the
phases go faster, and transmit less data to the client.

 Among these objects, there will be ones that are expressed as a
 delta against what you are going to send, or as a delta against what
 you know the recipient must already have (if you are using thin pack
 transfer) in the packfiles you have, and you can send these deltas
 as-is without recomputation.

Yes. I point that out above before even reading this far in your message. :-(

 But there will be ones that are either expressed as a base in your
 packfile, or as a delta against something you are not going to send
 and you know that the recipient does not have.  In order to turn
 these objects into deltas, it may be necessary to have a way to
 record which delta chain each object belongs to

Excellent observation. Instead of performing a hacky delta base search
by looking at identical path names in the have set, finding another
candidate in the same cluster of deltas that the object is currently
encoded against would yield a pretty efficient delta on the wire. It
doesn't have to be another object in the same chain, it could also be
a sibling object that shares a similar transitive delta base.

, and if you are
 introducing the mechanism to have extended sections in the new *.idx
 file, that may be a good place to do so.

Thus far we haven't done an extended sections modification to the
*.idx format. Its just a 'E003' version that requires the bitmaps to
be present below the CRC-32 table. But I can see how framing this as a
group of optional extensions would be worthwhile.

  When you need to express
 an object that your bitmap told you to send (as opposed to rev-list
 walking told you with the paths to the objects), you can find other
 objects that belong to the same delta chain and that you know are
 available to the recipient when it starts to fixing the thin pack
 using that extra piece of information, in order to find the optimal
 delta base to encode such an object against.

Yes.

 Just for fun, I applied the attached patch and repacked the history
 leading to v1.7.12-rc3 with the default depth/window:

   git rev-list --objects --all |
   git pack-objects \
   --no-reuse-delta --no-reuse-object --delta-base-offset \
   [--no-namehash] pack

 with and without the experimental --no-namehash option.  The result
 is staggering.  With name-hash to group objects from the same path
 close together in the delta window, the resulting pack is 33M.
 Without the name-hash hint, the same pack is 163M!  Needless to say,
 keeping the objects that should be hashed together inside a delta
 window is really important.

Cute experiment. Yes that name hash works as a decent predictor of

Re: [PATCH/RFC] index-pack: produce pack index version 3

2012-08-15 Thread Junio C Hamano
Shawn Pearce spea...@spearce.org writes:

 ... But I think its worth giving
 him a few weeks to finish getting the code ready, vs. rushing
 something in that someone else thinks might help. We have waited more
 than 6 years or whatever to improve packing. Colby's experiments are
 showing massive improvements (40s enumeration cut to 100ms) with low
 disk usage increase (10%) and no pack file format changes.

No matter what you do, you would be saving the bitmap somewhere
outside the *.pack file, yes?  Will it be in some extended section
of the new *.idx file?

With the bitmap, your object enumeration phase may go very fast, and
you would be able to figure out, in response to a fetch request I
have these tips of refs; please update me with these refs of yours,
the set of objects you would need to pack (i.e. the ones that are
reachable from your refs that were asked for, but that are not
reachable from the refs the requestor has).  

Among these objects, there will be ones that are expressed as a
delta against what you are going to send, or as a delta against what
you know the recipient must already have (if you are using thin pack
transfer) in the packfiles you have, and you can send these deltas
as-is without recomputation.

But there will be ones that are either expressed as a base in your
packfile, or as a delta against something you are not going to send
and you know that the recipient does not have.  In order to turn
these objects into deltas, it may be necessary to have a way to
record which delta chain each object belongs to, and if you are
introducing the mechanism to have extended sections in the new *.idx
file, that may be a good place to do so.  When you need to express
an object that your bitmap told you to send (as opposed to rev-list
walking told you with the paths to the objects), you can find other
objects that belong to the same delta chain and that you know are
available to the recipient when it starts to fixing the thin pack
using that extra piece of information, in order to find the optimal
delta base to encode such an object against.

Just for fun, I applied the attached patch and repacked the history
leading to v1.7.12-rc3 with the default depth/window:

  git rev-list --objects --all |
  git pack-objects \
  --no-reuse-delta --no-reuse-object --delta-base-offset \
  [--no-namehash] pack

with and without the experimental --no-namehash option.  The result
is staggering.  With name-hash to group objects from the same path
close together in the delta window, the resulting pack is 33M.
Without the name-hash hint, the same pack is 163M!  Needless to say,
keeping the objects that should be hashed together inside a delta
window is really important.

An obvious way to record the delta chain is to simply keep the
name_hash of each object in the pack, which would need 2 bytes per
object in the pack, that would bloat pack_idx_entry size from 32
bytes to 34 bytes per entry.  That way, after your bitmap discovers
an object that cannot reuse existing deltas, you can throw it, other
such objects with the same name-hash, and then objects that you know
will be available to the recipient (you mark the last category of
objects as preferred base), into the delta_list so that they are
close together in the delta window.

And here is a patch to experiment the name-hash based grouping
heuristics.

 builtin/pack-objects.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git c/builtin/pack-objects.c w/builtin/pack-objects.c
index 782e7d0..a2dd67b 100644
--- c/builtin/pack-objects.c
+++ w/builtin/pack-objects.c
@@ -67,6 +67,7 @@ static int keep_unreachable, unpack_unreachable, include_tag;
 static unsigned long unpack_unreachable_expiration;
 static int local;
 static int incremental;
+static int use_name_hash = 1;
 static int ignore_packed_keep;
 static int allow_ofs_delta;
 static struct pack_idx_option pack_idx_opts;
@@ -863,7 +864,7 @@ static unsigned name_hash(const char *name)
 {
unsigned c, hash = 0;
 
-   if (!name)
+   if (!name || !use_name_hash)
return 0;
 
/*
@@ -2468,6 +2469,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
  limit pack window by memory in addition to object 
limit),
OPT_INTEGER(0, depth, depth,
maximum length of delta chain allowed in the 
resulting pack),
+   OPT_BOOL(0, namehash, use_name_hash,
+assign name hint to each path),
OPT_BOOL(0, reuse-delta, reuse_delta,
 reuse existing deltas),
OPT_BOOL(0, reuse-object, reuse_object,

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] index-pack: produce pack index version 3

2012-08-13 Thread Junio C Hamano
Nguyen Thai Ngoc Duy pclo...@gmail.com writes:

 On Mon, Aug 13, 2012 at 2:49 AM, Junio C Hamano gits...@pobox.com wrote:
 For example, the reachability bitmap would want to say something
 like Traversing from commit A, these objects in this pack are
 reachable.  The bitmap for one commit A would logically consist of
 N bits for a packfile that stores N objects (the resulting bitmap
 needs to be compressed before going to disk, perhaps with RLE or
 something).  With the single sorted by SHA-1 table, we can use the
 index in that single table to enumerate all reachable objects of any
 type in one go.  With four separate tables, on the other hand, we
 would need four bitmaps per commit.

 No we still need one per commit. The n-th bit is in the order of the
 object in the pack, not the index. How sha-1 is sorted does not
 matter.

Then what is the problem of using that same N that counts the
object using the order of the objects in the pack as a short-hand to
the object name instead, if your objective is to avoid having to
repeat the full 20-byte object name in your secondary tables?  That
does not call for any split in the list of SHA-1s, no?

 Either way is _possible_, but I think the former is simpler, and the
 latter makes it harder to introduce new types of objects in the
 future, which I do not think we have examined possible use cases
 well enough to make that decision to say four types is enough
 forever.

 New types can be put in one of those four tables, depending on its
 purpose. The reason I split because I care particularly about commits
 and trees. If the new type serves the same purpose as tree, for
 example, then it's better put in tree table...

And when there are multiple uses, one that wants to treat trees and
commits more or less the same way, and another that wants to treat
them in a distinct way, even without considering a new type, how
would your we have four separate tables help?  The former user
wants three commits+trees, tags and blobs table while the
latter wants four, no?

 In either way, we would have such bitmap (or a set of four bitmaps
 in your case) for more than one commit (it is not necessary or
 desirable to add the reachability bitmap to all commits), and such a
 reachability extension would need to store a sequence of the
 commit object name the bitmap (or a set of four bitmaps) is about,
 and the bitmap (or set of four bitmaps).  That object name does not
 have to be 20-byte but would be a varint representation of the
 offset into the sorted by SHA-1 table.

 How do you reach the bitmap, given its commit sha-1?

Look at the ordered list of SHA-1 in the pack idx file as an array
of 20-byte entries, and find the commit SHA-1 in that array by
Newton-Raphson/bisection.  You have an index into the array that
holds the object name.  You can use it as the local object number
in the packfile.  Then you can find that object number (that is
shorter than 20-byte object name) in your secondary table, no?

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] index-pack: produce pack index version 3

2012-08-13 Thread Shawn Pearce
Let me start by echoing Junio's remark... lacks sufficient
justification. You don't give enough evidence to support even why it
is worth looking at this commit, let alone why it should be included
and cause a format change in the idx file format.

At some point you start to hand-wave about how it may benefit my
bitmap reachability index idea, but you don't seem to fully understand
that idea. And some of what you do here sounds like it actively hurts
doing the bitmap reachability project. So again this doesn't really
help.

Colby is nearly done prototyping the bitmap reachability
implementation in JGit and will release the code under the BSD license
there soon. I can't promise when yet because Colby will soon be
heading out for some (much deserved) vacation time, and doesn't want
to publish something until he is ready to stand behind the code. So it
might not show up until mid-September. But I think its worth giving
him a few weeks to finish getting the code ready, vs. rushing
something in that someone else thinks might help. We have waited more
than 6 years or whatever to improve packing. Colby's experiments are
showing massive improvements (40s enumeration cut to 100ms) with low
disk usage increase (10%) and no pack file format changes. Its OK to
wait 4 more weeks.

On Mon, Aug 13, 2012 at 12:40 AM, Junio C Hamano gits...@pobox.com wrote:
 Nguyen Thai Ngoc Duy pclo...@gmail.com writes:

 On Mon, Aug 13, 2012 at 2:49 AM, Junio C Hamano gits...@pobox.com wrote:
 For example, the reachability bitmap would want to say something
 like Traversing from commit A, these objects in this pack are
 reachable.  The bitmap for one commit A would logically consist of
 N bits for a packfile that stores N objects (the resulting bitmap
 needs to be compressed before going to disk, perhaps with RLE or
 something).  With the single sorted by SHA-1 table, we can use the
 index in that single table to enumerate all reachable objects of any
 type in one go.  With four separate tables, on the other hand, we
 would need four bitmaps per commit.

 No we still need one per commit. The n-th bit is in the order of the
 object in the pack, not the index. How sha-1 is sorted does not
 matter.

Yes, for RLE encoding to work well it is important that the bits are
assigned using pack offset ordering, not SHA-1 ordering. The object at
bit 0 is the object at offset 12 in the pack file, not the object that
has the lowest SHA-1. The RLE encoding is relying on the packer to
store objects closely reachable by time near each other in the pack
file. Which the packer already does because we have roughly proven
though the years that this ordering works sufficiently well enough for
nobody to propose another ordering.  :-)

 Then what is the problem of using that same N that counts the
 object using the order of the objects in the pack as a short-hand to
 the object name instead, if your objective is to avoid having to
 repeat the full 20-byte object name in your secondary tables?  That
 does not call for any split in the list of SHA-1s, no?

Exactly. There is no need to store a second copy of the SHA-1s
anywhere. We can discover information about object identified by bit N
by creating the reverse pack index in memory. The N-th entry in the
reverse index will tell us the Y-th position in the normal idx file,
which gives us the SHA-1. So SHA-1 can be obtained from a bitmap in
constant time, once the reverse index is built. The reverse index
build costs O(N log N) time, but its already required by the packer to
do efficient reuse of packed objects. We already pay this O(N log N)
penalty. So its currently free to us.

The bitmap approach Colby and I are collaborating on also contains 4
RLE bitmaps indicating what type each object is. Its needed so the
packer can avoid looking up type data in the base pack file, while
also avoiding parsing of trees where the mode information normally
provides the type hint. We could instead get this by splitting the
SHA-1 table into 4 tables as this patch proposes, but it doesn't help.
Type information can be obtained fast enough from these 4 RLE bitmaps,
without muddying the read code path with needing to know what table to
search, or needing to search 4 tables rather than 1 table.

 Either way is _possible_, but I think the former is simpler, and the
 latter makes it harder to introduce new types of objects in the
 future, which I do not think we have examined possible use cases
 well enough to make that decision to say four types is enough
 forever.

 New types can be put in one of those four tables, depending on its
 purpose.

No. A new type should get a new table. We should never cram a new type
into an existing type table just because it sounds like a good idea.
Its not. We are unlikely to add a new type anytime soon. If we do its
going to take more work than just worrying about where it fits into an
idx file. But it shouldn't be combined into a bucket with an existing
type.

 The reason I split because I care 

Re: [PATCH/RFC] index-pack: produce pack index version 3

2012-08-13 Thread Nguyen Thai Ngoc Duy
On Tue, Aug 14, 2012 at 7:46 AM, Shawn Pearce spea...@spearce.org wrote:
 Colby is nearly done prototyping the bitmap reachability
 implementation in JGit and will release the code under the BSD license
 there soon. I can't promise when yet because Colby will soon be
 heading out for some (much deserved) vacation time, and doesn't want
 to publish something until he is ready to stand behind the code. So it
 might not show up until mid-September. But I think its worth giving
 him a few weeks to finish getting the code ready, vs. rushing
 something in that someone else thinks might help. We have waited more
 than 6 years or whatever to improve packing. Colby's experiments are
 showing massive improvements (40s enumeration cut to 100ms) with low
 disk usage increase (10%) and no pack file format changes. Its OK to
 wait 4 more weeks.

Thanks. I did not know it's being worked on (your last email did not
indicate that, I think). I just wanted to try out and see how it might
work. And sure I can wait :)
--
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] index-pack: produce pack index version 3

2012-08-12 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:

 The main reason to group objects by type is to make it possible to
 create another sha1-something mapping for a particular object type,
 without wasting space for storing sha-1 keys again. For example, we
 can store commit caches, tree caches... at the end of the index as
 extensions.

Why can't you do the same with a single sorted by SHA-1 table?

Not impressed yet.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] index-pack: produce pack index version 3

2012-08-12 Thread Nguyen Thai Ngoc Duy
On Mon, Aug 13, 2012 at 2:49 AM, Junio C Hamano gits...@pobox.com wrote:
 For example, the reachability bitmap would want to say something
 like Traversing from commit A, these objects in this pack are
 reachable.  The bitmap for one commit A would logically consist of
 N bits for a packfile that stores N objects (the resulting bitmap
 needs to be compressed before going to disk, perhaps with RLE or
 something).  With the single sorted by SHA-1 table, we can use the
 index in that single table to enumerate all reachable objects of any
 type in one go.  With four separate tables, on the other hand, we
 would need four bitmaps per commit.

No we still need one per commit. The n-th bit is in the order of the
object in the pack, not the index. How sha-1 is sorted does not
matter.

 Either way is _possible_, but I think the former is simpler, and the
 latter makes it harder to introduce new types of objects in the
 future, which I do not think we have examined possible use cases
 well enough to make that decision to say four types is enough
 forever.

New types can be put in one of those four tables, depending on its
purpose. The reason I split because I care particularly about commits
and trees. If the new type serves the same purpose as tree, for
example, then it's better put in tree table...

 In either way, we would have such bitmap (or a set of four bitmaps
 in your case) for more than one commit (it is not necessary or
 desirable to add the reachability bitmap to all commits), and such a
 reachability extension would need to store a sequence of the
 commit object name the bitmap (or a set of four bitmaps) is about,
 and the bitmap (or set of four bitmaps).  That object name does not
 have to be 20-byte but would be a varint representation of the
 offset into the sorted by SHA-1 table.

How do you reach the bitmap, given its commit sha-1?

 That varint representation
 would be smaller by about 3.5 bits if you have a separate commit
 only, sorted by SHA-1 table (as the number of all objects tend to
 be 10x larger than the number of all commits that need them).  For
 the particular case of we want to only annotate the commits, never
 other kinds of objects use case, it would be a win.  But without
 knowing what other use cases we will want to use the object
 annotation in the pack index file mechanism for, it feels like a
 premature optimization to me to have four tables to shave 3.5 bits
 per object.

caching trees for faster traversal in general case (sort of pack v4
but it comes as a cache instead of replacing the real pack).
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html