Re: [PATCH/RFC] index-pack: produce pack index version 3
On Wed, Aug 15, 2012 at 10:42 PM, Junio C Hamano gits...@pobox.com wrote: Shawn Pearce spea...@spearce.org writes: ... But I think its worth giving him a few weeks to finish getting the code ready, vs. rushing something in that someone else thinks might help. We have waited more than 6 years or whatever to improve packing. Colby's experiments are showing massive improvements (40s enumeration cut to 100ms) with low disk usage increase (10%) and no pack file format changes. No matter what you do, you would be saving the bitmap somewhere outside the *.pack file, yes? Will it be in some extended section of the new *.idx file? Yes, because it is derived from the pack stream its stored external from it. We have chosen to append it as a new section below the CRC-32 table in the *.idx file. It could also go into a new file stream (*.bdx?) but I don't think this is worthwhile. Its a bikeshed that can be painted once the algorithm has proven its value. If it can't do that, its irrelevant where its stored. With the bitmap, your object enumeration phase may go very fast, and you would be able to figure out, in response to a fetch request I have these tips of refs; please update me with these refs of yours, the set of objects you would need to pack (i.e. the ones that are reachable from your refs that were asked for, but that are not reachable from the refs the requestor has). Yes. Not only that, but we are finding that the number of objects to send is fewer, resulting in a much smaller data stream. The bitmap gives us near perfect knowledge of everything the client has, not just the objects on the edge of the graph cut expressed by the ACKed have lines. Objects that existed earlier than the cut don't have to be reset. So the object enumeration phase goes very fast. The compressing objects phase is also time reduced, as fewer objects are needed to be considered, as fewer objects are being sent. With more complete knowledge of the have side, more deltas can be reused as-is, rather than re-encoded on the fly. This reduces the compressing objects phase time even further. And with fewer objects to send, we have easily seen trivial cases of a Linux kernel fetch that is 1 week behind Linus' tip save like 200K on an 800K transfer. Its hard to hate having all of the phases go faster, and transmit less data to the client. Among these objects, there will be ones that are expressed as a delta against what you are going to send, or as a delta against what you know the recipient must already have (if you are using thin pack transfer) in the packfiles you have, and you can send these deltas as-is without recomputation. Yes. I point that out above before even reading this far in your message. :-( But there will be ones that are either expressed as a base in your packfile, or as a delta against something you are not going to send and you know that the recipient does not have. In order to turn these objects into deltas, it may be necessary to have a way to record which delta chain each object belongs to Excellent observation. Instead of performing a hacky delta base search by looking at identical path names in the have set, finding another candidate in the same cluster of deltas that the object is currently encoded against would yield a pretty efficient delta on the wire. It doesn't have to be another object in the same chain, it could also be a sibling object that shares a similar transitive delta base. , and if you are introducing the mechanism to have extended sections in the new *.idx file, that may be a good place to do so. Thus far we haven't done an extended sections modification to the *.idx format. Its just a 'E003' version that requires the bitmaps to be present below the CRC-32 table. But I can see how framing this as a group of optional extensions would be worthwhile. When you need to express an object that your bitmap told you to send (as opposed to rev-list walking told you with the paths to the objects), you can find other objects that belong to the same delta chain and that you know are available to the recipient when it starts to fixing the thin pack using that extra piece of information, in order to find the optimal delta base to encode such an object against. Yes. Just for fun, I applied the attached patch and repacked the history leading to v1.7.12-rc3 with the default depth/window: git rev-list --objects --all | git pack-objects \ --no-reuse-delta --no-reuse-object --delta-base-offset \ [--no-namehash] pack with and without the experimental --no-namehash option. The result is staggering. With name-hash to group objects from the same path close together in the delta window, the resulting pack is 33M. Without the name-hash hint, the same pack is 163M! Needless to say, keeping the objects that should be hashed together inside a delta window is really important. Cute experiment. Yes that name hash works as a decent predictor of
Re: [PATCH/RFC] index-pack: produce pack index version 3
Shawn Pearce spea...@spearce.org writes: ... But I think its worth giving him a few weeks to finish getting the code ready, vs. rushing something in that someone else thinks might help. We have waited more than 6 years or whatever to improve packing. Colby's experiments are showing massive improvements (40s enumeration cut to 100ms) with low disk usage increase (10%) and no pack file format changes. No matter what you do, you would be saving the bitmap somewhere outside the *.pack file, yes? Will it be in some extended section of the new *.idx file? With the bitmap, your object enumeration phase may go very fast, and you would be able to figure out, in response to a fetch request I have these tips of refs; please update me with these refs of yours, the set of objects you would need to pack (i.e. the ones that are reachable from your refs that were asked for, but that are not reachable from the refs the requestor has). Among these objects, there will be ones that are expressed as a delta against what you are going to send, or as a delta against what you know the recipient must already have (if you are using thin pack transfer) in the packfiles you have, and you can send these deltas as-is without recomputation. But there will be ones that are either expressed as a base in your packfile, or as a delta against something you are not going to send and you know that the recipient does not have. In order to turn these objects into deltas, it may be necessary to have a way to record which delta chain each object belongs to, and if you are introducing the mechanism to have extended sections in the new *.idx file, that may be a good place to do so. When you need to express an object that your bitmap told you to send (as opposed to rev-list walking told you with the paths to the objects), you can find other objects that belong to the same delta chain and that you know are available to the recipient when it starts to fixing the thin pack using that extra piece of information, in order to find the optimal delta base to encode such an object against. Just for fun, I applied the attached patch and repacked the history leading to v1.7.12-rc3 with the default depth/window: git rev-list --objects --all | git pack-objects \ --no-reuse-delta --no-reuse-object --delta-base-offset \ [--no-namehash] pack with and without the experimental --no-namehash option. The result is staggering. With name-hash to group objects from the same path close together in the delta window, the resulting pack is 33M. Without the name-hash hint, the same pack is 163M! Needless to say, keeping the objects that should be hashed together inside a delta window is really important. An obvious way to record the delta chain is to simply keep the name_hash of each object in the pack, which would need 2 bytes per object in the pack, that would bloat pack_idx_entry size from 32 bytes to 34 bytes per entry. That way, after your bitmap discovers an object that cannot reuse existing deltas, you can throw it, other such objects with the same name-hash, and then objects that you know will be available to the recipient (you mark the last category of objects as preferred base), into the delta_list so that they are close together in the delta window. And here is a patch to experiment the name-hash based grouping heuristics. builtin/pack-objects.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git c/builtin/pack-objects.c w/builtin/pack-objects.c index 782e7d0..a2dd67b 100644 --- c/builtin/pack-objects.c +++ w/builtin/pack-objects.c @@ -67,6 +67,7 @@ static int keep_unreachable, unpack_unreachable, include_tag; static unsigned long unpack_unreachable_expiration; static int local; static int incremental; +static int use_name_hash = 1; static int ignore_packed_keep; static int allow_ofs_delta; static struct pack_idx_option pack_idx_opts; @@ -863,7 +864,7 @@ static unsigned name_hash(const char *name) { unsigned c, hash = 0; - if (!name) + if (!name || !use_name_hash) return 0; /* @@ -2468,6 +2469,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) limit pack window by memory in addition to object limit), OPT_INTEGER(0, depth, depth, maximum length of delta chain allowed in the resulting pack), + OPT_BOOL(0, namehash, use_name_hash, +assign name hint to each path), OPT_BOOL(0, reuse-delta, reuse_delta, reuse existing deltas), OPT_BOOL(0, reuse-object, reuse_object, -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] index-pack: produce pack index version 3
Nguyen Thai Ngoc Duy pclo...@gmail.com writes: On Mon, Aug 13, 2012 at 2:49 AM, Junio C Hamano gits...@pobox.com wrote: For example, the reachability bitmap would want to say something like Traversing from commit A, these objects in this pack are reachable. The bitmap for one commit A would logically consist of N bits for a packfile that stores N objects (the resulting bitmap needs to be compressed before going to disk, perhaps with RLE or something). With the single sorted by SHA-1 table, we can use the index in that single table to enumerate all reachable objects of any type in one go. With four separate tables, on the other hand, we would need four bitmaps per commit. No we still need one per commit. The n-th bit is in the order of the object in the pack, not the index. How sha-1 is sorted does not matter. Then what is the problem of using that same N that counts the object using the order of the objects in the pack as a short-hand to the object name instead, if your objective is to avoid having to repeat the full 20-byte object name in your secondary tables? That does not call for any split in the list of SHA-1s, no? Either way is _possible_, but I think the former is simpler, and the latter makes it harder to introduce new types of objects in the future, which I do not think we have examined possible use cases well enough to make that decision to say four types is enough forever. New types can be put in one of those four tables, depending on its purpose. The reason I split because I care particularly about commits and trees. If the new type serves the same purpose as tree, for example, then it's better put in tree table... And when there are multiple uses, one that wants to treat trees and commits more or less the same way, and another that wants to treat them in a distinct way, even without considering a new type, how would your we have four separate tables help? The former user wants three commits+trees, tags and blobs table while the latter wants four, no? In either way, we would have such bitmap (or a set of four bitmaps in your case) for more than one commit (it is not necessary or desirable to add the reachability bitmap to all commits), and such a reachability extension would need to store a sequence of the commit object name the bitmap (or a set of four bitmaps) is about, and the bitmap (or set of four bitmaps). That object name does not have to be 20-byte but would be a varint representation of the offset into the sorted by SHA-1 table. How do you reach the bitmap, given its commit sha-1? Look at the ordered list of SHA-1 in the pack idx file as an array of 20-byte entries, and find the commit SHA-1 in that array by Newton-Raphson/bisection. You have an index into the array that holds the object name. You can use it as the local object number in the packfile. Then you can find that object number (that is shorter than 20-byte object name) in your secondary table, no? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] index-pack: produce pack index version 3
Let me start by echoing Junio's remark... lacks sufficient justification. You don't give enough evidence to support even why it is worth looking at this commit, let alone why it should be included and cause a format change in the idx file format. At some point you start to hand-wave about how it may benefit my bitmap reachability index idea, but you don't seem to fully understand that idea. And some of what you do here sounds like it actively hurts doing the bitmap reachability project. So again this doesn't really help. Colby is nearly done prototyping the bitmap reachability implementation in JGit and will release the code under the BSD license there soon. I can't promise when yet because Colby will soon be heading out for some (much deserved) vacation time, and doesn't want to publish something until he is ready to stand behind the code. So it might not show up until mid-September. But I think its worth giving him a few weeks to finish getting the code ready, vs. rushing something in that someone else thinks might help. We have waited more than 6 years or whatever to improve packing. Colby's experiments are showing massive improvements (40s enumeration cut to 100ms) with low disk usage increase (10%) and no pack file format changes. Its OK to wait 4 more weeks. On Mon, Aug 13, 2012 at 12:40 AM, Junio C Hamano gits...@pobox.com wrote: Nguyen Thai Ngoc Duy pclo...@gmail.com writes: On Mon, Aug 13, 2012 at 2:49 AM, Junio C Hamano gits...@pobox.com wrote: For example, the reachability bitmap would want to say something like Traversing from commit A, these objects in this pack are reachable. The bitmap for one commit A would logically consist of N bits for a packfile that stores N objects (the resulting bitmap needs to be compressed before going to disk, perhaps with RLE or something). With the single sorted by SHA-1 table, we can use the index in that single table to enumerate all reachable objects of any type in one go. With four separate tables, on the other hand, we would need four bitmaps per commit. No we still need one per commit. The n-th bit is in the order of the object in the pack, not the index. How sha-1 is sorted does not matter. Yes, for RLE encoding to work well it is important that the bits are assigned using pack offset ordering, not SHA-1 ordering. The object at bit 0 is the object at offset 12 in the pack file, not the object that has the lowest SHA-1. The RLE encoding is relying on the packer to store objects closely reachable by time near each other in the pack file. Which the packer already does because we have roughly proven though the years that this ordering works sufficiently well enough for nobody to propose another ordering. :-) Then what is the problem of using that same N that counts the object using the order of the objects in the pack as a short-hand to the object name instead, if your objective is to avoid having to repeat the full 20-byte object name in your secondary tables? That does not call for any split in the list of SHA-1s, no? Exactly. There is no need to store a second copy of the SHA-1s anywhere. We can discover information about object identified by bit N by creating the reverse pack index in memory. The N-th entry in the reverse index will tell us the Y-th position in the normal idx file, which gives us the SHA-1. So SHA-1 can be obtained from a bitmap in constant time, once the reverse index is built. The reverse index build costs O(N log N) time, but its already required by the packer to do efficient reuse of packed objects. We already pay this O(N log N) penalty. So its currently free to us. The bitmap approach Colby and I are collaborating on also contains 4 RLE bitmaps indicating what type each object is. Its needed so the packer can avoid looking up type data in the base pack file, while also avoiding parsing of trees where the mode information normally provides the type hint. We could instead get this by splitting the SHA-1 table into 4 tables as this patch proposes, but it doesn't help. Type information can be obtained fast enough from these 4 RLE bitmaps, without muddying the read code path with needing to know what table to search, or needing to search 4 tables rather than 1 table. Either way is _possible_, but I think the former is simpler, and the latter makes it harder to introduce new types of objects in the future, which I do not think we have examined possible use cases well enough to make that decision to say four types is enough forever. New types can be put in one of those four tables, depending on its purpose. No. A new type should get a new table. We should never cram a new type into an existing type table just because it sounds like a good idea. Its not. We are unlikely to add a new type anytime soon. If we do its going to take more work than just worrying about where it fits into an idx file. But it shouldn't be combined into a bucket with an existing type. The reason I split because I care
Re: [PATCH/RFC] index-pack: produce pack index version 3
On Tue, Aug 14, 2012 at 7:46 AM, Shawn Pearce spea...@spearce.org wrote: Colby is nearly done prototyping the bitmap reachability implementation in JGit and will release the code under the BSD license there soon. I can't promise when yet because Colby will soon be heading out for some (much deserved) vacation time, and doesn't want to publish something until he is ready to stand behind the code. So it might not show up until mid-September. But I think its worth giving him a few weeks to finish getting the code ready, vs. rushing something in that someone else thinks might help. We have waited more than 6 years or whatever to improve packing. Colby's experiments are showing massive improvements (40s enumeration cut to 100ms) with low disk usage increase (10%) and no pack file format changes. Its OK to wait 4 more weeks. Thanks. I did not know it's being worked on (your last email did not indicate that, I think). I just wanted to try out and see how it might work. And sure I can wait :) -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] index-pack: produce pack index version 3
Nguyễn Thái Ngọc Duy pclo...@gmail.com writes: The main reason to group objects by type is to make it possible to create another sha1-something mapping for a particular object type, without wasting space for storing sha-1 keys again. For example, we can store commit caches, tree caches... at the end of the index as extensions. Why can't you do the same with a single sorted by SHA-1 table? Not impressed yet. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] index-pack: produce pack index version 3
On Mon, Aug 13, 2012 at 2:49 AM, Junio C Hamano gits...@pobox.com wrote: For example, the reachability bitmap would want to say something like Traversing from commit A, these objects in this pack are reachable. The bitmap for one commit A would logically consist of N bits for a packfile that stores N objects (the resulting bitmap needs to be compressed before going to disk, perhaps with RLE or something). With the single sorted by SHA-1 table, we can use the index in that single table to enumerate all reachable objects of any type in one go. With four separate tables, on the other hand, we would need four bitmaps per commit. No we still need one per commit. The n-th bit is in the order of the object in the pack, not the index. How sha-1 is sorted does not matter. Either way is _possible_, but I think the former is simpler, and the latter makes it harder to introduce new types of objects in the future, which I do not think we have examined possible use cases well enough to make that decision to say four types is enough forever. New types can be put in one of those four tables, depending on its purpose. The reason I split because I care particularly about commits and trees. If the new type serves the same purpose as tree, for example, then it's better put in tree table... In either way, we would have such bitmap (or a set of four bitmaps in your case) for more than one commit (it is not necessary or desirable to add the reachability bitmap to all commits), and such a reachability extension would need to store a sequence of the commit object name the bitmap (or a set of four bitmaps) is about, and the bitmap (or set of four bitmaps). That object name does not have to be 20-byte but would be a varint representation of the offset into the sorted by SHA-1 table. How do you reach the bitmap, given its commit sha-1? That varint representation would be smaller by about 3.5 bits if you have a separate commit only, sorted by SHA-1 table (as the number of all objects tend to be 10x larger than the number of all commits that need them). For the particular case of we want to only annotate the commits, never other kinds of objects use case, it would be a win. But without knowing what other use cases we will want to use the object annotation in the pack index file mechanism for, it feels like a premature optimization to me to have four tables to shave 3.5 bits per object. caching trees for faster traversal in general case (sort of pack v4 but it comes as a cache instead of replacing the real pack). -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html