Re: [PATCH 4/6] introduce a commit metapack
On Sun, Mar 17, 2013 at 08:21:13PM +0700, Nguyen Thai Ngoc Duy wrote: > On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen wrote: > > On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote: > >> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice > >> space/time trade-off. > > > > Following the on-disk format experiment yesterday, I changed the > > format to: > > > > - a list a _short_ SHA-1 of cached commits > > .. > > > > The length of SHA-1 is chosen to be able to unambiguously identify any > > cached commits. Full SHA-1 check is done after to catch false > > positives. For linux-2.6, SHA-1 length is 6 bytes, git and many > > moderate-sized projects are 4 bytes. > > And if we are going to create index v3, the same trick could be used > for the sha-1 table in the index. We use the short sha-1 table for > binary search and put the rest of sha-1 in a following table (just > like file offset table). The advantage is a denser search space, about > 1/4-1/3 the size of full sha-1 table. You can make it even smaller at some (potential) run-time cost. Keep in mind you are just repeating information that is in the full sha1 list in the index. So you could store a fixed-size offset into that list (e.g., 32-bit), and then instead of comparing sha1s during a binary search, you would dereference the offset to the real sha1s and compare those. The run-time cost is not any worse in a big-O sense, but your cache locality is much worse (you hit a second random page for each sha1 comparison), which might be noticeable. You'd have to benchmark to see how big an impact. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen wrote: > On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote: >> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice >> space/time trade-off. > > Following the on-disk format experiment yesterday, I changed the > format to: > > - a list a _short_ SHA-1 of cached commits > .. > > The length of SHA-1 is chosen to be able to unambiguously identify any > cached commits. Full SHA-1 check is done after to catch false > positives. For linux-2.6, SHA-1 length is 6 bytes, git and many > moderate-sized projects are 4 bytes. And if we are going to create index v3, the same trick could be used for the sha-1 table in the index. We use the short sha-1 table for binary search and put the rest of sha-1 in a following table (just like file offset table). The advantage is a denser search space, about 1/4-1/3 the size of full sha-1 table. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
Jeff King writes: > On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote: > ... >> If we are going to change the index to support extension sections and >> I have to modify JGit to grok this new format, it needs to be index v3 >> not index v2. If we are making index v3 we should just put index v3 on >> the end of the pack file. > > I'm not sure what you mean by your last sentence here. I am not Shawn, but here is a summary of what I think I discussed with him in person, lest I forget. You could imagine that a new pack system (from pack-objects, index-pack down to read_packed_sha1() call) that works with a packfile that * is a single file, whose name is pack-$SHA1.$sfx (where $sfx is something other than 'pack', perhaps); * has the pack data stream, including the concluding checksum of the stream contents at the end, at the beginning of the file; and * has the index v3 data blob appended to the pack data stream. The pack data is streamed over the wire exactly the same way, interoperating with existing software. When receiving, the new index-pack can read such a pack stream and add index at the end. When re-indexing an existing pack (think: upgrading existing packfiles from the current system), the index-pack can read from the packfile and do what it does currently (notably, it knows where the pack stream ends and can stop reading at that point, so even if you feed the new pack to it, it will stop at the end of the pack data, ignoring the index v3 already at the end of the input). One potential advantage of using a single file, instead of the primary .pack file with 3 (or 47) auxiliary files, is that it lets you repack without having to deal with this sequence, which happens currently when you repack: * create a new .pack file and the corresponding auxiliary files under temporary filename; * move existing pack files that describe the same set of objects away; * rename these new files, one at a time, to their final name, making sure that you rename .idx the last, because that happens to be the key to the pack aware programs. Instead you can rename only one thing (the new one) to the final name (possibly atomically replacing the existing one). With the current system, when you need to replace a pack with a new pack with the same packname (e.g. you repack everything with a better pack parameter in a repository that has everything packed into one), there is a very small window other concurrent users will not find the object data between the time when you rename the old ones away and the time when you move the new ones in. The hairly logic between "Ok we have prepared all new packfiles" and "End of pack replacement" can be done with a single rename(2) of the new packfile (which contains everything) to the final name, which atomically replaces the old one. This will become even safer if we picked $SHA1 (the name of the packfile) to represent the hash of the whole thing, not the hash of the sorted object names in the pack, as that will let us know there is no need to even "replace" the files. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Fri, Feb 1, 2013 at 5:15 PM, Jeff King wrote: > The short-sha1 is a clever idea. Looks like it saves us on the order of > 4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the > savings we get from dropping the other 3 sha1's to uint32_t, but still > not bad. We could save another 4 bytes per commit by using 3 bytes for storing .idx offsets. linux-2.6 only has 3M objects. It'll take many years for big projects to reach 16M objects and need the fourth byte in uint32_t. > I guess the next steps in iterating on this would be: > > 1. splitting out the refactoring here into separate patches > > 2. squashing the cleaned-up bits into my patch 4/6 > > 3. deciding whether this should go into a separate file or as part of > index v3. Your offsets depend on the .idx file having a sorted sha1 > list. That is not likely to change, but it would still be nice to > make sure they cannot get out of sync. I'm still curious what the > performance impact is for mmap-ing N versus N+8MB. 4. Print some cache statistics in "count-objects -v" >> The length of SHA-1 is chosen to be able to unambiguously identify any >> cached commits. Full SHA-1 check is done after to catch false >> positives. > > Just to be clear, these false positives come because the abbreviation is > unambiguous within the packfile, but we might be looking for a commit > that is not even in our pack, right? It may even be ambiguous within the pack, say an octopus (i.e. not cached) commit that shares the same sha-1 prefix with one of the cached commits. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote: > On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote: > > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice > > space/time trade-off. > > Following the on-disk format experiment yesterday, I changed the > format to: > > - a list a _short_ SHA-1 of cached commits > - a list of cache entries, each (5 uint32_t) consists of: >- uint32_t for the index in .idx sha-1 table to get full SHA-1 of > the commit >- uint32_t for timestamp >- uint32_t for tree, 1st and 2nd parents for the index in .idx > table BTW, I needed the minor fixups below to silence some warnings from your patch. Here are the cold and warm cache timings I got, as compared to stock git and my implementation: Pack | Cold Revs | Warm Revs ---+--+ stock | 12.54| 4.14 me | 4.76 (-62%) | 0.66 (-84%) duy | 4.36 (-65%) | 0.55 (-86%) Not surprising; yours is just a bit faster in terms of CPU, and even gains a little more in the cold cache case. Nice. Of course that is just gravy on top of the smaller disk usage, too. :) --- diff --git a/commit-metapack.c b/commit-metapack.c index c984b8e..78fd961 100644 --- a/commit-metapack.c +++ b/commit-metapack.c @@ -106,7 +106,7 @@ int commit_metapack(unsigned char *sha1, for (p = commit_metapacks; p; p = p->next) { struct commit_entry *data; uint32_t p1, p2; - unsigned lo, hi, mi; + unsigned lo, hi; int pos; /* sha1_entry_pos does not work with abbreviated sha-1 */ @@ -161,7 +161,7 @@ static void get_commits(struct metapack_writer *mw, struct write_cb *write_cb = (struct write_cb *)data; enum object_type type = sha1_object_info(sha1, NULL); struct commit *c; - int p1, p2; + int p1 = -1, p2 = -1; if (type != OBJ_COMMIT) return; diff --git a/commit.c b/commit.c index b326201..5b776f8 100644 --- a/commit.c +++ b/commit.c @@ -309,7 +309,7 @@ static int parse_commit_metapack(struct commit *item) static int parse_commit_metapack(struct commit *item) { - unsigned char *tree, *p1, *p2; + const unsigned char *tree, *p1, *p2; uint32_t ts; if (commit_metapack(item->object.sha1, &ts, &tree, &p1, &p2) < 0) -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote: > On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote: > > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice > > space/time trade-off. > > Following the on-disk format experiment yesterday, I changed the > format to: > > - a list a _short_ SHA-1 of cached commits > - a list of cache entries, each (5 uint32_t) consists of: >- uint32_t for the index in .idx sha-1 table to get full SHA-1 of > the commit >- uint32_t for timestamp >- uint32_t for tree, 1st and 2nd parents for the index in .idx > table Thanks for working on this, as it was the next step I was going to take. :) The short-sha1 is a clever idea. Looks like it saves us on the order of 4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the savings we get from dropping the other 3 sha1's to uint32_t, but still not bad. I guess the next steps in iterating on this would be: 1. splitting out the refactoring here into separate patches 2. squashing the cleaned-up bits into my patch 4/6 3. deciding whether this should go into a separate file or as part of index v3. Your offsets depend on the .idx file having a sorted sha1 list. That is not likely to change, but it would still be nice to make sure they cannot get out of sync. I'm still curious what the performance impact is for mmap-ing N versus N+8MB. > The length of SHA-1 is chosen to be able to unambiguously identify any > cached commits. Full SHA-1 check is done after to catch false > positives. Just to be clear, these false positives come because the abbreviation is unambiguous within the packfile, but we might be looking for a commit that is not even in our pack, right? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Wed, Jan 30, 2013 at 08:56:07PM +0700, Nguyen Thai Ngoc Duy wrote: > Another point, but not really important at this stage, I think we have > memory leak somewhere (lookup_commit??). It used up to 800 MB RES on > linux-2.6.git while generating the cache. We generate (and then leak!) the linked list in commit_metapack_write. That may be the culprit. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote: > > Of course, it is more convenient to store this kind of things in a > > separate file while experimenting and improving the mechanism, but I > > do not think we want to see each packfile in a repository comes with > > 47 auxiliary files with different suffixes 5 years down the road. > > Agggh. > > Right now we are in the middle of refactoring the JGit reachability > bitmap implementation to store it into a separate file and get it out > of the .idx file. This work is nearly completed. So this thread is > great timing. :-) > > I think Junio is right about not wanting 47 different tiny auxiliary > files for a single pack. We are unlikely to create that many, but > right now there are proposals floating around for at least 2 new > auxiliary files (commit cache and reachability bitmaps). So its not > impossible that another will be discovered in the future. Why don't we want 47 files (or even 3)? If it makes the implementation more flexible or simple without sacrificing performance, I don't see a problem. The filesystem is there to organize our data; we do not cram all of our files into one just to save a few inodes. We _do_ cram our data into packfiles and packed-refs when we can save O(n) inodes. But if we are talking about a handful of extra files that we must readdir() over while preparing the objects/pack directory, I don't think that is the same thing. The data dependency issues (can the files get out of sync? How common is it? Can we detect it?) and performance issues are more interesting to me. With respect to the latter, here's specifically what I'm thinking of. Let's imagine we have an extension for reachability bitmaps that takes up an extra 10MB. When we mmap the .idx file, do we always mmap the extra bytes? What's the performance impact of the extra mmap? I recall on some non-Linux systems it can be quite expensive. For most git commands which are not going to do a reachability analysis, this is a loss. I don't know if this is an issue big enough to care about or not. But I think it's worth benchmarking and including in the decision. > Junio may be right about the hole in the index file for git-core. I > haven't checked the JGit implementation, but I suspect it does not > have this hole. IIRC JGit consumes the index sections and then expects > the pack trailer SHA-1 to be present immediately after the last table. > This happens because JGit doesn't use mmap() to load the index, it > streams the file into memory and does some reformatting on the tables > to make search faster. Yeah, looking at the PackIndexV2 implementation, it counts the 64-bit offsets from the first table, calculates the size of the large offset table, reads past it, and then expects the pack checksum. So let's assume we would have to bump to v3 to implement it inside the .idx file. > If we are going to change the index to support extension sections and > I have to modify JGit to grok this new format, it needs to be index v3 > not index v2. If we are making index v3 we should just put index v3 on > the end of the pack file. I'm not sure what you mean by your last sentence here. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Tue, Jan 29, 2013 at 11:17:41PM -0800, Junio C Hamano wrote: > > True, but it is even less headache if the file is totally separate and > > optional. > > Once you start thinking about using an offset to some list of SHA-1, > perhaps? A section inside the same file can never go out of sync. Yes, having a data dependency is important. It is unavoidable to have a dependency on the packfile, though (and that is why the index and the metapacks embed the sha1 of the packfile). If the offsets used are packfile offsets, then that is sufficient. If the offsets are from the index, then yes, putting it in the same file is one way to keep them tied together. Another way is to do the same sha1 verification, except to embed the sha1 of the index in the metapack. So I certainly consider putting the dependency-handling to be a point in favor of the same file, but I'd weight it against other points (headache of bumping index version, performance of both types, etc). > Also a longer-term advantage is that you can teach index-pack to do > this. I think that is roughly the same amount of difficulty to do whether they are in the same file or not. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Wed, Jan 30, 2013 at 7:56 AM, Junio C Hamano wrote: > Jeff King writes: > >>>From this: >> >>> Then it will be very natural for the extension data that store the >>> commit metainfo to name objects in the pack the .idx file describes >>> by the offset in the SHA-1 table. >> >> I guess your argument is that putting it all in the same file makes it >> more natural for there to be a data dependency. > > It is more about the "I am torn on this one" I mentioned earlier. > > It would be more "logical" if this weren't tied to a particular > pack, as the properties of a commit you record in this series do not > depend on which pack the commit is in, and such a repository-global > file by definition cannot be inside anybody's .idx. > > But if we split the information into separate pieces and store one > piece per .idx for implementation reasons, it is crazy not to at > least consider it a longer term goal to put it inside .idx file. > > Of course, it is more convenient to store this kind of things in a > separate file while experimenting and improving the mechanism, but I > do not think we want to see each packfile in a repository comes with > 47 auxiliary files with different suffixes 5 years down the road. Agggh. Right now we are in the middle of refactoring the JGit reachability bitmap implementation to store it into a separate file and get it out of the .idx file. This work is nearly completed. So this thread is great timing. :-) I think Junio is right about not wanting 47 different tiny auxiliary files for a single pack. We are unlikely to create that many, but right now there are proposals floating around for at least 2 new auxiliary files (commit cache and reachability bitmaps). So its not impossible that another will be discovered in the future. Junio may be right about the hole in the index file for git-core. I haven't checked the JGit implementation, but I suspect it does not have this hole. IIRC JGit consumes the index sections and then expects the pack trailer SHA-1 to be present immediately after the last table. This happens because JGit doesn't use mmap() to load the index, it streams the file into memory and does some reformatting on the tables to make search faster. If we are going to change the index to support extension sections and I have to modify JGit to grok this new format, it needs to be index v3 not index v2. If we are making index v3 we should just put index v3 on the end of the pack file. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote: > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice > space/time trade-off. Following the on-disk format experiment yesterday, I changed the format to: - a list a _short_ SHA-1 of cached commits - a list of cache entries, each (5 uint32_t) consists of: - uint32_t for the index in .idx sha-1 table to get full SHA-1 of the commit - uint32_t for timestamp - uint32_t for tree, 1st and 2nd parents for the index in .idx table The length of SHA-1 is chosen to be able to unambiguously identify any cached commits. Full SHA-1 check is done after to catch false positives. For linux-2.6, SHA-1 length is 6 bytes, git and many moderate-sized projects are 4 bytes. So it's 26 bytes per commit for linux-2.6.git, or 8MB .commits file. Not as good as revindex approach (5.5MB) but way better than the current one (27MB). And still good enough for caching 2 parents unconditionally. Performance seems to improve a tiny bit, probably because of more compact search space: 0.6s with my patch vs 0.7s without (vs 3.4s without cache). -- 8< -- diff --git a/cache.h b/cache.h index 1f96f65..8048d5b 100644 --- a/cache.h +++ b/cache.h @@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, int, int); extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t); extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t); extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *); +extern int find_pack_entry_pos(const unsigned char *, struct packed_git *); extern int is_pack_valid(struct packed_git *); extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *); extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep); diff --git a/commit-metapack.c b/commit-metapack.c index 2c19f48..c984b8e 100644 --- a/commit-metapack.c +++ b/commit-metapack.c @@ -4,11 +4,21 @@ #include "commit.h" #include "sha1-lookup.h" +struct commit_entry { + uint32_t commit;/* nth_packed_object_sha1 to get own SHA-1 */ + uint32_t timestamp; + uint32_t tree; /* nth_packed_object_sha1 to get tree SHA-1 */ + uint32_t parent1; /* nth_packed_object_sha1 to get 1st parent SHA-1 */ + uint32_t parent2; /* nth_packed_object_sha1 to get 2nd parent SHA-1 */ +}; + struct commit_metapack { struct metapack mp; uint32_t nr; + uint32_t abbrev_len; + struct packed_git *pack; unsigned char *index; - unsigned char *data; + struct commit_entry *data; struct commit_metapack *next; }; static struct commit_metapack *commit_metapacks; @@ -41,20 +51,23 @@ static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack) } memcpy(&it->nr, it->mp.data, 4); it->nr = ntohl(it->nr); + memcpy(&it->abbrev_len, it->mp.data + 4, 4); + it->abbrev_len = ntohl(it->abbrev_len); + it->pack = pack; /* -* We need 84 bytes for each entry: sha1(20), date(4), tree(20), -* parents(40). +* We need 20+abbrev_len bytes for each entry: abbrev sha-1, +* date(4), tree index(4), parent indexes(8). */ - if (it->mp.len < (84 * it->nr + 4)) { + if (it->mp.len < ((sizeof(*it->data) + it->abbrev_len) * it->nr + 8)) { warning("commit metapack for '%s' is truncated", pack->pack_name); metapack_close(&it->mp); free(it); return NULL; } - it->index = it->mp.data + 4; - it->data = it->index + 20 * it->nr; + it->index = it->mp.data + 8; + it->data = (struct commit_entry*)(it->index + it->abbrev_len * it->nr); return it; } @@ -83,29 +96,51 @@ static void prepare_commit_metapacks(void) int commit_metapack(unsigned char *sha1, uint32_t *timestamp, - unsigned char **tree, - unsigned char **parent1, - unsigned char **parent2) + const unsigned char **tree, + const unsigned char **parent1, + const unsigned char **parent2) { struct commit_metapack *p; prepare_commit_metapacks(); for (p = commit_metapacks; p; p = p->next) { - unsigned char *data; - int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1); + struct commit_entry *data; + uint32_t p1, p2; + unsigned lo, hi, mi; + int pos; + + /* sha1_entry_pos does not work with abbreviated sha-1 */ + lo = 0; + hi = p->nr; + pos = -1; + do { + unsigned mi = (lo + hi) / 2; + int cmp = memcmp(p->index + mi * p-
Re: [PATCH 4/6] introduce a commit metapack
Jeff King writes: >>From this: > >> Then it will be very natural for the extension data that store the >> commit metainfo to name objects in the pack the .idx file describes >> by the offset in the SHA-1 table. > > I guess your argument is that putting it all in the same file makes it > more natural for there to be a data dependency. It is more about the "I am torn on this one" I mentioned earlier. It would be more "logical" if this weren't tied to a particular pack, as the properties of a commit you record in this series do not depend on which pack the commit is in, and such a repository-global file by definition cannot be inside anybody's .idx. But if we split the information into separate pieces and store one piece per .idx for implementation reasons, it is crazy not to at least consider it a longer term goal to put it inside .idx file. Of course, it is more convenient to store this kind of things in a separate file while experimenting and improving the mechanism, but I do not think we want to see each packfile in a repository comes with 47 auxiliary files with different suffixes 5 years down the road. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Wed, Jan 30, 2013 at 8:56 PM, Duy Nguyen wrote: > However, performance seems to suffer too. Maybe I do more lookups than > necessary, I don't know. Yes, I should have stored the position in the sha-1 <-> offset map instead of the position of the object in .pack file. Even so, performance does not improve. > I should probably measure the cost of revindex separately. And the cost of create_pack_revindex() is 0.6 sec :-( Perhaps we could store abbrev sha-1 instead of full sha-1. Nice space/time trade-off. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Tue, Jan 29, 2013 at 04:16:11AM -0500, Jeff King wrote: > When we are doing a commit traversal that does not need to > look at the commit messages themselves (e.g., rev-list, > merge-base, etc), we spend a lot of time accessing, > decompressing, and parsing the commit objects just to find > the parent and timestamp information. We can make a > space-time tradeoff by caching that information on disk in a > compact, uncompressed format. And this is a (messy) patch on top that avoids storing SHA-1 directly. On my linux-2.6.git (575 MB pack, 73 MB index), .commits file is 5.2 MB and 27 MB with and without my patch respectively. Nice shrinkage. However, performance seems to suffer too. Maybe I do more lookups than necessary, I don't know. I should probably measure the cost of revindex separately. git rev-list --all --quiet on vanilla git: real0m3.645s user0m3.556s sys 0m0.080s commit cache without my patch: real0m0.723s user0m0.677s sys 0m0.045s and with my patch: real0m1.338s user0m1.259s sys 0m0.075s Another point, but not really important at this stage, I think we have memory leak somewhere (lookup_commit??). It used up to 800 MB RES on linux-2.6.git while generating the cache. -- 8< -- diff --git a/cache.h b/cache.h index 1f96f65..8048d5b 100644 --- a/cache.h +++ b/cache.h @@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, int, int); extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t); extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t); extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *); +extern int find_pack_entry_pos(const unsigned char *, struct packed_git *); extern int is_pack_valid(struct packed_git *); extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *); extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep); diff --git a/commit-metapack.c b/commit-metapack.c index 2c19f48..55f7ea9 100644 --- a/commit-metapack.c +++ b/commit-metapack.c @@ -3,11 +3,12 @@ #include "metapack.h" #include "commit.h" #include "sha1-lookup.h" +#include "pack-revindex.h" struct commit_metapack { struct metapack mp; - uint32_t nr; - unsigned char *index; + struct packed_git *pack; + uint32_t first, last; unsigned char *data; struct commit_metapack *next; }; @@ -16,7 +17,7 @@ static struct commit_metapack *commit_metapacks; static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack) { struct commit_metapack *it = xcalloc(1, sizeof(*it)); - uint32_t version; + uint32_t version, nr; if (metapack_init(&it->mp, pack, "commits", &version) < 0) { free(it); @@ -39,22 +40,25 @@ static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack) free(it); return NULL; } - memcpy(&it->nr, it->mp.data, 4); - it->nr = ntohl(it->nr); + memcpy(&it->first, it->mp.data, 4); + it->first = ntohl(it->first); + memcpy(&it->last, it->mp.data + 4, 4); + it->last = ntohl(it->last); + nr = it->last - it->first + 1; + it->pack = pack; /* -* We need 84 bytes for each entry: sha1(20), date(4), tree(20), -* parents(40). +* We need 16 bytes for each entry: date(4), tree index(4), +* parent indexes(8). */ - if (it->mp.len < (84 * it->nr + 4)) { + if (it->mp.len < (16 * nr + 8)) { warning("commit metapack for '%s' is truncated", pack->pack_name); metapack_close(&it->mp); free(it); return NULL; } - it->index = it->mp.data + 4; - it->data = it->index + 20 * it->nr; + it->data = it->mp.data + 8; return it; } @@ -81,31 +85,61 @@ static void prepare_commit_metapacks(void) initialized = 1; } +static const unsigned char *idx_to_sha1(struct packed_git *p, + uint32_t nth) +{ + struct revindex_entry *revindex = get_revindex(p); + if (!revindex) + return NULL; + return nth_packed_object_sha1(p, revindex[nth].nr); +} + int commit_metapack(unsigned char *sha1, uint32_t *timestamp, - unsigned char **tree, - unsigned char **parent1, - unsigned char **parent2) + const unsigned char **tree, + const unsigned char **parent1, + const unsigned char **parent2) { struct commit_metapack *p; prepare_commit_metapacks(); for (p = commit_metapacks; p; p = p->next) { unsigned char *data; - int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1); -
Re: [PATCH 4/6] introduce a commit metapack
> True, but it is even less headache if the file is totally separate and > optional. Once you start thinking about using an offset to some list of SHA-1, perhaps? A section inside the same file can never go out of sync. Also a longer-term advantage is that you can teach index-pack to do this. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Wed, Jan 30, 2013 at 10:36:10AM +0700, Nguyen Thai Ngoc Duy wrote: > On Tue, Jan 29, 2013 at 4:16 PM, Jeff King wrote: > > +int commit_metapack(unsigned char *sha1, > > + uint32_t *timestamp, > > + unsigned char **tree, > > + unsigned char **parent1, > > + unsigned char **parent2) > > +{ > > Nit picking. tree, parent1 and parent2 can/should be "const unsigned char **". Thanks, I'll fix it in the next version. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Tue, Jan 29, 2013 at 10:08:08AM -0800, Junio C Hamano wrote: > > In order to reduce the disk footprint and I/O cost, the future > > direction for this mechanism may want to point into an existing > > store of SHA-1 hashes with a shorter file offset, and the .idx file > > could be such a store, and in order to move in that direction, you > > cannot avoid tying a metapack to a pack. > > Have you considered if we can extend the .idx format version 2 > without actually having to bump the version number? My quick > reading of check_packed_git_idx() tells me that we have a gap after > the "large offset table" that we can place extensions, but I may be > mistaken. If we indeed can, then perhaps adding a series of > > 4-byte "id" header > 4-byte extension length (or 8-byte) > ... N-byte worth of extension data ... > > followed by > > 20-byte SHA-1 checksum of all the extension sections > 8-byte file offset to the first extension section I considered it, but didn't look into it closely. My feeling was that it added extra complexity, without adding any real advantage that a separate file would not. >From this: > Then it will be very natural for the extension data that store the > commit metainfo to name objects in the pack the .idx file describes > by the offset in the SHA-1 table. I guess your argument is that putting it all in the same file makes it more natural for there to be a data dependency. > As we always say, .idx is a local cache and bumping its version is > not a huge headache compared to other longer term storage items. True, but it is even less headache if the file is totally separate and optional. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Tue, Jan 29, 2013 at 09:38:10AM -0800, Junio C Hamano wrote: > Jeff King writes: > > > +int commit_metapack(unsigned char *sha1, > > + uint32_t *timestamp, > > + unsigned char **tree, > > + unsigned char **parent1, > > + unsigned char **parent2) > > +{ > > + struct commit_metapack *p; > > + > > + prepare_commit_metapacks(); > > + for (p = commit_metapacks; p; p = p->next) { > > + unsigned char *data; > > + int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, > > sha1); > > This is a tangent, but isn't it about time to rip out the check for > GIT_USE_LOOKUP in find_pack_entry_one(), I wonder. Rip it out and always use sha1_entry_pos, or rip it out and never use it? I have never been able to measure any speedup from it; I just use it here to avoid rewriting the binary search myself. I do not think there is any disadvantage to using it, so I'd be in favor of just standardizing on it for any sha1 binary searches. > These cached properties of a single commit will not change no matter > which pack it appears in, and it feels logically wrong, especially > when you record these object names in the full SHA-1 form, to tie a > "commit metapack" to a pack. Logically there needs only one commit > metapack that describes all the commits known to the repository when > the metapack was created. True. I had originally envisioned it as tied to the packfile to help manage the lifecycle. You know you need to generate a metapack for some objects when they get packed, and you know you can throw it away when the associated pack goes away. And it means you can verify the integrity of the metapack by matching it to a particular packfile. However, if you just have a commit cache, you can always blow the whole thing away and regenerate it for all objects in the repo. It does not have tied to a pack (you do end up doing some extra work at regeneration when you are not doing a full repack, but it is really not enough to worry about). > In order to reduce the disk footprint and I/O cost, the future > direction for this mechanism may want to point into an existing > store of SHA-1 hashes with a shorter file offset, and the .idx file > could be such a store, and in order to move in that direction, you > cannot avoid tying a metapack to a pack. Yes. That was not one of my original goals for the commit cache, but I do think it's a useful direction to go in. And reachability bitmaps (which would eventually be their own metapacks) would generally want to be per-pack, too, for the same reason. > > + *tail = &commit_list_insert(c, *tail)->next; > > +} > > It feels somewhat wasteful to: > > - use commit_list for this, rather than an array of commit >objects. If you have a rough estimate of the number of commits >in the pack, you could just preallocate a single array and use >ALLOC_GROW() on it, no? We don't have a rough estimate, but yes, we could just use an array and trust ALLOC_GROW to be reasonable. The use of commit_list did not have a particular reason other than that it was simple (an array means stuffing the array pointer and the nr and alloc counts into a struct to get to the callback). The performance of writing the cache is dominated by accessing the objects themselves, anyway. I don't mind changing it, though, if you think it's clearer as an array. > - iterate over the .idx file and run sha1_object_info() and >parse_commit() on many objects in the SHA-1 order. Iterating in >the way builtin/pack-objects.c::get_object_details() does avoids >jumping around in existing packfiles, which may be more >efficient, no? Probably, though generating the complete commit cache for linux-2.6.git takes only about 7 seconds on my machine. I wasn't too concerned with optimizing generation, since it will typically be dwarfed by repacking costs. It might make more of a difference for doing a metapack on all objects (e.g., reachability bitmaps). The reason I do it in .idx order is that it feeds the callback in sorted order, so a writer could in theory just use that output as-is (and my initial version did that, as it wrote separate metapacks for each element). This version now puts all data elements together (for cache locality), and builds the in-memory list so we do not have to re-do sha1_object_info repeatedly. So it could very easily just generate the list in pack order and sort it at the end. I'll look into that for the next version. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Tue, Jan 29, 2013 at 4:16 PM, Jeff King wrote: > +int commit_metapack(unsigned char *sha1, > + uint32_t *timestamp, > + unsigned char **tree, > + unsigned char **parent1, > + unsigned char **parent2) > +{ Nit picking. tree, parent1 and parent2 can/should be "const unsigned char **". -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
Junio C Hamano writes: > I am torn on this one. > > These cached properties of a single commit will not change no matter > which pack it appears in, and it feels logically wrong, especially > when you record these object names in the full SHA-1 form, to tie a > "commit metapack" to a pack. Logically there needs only one commit > metapack that describes all the commits known to the repository when > the metapack was created. > > In order to reduce the disk footprint and I/O cost, the future > direction for this mechanism may want to point into an existing > store of SHA-1 hashes with a shorter file offset, and the .idx file > could be such a store, and in order to move in that direction, you > cannot avoid tying a metapack to a pack. Have you considered if we can extend the .idx format version 2 without actually having to bump the version number? My quick reading of check_packed_git_idx() tells me that we have a gap after the "large offset table" that we can place extensions, but I may be mistaken. If we indeed can, then perhaps adding a series of 4-byte "id" header 4-byte extension length (or 8-byte) ... N-byte worth of extension data ... followed by 20-byte SHA-1 checksum of all the extension sections 8-byte file offset to the first extension section at that gap, immediately before the trailer of the .idx file written by git_SHA1_Final(), in a way similar to the index extension is done may give us a natural way to deal with this. The reader can first read the offset recorded at the tail, checking if the offset is plausible because it may not exist, the check the extension section checksum, to see if the file has extension, or the file just ends with the large offset table and the 28 bytes it tried to were from the entries near the end of the large offset table. If that is not worth attempting, perhaps we may still want to store this as an optional extended section and bump the .idx format version. Then it will be very natural for the extension data that store the commit metainfo to name objects in the pack the .idx file describes by the offset in the SHA-1 table. As we always say, .idx is a local cache and bumping its version is not a huge headache compared to other longer term storage items. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
Jeff King writes: > +int commit_metapack(unsigned char *sha1, > + uint32_t *timestamp, > + unsigned char **tree, > + unsigned char **parent1, > + unsigned char **parent2) > +{ > + struct commit_metapack *p; > + > + prepare_commit_metapacks(); > + for (p = commit_metapacks; p; p = p->next) { > + unsigned char *data; > + int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, > sha1); This is a tangent, but isn't it about time to rip out the check for GIT_USE_LOOKUP in find_pack_entry_one(), I wonder. > + prepare_commit_metapacks(); > + for (p = commit_metapacks; p; p = p->next) { > + unsigned char *data; > + int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, > sha1); > + if (pos < 0) > + continue; > + > + /* timestamp(4) + tree(20) + parents(40) */ > + data = p->data + 64 * pos; > + *timestamp = *(uint32_t *)data; > + *timestamp = ntohl(*timestamp); > + data += 4; > + *tree = data; > + data += 20; > + *parent1 = data; > + data += 20; > + *parent2 = data; > + > + return 0; > + } > + > + return -1; > +} I am torn on this one. These cached properties of a single commit will not change no matter which pack it appears in, and it feels logically wrong, especially when you record these object names in the full SHA-1 form, to tie a "commit metapack" to a pack. Logically there needs only one commit metapack that describes all the commits known to the repository when the metapack was created. In order to reduce the disk footprint and I/O cost, the future direction for this mechanism may want to point into an existing store of SHA-1 hashes with a shorter file offset, and the .idx file could be such a store, and in order to move in that direction, you cannot avoid tying a metapack to a pack. > +static void get_commits(struct metapack_writer *mw, > + const unsigned char *sha1, > + void *data) > +{ > + struct commit_list ***tail = data; > + enum object_type type = sha1_object_info(sha1, NULL); > + struct commit *c; > + > + if (type != OBJ_COMMIT) > + return; > + > + c = lookup_commit(sha1); > + if (!c || parse_commit(c)) > + die("unable to read commit %s", sha1_to_hex(sha1)); > + > + /* > + * Our fixed-size parent list cannot represent root commits, nor > + * octopus merges. Just skip those commits, as we can fallback > + * in those rare cases to reading the actual commit object. > + */ > + if (!c->parents || > + (c->parents && c->parents->next && c->parents->next->next)) > + return; > + > + *tail = &commit_list_insert(c, *tail)->next; > +} It feels somewhat wasteful to: - use commit_list for this, rather than an array of commit objects. If you have a rough estimate of the number of commits in the pack, you could just preallocate a single array and use ALLOC_GROW() on it, no? - iterate over the .idx file and run sha1_object_info() and parse_commit() on many objects in the SHA-1 order. Iterating in the way builtin/pack-objects.c::get_object_details() does avoids jumping around in existing packfiles, which may be more efficient, no? > +void commit_metapack_write(const char *idx) > +{ > + struct metapack_writer mw; > + struct commit_list *commits = NULL, *p; > + struct commit_list **tail = &commits; > + uint32_t nr = 0; > + > + metapack_writer_init(&mw, idx, "commits", 1); > + > + /* Figure out how many eligible commits we've got in this pack. */ > + metapack_writer_foreach(&mw, get_commits, &tail); > + for (p = commits; p; p = p->next) > + nr++; -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On Tue, Jan 29, 2013 at 11:24:45AM +0100, Michael Haggerty wrote: > On 01/29/2013 10:16 AM, Jeff King wrote: > > When we are doing a commit traversal that does not need to > > look at the commit messages themselves (e.g., rev-list, > > merge-base, etc), we spend a lot of time accessing, > > decompressing, and parsing the commit objects just to find > > the parent and timestamp information. We can make a > > space-time tradeoff by caching that information on disk in a > > compact, uncompressed format. > > > > TODO: document on-disk format in Documentation/technical > > TODO: document API > > Would this be a good place to add the commit generation number that is > so enthusiastically discussed on the mailing list from time to time? Yes, that is one of my goals. We may even be able to just replace the timestamp field in the cache with a generation number. When it gets pretty-printed we pull it out of the commit message again anyway, so in theory the only use inside "struct commit" is for ordering. But I haven't looked at all of the use sites yet to be sure nobody is depending on it being an actual date stamp. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] introduce a commit metapack
On 01/29/2013 10:16 AM, Jeff King wrote: > When we are doing a commit traversal that does not need to > look at the commit messages themselves (e.g., rev-list, > merge-base, etc), we spend a lot of time accessing, > decompressing, and parsing the commit objects just to find > the parent and timestamp information. We can make a > space-time tradeoff by caching that information on disk in a > compact, uncompressed format. > > TODO: document on-disk format in Documentation/technical > TODO: document API Would this be a good place to add the commit generation number that is so enthusiastically discussed on the mailing list from time to time? Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] introduce a commit metapack
When we are doing a commit traversal that does not need to look at the commit messages themselves (e.g., rev-list, merge-base, etc), we spend a lot of time accessing, decompressing, and parsing the commit objects just to find the parent and timestamp information. We can make a space-time tradeoff by caching that information on disk in a compact, uncompressed format. TODO: document on-disk format in Documentation/technical TODO: document API Signed-off-by: Jeff King --- Makefile | 2 + commit-metapack.c | 175 ++ commit-metapack.h | 12 3 files changed, 189 insertions(+) create mode 100644 commit-metapack.c create mode 100644 commit-metapack.h diff --git a/Makefile b/Makefile index 3e4ae1b..6ca5320 100644 --- a/Makefile +++ b/Makefile @@ -619,6 +619,7 @@ LIB_H += column.h LIB_H += cache.h LIB_H += color.h LIB_H += column.h +LIB_H += commit-metapack.h LIB_H += commit.h LIB_H += compat/bswap.h LIB_H += compat/cygwin.h @@ -730,6 +731,7 @@ LIB_OBJS += combine-diff.o LIB_OBJS += color.o LIB_OBJS += column.o LIB_OBJS += combine-diff.o +LIB_OBJS += commit-metapack.o LIB_OBJS += commit.o LIB_OBJS += compat/obstack.o LIB_OBJS += compat/terminal.o diff --git a/commit-metapack.c b/commit-metapack.c new file mode 100644 index 000..2c19f48 --- /dev/null +++ b/commit-metapack.c @@ -0,0 +1,175 @@ +#include "cache.h" +#include "commit-metapack.h" +#include "metapack.h" +#include "commit.h" +#include "sha1-lookup.h" + +struct commit_metapack { + struct metapack mp; + uint32_t nr; + unsigned char *index; + unsigned char *data; + struct commit_metapack *next; +}; +static struct commit_metapack *commit_metapacks; + +static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack) +{ + struct commit_metapack *it = xcalloc(1, sizeof(*it)); + uint32_t version; + + if (metapack_init(&it->mp, pack, "commits", &version) < 0) { + free(it); + return NULL; + } + if (version != 1) { + /* +* This file comes from a more recent git version. Don't bother +* warning the user, as we'll just fallback to reading the +* commits. +*/ + metapack_close(&it->mp); + free(it); + return NULL; + } + + if (it->mp.len < 4) { + warning("commit metapack for '%s' is truncated", pack->pack_name); + metapack_close(&it->mp); + free(it); + return NULL; + } + memcpy(&it->nr, it->mp.data, 4); + it->nr = ntohl(it->nr); + + /* +* We need 84 bytes for each entry: sha1(20), date(4), tree(20), +* parents(40). +*/ + if (it->mp.len < (84 * it->nr + 4)) { + warning("commit metapack for '%s' is truncated", pack->pack_name); + metapack_close(&it->mp); + free(it); + return NULL; + } + + it->index = it->mp.data + 4; + it->data = it->index + 20 * it->nr; + + return it; +} + +static void prepare_commit_metapacks(void) +{ + static int initialized; + struct commit_metapack **tail = &commit_metapacks; + struct packed_git *p; + + if (initialized) + return; + + prepare_packed_git(); + for (p = packed_git; p; p = p->next) { + struct commit_metapack *it = alloc_commit_metapack(p); + + if (it) { + *tail = it; + tail = &it->next; + } + } + + initialized = 1; +} + +int commit_metapack(unsigned char *sha1, + uint32_t *timestamp, + unsigned char **tree, + unsigned char **parent1, + unsigned char **parent2) +{ + struct commit_metapack *p; + + prepare_commit_metapacks(); + for (p = commit_metapacks; p; p = p->next) { + unsigned char *data; + int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1); + if (pos < 0) + continue; + + /* timestamp(4) + tree(20) + parents(40) */ + data = p->data + 64 * pos; + *timestamp = *(uint32_t *)data; + *timestamp = ntohl(*timestamp); + data += 4; + *tree = data; + data += 20; + *parent1 = data; + data += 20; + *parent2 = data; + + return 0; + } + + return -1; +} + +static void get_commits(struct metapack_writer *mw, + const unsigned char *sha1, + void *data) +{ + struct commit_list ***tail = data; + enum object_type type = sha1_object_info(sha1, NULL); + struct commit *c; + + if (type != OBJ_COMMIT) + r