subject:"\[PATCH 4\/6\] introduce a commit metapack"

Re: [PATCH 4/6] introduce a commit metapack

2013-03-18 Thread Jeff King

On Sun, Mar 17, 2013 at 08:21:13PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen  wrote:
> > On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> >> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> >> space/time trade-off.
> >
> > Following the on-disk format experiment yesterday, I changed the
> > format to:
> >
> >  - a list a _short_ SHA-1 of cached commits
> > ..
> >
> > The length of SHA-1 is chosen to be able to unambiguously identify any
> > cached commits. Full SHA-1 check is done after to catch false
> > positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
> > moderate-sized projects are 4 bytes.
> 
> And if we are going to create index v3, the same trick could be used
> for the sha-1 table in the index. We use the short sha-1 table for
> binary search and put the rest of sha-1 in a following table (just
> like file offset table). The advantage is a denser search space, about
> 1/4-1/3 the size of full sha-1 table.

You can make it even smaller at some (potential) run-time cost.

Keep in mind you are just repeating information that is in the full sha1
list in the index. So you could store a fixed-size offset into that list
(e.g., 32-bit), and then instead of comparing sha1s during a binary
search, you would dereference the offset to the real sha1s and compare
those.

The run-time cost is not any worse in a big-O sense, but your cache
locality is much worse (you hit a second random page for each sha1
comparison), which might be noticeable. You'd have to benchmark to see
how big an impact.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-03-17 Thread Duy Nguyen

On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen  wrote:
> On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
>> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
>> space/time trade-off.
>
> Following the on-disk format experiment yesterday, I changed the
> format to:
>
>  - a list a _short_ SHA-1 of cached commits
> ..
>
> The length of SHA-1 is chosen to be able to unambiguously identify any
> cached commits. Full SHA-1 check is done after to catch false
> positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
> moderate-sized projects are 4 bytes.

And if we are going to create index v3, the same trick could be used
for the sha-1 table in the index. We use the short sha-1 table for
binary search and put the rest of sha-1 in a following table (just
like file offset table). The advantage is a denser search space, about
1/4-1/3 the size of full sha-1 table.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-02-02 Thread Junio C Hamano

Jeff King  writes:

> On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote:
> ...
>> If we are going to change the index to support extension sections and
>> I have to modify JGit to grok this new format, it needs to be index v3
>> not index v2. If we are making index v3 we should just put index v3 on
>> the end of the pack file.
>
> I'm not sure what you mean by your last sentence here.

I am not Shawn, but here is a summary of what I think I discussed
with him in person, lest I forget.

You could imagine that a new pack system (from pack-objects,
index-pack down to read_packed_sha1() call) that works with a
packfile that

 * is a single file, whose name is pack-$SHA1.$sfx (where $sfx
   is something other than 'pack', perhaps);

 * has the pack data stream, including the concluding checksum of
   the stream contents at the end, at the beginning of the file; and

 * has the index v3 data blob appended to the pack data stream.

The pack data is streamed over the wire exactly the same way,
interoperating with existing software.  When receiving, the new
index-pack can read such a pack stream and add index at the end.
When re-indexing an existing pack (think: upgrading existing
packfiles from the current system), the index-pack can read from the
packfile and do what it does currently (notably, it knows where the
pack stream ends and can stop reading at that point, so even if you
feed the new pack to it, it will stop at the end of the pack data,
ignoring the index v3 already at the end of the input).

One potential advantage of using a single file, instead of the
primary .pack file with 3 (or 47) auxiliary files, is that it lets
you repack without having to deal with this sequence, which happens
currently when you repack:

 * create a new .pack file and the corresponding auxiliary files
   under temporary filename;

 * move existing pack files that describe the same set of objects
   away;

 * rename these new files, one at a time, to their final name,
   making sure that you rename .idx the last, because that happens
   to be the key to the pack aware programs.

Instead you can rename only one thing (the new one) to the final
name (possibly atomically replacing the existing one).  With the
current system, when you need to replace a pack with a new pack with
the same packname (e.g. you repack everything with a better pack
parameter in a repository that has everything packed into one),
there is a very small window other concurrent users will not find
the object data between the time when you rename the old ones away
and the time when you move the new ones in.  The hairly logic
between "Ok we have prepared all new packfiles" and "End of pack
replacement" can be done with a single rename(2) of the new packfile
(which contains everything) to the final name, which atomically
replaces the old one.

This will become even safer if we picked $SHA1 (the name of the
packfile) to represent the hash of the whole thing, not the hash of
the sorted object names in the pack, as that will let us know there
is no need to even "replace" the files.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-02-02 Thread Duy Nguyen

On Fri, Feb 1, 2013 at 5:15 PM, Jeff King  wrote:
> The short-sha1 is a clever idea. Looks like it saves us on the order of
> 4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the
> savings we get from dropping the other 3 sha1's to uint32_t, but still
> not bad.

We could save another 4 bytes per commit by using 3 bytes for storing
.idx offsets. linux-2.6 only has 3M objects. It'll take many years for
big projects to reach 16M objects and need the fourth byte in
uint32_t.


> I guess the next steps in iterating on this would be:
>
>   1. splitting out the refactoring here into separate patches
>
>   2. squashing the cleaned-up bits into my patch 4/6
>
>   3. deciding whether this should go into a separate file or as part of
>  index v3. Your offsets depend on the .idx file having a sorted sha1
>  list. That is not likely to change, but it would still be nice to
>  make sure they cannot get out of sync. I'm still curious what the
>  performance impact is for mmap-ing N versus N+8MB.

4. Print some cache statistics in "count-objects -v"


>> The length of SHA-1 is chosen to be able to unambiguously identify any
>> cached commits. Full SHA-1 check is done after to catch false
>> positives.
>
> Just to be clear, these false positives come because the abbreviation is
> unambiguous within the packfile, but we might be looking for a commit
> that is not even in our pack, right?

It may even be ambiguous within the pack, say an octopus (i.e. not
cached) commit that shares the same sha-1 prefix with one of the
cached commits.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King

On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> > space/time trade-off.
> 
> Following the on-disk format experiment yesterday, I changed the
> format to:
> 
>  - a list a _short_ SHA-1 of cached commits
>  - a list of cache entries, each (5 uint32_t) consists of:
>- uint32_t for the index in .idx sha-1 table to get full SHA-1 of
>  the commit
>- uint32_t for timestamp
>- uint32_t for tree, 1st and 2nd parents for the index in .idx
>  table

BTW, I needed the minor fixups below to silence some warnings from your
patch. Here are the cold and warm cache timings I got, as compared to
stock git and my implementation:

 Pack  |  Cold Revs   |  Warm Revs
---+--+
 stock | 12.54| 4.14
me |  4.76 (-62%) | 0.66 (-84%)
   duy |  4.36 (-65%) | 0.55 (-86%)

Not surprising; yours is just a bit faster in terms of CPU, and even
gains a little more in the cold cache case. Nice. Of course that is just
gravy on top of the smaller disk usage, too. :)

---
diff --git a/commit-metapack.c b/commit-metapack.c
index c984b8e..78fd961 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -106,7 +106,7 @@ int commit_metapack(unsigned char *sha1,
for (p = commit_metapacks; p; p = p->next) {
struct commit_entry *data;
uint32_t p1, p2;
-   unsigned lo, hi, mi;
+   unsigned lo, hi;
int pos;
 
/* sha1_entry_pos does not work with abbreviated sha-1 */
@@ -161,7 +161,7 @@ static void get_commits(struct metapack_writer *mw,
struct write_cb *write_cb = (struct write_cb *)data;
enum object_type type = sha1_object_info(sha1, NULL);
struct commit *c;
-   int p1, p2;
+   int p1 = -1, p2 = -1;
 
if (type != OBJ_COMMIT)
return;
diff --git a/commit.c b/commit.c
index b326201..5b776f8 100644
--- a/commit.c
+++ b/commit.c
@@ -309,7 +309,7 @@ static int parse_commit_metapack(struct commit *item)
 
 static int parse_commit_metapack(struct commit *item)
 {
-   unsigned char *tree, *p1, *p2;
+   const unsigned char *tree, *p1, *p2;
uint32_t ts;
 
if (commit_metapack(item->object.sha1, &ts, &tree, &p1, &p2) < 0)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King

On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> > space/time trade-off.
> 
> Following the on-disk format experiment yesterday, I changed the
> format to:
> 
>  - a list a _short_ SHA-1 of cached commits
>  - a list of cache entries, each (5 uint32_t) consists of:
>- uint32_t for the index in .idx sha-1 table to get full SHA-1 of
>  the commit
>- uint32_t for timestamp
>- uint32_t for tree, 1st and 2nd parents for the index in .idx
>  table

Thanks for working on this, as it was the next step I was going to take. :)

The short-sha1 is a clever idea. Looks like it saves us on the order of
4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the
savings we get from dropping the other 3 sha1's to uint32_t, but still
not bad.

I guess the next steps in iterating on this would be:

  1. splitting out the refactoring here into separate patches

  2. squashing the cleaned-up bits into my patch 4/6

  3. deciding whether this should go into a separate file or as part of
 index v3. Your offsets depend on the .idx file having a sorted sha1
 list. That is not likely to change, but it would still be nice to
 make sure they cannot get out of sync. I'm still curious what the
 performance impact is for mmap-ing N versus N+8MB.

> The length of SHA-1 is chosen to be able to unambiguously identify any
> cached commits. Full SHA-1 check is done after to catch false
> positives.

Just to be clear, these false positives come because the abbreviation is
unambiguous within the packfile, but we might be looking for a commit
that is not even in our pack, right?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King

On Wed, Jan 30, 2013 at 08:56:07PM +0700, Nguyen Thai Ngoc Duy wrote:

> Another point, but not really important at this stage, I think we have
> memory leak somewhere (lookup_commit??). It used up to 800 MB RES on
> linux-2.6.git while generating the cache.

We generate (and then leak!) the linked list in commit_metapack_write.
That may be the culprit.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King

On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote:

> > Of course, it is more convenient to store this kind of things in a
> > separate file while experimenting and improving the mechanism, but I
> > do not think we want to see each packfile in a repository comes with
> > 47 auxiliary files with different suffixes 5 years down the road.
> 
> Agggh.
> 
> Right now we are in the middle of refactoring the JGit reachability
> bitmap implementation to store it into a separate file and get it out
> of the .idx file. This work is nearly completed. So this thread is
> great timing. :-)
> 
> I think Junio is right about not wanting 47 different tiny auxiliary
> files for a single pack. We are unlikely to create that many, but
> right now there are proposals floating around for at least 2 new
> auxiliary files (commit cache and reachability bitmaps). So its not
> impossible that another will be discovered in the future.

Why don't we want 47 files (or even 3)? If it makes the implementation
more flexible or simple without sacrificing performance, I don't see a
problem. The filesystem is there to organize our data; we do not cram
all of our files into one just to save a few inodes.

We _do_ cram our data into packfiles and packed-refs when we can save
O(n) inodes. But if we are talking about a handful of extra files that
we must readdir() over while preparing the objects/pack directory, I
don't think that is the same thing.

The data dependency issues (can the files get out of sync? How common is
it? Can we detect it?) and performance issues are more interesting to
me. With respect to the latter, here's specifically what I'm thinking
of. Let's imagine we have an extension for reachability bitmaps that
takes up an extra 10MB. When we mmap the .idx file, do we always mmap
the extra bytes? What's the performance impact of the extra mmap? I
recall on some non-Linux systems it can be quite expensive. For most git
commands which are not going to do a reachability analysis, this is a
loss.

I don't know if this is an issue big enough to care about or not. But I
think it's worth benchmarking and including in the decision.

> Junio may be right about the hole in the index file for git-core. I
> haven't checked the JGit implementation, but I suspect it does not
> have this hole. IIRC JGit consumes the index sections and then expects
> the pack trailer SHA-1 to be present immediately after the last table.
> This happens because JGit doesn't use mmap() to load the index, it
> streams the file into memory and does some reformatting on the tables
> to make search faster.

Yeah, looking at the PackIndexV2 implementation, it counts the 64-bit
offsets from the first table, calculates the size of the large offset
table, reads past it, and then expects the pack checksum. So let's
assume we would have to bump to v3 to implement it inside the .idx file.

> If we are going to change the index to support extension sections and
> I have to modify JGit to grok this new format, it needs to be index v3
> not index v2. If we are making index v3 we should just put index v3 on
> the end of the pack file.

I'm not sure what you mean by your last sentence here.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King

On Tue, Jan 29, 2013 at 11:17:41PM -0800, Junio C Hamano wrote:

> > True, but it is even less headache if the file is totally separate and
> > optional.
> 
> Once you start thinking about using an offset to some list of SHA-1,
> perhaps?  A section inside the same file can never go out of sync.

Yes, having a data dependency is important. It is unavoidable to have a
dependency on the packfile, though (and that is why the index and the
metapacks embed the sha1 of the packfile). If the offsets used are
packfile offsets, then that is sufficient.

If the offsets are from the index, then yes, putting it in the same file
is one way to keep them tied together. Another way is to do the same
sha1 verification, except to embed the sha1 of the index in the
metapack.

So I certainly consider putting the dependency-handling to be a point in
favor of the same file, but I'd weight it against other points (headache
of bumping index version, performance of both types, etc).

> Also a longer-term advantage is that you can teach index-pack to do
> this.

I think that is roughly the same amount of difficulty to do whether they
are in the same file or not.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-31 Thread Shawn Pearce

On Wed, Jan 30, 2013 at 7:56 AM, Junio C Hamano  wrote:
> Jeff King  writes:
>
>>>From this:
>>
>>> Then it will be very natural for the extension data that store the
>>> commit metainfo to name objects in the pack the .idx file describes
>>> by the offset in the SHA-1 table.
>>
>> I guess your argument is that putting it all in the same file makes it
>> more natural for there to be a data dependency.
>
> It is more about the "I am torn on this one" I mentioned earlier.
>
> It would be more "logical" if this weren't tied to a particular
> pack, as the properties of a commit you record in this series do not
> depend on which pack the commit is in, and such a repository-global
> file by definition cannot be inside anybody's .idx.
>
> But if we split the information into separate pieces and store one
> piece per .idx for implementation reasons, it is crazy not to at
> least consider it a longer term goal to put it inside .idx file.
>
> Of course, it is more convenient to store this kind of things in a
> separate file while experimenting and improving the mechanism, but I
> do not think we want to see each packfile in a repository comes with
> 47 auxiliary files with different suffixes 5 years down the road.

Agggh.

Right now we are in the middle of refactoring the JGit reachability
bitmap implementation to store it into a separate file and get it out
of the .idx file. This work is nearly completed. So this thread is
great timing. :-)

I think Junio is right about not wanting 47 different tiny auxiliary
files for a single pack. We are unlikely to create that many, but
right now there are proposals floating around for at least 2 new
auxiliary files (commit cache and reachability bitmaps). So its not
impossible that another will be discovered in the future.

Junio may be right about the hole in the index file for git-core. I
haven't checked the JGit implementation, but I suspect it does not
have this hole. IIRC JGit consumes the index sections and then expects
the pack trailer SHA-1 to be present immediately after the last table.
This happens because JGit doesn't use mmap() to load the index, it
streams the file into memory and does some reformatting on the tables
to make search faster.

If we are going to change the index to support extension sections and
I have to modify JGit to grok this new format, it needs to be index v3
not index v2. If we are making index v3 we should just put index v3 on
the end of the pack file.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-31 Thread Duy Nguyen

On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
> Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
> space/time trade-off.

Following the on-disk format experiment yesterday, I changed the
format to:

 - a list a _short_ SHA-1 of cached commits
 - a list of cache entries, each (5 uint32_t) consists of:
   - uint32_t for the index in .idx sha-1 table to get full SHA-1 of
 the commit
   - uint32_t for timestamp
   - uint32_t for tree, 1st and 2nd parents for the index in .idx
 table

The length of SHA-1 is chosen to be able to unambiguously identify any
cached commits. Full SHA-1 check is done after to catch false
positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
moderate-sized projects are 4 bytes. So it's 26 bytes per commit for
linux-2.6.git, or 8MB .commits file. Not as good as revindex approach
(5.5MB) but way better than the current one (27MB). And still good
enough for caching 2 parents unconditionally.

Performance seems to improve a tiny bit, probably because of more
compact search space: 0.6s with my patch vs 0.7s without (vs 3.4s
without cache).

-- 8< --
diff --git a/cache.h b/cache.h
index 1f96f65..8048d5b 100644
--- a/cache.h
+++ b/cache.h
@@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, 
int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, 
uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
+extern int find_pack_entry_pos(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, 
unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, 
unsigned long len, enum object_type *type, unsigned long *sizep);
diff --git a/commit-metapack.c b/commit-metapack.c
index 2c19f48..c984b8e 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -4,11 +4,21 @@
 #include "commit.h"
 #include "sha1-lookup.h"
 
+struct commit_entry {
+   uint32_t commit;/* nth_packed_object_sha1 to get own SHA-1 */
+   uint32_t timestamp;
+   uint32_t tree;  /* nth_packed_object_sha1 to get tree SHA-1 */
+   uint32_t parent1; /* nth_packed_object_sha1 to get 1st parent SHA-1 */
+   uint32_t parent2; /* nth_packed_object_sha1 to get 2nd parent SHA-1 */
+};
+
 struct commit_metapack {
struct metapack mp;
uint32_t nr;
+   uint32_t abbrev_len;
+   struct packed_git *pack;
unsigned char *index;
-   unsigned char *data;
+   struct commit_entry *data;
struct commit_metapack *next;
 };
 static struct commit_metapack *commit_metapacks;
@@ -41,20 +51,23 @@ static struct commit_metapack *alloc_commit_metapack(struct 
packed_git *pack)
}
memcpy(&it->nr, it->mp.data, 4);
it->nr = ntohl(it->nr);
+   memcpy(&it->abbrev_len, it->mp.data + 4, 4);
+   it->abbrev_len = ntohl(it->abbrev_len);
+   it->pack = pack;
 
/*
-* We need 84 bytes for each entry: sha1(20), date(4), tree(20),
-* parents(40).
+* We need 20+abbrev_len bytes for each entry: abbrev sha-1,
+* date(4), tree index(4), parent indexes(8).
 */
-   if (it->mp.len < (84 * it->nr + 4)) {
+   if (it->mp.len < ((sizeof(*it->data) + it->abbrev_len) * it->nr + 8)) {
warning("commit metapack for '%s' is truncated", 
pack->pack_name);
metapack_close(&it->mp);
free(it);
return NULL;
}
 
-   it->index = it->mp.data + 4;
-   it->data = it->index + 20 * it->nr;
+   it->index = it->mp.data + 8;
+   it->data = (struct commit_entry*)(it->index + it->abbrev_len * it->nr);
 
return it;
 }
@@ -83,29 +96,51 @@ static void prepare_commit_metapacks(void)
 
 int commit_metapack(unsigned char *sha1,
uint32_t *timestamp,
-   unsigned char **tree,
-   unsigned char **parent1,
-   unsigned char **parent2)
+   const unsigned char **tree,
+   const unsigned char **parent1,
+   const unsigned char **parent2)
 {
struct commit_metapack *p;
 
prepare_commit_metapacks();
for (p = commit_metapacks; p; p = p->next) {
-   unsigned char *data;
-   int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, 
sha1);
+   struct commit_entry *data;
+   uint32_t p1, p2;
+   unsigned lo, hi, mi;
+   int pos;
+
+   /* sha1_entry_pos does not work with abbreviated sha-1 */
+   lo = 0;
+   hi = p->nr;
+   pos = -1;
+   do {
+   unsigned mi = (lo + hi) / 2;
+   int cmp = memcmp(p->index + mi * p-

Re: [PATCH 4/6] introduce a commit metapack

2013-01-30 Thread Junio C Hamano

Jeff King  writes:

>>From this:
>
>> Then it will be very natural for the extension data that store the
>> commit metainfo to name objects in the pack the .idx file describes
>> by the offset in the SHA-1 table.
>
> I guess your argument is that putting it all in the same file makes it
> more natural for there to be a data dependency.

It is more about the "I am torn on this one" I mentioned earlier.

It would be more "logical" if this weren't tied to a particular
pack, as the properties of a commit you record in this series do not
depend on which pack the commit is in, and such a repository-global
file by definition cannot be inside anybody's .idx.

But if we split the information into separate pieces and store one
piece per .idx for implementation reasons, it is crazy not to at
least consider it a longer term goal to put it inside .idx file.

Of course, it is more convenient to store this kind of things in a
separate file while experimenting and improving the mechanism, but I
do not think we want to see each packfile in a repository comes with
47 auxiliary files with different suffixes 5 years down the road.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-30 Thread Duy Nguyen

On Wed, Jan 30, 2013 at 8:56 PM, Duy Nguyen  wrote:
> However, performance seems to suffer too. Maybe I do more lookups than
> necessary, I don't know.

Yes, I should have stored the position in the sha-1 <-> offset map
instead of the position of the object in .pack file. Even so,
performance does not improve.

> I should probably measure the cost of revindex separately.

And the cost of create_pack_revindex() is 0.6 sec :-(

Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
space/time trade-off.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-30 Thread Duy Nguyen

On Tue, Jan 29, 2013 at 04:16:11AM -0500, Jeff King wrote:
> When we are doing a commit traversal that does not need to
> look at the commit messages themselves (e.g., rev-list,
> merge-base, etc), we spend a lot of time accessing,
> decompressing, and parsing the commit objects just to find
> the parent and timestamp information. We can make a
> space-time tradeoff by caching that information on disk in a
> compact, uncompressed format.

And this is a (messy) patch on top that avoids storing SHA-1
directly. On my linux-2.6.git (575 MB pack, 73 MB index), .commits
file is 5.2 MB and 27 MB with and without my patch respectively. Nice
shrinkage.

However, performance seems to suffer too. Maybe I do more lookups than
necessary, I don't know. I should probably measure the cost of
revindex separately.

git rev-list --all --quiet on vanilla git:

real0m3.645s
user0m3.556s
sys 0m0.080s

commit cache without my patch:

real0m0.723s
user0m0.677s
sys 0m0.045s

and with my patch:

real0m1.338s
user0m1.259s
sys 0m0.075s


Another point, but not really important at this stage, I think we have
memory leak somewhere (lookup_commit??). It used up to 800 MB RES on
linux-2.6.git while generating the cache.

-- 8< --
diff --git a/cache.h b/cache.h
index 1f96f65..8048d5b 100644
--- a/cache.h
+++ b/cache.h
@@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, 
int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, 
uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
+extern int find_pack_entry_pos(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, 
unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, 
unsigned long len, enum object_type *type, unsigned long *sizep);
diff --git a/commit-metapack.c b/commit-metapack.c
index 2c19f48..55f7ea9 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -3,11 +3,12 @@
 #include "metapack.h"
 #include "commit.h"
 #include "sha1-lookup.h"
+#include "pack-revindex.h"
 
 struct commit_metapack {
struct metapack mp;
-   uint32_t nr;
-   unsigned char *index;
+   struct packed_git *pack;
+   uint32_t first, last;
unsigned char *data;
struct commit_metapack *next;
 };
@@ -16,7 +17,7 @@ static struct commit_metapack *commit_metapacks;
 static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack)
 {
struct commit_metapack *it = xcalloc(1, sizeof(*it));
-   uint32_t version;
+   uint32_t version, nr;
 
if (metapack_init(&it->mp, pack, "commits", &version) < 0) {
free(it);
@@ -39,22 +40,25 @@ static struct commit_metapack *alloc_commit_metapack(struct 
packed_git *pack)
free(it);
return NULL;
}
-   memcpy(&it->nr, it->mp.data, 4);
-   it->nr = ntohl(it->nr);
+   memcpy(&it->first, it->mp.data, 4);
+   it->first = ntohl(it->first);
+   memcpy(&it->last, it->mp.data + 4, 4);
+   it->last = ntohl(it->last);
+   nr = it->last - it->first + 1;
+   it->pack = pack;
 
/*
-* We need 84 bytes for each entry: sha1(20), date(4), tree(20),
-* parents(40).
+* We need 16 bytes for each entry: date(4), tree index(4),
+* parent indexes(8).
 */
-   if (it->mp.len < (84 * it->nr + 4)) {
+   if (it->mp.len < (16 * nr + 8)) {
warning("commit metapack for '%s' is truncated", 
pack->pack_name);
metapack_close(&it->mp);
free(it);
return NULL;
}
 
-   it->index = it->mp.data + 4;
-   it->data = it->index + 20 * it->nr;
+   it->data = it->mp.data + 8;
 
return it;
 }
@@ -81,31 +85,61 @@ static void prepare_commit_metapacks(void)
initialized = 1;
 }
 
+static const unsigned char *idx_to_sha1(struct packed_git *p,
+   uint32_t nth)
+{
+   struct revindex_entry *revindex = get_revindex(p);
+   if (!revindex)
+   return NULL;
+   return nth_packed_object_sha1(p, revindex[nth].nr);
+}
+
 int commit_metapack(unsigned char *sha1,
uint32_t *timestamp,
-   unsigned char **tree,
-   unsigned char **parent1,
-   unsigned char **parent2)
+   const unsigned char **tree,
+   const unsigned char **parent1,
+   const unsigned char **parent2)
 {
struct commit_metapack *p;
 
prepare_commit_metapacks();
for (p = commit_metapacks; p; p = p->next) {
unsigned char *data;
-   int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, 
sha1);
-

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Junio C Hamano

> True, but it is even less headache if the file is totally separate and
> optional.

Once you start thinking about using an offset to some list of SHA-1,
perhaps?  A section inside the same file can never go out of sync.
Also a longer-term advantage is that you can teach index-pack to do
this.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King

On Wed, Jan 30, 2013 at 10:36:10AM +0700, Nguyen Thai Ngoc Duy wrote:

> On Tue, Jan 29, 2013 at 4:16 PM, Jeff King  wrote:
> > +int commit_metapack(unsigned char *sha1,
> > +   uint32_t *timestamp,
> > +   unsigned char **tree,
> > +   unsigned char **parent1,
> > +   unsigned char **parent2)
> > +{
> 
> Nit picking. tree, parent1 and parent2 can/should be "const unsigned char **".

Thanks, I'll fix it in the next version.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King

On Tue, Jan 29, 2013 at 10:08:08AM -0800, Junio C Hamano wrote:

> > In order to reduce the disk footprint and I/O cost, the future
> > direction for this mechanism may want to point into an existing
> > store of SHA-1 hashes with a shorter file offset, and the .idx file
> > could be such a store, and in order to move in that direction, you
> > cannot avoid tying a metapack to a pack.
> 
> Have you considered if we can extend the .idx format version 2
> without actually having to bump the version number?  My quick
> reading of check_packed_git_idx() tells me that we have a gap after
> the "large offset table" that we can place extensions, but I may be
> mistaken.  If we indeed can, then perhaps adding a series of
> 
>   4-byte "id" header
> 4-byte extension length (or 8-byte)
> ... N-byte worth of extension data ...
> 
> followed by
> 
>   20-byte SHA-1 checksum of all the extension sections
>   8-byte file offset to the first extension section

I considered it, but didn't look into it closely. My feeling was that it
added extra complexity, without adding any real advantage that a
separate file would not.

>From this:

> Then it will be very natural for the extension data that store the
> commit metainfo to name objects in the pack the .idx file describes
> by the offset in the SHA-1 table.

I guess your argument is that putting it all in the same file makes it
more natural for there to be a data dependency.

> As we always say, .idx is a local cache and bumping its version is
> not a huge headache compared to other longer term storage items.

True, but it is even less headache if the file is totally separate and
optional.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King

On Tue, Jan 29, 2013 at 09:38:10AM -0800, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > +int commit_metapack(unsigned char *sha1,
> > +   uint32_t *timestamp,
> > +   unsigned char **tree,
> > +   unsigned char **parent1,
> > +   unsigned char **parent2)
> > +{
> > +   struct commit_metapack *p;
> > +
> > +   prepare_commit_metapacks();
> > +   for (p = commit_metapacks; p; p = p->next) {
> > +   unsigned char *data;
> > +   int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, 
> > sha1);
> 
> This is a tangent, but isn't it about time to rip out the check for
> GIT_USE_LOOKUP in find_pack_entry_one(), I wonder.

Rip it out and always use sha1_entry_pos, or rip it out and never use
it? I have never been able to measure any speedup from it; I just use it
here to avoid rewriting the binary search myself. I do not think there
is any disadvantage to using it, so I'd be in favor of just
standardizing on it for any sha1 binary searches.

> These cached properties of a single commit will not change no matter
> which pack it appears in, and it feels logically wrong, especially
> when you record these object names in the full SHA-1 form, to tie a
> "commit metapack" to a pack.  Logically there needs only one commit
> metapack that describes all the commits known to the repository when
> the metapack was created.

True. I had originally envisioned it as tied to the packfile to help
manage the lifecycle. You know you need to generate a metapack for some
objects when they get packed, and you know you can throw it away when
the associated pack goes away. And it means you can verify the integrity
of the metapack by matching it to a particular packfile.

However, if you just have a commit cache, you can always blow the whole
thing away and regenerate it for all objects in the repo. It does not
have tied to a pack (you do end up doing some extra work at regeneration
when you are not doing a full repack, but it is really not enough to
worry about).

> In order to reduce the disk footprint and I/O cost, the future
> direction for this mechanism may want to point into an existing
> store of SHA-1 hashes with a shorter file offset, and the .idx file
> could be such a store, and in order to move in that direction, you
> cannot avoid tying a metapack to a pack.

Yes. That was not one of my original goals for the commit cache, but I
do think it's a useful direction to go in. And reachability bitmaps
(which would eventually be their own metapacks) would generally want to
be per-pack, too, for the same reason.

> > +   *tail = &commit_list_insert(c, *tail)->next;
> > +}
> 
> It feels somewhat wasteful to:
> 
>  - use commit_list for this, rather than an array of commit
>objects.  If you have a rough estimate of the number of commits
>in the pack, you could just preallocate a single array and use
>ALLOC_GROW() on it, no?

We don't have a rough estimate, but yes, we could just use an array and
trust ALLOC_GROW to be reasonable. The use of commit_list did not have a
particular reason other than that it was simple (an array means stuffing
the array pointer and the nr and alloc counts into a struct to get to
the callback). The performance of writing the cache is dominated by
accessing the objects themselves, anyway. I don't mind changing it,
though, if you think it's clearer as an array.

>  - iterate over the .idx file and run sha1_object_info() and
>parse_commit() on many objects in the SHA-1 order.  Iterating in
>the way builtin/pack-objects.c::get_object_details() does avoids
>jumping around in existing packfiles, which may be more
>efficient, no?

Probably, though generating the complete commit cache for linux-2.6.git
takes only about 7 seconds on my machine. I wasn't too concerned with
optimizing generation, since it will typically be dwarfed by repacking
costs. It might make more of a difference for doing a metapack on all
objects (e.g., reachability bitmaps).

The reason I do it in .idx order is that it feeds the callback in sorted
order, so a writer could in theory just use that output as-is (and my
initial version did that, as it wrote separate metapacks for each
element). This version now puts all data elements together (for cache
locality), and builds the in-memory list so we do not have to re-do
sha1_object_info repeatedly. So it could very easily just generate the
list in pack order and sort it at the end.

I'll look into that for the next version.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Duy Nguyen

On Tue, Jan 29, 2013 at 4:16 PM, Jeff King  wrote:
> +int commit_metapack(unsigned char *sha1,
> +   uint32_t *timestamp,
> +   unsigned char **tree,
> +   unsigned char **parent1,
> +   unsigned char **parent2)
> +{

Nit picking. tree, parent1 and parent2 can/should be "const unsigned char **".
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Junio C Hamano

Junio C Hamano  writes:

> I am torn on this one.
>
> These cached properties of a single commit will not change no matter
> which pack it appears in, and it feels logically wrong, especially
> when you record these object names in the full SHA-1 form, to tie a
> "commit metapack" to a pack.  Logically there needs only one commit
> metapack that describes all the commits known to the repository when
> the metapack was created.
>
> In order to reduce the disk footprint and I/O cost, the future
> direction for this mechanism may want to point into an existing
> store of SHA-1 hashes with a shorter file offset, and the .idx file
> could be such a store, and in order to move in that direction, you
> cannot avoid tying a metapack to a pack.

Have you considered if we can extend the .idx format version 2
without actually having to bump the version number?  My quick
reading of check_packed_git_idx() tells me that we have a gap after
the "large offset table" that we can place extensions, but I may be
mistaken.  If we indeed can, then perhaps adding a series of

4-byte "id" header
4-byte extension length (or 8-byte)
... N-byte worth of extension data ...

followed by

20-byte SHA-1 checksum of all the extension sections
8-byte file offset to the first extension section

at that gap, immediately before the trailer of the .idx file written
by git_SHA1_Final(), in a way similar to the index extension is done
may give us a natural way to deal with this.  The reader can first
read the offset recorded at the tail, checking if the offset is
plausible because it may not exist, the check the extension section
checksum, to see if the file has extension, or the file just ends
with the large offset table and the 28 bytes it tried to were from
the entries near the end of the large offset table.

If that is not worth attempting, perhaps we may still want to store
this as an optional extended section and bump the .idx format
version.

Then it will be very natural for the extension data that store the
commit metainfo to name objects in the pack the .idx file describes
by the offset in the SHA-1 table.

As we always say, .idx is a local cache and bumping its version is
not a huge headache compared to other longer term storage items.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Junio C Hamano

Jeff King  writes:

> +int commit_metapack(unsigned char *sha1,
> + uint32_t *timestamp,
> + unsigned char **tree,
> + unsigned char **parent1,
> + unsigned char **parent2)
> +{
> + struct commit_metapack *p;
> +
> + prepare_commit_metapacks();
> + for (p = commit_metapacks; p; p = p->next) {
> + unsigned char *data;
> + int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, 
> sha1);

This is a tangent, but isn't it about time to rip out the check for
GIT_USE_LOOKUP in find_pack_entry_one(), I wonder.

> + prepare_commit_metapacks();
> + for (p = commit_metapacks; p; p = p->next) {
> + unsigned char *data;
> + int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, 
> sha1);
> + if (pos < 0)
> + continue;
> +
> + /* timestamp(4) + tree(20) + parents(40) */
> + data = p->data + 64 * pos;
> + *timestamp = *(uint32_t *)data;
> + *timestamp = ntohl(*timestamp);
> + data += 4;
> + *tree = data;
> + data += 20;
> + *parent1 = data;
> + data += 20;
> + *parent2 = data;
> +
> + return 0;
> + }
> +
> + return -1;
> +}

I am torn on this one.

These cached properties of a single commit will not change no matter
which pack it appears in, and it feels logically wrong, especially
when you record these object names in the full SHA-1 form, to tie a
"commit metapack" to a pack.  Logically there needs only one commit
metapack that describes all the commits known to the repository when
the metapack was created.

In order to reduce the disk footprint and I/O cost, the future
direction for this mechanism may want to point into an existing
store of SHA-1 hashes with a shorter file offset, and the .idx file
could be such a store, and in order to move in that direction, you
cannot avoid tying a metapack to a pack.

> +static void get_commits(struct metapack_writer *mw,
> + const unsigned char *sha1,
> + void *data)
> +{
> + struct commit_list ***tail = data;
> + enum object_type type = sha1_object_info(sha1, NULL);
> + struct commit *c;
> +
> + if (type != OBJ_COMMIT)
> + return;
> +
> + c = lookup_commit(sha1);
> + if (!c || parse_commit(c))
> + die("unable to read commit %s", sha1_to_hex(sha1));
> +
> + /*
> +  * Our fixed-size parent list cannot represent root commits, nor
> +  * octopus merges. Just skip those commits, as we can fallback
> +  * in those rare cases to reading the actual commit object.
> +  */
> + if (!c->parents ||
> + (c->parents && c->parents->next && c->parents->next->next))
> + return;
> +
> + *tail = &commit_list_insert(c, *tail)->next;
> +}

It feels somewhat wasteful to:

 - use commit_list for this, rather than an array of commit
   objects.  If you have a rough estimate of the number of commits
   in the pack, you could just preallocate a single array and use
   ALLOC_GROW() on it, no?

 - iterate over the .idx file and run sha1_object_info() and
   parse_commit() on many objects in the SHA-1 order.  Iterating in
   the way builtin/pack-objects.c::get_object_details() does avoids
   jumping around in existing packfiles, which may be more
   efficient, no?

> +void commit_metapack_write(const char *idx)
> +{
> + struct metapack_writer mw;
> + struct commit_list *commits = NULL, *p;
> + struct commit_list **tail = &commits;
> + uint32_t nr = 0;
> +
> + metapack_writer_init(&mw, idx, "commits", 1);
> +
> + /* Figure out how many eligible commits we've got in this pack. */
> + metapack_writer_foreach(&mw, get_commits, &tail);
> + for (p = commits; p; p = p->next)
> + nr++;

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King

On Tue, Jan 29, 2013 at 11:24:45AM +0100, Michael Haggerty wrote:

> On 01/29/2013 10:16 AM, Jeff King wrote:
> > When we are doing a commit traversal that does not need to
> > look at the commit messages themselves (e.g., rev-list,
> > merge-base, etc), we spend a lot of time accessing,
> > decompressing, and parsing the commit objects just to find
> > the parent and timestamp information. We can make a
> > space-time tradeoff by caching that information on disk in a
> > compact, uncompressed format.
> > 
> > TODO: document on-disk format in Documentation/technical
> > TODO: document API
> 
> Would this be a good place to add the commit generation number that is
> so enthusiastically discussed on the mailing list from time to time?

Yes, that is one of my goals. We may even be able to just replace the
timestamp field in the cache with a generation number. When it gets
pretty-printed we pull it out of the commit message again anyway, so in
theory the only use inside "struct commit" is for ordering. But I
haven't looked at all of the use sites yet to be sure nobody is
depending on it being an actual date stamp.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Michael Haggerty

On 01/29/2013 10:16 AM, Jeff King wrote:
> When we are doing a commit traversal that does not need to
> look at the commit messages themselves (e.g., rev-list,
> merge-base, etc), we spend a lot of time accessing,
> decompressing, and parsing the commit objects just to find
> the parent and timestamp information. We can make a
> space-time tradeoff by caching that information on disk in a
> compact, uncompressed format.
> 
> TODO: document on-disk format in Documentation/technical
> TODO: document API

Would this be a good place to add the commit generation number that is
so enthusiastically discussed on the mailing list from time to time?

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King

When we are doing a commit traversal that does not need to
look at the commit messages themselves (e.g., rev-list,
merge-base, etc), we spend a lot of time accessing,
decompressing, and parsing the commit objects just to find
the parent and timestamp information. We can make a
space-time tradeoff by caching that information on disk in a
compact, uncompressed format.

TODO: document on-disk format in Documentation/technical
TODO: document API

Signed-off-by: Jeff King 
---
 Makefile  |   2 +
 commit-metapack.c | 175 ++
 commit-metapack.h |  12 
 3 files changed, 189 insertions(+)
 create mode 100644 commit-metapack.c
 create mode 100644 commit-metapack.h

diff --git a/Makefile b/Makefile
index 3e4ae1b..6ca5320 100644
--- a/Makefile
+++ b/Makefile
@@ -619,6 +619,7 @@ LIB_H += column.h
 LIB_H += cache.h
 LIB_H += color.h
 LIB_H += column.h
+LIB_H += commit-metapack.h
 LIB_H += commit.h
 LIB_H += compat/bswap.h
 LIB_H += compat/cygwin.h
@@ -730,6 +731,7 @@ LIB_OBJS += combine-diff.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
+LIB_OBJS += commit-metapack.o
 LIB_OBJS += commit.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
diff --git a/commit-metapack.c b/commit-metapack.c
new file mode 100644
index 000..2c19f48
--- /dev/null
+++ b/commit-metapack.c
@@ -0,0 +1,175 @@
+#include "cache.h"
+#include "commit-metapack.h"
+#include "metapack.h"
+#include "commit.h"
+#include "sha1-lookup.h"
+
+struct commit_metapack {
+   struct metapack mp;
+   uint32_t nr;
+   unsigned char *index;
+   unsigned char *data;
+   struct commit_metapack *next;
+};
+static struct commit_metapack *commit_metapacks;
+
+static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack)
+{
+   struct commit_metapack *it = xcalloc(1, sizeof(*it));
+   uint32_t version;
+
+   if (metapack_init(&it->mp, pack, "commits", &version) < 0) {
+   free(it);
+   return NULL;
+   }
+   if (version != 1) {
+   /*
+* This file comes from a more recent git version. Don't bother
+* warning the user, as we'll just fallback to reading the
+* commits.
+*/
+   metapack_close(&it->mp);
+   free(it);
+   return NULL;
+   }
+
+   if (it->mp.len < 4) {
+   warning("commit metapack for '%s' is truncated", 
pack->pack_name);
+   metapack_close(&it->mp);
+   free(it);
+   return NULL;
+   }
+   memcpy(&it->nr, it->mp.data, 4);
+   it->nr = ntohl(it->nr);
+
+   /*
+* We need 84 bytes for each entry: sha1(20), date(4), tree(20),
+* parents(40).
+*/
+   if (it->mp.len < (84 * it->nr + 4)) {
+   warning("commit metapack for '%s' is truncated", 
pack->pack_name);
+   metapack_close(&it->mp);
+   free(it);
+   return NULL;
+   }
+
+   it->index = it->mp.data + 4;
+   it->data = it->index + 20 * it->nr;
+
+   return it;
+}
+
+static void prepare_commit_metapacks(void)
+{
+   static int initialized;
+   struct commit_metapack **tail = &commit_metapacks;
+   struct packed_git *p;
+
+   if (initialized)
+   return;
+
+   prepare_packed_git();
+   for (p = packed_git; p; p = p->next) {
+   struct commit_metapack *it = alloc_commit_metapack(p);
+
+   if (it) {
+   *tail = it;
+   tail = &it->next;
+   }
+   }
+
+   initialized = 1;
+}
+
+int commit_metapack(unsigned char *sha1,
+   uint32_t *timestamp,
+   unsigned char **tree,
+   unsigned char **parent1,
+   unsigned char **parent2)
+{
+   struct commit_metapack *p;
+
+   prepare_commit_metapacks();
+   for (p = commit_metapacks; p; p = p->next) {
+   unsigned char *data;
+   int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, 
sha1);
+   if (pos < 0)
+   continue;
+
+   /* timestamp(4) + tree(20) + parents(40) */
+   data = p->data + 64 * pos;
+   *timestamp = *(uint32_t *)data;
+   *timestamp = ntohl(*timestamp);
+   data += 4;
+   *tree = data;
+   data += 20;
+   *parent1 = data;
+   data += 20;
+   *parent2 = data;
+
+   return 0;
+   }
+
+   return -1;
+}
+
+static void get_commits(struct metapack_writer *mw,
+   const unsigned char *sha1,
+   void *data)
+{
+   struct commit_list ***tail = data;
+   enum object_type type = sha1_object_info(sha1, NULL);
+   struct commit *c;
+
+   if (type != OBJ_COMMIT)
+   r

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

Re: [PATCH 4/6] introduce a commit metapack

[PATCH 4/6] introduce a commit metapack

24 matches

Site Navigation

Mail list logo

Footer information