Re: [PATCH 4/6] introduce a commit metapack

2013-03-18 Thread Jeff King
On Sun, Mar 17, 2013 at 08:21:13PM +0700, Nguyen Thai Ngoc Duy wrote:

 On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen pclo...@gmail.com wrote:
  On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
  Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
  space/time trade-off.
 
  Following the on-disk format experiment yesterday, I changed the
  format to:
 
   - a list a _short_ SHA-1 of cached commits
  ..
 
  The length of SHA-1 is chosen to be able to unambiguously identify any
  cached commits. Full SHA-1 check is done after to catch false
  positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
  moderate-sized projects are 4 bytes.
 
 And if we are going to create index v3, the same trick could be used
 for the sha-1 table in the index. We use the short sha-1 table for
 binary search and put the rest of sha-1 in a following table (just
 like file offset table). The advantage is a denser search space, about
 1/4-1/3 the size of full sha-1 table.

You can make it even smaller at some (potential) run-time cost.

Keep in mind you are just repeating information that is in the full sha1
list in the index. So you could store a fixed-size offset into that list
(e.g., 32-bit), and then instead of comparing sha1s during a binary
search, you would dereference the offset to the real sha1s and compare
those.

The run-time cost is not any worse in a big-O sense, but your cache
locality is much worse (you hit a second random page for each sha1
comparison), which might be noticeable. You'd have to benchmark to see
how big an impact.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-03-17 Thread Duy Nguyen
On Thu, Jan 31, 2013 at 6:06 PM, Duy Nguyen pclo...@gmail.com wrote:
 On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
 Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
 space/time trade-off.

 Following the on-disk format experiment yesterday, I changed the
 format to:

  - a list a _short_ SHA-1 of cached commits
 ..

 The length of SHA-1 is chosen to be able to unambiguously identify any
 cached commits. Full SHA-1 check is done after to catch false
 positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
 moderate-sized projects are 4 bytes.

And if we are going to create index v3, the same trick could be used
for the sha-1 table in the index. We use the short sha-1 table for
binary search and put the rest of sha-1 in a following table (just
like file offset table). The advantage is a denser search space, about
1/4-1/3 the size of full sha-1 table.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-02-02 Thread Duy Nguyen
On Fri, Feb 1, 2013 at 5:15 PM, Jeff King p...@peff.net wrote:
 The short-sha1 is a clever idea. Looks like it saves us on the order of
 4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the
 savings we get from dropping the other 3 sha1's to uint32_t, but still
 not bad.

We could save another 4 bytes per commit by using 3 bytes for storing
.idx offsets. linux-2.6 only has 3M objects. It'll take many years for
big projects to reach 16M objects and need the fourth byte in
uint32_t.


 I guess the next steps in iterating on this would be:

   1. splitting out the refactoring here into separate patches

   2. squashing the cleaned-up bits into my patch 4/6

   3. deciding whether this should go into a separate file or as part of
  index v3. Your offsets depend on the .idx file having a sorted sha1
  list. That is not likely to change, but it would still be nice to
  make sure they cannot get out of sync. I'm still curious what the
  performance impact is for mmap-ing N versus N+8MB.

4. Print some cache statistics in count-objects -v


 The length of SHA-1 is chosen to be able to unambiguously identify any
 cached commits. Full SHA-1 check is done after to catch false
 positives.

 Just to be clear, these false positives come because the abbreviation is
 unambiguous within the packfile, but we might be looking for a commit
 that is not even in our pack, right?

It may even be ambiguous within the pack, say an octopus (i.e. not
cached) commit that shares the same sha-1 prefix with one of the
cached commits.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-02-02 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote:
 ...
 If we are going to change the index to support extension sections and
 I have to modify JGit to grok this new format, it needs to be index v3
 not index v2. If we are making index v3 we should just put index v3 on
 the end of the pack file.

 I'm not sure what you mean by your last sentence here.

I am not Shawn, but here is a summary of what I think I discussed
with him in person, lest I forget.

You could imagine that a new pack system (from pack-objects,
index-pack down to read_packed_sha1() call) that works with a
packfile that

 * is a single file, whose name is pack-$SHA1.$sfx (where $sfx
   is something other than 'pack', perhaps);

 * has the pack data stream, including the concluding checksum of
   the stream contents at the end, at the beginning of the file; and

 * has the index v3 data blob appended to the pack data stream.

The pack data is streamed over the wire exactly the same way,
interoperating with existing software.  When receiving, the new
index-pack can read such a pack stream and add index at the end.
When re-indexing an existing pack (think: upgrading existing
packfiles from the current system), the index-pack can read from the
packfile and do what it does currently (notably, it knows where the
pack stream ends and can stop reading at that point, so even if you
feed the new pack to it, it will stop at the end of the pack data,
ignoring the index v3 already at the end of the input).

One potential advantage of using a single file, instead of the
primary .pack file with 3 (or 47) auxiliary files, is that it lets
you repack without having to deal with this sequence, which happens
currently when you repack:

 * create a new .pack file and the corresponding auxiliary files
   under temporary filename;

 * move existing pack files that describe the same set of objects
   away;

 * rename these new files, one at a time, to their final name,
   making sure that you rename .idx the last, because that happens
   to be the key to the pack aware programs.

Instead you can rename only one thing (the new one) to the final
name (possibly atomically replacing the existing one).  With the
current system, when you need to replace a pack with a new pack with
the same packname (e.g. you repack everything with a better pack
parameter in a repository that has everything packed into one),
there is a very small window other concurrent users will not find
the object data between the time when you rename the old ones away
and the time when you move the new ones in.  The hairly logic
between Ok we have prepared all new packfiles and End of pack
replacement can be done with a single rename(2) of the new packfile
(which contains everything) to the final name, which atomically
replaces the old one.

This will become even safer if we picked $SHA1 (the name of the
packfile) to represent the hash of the whole thing, not the hash of
the sorted object names in the pack, as that will let us know there
is no need to even replace the files.



--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King
On Tue, Jan 29, 2013 at 11:17:41PM -0800, Junio C Hamano wrote:

  True, but it is even less headache if the file is totally separate and
  optional.
 
 Once you start thinking about using an offset to some list of SHA-1,
 perhaps?  A section inside the same file can never go out of sync.

Yes, having a data dependency is important. It is unavoidable to have a
dependency on the packfile, though (and that is why the index and the
metapacks embed the sha1 of the packfile). If the offsets used are
packfile offsets, then that is sufficient.

If the offsets are from the index, then yes, putting it in the same file
is one way to keep them tied together. Another way is to do the same
sha1 verification, except to embed the sha1 of the index in the
metapack.

So I certainly consider putting the dependency-handling to be a point in
favor of the same file, but I'd weight it against other points (headache
of bumping index version, performance of both types, etc).

 Also a longer-term advantage is that you can teach index-pack to do
 this.

I think that is roughly the same amount of difficulty to do whether they
are in the same file or not.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King
On Thu, Jan 31, 2013 at 09:03:26AM -0800, Shawn O. Pearce wrote:

  Of course, it is more convenient to store this kind of things in a
  separate file while experimenting and improving the mechanism, but I
  do not think we want to see each packfile in a repository comes with
  47 auxiliary files with different suffixes 5 years down the road.
 
 Agggh.
 
 Right now we are in the middle of refactoring the JGit reachability
 bitmap implementation to store it into a separate file and get it out
 of the .idx file. This work is nearly completed. So this thread is
 great timing. :-)
 
 I think Junio is right about not wanting 47 different tiny auxiliary
 files for a single pack. We are unlikely to create that many, but
 right now there are proposals floating around for at least 2 new
 auxiliary files (commit cache and reachability bitmaps). So its not
 impossible that another will be discovered in the future.

Why don't we want 47 files (or even 3)? If it makes the implementation
more flexible or simple without sacrificing performance, I don't see a
problem. The filesystem is there to organize our data; we do not cram
all of our files into one just to save a few inodes.

We _do_ cram our data into packfiles and packed-refs when we can save
O(n) inodes. But if we are talking about a handful of extra files that
we must readdir() over while preparing the objects/pack directory, I
don't think that is the same thing.

The data dependency issues (can the files get out of sync? How common is
it? Can we detect it?) and performance issues are more interesting to
me. With respect to the latter, here's specifically what I'm thinking
of. Let's imagine we have an extension for reachability bitmaps that
takes up an extra 10MB. When we mmap the .idx file, do we always mmap
the extra bytes? What's the performance impact of the extra mmap? I
recall on some non-Linux systems it can be quite expensive. For most git
commands which are not going to do a reachability analysis, this is a
loss.

I don't know if this is an issue big enough to care about or not. But I
think it's worth benchmarking and including in the decision.

 Junio may be right about the hole in the index file for git-core. I
 haven't checked the JGit implementation, but I suspect it does not
 have this hole. IIRC JGit consumes the index sections and then expects
 the pack trailer SHA-1 to be present immediately after the last table.
 This happens because JGit doesn't use mmap() to load the index, it
 streams the file into memory and does some reformatting on the tables
 to make search faster.

Yeah, looking at the PackIndexV2 implementation, it counts the 64-bit
offsets from the first table, calculates the size of the large offset
table, reads past it, and then expects the pack checksum. So let's
assume we would have to bump to v3 to implement it inside the .idx file.

 If we are going to change the index to support extension sections and
 I have to modify JGit to grok this new format, it needs to be index v3
 not index v2. If we are making index v3 we should just put index v3 on
 the end of the pack file.

I'm not sure what you mean by your last sentence here.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King
On Wed, Jan 30, 2013 at 08:56:07PM +0700, Nguyen Thai Ngoc Duy wrote:

 Another point, but not really important at this stage, I think we have
 memory leak somewhere (lookup_commit??). It used up to 800 MB RES on
 linux-2.6.git while generating the cache.

We generate (and then leak!) the linked list in commit_metapack_write.
That may be the culprit.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King
On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote:

 On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
  Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
  space/time trade-off.
 
 Following the on-disk format experiment yesterday, I changed the
 format to:
 
  - a list a _short_ SHA-1 of cached commits
  - a list of cache entries, each (5 uint32_t) consists of:
- uint32_t for the index in .idx sha-1 table to get full SHA-1 of
  the commit
- uint32_t for timestamp
- uint32_t for tree, 1st and 2nd parents for the index in .idx
  table

Thanks for working on this, as it was the next step I was going to take. :)

The short-sha1 is a clever idea. Looks like it saves us on the order of
4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the
savings we get from dropping the other 3 sha1's to uint32_t, but still
not bad.

I guess the next steps in iterating on this would be:

  1. splitting out the refactoring here into separate patches

  2. squashing the cleaned-up bits into my patch 4/6

  3. deciding whether this should go into a separate file or as part of
 index v3. Your offsets depend on the .idx file having a sorted sha1
 list. That is not likely to change, but it would still be nice to
 make sure they cannot get out of sync. I'm still curious what the
 performance impact is for mmap-ing N versus N+8MB.

 The length of SHA-1 is chosen to be able to unambiguously identify any
 cached commits. Full SHA-1 check is done after to catch false
 positives.

Just to be clear, these false positives come because the abbreviation is
unambiguous within the packfile, but we might be looking for a commit
that is not even in our pack, right?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-02-01 Thread Jeff King
On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote:

 On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
  Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
  space/time trade-off.
 
 Following the on-disk format experiment yesterday, I changed the
 format to:
 
  - a list a _short_ SHA-1 of cached commits
  - a list of cache entries, each (5 uint32_t) consists of:
- uint32_t for the index in .idx sha-1 table to get full SHA-1 of
  the commit
- uint32_t for timestamp
- uint32_t for tree, 1st and 2nd parents for the index in .idx
  table

BTW, I needed the minor fixups below to silence some warnings from your
patch. Here are the cold and warm cache timings I got, as compared to
stock git and my implementation:

 Pack  |  Cold Revs   |  Warm Revs
---+--+
 stock | 12.54| 4.14
me |  4.76 (-62%) | 0.66 (-84%)
   duy |  4.36 (-65%) | 0.55 (-86%)

Not surprising; yours is just a bit faster in terms of CPU, and even
gains a little more in the cold cache case. Nice. Of course that is just
gravy on top of the smaller disk usage, too. :)

---
diff --git a/commit-metapack.c b/commit-metapack.c
index c984b8e..78fd961 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -106,7 +106,7 @@ int commit_metapack(unsigned char *sha1,
for (p = commit_metapacks; p; p = p-next) {
struct commit_entry *data;
uint32_t p1, p2;
-   unsigned lo, hi, mi;
+   unsigned lo, hi;
int pos;
 
/* sha1_entry_pos does not work with abbreviated sha-1 */
@@ -161,7 +161,7 @@ static void get_commits(struct metapack_writer *mw,
struct write_cb *write_cb = (struct write_cb *)data;
enum object_type type = sha1_object_info(sha1, NULL);
struct commit *c;
-   int p1, p2;
+   int p1 = -1, p2 = -1;
 
if (type != OBJ_COMMIT)
return;
diff --git a/commit.c b/commit.c
index b326201..5b776f8 100644
--- a/commit.c
+++ b/commit.c
@@ -309,7 +309,7 @@ static int parse_commit_metapack(struct commit *item)
 
 static int parse_commit_metapack(struct commit *item)
 {
-   unsigned char *tree, *p1, *p2;
+   const unsigned char *tree, *p1, *p2;
uint32_t ts;
 
if (commit_metapack(item-object.sha1, ts, tree, p1, p2)  0)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-31 Thread Duy Nguyen
On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote:
 Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
 space/time trade-off.

Following the on-disk format experiment yesterday, I changed the
format to:

 - a list a _short_ SHA-1 of cached commits
 - a list of cache entries, each (5 uint32_t) consists of:
   - uint32_t for the index in .idx sha-1 table to get full SHA-1 of
 the commit
   - uint32_t for timestamp
   - uint32_t for tree, 1st and 2nd parents for the index in .idx
 table

The length of SHA-1 is chosen to be able to unambiguously identify any
cached commits. Full SHA-1 check is done after to catch false
positives. For linux-2.6, SHA-1 length is 6 bytes, git and many
moderate-sized projects are 4 bytes. So it's 26 bytes per commit for
linux-2.6.git, or 8MB .commits file. Not as good as revindex approach
(5.5MB) but way better than the current one (27MB). And still good
enough for caching 2 parents unconditionally.

Performance seems to improve a tiny bit, probably because of more
compact search space: 0.6s with my patch vs 0.7s without (vs 3.4s
without cache).

-- 8 --
diff --git a/cache.h b/cache.h
index 1f96f65..8048d5b 100644
--- a/cache.h
+++ b/cache.h
@@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, 
int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, 
uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
+extern int find_pack_entry_pos(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, 
unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, 
unsigned long len, enum object_type *type, unsigned long *sizep);
diff --git a/commit-metapack.c b/commit-metapack.c
index 2c19f48..c984b8e 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -4,11 +4,21 @@
 #include commit.h
 #include sha1-lookup.h
 
+struct commit_entry {
+   uint32_t commit;/* nth_packed_object_sha1 to get own SHA-1 */
+   uint32_t timestamp;
+   uint32_t tree;  /* nth_packed_object_sha1 to get tree SHA-1 */
+   uint32_t parent1; /* nth_packed_object_sha1 to get 1st parent SHA-1 */
+   uint32_t parent2; /* nth_packed_object_sha1 to get 2nd parent SHA-1 */
+};
+
 struct commit_metapack {
struct metapack mp;
uint32_t nr;
+   uint32_t abbrev_len;
+   struct packed_git *pack;
unsigned char *index;
-   unsigned char *data;
+   struct commit_entry *data;
struct commit_metapack *next;
 };
 static struct commit_metapack *commit_metapacks;
@@ -41,20 +51,23 @@ static struct commit_metapack *alloc_commit_metapack(struct 
packed_git *pack)
}
memcpy(it-nr, it-mp.data, 4);
it-nr = ntohl(it-nr);
+   memcpy(it-abbrev_len, it-mp.data + 4, 4);
+   it-abbrev_len = ntohl(it-abbrev_len);
+   it-pack = pack;
 
/*
-* We need 84 bytes for each entry: sha1(20), date(4), tree(20),
-* parents(40).
+* We need 20+abbrev_len bytes for each entry: abbrev sha-1,
+* date(4), tree index(4), parent indexes(8).
 */
-   if (it-mp.len  (84 * it-nr + 4)) {
+   if (it-mp.len  ((sizeof(*it-data) + it-abbrev_len) * it-nr + 8)) {
warning(commit metapack for '%s' is truncated, 
pack-pack_name);
metapack_close(it-mp);
free(it);
return NULL;
}
 
-   it-index = it-mp.data + 4;
-   it-data = it-index + 20 * it-nr;
+   it-index = it-mp.data + 8;
+   it-data = (struct commit_entry*)(it-index + it-abbrev_len * it-nr);
 
return it;
 }
@@ -83,29 +96,51 @@ static void prepare_commit_metapacks(void)
 
 int commit_metapack(unsigned char *sha1,
uint32_t *timestamp,
-   unsigned char **tree,
-   unsigned char **parent1,
-   unsigned char **parent2)
+   const unsigned char **tree,
+   const unsigned char **parent1,
+   const unsigned char **parent2)
 {
struct commit_metapack *p;
 
prepare_commit_metapacks();
for (p = commit_metapacks; p; p = p-next) {
-   unsigned char *data;
-   int pos = sha1_entry_pos(p-index, 20, 0, 0, p-nr, p-nr, 
sha1);
+   struct commit_entry *data;
+   uint32_t p1, p2;
+   unsigned lo, hi, mi;
+   int pos;
+
+   /* sha1_entry_pos does not work with abbreviated sha-1 */
+   lo = 0;
+   hi = p-nr;
+   pos = -1;
+   do {
+   unsigned mi = (lo + hi) / 2;
+   int cmp = memcmp(p-index + mi * p-abbrev_len, sha1, 
p-abbrev_len);
+
+   

Re: [PATCH 4/6] introduce a commit metapack

2013-01-31 Thread Shawn Pearce
On Wed, Jan 30, 2013 at 7:56 AM, Junio C Hamano gits...@pobox.com wrote:
 Jeff King p...@peff.net writes:

From this:

 Then it will be very natural for the extension data that store the
 commit metainfo to name objects in the pack the .idx file describes
 by the offset in the SHA-1 table.

 I guess your argument is that putting it all in the same file makes it
 more natural for there to be a data dependency.

 It is more about the I am torn on this one I mentioned earlier.

 It would be more logical if this weren't tied to a particular
 pack, as the properties of a commit you record in this series do not
 depend on which pack the commit is in, and such a repository-global
 file by definition cannot be inside anybody's .idx.

 But if we split the information into separate pieces and store one
 piece per .idx for implementation reasons, it is crazy not to at
 least consider it a longer term goal to put it inside .idx file.

 Of course, it is more convenient to store this kind of things in a
 separate file while experimenting and improving the mechanism, but I
 do not think we want to see each packfile in a repository comes with
 47 auxiliary files with different suffixes 5 years down the road.

Agggh.

Right now we are in the middle of refactoring the JGit reachability
bitmap implementation to store it into a separate file and get it out
of the .idx file. This work is nearly completed. So this thread is
great timing. :-)

I think Junio is right about not wanting 47 different tiny auxiliary
files for a single pack. We are unlikely to create that many, but
right now there are proposals floating around for at least 2 new
auxiliary files (commit cache and reachability bitmaps). So its not
impossible that another will be discovered in the future.

Junio may be right about the hole in the index file for git-core. I
haven't checked the JGit implementation, but I suspect it does not
have this hole. IIRC JGit consumes the index sections and then expects
the pack trailer SHA-1 to be present immediately after the last table.
This happens because JGit doesn't use mmap() to load the index, it
streams the file into memory and does some reformatting on the tables
to make search faster.

If we are going to change the index to support extension sections and
I have to modify JGit to grok this new format, it needs to be index v3
not index v2. If we are making index v3 we should just put index v3 on
the end of the pack file.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-30 Thread Duy Nguyen
On Tue, Jan 29, 2013 at 04:16:11AM -0500, Jeff King wrote:
 When we are doing a commit traversal that does not need to
 look at the commit messages themselves (e.g., rev-list,
 merge-base, etc), we spend a lot of time accessing,
 decompressing, and parsing the commit objects just to find
 the parent and timestamp information. We can make a
 space-time tradeoff by caching that information on disk in a
 compact, uncompressed format.

And this is a (messy) patch on top that avoids storing SHA-1
directly. On my linux-2.6.git (575 MB pack, 73 MB index), .commits
file is 5.2 MB and 27 MB with and without my patch respectively. Nice
shrinkage.

However, performance seems to suffer too. Maybe I do more lookups than
necessary, I don't know. I should probably measure the cost of
revindex separately.

git rev-list --all --quiet on vanilla git:

real0m3.645s
user0m3.556s
sys 0m0.080s

commit cache without my patch:

real0m0.723s
user0m0.677s
sys 0m0.045s

and with my patch:

real0m1.338s
user0m1.259s
sys 0m0.075s


Another point, but not really important at this stage, I think we have
memory leak somewhere (lookup_commit??). It used up to 800 MB RES on
linux-2.6.git while generating the cache.

-- 8 --
diff --git a/cache.h b/cache.h
index 1f96f65..8048d5b 100644
--- a/cache.h
+++ b/cache.h
@@ -1069,6 +1069,7 @@ extern struct packed_git *add_packed_git(const char *, 
int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, 
uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
+extern int find_pack_entry_pos(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, 
unsigned long *);
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, 
unsigned long len, enum object_type *type, unsigned long *sizep);
diff --git a/commit-metapack.c b/commit-metapack.c
index 2c19f48..55f7ea9 100644
--- a/commit-metapack.c
+++ b/commit-metapack.c
@@ -3,11 +3,12 @@
 #include metapack.h
 #include commit.h
 #include sha1-lookup.h
+#include pack-revindex.h
 
 struct commit_metapack {
struct metapack mp;
-   uint32_t nr;
-   unsigned char *index;
+   struct packed_git *pack;
+   uint32_t first, last;
unsigned char *data;
struct commit_metapack *next;
 };
@@ -16,7 +17,7 @@ static struct commit_metapack *commit_metapacks;
 static struct commit_metapack *alloc_commit_metapack(struct packed_git *pack)
 {
struct commit_metapack *it = xcalloc(1, sizeof(*it));
-   uint32_t version;
+   uint32_t version, nr;
 
if (metapack_init(it-mp, pack, commits, version)  0) {
free(it);
@@ -39,22 +40,25 @@ static struct commit_metapack *alloc_commit_metapack(struct 
packed_git *pack)
free(it);
return NULL;
}
-   memcpy(it-nr, it-mp.data, 4);
-   it-nr = ntohl(it-nr);
+   memcpy(it-first, it-mp.data, 4);
+   it-first = ntohl(it-first);
+   memcpy(it-last, it-mp.data + 4, 4);
+   it-last = ntohl(it-last);
+   nr = it-last - it-first + 1;
+   it-pack = pack;
 
/*
-* We need 84 bytes for each entry: sha1(20), date(4), tree(20),
-* parents(40).
+* We need 16 bytes for each entry: date(4), tree index(4),
+* parent indexes(8).
 */
-   if (it-mp.len  (84 * it-nr + 4)) {
+   if (it-mp.len  (16 * nr + 8)) {
warning(commit metapack for '%s' is truncated, 
pack-pack_name);
metapack_close(it-mp);
free(it);
return NULL;
}
 
-   it-index = it-mp.data + 4;
-   it-data = it-index + 20 * it-nr;
+   it-data = it-mp.data + 8;
 
return it;
 }
@@ -81,31 +85,61 @@ static void prepare_commit_metapacks(void)
initialized = 1;
 }
 
+static const unsigned char *idx_to_sha1(struct packed_git *p,
+   uint32_t nth)
+{
+   struct revindex_entry *revindex = get_revindex(p);
+   if (!revindex)
+   return NULL;
+   return nth_packed_object_sha1(p, revindex[nth].nr);
+}
+
 int commit_metapack(unsigned char *sha1,
uint32_t *timestamp,
-   unsigned char **tree,
-   unsigned char **parent1,
-   unsigned char **parent2)
+   const unsigned char **tree,
+   const unsigned char **parent1,
+   const unsigned char **parent2)
 {
struct commit_metapack *p;
 
prepare_commit_metapacks();
for (p = commit_metapacks; p; p = p-next) {
unsigned char *data;
-   int pos = sha1_entry_pos(p-index, 20, 0, 0, p-nr, p-nr, 
sha1);
-   if (pos  0)
+   uint32_t p1, p2;
+   

Re: [PATCH 4/6] introduce a commit metapack

2013-01-30 Thread Duy Nguyen
On Wed, Jan 30, 2013 at 8:56 PM, Duy Nguyen pclo...@gmail.com wrote:
 However, performance seems to suffer too. Maybe I do more lookups than
 necessary, I don't know.

Yes, I should have stored the position in the sha-1 - offset map
instead of the position of the object in .pack file. Even so,
performance does not improve.

 I should probably measure the cost of revindex separately.

And the cost of create_pack_revindex() is 0.6 sec :-(

Perhaps we could store abbrev sha-1 instead of full sha-1. Nice
space/time trade-off.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-30 Thread Junio C Hamano
Jeff King p...@peff.net writes:

From this:

 Then it will be very natural for the extension data that store the
 commit metainfo to name objects in the pack the .idx file describes
 by the offset in the SHA-1 table.

 I guess your argument is that putting it all in the same file makes it
 more natural for there to be a data dependency.

It is more about the I am torn on this one I mentioned earlier.

It would be more logical if this weren't tied to a particular
pack, as the properties of a commit you record in this series do not
depend on which pack the commit is in, and such a repository-global
file by definition cannot be inside anybody's .idx.

But if we split the information into separate pieces and store one
piece per .idx for implementation reasons, it is crazy not to at
least consider it a longer term goal to put it inside .idx file.

Of course, it is more convenient to store this kind of things in a
separate file while experimenting and improving the mechanism, but I
do not think we want to see each packfile in a repository comes with
47 auxiliary files with different suffixes 5 years down the road.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Michael Haggerty
On 01/29/2013 10:16 AM, Jeff King wrote:
 When we are doing a commit traversal that does not need to
 look at the commit messages themselves (e.g., rev-list,
 merge-base, etc), we spend a lot of time accessing,
 decompressing, and parsing the commit objects just to find
 the parent and timestamp information. We can make a
 space-time tradeoff by caching that information on disk in a
 compact, uncompressed format.
 
 TODO: document on-disk format in Documentation/technical
 TODO: document API

Would this be a good place to add the commit generation number that is
so enthusiastically discussed on the mailing list from time to time?

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King
On Tue, Jan 29, 2013 at 11:24:45AM +0100, Michael Haggerty wrote:

 On 01/29/2013 10:16 AM, Jeff King wrote:
  When we are doing a commit traversal that does not need to
  look at the commit messages themselves (e.g., rev-list,
  merge-base, etc), we spend a lot of time accessing,
  decompressing, and parsing the commit objects just to find
  the parent and timestamp information. We can make a
  space-time tradeoff by caching that information on disk in a
  compact, uncompressed format.
  
  TODO: document on-disk format in Documentation/technical
  TODO: document API
 
 Would this be a good place to add the commit generation number that is
 so enthusiastically discussed on the mailing list from time to time?

Yes, that is one of my goals. We may even be able to just replace the
timestamp field in the cache with a generation number. When it gets
pretty-printed we pull it out of the commit message again anyway, so in
theory the only use inside struct commit is for ordering. But I
haven't looked at all of the use sites yet to be sure nobody is
depending on it being an actual date stamp.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 +int commit_metapack(unsigned char *sha1,
 + uint32_t *timestamp,
 + unsigned char **tree,
 + unsigned char **parent1,
 + unsigned char **parent2)
 +{
 + struct commit_metapack *p;
 +
 + prepare_commit_metapacks();
 + for (p = commit_metapacks; p; p = p-next) {
 + unsigned char *data;
 + int pos = sha1_entry_pos(p-index, 20, 0, 0, p-nr, p-nr, 
 sha1);

This is a tangent, but isn't it about time to rip out the check for
GIT_USE_LOOKUP in find_pack_entry_one(), I wonder.

 + prepare_commit_metapacks();
 + for (p = commit_metapacks; p; p = p-next) {
 + unsigned char *data;
 + int pos = sha1_entry_pos(p-index, 20, 0, 0, p-nr, p-nr, 
 sha1);
 + if (pos  0)
 + continue;
 +
 + /* timestamp(4) + tree(20) + parents(40) */
 + data = p-data + 64 * pos;
 + *timestamp = *(uint32_t *)data;
 + *timestamp = ntohl(*timestamp);
 + data += 4;
 + *tree = data;
 + data += 20;
 + *parent1 = data;
 + data += 20;
 + *parent2 = data;
 +
 + return 0;
 + }
 +
 + return -1;
 +}

I am torn on this one.

These cached properties of a single commit will not change no matter
which pack it appears in, and it feels logically wrong, especially
when you record these object names in the full SHA-1 form, to tie a
commit metapack to a pack.  Logically there needs only one commit
metapack that describes all the commits known to the repository when
the metapack was created.

In order to reduce the disk footprint and I/O cost, the future
direction for this mechanism may want to point into an existing
store of SHA-1 hashes with a shorter file offset, and the .idx file
could be such a store, and in order to move in that direction, you
cannot avoid tying a metapack to a pack.

 +static void get_commits(struct metapack_writer *mw,
 + const unsigned char *sha1,
 + void *data)
 +{
 + struct commit_list ***tail = data;
 + enum object_type type = sha1_object_info(sha1, NULL);
 + struct commit *c;
 +
 + if (type != OBJ_COMMIT)
 + return;
 +
 + c = lookup_commit(sha1);
 + if (!c || parse_commit(c))
 + die(unable to read commit %s, sha1_to_hex(sha1));
 +
 + /*
 +  * Our fixed-size parent list cannot represent root commits, nor
 +  * octopus merges. Just skip those commits, as we can fallback
 +  * in those rare cases to reading the actual commit object.
 +  */
 + if (!c-parents ||
 + (c-parents  c-parents-next  c-parents-next-next))
 + return;
 +
 + *tail = commit_list_insert(c, *tail)-next;
 +}

It feels somewhat wasteful to:

 - use commit_list for this, rather than an array of commit
   objects.  If you have a rough estimate of the number of commits
   in the pack, you could just preallocate a single array and use
   ALLOC_GROW() on it, no?

 - iterate over the .idx file and run sha1_object_info() and
   parse_commit() on many objects in the SHA-1 order.  Iterating in
   the way builtin/pack-objects.c::get_object_details() does avoids
   jumping around in existing packfiles, which may be more
   efficient, no?

 +void commit_metapack_write(const char *idx)
 +{
 + struct metapack_writer mw;
 + struct commit_list *commits = NULL, *p;
 + struct commit_list **tail = commits;
 + uint32_t nr = 0;
 +
 + metapack_writer_init(mw, idx, commits, 1);
 +
 + /* Figure out how many eligible commits we've got in this pack. */
 + metapack_writer_foreach(mw, get_commits, tail);
 + for (p = commits; p; p = p-next)
 + nr++;

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Duy Nguyen
On Tue, Jan 29, 2013 at 4:16 PM, Jeff King p...@peff.net wrote:
 +int commit_metapack(unsigned char *sha1,
 +   uint32_t *timestamp,
 +   unsigned char **tree,
 +   unsigned char **parent1,
 +   unsigned char **parent2)
 +{

Nit picking. tree, parent1 and parent2 can/should be const unsigned char **.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King
On Tue, Jan 29, 2013 at 09:38:10AM -0800, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  +int commit_metapack(unsigned char *sha1,
  +   uint32_t *timestamp,
  +   unsigned char **tree,
  +   unsigned char **parent1,
  +   unsigned char **parent2)
  +{
  +   struct commit_metapack *p;
  +
  +   prepare_commit_metapacks();
  +   for (p = commit_metapacks; p; p = p-next) {
  +   unsigned char *data;
  +   int pos = sha1_entry_pos(p-index, 20, 0, 0, p-nr, p-nr, 
  sha1);
 
 This is a tangent, but isn't it about time to rip out the check for
 GIT_USE_LOOKUP in find_pack_entry_one(), I wonder.

Rip it out and always use sha1_entry_pos, or rip it out and never use
it? I have never been able to measure any speedup from it; I just use it
here to avoid rewriting the binary search myself. I do not think there
is any disadvantage to using it, so I'd be in favor of just
standardizing on it for any sha1 binary searches.

 These cached properties of a single commit will not change no matter
 which pack it appears in, and it feels logically wrong, especially
 when you record these object names in the full SHA-1 form, to tie a
 commit metapack to a pack.  Logically there needs only one commit
 metapack that describes all the commits known to the repository when
 the metapack was created.

True. I had originally envisioned it as tied to the packfile to help
manage the lifecycle. You know you need to generate a metapack for some
objects when they get packed, and you know you can throw it away when
the associated pack goes away. And it means you can verify the integrity
of the metapack by matching it to a particular packfile.

However, if you just have a commit cache, you can always blow the whole
thing away and regenerate it for all objects in the repo. It does not
have tied to a pack (you do end up doing some extra work at regeneration
when you are not doing a full repack, but it is really not enough to
worry about).

 In order to reduce the disk footprint and I/O cost, the future
 direction for this mechanism may want to point into an existing
 store of SHA-1 hashes with a shorter file offset, and the .idx file
 could be such a store, and in order to move in that direction, you
 cannot avoid tying a metapack to a pack.

Yes. That was not one of my original goals for the commit cache, but I
do think it's a useful direction to go in. And reachability bitmaps
(which would eventually be their own metapacks) would generally want to
be per-pack, too, for the same reason.

  +   *tail = commit_list_insert(c, *tail)-next;
  +}
 
 It feels somewhat wasteful to:
 
  - use commit_list for this, rather than an array of commit
objects.  If you have a rough estimate of the number of commits
in the pack, you could just preallocate a single array and use
ALLOC_GROW() on it, no?

We don't have a rough estimate, but yes, we could just use an array and
trust ALLOC_GROW to be reasonable. The use of commit_list did not have a
particular reason other than that it was simple (an array means stuffing
the array pointer and the nr and alloc counts into a struct to get to
the callback). The performance of writing the cache is dominated by
accessing the objects themselves, anyway. I don't mind changing it,
though, if you think it's clearer as an array.

  - iterate over the .idx file and run sha1_object_info() and
parse_commit() on many objects in the SHA-1 order.  Iterating in
the way builtin/pack-objects.c::get_object_details() does avoids
jumping around in existing packfiles, which may be more
efficient, no?

Probably, though generating the complete commit cache for linux-2.6.git
takes only about 7 seconds on my machine. I wasn't too concerned with
optimizing generation, since it will typically be dwarfed by repacking
costs. It might make more of a difference for doing a metapack on all
objects (e.g., reachability bitmaps).

The reason I do it in .idx order is that it feeds the callback in sorted
order, so a writer could in theory just use that output as-is (and my
initial version did that, as it wrote separate metapacks for each
element). This version now puts all data elements together (for cache
locality), and builds the in-memory list so we do not have to re-do
sha1_object_info repeatedly. So it could very easily just generate the
list in pack order and sort it at the end.

I'll look into that for the next version.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] introduce a commit metapack

2013-01-29 Thread Jeff King
On Wed, Jan 30, 2013 at 10:36:10AM +0700, Nguyen Thai Ngoc Duy wrote:

 On Tue, Jan 29, 2013 at 4:16 PM, Jeff King p...@peff.net wrote:
  +int commit_metapack(unsigned char *sha1,
  +   uint32_t *timestamp,
  +   unsigned char **tree,
  +   unsigned char **parent1,
  +   unsigned char **parent2)
  +{
 
 Nit picking. tree, parent1 and parent2 can/should be const unsigned char **.

Thanks, I'll fix it in the next version.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html