Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

2013-07-10 Thread Jeff King
On Mon, Jul 08, 2013 at 02:35:10PM -0700, Brandon Casey wrote:

  If 'n' is the number of objects in the pack, shouldn't it be unsigned?
 
  The data type for struct packed_git.num_objects is uint32_t.  Looks
  like create_pack_revindex uses the wrong datatype when it captures
  num_objects in the int num_ent and passes it to sort_revindex.  So, it
  looks like that function needs to be updated too.
 
 Hmm.  It seems more than just create_pack_revindex holds num_objects
 in a signed int.  Don't we support 4G objects in packs?
 
 find_pack_revindex uses a signed int for the index variables in its
 binary search which means it will fail when num_objects = 2^31.

Yes, I didn't make a test case, but I suspect it is just completely
broken at that scale. But nobody hits it because having 2^31 objects is
somewhat insane (we are still only in the 2^22 range for the kernel, and
most large repos I've seen have large objects, but not necessarily more
of them). So if development keeps up, the kernel should hit the bug in
another few thousand years. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

2013-07-10 Thread Jeff King
On Mon, Jul 08, 2013 at 01:50:41PM -0700, Brandon Casey wrote:

  +static void sort_revindex(struct revindex_entry *entries, int n, off_t max)
 
 If 'n' is the number of objects in the pack, shouldn't it be unsigned?

Yes. I inherited that bug from the rest of the revindex code.

 The data type for struct packed_git.num_objects is uint32_t.  Looks
 like create_pack_revindex uses the wrong datatype when it captures
 num_objects in the int num_ent and passes it to sort_revindex.  So, it
 looks like that function needs to be updated too.

Yep. And the binary search in find_pack_revindex, too (which even has an
integer overflow!). I'll add a patch to my series to fix it (and switch
mine).

  +   while (max / (((off_t)1)  digits)) {
 
 Is there any reason this shouldn't be simplified to just:
 
while (max  digits) {

No, yours is much more readable. In case you are wondering how I ended
up with that monstrosity, I originally did not keep the digits field
as a number of bits, but rather a running total that bit-shifted 16 bits
each time through the loop. I'll change it in the re-roll.

 I glanced briefly at the assembly and it appears that gcc does
 actually emit a divide instruction to accomplish this, which I think
 we can avoid by just rearranging the operation.

Yep, although it almost certainly doesn't matter. We hit that loop
condition check at most 5 times for a 64-bit integer. I'm more concerned
with readability.

  +   if (a != entries) {
  +   int i;
  +   for (i = 0; i  n; i++)
  +   entries[i] = tmp[i];
 
 I think I recall that somebody investigated whether a for loop like
 you have above was faster for copying structures than memcpy.  I
 forget whether it was conclusive.  Did you happen to compare them?

It was me, actually, but the comparison was for memcmp rather than an
open-coded loop. And the conclusion was that memcmp is way faster on
glibc 2.13 and higher.

I think memcpy probably is going to be faster (especially in recent
versions of glibc), given the size of the array (the other memcmp
discussion was for 20-byte hashes, where the function call and setup
time was much more relevant).

But I don't think this was even timed at all in my tests. Since we go
back-and-forth between the original array and the tmp storage, we have a
50% chance of not needing to swap back anyway. So for packfiles up to
64K, we do the swap (but they are not that interesting to measure), and
then from 64K to 4G, we do not.

Note that we also use struct assignment in the sort itself to drop
elements into their buckets. That could potentially use memcpy, though I
would expect the compiler to generate pretty decent instructions for
such a small struct.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

2013-07-08 Thread Jeff King
On Sun, Jul 07, 2013 at 04:52:23PM -0700, Shawn O. Pearce wrote:

 On Sun, Jul 7, 2013 at 3:14 AM, Jeff King p...@peff.net wrote:
  The pack revindex stores the offsets of the objects in the
  pack in sorted order, allowing us to easily find the on-disk
  size of each object. To compute it, we populate an array
  with the offsets from the sha1-sorted idx file, and then use
  qsort to order it by offsets.
 
  That does O(n log n) offset comparisons, and profiling shows
  that we spend most of our time in cmp_offset. However, since
  we are sorting on a simple off_t, we can use numeric sorts
  that perform better. A radix sort can run in O(k*n), where k
  is the number of digits in our number. For a 64-bit off_t,
  using 16-bit digits gives us k=4.
 
 Did you try the simple bucket sort Colby now uses in JGit?
 
 The sort is pretty simple:
 
   bucket_size = pack_length / object_count;
   buckets[] = malloc(object_count * sizeof(int));
 
   foreach obj in idx:
 push_chain(buckets[obj.offset / bucket_size], obj.idx_nth);
 
   foreach bucket:
 insertion sort by offset

I did do something similar (though I flattened my buckets into a single
list afterwards), but I ended up closer to 700ms (down from 830ms, but
with the radix sort around 200ms). It's entirely possible I screwed up
something in the implementation (the bucket insertion can be done in a
lot of different ways, many of which are terrible), but I didn't keep a
copy of that attempt. If you try it and have better numbers, I'd be
happy to see them.

 We observed on linux.git that most buckets have an average number of
 objects. IIRC the bucket_size was ~201 bytes and most buckets had very
 few objects each. For lookups we keep the bucket_size parameter and a
 bucket index table. This arrangement uses 8 bytes per object in the
 reverse index, making it very memory efficient. Searches are typically
 below O(log N) time because each bucket has log N entries.

I didn't measure lookups at all; I was focused on time to build the
index. So if there were benefits there that make up for a longer setup
time, I wouldn't have measured them (of course, we also care about the
case with few lookups, so it would be a tradeoff). You could also leave
each bucket unsorted and only lazily sort it when a lookup hits the
bucket, which might help that case (I didn't look to see if you do that
in JGit).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

2013-07-08 Thread Shawn Pearce
On Mon, Jul 8, 2013 at 12:57 AM, Jeff King p...@peff.net wrote:
 On Sun, Jul 07, 2013 at 04:52:23PM -0700, Shawn O. Pearce wrote:

 On Sun, Jul 7, 2013 at 3:14 AM, Jeff King p...@peff.net wrote:
  The pack revindex stores the offsets of the objects in the
  pack in sorted order, allowing us to easily find the on-disk
  size of each object. To compute it, we populate an array
  with the offsets from the sha1-sorted idx file, and then use
  qsort to order it by offsets.
 
  That does O(n log n) offset comparisons, and profiling shows
  that we spend most of our time in cmp_offset. However, since
  we are sorting on a simple off_t, we can use numeric sorts
  that perform better. A radix sort can run in O(k*n), where k
  is the number of digits in our number. For a 64-bit off_t,
  using 16-bit digits gives us k=4.

 Did you try the simple bucket sort Colby now uses in JGit?

 The sort is pretty simple:

   bucket_size = pack_length / object_count;
   buckets[] = malloc(object_count * sizeof(int));

   foreach obj in idx:
 push_chain(buckets[obj.offset / bucket_size], obj.idx_nth);

   foreach bucket:
 insertion sort by offset

 I did do something similar (though I flattened my buckets into a single
 list afterwards), but I ended up closer to 700ms (down from 830ms, but
 with the radix sort around 200ms). It's entirely possible I screwed up
 something in the implementation (the bucket insertion can be done in a
 lot of different ways, many of which are terrible), but I didn't keep a
 copy of that attempt. If you try it and have better numbers, I'd be
 happy to see them.

Colby's sort in Java is coming in around 450ms for linux.git, so
sounds like your implementation was doing something suboptimal.

But as I thought about it this morning, a radix sort for most pack
files should run with k=2 and take only O(2*N) time. It is a very
efficient sort for the data. Colby and I didn't even try a radix sort,
and I suspect it would out-perform the bucket sort we do now.

 We observed on linux.git that most buckets have an average number of
 objects. IIRC the bucket_size was ~201 bytes and most buckets had very
 few objects each. For lookups we keep the bucket_size parameter and a
 bucket index table. This arrangement uses 8 bytes per object in the
 reverse index, making it very memory efficient. Searches are typically
 below O(log N) time because each bucket has log N entries.

 I didn't measure lookups at all; I was focused on time to build the
 index. So if there were benefits there that make up for a longer setup
 time, I wouldn't have measured them (of course, we also care about the
 case with few lookups, so it would be a tradeoff).

We didn't measure lookup times either. Colby did compute a histogram
of bucket sizes and showed nearly all buckets were significantly
smaller than log N, so lookups are log N time even though they are a
simple iteration through the elements. Colby considered doing binary
search within a bucket but didn't bother given how small the buckets
are.

So our lookup time benefit is theoretical. The way JGit implements
clones we tend not to perform N lookups in revidx, its usually sub
1000 lookups in revidx. That makes it harder to have any noticeable
benefit from decreased lookup time.

 You could also leave
 each bucket unsorted and only lazily sort it when a lookup hits the
 bucket, which might help that case (I didn't look to see if you do that
 in JGit).

We didn't do that in JGit, the sort is done at initialization. But
given the remark I just made about clones doing only a few lookups we
may want to defer the sort. IIRC the sort is about half of our
initialization cost.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

2013-07-08 Thread Brandon Casey
On Sun, Jul 7, 2013 at 3:14 AM, Jeff King p...@peff.net wrote:
 The pack revindex stores the offsets of the objects in the
 pack in sorted order, allowing us to easily find the on-disk
 size of each object. To compute it, we populate an array
 with the offsets from the sha1-sorted idx file, and then use
 qsort to order it by offsets.

 That does O(n log n) offset comparisons, and profiling shows
 that we spend most of our time in cmp_offset. However, since
 we are sorting on a simple off_t, we can use numeric sorts
 that perform better. A radix sort can run in O(k*n), where k
 is the number of digits in our number. For a 64-bit off_t,
 using 16-bit digits gives us k=4.

 On the linux.git repo, with about 3M objects to sort, this
 yields a 400% speedup. Here are the best-of-five numbers for
 running echo HEAD | git cat-file --batch-disk-size, which
 is dominated by time spent building the pack revindex:

   before after
   real0m0.834s   0m0.204s
   user0m0.788s   0m0.164s
   sys 0m0.040s   0m0.036s

 On a smaller repo, the radix sort would not be
 as impressive (and could even be worse), as we are trading
 the log(n) factor for the k=4 of the radix sort. However,
 even on git.git, with 173K objects, it shows some
 improvement:

   before after
   real0m0.046s   0m0.017s
   user0m0.036s   0m0.012s
   sys 0m0.008s   0m0.000s

 Signed-off-by: Jeff King p...@peff.net
 ---
 I think there are probably still two potential issues here:

   1. My while() loop termination probably has issues when we have to use
  all 64 bits to represent the pack offset (not likely, but...)

   2. We put int pos[65536] on the stack. This is a little big, but is
  probably OK, as I think the usual small stack problems we have seen
  are always in threaded code. But it would not be a big deal to heap
  allocate it (it would happen once per radix step, which is only 4
  times for the whole sort).

  pack-revindex.c | 77 
 +
  1 file changed, 72 insertions(+), 5 deletions(-)

 diff --git a/pack-revindex.c b/pack-revindex.c
 index 77a0465..d2adf36 100644
 --- a/pack-revindex.c
 +++ b/pack-revindex.c
 @@ -59,11 +59,78 @@ static int cmp_offset(const void *a_, const void *b_)
 /* revindex elements are lazily initialized */
  }

 -static int cmp_offset(const void *a_, const void *b_)
 +/*
 + * This is a least-significant-digit radix sort using a 16-bit digit.
 + */
 +static void sort_revindex(struct revindex_entry *entries, int n, off_t max)

If 'n' is the number of objects in the pack, shouldn't it be unsigned?

The data type for struct packed_git.num_objects is uint32_t.  Looks
like create_pack_revindex uses the wrong datatype when it captures
num_objects in the int num_ent and passes it to sort_revindex.  So, it
looks like that function needs to be updated too.

  {
 -   const struct revindex_entry *a = a_;
 -   const struct revindex_entry *b = b_;
 -   return (a-offset  b-offset) ? -1 : (a-offset  b-offset) ? 1 : 0;
 +   /*
 +* We need O(n) temporary storage, so we sort back and forth between
 +* the real array and our tmp storage. To keep them straight, we 
 always
 +* sort from a into buckets in b.
 +*/
 +   struct revindex_entry *tmp = xcalloc(n, sizeof(*tmp));
 +   struct revindex_entry *a = entries, *b = tmp;
 +   int digits = 0;
 +
 +   /*
 +* We want to know the bucket that a[i] will go into when we are using
 +* the digit that is N bits from the (least significant) end.
 +*/
 +#define BUCKET_FOR(a, i, digits) ((a[i].offset  digits)  0x)
 +
 +   while (max / (((off_t)1)  digits)) {

Is there any reason this shouldn't be simplified to just:

   while (max  digits) {

I glanced briefly at the assembly and it appears that gcc does
actually emit a divide instruction to accomplish this, which I think
we can avoid by just rearranging the operation.

 +   struct revindex_entry *swap;
 +   int i;
 +   int pos[65536] = {0};
 +
 +   /*
 +* We want pos[i] to store the index of the last element that
 +* will go in bucket i (actually one past the last element).
 +* To do this, we first count the items that will go in each
 +* bucket, which gives us a relative offset from the last
 +* bucket. We can then cumulatively add the index from the
 +* previous bucket to get the true index.
 +*/
 +   for (i = 0; i  n; i++)
 +   pos[BUCKET_FOR(a, i, digits)]++;
 +   for (i = 1; i  ARRAY_SIZE(pos); i++)
 +   pos[i] += pos[i-1];
 +
 +   /*
 +* Now we can drop the elements into their correct buckets (in
 +* our temporary array).  We iterate the pos counter backwards

Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

2013-07-08 Thread Brandon Casey
On Mon, Jul 8, 2013 at 1:50 PM, Brandon Casey draf...@gmail.com wrote:
 On Sun, Jul 7, 2013 at 3:14 AM, Jeff King p...@peff.net wrote:

 diff --git a/pack-revindex.c b/pack-revindex.c
 index 77a0465..d2adf36 100644
 --- a/pack-revindex.c
 +++ b/pack-revindex.c
 @@ -59,11 +59,78 @@ static int cmp_offset(const void *a_, const void *b_)
 /* revindex elements are lazily initialized */
  }

 -static int cmp_offset(const void *a_, const void *b_)
 +/*
 + * This is a least-significant-digit radix sort using a 16-bit digit.
 + */
 +static void sort_revindex(struct revindex_entry *entries, int n, off_t max)

 If 'n' is the number of objects in the pack, shouldn't it be unsigned?

 The data type for struct packed_git.num_objects is uint32_t.  Looks
 like create_pack_revindex uses the wrong datatype when it captures
 num_objects in the int num_ent and passes it to sort_revindex.  So, it
 looks like that function needs to be updated too.

Hmm.  It seems more than just create_pack_revindex holds num_objects
in a signed int.  Don't we support 4G objects in packs?

find_pack_revindex uses a signed int for the index variables in its
binary search which means it will fail when num_objects = 2^31.

-Brandon
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

2013-07-07 Thread Shawn Pearce
On Sun, Jul 7, 2013 at 3:14 AM, Jeff King p...@peff.net wrote:
 The pack revindex stores the offsets of the objects in the
 pack in sorted order, allowing us to easily find the on-disk
 size of each object. To compute it, we populate an array
 with the offsets from the sha1-sorted idx file, and then use
 qsort to order it by offsets.

 That does O(n log n) offset comparisons, and profiling shows
 that we spend most of our time in cmp_offset. However, since
 we are sorting on a simple off_t, we can use numeric sorts
 that perform better. A radix sort can run in O(k*n), where k
 is the number of digits in our number. For a 64-bit off_t,
 using 16-bit digits gives us k=4.

Did you try the simple bucket sort Colby now uses in JGit?

The sort is pretty simple:

  bucket_size = pack_length / object_count;
  buckets[] = malloc(object_count * sizeof(int));

  foreach obj in idx:
push_chain(buckets[obj.offset / bucket_size], obj.idx_nth);

  foreach bucket:
insertion sort by offset

https://eclipse.googlesource.com/jgit/jgit/+/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/PackReverseIndex.java

We observed on linux.git that most buckets have an average number of
objects. IIRC the bucket_size was ~201 bytes and most buckets had very
few objects each. For lookups we keep the bucket_size parameter and a
bucket index table. This arrangement uses 8 bytes per object in the
reverse index, making it very memory efficient. Searches are typically
below O(log N) time because each bucket has log N entries.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html