Re: [PATCH 11/19] pack-objects: use bitmaps when packing objects

2013-10-30 Thread Jeff King
On Fri, Oct 25, 2013 at 02:14:11PM +, Shawn O. Pearce wrote:

 On Thu, Oct 24, 2013 at 6:04 PM, Jeff King p...@peff.net wrote:
  For bitmaps to be used, the following must be true:
 
1. We must be packing to stdout (as a normal `pack-objects` from
   `upload-pack` would do).
 
2. There must be a .bitmap index containing at least one of the
   have objects that the client is asking for.
 
 The client must explicitly have a commit that has a bitmap? In JGit
 we allow the client to have anything, and walk backwards using
 traditional graph traversal until a bitmap is found.

If the bitmaps contain the full set of reachable objects and the client
does not have any haves that are bitmapped , then we know that either:

  1. Their haves are not reachable from the wants

 or

  2. Their wants are not bitmapped, and so the slice of haves..wants
 has no bitmaps

Since (1) is relatively rare, I think we are using this as a proxy for
(2), so that we can do a regular walk rather than looking around for
bitmaps that don't exist. But I may be misremembering the reasoning.
Vicent?

  @@ -704,6 +759,18 @@ static void write_pack_file(void)
  offset = write_pack_header(f, nr_remaining);
  if (!offset)
  die_errno(unable to write pack header);
  +
  +   if (reuse_packfile) {
  +   off_t packfile_size;
  +   assert(pack_to_stdout);
  +
  +   packfile_size = write_reused_pack(f);
  +   if (!packfile_size)
  +   die_errno(failed to re-use existing pack);
  +
  +   offset += packfile_size;
  +   }
  +
  nr_written = 0;
  for (; i  to_pack.nr_objects; i++) {
  struct object_entry *e = write_order[i];
 
 Can reuse_packfile be true at the same time as to_pack.nr_objects  0?

Yes, if there are new, non-bitmapped objects to send on top of the
reused packfile.

 In JGit we write the to_pack list first, then the reuse pack. Our
 rationale was the to_pack list is recent objects that are newer and
 would appear first in a traditional traversal, so they should go at
 the front of the stream. This does mean if they delta compress against
 an object in that reuse_packfile slice they have to use REF_DELTA
 instead of OFS_DELTA.

That's a good point. In our case I think we do not delta against the
reused packfile objects at all, as we simply send out the whole slice of
packfile without making an entry for each object.

 Is this series running on github.com/torvalds/linux? Last Saturday I
 ran a live demo clone comparing github.com/torvalds/linux to a JGit
 bitmap clone and some guy heckled me because GitHub was only a few
 seconds slower. :-)

It is. Use kernel.org if you want to make fun of someone. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/19] pack-objects: use bitmaps when packing objects

2013-10-30 Thread Shawn Pearce
On Wed, Oct 30, 2013 at 8:21 AM, Jeff King p...@peff.net wrote:
 On Fri, Oct 25, 2013 at 02:14:11PM +, Shawn O. Pearce wrote:
 On Thu, Oct 24, 2013 at 6:04 PM, Jeff King p...@peff.net wrote:
  For bitmaps to be used, the following must be true:
 
1. We must be packing to stdout (as a normal `pack-objects` from
   `upload-pack` would do).
 
2. There must be a .bitmap index containing at least one of the
   have objects that the client is asking for.

 The client must explicitly have a commit that has a bitmap? In JGit
 we allow the client to have anything, and walk backwards using
 traditional graph traversal until a bitmap is found.

 If the bitmaps contain the full set of reachable objects and the client
 does not have any haves that are bitmapped , then we know that either:

   1. Their haves are not reachable from the wants

  or

   2. Their wants are not bitmapped, and so the slice of haves..wants
  has no bitmaps

 Since (1) is relatively rare, I think we are using this as a proxy for
 (2), so that we can do a regular walk rather than looking around for
 bitmaps that don't exist. But I may be misremembering the reasoning.
 Vicent?

Ah. I am not sure if we do this in JGit. I think JGit's approach is to
look if the have appears in a pack with bitmaps, this is a simple
lookup in the .idx file and does not require expanding any data from
the .bitmap file.


But it wasn't my question. :-)

Client sends want B ; have E. What if E appears in the bitmapped
pack, but does not itself have a bitmap? Do you walk backwards from B
and switch to the bitmap algorithm when you find a commit that has a
bitmap and that bitmap contains E?

 In JGit we write the to_pack list first, then the reuse pack. Our
 rationale was the to_pack list is recent objects that are newer and
 would appear first in a traditional traversal, so they should go at
 the front of the stream. This does mean if they delta compress against
 an object in that reuse_packfile slice they have to use REF_DELTA
 instead of OFS_DELTA.

 That's a good point. In our case I think we do not delta against the
 reused packfile objects at all, as we simply send out the whole slice of
 packfile without making an entry for each object.

JGit also doesn't use the reused packfile as delta bases, because
there are no object entries to shove through the delta window. So
there is never any risk of a reference to a base later in the file. It
also means that thin pack at the front of the stream is less
optimally compressed. At Google we side step that by doing GC at the
server very often, to try and keep the number of objects in that pack
low.

It might make sense to use a commit that covers the majority of the
reused pack as the edge base candidate case during delta compression
here, as though the client had sent us a have for that commit. I
don't think we have tried this in JGit. It would make deltas use
REF_DELTA, but the delta size has to be smaller than the REF_DELTA
header to be used in the stream so its still a smaller overall
transfer.

 Is this series running on github.com/torvalds/linux? Last Saturday I
 ran a live demo clone comparing github.com/torvalds/linux to a JGit
 bitmap clone and some guy heckled me because GitHub was only a few
 seconds slower. :-)

 It is. Use kernel.org if you want to make fun of someone. :)

Hah. OK, so GitHub was only a few seconds slower only because my
desktop is better connected to our data centers than to yours. Nicely
done, this patch series really works. :)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/19] pack-objects: use bitmaps when packing objects

2013-10-30 Thread Vicent Marti
On Wed, Oct 30, 2013 at 11:38 AM, Shawn Pearce spea...@spearce.org wrote:
 Since (1) is relatively rare, I think we are using this as a proxy for
 (2), so that we can do a regular walk rather than looking around for
 bitmaps that don't exist. But I may be misremembering the reasoning.
 Vicent?

 Ah. I am not sure if we do this in JGit. I think JGit's approach is to
 look if the have appears in a pack with bitmaps, this is a simple
 lookup in the .idx file and does not require expanding any data from
 the .bitmap file.

No, you don't do this in JGit right now. This is a small optimization
we implemented to prevent *loading* the bitmap file (and hence
building the reverse index, which can be expensive) unless we're sure
we can at least *attempt* a bitmap walk.

 But it wasn't my question. :-)

 Client sends want B ; have E. What if E appears in the bitmapped
 pack, but does not itself have a bitmap? Do you walk backwards from B
 and switch to the bitmap algorithm when you find a commit that has a
 bitmap and that bitmap contains E?

This is correct, we use the same heuristics as JGit. Once we have
loaded a bitmap file we know we can attempt a bitmap walk; if E is on
the bitmapped pack, we'll build a temporary bitmap using an extended
index (with bits for commits that are not in the packfile) as we walk
backwards until E. Once we find a commit with a bitmap, we'll OR that
with the temporary bitmap to skip the full walk.

 In JGit we write the to_pack list first, then the reuse pack. Our
 rationale was the to_pack list is recent objects that are newer and
 would appear first in a traditional traversal, so they should go at
 the front of the stream. This does mean if they delta compress against
 an object in that reuse_packfile slice they have to use REF_DELTA
 instead of OFS_DELTA.

 That's a good point. In our case I think we do not delta against the
 reused packfile objects at all, as we simply send out the whole slice of
 packfile without making an entry for each object.

 JGit also doesn't use the reused packfile as delta bases, because
 there are no object entries to shove through the delta window. So
 there is never any risk of a reference to a base later in the file. It
 also means that thin pack at the front of the stream is less
 optimally compressed. At Google we side step that by doing GC at the
 server very often, to try and keep the number of objects in that pack
 low.

 It might make sense to use a commit that covers the majority of the
 reused pack as the edge base candidate case during delta compression
 here, as though the client had sent us a have for that commit. I
 don't think we have tried this in JGit. It would make deltas use
 REF_DELTA, but the delta size has to be smaller than the REF_DELTA
 header to be used in the stream so its still a smaller overall
 transfer.

 Is this series running on github.com/torvalds/linux? Last Saturday I
 ran a live demo clone comparing github.com/torvalds/linux to a JGit
 bitmap clone and some guy heckled me because GitHub was only a few
 seconds slower. :-)

 It is. Use kernel.org if you want to make fun of someone. :)

 Hah. OK, so GitHub was only a few seconds slower only because my
 desktop is better connected to our data centers than to yours. Nicely
 done, this patch series really works. :)

Thanks Shawn, your feedback was very helpful.

Re. cloning speeds: we are actually bottlenecked by our routing layer.
The CPUs in our new fileservers can clone `torvalds/linux` to
/dev/null in 20s, but we're slowing down when routing the actual
traffic back to the client. We're in the process of rewriting our Git
reverse proxys so let's see what the future looks like.

Love,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/19] pack-objects: use bitmaps when packing objects

2013-10-25 Thread Shawn Pearce
On Thu, Oct 24, 2013 at 6:04 PM, Jeff King p...@peff.net wrote:
 For bitmaps to be used, the following must be true:

   1. We must be packing to stdout (as a normal `pack-objects` from
  `upload-pack` would do).

   2. There must be a .bitmap index containing at least one of the
  have objects that the client is asking for.

The client must explicitly have a commit that has a bitmap? In JGit
we allow the client to have anything, and walk backwards using
traditional graph traversal until a bitmap is found.

 @@ -704,6 +759,18 @@ static void write_pack_file(void)
 offset = write_pack_header(f, nr_remaining);
 if (!offset)
 die_errno(unable to write pack header);
 +
 +   if (reuse_packfile) {
 +   off_t packfile_size;
 +   assert(pack_to_stdout);
 +
 +   packfile_size = write_reused_pack(f);
 +   if (!packfile_size)
 +   die_errno(failed to re-use existing pack);
 +
 +   offset += packfile_size;
 +   }
 +
 nr_written = 0;
 for (; i  to_pack.nr_objects; i++) {
 struct object_entry *e = write_order[i];

Can reuse_packfile be true at the same time as to_pack.nr_objects  0?

In JGit we write the to_pack list first, then the reuse pack. Our
rationale was the to_pack list is recent objects that are newer and
would appear first in a traditional traversal, so they should go at
the front of the stream. This does mean if they delta compress against
an object in that reuse_packfile slice they have to use REF_DELTA
instead of OFS_DELTA.


Is this series running on github.com/torvalds/linux? Last Saturday I
ran a live demo clone comparing github.com/torvalds/linux to a JGit
bitmap clone and some guy heckled me because GitHub was only a few
seconds slower. :-)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html