Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes

2013-10-30 Thread Jeff King
On Fri, Oct 25, 2013 at 01:55:13PM +, Shawn O. Pearce wrote:

  As an extra optimization, when `prepare_bitmap_walk` succeeds, the
  `reuse_partial_packfile_from_bitmap` call can be attempted: it will find
  the amount of objects at the beginning of the on-disk packfile that can
  be reused as-is, and return an offset into the packfile. The source
  packfile can then be loaded and the bytes up to `offset` can be written
  directly to the result without having to consider the entires inside the
  packfile individually.
 
 Yay! This is similar to the optimization we use in JGit to send the
 entire pack, but the part about sending a leading prefix is new. Do
 you have any data showing how well this works in practice for cases
 where offset is before than length-20?

Actually, I don't think it kicks in very much due to packfile ordering.
You have all of the commits at the front of the pack, then all of the
trees, then all of the blobs. So if you want the whole thing, it is easy
to reuse a big chunk. But if you want only the most recent slice, we can
reuse the early bit with the new commits, but we stop partway through
the commit list. You still have to handle all of the trees and blobs
separately.

So in practice, I think this really only kicks in for clones anyway.

In theory, you could find islands of ones in the bitmap and send whole
slices of packfile at once. But you need to be careful not to send a
delta without its base. Which I think means you end up having to
generate the whole sha1 list anyway, and check that the other side has
each base before reusing a delta (i.e., the normal code path).

In fact, I'm not quite sure that even a partial reuse up to an offset is
100% safe. In a newly packed git repo it is, because we always put bases
before deltas (and OFS_DELTA objects need this). But if you had a bitmap
generated from a fixed thin pack, we would have REF_DELTA objects early
on that depend on bases appended to the end of the pack. So I really
wonder if we should scrap this partial reuse and either just have full
reuse, or go through the regular object_entry construction.

Vicent, you've thought about the reuse code a lot more than I have. Any
thoughts?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes

2013-10-30 Thread Shawn Pearce
On Wed, Oct 30, 2013 at 8:10 AM, Jeff King p...@peff.net wrote:
 On Fri, Oct 25, 2013 at 01:55:13PM +, Shawn O. Pearce wrote:

 Yay! This is similar to the optimization we use in JGit to send the
 entire pack, but the part about sending a leading prefix is new. Do
 you have any data showing how well this works in practice for cases
 where offset is before than length-20?

 Actually, I don't think it kicks in very much due to packfile ordering.
 You have all of the commits at the front of the pack, then all of the
 trees, then all of the blobs. So if you want the whole thing, it is easy
 to reuse a big chunk. But if you want only the most recent slice, we can
 reuse the early bit with the new commits, but we stop partway through
 the commit list. You still have to handle all of the trees and blobs
 separately.

 So in practice, I think this really only kicks in for clones anyway.

 In theory, you could find islands of ones in the bitmap and send whole
 slices of packfile at once. But you need to be careful not to send a
 delta without its base. Which I think means you end up having to
 generate the whole sha1 list anyway, and check that the other side has
 each base before reusing a delta (i.e., the normal code path).

 In fact, I'm not quite sure that even a partial reuse up to an offset is
 100% safe. In a newly packed git repo it is, because we always put bases
 before deltas (and OFS_DELTA objects need this). But if you had a bitmap
 generated from a fixed thin pack, we would have REF_DELTA objects early
 on that depend on bases appended to the end of the pack. So I really
 wonder if we should scrap this partial reuse and either just have full
 reuse, or go through the regular object_entry construction.

Yes. This is why JGit only does whole pack file reuse (minus the 12
byte header and 20 byte SHA-1 trailer). Any other case has so many
corner cases that we just punt into the object entry construction path
and rely on individual object reuse.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes

2013-10-30 Thread Vicent Marti
On Wed, Oct 30, 2013 at 9:10 AM, Jeff King p...@peff.net wrote:

 In fact, I'm not quite sure that even a partial reuse up to an offset is
 100% safe. In a newly packed git repo it is, because we always put bases
 before deltas (and OFS_DELTA objects need this). But if you had a bitmap
 generated from a fixed thin pack, we would have REF_DELTA objects early
 on that depend on bases appended to the end of the pack. So I really
 wonder if we should scrap this partial reuse and either just have full
 reuse, or go through the regular object_entry construction.

 Vicent, you've thought about the reuse code a lot more than I have. Any
 thoughts?

Yes, our pack writing and bitmap code takes enough precautions to
arrange the objects in the packfile in a way that can be partially
reused, so for any given bitmap file written from Git, I'd say we're
safe to always reuse the leader of the pack if this is possible.

For bitmaps generated from JGit, however, we cannot make this
assumption. I mean, we can right now (from my understanding of the
current implementation for pack-objects on JGit), but they are free to
change this in the future.

Obviously I intend to keep the pack reuse on production because the
CPU savings are noticeable, but we can drop it from the public
patchset. Ideally, we'd have full pack reuse like JGit, but we cannot
reasonably do that in GitHub because splitting a pack for the network
root would double our disk usage for all the forks.

luv,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes

2013-10-30 Thread Shawn Pearce
On Wed, Oct 30, 2013 at 3:47 PM, Vicent Marti vic...@github.com wrote:
 On Wed, Oct 30, 2013 at 9:10 AM, Jeff King p...@peff.net wrote:

 In fact, I'm not quite sure that even a partial reuse up to an offset is
 100% safe. In a newly packed git repo it is, because we always put bases
 before deltas (and OFS_DELTA objects need this). But if you had a bitmap
 generated from a fixed thin pack, we would have REF_DELTA objects early
 on that depend on bases appended to the end of the pack. So I really
 wonder if we should scrap this partial reuse and either just have full
 reuse, or go through the regular object_entry construction.

 Vicent, you've thought about the reuse code a lot more than I have. Any
 thoughts?

 Yes, our pack writing and bitmap code takes enough precautions to
 arrange the objects in the packfile in a way that can be partially
 reused, so for any given bitmap file written from Git, I'd say we're
 safe to always reuse the leader of the pack if this is possible.

 For bitmaps generated from JGit, however, we cannot make this
 assumption. I mean, we can right now (from my understanding of the
 current implementation for pack-objects on JGit), but they are free to
 change this in the future.

JGit certainly doesn't promise the ordering behavior, so the fact that
its happening is just luck. The code could change in the future to
invalidate this.

 Obviously I intend to keep the pack reuse on production because the
 CPU savings are noticeable, but we can drop it from the public
 patchset.

I think you should keep it in, its a significant improvement.

 Ideally, we'd have full pack reuse like JGit, but we cannot
 reasonably do that in GitHub because splitting a pack for the network
 root would double our disk usage for all the forks.

I gave a talk the week before about Git bitmaps and why we sometimes
have to slice pack files by object.

Some guy in the audience kept yelling that since its Git its all open
source and `git clone` is just a file transfer problem. So maybe for
his GitHub repositories and forks its OK to include the entire fork
network when someone clones?  :-)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes

2013-10-30 Thread Jeff King
On Wed, Oct 30, 2013 at 04:47:57PM +0100, Vicent Martí wrote:

 Yes, our pack writing and bitmap code takes enough precautions to
 arrange the objects in the packfile in a way that can be partially
 reused, so for any given bitmap file written from Git, I'd say we're
 safe to always reuse the leader of the pack if this is possible.

Perhaps it is worth adding a flag to the bitmap index to say my bases
always come before deltas.

 Obviously I intend to keep the pack reuse on production because the
 CPU savings are noticeable, but we can drop it from the public
 patchset. Ideally, we'd have full pack reuse like JGit, but we cannot
 reasonably do that in GitHub because splitting a pack for the network
 root would double our disk usage for all the forks.

Ah, right, I was forgetting about the fork-network repositories in my
analysis.  This doesn't work out of the box with this series, because we
will have all of the commits for all of the forks, followed by all of
the trees, and so forth.

What the list hasn't seen yet is that we have another patch series that
organizes the objects based on refs, so you can put all of
refs/remotes/X/* in one island, refs/remotes/Y/* in another
island, and so forth (in our scheme, each remote represents a single
fork, but it's configurable).

This lets us avoid deltas to objects that are not in the same island,
because such deltas would need to be thrown out when cloning a single
fork. And it lets us tweak the write order so that we can put all of one
fork's objects at the beginning, making it available for pack-reuse (so
obviously you would want to pick the root fork to put at the
beginning).

I'm planning to send those patches once the bitmap code is merged. One
is not strictly dependent on the other, but they touch some of the same
areas, and rely on some of the same infrastructure, so I think it makes
sense to do them in series.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes

2013-10-25 Thread Shawn Pearce
On Thu, Oct 24, 2013 at 6:03 PM, Jeff King p...@peff.net wrote:
 If `prepare_bitmap_walk` runs successfully, the resulting bitmap is
 stored and the equivalent of a `traverse_commit_list` call can be
 performed by using `traverse_bitmap_commit_list`; the bitmap version
 of this call yields the objects straight from the packfile index
 (without having to look them up or parse them) and hence is several
 orders of magnitude faster.

This is interesting, we didn't attempt it in JGit, but I can see how
it could be very useful.

 As an extra optimization, when `prepare_bitmap_walk` succeeds, the
 `reuse_partial_packfile_from_bitmap` call can be attempted: it will find
 the amount of objects at the beginning of the on-disk packfile that can
 be reused as-is, and return an offset into the packfile. The source
 packfile can then be loaded and the bytes up to `offset` can be written
 directly to the result without having to consider the entires inside the
 packfile individually.

Yay! This is similar to the optimization we use in JGit to send the
entire pack, but the part about sending a leading prefix is new. Do
you have any data showing how well this works in practice for cases
where offset is before than length-20?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html