Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes
On Fri, Oct 25, 2013 at 01:55:13PM +, Shawn O. Pearce wrote: As an extra optimization, when `prepare_bitmap_walk` succeeds, the `reuse_partial_packfile_from_bitmap` call can be attempted: it will find the amount of objects at the beginning of the on-disk packfile that can be reused as-is, and return an offset into the packfile. The source packfile can then be loaded and the bytes up to `offset` can be written directly to the result without having to consider the entires inside the packfile individually. Yay! This is similar to the optimization we use in JGit to send the entire pack, but the part about sending a leading prefix is new. Do you have any data showing how well this works in practice for cases where offset is before than length-20? Actually, I don't think it kicks in very much due to packfile ordering. You have all of the commits at the front of the pack, then all of the trees, then all of the blobs. So if you want the whole thing, it is easy to reuse a big chunk. But if you want only the most recent slice, we can reuse the early bit with the new commits, but we stop partway through the commit list. You still have to handle all of the trees and blobs separately. So in practice, I think this really only kicks in for clones anyway. In theory, you could find islands of ones in the bitmap and send whole slices of packfile at once. But you need to be careful not to send a delta without its base. Which I think means you end up having to generate the whole sha1 list anyway, and check that the other side has each base before reusing a delta (i.e., the normal code path). In fact, I'm not quite sure that even a partial reuse up to an offset is 100% safe. In a newly packed git repo it is, because we always put bases before deltas (and OFS_DELTA objects need this). But if you had a bitmap generated from a fixed thin pack, we would have REF_DELTA objects early on that depend on bases appended to the end of the pack. So I really wonder if we should scrap this partial reuse and either just have full reuse, or go through the regular object_entry construction. Vicent, you've thought about the reuse code a lot more than I have. Any thoughts? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes
On Wed, Oct 30, 2013 at 8:10 AM, Jeff King p...@peff.net wrote: On Fri, Oct 25, 2013 at 01:55:13PM +, Shawn O. Pearce wrote: Yay! This is similar to the optimization we use in JGit to send the entire pack, but the part about sending a leading prefix is new. Do you have any data showing how well this works in practice for cases where offset is before than length-20? Actually, I don't think it kicks in very much due to packfile ordering. You have all of the commits at the front of the pack, then all of the trees, then all of the blobs. So if you want the whole thing, it is easy to reuse a big chunk. But if you want only the most recent slice, we can reuse the early bit with the new commits, but we stop partway through the commit list. You still have to handle all of the trees and blobs separately. So in practice, I think this really only kicks in for clones anyway. In theory, you could find islands of ones in the bitmap and send whole slices of packfile at once. But you need to be careful not to send a delta without its base. Which I think means you end up having to generate the whole sha1 list anyway, and check that the other side has each base before reusing a delta (i.e., the normal code path). In fact, I'm not quite sure that even a partial reuse up to an offset is 100% safe. In a newly packed git repo it is, because we always put bases before deltas (and OFS_DELTA objects need this). But if you had a bitmap generated from a fixed thin pack, we would have REF_DELTA objects early on that depend on bases appended to the end of the pack. So I really wonder if we should scrap this partial reuse and either just have full reuse, or go through the regular object_entry construction. Yes. This is why JGit only does whole pack file reuse (minus the 12 byte header and 20 byte SHA-1 trailer). Any other case has so many corner cases that we just punt into the object entry construction path and rely on individual object reuse. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes
On Wed, Oct 30, 2013 at 9:10 AM, Jeff King p...@peff.net wrote: In fact, I'm not quite sure that even a partial reuse up to an offset is 100% safe. In a newly packed git repo it is, because we always put bases before deltas (and OFS_DELTA objects need this). But if you had a bitmap generated from a fixed thin pack, we would have REF_DELTA objects early on that depend on bases appended to the end of the pack. So I really wonder if we should scrap this partial reuse and either just have full reuse, or go through the regular object_entry construction. Vicent, you've thought about the reuse code a lot more than I have. Any thoughts? Yes, our pack writing and bitmap code takes enough precautions to arrange the objects in the packfile in a way that can be partially reused, so for any given bitmap file written from Git, I'd say we're safe to always reuse the leader of the pack if this is possible. For bitmaps generated from JGit, however, we cannot make this assumption. I mean, we can right now (from my understanding of the current implementation for pack-objects on JGit), but they are free to change this in the future. Obviously I intend to keep the pack reuse on production because the CPU savings are noticeable, but we can drop it from the public patchset. Ideally, we'd have full pack reuse like JGit, but we cannot reasonably do that in GitHub because splitting a pack for the network root would double our disk usage for all the forks. luv, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes
On Wed, Oct 30, 2013 at 3:47 PM, Vicent Marti vic...@github.com wrote: On Wed, Oct 30, 2013 at 9:10 AM, Jeff King p...@peff.net wrote: In fact, I'm not quite sure that even a partial reuse up to an offset is 100% safe. In a newly packed git repo it is, because we always put bases before deltas (and OFS_DELTA objects need this). But if you had a bitmap generated from a fixed thin pack, we would have REF_DELTA objects early on that depend on bases appended to the end of the pack. So I really wonder if we should scrap this partial reuse and either just have full reuse, or go through the regular object_entry construction. Vicent, you've thought about the reuse code a lot more than I have. Any thoughts? Yes, our pack writing and bitmap code takes enough precautions to arrange the objects in the packfile in a way that can be partially reused, so for any given bitmap file written from Git, I'd say we're safe to always reuse the leader of the pack if this is possible. For bitmaps generated from JGit, however, we cannot make this assumption. I mean, we can right now (from my understanding of the current implementation for pack-objects on JGit), but they are free to change this in the future. JGit certainly doesn't promise the ordering behavior, so the fact that its happening is just luck. The code could change in the future to invalidate this. Obviously I intend to keep the pack reuse on production because the CPU savings are noticeable, but we can drop it from the public patchset. I think you should keep it in, its a significant improvement. Ideally, we'd have full pack reuse like JGit, but we cannot reasonably do that in GitHub because splitting a pack for the network root would double our disk usage for all the forks. I gave a talk the week before about Git bitmaps and why we sometimes have to slice pack files by object. Some guy in the audience kept yelling that since its Git its all open source and `git clone` is just a file transfer problem. So maybe for his GitHub repositories and forks its OK to include the entire fork network when someone clones? :-) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes
On Wed, Oct 30, 2013 at 04:47:57PM +0100, Vicent Martà wrote: Yes, our pack writing and bitmap code takes enough precautions to arrange the objects in the packfile in a way that can be partially reused, so for any given bitmap file written from Git, I'd say we're safe to always reuse the leader of the pack if this is possible. Perhaps it is worth adding a flag to the bitmap index to say my bases always come before deltas. Obviously I intend to keep the pack reuse on production because the CPU savings are noticeable, but we can drop it from the public patchset. Ideally, we'd have full pack reuse like JGit, but we cannot reasonably do that in GitHub because splitting a pack for the network root would double our disk usage for all the forks. Ah, right, I was forgetting about the fork-network repositories in my analysis. This doesn't work out of the box with this series, because we will have all of the commits for all of the forks, followed by all of the trees, and so forth. What the list hasn't seen yet is that we have another patch series that organizes the objects based on refs, so you can put all of refs/remotes/X/* in one island, refs/remotes/Y/* in another island, and so forth (in our scheme, each remote represents a single fork, but it's configurable). This lets us avoid deltas to objects that are not in the same island, because such deltas would need to be thrown out when cloning a single fork. And it lets us tweak the write order so that we can put all of one fork's objects at the beginning, making it available for pack-reuse (so obviously you would want to pick the root fork to put at the beginning). I'm planning to send those patches once the bitmap code is merged. One is not strictly dependent on the other, but they touch some of the same areas, and rely on some of the same infrastructure, so I think it makes sense to do them in series. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes
On Thu, Oct 24, 2013 at 6:03 PM, Jeff King p...@peff.net wrote: If `prepare_bitmap_walk` runs successfully, the resulting bitmap is stored and the equivalent of a `traverse_commit_list` call can be performed by using `traverse_bitmap_commit_list`; the bitmap version of this call yields the objects straight from the packfile index (without having to look them up or parse them) and hence is several orders of magnitude faster. This is interesting, we didn't attempt it in JGit, but I can see how it could be very useful. As an extra optimization, when `prepare_bitmap_walk` succeeds, the `reuse_partial_packfile_from_bitmap` call can be attempted: it will find the amount of objects at the beginning of the on-disk packfile that can be reused as-is, and return an offset into the packfile. The source packfile can then be loaded and the bytes up to `offset` can be written directly to the result without having to consider the entires inside the packfile individually. Yay! This is similar to the optimization we use in JGit to send the entire pack, but the part about sending a leading prefix is new. Do you have any data showing how well this works in practice for cases where offset is before than length-20? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html