Re: [PATCH 0/2] optimizing pack access on read only fetch repos
On Fri, Feb 1, 2013 at 1:14 AM, Jeff King p...@peff.net wrote: On Thu, Jan 31, 2013 at 08:47:37AM -0800, Shawn O. Pearce wrote: - System resource cost we incur by having to keep 50 file descriptors open and maintaining 50 mmap windows will reduce by 50 fold. I wonder how measurable that is (and if it matters on Linux versus less efficient platforms). It does matter. We know it has a negative impact on JGit even on Linux for example. You don't want 300 packs in a repository. 50 might be tolerable. 300 is not. I'd love to see numbers if you have them. It's not that I don't believe it is slower, but knowing _how much_ is important when thinking about what kind of performance increase we are looking to get (which in turn impacts how much effort to put into the repacking). Never done a formal experiment. Just working from memory where 4 years and 3 desks ago I got called because one of our Git servers was struggling to keep up with user git fetch commands. Turns out the repository had O(200) pack files. git gc that normally took only about 5 minutes took a hellva lot longer, like an hour or more. The problem happened because the server was saving every push to a pack and never exploding incoming packs to loose objects. This meant the thin packs from the wire got delta bases appended to them. pack-objects was looking at O(50) different alternatives for most objects when trying to decide which one it should copy into the output pack... for either a fetch/clone client, or the gc I was trying to run to fix the repository. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
On Thu, Jan 31, 2013 at 08:47:37AM -0800, Shawn O. Pearce wrote: - System resource cost we incur by having to keep 50 file descriptors open and maintaining 50 mmap windows will reduce by 50 fold. I wonder how measurable that is (and if it matters on Linux versus less efficient platforms). It does matter. We know it has a negative impact on JGit even on Linux for example. You don't want 300 packs in a repository. 50 might be tolerable. 300 is not. I'd love to see numbers if you have them. It's not that I don't believe it is slower, but knowing _how much_ is important when thinking about what kind of performance increase we are looking to get (which in turn impacts how much effort to put into the repacking). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
On Tue, Jan 29, 2013 at 1:19 PM, Jeff King p...@peff.net wrote: On Tue, Jan 29, 2013 at 07:58:01AM -0800, Junio C Hamano wrote: The point is not about space. Disk is cheap, and it is not making it any worse than what happens to your target audience, that is a fetch-only repository with only gc --auto in it, where nobody passes -f to repack to cause recomputation of delta. What I was trying to seek was a way to reduce the runtime penalty we pay every time we run git in such a repository. - Object look-up cost will become log2(50*n) from 50*log2(n), which is about 50/log2(50) improvement; Yes and no. Our heuristic is to look at the last-used pack for an object. So assuming we have locality of requests, we should quite often get lucky and find the object in the first log2 search. Even if we don't assume locality, a situation with one large pack and a few small packs will have the large one as last used more often than the others, and it will also have the looked-for object more often than the others Opening all of those files does impact performance. It depends on how slow your open(2) syscall is. I know on Mac OS X that its not the fastest function we get from the C library. Performing ~40 opens to look through the most recent pack files and finally find the real pack that contains that tag you asked `git show` for isn't that quick. Some of us also use Git on filesystems that are network based, and slow compared to local disk Linux ext2/3/4 with gobs of free RAM. So I can see how it is something we could potentially optimize, but I could also see it being surprisingly not a big deal. I'd be very interested to see real measurements, even of something as simple as a master index which can reference multiple packfiles. I actually tried this many many years ago. There are threads in the archive about it. Its slower. We ruled it out. - System resource cost we incur by having to keep 50 file descriptors open and maintaining 50 mmap windows will reduce by 50 fold. I wonder how measurable that is (and if it matters on Linux versus less efficient platforms). It does matter. We know it has a negative impact on JGit even on Linux for example. You don't want 300 packs in a repository. 50 might be tolerable. 300 is not. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
On Sat, Jan 26, 2013 at 10:32 PM, Junio C Hamano gits...@pobox.com wrote: Jeff King p...@peff.net writes: This is a repost from here: http://thread.gmane.org/gmane.comp.version-control.git/211176 which got no response initially. Basically the issue is that read-only repos (e.g., a CI server) whose workflow is something like: git fetch $some_branch git checkout -f $some_branch make test will never run git-gc, and will accumulate a bunch of small packs and loose objects, leading to poor performance. ... I also wonder if we would be helped by another repack mode that coalesces small packs into a single one with minimum overhead, and run that often from gc --auto, so that we do not end up having to have 50 packfiles. Yes. This does help When we have 2 or more small and young packs, we could: - iterate over idx files for these packs to enumerate the objects to be packed, replacing read_object_list_from_stdin() step; - always choose to copy the data we have in these existing packs, instead of doing a full prepare_pack(); and - use the order the objects appear in the original packs, bypassing compute_write_order(). Hmm, sounds familiar. Seems like its what we do in JGit for Android. :-) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote: Both makes sense to me. I also wonder if we would be helped by another repack mode that coalesces small packs into a single one with minimum overhead, and run that often from gc --auto, so that we do not end up having to have 50 packfiles. When we have 2 or more small and young packs, we could: - iterate over idx files for these packs to enumerate the objects to be packed, replacing read_object_list_from_stdin() step; - always choose to copy the data we have in these existing packs, instead of doing a full prepare_pack(); and - use the order the objects appear in the original packs, bypassing compute_write_order(). I'm not sure. If I understand you correctly, it would basically just be concatenating packs without trying to do delta compression between the objects which are ending up in the same pack. So it would save us from having to do (up to) 50 binary searches to find an object in a pack, but would not actually save us much space. I would be interested to see the timing on how quick it is compared to a real repack, as the I/O that happens during a repack is non-trivial (although if you are leaving aside the big main pack, then it is probably not bad). But how do these somewhat mediocre concatenated packs get turned into real packs? Pack-objects does not consider deltas between objects in the same pack. And when would you decide to make a real pack? How do you know you have 50 young and small packs, and not 50 mediocre coalesced packs? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
On Sun, Jan 27, 2013 at 1:32 PM, Junio C Hamano gits...@pobox.com wrote: I also wonder if we would be helped by another repack mode that coalesces small packs into a single one with minimum overhead, and run that often from gc --auto, so that we do not end up having to have 50 packfiles. When we have 2 or more small and young packs, we could: - iterate over idx files for these packs to enumerate the objects to be packed, replacing read_object_list_from_stdin() step; - always choose to copy the data we have in these existing packs, instead of doing a full prepare_pack(); and - use the order the objects appear in the original packs, bypassing compute_write_order(). Isn't it easier and cheaper to create the master index, something like bup does? -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
Jeff King p...@peff.net wrote: On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote: Both makes sense to me. I also wonder if we would be helped by another repack mode that coalesces small packs into a single one with minimum overhead, and run that often from gc --auto, so that we do not end up having to have 50 packfiles. When we have 2 or more small and young packs, we could: - iterate over idx files for these packs to enumerate the objects to be packed, replacing read_object_list_from_stdin() step; - always choose to copy the data we have in these existing packs, instead of doing a full prepare_pack(); and - use the order the objects appear in the original packs, bypassing compute_write_order(). I'm not sure. If I understand you correctly, it would basically just be concatenating packs without trying to do delta compression between the objects which are ending up in the same pack. So it would save us from having to do (up to) 50 binary searches to find an object in a pack, but would not actually save us much space. I would be interested to see the timing on how quick it is compared to a real repack, as the I/O that happens during a repack is non-trivial (although if you are leaving aside the big main pack, then it is probably not bad). But how do these somewhat mediocre concatenated packs get turned into real packs? Pack-objects does not consider deltas between objects in the same pack. And when would you decide to make a real pack? How do you know you have 50 young and small packs, and not 50 mediocre coalesced packs? If we are reconsidering repacking strategies, I would like to propose an approach that might be a more general improvement to repacking which would help in more situations. You could roll together any packs which are close in size, say within 50% of each other. With this strategy you will end up with files which are spread out by size exponentially. I implementated this strategy on top of the current gc script using keep files, it works fairly well: https://gerrit-review.googlesource.com/#/c/35215/3/contrib/git-exproll.sh This saves some time, but mostly it saves I/O when repacking regularly. I suspect that if this strategy were used in core git that further optimizations could be made to also reduce the repack time, but I don't know enough about repacking to know? We run it nightly on our servers, both write and read only mirrors. We us are a ratio of 5 currently to drastically reduce large repack file rollovers, -Martin -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
Jeff King p...@peff.net writes: I also wonder if we would be helped by another repack mode that coalesces small packs into a single one with minimum overhead, and run that often from gc --auto, so that we do not end up having to have 50 packfiles. ... I'm not sure. If I understand you correctly, it would basically just be concatenating packs without trying to do delta compression between the objects which are ending up in the same pack. So it would save us from having to do (up to) 50 binary searches to find an object in a pack, but would not actually save us much space. The point is not about space. Disk is cheap, and it is not making it any worse than what happens to your target audience, that is a fetch-only repository with only gc --auto in it, where nobody passes -f to repack to cause recomputation of delta. What I was trying to seek was a way to reduce the runtime penalty we pay every time we run git in such a repository. - Object look-up cost will become log2(50*n) from 50*log2(n), which is about 50/log2(50) improvement; - System resource cost we incur by having to keep 50 file descriptors open and maintaining 50 mmap windows will reduce by 50 fold. - Anything else I missed? I would be interested to see the timing on how quick it is compared to a real repack,... Yes, that is what I meant by wonder if we would be helped by ;-) But how do these somewhat mediocre concatenated packs get turned into real packs? How do they get processed in a fetch-only repositories that sometimes run gc --auto today? By runninng repack -a -d -f occasionally, perhaps? At some point, you would need to run a repack that involves a real object-graph traversal that feeds you the paths for objects to obtain a reasonably compressed pack. We can inspect existing packs and guess a rough traversal order for commits, but for trees and blobs, you cannot unify existing delta chains from multiple packs effectively with data in the pack alone. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
On Tue, Jan 29, 2013 at 07:58:01AM -0800, Junio C Hamano wrote: The point is not about space. Disk is cheap, and it is not making it any worse than what happens to your target audience, that is a fetch-only repository with only gc --auto in it, where nobody passes -f to repack to cause recomputation of delta. What I was trying to seek was a way to reduce the runtime penalty we pay every time we run git in such a repository. - Object look-up cost will become log2(50*n) from 50*log2(n), which is about 50/log2(50) improvement; Yes and no. Our heuristic is to look at the last-used pack for an object. So assuming we have locality of requests, we should quite often get lucky and find the object in the first log2 search. Even if we don't assume locality, a situation with one large pack and a few small packs will have the large one as last used more often than the others, and it will also have the looked-for object more often than the others So I can see how it is something we could potentially optimize, but I could also see it being surprisingly not a big deal. I'd be very interested to see real measurements, even of something as simple as a master index which can reference multiple packfiles. - System resource cost we incur by having to keep 50 file descriptors open and maintaining 50 mmap windows will reduce by 50 fold. I wonder how measurable that is (and if it matters on Linux versus less efficient platforms). I would be interested to see the timing on how quick it is compared to a real repack,... Yes, that is what I meant by wonder if we would be helped by ;-) There is only one way to find out... :) Maybe I am blessed with nice machines, but I have mostly found the repack process not to be that big a deal these days (especially with threaded delta compression). But how do these somewhat mediocre concatenated packs get turned into real packs? How do they get processed in a fetch-only repositories that sometimes run gc --auto today? By runninng repack -a -d -f occasionally, perhaps? Do we run repack -adf regularly? The usual git gc procedure will not use -f, and without that, we will not even consider making deltas between objects that were formerly in different packs (but now are in the same pack). So you are avoiding doing medium-effort packs (repack -ad) in favor of doing potentially quick packs, but occasionally doing a big-effort pack (repack -adf). It may be reasonable advice to repack -adf occasionally, but I suspect most people are not doing it regularly (if only because git gc does not do it by default). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
Jeff King p...@peff.net writes: But how do these somewhat mediocre concatenated packs get turned into real packs? How do they get processed in a fetch-only repositories that sometimes run gc --auto today? By runninng repack -a -d -f occasionally, perhaps? Do we run repack -adf regularly? The usual git gc procedure will not use -f, and without that, we will not even consider making deltas between objects that were formerly in different packs (but now are in the same pack). Correct. It is not a new problem, and while I think it would need some solution, the coalesce 50 small young packs into one is not an attempt to address it. So you are avoiding doing medium-effort packs (repack -ad) in favor of doing potentially quick packs, but occasionally doing a big-effort pack (repack -adf). So I think but occasionally part is not correct. In either way, the packs use suboptimal delta, and you have to eventually pack with -f, whether you coalesce 50 packs into 1 often or keep paying the cost of having 50 packs longer. The trade-off is purely between one potentially quick coalescing per fetch in fetch-only repository vs any use of the data in the fetch-only repository (what do they do? build? serving gitweb locally?) has to deal with 25 packs on average, and we still need to pay medium repack cost every 50 fetches. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] optimizing pack access on read only fetch repos
Jeff King p...@peff.net writes: This is a repost from here: http://thread.gmane.org/gmane.comp.version-control.git/211176 which got no response initially. Basically the issue is that read-only repos (e.g., a CI server) whose workflow is something like: git fetch $some_branch git checkout -f $some_branch make test will never run git-gc, and will accumulate a bunch of small packs and loose objects, leading to poor performance. Patch 1 runs gc --auto on fetch, which I think is sane to do. Patch 2 optimizes our pack dir re-scanning for fetch-pack (which, unlike the rest of git, should expect to be missing lots of objects, since we are deciding what to fetch). I think 1 is a no-brainer. If your repo is packed, patch 2 matters less, but it still seems like a sensible optimization to me. [1/2]: fetch: run gc --auto after fetching [2/2]: fetch-pack: avoid repeatedly re-scanning pack directory -Peff Both makes sense to me. I also wonder if we would be helped by another repack mode that coalesces small packs into a single one with minimum overhead, and run that often from gc --auto, so that we do not end up having to have 50 packfiles. When we have 2 or more small and young packs, we could: - iterate over idx files for these packs to enumerate the objects to be packed, replacing read_object_list_from_stdin() step; - always choose to copy the data we have in these existing packs, instead of doing a full prepare_pack(); and - use the order the objects appear in the original packs, bypassing compute_write_order(). The procedure cannot be a straight byte-for-byte copy, because some objects may appear in multiple packs, and extra copies of the same object have to be excised from the result. OFS_DELTA offsets need to be adjusted for objects that appear later in the output and for objects that were deltified against such an object that recorded its base with OFS_DELTA format. But other than such OFS_DELTA adjustments, it feels that such an only coalesce multiple packs into one mode should be fairly quick. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html