Re: Reducing CPU load on git server
On Mon, Aug 29, 2016 at 03:41:59PM -0700, W. David Jarvis wrote: > We have an open support thread with the GHE support folks, but the > only feedback we've gotten so far on this subject is that they believe > our CPU load is being driven by the quantity of fetch operations (an > average of 16 fetch requests per second during normal business hours, > so 4 requests per second per subscriber box). About 4,000 fetch > requests on our main repository per day. Hmm. That might well be, but I'd have to see the numbers to say more. At this point I think we're exhausting what is useful to talk about on the Git list. I'll point GHE Support at this thread, which might help your conversation with them (and they might pull me in behind the scenes). I'll try to answer any Git-specific questions here, though. > > None of those operations is triggered by client fetches. You'll see > > "rev-list" for a variety of operations, so that's hard to pinpoint. But > > I'm surprised that "prune" is a common one for you. It is run as part of > > the post-push, but I happen to know that the version that ships on GHE > > is optimized to use bitmaps, and to avoid doing any work when there are > > no loose objects that need pruning in the first place. > > Would regular force-pushing trigger prune operations? Our engineering > body loves to force-push. No, it shouldn't make a difference. For stock git, "prune" will only be run occasionally as part of "git gc". On GitHub Enterprise, every push kicks off a "sync" job that moves objects from a specific fork into storage shared by all of the related forks. So GHE will run prune more often than stock git would, but force-pushing wouldn't have any effect on that. There are also some custom patches to optimize prune on GHE, so it shouldn't generally be very expensive. Unless perhaps for some reason the reachability bitmaps on your repository aren't performing very well. You could try something like comparing: time git rev-list --objects --all >/dev/null and time git rev-list --objects --all --use-bitmap-index >/dev/null on your server. The second should be a lot faster. If it's not, that may be an indication that Git could be doing a better job of selecting bitmap commits (that code is not GitHub-specific at all). > >> I might be misunderstanding this, but if the subscriber is already "up > >> to date" modulo a single updated ref tip, then this problem shouldn't > >> occur, right? Concretely: if ref A is built off of ref B, and the > >> subscriber already has B when it requests A, that shouldn't cause this > >> behavior, but it would cause this behavior if the subscriber didn't > >> have B when it requested A. > > > > Correct. So this shouldn't be a thing you are running into now, but it's > > something that might be made worse if you switch to fetching only single > > refs. > > But in theory if we were always up-to-date (since we'd always fetch > any updated ref) we wouldn't run into this problem? We could have a > cron job to ensure that we run a full git fetch every once in a while > but I'd hope that if this was written properly we'd almost always have > the most recent commit for any dependency ref. It's a little more complicated than that. What you're really going for is letting git reuse on-disk deltas when serving fetches. But depending on when the last repack was run, we might be cobbling together the fetch from multiple packs on disk, in which case there will still be some delta search. In my experience that's not _generally_ a big deal, though. Small fetches don't have that many deltas to find. > Our primary repository is fairly busy. It has about 1/3 the commits of > Linux and about 1/3 the refs, but seems otherwise to be on the same > scale. And, of course, it both hasn't been around for as long as Linux > has and has been experiencing exponential growth, which means its > current activity is higher than it has ever been before -- might put > it on a similar scale to Linux's current activity. Most of the work for repacking scales with the number of total objects (not quite linearly, though). For torvalds/linux (and its forks), that's around 16 million objects. You might try "git count-objects -v" on your server for comparison (but do it in the "network.git" directory, as that's the shared object storage). -Peff
Re: Reducing CPU load on git server
> Pegging CPU for a few seconds doesn't sound out-of-place for > pack-objects serving a fetch or clone on a large repository. And I can > certainly believe "minutes", especially if it was not serving a fetch, > but doing repository maintenance on a large repository. > > Talk to GitHub Enterprise support folks about what kind of process > monitoring and accounting is available. Recent versions of GHE can > easily tell things like which repositories and processes are using the > most CPU, RAM, I/O, and network, which ones are seeing a lot of > parallelism, etc. We have an open support thread with the GHE support folks, but the only feedback we've gotten so far on this subject is that they believe our CPU load is being driven by the quantity of fetch operations (an average of 16 fetch requests per second during normal business hours, so 4 requests per second per subscriber box). About 4,000 fetch requests on our main repository per day. > None of those operations is triggered by client fetches. You'll see > "rev-list" for a variety of operations, so that's hard to pinpoint. But > I'm surprised that "prune" is a common one for you. It is run as part of > the post-push, but I happen to know that the version that ships on GHE > is optimized to use bitmaps, and to avoid doing any work when there are > no loose objects that need pruning in the first place. Would regular force-pushing trigger prune operations? Our engineering body loves to force-push. >> I might be misunderstanding this, but if the subscriber is already "up >> to date" modulo a single updated ref tip, then this problem shouldn't >> occur, right? Concretely: if ref A is built off of ref B, and the >> subscriber already has B when it requests A, that shouldn't cause this >> behavior, but it would cause this behavior if the subscriber didn't >> have B when it requested A. > > Correct. So this shouldn't be a thing you are running into now, but it's > something that might be made worse if you switch to fetching only single > refs. But in theory if we were always up-to-date (since we'd always fetch any updated ref) we wouldn't run into this problem? We could have a cron job to ensure that we run a full git fetch every once in a while but I'd hope that if this was written properly we'd almost always have the most recent commit for any dependency ref. > That really sounds like repository maintenance. Repacks of > torvalds/linux (including all of its forks) on github.com take ~15 > minutes of CPU. There may be some optimization opportunities there (I > have a few things I'd like to explore in the next few months), but most > of it is pretty fundamental. It literally takes a few minutes just to > walk the entire object graph for that repo (that's one of the more > extreme cases, of course, but presumably you are hosting some large > repositories). > > Maintenance like that should be a very occasional operation, but it's > possible that you have a very busy repo. Our primary repository is fairly busy. It has about 1/3 the commits of Linux and about 1/3 the refs, but seems otherwise to be on the same scale. And, of course, it both hasn't been around for as long as Linux has and has been experiencing exponential growth, which means its current activity is higher than it has ever been before -- might put it on a similar scale to Linux's current activity. > OK, I double-checked, and your version should be coalescing identical > fetches. > > Given that, and that a lot of the load you mentioned above is coming > from non-fetch sources, it sounds like switching anything with your > replica fetch strategy isn't likely to help much. And a multi-tiered > architecture won't help if the load is being generated by requests that > are serving the web-views directly on the box. > > I'd really encourage you to talk with GitHub Support about performance > and clustering. It sounds like there may be some GitHub-specific things > to tweak. And it may be that the load is just too much for a single > machine, and would benefit from spreading the load across multiple git > servers. What surprised us is that we had been running this on an r3.4xlarge (16vCPU) on AWS for two years without too much issue. Then in a span of months we started experiencing massive CPU load, which forced us to upgrade the box to one with 32 vCPU (and better CPUs). We just don't understand what the precise driver of load is here. As noted above, we are talking with GitHub about performance -- we've also been pushing them to start working on a clustering plan, but my impression has been that they're reluctant to go down that path. I suspect that we use GHE much more aggressively than the typical GHE client, but I could be wrong about that. - V -- venanti.us 203.918.2328
Re: Reducing CPU load on git server
On ma, 2016-08-29 at 13:57 -0700, W. David Jarvis wrote: > > * If you do need branches consider archiving stale tags/branches > > after some time. I implemented this where I work, we just have a > > $REPO-archive.git with every tag/branch ever created for a given > > $REPO.git, and delete refs after a certain time. > > This is something else that we're actively considering. Why did your > company implement this -- was it to reduce load, or just to clean up > your repositories? Did you notice any change in server load? At 50k refs, ref negotiation gets expensive, especially over http D.
Re: Reducing CPU load on git server
On Mon, Aug 29, 2016 at 12:16:20PM -0700, W. David Jarvis wrote: > > Do you know which processes are generating the load? git-upload-pack > > does the negotiation, and then pack-objects does the actual packing. > > When I look at expensive operations (ones that I can see consuming > 90%+ of a CPU for more than a second), there are often pack-objects > processes running that will consume an entire core for multiple > seconds (I also saw one pack-object counting process run for several > minutes while using up a full core). Pegging CPU for a few seconds doesn't sound out-of-place for pack-objects serving a fetch or clone on a large repository. And I can certainly believe "minutes", especially if it was not serving a fetch, but doing repository maintenance on a large repository. Talk to GitHub Enterprise support folks about what kind of process monitoring and accounting is available. Recent versions of GHE can easily tell things like which repositories and processes are using the most CPU, RAM, I/O, and network, which ones are seeing a lot of parallelism, etc. > rev-list shows up as a pretty active CPU consumer, as do prune and > blame-tree. > > I'd say overall that in terms of high-CPU consumption activities, > `prune` and `rev-list` show up the most frequently. None of those operations is triggered by client fetches. You'll see "rev-list" for a variety of operations, so that's hard to pinpoint. But I'm surprised that "prune" is a common one for you. It is run as part of the post-push, but I happen to know that the version that ships on GHE is optimized to use bitmaps, and to avoid doing any work when there are no loose objects that need pruning in the first place. Blame-tree is a GitHub-specific command (it feeds the main repository view page), and is a known CPU hog. There's more clever caching for that coming down the pipe, but it's not shipped yet. > On the subject of prune - I forgot to mention that the `git fetch` > calls from the subscribers are running `git fetch --prune`. I'm not > sure if that changes the projected load profile. That shouldn't change anything; the pruning is purely a client side thing. > > Maybe. If pack-objects is where your load is coming from, then > > counter-intuitively things sometimes get _worse_ as you fetch less. The > > problem is that git will generally re-use deltas it has on disk when > > sending to the clients. But if the clients are missing some of the > > objects (because they don't fetch all of the branches), then we cannot > > use those deltas and may need to recompute new ones. So you might see > > some parts of the fetch get cheaper (negotiation, pack-object's > > "Counting objects" phase), but "Compressing objects" gets more > > expensive. > > I might be misunderstanding this, but if the subscriber is already "up > to date" modulo a single updated ref tip, then this problem shouldn't > occur, right? Concretely: if ref A is built off of ref B, and the > subscriber already has B when it requests A, that shouldn't cause this > behavior, but it would cause this behavior if the subscriber didn't > have B when it requested A. Correct. So this shouldn't be a thing you are running into now, but it's something that might be made worse if you switch to fetching only single refs. > See comment above about a long-running counting objects process. I > couldn't tell which of our repositories it was counting, but it went > for about 3 minutes with full core utilization. I didn't see us > counting pack-objects frequently but it's an expensive operation. That really sounds like repository maintenance. Repacks of torvalds/linux (including all of its forks) on github.com take ~15 minutes of CPU. There may be some optimization opportunities there (I have a few things I'd like to explore in the next few months), but most of it is pretty fundamental. It literally takes a few minutes just to walk the entire object graph for that repo (that's one of the more extreme cases, of course, but presumably you are hosting some large repositories). Maintenance like that should be a very occasional operation, but it's possible that you have a very busy repo. > > There's nothing in upstream git to help smooth these loads, but since > > you mentioned GitHub Enterprise, I happen to know that it does have a > > system for coalescing multiple fetches into a single pack-objects. I > > _think_ it's in GHE 2.5, so you might check which version you're > > running (and possibly also talk to GitHub Support, who might have more > > advice; there are also tools for finding out which git processes are > > generating the most load, etc). > > We're on 2.6.4 at the moment. OK, I double-checked, and your version should be coalescing identical fetches. Given that, and that a lot of the load you mentioned above is coming from non-fetch sources, it sounds like switching anything with your replica fetch strategy isn't likely to help much. And a multi-tiered architecture won't help if the load is
Re: Reducing CPU load on git server
> * Consider having that queue of yours just send the pushed payload > instead of "pull this", see git-bundle. This can turn this sync entire > thing into a static file distribution problem. As far as I know, GHE doesn't support this out of the box. We've asked them for essentially this, though. Due to the nature of our license we may not be able to configure something like this on the server instance ourselves. > * It's not clear from your post why you have to worry about all these > branches, surely your Chef instances just need the "master" branch, > just push that around. We allow deployments from non-master branches, so we do need multiple branches. We also use the replication fleet as the target for our build system, which needs to be able to build essentially any branch on any repository. > * If you do need branches consider archiving stale tags/branches > after some time. I implemented this where I work, we just have a > $REPO-archive.git with every tag/branch ever created for a given > $REPO.git, and delete refs after a certain time. This is something else that we're actively considering. Why did your company implement this -- was it to reduce load, or just to clean up your repositories? Did you notice any change in server load? > * If your problem is that you're CPU bound on the master have you > considered maybe solving this with something like NFS, i.e. replace > your ad-hoc replication with just a bunch of "slave" boxes that mount > the remote filesystem. This is definitely an interesting idea. It'd be a significant architectural change, though, and not one I'm sure we'd be able to get support for. > * Or, if you're willing to deal with occasional transitory repo > corruption (just retry?): rsync. I think this is a cost we're not necessarily okay with having to deal with. > * Theres's no reason for why your replication chain needs to be > single-level if master CPU is really the issue. You could have master > -> N slaves -> N*X slaves, or some combination thereof. This was discussed above - if the primary driver of load is the first fetch, then moving to a multi-tiered architecture will not solve our problems. > * Does it really even matter that your "slave" machines are all > up-to-date? We have something similar at work but it's just a minutely > cronjob that does "git fetch" on some repos, since the downstream > thing (e.g. the chef run) doesn't run more than once every 30m or > whatever anyway. It does, because we use the replication fleet for our build server. - V -- venanti.us 203.918.2328
Re: Reducing CPU load on git server
On Sun, Aug 28, 2016 at 9:42 PM, W. David Jarvis wrote: > I've run into a problem that I'm looking for some help with. Let me > describe the situation, and then some thoughts. Just a few points that you may not have considered, and I didn't see mentioned in this thread: * Consider having that queue of yours just send the pushed payload instead of "pull this", see git-bundle. This can turn this sync entire thing into a static file distribution problem. * It's not clear from your post why you have to worry about all these branches, surely your Chef instances just need the "master" branch, just push that around. * If you do need branches consider archiving stale tags/branches after some time. I implemented this where I work, we just have a $REPO-archive.git with every tag/branch ever created for a given $REPO.git, and delete refs after a certain time. * If your problem is that you're CPU bound on the master have you considered maybe solving this with something like NFS, i.e. replace your ad-hoc replication with just a bunch of "slave" boxes that mount the remote filesystem. * Or, if you're willing to deal with occasional transitory repo corruption (just retry?): rsync. * Theres's no reason for why your replication chain needs to be single-level if master CPU is really the issue. You could have master -> N slaves -> N*X slaves, or some combination thereof. * Does it really even matter that your "slave" machines are all up-to-date? We have something similar at work but it's just a minutely cronjob that does "git fetch" on some repos, since the downstream thing (e.g. the chef run) doesn't run more than once every 30m or whatever anyway.
Re: Reducing CPU load on git server
> So your load is probably really spiky, as you get thundering herds of > fetchers after every push (the spikes may have a long flatline at the > top, as it takes time to process the whole herd). It is quite spiky, yes. At the moment, however, the replication fleet is relatively small (at the moment it's just 4 machines). We had 6 machines earlier this month and we had hoped that terminating two of them would lead to a material drop in CPU usage, but we didn't see a really significant reduction. > Yes, though I'd be surprised if this negotiation is that expensive in > practice. In my experience it's not generally, and even if we ended up > traversing every commit in the repository, that's on the order of a few > seconds even for large, active repositories. > > In my experience, the problem in a mass-fetch like this ends up being > pack-objects preparing the packfile. It has to do a similar traversal, > but _also_ look at all of the trees and blobs reachable from there, and > then search for likely delta-compression candidates. > > Do you know which processes are generating the load? git-upload-pack > does the negotiation, and then pack-objects does the actual packing. When I look at expensive operations (ones that I can see consuming 90%+ of a CPU for more than a second), there are often pack-objects processes running that will consume an entire core for multiple seconds (I also saw one pack-object counting process run for several minutes while using up a full core). rev-list shows up as a pretty active CPU consumer, as do prune and blame-tree. I'd say overall that in terms of high-CPU consumption activities, `prune` and `rev-list` show up the most frequently. On the subject of prune - I forgot to mention that the `git fetch` calls from the subscribers are running `git fetch --prune`. I'm not sure if that changes the projected load profile. > Maybe. If pack-objects is where your load is coming from, then > counter-intuitively things sometimes get _worse_ as you fetch less. The > problem is that git will generally re-use deltas it has on disk when > sending to the clients. But if the clients are missing some of the > objects (because they don't fetch all of the branches), then we cannot > use those deltas and may need to recompute new ones. So you might see > some parts of the fetch get cheaper (negotiation, pack-object's > "Counting objects" phase), but "Compressing objects" gets more > expensive. I might be misunderstanding this, but if the subscriber is already "up to date" modulo a single updated ref tip, then this problem shouldn't occur, right? Concretely: if ref A is built off of ref B, and the subscriber already has B when it requests A, that shouldn't cause this behavior, but it would cause this behavior if the subscriber didn't have B when it requested A. > This is particularly noticeable with shallow fetches, which in my > experience are much more expensive to serve. I don't think we're doing shallow fetches anywhere in this system. > Jakub mentioned bitmaps, and if you are using GitHub Enterprise, they > are enabled. But they won't really help here. They are essentially > cached information that git generates at repack time. But if we _just_ > got a push, then the new objects to fetch won't be part of the cache, > and we'll fall back to traversing them as normal. On the other hand, > this should be a relatively small bit of history to traverse, so I'd > doubt that "Counting objects" is that expensive in your case (but you > should be able to get a rough sense by watching the progress meter > during a fetch). See comment above about a long-running counting objects process. I couldn't tell which of our repositories it was counting, but it went for about 3 minutes with full core utilization. I didn't see us counting pack-objects frequently but it's an expensive operation. > I'd suspect more that delta compression is expensive (we know we just > got some new objects, but we don't know if we can make good deltas > against the objects the client already has). That's a gut feeling, > though. > > If the fetch is small, that _also_ shouldn't be too expensive. But > things add up when you have a large number of machines all making the > same request at once. So it's entirely possible that the machine just > gets hit with a lot of 5s CPU tasks all at once. If you only have a > couple cores, that takes many multiples of 5s to clear out. I think this would show up if I was sitting around running `top` on the machine, but that doesn't end up being what I see. That might just be a function of there being a relatively small number of replication machines, I'm not sure. But I'm not noticing 4 of the same tasks get spawned simultaneously, which says to me that we're either utilizing a cache or there's some locking behavior involved. > There's nothing in upstream git to help smooth these loads, but since > you mentioned GitHub Enterprise, I happen to know that it does have a > system for coalescing
Re: Reducing CPU load on git server
On Mon, Aug 29, 2016 at 12:46:27PM +0200, Jakub Narębski wrote: > > So your load is probably really spiky, as you get thundering herds of > > fetchers after every push (the spikes may have a long flatline at the > > top, as it takes time to process the whole herd). > > One solution I have heard about, in the context of web cache, to reduce > the thundering herd problem (there caused by cache expiring at the same > time in many clients) was to add some random or quasi-random distribution > to expiration time. In your situation adding a random delay with some > specified deviation could help. That smooths the spikes, but you still have to serve all of the requests eventually. So if your problem is that the load spikes and the system slows to a crawl as a result (or runs out of RAM, etc), then distributing the load helps. But if you have enough load that your system is constantly busy, queueing the load in a different order just shifts it around. GHE will also introduce delays into starting git when load spikes, but that's a separate system that coalescing identical requests. > I wonder if this system for coalescing multiple fetches is something > generic, or is it something specific to GitHub / GitHub Enterprise > architecture? If it is the former, would it be considered for > upstreaming, and if so, when it would be in Git itself? I've already sent upstream the patch for a "hook" that sits between upload-pack and pack-objects (and it will be in v2.10). So that can call an arbitrary script which can then make scheduling policy for pack-objects, coalesce similar requests, etc. GHE has a generic tool for coalescing program invocations that is not Git-specific at all (it compares its stdin and command line arguments to decide when two requests are identical, runs the command on its arguments, and then passes the output to all members of the herd). That _might_ be open-sourced in the future, but I don't have a specific timeline. > One thing to note: if you have repositories which are to have the > same contents, you can distribute the pack-file to them and update > references without going through Git. It can be done on push > (push to master, distribute to mirrors), or as part of fetch > (master fetches from central repository, distributes to mirrors). > I think; I have never managed large set of replicated Git repositories. Doing it naively has some gotchas, because you want to make sure you have all of the necessary objects. But if you are going this route, probably distributed a git-bundle is the simplest way. > > Generally no, they should not conflict. Writes into the object database > > can happen simultaneously. Ref updates take a per-ref lock, so you > > should generally be able to write two unrelated refs at once. The big > > exception is that ref deletion required taking a repo-wide lock, but > > that presumably wouldn't be a problem for your case. > > Doesn't Git avoid taking locks, and use lockless synchronization > mechanisms (though possibly equivalent to locks)? I think it takes > lockfile to update reflog together with reference, but if reflogs > are turned off (and I think they are off for bare repositories by > default), ref update uses "atomic file write" (write + rename) > and compare-and-swap primitive. Updating repository is lock-free: > first update repository object database, then reference. There is a lockfile to make the compare-and-swap atomic, but yes, it's fundamentally based around the compare-and-swap. I don't think that matters to the end user though. Fundamentally they will see "I hoped to move from X to Y, but somebody else wrote Z, aborting", which is the same as "I did not win the lock race, aborting". The point is that updating two different refs is generally independent, and updating the same ref is not. > I guess that trying to replicate DGit approach that GitHub uses, see > "Introducing DGit" (http://githubengineering.com/introducing-dgit) > is currently out of question? Minor nitpick (that you don't even have any way of knowing about, so maybe more of a public service announcement). GitHub will stop using the "DGit" name because it's too confusingly similar to "Git" (and "Git" is trademarked by the project). There's a new blog post coming that mentions the name change, and that historic one will have a note added retroactively. The new name is "GitHub Spokes" (get it, Hub, Spokes?). But in response to your question, I'll caution that replicating it is a lot of work. :) Since the original problem report mentions GHE, I'll note that newer versions of GHE do support clustering and can share the git load across multiple Spokes servers. So in theory that could make the replica layer go away entirely, because it all happens behind the scenes. -Peff PS Sorry, I generally try to avoid hawking GitHub wares on the list, but since the OP mentioned GHE specifically, and because there aren't really generic solutions to most of these things, I do think it'
Re: Reducing CPU load on git server
W dniu 29.08.2016 o 07:47, Jeff King pisze: > On Sun, Aug 28, 2016 at 12:42:52PM -0700, W. David Jarvis wrote: > >> The actual replication process works as follows: >> >> 1. The primary git server receives a push and sends a webhook with the >> details of the push (repo, ref, sha, some metadata) to a "publisher" >> box >> >> 2. The publisher enqueues the details of the webhook into a queue >> >> 3. A fleet of "subscriber" (replica) boxes each reads the payload of >> the enqueued message. Each of these then tries to either clone the >> repository if they don't already have it, or they run `git fetch`. > > So your load is probably really spiky, as you get thundering herds of > fetchers after every push (the spikes may have a long flatline at the > top, as it takes time to process the whole herd). One solution I have heard about, in the context of web cache, to reduce the thundering herd problem (there caused by cache expiring at the same time in many clients) was to add some random or quasi-random distribution to expiration time. In your situation adding a random delay with some specified deviation could help. Note however that it is, I think, incompatible (to some extent) with "caching" solution, where the 'thundering herd' get served the same packfile. Or at least one solution can reduce the positive effect of the other. >> 1. We currently run a blanket `git fetch` rather than specifically >> fetching the ref that was pushed. My understanding from poking around >> the git source code is that this causes the replication server to send >> a list of all of its ref tips to the primary server, and the primary >> server then has to verify and compare each of these tips to the ref >> tips residing on the server. [...] > There's nothing in upstream git to help smooth these loads, but since > you mentioned GitHub Enterprise, I happen to know that it does have a > system for coalescing multiple fetches into a single pack-objects. I > _think_ it's in GHE 2.5, so you might check which version you're > running (and possibly also talk to GitHub Support, who might have more > advice; there are also tools for finding out which git processes are > generating the most load, etc). I wonder if this system for coalescing multiple fetches is something generic, or is it something specific to GitHub / GitHub Enterprise architecture? If it is the former, would it be considered for upstreaming, and if so, when it would be in Git itself? One thing to note: if you have repositories which are to have the same contents, you can distribute the pack-file to them and update references without going through Git. It can be done on push (push to master, distribute to mirrors), or as part of fetch (master fetches from central repository, distributes to mirrors). I think; I have never managed large set of replicated Git repositories. If mirrors can get out of sync, you would need to ensure that the repository doing the actual fetch / receiving the actual push is a least common denominator, that it it looks like lagging behind all other mirrors in set. There is no problem if repository gets packfile with more objects than it needs. >> In other words, let's imagine a world in which we ditch our current >> repo-level locking mechanism entirely. Let's also presume we move to >> fetching specific refs rather than using blanket fetches. Does that >> mean that if a fetch for ref A and a fetch for ref B are issued at >> roughly the exact same time, the two will be able to be executed at >> once without running into some git-internal locking mechanism on a >> granularity coarser than the ref? i.e. are fetch A and fetch B going >> to be blocked on the other's completion in any way? (let's presume >> that ref A and ref B are not parents of each other). > > Generally no, they should not conflict. Writes into the object database > can happen simultaneously. Ref updates take a per-ref lock, so you > should generally be able to write two unrelated refs at once. The big > exception is that ref deletion required taking a repo-wide lock, but > that presumably wouldn't be a problem for your case. Doesn't Git avoid taking locks, and use lockless synchronization mechanisms (though possibly equivalent to locks)? I think it takes lockfile to update reflog together with reference, but if reflogs are turned off (and I think they are off for bare repositories by default), ref update uses "atomic file write" (write + rename) and compare-and-swap primitive. Updating repository is lock-free: first update repository object database, then reference. That said, it might be that per-repository global lock that you use is beneficial, limiting the amount of concurrent access; but it could be detrimental, that global-lock contention is the cause of stalls and latency. >> The ultimate goal for us is just figuring out how we can best reduce >> the CPU load on the primary instance so that we don't find ourselves >> in a situation where we're not able to run ba
Re: Reducing CPU load on git server
On Sun, Aug 28, 2016 at 12:42:52PM -0700, W. David Jarvis wrote: > The actual replication process works as follows: > > 1. The primary git server receives a push and sends a webhook with the > details of the push (repo, ref, sha, some metadata) to a "publisher" > box > > 2. The publisher enqueues the details of the webhook into a queue > > 3. A fleet of "subscriber" (replica) boxes each reads the payload of > the enqueued message. Each of these then tries to either clone the > repository if they don't already have it, or they run `git fetch`. So your load is probably really spiky, as you get thundering herds of fetchers after every push (the spikes may have a long flatline at the top, as it takes time to process the whole herd). > 1. We currently run a blanket `git fetch` rather than specifically > fetching the ref that was pushed. My understanding from poking around > the git source code is that this causes the replication server to send > a list of all of its ref tips to the primary server, and the primary > server then has to verify and compare each of these tips to the ref > tips residing on the server. Yes, though I'd be surprised if this negotiation is that expensive in practice. In my experience it's not generally, and even if we ended up traversing every commit in the repository, that's on the order of a few seconds even for large, active repositories. In my experience, the problem in a mass-fetch like this ends up being pack-objects preparing the packfile. It has to do a similar traversal, but _also_ look at all of the trees and blobs reachable from there, and then search for likely delta-compression candidates. Do you know which processes are generating the load? git-upload-pack does the negotiation, and then pack-objects does the actual packing. > My hypothesis is that moving to fetching the specific branch rather > than doing a blanket fetch would have a significant and material > impact on server load. Maybe. If pack-objects is where your load is coming from, then counter-intuitively things sometimes get _worse_ as you fetch less. The problem is that git will generally re-use deltas it has on disk when sending to the clients. But if the clients are missing some of the objects (because they don't fetch all of the branches), then we cannot use those deltas and may need to recompute new ones. So you might see some parts of the fetch get cheaper (negotiation, pack-object's "Counting objects" phase), but "Compressing objects" gets more expensive. This is particularly noticeable with shallow fetches, which in my experience are much more expensive to serve. Jakub mentioned bitmaps, and if you are using GitHub Enterprise, they are enabled. But they won't really help here. They are essentially cached information that git generates at repack time. But if we _just_ got a push, then the new objects to fetch won't be part of the cache, and we'll fall back to traversing them as normal. On the other hand, this should be a relatively small bit of history to traverse, so I'd doubt that "Counting objects" is that expensive in your case (but you should be able to get a rough sense by watching the progress meter during a fetch). I'd suspect more that delta compression is expensive (we know we just got some new objects, but we don't know if we can make good deltas against the objects the client already has). That's a gut feeling, though. If the fetch is small, that _also_ shouldn't be too expensive. But things add up when you have a large number of machines all making the same request at once. So it's entirely possible that the machine just gets hit with a lot of 5s CPU tasks all at once. If you only have a couple cores, that takes many multiples of 5s to clear out. There's nothing in upstream git to help smooth these loads, but since you mentioned GitHub Enterprise, I happen to know that it does have a system for coalescing multiple fetches into a single pack-objects. I _think_ it's in GHE 2.5, so you might check which version you're running (and possibly also talk to GitHub Support, who might have more advice; there are also tools for finding out which git processes are generating the most load, etc). > In other words, let's imagine a world in which we ditch our current > repo-level locking mechanism entirely. Let's also presume we move to > fetching specific refs rather than using blanket fetches. Does that > mean that if a fetch for ref A and a fetch for ref B are issued at > roughly the exact same time, the two will be able to be executed at > once without running into some git-internal locking mechanism on a > granularity coarser than the ref? i.e. are fetch A and fetch B going > to be blocked on the other's completion in any way? (let's presume > that ref A and ref B are not parents of each other). Generally no, they should not conflict. Writes into the object database can happen simultaneously. Ref updates take a per-ref lock, so you should generally be able to write two unrelated refs a
Re: Reducing CPU load on git server
My assumption is that pack bitmaps are enabled since the primary server is a GitHub Enterprise instance, but I'll have to confirm. On Sun, Aug 28, 2016 at 2:20 PM, Jakub Narębski wrote: > W dniu 28.08.2016 o 21:42, W. David Jarvis pisze: > >> The ultimate goal for us is just figuring out how we can best reduce >> the CPU load on the primary instance so that we don't find ourselves >> in a situation where we're not able to run basic git operations >> anymore. > > I assume that you have turned on pack bitmaps? See for example > "Counting Objects" blog post on GitHub Engineering blog > http://githubengineering.com/counting-objects/ > > There are a few other articles there worth reading in your > situation. > -- > Jakub Narębski -- venanti.us 203.918.2328 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reducing CPU load on git server
W dniu 28.08.2016 o 21:42, W. David Jarvis pisze: > The ultimate goal for us is just figuring out how we can best reduce > the CPU load on the primary instance so that we don't find ourselves > in a situation where we're not able to run basic git operations > anymore. I assume that you have turned on pack bitmaps? See for example "Counting Objects" blog post on GitHub Engineering blog http://githubengineering.com/counting-objects/ There are a few other articles there worth reading in your situation. -- Jakub Narębski -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html