Re: Reducing CPU load on git server

2016-08-30 Thread Jeff King
On Mon, Aug 29, 2016 at 03:41:59PM -0700, W. David Jarvis wrote:

> We have an open support thread with the GHE support folks, but the
> only feedback we've gotten so far on this subject is that they believe
> our CPU load is being driven by the quantity of fetch operations (an
> average of 16 fetch requests per second during normal business hours,
> so 4 requests per second per subscriber box). About 4,000 fetch
> requests on our main repository per day.

Hmm. That might well be, but I'd have to see the numbers to say more. At
this point I think we're exhausting what is useful to talk about on the
Git list.  I'll point GHE Support at this thread, which might help your
conversation with them (and they might pull me in behind the scenes).

I'll try to answer any Git-specific questions here, though.

> > None of those operations is triggered by client fetches. You'll see
> > "rev-list" for a variety of operations, so that's hard to pinpoint. But
> > I'm surprised that "prune" is a common one for you. It is run as part of
> > the post-push, but I happen to know that the version that ships on GHE
> > is optimized to use bitmaps, and to avoid doing any work when there are
> > no loose objects that need pruning in the first place.
> 
> Would regular force-pushing trigger prune operations? Our engineering
> body loves to force-push.

No, it shouldn't make a difference. For stock git, "prune" will only be
run occasionally as part of "git gc". On GitHub Enterprise, every push
kicks off a "sync" job that moves objects from a specific fork into
storage shared by all of the related forks. So GHE will run prune more
often than stock git would, but force-pushing wouldn't have any effect
on that.

There are also some custom patches to optimize prune on GHE, so it
shouldn't generally be very expensive. Unless perhaps for some reason
the reachability bitmaps on your repository aren't performing very well.

You could try something like comparing:

  time git rev-list --objects --all >/dev/null

and

  time git rev-list --objects --all --use-bitmap-index >/dev/null

on your server. The second should be a lot faster. If it's not, that may
be an indication that Git could be doing a better job of selecting
bitmap commits (that code is not GitHub-specific at all).

> >> I might be misunderstanding this, but if the subscriber is already "up
> >> to date" modulo a single updated ref tip, then this problem shouldn't
> >> occur, right? Concretely: if ref A is built off of ref B, and the
> >> subscriber already has B when it requests A, that shouldn't cause this
> >> behavior, but it would cause this behavior if the subscriber didn't
> >> have B when it requested A.
> >
> > Correct. So this shouldn't be a thing you are running into now, but it's
> > something that might be made worse if you switch to fetching only single
> > refs.
> 
> But in theory if we were always up-to-date (since we'd always fetch
> any updated ref) we wouldn't run into this problem? We could have a
> cron job to ensure that we run a full git fetch every once in a while
> but I'd hope that if this was written properly we'd almost always have
> the most recent commit for any dependency ref.

It's a little more complicated than that. What you're really going for
is letting git reuse on-disk deltas when serving fetches. But depending
on when the last repack was run, we might be cobbling together the fetch
from multiple packs on disk, in which case there will still be some
delta search. In my experience that's not _generally_ a big deal,
though. Small fetches don't have that many deltas to find.

> Our primary repository is fairly busy. It has about 1/3 the commits of
> Linux and about 1/3 the refs, but seems otherwise to be on the same
> scale. And, of course, it both hasn't been around for as long as Linux
> has and has been experiencing exponential growth, which means its
> current activity is higher than it has ever been before -- might put
> it on a similar scale to Linux's current activity.

Most of the work for repacking scales with the number of total objects
(not quite linearly, though).  For torvalds/linux (and its forks),
that's around 16 million objects. You might try "git count-objects -v"
on your server for comparison (but do it in the "network.git" directory,
as that's the shared object storage).

-Peff


Re: Reducing CPU load on git server

2016-08-29 Thread W. David Jarvis
> Pegging CPU for a few seconds doesn't sound out-of-place for
> pack-objects serving a fetch or clone on a large repository. And I can
> certainly believe "minutes", especially if it was not serving a fetch,
> but doing repository maintenance on a large repository.
>
> Talk to GitHub Enterprise support folks about what kind of process
> monitoring and accounting is available. Recent versions of GHE can
> easily tell things like which repositories and processes are using the
> most CPU, RAM, I/O, and network, which ones are seeing a lot of
> parallelism, etc.

We have an open support thread with the GHE support folks, but the
only feedback we've gotten so far on this subject is that they believe
our CPU load is being driven by the quantity of fetch operations (an
average of 16 fetch requests per second during normal business hours,
so 4 requests per second per subscriber box). About 4,000 fetch
requests on our main repository per day.

> None of those operations is triggered by client fetches. You'll see
> "rev-list" for a variety of operations, so that's hard to pinpoint. But
> I'm surprised that "prune" is a common one for you. It is run as part of
> the post-push, but I happen to know that the version that ships on GHE
> is optimized to use bitmaps, and to avoid doing any work when there are
> no loose objects that need pruning in the first place.

Would regular force-pushing trigger prune operations? Our engineering
body loves to force-push.

>> I might be misunderstanding this, but if the subscriber is already "up
>> to date" modulo a single updated ref tip, then this problem shouldn't
>> occur, right? Concretely: if ref A is built off of ref B, and the
>> subscriber already has B when it requests A, that shouldn't cause this
>> behavior, but it would cause this behavior if the subscriber didn't
>> have B when it requested A.
>
> Correct. So this shouldn't be a thing you are running into now, but it's
> something that might be made worse if you switch to fetching only single
> refs.

But in theory if we were always up-to-date (since we'd always fetch
any updated ref) we wouldn't run into this problem? We could have a
cron job to ensure that we run a full git fetch every once in a while
but I'd hope that if this was written properly we'd almost always have
the most recent commit for any dependency ref.

> That really sounds like repository maintenance. Repacks of
> torvalds/linux (including all of its forks) on github.com take ~15
> minutes of CPU. There may be some optimization opportunities there (I
> have a few things I'd like to explore in the next few months), but most
> of it is pretty fundamental. It literally takes a few minutes just to
> walk the entire object graph for that repo (that's one of the more
> extreme cases, of course, but presumably you are hosting some large
> repositories).
>
> Maintenance like that should be a very occasional operation, but it's
> possible that you have a very busy repo.

Our primary repository is fairly busy. It has about 1/3 the commits of
Linux and about 1/3 the refs, but seems otherwise to be on the same
scale. And, of course, it both hasn't been around for as long as Linux
has and has been experiencing exponential growth, which means its
current activity is higher than it has ever been before -- might put
it on a similar scale to Linux's current activity.

> OK, I double-checked, and your version should be coalescing identical
> fetches.
>
> Given that, and that a lot of the load you mentioned above is coming
> from non-fetch sources, it sounds like switching anything with your
> replica fetch strategy isn't likely to help much. And a multi-tiered
> architecture won't help if the load is being generated by requests that
> are serving the web-views directly on the box.
>
> I'd really encourage you to talk with GitHub Support about performance
> and clustering. It sounds like there may be some GitHub-specific things
> to tweak. And it may be that the load is just too much for a single
> machine, and would benefit from spreading the load across multiple git
> servers.

What surprised us is that we had been running this on an r3.4xlarge
(16vCPU) on AWS for two years without too much issue. Then in a span
of months we started experiencing massive CPU load, which forced us to
upgrade the box to one with 32 vCPU (and better CPUs). We just don't
understand what the precise driver of load is here.

As noted above, we are talking with GitHub about performance -- we've
also been pushing them to start working on a clustering plan, but my
impression has been that they're reluctant to go down that path. I
suspect that we use GHE much more aggressively than the typical GHE
client, but I could be wrong about that.

 - V

-- 

venanti.us
203.918.2328



Re: Reducing CPU load on git server

2016-08-29 Thread Dennis Kaarsemaker
On ma, 2016-08-29 at 13:57 -0700, W. David Jarvis wrote:

> >  * If you do need branches consider archiving stale tags/branches
> > after some time. I implemented this where I work, we just have a
> > $REPO-archive.git with every tag/branch ever created for a given
> > $REPO.git, and delete refs after a certain time.
> 
> This is something else that we're actively considering. Why did your
> company implement this -- was it to reduce load, or just to clean up
> your repositories? Did you notice any change in server load?

At 50k refs, ref negotiation gets expensive, especially over http

D. 


Re: Reducing CPU load on git server

2016-08-29 Thread Jeff King
On Mon, Aug 29, 2016 at 12:16:20PM -0700, W. David Jarvis wrote:

> > Do you know which processes are generating the load? git-upload-pack
> > does the negotiation, and then pack-objects does the actual packing.
> 
> When I look at expensive operations (ones that I can see consuming
> 90%+ of a CPU for more than a second), there are often pack-objects
> processes running that will consume an entire core for multiple
> seconds (I also saw one pack-object counting process run for several
> minutes while using up a full core).

Pegging CPU for a few seconds doesn't sound out-of-place for
pack-objects serving a fetch or clone on a large repository. And I can
certainly believe "minutes", especially if it was not serving a fetch,
but doing repository maintenance on a large repository.

Talk to GitHub Enterprise support folks about what kind of process
monitoring and accounting is available. Recent versions of GHE can
easily tell things like which repositories and processes are using the
most CPU, RAM, I/O, and network, which ones are seeing a lot of
parallelism, etc.

> rev-list shows up as a pretty active CPU consumer, as do prune and
> blame-tree.
> 
> I'd say overall that in terms of high-CPU consumption activities,
> `prune` and `rev-list` show up the most frequently.

None of those operations is triggered by client fetches. You'll see
"rev-list" for a variety of operations, so that's hard to pinpoint. But
I'm surprised that "prune" is a common one for you. It is run as part of
the post-push, but I happen to know that the version that ships on GHE
is optimized to use bitmaps, and to avoid doing any work when there are
no loose objects that need pruning in the first place.

Blame-tree is a GitHub-specific command (it feeds the main repository
view page), and is a known CPU hog. There's more clever caching for that
coming down the pipe, but it's not shipped yet.

> On the subject of prune - I forgot to mention that the `git fetch`
> calls from the subscribers are running `git fetch --prune`. I'm not
> sure if that changes the projected load profile.

That shouldn't change anything; the pruning is purely a client side
thing.

> > Maybe. If pack-objects is where your load is coming from, then
> > counter-intuitively things sometimes get _worse_ as you fetch less. The
> > problem is that git will generally re-use deltas it has on disk when
> > sending to the clients. But if the clients are missing some of the
> > objects (because they don't fetch all of the branches), then we cannot
> > use those deltas and may need to recompute new ones. So you might see
> > some parts of the fetch get cheaper (negotiation, pack-object's
> > "Counting objects" phase), but "Compressing objects" gets more
> > expensive.
> 
> I might be misunderstanding this, but if the subscriber is already "up
> to date" modulo a single updated ref tip, then this problem shouldn't
> occur, right? Concretely: if ref A is built off of ref B, and the
> subscriber already has B when it requests A, that shouldn't cause this
> behavior, but it would cause this behavior if the subscriber didn't
> have B when it requested A.

Correct. So this shouldn't be a thing you are running into now, but it's
something that might be made worse if you switch to fetching only single
refs.

> See comment above about a long-running counting objects process. I
> couldn't tell which of our repositories it was counting, but it went
> for about 3 minutes with full core utilization. I didn't see us
> counting pack-objects frequently but it's an expensive operation.

That really sounds like repository maintenance. Repacks of
torvalds/linux (including all of its forks) on github.com take ~15
minutes of CPU. There may be some optimization opportunities there (I
have a few things I'd like to explore in the next few months), but most
of it is pretty fundamental. It literally takes a few minutes just to
walk the entire object graph for that repo (that's one of the more
extreme cases, of course, but presumably you are hosting some large
repositories).

Maintenance like that should be a very occasional operation, but it's
possible that you have a very busy repo.

> > There's nothing in upstream git to help smooth these loads, but since
> > you mentioned GitHub Enterprise, I happen to know that it does have a
> > system for coalescing multiple fetches into a single pack-objects. I
> > _think_ it's in GHE 2.5, so you might check which version you're
> > running (and possibly also talk to GitHub Support, who might have more
> > advice; there are also tools for finding out which git processes are
> > generating the most load, etc).
> 
> We're on 2.6.4 at the moment.

OK, I double-checked, and your version should be coalescing identical
fetches.

Given that, and that a lot of the load you mentioned above is coming
from non-fetch sources, it sounds like switching anything with your
replica fetch strategy isn't likely to help much. And a multi-tiered
architecture won't help if the load is 

Re: Reducing CPU load on git server

2016-08-29 Thread W. David Jarvis
>  * Consider having that queue of yours just send the pushed payload
> instead of "pull this", see git-bundle. This can turn this sync entire
> thing into a static file distribution problem.

As far as I know, GHE doesn't support this out of the box. We've asked
them for essentially this, though. Due to the nature of our license we
may not be able to configure something like this on the server
instance ourselves.

>  * It's not clear from your post why you have to worry about all these
> branches, surely your Chef instances just need the "master" branch,
> just push that around.

We allow deployments from non-master branches, so we do need multiple
branches. We also use the replication fleet as the target for our
build system, which needs to be able to build essentially any branch
on any repository.

>  * If you do need branches consider archiving stale tags/branches
> after some time. I implemented this where I work, we just have a
> $REPO-archive.git with every tag/branch ever created for a given
> $REPO.git, and delete refs after a certain time.

This is something else that we're actively considering. Why did your
company implement this -- was it to reduce load, or just to clean up
your repositories? Did you notice any change in server load?

>  * If your problem is that you're CPU bound on the master have you
> considered maybe solving this with something like NFS, i.e. replace
> your ad-hoc replication with just a bunch of "slave" boxes that mount
> the remote filesystem.

This is definitely an interesting idea. It'd be a significant
architectural change, though, and not one I'm sure we'd be able to get
support for.

>  * Or, if you're willing to deal with occasional transitory repo
> corruption (just retry?): rsync.

I think this is a cost we're not necessarily okay with having to deal with.

>  * Theres's no reason for why your replication chain needs to be
> single-level if master CPU is really the issue. You could have master
> -> N slaves -> N*X slaves, or some combination thereof.

This was discussed above - if the primary driver of load is the first
fetch, then moving to a multi-tiered architecture will not solve our
problems.

>  * Does it really even matter that your "slave" machines are all
> up-to-date? We have something similar at work but it's just a minutely
> cronjob that does "git fetch" on some repos, since the downstream
> thing (e.g. the chef run) doesn't run more than once every 30m or
> whatever anyway.

It does, because we use the replication fleet for our build server.

 - V

-- 

venanti.us
203.918.2328



Re: Reducing CPU load on git server

2016-08-29 Thread Ævar Arnfjörð Bjarmason
On Sun, Aug 28, 2016 at 9:42 PM, W. David Jarvis
 wrote:
> I've run into a problem that I'm looking for some help with. Let me
> describe the situation, and then some thoughts.

Just a few points that you may not have considered, and I didn't see
mentioned in this thread:

 * Consider having that queue of yours just send the pushed payload
instead of "pull this", see git-bundle. This can turn this sync entire
thing into a static file distribution problem.

 * It's not clear from your post why you have to worry about all these
branches, surely your Chef instances just need the "master" branch,
just push that around.

 * If you do need branches consider archiving stale tags/branches
after some time. I implemented this where I work, we just have a
$REPO-archive.git with every tag/branch ever created for a given
$REPO.git, and delete refs after a certain time.

 * If your problem is that you're CPU bound on the master have you
considered maybe solving this with something like NFS, i.e. replace
your ad-hoc replication with just a bunch of "slave" boxes that mount
the remote filesystem.

 * Or, if you're willing to deal with occasional transitory repo
corruption (just retry?): rsync.

 * Theres's no reason for why your replication chain needs to be
single-level if master CPU is really the issue. You could have master
-> N slaves -> N*X slaves, or some combination thereof.

 * Does it really even matter that your "slave" machines are all
up-to-date? We have something similar at work but it's just a minutely
cronjob that does "git fetch" on some repos, since the downstream
thing (e.g. the chef run) doesn't run more than once every 30m or
whatever anyway.


Re: Reducing CPU load on git server

2016-08-29 Thread W. David Jarvis
> So your load is probably really spiky, as you get thundering herds of
> fetchers after every push (the spikes may have a long flatline at the
> top, as it takes time to process the whole herd).

It is quite spiky, yes. At the moment, however, the replication fleet
is relatively small (at the moment it's just 4 machines). We had 6
machines earlier this month and we had hoped that terminating two of
them would lead to a material drop in CPU usage, but we didn't see a
really significant reduction.

> Yes, though I'd be surprised if this negotiation is that expensive in
> practice. In my experience it's not generally, and even if we ended up
> traversing every commit in the repository, that's on the order of a few
> seconds even for large, active repositories.
>
> In my experience, the problem in a mass-fetch like this ends up being
> pack-objects preparing the packfile. It has to do a similar traversal,
> but _also_ look at all of the trees and blobs reachable from there, and
> then search for likely delta-compression candidates.
>
> Do you know which processes are generating the load? git-upload-pack
> does the negotiation, and then pack-objects does the actual packing.

When I look at expensive operations (ones that I can see consuming
90%+ of a CPU for more than a second), there are often pack-objects
processes running that will consume an entire core for multiple
seconds (I also saw one pack-object counting process run for several
minutes while using up a full core). rev-list shows up as a pretty
active CPU consumer, as do prune and blame-tree.

I'd say overall that in terms of high-CPU consumption activities,
`prune` and `rev-list` show up the most frequently.

On the subject of prune - I forgot to mention that the `git fetch`
calls from the subscribers are running `git fetch --prune`. I'm not
sure if that changes the projected load profile.

> Maybe. If pack-objects is where your load is coming from, then
> counter-intuitively things sometimes get _worse_ as you fetch less. The
> problem is that git will generally re-use deltas it has on disk when
> sending to the clients. But if the clients are missing some of the
> objects (because they don't fetch all of the branches), then we cannot
> use those deltas and may need to recompute new ones. So you might see
> some parts of the fetch get cheaper (negotiation, pack-object's
> "Counting objects" phase), but "Compressing objects" gets more
> expensive.

I might be misunderstanding this, but if the subscriber is already "up
to date" modulo a single updated ref tip, then this problem shouldn't
occur, right? Concretely: if ref A is built off of ref B, and the
subscriber already has B when it requests A, that shouldn't cause this
behavior, but it would cause this behavior if the subscriber didn't
have B when it requested A.

> This is particularly noticeable with shallow fetches, which in my
> experience are much more expensive to serve.

I don't think we're doing shallow fetches anywhere in this system.

> Jakub mentioned bitmaps, and if you are using GitHub Enterprise, they
> are enabled. But they won't really help here. They are essentially
> cached information that git generates at repack time. But if we _just_
> got a push, then the new objects to fetch won't be part of the cache,
> and we'll fall back to traversing them as normal.  On the other hand,
> this should be a relatively small bit of history to traverse, so I'd
> doubt that "Counting objects" is that expensive in your case (but you
> should be able to get a rough sense by watching the progress meter
> during a fetch).

See comment above about a long-running counting objects process. I
couldn't tell which of our repositories it was counting, but it went
for about 3 minutes with full core utilization. I didn't see us
counting pack-objects frequently but it's an expensive operation.

> I'd suspect more that delta compression is expensive (we know we just
> got some new objects, but we don't know if we can make good deltas
> against the objects the client already has). That's a gut feeling,
> though.
>
> If the fetch is small, that _also_ shouldn't be too expensive. But
> things add up when you have a large number of machines all making the
> same request at once. So it's entirely possible that the machine just
> gets hit with a lot of 5s CPU tasks all at once. If you only have a
> couple cores, that takes many multiples of 5s to clear out.

I think this would show up if I was sitting around running `top` on
the machine, but that doesn't end up being what I see. That might just
be a function of there being a relatively small number of replication
machines, I'm not sure. But I'm not noticing 4 of the same tasks get
spawned simultaneously, which says to me that we're either utilizing a
cache or there's some locking behavior involved.

> There's nothing in upstream git to help smooth these loads, but since
> you mentioned GitHub Enterprise, I happen to know that it does have a
> system for coalescing 

Re: Reducing CPU load on git server

2016-08-29 Thread Jeff King
On Mon, Aug 29, 2016 at 12:46:27PM +0200, Jakub Narębski wrote:

> > So your load is probably really spiky, as you get thundering herds of
> > fetchers after every push (the spikes may have a long flatline at the
> > top, as it takes time to process the whole herd).
> 
> One solution I have heard about, in the context of web cache, to reduce
> the thundering herd problem (there caused by cache expiring at the same
> time in many clients) was to add some random or quasi-random distribution
> to expiration time.  In your situation adding a random delay with some
> specified deviation could help.

That smooths the spikes, but you still have to serve all of the requests
eventually. So if your problem is that the load spikes and the system
slows to a crawl as a result (or runs out of RAM, etc), then
distributing the load helps. But if you have enough load that your
system is constantly busy, queueing the load in a different order just
shifts it around.

GHE will also introduce delays into starting git when load spikes, but
that's a separate system that coalescing identical requests.

> I wonder if this system for coalescing multiple fetches is something
> generic, or is it something specific to GitHub / GitHub Enterprise
> architecture?  If it is the former, would it be considered for
> upstreaming, and if so, when it would be in Git itself?

I've already sent upstream the patch for a "hook" that sits between
upload-pack and pack-objects (and it will be in v2.10). So that can call
an arbitrary script which can then make scheduling policy for
pack-objects, coalesce similar requests, etc.

GHE has a generic tool for coalescing program invocations that is not
Git-specific at all (it compares its stdin and command line arguments to
decide when two requests are identical, runs the command on its
arguments, and then passes the output to all members of the herd). That
_might_ be open-sourced in the future, but I don't have a specific
timeline.

> One thing to note: if you have repositories which are to have the
> same contents, you can distribute the pack-file to them and update
> references without going through Git.  It can be done on push
> (push to master, distribute to mirrors), or as part of fetch
> (master fetches from central repository, distributes to mirrors).
> I think; I have never managed large set of replicated Git repositories.

Doing it naively has some gotchas, because you want to make sure you
have all of the necessary objects. But if you are going this route,
probably distributed a git-bundle is the simplest way.

> > Generally no, they should not conflict. Writes into the object database
> > can happen simultaneously. Ref updates take a per-ref lock, so you
> > should generally be able to write two unrelated refs at once. The big
> > exception is that ref deletion required taking a repo-wide lock, but
> > that presumably wouldn't be a problem for your case.
> 
> Doesn't Git avoid taking locks, and use lockless synchronization
> mechanisms (though possibly equivalent to locks)?  I think it takes
> lockfile to update reflog together with reference, but if reflogs
> are turned off (and I think they are off for bare repositories by
> default), ref update uses "atomic file write" (write + rename)
> and compare-and-swap primitive.  Updating repository is lock-free:
> first update repository object database, then reference.

There is a lockfile to make the compare-and-swap atomic, but yes, it's
fundamentally based around the compare-and-swap. I don't think that
matters to the end user though. Fundamentally they will see "I hoped to
move from X to Y, but somebody else wrote Z, aborting", which is the
same as "I did not win the lock race, aborting".

The point is that updating two different refs is generally independent,
and updating the same ref is not.

> I guess that trying to replicate DGit approach that GitHub uses, see
> "Introducing DGit" (http://githubengineering.com/introducing-dgit)
> is currently out of question?

Minor nitpick (that you don't even have any way of knowing about, so
maybe more of a public service announcement). GitHub will stop using the
"DGit" name because it's too confusingly similar to "Git" (and "Git" is
trademarked by the project). There's a new blog post coming that
mentions the name change, and that historic one will have a note added
retroactively. The new name is "GitHub Spokes" (get it, Hub, Spokes?).

But in response to your question, I'll caution that replicating it is a
lot of work. :)

Since the original problem report mentions GHE, I'll note that newer
versions of GHE do support clustering and can share the git load across
multiple Spokes servers. So in theory that could make the replica layer
go away entirely, because it all happens behind the scenes.

-Peff

PS Sorry, I generally try to avoid hawking GitHub wares on the list, but
   since the OP mentioned GHE specifically, and because there aren't
   really generic solutions to most of these things, I do think it'

Re: Reducing CPU load on git server

2016-08-29 Thread Jakub Narębski
W dniu 29.08.2016 o 07:47, Jeff King pisze:
> On Sun, Aug 28, 2016 at 12:42:52PM -0700, W. David Jarvis wrote:
> 
>> The actual replication process works as follows:
>>
>> 1. The primary git server receives a push and sends a webhook with the
>> details of the push (repo, ref, sha, some metadata) to a "publisher"
>> box
>>
>> 2. The publisher enqueues the details of the webhook into a queue
>>
>> 3. A fleet of "subscriber" (replica) boxes each reads the payload of
>> the enqueued message. Each of these then tries to either clone the
>> repository if they don't already have it, or they run `git fetch`.
> 
> So your load is probably really spiky, as you get thundering herds of
> fetchers after every push (the spikes may have a long flatline at the
> top, as it takes time to process the whole herd).

One solution I have heard about, in the context of web cache, to reduce
the thundering herd problem (there caused by cache expiring at the same
time in many clients) was to add some random or quasi-random distribution
to expiration time.  In your situation adding a random delay with some
specified deviation could help.

Note however that it is, I think, incompatible (to some extent) with
"caching" solution, where the 'thundering herd' get served the same
packfile.  Or at least one solution can reduce the positive effect
of the other.

>> 1. We currently run a blanket `git fetch` rather than specifically
>> fetching the ref that was pushed. My understanding from poking around
>> the git source code is that this causes the replication server to send
>> a list of all of its ref tips to the primary server, and the primary
>> server then has to verify and compare each of these tips to the ref
>> tips residing on the server.

[...]
> There's nothing in upstream git to help smooth these loads, but since
> you mentioned GitHub Enterprise, I happen to know that it does have a
> system for coalescing multiple fetches into a single pack-objects. I
> _think_ it's in GHE 2.5, so you might check which version you're
> running (and possibly also talk to GitHub Support, who might have more
> advice; there are also tools for finding out which git processes are
> generating the most load, etc).

I wonder if this system for coalescing multiple fetches is something
generic, or is it something specific to GitHub / GitHub Enterprise
architecture?  If it is the former, would it be considered for
upstreaming, and if so, when it would be in Git itself?


One thing to note: if you have repositories which are to have the
same contents, you can distribute the pack-file to them and update
references without going through Git.  It can be done on push
(push to master, distribute to mirrors), or as part of fetch
(master fetches from central repository, distributes to mirrors).
I think; I have never managed large set of replicated Git repositories.

If mirrors can get out of sync, you would need to ensure that the
repository doing the actual fetch / receiving the actual push is
a least common denominator, that it it looks like lagging behind
all other mirrors in set.  There is no problem if repository gets
packfile with more objects than it needs.

>> In other words, let's imagine a world in which we ditch our current
>> repo-level locking mechanism entirely. Let's also presume we move to
>> fetching specific refs rather than using blanket fetches. Does that
>> mean that if a fetch for ref A and a fetch for ref B are issued at
>> roughly the exact same time, the two will be able to be executed at
>> once without running into some git-internal locking mechanism on a
>> granularity coarser than the ref? i.e. are fetch A and fetch B going
>> to be blocked on the other's completion in any way? (let's presume
>> that ref A and ref B are not parents of each other).
> 
> Generally no, they should not conflict. Writes into the object database
> can happen simultaneously. Ref updates take a per-ref lock, so you
> should generally be able to write two unrelated refs at once. The big
> exception is that ref deletion required taking a repo-wide lock, but
> that presumably wouldn't be a problem for your case.

Doesn't Git avoid taking locks, and use lockless synchronization
mechanisms (though possibly equivalent to locks)?  I think it takes
lockfile to update reflog together with reference, but if reflogs
are turned off (and I think they are off for bare repositories by
default), ref update uses "atomic file write" (write + rename)
and compare-and-swap primitive.  Updating repository is lock-free:
first update repository object database, then reference.

That said, it might be that per-repository global lock that you
use is beneficial, limiting the amount of concurrent access; but
it could be detrimental, that global-lock contention is the cause
of stalls and latency.

>> The ultimate goal for us is just figuring out how we can best reduce
>> the CPU load on the primary instance so that we don't find ourselves
>> in a situation where we're not able to run ba

Re: Reducing CPU load on git server

2016-08-28 Thread Jeff King
On Sun, Aug 28, 2016 at 12:42:52PM -0700, W. David Jarvis wrote:

> The actual replication process works as follows:
> 
> 1. The primary git server receives a push and sends a webhook with the
> details of the push (repo, ref, sha, some metadata) to a "publisher"
> box
> 
> 2. The publisher enqueues the details of the webhook into a queue
> 
> 3. A fleet of "subscriber" (replica) boxes each reads the payload of
> the enqueued message. Each of these then tries to either clone the
> repository if they don't already have it, or they run `git fetch`.

So your load is probably really spiky, as you get thundering herds of
fetchers after every push (the spikes may have a long flatline at the
top, as it takes time to process the whole herd).

> 1. We currently run a blanket `git fetch` rather than specifically
> fetching the ref that was pushed. My understanding from poking around
> the git source code is that this causes the replication server to send
> a list of all of its ref tips to the primary server, and the primary
> server then has to verify and compare each of these tips to the ref
> tips residing on the server.

Yes, though I'd be surprised if this negotiation is that expensive in
practice. In my experience it's not generally, and even if we ended up
traversing every commit in the repository, that's on the order of a few
seconds even for large, active repositories.

In my experience, the problem in a mass-fetch like this ends up being
pack-objects preparing the packfile. It has to do a similar traversal,
but _also_ look at all of the trees and blobs reachable from there, and
then search for likely delta-compression candidates.

Do you know which processes are generating the load? git-upload-pack
does the negotiation, and then pack-objects does the actual packing.

> My hypothesis is that moving to fetching the specific branch rather
> than doing a blanket fetch would have a significant and material
> impact on server load.

Maybe. If pack-objects is where your load is coming from, then
counter-intuitively things sometimes get _worse_ as you fetch less. The
problem is that git will generally re-use deltas it has on disk when
sending to the clients. But if the clients are missing some of the
objects (because they don't fetch all of the branches), then we cannot
use those deltas and may need to recompute new ones. So you might see
some parts of the fetch get cheaper (negotiation, pack-object's
"Counting objects" phase), but "Compressing objects" gets more
expensive.

This is particularly noticeable with shallow fetches, which in my
experience are much more expensive to serve.

Jakub mentioned bitmaps, and if you are using GitHub Enterprise, they
are enabled. But they won't really help here. They are essentially
cached information that git generates at repack time. But if we _just_
got a push, then the new objects to fetch won't be part of the cache,
and we'll fall back to traversing them as normal.  On the other hand,
this should be a relatively small bit of history to traverse, so I'd
doubt that "Counting objects" is that expensive in your case (but you
should be able to get a rough sense by watching the progress meter
during a fetch).

I'd suspect more that delta compression is expensive (we know we just
got some new objects, but we don't know if we can make good deltas
against the objects the client already has). That's a gut feeling,
though.

If the fetch is small, that _also_ shouldn't be too expensive. But
things add up when you have a large number of machines all making the
same request at once. So it's entirely possible that the machine just
gets hit with a lot of 5s CPU tasks all at once. If you only have a
couple cores, that takes many multiples of 5s to clear out.

There's nothing in upstream git to help smooth these loads, but since
you mentioned GitHub Enterprise, I happen to know that it does have a
system for coalescing multiple fetches into a single pack-objects. I
_think_ it's in GHE 2.5, so you might check which version you're
running (and possibly also talk to GitHub Support, who might have more
advice; there are also tools for finding out which git processes are
generating the most load, etc).

> In other words, let's imagine a world in which we ditch our current
> repo-level locking mechanism entirely. Let's also presume we move to
> fetching specific refs rather than using blanket fetches. Does that
> mean that if a fetch for ref A and a fetch for ref B are issued at
> roughly the exact same time, the two will be able to be executed at
> once without running into some git-internal locking mechanism on a
> granularity coarser than the ref? i.e. are fetch A and fetch B going
> to be blocked on the other's completion in any way? (let's presume
> that ref A and ref B are not parents of each other).

Generally no, they should not conflict. Writes into the object database
can happen simultaneously. Ref updates take a per-ref lock, so you
should generally be able to write two unrelated refs a

Re: Reducing CPU load on git server

2016-08-28 Thread W. David Jarvis
My assumption is that pack bitmaps are enabled since the primary
server is a GitHub Enterprise instance, but I'll have to confirm.

On Sun, Aug 28, 2016 at 2:20 PM, Jakub Narębski  wrote:
> W dniu 28.08.2016 o 21:42, W. David Jarvis pisze:
>
>> The ultimate goal for us is just figuring out how we can best reduce
>> the CPU load on the primary instance so that we don't find ourselves
>> in a situation where we're not able to run basic git operations
>> anymore.
>
> I assume that you have turned on pack bitmaps?  See for example
> "Counting Objects" blog post on GitHub Engineering blog
> http://githubengineering.com/counting-objects/
>
> There are a few other articles there worth reading in your
> situation.
> --
> Jakub Narębski



-- 

venanti.us
203.918.2328

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reducing CPU load on git server

2016-08-28 Thread Jakub Narębski
W dniu 28.08.2016 o 21:42, W. David Jarvis pisze:

> The ultimate goal for us is just figuring out how we can best reduce
> the CPU load on the primary instance so that we don't find ourselves
> in a situation where we're not able to run basic git operations
> anymore.

I assume that you have turned on pack bitmaps?  See for example
"Counting Objects" blog post on GitHub Engineering blog
http://githubengineering.com/counting-objects/

There are a few other articles there worth reading in your
situation.
-- 
Jakub Narębski
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html