Re: Sketch of an idea for handling the "mixed workload" problem

Demi Marie Obenour Sun, 21 Jan 2024 15:47:22 -0800

On Mon, Oct 02, 2023 at 12:20:31PM +0100, George Dunlap wrote:
> On Sun, Oct 1, 2023 at 12:28 AM Demi Marie Obenour
> <[email protected]> wrote:
> >
> > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote:
> > > The basic credit2 algorithm goes something like this:
> > >
> > > 1. All vcpus start with the same number of credits; about 10ms worth
> > > if everyone has the same weight
> >
> > > 2. vcpus burn credits as they consume cpu, based on the relative
> > > weights: higher weights burn slower, lower weights burn faster
> > >
> > > 3. At any given point in time, the runnable vcpu with the highest
> > > credit is allowed to run
> > >
> > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is
> > > reset: everyone gets another 10ms, and can carry over at most 2ms of
> > > credit over the reset.
> >
> > One relevant aspect of Qubes OS is that it is very very heavily
> > oversubscribed: having more VMs running than physical CPUs is (at least
> > in my usage) not uncommon, and each of those VMs will typically have at
> > least two vCPUs.  With a credit of 10ms and 36 vCPUs, I could easily see
> > a vCPU not being allowed to execute for 200ms or more.  For audio or
> > video, workloads, this is a disaster.
> >
> > 10ms is a LOT for desktop workloads or for anyone who cares about
> > latency.  At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a
> > heavily contended system frame drops are guaranteed.
> 
> You'd probably benefit from understanding better how the various
> algorithms actually work.  I'm sorry I don't have any really good
> "virtualization scheduling for dummies" resources; the best I have is
> a few talks I gave on the subject; e.g.:
> 
> https://www.youtube.com/watch?v=C3jjvkr6fgQ
> 
> For one, when I say "oversubscribed", I don't mean "vcpus / pcpus"; I
> mean "requested vcpu execution time / vcpus".  If you have 18 vcpus on
> a single pcpu, and all of them *on an empty system* would have run at
> 5%, you're totally fine.  If you have 18 vcpus on a single pcpu, and
> all of them on an empty system would have averaged 100%, there's only
> so much the scheduler can do to avoid problems.


If each vCPU would have spent 4% time doing realtime tasks, it should be
possible to give all of the realtime tasks all the time they need, while
the remaining 100 - 4 * 18 = 28% of time is available to non-realtime
tasks.  That’s not awesome, but it might be enough to prevent audio from
glitching.

> Secondly, while on credit1 a vcpu is allowed to run for 10ms without
> stopping (and then must wait for 18x that time to get the same credit
> back, if there are 18 other vcpus running on that same pcpu), this is
> not the case for credit2.  The exact calculation can be found in
> xen/common/sched/credit2.c:sched2_runtime(), but generally here's the
> general algorithm from the comment:
> 
> /* General algorithm:
>  * 1) Run until snext's credit will be 0.
>  * 2) But if someone is waiting, run until snext's credit is equal
>  *    to his.
>  * 3) But, if we are capped, never run more than our budget.
>  * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or
>  *    the ratelimit time.
>  */
> 
> Default MIN_TIMER is 500us, and is configurable via sysctl; default
> MAX_TIMER is... hmm, I'm pretty sure this started out as 2ms, but now
> it seems to be 10ms.  Looks like this was changed in da92ec5bd1 ("xen:
> credit2: "relax" CSCHED2_MAX_TIMER") in 2016.  (MAX_TIMER isn't
> configurable, but arguably it should be; and making it configurable
> should just be a matter of duplicating the logic around MIN_TIMER.)

Maybe MAX_TIMER should be lowered to e.g. 1ms?

> That's not yet the last word though: If a VM that was a sleep wakes
> up, and it has credit than the running vcpu, then it will generally
> preempt that cpu.
> 
> All that to say, that it should be very rare for a cpu to run for a
> full 10ms under credit2.

That’s good.

> > > Other ways we could consider putting a vcpu into a boosted state (some
> > > discussed on Matrix or emails linked from Matrix):
> > > * Xen is about to preempt, but finds that the vcpu interrupts are
> > > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > > one)
> >
> > This is also a good heuristic for "vCPU owns a spinlock", which is
> > definitely a bad time to preempt.
> 
> Not all spinlocks disable IRQs, but certainly some do.
> 
> > > Getting the defaults right might take some thinking.  If you set the
> > > default "boost credit ratio" to 25% and the "default boost interval"
> > > to 500ms, then you'd basically have five "boosts" per scheduling
> > > window.  The window depends on how active other vcpus are, but if it's
> > > longer than 20ms your system is too overloaded.
> >
> > An interval of 500ms seems rather long to me.  Did you mean 500μs?
> 
> Yes, I did mean 500us, sorry.
> 
> I'll respond to the other suggestions later.
> 
> > > Demi, what kinds of interrupt counts are you getting for your VM?
> >
> > I didn't measure it, but I can check the next time I am on a video call
> > or doing audio recoring.
> 
> Running xentrace would be really interesting too; those are another
> good way to nerd-snipe me. :-)
> 
>  -George

That would certainly be a good idea!
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

signature.asc
Description: PGP signature

Re: Sketch of an idea for handling the "mixed workload" problem

Reply via email to