On Mon, Oct 02, 2023 at 12:20:31PM +0100, George Dunlap wrote: > On Sun, Oct 1, 2023 at 12:28 AM Demi Marie Obenour > <[email protected]> wrote: > > > > On Fri, Sep 29, 2023 at 05:42:16PM +0100, George Dunlap wrote: > > > The basic credit2 algorithm goes something like this: > > > > > > 1. All vcpus start with the same number of credits; about 10ms worth > > > if everyone has the same weight > > > > > 2. vcpus burn credits as they consume cpu, based on the relative > > > weights: higher weights burn slower, lower weights burn faster > > > > > > 3. At any given point in time, the runnable vcpu with the highest > > > credit is allowed to run > > > > > > 4. When the "next runnable vcpu" on a runqueue is negative, credit is > > > reset: everyone gets another 10ms, and can carry over at most 2ms of > > > credit over the reset. > > > > One relevant aspect of Qubes OS is that it is very very heavily > > oversubscribed: having more VMs running than physical CPUs is (at least > > in my usage) not uncommon, and each of those VMs will typically have at > > least two vCPUs. With a credit of 10ms and 36 vCPUs, I could easily see > > a vCPU not being allowed to execute for 200ms or more. For audio or > > video, workloads, this is a disaster. > > > > 10ms is a LOT for desktop workloads or for anyone who cares about > > latency. At 60Hz it is 3/5 of a frame, and with a 120Hz monitor and a > > heavily contended system frame drops are guaranteed. > > You'd probably benefit from understanding better how the various > algorithms actually work. I'm sorry I don't have any really good > "virtualization scheduling for dummies" resources; the best I have is > a few talks I gave on the subject; e.g.: > > https://www.youtube.com/watch?v=C3jjvkr6fgQ > > For one, when I say "oversubscribed", I don't mean "vcpus / pcpus"; I > mean "requested vcpu execution time / vcpus". If you have 18 vcpus on > a single pcpu, and all of them *on an empty system* would have run at > 5%, you're totally fine. If you have 18 vcpus on a single pcpu, and > all of them on an empty system would have averaged 100%, there's only > so much the scheduler can do to avoid problems.
If each vCPU would have spent 4% time doing realtime tasks, it should be
possible to give all of the realtime tasks all the time they need, while
the remaining 100 - 4 * 18 = 28% of time is available to non-realtime
tasks. That’s not awesome, but it might be enough to prevent audio from
glitching.
> Secondly, while on credit1 a vcpu is allowed to run for 10ms without
> stopping (and then must wait for 18x that time to get the same credit
> back, if there are 18 other vcpus running on that same pcpu), this is
> not the case for credit2. The exact calculation can be found in
> xen/common/sched/credit2.c:sched2_runtime(), but generally here's the
> general algorithm from the comment:
>
> /* General algorithm:
> * 1) Run until snext's credit will be 0.
> * 2) But if someone is waiting, run until snext's credit is equal
> * to his.
> * 3) But, if we are capped, never run more than our budget.
> * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or
> * the ratelimit time.
> */
>
> Default MIN_TIMER is 500us, and is configurable via sysctl; default
> MAX_TIMER is... hmm, I'm pretty sure this started out as 2ms, but now
> it seems to be 10ms. Looks like this was changed in da92ec5bd1 ("xen:
> credit2: "relax" CSCHED2_MAX_TIMER") in 2016. (MAX_TIMER isn't
> configurable, but arguably it should be; and making it configurable
> should just be a matter of duplicating the logic around MIN_TIMER.)
Maybe MAX_TIMER should be lowered to e.g. 1ms?
> That's not yet the last word though: If a VM that was a sleep wakes
> up, and it has credit than the running vcpu, then it will generally
> preempt that cpu.
>
> All that to say, that it should be very rare for a cpu to run for a
> full 10ms under credit2.
That’s good.
> > > Other ways we could consider putting a vcpu into a boosted state (some
> > > discussed on Matrix or emails linked from Matrix):
> > > * Xen is about to preempt, but finds that the vcpu interrupts are
> > > blocked (this sort of overlaps with the "when we deliver an interrupt"
> > > one)
> >
> > This is also a good heuristic for "vCPU owns a spinlock", which is
> > definitely a bad time to preempt.
>
> Not all spinlocks disable IRQs, but certainly some do.
>
> > > Getting the defaults right might take some thinking. If you set the
> > > default "boost credit ratio" to 25% and the "default boost interval"
> > > to 500ms, then you'd basically have five "boosts" per scheduling
> > > window. The window depends on how active other vcpus are, but if it's
> > > longer than 20ms your system is too overloaded.
> >
> > An interval of 500ms seems rather long to me. Did you mean 500μs?
>
> Yes, I did mean 500us, sorry.
>
> I'll respond to the other suggestions later.
>
> > > Demi, what kinds of interrupt counts are you getting for your VM?
> >
> > I didn't measure it, but I can check the next time I am on a video call
> > or doing audio recoring.
>
> Running xentrace would be really interesting too; those are another
> good way to nerd-snipe me. :-)
>
> -George
That would certainly be a good idea!
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
signature.asc
Description: PGP signature
