Re: [vfio-users] Best pinning strategy for latency / performance trade-off

Zachary Boley Thu, 02 Feb 2017 20:55:09 -0800

I have 2 Xeon 5540s (4 physical and 4 logical per CPU) currently one entire
CPU is dedicated to the vm (basically what it says in numa 0 in lscpu) I
didn't quite get the guide, what would be the best setup to get the most
out of the vm for gaming? Or is that the best configuration I have atm


On Feb 2, 2017 12:35 PM, "Jan Wiele" <[email protected]> wrote:

> Hi Thomas,
>
> awesome work! I've changed my (gaming-)setup (2x Xeon E5-2670 (8 real
> cores per CPU)) to the following:
>
> VM1 and VM2:
> Each gets 4 real cores on CPU0; Emulator-Thread is pinned to the
> respective Hyper-Threading cores.
>
> VM3:
> 6 real cores on CPU1; Emulator-Thread is pinned to the respective
> Hyper-Threading cores.
>
> Host:
> 2 real cores on CPU1; 2 Hyperthreaded cores.
>
>
> I've chosen this layout ("Low latency setup"), since it fits my setup the
> most. Alternatively I could have pinned the emulator threads to the host
> ("Balanced setup, emulator with host"), but this would result in some
> cross-node-traffic, which I wanted to prevent. Additionally some benchmarks
> show that hyperthreading does not improve the gaming performance [1] by
> much.
>
> With my new setup, I ran DPC Latency Checker [2] and saw on all three VMs
> timings around 1000us. However, LatencyMon [3] showed most of the time much
> lower values (<100us). Can they be compared?
>
> LatencyMon also showed me that the USB2 driver has a long ISR. Changing
> this to USB3 in libvirt fixed that.
>
>
> Cheers,
> Jan
>
> [1] https://www.techpowerup.com/forums/threads/gaming-benchmarks
> -core-i7-6700k-hyperthreading-test.219417/
> [2] http://www.thesycon.de/eng/latency_check.shtml
> [3] http://www.resplendence.com/latencymon
>
> Am 01.02.2017 um 16:46 schrieb Thomas Lindroth:
>
>> A while ago there was a conversation on the #vfio-users irc channel about
>> how to
>> use cpuset/pinning to get the best latency and performance. I said I
>> would run
>> some tests and eventually did. Writing up the result took a lot of time
>> and
>> there are some more test I want to run to verify the results but don't
>> have time
>> to do that now. I'll just post what I've concluded instead. First some
>> theory.
>>
>> Latency in a virtual environment have many difference causes.
>> * There is latency in the hardware/bios like system management interrupts.
>> * The host operating system introduce some latency. This is often because
>> the
>>   host won't schedule the VM when it wants to run.
>> * The emulator got some latency because of things like nested page tables
>> and
>>   handling of virtual hardware.
>> * The guest OS introduce it's own latency when the workload wants to run
>> but the
>>   guest scheduler won't schedule it.
>>
>> Point 1 and 4 are latencies you get even on bare metal but point 2 and 3
>> is
>> extra latency caused by the virtualisation. This post is mostly about
>> reducing
>> the latency of point 2.
>>
>> I assume you are already familiar with how this is usually done. By using
>> cpuset
>> you can reserve some cores for exclusive use by the VM and put all system
>> processes on a separate housekeeping core. This allows the VM to run
>> whenever it
>> wants which is good for latency but the downside is the VM can't use the
>> housekeeping core so performance is reduced.
>>
>> By running pstree -p when the VM is running you get some output like this:
>> ...
>> ─qemu-system-x86(4995)─┬─{CPU 0/KVM}(5004)
>>                        ├─{CPU 1/KVM}(5005)
>>                        ├─{CPU 2/KVM}(5006)
>>                        ├─{CPU 3/KVM}(5007)
>>                        ├─{CPU 4/KVM}(5008)
>>                        ├─{CPU 5/KVM}(5009)
>>                        ├─{qemu-system-x86}(4996)
>>                        ├─{qemu-system-x86}(5012)
>>                        ├─{qemu-system-x86}(5013)
>>                        ├─{worker}(5765)
>>                        └─{worker}(5766)
>>
>> Qemu spawn a bunch of threads for different things. The "CPU #/KVM"
>> threads runs
>> the actual guest code and there is one for each virtual cpu. I call them
>> "VM threads" from here on. The qemu-system-x86 threads are used to emulate
>> virtual hardware and is called the emulator in libvirt terminology. I
>> call them
>> "emulator threads". The worker threads are probably what libvirt calls
>> iothreads
>> but I treat them the same as the emulator threads and refer to them both
>> as
>> "emulator threads".
>>
>> My cpu is a i7-4790K with 4 hyper threaded cores for a total of 8 logical
>> cores.
>> A lot of people here probably have something similar. Take a look in
>> /proc/cpuinfo to see how it's laid out. I number my cores like cpuinfo
>> where I
>> got physical cores 0-3 and logical cores 0-7. pcore 0 corresponds to
>> lcore 0,4
>> and pcore 1 is lcore 1,5 and so on.
>>
>> The goal is to partition the system processes, VM threads and emulator
>> threads
>> on these 8 lcores to get good latency and acceptable performance but to
>> do that
>> I need a way to measure latency. Mainline kernel 4.9 got a new latency
>> tracer
>> called hwlat. It's designed to measure hardware latencies like SMI but if
>> you
>> run it in a VM you get all latencies below the guest (point 1-3 above).
>> Hwlat
>> bypasses the normal cpu scheduler so it won't measure any latency from
>> the guest
>> scheduler (point 4). It basically makes it possible to focus on just the
>> VM
>> related latencies.
>> https://lwn.net/Articles/703129/
>>
>> We should perhaps also discuss how much latency is too much. That's up for
>> debate but the windows DPC latency checker lists 500us as green, 1000us as
>> yellow and 2000us as red. If a game runs at 60fps it has a deadline of
>> 16.7ms to
>> render a frame. I'll just decide that 1ms (1000us) is the upper limit for
>> what I
>> can tolerate.
>>
>> One of the consequences of how hwlat works is that it also fails to
>> notice a lot
>> of the point 3 types of latencies. Most of the latency in point 3 is
>> caused by
>> vm-exits. That's when the guest do something the hardware virtualisation
>> can't
>> handle and have to rely on kvm or qemu to emulate the behaviour. This is
>> a lot
>> slower than real hardware but it mostly only happens when the guest tries
>> to
>> access hardware resources, so I'll call it IO-latency. The hwlat tracer
>> only
>> sits and spins in kernel space and never touch any hardware by itself.
>> Since
>> hwlat don't trigger vm-exits it also can't measure latencies from that so
>> it
>> would be good to have something else that could. They way I rigged things
>> up is
>> to set the virtual disk controller to ahci. I know that has to be
>> emulated by
>> qemu. I then added a ram block device from /dev/ram* to the VM as a
>> virtual
>> disk. I can then run the fio disk benchmark in the VM on that disk to
>> trigger
>> vm-exits and get a report on the latency from fio. It's not a good
>> solution but
>> it's the best I could come up with.
>> http://freecode.com/projects/fio
>>
>> === Low latency setup ===
>>
>> Let's finally get down to business. The first setup I tried is configured
>> for
>> minimum latency at the expense of performance.
>>
>> The virtual cpu in this setup got 3 cores and no HT. The VM threads are
>> pinned
>> to lcore 1,2,3. The emulator threads are pinned to lcore 5,6,7. That
>> leaves
>> pcore 0 which is dedicated to the host using cpuset.
>>
>> Here is the layout in libvirt xml
>> <vcpupin vcpu='0' cpuset='1'/>
>> <vcpupin vcpu='1' cpuset='2'/>
>> <vcpupin vcpu='2' cpuset='3'/>
>> <emulatorpin cpuset='5-7'/>
>> <topology sockets='1' cores='3' threads='1'/>
>>
>> And here are the result of hwlat (all hwlat test run for 30 min each). I
>> used a
>> synthetic load to test how the latencies changed under load. I use the
>> program
>> stress as synthetic load on both guest and host
>> (stress --vm 1 --io 1 --cpu 8 --hdd 1).
>>
>>                          mean     stdev    max(us)
>> host idle, VM idle:   17.2778   15.6788     70
>> host load, VM idle:   21.4856   20.1409     72
>> host idle, VM load:   19.7144   18.9321    103
>> host load, VM load:   21.8189   21.2839    139
>>
>> As you can see the load on the host makes little difference for the
>> latency.
>> The cpuset isolation works well. The slight decrease of the mean might be
>> because of reduced memory bandwidth. Putting the VM under load will
>> increase the
>> latency a bit. This might seem odd since the idea of using hwlat was to
>> bypass
>> the guest scheduler thereby making the latency independent of what is
>> running in
>> the guest. What is probably happening is that the "--hdd" part of the
>> stress
>> access the disk and this makes the emulator threads run. They are pinned
>> to the
>> HT siblings of the VM threads and thereby slightly impact the latency of
>> them.
>> Overall the latency is very good in this setup.
>>
>> fio (us) min=40, max=1306, avg=52.81, stdev=12.60 iops=18454
>> Here is the result of the io latency test with fio. Since the emulator
>> treads
>> are running mostly isolated on their own siblings this result must be
>> considered
>> good.
>>
>> === Low latency setup, with realtime ===
>>
>> In an older post to the mailing list I said "The NO_HZ_FULL scheduler
>> mode only
>> works if a single process wants to run on a core. When the VM thread runs
>> as
>> realtime priority it can starve the kernel threads for long period of
>> time and
>> the scheduler will turn off NO_HZ_FULL when that happens since several
>> processes
>> wants to run. To get the full advantage of NO_HZ_FULL don't use realtime
>> priority."
>>
>> Let's see how much impact this really has. The idea behind realtime pri
>> is to
>> always give your preferred workload priority over unimportant workloads.
>> But to
>> make any difference there has to be an unimportant workload to preempt.
>> Cpuset
>> is a great way to move unimportant processes to a housekeeping cpu but
>> unfortunately the kernel got some pesky kthreads that refuse to migrate.
>> By
>> using realtime pri on the VM threads I should be able to out-preempt the
>> kernel
>> threads and get lower latency. In this test I used  the same setup as
>> above but
>> I used schedtool to set round-robin pri 1 on all VM related threads.
>>
>>                          mean     stdev    max(us)
>> host idle, VM idle:   17.6511   15.3028     61
>> host load, VM idle:   20.2400   19.6558     57
>> host idle, VM load:   18.9244   18.8119    108
>> host load, VM load:   20.4228   21.0749    122
>>
>> The result is mostly the same. Those few remaining kthreads that I can't
>> disable
>> or migrate apparently doesn't make much difference on latency.
>>
>> === Balanced setup, emulator with VM threads ===
>>
>> 3 cores isn't a lot these days and some games like Mad max and Rise of
>> the tomb
>> raider max out the cpu in the low latency setup. This results in big
>> frame drops
>> when that happens. The setup below with a virtual 2 core HT cpu would
>> probably
>> give ok latency but the addition of hyper threading usually only give
>> 25-50%
>> extra performance for real world workloads so this setup would generally
>> be
>> slower than the low latency setup. I didn't bother to test it.
>> <vcpupin vcpu='0' cpuset='2'/>
>> <vcpupin vcpu='1' cpuset='6'/>
>> <vcpupin vcpu='2' cpuset='3'/>
>> <vcpupin vcpu='3' cpuset='7'/>
>> <emulatorpin cpuset='1,5'/>
>> <topology sockets='1' cores='2' threads='2'/>
>>
>> To get better performance I need at least a virtual 3 core HT cpu but if
>> the
>> host use pcore 0 and the VM threads use pcore 1-3 where will the emulator
>> threads run? I could overallocate the system by having the emulator
>> threads
>> compete with the VM threads or I could overallocate the system by having
>> the
>> emulator threads compete with the host processes. Lets try to run the
>> emulator
>> with the VM treads first.
>>
>> <vcpupin vcpu='0' cpuset='1'/>
>> <vcpupin vcpu='1' cpuset='5'/>
>> <vcpupin vcpu='2' cpuset='2'/>
>> <vcpupin vcpu='3' cpuset='6'/>
>> <vcpupin vcpu='4' cpuset='3'/>
>> <vcpupin vcpu='5' cpuset='7'/>
>> <emulatorpin cpuset='1-3,5-7'/>
>> <topology sockets='1' cores='3' threads='2'/>
>>
>> The odd ordering for vcpupin is done because Intel cpus lay out HT
>> siblings as
>> lcore[01234567] = pcore[01230123] but qemu lays out the virtual cpu as
>> lcore[012345] = pcore[001122]. To get a 1:1 mapping I have to order them
>> like
>> that.
>>
>>                          mean     stdev    max(us)
>> host idle, VM idle:   17.4906   15.1180     89
>> host load, VM idle:   22.7317   19.5327     95
>> host idle, VM load:   82.3694  329.6875   9458
>> host load, VM load:  141.2461 1170.5207  20757
>>
>> The result is really bad. It works ok as long as the VM is idle but as
>> soon as
>> it's under load I get bad latencies. The reason is likely that the
>> stressor
>> accesses the disk which activates the emulator and in this setup the
>> emulator
>> can preempt the VM threads. We can check if this is the case by running
>> the
>> stress without "--hdd".
>>
>>                                        mean     stdev    max(us)
>> host load, VM load(but no --hdd):   57.4728  138.8211   1345
>>
>> The latency is reduced quite a bit but it's still high. It's likely still
>> the
>> emulator threads preempting the VM threads. Accessing the disk is just
>> one of
>> many things the VM can do to activate the emulator.
>>
>> fio (us) min=41, max=7348, avg=62.17, stdev=14.99 iops=15715
>> io latency is also a lot worse compared to the low latency setup. The
>> reason is
>> the VM threads can preempt the emulator threads while they are emulating
>> the
>> disk drive.
>>
>> === Balanced setup, emulator with host ===
>>
>> Pairing up the emulator threads and VM threads was a bad idea so lets try
>> running the emulator on the core reserved for the host. Since the VM
>> threads run
>> by themselves in this setup we would expect to get good hwlat latency but
>> the
>> emulator threads can be preempted by host processes so io latency might
>> suffer.
>> Lets start by looking at the io latency.
>>
>> fio (us) min=40, max=46852, avg=61.55, stdev=250.90 iops=15893
>>
>> Yup, massive io latency. Here is a situation were realtime pri could help.
>> If the emulator threads get realtime pri they can out-preempt the host
>> processes. Lets try that.
>>
>> fio (us) min=38, max=2640, avg=53.72, stdev=13.61  iops=18140
>>
>> That's better but it's not as good as the low latency setup where the
>> emulator
>> threads got their own lcore. To reduce the latency even more we could try
>> to
>> split pcore 0 in two and run host processes on lcore 0 and the emulator
>> threads
>> on lcore 4. But this doesn't leave much cpu for the emulator (or the
>> host).
>>
>> fio (us) min=44, max=1192, avg=56.07, stdev=8.52 iops=17377
>>
>> The max io latency now decreased to the same level as the low latency
>> setup.
>> Unfortunately the number of iops also decreased a bit (down 5.8% compared
>> to the
>> low latency setup). I'm guessing this is because the emulator threads
>> don't get
>> as much cpu power in this setup.
>>
>>                          mean     stdev    max(us)
>> host idle, VM idle:   18.3933   15.5901    106
>> host load, VM idle:   20.2006   18.8932     77
>> host idle, VM load:   23.1694   22.4301    110
>> host load, VM load:   23.2572   23.7288    120
>>
>> Hwlat latency is comparable to the low latency setup so this setup gives
>> a good
>> latency / performance trade-off
>>
>> === Max performance setup ===
>>
>> If 3 cores with HT isn't enough I suggest you give up but for comparison
>> let's
>> see what happens if we mirror the host cpu in the VM. Now we have no room
>> at all
>> for the emulator or the host processes so I let them schedule free.
>> <vcpupin vcpu='0' cpuset='0'/>
>> <vcpupin vcpu='1' cpuset='4'/>
>> <vcpupin vcpu='2' cpuset='1'/>
>> <vcpupin vcpu='3' cpuset='5'/>
>> <vcpupin vcpu='4' cpuset='2'/>
>> <vcpupin vcpu='5' cpuset='6'/>
>> <vcpupin vcpu='6' cpuset='3'/>
>> <vcpupin vcpu='7' cpuset='7'/>
>> <emulatorpin cpuset='0-7'/>
>> <topology sockets='1' cores='4' threads='2'/>
>>
>>                            mean      stdev    max(us)
>> host idle, VM idle:    185.4200   839.7908   6311
>> host load, VM idle:   3835.9333  7836.5902  97234
>> host idle, VM load:   1891.4300  3873.9165  31015
>> host load, VM load:   8459.2550  6437.6621  51665
>>
>> fio (us) min=48, max=112484, avg=90.41, stdev=355.10 iops=10845
>>
>> I only ran these tests for 10 min each. That's all that was needed. As
>> you can
>> see it's terrible. I'm afraid that many people probably run a setup
>> similar to
>> this. I ran like this myself for a while until I switched to libvirt and
>> started
>> looking into pinning. Realtime pri would probably help a lot here but
>> realtime
>> in this configuration is potentially dangerous. Workloads on the guest
>> could
>> starve the host and depending on how the guest gets its input a reset
>> using the
>> hardware reset button could be needed to get the system back.
>>
>> === Testing with games ===
>>
>> I want low latency for gaming so it would make sense to test the setups
>> with
>> games. This turns out to be kind of tricky. Games are complicated and
>> interpreting the results can be hard. https://i.imgur.com/NIrXnkt.png as
>> an
>> example here is a percentile plot of the frametimes in the built in
>> benchmark of
>> rise of the tomb raider taken with fraps. The performance and balanced
>> setups
>> looks about the same at lower percentiles but the low latency setup is a
>> lot
>> lower. This means that the low latency setup, with is the weakest in
>> terms of
>> cpu power, got higher frame rate for some parts of the benchmark. This
>> doesn't
>> make sense at first. It only starts to make sense if I pay attention to
>> the
>> benchmark while it's running. Rise of the tomb raider loads in a lot of
>> geometry
>> dynamically and the low latency setup can't keep up. It has bad pop-in of
>> textures and objects so the scene the gpu renders is less complicated
>> than the
>> other setups. Less complicated scene results in higher frame rate. An odd
>> counter intuitive result.
>>
>> Overall the performance and balanced setups have the same percentile
>> curve for
>> lower percentiles in every game I tested. This tells me that the balanced
>> setup
>> got enough cpu power for all games I've tried. They only differ at higher
>> percentile due to latency induced framedrops. The performance setup
>> always have
>> the worst max frametime in every game so there is no reason to use it
>> over the
>> balanced setup. The performance setup also have crackling sound in
>> several games
>> over hdmi audio even with MSI enabled. Which setup got the lowest max
>> framtime
>> depends on the workload. If the game max out the cpu of the low latency
>> setup
>> the max framtime will be worse than the balanced setup, if not the low
>> latency
>> setup got the best latency.
>>
>> === Conclusion ===
>>
>> The balanced setup (emulators with host) doesn't have the best latency in
>> every
>> workload but I haven't found any workload where it performs poorly in
>> regards to
>> max latency, io latency or available cpu power. Even in those workloads
>> where
>> another setup performed better the balanced setup was always close. If
>> you are
>> too lazy to switch setups depending on the workload use the balanced
>> setup as
>> the default configuration. If your cpu isn't a 4 core with HT finding the
>> best
>> setup for your cpu is left as an exercise for the reader.
>>
>> === Future work ===
>>
>> https://vfio.blogspot.se/2016/10/how-to-improve-performance-
>> in-windows-7.html
>> This was a nice trick for forcing win7 to use TSC. Just one problem,
>> turns out
>> it doesn't work if hyper threading is enabled. Any time I use a virtual
>> cpu with
>> threads='2' win7 will revert to using the acpi_pm. I've spent a lot of
>> time
>> trying to work around the problem but failed. I don't even know why hyper
>> threading would make a difference for TSC. Microsoft's documentation is
>> amazingly unhelpful. But even when the guest is hammering the acpi_pm
>> timer the
>> balanced setup gives better performance than the low latency setup but I'm
>> afraid the reduced resolution and extra indeterminism of the acpi_pm
>> timer might
>> result in other problems. This is only a problem in win7 because modern
>> versions
>> if windows should use hypervclock. I've read somewhere that it might be
>> possible
>> to modify OVMF to work around the bug in win7 that prevents hyperv from
>> working.
>> With that modification it might be possible to use hypervclock in win7.
>> Perhaps I'll look into that in the future. In the mean time I'll stick
>> with the
>> balanced setup despite the use of acpi_pm.
>>
>> _______________________________________________
>> vfio-users mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/vfio-users
>>
>>
> _______________________________________________
> vfio-users mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/vfio-users
>

_______________________________________________
vfio-users mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/vfio-users

Re: [vfio-users] Best pinning strategy for latency / performance trade-off

Reply via email to