I have 2 Xeon 5540s (4 physical and 4 logical per CPU) currently one entire CPU is dedicated to the vm (basically what it says in numa 0 in lscpu) I didn't quite get the guide, what would be the best setup to get the most out of the vm for gaming? Or is that the best configuration I have atm
On Feb 2, 2017 12:35 PM, "Jan Wiele" <[email protected]> wrote: > Hi Thomas, > > awesome work! I've changed my (gaming-)setup (2x Xeon E5-2670 (8 real > cores per CPU)) to the following: > > VM1 and VM2: > Each gets 4 real cores on CPU0; Emulator-Thread is pinned to the > respective Hyper-Threading cores. > > VM3: > 6 real cores on CPU1; Emulator-Thread is pinned to the respective > Hyper-Threading cores. > > Host: > 2 real cores on CPU1; 2 Hyperthreaded cores. > > > I've chosen this layout ("Low latency setup"), since it fits my setup the > most. Alternatively I could have pinned the emulator threads to the host > ("Balanced setup, emulator with host"), but this would result in some > cross-node-traffic, which I wanted to prevent. Additionally some benchmarks > show that hyperthreading does not improve the gaming performance [1] by > much. > > With my new setup, I ran DPC Latency Checker [2] and saw on all three VMs > timings around 1000us. However, LatencyMon [3] showed most of the time much > lower values (<100us). Can they be compared? > > LatencyMon also showed me that the USB2 driver has a long ISR. Changing > this to USB3 in libvirt fixed that. > > > Cheers, > Jan > > [1] https://www.techpowerup.com/forums/threads/gaming-benchmarks > -core-i7-6700k-hyperthreading-test.219417/ > [2] http://www.thesycon.de/eng/latency_check.shtml > [3] http://www.resplendence.com/latencymon > > Am 01.02.2017 um 16:46 schrieb Thomas Lindroth: > >> A while ago there was a conversation on the #vfio-users irc channel about >> how to >> use cpuset/pinning to get the best latency and performance. I said I >> would run >> some tests and eventually did. Writing up the result took a lot of time >> and >> there are some more test I want to run to verify the results but don't >> have time >> to do that now. I'll just post what I've concluded instead. First some >> theory. >> >> Latency in a virtual environment have many difference causes. >> * There is latency in the hardware/bios like system management interrupts. >> * The host operating system introduce some latency. This is often because >> the >> host won't schedule the VM when it wants to run. >> * The emulator got some latency because of things like nested page tables >> and >> handling of virtual hardware. >> * The guest OS introduce it's own latency when the workload wants to run >> but the >> guest scheduler won't schedule it. >> >> Point 1 and 4 are latencies you get even on bare metal but point 2 and 3 >> is >> extra latency caused by the virtualisation. This post is mostly about >> reducing >> the latency of point 2. >> >> I assume you are already familiar with how this is usually done. By using >> cpuset >> you can reserve some cores for exclusive use by the VM and put all system >> processes on a separate housekeeping core. This allows the VM to run >> whenever it >> wants which is good for latency but the downside is the VM can't use the >> housekeeping core so performance is reduced. >> >> By running pstree -p when the VM is running you get some output like this: >> ... >> ─qemu-system-x86(4995)─┬─{CPU 0/KVM}(5004) >> ├─{CPU 1/KVM}(5005) >> ├─{CPU 2/KVM}(5006) >> ├─{CPU 3/KVM}(5007) >> ├─{CPU 4/KVM}(5008) >> ├─{CPU 5/KVM}(5009) >> ├─{qemu-system-x86}(4996) >> ├─{qemu-system-x86}(5012) >> ├─{qemu-system-x86}(5013) >> ├─{worker}(5765) >> └─{worker}(5766) >> >> Qemu spawn a bunch of threads for different things. The "CPU #/KVM" >> threads runs >> the actual guest code and there is one for each virtual cpu. I call them >> "VM threads" from here on. The qemu-system-x86 threads are used to emulate >> virtual hardware and is called the emulator in libvirt terminology. I >> call them >> "emulator threads". The worker threads are probably what libvirt calls >> iothreads >> but I treat them the same as the emulator threads and refer to them both >> as >> "emulator threads". >> >> My cpu is a i7-4790K with 4 hyper threaded cores for a total of 8 logical >> cores. >> A lot of people here probably have something similar. Take a look in >> /proc/cpuinfo to see how it's laid out. I number my cores like cpuinfo >> where I >> got physical cores 0-3 and logical cores 0-7. pcore 0 corresponds to >> lcore 0,4 >> and pcore 1 is lcore 1,5 and so on. >> >> The goal is to partition the system processes, VM threads and emulator >> threads >> on these 8 lcores to get good latency and acceptable performance but to >> do that >> I need a way to measure latency. Mainline kernel 4.9 got a new latency >> tracer >> called hwlat. It's designed to measure hardware latencies like SMI but if >> you >> run it in a VM you get all latencies below the guest (point 1-3 above). >> Hwlat >> bypasses the normal cpu scheduler so it won't measure any latency from >> the guest >> scheduler (point 4). It basically makes it possible to focus on just the >> VM >> related latencies. >> https://lwn.net/Articles/703129/ >> >> We should perhaps also discuss how much latency is too much. That's up for >> debate but the windows DPC latency checker lists 500us as green, 1000us as >> yellow and 2000us as red. If a game runs at 60fps it has a deadline of >> 16.7ms to >> render a frame. I'll just decide that 1ms (1000us) is the upper limit for >> what I >> can tolerate. >> >> One of the consequences of how hwlat works is that it also fails to >> notice a lot >> of the point 3 types of latencies. Most of the latency in point 3 is >> caused by >> vm-exits. That's when the guest do something the hardware virtualisation >> can't >> handle and have to rely on kvm or qemu to emulate the behaviour. This is >> a lot >> slower than real hardware but it mostly only happens when the guest tries >> to >> access hardware resources, so I'll call it IO-latency. The hwlat tracer >> only >> sits and spins in kernel space and never touch any hardware by itself. >> Since >> hwlat don't trigger vm-exits it also can't measure latencies from that so >> it >> would be good to have something else that could. They way I rigged things >> up is >> to set the virtual disk controller to ahci. I know that has to be >> emulated by >> qemu. I then added a ram block device from /dev/ram* to the VM as a >> virtual >> disk. I can then run the fio disk benchmark in the VM on that disk to >> trigger >> vm-exits and get a report on the latency from fio. It's not a good >> solution but >> it's the best I could come up with. >> http://freecode.com/projects/fio >> >> === Low latency setup === >> >> Let's finally get down to business. The first setup I tried is configured >> for >> minimum latency at the expense of performance. >> >> The virtual cpu in this setup got 3 cores and no HT. The VM threads are >> pinned >> to lcore 1,2,3. The emulator threads are pinned to lcore 5,6,7. That >> leaves >> pcore 0 which is dedicated to the host using cpuset. >> >> Here is the layout in libvirt xml >> <vcpupin vcpu='0' cpuset='1'/> >> <vcpupin vcpu='1' cpuset='2'/> >> <vcpupin vcpu='2' cpuset='3'/> >> <emulatorpin cpuset='5-7'/> >> <topology sockets='1' cores='3' threads='1'/> >> >> And here are the result of hwlat (all hwlat test run for 30 min each). I >> used a >> synthetic load to test how the latencies changed under load. I use the >> program >> stress as synthetic load on both guest and host >> (stress --vm 1 --io 1 --cpu 8 --hdd 1). >> >> mean stdev max(us) >> host idle, VM idle: 17.2778 15.6788 70 >> host load, VM idle: 21.4856 20.1409 72 >> host idle, VM load: 19.7144 18.9321 103 >> host load, VM load: 21.8189 21.2839 139 >> >> As you can see the load on the host makes little difference for the >> latency. >> The cpuset isolation works well. The slight decrease of the mean might be >> because of reduced memory bandwidth. Putting the VM under load will >> increase the >> latency a bit. This might seem odd since the idea of using hwlat was to >> bypass >> the guest scheduler thereby making the latency independent of what is >> running in >> the guest. What is probably happening is that the "--hdd" part of the >> stress >> access the disk and this makes the emulator threads run. They are pinned >> to the >> HT siblings of the VM threads and thereby slightly impact the latency of >> them. >> Overall the latency is very good in this setup. >> >> fio (us) min=40, max=1306, avg=52.81, stdev=12.60 iops=18454 >> Here is the result of the io latency test with fio. Since the emulator >> treads >> are running mostly isolated on their own siblings this result must be >> considered >> good. >> >> === Low latency setup, with realtime === >> >> In an older post to the mailing list I said "The NO_HZ_FULL scheduler >> mode only >> works if a single process wants to run on a core. When the VM thread runs >> as >> realtime priority it can starve the kernel threads for long period of >> time and >> the scheduler will turn off NO_HZ_FULL when that happens since several >> processes >> wants to run. To get the full advantage of NO_HZ_FULL don't use realtime >> priority." >> >> Let's see how much impact this really has. The idea behind realtime pri >> is to >> always give your preferred workload priority over unimportant workloads. >> But to >> make any difference there has to be an unimportant workload to preempt. >> Cpuset >> is a great way to move unimportant processes to a housekeeping cpu but >> unfortunately the kernel got some pesky kthreads that refuse to migrate. >> By >> using realtime pri on the VM threads I should be able to out-preempt the >> kernel >> threads and get lower latency. In this test I used the same setup as >> above but >> I used schedtool to set round-robin pri 1 on all VM related threads. >> >> mean stdev max(us) >> host idle, VM idle: 17.6511 15.3028 61 >> host load, VM idle: 20.2400 19.6558 57 >> host idle, VM load: 18.9244 18.8119 108 >> host load, VM load: 20.4228 21.0749 122 >> >> The result is mostly the same. Those few remaining kthreads that I can't >> disable >> or migrate apparently doesn't make much difference on latency. >> >> === Balanced setup, emulator with VM threads === >> >> 3 cores isn't a lot these days and some games like Mad max and Rise of >> the tomb >> raider max out the cpu in the low latency setup. This results in big >> frame drops >> when that happens. The setup below with a virtual 2 core HT cpu would >> probably >> give ok latency but the addition of hyper threading usually only give >> 25-50% >> extra performance for real world workloads so this setup would generally >> be >> slower than the low latency setup. I didn't bother to test it. >> <vcpupin vcpu='0' cpuset='2'/> >> <vcpupin vcpu='1' cpuset='6'/> >> <vcpupin vcpu='2' cpuset='3'/> >> <vcpupin vcpu='3' cpuset='7'/> >> <emulatorpin cpuset='1,5'/> >> <topology sockets='1' cores='2' threads='2'/> >> >> To get better performance I need at least a virtual 3 core HT cpu but if >> the >> host use pcore 0 and the VM threads use pcore 1-3 where will the emulator >> threads run? I could overallocate the system by having the emulator >> threads >> compete with the VM threads or I could overallocate the system by having >> the >> emulator threads compete with the host processes. Lets try to run the >> emulator >> with the VM treads first. >> >> <vcpupin vcpu='0' cpuset='1'/> >> <vcpupin vcpu='1' cpuset='5'/> >> <vcpupin vcpu='2' cpuset='2'/> >> <vcpupin vcpu='3' cpuset='6'/> >> <vcpupin vcpu='4' cpuset='3'/> >> <vcpupin vcpu='5' cpuset='7'/> >> <emulatorpin cpuset='1-3,5-7'/> >> <topology sockets='1' cores='3' threads='2'/> >> >> The odd ordering for vcpupin is done because Intel cpus lay out HT >> siblings as >> lcore[01234567] = pcore[01230123] but qemu lays out the virtual cpu as >> lcore[012345] = pcore[001122]. To get a 1:1 mapping I have to order them >> like >> that. >> >> mean stdev max(us) >> host idle, VM idle: 17.4906 15.1180 89 >> host load, VM idle: 22.7317 19.5327 95 >> host idle, VM load: 82.3694 329.6875 9458 >> host load, VM load: 141.2461 1170.5207 20757 >> >> The result is really bad. It works ok as long as the VM is idle but as >> soon as >> it's under load I get bad latencies. The reason is likely that the >> stressor >> accesses the disk which activates the emulator and in this setup the >> emulator >> can preempt the VM threads. We can check if this is the case by running >> the >> stress without "--hdd". >> >> mean stdev max(us) >> host load, VM load(but no --hdd): 57.4728 138.8211 1345 >> >> The latency is reduced quite a bit but it's still high. It's likely still >> the >> emulator threads preempting the VM threads. Accessing the disk is just >> one of >> many things the VM can do to activate the emulator. >> >> fio (us) min=41, max=7348, avg=62.17, stdev=14.99 iops=15715 >> io latency is also a lot worse compared to the low latency setup. The >> reason is >> the VM threads can preempt the emulator threads while they are emulating >> the >> disk drive. >> >> === Balanced setup, emulator with host === >> >> Pairing up the emulator threads and VM threads was a bad idea so lets try >> running the emulator on the core reserved for the host. Since the VM >> threads run >> by themselves in this setup we would expect to get good hwlat latency but >> the >> emulator threads can be preempted by host processes so io latency might >> suffer. >> Lets start by looking at the io latency. >> >> fio (us) min=40, max=46852, avg=61.55, stdev=250.90 iops=15893 >> >> Yup, massive io latency. Here is a situation were realtime pri could help. >> If the emulator threads get realtime pri they can out-preempt the host >> processes. Lets try that. >> >> fio (us) min=38, max=2640, avg=53.72, stdev=13.61 iops=18140 >> >> That's better but it's not as good as the low latency setup where the >> emulator >> threads got their own lcore. To reduce the latency even more we could try >> to >> split pcore 0 in two and run host processes on lcore 0 and the emulator >> threads >> on lcore 4. But this doesn't leave much cpu for the emulator (or the >> host). >> >> fio (us) min=44, max=1192, avg=56.07, stdev=8.52 iops=17377 >> >> The max io latency now decreased to the same level as the low latency >> setup. >> Unfortunately the number of iops also decreased a bit (down 5.8% compared >> to the >> low latency setup). I'm guessing this is because the emulator threads >> don't get >> as much cpu power in this setup. >> >> mean stdev max(us) >> host idle, VM idle: 18.3933 15.5901 106 >> host load, VM idle: 20.2006 18.8932 77 >> host idle, VM load: 23.1694 22.4301 110 >> host load, VM load: 23.2572 23.7288 120 >> >> Hwlat latency is comparable to the low latency setup so this setup gives >> a good >> latency / performance trade-off >> >> === Max performance setup === >> >> If 3 cores with HT isn't enough I suggest you give up but for comparison >> let's >> see what happens if we mirror the host cpu in the VM. Now we have no room >> at all >> for the emulator or the host processes so I let them schedule free. >> <vcpupin vcpu='0' cpuset='0'/> >> <vcpupin vcpu='1' cpuset='4'/> >> <vcpupin vcpu='2' cpuset='1'/> >> <vcpupin vcpu='3' cpuset='5'/> >> <vcpupin vcpu='4' cpuset='2'/> >> <vcpupin vcpu='5' cpuset='6'/> >> <vcpupin vcpu='6' cpuset='3'/> >> <vcpupin vcpu='7' cpuset='7'/> >> <emulatorpin cpuset='0-7'/> >> <topology sockets='1' cores='4' threads='2'/> >> >> mean stdev max(us) >> host idle, VM idle: 185.4200 839.7908 6311 >> host load, VM idle: 3835.9333 7836.5902 97234 >> host idle, VM load: 1891.4300 3873.9165 31015 >> host load, VM load: 8459.2550 6437.6621 51665 >> >> fio (us) min=48, max=112484, avg=90.41, stdev=355.10 iops=10845 >> >> I only ran these tests for 10 min each. That's all that was needed. As >> you can >> see it's terrible. I'm afraid that many people probably run a setup >> similar to >> this. I ran like this myself for a while until I switched to libvirt and >> started >> looking into pinning. Realtime pri would probably help a lot here but >> realtime >> in this configuration is potentially dangerous. Workloads on the guest >> could >> starve the host and depending on how the guest gets its input a reset >> using the >> hardware reset button could be needed to get the system back. >> >> === Testing with games === >> >> I want low latency for gaming so it would make sense to test the setups >> with >> games. This turns out to be kind of tricky. Games are complicated and >> interpreting the results can be hard. https://i.imgur.com/NIrXnkt.png as >> an >> example here is a percentile plot of the frametimes in the built in >> benchmark of >> rise of the tomb raider taken with fraps. The performance and balanced >> setups >> looks about the same at lower percentiles but the low latency setup is a >> lot >> lower. This means that the low latency setup, with is the weakest in >> terms of >> cpu power, got higher frame rate for some parts of the benchmark. This >> doesn't >> make sense at first. It only starts to make sense if I pay attention to >> the >> benchmark while it's running. Rise of the tomb raider loads in a lot of >> geometry >> dynamically and the low latency setup can't keep up. It has bad pop-in of >> textures and objects so the scene the gpu renders is less complicated >> than the >> other setups. Less complicated scene results in higher frame rate. An odd >> counter intuitive result. >> >> Overall the performance and balanced setups have the same percentile >> curve for >> lower percentiles in every game I tested. This tells me that the balanced >> setup >> got enough cpu power for all games I've tried. They only differ at higher >> percentile due to latency induced framedrops. The performance setup >> always have >> the worst max frametime in every game so there is no reason to use it >> over the >> balanced setup. The performance setup also have crackling sound in >> several games >> over hdmi audio even with MSI enabled. Which setup got the lowest max >> framtime >> depends on the workload. If the game max out the cpu of the low latency >> setup >> the max framtime will be worse than the balanced setup, if not the low >> latency >> setup got the best latency. >> >> === Conclusion === >> >> The balanced setup (emulators with host) doesn't have the best latency in >> every >> workload but I haven't found any workload where it performs poorly in >> regards to >> max latency, io latency or available cpu power. Even in those workloads >> where >> another setup performed better the balanced setup was always close. If >> you are >> too lazy to switch setups depending on the workload use the balanced >> setup as >> the default configuration. If your cpu isn't a 4 core with HT finding the >> best >> setup for your cpu is left as an exercise for the reader. >> >> === Future work === >> >> https://vfio.blogspot.se/2016/10/how-to-improve-performance- >> in-windows-7.html >> This was a nice trick for forcing win7 to use TSC. Just one problem, >> turns out >> it doesn't work if hyper threading is enabled. Any time I use a virtual >> cpu with >> threads='2' win7 will revert to using the acpi_pm. I've spent a lot of >> time >> trying to work around the problem but failed. I don't even know why hyper >> threading would make a difference for TSC. Microsoft's documentation is >> amazingly unhelpful. But even when the guest is hammering the acpi_pm >> timer the >> balanced setup gives better performance than the low latency setup but I'm >> afraid the reduced resolution and extra indeterminism of the acpi_pm >> timer might >> result in other problems. This is only a problem in win7 because modern >> versions >> if windows should use hypervclock. I've read somewhere that it might be >> possible >> to modify OVMF to work around the bug in win7 that prevents hyperv from >> working. >> With that modification it might be possible to use hypervclock in win7. >> Perhaps I'll look into that in the future. In the mean time I'll stick >> with the >> balanced setup despite the use of acpi_pm. >> >> _______________________________________________ >> vfio-users mailing list >> [email protected] >> https://www.redhat.com/mailman/listinfo/vfio-users >> >> > _______________________________________________ > vfio-users mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/vfio-users >
_______________________________________________ vfio-users mailing list [email protected] https://www.redhat.com/mailman/listinfo/vfio-users
