On 02/01/2017 04:28 AM, Alyx wrote: > If I was to boot up the VM into Linux are there any tests I can do in a > Linux VM environment to help figure out what the issue is? Since no options > seem to reveal the problem and I presume Linux has more tools to deduce the > specifics of this issue.
I have no experience with amd hardware but some ideas come to mind. Looks like you tried all the regular techniques to improve performance and latency already. There might be some source of latency in your actual hardware. You can test this by booting a 4.9 kernel and use the new hwlat tracer. I suppose using a linux live cd is good enough. Some of the exton live cds got a 4.9 kernel like this http://linux.exton.net/?p=820 Boot a system with 4.9, make sure debugfs is mounted and run "echo hwlat > /sys/kernel/debug/tracing/current_tracer" You'll get lines like this in /sys/kernel/debug/tracing/trace <...>-3728 [000] dn.. 668.336945: #1 inner/outer(us): 31/9 ts:1484487305.141866647 The interesting part is 31/9. The two numbers tells you the max hardware latency in us so in this example it's 31 us. If you get high numbers like 1000us the latency might impact your system. The system will behave unusual while the hwlat tracer is running. That's normal. Try to use various hardware features while the tracer is running. If you have an integrated gpu try to use that. I get hwlat spikes only when my intel igpu use hardware video decoding for movies. I assume it steals memory bandwidth or something. Actually installing 4.9 on your real system would be preferred because you can test the latency under your normal workload. https://lwn.net/Articles/703129/ Hardware latencies on the host probably isn't the answer because you would notice that while using the host. If I understand your problem you only get stalls in the VM. Something you could test is to use hwlat in the guest. Setup the guest with pinning and reservation of cores for good latency and boot that live cd. Run the test the same way and see what latencies you get. If you get high latency in the guest but not on the host then the problem is unique to the vm environment. When you boot that live cd you might want to append "tsc=reliable" to the kernel command line in grub before booting. That will force the use of TSC for timing in the guest which might reveal problems with that timer. If you do see high latency in the guest a possible explanation is that some kernel thread hogs the cpu on the host. You can test this by running "perf record -e "sched:sched_switch" -C 1,2,3" on the host while the vm is running. The numbers are the cpu cores where you pinned the VM. This will record every time a process is scheduled to run on those cores. Use the guest until you observe the stall, stop perf and run "perf report --fields=sample,overhead,cpu,comm". This will show all processes that ran on those cores. You would expect to see things like "CPU #/KVM", "kvm-pit", "swapper" and perhaps qemu but not much else. Swapper is apparently the process that does busy waiting when the kernel doesn't have anything to do so seeing that just means the kernel was idle. If you see stuff like kworker or ksoftirqd that means some kthreads ran on the same core as your VM which could result in the problems you describe. If there are kworker threads meddling with the VM finding out what they do can be tricky. Try running "echo "workqueue:workqueue_queue_work" > /sys/kernel/debug/tracing/set_event" on the host while the VM is running. This will trace every time a kworker runs. Then look in the file "/sys/kernel/debug/tracing/per_cpu/cpu1/trace" (assuming your VM runs on cpu 1). You will get lines like this: kworker/u16:6-5206 [001] d..2 5351.981584: workqueue_queue_work: work struct=ffff88036a44b0d0 function=do_worker workqueue=ffff88041ba25c00 req_cpu=8 cpu=4294967295 The interesting part is function=do_worker. This tells us that the kernel function "do_worker" has been running in a kworker on that cpu. "do_worker" is not the most descriptive name so to find out what it is you have to grep the kernel source. If you do you'll see that do_worker is in the file drivers/md/dm-thin.c and is part of the device mapper thin target code used by LVM thin partitions. What work is running in kworker threads is depending on your setup and hardware so you'll have to draw your own conclusions depending on what you find. I had problems with dm-thin but for you it could be anything. Since you are using win10 you probably use the hypervclock in windows and there shouldn't be any problems with that but since your guest stalls and then catch up I thought the problem might be related to the timer used in the guest. I know TSC might sometimes run backwards in the guest which can confuse the software in the guest. I don't know how to reliably set the timer source in windows but on linux you can check /sys/devices/system/clocksource/clocksource0/available_clocksource for what timers exists (you have to boot the guest with tsc=reliable for tsc to show up). You can set a timer with echo "acpi_pm" > /sys/devices/system/clocksource/clocksource0/current_clocksource and then check if you get any stalls in the guest. Perhaps a video like this could be helpful to spot the stalls https://www.youtube.com/watch?v=cuXsupMuik4 The only other thing I can think of is that there are some differences in how the hardware virtualisation work between intel and amd. I think I read somewhere that disabling the hardware accelerated nested page tables on amd gives a performance boot. But that doesn't make sense to me. The kvm_intel module got various parameters for controlling things like this. I can list them from /sys/module/kvm_intel/parameters/*. Check out what parameters the kvm_amd modules has. Perhaps you'll find something interesting there. _______________________________________________ vfio-users mailing list [email protected] https://www.redhat.com/mailman/listinfo/vfio-users
