[KVM-AUTOTEST][PATCH] Win2003-64.steps: fix a barrier at step 17 that sometimes fails due to popups.
From: Michael Goldish mgold...@redhat.com Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/steps/Win2003-64.steps b/client/tests/kvm_runtest_2/steps/Win2003-64.steps index 2ce794f..9a4d3e7 100644 --- a/client/tests/kvm_runtest_2/steps/Win2003-64.steps +++ b/client/tests/kvm_runtest_2/steps/Win2003-64.steps @@ -124,7 +124,7 @@ key 0xdc step 1081.26 screendump 20080101_17_defe197bc5b379b442c5bbbf9fe649f1.ppm # Windows 2003 Start Menu Opened -barrier_2 29 19 321 429 2c7df6a8da99f7a33e37b94d165129f4 19 +barrier_2 118 73 209 68 7c9eca05f228fe133007bffb22e14a0c 19 # Sending keys: u key u # -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] kvm_guest_wizard.py, stepmaker.py: do not explicitly print failure messages.
From: Michael Goldish mgold...@redhat.com This is now taken care of by kvm_runtest_2.py. Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/kvm_guest_wizard.py b/client/tests/kvm_runtest_2/kvm_guest_wizard.py index 2f08c78..afe6b02 100644 --- a/client/tests/kvm_runtest_2/kvm_guest_wizard.py +++ b/client/tests/kvm_runtest_2/kvm_guest_wizard.py @@ -139,24 +139,16 @@ def barrier_2(vm, words, fail_if_stuck_for, stuck_detection_history, output_dir, def run_steps(test, params, env): vm = kvm_utils.env_get_vm(env, params.get(main_vm)) if not vm: -message = VM object not found in environment -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, VM object not found in environment if not vm.is_alive(): -message = VM seems to be dead; Guestwizard requires a living VM -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, VM seems to be dead; Guestwizard requires a living VM steps_filename = params.get(steps) if not steps_filename: -message = Steps filename not specified -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, Steps filename not specified steps_filename = os.path.join(test.bindir, steps, steps_filename) if not os.path.exists(steps_filename): -message = Steps file not found: %s % steps_filename -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, Steps file not found: %s % steps_filename fail_if_stuck_for = params.get(fail_if_stuck_for) if fail_if_stuck_for: diff --git a/client/tests/kvm_runtest_2/stepmaker.py b/client/tests/kvm_runtest_2/stepmaker.py index 476c145..54a1a4a 100644 --- a/client/tests/kvm_runtest_2/stepmaker.py +++ b/client/tests/kvm_runtest_2/stepmaker.py @@ -323,24 +323,16 @@ class StepMaker(stepeditor.StepMakerWindow): def run_stepmaker(test, params, env): vm = kvm_utils.env_get_vm(env, params.get(main_vm)) if not vm: -message = VM object not found in environment -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, VM object not found in environment if not vm.is_alive(): -message = VM seems to be dead; Step Maker requires a living VM -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, VM seems to be dead; Step Maker requires a living VM steps_filename = params.get(steps) if not steps_filename: -message = Steps filename not specified -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, Steps filename not specified steps_filename = os.path.join(test.bindir, steps, steps_filename) if os.path.exists(steps_filename): -message = Steps file %s already exists % steps_filename -kvm_log.error(message) -raise error.TestError, message +raise error.TestError, Steps file %s already exists % steps_filename StepMaker(vm, steps_filename, test.debugdir, params) gtk.main() -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] kvm_guest_wizard.py: barrier_2: sleep at the end of each iteration.
From: Michael Goldish mgold...@redhat.com Sleep at the end of the iteration, after checking the md5sum for the first time. This enables us to pass the barrier very quickly in some cases. (Currently we sleep first and check the md5sum later.) Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/kvm_guest_wizard.py b/client/tests/kvm_runtest_2/kvm_guest_wizard.py index afe6b02..c89e12e 100644 --- a/client/tests/kvm_runtest_2/kvm_guest_wizard.py +++ b/client/tests/kvm_runtest_2/kvm_guest_wizard.py @@ -51,8 +51,6 @@ def barrier_2(vm, words, fail_if_stuck_for, stuck_detection_history, output_dir, failure_message = None -iteration = 0 - # Main loop while True: # Check for timeouts @@ -63,13 +61,6 @@ def barrier_2(vm, words, fail_if_stuck_for, stuck_detection_history, output_dir, failure_message = guest is stuck break -# Sleep for a while -if iteration == 0 and sleep_duration == 1.0: -time.sleep(0.5) -else: -time.sleep(sleep_duration) -iteration += 1 - # Make sure vm is alive if not vm.is_alive(): failure_message = VM is dead @@ -108,6 +99,9 @@ def barrier_2(vm, words, fail_if_stuck_for, stuck_detection_history, output_dir, # Limit queue length to stuck_detection_history prev_whole_image_md5sums = prev_whole_image_md5sums[:stuck_detection_history] +# Sleep for a while +time.sleep(sleep_duration) + # Failure message = Barrier failed at step %s after %.2f seconds (%s) % \ (current_step_num, time.time() - start_time, failure_message) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] RHEL-3.9 step files: fix the first barriers, which are a little risky.
From: Michael Goldish mgold...@redhat.com In these step files, the regions picked for the barriers are updated before the regions that actually indicate readiness for keyboard input (the boot: prompt). Therefore, when the step files run, keyboard input (Enter key) is given before the VM is ready, and thus has no effect. Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/steps/RHEL-3.9-i386.steps b/client/tests/kvm_runtest_2/steps/RHEL-3.9-i386.steps index 0b449d0..7e053f8 100644 --- a/client/tests/kvm_runtest_2/steps/RHEL-3.9-i386.steps +++ b/client/tests/kvm_runtest_2/steps/RHEL-3.9-i386.steps @@ -5,14 +5,14 @@ # step unknown screendump 20090404_143150_210be5cd4155aa4fdbd89494106b0f4d.ppm -barrier_2 176 37 27 270 a3ea4e66598c86efeecc5926e244a8f5 20 optional +barrier_2 42 171 0 265 1e64135a6da7be7e77f9de86f9eff237 20 optional # Sending keys: ret key ret # step 33.24 screendump 20080101_01_9e427763ab38cb59e66287f190585eb2.ppm # boot options -barrier_2 272 19 30 26 4559684d52ebb532b4ccae01c190e45f 15 optional +barrier_2 47 170 0 27 bb2399341a2c56e29213445c375779a6 15 optional # Sending keys: ret key ret # diff --git a/client/tests/kvm_runtest_2/steps/RHEL-3.9-x86_64.steps b/client/tests/kvm_runtest_2/steps/RHEL-3.9-x86_64.steps index 9ef221a..6b3e8bf 100644 --- a/client/tests/kvm_runtest_2/steps/RHEL-3.9-x86_64.steps +++ b/client/tests/kvm_runtest_2/steps/RHEL-3.9-x86_64.steps @@ -7,7 +7,7 @@ step 33.24 screendump 20080101_01_9e427763ab38cb59e66287f190585eb2.ppm # boot options sleep 5 -barrier_2 272 19 30 26 4559684d52ebb532b4ccae01c190e45f 166 optional +barrier_2 47 167 0 27 a4964917d318bfb8b9094569e19b4d60 166 optional # Sending keys: ret key ret # -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] make_html_report.py: don't use head -n 1 when getting the KVM version.
From: Michael Goldish mgold...@redhat.com Use python's splitlines() instead of head -n 1 because the latter causes broken pipe messages in some configurations. Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/make_html_report.py b/client/tests/kvm_runtest_2/make_html_report.py index 677da78..002062c 100755 --- a/client/tests/kvm_runtest_2/make_html_report.py +++ b/client/tests/kvm_runtest_2/make_html_report.py @@ -1636,12 +1636,14 @@ def get_keyval_value(result_dir, key): If no appropriate line is found, return 'Unknown'. keyval_pattern = os.path.join(result_dir, kvm_runtest_2.*, keyval) -keyval_line = commands.getoutput(grep -h %s %s | head -n 1 % (key, keyval_pattern)) +keyval_lines = commands.getoutput(rgrep -h '\b%s\b.*=' %s % (key, keyval_pattern)) +if not keyval_lines: +return Unknown +keyval_line = keyval_lines.splitlines()[0] if key in keyval_line and = in keyval_line: -value = keyval_line.split(=)[1].strip() +return keyval_line.split(=)[1].strip() else: -value = Unknown -return value +return Unknown def get_kvm_version(result_dir): -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] Fedora-8-i386.steps: fix a barrier at step 26 that sometimes fails.
From: Alexey Eromenko aerom...@redhat.com Signed-off-by: Alexey Eromenko aerom...@redhat.com Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/steps/Fedora-8-i386.steps b/client/tests/kvm_runtest_2/steps/Fedora-8-i386.steps index b23878a..732d705 100644 --- a/client/tests/kvm_runtest_2/steps/Fedora-8-i386.steps +++ b/client/tests/kvm_runtest_2/steps/Fedora-8-i386.steps @@ -190,7 +190,7 @@ key alt-y step 1438.85 screendump 20080101_26_ecf306a3f7ace4cb936f009c80afab21.ppm # date and time -barrier_2 280 292 520 308 5b1d9fd42d7025be66355a94c4c93dbe 21 +barrier_2 221 290 579 310 4502dc2f76ed6aba1a2229c8385a8eae 21 # Sending keys: alt-f key alt-f # -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] WinXP-32.steps: add an optional barrier to deal with closed start menu.
From: Michael Goldish mgold...@redhat.com Usually the barrier at step 24 should find an open start menu. Sometimes the start menu is closed, so an additional step is needed to open it in that case. Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/steps/WinXP-32.steps b/client/tests/kvm_runtest_2/steps/WinXP-32.steps index ca3cef8..b0c6e35 100644 --- a/client/tests/kvm_runtest_2/steps/WinXP-32.steps +++ b/client/tests/kvm_runtest_2/steps/WinXP-32.steps @@ -148,8 +148,20 @@ key alt-f # step 2279.61 screendump 20080101_24_51acae003d73d0f4a4c58fa0a053b471.ppm -# Win XP Start Menu (opened) -barrier_2 104 24 0 576 d14fcd1ec7ffcd0fa3e942b9caa4d4e5 240 +# Win XP desktop +barrier_2 48 51 391 288 bbac8a522510d7c8d6e515f6a3fbd4c3 240 +# +step 2279.61 +screendump 20090416_150641_b72ad5c48ec2dbc9814d569e38cbb4cc.ppm +# Win XP Start Menu (closed) +barrier_2 104 41 0 559 a7cc02cecff2cb495f300aefbb99d9ae 5 optional +# Sending keys: ctrl-esc +key ctrl-esc +# +step 2279.61 +screendump 20080101_24_51acae003d73d0f4a4c58fa0a053b471.ppm +# Win XP Start Menu (open) +barrier_2 189 67 0 533 69a5571c58bdf9e0acee8337883bec43 20 # Sending keys: u key u # -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] kvm_tests.cfg.sample: fix typo in RHEL.4.7.x86_64 ISO filename.
From: Alexey Eromenko aerom...@redhat.com Signed-off-by: Alexey Eromenko aerom...@redhat.com Signed-off-by: Michael Goldish mgold...@redhat.com Signed-off-by: Uri Lublin u...@redhat.com diff --git a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample index f621abe..68df834 100644 --- a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample +++ b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample @@ -236,7 +236,7 @@ variants: image_name = rhel4-64 install: steps=RHEL-4.7-x86_64.steps -cdrom=linux/RHEL-4-x86_64-DVD.iso +cdrom=linux/RHEL-4.7-x86_64-DVD.iso md5sum=ea9dae16dd86f7d94092d0e672333292 md5sum_1m=58fa63eaee68e269f4cb1d2edf479792 -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] kvm_tests.cfg.sample: Windows: change migration_test_command to ver vol.
From: Michael Goldish mgold...@redhat.com The current migration_test_command is help, which for some reason returns a non-zero %errorlevel%. This makes get_command_status_output report the output as if an error has occurred. ver vol returns 0, and is also probably more informative. Signed-off-by: Michael Goldish mgold...@redhat.com diff --git a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample index 68df834..cc0907b 100644 --- a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample +++ b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample @@ -269,6 +269,8 @@ variants: ssh_status_test_command = echo %errorlevel% username = Administrator password = 123456 +migrate: +migration_test_command = ver vol variants: - Win2000: -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] kvm_tests.cfg.sample: remove -snapshot from boot and reboot tests.
From: Michael Goldish mgold...@redhat.com It shouldn't be too risky to run these tests without -snapshot. It will also save some time because the following tests, which run without -snapshot, won't have to restart the VM. That's true since following tests do not use -snapshot, so the framework shuts down the VM and re-run without -snapshot. Signed-off-by: Michael Goldish mgold...@redhat.com Signed-off-by: Uri Lublin u...@redhat.com diff --git a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample index cc0907b..9798c83 100644 --- a/client/tests/kvm_runtest_2/kvm_tests.cfg.sample +++ b/client/tests/kvm_runtest_2/kvm_tests.cfg.sample @@ -42,13 +42,11 @@ variants: - boot: install setup type = boot -extra_params += -snapshot kill_vm_on_error = yes - reboot: install setup type = boot reboot = yes -extra_params += -snapshot kill_vm_on_error = yes - migrate: install setup -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST][PATCH] kvm_install: make buildir (where kvm is installed) next to srcdir
From: Uri Lublin u...@redhat.com If the user specify a directory for the source (srcdir) it makes sense that buildir would reside in the same directory. This is useful for server dispatched client tests, as without this patch one would have to re-install kvm for every job. Note: this works for most used modes (kvm sources), including git, snapshot and release. It may not work (meaning buildir will not be next to srcdir) for e.g. localsrc. Signed-off-by: Uri Lublin u...@redhat.com diff --git a/client/tests/kvm_runtest_2/kvm_install.py b/client/tests/kvm_runtest_2/kvm_install.py index 843a621..6d8fcb2 100755 --- a/client/tests/kvm_runtest_2/kvm_install.py +++ b/client/tests/kvm_runtest_2/kvm_install.py @@ -228,7 +228,10 @@ def __load_kvm_modules(): def __install_kvm(test, srcdir): # create destination dir -kvm_build_dir = os.path.join(test.bindir, build) + +kvm_build_dir = os.path.join(srcdir, '..', '..', 'build') +kvm_build_dir = os.path.abspath(kvm_build_dir) + if not os.path.exists(kvm_build_dir): os.mkdir(kvm_build_dir) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Split kvm source tarballs
Christoph Hellwig wrote: Any idea when the split will happen? I'm working on it, but hit a snag when rewriting upstream qemu commits to match the official qemu.git: - my repo and qemu.git used different -k options ($keyword $ squashing), so I can't match tree objects - some non-UTF-8 characters got corrupted, so I can't match log messages (some are dups anyway) - I can't match the git-svn-id line since I ripped it off in kvm-userspace.git - I can't match the ordinal position of the commit in the changelog since some bogus commits made their way into kvm-userspace.git I expect I'll work around these issues, but it isn't like I have my script ready to roll. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] deal with interrupt shadow state for emulated instruction
H. Peter Anvin wrote: Avi Kivity wrote: We don't emulate guest user mode. Well, if guest userspace can convince its kernel to give it access to some memory mapped I/O register, I guess it can execute repeated 'mov ss, mmio' and starve the guest kernel. It doesn't need a MMIO register to do that, even. Can you explain how? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add MCE support to KVM
Anthony Liguori wrote: Avi Kivity wrote: Then we would need to tell which read-only MSRs are setup writeable and which aren't... I'm okay with an ioctl to setup MCE, but just make sure userspace has all the information to know what the kernel can do rather than the try-and-see-if-it-works approach. We can publish this information via KVM_CAP things, or via another ioctl (see KVM_GET_SUPPORTED_CPUID2 for an example). Why not introduce a new exit type for MSR reads/writes that aren't handled by the kernel? You just need a bit on the return that indicates whether to GPF because of an invalid MSR access. KVM_SET_MSRs should be reserved for MSRs that are performance sensitive. Not all of them will be. Right now everything in the vcpu is emulated in the kernel. Everything else is emulated either in the kernel (irqchip) or in userspace. This makes things easier to understand, and is more future friendly if more cpu features become virtualized by hardware. While these are not compelling reasons, they at least lean the balance in favour of a kernel implementation. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Add MCE support to KVM
Anthony Liguori wrote: Huang Ying wrote: The related MSRs are emulated. MCE capability is exported via extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation such as the mcg_cap. MCE is injected via vcpu ioctl command KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are not simulated. Maybe I'm missing something, but couldn't this be implemented entirely within userspace? There's nothing VT/SVM specific about this. If the issue is setting these MSRs from userspace via KVM_SET_MSRS isn't enough, perhaps we should add userspace MSR handling. You also need to inject the MCE. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] return default values for apic probe functions.
Glauber Costa wrote: As KVM cpus runs on threads, it is possible that we call kvm_load_registers() from a cpu thread, while the apic has not yet fully initialized. kvm_load_registers() is called from ap_main_loop. This is not a problem when we're starting the whole machine together, but is a problem for hotplug, since we don't have the protection of the locks that protect machine initialization. Currently, some executions of cpu hotplug on rainy sundays fail with a segfault. Moving apic initialization to before kvm_init_vpcu proved fruitful, as there are some dependencies involved. (kvm irqchip would fail to initialize). I presume you mean unfruitful (or perhaps a nasty kind of fruit). This patch provides default values to be used for tpr and apic_base, that will be returned when the apic is not yet properly initialized. It is aimed at kvm, where the problem exists, but it could equally be used for qemu too, if there is agreement. Seems like a hack... can you try not to make the vcpu visible until it is completely initialized? (and what is the problem exactly - someone accessing the registers from a different thread? that shouldn't happen) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/15] Coalesce userspace/kernel irqchip interrupt injection logic.
Jan Kiszka wrote: - - if (!npt_enabled event_injection) - kvm_mmu_unprotect_page_virt(svm-vcpu, fault_address); + else { + if (svm-vcpu.arch.interrupt.pending || + svm-vcpu.arch.exception.pending) + kvm_mmu_unprotect_page_virt(svm-vcpu, fault_address); + } Without understanding yet why kvm_mmu_unprotect_page_virt is required here, That is needed in case interrupt injection failed because the process of injecting the interrupt (usually pushing data on the stack) writes to a write-protected page table. I guess we need nmi_pending here as well. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Jan Kiszka wrote: First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. Yes, good catch. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. That will be a disaster, IRET is horribly complicated. Maybe setting a breakpoint. Or maybe enforcing a single step exception. Single step looks good, except that it may conflict with a guest debugging itself (guest debugging an NMI handler?!). Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. It may cause a stack overflow if we have a huge stream of NMIs (if an NMI triggers an NMI in the handler). Of course that's a totally unrealistic scenario. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Avi Kivity wrote: Jan Kiszka wrote: First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. Yes, good catch. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. That will be a disaster, IRET is horribly complicated. Yeah, I know... Maybe setting a breakpoint. Or maybe enforcing a single step exception. Single step looks good, except that it may conflict with a guest debugging itself (guest debugging an NMI handler?!). But that should be solvable without too much effort. BTW, guest single-stepping in and out of interrupt handlers is not properly working anyway. We only set TF in current eflags but do not bother about the CPU state that will get loaded next. Given some rainy days (or a paying customer) I think I'll look into this once again. Same goes for suppressing IRQ injection while single-stepping just as QEMU does, which may even come earlier as someone already asked for it. Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. It may cause a stack overflow if we have a huge stream of NMIs (if an NMI triggers an NMI in the handler). Of course that's a totally unrealistic scenario. Good point. But as it is a corner case, I think we can fly without a complete solution first, fixing it in a second step. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs
Cameron Macdonell wrote: Hi Avi and Anthony, Sorry for the top-reply, but we haven't discussed this aspect here before. I've been thinking about how to implement interrupts. As far as I can tell, unix domain sockets in Qemu/KVM are used point-to-point with one VM being the server by specifying server along with the unix: option. This works simply for two VMs, but I'm unsure how this can extend to multiple VMs. How would a server VM know how many clients to wait for? How can messages then be multicast or broadcast? Is a separate interrupt server necessary? I don't think unix provides a reliable multicast RPC. So yes, an interrupt server seems necessary. You could expand its role an make it a shared memory PCI card server, and have it also be responsible for providing the backing file using an SCM_RIGHTS fd. That would reduce setup headaches for users (setting up a file for which all VMs have permissions). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Luvalley-2 has been released: running KVM below any operating system
Xiaodong Yi wrote: Hi, I've tested the guest Linux using UnixBench 5.1.2. The platform is: * Intel's Core Due CPU with 2 cores, 2GB RAM * CentOS 5.2 as the dom0 Linux, i.e., the host Linux for KVM * CentOS 5.2 as the guest Linux, i.e., the Linux running on the virtual machine provided by Qemu The first set of results is for Luvalley, and the second one is for KVM. As the result, Luvalley's guest Linux is 20% ~ 30% faster than KVM's guest! It is very surprise to me. I had through Luvalley's guest should be the same performance as KVM's. Yes, it is surprising. Double-Precision Whetstone12287.7 MWIPS (10.0 s, 2 samples) Double-Precision Whetstone 2166.3 MWIPS (10.2 s, 2 samples That's by far the biggest difference. Can you confirm it isn't a typo? If not, then it looks like we have a bug in floating point handling. I don't think this benchmark uses sse. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compiling git -85rc
Michael Tokarev wrote: For quite some time the compilation of userspace is broken on my environment. I'm not sure if it's new or it was this way before -- at least kvm-84 builds fine. The problem happens when the kernel running is 64bits and the userspace is 32bits. In the source tree, qemu/ directory correctly configures itself to use -m32 compiler flag (the default anyway), but other directories -- libkvm/ and user/ uses x86_64 for $(ARCH) and hence -m64 and /lib64. And sure thing, the link stage does not work because -lkvm is not compatible (32 vs 64 bits). I had to change include config-$(ARCH).mak into include config-i386.mak in libkvm/Makefile and in user/Makefile to build it. Maybe, just may be, it's sufficient to do a make ARCH=i386 step instead - I'll try that too. That's a bug, I'll fix it up. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Luvalley-2 has been released: running KVM below any operating system
Avi Kivity wrote: Xiaodong Yi wrote: Hi, I've tested the guest Linux using UnixBench 5.1.2. The platform is: * Intel's Core Due CPU with 2 cores, 2GB RAM * CentOS 5.2 as the dom0 Linux, i.e., the host Linux for KVM * CentOS 5.2 as the guest Linux, i.e., the Linux running on the virtual machine provided by Qemu The first set of results is for Luvalley, and the second one is for KVM. As the result, Luvalley's guest Linux is 20% ~ 30% faster than KVM's guest! It is very surprise to me. I had through Luvalley's guest should be the same performance as KVM's. Yes, it is surprising. Double-Precision Whetstone12287.7 MWIPS (10.0 s, 2 samples) Double-Precision Whetstone 2166.3 MWIPS (10.2 s, 2 samples That's by far the biggest difference. Can you confirm it isn't a typo? If not, then it looks like we have a bug in floating point handling. I don't think this benchmark uses sse. Even the native Linux numbers are not that high, rather comparable to KVM. I suspect Luvalley is fooling the benchmark here... Jan signature.asc Description: OpenPGP digital signature
Re: Luvalley-2 has been released: running KVM below any operating system
Jan Kiszka wrote: Avi Kivity wrote: Xiaodong Yi wrote: Hi, I've tested the guest Linux using UnixBench 5.1.2. The platform is: * Intel's Core Due CPU with 2 cores, 2GB RAM * CentOS 5.2 as the dom0 Linux, i.e., the host Linux for KVM * CentOS 5.2 as the guest Linux, i.e., the Linux running on the virtual machine provided by Qemu The first set of results is for Luvalley, and the second one is for KVM. As the result, Luvalley's guest Linux is 20% ~ 30% faster than KVM's guest! It is very surprise to me. I had through Luvalley's guest should be the same performance as KVM's. Yes, it is surprising. Double-Precision Whetstone12287.7 MWIPS (10.0 s, 2 samples) Double-Precision Whetstone 2166.3 MWIPS (10.2 s, 2 samples That's by far the biggest difference. Can you confirm it isn't a typo? If not, then it looks like we have a bug in floating point handling. I don't think this benchmark uses sse. Even the native Linux numbers are not that high, rather comparable to KVM. I suspect Luvalley is fooling the benchmark here... Most likely timing. Likely Luvally guests time runs to slowly, so they achieve more loops per unit of guest time. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: port kvm to arm
(copying kvm@vger.kernel.org, where most kvm development happens) sen wang wrote: I have some idea about the ARM KVM. I have register a project ,namedarm-kvm, to port KVM to arm. I'll kick off the ARM KVM porting next week. This is my plan: step 1 Make a guest linux run on the Hypervisor-linux. From the Hypervisor-linux's viewpoint, The guest is a thread of him. which is like the X86 KVM.But,In this step,the guest-linux has no IO. It is based on roofs. The physical memroy is allocated to the guest os at creating time. The guest-linux still run in the privileged mode,which handle the user mode memory exeception ,SWI system call ... But the guest-linux will not touch the hardware devices.The physical interrupt should be controlled by Hypervisor-linux. So the Hypervisor-linux can switch it off when interrupt happens. And the Hypervisor-linux will provide some vitual system devices like system timer and interrupt controller for guest. Because lack of hardware support,The guest linux should know when to exit,for exemple the idle time,the kernel panic, the guest should voluntarily exit VM --which is like a thread switch. so there is another job: modification of Hypervisor-linux to support arm,and modification of qemu for arm KVM You'll can hook read[bwl]() and similar functions to use hypercalls, which will allow you to use emulated devices in qemu, even without full mmu virtualization. step 2 Make guest IO work The guest should have IO functions.Of course, IT is the virtIO. we should make the virtIO work on the arm kvm.The arm has no hardware virtualization extension like intel VT. But,In the guest virtIO driver, when need start IO,the guest should voluntarily exit VM--so it like IO-exit of the intel VT. If you make emulated devices work, then you can defer this to the last step. virtio is only an optimization. step 3 add memory virtualization for guest just make the guest run in the user mode,porting the x86 kvm memory virtualization function for arm kvm,and so on. But I think we should keep the work mode of step 1 as a option for user. sometime, I think the guest handle the physical memory directly is not a bad thing. step 4 use the trustzone from arm V6, there is a hardware extension of trustzone. We can make use of it. let the Hypervisor-linux run int the monitor state,the guest run in the non-security state. But,the guest os run as normal--linux kernel in the privileged mode,the user program in the user mode.So architecture is more elegant. Does the monitor state have its own userspace? If not, how are you going to run qemu? step5 more guest os the work mode of step 1 has the advantage: we don't need modify the guest much,which make port wince,symbain possible. for, you know,we don't have so much source codes for those os. welcome join the arm-kvm team. thanks Good luck :) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Fri, Apr 17, 2009 at 03:12:57PM +, Dmitry Eremin-Solenikov wrote: This patch does expose some problems on real HW. The first NMI completes w/o problems. However If I try to boot the kernel w/ nmi_watchdog=1 or to trigger two NMIs from the monitor, kernel is stuck somewhere. Can you try this patch instead patch13: diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8b6f6e9..057a612 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -766,6 +766,7 @@ enum { #define HF_GIF_MASK(1 0) #define HF_HIF_MASK(1 1) #define HF_VINTR_MASK (1 2) +#define HF_NMI_MASK(1 3) /* * Hardware virtualization extension instructions may fault if a diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c605477..0a2b3f1 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1834,6 +1834,13 @@ static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 1; } +static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + svm-vmcb-control.intercept = ~(1UL INTERCEPT_IRET); + svm-vcpu.arch.hflags = ~HF_NMI_MASK; + return 1; +} + static int invlpg_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { if (emulate_instruction(svm-vcpu, kvm_run, 0, 0, 0) != EMULATE_DONE) @@ -2111,6 +2118,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm, [SVM_EXIT_VINTR]= interrupt_window_interception, /* [SVM_EXIT_CR0_SEL_WRITE] = emulate_on_interception, */ [SVM_EXIT_CPUID]= cpuid_interception, + [SVM_EXIT_IRET] = iret_interception, [SVM_EXIT_INVD] = emulate_on_interception, [SVM_EXIT_HLT] = halt_interception, [SVM_EXIT_INVLPG] = invlpg_interception, @@ -2218,6 +2226,12 @@ static void pre_svm_run(struct vcpu_svm *svm) new_asid(svm, svm_data); } +static void svm_inject_nmi(struct vcpu_svm *svm) +{ + svm-vmcb-control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI; + svm-vcpu.arch.hflags |= HF_NMI_MASK; + svm-vmcb-control.intercept |= (1UL INTERCEPT_IRET); +} static inline void svm_inject_irq(struct vcpu_svm *svm, int irq) { @@ -2269,6 +2283,14 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu) vmcb-control.intercept_cr_write |= INTERCEPT_CR8_MASK; } +static int svm_nmi_allowed(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb *vmcb = svm-vmcb; + return !(vmcb-control.int_state SVM_INTERRUPT_SHADOW_MASK) + !(svm-vcpu.arch.hflags HF_NMI_MASK); +} + static int svm_interrupt_allowed(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -2284,16 +2306,35 @@ static void enable_irq_window(struct kvm_vcpu *vcpu) svm_inject_irq(to_svm(vcpu), 0x0); } +static void enable_nmi_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + if (svm-vmcb-control.int_state SVM_INTERRUPT_SHADOW_MASK) + enable_irq_window(vcpu); +} + static void svm_intr_inject(struct kvm_vcpu *vcpu) { /* try to reinject previous events if any */ + if (vcpu-arch.nmi_injected) { + svm_inject_nmi(to_svm(vcpu)); + return; + } + if (vcpu-arch.interrupt.pending) { svm_queue_irq(to_svm(vcpu), vcpu-arch.interrupt.nr); return; } /* try to inject new event if pending */ - if (kvm_cpu_has_interrupt(vcpu)) { + if (vcpu-arch.nmi_pending) { + if (svm_nmi_allowed(vcpu)) { + vcpu-arch.nmi_pending = false; + vcpu-arch.nmi_injected = true; + svm_inject_nmi(vcpu); + } + } else if (kvm_cpu_has_interrupt(vcpu)) { if (svm_interrupt_allowed(vcpu)) { kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu)); svm_queue_irq(to_svm(vcpu), vcpu-arch.interrupt.nr); @@ -2312,7 +2353,10 @@ static void svm_intr_assist(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) svm_intr_inject(vcpu); - if (kvm_cpu_has_interrupt(vcpu) || req_int_win) + /* enable NMI/IRQ window open exits if needed */ + if (vcpu-arch.nmi_pending) + enable_nmi_window(vcpu); + else if (kvm_cpu_has_interrupt(vcpu) || req_int_win) enable_irq_window(vcpu); out: -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Fri, Apr 17, 2009 at 09:55:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 49 +- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8b6f6e9..057a612 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -766,6 +766,7 @@ enum { #define HF_GIF_MASK(1 0) #define HF_HIF_MASK(1 1) #define HF_VINTR_MASK (1 2) +#define HF_NMI_MASK(1 3) /* * Hardware virtualization extension instructions may fault if a diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c605477..cd60fd7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1834,6 +1834,13 @@ static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 1; } +static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + svm-vmcb-control.intercept = ~(1UL INTERCEPT_IRET); + svm-vcpu.arch.hflags = ~HF_NMI_MASK; + return 0; +} First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. Maybe setting a breakpoint. Or maybe enforcing a single step exception. Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. What about this: Instead of clearing HF_NMI_MASK in iret_interception() we can set another flag (HF_IRET) and on guest entry clear HF_NMI_MASK (and HF_IRET) if HF_IRET is set, but do that after checking for NMI injection. The pending NMI will be injected on the next entry. Also not how real HW works, but may be better then current situation. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: On Fri, Apr 17, 2009 at 09:55:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 49 +- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8b6f6e9..057a612 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -766,6 +766,7 @@ enum { #define HF_GIF_MASK(1 0) #define HF_HIF_MASK(1 1) #define HF_VINTR_MASK (1 2) +#define HF_NMI_MASK(1 3) /* * Hardware virtualization extension instructions may fault if a diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c605477..cd60fd7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1834,6 +1834,13 @@ static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 1; } +static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + svm-vmcb-control.intercept = ~(1UL INTERCEPT_IRET); + svm-vcpu.arch.hflags = ~HF_NMI_MASK; + return 0; +} First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. Maybe setting a breakpoint. Or maybe enforcing a single step exception. Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. What about this: Instead of clearing HF_NMI_MASK in iret_interception() we can set another flag (HF_IRET) and on guest entry clear HF_NMI_MASK (and HF_IRET) if HF_IRET is set, but do that after checking for NMI injection. The pending NMI will be injected on the next entry. Also not how real HW works, but may be better then current situation. There may not be a next entry if the guest is in a tight loop. Given NMIs are used for watchdogs, that's not good. btw, injection before IRET is executed is broken if interrupt stack tables are used, since the injection will reset rsp instead of nesting. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 04:21:29PM +0300, Avi Kivity wrote: Gleb Natapov wrote: On Fri, Apr 17, 2009 at 09:55:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 49 +- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8b6f6e9..057a612 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -766,6 +766,7 @@ enum { #define HF_GIF_MASK (1 0) #define HF_HIF_MASK (1 1) #define HF_VINTR_MASK (1 2) +#define HF_NMI_MASK (1 3) /* * Hardware virtualization extension instructions may fault if a diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c605477..cd60fd7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1834,6 +1834,13 @@ static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 1; } +static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + svm-vmcb-control.intercept = ~(1UL INTERCEPT_IRET); + svm-vcpu.arch.hflags = ~HF_NMI_MASK; + return 0; +} First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. Maybe setting a breakpoint. Or maybe enforcing a single step exception. Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. What about this: Instead of clearing HF_NMI_MASK in iret_interception() we can set another flag (HF_IRET) and on guest entry clear HF_NMI_MASK (and HF_IRET) if HF_IRET is set, but do that after checking for NMI injection. The pending NMI will be injected on the next entry. Also not how real HW works, but may be better then current situation. There may not be a next entry if the guest is in a tight loop. Given NMIs are used for watchdogs, that's not good. We don't exit a guest after kvm time slice ends? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: On Fri, Apr 17, 2009 at 09:55:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 49 +- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8b6f6e9..057a612 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -766,6 +766,7 @@ enum { #define HF_GIF_MASK(1 0) #define HF_HIF_MASK(1 1) #define HF_VINTR_MASK (1 2) +#define HF_NMI_MASK(1 3) /* * Hardware virtualization extension instructions may fault if a diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c605477..cd60fd7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1834,6 +1834,13 @@ static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 1; } +static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + svm-vmcb-control.intercept = ~(1UL INTERCEPT_IRET); + svm-vcpu.arch.hflags = ~HF_NMI_MASK; + return 0; +} First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. Maybe setting a breakpoint. Or maybe enforcing a single step exception. Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. What about this: Instead of clearing HF_NMI_MASK in iret_interception() we can set another flag (HF_IRET) and on guest entry clear HF_NMI_MASK (and HF_IRET) if HF_IRET is set, but do that after checking for NMI injection. The pending NMI will be injected on the next entry. Also not how real HW works, but may be better then current situation. It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. The next step should then be setting TF in the eflags stored on the guest's stack before returning *if* there is already the next NMI pending. But I wonder how much additional effort this will actually mean (compared to the band-aid work)... :) Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: There may not be a next entry if the guest is in a tight loop. Given NMIs are used for watchdogs, that's not good. We don't exit a guest after kvm time slice ends? There are no time slices any more. If there's only once thread for a vcpu, you might have no exits at all with a tickless kernel. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 03:27:22PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: On Fri, Apr 17, 2009 at 09:55:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 49 +- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8b6f6e9..057a612 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -766,6 +766,7 @@ enum { #define HF_GIF_MASK (1 0) #define HF_HIF_MASK (1 1) #define HF_VINTR_MASK(1 2) +#define HF_NMI_MASK (1 3) /* * Hardware virtualization extension instructions may fault if a diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c605477..cd60fd7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1834,6 +1834,13 @@ static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 1; } +static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + svm-vmcb-control.intercept = ~(1UL INTERCEPT_IRET); + svm-vcpu.arch.hflags = ~HF_NMI_MASK; + return 0; +} First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. Maybe setting a breakpoint. Or maybe enforcing a single step exception. Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. What about this: Instead of clearing HF_NMI_MASK in iret_interception() we can set another flag (HF_IRET) and on guest entry clear HF_NMI_MASK (and HF_IRET) if HF_IRET is set, but do that after checking for NMI injection. The pending NMI will be injected on the next entry. Also not how real HW works, but may be better then current situation. It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 04:28:09PM +0300, Avi Kivity wrote: Gleb Natapov wrote: There may not be a next entry if the guest is in a tight loop. Given NMIs are used for watchdogs, that's not good. We don't exit a guest after kvm time slice ends? There are no time slices any more. If there's only once thread for a vcpu, you might have no exits at all with a tickless kernel. Well, KVM may request some kind of even (timer) that will cause exit to VCPU. This looks hacky though. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: On Sun, Apr 19, 2009 at 03:27:22PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: On Fri, Apr 17, 2009 at 09:55:45PM +0200, Jan Kiszka wrote: Gleb Natapov wrote: Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 49 +- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8b6f6e9..057a612 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -766,6 +766,7 @@ enum { #define HF_GIF_MASK (1 0) #define HF_HIF_MASK (1 1) #define HF_VINTR_MASK(1 2) +#define HF_NMI_MASK (1 3) /* * Hardware virtualization extension instructions may fault if a diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c605477..cd60fd7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1834,6 +1834,13 @@ static int cpuid_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 1; } +static int iret_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + svm-vmcb-control.intercept = ~(1UL INTERCEPT_IRET); + svm-vcpu.arch.hflags = ~HF_NMI_MASK; + return 0; +} First, this must return 1 (or set an exit reason, but there is no reason to escape to user space here). And second, I think a corner case is not handled the same way as on real iron: If there is already the next NMI waiting, we will inject it before iret, not after its execution as it should be. No easy solution for this yet. Maybe emulating iret, but there is no implementation, specifically for protected mode. Maybe setting a breakpoint. Or maybe enforcing a single step exception. Nothing trivial in this list. On the other hand, this may only be a slight imprecision of the virtualization. Need to think about it. What about this: Instead of clearing HF_NMI_MASK in iret_interception() we can set another flag (HF_IRET) and on guest entry clear HF_NMI_MASK (and HF_IRET) if HF_IRET is set, but do that after checking for NMI injection. The pending NMI will be injected on the next entry. Also not how real HW works, but may be better then current situation. It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. Hey, this is band-aid, it won't heal all cases. ;) Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. You mean if the NMI handler enabled interrupts? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 04:40:51PM +0300, Avi Kivity wrote: Gleb Natapov wrote: It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. You mean if the NMI handler enabled interrupts? Yes. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: On Sun, Apr 19, 2009 at 04:40:51PM +0300, Avi Kivity wrote: Gleb Natapov wrote: It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. You mean if the NMI handler enabled interrupts? Yes. Then the guest deserves whatever it gets... -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: On Sun, Apr 19, 2009 at 04:28:09PM +0300, Avi Kivity wrote: Gleb Natapov wrote: There may not be a next entry if the guest is in a tight loop. Given NMIs are used for watchdogs, that's not good. We don't exit a guest after kvm time slice ends? There are no time slices any more. If there's only once thread for a vcpu, you might have no exits at all with a tickless kernel. Well, KVM may request some kind of even (timer) that will cause exit to VCPU. This looks hacky though. We already spent to much electrons and brain cycles on possibly much simpler workarounds. I think injecting and handling a single-step, even while there is guest debugging going on or the guest itself single-steps or both, will not be more complicated - but more correct. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 04:43:12PM +0300, Avi Kivity wrote: Gleb Natapov wrote: On Sun, Apr 19, 2009 at 04:40:51PM +0300, Avi Kivity wrote: Gleb Natapov wrote: It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. You mean if the NMI handler enabled interrupts? Yes. Then the guest deserves whatever it gets... I suspect windows may do this since it uses NMI for task switching. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Jan Kiszka wrote: We already spent to much electrons and brain cycles on possibly much simpler workarounds. I think injecting and handling a single-step, even while there is guest debugging going on or the guest itself single-steps or both, will not be more complicated - but more correct. I agree. I'm still worried about interactions between the IRET single stepping code and other things which use the debug registers. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 04:49:18PM +0300, Avi Kivity wrote: Jan Kiszka wrote: We already spent to much electrons and brain cycles on possibly much simpler workarounds. I think injecting and handling a single-step, even while there is guest debugging going on or the guest itself single-steps or both, will not be more complicated - but more correct. I agree. I'm still worried about interactions between the IRET single stepping code and other things which use the debug registers. I don't disagree too. Just throwing other ideas :) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/15] Coalesce userspace/kernel irqchip interrupt injection logic.
On Sat, Apr 18, 2009 at 07:28:20PM +0300, Gleb Natapov wrote: So this patch may either expose a bug in the svm emulation of qemu or comes with a subtle regression that only triggers due to qemu's timing. This needs to be understood. Gleb, any progress on reproducing it on your side? I reproduced it and I am debugging it. In my case the boot hangs on sti;hlt sequence. Instrumentation thus far shows that at this point interrupts no longer injected because ppr value is too big. Need to see why, but tpr handling is not complete in qemu svm. May be this is the reason. Will know more tomorrow. I've looked into this and my conclusion is that if you are not going to develop SVM in qemu don't use it just yet. QEMU doesn't handle exceptions during event injection properly. Actually it does not handle it at all, so if PF happens during interrupt injection interrupt is lost and, what worse, is never acked. If interrupt was high prio it blocks all other interrupts. The patch below adds exception handling during event injection. Valid flag removed from EVENTINJ only after successful injection and EVENTINJ is copied to EXITINTINFO on exit. Can you give it a try? And this is not the only problem I saw, but the one that caused my guest to hang. diff --git a/target-i386/op_helper.c b/target-i386/op_helper.c index be09263..9264afd 100644 --- a/target-i386/op_helper.c +++ b/target-i386/op_helper.c @@ -1249,6 +1249,10 @@ void do_interrupt(int intno, int is_int, int error_code, } else { do_interrupt_real(intno, is_int, error_code, next_eip); } +if (env-hflags HF_SVMI_MASK) { + uint32_t event_inj = ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj)); + stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj), event_inj ~SVM_EVTINJ_VALID); +} } /* This should come from sysemu.h - if we could include it here... */ @@ -4994,7 +4998,6 @@ void helper_vmrun(int aflag, int next_eip_addend) uint8_t vector = event_inj SVM_EVTINJ_VEC_MASK; uint16_t valid_err = event_inj SVM_EVTINJ_VALID_ERR; uint32_t event_inj_err = ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj_err)); -stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj), event_inj ~SVM_EVTINJ_VALID); qemu_log_mask(CPU_LOG_TB_IN_ASM, Injecting(%#hx): , valid_err); /* FIXME: need to implement valid_err */ @@ -5331,6 +5334,8 @@ void helper_vmexit(uint32_t exit_code, uint64_t exit_info_1) cpu_x86_set_cpl(env, 0); stq_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_code), exit_code); stq_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_info_1), exit_info_1); +stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_int_info), ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj))); +stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_int_info_err), ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj_err))); env-hflags2 = ~HF2_GIF_MASK; /* FIXME: Resets the current ASID register to zero (host ASID). */ -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Avi Kivity wrote: Jan Kiszka wrote: We already spent to much electrons and brain cycles on possibly much simpler workarounds. I think injecting and handling a single-step, even while there is guest debugging going on or the guest itself single-steps or both, will not be more complicated - but more correct. I agree. I'm still worried about interactions between the IRET single stepping code and other things which use the debug registers. The interaction is inside KVM, so under our control. We may simply save the related states, hook into the guest-debugging parts that evaluate single step exceptions, and handle this case with highest prio, transparently to users of lower prio. In theory - you never really know until you tried to implement it... Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 05/15] Coalesce userspace/kernel irqchip interrupt injection logic.
Gleb Natapov wrote: On Sat, Apr 18, 2009 at 07:28:20PM +0300, Gleb Natapov wrote: So this patch may either expose a bug in the svm emulation of qemu or comes with a subtle regression that only triggers due to qemu's timing. This needs to be understood. Gleb, any progress on reproducing it on your side? I reproduced it and I am debugging it. In my case the boot hangs on sti;hlt sequence. Instrumentation thus far shows that at this point interrupts no longer injected because ppr value is too big. Need to see why, but tpr handling is not complete in qemu svm. May be this is the reason. Will know more tomorrow. I've looked into this and my conclusion is that if you are not going to develop SVM in qemu don't use it just yet. We had a resource conflict regarding SVM capable AMD boxes and a tight schedule, so we decided to pick qemu as initial development platform. Turns out that this has was a bit too optimistic. :) QEMU doesn't handle exceptions during event injection properly. Actually it does not handle it at all, so if PF happens during interrupt injection interrupt is lost and, what worse, is never acked. If interrupt was high prio it blocks all other interrupts. The patch below adds exception handling during event injection. Valid flag removed from EVENTINJ only after successful injection and EVENTINJ is copied to EXITINTINFO on exit. Can you give it a try? Ah, great, thanks. Will test. And this is not the only problem I saw, but the one that caused my guest to hang. OK, good to know. I added Alex (though he's said to be on vacation ATM) and qemu to CC. Maybe you can quickly list the other issues you've stumbled over, for the records and for motivating contributors... diff --git a/target-i386/op_helper.c b/target-i386/op_helper.c index be09263..9264afd 100644 --- a/target-i386/op_helper.c +++ b/target-i386/op_helper.c @@ -1249,6 +1249,10 @@ void do_interrupt(int intno, int is_int, int error_code, } else { do_interrupt_real(intno, is_int, error_code, next_eip); } +if (env-hflags HF_SVMI_MASK) { + uint32_t event_inj = ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj)); + stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj), event_inj ~SVM_EVTINJ_VALID); +} } /* This should come from sysemu.h - if we could include it here... */ @@ -4994,7 +4998,6 @@ void helper_vmrun(int aflag, int next_eip_addend) uint8_t vector = event_inj SVM_EVTINJ_VEC_MASK; uint16_t valid_err = event_inj SVM_EVTINJ_VALID_ERR; uint32_t event_inj_err = ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj_err)); -stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj), event_inj ~SVM_EVTINJ_VALID); qemu_log_mask(CPU_LOG_TB_IN_ASM, Injecting(%#hx): , valid_err); /* FIXME: need to implement valid_err */ @@ -5331,6 +5334,8 @@ void helper_vmexit(uint32_t exit_code, uint64_t exit_info_1) cpu_x86_set_cpl(env, 0); stq_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_code), exit_code); stq_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_info_1), exit_info_1); +stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_int_info), ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj))); +stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_int_info_err), ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj_err))); env-hflags2 = ~HF2_GIF_MASK; /* FIXME: Resets the current ASID register to zero (host ASID). */ -- Gleb. Thanks again, Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov g...@redhat.com writes: On Sun, Apr 19, 2009 at 04:43:12PM +0300, Avi Kivity wrote: Gleb Natapov wrote: On Sun, Apr 19, 2009 at 04:40:51PM +0300, Avi Kivity wrote: Gleb Natapov wrote: It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. You mean if the NMI handler enabled interrupts? Yes. Then the guest deserves whatever it gets... I suspect windows may do this since it uses NMI for task switching. Could you elaborate on that? How/why does it use NMIs for task switching? Regards, -- Julian Stecklina The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners - Ernst Jan Plugge -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 04:07:52PM +0200, Julian Stecklina wrote: Gleb Natapov g...@redhat.com writes: On Sun, Apr 19, 2009 at 04:43:12PM +0300, Avi Kivity wrote: Gleb Natapov wrote: On Sun, Apr 19, 2009 at 04:40:51PM +0300, Avi Kivity wrote: Gleb Natapov wrote: It's OK as a first step towards correct NMI emulation. Additionally, you could enable the IRQ window interception in case the is an NMI pending. The resulting behavior should then much like the VNMI mask emulation for vmx. Yeah, but the question is if IRQ windows is already opened will exit happens before or after IRET. You mean if the NMI handler enabled interrupts? Yes. Then the guest deserves whatever it gets... I suspect windows may do this since it uses NMI for task switching. Could you elaborate on that? How/why does it use NMIs for task switching? During WHQL testing (or if you just enable verifier on windows 2003) windows changes hibernate to not power down a PC, but resume immediately. During this immediate resume it sends NMI to non-boot CPUs while IDT for nmi is configured as a task gate. I am not sure it actually calls IRET after that. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: Could you elaborate on that? How/why does it use NMIs for task switching? During WHQL testing (or if you just enable verifier on windows 2003) windows changes hibernate to not power down a PC, but resume immediately. During this immediate resume it sends NMI to non-boot CPUs while IDT for nmi is configured as a task gate. I am not sure it actually calls IRET after that. If it doesn't call IRET, it will never see another NMI. But of course it will execute IRET, as part of normal execution. You can't do anything without it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2768533 ] SCSI/VirtIO: VM unable to reboot from non-IDE controller
Bugs item #2768533, was opened at 2009-04-16 16:01 Message generated for change (Comment added) made by technologov You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2768533group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 7 Private: No Submitted By: Technologov (technologov) Assigned to: Nobody/Anonymous (nobody) Summary: SCSI/VirtIO: VM unable to reboot from non-IDE controller Initial Comment: All Red Hat-based systems (RHEL, Fedora) have failed during automated tests on SCSI disks on KVM. It turned out to be a problem of Qemu/KVM, where after a software reboot (i.e. initiated by guest OS), the system is unable to boot from SCSI controller. To reboot successfully you need either soft reboot+IDE or Hard reboot (Qemu cold boot)+SCSI. The Command sent to Qemu/KVM: /usr/local/bin/qemu-system-x86_64 -m 512 -monitor tcp:localhost:4602,server,nowait -cdrom /isos/linux/Fedora-8-i386-DVD.iso -drive file=/vm/fedora8-32.qcow2,if=scsi,boot=on -name fedora8-32 Host: RHEL 5/x64, KVM-85rc6. (tried both Intel and AMD) Guest: RHEL 5, Fedora 8, Fedora 9 (tried both 32 and 64-bit). -Alexey, 16.4.2009. -- Comment By: Technologov (technologov) Date: 2009-04-19 17:25 Message: Bug was also opened in Ubuntu Launchpad. https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/363743 Two reasons: 1. Qemu project does not have bugzilla - so Ubuntu Launchpad became main bugzilla for Qemu 2. This bug affects Ubuntu-8.10 -- Comment By: Technologov (technologov) Date: 2009-04-16 17:12 Message: This problem also exists for VirtIO block device. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2768533group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/15] Coalesce userspace/kernel irqchip interrupt injection logic.
On Sun, Apr 19, 2009 at 04:05:21PM +0200, Jan Kiszka wrote: And this is not the only problem I saw, but the one that caused my guest to hang. OK, good to know. I added Alex (though he's said to be on vacation ATM) and qemu to CC. Maybe you can quickly list the other issues you've stumbled over, for the records and for motivating contributors... Another one that I remember (because this was my first suspect) is interrupt shadow handling. HF_INHIBIT_IRQ_MASK is cleared on exit when shadow bit is set in int_state and is not set on entry if hypervisor set shadow bit by itself. I am not sure how real HW actually handles this, but patch below demonstrates how I think it does it :) And of cause comments like /* FIXME: this should respect TPR */ don't look promising. diff --git a/target-i386/op_helper.c b/target-i386/op_helper.c index be09263..691a7f0 100644 --- a/target-i386/op_helper.c +++ b/target-i386/op_helper.c @@ -4971,6 +4997,15 @@ void helper_vmrun(int aflag, int next_eip_addend) env-dr[6] = ldq_phys(env-vm_vmcb + offsetof(struct vmcb, save.dr6)); cpu_x86_set_cpl(env, ldub_phys(env-vm_vmcb + offsetof(struct vmcb, save.cpl))); +{ + uint32_t aaa; +aaa = ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.int_state)); + if (aaa SVM_INTERRUPT_SHADOW_MASK) + helper_set_inhibit_irq(); + else + helper_reset_inhibit_irq(); +} + /* FIXME: guest state consistency checks */ switch(ldub_phys(env-vm_vmcb + offsetof(struct vmcb, control.tlb_ctl))) { @@ -5243,7 +5280,6 @@ void helper_vmexit(uint32_t exit_code, uint64_t exit_info_1) if(env-hflags HF_INHIBIT_IRQ_MASK) { stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.int_state), SVM_INTERRUPT_SHADOW_MASK); -env-hflags = ~HF_INHIBIT_IRQ_MASK; } else { stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.int_state), 0); } -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 05:20:37PM +0300, Avi Kivity wrote: Gleb Natapov wrote: Could you elaborate on that? How/why does it use NMIs for task switching? During WHQL testing (or if you just enable verifier on windows 2003) windows changes hibernate to not power down a PC, but resume immediately. During this immediate resume it sends NMI to non-boot CPUs while IDT for nmi is configured as a task gate. I am not sure it actually calls IRET after that. If it doesn't call IRET, it will never see another NMI. But of course it will execute IRET, as part of normal execution. You can't do anything without it. Boot CPU can send INIT after task switch (and I think this is what happens). -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
Gleb Natapov wrote: On Sun, Apr 19, 2009 at 05:20:37PM +0300, Avi Kivity wrote: Gleb Natapov wrote: Could you elaborate on that? How/why does it use NMIs for task switching? During WHQL testing (or if you just enable verifier on windows 2003) windows changes hibernate to not power down a PC, but resume immediately. During this immediate resume it sends NMI to non-boot CPUs while IDT for nmi is configured as a task gate. I am not sure it actually calls IRET after that. If it doesn't call IRET, it will never see another NMI. But of course it will execute IRET, as part of normal execution. You can't do anything without it. Boot CPU can send INIT after task switch (and I think this is what happens). But eventually it will execute IRET. (We need to fix INIT to clear the NMI blocking flag, not that it matters so much) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/15] Coalesce userspace/kernel irqchip interrupt injection logic.
Jan Kiszka wrote: Gleb Natapov wrote: On Sat, Apr 18, 2009 at 07:28:20PM +0300, Gleb Natapov wrote: So this patch may either expose a bug in the svm emulation of qemu or comes with a subtle regression that only triggers due to qemu's timing. This needs to be understood. Gleb, any progress on reproducing it on your side? I reproduced it and I am debugging it. In my case the boot hangs on sti;hlt sequence. Instrumentation thus far shows that at this point interrupts no longer injected because ppr value is too big. Need to see why, but tpr handling is not complete in qemu svm. May be this is the reason. Will know more tomorrow. I've looked into this and my conclusion is that if you are not going to develop SVM in qemu don't use it just yet. We had a resource conflict regarding SVM capable AMD boxes and a tight schedule, so we decided to pick qemu as initial development platform. Turns out that this has was a bit too optimistic. :) QEMU doesn't handle exceptions during event injection properly. Actually it does not handle it at all, so if PF happens during interrupt injection interrupt is lost and, what worse, is never acked. If interrupt was high prio it blocks all other interrupts. The patch below adds exception handling during event injection. Valid flag removed from EVENTINJ only after successful injection and EVENTINJ is copied to EXITINTINFO on exit. Can you give it a try? Ah, great, thanks. Will test. I can confirm: patch below makes my kvm-in-qemu test case happy, too. Maybe you want to post this with changelog and signed-off to qemu-devel. Jan diff --git a/target-i386/op_helper.c b/target-i386/op_helper.c index be09263..9264afd 100644 --- a/target-i386/op_helper.c +++ b/target-i386/op_helper.c @@ -1249,6 +1249,10 @@ void do_interrupt(int intno, int is_int, int error_code, } else { do_interrupt_real(intno, is_int, error_code, next_eip); } +if (env-hflags HF_SVMI_MASK) { +uint32_t event_inj = ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj)); +stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj), event_inj ~SVM_EVTINJ_VALID); +} } /* This should come from sysemu.h - if we could include it here... */ @@ -4994,7 +4998,6 @@ void helper_vmrun(int aflag, int next_eip_addend) uint8_t vector = event_inj SVM_EVTINJ_VEC_MASK; uint16_t valid_err = event_inj SVM_EVTINJ_VALID_ERR; uint32_t event_inj_err = ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj_err)); -stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj), event_inj ~SVM_EVTINJ_VALID); qemu_log_mask(CPU_LOG_TB_IN_ASM, Injecting(%#hx): , valid_err); /* FIXME: need to implement valid_err */ @@ -5331,6 +5334,8 @@ void helper_vmexit(uint32_t exit_code, uint64_t exit_info_1) cpu_x86_set_cpl(env, 0); stq_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_code), exit_code); stq_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_info_1), exit_info_1); +stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_int_info), ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj))); +stl_phys(env-vm_vmcb + offsetof(struct vmcb, control.exit_int_info_err), ldl_phys(env-vm_vmcb + offsetof(struct vmcb, control.event_inj_err))); env-hflags2 = ~HF2_GIF_MASK; /* FIXME: Resets the current ASID register to zero (host ASID). */ -- Gleb. signature.asc Description: OpenPGP digital signature
Re: [PATCH 05/15] Coalesce userspace/kernel irqchip interrupt injection logic.
On Sun, Apr 19, 2009 at 05:06:29PM +0200, Jan Kiszka wrote: Jan Kiszka wrote: Gleb Natapov wrote: On Sat, Apr 18, 2009 at 07:28:20PM +0300, Gleb Natapov wrote: So this patch may either expose a bug in the svm emulation of qemu or comes with a subtle regression that only triggers due to qemu's timing. This needs to be understood. Gleb, any progress on reproducing it on your side? I reproduced it and I am debugging it. In my case the boot hangs on sti;hlt sequence. Instrumentation thus far shows that at this point interrupts no longer injected because ppr value is too big. Need to see why, but tpr handling is not complete in qemu svm. May be this is the reason. Will know more tomorrow. I've looked into this and my conclusion is that if you are not going to develop SVM in qemu don't use it just yet. We had a resource conflict regarding SVM capable AMD boxes and a tight schedule, so we decided to pick qemu as initial development platform. Turns out that this has was a bit too optimistic. :) QEMU doesn't handle exceptions during event injection properly. Actually it does not handle it at all, so if PF happens during interrupt injection interrupt is lost and, what worse, is never acked. If interrupt was high prio it blocks all other interrupts. The patch below adds exception handling during event injection. Valid flag removed from EVENTINJ only after successful injection and EVENTINJ is copied to EXITINTINFO on exit. Can you give it a try? Ah, great, thanks. Will test. I can confirm: patch below makes my kvm-in-qemu test case happy, too. Maybe you want to post this with changelog and signed-off to qemu-devel. Yeah, I'll reformat and submit. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/15] Add NMI injection support to SVM.
On Sun, Apr 19, 2009 at 05:57:56PM +0300, Avi Kivity wrote: Gleb Natapov wrote: On Sun, Apr 19, 2009 at 05:20:37PM +0300, Avi Kivity wrote: Gleb Natapov wrote: Could you elaborate on that? How/why does it use NMIs for task switching? During WHQL testing (or if you just enable verifier on windows 2003) windows changes hibernate to not power down a PC, but resume immediately. During this immediate resume it sends NMI to non-boot CPUs while IDT for nmi is configured as a task gate. I am not sure it actually calls IRET after that. If it doesn't call IRET, it will never see another NMI. But of course it will execute IRET, as part of normal execution. You can't do anything without it. Boot CPU can send INIT after task switch (and I think this is what happens). But eventually it will execute IRET. Yes :) But I strongly suspect that NMI window will be opened after SIPI even before first IRET. (We need to fix INIT to clear the NMI blocking flag, not that it matters so much) If we reset intercept mask on INIT, but don't clear NMI blocking flag we will never receive NMIs on the vcpu. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v4)
On Sat, Apr 18, 2009 at 05:34:27PM +0200, Andrea Arcangeli wrote: On Sun, Apr 12, 2009 at 07:31:58PM -0300, Marcelo Tosatti wrote: mmu_sync_children needs to find any _reachable_ sptes that are unsync, read the guest pagetable, and sync the sptes. Before it can inspect the guest pagetable, it needs to write protect it, with rmap_write_protect. So far so good. In theory it wouldnt be necesarry to call kvm_flush_remote_tlbs_cond(protected) here, but only kvm_flush_remote_tlbs(), since the kvm-remote_tlbs_dirty information is not pertinent to mmu_sync_children. Hmm I'm not sure I fully understand how it is not pertinent. When we have a cr3 write in a remote vcpu, mmu_sync_children runs and it resyncs all sptes reachaeble for that remote vcpu context. But to resync the sptes, it also has to get rid of any old writable tlb entry for its remote vcpu where cr3 write is running. Checking only sptes to find writable ones, updating the sptes that are mapped by the writable sptes, and marking them wrprotected, isn't enough, as old spte contents may still be cached in the tlb if remote_tlbs_dirty is true! Think if the invlpg'd spte that got nuked by rmap_remove in the invlpg handler running in the current vcpu has a writable tlb entry active in the vcpu that later does cr3 overwrite. mmu_sync_children running in the remote vcpu will find no writable spte in the rmap chain representing that spte (because that spte that is still cached in the remote tlb, has already been zapped by the current vcpu) but it is still cached and writable in the remote vcpu TLB cache, when cr3 overwrite runs. Right, so you have to cope with the fact that invlpg can skip a TLB flush. OK. But this is done here to clear remote_tlbs_dirty (after a kvm_flush_remote_tlbs remote_tlbs_dirty is clean), ie: as an optimization. The whole point of remote_tlbs_dirty is to defer any smp tlb flush at the least possible time, so how can it be an optimization to run a kvm_flush_remote_tlbs earlier than necessary? The only way this can be an optimization, is to never run kvm_flush_remote_tlbs unless absolutely necessary, and to leave remote_tlbs_dirty true instead of calling kvm_flush_remote_tlbs. Calling kvm_flush_remote_tlbs_cond instead of kvm_flush_remote_tlbs cannot ever be an optimization. Right. @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) rmap_remove(vcpu-kvm, sptep); if (is_large_pte(*sptep)) --vcpu-kvm-stat.lpages; - need_flush = 1; + vcpu-kvm-remote_tlbs_dirty = true; } set_shadow_pte(sptep, shadow_trap_nonpresent_pte); break; @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) break; } - if (need_flush) - kvm_flush_remote_tlbs(vcpu-kvm); spin_unlock(vcpu-kvm-mmu_lock); AFIK to be compliant with lowlevel archs (without ASN it doesn't matter I think as vmx always flush on exit), we have to flush the local tlb here, with set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests). I don't see why it's missing. Or am I wrong? Caller does it: void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) { vcpu-arch.mmu.invlpg(vcpu, gva); kvm_mmu_flush_tlb(vcpu); ++vcpu-stat.invlpg; } Ah great! Avi also mentioned it I recall but I didn't figure out it was after FNAME(invlpg) returns. But isn't always more efficient to set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests) instead like I'm doing? Sure that works too. See my version of kvm_flush_local_tlb, that's a bit different from kvm_mmu_flush_tlb and I'm made the old no-mmu notifier kernel safe too which I think is worth it. If you like to keep my version of kvm_flush_local_tlb, then I've simply to remove the kvm_mmu_flush_tlb from kvm_mmu_invlpg and move it inside the arch.mmu.invlpg handler so each shadow implementation does it its way. Comments? I'm fine with your kvm_flush_local_tlb. Just one minor nit: + /* get new asid before returning to guest mode */ + if (!test_bit(KVM_REQ_TLB_FLUSH, vcpu-requests)) + set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests); Whats the test_bit for? If you disagree, and you want to run kvm_mmu_flush_tlb in kvm_mmu_invlpg instead of set_bit(KVM_REQ_TLB_FLUSH, vcpu-requests), then I can expand the kvm_flush_local_tlb with the smp_wmb() in the FNAME(invlpg) code. Like: if (need_flush) { smp_wmb(); remote_tlbs_dirty = true; } spin_unlock(mmu_lock); Then the local tlb flush will run when we return from FNAME(invlpg) and remote_tlbs_dirty is set _after_
Re: [PATCH -v2] Add MCE support to KVM
On Sat, 2009-04-18 at 23:54 +0800, Anthony Liguori wrote: Huang Ying wrote: The related MSRs are emulated. MCE capability is exported via extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation such as the mcg_cap. MCE is injected via vcpu ioctl command KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are not simulated. Maybe I'm missing something, but couldn't this be implemented entirely within userspace? There's nothing VT/SVM specific about this. If the issue is setting these MSRs from userspace via KVM_SET_MSRS isn't enough, perhaps we should add userspace MSR handling. Also, if you implement the MSR logic in userspace, it's pretty simple to make it work in the non-TCG case which will be a requirement for upstream merging. There is more logic than just KVM_SET_MSRS, such as BANK reporting disabling, overwriting rules, triple fault for UC MCE during MCIP. Although these logic can be implemented in user space, I think put them in kernel space is easy to be understood. And the code is pretty short. Best Regards, Huang Ying signature.asc Description: This is a digitally signed message part
Re: [PATCH -v2 2/2] kvm userspace: Add MCE simulation to kvm
On Fri, 2009-04-17 at 22:49 +0800, Marcelo Tosatti wrote: Hi Huang, On Fri, Apr 17, 2009 at 03:38:45PM +0800, Huang Ying wrote: - MCE features are initialized when VCPU is initialized according to CPUID. - A monitor command mce is added to inject a MCE. ChangeLog: v2: - Use new kernel MCE capability exportion interface. Signed-off-by: Huang Ying ying.hu...@intel.com --- libkvm/libkvm-x86.c| 33 + libkvm/libkvm.h|4 qemu/monitor.c | 26 ++ qemu/qemu-kvm-x86.c| 33 + qemu/qemu-kvm.c| 26 ++ qemu/qemu-kvm.h|4 qemu/target-i386/cpu.h |3 +++ 7 files changed, 129 insertions(+) --- a/qemu/monitor.c +++ b/qemu/monitor.c @@ -1557,6 +1557,31 @@ static void do_info_status(Monitor *mon) } +#if defined(TARGET_I386) || defined(TARGET_X86_64) +static void do_inject_mce(Monitor *mon, + int cpu_index, int bank, + unsigned status_hi, unsigned status_lo, + unsigned mcg_status_hi, unsigned mcg_status_lo, + unsigned addr_hi, unsigned addr_lo, + unsigned misc_hi, unsigned misc_lo) +{ +CPUState *env; +struct kvm_x86_mce mce = { + .bank = bank, + .status = ((uint64_t)status_hi 32) | status_lo, + .mcg_status = ((uint64_t)mcg_status_hi 32) | mcg_status_lo, + .addr = ((uint64_t)addr_hi 32) | addr_lo, + .misc = ((uint64_t)misc_hi 32) | misc_lo, +}; + +for (env = first_cpu; env != NULL; env = env-next_cpu) + if (env-cpu_index == cpu_index env-mcg_cap) { + kvm_inject_x86_mce(env, mce); + break; + } +} +#endif This will break compilation on hosts with older kernel headers. To ease the merge with QEMU, do_inject_mce should call qemu_inject_x86_mce, which does nothing for the case where KVM is disabled, or call KVM specific function otherwise (knowledge about struct kvm_x86_mce should be contained within KVM code). Yes. I will fix this. + static void do_balloon(Monitor *mon, int value) { ram_addr_t target = value; @@ -1758,6 +1783,7 @@ static const mon_cmd_t mon_cmds[] = { [tap,user,socket,vde] options, add host VLAN client }, { host_net_remove, is, net_host_device_remove, vlan_id name, remove host VLAN client }, +{ mce, ii, do_inject_mce, cpu bank status mcgstatus addr misc, inject a MCE on the given CPU}, #endif { balloon, i, do_balloon, target, request VM to change it's memory allocation (in MB) }, --- a/libkvm/libkvm-x86.c +++ b/libkvm/libkvm-x86.c @@ -379,6 +379,39 @@ int kvm_set_msrs(kvm_context_t kvm, int return r; } +int kvm_get_mce_cap_supported(kvm_context_t kvm, uint64_t *mce_cap, + int *max_banks) +{ +int r; + +r = ioctl(kvm-fd, KVM_CHECK_EXTENSION, KVM_CAP_MCE); +if (r 0) { + *max_banks = r; + return ioctl(kvm-fd, KVM_X86_GET_MCE_CAP_SUPPORTED, mce_cap); +} +return -ENOSYS; +} + +int kvm_setup_mce(kvm_context_t kvm, int vcpu, uint64_t *mcg_cap) +{ +int r; + +r = ioctl(kvm-fd, KVM_CHECK_EXTENSION, KVM_CAP_MCE); +if (r 0) + return ioctl(kvm-vcpu_fd[vcpu], KVM_X86_SETUP_MCE, mcg_cap); +return -ENOSYS; +} + +int kvm_set_mce(kvm_context_t kvm, int vcpu, struct kvm_x86_mce *m) +{ +int r; + +r = ioctl(kvm-fd, KVM_CHECK_EXTENSION, KVM_CAP_MCE); +if (r 0) + return ioctl(kvm-vcpu_fd[vcpu], KVM_X86_SET_MCE, m); +return -ENOSYS; +} Have to #ifdef KVM_CAP_MCE where appropriate, otherwise compilation on hosts with older kernel headers fail. Yes. I will fix this. Best Regards, Huang Ying signature.asc Description: This is a digitally signed message part
[PATCH 0/5] ksm - dynamic page sharing driver for linux v4
Introduction: KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. Unlike tradtional page sharing that is made at the allocation of the memory, ksm do it dynamicly after the memory was created. Memory is periodically scanned; identical pages are identified and merged. The sharing is made in a transparent way to the procsess that use it. Ksm is highly important for hypervisors (kvm), where in production enviorments there might be many copys of the same data data among the host memory. This kind of data can be: similar kernels, librarys, cache, and so on. Even that ksm was wrote for kvm, any userspace application that want to use it to share its data can try it. Ksm may be useful for any application that might have similar (page aligment) data strctures among the memory, ksm will find this data merge it to one copy, and even if it will be changed and thereforew copy on writed, ksm will merge it again as soon as it will be identical again. Another reason to consider using ksm is the fact that it might simplify alot the userspace code of application that want to use shared private data, instead that the application will mange shared area, ksm will do this for the application, and even write to this data will be allowed without any synchinization acts from the application. Ksm was desgiend to be a loadable module that doesnt change the VM code of linux. From v3 - v4: 1) Mostly fixes of coding styles, and few bugs that Andrew found. * get_user_pages return value check - we now check == 1. * protecting the vma under the mmap_sem() while checking its fields. * kthread_* renaming into ksm_thread_* * const to the file_operations strctures. (The only thing i didnt change from your comments, is the number of pages we are allocating for the hash table, this is performence critical for ksm) 2) Changed get_pte() to be linux generic function avaible for other users. Description of the ksm interface: Ksm interface is splited into two areas; Administration sysfs interface - interface that is avaible thanks to sysfs to control the ksm cpu timing, maximum allocation of kernel pages and statics. This interface is avaible at /sys/kernel/mm/ksm/ and its fields are: kernel_pages_allocated - information about how many kernel pagesksm have allocated, this pages are not swappabke, and each page like that is used by ksm to share pages with identical content. pages_shared - how many pages were shared by ksm run - set to 1 when you want ksm to run, 0 when no max_kernel_pages - set the maximum amount of kernel pages to be allocated by ksm, set 0 for unlimited. pages_to_scan - how many pages to scan before ksm will sleep sleep - how much usecs ksm will sleep. The interface for applications that want its memory to be scanned by ksm: This interface is built around ioctls when application want its memory to be scanned it will do something like that: static int ksm_register_memory(unsigned long phys_ram_size, unsigned long phys_ram_base) { int fd; int ksm_fd; int r = 1; struct ksm_memory_region ksm_region; fd = open(/dev/ksm, O_RDWR | O_TRUNC, (mode_t)0600); if (fd == -1) goto out; ksm_fd = ioctl(fd, KSM_CREATE_SHARED_MEMORY_AREA); if (ksm_fd == -1) goto out_free; ksm_region.npages = phys_ram_size / TARGET_PAGE_SIZE; ksm_region.addr = phys_ram_base; r = ioctl(ksm_fd, KSM_REGISTER_MEMORY_REGION, ksm_region); if (r) goto out_free1; return r; out_free1: close(ksm_fd); out_free: close(fd); out: return r; } This ioctls are: KSM_GET_API_VERSION: Give the userspace the api version of the module. KSM_CREATE_SHARED_MEMORY_AREA: Create shared memory reagion fd, that latter allow the user to register the memory region to scan by using: KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION KSM_REGISTER_MEMORY_REGION: Register userspace virtual address range to be scanned by ksm. This ioctl is using the ksm_memory_region structure: ksm_memory_region: __u32 npages; number of pages to share inside this memory region. __u32 pad; __u64 addr: the begining of the virtual address of this region. __u64 reserved_bits; reserved bits for future usage. KSM_REMOVE_MEMORY_REGION: Remove memory region from ksm. Testing ksm: Considering the fact that i got some mails asking me how to use this, I guess it wasnt explined good in the last posts, i will try to improve this: The following steps should allow you to test ksm and play with it: 1) First patch your kernel with this patchs. 2) Patch avi kvm-git tree: (git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git) with the patchs from: http://lkml.org/lkml/2009/3/30/534 3) Patch kvm-userspace tree: git.kernel.org/pub/scm/virt/kvm/kvm-userspace.git with the patchs from: http://lkml.org/lkml/2009/3/30/538 4) try to do: echo 300
[PATCH 1/5] MMU_NOTIFIERS: add set_pte_at_notify()
this macro allow setting the pte in the shadow page tables directly instead of flushing the shadow page table entry and then get vmexit in order to set it. This function is optimzation for kvm/users of mmu_notifiers for COW pages, it is useful for kvm when ksm is used beacuse it allow kvm not to have to recive VMEXIT and only then map the shared page into the mmu shadow pages, but instead map it directly at the same time linux map the page into the host page table. this mmu notifer macro is working by calling to callback that will map directly the physical page into the shadow page tables. (users of mmu_notifiers that didnt implement the set_pte_at_notify() call back will just recive the mmu_notifier_invalidate_page callback) Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/mmu_notifier.h | 34 ++ mm/memory.c | 10 -- mm/mmu_notifier.c| 20 3 files changed, 62 insertions(+), 2 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b77486d..8bb245f 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,15 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + /* + * change_pte is called in cases that pte mapping into page is changed + * for example when ksm mapped pte to point into a new shared page. + */ + void (*change_pte)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address, + pte_t pte); + /* * Before this is invoked any secondary MMU is still ok to * read/write to the page previously pointed to by the Linux @@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); extern void __mmu_notifier_release(struct mm_struct *mm); extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long address); +extern void __mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte); extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, @@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline void mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_change_pte(mm, address, pte); +} + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address) { @@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __young;\ }) +#define set_pte_at_notify(__mm, __address, __ptep, __pte) \ +({ \ + struct mm_struct *___mm = __mm; \ + unsigned long ___address = __address; \ + pte_t ___pte = __pte; \ + \ + set_pte_at(__mm, __address, __ptep, ___pte);\ + mmu_notifier_change_pte(___mm, ___address, ___pte); \ +}) + #else /* CONFIG_MMU_NOTIFIER */ static inline void mmu_notifier_release(struct mm_struct *mm) @@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline void mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte) +{ +} + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address) { @@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) #define ptep_clear_flush_young_notify ptep_clear_flush_young #define ptep_clear_flush_notify ptep_clear_flush +#define set_pte_at_notify set_pte_at #endif /* CONFIG_MMU_NOTIFIER */ diff --git a/mm/memory.c b/mm/memory.c index cf6873e..1e1a14b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2051,9 +2051,15 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush_notify(vma, address, page_table); + ptep_clear_flush(vma, address, page_table); page_add_new_anon_rmap(new_page, vma, address); -
[PATCH 2/5] add get_pte(): helper function: fetching pte for va
get_pte() receive mm_struct of a task, and a virtual address and return the pte corresponding to it. this function return NULL in case it couldnt fetch the pte. Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/mm.h | 24 1 files changed, 24 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index bff1f0d..9a34109 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -894,6 +894,30 @@ int vma_wants_writenotify(struct vm_area_struct *vma); extern pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, spinlock_t **ptl); +static inline pte_t *get_pte(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep = NULL; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map(pmd, addr); +out: + return ptep; +} + #ifdef __PAGETABLE_PUD_FOLDED static inline int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] add replace_page(): change the page pte is pointing to.
replace_page() allow changing the mapping of pte from one physical page into diffrent physical page. this function is working by removing oldpage from the rmap and calling put_page on it, and by setting the pte to point into newpage and by inserting it to the rmap using page_add_file_rmap(). note: newpage must be non anonymous page, the reason for this is: replace_page() is built to allow mapping one page into more than one virtual addresses, the mapping of this page can happen in diffrent offsets inside each vma, and therefore we cannot trust the page-index anymore. the side effect of this issue is that newpage cannot be anything but kernel allocated page that is not swappable. Signed-off-by: Izik Eidus iei...@redhat.com --- include/linux/mm.h |5 +++ mm/memory.c| 80 2 files changed, 85 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 9a34109..a0ddfb5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1264,6 +1264,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot); +#endif + struct page *follow_page(struct vm_area_struct *, unsigned long address, unsigned int foll_flags); #define FOLL_WRITE 0x01/* check pte is writable */ diff --git a/mm/memory.c b/mm/memory.c index 1e1a14b..d6e53c2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_mixed); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) + +/** + * replace_page - replace page in vma with new page + * @vma: vma that hold the pte oldpage is pointed by. + * @oldpage: the page we are replacing with newpage + * @newpage: the page we replace oldpage with + * @orig_pte: the original value of the pte + * @prot: page protection bits + * + * Returns 0 on success, -EFAULT on failure. + * + * Note: @newpage must not be an anonymous page because replace_page() does + * not change the mapping of @newpage to have the same values as @oldpage. + * @newpage can be mapped in several vmas at different offsets (page-index). + */ +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot) +{ + struct mm_struct *mm = vma-vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + spinlock_t *ptl; + unsigned long addr; + int ret; + + BUG_ON(PageAnon(newpage)); + + ret = -EFAULT; + addr = page_address_in_vma(oldpage, vma); + if (addr == -EFAULT) + goto out; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map_lock(mm, pmd, addr, ptl); + if (!ptep) + goto out; + + if (!pte_same(*ptep, orig_pte)) { + pte_unmap_unlock(ptep, ptl); + goto out; + } + + ret = 0; + get_page(newpage); + page_add_file_rmap(newpage); + + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot)); + + page_remove_rmap(oldpage); + if (PageAnon(oldpage)) { + dec_mm_counter(mm, anon_rss); + inc_mm_counter(mm, file_rss); + } + put_page(oldpage); + + pte_unmap_unlock(ptep, ptl); + +out: + return ret; +} +EXPORT_SYMBOL_GPL(replace_page); + +#endif + /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] add ksm kernel shared memory driver.
Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Signed-off-by: Izik Eidus iei...@redhat.com Signed-off-by: Chris Wright chr...@redhat.com Signed-off-by: Andrea Arcangeli aarca...@redhat.com --- include/linux/ksm.h| 48 ++ include/linux/miscdevice.h |1 + mm/Kconfig |6 + mm/Makefile|1 + mm/ksm.c | 1675 5 files changed, 1731 insertions(+), 0 deletions(-) create mode 100644 include/linux/ksm.h create mode 100644 mm/ksm.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h new file mode 100644 index 000..2c11e9a --- /dev/null +++ b/include/linux/ksm.h @@ -0,0 +1,48 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +/* + * Userspace interface for /dev/ksm - kvm shared memory + */ + +#include linux/types.h +#include linux/ioctl.h + +#include asm/types.h + +#define KSM_API_VERSION 1 + +#define ksm_control_flags_run 1 + +/* for KSM_REGISTER_MEMORY_REGION */ +struct ksm_memory_region { + __u32 npages; /* number of pages to share */ + __u32 pad; + __u64 addr; /* the begining of the virtual address */ +__u64 reserved_bits; +}; + +#define KSMIO 0xAB + +/* ioctls for /dev/ksm */ + +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) +/* + * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd + */ +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ + +/* ioctls for SMA fds */ + +/* + * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be + * scanned by kvm. + */ +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ + struct ksm_memory_region) +/* + * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. + */ +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21) + +#endif diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index beb6ec9..297c0bb 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -30,6 +30,7 @@ #define HPET_MINOR 228 #define FUSE_MINOR 229 #define KVM_MINOR 232 +#define KSM_MINOR 233 #define MISC_DYNAMIC_MINOR 255 struct device; diff --git a/mm/Kconfig b/mm/Kconfig index 57971d2..fb8ac63 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -225,3 +225,9 @@ config HAVE_MLOCKED_PAGE_BIT config MMU_NOTIFIER bool + +config KSM + tristate Enable KSM for page sharing + help + Enable the KSM kernel module to allow page sharing of equal pages + among different tasks. diff --git a/mm/Makefile b/mm/Makefile index ec73c68..b885513 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o +obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o obj-$(CONFIG_SLAB) += slab.o obj-$(CONFIG_SLUB) += slub.o diff --git a/mm/ksm.c b/mm/ksm.c new file mode 100644 index 000..7fd4158 --- /dev/null +++ b/mm/ksm.c @@ -0,0 +1,1675 @@ +/* + * Memory merging driver for Linux + * + * This module enables dynamic sharing of identical pages found in different + * memory areas, even if they are not shared by fork() + * + * Copyright (C) 2008 Red Hat, Inc. + * Authors: + * Izik Eidus + * Andrea Arcangeli + * Chris Wright + * + * This work is licensed under the terms of the GNU GPL, version 2. + */ + +#include linux/module.h +#include linux/errno.h +#include linux/mm.h +#include linux/fs.h +#include linux/miscdevice.h +#include linux/vmalloc.h +#include linux/file.h +#include linux/mman.h +#include linux/sched.h +#include linux/rwsem.h +#include linux/pagemap.h +#include linux/sched.h +#include linux/rmap.h +#include linux/spinlock.h +#include linux/jhash.h +#include linux/delay.h +#include linux/kthread.h +#include linux/wait.h
Re: Luvalley-2 has been released: running KVM below any operating system
It is not a typo. I copied from UnixBench output directly. Howver, it must be a bug of Luvalley because even the native Linux benchmark on Double-Precision Whetstone is not that high. I also noticed that other benchmarks are all lower than native Linux. About timing, Luvalley does nothing more than KVM except that Luvalley implemented the VLAPIC timer using TSC while KVM uses the services of Linux kernel. The other timers of both Luvalley and KVM, I think, are all implemented in Qemu. Moreover, I could not explain why Luvalley's benchmarks on process creation, execl throughput, file copy and shell script are 20% ~ 40% higher than KVM, while other benchmarks on pipe throughput, context switching and syscall overhead are almost the same as KVM. Thanks for your attention, Xiaodong 2009/4/19 Avi Kivity a...@redhat.com: Jan Kiszka wrote: Avi Kivity wrote: Xiaodong Yi wrote: Hi, I've tested the guest Linux using UnixBench 5.1.2. The platform is: * Intel's Core Due CPU with 2 cores, 2GB RAM * CentOS 5.2 as the dom0 Linux, i.e., the host Linux for KVM * CentOS 5.2 as the guest Linux, i.e., the Linux running on the virtual machine provided by Qemu The first set of results is for Luvalley, and the second one is for KVM. As the result, Luvalley's guest Linux is 20% ~ 30% faster than KVM's guest! It is very surprise to me. I had through Luvalley's guest should be the same performance as KVM's. Yes, it is surprising. Double-Precision Whetstone 12287.7 MWIPS (10.0 s, 2 samples) Double-Precision Whetstone 2166.3 MWIPS (10.2 s, 2 samples That's by far the biggest difference. Can you confirm it isn't a typo? If not, then it looks like we have a bug in floating point handling. I don't think this benchmark uses sse. Even the native Linux numbers are not that high, rather comparable to KVM. I suspect Luvalley is fooling the benchmark here... Most likely timing. Likely Luvally guests time runs to slowly, so they achieve more loops per unit of guest time. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Add MCE support to KVM
On Sun, Apr 19, 2009 at 9:19 PM, Huang Ying ying.hu...@intel.com wrote: On Sat, 2009-04-18 at 23:54 +0800, Anthony Liguori wrote: Huang Ying wrote: The related MSRs are emulated. MCE capability is exported via extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation such as the mcg_cap. MCE is injected via vcpu ioctl command KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are not simulated. Maybe I'm missing something, but couldn't this be implemented entirely within userspace? There's nothing VT/SVM specific about this. If the issue is setting these MSRs from userspace via KVM_SET_MSRS isn't enough, perhaps we should add userspace MSR handling. Also, if you implement the MSR logic in userspace, it's pretty simple to make it work in the non-TCG case which will be a requirement for upstream merging. There is more logic than just KVM_SET_MSRS, such as BANK reporting disabling, overwriting rules, triple fault for UC MCE during MCIP. Although these logic can be implemented in user space, I think put them in kernel space is easy to be understood. And the code is pretty short. IMO the main reason to put this in kernel-space would be to make it possible to automatically forward some MCE errors generated by the real hardware (RAM ECC errors for example) down into the VM. Right now I suppose you could do that with the patches to forward RAM-based hard MCEs to userspace using SIGSEGV and handling the SIGSEGV in userspace, but that seems more fragile to me. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Add MCE support to KVM
On Mon, 2009-04-20 at 11:57 +0800, Kyle Moffett wrote: On Sun, Apr 19, 2009 at 9:19 PM, Huang Ying ying.hu...@intel.com wrote: On Sat, 2009-04-18 at 23:54 +0800, Anthony Liguori wrote: Huang Ying wrote: The related MSRs are emulated. MCE capability is exported via extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation such as the mcg_cap. MCE is injected via vcpu ioctl command KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are not simulated. Maybe I'm missing something, but couldn't this be implemented entirely within userspace? There's nothing VT/SVM specific about this. If the issue is setting these MSRs from userspace via KVM_SET_MSRS isn't enough, perhaps we should add userspace MSR handling. Also, if you implement the MSR logic in userspace, it's pretty simple to make it work in the non-TCG case which will be a requirement for upstream merging. There is more logic than just KVM_SET_MSRS, such as BANK reporting disabling, overwriting rules, triple fault for UC MCE during MCIP. Although these logic can be implemented in user space, I think put them in kernel space is easy to be understood. And the code is pretty short. IMO the main reason to put this in kernel-space would be to make it possible to automatically forward some MCE errors generated by the real hardware (RAM ECC errors for example) down into the VM. Right now I suppose you could do that with the patches to forward RAM-based hard MCEs to userspace using SIGSEGV and handling the SIGSEGV in userspace, but that seems more fragile to me. Yes. Puting this in kernel-space would make it more reliable to forward the MCE to KVM guest. Best Regards, Huang Ying signature.asc Description: This is a digitally signed message part