RE: [PATCH] whpx: Added support for saving/restoring VM state
Hi Paolo, Thanks for pointing out the relevant fields of the XSAVE state - it explains what is going on. I did some quick experiments with modifying the buffer manually, so: * Hyper-V doesn't care about the value of MXCSR_MASK. Both 0x and 0x work. * Setting XCOMP_BV to 0 does trigger the error. My best guess is that Hyper-V is emulating XSAVE/XRSTOR programmatically and they only implemented the compacted format. Do you think it would be a reasonable workaround to handle it like this: 1. Pass the XSAVE data received from Hyper-V to x86_cpu_xrstor_all_areas(). 2. Also save it into the snapshot area like we do now. 3. When restoring, first try to pass the data from x86_cpu_xsave_all_areas() to Hyper-V. 4. If it rejects it, pass the original block saved in step 2. If either Microsoft adds support for the regular format, or someone on the Qemu side implements the compacted one, this logic will start properly synchronizing the QEMU-level XSAVE structure with Hyper-V. Until then (or if it breaks again) it will fall back to saving/restoring the data "as is", which will be sufficient for snapshots. Best, Ivan -Original Message- From: Qemu-devel On Behalf Of Paolo Bonzini Sent: Tuesday, May 17, 2022 7:12 AM To: Ivan Shcherbakov ; qemu-devel@nongnu.org Subject: Re: [PATCH] whpx: Added support for saving/restoring VM state On 5/16/22 20:44, Ivan Shcherbakov wrote: > Passing it to x86_cpu_xrstor_all_areas()/x86_cpu_xsave_all_areas() changed > the following values: > > 0x001C: ff ff -> 00 00 > 0x0208: 07 -> 00 > 0x020F: 80 -> 00 0x1C-0x1F is MXCSR_MASK. There's already a field in the x86 CPUState, but it was forgotten in x86_cpu_xsave_all_areas()/x86_cpu_xrstor_all_areas(). The field should also be initialized to 0x in the CPU reset function. 0x208...0x20F is XCOMP_BV and bit 63 in there is indeed signaling compacted format. First of all I'd start with your patch and hack it to check if Hyper-V accepts zero at 0x208..0x20F; in this specific case of 0x208...0x20F have all low consecutive bits set plus bit 63 set, it's fine to do just that. If so, x86_cpu_xrstor_all_areas() needs no support for compacted format. I would be somewhat surprised if Hyper-V needs support in XRSTOR too. For XSAVE, the algorithm to compute the offset (instead of just using x->offset) is given in the Intel manual: If XCOMP_BV[i] = 0, state component i is not in the XSAVE area at all. If XCOMP_BV[i] = 1, state component i is located at a byte offset from the base address of the XSAVE area, which is determined by the following steps: - If i is the first bit set in bits 62:2 of the XCOMP_BV, state component i starts at offset 576 - Otherwise, take CPUID[EAX=0DH,ECX=i].ECX[1]: - If it is 0, state component i starts right after the preceding state component whose bit is set in XCOMP_BV (where the size of component j is enumerated in CPUID[EAX=0DH,ECX=j].EAX). - If it is 1, state component i starts after the preceding state component whose bit is set in XCOMP_BV, but on a 64-byte aligned offset relative to the beginning of the XSAVE area. Paolo
RE: [PATCH] whpx: Added support for saving/restoring VM state
Hi Paolo, >What are the differences? Is it using the XSAVEC/XSAVES ("compacted") format? I am not very familiar with the format internals, so I briefly checked whether I could reuse the general logic from the HVF port. Here's what I got on a booted Linux VM: WHvGetVirtualProcessorXsaveState() returned this block (844 bytes): 7f 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0 1f 00 00 ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 58 65 c8 34 7f 00 00 04 7f 65 c8 34 7f 00 00 44 e1 68 c8 34 7f 00 00 f8 35 65 c8 34 7f 00 00 ec 4f 65 c8 34 7f 00 00 5c 4f 6a c8 34 7f 00 00 d0 7d 65 c8 34 7f 00 00 3c 6a 67 c8 34 7f 00 00 64 90 67 c8 34 7f 00 00 9c 0c 78 c8 34 7f 00 00 3c 6a 67 c8 34 7f 00 00 40 60 65 c8 34 7f 00 00 5c 4f 6a c8 34 7f 00 00 d0 7d 65 c8 34 7f 00 00 f8 35 65 c8 34 7f 00 00 ec 4f 65 c8 34 7f 00 00 70 3d 66 cb 34 7f 00 00 00 00 00 00 00 00 00 00 69 00 65 00 6f 00 20 00 00 00 00 00 00 00 00 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 00 00 ff ff 00 00 00 00 00 00 00 00 ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 0c 00 00 00 05 00 00 00 22 00 00 00 05 00 00 00 10 00 00 00 05 00 00 00 22 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 07 00 00 00 00 00 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Passing it to x86_cpu_xrstor_all_areas()/x86_cpu_xsave_all_areas() changed the following values: 0x001C: ff ff -> 00 00 0x0208: 07 -> 00 0x020F: 80 -> 00 Trying to pass the updated block to WHvSetVirtualProcessorXsaveState() just made it reject the entire block with error c0350005 (generic "invalid parameter"). I haven't looked too deep into it, since just saving the state "as is" is sufficient to get the snapshots working. If there is something obvious I missed, I can give it another brief try. Best, Ivan
[PATCH] whpx: Added support for saving/restoring VM state
This patch adds support for the savevm/loadvm commands when using WHPX. It saves the XSAVE state and the relevant internal registers that were not otherwise captured into a separate "whpx/cpustate" object in the snapshot, and restores them when the snapshot is loaded. Note that due to the XSAVE format differences between the WHPX API and QEMU, the XSAVE state captured from WHPX is not reflected in the X86CPU structure, and is instead saved to the snapshots "as is". Signed-off-by: Ivan Shcherbakov --- target/i386/cpu.h| 2 +- target/i386/whpx/whpx-all.c | 205 +-- target/i386/whpx/whpx-internal.h | 7 +- 3 files changed, 201 insertions(+), 13 deletions(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 9661f9fbd1..9c16199679 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1698,7 +1698,7 @@ typedef struct CPUArchState { int64_t user_tsc_khz; /* for sanity check only */ uint64_t apic_bus_freq; uint64_t tsc; -#if defined(CONFIG_KVM) || defined(CONFIG_HVF) +#if defined(CONFIG_KVM) || defined(CONFIG_HVF) || defined(CONFIG_WHPX) void *xsave_buf; uint32_t xsave_buf_len; #endif diff --git a/target/i386/whpx/whpx-all.c b/target/i386/whpx/whpx-all.c index b22a3314b4..6f95e9c780 100644 --- a/target/i386/whpx/whpx-all.c +++ b/target/i386/whpx/whpx-all.c @@ -26,6 +26,9 @@ #include "qapi/qapi-types-common.h" #include "qapi/qapi-visit-common.h" #include "migration/blocker.h" +#include "migration/register.h" +#include "migration/qemu-file-types.h" +#include "qemu/memalign.h" #include #include "whpx-internal.h" @@ -242,6 +245,10 @@ struct whpx_vcpu { WHV_RUN_VP_EXIT_CONTEXT exit_ctx; }; +enum { +WHPX_XSAVE_AREA_ALIGNMENT = 4096, +}; + static bool whpx_allowed; static bool whp_dispatch_initialized; static HMODULE hWinHvPlatform, hWinHvEmulation; @@ -251,6 +258,29 @@ static WHV_PROCESSOR_XSAVE_FEATURES whpx_xsave_cap; struct whpx_state whpx_global; struct WHPDispatch whp_dispatch; +static void whpx_xsave_save(QEMUFile *f, void *opaque); +static int whpx_xsave_load(QEMUFile *f, void *opaque, int version_id); + +/* + * As of Windows 10 21H1, the layout of the XSAVE data returned by the WHPX API + * does not match the layout used by QEMU. + * + * Specifically, trying to pass the state returned by x86_cpu_xsave_all_areas() + * to WHvSetVirtualProcessorXsaveState() causes it to return an error. + * + * As a result, we do not reflect the captured XSAVE state in the X86CPU + * structure, and instead manually save it to the snapshots via the + * whpx_xsave_() callbacks. + * + * Note that unlike the device drivers that can use the new VMStateDescription + * mechanism via 'DeviceClass::vmsd' field, AccelClass objects cannot easily + * do it. Hence, we rely on the legacy state management API. + */ +static SaveVMHandlers savevm_whpx = { +.save_state = whpx_xsave_save, +.load_state = whpx_xsave_load, +}; + static bool whpx_has_xsave(void) { return whpx_xsave_cap.XsaveSupport; @@ -307,18 +337,35 @@ static SegmentCache whpx_seg_h2q(const WHV_X64_SEGMENT_REGISTER *hs) } /* X64 Extended Control Registers */ -static void whpx_set_xcrs(CPUState *cpu) +static void whpx_set_xsave(CPUState *cpu) { CPUX86State *env = cpu->env_ptr; HRESULT hr; struct whpx_state *whpx = _global; WHV_REGISTER_VALUE xcr0; WHV_REGISTER_NAME xcr0_name = WHvX64RegisterXCr0; +void *xsave = X86_CPU(cpu)->env.xsave_buf; +uint32_t xsave_len = X86_CPU(cpu)->env.xsave_buf_len; if (!whpx_has_xsave()) { return; } +if (xsave) { +/* + * See the comment on 'savevm_whpx' for an explanation why + * we cannot do this: + * x86_cpu_xsave_all_areas(X86_CPU(cpu), xsave, xsave_len); + */ + +hr = whp_dispatch.WHvSetVirtualProcessorXsaveState( +whpx->partition, cpu->cpu_index, xsave, xsave_len); + +if (FAILED(hr)) { +error_report("WHPX: Failed to set xsave state, hr=%08lx", hr); +} +} + /* Only xcr0 is supported by the hypervisor currently */ xcr0.Reg64 = env->xcr0; hr = whp_dispatch.WHvSetVirtualProcessorRegisters( @@ -326,6 +373,7 @@ static void whpx_set_xcrs(CPUState *cpu) if (FAILED(hr)) { error_report("WHPX: Failed to set register xcr0, hr=%08lx", hr); } + } static int whpx_set_tsc(CPUState *cpu) @@ -474,7 +522,7 @@ static void whpx_set_registers(CPUState *cpu, int level) * Extended control registers needs to be handled separately depending * on whether xsave is supported/enabled or not. */ -whpx_set_xcrs(cpu); +whpx_set_xsave(cpu); /* 16 XMM registers */ assert(whpx_register_names[idx] == WHvX64RegisterXmm0); @@ -582,28 +630,57 @@ static int whpx_get_tsc(CPUState *cpu) r
[PATCH] whpx: Added support for saving/restoring VM state
This patch adds support for the savevm/loadvm commands when using WHPX. It saves the XSAVE state and the relevant internal registers that were not otherwise captured into a separate "whpx/cpustate" object in the snapshot, and restores them when the snapshot is loaded. Note that due to the XSAVE format differences between the WHPX API and QEMU, the XSAVE state captured from WHPX is not reflected in the X86CPU structure, and is instead saved to the snapshots "as is". Signed-off-by: Ivan Shcherbakov --- target/i386/cpu.h| 2 +- target/i386/whpx/whpx-all.c | 205 +-- target/i386/whpx/whpx-internal.h | 7 +- 3 files changed, 201 insertions(+), 13 deletions(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 9661f9fbd1..9c16199679 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1698,7 +1698,7 @@ typedef struct CPUArchState { int64_t user_tsc_khz; /* for sanity check only */ uint64_t apic_bus_freq; uint64_t tsc; -#if defined(CONFIG_KVM) || defined(CONFIG_HVF) +#if defined(CONFIG_KVM) || defined(CONFIG_HVF) || defined(CONFIG_WHPX) void *xsave_buf; uint32_t xsave_buf_len; #endif diff --git a/target/i386/whpx/whpx-all.c b/target/i386/whpx/whpx-all.c index b22a3314b4..6f95e9c780 100644 --- a/target/i386/whpx/whpx-all.c +++ b/target/i386/whpx/whpx-all.c @@ -26,6 +26,9 @@ #include "qapi/qapi-types-common.h" #include "qapi/qapi-visit-common.h" #include "migration/blocker.h" +#include "migration/register.h" +#include "migration/qemu-file-types.h" +#include "qemu/memalign.h" #include #include "whpx-internal.h" @@ -242,6 +245,10 @@ struct whpx_vcpu { WHV_RUN_VP_EXIT_CONTEXT exit_ctx; }; +enum { +WHPX_XSAVE_AREA_ALIGNMENT = 4096, +}; + static bool whpx_allowed; static bool whp_dispatch_initialized; static HMODULE hWinHvPlatform, hWinHvEmulation; @@ -251,6 +258,29 @@ static WHV_PROCESSOR_XSAVE_FEATURES whpx_xsave_cap; struct whpx_state whpx_global; struct WHPDispatch whp_dispatch; +static void whpx_xsave_save(QEMUFile *f, void *opaque); +static int whpx_xsave_load(QEMUFile *f, void *opaque, int version_id); + +/* + * As of Windows 10 21H1, the layout of the XSAVE data returned by the WHPX API + * does not match the layout used by QEMU. + * + * Specifically, trying to pass the state returned by x86_cpu_xsave_all_areas() + * to WHvSetVirtualProcessorXsaveState() causes it to return an error. + * + * As a result, we do not reflect the captured XSAVE state in the X86CPU + * structure, and instead manually save it to the snapshots via the + * whpx_xsave_() callbacks. + * + * Note that unlike the device drivers that can use the new VMStateDescription + * mechanism via 'DeviceClass::vmsd' field, AccelClass objects cannot easily + * do it. Hence, we rely on the legacy state management API. + */ +static SaveVMHandlers savevm_whpx = { +.save_state = whpx_xsave_save, +.load_state = whpx_xsave_load, +}; + static bool whpx_has_xsave(void) { return whpx_xsave_cap.XsaveSupport; @@ -307,18 +337,35 @@ static SegmentCache whpx_seg_h2q(const WHV_X64_SEGMENT_REGISTER *hs) } /* X64 Extended Control Registers */ -static void whpx_set_xcrs(CPUState *cpu) +static void whpx_set_xsave(CPUState *cpu) { CPUX86State *env = cpu->env_ptr; HRESULT hr; struct whpx_state *whpx = _global; WHV_REGISTER_VALUE xcr0; WHV_REGISTER_NAME xcr0_name = WHvX64RegisterXCr0; +void *xsave = X86_CPU(cpu)->env.xsave_buf; +uint32_t xsave_len = X86_CPU(cpu)->env.xsave_buf_len; if (!whpx_has_xsave()) { return; } +if (xsave) { +/* + * See the comment on 'savevm_whpx' for an explanation why + * we cannot do this: + * x86_cpu_xsave_all_areas(X86_CPU(cpu), xsave, xsave_len); + */ + +hr = whp_dispatch.WHvSetVirtualProcessorXsaveState( +whpx->partition, cpu->cpu_index, xsave, xsave_len); + +if (FAILED(hr)) { +error_report("WHPX: Failed to set xsave state, hr=%08lx", hr); +} +} + /* Only xcr0 is supported by the hypervisor currently */ xcr0.Reg64 = env->xcr0; hr = whp_dispatch.WHvSetVirtualProcessorRegisters( @@ -326,6 +373,7 @@ static void whpx_set_xcrs(CPUState *cpu) if (FAILED(hr)) { error_report("WHPX: Failed to set register xcr0, hr=%08lx", hr); } + } static int whpx_set_tsc(CPUState *cpu) @@ -474,7 +522,7 @@ static void whpx_set_registers(CPUState *cpu, int level) * Extended control registers needs to be handled separately depending * on whether xsave is supported/enabled or not. */ -whpx_set_xcrs(cpu); +whpx_set_xsave(cpu); /* 16 XMM registers */ assert(whpx_register_names
[PATCH v2] whpx: Added support for breakpoints and stepping
Below is the updated version of the patch adding debugging support to WHPX. It incorporates feedback from Alex Bennée and Peter Maydell regarding not changing the emulation logic depending on the gdb connection status. Instead of checking for an active gdb connection to determine whether QEMU should intercept the INT1 exceptions, it now checks whether any breakpoints have been set, or whether gdb has explicitly requested one or more CPUs to do single-stepping. Having none of these condition present now has the same effect as not using gdb at all. --- This adds support for breakpoints and stepping when debugging WHPX-accelerated guests with gdb. It enables reliable debugging of the Linux kernel in both single-CPU and SMP modes. Signed-off-by: Ivan Shcherbakov --- gdbstub.c | 10 +- include/sysemu/accel-ops.h| 1 + include/sysemu/runstate.h | 8 +- softmmu/cpus.c| 12 +- target/i386/whpx/whpx-accel-ops.c | 1 + target/i386/whpx/whpx-accel-ops.h | 1 + target/i386/whpx/whpx-all.c | 770 +- target/i386/whpx/whpx-internal.h | 30 ++ 8 files changed, 815 insertions(+), 18 deletions(-) diff --git a/gdbstub.c b/gdbstub.c index 3c14c6a038..ab3cd44f72 100644 --- a/gdbstub.c +++ b/gdbstub.c @@ -519,7 +519,15 @@ static int gdb_continue_partial(char *newstates) int flag = 0; if (!runstate_needs_reset()) { -if (vm_prepare_start()) { +bool step_requested = false; +CPU_FOREACH(cpu) { +if (newstates[cpu->cpu_index] == 's') { +step_requested = true; +break; +} +} + +if (vm_prepare_start(step_requested)) { return 0; } diff --git a/include/sysemu/accel-ops.h b/include/sysemu/accel-ops.h index 032f6979d7..f40ca4c033 100644 --- a/include/sysemu/accel-ops.h +++ b/include/sysemu/accel-ops.h @@ -35,6 +35,7 @@ struct AccelOpsClass { void (*synchronize_post_init)(CPUState *cpu); void (*synchronize_state)(CPUState *cpu); void (*synchronize_pre_loadvm)(CPUState *cpu); +void (*synchronize_pre_resume)(bool step_pending); void (*handle_interrupt)(CPUState *cpu, int mask); diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h index a535691573..798ba13ecb 100644 --- a/include/sysemu/runstate.h +++ b/include/sysemu/runstate.h @@ -34,7 +34,13 @@ static inline bool shutdown_caused_by_guest(ShutdownCause cause) } void vm_start(void); -int vm_prepare_start(void); + +/** + * vm_prepare_start: Prepare for starting/resuming the VM + * + * @step_pending: whether any of the CPUs is about to be single-stepped by gdb + */ +int vm_prepare_start(bool step_pending); int vm_stop(RunState state); int vm_stop_force_state(RunState state); int vm_shutdown(void); diff --git a/softmmu/cpus.c b/softmmu/cpus.c index 035395ae13..acd695a229 100644 --- a/softmmu/cpus.c +++ b/softmmu/cpus.c @@ -662,7 +662,7 @@ int vm_stop(RunState state) * Returns -1 if the vCPUs are not to be restarted (e.g. if they are already * running or in case of an error condition), 0 otherwise. */ -int vm_prepare_start(void) +int vm_prepare_start(bool step_pending) { RunState requested; @@ -682,6 +682,14 @@ int vm_prepare_start(void) return -1; } +/* + * WHPX accelerator needs to know whether we are going to step + * any CPUs, before starting the first one. + */ +if (cpus_accel->synchronize_pre_resume) { +cpus_accel->synchronize_pre_resume(step_pending); +} + /* We are sending this now, but the CPUs will be resumed shortly later */ qapi_event_send_resume(); @@ -693,7 +701,7 @@ int vm_prepare_start(void) void vm_start(void) { -if (!vm_prepare_start()) { +if (!vm_prepare_start(false)) { resume_all_vcpus(); } } diff --git a/target/i386/whpx/whpx-accel-ops.c b/target/i386/whpx/whpx-accel-ops.c index 6bc47c5309..1a859a084d 100644 --- a/target/i386/whpx/whpx-accel-ops.c +++ b/target/i386/whpx/whpx-accel-ops.c @@ -94,6 +94,7 @@ static void whpx_accel_ops_class_init(ObjectClass *oc, void *data) ops->synchronize_post_init = whpx_cpu_synchronize_post_init; ops->synchronize_state = whpx_cpu_synchronize_state; ops->synchronize_pre_loadvm = whpx_cpu_synchronize_pre_loadvm; +ops->synchronize_pre_resume = whpx_cpu_synchronize_pre_resume; } static const TypeInfo whpx_accel_ops_type = { diff --git a/target/i386/whpx/whpx-accel-ops.h b/target/i386/whpx/whpx-accel-ops.h index 2dee6d61ea..b5102dd1ee 100644 --- a/target/i386/whpx/whpx-accel-ops.h +++ b/target/i386/whpx/whpx-accel-ops.h @@ -21,6 +21,7 @@ void whpx_cpu_synchronize_state(CPUState *cpu); void whpx_cpu_synchronize_post_reset(CPUState *cpu); void whpx_cpu_synchronize_post_init(CPUState *cpu); void whpx_cpu_synchronize_pre_loadvm(CPUState *cpu); +void whpx_cpu_synchronize_pre_resume(bool step_pending); /* stat
RE: [PATCH 3/3] whpx: Added support for breakpoints and stepping
Hi Alex, Is there anything I could do to get the WHPX debugging support accepted into QEMU? Would the proposed callback AccelOpsClass work for you, or would you prefer another approach? Best, Ivan -Original Message- From: Qemu-devel On Behalf Of Ivan Shcherbakov Sent: Monday, February 28, 2022 6:09 PM To: 'Alex Bennée' Cc: 'Peter Maydell' ; m...@redhat.com; qemu-devel@nongnu.org; arm...@redhat.com; 'Paolo Bonzini' Subject: RE: [PATCH 3/3] whpx: Added support for breakpoints and stepping Hi Alex, Thanks for getting back to me. It is definitely the latter case (i.e. it is possible to change it while the target is stopped at a breakpoint before resuming any VCPUs). vm_state_notify() does look like it's intended for this type of notifications, but unfortunately, it doesn't make a distinction between stepping and running normally. Below is the relevant code from gdbstub.c: >static int gdb_continue_partial(char *newstates) { >int flag = 0; > >/* Various corner cases omitted for brevity */ >if (vm_prepare_start()) { >return 0; >} >CPU_FOREACH(cpu) { >switch (newstates[cpu->cpu_index]) { >case 's': >trace_gdbstub_op_stepping(cpu->cpu_index); >cpu_single_step(cpu, gdbserver_state.sstep_flags); >cpu_resume(cpu); >flag = 1; >break; >case 'c': >trace_gdbstub_op_continue_cpu(cpu->cpu_index); >cpu_resume(cpu); >flag = 1; >break; >default: >res = -1; >break; >} >} >} Also: >int vm_prepare_start(void) >{ >runstate_set(RUN_STATE_RUNNING); >vm_state_notify(1, RUN_STATE_RUNNING); >return 0; >} and: >void vm_state_notify(bool running, RunState state); So, currently, vm_prepare_start() has no way of distinguishing between single-stepping and normal running unless gdb_continue_partial() scans the 'newstates' array before calling it, and passes some extra argument to vm_prepare_start(), indicating whether a step request was present anywhere in the array. I couldn't find any existing run state that would match single-stepping, and adding another run state looks like a very non-trivial change that can easily backfire. Adding another argument to vm_state_notify() could be easier, but I am still afraid it could break some other part of QEMU, so I thought adding a new member to AccelOpsClass would be a safer bet. But again, if you think another way to do it is better, I am very open to it. Best regards, Ivan -Original Message- From: Alex Bennée Sent: Monday, February 28, 2022 2:28 AM To: Ivan Shcherbakov Cc: 'Peter Maydell' ; 'Paolo Bonzini' ; qemu-devel@nongnu.org; arm...@redhat.com; m...@redhat.com Subject: Re: [PATCH 3/3] whpx: Added support for breakpoints and stepping Is the limitation that whpx_set_exception_exit_bitmap needs to be set before any vCPU can be run or that it cannot be set if any vCPUs are currently running? If it is the later wouldn't adding a hook into the vm_change_state_head callbacks be enough?
RE: [PATCH 3/3] whpx: Added support for breakpoints and stepping
Hi Alex, Thanks for getting back to me. It is definitely the latter case (i.e. it is possible to change it while the target is stopped at a breakpoint before resuming any VCPUs). vm_state_notify() does look like it's intended for this type of notifications, but unfortunately, it doesn't make a distinction between stepping and running normally. Below is the relevant code from gdbstub.c: >static int gdb_continue_partial(char *newstates) >{ >int flag = 0; > >/* Various corner cases omitted for brevity */ >if (vm_prepare_start()) { >return 0; >} >CPU_FOREACH(cpu) { >switch (newstates[cpu->cpu_index]) { >case 's': >trace_gdbstub_op_stepping(cpu->cpu_index); >cpu_single_step(cpu, gdbserver_state.sstep_flags); >cpu_resume(cpu); >flag = 1; >break; >case 'c': >trace_gdbstub_op_continue_cpu(cpu->cpu_index); >cpu_resume(cpu); >flag = 1; >break; >default: >res = -1; >break; >} >} >} Also: >int vm_prepare_start(void) >{ >runstate_set(RUN_STATE_RUNNING); >vm_state_notify(1, RUN_STATE_RUNNING); >return 0; >} and: >void vm_state_notify(bool running, RunState state); So, currently, vm_prepare_start() has no way of distinguishing between single-stepping and normal running unless gdb_continue_partial() scans the 'newstates' array before calling it, and passes some extra argument to vm_prepare_start(), indicating whether a step request was present anywhere in the array. I couldn't find any existing run state that would match single-stepping, and adding another run state looks like a very non-trivial change that can easily backfire. Adding another argument to vm_state_notify() could be easier, but I am still afraid it could break some other part of QEMU, so I thought adding a new member to AccelOpsClass would be a safer bet. But again, if you think another way to do it is better, I am very open to it. Best regards, Ivan -Original Message- From: Alex Bennée Sent: Monday, February 28, 2022 2:28 AM To: Ivan Shcherbakov Cc: 'Peter Maydell' ; 'Paolo Bonzini' ; qemu-devel@nongnu.org; arm...@redhat.com; m...@redhat.com Subject: Re: [PATCH 3/3] whpx: Added support for breakpoints and stepping Is the limitation that whpx_set_exception_exit_bitmap needs to be set before any vCPU can be run or that it cannot be set if any vCPUs are currently running? If it is the later wouldn't adding a hook into the vm_change_state_head callbacks be enough?
RE: [PATCH 3/3] whpx: Added support for breakpoints and stepping
Hi Alex and Peter, It would be great to hear your feedback on handling the WHPX stepping via an added argument to vm_prepare_start(). It is only called from 2 places, so the changes will be minimal. I this works for you, I will gladly send an updated patch. I am also fully open to other suggestions. You obviously know the QEMU codebase much better, so I'm happy to refactor it so that it blends in well with the rest of the code. Best, Ivan -Original Message- From: Qemu-devel On Behalf Of Ivan Shcherbakov Sent: Thursday, February 24, 2022 7:54 AM To: 'Alex Bennée' ; 'Peter Maydell' Cc: 'Paolo Bonzini' ; qemu-devel@nongnu.org; arm...@redhat.com; m...@redhat.com Subject: RE: [PATCH 3/3] whpx: Added support for breakpoints and stepping > I haven't looked at the rest of the patch -- but can you explain where > whpx is different from how other accelerators handle debug such that > it needs to know whether gdb is connected or not ? This mainly comes from the way single-stepping is handled. WHPX needs to know whether you want to trap INT1 before starting the first VCPU. The current gdbstub implementation doesn’t make it easy. Assume the scenario: 1. You have 2 VCPUs. You run the first one and step the second one. 2. gdb_continue_partial() calls cpu_resume(0) 3. gdb_continue_partial() calls cpu_single_step(1). WHPX needs to know whether to trap INT1 at step #2 (starting the first CPU), but it won't know it until step #3. So, the current logic simply checks if gdb is connected at all in step #2. >Just the fact you have connected to the gdbserver shouldn't affect how you run >WHPX up until the point there are things you need to trap - i.e. >handling installed breakpoints. This can be addressed by adding a "bool stepping_expected" argument to vm_prepare_start(). It will be set to true if gdb_continue_partial() expects ANY thread to be stepped, and will be false otherwise. It will also require a new callback in AccelOpsClass (e.g. on_vm_starting(bool stepping_expected)) that will be called from vm_prepare_start(). The WHPX implementation will then check if any breakpoints are set, and if the last call to this function expected stepping, and use it to decide whether to trap INT1. Let me know if this will work better. Best, Ivan
RE: [PATCH 3/3] whpx: Added support for breakpoints and stepping
> I haven't looked at the rest of the patch -- but can you explain where > whpx is different from how other accelerators handle debug such that > it needs to know whether gdb is connected or not ? This mainly comes from the way single-stepping is handled. WHPX needs to know whether you want to trap INT1 before starting the first VCPU. The current gdbstub implementation doesn’t make it easy. Assume the scenario: 1. You have 2 VCPUs. You run the first one and step the second one. 2. gdb_continue_partial() calls cpu_resume(0) 3. gdb_continue_partial() calls cpu_single_step(1). WHPX needs to know whether to trap INT1 at step #2 (starting the first CPU), but it won't know it until step #3. So, the current logic simply checks if gdb is connected at all in step #2. >Just the fact you have connected to the gdbserver shouldn't affect how you run >WHPX up until the point there are things you need to trap - i.e. >handling installed breakpoints. This can be addressed by adding a "bool stepping_expected" argument to vm_prepare_start(). It will be set to true if gdb_continue_partial() expects ANY thread to be stepped, and will be false otherwise. It will also require a new callback in AccelOpsClass (e.g. on_vm_starting(bool stepping_expected)) that will be called from vm_prepare_start(). The WHPX implementation will then check if any breakpoints are set, and if the last call to this function expected stepping, and use it to decide whether to trap INT1. Let me know if this will work better. Best, Ivan
RE: [PATCH 3/3] whpx: Added support for breakpoints and stepping
Hi Paolo, Thanks for getting back to me. Please see my comments below: >Please use WhpxStepMode and likewise for WhpxBreakpointState. No problem, I have updated the patch. >(In the case of WhpxStepMode I would also consider simply a "bool exclusive" >in whpx_cpu_run). This is a leftover from prior experiments with stepping into the exception handlers that involves reading the IDT/GDT, writing INT3 into the original handler, disabling INT1 interception and instead intercepting INT3. It would be implemented via a third step mode. I ended up removing it due to the unbelievable complexity of properly handling all corner cases, but I thought it would make sense to keep it as an enum if someone else decides to add anything similar later. >Please leave the empty line. Oops, leftover from other experiments. Thanks for pointing out. >Out of curiosity, does the guest see TF=1 if it single steps through a PUSHF >(and then break horribly on POPF :))? This is a very good point. It would indeed get pushed into the stack and popped back with POPF, raising another INT1. I shouldn't be too horrible though: it would report another stop to gdb, and doing a step again would clear it. A somewhat bigger issue would be that stepping through POPF or LAHF could reset the TF to 0, effectively ending the single-stepping mode. It could be addressed by emulating PUSHF/POPF, but there are a few corner cases (alignment checks, page faults on RSP, flag translation for the virtual 8086 mode), that could make things more complicated. Also, PUSHF/POPF it not the only special case. Stepping over IRET, into page fault handlers, and when the guest itself is running a debugger that wants to do single-stepping would also require rather non-trivial workarounds. I have added a detailed comment outlining the limitations of the current stepping logic and possible workarounds above the definition of WhpxStepMode. The current implementation works like a charm for debugging real-world kernel modules on x64 Ubuntu, and if some of these limitations end up breaking other debug scenarios, I would much more prefer addressing specific narrow issues, than adding another layer of complexity proactively, if that's OK with you. I have taken special care to make sure the new functionality won't cause any regressions when not debugging. The intercepted exception mask is set to 0 when gdb is not connected, 100% matching the behavior prior to the patch. >Why separate the original addresses in a different array This accommodates the case when different parts of QEMU would set multiple breakpoints at the same location, or when a breakpoint removed on the QEMU level could not be removed from RAM (e.g. the page got swapped out). Synchronizing the QEMU breakpoint list with the RAM breakpoint list currently has O(N^2) complexity. However, in many cases (e.g. stopping to handle timers), the QEMU breakpoints won't change between the invocations, so we can just do a quick O(N) comparison of the old breakpoint list vs. new breakpoint list, and avoid the O(N^2) part. You can find this logic by searching for the 'update_pending' variable. This could be avoided if cpu_breakpoint_insert() or gdb_breakpoint_insert() could invoke a WHPX-specific handler, similar to what is currently done with kvm_insert_breakpoint(), or a generic callback via AccelOpsClass, that could just mark the RAM breakpoint list dirty. But I didn't want to introduce unnecessary changes outside the WHPX module. >(and why the different logic, with used/allocated for one array and an exact >size for the other) When we are rebuilding the memory breakpoint list, we don't know how many of the CPU breakpoints will refer to the same memory addresses, or overlap with the breakpoints that could not be removed. So, we allocate for the worst case, and keep a track of the actually created entries in the 'used' field. Updated patch below. Let me know if you have any further concerns and I will be happy to address them. This adds support for breakpoints and stepping when debugging WHPX-accelerated guests with gdb. It enables reliable debugging of the Linux kernel in both single-CPU and SMP modes. Signed-off-by: Ivan Shcherbakov --- gdbstub.c| 10 + include/exec/gdbstub.h | 8 + target/i386/whpx/whpx-all.c | 763 ++- target/i386/whpx/whpx-internal.h | 29 ++ 4 files changed, 796 insertions(+), 14 deletions(-) diff --git a/gdbstub.c b/gdbstub.c index 3c14c6a038..d30cbfa478 100644 --- a/gdbstub.c +++ b/gdbstub.c @@ -373,6 +373,12 @@ typedef struct GDBState { } GDBState; static GDBState gdbserver_state; +static bool gdbserver_is_connected; + +bool gdb_is_connected(void) +{ +return gdbserver_is_connected; +} static void init_gdbserver_state(void) { @@ -3410,6 +3416,10 @@ static void gdb_chr_event(void *opaque, QEMUChrEvent event) vm_stop(RUN_STATE_
[PATCH 3/3] whpx: Added support for breakpoints and stepping
This adds support for breakpoints and stepping when debugging WHPX-accelerated guests with gdb. It enables reliable debugging of the Linux kernel in both single-CPU and SMP modes. Signed-off-by: Ivan Shcherbakov --- gdbstub.c| 10 + include/exec/gdbstub.h | 8 + target/i386/whpx/whpx-all.c | 689 ++- target/i386/whpx/whpx-internal.h | 29 ++ 4 files changed, 721 insertions(+), 15 deletions(-) diff --git a/gdbstub.c b/gdbstub.c index 3c14c6a038..d30cbfa478 100644 --- a/gdbstub.c +++ b/gdbstub.c @@ -373,6 +373,12 @@ typedef struct GDBState { } GDBState; static GDBState gdbserver_state; +static bool gdbserver_is_connected; + +bool gdb_is_connected(void) +{ +return gdbserver_is_connected; +} static void init_gdbserver_state(void) { @@ -3410,6 +3416,10 @@ static void gdb_chr_event(void *opaque, QEMUChrEvent event) vm_stop(RUN_STATE_PAUSED); replay_gdb_attached(); gdb_has_xml = false; +gdbserver_is_connected = true; +break; +case CHR_EVENT_CLOSED: +gdbserver_is_connected = false; break; default: break; diff --git a/include/exec/gdbstub.h b/include/exec/gdbstub.h index a024a0350d..0ef54cdeb5 100644 --- a/include/exec/gdbstub.h +++ b/include/exec/gdbstub.h @@ -188,4 +188,12 @@ extern bool gdb_has_xml; /* in gdbstub-xml.c, generated by scripts/feature_to_c.sh */ extern const char *const xml_builtin[][2]; +/** + * gdb_is_connected: Check whether gdb is currently connected. + * This function is used to determine if gdb is currently connected to qemu. + * It is used by the WHPX engine to enable interception of debug-related + * exceptions, when debugging with gdb, and pass them to the guest otherwise. + */ +bool gdb_is_connected(void); + #endif diff --git a/target/i386/whpx/whpx-all.c b/target/i386/whpx/whpx-all.c index 8a8b5d55d1..030988fa51 100644 --- a/target/i386/whpx/whpx-all.c +++ b/target/i386/whpx/whpx-all.c @@ -12,6 +12,7 @@ #include "cpu.h" #include "exec/address-spaces.h" #include "exec/ioport.h" +#include "exec/gdbstub.h" #include "qemu-common.h" #include "qemu/accel.h" #include "sysemu/whpx.h" @@ -148,6 +149,12 @@ struct whpx_register_set { WHV_REGISTER_VALUE values[RTL_NUMBER_OF(whpx_register_names)]; }; +enum whpx_step_mode { +whpx_step_none = 0, +/* Halt other VCPUs */ +whpx_step_exclusive, +}; + struct whpx_vcpu { WHV_EMULATOR_HANDLE emulator; bool window_registered; @@ -156,7 +163,6 @@ struct whpx_vcpu { uint64_t tpr; uint64_t apic_base; bool interruption_pending; - /* Must be the last field as it may have a tail */ WHV_RUN_VP_EXIT_CONTEXT exit_ctx; }; @@ -793,6 +799,515 @@ static int whpx_handle_portio(CPUState *cpu, return 0; } +/* + * Controls whether we should intercept various exceptions on the guest, + * namely breakpoint/single-step events. + * + * The 'exceptions' argument accepts a bitmask, e.g: + * (1 << WHvX64ExceptionTypeDebugTrapOrFault) | (...) + */ +static HRESULT whpx_set_exception_exit_bitmap(UINT64 exceptions) +{ +struct whpx_state *whpx = _global; +WHV_PARTITION_PROPERTY prop = { 0, }; +HRESULT hr; + +if (exceptions == whpx->exception_exit_bitmap) { +return S_OK; +} + +prop.ExceptionExitBitmap = exceptions; + +hr = whp_dispatch.WHvSetPartitionProperty( +whpx->partition, +WHvPartitionPropertyCodeExceptionExitBitmap, +, +sizeof(WHV_PARTITION_PROPERTY)); + +if (SUCCEEDED(hr)) { +whpx->exception_exit_bitmap = exceptions; +} + +return hr; +} + + +/* + * This function is called before/after stepping over a single instruction. + * It will update the CPU registers to arm/disarm the instruction stepping + * accordingly. + */ +static HRESULT whpx_vcpu_configure_single_stepping(CPUState *cpu, +bool set, +uint64_t *exit_context_rflags) +{ +WHV_REGISTER_NAME reg_name; +WHV_REGISTER_VALUE reg_value; +HRESULT hr; +struct whpx_state *whpx = _global; + +/* + * If we are trying to step over a single instruction, we need to set the + * TF bit in rflags. Otherwise, clear it. + */ +reg_name = WHvX64RegisterRflags; +hr = whp_dispatch.WHvGetVirtualProcessorRegisters( +whpx->partition, +cpu->cpu_index, +_name, +1, +_value); + +if (FAILED(hr)) { +error_report("WHPX: Failed to get rflags, hr=%08lx", hr); +return hr; +} + +if (exit_context_rflags) { +assert(*exit_context_rflags == reg_value.Reg64); +} + +if (set) { +/* Raise WHvX64ExceptionTypeDebugTrapOrFault after each instruction */ +reg_value.Reg64 |= TF_MASK; +} else { +reg_value.Reg64 &= ~TF_MASK; +} + +if (exit_context_rflags) { +*exit_context
[PATCH 2/3] whpx: Fixed incorrect CR8/TPR synchronization
This fixes the following error triggered when stopping and resuming a 64-bit Linux kernel via gdb: qemu-system-x86_64.exe: WHPX: Failed to set virtual processor context, hr=c0350005 The previous logic for synchronizing the values did not take into account that the lower 4 bits of the CR8 register, containing the priority level, mapped to bits 7:4 of the APIC.TPR register (see section 10.8.6.1 of Volume 3 of Intel 64 and IA-32 Architectures Software Developer's Manual). The caused WHvSetVirtualProcessorRegisters() to fail with an error, effectively preventing GDB from changing the guest context. Signed-off-by: Ivan Shcherbakov --- target/i386/whpx/whpx-all.c | 49 +++-- 1 file changed, 41 insertions(+), 8 deletions(-) diff --git a/target/i386/whpx/whpx-all.c b/target/i386/whpx/whpx-all.c index edd4fafbdf..8a8b5d55d1 100644 --- a/target/i386/whpx/whpx-all.c +++ b/target/i386/whpx/whpx-all.c @@ -256,6 +256,28 @@ static int whpx_set_tsc(CPUState *cpu) return 0; } +/* + * The CR8 register in the CPU is mapped to the TPR register of the APIC, + * however, they use a slightly different encoding. Specifically: + * + * APIC.TPR[bits 7:4] = CR8[bits 3:0] + * + * This mechanism is described in section 10.8.6.1 of Volume 3 of Intel 64 + * and IA-32 Architectures Software Developer's Manual. + * + * The functions below translate the value of CR8 to TPR and vice versa. + */ + +static uint64_t whpx_apic_tpr_to_cr8(uint64_t tpr) +{ +return tpr >> 4; +} + +static uint64_t whpx_cr8_to_apic_tpr(uint64_t cr8) +{ +return cr8 << 4; +} + static void whpx_set_registers(CPUState *cpu, int level) { struct whpx_state *whpx = _global; @@ -284,7 +306,7 @@ static void whpx_set_registers(CPUState *cpu, int level) v86 = (env->eflags & VM_MASK); r86 = !(env->cr[0] & CR0_PE_MASK); -vcpu->tpr = cpu_get_apic_tpr(x86_cpu->apic_state); +vcpu->tpr = whpx_apic_tpr_to_cr8(cpu_get_apic_tpr(x86_cpu->apic_state)); vcpu->apic_base = cpu_get_apic_base(x86_cpu->apic_state); idx = 0; @@ -475,6 +497,17 @@ static void whpx_get_registers(CPUState *cpu) hr); } +if (whpx_apic_in_platform()) { +/* + * Fetch the TPR value from the emulated APIC. It may get overwritten + * below with the value from CR8 returned by + * WHvGetVirtualProcessorRegisters(). + */ +whpx_apic_get(x86_cpu->apic_state); +vcpu->tpr = whpx_apic_tpr_to_cr8( +cpu_get_apic_tpr(x86_cpu->apic_state)); +} + idx = 0; /* Indexes for first 16 registers match between HV and QEMU definitions */ @@ -521,8 +554,12 @@ static void whpx_get_registers(CPUState *cpu) assert(whpx_register_names[idx] == WHvX64RegisterCr8); tpr = vcxt.values[idx++].Reg64; if (tpr != vcpu->tpr) { +/* + * TPR value stored in the CR8 register doesn't match the one fetched + * from the emulated APIC. Override the latter with the former. + */ vcpu->tpr = tpr; -cpu_set_apic_tpr(x86_cpu->apic_state, tpr); +cpu_set_apic_tpr(x86_cpu->apic_state, whpx_cr8_to_apic_tpr(tpr)); } /* 8 Debug Registers - Skipped */ @@ -600,10 +637,6 @@ static void whpx_get_registers(CPUState *cpu) assert(idx == RTL_NUMBER_OF(whpx_register_names)); -if (whpx_apic_in_platform()) { -whpx_apic_get(x86_cpu->apic_state); -} - x86_update_hflags(env); return; @@ -865,7 +898,7 @@ static void whpx_vcpu_pre_run(CPUState *cpu) } /* Sync the TPR to the CR8 if was modified during the intercept */ -tpr = cpu_get_apic_tpr(x86_cpu->apic_state); +tpr = whpx_apic_tpr_to_cr8(cpu_get_apic_tpr(x86_cpu->apic_state)); if (tpr != vcpu->tpr) { vcpu->tpr = tpr; reg_values[reg_count].Reg64 = tpr; @@ -914,7 +947,7 @@ static void whpx_vcpu_post_run(CPUState *cpu) if (vcpu->tpr != tpr) { vcpu->tpr = tpr; qemu_mutex_lock_iothread(); -cpu_set_apic_tpr(x86_cpu->apic_state, vcpu->tpr); +cpu_set_apic_tpr(x86_cpu->apic_state, whpx_cr8_to_apic_tpr(vcpu->tpr)); qemu_mutex_unlock_iothread(); } -- 2.29.2.windows.3
[PATCH 1/3] whpx: Fixed reporting of the CPU context to GDB for 64-bit
Hi All, We have been looking into kernel-debugging Linux VMs running on Windows with Hyper-V enabled (that forces the virtualization software to use WHPX), and it turned out, none of the major virtualization tools supports it properly. I've added the missing parts to QEMU and it looks pretty solid: setting breakpoints in the kernel, running, stepping in/over works reliably and fast. The changes involved 3 parts: 1. Fixing the x64 register reporting to gdb (this patch) 2. Fixing synchronization of CR8 <=> APIC.TPR, that was preventing WHvSetVirtualProcessorRegisters() from working 3. Implementing software breakpoints It would be great if the changes could be integrated into the QEMU repository, allowing other Windows users to debug their VMs. Below is the description of the first patch. This change makes sure that stopping in the 64-bit mode will set the HF_CS64_MASK flag in env->hflags (see x86_update_hflags() in target/i386/cpu.c). Without it, the code in gdbstub.c would only use the 32-bit register values when debugging 64-bit targets, making debugging effectively impossible. Signed-off-by: Ivan Shcherbakov --- target/i386/whpx/whpx-all.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/target/i386/whpx/whpx-all.c b/target/i386/whpx/whpx-all.c index ef896da0a2..edd4fafbdf 100644 --- a/target/i386/whpx/whpx-all.c +++ b/target/i386/whpx/whpx-all.c @@ -604,6 +604,8 @@ static void whpx_get_registers(CPUState *cpu) whpx_apic_get(x86_cpu->apic_state); } +x86_update_hflags(env); + return; } --