Hi,
Code is trying to destroy multiple vcpus held by the domain:
complete_domain_destroy->hvm_vcpu_destroy() for each vcpu.
In vmx_vcpu_destroy, we have a call for vmx_vcpu_disable_pml which leads
to a race while destroying foreign vcpu's with other domains rebooting
on the same vcpus .
With a single domain reboot, no race is observed.
Commit e18d4274772e52ac81fda1acb246d11ef666e5fe causes this race condition.
Anshul
On 07/02/17 17:58, anshul makkar wrote:
Hi, Sorry, forgot to include you in cc list.
Anshul
On 07/02/17 17:26, anshul makkar wrote:
Hi,
Facing a issue where bootstorm of guests leads to host crash. I
debugged and found that that enabling PML introduces a race
condition during guest teardown stage while disabling PML on a vcpu
and context switch happening for another vcpu.
Crash happens only on Broadwell processors as PML got introduced in
this generation.
Here is my analysis:
Race condition:
vmcs.c vmx_vcpu_disable_pml (vcpu){ vmx_vmcs_enter() ; vm_write(
disable_PML); vmx_vmcx_exit();)
In between I have a callpath from another pcpu executing context
switch-> vmx_fpu_leave() and crash on vmwrite..
if ( !(v->arch.hvm_vmx.host_cr0 & X86_CR0_TS) )
{
v->arch.hvm_vmx.host_cr0 |= X86_CR0_TS;
__vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); //crash
}
Debug logs
XEN) [221256.749928] VMWRITE VMCS Invalid !!!!!
(XEN) [221256.754870] **[00] { now 0000c93b4341df1d, hw
00000035fffea000, op 00000035fffea000 } vmclear
(XEN) [221256.765052] ** frames [ ffff82d080134652
smp_call_function_interrupt+0x92/0xa0 ]
(XEN) [221256.773969] **[01] { now 0000c93b4341e099, hw
ffffffffffffffff, op 00000035fffea000 } vmptrld
(XEN) [221256.784150] ** frames [ ffff82d0801f0765
vmx_vmcs_try_enter+0x95/0xb0 ]
(XEN) [221256.792197] **[02] { now 0000c93b4341e1f1, hw
00000035fffea000, op 00000035fffea000 } vmclear
(XEN) [221256.802378] ** frames [ ffff82d080134652
smp_call_function_interrupt+0x92/0xa0 ]
(XEN) [221256.811298] **[03] { now 0000c93b5784dd0a, hw
ffffffffffffffff, op 00000039d7074000 } vmptrld
(XEN) [221256.821478] ** frames [ ffff82d0801f4c31
vmx_do_resume+0x51/0x150 ]
(XEN) [221256.829139] **[04] { now 0000c93b59d67b5b, hw
00000039d7074000, op 0000002b9a575000 } vmptrld
(XEN) [221256.839320] ** frames [ ffff82d0801f4c31
vmx_do_resume+0x51/0x150 ]
(XEN) [221256.882850] **[07] { now 0000c93b59e71e48, hw
0000002b9a575000, op 00000039d7074000 } vmptrld
(XEN) [221256.893034] ** frames [ ffff82d0801f4d13
vmx_do_resume+0x133/0x150 ]
(XEN) [221256.900790] **[08] { now 0000c93b59e78675, hw
00000039d7074000, op 00000040077ae000 } vmptrld
(XEN) [221256.910968] ** frames [ ffff82d0801f0765
vmx_vmcs_try_enter+0x95/0xb0 ]
(XEN) [221256.919015] **[09] { now 0000c93b59e78ac8, hw
00000040077ae000, op 00000040077ae000 } vmclear
(XEN) [221256.929196] ** frames [ ffff82d080134652
smp_call_function_interrupt+0x92/0xa0 ]
(XEN) [221256.938117] **[10] { now 0000c93b59e78d72, hw
ffffffffffffffff, op 00000040077ae000 } vmptrld
(XEN) [221256.948297] ** frames [ ffff82d0801f0765
vmx_vmcs_try_enter+0x95/0xb0 ]
(XEN) [221256.956345] **[11] { now 0000c93b59e78ff0, hw
00000040077ae000, op 00000040077ae000 } vmclear
(XEN) [221256.966525] ** frames [ ffff82d080134652
smp_call_function_interrupt+0x92/0xa0 ]
(XEN) [221256.975445] **[12] { now 0000c93b59e7deda, hw
ffffffffffffffff, op 00000040077b3000 } vmptrld
(XEN) [221256.985626] ** frames [ ffff82d0801f0765
vmx_vmcs_try_enter+0x95/0xb0 ]
(XEN) [221256.993672] **[13] { now 0000c93b59e9fe00, hw
00000040077b3000, op 00000040077b3000 } vmclear
(XEN) [221257.003852] ** frames [ ffff82d080134652
smp_call_function_interrupt+0x92/0xa0 ]
(XEN) [221257.012772] **[14] { now 0000c93b59ea007e, hw
ffffffffffffffff, op 00000040077b3000 } vmptrld
(XEN) [221257.022952] ** frames [ ffff82d0801f0765
vmx_vmcs_try_enter+0x95/0xb0 ]
(XEN) [221257.031000] **[15] { now 0000c93b59ea02ba, hw
00000040077b3000, op 00000040077b3000 } vmclear
(XEN) [221257.041180] ** frames [ ffff82d080134652
smp_call_function_interrupt+0x92/0xa0 ]
(XEN) [221257.050101] ....
(XEN) [221257.053008] vmcs_ptr:0xffffffffffffffff,
vcpu->vmcs:0x2b9a575000
vmcs is loaded and between the next call to vm_write, there is a
clear of vmcs caused by vmx_vcpu_disable_pml.
Above log highlights that IPI is clearing the vmcs in between vmptrld
and vmwrite but I also verified that interrupts are disabled during
context switch and execution of vm_write in vmx_fpu_leave.. This has
got me confused.
Also, I am not sure if I understand the handling of foreign_vmcs
correctly, which can also be the cause of the race.
Please if you can share some suggestions here.
Thanks
Anshul Makkar
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel