> For the last couple years we have received numerous reports from users of
> monitor vm_events of spurious guest crashes when using events. In particular,
> it has observed that the problem occurs when vm_events are being disabled. The
> nature of the guest crash varied widely and has only occured occasionally. 
> This
> made debugging the issue particularly hard. We had discussions about this 
> issue
> even here on the xen-devel mailinglist with no luck figuring it out.
> 
> The bug has now been identified as a race-condition between register event
> handling and disabling the vm_event interface.
> 
> Patch 96760e2fba100d694300a81baddb5740e0f8c0ee, "vm_event: deny register 
> writes
> if refused by  vm_event reply" is the patch that introduced the error. In this
> patch emulation of register write events can be postponed until the
> corresponding vm_event handler decides whether to allow such write to take
> place. Unfortunately this can only be implemented by performing the deny/allow
> step when the vCPU gets scheduled. Due to that postponed emulation of the 
> event
> if the user decides to pause the VM in the vm_event handler and then disable
> events, the entire emulation step is skipped the next time the vCPU is 
> resumed.
> Even if the user doesn't pause during the vm_event handling but exits
> immediately and disables vm_event, the situation becomes racey as disabling
> vm_event may succeed before the guest's vCPUs get scheduled with the pending
> emulation task. This has been particularly the case with VMs that have several
> vCPUs as after the VM is unpaused it may actually take a long time before all
?vCPUs get scheduled.
> 
> The only solution currently is to poll each vCPU before vm_events are disabled
> to verify they had been scheduled. The following patches resolve this issue in
> a much nicer way.
> 
> Patch 1 adds an option to the monitor_op domctl that needs to be specified if
>     the user wants to actually use the postponed register-write handling
>     mechanism. If that option is not specified then handling is performed the
>     same way as before patch 96760e2fba100d694300a81baddb5740e0f8c0ee.
>     
> Patch 2 performs sanity checking when disabling vm_events to determine whether
>     its safe to free all vm_event structures. The vCPUs still get unpaused to
>     allow them to get scheduled and perform any of their pending operations,
>     but otherwise an -EAGAIN error is returned signaling to the user that they
>     need to wait and try again disabling the interface.
>     
> Patch 3 adds a vm_event specifically to signal to the user when it is safe to
>     continue disabling the interface.
>     
> Shout out to our friends at CERT.pl for stumbling upon a crucial piece of
> information that lead to finally squashing this nasty bug.
> 
> Tamas K Lengyel (3):
>   xen/monitor: Control register values
>   xen/vm_event: add vm_event_check_pending_op
>   xen/vm_event: Add safe to disable vm_event
> 
>  xen/arch/x86/hvm/hvm.c            | 69 +++++++++++++++++++++++--------
>  xen/arch/x86/hvm/monitor.c        | 14 +++++++
>  xen/arch/x86/monitor.c            | 23 ++++++++++-
>  xen/arch/x86/vm_event.c           | 23 +++++++++++
>  xen/common/vm_event.c             | 17 ++++++--
>  xen/include/asm-arm/vm_event.h    |  7 ++++
>  xen/include/asm-x86/domain.h      |  2 +
>  xen/include/asm-x86/hvm/monitor.h |  1 +
>  xen/include/asm-x86/vm_event.h    |  2 +
>  xen/include/public/domctl.h       |  3 ++
>  xen/include/public/vm_event.h     |  8 ++++
>  11 files changed, 147 insertions(+), 22 deletions(-)
> 
> -- 
> 2.26.1

Hi,

I have reproduced the mentioned race-condition between register event handling 
and disabling the vm_event interface. With Xen 4.13.0 and Windows 7 x64 DomU (2 
VCPUs), my test program causes random BSODs on DomU once vm_event interface is 
disabled. I can confirm that after applying Patch 1 to Xen 4.13.0 the test 
program doesn't crash the DomU anymore, so it would actually resolve the 
mentioned bug.

Best regards,
Michał Leszczyński
CERT Poland

Reply via email to