Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-20 Thread Stefan Hajnoczi
On Sat, Jul 18, 2015 at 10:21:28PM +0200, Paolo Bonzini wrote:
 The change from v2 is very small, notify_me is decremented as soon
 as ppoll returns.
 
 Paolo
 
 v1-v2
 Split some changes to the tests to a separate patch
 Fix commit message [Laszlo]
 Clarify do...while loop in aio-win32.c [Kevin]
 
 v2-v3
 Decrement notify_me a little earlier in aio-posix.c.
 
 Paolo Bonzini (3):
   tests: remove irrelevant assertions from test-aio
   aio-win32: reorganize polling loop
   AioContext: fix broken ctx-dispatching optimization
 
  aio-posix.c | 18 +---
  aio-win32.c | 41 ++
  async.c | 21 +-
  docs/aio_notify.promela | 77 
 +
  include/block/aio.h | 29 +++
  tests/test-aio.c| 26 +++--
  6 files changed, 97 insertions(+), 115 deletions(-)

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan


pgpNsAjTUOfHW.pgp
Description: PGP signature


Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-19 Thread Richard W.M. Jones

I tested this patch series, plus 'AioContext: fix missing wakeups due
to event_notifier_test_and_clear' overnight, and there were no
failures after something like 2500 iterations.  So:

  Tested-by: Richard W.M. Jones rjo...@redhat.com

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/



[Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-18 Thread Paolo Bonzini
The change from v2 is very small, notify_me is decremented as soon
as ppoll returns.

Paolo

v1-v2
Split some changes to the tests to a separate patch
Fix commit message [Laszlo]
Clarify do...while loop in aio-win32.c [Kevin]

v2-v3
Decrement notify_me a little earlier in aio-posix.c.

Paolo Bonzini (3):
  tests: remove irrelevant assertions from test-aio
  aio-win32: reorganize polling loop
  AioContext: fix broken ctx-dispatching optimization

 aio-posix.c | 18 +---
 aio-win32.c | 41 ++
 async.c | 21 +-
 docs/aio_notify.promela | 77 +
 include/block/aio.h | 29 +++
 tests/test-aio.c| 26 +++--
 6 files changed, 97 insertions(+), 115 deletions(-)

-- 
2.4.3




Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Paolo Bonzini


On 17/07/2015 15:28, Marc Zyngier wrote:
  
  Marc, does it ring any bell?
 Well, this is an example of a guest accessing non-memory using an
 instruction that we cannot safely emulate - not an IO accessor (load
 multiple, for example).
 
 In this case, we kill the guest (we could as well inject a fault).

Yup, I later found this in the dmesg:

[1875219.903969] kvm [14843]: load/store instruction decoding not implemented

 This vcpu seems to be accessing 0x918 (the mmio structure points
 there), but I can't immediately relate it to the content of the
 registers.

0x918 is the pl011, which makes some sense.

Have you ever seen a corrupted register dump?  The guest RAM goes
from 0x4000 to 0xbfff, so SP is pointing outside memory.

  PC=bf671210  SP=c1f0
  X00=0a003e70 X01= X02=bf680548 
  X03=0030
  X04=bbb5fc18 X05=004b7770 X06=bf721930 
  X07=009a
  X08=bf716858 X09=0090 X10= 
  X11=0046
  X12=a007e03a X13= X14= 
  X15=
  X16=bf716df0 X17= X18= 
  X19=bf6f5f18
  X20= X21= X22= 
  X23=
  X24= X25= X26= 
  X27=
  X28= X29= X30= 
  PSTATE=6305 (flags -ZC-)

If the register dump is not corrupted, execution went in the weeds...
I don't have the guest anymore but it's just firmware so the memory
contents are well reproducible:

0xbf671200:  a97d6ffa  ldmdbge  sp!, {r1, r3, r4, r5, r6, r7, 
r8, r9, sl, fp, sp, lr}^
0xbf671204:  a97e77fc  ldmdbge  lr!, {r2, r3, r4, r5, r6, r7, 
r8, r9, sl, ip, sp, lr}^
0xbf671208:  f85f03fe  undefined instruction 0xf85f03fe
0xbf67120c:  910803ff  strdls   r0, [r8, -pc]
0xbf671210:  ad7007e0  ldclge   7, cr0, [r0, #-896]!
0xbf671214:  ad710fe2  ldclge   15, cr0, [r1, #-904]!
0xbf671218:  ad7217e4  ldclge   7, cr1, [r2, #-912]!
0xbf67121c:  ad731fe6  ldclge   15, cr1, [r3, #-920]!
0xbf671220:  ad7427e8  ldclge   7, cr2, [r4, #-928]!
0xbf671224:  ad752fea  ldclge   15, cr2, [r5, #-936]!

 What looks a bit weird is that all the registers are clamped to 32bit,
 but the PSTATE indicates it is running in 64bit (EL1h, which makes the
 PC being utterly wrong).

The PC can be okay since UEFI code doesn't really use virtual addressing,
but the other registers are weird indeed.

 What are you running there?

Just firmware.  It's a UEFI reboot loop (as soon as you get to the
UEFI shell QEMU exits and the script starts a new one).

The kernel is an old 3.19 one, I'll upgrade and retest.

Paolo



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Marc Zyngier
On Fri, 17 Jul 2015 15:04:27 +0100
Paolo Bonzini pbonz...@redhat.com wrote:

 
 
 On 17/07/2015 15:28, Marc Zyngier wrote:
   
   Marc, does it ring any bell?
  Well, this is an example of a guest accessing non-memory using an
  instruction that we cannot safely emulate - not an IO accessor (load
  multiple, for example).
  
  In this case, we kill the guest (we could as well inject a fault).
 
 Yup, I later found this in the dmesg:
 
 [1875219.903969] kvm [14843]: load/store instruction decoding not implemented
 
  This vcpu seems to be accessing 0x918 (the mmio structure points
  there), but I can't immediately relate it to the content of the
  registers.
 
 0x918 is the pl011, which makes some sense.
 
 Have you ever seen a corrupted register dump?  The guest RAM goes
 from 0x4000 to 0xbfff, so SP is pointing outside memory.

I've never seen such a corruption - so far.

   PC=bf671210  SP=c1f0
   X00=0a003e70 X01= X02=bf680548 
   X03=0030
   X04=bbb5fc18 X05=004b7770 X06=bf721930 
   X07=009a
   X08=bf716858 X09=0090 X10= 
   X11=0046
   X12=a007e03a X13= X14= 
   X15=
   X16=bf716df0 X17= X18= 
   X19=bf6f5f18
   X20= X21= X22= 
   X23=
   X24= X25= X26= 
   X27=
   X28= X29= X30= 
   PSTATE=6305 (flags -ZC-)
 
 If the register dump is not corrupted, execution went in the weeds...
 I don't have the guest anymore but it's just firmware so the memory
 contents are well reproducible:
 
 0xbf671200:  a97d6ffa  ldmdbgesp!, {r1, r3, r4, r5, r6, r7, 
 r8, r9, sl, fp, sp, lr}^
 0xbf671204:  a97e77fc  ldmdbgelr!, {r2, r3, r4, r5, r6, r7, 
 r8, r9, sl, ip, sp, lr}^
 0xbf671208:  f85f03fe  undefined instruction 0xf85f03fe
 0xbf67120c:  910803ff  strdls r0, [r8, -pc]
 0xbf671210:  ad7007e0  ldclge 7, cr0, [r0, #-896]!
 0xbf671214:  ad710fe2  ldclge 15, cr0, [r1, #-904]!
 0xbf671218:  ad7217e4  ldclge 7, cr1, [r2, #-912]!
 0xbf67121c:  ad731fe6  ldclge 15, cr1, [r3, #-920]!
 0xbf671220:  ad7427e8  ldclge 7, cr2, [r4, #-928]!
 0xbf671224:  ad752fea  ldclge 15, cr2, [r5, #-936]!

But that's all 32bit code, and your guest was running in 64bit. What
does it look like as A64?

  What looks a bit weird is that all the registers are clamped to 32bit,
  but the PSTATE indicates it is running in 64bit (EL1h, which makes the
  PC being utterly wrong).
 
 The PC can be okay since UEFI code doesn't really use virtual addressing,
 but the other registers are weird indeed.

It definitely looks like something tramped on your register file. KVM
doesn't do that at all (we use the whole AArch64 register file
irrespective of the execution state).

Is your UEFI guest 32 or 64bit?

M.
-- 
Jazz is not dead. It just smells funny.



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Laszlo Ersek
On 07/17/15 15:28, Marc Zyngier wrote:
 On Fri, 17 Jul 2015 10:30:38 +0100
 Paolo Bonzini pbonz...@redhat.com wrote:
 


 On 17/07/2015 06:44, Paolo Bonzini wrote:


 On 16/07/2015 21:05, Richard W.M. Jones wrote:

 Sorry to spoil things, but I'm still seeing this bug, although it is
 now a lot less frequent with your patch.  I would estimate it happens
 more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
 with qemu.git + the v2 patch series.

 It's the exact same hang in both cases.

 Is it possible that this patch doesn't completely close any race?

 Still, it is an improvement, so there is that.

 Would seem at first glance like a different bug.

 Interestingly, adding some tracing (qemu_clock_get_ns) makes the bug
 more likely: now it reproduces in about 10 tries.  Of course :) adding
 other kinds of tracing instead make it go away again (50 tries).

 Perhaps this:

i/o thread vcpu thread   worker thread
-
lock_iothread
notify_me = 1
...
unlock_iothread
   lock_iothread
   notify_me = 3
   ppoll
   notify_me = 1
  bh-scheduled = 1
  event_notifier_set
   event_notifier_test_and_clear
ppoll
 ^^ hang
 
 In the exact shape above, it doesn't seem too likely to happen, but
 perhaps there's another simpler case.  Still, the bug exists.

 The above is not really related to notify_me.  Here the notification is
 not being optimized away!  So I wonder if this one has been there forever.

 Fam suggested putting the event_notifier_test_and_clear before
 aio_bh_poll(), but it does not work.  I'll look more close

 However, an unconditional event_notifier_test_and_clear is pretty
 expensive.  On one hand, obviously correctness comes first.  On the
 other hand, an expensive operation at the wrong place can mask the race
 very easily; I'll let the fix run for a while, but I'm not sure if a
 successful test really says anything useful.

 So it may not be useful, but still successful test is successful. :)
 The following patch, which also includes the delta between v2 and v3
 of this series, survived 674 reboots before hitting a definitely
 unrelated problem:

 error: kvm run failed Function not implemented
 PC=bf671210  SP=c1f0
 X00=0a003e70 X01= X02=bf680548 
 X03=0030
 X04=bbb5fc18 X05=004b7770 X06=bf721930 
 X07=009a
 X08=bf716858 X09=0090 X10= 
 X11=0046
 X12=a007e03a X13= X14= 
 X15=
 X16=bf716df0 X17= X18= 
 X19=bf6f5f18
 X20= X21= X22= 
 X23=
 X24= X25= X26= 
 X27=
 X28= X29= X30= 
 PSTATE=6305 (flags -ZC-)

 For the record, this is the kvm_run struct:

 $6 = {request_interrupt_window = 0 '\000', padding1 = 
 \000\000\000\000\000\000, exit_reason = 0, 
   ready_for_interrupt_injection = 0 '\000', if_flag = 0 '\000', flags = 0, 
 cr8 = 0, apic_base = 0, {hw = {
   hardware_exit_reason = 150994968}, fail_entry = 
 {hardware_entry_failure_reason = 150994968}, ex = {
   exception = 150994968, error_code = 0}, io = {direction = 24 '\030', 
 size = 0 '\000', port = 2304, 
   count = 0, data_offset = 144}, debug = {arch = {No data fields}}, 
 mmio = {phys_addr = 150994968, 
   data = \220\000\000\000\000\000\000, len = 4, is_write = 0 '\000'}, 
 hypercall = {nr = 150994968, 
   args = {144, 4, 0, 0, 0, 0}, ret = 0, longmode = 0, pad = 0}, 
 tpr_access = {rip = 150994968, 
   is_write = 144, pad = 0}, s390_sieic = {icptcode = 24 '\030', ipa = 
 2304, ipb = 0}, 
 s390_reset_flags = 150994968, s390_ucontrol = {trans_exc_code = 
 150994968, pgm_code = 144}, dcr = {
   dcrn = 150994968, data = 0, is_write = 144 '\220'}, internal = 
 {suberror = 150994968, ndata = 0, 
   data = {144, 4, 0 repeats 14 times}}, osi = {gprs = {150994968, 144, 
 4, 0 repeats 29 times}}, 
 papr_hcall = {nr = 150994968, ret = 144, args = {4, 0, 0, 0, 0, 0, 0, 0, 
 0}}, s390_tsch = {
   subchannel_id = 24, subchannel_nr = 2304, io_int_parm = 0, io_int_word 
 = 144, ipb = 0, 
   dequeued = 4 '\004'}, epr = {epr = 150994968}, system_event = {type = 
 150994968, flags = 144}, 
 s390_stsi = {addr = 150994968, ar = 144 '\220', reserved = 0 '\000', fc 
 = 0 '\000', sel1 = 0 '\000', 
   sel2 = 0}, 
 padding = 
 \030\000\000\t\000\000\000\000\220\000\000\000\000\000\000\000\004, '\000' 
 repeats 238 times}, kvm_valid_regs = 0, 

Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Marc Zyngier
On Fri, 17 Jul 2015 14:39:55 +0100
Laszlo Ersek ler...@redhat.com wrote:

 On 07/17/15 15:28, Marc Zyngier wrote:
  On Fri, 17 Jul 2015 10:30:38 +0100
  Paolo Bonzini pbonz...@redhat.com wrote:
  
 
 
  On 17/07/2015 06:44, Paolo Bonzini wrote:
 
 
  On 16/07/2015 21:05, Richard W.M. Jones wrote:
 
  Sorry to spoil things, but I'm still seeing this bug, although it is
  now a lot less frequent with your patch.  I would estimate it happens
  more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
  with qemu.git + the v2 patch series.
 
  It's the exact same hang in both cases.
 
  Is it possible that this patch doesn't completely close any race?
 
  Still, it is an improvement, so there is that.
 
  Would seem at first glance like a different bug.
 
  Interestingly, adding some tracing (qemu_clock_get_ns) makes the bug
  more likely: now it reproduces in about 10 tries.  Of course :) adding
  other kinds of tracing instead make it go away again (50 tries).
 
  Perhaps this:
 
 i/o thread vcpu thread   worker thread
 -
 lock_iothread
 notify_me = 1
 ...
 unlock_iothread
lock_iothread
notify_me = 3
ppoll
notify_me = 1
   bh-scheduled = 1
   event_notifier_set
event_notifier_test_and_clear
 ppoll
  ^^ hang

  In the exact shape above, it doesn't seem too likely to happen, but
  perhaps there's another simpler case.  Still, the bug exists.
 
  The above is not really related to notify_me.  Here the notification is
  not being optimized away!  So I wonder if this one has been there forever.
 
  Fam suggested putting the event_notifier_test_and_clear before
  aio_bh_poll(), but it does not work.  I'll look more close
 
  However, an unconditional event_notifier_test_and_clear is pretty
  expensive.  On one hand, obviously correctness comes first.  On the
  other hand, an expensive operation at the wrong place can mask the race
  very easily; I'll let the fix run for a while, but I'm not sure if a
  successful test really says anything useful.
 
  So it may not be useful, but still successful test is successful. :)
  The following patch, which also includes the delta between v2 and v3
  of this series, survived 674 reboots before hitting a definitely
  unrelated problem:
 
  error: kvm run failed Function not implemented
  PC=bf671210  SP=c1f0
  X00=0a003e70 X01= X02=bf680548 
  X03=0030
  X04=bbb5fc18 X05=004b7770 X06=bf721930 
  X07=009a
  X08=bf716858 X09=0090 X10= 
  X11=0046
  X12=a007e03a X13= X14= 
  X15=
  X16=bf716df0 X17= X18= 
  X19=bf6f5f18
  X20= X21= X22= 
  X23=
  X24= X25= X26= 
  X27=
  X28= X29= X30= 
  PSTATE=6305 (flags -ZC-)
 
  For the record, this is the kvm_run struct:
 
  $6 = {request_interrupt_window = 0 '\000', padding1 = 
  \000\000\000\000\000\000, exit_reason = 0, 
ready_for_interrupt_injection = 0 '\000', if_flag = 0 '\000', flags = 0, 
  cr8 = 0, apic_base = 0, {hw = {
hardware_exit_reason = 150994968}, fail_entry = 
  {hardware_entry_failure_reason = 150994968}, ex = {
exception = 150994968, error_code = 0}, io = {direction = 24 '\030', 
  size = 0 '\000', port = 2304, 
count = 0, data_offset = 144}, debug = {arch = {No data fields}}, 
  mmio = {phys_addr = 150994968, 
data = \220\000\000\000\000\000\000, len = 4, is_write = 0 
  '\000'}, hypercall = {nr = 150994968, 
args = {144, 4, 0, 0, 0, 0}, ret = 0, longmode = 0, pad = 0}, 
  tpr_access = {rip = 150994968, 
is_write = 144, pad = 0}, s390_sieic = {icptcode = 24 '\030', ipa = 
  2304, ipb = 0}, 
  s390_reset_flags = 150994968, s390_ucontrol = {trans_exc_code = 
  150994968, pgm_code = 144}, dcr = {
dcrn = 150994968, data = 0, is_write = 144 '\220'}, internal = 
  {suberror = 150994968, ndata = 0, 
data = {144, 4, 0 repeats 14 times}}, osi = {gprs = {150994968, 
  144, 4, 0 repeats 29 times}}, 
  papr_hcall = {nr = 150994968, ret = 144, args = {4, 0, 0, 0, 0, 0, 0, 
  0, 0}}, s390_tsch = {
subchannel_id = 24, subchannel_nr = 2304, io_int_parm = 0, 
  io_int_word = 144, ipb = 0, 
dequeued = 4 '\004'}, epr = {epr = 150994968}, system_event = {type 
  = 150994968, flags = 144}, 
  s390_stsi = {addr = 150994968, ar = 144 '\220', reserved = 0 '\000', 
  

Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Richard W.M. Jones
On Fri, Jul 17, 2015 at 02:48:40PM +0100, Marc Zyngier wrote:
 Still: there is nothing in the registers that remotely points to that
 area. X0 is the closest, but it'd take a big negative offset to get
 there.
 
 Is that a Linux kernel? or something else?

You're sure it's not this one?
https://bugzilla.redhat.com/show_bug.cgi?id=1194366
That was caused by ftrace screwing up guest memory, so it was
effectively running random code.  It is also fixed (by you in fact).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Laszlo Ersek
On 07/17/15 15:48, Marc Zyngier wrote:
 On Fri, 17 Jul 2015 14:39:55 +0100
 Laszlo Ersek ler...@redhat.com wrote:
 
 On 07/17/15 15:28, Marc Zyngier wrote:
 On Fri, 17 Jul 2015 10:30:38 +0100
 Paolo Bonzini pbonz...@redhat.com wrote:



 On 17/07/2015 06:44, Paolo Bonzini wrote:


 On 16/07/2015 21:05, Richard W.M. Jones wrote:

 Sorry to spoil things, but I'm still seeing this bug, although it is
 now a lot less frequent with your patch.  I would estimate it happens
 more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
 with qemu.git + the v2 patch series.

 It's the exact same hang in both cases.

 Is it possible that this patch doesn't completely close any race?

 Still, it is an improvement, so there is that.

 Would seem at first glance like a different bug.

 Interestingly, adding some tracing (qemu_clock_get_ns) makes the bug
 more likely: now it reproduces in about 10 tries.  Of course :) adding
 other kinds of tracing instead make it go away again (50 tries).

 Perhaps this:

i/o thread vcpu thread   worker thread
-
lock_iothread
notify_me = 1
...
unlock_iothread
   lock_iothread
   notify_me = 3
   ppoll
   notify_me = 1
  bh-scheduled = 1
  event_notifier_set
   event_notifier_test_and_clear
ppoll
 ^^ hang
   
 In the exact shape above, it doesn't seem too likely to happen, but
 perhaps there's another simpler case.  Still, the bug exists.

 The above is not really related to notify_me.  Here the notification is
 not being optimized away!  So I wonder if this one has been there forever.

 Fam suggested putting the event_notifier_test_and_clear before
 aio_bh_poll(), but it does not work.  I'll look more close

 However, an unconditional event_notifier_test_and_clear is pretty
 expensive.  On one hand, obviously correctness comes first.  On the
 other hand, an expensive operation at the wrong place can mask the race
 very easily; I'll let the fix run for a while, but I'm not sure if a
 successful test really says anything useful.

 So it may not be useful, but still successful test is successful. :)
 The following patch, which also includes the delta between v2 and v3
 of this series, survived 674 reboots before hitting a definitely
 unrelated problem:

 error: kvm run failed Function not implemented
 PC=bf671210  SP=c1f0
 X00=0a003e70 X01= X02=bf680548 
 X03=0030
 X04=bbb5fc18 X05=004b7770 X06=bf721930 
 X07=009a
 X08=bf716858 X09=0090 X10= 
 X11=0046
 X12=a007e03a X13= X14= 
 X15=
 X16=bf716df0 X17= X18= 
 X19=bf6f5f18
 X20= X21= X22= 
 X23=
 X24= X25= X26= 
 X27=
 X28= X29= X30= 
 PSTATE=6305 (flags -ZC-)

 For the record, this is the kvm_run struct:

 $6 = {request_interrupt_window = 0 '\000', padding1 = 
 \000\000\000\000\000\000, exit_reason = 0, 
   ready_for_interrupt_injection = 0 '\000', if_flag = 0 '\000', flags = 0, 
 cr8 = 0, apic_base = 0, {hw = {
   hardware_exit_reason = 150994968}, fail_entry = 
 {hardware_entry_failure_reason = 150994968}, ex = {
   exception = 150994968, error_code = 0}, io = {direction = 24 '\030', 
 size = 0 '\000', port = 2304, 
   count = 0, data_offset = 144}, debug = {arch = {No data fields}}, 
 mmio = {phys_addr = 150994968, 
   data = \220\000\000\000\000\000\000, len = 4, is_write = 0 
 '\000'}, hypercall = {nr = 150994968, 
   args = {144, 4, 0, 0, 0, 0}, ret = 0, longmode = 0, pad = 0}, 
 tpr_access = {rip = 150994968, 
   is_write = 144, pad = 0}, s390_sieic = {icptcode = 24 '\030', ipa = 
 2304, ipb = 0}, 
 s390_reset_flags = 150994968, s390_ucontrol = {trans_exc_code = 
 150994968, pgm_code = 144}, dcr = {
   dcrn = 150994968, data = 0, is_write = 144 '\220'}, internal = 
 {suberror = 150994968, ndata = 0, 
   data = {144, 4, 0 repeats 14 times}}, osi = {gprs = {150994968, 
 144, 4, 0 repeats 29 times}}, 
 papr_hcall = {nr = 150994968, ret = 144, args = {4, 0, 0, 0, 0, 0, 0, 
 0, 0}}, s390_tsch = {
   subchannel_id = 24, subchannel_nr = 2304, io_int_parm = 0, 
 io_int_word = 144, ipb = 0, 
   dequeued = 4 '\004'}, epr = {epr = 150994968}, system_event = {type 
 = 150994968, flags = 144}, 
 s390_stsi = {addr = 150994968, ar = 144 '\220', reserved = 0 '\000', 
 fc = 0 '\000', sel1 = 0 '\000', 
   sel2 = 0}, 
 padding = 
 

Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Marc Zyngier
On Fri, 17 Jul 2015 14:53:06 +0100
Richard W.M. Jones rjo...@redhat.com wrote:

 On Fri, Jul 17, 2015 at 02:48:40PM +0100, Marc Zyngier wrote:
  Still: there is nothing in the registers that remotely points to that
  area. X0 is the closest, but it'd take a big negative offset to get
  there.
  
  Is that a Linux kernel? or something else?
 
 You're sure it's not this one?
 https://bugzilla.redhat.com/show_bug.cgi?id=1194366
 That was caused by ftrace screwing up guest memory, so it was
 effectively running random code.  It is also fixed (by you in fact).

Don't think so. The bug you quote was the guest kernel being buggy, and
touching non-memory space.

This new issue seems different - this is not a Linux kernel, by the
look of it.

M.
-- 
Jazz is not dead. It just smells funny.



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Richard W.M. Jones
On Fri, Jul 17, 2015 at 11:30:38AM +0200, Paolo Bonzini wrote:
 error: kvm run failed Function not implemented
 PC=bf671210  SP=c1f0
 X00=0a003e70 X01= X02=bf680548 
 X03=0030
 X04=bbb5fc18 X05=004b7770 X06=bf721930 
 X07=009a
 X08=bf716858 X09=0090 X10= 
 X11=0046
 X12=a007e03a X13= X14= 
 X15=
 X16=bf716df0 X17= X18= 
 X19=bf6f5f18
 X20= X21= X22= 
 X23=
 X24= X25= X26= 
 X27=
 X28= X29= X30= 
 PSTATE=6305 (flags -ZC-)

Vaguely reminiscent of this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1194366

(See comment 7 in particular)

 diff --git a/aio-posix.c b/aio-posix.c
 index 268d14d..d2011d0 100644
 --- a/aio-posix.c
 +++ b/aio-posix.c
 @@ -273,6 +273,13 @@ bool aio_poll(AioContext *ctx, bool blocking)
  aio_context_acquire(ctx);
  }
  
 +/* This should be optimized... */
 +event_notifier_test_and_clear(ctx-notifier);
 +
 +if (blocking) {
 +atomic_sub(ctx-notify_me, 2);
 +}
 +
  /* if we have any readable fds, dispatch event */
  if (ret  0) {
  for (i = 0; i  npfd; i++) {
 @@ -283,10 +290,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
  npfd = 0;
  ctx-walking_handlers--;
  
 -if (blocking) {
 -atomic_sub(ctx-notify_me, 2);
 -}
 -
  /* Run dispatch even if there were no readable fds to run timers */
  if (aio_dispatch(ctx)) {
  progress = true;
 diff --git a/aio-win32.c b/aio-win32.c
 index 2bfd5f8..33809fd 100644
 --- a/aio-win32.c
 +++ b/aio-win32.c
 @@ -326,6 +326,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
  if (timeout) {
  aio_context_acquire(ctx);
  }
 +
 +/* This should be optimized... */
 +event_notifier_test_and_clear(ctx-notifier);
 +
  if (blocking) {
  assert(first);
  atomic_sub(ctx-notify_me, 2);
 diff --git a/async.c b/async.c
 index 9204907..120e183 100644
 --- a/async.c
 +++ b/async.c
 @@ -202,6 +202,9 @@ aio_ctx_check(GSource *source)
  AioContext *ctx = (AioContext *) source;
  QEMUBH *bh;
  
 +/* This should be optimized... */
 +event_notifier_test_and_clear(ctx-notifier);
 +
  atomic_and(ctx-notify_me, ~1);
  for (bh = ctx-first_bh; bh; bh = bh-next) {
  if (!bh-deleted  bh-scheduled) {
 @@ -280,6 +280,10 @@ static void aio_rfifolock_cb(void *opaque)
  aio_notify(opaque);
  }
  
 +static void event_notifier_dummy_cb(EventNotifier *e)
 +{
 +}
 +
  AioContext *aio_context_new(Error **errp)
  {
  int ret;
 @@ -292,7 +296,7 @@ AioContext *aio_context_new(Error **errp)
  return NULL;
  }
  g_source_set_can_recurse(ctx-source, true);
 -aio_set_event_notifier(ctx, ctx-notifier, NULL);
 +aio_set_event_notifier(ctx, ctx-notifier, event_notifier_dummy_cb);
  ctx-thread_pool = NULL;
  qemu_mutex_init(ctx-bh_lock);
  rfifolock_init(ctx-lock, aio_rfifolock_cb, ctx);

With this patch, I've got to 500 iterations without seeing the error.

Still testing it however ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Marc Zyngier
On Fri, 17 Jul 2015 10:30:38 +0100
Paolo Bonzini pbonz...@redhat.com wrote:

 
 
 On 17/07/2015 06:44, Paolo Bonzini wrote:
  
  
  On 16/07/2015 21:05, Richard W.M. Jones wrote:
 
  Sorry to spoil things, but I'm still seeing this bug, although it is
  now a lot less frequent with your patch.  I would estimate it happens
  more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
  with qemu.git + the v2 patch series.
 
  It's the exact same hang in both cases.
 
  Is it possible that this patch doesn't completely close any race?
 
  Still, it is an improvement, so there is that.
  
  Would seem at first glance like a different bug.
  
  Interestingly, adding some tracing (qemu_clock_get_ns) makes the bug
  more likely: now it reproduces in about 10 tries.  Of course :) adding
  other kinds of tracing instead make it go away again (50 tries).
  
  Perhaps this:
  
 i/o thread vcpu thread   worker thread
 -
 lock_iothread
 notify_me = 1
 ...
 unlock_iothread
lock_iothread
notify_me = 3
ppoll
notify_me = 1
   bh-scheduled = 1
   event_notifier_set
event_notifier_test_and_clear
 ppoll
  ^^ hang
  
  In the exact shape above, it doesn't seem too likely to happen, but
  perhaps there's another simpler case.  Still, the bug exists.
  
  The above is not really related to notify_me.  Here the notification is
  not being optimized away!  So I wonder if this one has been there forever.
  
  Fam suggested putting the event_notifier_test_and_clear before
  aio_bh_poll(), but it does not work.  I'll look more close
  
  However, an unconditional event_notifier_test_and_clear is pretty
  expensive.  On one hand, obviously correctness comes first.  On the
  other hand, an expensive operation at the wrong place can mask the race
  very easily; I'll let the fix run for a while, but I'm not sure if a
  successful test really says anything useful.
 
 So it may not be useful, but still successful test is successful. :)
 The following patch, which also includes the delta between v2 and v3
 of this series, survived 674 reboots before hitting a definitely
 unrelated problem:
 
 error: kvm run failed Function not implemented
 PC=bf671210  SP=c1f0
 X00=0a003e70 X01= X02=bf680548 
 X03=0030
 X04=bbb5fc18 X05=004b7770 X06=bf721930 
 X07=009a
 X08=bf716858 X09=0090 X10= 
 X11=0046
 X12=a007e03a X13= X14= 
 X15=
 X16=bf716df0 X17= X18= 
 X19=bf6f5f18
 X20= X21= X22= 
 X23=
 X24= X25= X26= 
 X27=
 X28= X29= X30= 
 PSTATE=6305 (flags -ZC-)
 
 For the record, this is the kvm_run struct:
 
 $6 = {request_interrupt_window = 0 '\000', padding1 = 
 \000\000\000\000\000\000, exit_reason = 0, 
   ready_for_interrupt_injection = 0 '\000', if_flag = 0 '\000', flags = 0, 
 cr8 = 0, apic_base = 0, {hw = {
   hardware_exit_reason = 150994968}, fail_entry = 
 {hardware_entry_failure_reason = 150994968}, ex = {
   exception = 150994968, error_code = 0}, io = {direction = 24 '\030', 
 size = 0 '\000', port = 2304, 
   count = 0, data_offset = 144}, debug = {arch = {No data fields}}, 
 mmio = {phys_addr = 150994968, 
   data = \220\000\000\000\000\000\000, len = 4, is_write = 0 '\000'}, 
 hypercall = {nr = 150994968, 
   args = {144, 4, 0, 0, 0, 0}, ret = 0, longmode = 0, pad = 0}, 
 tpr_access = {rip = 150994968, 
   is_write = 144, pad = 0}, s390_sieic = {icptcode = 24 '\030', ipa = 
 2304, ipb = 0}, 
 s390_reset_flags = 150994968, s390_ucontrol = {trans_exc_code = 
 150994968, pgm_code = 144}, dcr = {
   dcrn = 150994968, data = 0, is_write = 144 '\220'}, internal = 
 {suberror = 150994968, ndata = 0, 
   data = {144, 4, 0 repeats 14 times}}, osi = {gprs = {150994968, 144, 
 4, 0 repeats 29 times}}, 
 papr_hcall = {nr = 150994968, ret = 144, args = {4, 0, 0, 0, 0, 0, 0, 0, 
 0}}, s390_tsch = {
   subchannel_id = 24, subchannel_nr = 2304, io_int_parm = 0, io_int_word 
 = 144, ipb = 0, 
   dequeued = 4 '\004'}, epr = {epr = 150994968}, system_event = {type = 
 150994968, flags = 144}, 
 s390_stsi = {addr = 150994968, ar = 144 '\220', reserved = 0 '\000', fc = 
 0 '\000', sel1 = 0 '\000', 
   sel2 = 0}, 
 padding = 
 \030\000\000\t\000\000\000\000\220\000\000\000\000\000\000\000\004, '\000' 
 repeats 238 

Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Paolo Bonzini


On 17/07/2015 14:58, Richard W.M. Jones wrote:
 On Fri, Jul 17, 2015 at 11:30:38AM +0200, Paolo Bonzini wrote:
 error: kvm run failed Function not implemented
 PC=bf671210  SP=c1f0
 X00=0a003e70 X01= X02=bf680548 
 X03=0030
 X04=bbb5fc18 X05=004b7770 X06=bf721930 
 X07=009a
 X08=bf716858 X09=0090 X10= 
 X11=0046
 X12=a007e03a X13= X14= 
 X15=
 X16=bf716df0 X17= X18= 
 X19=bf6f5f18
 X20= X21= X22= 
 X23=
 X24= X25= X26= 
 X27=
 X28= X29= X30= 
 PSTATE=6305 (flags -ZC-)
 
 Vaguely reminiscent of this bug:
 
 https://bugzilla.redhat.com/show_bug.cgi?id=1194366
 
 (See comment 7 in particular)

Must be it, I was using an old kernel.

Thanks!

Paolo

 diff --git a/aio-posix.c b/aio-posix.c
 index 268d14d..d2011d0 100644
 --- a/aio-posix.c
 +++ b/aio-posix.c
 @@ -273,6 +273,13 @@ bool aio_poll(AioContext *ctx, bool blocking)
  aio_context_acquire(ctx);
  }
  
 +/* This should be optimized... */
 +event_notifier_test_and_clear(ctx-notifier);
 +
 +if (blocking) {
 +atomic_sub(ctx-notify_me, 2);
 +}
 +
  /* if we have any readable fds, dispatch event */
  if (ret  0) {
  for (i = 0; i  npfd; i++) {
 @@ -283,10 +290,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
  npfd = 0;
  ctx-walking_handlers--;
  
 -if (blocking) {
 -atomic_sub(ctx-notify_me, 2);
 -}
 -
  /* Run dispatch even if there were no readable fds to run timers */
  if (aio_dispatch(ctx)) {
  progress = true;
 diff --git a/aio-win32.c b/aio-win32.c
 index 2bfd5f8..33809fd 100644
 --- a/aio-win32.c
 +++ b/aio-win32.c
 @@ -326,6 +326,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
  if (timeout) {
  aio_context_acquire(ctx);
  }
 +
 +/* This should be optimized... */
 +event_notifier_test_and_clear(ctx-notifier);
 +
  if (blocking) {
  assert(first);
  atomic_sub(ctx-notify_me, 2);
 diff --git a/async.c b/async.c
 index 9204907..120e183 100644
 --- a/async.c
 +++ b/async.c
 @@ -202,6 +202,9 @@ aio_ctx_check(GSource *source)
  AioContext *ctx = (AioContext *) source;
  QEMUBH *bh;
  
 +/* This should be optimized... */
 +event_notifier_test_and_clear(ctx-notifier);
 +
  atomic_and(ctx-notify_me, ~1);
  for (bh = ctx-first_bh; bh; bh = bh-next) {
  if (!bh-deleted  bh-scheduled) {
 @@ -280,6 +280,10 @@ static void aio_rfifolock_cb(void *opaque)
  aio_notify(opaque);
  }
  
 +static void event_notifier_dummy_cb(EventNotifier *e)
 +{
 +}
 +
  AioContext *aio_context_new(Error **errp)
  {
  int ret;
 @@ -292,7 +296,7 @@ AioContext *aio_context_new(Error **errp)
  return NULL;
  }
  g_source_set_can_recurse(ctx-source, true);
 -aio_set_event_notifier(ctx, ctx-notifier, NULL);
 +aio_set_event_notifier(ctx, ctx-notifier, event_notifier_dummy_cb);
  ctx-thread_pool = NULL;
  qemu_mutex_init(ctx-bh_lock);
  rfifolock_init(ctx-lock, aio_rfifolock_cb, ctx);
 
 With this patch, I've got to 500 iterations without seeing the error.
 
 Still testing it however ...
 
 Rich.
 



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-17 Thread Paolo Bonzini


On 17/07/2015 06:44, Paolo Bonzini wrote:
 
 
 On 16/07/2015 21:05, Richard W.M. Jones wrote:

 Sorry to spoil things, but I'm still seeing this bug, although it is
 now a lot less frequent with your patch.  I would estimate it happens
 more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
 with qemu.git + the v2 patch series.

 It's the exact same hang in both cases.

 Is it possible that this patch doesn't completely close any race?

 Still, it is an improvement, so there is that.
 
 Would seem at first glance like a different bug.
 
 Interestingly, adding some tracing (qemu_clock_get_ns) makes the bug
 more likely: now it reproduces in about 10 tries.  Of course :) adding
 other kinds of tracing instead make it go away again (50 tries).
 
 Perhaps this:
 
i/o thread vcpu thread   worker thread
-
lock_iothread
notify_me = 1
...
unlock_iothread
   lock_iothread
   notify_me = 3
   ppoll
   notify_me = 1
  bh-scheduled = 1
  event_notifier_set
   event_notifier_test_and_clear
ppoll
 ^^ hang
   
 In the exact shape above, it doesn't seem too likely to happen, but
 perhaps there's another simpler case.  Still, the bug exists.
 
 The above is not really related to notify_me.  Here the notification is
 not being optimized away!  So I wonder if this one has been there forever.
 
 Fam suggested putting the event_notifier_test_and_clear before
 aio_bh_poll(), but it does not work.  I'll look more close
 
 However, an unconditional event_notifier_test_and_clear is pretty
 expensive.  On one hand, obviously correctness comes first.  On the
 other hand, an expensive operation at the wrong place can mask the race
 very easily; I'll let the fix run for a while, but I'm not sure if a
 successful test really says anything useful.

So it may not be useful, but still successful test is successful. :)
The following patch, which also includes the delta between v2 and v3
of this series, survived 674 reboots before hitting a definitely
unrelated problem:

error: kvm run failed Function not implemented
PC=bf671210  SP=c1f0
X00=0a003e70 X01= X02=bf680548 
X03=0030
X04=bbb5fc18 X05=004b7770 X06=bf721930 
X07=009a
X08=bf716858 X09=0090 X10= 
X11=0046
X12=a007e03a X13= X14= 
X15=
X16=bf716df0 X17= X18= 
X19=bf6f5f18
X20= X21= X22= 
X23=
X24= X25= X26= 
X27=
X28= X29= X30= PSTATE=6305 
(flags -ZC-)

For the record, this is the kvm_run struct:

$6 = {request_interrupt_window = 0 '\000', padding1 = 
\000\000\000\000\000\000, exit_reason = 0, 
  ready_for_interrupt_injection = 0 '\000', if_flag = 0 '\000', flags = 0, cr8 
= 0, apic_base = 0, {hw = {
  hardware_exit_reason = 150994968}, fail_entry = 
{hardware_entry_failure_reason = 150994968}, ex = {
  exception = 150994968, error_code = 0}, io = {direction = 24 '\030', size 
= 0 '\000', port = 2304, 
  count = 0, data_offset = 144}, debug = {arch = {No data fields}}, mmio 
= {phys_addr = 150994968, 
  data = \220\000\000\000\000\000\000, len = 4, is_write = 0 '\000'}, 
hypercall = {nr = 150994968, 
  args = {144, 4, 0, 0, 0, 0}, ret = 0, longmode = 0, pad = 0}, tpr_access 
= {rip = 150994968, 
  is_write = 144, pad = 0}, s390_sieic = {icptcode = 24 '\030', ipa = 2304, 
ipb = 0}, 
s390_reset_flags = 150994968, s390_ucontrol = {trans_exc_code = 150994968, 
pgm_code = 144}, dcr = {
  dcrn = 150994968, data = 0, is_write = 144 '\220'}, internal = {suberror 
= 150994968, ndata = 0, 
  data = {144, 4, 0 repeats 14 times}}, osi = {gprs = {150994968, 144, 4, 
0 repeats 29 times}}, 
papr_hcall = {nr = 150994968, ret = 144, args = {4, 0, 0, 0, 0, 0, 0, 0, 
0}}, s390_tsch = {
  subchannel_id = 24, subchannel_nr = 2304, io_int_parm = 0, io_int_word = 
144, ipb = 0, 
  dequeued = 4 '\004'}, epr = {epr = 150994968}, system_event = {type = 
150994968, flags = 144}, 
s390_stsi = {addr = 150994968, ar = 144 '\220', reserved = 0 '\000', fc = 0 
'\000', sel1 = 0 '\000', 
  sel2 = 0}, 
padding = 
\030\000\000\t\000\000\000\000\220\000\000\000\000\000\000\000\004, '\000' 
repeats 238 times}, kvm_valid_regs = 0, kvm_dirty_regs = 0, s = {regs = {No 
data fields}, 
padding = '\000' repeats 2047 times}}

Marc, does it ring any bell?

Paolo

diff --git a/aio-posix.c 

Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-16 Thread Paolo Bonzini


On 16/07/2015 21:05, Richard W.M. Jones wrote:
 Sorry to spoil things, but I'm still seeing this bug, although it is
 now a lot less frequent with your patch.  I would estimate it happens
 more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
 with qemu.git + the v2 patch series.
 
 It's the exact same hang in both cases.
 
 Is it possible that this patch doesn't completely close any race?
 
 Still, it is an improvement, so there is that.

I would guess instead that there are two separate bugs, but it's not
impossible that it's still there.

Paolo



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-16 Thread Paolo Bonzini


On 17/07/2015 00:06, Paolo Bonzini wrote:
 
 
 On 16/07/2015 21:05, Richard W.M. Jones wrote:
 Sorry to spoil things, but I'm still seeing this bug, although it is
 now a lot less frequent with your patch.  I would estimate it happens
 more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
 with qemu.git + the v2 patch series.

 It's the exact same hang in both cases.

 Is it possible that this patch doesn't completely close any race?

 Still, it is an improvement, so there is that.
 
 I would guess instead that there are two separate bugs, but it's not
 impossible that it's still there.

Reproduced after ~80 runs...

Paolo



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-16 Thread Kevin Wolf
Am 16.07.2015 um 11:56 hat Paolo Bonzini geschrieben:
 Apart from an additional assertion this is exactly the same code as v1,
 but split across three patches so that the important one focuses on the
 optimization.
 
 Paolo
 
 v1-v2
 Split some changes to the tests to a separate patch
 Fix commit message [Laszlo]
 Clarify do...while loop in aio-win32.c [Kevin]
 
 Paolo Bonzini (3):
   tests: remove irrelevant assertions from test-aio
   aio-win32: reorganize polling loop
   AioContext: fix broken ctx-dispatching optimization

Reviewed-by: Kevin Wolf kw...@redhat.com



[Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-16 Thread Paolo Bonzini
Apart from an additional assertion this is exactly the same code as v1,
but split across three patches so that the important one focuses on the
optimization.

Paolo

v1-v2
Split some changes to the tests to a separate patch
Fix commit message [Laszlo]
Clarify do...while loop in aio-win32.c [Kevin]

Paolo Bonzini (3):
  tests: remove irrelevant assertions from test-aio
  aio-win32: reorganize polling loop
  AioContext: fix broken ctx-dispatching optimization

 aio-posix.c | 19 ++--
 aio-win32.c | 41 ++
 async.c | 21 +-
 docs/aio_notify.promela | 77 +
 include/block/aio.h | 29 +++
 tests/test-aio.c| 26 +++--
 6 files changed, 98 insertions(+), 115 deletions(-)

-- 
2.4.3

diff from v1:
diff --git a/aio-win32.c b/aio-win32.c
index ae7c6cf..9d6c12f 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -312,7 +312,13 @@ bool aio_poll(AioContext *ctx, bool blocking)
 ctx-walking_handlers--;
 first = true;
 
-/* wait until next event */
+/* ctx-notifier is always registered.  */
+assert(count  0);
+
+/* Multiple iterations, all of them non-blocking except the first,
+ * may be necessary to process all pending events.  After the first
+ * WaitForMultipleObjects call ctx-notify_me will be decremented.
+ */
 do {
 HANDLE event;
 int ret;





Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-16 Thread Richard W.M. Jones
Sorry for the very long delay in replying to this.  I wanted to be
absolutely sure I was reproducing the bug.

Unfortunately I'm only able to reproduce the bug with qemu 2.3.0
(both the version in Fedora Rawhide, or the tagged v2.3.0 from git).

I cannot currently reproduce it at all with upstream qemu from git.

The patches understandably only apply to upstream qemu from git, and
have quite a few tricky conflicts with v2.3.0.

I'll keep trying on this one.  It may be that the window for the
bug to reproduce with qemu.git has got smaller.

Rich.

--

Notes on how I try to reproduce this:

(1) Using Fedora Rawhide aarch64

(2) libguestfs checked out and compiled from git

(3) kraxel's edk2.git-aarch64-0-20150713.b1115.g2ad9cf3.noarch

(4) heisenscsi.pl (attached).

$ export LIBGUESTFS_HV=/path/to/qemu/aarch64-softmmu/qemu-system-aarch64
$ while true; do echo .; ./run ./heisenscsi.pl ; done



-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org


heisenscsi.pl
Description: Perl program


Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-16 Thread Richard W.M. Jones

Sorry to spoil things, but I'm still seeing this bug, although it is
now a lot less frequent with your patch.  I would estimate it happens
more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
with qemu.git + the v2 patch series.

It's the exact same hang in both cases.

Is it possible that this patch doesn't completely close any race?

Still, it is an improvement, so there is that.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



Re: [Qemu-devel] [PATCH v2 0/3] AioContext: ctx-dispatching is dead, all hail ctx-notify_me

2015-07-16 Thread Paolo Bonzini


On 16/07/2015 21:05, Richard W.M. Jones wrote:
 
 Sorry to spoil things, but I'm still seeing this bug, although it is
 now a lot less frequent with your patch.  I would estimate it happens
 more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs
 with qemu.git + the v2 patch series.
 
 It's the exact same hang in both cases.
 
 Is it possible that this patch doesn't completely close any race?
 
 Still, it is an improvement, so there is that.

Would seem at first glance like a different bug.

Interestingly, adding some tracing (qemu_clock_get_ns) makes the bug
more likely: now it reproduces in about 10 tries.  Of course :) adding
other kinds of tracing instead make it go away again (50 tries).

Perhaps this:

   i/o thread vcpu thread   worker thread
   -
   lock_iothread
   notify_me = 1
   ...
   unlock_iothread
  lock_iothread
  notify_me = 3
  ppoll
  notify_me = 1
 bh-scheduled = 1
 event_notifier_set
  event_notifier_test_and_clear
   ppoll
^^ hang

In the exact shape above, it doesn't seem too likely to happen, but
perhaps there's another simpler case.  Still, the bug exists.

The above is not really related to notify_me.  Here the notification is
not being optimized away!  So I wonder if this one has been there forever.

Fam suggested putting the event_notifier_test_and_clear before
aio_bh_poll(), but it does not work.  I'll look more close

However, an unconditional event_notifier_test_and_clear is pretty
expensive.  On one hand, obviously correctness comes first.  On the
other hand, an expensive operation at the wrong place can mask the race
very easily; I'll let the fix run for a while, but I'm not sure if a
successful test really says anything useful.

Paolo