------- Comment From [email protected] 2018-05-04 09:09 EDT-------
(In reply to comment #15)
> This is not the same as the original bug, but I suspect they are part of a
> class of issues we're hitting while running under very particular
> circumstances which might not generally be seen during normal operation and
> triggering various corner cases. As such I think it makes sense to group
> them under this bug for the time being.
>
> The general workload is running IO-heavy disk workloads on large guests
> (20GB memory, 16 vcpus) with SAN-based storage, and then performing
> migration during the workload. During migration we begin to see a high
> occurrence of rcu_sched stall warnings, and after 1-3  hours of operations
> we hit filesystem-related crashes like the ones posted. We've seen this with
> 2 separate FC cards, emulex and qlogic, where we invoke QEMU through libvirt
> as:

We been gathering additional traces while running under this scenario,
and while so far most of the traces have been filesystem-related, we now
have a couple that suggest the common thread between all of these is
failures are related to RCU-managed data structures. I'll attach the
summaries for these from xmon, these have the full dmesg log since guest
start, as well as timestamps in dmesg noting where migrating has
started/stopped, and "WATCHDOG" messages to note any large jumps in
wall-clock time. For example (from boslcp3g1-migtest-fail-on-lcp5):

[ 5757.347542] migration iteration 7: started at Thu May 3 05:59:14 CDT 2018
<guest stopped will get stopped by libvirt 60 seconds, then after migration 
finishes it will be resumed and we immediately see the below>
[ 5935.727884] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 5935.728567]  1-...!: (670 GPs behind) idle=486/140000000000000/0 
softirq=218179/218180 fqs=0
[ 5935.730091]  2-...!: (3750 GPs behind) idle=006/140000000000000/0 
softirq=203335/203335 fqs=0
[ 5935.731076]  4-...!: (96 GPs behind) idle=c2e/140000000000000/0 
softirq=168607/168608 fqs=0
[ 5935.731783]  5-...!: (2270 GPs behind) idle=e16/140000000000000/0 
softirq=152608/152608 fqs=1
[ 5935.732959]  6-...!: (322 GPs behind) idle=3ca/140000000000001/0 
softirq=169452/169453 fqs=1
[ 5935.735061]  8-...!: (6 GPs behind) idle=c36/140000000000001/0 
softirq=280514/280516 fqs=1
[ 5935.736638]  9-...!: (5 GPs behind) idle=c1e/140000000000001/0 
softirq=248247/248249 fqs=1
[ 5935.738112]  10-...!: (4 GPs behind) idle=62a/1/0 softirq=228207/228208 fqs=1
[ 5935.738868]  11-...!: (32 GPs behind) idle=afe/140000000000000/0 
softirq=228817/228818 fqs=1
[ 5935.739122]  12-...!: (3 GPs behind) idle=426/1/0 softirq=192716/192717 fqs=1
[ 5935.739295]  14-...!: (5 GPs behind) idle=e56/140000000000000/0 
softirq=133888/133892 fqs=1
[ 5935.739486]  15-...!: (7 GPs behind) idle=36e/140000000000000/0 
softirq=161010/161013 fqs=1
...
[ 5935.740031] Unable to handle kernel paging request for data at address 
0x00000008
[ 5935.740128] Faulting instruction address: 0xc000000000403d04

For the prior iterations where we don't crash we'd have messages like:

[ 2997.413561] WATCHDOG (Thu May  3 05:13:18 CDT 2018): time jump of 114 seconds
[ 3023.759629] migration iteration 1: completed at Thu May 3 05:13:25 CDT 2018
[ 3239.678964] migration iteration 2: started at Thu May 3 05:16:45 CDT 2018

The WATCHDOG is noting the amount of time the guest has seen jump after
it resumes execution. These are generally on the order of 1-2 minutes
here where we're doing migration via virsh migrate ... --timeout 60,
which manually stops the guest if it hasn't finished migration within
60s.

We now know that the source of the skip in time actually originates from
behavior on the source side of migration due to handling within QEMU,
and the guest is reacting after it wakes up from migration. A patch has
been sent which changes the behavior so that the guest doesn't see a
jump in time after resuming:

http://lists.nongnu.org/archive/html/qemu-devel/2018-05/msg00928.html

The patch is still under discussion and it's not clear yet whether this
is actually a QEMU bug or intended behavior. I'm still testing the bug
is conjunciton with original workload and would like to see it run over
the weekend before I can say with any certain, but so far it has run
overnight whereas prior to the change it would crashes after an hour or
2 (though we have seen runs that survived as long as 8 hours so that's
not definitive).

If that survives it would suggest that the source for the RCU-related
crashes seems to occur as a result of a jump in the guest VCPU's
timebase register. One interesting thing I've noticed is that with a
QEMU that *doesn't have the patch above*, disabling RCU stall warning
messages via:

echo 1 >/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress

allowed the workload to run for 16 hours without crashing. This may
suggest the warning messages, in conjunction with rcu_cpu_stall_timeout
being exceeded due to jump in timebase register, are triggering issues
with RCU. What I plan to try next is raising rcu_cpu_stall_timeout to a
much higher value (currently 21 on Ubuntu 18.04 it seems) and see if
that can survive a run over the weekend.

All this also suggests it may be possible to trigger the failures
synthetically by doing a high IO workload, pausing the guest, advancing
the VCPU tb_offset, and then resuming. This is something I plan to try
soon to help narrow things down.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1768115

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running
  with IO stress crashed@security_file_permission+0xf4/0x160.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1768115/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to