------- Comment From [email protected] 2018-05-07 14:48 EDT-------
The RCU connection is possibly a red herring. I tested the above theory about 
RCU timeouts/warning being a trigger by modifying QEMU to allow guest timebase 
to be advanced artificially to trigger RCU timeouts/warnings in rapid 
succession and ran this for 8 hours using the same workload without seeing a 
crash. It seems migration is a necessary component to reproduce this.

I did further tests to capture the guest memory state before/after
migration to see if there's possibly an issue with dirty-page tracking
or something similar that could explain the crashes, and have data from
2 crashes that show a roughly 24 bytes difference between source/target
after migration within the first 100MB of guest physical address range.
I have more details in the summaries/logs I'm attaching, but one example
is below (from "migtest" log):

root@boslcp5:~/dumps-cmp# xxd -s 0x013e0000 -l 128 0-2.boslcp5
013e0000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
root@boslcp5:~/dumps-cmp# xxd -s 0x013e0000 -l 128 0-2.boslcp6
013e0000: 3403 0100 0002 0000 2f62 6c61 7374 2f76  4......./blast/v
013e0010: 6463 3400 0000 0000 0000 0000 0000 0000  dc4.............
013e0020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
013e0070: 0000 0000 0000 0000 0000 0000 0000 0000  ................

"blast" is part of the LTP IO test suite running in the guest. It seems
some data structure related to it is present in the source guest memory,
but not on the target side. Part of the structure seems to be a trigger
buffer, but the preceding value might be a point or something else and
may explain the crashes if that ends up being zero'd on the target side.
The other summary/log I'm attaching has almost an identical inconsistent
between source/target from another guest using same workload and hitting
a crash:

root@boslcp5:~/dumps-cmp-migtest2# xxd -s 38273024 -l 128 0-2.boslcp5
02480000: c000 0100 0000 0000 2f62 6c61 7374 2f76  ......../blast/v
02480010: 6462 3400 0000 0000 0000 0000 0000 0000  db4.............
02480020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
root@boslcp5:~/dumps-cmp-migtest2# xxd -s 38273024 -l 128 0-2.boslcp6
02480000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
02480070: 0000 0000 0000 0000 0000 0000 0000 0000  ................

It seems highly likely there's an issue related to dirty bitmap tracking
at play here. Could use some help from kernel folks on figuring out
where that might lie. Crashed guests are still live ATM so let me know
if there's anything I should try to gather.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1768115

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running
  with IO stress crashed@security_file_permission+0xf4/0x160.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1768115/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to