------- Comment From [email protected] 2018-05-07 14:48 EDT------- The RCU connection is possibly a red herring. I tested the above theory about RCU timeouts/warning being a trigger by modifying QEMU to allow guest timebase to be advanced artificially to trigger RCU timeouts/warnings in rapid succession and ran this for 8 hours using the same workload without seeing a crash. It seems migration is a necessary component to reproduce this.
I did further tests to capture the guest memory state before/after migration to see if there's possibly an issue with dirty-page tracking or something similar that could explain the crashes, and have data from 2 crashes that show a roughly 24 bytes difference between source/target after migration within the first 100MB of guest physical address range. I have more details in the summaries/logs I'm attaching, but one example is below (from "migtest" log): root@boslcp5:~/dumps-cmp# xxd -s 0x013e0000 -l 128 0-2.boslcp5 013e0000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ root@boslcp5:~/dumps-cmp# xxd -s 0x013e0000 -l 128 0-2.boslcp6 013e0000: 3403 0100 0002 0000 2f62 6c61 7374 2f76 4......./blast/v 013e0010: 6463 3400 0000 0000 0000 0000 0000 0000 dc4............. 013e0020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 013e0070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ "blast" is part of the LTP IO test suite running in the guest. It seems some data structure related to it is present in the source guest memory, but not on the target side. Part of the structure seems to be a trigger buffer, but the preceding value might be a point or something else and may explain the crashes if that ends up being zero'd on the target side. The other summary/log I'm attaching has almost an identical inconsistent between source/target from another guest using same workload and hitting a crash: root@boslcp5:~/dumps-cmp-migtest2# xxd -s 38273024 -l 128 0-2.boslcp5 02480000: c000 0100 0000 0000 2f62 6c61 7374 2f76 ......../blast/v 02480010: 6462 3400 0000 0000 0000 0000 0000 0000 db4............. 02480020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ root@boslcp5:~/dumps-cmp-migtest2# xxd -s 38273024 -l 128 0-2.boslcp6 02480000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 02480070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ It seems highly likely there's an issue related to dirty bitmap tracking at play here. Could use some help from kernel folks on figuring out where that might lie. Crashed guests are still live ATM so let me know if there's anything I should try to gather. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1768115 Title: ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running with IO stress crashed@security_file_permission+0xf4/0x160. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1768115/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
