------- Comment From [email protected] 2018-05-08 16:37 EDT------- Hit another instance of the RAM inconsistencies prior to resuming guest on target side (this one is migrating from boslcp6 to boslcp5 and crashing after it resumes execution on boslcp5). The signature is eerily similar to the ones above... the workload is blast from LTP but it's strange that 3 out of 3 so far have been the same data structure. Maybe there's a relationship between something the process is doing and dirty syncing?
root@boslcp5:~/vm_logs/1525768538/dumps# xxd -s 20250624 -l 128 0-2.vm0.iteration2a 01350000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ root@boslcp5:~/vm_logs/1525768538/dumps# xxd -s 20250624 -l 128 0-2.vm0.iteration2a.boslcp6 01350000: d603 0100 0000 0000 2f62 6c61 7374 2f76 ......../blast/v 01350010: 6463 3400 0000 0000 0000 0000 0000 0000 dc4............. 01350020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 01350070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ For this run I included traces of the various stages of memory migration on the QEMU side relative to dirty bitmap sync (attached above). The 3 phases are: "ram_save_setup": enables dirty logging and sets up data structures used for tracking dirty pages. does the initial bitmap sync. QEMU keeps it's own copy which gets OR'd with the one provided by KVM on each bitmap sync. There's 2 blocks (ram-node0/ram-node1) each with their own bitmap / KVM memslot since guest was defined with 2 NUMA nodes. Only ram-node0 would be relevant here since it has offset 0 in guest physical memory address space. "ram_save_pending": called before each iteration to see if there are pages still pending. When number of dirty pages in the QEMU bitmap drop below a certain value it does another sync with KVM's bitmap. "ram_save_iterate": walks the QEMU dirty bitmap and sends corresponding pages until there's none left or some other limit (e.g. bandwidth throttling or max-pages-per-iteration) is hit. "ram_save_pending"/"ram_save_iterate" keeps repeating until no more pages are left. "ram_save_complete" does a final sync with KVM bitmap, sends final set of pages, then disables dirty logging and completes the migration. "vm_stop" denotes with the guest VCPUs have all exited and stopped execution. There's 2 migrations reflected in the posted traces, the first one can be ignored (everything between first ram_save_setup and first ram_save_complete), it's just a backup of the VM. After that the VM is backup up it resumes execution and that's the state we're migrating here and seeing a crash with on other end. The sequences of events in this run are comparable to previous successful runs, no strange orderings or missed calls to sync with KVM dirty bitmap/etc. The condensed version of the trace are below, but it looks like there's a sync prior to vm_stop, and a sync afterward, and given that these syncs are OR'd into a persistent bitmap maintained by QEMU, there's shouldn't be any loss of dirty page information with this particular ordering of events. [email protected]: >ram_save_setup [email protected]: migration_bitmap_sync, count: 4 [email protected]: qemu_global_log_sync [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: kvm_log_sync, addr: 0, size: 280000000 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: kvm_log_sync, addr: 280000000, size: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: kvm_log_sync, addr: 200080000000, size: 1000000 [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: migration_bitmap_sync, id: ram-node0, block->mr->name: ram-node0, block->used_length: 280000000h [email protected]: migration_bitmap_sync, id: ram-node1, block->mr->name: ram-node1, block->used_length: 280000000h [email protected]: <ram_save_setup [email protected]: >ram_save_pending, dirty pages remaining: 5247120, page size: 4096 [email protected]: <ram_save_pending, non_postcopiable_pending: 21492203520 bytes [email protected]: >ram_save_iterate, iteration: 0 [email protected]: <ram_save_iterate, pages sent: 848, done: 0 ... [email protected]: >ram_save_pending, dirty pages remaining: 4362032, page size: 4096 [email protected]: <ram_save_pending, non_postcopiable_pending: 17866883072 bytes [email protected]: >ram_save_iterate, iteration: 55318 [email protected]: <ram_save_iterate, pages sent: 976, done: 0 [email protected]: vm_stop [email protected]: >ram_save_pending, dirty pages remaining: 4361056, page size: 4096 [email protected]: <ram_save_pending, non_postcopiable_pending: 17862885376 bytes [email protected]: >ram_save_iterate, iteration: 55379 [email protected]: <ram_save_iterate, pages sent: 832, done: 0 ... [email protected]: >ram_save_iterate, iteration: 325307 [email protected]: <ram_save_iterate, pages sent: 832, done: 0 [email protected]: >ram_save_pending, dirty pages remaining: 41376, page size: 4096 [email protected]: migration_bitmap_sync, count: 5 [email protected]: qemu_global_log_sync [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: kvm_log_sync, addr: 0, size: 280000000 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: kvm_log_sync, addr: 280000000, size: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: kvm_log_sync, addr: 200080000000, size: 1000000 [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: migration_bitmap_sync, id: ram-node0, block->mr->name: ram-node0, block->used_length: 280000000h [email protected]: migration_bitmap_sync, id: ram-node1, block->mr->name: ram-node1, block->used_length: 280000000h [email protected]: <ram_save_pending, non_postcopiable_pending: 7058665472 bytes [email protected]: >ram_save_iterate, iteration: 325359 [email protected]: <ram_save_iterate, pages sent: 1120, done: 0 [email protected]: >ram_save_pending, dirty pages remaining: 1722187, page size: 4096 [email protected]: <ram_save_pending, non_postcopiable_pending: 7054077952 bytes ... [email protected]: >ram_save_iterate, iteration: 430453 [email protected]: <ram_save_iterate, pages sent: 832, done: 0 [email protected]: >ram_save_pending, dirty pages remaining: 41136, page size: 4096 [email protected]: migration_bitmap_sync, count: 6 [email protected]: qemu_global_log_sync [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: kvm_log_sync, addr: 0, size: 280000000 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: kvm_log_sync, addr: 280000000, size: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: kvm_log_sync, addr: 200080000000, size: 1000000 [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: migration_bitmap_sync, id: ram-node0, block->mr->name: ram-node0, block->used_length: 280000000h [email protected]: migration_bitmap_sync, id: ram-node1, block->mr->name: ram-node1, block->used_length: 280000000h [email protected]: <ram_save_pending, non_postcopiable_pending: 168493056 bytes [email protected]: >ram_save_complete [email protected]: migration_bitmap_sync, count: 7 [email protected]: qemu_global_log_sync [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: kvm_log_sync, addr: 0, size: 280000000 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: kvm_log_sync, addr: 280000000, size: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: kvm_log_sync, addr: 200080000000, size: 1000000 [email protected]: qemu_global_log_sync, name: ram-node0, addr: 0 [email protected]: qemu_global_log_sync, name: ram-node1, addr: 280000000 [email protected]: qemu_global_log_sync, name: vga.vram, addr: 80000000 [email protected]: migration_bitmap_sync, id: ram-node0, block->mr->name: ram-node0, block->used_length: 280000000h [email protected]: migration_bitmap_sync, id: ram-node1, block->mr->name: ram-node1, block->used_length: 280000000h [email protected]: migration_bitmap_sync, id: vga.vram, block->mr->name: vga.vram, block->used_length: 1000000h [email protected]: migration_bitmap_sync, id: pci@800000020000000:01.0/virtio-net-pci.rom, block->mr->name: virtio-net-pci.rom, block->used_length: 40000h [email protected]: migration_bitmap_sync, id: pci@800000020000000:03.0/virtio-net-pci.rom, block->mr->name: virtio-net-pci.rom, block->used_length: 40000h [email protected]: migration_bitmap_sync, id: pci@800000020000000:00.0/vga.rom, block->mr->name: vga.rom, block->used_length: 10000h [email protected]: ram_save_complete, dirty pages remaining: 41136 [email protected]: <ram_save_complete, pages sent: 41136 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1768115 Title: ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running with IO stress crashed@security_file_permission+0xf4/0x160. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1768115/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
