** Description changed: + SRU Justification: + + [ Impact ] + + * While doing ISST testing it turned out that a 2nd level (KVM) + guest (aka VM) continuously dumped when running an NFS + guest migration. + + [ Test Plan ] + + * Setup two IBM Power 10 system (with firmware 1060, that offers + support for KVM) with Ubuntu Server 24.04 for ppc64el. + + * Setup qemu/KVM on both on these system to allow guest migration. + + * Setup a KVM guest and place its disk on an NFS volume. + + * Now initiate a guest migration. + + * Without the two patches the initiator system will start to dump. + + * Since this setup requires a special firmware level, + the verification will be done by the IBM Power team. + + [ Where problems could occur ] + + * Although the patch set looks huge, + the patches themselves are relatively small and less invasive + and I would consider them mainly as fixes. + + * kvmppc_set_one_reg_hv() wrongly get() the value instead of + set() for MMCR3. + + * And The kvmppc_get_one_reg_hv() for SDAR is wrongly getting + the SIAR instead of SDAR - which is quite traceable. + + * Then a one-reg interface for DEXCR register KVM_REG_PPC_DEXCR + is introduced. Here issues can happen if the initialization + is done wrong or in the case statement. + A fix was added to keep nested guest DEXCR in sync. + The guest state element defined for DEXCR was already there, + but not really considered - this is fixed now (DEXCR GSID). + If initialization was done wrong or code in case stmt, + this can harm the guest state. + Guest state may get out of sync. + + * Another one-reg register identifier was introduced + that is used to read and set the virtual HASHKEYR + for the guest during enter/exit with KVM_REG_PPC_HASHKEYR. + Again initialization and the case code are critical. + Code was added to keep nested guest HASHKEYR in sync. + Again the state element defined for HASHKEYR was there, + but not considered, what is fixed now (HASHKEYR GSID) + If initialization was done wrong or code in case stmt, + this can harm the guest state. + This can harm the L2 guest during enter or exit. + + * Again another one-reg identifier was introduced + that is used to read and set the virtual HASHPKEYR + for the guest during enter/exit with KVM_REG_PPC_HASHPKEYR. + And again the guest state element defined for HASHPKEYR + was there but ignored which is now fixed (HASHPKEYR GSID). + If initialization was done wrong or code in case stmt, + this can harm the guest state. + This can harm the L2 guest during enter or exit. + + [ Other Info ] + + * Since (nested) KVM support is new on P10, + this does not affect older Power generation + (P9 is the only other hw generation that is supported by 24.04, + but it only supports native virtualization). + + * Both patches are upstream accepted since v6.11(-rc1), + hence will be in oracular + and are also upstream tagged as stable updates. + + * Since the required firmware FW1060 is relatively new, + we can assume that not many user ran into this issue yet. + __________ + == Comment: #0 - SEETEENA THOUFEEK <[email protected]> - 2024-08-09 03:50:24 == +++ This bug was initially created as a clone of Bug #206737 +++ ---Problem Description--- L2 Guest migration: evelp2g4[L2]: while running NFS guest migration continuously dumping smp_call_function_many_cond+0x500/0x738 (unreliable) and watchdog: BUG: soft lockup - CPU#14 stuck for 223s! [systemd-homed} - + ---uname output--- NA - - Machine Type = NA - + + Machine Type = NA + Contact Information = NA [79205.163691] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries [79205.163834] NIP: c0000000002bb7a4 LR: c0000000002bb750 CTR: c0000000000d192c - [79205.163929] REGS: c0000003871cf1b0 TRAP: 0900 Tainted: G L + [79205.163929] REGS: c0000003871cf1b0 TRAP: 0900 Tainted: G L [79205.165041] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44042222 XER: 20040004 [79205.165266] CFAR: 0000000000000000 IRQMASK: 0 - GPR00: c0000000002bbc58 c0000003871cf450 c0000000020ded00 0000000000000009 - GPR04: 0000000000000009 0000000000000009 0000000000000080 0000000000000200 - GPR08: 00000000000001ff 0000000000000001 c000000740f57ee0 0000000044048222 - GPR12: c0000000000d192c c000000743ddc980 0000000000000000 0000000000000000 - GPR16: 0000000000000000 c00000000d86e200 0000000000000001 0000000000000001 - GPR20: 000000000000000c c000000003d06188 c0000000000ac4d0 c00000000a374e00 - GPR24: c000000003d06840 0000000000000000 c000000741193188 c000000741193188 - GPR28: c000000741193180 c000000003d06840 0000000000000048 0000000000000009 + GPR00: c0000000002bbc58 c0000003871cf450 c0000000020ded00 0000000000000009 + GPR04: 0000000000000009 0000000000000009 0000000000000080 0000000000000200 + GPR08: 00000000000001ff 0000000000000001 c000000740f57ee0 0000000044048222 + GPR12: c0000000000d192c c000000743ddc980 0000000000000000 0000000000000000 + GPR16: 0000000000000000 c00000000d86e200 0000000000000001 0000000000000001 + GPR20: 000000000000000c c000000003d06188 c0000000000ac4d0 c00000000a374e00 + GPR24: c000000003d06840 0000000000000000 c000000741193188 c000000741193188 + GPR28: c000000741193180 c000000003d06840 0000000000000048 0000000000000009 [79205.171660] NIP [c0000000002bb7a4] smp_call_function_many_cond+0x1e0/0x738 [79205.171752] LR [c0000000002bb750] smp_call_function_many_cond+0x18c/0x738 [79205.171835] Call Trace: [79205.171869] [c0000003871cf450] [c0000000002bbc58] smp_call_function_many_cond+0x694/0x738 (unreliable) [79205.171986] [c0000003871cf520] [c0000000000ac4d0] radix__tlb_flush+0x4c/0x140 [79205.173636] [c0000003871cf560] [c00000000052e900] tlb_finish_mmu+0x130/0x1f0 [79205.173754] [c0000003871cf590] [c00000000052a280] exit_mmap+0x1cc/0x574 [79205.173848] [c0000003871cf6c0] [c00000000016ec9c] __mmput+0x54/0x1d4 [79205.173939] [c0000003871cf6f0] [c0000000006385c4] begin_new_exec+0x6dc/0xefc [79205.174037] [c0000003871cf780] [c0000000006edea8] load_elf_binary+0x4c8/0x1a50 [79205.174136] [c0000003871cf880] [c0000000006361c8] bprm_execve+0x2b4/0x7a0 [79205.174219] [c0000003871cf950] [c000000000637988] do_execveat_common+0x1c0/0x2d8 [79205.174316] [c0000003871cf9f0] [c000000000638e38] sys_execve+0x54/0x6c [79205.174399] [c0000003871cfa20] [c00000000002fec8] system_call_exception+0x168/0x310 [79205.174497] [c0000003871cfe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec [79205.176245] --- interrupt: 3000 at 0x7fff95b10b08 [79205.176326] NIP: 00007fff95b10b08 LR: 00007fff95b10b08 CTR: 0000000000000000 [79205.176438] REGS: c0000003871cfe80 TRAP: 3000 Tainted: G L ( [79205.176558] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 48044424 XER: 00000000 [79205.176686] IRQMASK: 0 - GPR00: 000000000000000b 00007fffe6919aa0 00007fff95c47c00 0000000152598c80 - GPR04: 00007fffe6919bf8 00000001525db6e0 ffffffffffffffff 00007fffe6919a20 - GPR08: 0000000152598c88 0000000000000000 0000000000000000 0000000000000000 - GPR12: 0000000000000000 00007fff969a4220 0000000152585570 0000000000000000 - GPR16: 00007fffe6919c48 0000000000000570 0000000152598c80 0000000000000000 - GPR20: 0000000000000000 0000000000009998 000000015259a450 0000000152586460 - GPR24: 00000001525bca90 00007fffe6919e48 0000000000000000 00000001525db6e0 - GPR28: 0000000117e98448 00000001525d0b00 0000000000000000 0000000000100000 + GPR00: 000000000000000b 00007fffe6919aa0 00007fff95c47c00 0000000152598c80 + GPR04: 00007fffe6919bf8 00000001525db6e0 ffffffffffffffff 00007fffe6919a20 + GPR08: 0000000152598c88 0000000000000000 0000000000000000 0000000000000000 + GPR12: 0000000000000000 00007fff969a4220 0000000152585570 0000000000000000 + GPR16: 00007fffe6919c48 0000000000000570 0000000152598c80 0000000000000000 + GPR20: 0000000000000000 0000000000009998 000000015259a450 0000000152586460 + GPR24: 00000001525bca90 00007fffe6919e48 0000000000000000 00000001525db6e0 + GPR28: 0000000117e98448 00000001525d0b00 0000000000000000 0000000000100000 [79205.177505] NIP [00007fff95b10b08] 0x7fff95b10b08 [79205.177578] LR [00007fff95b10b08] 0x7fff95b10b08 [79205.177649] --- interrupt: 3000 - - Steps to reproduce: Install the build on NFS storage guest kernel 6.8.10-300 + Steps to reproduce: Install the build on NFS storage guest kernel + 6.8.10-300 Start the HTX workload - mdt.less Start the NFS guest migration between the L2 hosts. - Sourece L2 host : evelp2 + Sourece L2 host : evelp2 Target L2 host : rinlp1 migration command : virsh migrate --live --domain $vm_name qemu+ssh://$target_host/system --verbose --undefinesource --persistent --timeout 120 - Share the same NFS storage between two hosts [here /kvm_pool] + Share the same NFS storage between two hosts [here /kvm_pool] 10.33.4.52:/kvm_pool nfs4 650G 304G 347G 47% /kvm_pool Test running : HTX Guest state : up - ------------------------------------------------------------------------------------- + ------------------------------------------------------------------------------------- -------------------------------------- L2 guest Config: (1) Problem on Guest: evelp2g4 (2) PHYP/ Processor Type: KVM/P10/Everest (3) Rootvg Filesystem: EXT4 - (5) Network Bridge: Macvtap (6) IO Disk Type/Driver: qemu-img/ qcow2 (7) Install Disk Type: Single - ------------------------------------------------------------------------------------- + ------------------------------------------------------------------------------------- -------------------------------------- L1 host details : MDC mode : off (1) PHYP/ Processor Type: KVM/P10/Everest (2) CEC Name: evelp2 (3) Rootvg Filesystem: xfs - (5) Network Interface: Dedicated Network (6) IO Type: NVME - (8) Multipath Enabled: no (9) Install Disk Type: Single (10) MMU: RPT - The kernel patches are at https://lore.kernel.org/kvm/[email protected]/T/#t Qemu patches are at https://lore.kernel.org/qemu-devel/171760304518.1127.12881297254648658843.stgit@ad1b393f0e09/ powerpc/topic/ppc-kvm. [1/8] KVM: PPC: Book3S HV: Fix the set_one_reg for MMCR3 https://git.kernel.org/powerpc/c/f9ca6a10be20479d526f27316cc32cfd1785ed39 [2/8] KVM: PPC: Book3S HV: Fix the get_one_reg of SDAR https://git.kernel.org/powerpc/c/009f6f42c67e9de737d6d3d199f92b21a8cb9622 [3/8] KVM: PPC: Book3S HV: Add one-reg interface for DEXCR register https://git.kernel.org/powerpc/c/1a1e6865f516696adcf6e94f286c7a0f84d78df3 [4/8] KVM: PPC: Book3S HV nestedv2: Keep nested guest DEXCR in sync https://git.kernel.org/powerpc/c/2d6be3ca3276ab30fb14f285d400461a718d45e7 [5/8] KVM: PPC: Book3S HV: Add one-reg interface for HASHKEYR register https://git.kernel.org/powerpc/c/e9eb790b25577a15d3f450ed585c59048e4e6c44 [6/8] KVM: PPC: Book3S HV nestedv2: Keep nested guest HASHKEYR in sync https://git.kernel.org/powerpc/c/1e97c1eb785fe2dc863c2bd570030d6fcf4b5e5b [7/8] KVM: PPC: Book3S HV: Add one-reg interface for HASHPKEYR register https://git.kernel.org/powerpc/c/9a0d2f4995ddde3022c54e43f9ece4f71f76f6e8 [8/8] KVM: PPC: Book3S HV nestedv2: Keep nested guest HASHPKEYR in sync https://git.kernel.org/powerpc/c/0b65365f3fa95c2c5e2094739151a05cabb3c48a
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2076406 Title: L2 Guest migration: continuously dumping while running NFS guest migration To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/2076406/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
