Stefan, This looks like a separate bug (as we discussed). Please file another bug for this when you have time.
** Description changed: [Impact] - Users of nested KVM for testing openstack have soft lockups as follows: + Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" - #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02 - #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203 - #2 [ffff88043fd03e30] panic at ffffffff81719ff4 - #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5 - #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787 - #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f - #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537 - #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f - #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd + #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02 + #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203 + #2 [ffff88043fd03e30] panic at ffffffff81719ff4 + #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5 + #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787 + #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f + #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537 + #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f + #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> --- - #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd - [exception RIP: generic_exec_single+130] - RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202 - RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001 - RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286 - RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68 - R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000 - R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000 - ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 + #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd + [exception RIP: generic_exec_single+130] + RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202 + RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001 + RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286 + RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68 + R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000 + R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000 + ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d - RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293 - RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff - RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000 - RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738 - R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700 - R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000 - ORIG_RAX: 000000000000001c CS: 0033 SS: 002b + RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293 + RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff + RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000 + RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738 + R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700 + R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000 + ORIG_RAX: 000000000000001c CS: 0033 SS: 002b + + [Workaround] + + In order to avoid this issue, the workload needs to be pinned to CPUs + such that the function always executes locally. For the nested VM case, + this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. + This can be accomplished with the following (for 2 vCPUs): + + virsh vcpupin <domain> 0 0 + virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least): 24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
