Stefan,
This looks like a separate bug (as we discussed). Please file another bug for 
this when you have time.

** Description changed:

  [Impact]
- Users of nested KVM for testing openstack have soft lockups as follows:
+ Certain workloads that need to execute functions on a non-local CPU using 
smp_call_function_* can result in soft lockups with the following backtrace:
  
  PID: 22262  TASK: ffff8804274bb000  CPU: 1   COMMAND: "qemu-system-x86"
-  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
-  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
-  #2 [ffff88043fd03e30] panic at ffffffff81719ff4
-  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
-  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
-  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
-  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
-  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
-  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
+  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
+  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
+  #2 [ffff88043fd03e30] panic at ffffffff81719ff4
+  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
+  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
+  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
+  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
+  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
+  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
  --- <IRQ stack> ---
-  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
-     [exception RIP: generic_exec_single+130]
-     RIP: ffffffff810dbe62  RSP: ffff880426f0da00  RFLAGS: 00000202
-     RAX: 0000000000000002  RBX: ffff880426f0d9d0  RCX: 0000000000000001
-     RDX: ffffffff8180ad60  RSI: 0000000000000000  RDI: 0000000000000286
-     RBP: ffff880426f0da30   R8: ffffffff8180ad48   R9: ffff88042713bc68
-     R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: ffff8804274bb000
-     R13: 0000000000000000  R14: ffff880407670280  R15: 0000000000000000
-     ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
+  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
+     [exception RIP: generic_exec_single+130]
+     RIP: ffffffff810dbe62  RSP: ffff880426f0da00  RFLAGS: 00000202
+     RAX: 0000000000000002  RBX: ffff880426f0d9d0  RCX: 0000000000000001
+     RDX: ffffffff8180ad60  RSI: 0000000000000000  RDI: 0000000000000286
+     RBP: ffff880426f0da30   R8: ffffffff8180ad48   R9: ffff88042713bc68
+     R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: ffff8804274bb000
+     R13: 0000000000000000  R14: ffff880407670280  R15: 0000000000000000
+     ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
  #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75
  #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6
  #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7
  #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
  #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d
  #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
  #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8
  #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956
  #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
  #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
  #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
  #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
  #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d
-     RIP: 00007fe7ca2cc647  RSP: 00007fe7be9febf0  RFLAGS: 00000293
-     RAX: 000000000000001c  RBX: ffffffff8173196d  RCX: ffffffffffffffff
-     RDX: 0000000000000004  RSI: 00000000007fb000  RDI: 00007fe7be1ff000
-     RBP: 0000000000000000   R8: 0000000000000000   R9: 00007fe7d1cd2738
-     R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: 00007fe7be9ff700
-     R13: 00007fe7be9ff9c0  R14: 0000000000000000  R15: 0000000000000000
-     ORIG_RAX: 000000000000001c  CS: 0033  SS: 002b
+     RIP: 00007fe7ca2cc647  RSP: 00007fe7be9febf0  RFLAGS: 00000293
+     RAX: 000000000000001c  RBX: ffffffff8173196d  RCX: ffffffffffffffff
+     RDX: 0000000000000004  RSI: 00000000007fb000  RDI: 00007fe7be1ff000
+     RBP: 0000000000000000   R8: 0000000000000000   R9: 00007fe7d1cd2738
+     R10: 00007fe7d1f2dbd0  R11: 0000000000000206  R12: 00007fe7be9ff700
+     R13: 00007fe7be9ff9c0  R14: 0000000000000000  R15: 0000000000000000
+     ORIG_RAX: 000000000000001c  CS: 0033  SS: 002b
+ 
+ [Workaround]
+ 
+ In order to avoid this issue, the workload needs to be pinned to CPUs
+ such that the function always executes locally. For the nested VM case,
+ this means the the L1 VM needs to have all vCPUs pinned to a unique CPU.
+ This can be accomplished with the following (for 2 vCPUs):
+ 
+ virsh vcpupin <domain> 0 0
+ virsh vcpupin <domain> 1 1
  
  
  [Test Case]
  - Deploy openstack on openstack
  - Run tempest on L1 cloud
  - Check kernel log of L1 nova-compute nodes
  
  (Although this may not necessarily be related to nested KVM)
  Potentially related: https://lkml.org/lkml/2014/11/14/656
  
  --
  
  Original Description:
  
  When installing qemu-kvm on a VM, KSM is enabled.
  
  I have encountered this problem in trusty:$ lsb_release -a
  Distributor ID: Ubuntu
  Description:    Ubuntu 14.04.1 LTS
  Release:        14.04
  Codename:       trusty
  $ uname -a
  Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 
17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
  
  The way to see the behaviour:
  1) $ more /sys/kernel/mm/ksm/run
  0
  2) $ sudo apt-get install qemu-kvm
  3) $ more /sys/kernel/mm/ksm/run
  1
  
  To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, 
run tempest on it, the compute nodes of the virtualised deployment will 
eventually stop responding with (run tempest 2 times at least):
   24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
  [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
  
  I am not sure whether the problem is that we are enabling KSM on a VM or
  the problem is that nested KSM is not behaving properly. Either way I
  can easily reproduce, please contact me if you need further details.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1413540

Title:
  Trusty soft lockup issues with nested KVM

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to