Public bug reported:

Problem Description: 
--------------------------
PowerKVM guest actg1 is hanging while running IO and TCP tests. Unable to ssh 
or virsh console to the guest. 

While running IO and TCP tests the guest actg1 is giving below CPU stall
trace :

[16091.824841] INFO: rcu_sched self-detected stall on CPU { 1}  (t=670218 
jiffies g=122793 c=122792 q=6344067)
[16154.853532] INFO: rcu_sched self-detected stall on CPU { 1}  (t=676521 
jiffies g=122793 c=122792 q=6345885)
[16217.882223] INFO: rcu_sched self-detected stall on CPU { 1}  (t=682824 
jiffies g=122793 c=122792 q=6347609)
[16280.910913] INFO: rcu_sched self-detected stall on CPU { 1}  (t=689127 
jiffies g=122793 c=122792 q=6349994)
[16343.939603] INFO: rcu_sched self-detected stall on CPU { 1}  (t=695430 
jiffies g=122793 c=122792 q=6351789)
[16406.968293] INFO: rcu_sched self-detected stall on CPU { 1}  (t=701733 
jiffies g=122793 c=122792 q=6353519)
[16469.996983] INFO: rcu_sched self-detected stall on CPU { 1}  (t=708036 
jiffies g=122793 c=122792 q=6355275)
[16533.025674] INFO: rcu_sched self-detected stall on CPU { 1}  (t=714339 
jiffies g=122793 c=122792 q=6357064)
[16596.054364] INFO: rcu_sched self-detected stall on CPU { 1}  (t=720642 
jiffies g=122793 c=122792 q=6358808)
[16659.083056] INFO: rcu_sched self-detected stall on CPU { 1}  (t=726945 
jiffies g=122793 c=122792 q=6360538)
[16722.111749] INFO: rcu_sched self-detected stall on CPU { 1}  (t=733248 
jiffies g=122793 c=122792 q=6362273)


XML file :
---------

<domain type='kvm' id='2'>
  <name>actg1</name>
  <uuid>01a4e83f-2275-43b2-a4b3-31ada581141d</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='ppc64' machine='pseries'>hvm</type>
    <boot dev='hd'/>
    <boot dev='cdrom'/>
    <boot dev='network'/>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='custom' match='exact'>
    <model fallback='allow'>power8</model>
    <topology sockets='2' cores='2' threads='2'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-ppc64</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/wwn-0x6005076802810c1fa0000000000000a1'/> 
=======================> SAN (coho)
      <target dev='hda' bus='scsi'/>
      <alias name='scsi0-0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/scsi-1IBM_IPR-0_5D81B900000010C0'/>  
============================> LocalCEC
      <target dev='hdb' bus='scsi'/>
      <alias name='scsi0-0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/disk/by-id/scsi-1IBM_IPR-0_5D81B900000010A0'/>  
============================> LocalCEC
      <target dev='hdc' bus='scsi'/>
      <alias name='scsi0-0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/wwn-0x6005076802810c1fa0000000000000cd'/>  
============================> iscsi
      <target dev='hdd' bus='scsi'/>
      <alias name='scsi0-0-0-3'/>
      <address type='drive' controller='0' bus='0' target='0' unit='3'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/disk/by-id/wwn-0x6005076802810c1fa0000000000000cc'/>  
============================> iscsi
      <target dev='hde' bus='scsi'/>
      <alias name='scsi0-0-0-4'/>
      <address type='drive' controller='0' bus='0' target='0' unit='4'/>
    </disk>
    <controller type='scsi' index='0'>
      <alias name='scsi0'/>
      <address type='spapr-vio' reg='0x2000'/>
    </controller>
    <controller type='scsi' index='1' model='virtio-scsi'>
      <alias name='scsi1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' 
function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='nec-xhci'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' 
function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:37:24:fe'/>
      <source bridge='brenP6p1s0f1'/>
      <target dev='vnet0'/>
      <model type='spapr-vlan'/>
      <alias name='net0'/>
      <address type='spapr-vio' reg='0x1000'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/1'/>
      <target port='0'/>
      <alias name='serial0'/>
      <address type='spapr-vio' reg='0x30000000'/>
    </serial>
    <console type='pty' tty='/dev/pts/1'>
      <source path='/dev/pts/1'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
      <address type='spapr-vio' reg='0x30000000'/>
    </console>
    <video>
      <model type='vga' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' 
function='0x0'/>
    </video>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c405,c741</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c405,c741</imagelabel>
  </seclabel>
</domain>

Host configuration:
-------------------

[root@actkvm home]# cat /etc/issue
IBM_PowerKVM release 2.1.0 build 27 service (pkvm2_1)
Kernel \r on a \m (\l)

[root@actkvm home]# virsh version
Compiled against library: libvirt 1.1.3
Using library: libvirt 1.1.3
Using API: QEMU 1.1.3
Running hypervisor: QEMU 1.6.0

Dropped into xmon, much of the same, soft lockups between raw_spin_lock
and shrink_dentry_list.

The guest manages to get a fair bit into its syslog before locking up, I
think the two most interesting bits are

Sep 18 05:44:46 actg1 kernel: [ 8349.373013] ------------[ cut here 
]------------
Sep 18 05:44:46 actg1 kernel: [ 8349.373019] WARNING: at 
/build/buildd/linux-3.13.0/fs/dcache.c:362
Sep 18 05:44:46 actg1 kernel: [ 8349.373020] Modules linked in: rpcsec_gss_krb5 
nfsv4 dm_round_robin dm_multipath scsi_dh pseries_rng nfsd auth_rpcgss nfs_acl 
nfs lockd sunrpc fscache rtc_generic virtio_scsi
Sep 18 05:44:46 actg1 kernel: [ 8349.373036] CPU: 1 PID: 22289 Comm: host01 Not 
tainted 3.13.0-35-generic #62-Ubuntu
Sep 18 05:44:46 actg1 kernel: [ 8349.373038] task: c0000000fcb40000 ti: 
c0000000faec0000 task.ti: c0000000faec0000
Sep 18 05:44:46 actg1 kernel: [ 8349.373040] NIP: c000000000274980 LR: 
c0000000002757f8 CTR: c0000000002e2f90
Sep 18 05:44:46 actg1 kernel: [ 8349.373042] REGS: c0000000faec3590 TRAP: 0700  
 Not tainted  (3.13.0-35-generic)
Sep 18 05:44:46 actg1 kernel: [ 8349.373042] MSR: 8000000000029033 
<SF,EE,ME,IR,DR,RI,LE>  CR: 22222848  XER: 20000000
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] CFAR: c000000000274900 SOFTE: 1 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR00: c0000000002757f8 
c0000000faec3810 c00000000164f6a0 c0000000f9b1c840 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR04: c0000000faec39b8 
0000000001af2460 0000000000000000 0000000000000000 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR08: c000000001708ff0 
0000000000000001 c00000000150f6a0 00000000000000b4 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR12: 0000000022222842 
c00000000fe40380 0000010016693908 00000100166939c8 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR16: 000000001016f5a8 
000000000000000a 0000000000000000 0000000000200000 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR20: 0000000000000000 
c000000000c222a8 0000000000000004 c0000000fbc04280 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR24: 0000000000001143 
c0000000f9b1ed00 0000000000000000 0000000000000000 
Sep 18 05:44:46 actg1 kernel: [ 8349.373048] GPR28: c0000000f9b1c840 
c0000000f9b1c840 c0000000faec39b8 c0000000f9b1c840 
Sep 18 05:44:46 actg1 kernel: [ 8349.373070] NIP [c000000000274980] 
.d_shrink_del+0xb0/0xd0
Sep 18 05:44:46 actg1 kernel: [ 8349.373072] LR [c0000000002757f8] 
.shrink_dentry_list+0xc8/0x160
Sep 18 05:44:46 actg1 kernel: [ 8349.373073] Call Trace:
Sep 18 05:44:46 actg1 kernel: [ 8349.373077] [c0000000faec3810] 
[c00000000095c140] ._raw_spin_lock+0xb0/0xc0 (unreliable)
Sep 18 05:44:46 actg1 kernel: [ 8349.373080] [c0000000faec3890] 
[c0000000002757f8] .shrink_dentry_list+0xc8/0x160
Sep 18 05:44:46 actg1 kernel: [ 8349.373081] [c0000000faec3940] 
[c000000000275a54] .shrink_dcache_parent+0x44/0xa0
Sep 18 05:44:46 actg1 kernel: [ 8349.373085] [c0000000faec39f0] 
[c0000000002ea3fc] .proc_flush_task+0xcc/0x200
Sep 18 05:44:46 actg1 kernel: [ 8349.373088] [c0000000faec3ae0] 
[c00000000009e16c] .release_task+0x9c/0x540
Sep 18 05:44:46 actg1 kernel: [ 8349.373090] [c0000000faec3bc0] 
[c00000000009ebe8] .wait_task_zombie+0x5d8/0x760
Sep 18 05:44:46 actg1 kernel: [ 8349.373092] [c0000000faec3c80] 
[c00000000009f6c0] .do_wait+0x120/0x2c0
Sep 18 05:44:46 actg1 kernel: [ 8349.373095] [c0000000faec3d40] 
[c0000000000a0d40] .SyS_wait4+0xa0/0x140
Sep 18 05:44:46 actg1 kernel: [ 8349.373097] [c0000000faec3e30] 
[c00000000000a158] syscall_exit+0x0/0x98
Sep 18 05:44:46 actg1 kernel: [ 8349.373098] Instruction dump:
Sep 18 05:44:46 actg1 kernel: [ 8349.373099] 3908ffff 7d09512a 4bd9c529 
60000000 38210080 e8010010 ebe1fff8 7c0803a6 
Sep 18 05:44:46 actg1 kernel: [ 8349.373102] 4e800020 3d42ffec 892ae58d 
69290001 <0b090000> 2fa90000 41feff7c 39200001 


and then

Sep 18 05:47:32 actg1 kernel: [ 8514.856141] JIT Sampler invoked oom-killer: 
gfp_mask=0x201da, order=0, oom_score_adj=0
Sep 18 05:47:32 actg1 kernel: [ 8514.856150] JIT Sampler cpuset=/ mems_allowed=0
Sep 18 05:47:32 actg1 kernel: [ 8514.856156] CPU: 0 PID: 3467 Comm: JIT Sampler 
Tainted: G        W    3.13.0-35-generic #62-Ubuntu
Sep 18 05:47:32 actg1 kernel: [ 8514.856159] Call Trace:
Sep 18 05:47:32 actg1 kernel: [ 8514.856170] [c0000000f58a2cf0] 
[c000000000016b00] .show_stack+0x170/0x290 (unreliable)
Sep 18 05:47:32 actg1 kernel: [ 8514.856179] [c0000000f58a2de0] 
[c000000000970b5c] .dump_stack+0x88/0xb4
Sep 18 05:47:32 actg1 kernel: [ 8514.856183] [c0000000f58a2e60] 
[c00000000096a43c] .dump_header+0xc0/0x248
Sep 18 05:47:32 actg1 kernel: [ 8514.856187] [c0000000f58a2f40] 
[c0000000001c6bf8] .oom_kill_process+0x318/0x4c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856191] [c0000000f58a3020] 
[c0000000001c7598] .out_of_memory+0x4f8/0x530
Sep 18 05:47:32 actg1 kernel: [ 8514.856194] [c0000000f58a3100] 
[c0000000001cea60] .__alloc_pages_nodemask+0xbb0/0xc70
Sep 18 05:47:32 actg1 kernel: [ 8514.856199] [c0000000f58a32f0] 
[c000000000224958] .alloc_pages_current+0xb8/0x1a0
Sep 18 05:47:32 actg1 kernel: [ 8514.856203] [c0000000f58a3390] 
[c0000000001c2efc] .__page_cache_alloc+0xcc/0x110
Sep 18 05:47:32 actg1 kernel: [ 8514.856206] [c0000000f58a3420] 
[c0000000001c4884] .filemap_fault+0x204/0x4b0
Sep 18 05:47:32 actg1 kernel: [ 8514.856210] [c0000000f58a3500] 
[c0000000001f6d30] .__do_fault+0xc0/0x6f0
Sep 18 05:47:32 actg1 kernel: [ 8514.856214] [c0000000f58a35f0] 
[c0000000001fc004] .handle_mm_fault+0x214/0xde0
Sep 18 05:47:32 actg1 kernel: [ 8514.856218] [c0000000f58a36e0] 
[c00000000095e8f8] .do_page_fault+0x388/0x8c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856221] [c0000000f58a3910] 
[c000000000009568] handle_page_fault+0x10/0x30
Sep 18 05:47:32 actg1 kernel: [ 8514.856227] --- Exception: 301 at 
.do_page_fault+0x468/0x8c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856227]     LR = .do_page_fault+0x348/0x8c0
Sep 18 05:47:32 actg1 kernel: [ 8514.856231] [c0000000f58a3e30] 
[c000000000009568] handle_page_fault+0x10/0x30
Sep 18 05:47:32 actg1 kernel: [ 8514.856233] Mem-Info:
Sep 18 05:47:32 actg1 kernel: [ 8514.856236] Node 0 DMA per-cpu:
Sep 18 05:47:32 actg1 kernel: [ 8514.856239] CPU    0: hi:    6, btch:   1 usd: 
  0
Sep 18 05:47:32 actg1 kernel: [ 8514.856241] CPU    1: hi:    6, btch:   1 usd: 
  0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247] active_anon:274 inactive_anon:317 
isolated_anon:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247]  active_file:24 inactive_file:58 
isolated_file:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247]  unevictable:283 dirty:0 
writeback:317 unstable:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247]  free:2001 slab_reclaimable:8289 
slab_unreclaimable:53322
Sep 18 05:47:32 actg1 kernel: [ 8514.856247]  mapped:159 shmem:0 pagetables:92 
bounce:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856247]  free_cma:0
Sep 18 05:47:32 actg1 kernel: [ 8514.856253] Node 0 DMA free:128064kB 
min:131072kB low:163840kB high:196608kB active_anon:17536kB 
inactive_anon:20288kB active_file:1536kB inactive_file:3712kB 
unevictable:18112kB isolated(anon):0kB isola
ted(file):0kB present:4194304kB managed:4171392kB mlocked:18112kB dirty:0kB 
writeback:20288kB mapped:10176kB shmem:0kB slab_reclaimable:530496kB 
slab_unreclaimable:3412608kB kernel_stack:1014816kB pagetables:5888kB 
unstable:0kB bounce:0
kB free_cma:0kB writeback_tmp:0kB pages_scanned:4216 all_unreclaimable? yes
Sep 18 05:47:32 actg1 kernel: [ 8514.856263] lowmem_reserve[]: 0 0 0
Sep 18 05:47:32 actg1 kernel: [ 8514.856267] Node 0 DMA: 771*64kB (EM) 
460*128kB (M) 0*256kB 1*512kB (R) 0*1024kB 1*2048kB (R) 2*4096kB (R) 2*8192kB 
(R) 0*16384kB = 135360kB
Sep 18 05:47:32 actg1 kernel: [ 8514.856280] 626 total pagecache pages
Sep 18 05:47:32 actg1 kernel: [ 8514.856283] 446 pages in swap cache
Sep 18 05:47:32 actg1 kernel: [ 8514.856285] Swap cache stats: add 21120, 
delete 20674, find 8129/11084
Sep 18 05:47:32 actg1 kernel: [ 8514.856287] Free swap  = 727424kB
Sep 18 05:47:32 actg1 kernel: [ 8514.856288] Total swap = 917440kB
Sep 18 05:47:32 actg1 kernel: [ 8514.856290] 65536 pages RAM
Sep 18 05:47:32 actg1 kernel: [ 8514.856291] 0 pages HighMem/MovableOnly
Sep 18 05:47:32 actg1 kernel: [ 8514.856293] 358 pages reserved
...etc/etc


For what it's worth, my guest which reproduces this problem under 14.04.1 (but 
does not reproduce it under 14.10)

There are a number of patches affecting dcache code between 14.04.1 and
14.10 - in particular:

commit 046b961b45f93a92e4c70525a12f3d378bced130
Author: Al Viro <v...@zeniv.linux.org.uk>
Date:   Thu May 29 08:54:52 2014 -0400

    shrink_dentry_list(): take parent's ->d_lock earlier
    
    The cause of livelocks there is that we are taking ->d_lock on
    dentry and its parent in the wrong order, forcing us to use
    trylock on the parent's one.  d_walk() takes them in the right
    order, and unfortunately it's not hard to create a situation
    when shrink_dentry_list() can't make progress since trylock
    keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
    keeps calling d_walk() disrupting the very shrink_dentry_list() it's
    waiting for.
...
...


3.13.0-35.62 which 14.04.1 is running does not include this patch, but a tag 
3.13.0-36.63 exists in the trusty kernel tree which does, probably a future 
update.
I've built a kernel from this tag and installed it in the guest - please retest.

Hi Samuel,

The run completed 26 hours successfully. Seems the issue is solved.
As discussed , please let me know when  .deb is available for this patch

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Confirmed


** Tags: architecture-ppc64 bugnameltc-115314 severity-critical 
targetmilestone-inin1404

** Tags added: architecture-ppc64 bugnameltc-115314 severity-critical
targetmilestone-inin1404

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1372975

Title:
  ISST-KVM : Ubuntu 14.04 guest actg1 running TCP and IO hanging

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1372975/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to