[Kernel-packages] [Bug 1768115] Comment bridged from LTC Bugzilla

2018-05-09 Thread bugproxy
--- Comment From mdr...@us.ibm.com 2018-05-09 08:47 EDT---
(In reply to comment #26)
> Is it essential to have two NUMA nodes for the guest memory to see this bug?
> Can we reproduce it without the NUMA node stuff in the xml?

I haven't attempted it on my end. Can give it a try. But we suspect
https://bugzilla.linux.ibm.com/show_bug.cgi?id=167036https://bugzilla.linux.ibm.com/show_bug.cgi?id=167036
may be the same issue (but with Pegas) since see they're doing IO tests
and various IO related failures after migration. In that particular
config there were no additional NUMA nodes in the guest.

I am hoping to get the dump-bitmap-on-demand test you suggested going
today and hopefully that can reproduce at a high enough frequency that I
can try the kernel patches, disabling THP, and NUMA configurations
within a reasonable timeframe. The test I kicked off yesterday to
capture first 128MB of dirty bitmap ran all night without triggering...

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1768115

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running
  with IO stress crashed@security_file_permission+0xf4/0x160.

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Problem Description: Migration Guest running with IO stress
  crashed@security_file_permission+0xf4/0x160 after couple of
  migrations.

  Steps to re-create:

  Source host - boslcp3
  Destination host - boslcp4

  1.boslcp3 & boslcp4 installed with latest kernel
  root@boslcp3:~# uname -a
  Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~#

  root@boslcp4:~# uname -a
  Linux boslcp4 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~#

  2. Installed guest boslcp3g1 with kernel and started LTP run from
  boslcp3 host

  root@boslcp3g1:~# uname -a
  Linux boslcp3g1 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux

  3. Started migrating boslcp3g1 guest from source to destination & viceversa.
  4. After couple of migrations it crashed at boslcp4 & enters into xmon

  8:mon> t
  [c004f8a23d20] c05a7674 security_file_permission+0xf4/0x160
  [c004f8a23d60] c03d1d30 rw_verify_area+0x70/0x120
  [c004f8a23d90] c03d375c vfs_read+0x8c/0x1b0
  [c004f8a23de0] c03d3d88 SyS_read+0x68/0x110
  [c004f8a23e30] c000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 71f1779fe280
  SP (7fffe99ece50) is in userspace
  8:mon> S
  msr= 80001033  sprg0 = 
  pvr= 004e1202  sprg1 = c7a85800
  dec= 591e3e03  sprg2 = c7a85800
  sp = c004f8a234a0  sprg3 = 00010008
  toc= c16eae00  dar   = 023c
  srr0   = c00c355c  srr1  = 80001033 dsisr  = 4000
  dscr   =   ppr   = 0010 pir= 0011
  amr=   uamor = 
  dpdes  =   tir   =  cir= 
  fscr   = 05000180  tar   =  pspb   = 
  mmcr0  = 8000  mmcr1 =  mmcr2  = 
  pmc1   =  pmc2 =   pmc3 =   pmc4   = 
  mmcra  =    siar =  pmc5   = 026c
  sdar   =    sier =  pmc6   = 0861
  ebbhr  =   ebbrr =  bescr  = 
  iamr   = 4000
  pidr   = 0034  tidr  = 
  cpu 0x8: Vector: 700 (Program Check) at [c004f8a23220]
  pc: c00e4854: xmon_core+0x1f24/0x3520
  lr: c00e4850: xmon_core+0x1f20/0x3520
  sp: c004f8a234a0
     msr: 80041033
    current = 0xc004f89faf00
    paca= 0xc7a85800   softe: 0irq_happened: 0x01
  pid   = 24028, comm = top
  Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 (Ubuntu 
4.15.0-20.21-generic 4.15.17)
  cpu 0x8: Exception 700 (Program Check) in xmon, returning to main loop
  [c004f8a23d20] c05a7674 security_file_permission+0xf4/0x160
  [c004f8a23d60] c03d1d30 rw_verify_area+0x70/0x120
  [c004f8a23d90] c03d375c vfs_read+0x8c/0x1b0
  [c004f8a23de0] c03d3d88 SyS_read+0x68/0x110
  [c004f8a23e30] c000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 71f1779fe280
  SP (7fffe99ece50) is in userspace
  8:mon> r
  R00 = c043b7fc   R16 = 
  R01 = c004f8a23c90   R17 = ff70
  R02 = 

[Kernel-packages] [Bug 1768115] Comment bridged from LTC Bugzilla

2018-05-08 Thread bugproxy
--- Comment From p...@au1.ibm.com 2018-05-09 00:25 EDT---
Is it essential to have two NUMA nodes for the guest memory to see this bug? 
Can we reproduce it without the NUMA node stuff in the xml?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1768115

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running
  with IO stress crashed@security_file_permission+0xf4/0x160.

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Problem Description: Migration Guest running with IO stress
  crashed@security_file_permission+0xf4/0x160 after couple of
  migrations.

  Steps to re-create:

  Source host - boslcp3
  Destination host - boslcp4

  1.boslcp3 & boslcp4 installed with latest kernel
  root@boslcp3:~# uname -a
  Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~#

  root@boslcp4:~# uname -a
  Linux boslcp4 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~#

  2. Installed guest boslcp3g1 with kernel and started LTP run from
  boslcp3 host

  root@boslcp3g1:~# uname -a
  Linux boslcp3g1 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux

  3. Started migrating boslcp3g1 guest from source to destination & viceversa.
  4. After couple of migrations it crashed at boslcp4 & enters into xmon

  8:mon> t
  [c004f8a23d20] c05a7674 security_file_permission+0xf4/0x160
  [c004f8a23d60] c03d1d30 rw_verify_area+0x70/0x120
  [c004f8a23d90] c03d375c vfs_read+0x8c/0x1b0
  [c004f8a23de0] c03d3d88 SyS_read+0x68/0x110
  [c004f8a23e30] c000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 71f1779fe280
  SP (7fffe99ece50) is in userspace
  8:mon> S
  msr= 80001033  sprg0 = 
  pvr= 004e1202  sprg1 = c7a85800
  dec= 591e3e03  sprg2 = c7a85800
  sp = c004f8a234a0  sprg3 = 00010008
  toc= c16eae00  dar   = 023c
  srr0   = c00c355c  srr1  = 80001033 dsisr  = 4000
  dscr   =   ppr   = 0010 pir= 0011
  amr=   uamor = 
  dpdes  =   tir   =  cir= 
  fscr   = 05000180  tar   =  pspb   = 
  mmcr0  = 8000  mmcr1 =  mmcr2  = 
  pmc1   =  pmc2 =   pmc3 =   pmc4   = 
  mmcra  =    siar =  pmc5   = 026c
  sdar   =    sier =  pmc6   = 0861
  ebbhr  =   ebbrr =  bescr  = 
  iamr   = 4000
  pidr   = 0034  tidr  = 
  cpu 0x8: Vector: 700 (Program Check) at [c004f8a23220]
  pc: c00e4854: xmon_core+0x1f24/0x3520
  lr: c00e4850: xmon_core+0x1f20/0x3520
  sp: c004f8a234a0
     msr: 80041033
    current = 0xc004f89faf00
    paca= 0xc7a85800   softe: 0irq_happened: 0x01
  pid   = 24028, comm = top
  Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 (Ubuntu 
4.15.0-20.21-generic 4.15.17)
  cpu 0x8: Exception 700 (Program Check) in xmon, returning to main loop
  [c004f8a23d20] c05a7674 security_file_permission+0xf4/0x160
  [c004f8a23d60] c03d1d30 rw_verify_area+0x70/0x120
  [c004f8a23d90] c03d375c vfs_read+0x8c/0x1b0
  [c004f8a23de0] c03d3d88 SyS_read+0x68/0x110
  [c004f8a23e30] c000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 71f1779fe280
  SP (7fffe99ece50) is in userspace
  8:mon> r
  R00 = c043b7fc   R16 = 
  R01 = c004f8a23c90   R17 = ff70
  R02 = c16eae00   R18 = 0a51b4bebfc8
  R03 = c00279557200   R19 = 7fffe99edbb0
  R04 = c003242499c0   R20 = 0a51b4c04db0
  R05 = 0002   R21 = 0a51b4c20e90
  R06 = 0004   R22 = 00040f00
  R07 = ff81   R23 = 0a51b4c06560
  R08 = ff80   R24 = ff80
  R09 =    R25 = 0a51b4bec2b8
  R10 =    R26 = 71f177bb0b20
  R11 =    R27 = 
  R12 = 2000   R28 = c00279557200
  R13 = c7a85800   R29 = c004c7734210
  R14 =    R30 = 
  R15 =    R31 = c003242499c0
  pc  = c043b808 __fsnotify_parent+0x88/0x1a0
  cfar= 

[Kernel-packages] [Bug 1768115] Comment bridged from LTC Bugzilla

2018-05-08 Thread bugproxy
--- Comment From mdr...@us.ibm.com 2018-05-08 16:37 EDT---
Hit another instance of the RAM inconsistencies prior to resuming guest on 
target side (this one is migrating from boslcp6 to boslcp5 and crashing after 
it resumes execution on boslcp5). The signature is eerily similar to the ones 
above... the workload is blast from LTP but it's strange that 3 out of 3 so far 
have been the same data structure. Maybe there's a relationship between 
something the process is doing and dirty syncing?

root@boslcp5:~/vm_logs/1525768538/dumps# xxd -s 20250624 -l 128 
0-2.vm0.iteration2a
0135:          
01350010:          
01350020:          
01350030:          
01350040:          
01350050:          
01350060:          
01350070:          
root@boslcp5:~/vm_logs/1525768538/dumps# xxd -s 20250624 -l 128 
0-2.vm0.iteration2a.boslcp6
0135: d603 0100   2f62 6c61 7374 2f76  /blast/v
01350010: 6463 3400        dc4.
01350020:          
01350030:          
01350040:          
01350050:          
01350060:          
01350070:          

For this run I included traces of the various stages of memory migration
on the QEMU side relative to dirty bitmap sync (attached above). The 3
phases are:

"ram_save_setup": enables dirty logging and sets up data structures used
for tracking dirty pages. does the initial bitmap sync. QEMU keeps it's
own copy which gets OR'd with the one provided by KVM on each bitmap
sync. There's 2 blocks (ram-node0/ram-node1) each with their own bitmap
/ KVM memslot since guest was defined with 2 NUMA nodes. Only ram-node0
would be relevant here since it has offset 0 in guest physical memory
address space.

"ram_save_pending": called before each iteration to see if there are
pages still pending. When number of dirty pages in the QEMU bitmap drop
below a certain value it does another sync with KVM's bitmap.

"ram_save_iterate": walks the QEMU dirty bitmap and sends corresponding
pages until there's none left or some other limit (e.g. bandwidth
throttling or max-pages-per-iteration) is hit.

"ram_save_pending"/"ram_save_iterate" keeps repeating until no more
pages are left.

"ram_save_complete" does a final sync with KVM bitmap, sends final set
of pages, then disables dirty logging and completes the migration.

"vm_stop" denotes with the guest VCPUs have all exited and stopped
execution.

There's 2 migrations reflected in the posted traces, the first one can
be ignored (everything between first ram_save_setup and first
ram_save_complete), it's just a backup of the VM. After that the VM is
backup up it resumes execution and that's the state we're migrating here
and seeing a crash with on other end.

The sequences of events in this run are comparable to previous
successful runs, no strange orderings or missed calls to sync with KVM
dirty bitmap/etc. The condensed version of the trace are below, but it
looks like there's a sync prior to vm_stop, and a sync afterward, and
given that these syncs are OR'd into a persistent bitmap maintained by
QEMU, there's shouldn't be any loss of dirty page information with this
particular ordering of events.

117401@1525770831.423435: >ram_save_setup
117401@1525770831.424386: migration_bitmap_sync, count: 4
117401@1525770831.424400: qemu_global_log_sync
117401@1525770831.424410: qemu_global_log_sync, name: ram-node0, addr: 0
117401@1525770831.424419: kvm_log_sync, addr: 0, size: 28000
117401@1525770831.445270: qemu_global_log_sync, name: ram-node1, addr: 28000
117401@1525770831.445279: kvm_log_sync, addr: 28000, size: 28000
117401@1525770831.545805: qemu_global_log_sync, name: vga.vram, addr: 8000
117401@1525770831.545814: kvm_log_sync, addr: 20008000, size: 100
117401@1525770831.545831: qemu_global_log_sync, name: ram-node0, addr: 0
117401@1525770831.545905: qemu_global_log_sync, name: ram-node1, addr: 28000
117401@1525770831.545959: qemu_global_log_sync, name: vga.vram, addr: 8000
117401@1525770831.545965: migration_bitmap_sync, id: ram-node0, 
block->mr->name: ram-node0, block->used_length: 28000h
117401@1525770831.547606: migration_bitmap_sync, id: ram-node1, 
block->mr->name: ram-node1, block->used_length: 28000h
117401@1525770831.548986: ram_save_pending, dirty pages remaining: 5247120, 
page size: 4096

[Kernel-packages] [Bug 1768115] Comment bridged from LTC Bugzilla

2018-05-07 Thread bugproxy
--- Comment From mdr...@us.ibm.com 2018-05-07 14:48 EDT---
The RCU connection is possibly a red herring. I tested the above theory about 
RCU timeouts/warning being a trigger by modifying QEMU to allow guest timebase 
to be advanced artificially to trigger RCU timeouts/warnings in rapid 
succession and ran this for 8 hours using the same workload without seeing a 
crash. It seems migration is a necessary component to reproduce this.

I did further tests to capture the guest memory state before/after
migration to see if there's possibly an issue with dirty-page tracking
or something similar that could explain the crashes, and have data from
2 crashes that show a roughly 24 bytes difference between source/target
after migration within the first 100MB of guest physical address range.
I have more details in the summaries/logs I'm attaching, but one example
is below (from "migtest" log):

root@boslcp5:~/dumps-cmp# xxd -s 0x013e -l 128 0-2.boslcp5
013e:          
013e0010:          
013e0020:          
013e0030:          
013e0040:          
013e0050:          
013e0060:          
013e0070:          
root@boslcp5:~/dumps-cmp# xxd -s 0x013e -l 128 0-2.boslcp6
013e: 3403 0100 0002  2f62 6c61 7374 2f76  4.../blast/v
013e0010: 6463 3400        dc4.
013e0020:          
013e0030:          
013e0040:          
013e0050:          
013e0060:          
013e0070:          

"blast" is part of the LTP IO test suite running in the guest. It seems
some data structure related to it is present in the source guest memory,
but not on the target side. Part of the structure seems to be a trigger
buffer, but the preceding value might be a point or something else and
may explain the crashes if that ends up being zero'd on the target side.
The other summary/log I'm attaching has almost an identical inconsistent
between source/target from another guest using same workload and hitting
a crash:

root@boslcp5:~/dumps-cmp-migtest2# xxd -s 38273024 -l 128 0-2.boslcp5
0248: c000 0100   2f62 6c61 7374 2f76  /blast/v
02480010: 6462 3400        db4.
02480020:          
02480030:          
02480040:          
02480050:          
02480060:          
02480070:          
root@boslcp5:~/dumps-cmp-migtest2# xxd -s 38273024 -l 128 0-2.boslcp6
0248:          
02480010:          
02480020:          
02480030:          
02480040:          
02480050:          
02480060:          
02480070:          

It seems highly likely there's an issue related to dirty bitmap tracking
at play here. Could use some help from kernel folks on figuring out
where that might lie. Crashed guests are still live ATM so let me know
if there's anything I should try to gather.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1768115

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running
  with IO stress crashed@security_file_permission+0xf4/0x160.

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Problem Description: Migration Guest running with IO stress
  crashed@security_file_permission+0xf4/0x160 after couple of
  migrations.

  Steps to re-create:

  Source host - boslcp3
  Destination host - boslcp4

  1.boslcp3 & boslcp4 installed with latest kernel
  root@boslcp3:~# uname -a
  Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~#

  root@boslcp4:~# uname -a
  

[Kernel-packages] [Bug 1768115] Comment bridged from LTC Bugzilla

2018-05-04 Thread bugproxy
--- Comment From mdr...@us.ibm.com 2018-05-04 09:09 EDT---
(In reply to comment #15)
> This is not the same as the original bug, but I suspect they are part of a
> class of issues we're hitting while running under very particular
> circumstances which might not generally be seen during normal operation and
> triggering various corner cases. As such I think it makes sense to group
> them under this bug for the time being.
>
> The general workload is running IO-heavy disk workloads on large guests
> (20GB memory, 16 vcpus) with SAN-based storage, and then performing
> migration during the workload. During migration we begin to see a high
> occurrence of rcu_sched stall warnings, and after 1-3  hours of operations
> we hit filesystem-related crashes like the ones posted. We've seen this with
> 2 separate FC cards, emulex and qlogic, where we invoke QEMU through libvirt
> as:

We been gathering additional traces while running under this scenario,
and while so far most of the traces have been filesystem-related, we now
have a couple that suggest the common thread between all of these is
failures are related to RCU-managed data structures. I'll attach the
summaries for these from xmon, these have the full dmesg log since guest
start, as well as timestamps in dmesg noting where migrating has
started/stopped, and "WATCHDOG" messages to note any large jumps in
wall-clock time. For example (from boslcp3g1-migtest-fail-on-lcp5):

[ 5757.347542] migration iteration 7: started at Thu May 3 05:59:14 CDT 2018

[ 5935.727884] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 5935.728567]  1-...!: (670 GPs behind) idle=486/140/0 
softirq=218179/218180 fqs=0
[ 5935.730091]  2-...!: (3750 GPs behind) idle=006/140/0 
softirq=203335/203335 fqs=0
[ 5935.731076]  4-...!: (96 GPs behind) idle=c2e/140/0 
softirq=168607/168608 fqs=0
[ 5935.731783]  5-...!: (2270 GPs behind) idle=e16/140/0 
softirq=152608/152608 fqs=1
[ 5935.732959]  6-...!: (322 GPs behind) idle=3ca/141/0 
softirq=169452/169453 fqs=1
[ 5935.735061]  8-...!: (6 GPs behind) idle=c36/141/0 
softirq=280514/280516 fqs=1
[ 5935.736638]  9-...!: (5 GPs behind) idle=c1e/141/0 
softirq=248247/248249 fqs=1
[ 5935.738112]  10-...!: (4 GPs behind) idle=62a/1/0 softirq=228207/228208 fqs=1
[ 5935.738868]  11-...!: (32 GPs behind) idle=afe/140/0 
softirq=228817/228818 fqs=1
[ 5935.739122]  12-...!: (3 GPs behind) idle=426/1/0 softirq=192716/192717 fqs=1
[ 5935.739295]  14-...!: (5 GPs behind) idle=e56/140/0 
softirq=133888/133892 fqs=1
[ 5935.739486]  15-...!: (7 GPs behind) idle=36e/140/0 
softirq=161010/161013 fqs=1
...
[ 5935.740031] Unable to handle kernel paging request for data at address 
0x0008
[ 5935.740128] Faulting instruction address: 0xc0403d04

For the prior iterations where we don't crash we'd have messages like:

[ 2997.413561] WATCHDOG (Thu May  3 05:13:18 CDT 2018): time jump of 114 seconds
[ 3023.759629] migration iteration 1: completed at Thu May 3 05:13:25 CDT 2018
[ 3239.678964] migration iteration 2: started at Thu May 3 05:16:45 CDT 2018

The WATCHDOG is noting the amount of time the guest has seen jump after
it resumes execution. These are generally on the order of 1-2 minutes
here where we're doing migration via virsh migrate ... --timeout 60,
which manually stops the guest if it hasn't finished migration within
60s.

We now know that the source of the skip in time actually originates from
behavior on the source side of migration due to handling within QEMU,
and the guest is reacting after it wakes up from migration. A patch has
been sent which changes the behavior so that the guest doesn't see a
jump in time after resuming:

http://lists.nongnu.org/archive/html/qemu-devel/2018-05/msg00928.html

The patch is still under discussion and it's not clear yet whether this
is actually a QEMU bug or intended behavior. I'm still testing the bug
is conjunciton with original workload and would like to see it run over
the weekend before I can say with any certain, but so far it has run
overnight whereas prior to the change it would crashes after an hour or
2 (though we have seen runs that survived as long as 8 hours so that's
not definitive).

If that survives it would suggest that the source for the RCU-related
crashes seems to occur as a result of a jump in the guest VCPU's
timebase register. One interesting thing I've noticed is that with a
QEMU that *doesn't have the patch above*, disabling RCU stall warning
messages via:

echo 1 >/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress

allowed the workload to run for 16 hours without crashing. This may
suggest the warning messages, in conjunction with rcu_cpu_stall_timeout
being exceeded due to jump in timebase register, are triggering issues
with RCU. What I plan to try next is raising rcu_cpu_stall_timeout to a
much higher value (currently 21 on Ubuntu 18.04 it seems) and 

[Kernel-packages] [Bug 1768115] Comment bridged from LTC Bugzilla

2018-04-30 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-30 15:31 EDT---
Both logs show that the dmesg buffer has been overrun, so by the time you get 
to xmon and run "dl" you've lost the messages that show what happened before 
things went wrong. You will need to be collecting console output from the 
beginning in order to show what happened.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1768115

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running
  with IO stress crashed@security_file_permission+0xf4/0x160.

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  Problem Description: Migration Guest running with IO stress
  crashed@security_file_permission+0xf4/0x160 after couple of
  migrations.

  Steps to re-create:

  Source host - boslcp3
  Destination host - boslcp4

  1.boslcp3 & boslcp4 installed with latest kernel 
  root@boslcp3:~# uname -a
  Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~#

  root@boslcp4:~# uname -a
  Linux boslcp4 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~#

  2. Installed guest boslcp3g1 with kernel and started LTP run from
  boslcp3 host

  root@boslcp3g1:~# uname -a
  Linux boslcp3g1 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux

  3. Started migrating boslcp3g1 guest from source to destination & viceversa.
  4. After couple of migrations it crashed at boslcp4 & enters into xmon

  8:mon> t
  [c004f8a23d20] c05a7674 security_file_permission+0xf4/0x160
  [c004f8a23d60] c03d1d30 rw_verify_area+0x70/0x120
  [c004f8a23d90] c03d375c vfs_read+0x8c/0x1b0
  [c004f8a23de0] c03d3d88 SyS_read+0x68/0x110
  [c004f8a23e30] c000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 71f1779fe280
  SP (7fffe99ece50) is in userspace
  8:mon> S
  msr= 80001033  sprg0 = 
  pvr= 004e1202  sprg1 = c7a85800
  dec= 591e3e03  sprg2 = c7a85800
  sp = c004f8a234a0  sprg3 = 00010008
  toc= c16eae00  dar   = 023c
  srr0   = c00c355c  srr1  = 80001033 dsisr  = 4000
  dscr   =   ppr   = 0010 pir= 0011
  amr=   uamor = 
  dpdes  =   tir   =  cir= 
  fscr   = 05000180  tar   =  pspb   = 
  mmcr0  = 8000  mmcr1 =  mmcr2  = 
  pmc1   =  pmc2 =   pmc3 =   pmc4   = 
  mmcra  =    siar =  pmc5   = 026c
  sdar   =    sier =  pmc6   = 0861
  ebbhr  =   ebbrr =  bescr  = 
  iamr   = 4000
  pidr   = 0034  tidr  = 
  cpu 0x8: Vector: 700 (Program Check) at [c004f8a23220]
  pc: c00e4854: xmon_core+0x1f24/0x3520
  lr: c00e4850: xmon_core+0x1f20/0x3520
  sp: c004f8a234a0
 msr: 80041033
current = 0xc004f89faf00
paca= 0xc7a85800   softe: 0irq_happened: 0x01
  pid   = 24028, comm = top
  Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 (Ubuntu 
4.15.0-20.21-generic 4.15.17)
  cpu 0x8: Exception 700 (Program Check) in xmon, returning to main loop
  [c004f8a23d20] c05a7674 security_file_permission+0xf4/0x160
  [c004f8a23d60] c03d1d30 rw_verify_area+0x70/0x120
  [c004f8a23d90] c03d375c vfs_read+0x8c/0x1b0
  [c004f8a23de0] c03d3d88 SyS_read+0x68/0x110
  [c004f8a23e30] c000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 71f1779fe280
  SP (7fffe99ece50) is in userspace
  8:mon> r
  R00 = c043b7fc   R16 = 
  R01 = c004f8a23c90   R17 = ff70
  R02 = c16eae00   R18 = 0a51b4bebfc8
  R03 = c00279557200   R19 = 7fffe99edbb0
  R04 = c003242499c0   R20 = 0a51b4c04db0
  R05 = 0002   R21 = 0a51b4c20e90
  R06 = 0004   R22 = 00040f00
  R07 = ff81   R23 = 0a51b4c06560
  R08 = ff80   R24 = ff80
  R09 =    R25 = 0a51b4bec2b8
  R10 =    R26 = 71f177bb0b20
  R11 =    R27 = 
  R12 = 2000   R28 = c00279557200
  R13 = c7a85800   R29 = c004c7734210
  R14 =    R30 =