Re: crash after NX error

2019-06-10 Thread Haren Myneni
On 06/05/2019 04:06 AM, Michael Ellerman wrote:
> Stewart Smith  writes:
>> On my two socket POWER9 system (powernv) with 842 zwap set up, I
>> recently got a crash with the Ubuntu kernel (I haven't tried with
>> upstream, and this is the first time the system has died like this, so
>> I'm not sure how repeatable it is).
>>
>> [2.891463] zswap: loaded using pool 842-nx/zbud
>> ...
>> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 500 
>> us, giving up : 00 00 00 00 
>> [16868.932913] Unable to handle kernel paging request for data at address 
>> 0x6655f67da816cdb8
>> [16868.933726] Faulting instruction address: 0xc0391600
>>
>>
>> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c01c9d98b9a0]
>> pc: c0391600: kmem_cache_alloc+0x2e0/0x340
>> lr: c03915ec: kmem_cache_alloc+0x2cc/0x340
>> sp: c01c9d98bc20
>>msr: 9280b033
>>dar: 6655f67da816cdb8
>>   current = 0xc01ad43cb400
>>   paca= 0xcfac7800   softe: 0irq_happened: 0x01
>> pid   = 8319, comm = make
>> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 
>> 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 
>> (Ubuntu 4.15.0-50.54-generic 4.15.18)
>>
>> 68:mon> t
>> [c01c9d98bc20] c03914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
>> [c01c9d98bc80] c03b1e14 __khugepaged_enter+0x54/0x220
>> [c01c9d98bcc0] c010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
>> [c01c9d98bda0] c010fe4c _do_fork+0xec/0x510
>> [c01c9d98be30] c000b584 ppc_clone+0x8/0xc
>> --- Exception: c00 (System Call) at 7afe9daf87f4
>> SP (7fffca606880) is in userspace
>>
>> So, it looks like there could be a problem in the error path, plausibly
>> fixed by this patch:
>>
>> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> Author: Haren Myneni 
>> Date:   Wed Jun 13 00:32:40 2018 -0700
>>
>> crypto/nx: Initialize 842 high and normal RxFIFO control registers
>> 
>> NX increments readOffset by FIFO size in receive FIFO control register
>> when CRB is read. But the index in RxFIFO has to match with the
>> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>> may be processing incorrect CRBs and can cause CRB timeout.
>> 
>> VAS FIFO offset is 0 when the receive window is opened during
>> initialization. When the module is reloaded or in kexec boot, readOffset
>> in FIFO control register may not match with VAS entry. This patch adds
>> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>> control register for both high and normal FIFOs.
>> 
>> Signed-off-by: Haren Myneni 
>> [mpe: Fixup uninitialized variable warning]
>> Signed-off-by: Michael Ellerman 
>>
>> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> v4.19-rc1~24^2~50
>>
>>
>> Which was never backported to any stable release, so probably needs to
>> be for v4.14 through v4.18.
> 
> Yeah the P9 NX support went in in:
>   b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine")
> 
> Which was: v4.14-rc1~119^2~21, so first released in v4.14.
> 
> 
> I'm actually less interested in that and more interested in the
> subsequent crash. The time stamps are miles apart though, did we just
> leave some corrupted memory after the NX failed and then hit it later?
> Or did we not correctly signal to the upper level APIs that the request
> failed.
> 
> I think we need to do some testing with errors injected into the
> wait_for_csb() path, to ensure that failures there are not causing
> corrupting in zswap. Haren have you done any testing of error injection?

The code path returns error code from wait_for_csb() properly to upper level 
APIs. In the case of decompression case, upon failure the request will fall 
back to SW 842. 

If NX is involved in this crash, the compression request may be successful with 
invalid CRB (mismatch FIFO entries in NX and VAS). Then SW 842 may be 
decompressed invalid data which might cause corruption later when accessing it. 

I will try to reproduce the issue with 4.14 kernel,

Thanks
Haren
  
> 
> cheers
> 



Re: crash after NX error

2019-06-05 Thread Stewart Smith
Michael Ellerman  writes:
> Stewart Smith  writes:
>> On my two socket POWER9 system (powernv) with 842 zwap set up, I
>> recently got a crash with the Ubuntu kernel (I haven't tried with
>> upstream, and this is the first time the system has died like this, so
>> I'm not sure how repeatable it is).
>>
>> [2.891463] zswap: loaded using pool 842-nx/zbud
>> ...
>> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 500 
>> us, giving up : 00 00 00 00 
>> [16868.932913] Unable to handle kernel paging request for data at address 
>> 0x6655f67da816cdb8
>> [16868.933726] Faulting instruction address: 0xc0391600
>>
>>
>> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c01c9d98b9a0]
>> pc: c0391600: kmem_cache_alloc+0x2e0/0x340
>> lr: c03915ec: kmem_cache_alloc+0x2cc/0x340
>> sp: c01c9d98bc20
>>msr: 9280b033
>>dar: 6655f67da816cdb8
>>   current = 0xc01ad43cb400
>>   paca= 0xcfac7800   softe: 0irq_happened: 0x01
>> pid   = 8319, comm = make
>> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 
>> 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 
>> (Ubuntu 4.15.0-50.54-generic 4.15.18)
>>
>> 68:mon> t
>> [c01c9d98bc20] c03914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
>> [c01c9d98bc80] c03b1e14 __khugepaged_enter+0x54/0x220
>> [c01c9d98bcc0] c010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
>> [c01c9d98bda0] c010fe4c _do_fork+0xec/0x510
>> [c01c9d98be30] c000b584 ppc_clone+0x8/0xc
>> --- Exception: c00 (System Call) at 7afe9daf87f4
>> SP (7fffca606880) is in userspace
>>
>> So, it looks like there could be a problem in the error path, plausibly
>> fixed by this patch:
>>
>> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> Author: Haren Myneni 
>> Date:   Wed Jun 13 00:32:40 2018 -0700
>>
>> crypto/nx: Initialize 842 high and normal RxFIFO control registers
>> 
>> NX increments readOffset by FIFO size in receive FIFO control register
>> when CRB is read. But the index in RxFIFO has to match with the
>> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>> may be processing incorrect CRBs and can cause CRB timeout.
>> 
>> VAS FIFO offset is 0 when the receive window is opened during
>> initialization. When the module is reloaded or in kexec boot, readOffset
>> in FIFO control register may not match with VAS entry. This patch adds
>> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>> control register for both high and normal FIFOs.
>> 
>> Signed-off-by: Haren Myneni 
>> [mpe: Fixup uninitialized variable warning]
>> Signed-off-by: Michael Ellerman 
>>
>> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> v4.19-rc1~24^2~50
>>
>>
>> Which was never backported to any stable release, so probably needs to
>> be for v4.14 through v4.18.
>
> Yeah the P9 NX support went in in:
>   b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine")
>
> Which was: v4.14-rc1~119^2~21, so first released in v4.14.
>
>
> I'm actually less interested in that and more interested in the
> subsequent crash. The time stamps are miles apart though, did we just
> leave some corrupted memory after the NX failed and then hit it later?
> Or did we not correctly signal to the upper level APIs that the request
> failed.
>
> I think we need to do some testing with errors injected into the
> wait_for_csb() path, to ensure that failures there are not causing
> corrupting in zswap. Haren have you done any testing of error
> injection?

So, things died pretty heavily overnight (requiring e2fsck) with a *lot*
of those wait_for_csb() errors in the log.

It certainly *looks* like there's corruption around, as one of the CI
jobs that failed around that time got "internal compiler error" which is
usually a good sign that things have gone poorly somewhere.

-- 
Stewart Smith
OPAL Architect, IBM.



Re: crash after NX error

2019-06-05 Thread Michael Ellerman
Stewart Smith  writes:
> On my two socket POWER9 system (powernv) with 842 zwap set up, I
> recently got a crash with the Ubuntu kernel (I haven't tried with
> upstream, and this is the first time the system has died like this, so
> I'm not sure how repeatable it is).
>
> [2.891463] zswap: loaded using pool 842-nx/zbud
> ...
> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 500 
> us, giving up : 00 00 00 00 
> [16868.932913] Unable to handle kernel paging request for data at address 
> 0x6655f67da816cdb8
> [16868.933726] Faulting instruction address: 0xc0391600
>
>
> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c01c9d98b9a0]
> pc: c0391600: kmem_cache_alloc+0x2e0/0x340
> lr: c03915ec: kmem_cache_alloc+0x2cc/0x340
> sp: c01c9d98bc20
>msr: 9280b033
>dar: 6655f67da816cdb8
>   current = 0xc01ad43cb400
>   paca= 0xcfac7800   softe: 0irq_happened: 0x01
> pid   = 8319, comm = make
> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 
> (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 
> 4.15.0-50.54-generic 4.15.18)
>
> 68:mon> t
> [c01c9d98bc20] c03914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
> [c01c9d98bc80] c03b1e14 __khugepaged_enter+0x54/0x220
> [c01c9d98bcc0] c010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
> [c01c9d98bda0] c010fe4c _do_fork+0xec/0x510
> [c01c9d98be30] c000b584 ppc_clone+0x8/0xc
> --- Exception: c00 (System Call) at 7afe9daf87f4
> SP (7fffca606880) is in userspace
>
> So, it looks like there could be a problem in the error path, plausibly
> fixed by this patch:
>
> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
> Author: Haren Myneni 
> Date:   Wed Jun 13 00:32:40 2018 -0700
>
> crypto/nx: Initialize 842 high and normal RxFIFO control registers
> 
> NX increments readOffset by FIFO size in receive FIFO control register
> when CRB is read. But the index in RxFIFO has to match with the
> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
> may be processing incorrect CRBs and can cause CRB timeout.
> 
> VAS FIFO offset is 0 when the receive window is opened during
> initialization. When the module is reloaded or in kexec boot, readOffset
> in FIFO control register may not match with VAS entry. This patch adds
> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
> control register for both high and normal FIFOs.
> 
> Signed-off-by: Haren Myneni 
> [mpe: Fixup uninitialized variable warning]
> Signed-off-by: Michael Ellerman 
>
> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
> v4.19-rc1~24^2~50
>
>
> Which was never backported to any stable release, so probably needs to
> be for v4.14 through v4.18.

Yeah the P9 NX support went in in:
  b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine")

Which was: v4.14-rc1~119^2~21, so first released in v4.14.


I'm actually less interested in that and more interested in the
subsequent crash. The time stamps are miles apart though, did we just
leave some corrupted memory after the NX failed and then hit it later?
Or did we not correctly signal to the upper level APIs that the request
failed.

I think we need to do some testing with errors injected into the
wait_for_csb() path, to ensure that failures there are not causing
corrupting in zswap. Haren have you done any testing of error injection?

cheers


Re: crash after NX error

2019-06-04 Thread Haren Myneni
On 06/03/2019 08:23 PM, Stewart Smith wrote:
> On my two socket POWER9 system (powernv) with 842 zwap set up, I
> recently got a crash with the Ubuntu kernel (I haven't tried with
> upstream, and this is the first time the system has died like this, so
> I'm not sure how repeatable it is).
> 
> [2.891463] zswap: loaded using pool 842-nx/zbud
> ...
> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 500 
> us, giving up : 00 00 00 00 
> [16868.932913] Unable to handle kernel paging request for data at address 
> 0x6655f67da816cdb8
> [16868.933726] Faulting instruction address: 0xc0391600
> 
> 
> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c01c9d98b9a0]
> pc: c0391600: kmem_cache_alloc+0x2e0/0x340
> lr: c03915ec: kmem_cache_alloc+0x2cc/0x340
> sp: c01c9d98bc20
>msr: 9280b033
>dar: 6655f67da816cdb8
>   current = 0xc01ad43cb400
>   paca= 0xcfac7800   softe: 0irq_happened: 0x01
> pid   = 8319, comm = make
> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 
> (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 
> 4.15.0-50.54-generic 4.15.18)
> 
> 68:mon> t
> [c01c9d98bc20] c03914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
> [c01c9d98bc80] c03b1e14 __khugepaged_enter+0x54/0x220
> [c01c9d98bcc0] c010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
> [c01c9d98bda0] c010fe4c _do_fork+0xec/0x510
> [c01c9d98be30] c000b584 ppc_clone+0x8/0xc
> --- Exception: c00 (System Call) at 7afe9daf87f4
> SP (7fffca606880) is in userspace
> 
> So, it looks like there could be a problem in the error path, plausibly
> fixed by this patch:
> 
> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
> Author: Haren Myneni 
> Date:   Wed Jun 13 00:32:40 2018 -0700
> 
> crypto/nx: Initialize 842 high and normal RxFIFO control registers
> 
> NX increments readOffset by FIFO size in receive FIFO control register
> when CRB is read. But the index in RxFIFO has to match with the
> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
> may be processing incorrect CRBs and can cause CRB timeout.
> 
> VAS FIFO offset is 0 when the receive window is opened during
> initialization. When the module is reloaded or in kexec boot, readOffset
> in FIFO control register may not match with VAS entry. This patch adds
> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
> control register for both high and normal FIFOs.
> 
> Signed-off-by: Haren Myneni 
> [mpe: Fixup uninitialized variable warning]
> Signed-off-by: Michael Ellerman 
> 
> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
> v4.19-rc1~24^2~50
> 
> 
> Which was never backported to any stable release, so probably needs to
> be for v4.14 through v4.18. Notably, Ubuntu is on v4.15 and it doesn't
> seem to have picked up the patch. I'm opening an Ubuntu bug for this.
> 
> Haren, is this something you can drive through the stable process
> (assuming my above crash looks like this failure)?
> 

Thanks Stewart. Missed this in stable releases and I will work on it. Merged in 
Ubuntu 18.04.x kernel recently and will be in the next update.

Also need
 
commit 6e708000ec2c93c2bde6a46aa2d6c3e80d4eaeb9
Author: Haren Myneni 
Date:   Wed Jun 13 00:28:57 2018 -0700

powerpc/powernv: Export opal_check_token symbol

Export opal_check_token symbol for modules to check the availability
of OPAL calls before using them.

Signed-off-by: Haren Myneni 
Signed-off-by: Michael Ellerman 

 






crash after NX error

2019-06-03 Thread Stewart Smith
On my two socket POWER9 system (powernv) with 842 zwap set up, I
recently got a crash with the Ubuntu kernel (I haven't tried with
upstream, and this is the first time the system has died like this, so
I'm not sure how repeatable it is).

[2.891463] zswap: loaded using pool 842-nx/zbud
...
[15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 500 
us, giving up : 00 00 00 00 
[16868.932913] Unable to handle kernel paging request for data at address 
0x6655f67da816cdb8
[16868.933726] Faulting instruction address: 0xc0391600


cpu 0x68: Vector: 380 (Data Access Out of Range) at [c01c9d98b9a0]
pc: c0391600: kmem_cache_alloc+0x2e0/0x340
lr: c03915ec: kmem_cache_alloc+0x2cc/0x340
sp: c01c9d98bc20
   msr: 9280b033
   dar: 6655f67da816cdb8
  current = 0xc01ad43cb400
  paca= 0xcfac7800   softe: 0irq_happened: 0x01
pid   = 8319, comm = make
Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 
(Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 
4.15.0-50.54-generic 4.15.18)

68:mon> t
[c01c9d98bc20] c03914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
[c01c9d98bc80] c03b1e14 __khugepaged_enter+0x54/0x220
[c01c9d98bcc0] c010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
[c01c9d98bda0] c010fe4c _do_fork+0xec/0x510
[c01c9d98be30] c000b584 ppc_clone+0x8/0xc
--- Exception: c00 (System Call) at 7afe9daf87f4
SP (7fffca606880) is in userspace

So, it looks like there could be a problem in the error path, plausibly
fixed by this patch:

commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
Author: Haren Myneni 
Date:   Wed Jun 13 00:32:40 2018 -0700

crypto/nx: Initialize 842 high and normal RxFIFO control registers

NX increments readOffset by FIFO size in receive FIFO control register
when CRB is read. But the index in RxFIFO has to match with the
corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
may be processing incorrect CRBs and can cause CRB timeout.

VAS FIFO offset is 0 when the receive window is opened during
initialization. When the module is reloaded or in kexec boot, readOffset
in FIFO control register may not match with VAS entry. This patch adds
nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
control register for both high and normal FIFOs.

Signed-off-by: Haren Myneni 
[mpe: Fixup uninitialized variable warning]
Signed-off-by: Michael Ellerman 

$ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
v4.19-rc1~24^2~50


Which was never backported to any stable release, so probably needs to
be for v4.14 through v4.18. Notably, Ubuntu is on v4.15 and it doesn't
seem to have picked up the patch. I'm opening an Ubuntu bug for this.

Haren, is this something you can drive through the stable process
(assuming my above crash looks like this failure)?

-- 
Stewart Smith
OPAL Architect, IBM.