Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations
On Sun, Jul 29, 2018 at 04:52:30PM +0200, Ard Biesheuvel wrote: > As it turns out, checking the TIF_NEED_RESCHED flag after each > iteration results in a significant performance regression (~10%) > when running fast algorithms (i.e., ones that use special instructions > and operate in the < 4 cycles per byte range) on in-order cores with > comparatively slow memory accesses such as the Cortex-A53. > > Given the speed of these ciphers, and the fact that the page based > nature of the AEAD scatterwalk API guarantees that the core NEON > transform is never invoked with more than a single page's worth of > input, we can estimate the worst case duration of any resulting > scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages, > processing a page's worth of input at 4 cycles per byte results in > a delay of ~250 us, which is a reasonable upper bound. > > So let's remove the yield checks from the fused AES-CCM and AES-GCM > routines entirely. > > This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and > partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3. > > Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...") > Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...") > Signed-off-by: Ard Biesheuvel Patch applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations
On 3 August 2018 at 10:17, Herbert Xu wrote: > On Fri, Aug 03, 2018 at 09:10:08AM +0200, Ard Biesheuvel wrote: >> But I think it's too late now to take this into v4.18. Could you >> please queue this (and my other two pending arm64/aes-gcm patches, if >> possible) for v4.19 instead? > > OK I'll do that. > Thanks. But, actually, those two pending patches are 3 piece series now ... (v2 of 'crypto/arm64: aes-ce-gcm - switch to 2-way aggregation') Thanks, Ard.
Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations
On Fri, Aug 03, 2018 at 09:10:08AM +0200, Ard Biesheuvel wrote: > But I think it's too late now to take this into v4.18. Could you > please queue this (and my other two pending arm64/aes-gcm patches, if > possible) for v4.19 instead? OK I'll do that. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations
On 3 August 2018 at 08:14, Herbert Xu wrote: > On Sun, Jul 29, 2018 at 04:52:30PM +0200, Ard Biesheuvel wrote: >> As it turns out, checking the TIF_NEED_RESCHED flag after each >> iteration results in a significant performance regression (~10%) >> when running fast algorithms (i.e., ones that use special instructions >> and operate in the < 4 cycles per byte range) on in-order cores with >> comparatively slow memory accesses such as the Cortex-A53. >> >> Given the speed of these ciphers, and the fact that the page based >> nature of the AEAD scatterwalk API guarantees that the core NEON >> transform is never invoked with more than a single page's worth of >> input, we can estimate the worst case duration of any resulting >> scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages, >> processing a page's worth of input at 4 cycles per byte results in >> a delay of ~250 us, which is a reasonable upper bound. >> >> So let's remove the yield checks from the fused AES-CCM and AES-GCM >> routines entirely. >> >> This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and >> partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3. >> >> Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...") >> Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...") >> Signed-off-by: Ard Biesheuvel >> --- >> This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks' >> sent out on the 24th of July. >> >> Given the significant performance regression, we may want to treat this as >> a fix (the patches in question went into v4.18) >> >> This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing >> kernel_neon_begin/end pair' which I sent out on the 27th of July, which >> fixes a data corruption bug, whic should also be applied as a fix. > > Acked-by: Herbert Xu > Thanks Herbert. But I think it's too late now to take this into v4.18. Could you please queue this (and my other two pending arm64/aes-gcm patches, if possible) for v4.19 instead? Thanks,
Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations
On Sun, Jul 29, 2018 at 04:52:30PM +0200, Ard Biesheuvel wrote: > As it turns out, checking the TIF_NEED_RESCHED flag after each > iteration results in a significant performance regression (~10%) > when running fast algorithms (i.e., ones that use special instructions > and operate in the < 4 cycles per byte range) on in-order cores with > comparatively slow memory accesses such as the Cortex-A53. > > Given the speed of these ciphers, and the fact that the page based > nature of the AEAD scatterwalk API guarantees that the core NEON > transform is never invoked with more than a single page's worth of > input, we can estimate the worst case duration of any resulting > scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages, > processing a page's worth of input at 4 cycles per byte results in > a delay of ~250 us, which is a reasonable upper bound. > > So let's remove the yield checks from the fused AES-CCM and AES-GCM > routines entirely. > > This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and > partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3. > > Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...") > Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...") > Signed-off-by: Ard Biesheuvel > --- > This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks' > sent out on the 24th of July. > > Given the significant performance regression, we may want to treat this as > a fix (the patches in question went into v4.18) > > This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing > kernel_neon_begin/end pair' which I sent out on the 27th of July, which > fixes a data corruption bug, whic should also be applied as a fix. Acked-by: Herbert Xu Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations
As it turns out, checking the TIF_NEED_RESCHED flag after each iteration results in a significant performance regression (~10%) when running fast algorithms (i.e., ones that use special instructions and operate in the < 4 cycles per byte range) on in-order cores with comparatively slow memory accesses such as the Cortex-A53. Given the speed of these ciphers, and the fact that the page based nature of the AEAD scatterwalk API guarantees that the core NEON transform is never invoked with more than a single page's worth of input, we can estimate the worst case duration of any resulting scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages, processing a page's worth of input at 4 cycles per byte results in a delay of ~250 us, which is a reasonable upper bound. So let's remove the yield checks from the fused AES-CCM and AES-GCM routines entirely. This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3. Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...") Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...") Signed-off-by: Ard Biesheuvel --- This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks' sent out on the 24th of July. Given the significant performance regression, we may want to treat this as a fix (the patches in question went into v4.18) This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing kernel_neon_begin/end pair' which I sent out on the 27th of July, which fixes a data corruption bug, whic should also be applied as a fix. arch/arm64/crypto/aes-ce-ccm-core.S | 150 +++- arch/arm64/crypto/ghash-ce-core.S | 76 -- 2 files changed, 80 insertions(+), 146 deletions(-) diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S index 88f5aef7934c..e3a375c4cb83 100644 --- a/arch/arm64/crypto/aes-ce-ccm-core.S +++ b/arch/arm64/crypto/aes-ce-ccm-core.S @@ -19,33 +19,24 @@ * u32 *macp, u8 const rk[], u32 rounds); */ ENTRY(ce_aes_ccm_auth_data) - frame_push 7 - - mov x19, x0 - mov x20, x1 - mov x21, x2 - mov x22, x3 - mov x23, x4 - mov x24, x5 - - ldr w25, [x22] /* leftover from prev round? */ + ldr w8, [x3]/* leftover from prev round? */ ld1 {v0.16b}, [x0] /* load mac */ - cbz w25, 1f - sub w25, w25, #16 + cbz w8, 1f + sub w8, w8, #16 eor v1.16b, v1.16b, v1.16b -0: ldrbw7, [x20], #1 /* get 1 byte of input */ - subsw21, w21, #1 - add w25, w25, #1 +0: ldrbw7, [x1], #1/* get 1 byte of input */ + subsw2, w2, #1 + add w8, w8, #1 ins v1.b[0], w7 ext v1.16b, v1.16b, v1.16b, #1 /* rotate in the input bytes */ beq 8f /* out of input? */ - cbnzw25, 0b + cbnzw8, 0b eor v0.16b, v0.16b, v1.16b -1: ld1 {v3.4s}, [x23] /* load first round key */ - prfmpldl1strm, [x20] - cmp w24, #12/* which key size? */ - add x6, x23, #16 - sub w7, w24, #2 /* modified # of rounds */ +1: ld1 {v3.4s}, [x4] /* load first round key */ + prfmpldl1strm, [x1] + cmp w5, #12 /* which key size? */ + add x6, x4, #16 + sub w7, w5, #2 /* modified # of rounds */ bmi 2f bne 5f mov v5.16b, v3.16b @@ -64,43 +55,33 @@ ENTRY(ce_aes_ccm_auth_data) ld1 {v5.4s}, [x6], #16 /* load next round key */ bpl 3b aesev0.16b, v4.16b - subsw21, w21, #16 /* last data? */ + subsw2, w2, #16 /* last data? */ eor v0.16b, v0.16b, v5.16b /* final round */ bmi 6f - ld1 {v1.16b}, [x20], #16/* load next input block */ + ld1 {v1.16b}, [x1], #16 /* load next input block */ eor v0.16b, v0.16b, v1.16b /* xor with mac */ - beq 6f - - if_will_cond_yield_neon - st1 {v0.16b}, [x19] /* store mac */ - do_cond_yield_neon - ld1 {v0.16b}, [x19] /* reload mac */ - endif_yield_neon - - b 1b -6: st1 {v0.16b}, [x19] /* store mac */ + bne 1b +6: st1 {v0.16b}, [x0] /* store mac */ beq 10f - addsw21, w21, #16 + addsw2, w2, #16 beq 10f - mov w25, w21 -7: ldrb