Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-08-03 Thread Herbert Xu
On Sun, Jul 29, 2018 at 04:52:30PM +0200, Ard Biesheuvel wrote:
> As it turns out, checking the TIF_NEED_RESCHED flag after each
> iteration results in a significant performance regression (~10%)
> when running fast algorithms (i.e., ones that use special instructions
> and operate in the < 4 cycles per byte range) on in-order cores with
> comparatively slow memory accesses such as the Cortex-A53.
> 
> Given the speed of these ciphers, and the fact that the page based
> nature of the AEAD scatterwalk API guarantees that the core NEON
> transform is never invoked with more than a single page's worth of
> input, we can estimate the worst case duration of any resulting
> scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
> processing a page's worth of input at 4 cycles per byte results in
> a delay of ~250 us, which is a reasonable upper bound.
> 
> So let's remove the yield checks from the fused AES-CCM and AES-GCM
> routines entirely.
> 
> This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and
> partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3.
> 
> Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
> Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...")
> Signed-off-by: Ard Biesheuvel 

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-08-03 Thread Ard Biesheuvel
On 3 August 2018 at 10:17, Herbert Xu  wrote:
> On Fri, Aug 03, 2018 at 09:10:08AM +0200, Ard Biesheuvel wrote:
>> But I think it's too late now to take this into v4.18. Could you
>> please queue this (and my other two pending arm64/aes-gcm patches, if
>> possible) for v4.19 instead?
>
> OK I'll do that.
>

Thanks.

But, actually, those two pending patches are 3 piece series now ...
(v2 of 'crypto/arm64: aes-ce-gcm - switch to 2-way aggregation')

Thanks,
Ard.


Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-08-03 Thread Herbert Xu
On Fri, Aug 03, 2018 at 09:10:08AM +0200, Ard Biesheuvel wrote:
> But I think it's too late now to take this into v4.18. Could you
> please queue this (and my other two pending arm64/aes-gcm patches, if
> possible) for v4.19 instead?

OK I'll do that.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-08-03 Thread Ard Biesheuvel
On 3 August 2018 at 08:14, Herbert Xu  wrote:
> On Sun, Jul 29, 2018 at 04:52:30PM +0200, Ard Biesheuvel wrote:
>> As it turns out, checking the TIF_NEED_RESCHED flag after each
>> iteration results in a significant performance regression (~10%)
>> when running fast algorithms (i.e., ones that use special instructions
>> and operate in the < 4 cycles per byte range) on in-order cores with
>> comparatively slow memory accesses such as the Cortex-A53.
>>
>> Given the speed of these ciphers, and the fact that the page based
>> nature of the AEAD scatterwalk API guarantees that the core NEON
>> transform is never invoked with more than a single page's worth of
>> input, we can estimate the worst case duration of any resulting
>> scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
>> processing a page's worth of input at 4 cycles per byte results in
>> a delay of ~250 us, which is a reasonable upper bound.
>>
>> So let's remove the yield checks from the fused AES-CCM and AES-GCM
>> routines entirely.
>>
>> This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and
>> partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3.
>>
>> Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
>> Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...")
>> Signed-off-by: Ard Biesheuvel 
>> ---
>> This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks'
>> sent out on the 24th of July.
>>
>> Given the significant performance regression, we may want to treat this as
>> a fix (the patches in question went into v4.18)
>>
>> This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing
>> kernel_neon_begin/end pair' which I sent out on the 27th of July, which
>> fixes a data corruption bug, whic should also be applied as a fix.
>
> Acked-by: Herbert Xu 
>

Thanks Herbert.

But I think it's too late now to take this into v4.18. Could you
please queue this (and my other two pending arm64/aes-gcm patches, if
possible) for v4.19 instead?

Thanks,


Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-08-03 Thread Herbert Xu
On Sun, Jul 29, 2018 at 04:52:30PM +0200, Ard Biesheuvel wrote:
> As it turns out, checking the TIF_NEED_RESCHED flag after each
> iteration results in a significant performance regression (~10%)
> when running fast algorithms (i.e., ones that use special instructions
> and operate in the < 4 cycles per byte range) on in-order cores with
> comparatively slow memory accesses such as the Cortex-A53.
> 
> Given the speed of these ciphers, and the fact that the page based
> nature of the AEAD scatterwalk API guarantees that the core NEON
> transform is never invoked with more than a single page's worth of
> input, we can estimate the worst case duration of any resulting
> scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
> processing a page's worth of input at 4 cycles per byte results in
> a delay of ~250 us, which is a reasonable upper bound.
> 
> So let's remove the yield checks from the fused AES-CCM and AES-GCM
> routines entirely.
> 
> This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and
> partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3.
> 
> Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
> Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...")
> Signed-off-by: Ard Biesheuvel 
> ---
> This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks'
> sent out on the 24th of July.
> 
> Given the significant performance regression, we may want to treat this as
> a fix (the patches in question went into v4.18)
> 
> This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing
> kernel_neon_begin/end pair' which I sent out on the 27th of July, which
> fixes a data corruption bug, whic should also be applied as a fix.

Acked-by: Herbert Xu 

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-07-29 Thread Ard Biesheuvel
As it turns out, checking the TIF_NEED_RESCHED flag after each
iteration results in a significant performance regression (~10%)
when running fast algorithms (i.e., ones that use special instructions
and operate in the < 4 cycles per byte range) on in-order cores with
comparatively slow memory accesses such as the Cortex-A53.

Given the speed of these ciphers, and the fact that the page based
nature of the AEAD scatterwalk API guarantees that the core NEON
transform is never invoked with more than a single page's worth of
input, we can estimate the worst case duration of any resulting
scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
processing a page's worth of input at 4 cycles per byte results in
a delay of ~250 us, which is a reasonable upper bound.

So let's remove the yield checks from the fused AES-CCM and AES-GCM
routines entirely.

This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and
partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3.

Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...")
Signed-off-by: Ard Biesheuvel 
---
This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks'
sent out on the 24th of July.

Given the significant performance regression, we may want to treat this as
a fix (the patches in question went into v4.18)

This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing
kernel_neon_begin/end pair' which I sent out on the 27th of July, which
fixes a data corruption bug, whic should also be applied as a fix.

 arch/arm64/crypto/aes-ce-ccm-core.S | 150 +++-
 arch/arm64/crypto/ghash-ce-core.S   |  76 --
 2 files changed, 80 insertions(+), 146 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index 88f5aef7934c..e3a375c4cb83 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,33 +19,24 @@
 *   u32 *macp, u8 const rk[], u32 rounds);
 */
 ENTRY(ce_aes_ccm_auth_data)
-   frame_push  7
-
-   mov x19, x0
-   mov x20, x1
-   mov x21, x2
-   mov x22, x3
-   mov x23, x4
-   mov x24, x5
-
-   ldr w25, [x22]  /* leftover from prev round? */
+   ldr w8, [x3]/* leftover from prev round? */
ld1 {v0.16b}, [x0]  /* load mac */
-   cbz w25, 1f
-   sub w25, w25, #16
+   cbz w8, 1f
+   sub w8, w8, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrbw7, [x20], #1   /* get 1 byte of input */
-   subsw21, w21, #1
-   add w25, w25, #1
+0: ldrbw7, [x1], #1/* get 1 byte of input */
+   subsw2, w2, #1
+   add w8, w8, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1  /* rotate in the input bytes */
beq 8f  /* out of input? */
-   cbnzw25, 0b
+   cbnzw8, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x23]  /* load first round key */
-   prfmpldl1strm, [x20]
-   cmp w24, #12/* which key size? */
-   add x6, x23, #16
-   sub w7, w24, #2 /* modified # of rounds */
+1: ld1 {v3.4s}, [x4]   /* load first round key */
+   prfmpldl1strm, [x1]
+   cmp w5, #12 /* which key size? */
+   add x6, x4, #16
+   sub w7, w5, #2  /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -64,43 +55,33 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16  /* load next round key */
bpl 3b
aesev0.16b, v4.16b
-   subsw21, w21, #16   /* last data? */
+   subsw2, w2, #16 /* last data? */
eor v0.16b, v0.16b, v5.16b  /* final round */
bmi 6f
-   ld1 {v1.16b}, [x20], #16/* load next input block */
+   ld1 {v1.16b}, [x1], #16 /* load next input block */
eor v0.16b, v0.16b, v1.16b  /* xor with mac */
-   beq 6f
-
-   if_will_cond_yield_neon
-   st1 {v0.16b}, [x19] /* store mac */
-   do_cond_yield_neon
-   ld1 {v0.16b}, [x19] /* reload mac */
-   endif_yield_neon
-
-   b   1b
-6: st1 {v0.16b}, [x19] /* store mac */
+   bne 1b
+6: st1 {v0.16b}, [x0]  /* store mac */
beq 10f
-   addsw21, w21, #16
+   addsw2, w2, #16
beq 10f
-   mov w25, w21
-7: ldrb