Re: [PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL

2017-07-18 Thread Ard Biesheuvel
On 18 July 2017 at 10:56, Ard Biesheuvel  wrote:
> On 18 July 2017 at 10:49, Herbert Xu  wrote:
>> On Wed, Jul 05, 2017 at 12:43:19AM +0100, Ard Biesheuvel wrote:
>>> Implement a NEON fallback for systems that do support NEON but have
>>> no support for the optional 64x64->128 polynomial multiplication
>>> instruction that is part of the ARMv8 Crypto Extensions. It is based
>>> on the paper "Fast Software Polynomial Multiplication on ARM Processors
>>> Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and
>>> Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked
>>> extensively for the AArch64 ISA.
>>>
>>> On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the
>>> NEON based implementation is 4x faster than the table based one, and
>>> is time invariant as well, making it less vulnerable to timing attacks.
>>> When combined with the bit-sliced NEON implementation of AES-CTR, the
>>> AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte).
>>>
>>> Signed-off-by: Ard Biesheuvel 
>>
>> This patch does not apply against cryptodev.
>>
>
> Yeah, it implements a non-SIMD fallback which depends on the AES
> refactor series.

FYI I have pushed everything I have queued up locally here:
https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=crypto-arm-for-v4.14

Once the crypto_xor() and AES refactor stuff looks satisfactory to
you, I will repost the remaining bits, including these GCM and GHASH
changes.

Thanks,
Ard.


Re: [PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL

2017-07-18 Thread Ard Biesheuvel
On 18 July 2017 at 10:49, Herbert Xu  wrote:
> On Wed, Jul 05, 2017 at 12:43:19AM +0100, Ard Biesheuvel wrote:
>> Implement a NEON fallback for systems that do support NEON but have
>> no support for the optional 64x64->128 polynomial multiplication
>> instruction that is part of the ARMv8 Crypto Extensions. It is based
>> on the paper "Fast Software Polynomial Multiplication on ARM Processors
>> Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and
>> Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked
>> extensively for the AArch64 ISA.
>>
>> On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the
>> NEON based implementation is 4x faster than the table based one, and
>> is time invariant as well, making it less vulnerable to timing attacks.
>> When combined with the bit-sliced NEON implementation of AES-CTR, the
>> AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte).
>>
>> Signed-off-by: Ard Biesheuvel 
>
> This patch does not apply against cryptodev.
>

Yeah, it implements a non-SIMD fallback which depends on the AES
refactor series.


Re: [PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL

2017-07-18 Thread Herbert Xu
On Wed, Jul 05, 2017 at 12:43:19AM +0100, Ard Biesheuvel wrote:
> Implement a NEON fallback for systems that do support NEON but have
> no support for the optional 64x64->128 polynomial multiplication
> instruction that is part of the ARMv8 Crypto Extensions. It is based
> on the paper "Fast Software Polynomial Multiplication on ARM Processors
> Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and
> Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked
> extensively for the AArch64 ISA.
> 
> On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the
> NEON based implementation is 4x faster than the table based one, and
> is time invariant as well, making it less vulnerable to timing attacks.
> When combined with the bit-sliced NEON implementation of AES-CTR, the
> AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte).
> 
> Signed-off-by: Ard Biesheuvel 

This patch does not apply against cryptodev.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL

2017-07-04 Thread Ard Biesheuvel
Implement a NEON fallback for systems that do support NEON but have
no support for the optional 64x64->128 polynomial multiplication
instruction that is part of the ARMv8 Crypto Extensions. It is based
on the paper "Fast Software Polynomial Multiplication on ARM Processors
Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and
Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked
extensively for the AArch64 ISA.

On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the
NEON based implementation is 4x faster than the table based one, and
is time invariant as well, making it less vulnerable to timing attacks.
When combined with the bit-sliced NEON implementation of AES-CTR, the
AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte).

Signed-off-by: Ard Biesheuvel 
---
v2:
- use alternative reduction and precomputed coefficients for loop invariants
- refactor asm macros for better legibility

 arch/arm64/crypto/ghash-ce-core.S | 251 +---
 arch/arm64/crypto/ghash-ce-glue.c |  40 +++-
 2 files changed, 255 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index cb22459eba85..5d40b99ca3ac 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -1,7 +1,7 @@
 /*
  * Accelerated GHASH implementation with ARMv8 PMULL instructions.
  *
- * Copyright (C) 2014 Linaro Ltd. 
+ * Copyright (C) 2014 - 2017 Linaro Ltd. 
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License version 2 as published
@@ -11,31 +11,219 @@
 #include 
 #include 
 
-   SHASH   .reqv0
-   SHASH2  .reqv1
-   T1  .reqv2
-   T2  .reqv3
-   MASK.reqv4
-   XL  .reqv5
-   XM  .reqv6
-   XH  .reqv7
-   IN1 .reqv7
+   SHASH   .reqv0
+   SHASH2  .reqv1
+   T1  .reqv2
+   T2  .reqv3
+   MASK.reqv4
+   XL  .reqv5
+   XM  .reqv6
+   XH  .reqv7
+   IN1 .reqv7
+
+   k00_16  .reqv8
+   k32_48  .reqv9
+
+   t3  .reqv10
+   t4  .reqv11
+   t5  .reqv12
+   t6  .reqv13
+   t7  .reqv14
+   t8  .reqv15
+   t9  .reqv16
+
+   perm1   .reqv17
+   perm2   .reqv18
+   perm3   .reqv19
+
+   sh1 .reqv20
+   sh2 .reqv21
+   sh3 .reqv22
+   sh4 .reqv23
+
+   ss1 .reqv24
+   ss2 .reqv25
+   ss3 .reqv26
+   ss4 .reqv27
+
+   VZR .reqv28
 
.text
.arch   armv8-a+crypto
 
-   /*
-* void pmull_ghash_update(int blocks, u64 dg[], const char *src,
-* struct ghash_key const *k, const char *head)
-*/
-ENTRY(pmull_ghash_update)
+   .macro  __pmull_p64, rd, rn, rm
+   pmull   \rd\().1q, \rn\().1d, \rm\().1d
+   .endm
+
+   .macro  __pmull2_p64, rd, rn, rm
+   pmull2  \rd\().1q, \rn\().2d, \rm\().2d
+   .endm
+
+   .macro  __pmull_p8, rq, ad, bd
+   ext t3.8b, \ad\().8b, \ad\().8b, #1 // A1
+   ext t5.8b, \ad\().8b, \ad\().8b, #2 // A2
+   ext t7.8b, \ad\().8b, \ad\().8b, #3 // A3
+
+   __pmull_p8_\bd  \rq, \ad
+   .endm
+
+   .macro  __pmull2_p8, rq, ad, bd
+   tbl t3.16b, {\ad\().16b}, perm1.16b // A1
+   tbl t5.16b, {\ad\().16b}, perm2.16b // A2
+   tbl t7.16b, {\ad\().16b}, perm3.16b // A3
+
+   __pmull2_p8_\bd \rq, \ad
+   .endm
+
+   .macro  __pmull_p8_SHASH, rq, ad
+   __pmull_p8_tail \rq, \ad\().8b, SHASH.8b, 8b,, sh1, sh2, sh3, sh4
+   .endm
+
+   .macro  __pmull_p8_SHASH2, rq, ad
+   __pmull_p8_tail \rq, \ad\().8b, SHASH2.8b, 8b,, ss1, ss2, ss3, ss4
+   .endm
+
+   .macro  __pmull2_p8_SHASH, rq, ad
+   __pmull_p8_tail \rq, \ad\().16b, SHASH.16b, 16b, 2, sh1, sh2, sh3, sh4
+   .endm
+
+   .macro  __pmull_p8_tail, rq, ad, bd, nb, t, b1, b2, b3, b4
+   pmull\t t3.8h, t3.\nb, \bd  // F = A1*B
+   pmull\t t4.8h, \ad, \b1\().\nb  // E = A*B1
+   pmull\t t5.8h, t5.\nb, \bd  // H = A2*B
+   pmull\t t6.8h, \ad,