Re: [PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL
On 18 July 2017 at 10:56, Ard Biesheuvelwrote: > On 18 July 2017 at 10:49, Herbert Xu wrote: >> On Wed, Jul 05, 2017 at 12:43:19AM +0100, Ard Biesheuvel wrote: >>> Implement a NEON fallback for systems that do support NEON but have >>> no support for the optional 64x64->128 polynomial multiplication >>> instruction that is part of the ARMv8 Crypto Extensions. It is based >>> on the paper "Fast Software Polynomial Multiplication on ARM Processors >>> Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and >>> Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked >>> extensively for the AArch64 ISA. >>> >>> On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the >>> NEON based implementation is 4x faster than the table based one, and >>> is time invariant as well, making it less vulnerable to timing attacks. >>> When combined with the bit-sliced NEON implementation of AES-CTR, the >>> AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte). >>> >>> Signed-off-by: Ard Biesheuvel >> >> This patch does not apply against cryptodev. >> > > Yeah, it implements a non-SIMD fallback which depends on the AES > refactor series. FYI I have pushed everything I have queued up locally here: https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=crypto-arm-for-v4.14 Once the crypto_xor() and AES refactor stuff looks satisfactory to you, I will repost the remaining bits, including these GCM and GHASH changes. Thanks, Ard.
Re: [PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL
On 18 July 2017 at 10:49, Herbert Xuwrote: > On Wed, Jul 05, 2017 at 12:43:19AM +0100, Ard Biesheuvel wrote: >> Implement a NEON fallback for systems that do support NEON but have >> no support for the optional 64x64->128 polynomial multiplication >> instruction that is part of the ARMv8 Crypto Extensions. It is based >> on the paper "Fast Software Polynomial Multiplication on ARM Processors >> Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and >> Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked >> extensively for the AArch64 ISA. >> >> On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the >> NEON based implementation is 4x faster than the table based one, and >> is time invariant as well, making it less vulnerable to timing attacks. >> When combined with the bit-sliced NEON implementation of AES-CTR, the >> AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte). >> >> Signed-off-by: Ard Biesheuvel > > This patch does not apply against cryptodev. > Yeah, it implements a non-SIMD fallback which depends on the AES refactor series.
Re: [PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL
On Wed, Jul 05, 2017 at 12:43:19AM +0100, Ard Biesheuvel wrote: > Implement a NEON fallback for systems that do support NEON but have > no support for the optional 64x64->128 polynomial multiplication > instruction that is part of the ARMv8 Crypto Extensions. It is based > on the paper "Fast Software Polynomial Multiplication on ARM Processors > Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and > Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked > extensively for the AArch64 ISA. > > On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the > NEON based implementation is 4x faster than the table based one, and > is time invariant as well, making it less vulnerable to timing attacks. > When combined with the bit-sliced NEON implementation of AES-CTR, the > AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte). > > Signed-off-by: Ard BiesheuvelThis patch does not apply against cryptodev. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH v2 2/2] crypto: arm64/ghash - add NEON accelerated fallback for 64-bit PMULL
Implement a NEON fallback for systems that do support NEON but have no support for the optional 64x64->128 polynomial multiplication instruction that is part of the ARMv8 Crypto Extensions. It is based on the paper "Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine" by Danilo Camara, Conrado Gouvea, Julio Lopez and Ricardo Dahab (https://hal.inria.fr/hal-01506572), but has been reworked extensively for the AArch64 ISA. On a low-end core such as the Cortex-A53 found in the Raspberry Pi3, the NEON based implementation is 4x faster than the table based one, and is time invariant as well, making it less vulnerable to timing attacks. When combined with the bit-sliced NEON implementation of AES-CTR, the AES-GCM performance increases by ~2x (from 58 to 30 cycles per byte). Signed-off-by: Ard Biesheuvel--- v2: - use alternative reduction and precomputed coefficients for loop invariants - refactor asm macros for better legibility arch/arm64/crypto/ghash-ce-core.S | 251 +--- arch/arm64/crypto/ghash-ce-glue.c | 40 +++- 2 files changed, 255 insertions(+), 36 deletions(-) diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S index cb22459eba85..5d40b99ca3ac 100644 --- a/arch/arm64/crypto/ghash-ce-core.S +++ b/arch/arm64/crypto/ghash-ce-core.S @@ -1,7 +1,7 @@ /* * Accelerated GHASH implementation with ARMv8 PMULL instructions. * - * Copyright (C) 2014 Linaro Ltd. + * Copyright (C) 2014 - 2017 Linaro Ltd. * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 as published @@ -11,31 +11,219 @@ #include #include - SHASH .reqv0 - SHASH2 .reqv1 - T1 .reqv2 - T2 .reqv3 - MASK.reqv4 - XL .reqv5 - XM .reqv6 - XH .reqv7 - IN1 .reqv7 + SHASH .reqv0 + SHASH2 .reqv1 + T1 .reqv2 + T2 .reqv3 + MASK.reqv4 + XL .reqv5 + XM .reqv6 + XH .reqv7 + IN1 .reqv7 + + k00_16 .reqv8 + k32_48 .reqv9 + + t3 .reqv10 + t4 .reqv11 + t5 .reqv12 + t6 .reqv13 + t7 .reqv14 + t8 .reqv15 + t9 .reqv16 + + perm1 .reqv17 + perm2 .reqv18 + perm3 .reqv19 + + sh1 .reqv20 + sh2 .reqv21 + sh3 .reqv22 + sh4 .reqv23 + + ss1 .reqv24 + ss2 .reqv25 + ss3 .reqv26 + ss4 .reqv27 + + VZR .reqv28 .text .arch armv8-a+crypto - /* -* void pmull_ghash_update(int blocks, u64 dg[], const char *src, -* struct ghash_key const *k, const char *head) -*/ -ENTRY(pmull_ghash_update) + .macro __pmull_p64, rd, rn, rm + pmull \rd\().1q, \rn\().1d, \rm\().1d + .endm + + .macro __pmull2_p64, rd, rn, rm + pmull2 \rd\().1q, \rn\().2d, \rm\().2d + .endm + + .macro __pmull_p8, rq, ad, bd + ext t3.8b, \ad\().8b, \ad\().8b, #1 // A1 + ext t5.8b, \ad\().8b, \ad\().8b, #2 // A2 + ext t7.8b, \ad\().8b, \ad\().8b, #3 // A3 + + __pmull_p8_\bd \rq, \ad + .endm + + .macro __pmull2_p8, rq, ad, bd + tbl t3.16b, {\ad\().16b}, perm1.16b // A1 + tbl t5.16b, {\ad\().16b}, perm2.16b // A2 + tbl t7.16b, {\ad\().16b}, perm3.16b // A3 + + __pmull2_p8_\bd \rq, \ad + .endm + + .macro __pmull_p8_SHASH, rq, ad + __pmull_p8_tail \rq, \ad\().8b, SHASH.8b, 8b,, sh1, sh2, sh3, sh4 + .endm + + .macro __pmull_p8_SHASH2, rq, ad + __pmull_p8_tail \rq, \ad\().8b, SHASH2.8b, 8b,, ss1, ss2, ss3, ss4 + .endm + + .macro __pmull2_p8_SHASH, rq, ad + __pmull_p8_tail \rq, \ad\().16b, SHASH.16b, 16b, 2, sh1, sh2, sh3, sh4 + .endm + + .macro __pmull_p8_tail, rq, ad, bd, nb, t, b1, b2, b3, b4 + pmull\t t3.8h, t3.\nb, \bd // F = A1*B + pmull\t t4.8h, \ad, \b1\().\nb // E = A*B1 + pmull\t t5.8h, t5.\nb, \bd // H = A2*B + pmull\t t6.8h, \ad,