[PATCH v2 0/3] crypto: arm64/chacha - performance improvements

2018-12-04 Thread Ard Biesheuvel
Improve the performance of NEON based ChaCha:

Patch #1 adds a block size of 1472 to the tcrypt test template so we have
something that reflects the VPN case.

Patch #2 improves performance for arbitrary length inputs: on deep pipelines,
throughput increases ~30% when running on inputs blocks whose size is drawn
randomly from the interval [64, 1024)

Patch #3 adopts the OpenSSL approach to use the ALU in parallel with the
SIMD unit to process a fifth block while the SIMD is operating on 4 blocks.

Performance on Cortex-A57:

BEFORE:
===
testing speed of async chacha20 (chacha20-neon) encryption
tcrypt: test 0 (256 bit key, 16 byte blocks): 2528223 operations in 1 seconds 
(40451568 bytes)
tcrypt: test 1 (256 bit key, 64 byte blocks): 2518155 operations in 1 seconds 
(161161920 bytes)
tcrypt: test 2 (256 bit key, 256 byte blocks): 1207948 operations in 1 seconds 
(309234688 bytes)
tcrypt: test 3 (256 bit key, 1024 byte blocks): 332194 operations in 1 seconds 
(340166656 bytes)
tcrypt: test 4 (256 bit key, 1472 byte blocks): 185659 operations in 1 seconds 
(273290048 bytes)
tcrypt: test 5 (256 bit key, 8192 byte blocks): 41829 operations in 1 seconds 
(342663168 bytes)

AFTER:
==
testing speed of async chacha20 (chacha20-neon) encryption
tcrypt: test 0 (256 bit key, 16 byte blocks): 2530018 operations in 1 seconds 
(40480288 bytes)
tcrypt: test 1 (256 bit key, 64 byte blocks): 2518270 operations in 1 seconds 
(161169280 bytes)
tcrypt: test 2 (256 bit key, 256 byte blocks): 1187760 operations in 1 seconds 
(304066560 bytes)
tcrypt: test 3 (256 bit key, 1024 byte blocks): 361652 operations in 1 seconds 
(370331648 bytes)
tcrypt: test 4 (256 bit key, 1472 byte blocks): 280971 operations in 1 seconds 
(413589312 bytes)
tcrypt: test 5 (256 bit key, 8192 byte blocks): 53654 operations in 1 seconds 
(439533568 bytes)

Zinc:
=
testing speed of async chacha20 (chacha20-software) encryption
tcrypt: test 0 (256 bit key, 16 byte blocks): 2510300 operations in 1 seconds 
(40164800 bytes)
tcrypt: test 1 (256 bit key, 64 byte blocks): 2663794 operations in 1 seconds 
(170482816 bytes)
tcrypt: test 2 (256 bit key, 256 byte blocks): 1237617 operations in 1 seconds 
(316829952 bytes)
tcrypt: test 3 (256 bit key, 1024 byte blocks): 364645 operations in 1 seconds 
(373396480 bytes)
tcrypt: test 4 (256 bit key, 1472 byte blocks): 251548 operations in 1 seconds 
(370278656 bytes)
tcrypt: test 5 (256 bit key, 8192 byte blocks): 47650 operations in 1 seconds 
(390348800 bytes)

Cc: Eric Biggers 
Cc: Martin Willi 

Ard Biesheuvel (3):
  crypto: tcrypt - add block size of 1472 to skcipher template
  crypto: arm64/chacha - optimize for arbitrary length inputs
  crypto: arm64/chacha - use combined SIMD/ALU routine for more speed

 arch/arm64/crypto/chacha-neon-core.S | 396 +++-
 arch/arm64/crypto/chacha-neon-glue.c |  59 ++-
 crypto/tcrypt.c  |   2 +-
 3 files changed, 404 insertions(+), 53 deletions(-)

-- 
2.19.2



[PATCH v2 1/3] crypto: tcrypt - add block size of 1472 to skcipher template

2018-12-04 Thread Ard Biesheuvel
In order to have better coverage of algorithms operating on block
sizes that are in the ballpark of a VPN  packet, add 1472 to the
block_sizes array.

Signed-off-by: Ard Biesheuvel 
---
 crypto/tcrypt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 0590a9204562..e7fb87e114a5 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -81,7 +81,7 @@ static char *check[] = {
NULL
 };
 
-static u32 block_sizes[] = { 16, 64, 256, 1024, 8192, 0 };
+static u32 block_sizes[] = { 16, 64, 256, 1024, 1472, 8192, 0 };
 static u32 aead_sizes[] = { 16, 64, 256, 512, 1024, 2048, 4096, 8192, 0 };
 
 #define XBUFSIZE 8
-- 
2.19.2



[PATCH v2 3/3] crypto: arm64/chacha - use combined SIMD/ALU routine for more speed

2018-12-04 Thread Ard Biesheuvel
To some degree, most known AArch64 micro-architectures appear to be
able to issue ALU instructions in parellel to SIMD instructions
without affecting the SIMD throughput. This means we can use the ALU
to process a fifth ChaCha block while the SIMD is processing four
blocks in parallel.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/chacha-neon-core.S | 235 ++--
 arch/arm64/crypto/chacha-neon-glue.c |  39 ++--
 2 files changed, 239 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/crypto/chacha-neon-core.S 
b/arch/arm64/crypto/chacha-neon-core.S
index 32086709e6b3..534e0a3fafa4 100644
--- a/arch/arm64/crypto/chacha-neon-core.S
+++ b/arch/arm64/crypto/chacha-neon-core.S
@@ -1,13 +1,13 @@
 /*
  * ChaCha/XChaCha NEON helper functions
  *
- * Copyright (C) 2016 Linaro, Ltd. 
+ * Copyright (C) 2016-2018 Linaro, Ltd. 
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  *
- * Based on:
+ * Originally based on:
  * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions
  *
  * Copyright (C) 2015 Martin Willi
@@ -160,8 +160,27 @@ ENTRY(hchacha_block_neon)
ret x9
 ENDPROC(hchacha_block_neon)
 
+   a0  .reqw12
+   a1  .reqw13
+   a2  .reqw14
+   a3  .reqw15
+   a4  .reqw16
+   a5  .reqw17
+   a6  .reqw19
+   a7  .reqw20
+   a8  .reqw21
+   a9  .reqw22
+   a10 .reqw23
+   a11 .reqw24
+   a12 .reqw25
+   a13 .reqw26
+   a14 .reqw27
+   a15 .reqw28
+
.align  6
 ENTRY(chacha_4block_xor_neon)
+   frame_push  10
+
// x0: Input state matrix, s
// x1: 4 data blocks output, o
// x2: 4 data blocks input, i
@@ -181,6 +200,9 @@ ENTRY(chacha_4block_xor_neon)
// matrix by interleaving 32- and then 64-bit words, which allows us to
// do XOR in NEON registers.
//
+   // At the same time, a fifth block is encrypted in parallel using
+   // scalar registers
+   //
adr_l   x9, CTRINC  // ... and ROT8
ld1 {v30.4s-v31.4s}, [x9]
 
@@ -191,7 +213,24 @@ ENTRY(chacha_4block_xor_neon)
ld4r{ v8.4s-v11.4s}, [x8], #16
ld4r{v12.4s-v15.4s}, [x8]
 
-   // x12 += counter values 0-3
+   mov a0, v0.s[0]
+   mov a1, v1.s[0]
+   mov a2, v2.s[0]
+   mov a3, v3.s[0]
+   mov a4, v4.s[0]
+   mov a5, v5.s[0]
+   mov a6, v6.s[0]
+   mov a7, v7.s[0]
+   mov a8, v8.s[0]
+   mov a9, v9.s[0]
+   mov a10, v10.s[0]
+   mov a11, v11.s[0]
+   mov a12, v12.s[0]
+   mov a13, v13.s[0]
+   mov a14, v14.s[0]
+   mov a15, v15.s[0]
+
+   // x12 += counter values 1-4
add v12.4s, v12.4s, v30.4s
 
 .Ldoubleround4:
@@ -200,33 +239,53 @@ ENTRY(chacha_4block_xor_neon)
// x2 += x6, x14 = rotl32(x14 ^ x2, 16)
// x3 += x7, x15 = rotl32(x15 ^ x3, 16)
add v0.4s, v0.4s, v4.4s
+ add   a0, a0, a4
add v1.4s, v1.4s, v5.4s
+ add   a1, a1, a5
add v2.4s, v2.4s, v6.4s
+ add   a2, a2, a6
add v3.4s, v3.4s, v7.4s
+ add   a3, a3, a7
 
eor v12.16b, v12.16b, v0.16b
+ eor   a12, a12, a0
eor v13.16b, v13.16b, v1.16b
+ eor   a13, a13, a1
eor v14.16b, v14.16b, v2.16b
+ eor   a14, a14, a2
eor v15.16b, v15.16b, v3.16b
+ eor   a15, a15, a3
 
rev32   v12.8h, v12.8h
+ ror   a12, a12, #16
rev32   v13.8h, v13.8h
+ ror   a13, a13, #16
rev32   v14.8h, v14.8h
+ ror   a14, a14, #16
rev32   v15.8h, v15.8h
+ ror   a15, a15, #16
 
// x8 += x12, x4 = rotl32(x4 ^ x8, 12)
// x9 += x13, x5 = rotl32(x5 ^ x9, 12)
// x10 += x14, x6 = rotl32(x6 ^ x10, 12)
// x11 += x15, x7 = rotl32(x7 ^ x11, 12)
add v8.4s, v8.4s, v12.4s
+ add   a8, a8, a12
add v9.4s, v9.4s, v13.4s
+ add   a9, a9, a13
add v10.4s, v10.4s, v14.4s
+ add   a10, a10, a14
add v11.4s, v11.4s, v15.4s
+ add

[PATCH v2 2/3] crypto: arm64/chacha - optimize for arbitrary length inputs

2018-12-04 Thread Ard Biesheuvel
Update the 4-way NEON ChaCha routine so it can handle input of any
length >64 bytes in its entirety, rather than having to call into
the 1-way routine and/or memcpy()s via temp buffers to handle the
tail of a ChaCha invocation that is not a multiple of 256 bytes.

On inputs that are a multiple of 256 bytes (and thus in tcrypt
benchmarks), performance drops by around 1% on Cortex-A57, while
performance for inputs drawn randomly from the range [64, 1024)
increases by around 30%.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/chacha-neon-core.S | 183 ++--
 arch/arm64/crypto/chacha-neon-glue.c |  38 ++--
 2 files changed, 184 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/crypto/chacha-neon-core.S 
b/arch/arm64/crypto/chacha-neon-core.S
index 75b4e06cee79..32086709e6b3 100644
--- a/arch/arm64/crypto/chacha-neon-core.S
+++ b/arch/arm64/crypto/chacha-neon-core.S
@@ -19,6 +19,8 @@
  */
 
 #include 
+#include 
+#include 
 
.text
.align  6
@@ -36,7 +38,7 @@
  */
 chacha_permute:
 
-   adr x10, ROT8
+   adr_l   x10, ROT8
ld1 {v12.4s}, [x10]
 
 .Ldoubleround:
@@ -164,6 +166,12 @@ ENTRY(chacha_4block_xor_neon)
// x1: 4 data blocks output, o
// x2: 4 data blocks input, i
// w3: nrounds
+   // x4: byte count
+
+   adr_l   x10, .Lpermute
+   and x5, x4, #63
+   add x10, x10, x5
+   add x11, x10, #64
 
//
// This function encrypts four consecutive ChaCha blocks by loading
@@ -173,15 +181,15 @@ ENTRY(chacha_4block_xor_neon)
// matrix by interleaving 32- and then 64-bit words, which allows us to
// do XOR in NEON registers.
//
-   adr x9, CTRINC  // ... and ROT8
+   adr_l   x9, CTRINC  // ... and ROT8
ld1 {v30.4s-v31.4s}, [x9]
 
// x0..15[0-3] = s0..3[0..3]
-   mov x4, x0
-   ld4r{ v0.4s- v3.4s}, [x4], #16
-   ld4r{ v4.4s- v7.4s}, [x4], #16
-   ld4r{ v8.4s-v11.4s}, [x4], #16
-   ld4r{v12.4s-v15.4s}, [x4]
+   add x8, x0, #16
+   ld4r{ v0.4s- v3.4s}, [x0]
+   ld4r{ v4.4s- v7.4s}, [x8], #16
+   ld4r{ v8.4s-v11.4s}, [x8], #16
+   ld4r{v12.4s-v15.4s}, [x8]
 
// x12 += counter values 0-3
add v12.4s, v12.4s, v30.4s
@@ -425,24 +433,47 @@ ENTRY(chacha_4block_xor_neon)
zip1v30.4s, v14.4s, v15.4s
zip2v31.4s, v14.4s, v15.4s
 
+   mov x3, #64
+   subsx5, x4, #64
+   add x6, x5, x2
+   cselx3, x3, xzr, ge
+   cselx2, x2, x6, ge
+
// interleave 64-bit words in state n, n+2
zip1v0.2d, v16.2d, v18.2d
zip2v4.2d, v16.2d, v18.2d
zip1v8.2d, v17.2d, v19.2d
zip2v12.2d, v17.2d, v19.2d
-   ld1 {v16.16b-v19.16b}, [x2], #64
+   ld1 {v16.16b-v19.16b}, [x2], x3
+
+   subsx6, x4, #128
+   ccmpx3, xzr, #4, lt
+   add x7, x6, x2
+   cselx3, x3, xzr, eq
+   cselx2, x2, x7, eq
 
zip1v1.2d, v20.2d, v22.2d
zip2v5.2d, v20.2d, v22.2d
zip1v9.2d, v21.2d, v23.2d
zip2v13.2d, v21.2d, v23.2d
-   ld1 {v20.16b-v23.16b}, [x2], #64
+   ld1 {v20.16b-v23.16b}, [x2], x3
+
+   subsx7, x4, #192
+   ccmpx3, xzr, #4, lt
+   add x8, x7, x2
+   cselx3, x3, xzr, eq
+   cselx2, x2, x8, eq
 
zip1v2.2d, v24.2d, v26.2d
zip2v6.2d, v24.2d, v26.2d
zip1v10.2d, v25.2d, v27.2d
zip2v14.2d, v25.2d, v27.2d
-   ld1 {v24.16b-v27.16b}, [x2], #64
+   ld1 {v24.16b-v27.16b}, [x2], x3
+
+   subsx8, x4, #256
+   ccmpx3, xzr, #4, lt
+   add x9, x8, x2
+   cselx2, x2, x9, eq
 
zip1v3.2d, v28.2d, v30.2d
zip2v7.2d, v28.2d, v30.2d
@@ -451,29 +482,155 @@ ENTRY(chacha_4block_xor_neon)
ld1 {v28.16b-v31.16b}, [x2]
 
// xor with corresponding input, write to output
+   tbnzx5, #63, 0f
eor v16.16b, v16.16b, v0.16b
eor v17.16b, v17.16b, v1.16b
eor v18.16b, v18.16b, v2.16b
eor v19.16b, v19.16b, v3.16b
+   st1 {v16.16b-v19.16b}, [x1], #64
+
+   tbnzx6, #63, 1f
eor v20.16b, v20.16b, v4.16b
eor v21.16b, v21.16b, v5.

Re: [PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account

2018-11-09 Thread Ard Biesheuvel
On 9 November 2018 at 10:45, Herbert Xu  wrote:
> On Fri, Nov 09, 2018 at 05:44:47PM +0800, Herbert Xu wrote:
>> On Fri, Nov 09, 2018 at 12:33:23AM +0100, Ard Biesheuvel wrote:
>> >
>> > This should be
>> >
>> > reqsize += max(crypto_skcipher_reqsize(_tfm->base);
>> >crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm)));
>> >
>> > since the cryptd path in simd still needs some space in the subreq for
>> > the completion.
>>
>> OK this is what I applied to the cryptodev tree, please double-check
>> to see if I did anything silly:
>
> I meant the crypto tree rather than cryptodev.
>

That looks fine. Thanks Herbert.


Re: .S_shipped unnecessary?

2018-11-08 Thread Ard Biesheuvel
(+ Masahiro, kbuild ml)

On 8 November 2018 at 21:37, Jason A. Donenfeld  wrote:
> Hi Ard, Eric, and others,
>
> As promised, the next Zinc patchset will have less generated code! After a
> bit of work with Andy and Samuel, I'll be bundling the perlasm.
>

Wonderful! Any problems doing that for x86_64 ?

> One thing I'm wondering about, though, is the wisdom behind the current
> .S_shipped pattern. Usually the _shipped is for big firmware blobs that are
> hard (or impossible) to build independently. But in this case, the .S is
> generated from the .pl significantly faster than gcc even compiles a basic
> C file. And, since perl is needed to build the kernel anyway, it's not like
> it will be impossible to find the right tools. Rather than clutter up commits
> with the .pl _and_ the .S_shipped, what would you think if I just generated
> the .S each time as an ordinary build artifact. AFAICT, this is fairly usual,
> and it's hard to see downsides. Hence, why I'm writing this email: are there
> any downsides to that?
>

I agree 100%. When I added this the first time, it was at the request
of the ARM maintainer, who was reluctant to rely on Perl for some
reason.

Recently, we have had to add a kludge to prevent spurious rebuilds of
the .S_shipped files as well.

I'd be perfectly happy to get rid of this entirely, and always
generate the .S from the .pl, which to me is kind of the point of
carrying these files in the first place.

Masahiro: do you see any problems with this?


Re: [PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account

2018-11-08 Thread Ard Biesheuvel
On 8 November 2018 at 23:55, Ard Biesheuvel  wrote:
> The simd wrapper's skcipher request context structure consists
> of a single subrequest whose size is taken from the subordinate
> skcipher. However, in simd_skcipher_init(), the reqsize that is
> retrieved is not from the subordinate skcipher but from the
> cryptd request structure, whose size is completely unrelated to
> the actual wrapped skcipher.
>
> Reported-by: Qian Cai 
> Signed-off-by: Ard Biesheuvel 
> ---
>  crypto/simd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/crypto/simd.c b/crypto/simd.c
> index ea7240be3001..2f3d6e897afc 100644
> --- a/crypto/simd.c
> +++ b/crypto/simd.c
> @@ -125,7 +125,7 @@ static int simd_skcipher_init(struct crypto_skcipher *tfm)
> ctx->cryptd_tfm = cryptd_tfm;
>
> reqsize = sizeof(struct skcipher_request);
> -   reqsize += crypto_skcipher_reqsize(_tfm->base);
> +   reqsize += crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm));
>

This should be

reqsize += max(crypto_skcipher_reqsize(_tfm->base);
   crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm)));

since the cryptd path in simd still needs some space in the subreq for
the completion.


[PATCH] crypto/simd: correctly take reqsize of wrapped skcipher into account

2018-11-08 Thread Ard Biesheuvel
The simd wrapper's skcipher request context structure consists
of a single subrequest whose size is taken from the subordinate
skcipher. However, in simd_skcipher_init(), the reqsize that is
retrieved is not from the subordinate skcipher but from the
cryptd request structure, whose size is completely unrelated to
the actual wrapped skcipher.

Reported-by: Qian Cai 
Signed-off-by: Ard Biesheuvel 
---
 crypto/simd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/simd.c b/crypto/simd.c
index ea7240be3001..2f3d6e897afc 100644
--- a/crypto/simd.c
+++ b/crypto/simd.c
@@ -125,7 +125,7 @@ static int simd_skcipher_init(struct crypto_skcipher *tfm)
ctx->cryptd_tfm = cryptd_tfm;
 
reqsize = sizeof(struct skcipher_request);
-   reqsize += crypto_skcipher_reqsize(_tfm->base);
+   reqsize += crypto_skcipher_reqsize(cryptd_skcipher_child(cryptd_tfm));
 
crypto_skcipher_set_reqsize(tfm, reqsize);
 
-- 
2.19.1



Re: [PATCH v4 2/7] tpm2-sessions: Add full HMAC and encrypt/decrypt session handling

2018-10-23 Thread Ard Biesheuvel
On 23 October 2018 at 04:01, James Bottomley
 wrote:
> On Mon, 2018-10-22 at 19:19 -0300, Ard Biesheuvel wrote:
> [...]
>> > +static void hmac_init(struct shash_desc *desc, u8 *key, int
>> > keylen)
>> > +{
>> > +   u8 pad[SHA256_BLOCK_SIZE];
>> > +   int i;
>> > +
>> > +   desc->tfm = sha256_hash;
>> > +   desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
>>
>> I don't think this actually does anything in the shash API
>> implementation, so you can drop this.
>
> OK, I find crypto somewhat hard to follow.  There were bits I had to
> understand, like when I wrote the CFB implementation or when I fixed
> the ECDH scatterlist handling, but I've got to confess, in time
> honoured tradition I simply copied this from EVM crypto without
> actually digging into the code to understand why.
>

Yeah, it is notoriously hard to use, and we should try to improve that.

>>  However, I take it this means that hmac_init() is never called in
>> contexts where sleeping is not allowed? For the relevance of that,
>> please see below.
>
> Right, these routines are always called as an adjunct to a TPM
> operation and TPM operations can sleep, so we must at least have kernel
> thread context.
>
> [...]
>> > +   /* encrypt before HMAC */
>> > +   if (auth->attrs & TPM2_SA_DECRYPT) {
>> > +   struct scatterlist sg[1];
>> > +   u16 len;
>> > +   SKCIPHER_REQUEST_ON_STACK(req, auth->aes);
>> > +
>> > +   skcipher_request_set_tfm(req, auth->aes);
>> > +   skcipher_request_set_callback(req,
>> > CRYPTO_TFM_REQ_MAY_SLEEP,
>> > + NULL, NULL);
>> > +
>>
>> Your crypto_alloc_skcipher() call [further down] does not mask out
>> CRYPTO_ALG_ASYNC, and so it permits asynchronous implementations to
>> be selected. Your generic cfb template only permits synchronous
>> implementations, since it wraps the cipher directly (which is always
>> synchronous). However, we have support in the tree for some
>> accelerators that implement cfb(aes), and those will return
>> -EINPROGRESS when invoking crypto_skcipher_en/decrypt(req), which you
>> are not set up to handle.
>>
>> So the simple solution is to call 'crypto_alloc_skcipher("cfb(aes)",
>> 0, CRYPTO_ALG_ASYNC)' below instead.
>>
>> However, I would prefer it if we permit asynchronous skciphers here.
>> The reason is that, on a given platform, the accelerator may be the
>> only truly time invariant AES implementation that is available, and
>> since we are dealing with the TPM, a bit of paranoia does not hurt.
>> It also makes it easier to implement it as a [time invariant] SIMD
>> based asynchronous skcipher, which are simpler than synchronous ones
>> since they don't require a non-SIMD fallback path for calls from
>> contexts where the SIMD unit may not be used.
>>
>> I have already implemented cfb(aes) for arm64/NEON. I will post the
>> patch after the merge window closes.
>>
>> > +   /* need key and IV */
>> > +   KDFa(auth->session_key, SHA256_DIGEST_SIZE
>> > ++ auth->passphraselen, "CFB", auth->our_nonce,
>> > +auth->tpm_nonce, AES_KEYBYTES +
>> > AES_BLOCK_SIZE,
>> > +auth->scratch);
>> > +   crypto_skcipher_setkey(auth->aes, auth->scratch,
>> > AES_KEYBYTES);
>> > +   len = tpm_get_inc_u16();
>> > +   sg_init_one(sg, p, len);
>> > +   skcipher_request_set_crypt(req, sg, sg, len,
>> > +  auth->scratch +
>> > AES_KEYBYTES);
>> > +   crypto_skcipher_encrypt(req);
>>
>> So please consider replacing this with something like.
>>
>> DECLARE_CRYPTO_WAIT(wait); [further up]
>> skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP,
>>       crypto_req_done, );
>> crypto_wait_req(crypto_skcipher_encrypt(req), );
>
> Sure, I can do this ... the crypto skcipher handling was also cut and
> paste, but I forget where from this time.  So what I think you're
> asking for is below as the incremental diff?  I've tested this out and
> it all works fine in my session testing environment (and on my real
> laptop) ... although since I'm fully sync, I won't really have tested
> the -EINPROGRESS do the wait case.
>

Yes that looks

Re: [PATCH v4 2/7] tpm2-sessions: Add full HMAC and encrypt/decrypt session handling

2018-10-22 Thread Ard Biesheuvel
Hi James,

Some comments below on how you are using the crypto API.

On 22 October 2018 at 04:36, James Bottomley
 wrote:
> This code adds true session based HMAC authentication plus parameter
> decryption and response encryption using AES.
>
> The basic design of this code is to segregate all the nasty crypto,
> hash and hmac code into tpm2-sessions.c and export a usable API.
>
> The API first of all starts off by gaining a session with
>
> tpm2_start_auth_session()
>
> Which initiates a session with the TPM and allocates an opaque
> tpm2_auth structure to handle the session parameters.  Then the use is
> simply:
>
> * tpm_buf_append_name() in place of the tpm_buf_append_u32 for the
>   handles
>
> * tpm_buf_append_hmac_session() where tpm2_append_auth() would go
>
> * tpm_buf_fill_hmac_session() called after the entire command buffer
>   is finished but before tpm_transmit_cmd() is called which computes
>   the correct HMAC and places it in the command at the correct
>   location.
>
> Finally, after tpm_transmit_cmd() is called,
> tpm_buf_check_hmac_response() is called to check that the returned
> HMAC matched and collect the new state for the next use of the
> session, if any.
>
> The features of the session is controlled by the session attributes
> set in tpm_buf_append_hmac_session().  If TPM2_SA_CONTINUE_SESSION is
> not specified, the session will be flushed and the tpm2_auth structure
> freed in tpm_buf_check_hmac_response(); otherwise the session may be
> used again.  Parameter encryption is specified by or'ing the flag
> TPM2_SA_DECRYPT and response encryption by or'ing the flag
> TPM2_SA_ENCRYPT.  the various encryptions will be taken care of by
> tpm_buf_fill_hmac_session() and tpm_buf_check_hmac_response()
> respectively.
>
> To get all of this to work securely, the Kernel now needs a primary
> key to encrypt the session salt to, so we derive an EC key from the
> NULL seed and store it in the tpm_chip structure.  We also make sure
> that this seed remains for the kernel by using a kernel space to take
> it out of the TPM when userspace wants to use it.
>
> Signed-off-by: James Bottomley 
>
> ---
>
> v2: Added docbook and improved response check API
> v3: Add readpublic, fix hmac length, add API for close on error
> allow for the hmac session not being first in the sessions
> v4: Make NULL seed template exactly match the SRK ECC template.
> Also check the NULL primary key name is what getpublic returns
> to prevent spoofing.  Also parametrise the name size for reuse
> ---
>  drivers/char/tpm/Kconfig |3 +
>  drivers/char/tpm/Makefile|2 +-
>  drivers/char/tpm/tpm.h   |   32 +
>  drivers/char/tpm/tpm2-cmd.c  |   34 +-
>  drivers/char/tpm/tpm2-sessions.c | 1188 
> ++
>  drivers/char/tpm/tpm2-sessions.h |   57 ++
>  6 files changed, 1300 insertions(+), 16 deletions(-)
>  create mode 100644 drivers/char/tpm/tpm2-sessions.c
>  create mode 100644 drivers/char/tpm/tpm2-sessions.h
>
...
> diff --git a/drivers/char/tpm/tpm2-sessions.c 
> b/drivers/char/tpm/tpm2-sessions.c
> new file mode 100644
> index ..422c3ec64f8c
> --- /dev/null
> +++ b/drivers/char/tpm/tpm2-sessions.c
> @@ -0,0 +1,1188 @@
...
> +/*
> + * this is our static crypto shash.  This is possible because the hash
> + * is multi-threaded and all the state stored in the desc
> + */
> +static struct crypto_shash *sha256_hash;
> +
> +/*
> + * It turns out the crypto hmac(sha256) is hard for us to consume
> + * because it assumes a fixed key and the TPM seems to change the key
> + * on every operation, so we weld the hmac init and final functions in
> + * here to give it the same usage characteristics as a regular hash
> + */
> +static void hmac_init(struct shash_desc *desc, u8 *key, int keylen)
> +{
> +   u8 pad[SHA256_BLOCK_SIZE];
> +   int i;
> +
> +   desc->tfm = sha256_hash;
> +   desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;

I don't think this actually does anything in the shash API
implementation, so you can drop this. However, I take it this means
that hmac_init() is never called in contexts where sleeping is not
allowed? For the relevance of that, please see below.

> +   crypto_shash_init(desc);
> +   for (i = 0; i < sizeof(pad); i++) {
> +   if (i < keylen)
> +   pad[i] = key[i];
> +   else
> +   pad[i] = 0;
> +   pad[i] ^= HMAC_IPAD_VALUE;
> +   }
> +   crypto_shash_update(desc, pad, sizeof(pad));
> +}
> +
> +static void hmac_final(struct shash_desc *desc, u8 *key, int keylen, u8 *out)
> +{
> +   u8 pad[SHA256_BLOCK_SIZE];
> +   int i;
> +
> +   for (i = 0; i < sizeof(pad); i++) {
> +   if (i < keylen)
> +   pad[i] = key[i];
> +   else
> +   pad[i] = 0;
> +   pad[i] ^= HMAC_OPAD_VALUE;
> +   }
> +
> +   /* collect the final hash;  use out as 

Re: [PATCH 1/2] crypto: fix cfb mode decryption

2018-10-21 Thread Ard Biesheuvel
On 21 October 2018 at 11:00, James Bottomley
 wrote:
> On October 21, 2018 9:58:04 AM GMT, Ard Biesheuvel 
>  wrote:
>>On 21 October 2018 at 10:07, James Bottomley
>> wrote:
>>> On Sun, 2018-10-21 at 09:05 +0200, Ard Biesheuvel wrote:
>>>> (+ James)
>>>
>>> Thanks!
>>>
>>>> On 20 October 2018 at 01:01, Dmitry Eremin-Solenikov
>>>>  wrote:
>>>> > crypto_cfb_decrypt_segment() incorrectly XOR'ed generated
>>keystream
>>>> > with
>>>> > IV, rather than with data stream, resulting in incorrect
>>>> > decryption.
>>>> > Test vectors will be added in the next patch.
>>>> >
>>>> > Signed-off-by: Dmitry Eremin-Solenikov 
>>>> > Cc: sta...@vger.kernel.org
>>>> > ---
>>>> >  crypto/cfb.c | 2 +-
>>>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>>>> >
>>>> > diff --git a/crypto/cfb.c b/crypto/cfb.c
>>>> > index a0d68c09e1b9..fd4e8500e121 100644
>>>> > --- a/crypto/cfb.c
>>>> > +++ b/crypto/cfb.c
>>>> > @@ -144,7 +144,7 @@ static int crypto_cfb_decrypt_segment(struct
>>>> > skcipher_walk *walk,
>>>> >
>>>> > do {
>>>> > crypto_cfb_encrypt_one(tfm, iv, dst);
>>>> > -   crypto_xor(dst, iv, bsize);
>>>> > +   crypto_xor(dst, src, bsize);
>>>
>>> This does look right.  I think the reason the TPM code works is that
>>it
>>> always does encrypt/decrypt in-place, which is a separate piece of
>>the
>>> code which appears to be correct.
>>>
>>
>>Yeah I figured that.
>>
>>So where is the TPM code that actually uses this code?
>
> It was posted to the integrity list a while ago.  I'm planning a repost  
> shortly.
>

OK, found it. Mind cc'ing me on that repost?


Re: [PATCH 1/2] crypto: fix cfb mode decryption

2018-10-21 Thread Ard Biesheuvel
On 21 October 2018 at 10:07, James Bottomley
 wrote:
> On Sun, 2018-10-21 at 09:05 +0200, Ard Biesheuvel wrote:
>> (+ James)
>
> Thanks!
>
>> On 20 October 2018 at 01:01, Dmitry Eremin-Solenikov
>>  wrote:
>> > crypto_cfb_decrypt_segment() incorrectly XOR'ed generated keystream
>> > with
>> > IV, rather than with data stream, resulting in incorrect
>> > decryption.
>> > Test vectors will be added in the next patch.
>> >
>> > Signed-off-by: Dmitry Eremin-Solenikov 
>> > Cc: sta...@vger.kernel.org
>> > ---
>> >  crypto/cfb.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/crypto/cfb.c b/crypto/cfb.c
>> > index a0d68c09e1b9..fd4e8500e121 100644
>> > --- a/crypto/cfb.c
>> > +++ b/crypto/cfb.c
>> > @@ -144,7 +144,7 @@ static int crypto_cfb_decrypt_segment(struct
>> > skcipher_walk *walk,
>> >
>> > do {
>> > crypto_cfb_encrypt_one(tfm, iv, dst);
>> > -   crypto_xor(dst, iv, bsize);
>> > +   crypto_xor(dst, src, bsize);
>
> This does look right.  I think the reason the TPM code works is that it
> always does encrypt/decrypt in-place, which is a separate piece of the
> code which appears to be correct.
>

Yeah I figured that.

So where is the TPM code that actually uses this code?


Re: [PATCH 2/2] crypto: testmgr: add AES-CFB tests

2018-10-21 Thread Ard Biesheuvel
(+ James)

On 20 October 2018 at 01:01, Dmitry Eremin-Solenikov
 wrote:
> Add AES128/192/256-CFB testvectors from NIST SP800-38A.
>
> Signed-off-by: Dmitry Eremin-Solenikov 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitry Eremin-Solenikov 
> ---
>  crypto/tcrypt.c  |  5 
>  crypto/testmgr.c |  7 +
>  crypto/testmgr.h | 76 
>  3 files changed, 88 insertions(+)
>
> diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
> index bdde95e8d369..a6315827d240 100644
> --- a/crypto/tcrypt.c
> +++ b/crypto/tcrypt.c
> @@ -1733,6 +1733,7 @@ static int do_test(const char *alg, u32 type, u32 mask, 
> int m, u32 num_mb)
> ret += tcrypt_test("xts(aes)");
> ret += tcrypt_test("ctr(aes)");
> ret += tcrypt_test("rfc3686(ctr(aes))");
> +   ret += tcrypt_test("cfb(aes)");
> break;
>
> case 11:
> @@ -2059,6 +2060,10 @@ static int do_test(const char *alg, u32 type, u32 
> mask, int m, u32 num_mb)
> speed_template_16_24_32);
> test_cipher_speed("ctr(aes)", DECRYPT, sec, NULL, 0,
> speed_template_16_24_32);
> +   test_cipher_speed("cfb(aes)", ENCRYPT, sec, NULL, 0,
> +   speed_template_16_24_32);
> +   test_cipher_speed("cfb(aes)", DECRYPT, sec, NULL, 0,
> +   speed_template_16_24_32);
> break;
>
> case 201:
> diff --git a/crypto/testmgr.c b/crypto/testmgr.c
> index a1d42245082a..016d61c419fc 100644
> --- a/crypto/testmgr.c
> +++ b/crypto/testmgr.c
> @@ -2684,6 +2684,13 @@ static const struct alg_test_desc alg_test_descs[] = {
> .dec = __VECS(aes_ccm_dec_tv_template)
> }
> }
> +   }, {
> +   .alg = "cfb(aes)",
> +   .test = alg_test_skcipher,
> +   .fips_allowed = 1,
> +   .suite = {
> +   .cipher = __VECS(aes_cfb_tv_template)
> +   },
> }, {
> .alg = "chacha20",
> .test = alg_test_skcipher,
> diff --git a/crypto/testmgr.h b/crypto/testmgr.h
> index 173111c70746..19b6d184c8fb 100644
> --- a/crypto/testmgr.h
> +++ b/crypto/testmgr.h
> @@ -12081,6 +12081,82 @@ static const struct cipher_testvec 
> aes_cbc_tv_template[] = {
> },
>  };
>
> +static const struct cipher_testvec aes_cfb_tv_template[] = {
> +   { /* From NIST SP800-38A */
> +   .key= "\x2b\x7e\x15\x16\x28\xae\xd2\xa6"
> + "\xab\xf7\x15\x88\x09\xcf\x4f\x3c",
> +   .klen   = 16,
> +   .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
> + "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
> +   .ptext  = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
> + "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
> + "\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
> + "\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
> + "\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
> + "\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
> + "\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
> + "\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
> +   .ctext  = "\x3b\x3f\xd9\x2e\xb7\x2d\xad\x20"
> + "\x33\x34\x49\xf8\xe8\x3c\xfb\x4a"
> + "\xc8\xa6\x45\x37\xa0\xb3\xa9\x3f"
> + "\xcd\xe3\xcd\xad\x9f\x1c\xe5\x8b"
> + "\x26\x75\x1f\x67\xa3\xcb\xb1\x40"
> + "\xb1\x80\x8c\xf1\x87\xa4\xf4\xdf"
> + "\xc0\x4b\x05\x35\x7c\x5d\x1c\x0e"
> + "\xea\xc4\xc6\x6f\x9f\xf7\xf2\xe6",
> +   .len= 64,
> +   }, {
> +   .key= "\x8e\x73\xb0\xf7\xda\x0e\x64\x52"
> + "\xc8\x10\xf3\x2b\x80\x90\x79\xe5"
> + "\x62\xf8\xea\xd2\x52\x2c\x6b\x7b",
> +   .klen   = 24,
> +   .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
> + "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
> +   .ptext  = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
> + "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
> + "\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
> + "\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
> + "\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
> + "\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
> + "\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
> + "\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
> +   .ctext  = "\xcd\xc8\x0d\x6f\xdd\xf1\x8c\xab"
> + "\x34\xc2\x59\x09\xc9\x9a\x41\x74"
> + "\x67\xce\x7f\x7f\x81\x17\x36\x21"
> + 

Re: [PATCH 1/2] crypto: fix cfb mode decryption

2018-10-21 Thread Ard Biesheuvel
(+ James)

On 20 October 2018 at 01:01, Dmitry Eremin-Solenikov
 wrote:
> crypto_cfb_decrypt_segment() incorrectly XOR'ed generated keystream with
> IV, rather than with data stream, resulting in incorrect decryption.
> Test vectors will be added in the next patch.
>
> Signed-off-by: Dmitry Eremin-Solenikov 
> Cc: sta...@vger.kernel.org
> ---
>  crypto/cfb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/crypto/cfb.c b/crypto/cfb.c
> index a0d68c09e1b9..fd4e8500e121 100644
> --- a/crypto/cfb.c
> +++ b/crypto/cfb.c
> @@ -144,7 +144,7 @@ static int crypto_cfb_decrypt_segment(struct 
> skcipher_walk *walk,
>
> do {
> crypto_cfb_encrypt_one(tfm, iv, dst);
> -   crypto_xor(dst, iv, bsize);
> +   crypto_xor(dst, src, bsize);
> iv = src;
>
> src += bsize;
> --
> 2.19.1
>


Re: [PATCH v3 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-19 Thread Ard Biesheuvel
On 20 October 2018 at 04:39, Eric Biggers  wrote:
> On Fri, Oct 19, 2018 at 05:54:12PM +0800, Ard Biesheuvel wrote:
>> On 19 October 2018 at 13:41, Ard Biesheuvel  
>> wrote:
>> > On 18 October 2018 at 12:37, Eric Biggers  wrote:
>> >> From: Eric Biggers 
>> >>
>> >> Make the ARM scalar AES implementation closer to constant-time by
>> >> disabling interrupts and prefetching the tables into L1 cache.  This is
>> >> feasible because due to ARM's "free" rotations, the main tables are only
>> >> 1024 bytes instead of the usual 4096 used by most AES implementations.
>> >>
>> >> On ARM Cortex-A7, the speed loss is only about 5%.  The resulting code
>> >> is still over twice as fast as aes_ti.c.  Responsiveness is potentially
>> >> a concern, but interrupts are only disabled for a single AES block.
>> >>
>> >
>> > So that would be in the order of 700 cycles, based on the numbers you
>> > shared in v1 of the aes_ti.c patch. Does that sound about right? So
>> > that would be around 1 microsecond, which is really not a number to
>> > obsess about imo.
>> >
>> > I considered another option, which is to detect whether an interrupt
>> > has been taken (by writing some canary value below that stack pointer
>> > in the location where the exception handler will preserve the value of
>> > sp, and checking at the end whether it has been modified) and doing a
>> > usleep_range(x, y) if that is the case.
>> >
>> > But this is much simpler so let's only go there if we must.
>> >
>>
>> I played around a bit and implemented it for discussion purposes, but
>> restarting the operation if it gets interrupted, as suggested in the
>> paper (whitespace corruption courtesy of Gmail)
>>
>>
>> diff --git a/arch/arm/crypto/aes-cipher-core.S
>> b/arch/arm/crypto/aes-cipher-core.S
>> index 184d6c2d15d5..2e8a84a47784 100644
>> --- a/arch/arm/crypto/aes-cipher-core.S
>> +++ b/arch/arm/crypto/aes-cipher-core.S
>> @@ -10,6 +10,7 @@
>>   */
>>
>>  #include 
>> +#include 
>>  #include 
>>
>>   .text
>> @@ -139,6 +140,34 @@
>>
>>   __adrl ttab, \ttab
>>
>> + /*
>> + * Set a canary that will allow us to tell whether any
>> + * interrupts were taken while this function was executing.
>> + * The zero value will be overwritten with the process counter
>> + * value at the point where the IRQ exception is taken.
>> + */
>> + mov t0, #0
>> + str t0, [sp, #-(SVC_REGS_SIZE - S_PC)]
>> +
>> + /*
>> + * Prefetch the 1024-byte 'ft' or 'it' table into L1 cache,
>> + * assuming cacheline size >= 32.  This is a hardening measure
>> + * intended to make cache-timing attacks more difficult.
>> + * They may not be fully prevented, however; see the paper
>> + * https://cr.yp.to/antiforgery/cachetiming-20050414.pdf
>> + * ("Cache-timing attacks on AES") for a discussion of the many
>> + * difficulties involved in writing truly constant-time AES
>> + * software.
>> + */
>> + .set i, 0
>> + .rept 1024 / 128
>> + ldr r8, [ttab, #i + 0]
>> + ldr r9, [ttab, #i + 32]
>> + ldr r10, [ttab, #i + 64]
>> + ldr r11, [ttab, #i + 96]
>> + .set i, i + 128
>> + .endr
>> +
>>   tst rounds, #2
>>   bne 1f
>>
>> @@ -154,6 +183,8 @@
>>  2: __adrl ttab, \ltab
>>   \round r4, r5, r6, r7, r8, r9, r10, r11, \bsz, b
>>
>> + ldr r0, [sp, #-(SVC_REGS_SIZE - S_PC)] // check canary
>> +
>>  #ifdef CONFIG_CPU_BIG_ENDIAN
>>   __rev r4, r4
>>   __rev r5, r5
>> diff --git a/arch/arm/crypto/aes-cipher-glue.c
>> b/arch/arm/crypto/aes-cipher-glue.c
>> index c222f6e072ad..de8f32121511 100644
>> --- a/arch/arm/crypto/aes-cipher-glue.c
>> +++ b/arch/arm/crypto/aes-cipher-glue.c
>> @@ -11,28 +11,39 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>
>> -asmlinkage void __aes_arm_encrypt(u32 *rk, int rounds, const u8 *in, u8 
>> *out);
>> +asmlinkage int __aes_arm_encrypt(u32 *rk, int rounds, const u8 *in, u8 
>> *out);
>>  EXPORT_SYMBOL(__aes_arm_encrypt);
>>
>> -asmlinkage void __aes_arm_decrypt(u32 *rk, int rounds, const u8 *in, u8 
>> *out);
>> +asmlinkage int __aes_arm_decrypt(u32 *rk, int rounds, const u8 *in, u8 
>> *out);
>>  EXPORT_SYMBOL(__aes_arm_decrypt);
>>
>>  static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>>  {
>>   struct 

Re: [PATCH v3 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-19 Thread Ard Biesheuvel
On 19 October 2018 at 13:41, Ard Biesheuvel  wrote:
> On 18 October 2018 at 12:37, Eric Biggers  wrote:
>> From: Eric Biggers 
>>
>> Make the ARM scalar AES implementation closer to constant-time by
>> disabling interrupts and prefetching the tables into L1 cache.  This is
>> feasible because due to ARM's "free" rotations, the main tables are only
>> 1024 bytes instead of the usual 4096 used by most AES implementations.
>>
>> On ARM Cortex-A7, the speed loss is only about 5%.  The resulting code
>> is still over twice as fast as aes_ti.c.  Responsiveness is potentially
>> a concern, but interrupts are only disabled for a single AES block.
>>
>
> So that would be in the order of 700 cycles, based on the numbers you
> shared in v1 of the aes_ti.c patch. Does that sound about right? So
> that would be around 1 microsecond, which is really not a number to
> obsess about imo.
>
> I considered another option, which is to detect whether an interrupt
> has been taken (by writing some canary value below that stack pointer
> in the location where the exception handler will preserve the value of
> sp, and checking at the end whether it has been modified) and doing a
> usleep_range(x, y) if that is the case.
>
> But this is much simpler so let's only go there if we must.
>

I played around a bit and implemented it for discussion purposes, but
restarting the operation if it gets interrupted, as suggested in the
paper (whitespace corruption courtesy of Gmail)


diff --git a/arch/arm/crypto/aes-cipher-core.S
b/arch/arm/crypto/aes-cipher-core.S
index 184d6c2d15d5..2e8a84a47784 100644
--- a/arch/arm/crypto/aes-cipher-core.S
+++ b/arch/arm/crypto/aes-cipher-core.S
@@ -10,6 +10,7 @@
  */

 #include 
+#include 
 #include 

  .text
@@ -139,6 +140,34 @@

  __adrl ttab, \ttab

+ /*
+ * Set a canary that will allow us to tell whether any
+ * interrupts were taken while this function was executing.
+ * The zero value will be overwritten with the process counter
+ * value at the point where the IRQ exception is taken.
+ */
+ mov t0, #0
+ str t0, [sp, #-(SVC_REGS_SIZE - S_PC)]
+
+ /*
+ * Prefetch the 1024-byte 'ft' or 'it' table into L1 cache,
+ * assuming cacheline size >= 32.  This is a hardening measure
+ * intended to make cache-timing attacks more difficult.
+ * They may not be fully prevented, however; see the paper
+ * https://cr.yp.to/antiforgery/cachetiming-20050414.pdf
+ * ("Cache-timing attacks on AES") for a discussion of the many
+ * difficulties involved in writing truly constant-time AES
+ * software.
+ */
+ .set i, 0
+ .rept 1024 / 128
+ ldr r8, [ttab, #i + 0]
+ ldr r9, [ttab, #i + 32]
+ ldr r10, [ttab, #i + 64]
+ ldr r11, [ttab, #i + 96]
+ .set i, i + 128
+ .endr
+
  tst rounds, #2
  bne 1f

@@ -154,6 +183,8 @@
 2: __adrl ttab, \ltab
  \round r4, r5, r6, r7, r8, r9, r10, r11, \bsz, b

+ ldr r0, [sp, #-(SVC_REGS_SIZE - S_PC)] // check canary
+
 #ifdef CONFIG_CPU_BIG_ENDIAN
  __rev r4, r4
  __rev r5, r5
diff --git a/arch/arm/crypto/aes-cipher-glue.c
b/arch/arm/crypto/aes-cipher-glue.c
index c222f6e072ad..de8f32121511 100644
--- a/arch/arm/crypto/aes-cipher-glue.c
+++ b/arch/arm/crypto/aes-cipher-glue.c
@@ -11,28 +11,39 @@

 #include 
 #include 
+#include 
 #include 

-asmlinkage void __aes_arm_encrypt(u32 *rk, int rounds, const u8 *in, u8 *out);
+asmlinkage int __aes_arm_encrypt(u32 *rk, int rounds, const u8 *in, u8 *out);
 EXPORT_SYMBOL(__aes_arm_encrypt);

-asmlinkage void __aes_arm_decrypt(u32 *rk, int rounds, const u8 *in, u8 *out);
+asmlinkage int __aes_arm_decrypt(u32 *rk, int rounds, const u8 *in, u8 *out);
 EXPORT_SYMBOL(__aes_arm_decrypt);

 static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
  struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
  int rounds = 6 + ctx->key_length / 4;
+ u8 buf[AES_BLOCK_SIZE];

- __aes_arm_encrypt(ctx->key_enc, rounds, in, out);
+ if (out == in)
+   in = memcpy(buf, in, AES_BLOCK_SIZE);
+
+ while (unlikely(__aes_arm_encrypt(ctx->key_enc, rounds, in, out)))
+   cpu_relax();
 }

 static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
  struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
  int rounds = 6 + ctx->key_length / 4;
+ u8 buf[AES_BLOCK_SIZE];
+
+ if (out == in)
+   in = memcpy(buf, in, AES_BLOCK_SIZE);

- __aes_arm_decrypt(ctx->key_dec, rounds, in, out);
+ while (unlikely(__aes_arm_decrypt(ctx->key_dec, rounds, in, out)))
+   cpu_relax();
 }

 static struct crypto_alg aes_alg = {


Re: [PATCH v3 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-18 Thread Ard Biesheuvel
On 18 October 2018 at 12:37, Eric Biggers  wrote:
> From: Eric Biggers 
>
> Make the ARM scalar AES implementation closer to constant-time by
> disabling interrupts and prefetching the tables into L1 cache.  This is
> feasible because due to ARM's "free" rotations, the main tables are only
> 1024 bytes instead of the usual 4096 used by most AES implementations.
>
> On ARM Cortex-A7, the speed loss is only about 5%.  The resulting code
> is still over twice as fast as aes_ti.c.  Responsiveness is potentially
> a concern, but interrupts are only disabled for a single AES block.
>

So that would be in the order of 700 cycles, based on the numbers you
shared in v1 of the aes_ti.c patch. Does that sound about right? So
that would be around 1 microsecond, which is really not a number to
obsess about imo.

I considered another option, which is to detect whether an interrupt
has been taken (by writing some canary value below that stack pointer
in the location where the exception handler will preserve the value of
sp, and checking at the end whether it has been modified) and doing a
usleep_range(x, y) if that is the case.

But this is much simpler so let's only go there if we must.

> Note that even after these changes, the implementation still isn't
> necessarily guaranteed to be constant-time; see
> https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
> of the many difficulties involved in writing truly constant-time AES
> software.  But it's valuable to make such attacks more difficult.
>
> Much of this patch is based on patches suggested by Ard Biesheuvel.
>
> Suggested-by: Ard Biesheuvel 
> Signed-off-by: Eric Biggers 

Reviewed-by: Ard Biesheuvel 

> ---
>  arch/arm/crypto/Kconfig   |  9 +
>  arch/arm/crypto/aes-cipher-core.S | 62 ++-
>  crypto/aes_generic.c  |  9 +++--
>  3 files changed, 66 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index ef0c7feea6e29..0473a8f683896 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -69,6 +69,15 @@ config CRYPTO_AES_ARM
> help
>   Use optimized AES assembler routines for ARM platforms.
>
> + On ARM processors without the Crypto Extensions, this is the
> + fastest AES implementation for single blocks.  For multiple
> + blocks, the NEON bit-sliced implementation is usually faster.
> +
> + This implementation may be vulnerable to cache timing attacks,
> + since it uses lookup tables.  However, as countermeasures it
> + disables IRQs and preloads the tables; it is hoped this makes
> + such attacks very difficult.
> +
>  config CRYPTO_AES_ARM_BS
> tristate "Bit sliced AES using NEON instructions"
> depends on KERNEL_MODE_NEON
> diff --git a/arch/arm/crypto/aes-cipher-core.S 
> b/arch/arm/crypto/aes-cipher-core.S
> index 184d6c2d15d5e..f2d67c095e596 100644
> --- a/arch/arm/crypto/aes-cipher-core.S
> +++ b/arch/arm/crypto/aes-cipher-core.S
> @@ -10,6 +10,7 @@
>   */
>
>  #include 
> +#include 
>  #include 
>
> .text
> @@ -41,7 +42,7 @@
> .endif
> .endm
>
> -   .macro  __hround, out0, out1, in0, in1, in2, in3, t3, t4, 
> enc, sz, op
> +   .macro  __hround, out0, out1, in0, in1, in2, in3, t3, t4, 
> enc, sz, op, oldcpsr
> __select\out0, \in0, 0
> __selectt0, \in1, 1
> __load  \out0, \out0, 0, \sz, \op
> @@ -73,6 +74,14 @@
> __load  t0, t0, 3, \sz, \op
> __load  \t4, \t4, 3, \sz, \op
>
> +   .ifnb   \oldcpsr
> +   /*
> +* This is the final round and we're done with all data-dependent 
> table
> +* lookups, so we can safely re-enable interrupts.
> +*/
> +   restore_irqs\oldcpsr
> +   .endif
> +
> eor \out1, \out1, t1, ror #24
> eor \out0, \out0, t2, ror #16
> ldm rk!, {t1, t2}
> @@ -83,14 +92,14 @@
> eor \out1, \out1, t2
> .endm
>
> -   .macro  fround, out0, out1, out2, out3, in0, in1, in2, in3, 
> sz=2, op
> +   .macro  fround, out0, out1, out2, out3, in0, in1, in2, in3, 
> sz=2, op, oldcpsr
> __hround\out0, \out1, \in0, \in1, \in2, \in3, \out2, \out3, 
> 1, \sz, \op
> -   __hround\out2, \out3, \in2, \in3, \in0, \in1, \in1, \in2, 1, 
> \sz, \op
> +   __hround\out2, \out3, \in2, \in3, \in0, \in1, \in1, \in2, 1, 
> \sz, \op, \oldcpsr
> .endm
>
> -   .macro  iround, out0, 

Re: [PATCH v2 1/2] crypto: aes_ti - disable interrupts while accessing S-box

2018-10-17 Thread Ard Biesheuvel
Hi Eric,

On 17 October 2018 at 14:18, Eric Biggers  wrote:
> From: Eric Biggers 
>
> In the "aes-fixed-time" AES implementation, disable interrupts while
> accessing the S-box, in order to make cache-timing attacks more
> difficult.  Previously it was possible for the CPU to be interrupted
> while the S-box was loaded into L1 cache, potentially evicting the
> cachelines and causing later table lookups to be time-variant.
>
> In tests I did on x86 and ARM, this doesn't affect performance
> significantly.  Responsiveness is potentially a concern, but interrupts
> are only disabled for a single AES block.
>
> Note that even after this change, the implementation still isn't
> necessarily guaranteed to be constant-time; see
> https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
> of the many difficulties involved in writing truly constant-time AES
> software.  But it's valuable to make such attacks more difficult.
>
> Signed-off-by: Eric Biggers 

Thanks for taking a look. Could we add something to the Kconfig blurb
that mentions that it runs the algorithm witn interrupts disabled? In
any case,

Reviewed-by: Ard Biesheuvel 


> ---
>  crypto/aes_ti.c | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/crypto/aes_ti.c b/crypto/aes_ti.c
> index 03023b2290e8..1ff9785b30f5 100644
> --- a/crypto/aes_ti.c
> +++ b/crypto/aes_ti.c
> @@ -269,6 +269,7 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> const u32 *rkp = ctx->key_enc + 4;
> int rounds = 6 + ctx->key_length / 4;
> u32 st0[4], st1[4];
> +   unsigned long flags;
> int round;
>
> st0[0] = ctx->key_enc[0] ^ get_unaligned_le32(in);
> @@ -276,6 +277,12 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> st0[2] = ctx->key_enc[2] ^ get_unaligned_le32(in + 8);
> st0[3] = ctx->key_enc[3] ^ get_unaligned_le32(in + 12);
>
> +   /*
> +* Temporarily disable interrupts to avoid races where cachelines are
> +* evicted when the CPU is interrupted to do something else.
> +*/
> +   local_irq_save(flags);
> +
> st0[0] ^= __aesti_sbox[ 0] ^ __aesti_sbox[128];
> st0[1] ^= __aesti_sbox[32] ^ __aesti_sbox[160];
> st0[2] ^= __aesti_sbox[64] ^ __aesti_sbox[192];
> @@ -300,6 +307,8 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> put_unaligned_le32(subshift(st1, 1) ^ rkp[5], out + 4);
> put_unaligned_le32(subshift(st1, 2) ^ rkp[6], out + 8);
> put_unaligned_le32(subshift(st1, 3) ^ rkp[7], out + 12);
> +
> +   local_irq_restore(flags);
>  }
>
>  static void aesti_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
> @@ -308,6 +317,7 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> const u32 *rkp = ctx->key_dec + 4;
> int rounds = 6 + ctx->key_length / 4;
> u32 st0[4], st1[4];
> +   unsigned long flags;
> int round;
>
> st0[0] = ctx->key_dec[0] ^ get_unaligned_le32(in);
> @@ -315,6 +325,12 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> st0[2] = ctx->key_dec[2] ^ get_unaligned_le32(in + 8);
> st0[3] = ctx->key_dec[3] ^ get_unaligned_le32(in + 12);
>
> +   /*
> +* Temporarily disable interrupts to avoid races where cachelines are
> +* evicted when the CPU is interrupted to do something else.
> +*/
> +   local_irq_save(flags);
> +
> st0[0] ^= __aesti_inv_sbox[ 0] ^ __aesti_inv_sbox[128];
> st0[1] ^= __aesti_inv_sbox[32] ^ __aesti_inv_sbox[160];
> st0[2] ^= __aesti_inv_sbox[64] ^ __aesti_inv_sbox[192];
> @@ -339,6 +355,8 @@ static void aesti_decrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> put_unaligned_le32(inv_subshift(st1, 1) ^ rkp[5], out + 4);
> put_unaligned_le32(inv_subshift(st1, 2) ^ rkp[6], out + 8);
> put_unaligned_le32(inv_subshift(st1, 3) ^ rkp[7], out + 12);
> +
> +   local_irq_restore(flags);
>  }
>
>  static struct crypto_alg aes_alg = {
> --
> 2.19.1
>


Re: [PATCH v2 2/2] crypto: arm/aes - add some hardening against cache-timing attacks

2018-10-17 Thread Ard Biesheuvel
Hi Eric,

Thanks for looking into this.

On 17 October 2018 at 14:18, Eric Biggers  wrote:
> From: Eric Biggers 
>
> Make the ARM scalar AES implementation closer to constant-time by
> disabling interrupts and prefetching the tables into L1 cache.  This is
> feasible because due to ARM's "free" rotations, the main tables are only
> 1024 bytes instead of the usual 4096 used by most AES implementations.
>
> On ARM Cortex-A7, the speed loss is only about 5%.  The resulting
> implementation is still over twice as fast as aes_ti.c.
>
> Note that even after these changes, the implementation still isn't
> necessarily guaranteed to be constant-time; see
> https://cr.yp.to/antiforgery/cachetiming-20050414.pdf for a discussion
> of the many difficulties involved in writing truly constant-time AES
> software.  But it's valuable to make such attacks more difficult.
>
> Suggested-by: Ard Biesheuvel 
> Signed-off-by: Eric Biggers 
> ---
>  arch/arm/crypto/aes-cipher-core.S | 26 ++
>  arch/arm/crypto/aes-cipher-glue.c | 13 +
>  crypto/aes_generic.c  |  9 +
>  3 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/crypto/aes-cipher-core.S 
> b/arch/arm/crypto/aes-cipher-core.S
> index 184d6c2d15d5..ba9d4aefe585 100644
> --- a/arch/arm/crypto/aes-cipher-core.S
> +++ b/arch/arm/crypto/aes-cipher-core.S
> @@ -138,6 +138,23 @@
> eor r7, r7, r11
>
> __adrl  ttab, \ttab
> +   /*
> +* Prefetch the 1024-byte 'ft' or 'it' table into L1 cache, assuming
> +* cacheline size >= 32.  This, along with the caller disabling
> +* interrupts, is a hardening measure intended to make cache-timing
> +* attacks more difficult.  They may not be fully prevented, however;
> +* see the paper https://cr.yp.to/antiforgery/cachetiming-20050414.pdf
> +* ("Cache-timing attacks on AES") for a discussion of the many
> +* difficulties involved in writing truly constant-time AES software.
> +*/
> +   .set i, 0
> +.rept 1024 / 128
> +   ldr r8, [ttab, #i + 0]
> +   ldr r9, [ttab, #i + 32]
> +   ldr r10, [ttab, #i + 64]
> +   ldr r11, [ttab, #i + 96]
> +   .set i, i + 128
> +.endr
>

Mind sticking a bit more to the coding style of the file? I.e., indent
the gas directives with the code, and putting two tabs after them

> tst rounds, #2
> bne 1f
> @@ -152,6 +169,15 @@
> b   0b
>
>  2: __adrl  ttab, \ltab
> +.if \bsz == 0
> +   /* Prefetch the 256-byte inverse S-box; see explanation above */
> +   .set i, 0
> +.rept 256 / 64
> +   ldr t0, [ttab, #i + 0]
> +   ldr t1, [ttab, #i + 32]
> +   .set i, i + 64
> +.endr
> +.endif
> \round  r4, r5, r6, r7, r8, r9, r10, r11, \bsz, b
>
>  #ifdef CONFIG_CPU_BIG_ENDIAN
> diff --git a/arch/arm/crypto/aes-cipher-glue.c 
> b/arch/arm/crypto/aes-cipher-glue.c
> index c222f6e072ad..f40e35eb22e4 100644
> --- a/arch/arm/crypto/aes-cipher-glue.c
> +++ b/arch/arm/crypto/aes-cipher-glue.c
> @@ -23,16 +23,29 @@ static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, 
> const u8 *in)
>  {
> struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> int rounds = 6 + ctx->key_length / 4;
> +   unsigned long flags;
>
> +   /*
> +* This AES implementation prefetches the lookup table into L1 cache 
> to
> +* try to make timing attacks on the table lookups more difficult.
> +* Temporarily disable interrupts to avoid races where cachelines are
> +* evicted when the CPU is interrupted to do something else.
> +*/
> +   local_irq_save(flags);
> __aes_arm_encrypt(ctx->key_enc, rounds, in, out);
> +   local_irq_restore(flags);
>  }
>

Disabling interrupts like that is going to raise some eyebrows, so I'd
prefer it if we can reduce the scope of the IRQ blackout as much as we
can. This means we should move the IRQ en/disable into the .S file,
and only let it cover the parts of the code where we are actually
doing table lookups that are indexed by the input.

>  static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
>  {
> struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> int rounds = 6 + ctx->key_length / 4;
> +   unsigned long flags;
>
> +   /* Disable interrupts to help mitigate timing attacks, see above */
> +   local_irq_save(flags);
> __aes_arm_decrypt(ctx->key_dec, rounds, in, out);
> +   local_irq

Re: [PATCH 2/3] crypto: crypto_xor - use unaligned accessors for aligned fast path

2018-10-09 Thread Ard Biesheuvel
On 9 October 2018 at 05:47, Eric Biggers  wrote:
> Hi Ard,
>
> On Mon, Oct 08, 2018 at 11:15:53PM +0200, Ard Biesheuvel wrote:
>> On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> because the ordinary load/store instructions (ldr, ldrh, ldrb) can
>> tolerate any misalignment of the memory address. However, load/store
>> double and load/store multiple instructions (ldrd, ldm) may still only
>> be used on memory addresses that are 32-bit aligned, and so we have to
>> use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
>> may end up with a severe performance hit due to alignment traps that
>> require fixups by the kernel.
>>
>> Fortunately, the get_unaligned() accessors do the right thing: when
>> building for ARMv6 or later, the compiler will emit unaligned accesses
>> using the ordinary load/store instructions (but avoid the ones that
>> require 32-bit alignment). When building for older ARM, those accessors
>> will emit the appropriate sequence of ldrb/mov/orr instructions. And on
>> architectures that can truly tolerate any kind of misalignment, the
>> get_unaligned() accessors resolve to the leXX_to_cpup accessors that
>> operate on aligned addresses.
>>
>> So switch to the unaligned accessors for the aligned fast path. This
>> will create the exact same code on architectures that can really
>> tolerate any kind of misalignment, and generate code for ARMv6+ that
>> avoids load/store instructions that trigger alignment faults.
>>
>> Signed-off-by: Ard Biesheuvel 
>> ---
>>  crypto/algapi.c |  7 +++
>>  include/crypto/algapi.h | 11 +--
>>  2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/crypto/algapi.c b/crypto/algapi.c
>> index 2545c5f89c4c..52ce3c5a0499 100644
>> --- a/crypto/algapi.c
>> +++ b/crypto/algapi.c
>> @@ -988,11 +988,10 @@ void crypto_inc(u8 *a, unsigned int size)
>>   __be32 *b = (__be32 *)(a + size);
>>   u32 c;
>>
>> - if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
>> - IS_ALIGNED((unsigned long)b, __alignof__(*b)))
>> + if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
>>   for (; size >= 4; size -= 4) {
>> - c = be32_to_cpu(*--b) + 1;
>> - *b = cpu_to_be32(c);
>> + c = get_unaligned_be32(--b) + 1;
>> + put_unaligned_be32(c, b);
>>   if (likely(c))
>>   return;
>>   }
>> diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
>> index 4a5ad10e75f0..86267c232f34 100644
>> --- a/include/crypto/algapi.h
>> +++ b/include/crypto/algapi.h
>> @@ -17,6 +17,8 @@
>>  #include 
>>  #include 
>>
>> +#include 
>> +
>>  /*
>>   * Maximum values for blocksize and alignmask, used to allocate
>>   * static buffers that are big enough for any combination of
>> @@ -212,7 +214,9 @@ static inline void crypto_xor(u8 *dst, const u8 *src, 
>> unsigned int size)
>>   unsigned long *s = (unsigned long *)src;
>>
>>   while (size > 0) {
>> - *d++ ^= *s++;
>> + put_unaligned(get_unaligned(d) ^ get_unaligned(s), d);
>> + d++;
>> + s++;
>>   size -= sizeof(unsigned long);
>>   }
>>   } else {
>> @@ -231,7 +235,10 @@ static inline void crypto_xor_cpy(u8 *dst, const u8 
>> *src1, const u8 *src2,
>>   unsigned long *s2 = (unsigned long *)src2;
>>
>>   while (size > 0) {
>> - *d++ = *s1++ ^ *s2++;
>> + put_unaligned(get_unaligned(s1) ^ get_unaligned(s2), 
>> d);
>> + d++;
>> + s1++;
>> + s2++;
>>   size -= sizeof(unsigned long);
>>   }
>>   } else {
>> --
>> 2.11.0
>>
>
> Doesn't __crypto_xor() have the same problem too?
>

More or less, and I was wondering what to do about it.

To fix __crypto_xor() correctly, we'd have to duplicate the code path
that operates on the u64[], u32[] and u16[] chunks, or we'll end up
with suboptimal code that uses the accessors even if the alignment
routine has executed first. This is the same issue Jason points out in
siphash.

Perhaps the answer is to add 'fast' unaligned accessors that may be
used on unaligned quantities only if
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set?

E.g.,

#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
#define get_unaligned_fast  get_unaligned
#else
#define get_unaligned_fast(x)  (*(x))
#endif

Arnd?


Re: [PATCH 1/3] crypto: memneq - use unaligned accessors for aligned fast path

2018-10-09 Thread Ard Biesheuvel
On 9 October 2018 at 05:34, Eric Biggers  wrote:
> Hi Ard,
>
> On Mon, Oct 08, 2018 at 11:15:52PM +0200, Ard Biesheuvel wrote:
>> On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> because the ordinary load/store instructions (ldr, ldrh, ldrb) can
>> tolerate any misalignment of the memory address. However, load/store
>> double and load/store multiple instructions (ldrd, ldm) may still only
>> be used on memory addresses that are 32-bit aligned, and so we have to
>> use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
>> may end up with a severe performance hit due to alignment traps that
>> require fixups by the kernel.
>>
>> Fortunately, the get_unaligned() accessors do the right thing: when
>> building for ARMv6 or later, the compiler will emit unaligned accesses
>> using the ordinary load/store instructions (but avoid the ones that
>> require 32-bit alignment). When building for older ARM, those accessors
>> will emit the appropriate sequence of ldrb/mov/orr instructions. And on
>> architectures that can truly tolerate any kind of misalignment, the
>> get_unaligned() accessors resolve to the leXX_to_cpup accessors that
>> operate on aligned addresses.
>>
>> So switch to the unaligned accessors for the aligned fast path. This
>> will create the exact same code on architectures that can really
>> tolerate any kind of misalignment, and generate code for ARMv6+ that
>> avoids load/store instructions that trigger alignment faults.
>>
>> Signed-off-by: Ard Biesheuvel 
>> ---
>>  crypto/memneq.c | 24 ++--
>>  1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/crypto/memneq.c b/crypto/memneq.c
>> index afed1bd16aee..0f46a6150f22 100644
>> --- a/crypto/memneq.c
>> +++ b/crypto/memneq.c
>> @@ -60,6 +60,7 @@
>>   */
>>
>>  #include 
>> +#include 
>>
>>  #ifndef __HAVE_ARCH_CRYPTO_MEMNEQ
>>
>> @@ -71,7 +72,10 @@ __crypto_memneq_generic(const void *a, const void *b, 
>> size_t size)
>>
>>  #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
>>   while (size >= sizeof(unsigned long)) {
>> - neq |= *(unsigned long *)a ^ *(unsigned long *)b;
>> + unsigned long const *p = a;
>> + unsigned long const *q = b;
>> +
>> + neq |= get_unaligned(p) ^ get_unaligned(q);
>>   OPTIMIZER_HIDE_VAR(neq);
>>   a += sizeof(unsigned long);
>>   b += sizeof(unsigned long);
>> @@ -95,18 +99,24 @@ static inline unsigned long __crypto_memneq_16(const 
>> void *a, const void *b)
>>
>>  #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>   if (sizeof(unsigned long) == 8) {
>> - neq |= *(unsigned long *)(a)   ^ *(unsigned long *)(b);
>> + unsigned long const *p = a;
>> + unsigned long const *q = b;
>> +
>> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>>   OPTIMIZER_HIDE_VAR(neq);
>> - neq |= *(unsigned long *)(a+8) ^ *(unsigned long *)(b+8);
>> + neq |= get_unaligned(p) ^ get_unaligned(q);
>>   OPTIMIZER_HIDE_VAR(neq);
>>   } else if (sizeof(unsigned int) == 4) {
>> - neq |= *(unsigned int *)(a)^ *(unsigned int *)(b);
>> + unsigned int const *p = a;
>> + unsigned int const *q = b;
>> +
>> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>>   OPTIMIZER_HIDE_VAR(neq);
>> - neq |= *(unsigned int *)(a+4)  ^ *(unsigned int *)(b+4);
>> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>>   OPTIMIZER_HIDE_VAR(neq);
>> - neq |= *(unsigned int *)(a+8)  ^ *(unsigned int *)(b+8);
>> + neq |= get_unaligned(p++) ^ get_unaligned(q++);
>>   OPTIMIZER_HIDE_VAR(neq);
>> - neq |= *(unsigned int *)(a+12) ^ *(unsigned int *)(b+12);
>> + neq |= get_unaligned(p) ^ get_unaligned(q);
>>   OPTIMIZER_HIDE_VAR(neq);
>>   } else
>>  #endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
>
> This looks good, but maybe now we should get rid of the
> !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS path too?
> At least for the 16-byte case:
>
> static inline unsigned long __crypto_memneq_16(const void *a, const void *b)
> {
> const unsigned long *p = a, *q = b;
> unsigned long neq = 0;
>
> BUILD_BUG_ON(sizeof(*p) != 4 && sizeof(*p) != 8);
> neq |= get_unaligned(p++) ^ get_unaligned(q++);
> OPTIMIZER_HIDE_VAR(neq);
> neq |= get_unaligned(p++) ^ get_unaligned(q++);
> OPTIMIZER_HIDE_VAR(neq);
> if (sizeof(*p) == 4) {
> neq |= get_unaligned(p++) ^ get_unaligned(q++);
> OPTIMIZER_HIDE_VAR(neq);
> neq |= get_unaligned(p++) ^ get_unaligned(q++);
> OPTIMIZER_HIDE_VAR(neq);
> }
> return neq;
> }

Yes that makes sense.


[PATCH 0/3] crypto: use unaligned accessors in aligned fast paths

2018-10-08 Thread Ard Biesheuvel
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS behaves a bit counterintuitively
on ARM: we set it for architecture revisions v6 and up, which support
any alignment for load/store instructions that operate on bytes, half
words or words. However, load/store double word and load store multiple
instructions still require 32-bit alignment, and using them on unaligned
quantities results in costly alignment traps that have to be fixed up by
the kernel's fixup code.

Fortunately, the unaligned accessors do the right thing here: on
architectures that really tolerate any misalignment, they simply resolve
to the aligned accessors, while on ARMv6+ (which uses the packed struct
wrappers for unaligned accesses), they result in load/store sequences
that avoid the instructions that require 32-bit alignment.

Since there is not really a downside to using the unaligned accessors on
aligned paths for architectures other than ARM that define
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, let's switch to them in a couple
of places in the crypto code.

Note that all patches are against code that has been observed to be emitted
with ldm or ldrd instructions when building ARM's multi_v7_defconfig.

Ard Biesheuvel (3):
  crypto: memneq - use unaligned accessors for aligned fast path
  crypto: crypto_xor - use unaligned accessors for aligned fast path
  crypto: siphash - drop _aligned variants

 crypto/algapi.c |   7 +-
 crypto/memneq.c |  24 +++--
 include/crypto/algapi.h |  11 +-
 include/linux/siphash.h | 106 +---
 lib/siphash.c   | 103 ++-
 5 files changed, 83 insertions(+), 168 deletions(-)

-- 
2.11.0



[PATCH 1/3] crypto: memneq - use unaligned accessors for aligned fast path

2018-10-08 Thread Ard Biesheuvel
On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
because the ordinary load/store instructions (ldr, ldrh, ldrb) can
tolerate any misalignment of the memory address. However, load/store
double and load/store multiple instructions (ldrd, ldm) may still only
be used on memory addresses that are 32-bit aligned, and so we have to
use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
may end up with a severe performance hit due to alignment traps that
require fixups by the kernel.

Fortunately, the get_unaligned() accessors do the right thing: when
building for ARMv6 or later, the compiler will emit unaligned accesses
using the ordinary load/store instructions (but avoid the ones that
require 32-bit alignment). When building for older ARM, those accessors
will emit the appropriate sequence of ldrb/mov/orr instructions. And on
architectures that can truly tolerate any kind of misalignment, the
get_unaligned() accessors resolve to the leXX_to_cpup accessors that
operate on aligned addresses.

So switch to the unaligned accessors for the aligned fast path. This
will create the exact same code on architectures that can really
tolerate any kind of misalignment, and generate code for ARMv6+ that
avoids load/store instructions that trigger alignment faults.

Signed-off-by: Ard Biesheuvel 
---
 crypto/memneq.c | 24 ++--
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/crypto/memneq.c b/crypto/memneq.c
index afed1bd16aee..0f46a6150f22 100644
--- a/crypto/memneq.c
+++ b/crypto/memneq.c
@@ -60,6 +60,7 @@
  */
 
 #include 
+#include 
 
 #ifndef __HAVE_ARCH_CRYPTO_MEMNEQ
 
@@ -71,7 +72,10 @@ __crypto_memneq_generic(const void *a, const void *b, size_t 
size)
 
 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
while (size >= sizeof(unsigned long)) {
-   neq |= *(unsigned long *)a ^ *(unsigned long *)b;
+   unsigned long const *p = a;
+   unsigned long const *q = b;
+
+   neq |= get_unaligned(p) ^ get_unaligned(q);
OPTIMIZER_HIDE_VAR(neq);
a += sizeof(unsigned long);
b += sizeof(unsigned long);
@@ -95,18 +99,24 @@ static inline unsigned long __crypto_memneq_16(const void 
*a, const void *b)
 
 #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
if (sizeof(unsigned long) == 8) {
-   neq |= *(unsigned long *)(a)   ^ *(unsigned long *)(b);
+   unsigned long const *p = a;
+   unsigned long const *q = b;
+
+   neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
-   neq |= *(unsigned long *)(a+8) ^ *(unsigned long *)(b+8);
+   neq |= get_unaligned(p) ^ get_unaligned(q);
OPTIMIZER_HIDE_VAR(neq);
} else if (sizeof(unsigned int) == 4) {
-   neq |= *(unsigned int *)(a)^ *(unsigned int *)(b);
+   unsigned int const *p = a;
+   unsigned int const *q = b;
+
+   neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
-   neq |= *(unsigned int *)(a+4)  ^ *(unsigned int *)(b+4);
+   neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
-   neq |= *(unsigned int *)(a+8)  ^ *(unsigned int *)(b+8);
+   neq |= get_unaligned(p++) ^ get_unaligned(q++);
OPTIMIZER_HIDE_VAR(neq);
-   neq |= *(unsigned int *)(a+12) ^ *(unsigned int *)(b+12);
+   neq |= get_unaligned(p) ^ get_unaligned(q);
OPTIMIZER_HIDE_VAR(neq);
} else
 #endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
-- 
2.11.0



[PATCH 2/3] crypto: crypto_xor - use unaligned accessors for aligned fast path

2018-10-08 Thread Ard Biesheuvel
On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
because the ordinary load/store instructions (ldr, ldrh, ldrb) can
tolerate any misalignment of the memory address. However, load/store
double and load/store multiple instructions (ldrd, ldm) may still only
be used on memory addresses that are 32-bit aligned, and so we have to
use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
may end up with a severe performance hit due to alignment traps that
require fixups by the kernel.

Fortunately, the get_unaligned() accessors do the right thing: when
building for ARMv6 or later, the compiler will emit unaligned accesses
using the ordinary load/store instructions (but avoid the ones that
require 32-bit alignment). When building for older ARM, those accessors
will emit the appropriate sequence of ldrb/mov/orr instructions. And on
architectures that can truly tolerate any kind of misalignment, the
get_unaligned() accessors resolve to the leXX_to_cpup accessors that
operate on aligned addresses.

So switch to the unaligned accessors for the aligned fast path. This
will create the exact same code on architectures that can really
tolerate any kind of misalignment, and generate code for ARMv6+ that
avoids load/store instructions that trigger alignment faults.

Signed-off-by: Ard Biesheuvel 
---
 crypto/algapi.c |  7 +++
 include/crypto/algapi.h | 11 +--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/crypto/algapi.c b/crypto/algapi.c
index 2545c5f89c4c..52ce3c5a0499 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -988,11 +988,10 @@ void crypto_inc(u8 *a, unsigned int size)
__be32 *b = (__be32 *)(a + size);
u32 c;
 
-   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
-   IS_ALIGNED((unsigned long)b, __alignof__(*b)))
+   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
for (; size >= 4; size -= 4) {
-   c = be32_to_cpu(*--b) + 1;
-   *b = cpu_to_be32(c);
+   c = get_unaligned_be32(--b) + 1;
+   put_unaligned_be32(c, b);
if (likely(c))
return;
}
diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 4a5ad10e75f0..86267c232f34 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -17,6 +17,8 @@
 #include 
 #include 
 
+#include 
+
 /*
  * Maximum values for blocksize and alignmask, used to allocate
  * static buffers that are big enough for any combination of
@@ -212,7 +214,9 @@ static inline void crypto_xor(u8 *dst, const u8 *src, 
unsigned int size)
unsigned long *s = (unsigned long *)src;
 
while (size > 0) {
-   *d++ ^= *s++;
+   put_unaligned(get_unaligned(d) ^ get_unaligned(s), d);
+   d++;
+   s++;
size -= sizeof(unsigned long);
}
} else {
@@ -231,7 +235,10 @@ static inline void crypto_xor_cpy(u8 *dst, const u8 *src1, 
const u8 *src2,
unsigned long *s2 = (unsigned long *)src2;
 
while (size > 0) {
-   *d++ = *s1++ ^ *s2++;
+   put_unaligned(get_unaligned(s1) ^ get_unaligned(s2), d);
+   d++;
+   s1++;
+   s2++;
size -= sizeof(unsigned long);
}
} else {
-- 
2.11.0



[PATCH 3/3] crypto: siphash - drop _aligned variants

2018-10-08 Thread Ard Biesheuvel
On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
because the ordinary load/store instructions (ldr, ldrh, ldrb) can
tolerate any misalignment of the memory address. However, load/store
double and load/store multiple instructions (ldrd, ldm) may still only
be used on memory addresses that are 32-bit aligned, and so we have to
use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
may end up with a severe performance hit due to alignment traps that
require fixups by the kernel.

Fortunately, the get_unaligned() accessors do the right thing: when
building for ARMv6 or later, the compiler will emit unaligned accesses
using the ordinary load/store instructions (but avoid the ones that
require 32-bit alignment). When building for older ARM, those accessors
will emit the appropriate sequence of ldrb/mov/orr instructions. And on
architectures that can truly tolerate any kind of misalignment, the
get_unaligned() accessors resolve to the leXX_to_cpup accessors that
operate on aligned addresses.

Since the compiler will in fact emit ldrd or ldm instructions when
building this code for ARM v6 or later, the solution is to use the
unaligned accessors on the aligned code paths. Given the above, this
either produces the same code, or better in the ARMv6+ case. However,
since that removes the only difference between the aligned and unaligned
variants, we can drop the aligned variant entirely.

Signed-off-by: Ard Biesheuvel 
---
 include/linux/siphash.h | 106 +---
 lib/siphash.c   | 103 ++-
 2 files changed, 54 insertions(+), 155 deletions(-)

diff --git a/include/linux/siphash.h b/include/linux/siphash.h
index fa7a6b9cedbf..ef3c36b0ae0f 100644
--- a/include/linux/siphash.h
+++ b/include/linux/siphash.h
@@ -15,16 +15,14 @@
 
 #include 
 #include 
+#include 
 
 #define SIPHASH_ALIGNMENT __alignof__(u64)
 typedef struct {
u64 key[2];
 } siphash_key_t;
 
-u64 __siphash_aligned(const void *data, size_t len, const siphash_key_t *key);
-#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-u64 __siphash_unaligned(const void *data, size_t len, const siphash_key_t 
*key);
-#endif
+u64 __siphash(const void *data, size_t len, const siphash_key_t *key);
 
 u64 siphash_1u64(const u64 a, const siphash_key_t *key);
 u64 siphash_2u64(const u64 a, const u64 b, const siphash_key_t *key);
@@ -48,26 +46,6 @@ static inline u64 siphash_4u32(const u32 a, const u32 b, 
const u32 c,
 }
 
 
-static inline u64 ___siphash_aligned(const __le64 *data, size_t len,
-const siphash_key_t *key)
-{
-   if (__builtin_constant_p(len) && len == 4)
-   return siphash_1u32(le32_to_cpup((const __le32 *)data), key);
-   if (__builtin_constant_p(len) && len == 8)
-   return siphash_1u64(le64_to_cpu(data[0]), key);
-   if (__builtin_constant_p(len) && len == 16)
-   return siphash_2u64(le64_to_cpu(data[0]), le64_to_cpu(data[1]),
-   key);
-   if (__builtin_constant_p(len) && len == 24)
-   return siphash_3u64(le64_to_cpu(data[0]), le64_to_cpu(data[1]),
-   le64_to_cpu(data[2]), key);
-   if (__builtin_constant_p(len) && len == 32)
-   return siphash_4u64(le64_to_cpu(data[0]), le64_to_cpu(data[1]),
-   le64_to_cpu(data[2]), le64_to_cpu(data[3]),
-   key);
-   return __siphash_aligned(data, len, key);
-}
-
 /**
  * siphash - compute 64-bit siphash PRF value
  * @data: buffer to hash
@@ -77,11 +55,30 @@ static inline u64 ___siphash_aligned(const __le64 *data, 
size_t len,
 static inline u64 siphash(const void *data, size_t len,
  const siphash_key_t *key)
 {
-#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-   if (!IS_ALIGNED((unsigned long)data, SIPHASH_ALIGNMENT))
-   return __siphash_unaligned(data, len, key);
-#endif
-   return ___siphash_aligned(data, len, key);
+   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+   if (__builtin_constant_p(len) && len == 4)
+   return siphash_1u32(get_unaligned_le32(data),
+   key);
+   if (__builtin_constant_p(len) && len == 8)
+   return siphash_1u64(get_unaligned_le64(data),
+   key);
+   if (__builtin_constant_p(len) && len == 16)
+   return siphash_2u64(get_unaligned_le64(data),
+   get_unaligned_le64(data + 8),
+   key);
+   if (__builtin_constant_p(len) && len == 24)
+   return siphash_3u64(get_unaligned_le64(data),
+   get_unaligned_le64(data + 8),

[PATCH] crypto: arm64/aes-blk - ensure XTS mask is always loaded

2018-10-08 Thread Ard Biesheuvel
Commit 2e5d2f33d1db ("crypto: arm64/aes-blk - improve XTS mask handling")
optimized away some reloads of the XTS mask vector, but failed to take
into account that calls into the XTS en/decrypt routines will take a
slightly different code path if a single block of input is split across
different buffers. So let's ensure that the first load occurs
unconditionally, and move the reload to the end so it doesn't occur
needlessly.

Fixes: 2e5d2f33d1db ("crypto: arm64/aes-blk - improve XTS mask handling")
Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-modes.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 039738ae23f6..67700045a0e0 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -359,18 +359,17 @@ AES_ENTRY(aes_xts_encrypt)
mov x29, sp
 
ld1 {v4.16b}, [x6]
+   xts_load_mask   v8
cbz w7, .Lxtsencnotfirst
 
enc_prepare w3, x5, x8
encrypt_block   v4, w3, x5, x8, w7  /* first tweak */
enc_switch_key  w3, x2, x8
-   xts_load_mask   v8
b   .LxtsencNx
 
 .Lxtsencnotfirst:
enc_prepare w3, x2, x8
 .LxtsencloopNx:
-   xts_reload_mask v8
next_tweak  v4, v4, v8
 .LxtsencNx:
subsw4, w4, #4
@@ -391,6 +390,7 @@ AES_ENTRY(aes_xts_encrypt)
st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v7.16b
cbz w4, .Lxtsencout
+   xts_reload_mask v8
b   .LxtsencloopNx
 .Lxtsenc1x:
addsw4, w4, #4
@@ -417,18 +417,17 @@ AES_ENTRY(aes_xts_decrypt)
mov x29, sp
 
ld1 {v4.16b}, [x6]
+   xts_load_mask   v8
cbz w7, .Lxtsdecnotfirst
 
enc_prepare w3, x5, x8
encrypt_block   v4, w3, x5, x8, w7  /* first tweak */
dec_prepare w3, x2, x8
-   xts_load_mask   v8
b   .LxtsdecNx
 
 .Lxtsdecnotfirst:
dec_prepare w3, x2, x8
 .LxtsdecloopNx:
-   xts_reload_mask v8
next_tweak  v4, v4, v8
 .LxtsdecNx:
subsw4, w4, #4
@@ -449,6 +448,7 @@ AES_ENTRY(aes_xts_decrypt)
st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v7.16b
cbz w4, .Lxtsdecout
+   xts_reload_mask v8
b   .LxtsdecloopNx
 .Lxtsdec1x:
addsw4, w4, #4
-- 
2.11.0



Re: [PATCH] crypto: x86/aes-ni - fix build error following fpu template removal

2018-10-05 Thread Ard Biesheuvel
On 5 October 2018 at 19:13, Eric Biggers  wrote:
> From: Eric Biggers 
>
> aesni-intel_glue.c still calls crypto_fpu_init() and crypto_fpu_exit()
> to register/unregister the "fpu" template.  But these functions don't
> exist anymore, causing a build error.  Remove the calls to them.
>
> Fixes: 944585a64f5e ("crypto: x86/aes-ni - remove special handling of AES in 
> PCBC mode")
> Signed-off-by: Eric Biggers 

Thanks for spotting that.

I had actually noticed myself, but wasn't really expecting this RFC
patch to be picked up without discussion.


> ---
>  arch/x86/crypto/aesni-intel_glue.c | 13 +
>  1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/arch/x86/crypto/aesni-intel_glue.c 
> b/arch/x86/crypto/aesni-intel_glue.c
> index 89bae64eef4f9..661f7daf43da9 100644
> --- a/arch/x86/crypto/aesni-intel_glue.c
> +++ b/arch/x86/crypto/aesni-intel_glue.c
> @@ -102,9 +102,6 @@ asmlinkage void aesni_cbc_enc(struct crypto_aes_ctx *ctx, 
> u8 *out,
>  asmlinkage void aesni_cbc_dec(struct crypto_aes_ctx *ctx, u8 *out,
>   const u8 *in, unsigned int len, u8 *iv);
>
> -int crypto_fpu_init(void);
> -void crypto_fpu_exit(void);
> -
>  #define AVX_GEN2_OPTSIZE 640
>  #define AVX_GEN4_OPTSIZE 4096
>
> @@ -1449,13 +1446,9 @@ static int __init aesni_init(void)
>  #endif
>  #endif
>
> -   err = crypto_fpu_init();
> -   if (err)
> -   return err;
> -
> err = crypto_register_algs(aesni_algs, ARRAY_SIZE(aesni_algs));
> if (err)
> -   goto fpu_exit;
> +   return err;
>
> err = crypto_register_skciphers(aesni_skciphers,
> ARRAY_SIZE(aesni_skciphers));
> @@ -1489,8 +1482,6 @@ static int __init aesni_init(void)
> ARRAY_SIZE(aesni_skciphers));
>  unregister_algs:
> crypto_unregister_algs(aesni_algs, ARRAY_SIZE(aesni_algs));
> -fpu_exit:
> -   crypto_fpu_exit();
> return err;
>  }
>
> @@ -1501,8 +1492,6 @@ static void __exit aesni_exit(void)
> crypto_unregister_skciphers(aesni_skciphers,
> ARRAY_SIZE(aesni_skciphers));
> crypto_unregister_algs(aesni_algs, ARRAY_SIZE(aesni_algs));
> -
> -   crypto_fpu_exit();
>  }
>
>  late_initcall(aesni_init);
> --
> 2.19.0.605.g01d371f741-goog
>


Re: [PATCH] crypto: qat - move temp buffers off the stack

2018-10-05 Thread Ard Biesheuvel
On 5 October 2018 at 04:29, Herbert Xu  wrote:
> On Wed, Sep 26, 2018 at 11:51:59AM +0200, Ard Biesheuvel wrote:
>> Arnd reports that with Kees's latest VLA patches applied, the HMAC
>> handling in the QAT driver uses a worst case estimate of 160 bytes
>> for the SHA blocksize, allowing the compiler to determine the size
>> of the stack frame at runtime and throw a warning:
>>
>>   drivers/crypto/qat/qat_common/qat_algs.c: In function 
>> 'qat_alg_do_precomputes':
>>   drivers/crypto/qat/qat_common/qat_algs.c:257:1: error: the frame size
>>   of 1112 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
>>
>> Given that this worst case estimate is only 32 bytes larger than the
>> actual block size of SHA-512, the use of a VLA here was hiding the
>> excessive size of the stack frame from the compiler, and so we should
>> try to move these buffers off the stack.
>>
>> So move the ipad/opad buffers and the various SHA state descriptors
>> into the tfm context struct. Since qat_alg_do_precomputes() is only
>> called in the context of a setkey() operation, this should be safe.
>> Using SHA512_BLOCK_SIZE for the size of the ipad/opad buffers allows
>> them to be used by SHA-1/SHA-256 as well.
>>
>> Reported-by: Arnd Bergmann 
>> Signed-off-by: Ard Biesheuvel 
>> ---
>> This applies against v4.19-rc while Arnd's report was about -next. However,
>> since Kees's VLA change results in a warning about a pre-existing condition,
>> we may decide to apply it as a fix, and handle the conflict with Kees's
>> patch in cryptodev. Otherwise, I can respin it to apply onto cryptodev
>> directly. The patch was build tested only - I don't have the hardware.
>>
>> Thoughts anyone?
>
> I applied it against cryptodev only.  I don't think it's bad enough
> to warrant a backport to stable though.  But if you guys disagree we
> could always send the backport to stable after this goes upstream.
>

Works for me.


Re: [PATCH] crypto: aes_ti - disable interrupts while accessing sbox

2018-10-04 Thread Ard Biesheuvel
Hi Eric,

On 4 October 2018 at 06:07, Eric Biggers  wrote:
> From: Eric Biggers 
>
> The generic constant-time AES implementation is supposed to preload the
> AES S-box into the CPU's L1 data cache.  But, an interrupt handler can
> run on the CPU and muck with the cache.  Worse, on preemptible kernels
> the process can even be preempted and moved to a different CPU.  So the
> implementation may actually still be vulnerable to cache-timing attacks.
>
> Make it more robust by disabling interrupts while the sbox is used.
>
> In some quick tests on x86 and ARM, this doesn't affect performance
> significantly.  Responsiveness is also a concern, but interrupts are
> only disabled for a single AES block which even on ARM Cortex-A7 is
> "only" ~1500 cycles to encrypt or ~2600 cycles to decrypt.
>

I share your concern, but that is quite a big hammer.

Also, does it really take ~100 cycles per byte? That is terrible :-)

Given that the full lookup table is only 1024 bytes (or 1024+256 bytes
for decryption), I wonder if something like the below would be a
better option for A7 (apologies for the mangled whitespace)

diff --git a/arch/arm/crypto/aes-cipher-core.S
b/arch/arm/crypto/aes-cipher-core.S
index 184d6c2d15d5..83e893f7e581 100644
--- a/arch/arm/crypto/aes-cipher-core.S
+++ b/arch/arm/crypto/aes-cipher-core.S
@@ -139,6 +139,13 @@

  __adrl ttab, \ttab

+ .irpc r, 01234567
+ ldr r8, [ttab, #(32 * \r)]
+ ldr r9, [ttab, #(32 * \r) + 256]
+ ldr r10, [ttab, #(32 * \r) + 512]
+ ldr r11, [ttab, #(32 * \r) + 768]
+ .endr
+
  tst rounds, #2
  bne 1f

@@ -180,6 +187,12 @@ ENDPROC(__aes_arm_encrypt)

  .align 5
 ENTRY(__aes_arm_decrypt)
+ __adrl ttab, __aes_arm_inverse_sbox
+
+ .irpc r, 01234567
+ ldr r8, [ttab, #(32 * \r)]
+ .endr
+
  do_crypt iround, crypto_it_tab, __aes_arm_inverse_sbox, 0
 ENDPROC(__aes_arm_decrypt)

diff --git a/arch/arm/crypto/aes-cipher-glue.c
b/arch/arm/crypto/aes-cipher-glue.c
index c222f6e072ad..630e1a436f1d 100644
--- a/arch/arm/crypto/aes-cipher-glue.c
+++ b/arch/arm/crypto/aes-cipher-glue.c
@@ -23,16 +23,22 @@ static void aes_encrypt(struct crypto_tfm *tfm, u8
*out, const u8 *in)
 {
  struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
  int rounds = 6 + ctx->key_length / 4;
+ unsigned long flags;

+ local_irq_save(flags);
  __aes_arm_encrypt(ctx->key_enc, rounds, in, out);
+ local_irq_restore(flags);
 }

 static void aes_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
 {
  struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm);
  int rounds = 6 + ctx->key_length / 4;
+ unsigned long flags;

+ local_irq_save(flags);
  __aes_arm_decrypt(ctx->key_dec, rounds, in, out);
+ local_irq_restore(flags);
 }

 static struct crypto_alg aes_alg = {
diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c
index ca554d57d01e..82fa860c9cb9 100644
--- a/crypto/aes_generic.c
+++ b/crypto/aes_generic.c
@@ -63,7 +63,7 @@ static inline u8 byte(const u32 x, const unsigned n)

 static const u32 rco_tab[10] = { 1, 2, 4, 8, 16, 32, 64, 128, 27, 54 };

-__visible const u32 crypto_ft_tab[4][256] = {
+__visible const u32 crypto_ft_tab[4][256] __cacheline_aligned = {
  {
  0xa56363c6, 0x847c7cf8, 0x99ee, 0x8d7b7bf6,
  0x0df2f2ff, 0xbd6b6bd6, 0xb16f6fde, 0x54c5c591,
@@ -327,7 +327,7 @@ __visible const u32 crypto_ft_tab[4][256] = {
  }
 };

-__visible const u32 crypto_fl_tab[4][256] = {
+__visible const u32 crypto_fl_tab[4][256] __cacheline_aligned = {
  {
  0x0063, 0x007c, 0x0077, 0x007b,
  0x00f2, 0x006b, 0x006f, 0x00c5,
@@ -591,7 +591,7 @@ __visible const u32 crypto_fl_tab[4][256] = {
  }
 };

-__visible const u32 crypto_it_tab[4][256] = {
+__visible const u32 crypto_it_tab[4][256] __cacheline_aligned = {
  {
  0x50a7f451, 0x5365417e, 0xc3a4171a, 0x965e273a,
  0xcb6bab3b, 0xf1459d1f, 0xab58faac, 0x9303e34b,
@@ -855,7 +855,7 @@ __visible const u32 crypto_it_tab[4][256] = {
  }
 };

-__visible const u32 crypto_il_tab[4][256] = {
+__visible const u32 crypto_il_tab[4][256] __cacheline_aligned = {
  {
  0x0052, 0x0009, 0x006a, 0x00d5,
  0x0030, 0x0036, 0x00a5, 0x0038,






> Fixes: b5e0b032b6c3 ("crypto: aes - add generic time invariant AES cipher")
> Signed-off-by: Eric Biggers 
> ---
>  crypto/aes_ti.c | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/crypto/aes_ti.c b/crypto/aes_ti.c
> index 03023b2290e8e..81b604419ee0e 100644
> --- a/crypto/aes_ti.c
> +++ b/crypto/aes_ti.c
> @@ -269,6 +269,7 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> const u32 *rkp = ctx->key_enc + 4;
> int rounds = 6 + ctx->key_length / 4;
> u32 st0[4], st1[4];
> +   unsigned long flags;
> int round;
>
> st0[0] = ctx->key_enc[0] ^ get_unaligned_le32(in);
> @@ -276,6 +277,12 @@ static void aesti_encrypt(struct crypto_tfm *tfm, u8 
> *out, const u8 *in)
> st0[2] = ctx->key_enc[2] ^ get_unaligned_le32(in + 8);
> st0[3] = ctx->key_enc[3] ^ get_unaligned_le32(in + 

Re: [PATCH] crypto: arm64/aes - fix handling sub-block CTS-CBC inputs

2018-10-03 Thread Ard Biesheuvel
On 3 October 2018 at 07:22, Eric Biggers  wrote:
> From: Eric Biggers 
>
> In the new arm64 CTS-CBC implementation, return an error code rather
> than crashing on inputs shorter than AES_BLOCK_SIZE bytes.  Also set
> cra_blocksize to AES_BLOCK_SIZE (like is done in the cts template) to
> indicate the minimum input size.
>
> Fixes: dd597fb33ff0 ("crypto: arm64/aes-blk - add support for CTS-CBC mode")
> Signed-off-by: Eric Biggers 

Thanks Eric

Reviewed-by: Ard Biesheuvel 

> ---
>  arch/arm64/crypto/aes-glue.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
> index 26d2b0263ba63..1e676625ef33f 100644
> --- a/arch/arm64/crypto/aes-glue.c
> +++ b/arch/arm64/crypto/aes-glue.c
> @@ -243,8 +243,11 @@ static int cts_cbc_encrypt(struct skcipher_request *req)
>
> skcipher_request_set_tfm(>subreq, tfm);
>
> -   if (req->cryptlen == AES_BLOCK_SIZE)
> +   if (req->cryptlen <= AES_BLOCK_SIZE) {
> +   if (req->cryptlen < AES_BLOCK_SIZE)
> +   return -EINVAL;
> cbc_blocks = 1;
> +   }
>
> if (cbc_blocks > 0) {
> unsigned int blocks;
> @@ -305,8 +308,11 @@ static int cts_cbc_decrypt(struct skcipher_request *req)
>
> skcipher_request_set_tfm(>subreq, tfm);
>
> -   if (req->cryptlen == AES_BLOCK_SIZE)
> +   if (req->cryptlen <= AES_BLOCK_SIZE) {
> +   if (req->cryptlen < AES_BLOCK_SIZE)
> +   return -EINVAL;
> cbc_blocks = 1;
> +   }
>
> if (cbc_blocks > 0) {
> unsigned int blocks;
> @@ -486,14 +492,13 @@ static struct skcipher_alg aes_algs[] = { {
> .cra_driver_name= "__cts-cbc-aes-" MODE,
> .cra_priority   = PRIO,
> .cra_flags  = CRYPTO_ALG_INTERNAL,
> -   .cra_blocksize  = 1,
> +   .cra_blocksize  = AES_BLOCK_SIZE,
> .cra_ctxsize= sizeof(struct crypto_aes_ctx),
> .cra_module = THIS_MODULE,
> },
> .min_keysize= AES_MIN_KEY_SIZE,
> .max_keysize= AES_MAX_KEY_SIZE,
> .ivsize = AES_BLOCK_SIZE,
> -   .chunksize  = AES_BLOCK_SIZE,
> .walksize   = 2 * AES_BLOCK_SIZE,
> .setkey = skcipher_aes_setkey,
> .encrypt= cts_cbc_encrypt,
> --
> 2.19.0
>


[PATCH v2 1/2] crypto: morus/generic - fix for big endian systems

2018-10-01 Thread Ard Biesheuvel
Omit the endian swabbing when folding the lengths of the assoc and
crypt input buffers into the state to finalize the tag. This is not
necessary given that the memory representation of the state is in
machine native endianness already.

This fixes an error reported by tcrypt running on a big endian system:

  alg: aead: Test 2 failed on encryption for morus640-generic
  : a8 30 ef fb e6 26 eb 23 b0 87 dd 98 57 f3 e1 4b
  0010: 21
  alg: aead: Test 2 failed on encryption for morus1280-generic
  : 88 19 1b fb 1c 29 49 0e ee 82 2f cb 97 a6 a5 ee
  0010: 5f

Fixes: 396be41f16fd ("crypto: morus - Add generic MORUS AEAD implementations")
Cc:  # v4.18+
Reviewed-by: Ondrej Mosnacek 
Signed-off-by: Ard Biesheuvel 
---
 crypto/morus1280.c |  7 ++-
 crypto/morus640.c  | 16 
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/crypto/morus1280.c b/crypto/morus1280.c
index d057cf5ac4a8..3889c188f266 100644
--- a/crypto/morus1280.c
+++ b/crypto/morus1280.c
@@ -385,14 +385,11 @@ static void crypto_morus1280_final(struct morus1280_state 
*state,
   struct morus1280_block *tag_xor,
   u64 assoclen, u64 cryptlen)
 {
-   u64 assocbits = assoclen * 8;
-   u64 cryptbits = cryptlen * 8;
-
struct morus1280_block tmp;
unsigned int i;
 
-   tmp.words[0] = cpu_to_le64(assocbits);
-   tmp.words[1] = cpu_to_le64(cryptbits);
+   tmp.words[0] = assoclen * 8;
+   tmp.words[1] = cryptlen * 8;
tmp.words[2] = 0;
tmp.words[3] = 0;
 
diff --git a/crypto/morus640.c b/crypto/morus640.c
index 1ca76e54281b..da06ec2f6a80 100644
--- a/crypto/morus640.c
+++ b/crypto/morus640.c
@@ -384,21 +384,13 @@ static void crypto_morus640_final(struct morus640_state 
*state,
  struct morus640_block *tag_xor,
  u64 assoclen, u64 cryptlen)
 {
-   u64 assocbits = assoclen * 8;
-   u64 cryptbits = cryptlen * 8;
-
-   u32 assocbits_lo = (u32)assocbits;
-   u32 assocbits_hi = (u32)(assocbits >> 32);
-   u32 cryptbits_lo = (u32)cryptbits;
-   u32 cryptbits_hi = (u32)(cryptbits >> 32);
-
struct morus640_block tmp;
unsigned int i;
 
-   tmp.words[0] = cpu_to_le32(assocbits_lo);
-   tmp.words[1] = cpu_to_le32(assocbits_hi);
-   tmp.words[2] = cpu_to_le32(cryptbits_lo);
-   tmp.words[3] = cpu_to_le32(cryptbits_hi);
+   tmp.words[0] = lower_32_bits(assoclen * 8);
+   tmp.words[1] = upper_32_bits(assoclen * 8);
+   tmp.words[2] = lower_32_bits(cryptlen * 8);
+   tmp.words[3] = upper_32_bits(cryptlen * 8);
 
for (i = 0; i < MORUS_BLOCK_WORDS; i++)
state->s[4].words[i] ^= state->s[0].words[i];
-- 
2.17.1



[PATCH v2 2/2] crypto: aegis/generic - fix for big endian systems

2018-10-01 Thread Ard Biesheuvel
Use the correct __le32 annotation and accessors to perform the
single round of AES encryption performed inside the AEGIS transform.
Otherwise, tcrypt reports:

  alg: aead: Test 1 failed on encryption for aegis128-generic
  : 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e
  alg: aead: Test 1 failed on encryption for aegis128l-generic
  : cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28
  alg: aead: Test 1 failed on encryption for aegis256-generic
  : aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")
Cc:  # v4.18+
Signed-off-by: Ard Biesheuvel 
---
 crypto/aegis.h | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/crypto/aegis.h b/crypto/aegis.h
index f1c6900ddb80..405e025fc906 100644
--- a/crypto/aegis.h
+++ b/crypto/aegis.h
@@ -21,7 +21,7 @@
 
 union aegis_block {
__le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];
-   u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];
+   __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];
u8 bytes[AEGIS_BLOCK_SIZE];
 };
 
@@ -57,24 +57,22 @@ static void crypto_aegis_aesenc(union aegis_block *dst,
const union aegis_block *src,
const union aegis_block *key)
 {
-   u32 *d = dst->words32;
const u8  *s  = src->bytes;
-   const u32 *k  = key->words32;
const u32 *t0 = crypto_ft_tab[0];
const u32 *t1 = crypto_ft_tab[1];
const u32 *t2 = crypto_ft_tab[2];
const u32 *t3 = crypto_ft_tab[3];
u32 d0, d1, d2, d3;
 
-   d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];
-   d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];
-   d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];
-   d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];
+   d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]];
+   d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]];
+   d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]];
+   d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]];
 
-   d[0] = d0;
-   d[1] = d1;
-   d[2] = d2;
-   d[3] = d3;
+   dst->words32[0] = cpu_to_le32(d0) ^ key->words32[0];
+   dst->words32[1] = cpu_to_le32(d1) ^ key->words32[1];
+   dst->words32[2] = cpu_to_le32(d2) ^ key->words32[2];
+   dst->words32[3] = cpu_to_le32(d3) ^ key->words32[3];
 }
 
 #endif /* _CRYPTO_AEGIS_H */
-- 
2.17.1



[PATCH v2 0/2] crypto - fix aegis/morus for big endian systems

2018-10-01 Thread Ard Biesheuvel
Some bug fixes for issues that I stumbled upon while working on other
stuff.

Changes since v1:
- add Ondrej's ack to #1
- simplify #2 and drop unrelated performance tweak

Ard Biesheuvel (2):
  crypto: morus/generic - fix for big endian systems
  crypto: aegis/generic - fix for big endian systems

 crypto/aegis.h | 20 +---
 crypto/morus1280.c |  7 ++-
 crypto/morus640.c  | 16 
 3 files changed, 15 insertions(+), 28 deletions(-)

-- 
2.17.1



Re: [PATCH 2/2] crypto: aegis/generic - fix for big endian systems

2018-10-01 Thread Ard Biesheuvel
On 1 October 2018 at 10:00, Ondrej Mosnacek  wrote:
> On Sun, Sep 30, 2018 at 1:14 PM Ard Biesheuvel
>  wrote:
>> On 30 September 2018 at 10:58, Ard Biesheuvel  
>> wrote:
>> > Use the correct __le32 annotation and accessors to perform the
>> > single round of AES encryption performed inside the AEGIS transform.
>> > Otherwise, tcrypt reports:
>> >
>> >   alg: aead: Test 1 failed on encryption for aegis128-generic
>> >   : 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e
>> >   alg: aead: Test 1 failed on encryption for aegis128l-generic
>> >   : cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28
>> >   alg: aead: Test 1 failed on encryption for aegis256-generic
>> >   : aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c
>> >
>> > While at it, let's refer to the first precomputed table only, and
>> > derive the other ones by rotation. This reduces the D-cache footprint
>> > by 75%, and shouldn't be too costly or free on load/store architectures
>> > (and X86 has its own AES-NI based implementation)
>> >
>> > Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD 
>> > implementations")
>> > Cc:  # v4.18+
>> > Signed-off-by: Ard Biesheuvel 
>> > ---
>> >  crypto/aegis.h | 23 +---
>> >  1 file changed, 10 insertions(+), 13 deletions(-)
>> >
>> > diff --git a/crypto/aegis.h b/crypto/aegis.h
>> > index f1c6900ddb80..84d3e07a3c33 100644
>> > --- a/crypto/aegis.h
>> > +++ b/crypto/aegis.h
>> > @@ -21,7 +21,7 @@
>> >
>> >  union aegis_block {
>> > __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];
>> > -   u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];
>> > +   __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];
>> > u8 bytes[AEGIS_BLOCK_SIZE];
>> >  };
>> >
>> > @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,
>> >  {
>> > u32 *d = dst->words32;
>> > const u8  *s  = src->bytes;
>> > -   const u32 *k  = key->words32;
>> > +   const __le32 *k  = key->words32;
>> > const u32 *t0 = crypto_ft_tab[0];
>> > -   const u32 *t1 = crypto_ft_tab[1];
>> > -   const u32 *t2 = crypto_ft_tab[2];
>> > -   const u32 *t3 = crypto_ft_tab[3];
>> > u32 d0, d1, d2, d3;
>> >
>> > -   d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];
>> > -   d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];
>> > -   d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];
>> > -   d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];
>> > +   d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ 
>> > rol32(t0[s[15]], 24);
>> > +   d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ 
>> > rol32(t0[s[ 3]], 24);
>> > +   d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ 
>> > rol32(t0[s[ 7]], 24);
>> > +   d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ 
>> > rol32(t0[s[11]], 24);
>> >
>> > -   d[0] = d0;
>> > -   d[1] = d1;
>> > -   d[2] = d2;
>> > -   d[3] = d3;
>> > +   d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));
>> > +   d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));
>> > +   d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));
>> > +   d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));
>>
>>
>> I suppose this
>>
>> > +   d[0] = cpu_to_le32(d0) ^ k[0];
>> > +   d[1] = cpu_to_le32(d1) ^ k[1];
>> > +   d[2] = cpu_to_le32(d2) ^ k[2];
>> > +   d[3] = cpu_to_le32(d3) ^ k[3];
>>
>> should work fine as well
>
> Yeah, that looks nicer, but I'm not sure if it is completely OK to do
> bitwise/arithmetic operations directly on the __[lb]e* types...  Maybe
> yes, but the code I've seen that used them usually seemed to treat
> them as opaque types.
>

No, xor is fine with __le/__be types


Re: [PATCH 2/2] crypto: aegis/generic - fix for big endian systems

2018-10-01 Thread Ard Biesheuvel
On 1 October 2018 at 09:50, Ondrej Mosnacek  wrote:
> Hi Ard,
>
> On Sun, Sep 30, 2018 at 10:59 AM Ard Biesheuvel
>  wrote:
>> Use the correct __le32 annotation and accessors to perform the
>> single round of AES encryption performed inside the AEGIS transform.
>> Otherwise, tcrypt reports:
>>
>>   alg: aead: Test 1 failed on encryption for aegis128-generic
>>   : 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e
>>   alg: aead: Test 1 failed on encryption for aegis128l-generic
>>   : cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28
>>   alg: aead: Test 1 failed on encryption for aegis256-generic
>>   : aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c
>
> Hm...  I think the reason I made a mistake here is that I first had a
> version with the AES table hard-coded and I had an #ifdef  endian> #else #endif there with values for little-endian and
> big-endian variants.  Then I realized the aes_generic module exports
> the crypto_ft_table and rewrote the code to use that.  Somewhere along
> the way I forgot to check if the aes_generic table uses the same trick
> and correct the code...
>
> It would be nice to apply the same optimization to aes_generic.c, but
> unfortunately the current tables are exported so changing the
> convention would break external modules that use them :/
>

Indeed. I am doing some refactoring work on the AES code, which is how
I ran into this in the first place.

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=for-kernelci

>>
>> While at it, let's refer to the first precomputed table only, and
>> derive the other ones by rotation. This reduces the D-cache footprint
>> by 75%, and shouldn't be too costly or free on load/store architectures
>> (and X86 has its own AES-NI based implementation)
>
> Could you maybe extract this into a separate patch?  I don't think we
> should mix functional and performance fixes together.
>

Yeah, good point. I will do that and fold in the simplification.

>>
>> Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD 
>> implementations")
>> Cc:  # v4.18+
>> Signed-off-by: Ard Biesheuvel 
>> ---
>>  crypto/aegis.h | 23 +---
>>  1 file changed, 10 insertions(+), 13 deletions(-)
>>
>> diff --git a/crypto/aegis.h b/crypto/aegis.h
>> index f1c6900ddb80..84d3e07a3c33 100644
>> --- a/crypto/aegis.h
>> +++ b/crypto/aegis.h
>> @@ -21,7 +21,7 @@
>>
>>  union aegis_block {
>> __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];
>> -   u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];
>> +   __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];
>> u8 bytes[AEGIS_BLOCK_SIZE];
>>  };
>>
>> @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,
>>  {
>> u32 *d = dst->words32;
>> const u8  *s  = src->bytes;
>> -   const u32 *k  = key->words32;
>> +   const __le32 *k  = key->words32;
>> const u32 *t0 = crypto_ft_tab[0];
>> -   const u32 *t1 = crypto_ft_tab[1];
>> -   const u32 *t2 = crypto_ft_tab[2];
>> -   const u32 *t3 = crypto_ft_tab[3];
>> u32 d0, d1, d2, d3;
>>
>> -   d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];
>> -   d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];
>> -   d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];
>> -   d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];
>> +   d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ 
>> rol32(t0[s[15]], 24);
>> +   d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ 
>> rol32(t0[s[ 3]], 24);
>> +   d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ 
>> rol32(t0[s[ 7]], 24);
>> +   d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ 
>> rol32(t0[s[11]], 24);
>>
>> -   d[0] = d0;
>> -   d[1] = d1;
>> -   d[2] = d2;
>> -   d[3] = d3;
>> +   d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));
>> +   d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));
>> +   d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));
>> +   d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));
>>  }
>>
>>  #endif /* _CRYPTO_AEGIS_H */
>> --
>> 2.19.0
>>
>
> Thanks,
>
> --
> Ondrej Mosnacek 
> Associate Software Engineer, Security Technologies
> Red Hat, Inc.


Re: [PATCH 1/2] crypto: morus/generic - fix for big endian systems

2018-10-01 Thread Ard Biesheuvel
On 1 October 2018 at 09:26, Ondrej Mosnacek  wrote:
> On Sun, Sep 30, 2018 at 10:59 AM Ard Biesheuvel
>  wrote:
>> Omit the endian swabbing when folding the lengths of the assoc and
>> crypt input buffers into the state to finalize the tag. This is not
>> necessary given that the memory representation of the state is in
>> machine native endianness already.
>>
>> This fixes an error reported by tcrypt running on a big endian system:
>>
>>   alg: aead: Test 2 failed on encryption for morus640-generic
>>   : a8 30 ef fb e6 26 eb 23 b0 87 dd 98 57 f3 e1 4b
>>   0010: 21
>>   alg: aead: Test 2 failed on encryption for morus1280-generic
>>   : 88 19 1b fb 1c 29 49 0e ee 82 2f cb 97 a6 a5 ee
>>   0010: 5f
>
> Yikes, I never really got around to test MORUS and AEGIS on a BE
> machine...  My mistake, sorry :/
>

No worries - this is brand new code so this is not entirely unexpected.

>>
>> Fixes: 396be41f16fd ("crypto: morus - Add generic MORUS AEAD 
>> implementations")
>> Cc:  # v4.18+
>> Signed-off-by: Ard Biesheuvel 
>
> Reviewed-by: Ondrej Mosnacek 
>

Thanks!

>> ---
>>  crypto/morus1280.c |  7 ++-
>>  crypto/morus640.c  | 16 
>>  2 files changed, 6 insertions(+), 17 deletions(-)
>>
>> diff --git a/crypto/morus1280.c b/crypto/morus1280.c
>> index d057cf5ac4a8..3889c188f266 100644
>> --- a/crypto/morus1280.c
>> +++ b/crypto/morus1280.c
>> @@ -385,14 +385,11 @@ static void crypto_morus1280_final(struct 
>> morus1280_state *state,
>>struct morus1280_block *tag_xor,
>>u64 assoclen, u64 cryptlen)
>>  {
>> -   u64 assocbits = assoclen * 8;
>> -   u64 cryptbits = cryptlen * 8;
>> -
>> struct morus1280_block tmp;
>> unsigned int i;
>>
>> -   tmp.words[0] = cpu_to_le64(assocbits);
>> -   tmp.words[1] = cpu_to_le64(cryptbits);
>> +   tmp.words[0] = assoclen * 8;
>> +   tmp.words[1] = cryptlen * 8;
>> tmp.words[2] = 0;
>> tmp.words[3] = 0;
>>
>> diff --git a/crypto/morus640.c b/crypto/morus640.c
>> index 1ca76e54281b..da06ec2f6a80 100644
>> --- a/crypto/morus640.c
>> +++ b/crypto/morus640.c
>> @@ -384,21 +384,13 @@ static void crypto_morus640_final(struct 
>> morus640_state *state,
>>   struct morus640_block *tag_xor,
>>   u64 assoclen, u64 cryptlen)
>>  {
>> -   u64 assocbits = assoclen * 8;
>> -   u64 cryptbits = cryptlen * 8;
>> -
>> -   u32 assocbits_lo = (u32)assocbits;
>> -   u32 assocbits_hi = (u32)(assocbits >> 32);
>> -   u32 cryptbits_lo = (u32)cryptbits;
>> -   u32 cryptbits_hi = (u32)(cryptbits >> 32);
>> -
>> struct morus640_block tmp;
>> unsigned int i;
>>
>> -   tmp.words[0] = cpu_to_le32(assocbits_lo);
>> -   tmp.words[1] = cpu_to_le32(assocbits_hi);
>> -   tmp.words[2] = cpu_to_le32(cryptbits_lo);
>> -   tmp.words[3] = cpu_to_le32(cryptbits_hi);
>> +   tmp.words[0] = lower_32_bits(assoclen * 8);
>> +   tmp.words[1] = upper_32_bits(assoclen * 8);
>> +   tmp.words[2] = lower_32_bits(cryptlen * 8);
>> +   tmp.words[3] = upper_32_bits(cryptlen * 8);
>>
>> for (i = 0; i < MORUS_BLOCK_WORDS; i++)
>> state->s[4].words[i] ^= state->s[0].words[i];
>> --
>> 2.19.0
>>
>
> Thanks,
>
> --
> Ondrej Mosnacek 
> Associate Software Engineer, Security Technologies
> Red Hat, Inc.


[PATCH] crypto: lrw - fix rebase error after out of bounds fix

2018-09-30 Thread Ard Biesheuvel
Due to an unfortunate interaction between commit fbe1a850b3b1
("crypto: lrw - Fix out-of bounds access on counter overflow") and
commit c778f96bf347 ("crypto: lrw - Optimize tweak computation"),
we ended up with a version of next_index() that always returns 127.

Fixes: c778f96bf347 ("crypto: lrw - Optimize tweak computation")
Signed-off-by: Ard Biesheuvel 
---
 crypto/lrw.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/crypto/lrw.c b/crypto/lrw.c
index 6fcf0d431185..0430ccd08728 100644
--- a/crypto/lrw.c
+++ b/crypto/lrw.c
@@ -122,10 +122,9 @@ static int next_index(u32 *counter)
int i, res = 0;
 
for (i = 0; i < 4; i++) {
-   if (counter[i] + 1 != 0) {
-   res += ffz(counter[i]++);
-   break;
-   }
+   if (counter[i] + 1 != 0)
+   return res + ffz(counter[i]++);
+
counter[i] = 0;
res += 32;
}
-- 
2.17.1



Re: [PATCH 2/2] crypto: aegis/generic - fix for big endian systems

2018-09-30 Thread Ard Biesheuvel
On 30 September 2018 at 10:58, Ard Biesheuvel  wrote:
> Use the correct __le32 annotation and accessors to perform the
> single round of AES encryption performed inside the AEGIS transform.
> Otherwise, tcrypt reports:
>
>   alg: aead: Test 1 failed on encryption for aegis128-generic
>   : 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e
>   alg: aead: Test 1 failed on encryption for aegis128l-generic
>   : cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28
>   alg: aead: Test 1 failed on encryption for aegis256-generic
>   : aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c
>
> While at it, let's refer to the first precomputed table only, and
> derive the other ones by rotation. This reduces the D-cache footprint
> by 75%, and shouldn't be too costly or free on load/store architectures
> (and X86 has its own AES-NI based implementation)
>
> Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")
> Cc:  # v4.18+
> Signed-off-by: Ard Biesheuvel 
> ---
>  crypto/aegis.h | 23 +---
>  1 file changed, 10 insertions(+), 13 deletions(-)
>
> diff --git a/crypto/aegis.h b/crypto/aegis.h
> index f1c6900ddb80..84d3e07a3c33 100644
> --- a/crypto/aegis.h
> +++ b/crypto/aegis.h
> @@ -21,7 +21,7 @@
>
>  union aegis_block {
> __le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];
> -   u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];
> +   __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];
> u8 bytes[AEGIS_BLOCK_SIZE];
>  };
>
> @@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,
>  {
> u32 *d = dst->words32;
> const u8  *s  = src->bytes;
> -   const u32 *k  = key->words32;
> +   const __le32 *k  = key->words32;
> const u32 *t0 = crypto_ft_tab[0];
> -   const u32 *t1 = crypto_ft_tab[1];
> -   const u32 *t2 = crypto_ft_tab[2];
> -   const u32 *t3 = crypto_ft_tab[3];
> u32 d0, d1, d2, d3;
>
> -   d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];
> -   d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];
> -   d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];
> -   d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];
> +   d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ 
> rol32(t0[s[15]], 24);
> +   d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ 
> rol32(t0[s[ 3]], 24);
> +   d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ 
> rol32(t0[s[ 7]], 24);
> +   d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ 
> rol32(t0[s[11]], 24);
>
> -   d[0] = d0;
> -   d[1] = d1;
> -   d[2] = d2;
> -   d[3] = d3;
> +   d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));
> +   d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));
> +   d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));
> +   d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));


I suppose this

> +   d[0] = cpu_to_le32(d0) ^ k[0];
> +   d[1] = cpu_to_le32(d1) ^ k[1];
> +   d[2] = cpu_to_le32(d2) ^ k[2];
> +   d[3] = cpu_to_le32(d3) ^ k[3];

should work fine as well

>  }
>
>  #endif /* _CRYPTO_AEGIS_H */
> --
> 2.19.0
>


[PATCH 2/2] crypto: aegis/generic - fix for big endian systems

2018-09-30 Thread Ard Biesheuvel
Use the correct __le32 annotation and accessors to perform the
single round of AES encryption performed inside the AEGIS transform.
Otherwise, tcrypt reports:

  alg: aead: Test 1 failed on encryption for aegis128-generic
  : 6c 25 25 4a 3c 10 1d 27 2b c1 d4 84 9a ef 7f 6e
  alg: aead: Test 1 failed on encryption for aegis128l-generic
  : cd c6 e3 b8 a0 70 9d 8e c2 4f 6f fe 71 42 df 28
  alg: aead: Test 1 failed on encryption for aegis256-generic
  : aa ed 07 b1 96 1d e9 e6 f2 ed b5 8e 1c 5f dc 1c

While at it, let's refer to the first precomputed table only, and
derive the other ones by rotation. This reduces the D-cache footprint
by 75%, and shouldn't be too costly or free on load/store architectures
(and X86 has its own AES-NI based implementation)

Fixes: f606a88e5823 ("crypto: aegis - Add generic AEGIS AEAD implementations")
Cc:  # v4.18+
Signed-off-by: Ard Biesheuvel 
---
 crypto/aegis.h | 23 +---
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/crypto/aegis.h b/crypto/aegis.h
index f1c6900ddb80..84d3e07a3c33 100644
--- a/crypto/aegis.h
+++ b/crypto/aegis.h
@@ -21,7 +21,7 @@
 
 union aegis_block {
__le64 words64[AEGIS_BLOCK_SIZE / sizeof(__le64)];
-   u32 words32[AEGIS_BLOCK_SIZE / sizeof(u32)];
+   __le32 words32[AEGIS_BLOCK_SIZE / sizeof(__le32)];
u8 bytes[AEGIS_BLOCK_SIZE];
 };
 
@@ -59,22 +59,19 @@ static void crypto_aegis_aesenc(union aegis_block *dst,
 {
u32 *d = dst->words32;
const u8  *s  = src->bytes;
-   const u32 *k  = key->words32;
+   const __le32 *k  = key->words32;
const u32 *t0 = crypto_ft_tab[0];
-   const u32 *t1 = crypto_ft_tab[1];
-   const u32 *t2 = crypto_ft_tab[2];
-   const u32 *t3 = crypto_ft_tab[3];
u32 d0, d1, d2, d3;
 
-   d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]] ^ k[0];
-   d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]] ^ k[1];
-   d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]] ^ k[2];
-   d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]] ^ k[3];
+   d0 = t0[s[ 0]] ^ rol32(t0[s[ 5]], 8) ^ rol32(t0[s[10]], 16) ^ 
rol32(t0[s[15]], 24);
+   d1 = t0[s[ 4]] ^ rol32(t0[s[ 9]], 8) ^ rol32(t0[s[14]], 16) ^ 
rol32(t0[s[ 3]], 24);
+   d2 = t0[s[ 8]] ^ rol32(t0[s[13]], 8) ^ rol32(t0[s[ 2]], 16) ^ 
rol32(t0[s[ 7]], 24);
+   d3 = t0[s[12]] ^ rol32(t0[s[ 1]], 8) ^ rol32(t0[s[ 6]], 16) ^ 
rol32(t0[s[11]], 24);
 
-   d[0] = d0;
-   d[1] = d1;
-   d[2] = d2;
-   d[3] = d3;
+   d[0] = cpu_to_le32(d0 ^ le32_to_cpu(k[0]));
+   d[1] = cpu_to_le32(d1 ^ le32_to_cpu(k[1]));
+   d[2] = cpu_to_le32(d2 ^ le32_to_cpu(k[2]));
+   d[3] = cpu_to_le32(d3 ^ le32_to_cpu(k[3]));
 }
 
 #endif /* _CRYPTO_AEGIS_H */
-- 
2.19.0



[PATCH 1/2] crypto: morus/generic - fix for big endian systems

2018-09-30 Thread Ard Biesheuvel
Omit the endian swabbing when folding the lengths of the assoc and
crypt input buffers into the state to finalize the tag. This is not
necessary given that the memory representation of the state is in
machine native endianness already.

This fixes an error reported by tcrypt running on a big endian system:

  alg: aead: Test 2 failed on encryption for morus640-generic
  : a8 30 ef fb e6 26 eb 23 b0 87 dd 98 57 f3 e1 4b
  0010: 21
  alg: aead: Test 2 failed on encryption for morus1280-generic
  : 88 19 1b fb 1c 29 49 0e ee 82 2f cb 97 a6 a5 ee
  0010: 5f

Fixes: 396be41f16fd ("crypto: morus - Add generic MORUS AEAD implementations")
Cc:  # v4.18+
Signed-off-by: Ard Biesheuvel 
---
 crypto/morus1280.c |  7 ++-
 crypto/morus640.c  | 16 
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/crypto/morus1280.c b/crypto/morus1280.c
index d057cf5ac4a8..3889c188f266 100644
--- a/crypto/morus1280.c
+++ b/crypto/morus1280.c
@@ -385,14 +385,11 @@ static void crypto_morus1280_final(struct morus1280_state 
*state,
   struct morus1280_block *tag_xor,
   u64 assoclen, u64 cryptlen)
 {
-   u64 assocbits = assoclen * 8;
-   u64 cryptbits = cryptlen * 8;
-
struct morus1280_block tmp;
unsigned int i;
 
-   tmp.words[0] = cpu_to_le64(assocbits);
-   tmp.words[1] = cpu_to_le64(cryptbits);
+   tmp.words[0] = assoclen * 8;
+   tmp.words[1] = cryptlen * 8;
tmp.words[2] = 0;
tmp.words[3] = 0;
 
diff --git a/crypto/morus640.c b/crypto/morus640.c
index 1ca76e54281b..da06ec2f6a80 100644
--- a/crypto/morus640.c
+++ b/crypto/morus640.c
@@ -384,21 +384,13 @@ static void crypto_morus640_final(struct morus640_state 
*state,
  struct morus640_block *tag_xor,
  u64 assoclen, u64 cryptlen)
 {
-   u64 assocbits = assoclen * 8;
-   u64 cryptbits = cryptlen * 8;
-
-   u32 assocbits_lo = (u32)assocbits;
-   u32 assocbits_hi = (u32)(assocbits >> 32);
-   u32 cryptbits_lo = (u32)cryptbits;
-   u32 cryptbits_hi = (u32)(cryptbits >> 32);
-
struct morus640_block tmp;
unsigned int i;
 
-   tmp.words[0] = cpu_to_le32(assocbits_lo);
-   tmp.words[1] = cpu_to_le32(assocbits_hi);
-   tmp.words[2] = cpu_to_le32(cryptbits_lo);
-   tmp.words[3] = cpu_to_le32(cryptbits_hi);
+   tmp.words[0] = lower_32_bits(assoclen * 8);
+   tmp.words[1] = upper_32_bits(assoclen * 8);
+   tmp.words[2] = lower_32_bits(cryptlen * 8);
+   tmp.words[3] = upper_32_bits(cryptlen * 8);
 
for (i = 0; i < MORUS_BLOCK_WORDS; i++)
state->s[4].words[i] ^= state->s[0].words[i];
-- 
2.19.0



[PATCH 0/2] crypto - fix aegis/morus for big endian systems

2018-09-30 Thread Ard Biesheuvel
Some bug fixes for issues that I stumbled upon while working on other
stuff.

Ard Biesheuvel (2):
  crypto: morus/generic - fix for big endian systems
  crypto: aegis/generic - fix for big endian systems

 crypto/aegis.h | 23 +---
 crypto/morus1280.c |  7 ++
 crypto/morus640.c  | 16 --
 3 files changed, 16 insertions(+), 30 deletions(-)

-- 
2.19.0



Re: [PATCH] crypto: qat - move temp buffers off the stack

2018-09-26 Thread Ard Biesheuvel
On Wed, 26 Sep 2018 at 11:53, Ard Biesheuvel  wrote:
>
> Arnd reports that with Kees's latest VLA patches applied, the HMAC
> handling in the QAT driver uses a worst case estimate of 160 bytes
> for the SHA blocksize, allowing the compiler to determine the size
> of the stack frame at runtime and throw a warning:
>

s/runtime/compile time/

>   drivers/crypto/qat/qat_common/qat_algs.c: In function 
> 'qat_alg_do_precomputes':
>   drivers/crypto/qat/qat_common/qat_algs.c:257:1: error: the frame size
>   of 1112 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
>
> Given that this worst case estimate is only 32 bytes larger than the
> actual block size of SHA-512, the use of a VLA here was hiding the
> excessive size of the stack frame from the compiler, and so we should
> try to move these buffers off the stack.
>
> So move the ipad/opad buffers and the various SHA state descriptors
> into the tfm context struct. Since qat_alg_do_precomputes() is only
> called in the context of a setkey() operation, this should be safe.
> Using SHA512_BLOCK_SIZE for the size of the ipad/opad buffers allows
> them to be used by SHA-1/SHA-256 as well.
>
> Reported-by: Arnd Bergmann 
> Signed-off-by: Ard Biesheuvel 
> ---
> This applies against v4.19-rc while Arnd's report was about -next. However,
> since Kees's VLA change results in a warning about a pre-existing condition,
> we may decide to apply it as a fix, and handle the conflict with Kees's
> patch in cryptodev. Otherwise, I can respin it to apply onto cryptodev
> directly. The patch was build tested only - I don't have the hardware.
>
> Thoughts anyone?
>
>  drivers/crypto/qat/qat_common/qat_algs.c | 60 ++--
>  1 file changed, 31 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/crypto/qat/qat_common/qat_algs.c 
> b/drivers/crypto/qat/qat_common/qat_algs.c
> index 1138e41d6805..d2698299896f 100644
> --- a/drivers/crypto/qat/qat_common/qat_algs.c
> +++ b/drivers/crypto/qat/qat_common/qat_algs.c
> @@ -113,6 +113,13 @@ struct qat_alg_aead_ctx {
> struct crypto_shash *hash_tfm;
> enum icp_qat_hw_auth_algo qat_hash_alg;
> struct qat_crypto_instance *inst;
> +   union {
> +   struct sha1_state sha1;
> +   struct sha256_state sha256;
> +   struct sha512_state sha512;
> +   };
> +   char ipad[SHA512_BLOCK_SIZE]; /* sufficient for SHA-1/SHA-256 as well 
> */
> +   char opad[SHA512_BLOCK_SIZE];
>  };
>
>  struct qat_alg_ablkcipher_ctx {
> @@ -148,37 +155,32 @@ static int qat_alg_do_precomputes(struct 
> icp_qat_hw_auth_algo_blk *hash,
>   unsigned int auth_keylen)
>  {
> SHASH_DESC_ON_STACK(shash, ctx->hash_tfm);
> -   struct sha1_state sha1;
> -   struct sha256_state sha256;
> -   struct sha512_state sha512;
> int block_size = crypto_shash_blocksize(ctx->hash_tfm);
> int digest_size = crypto_shash_digestsize(ctx->hash_tfm);
> -   char ipad[block_size];
> -   char opad[block_size];
> __be32 *hash_state_out;
> __be64 *hash512_state_out;
> int i, offset;
>
> -   memset(ipad, 0, block_size);
> -   memset(opad, 0, block_size);
> +   memset(ctx->ipad, 0, block_size);
> +   memset(ctx->opad, 0, block_size);
> shash->tfm = ctx->hash_tfm;
> shash->flags = 0x0;
>
> if (auth_keylen > block_size) {
> int ret = crypto_shash_digest(shash, auth_key,
> - auth_keylen, ipad);
> + auth_keylen, ctx->ipad);
> if (ret)
> return ret;
>
> -   memcpy(opad, ipad, digest_size);
> +   memcpy(ctx->opad, ctx->ipad, digest_size);
> } else {
> -   memcpy(ipad, auth_key, auth_keylen);
> -   memcpy(opad, auth_key, auth_keylen);
> +   memcpy(ctx->ipad, auth_key, auth_keylen);
> +   memcpy(ctx->opad, auth_key, auth_keylen);
> }
>
> for (i = 0; i < block_size; i++) {
> -   char *ipad_ptr = ipad + i;
> -   char *opad_ptr = opad + i;
> +   char *ipad_ptr = ctx->ipad + i;
> +   char *opad_ptr = ctx->opad + i;
> *ipad_ptr ^= HMAC_IPAD_VALUE;
> *opad_ptr ^= HMAC_OPAD_VALUE;
> }
> @@ -186,7 +188,7 @@ static int qat_alg_do_precomputes(struct 
> icp_qat_hw_auth_algo_blk *hash,
> if (crypto_shash_init(shash))
> return -EFAULT;
>
> -   if (crypt

[RFC PATCH] crypto: x86/aes-ni - remove special handling of AES in PCBC mode

2018-09-24 Thread Ard Biesheuvel
For historical reasons, the AES-NI based implementation of the PCBC
chaining mode uses a special FPU chaining mode wrapper template to
amortize the FPU start/stop overhead over multiple blocks.

When this FPU wrapper was introduced, it supported widely used
chaining modes such as XTS and CTR (as well as LRW), but currently,
PCBC is the only remaining user.

Since there are no known users of pcbc(aes) in the kernel, let's remove
this special driver, and rely on the generic pcbc driver to encapsulate
the AES-NI core cipher.

Signed-off-by: Ard Biesheuvel 
---
 arch/x86/crypto/Makefile   |   2 +-
 arch/x86/crypto/aesni-intel_glue.c |  32 ---
 arch/x86/crypto/fpu.c  | 207 
 crypto/Kconfig |   2 +-
 4 files changed, 2 insertions(+), 241 deletions(-)

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index a450ad573dcb..42d22005764c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -106,7 +106,7 @@ ifeq ($(avx2_supported),yes)
morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o
 endif
 
-aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
+aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
 aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
 sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
diff --git a/arch/x86/crypto/aesni-intel_glue.c 
b/arch/x86/crypto/aesni-intel_glue.c
index acbe7e8336d8..d90770c43b40 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1253,22 +1253,6 @@ static struct skcipher_alg aesni_skciphers[] = {
 static
 struct simd_skcipher_alg *aesni_simd_skciphers[ARRAY_SIZE(aesni_skciphers)];
 
-static struct {
-   const char *algname;
-   const char *drvname;
-   const char *basename;
-   struct simd_skcipher_alg *simd;
-} aesni_simd_skciphers2[] = {
-#if (defined(MODULE) && IS_ENABLED(CONFIG_CRYPTO_PCBC)) || \
-IS_BUILTIN(CONFIG_CRYPTO_PCBC)
-   {
-   .algname= "pcbc(aes)",
-   .drvname= "pcbc-aes-aesni",
-   .basename   = "fpu(pcbc(__aes-aesni))",
-   },
-#endif
-};
-
 #ifdef CONFIG_X86_64
 static int generic_gcmaes_set_key(struct crypto_aead *aead, const u8 *key,
  unsigned int key_len)
@@ -1422,10 +1406,6 @@ static void aesni_free_simds(void)
for (i = 0; i < ARRAY_SIZE(aesni_simd_skciphers) &&
aesni_simd_skciphers[i]; i++)
simd_skcipher_free(aesni_simd_skciphers[i]);
-
-   for (i = 0; i < ARRAY_SIZE(aesni_simd_skciphers2); i++)
-   if (aesni_simd_skciphers2[i].simd)
-   simd_skcipher_free(aesni_simd_skciphers2[i].simd);
 }
 
 static int __init aesni_init(void)
@@ -1499,18 +1479,6 @@ static int __init aesni_init(void)
aesni_simd_skciphers[i] = simd;
}
 
-   for (i = 0; i < ARRAY_SIZE(aesni_simd_skciphers2); i++) {
-   algname = aesni_simd_skciphers2[i].algname;
-   drvname = aesni_simd_skciphers2[i].drvname;
-   basename = aesni_simd_skciphers2[i].basename;
-   simd = simd_skcipher_create_compat(algname, drvname, basename);
-   err = PTR_ERR(simd);
-   if (IS_ERR(simd))
-   continue;
-
-   aesni_simd_skciphers2[i].simd = simd;
-   }
-
return 0;
 
 unregister_simds:
diff --git a/arch/x86/crypto/fpu.c b/arch/x86/crypto/fpu.c
deleted file mode 100644
index 406680476c52..
--- a/arch/x86/crypto/fpu.c
+++ /dev/null
@@ -1,207 +0,0 @@
-/*
- * FPU: Wrapper for blkcipher touching fpu
- *
- * Copyright (c) Intel Corp.
- *   Author: Huang Ying 
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or (at your option)
- * any later version.
- *
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-struct crypto_fpu_ctx {
-   struct crypto_skcipher *child;
-};
-
-static int crypto_fpu_setkey(struct crypto_skcipher *parent, const u8 *key,
-unsigned int keylen)
-{
-   struct crypto_fpu_ctx *ctx = crypto_skcipher_ctx(parent);
-   struct crypto_skcipher *child = ctx->child;
-   int err;
-
-   crypto_skcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
-   crypto_skcipher_set_flags(child, crypto_skcipher_get_flags(parent) &
-CRYPTO_TFM_REQ_MASK);
-   err = crypto_skcipher_setkey(child, key, keylen);
-   crypto_skcipher_set_flags(parent, crypto_skcipher_get_flags(child) &
- CRYPTO_TFM_RES_MAS

Re: [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC

2018-09-20 Thread Ard Biesheuvel
On 10 September 2018 at 07:41, Ard Biesheuvel  wrote:
> Some cleanups and optimizations for the arm64  AES skcipher routines.
>
> Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
> which are natively arrays of u32.
>
> Patch #2 partially reverts the use of NEON yield calls, which is not
> needed for skciphers.
>
> Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.
>
> Patch #4 tweaks the XTS handling to remove a literal load from the inner
> loop.
>
> Cc: Eric Biggers 
> Cc: Theodore Ts'o 
> Cc: Steve Capper 
>
> Ard Biesheuvel (4):
>   crypto: arm64/aes-blk - remove pointless (u8 *) casts
>   crypto: arm64/aes-blk - revert NEON yield for skciphers
>   crypto: arm64/aes-blk - add support for CTS-CBC mode
>   crypto: aes/arm64-blk - improve XTS mask handling
>
>  arch/arm64/crypto/aes-ce.S|   5 +
>  arch/arm64/crypto/aes-glue.c  | 212 +--
>  arch/arm64/crypto/aes-modes.S | 400 ++--
>  arch/arm64/crypto/aes-neon.S  |   6 +
>  4 files changed, 406 insertions(+), 217 deletions(-)
>

Eric, any thoughts on this?


Re: [PATCH 1/5] crypto: arm/aes-ce - enable module autoloading based on CPU feature bits

2018-09-13 Thread Ard Biesheuvel
On 13 September 2018 at 08:24, Stefan Agner  wrote:
> On 10.09.2018 00:01, Ard Biesheuvel wrote:
>> On 10 September 2018 at 08:21, Stefan Agner  wrote:
>>> Hi Ard,
>>>
>>> On 21.05.2017 03:23, Ard Biesheuvel wrote:
>>>> Make the module autoloadable by tying it to the CPU feature bit that
>>>> describes whether the optional instructions it relies on are implemented
>>>> by the current CPU.
>>>>
>>>
>>> This leads to a compiler warning when compiling multi_v7_defconfig/ARM32
>>> using Clang 6.0.1:
>>>
>>> arch/arm/crypto/aes-ce-glue.c:450:1: warning: variable
>>> 'cpu_feature_match_AES' is not needed and will not
>>>   be emitted [-Wunneeded-internal-declaration]
>>> module_cpu_feature_match(AES, aes_init);
>>>
>>> ./include/linux/cpufeature.h:48:33: note: expanded from macro
>>> 'module_cpu_feature_match'
>>> static struct cpu_feature const cpu_feature_match_ ## x[] = \
>>>
>>> :83:1: note: expanded from here
>>> cpu_feature_match_AES
>>> ^
>>> 1 warning generated.
>>>
>>> Do you happen to have an idea how to alleviate?
>>>
>>
>> I guess this only happens for modules that are selected as builtin,
>> and so MODULE_DEVICE_TABLE() resolves to nothing?
>> Does this only occur for CPU features?
>
> So in the above case CONFIG_ARM_CRYPTO=y, CONFIG_CRYPTO_AES_ARM_CE=m...
>
> Right now I only saw it with CPU features... I remember seen similar issues, 
> which got resolved. Digging in the git history I found 1f318a8bafcf 
> ("modules: mark __inittest/__exittest as __maybe_unused"),
>
> This seems to resolve it:
>
> --- a/include/linux/cpufeature.h
> +++ b/include/linux/cpufeature.h
> @@ -45,7 +45,7 @@
>   * 'asm/cpufeature.h' of your favorite architecture.
>   */
>  #define module_cpu_feature_match(x, __initfunc)\
> -static struct cpu_feature const cpu_feature_match_ ## x[] =\
> +static struct cpu_feature const __maybe_unused cpu_feature_match_ ## x[] = \
> { { .feature = cpu_feature(x) }, { } }; \
>  MODULE_DEVICE_TABLE(cpu, cpu_feature_match_ ## x); \
> \
>
> Also arch/arm/crypto/crc32-ce-glue.c needs an extra __maybe_unused.
>

Yes, that looks like the right approach to me. The difference between
other uses of MODULE_DEVICE_TABLE() is that the second argument
usually gets referenced in some way in the driver struct. It the CPU
feature case, that does not happen and so the struct ends up being
unreferenced when building the code into the kernel.


Re: [PATCH] crypto: tcrypt - fix ghash-generic speed test

2018-09-12 Thread Ard Biesheuvel
On 12 September 2018 at 15:20, Horia Geantă  wrote:
> ghash is a keyed hash algorithm, thus setkey needs to be called.
> Otherwise the following error occurs:
> $ modprobe tcrypt mode=318 sec=1
> testing speed of async ghash-generic (ghash-generic)
> tcrypt: test  0 (   16 byte blocks,   16 bytes per update,   1 updates):
> tcrypt: hashing failed ret=-126
>
> Cc:  # 4.6+
> Fixes: 0660511c0bee ("crypto: tcrypt - Use ahash")
> Tested-by: Franck Lenormand 
> Signed-off-by: Horia Geantă 

Acked-by: Ard Biesheuvel 

> ---
>  crypto/tcrypt.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
> index bdde95e8d369..6e0a054bb61d 100644
> --- a/crypto/tcrypt.c
> +++ b/crypto/tcrypt.c
> @@ -1103,6 +1103,9 @@ static void test_ahash_speed_common(const char *algo, 
> unsigned int secs,
> break;
> }
>
> +   if (speed[i].klen)
> +   crypto_ahash_setkey(tfm, tvmem[0], speed[i].klen);
> +
> pr_info("test%3u "
> "(%5u byte blocks,%5u bytes per update,%4u updates): 
> ",
> i, speed[i].blen, speed[i].plen, speed[i].blen / 
> speed[i].plen);
> --
> 2.16.2
>


[PATCH 1/4] crypto: arm64/aes-blk - remove pointless (u8 *) casts

2018-09-10 Thread Ard Biesheuvel
For some reason, the asmlinkage prototypes of the NEON routines take
u8[] arguments for the round key arrays, while the actual round keys
are arrays of u32, and so passing them into those routines requires
u8* casts at each occurrence. Fix that.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-glue.c | 47 ++--
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index adcb83eb683c..1c6934544c1f 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -63,24 +63,24 @@ MODULE_AUTHOR("Ard Biesheuvel ");
 MODULE_LICENSE("GPL v2");
 
 /* defined in aes-modes.S */
-asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks);
-asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks);
 
-asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
-asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
 
-asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 ctr[]);
 
-asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
-   int rounds, int blocks, u8 const rk2[], u8 iv[],
+asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u32 const rk1[],
+   int rounds, int blocks, u32 const rk2[], u8 
iv[],
int first);
-asmlinkage void aes_xts_decrypt(u8 out[], u8 const in[], u8 const rk1[],
-   int rounds, int blocks, u8 const rk2[], u8 iv[],
+asmlinkage void aes_xts_decrypt(u8 out[], u8 const in[], u32 const rk1[],
+   int rounds, int blocks, u32 const rk2[], u8 
iv[],
int first);
 
 asmlinkage void aes_mac_update(u8 const in[], u32 const rk[], int rounds,
@@ -142,7 +142,7 @@ static int ecb_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_enc, rounds, blocks);
+   ctx->key_enc, rounds, blocks);
kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -162,7 +162,7 @@ static int ecb_decrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_dec, rounds, blocks);
+   ctx->key_dec, rounds, blocks);
kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -182,7 +182,7 @@ static int cbc_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+   ctx->key_enc, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -202,7 +202,7 @@ static int cbc_decrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+   ctx->key_dec, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -222,7 +222,7 @@ static int ctr_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+   ctx->key_enc, rounds, blocks, walk.iv);

[PATCH 2/4] crypto: arm64/aes-blk - revert NEON yield for skciphers

2018-09-10 Thread Ard Biesheuvel
The reasoning of commit f10dc56c64bb ("crypto: arm64 - revert NEON yield
for fast AEAD implementations") applies equally to skciphers: the walk
API already guarantees that the input size of each call into the NEON
code is bounded to the size of a page, and so there is no need for an
additional TIF_NEED_RESCHED flag check inside the inner loop. So revert
the skcipher changes to aes-modes.S (but retain the mac ones)

This partially reverts commit 0c8f838a52fe9fd82761861a934f16ef9896b4e5.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-modes.S | 281 
 1 file changed, 108 insertions(+), 173 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 496c243de4ac..35632d11200f 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align  4
 
 aes_encrypt_block4x:
-   encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
+   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
 ENDPROC(aes_encrypt_block4x)
 
 aes_decrypt_block4x:
-   decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
+   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
 ENDPROC(aes_decrypt_block4x)
 
@@ -31,71 +31,57 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   frame_push  5
+   stp x29, x30, [sp, #-16]!
+   mov x29, sp
 
-   mov x19, x0
-   mov x20, x1
-   mov x21, x2
-   mov x22, x3
-   mov x23, x4
-
-.Lecbencrestart:
-   enc_prepare w22, x21, x5
+   enc_prepare w3, x2, x5
 
 .LecbencloopNx:
-   subsw23, w23, #4
+   subsw4, w4, #4
bmi .Lecbenc1x
-   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
+   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
bl  aes_encrypt_block4x
-   st1 {v0.16b-v3.16b}, [x19], #64
-   cond_yield_neon .Lecbencrestart
+   st1 {v0.16b-v3.16b}, [x0], #64
b   .LecbencloopNx
 .Lecbenc1x:
-   addsw23, w23, #4
+   addsw4, w4, #4
beq .Lecbencout
 .Lecbencloop:
-   ld1 {v0.16b}, [x20], #16/* get next pt block */
-   encrypt_block   v0, w22, x21, x5, w6
-   st1 {v0.16b}, [x19], #16
-   subsw23, w23, #1
+   ld1 {v0.16b}, [x1], #16 /* get next pt block */
+   encrypt_block   v0, w3, x2, x5, w6
+   st1 {v0.16b}, [x0], #16
+   subsw4, w4, #1
bne .Lecbencloop
 .Lecbencout:
-   frame_pop
+   ldp x29, x30, [sp], #16
ret
 AES_ENDPROC(aes_ecb_encrypt)
 
 
 AES_ENTRY(aes_ecb_decrypt)
-   frame_push  5
+   stp x29, x30, [sp, #-16]!
+   mov x29, sp
 
-   mov x19, x0
-   mov x20, x1
-   mov x21, x2
-   mov x22, x3
-   mov x23, x4
-
-.Lecbdecrestart:
-   dec_prepare w22, x21, x5
+   dec_prepare w3, x2, x5
 
 .LecbdecloopNx:
-   subsw23, w23, #4
+   subsw4, w4, #4
bmi .Lecbdec1x
-   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
+   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 ct blocks */
bl  aes_decrypt_block4x
-   st1 {v0.16b-v3.16b}, [x19], #64
-   cond_yield_neon .Lecbdecrestart
+   st1 {v0.16b-v3.16b}, [x0], #64
b   .LecbdecloopNx
 .Lecbdec1x:
-   addsw23, w23, #4
+   addsw4, w4, #4
beq .Lecbdecout
 .Lecbdecloop:
-   ld1 {v0.16b}, [x20], #16/* get next ct block */
-   decrypt_block   v0, w22, x21, x5, w6
-   st1 {v0.16b}, [x19], #16
-   subsw23, w23, #1
+   ld1 {v0.16b}, [x1], #16 /* get next ct block */
+   decrypt_block   v0, w3, x2, x5, w6
+   st1 {v0.16b}, [x0], #16
+   subsw4, w4, #1
bne .Lecbdecloop
 .Lecbdecout:
-   frame_pop
+   ldp x29, x30, [sp], #16
ret
 AES_ENDPROC(aes_ecb_decrypt)
 
@@ -108,100 +94,78 @@ AES_ENDPROC(aes_ecb_decrypt)
 */
 
 AES_ENTRY(aes_cbc_encrypt)
-   frame_push  6
-
-   mov x19, x0
-   mov x20, x1
-   mov x21, x2
-   mov x22, x3
-   mov x23, x4
-   mov x24, x5
-
-.Lcbcencrestart:
-   ld1 {v4.16b}, [x24] /* get iv */
-   enc_prepare w22, x21, x6
+   ld1 {v4.16b}, [x5] 

[PATCH 4/4] crypto: arm64/aes-blk - improve XTS mask handling

2018-09-10 Thread Ard Biesheuvel
The Crypto Extension instantiation of the aes-modes.S collection of
skciphers uses only 15 NEON registers for the round key array, whereas
the pure NEON flavor uses 16 NEON registers for the AES S-box.

This means we have a spare register available that we can use to hold
the XTS mask vector, removing the need to reload it at every iteration
of the inner loop.

Since the pure NEON version does not permit this optimization, tweak
the macros so we can factor out this functionality. Also, replace the
literal load with a short sequence to compose the mask vector.

On Cortex-A53, this results in a ~4% speedup.

Signed-off-by: Ard Biesheuvel 
---
Raw performance numbers after the patch.

 arch/arm64/crypto/aes-ce.S|  5 +++
 arch/arm64/crypto/aes-modes.S | 40 ++--
 arch/arm64/crypto/aes-neon.S  |  6 +++
 3 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 623e74ed1c67..143070510809 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -17,6 +17,11 @@
 
.arch   armv8-a+crypto
 
+   xtsmask .reqv16
+
+   .macro  xts_reload_mask, tmp
+   .endm
+
/* preload all round keys */
.macro  load_round_keys, rounds, rk
cmp \rounds, #12
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 82931fba53d2..5c0fa7905d24 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -340,17 +340,19 @@ AES_ENDPROC(aes_ctr_encrypt)
 * int blocks, u8 const rk2[], u8 iv[], int first)
 */
 
-   .macro  next_tweak, out, in, const, tmp
+   .macro  next_tweak, out, in, tmp
sshr\tmp\().2d,  \in\().2d,   #63
-   and \tmp\().16b, \tmp\().16b, \const\().16b
+   and \tmp\().16b, \tmp\().16b, xtsmask.16b
add \out\().2d,  \in\().2d,   \in\().2d
ext \tmp\().16b, \tmp\().16b, \tmp\().16b, #8
eor \out\().16b, \out\().16b, \tmp\().16b
.endm
 
-.Lxts_mul_x:
-CPU_LE(.quad   1, 0x87 )
-CPU_BE(.quad   0x87, 1 )
+   .macro  xts_load_mask, tmp
+   movixtsmask.2s, #0x1
+   movi\tmp\().2s, #0x87
+   uzp1xtsmask.4s, xtsmask.4s, \tmp\().4s
+   .endm
 
 AES_ENTRY(aes_xts_encrypt)
stp x29, x30, [sp, #-16]!
@@ -362,24 +364,24 @@ AES_ENTRY(aes_xts_encrypt)
enc_prepare w3, x5, x8
encrypt_block   v4, w3, x5, x8, w7  /* first tweak */
enc_switch_key  w3, x2, x8
-   ldr q7, .Lxts_mul_x
+   xts_load_mask   v8
b   .LxtsencNx
 
 .Lxtsencnotfirst:
enc_prepare w3, x2, x8
 .LxtsencloopNx:
-   ldr q7, .Lxts_mul_x
-   next_tweak  v4, v4, v7, v8
+   xts_reload_mask v8
+   next_tweak  v4, v4, v8
 .LxtsencNx:
subsw4, w4, #4
bmi .Lxtsenc1x
ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
-   next_tweak  v5, v4, v7, v8
+   next_tweak  v5, v4, v8
eor v0.16b, v0.16b, v4.16b
-   next_tweak  v6, v5, v7, v8
+   next_tweak  v6, v5, v8
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
-   next_tweak  v7, v6, v7, v8
+   next_tweak  v7, v6, v8
eor v3.16b, v3.16b, v7.16b
bl  aes_encrypt_block4x
eor v3.16b, v3.16b, v7.16b
@@ -401,7 +403,7 @@ AES_ENTRY(aes_xts_encrypt)
st1 {v0.16b}, [x0], #16
subsw4, w4, #1
beq .Lxtsencout
-   next_tweak  v4, v4, v7, v8
+   next_tweak  v4, v4, v8
b   .Lxtsencloop
 .Lxtsencout:
st1 {v4.16b}, [x6]
@@ -420,24 +422,24 @@ AES_ENTRY(aes_xts_decrypt)
enc_prepare w3, x5, x8
encrypt_block   v4, w3, x5, x8, w7  /* first tweak */
dec_prepare w3, x2, x8
-   ldr q7, .Lxts_mul_x
+   xts_load_mask   v8
b   .LxtsdecNx
 
 .Lxtsdecnotfirst:
dec_prepare w3, x2, x8
 .LxtsdecloopNx:
-   ldr q7, .Lxts_mul_x
-   next_tweak  v4, v4, v7, v8
+   xts_reload_mask v8
+   next_tweak  v4, v4, v8
 .LxtsdecNx:
subsw4, w4, #4
bmi .Lxtsdec1x
ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 ct blocks */
-   next_tweak  v5, v4, v7, v8
+   next_tweak  v5, v4, v8
eor v0.16b, v0.16b, v4.16b
-   next_tweak  v6, v5, v7, v8
+   next_tweak  v6, v5, v8
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b

[PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC

2018-09-10 Thread Ard Biesheuvel
Some cleanups and optimizations for the arm64  AES skcipher routines.

Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
which are natively arrays of u32.

Patch #2 partially reverts the use of NEON yield calls, which is not
needed for skciphers.

Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.

Patch #4 tweaks the XTS handling to remove a literal load from the inner
loop.

Cc: Eric Biggers 
Cc: Theodore Ts'o 
Cc: Steve Capper 

Ard Biesheuvel (4):
  crypto: arm64/aes-blk - remove pointless (u8 *) casts
  crypto: arm64/aes-blk - revert NEON yield for skciphers
  crypto: arm64/aes-blk - add support for CTS-CBC mode
  crypto: aes/arm64-blk - improve XTS mask handling

 arch/arm64/crypto/aes-ce.S|   5 +
 arch/arm64/crypto/aes-glue.c  | 212 +--
 arch/arm64/crypto/aes-modes.S | 400 ++--
 arch/arm64/crypto/aes-neon.S  |   6 +
 4 files changed, 406 insertions(+), 217 deletions(-)

-- 
2.18.0



[PATCH 3/4] crypto: arm64/aes-blk - add support for CTS-CBC mode

2018-09-10 Thread Ard Biesheuvel
Currently, we rely on the generic CTS chaining mode wrapper to
instantiate the cts(cbc(aes)) skcipher. Due to the high performance
of the ARMv8 Crypto Extensions AES instructions (~1 cycles per byte),
any overhead in the chaining mode layers is amplified, and so it pays
off considerably to fold the CTS handling into the SIMD routines.

On Cortex-A53, this results in a ~50% speedup for smaller input sizes.

Signed-off-by: Ard Biesheuvel 
---
This patch supersedes '[RFC/RFT PATCH] crypto: arm64/aes-ce - add support
for CTS-CBC mode' sent out last Saturday.

Changes:
- keep subreq and scatterlist in request ctx structure
- optimize away second scatterwalk_ffwd() invocation when encrypting in-place
- keep permute table in .rodata section
- polish asm code (drop literal + offset reference, reorder insns)

Raw performance numbers after the patch.

 arch/arm64/crypto/aes-glue.c  | 165 
 arch/arm64/crypto/aes-modes.S |  79 +-
 2 files changed, 243 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 1c6934544c1f..26d2b0263ba6 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -31,6 +32,8 @@
 #define aes_ecb_decryptce_aes_ecb_decrypt
 #define aes_cbc_encryptce_aes_cbc_encrypt
 #define aes_cbc_decryptce_aes_cbc_decrypt
+#define aes_cbc_cts_encryptce_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decryptce_aes_cbc_cts_decrypt
 #define aes_ctr_encryptce_aes_ctr_encrypt
 #define aes_xts_encryptce_aes_xts_encrypt
 #define aes_xts_decryptce_aes_xts_decrypt
@@ -45,6 +48,8 @@ MODULE_DESCRIPTION("AES-ECB/CBC/CTR/XTS using ARMv8 Crypto 
Extensions");
 #define aes_ecb_decryptneon_aes_ecb_decrypt
 #define aes_cbc_encryptneon_aes_cbc_encrypt
 #define aes_cbc_decryptneon_aes_cbc_decrypt
+#define aes_cbc_cts_encryptneon_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decryptneon_aes_cbc_cts_decrypt
 #define aes_ctr_encryptneon_aes_ctr_encrypt
 #define aes_xts_encryptneon_aes_xts_encrypt
 #define aes_xts_decryptneon_aes_xts_decrypt
@@ -73,6 +78,11 @@ asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u32 
const rk[],
 asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
 
+asmlinkage void aes_cbc_cts_encrypt(u8 out[], u8 const in[], u32 const rk[],
+   int rounds, int bytes, u8 const iv[]);
+asmlinkage void aes_cbc_cts_decrypt(u8 out[], u8 const in[], u32 const rk[],
+   int rounds, int bytes, u8 const iv[]);
+
 asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 ctr[]);
 
@@ -87,6 +97,12 @@ asmlinkage void aes_mac_update(u8 const in[], u32 const 
rk[], int rounds,
   int blocks, u8 dg[], int enc_before,
   int enc_after);
 
+struct cts_cbc_req_ctx {
+   struct scatterlist sg_src[2];
+   struct scatterlist sg_dst[2];
+   struct skcipher_request subreq;
+};
+
 struct crypto_aes_xts_ctx {
struct crypto_aes_ctx key1;
struct crypto_aes_ctx __aligned(8) key2;
@@ -209,6 +225,136 @@ static int cbc_decrypt(struct skcipher_request *req)
return err;
 }
 
+static int cts_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+   crypto_skcipher_set_reqsize(tfm, sizeof(struct cts_cbc_req_ctx));
+   return 0;
+}
+
+static int cts_cbc_encrypt(struct skcipher_request *req)
+{
+   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+   struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
+   struct cts_cbc_req_ctx *rctx = skcipher_request_ctx(req);
+   int err, rounds = 6 + ctx->key_length / 4;
+   int cbc_blocks = DIV_ROUND_UP(req->cryptlen, AES_BLOCK_SIZE) - 2;
+   struct scatterlist *src = req->src, *dst = req->dst;
+   struct skcipher_walk walk;
+
+   skcipher_request_set_tfm(>subreq, tfm);
+
+   if (req->cryptlen == AES_BLOCK_SIZE)
+   cbc_blocks = 1;
+
+   if (cbc_blocks > 0) {
+   unsigned int blocks;
+
+   skcipher_request_set_crypt(>subreq, req->src, req->dst,
+  cbc_blocks * AES_BLOCK_SIZE,
+  req->iv);
+
+   err = skcipher_walk_virt(, >subreq, false);
+
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
+   aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+  

Re: [PATCH 1/5] crypto: arm/aes-ce - enable module autoloading based on CPU feature bits

2018-09-10 Thread Ard Biesheuvel
On 10 September 2018 at 08:21, Stefan Agner  wrote:
> Hi Ard,
>
> On 21.05.2017 03:23, Ard Biesheuvel wrote:
>> Make the module autoloadable by tying it to the CPU feature bit that
>> describes whether the optional instructions it relies on are implemented
>> by the current CPU.
>>
>
> This leads to a compiler warning when compiling multi_v7_defconfig/ARM32
> using Clang 6.0.1:
>
> arch/arm/crypto/aes-ce-glue.c:450:1: warning: variable
> 'cpu_feature_match_AES' is not needed and will not
>   be emitted [-Wunneeded-internal-declaration]
> module_cpu_feature_match(AES, aes_init);
>
> ./include/linux/cpufeature.h:48:33: note: expanded from macro
> 'module_cpu_feature_match'
> static struct cpu_feature const cpu_feature_match_ ## x[] = \
>
> :83:1: note: expanded from here
> cpu_feature_match_AES
> ^
> 1 warning generated.
>
> Do you happen to have an idea how to alleviate?
>

I guess this only happens for modules that are selected as builtin,
and so MODULE_DEVICE_TABLE() resolves to nothing?
Does this only occur for CPU features?


[RFC/RFT PATCH] crypto: arm64/aes-ce - add support for CTS-CBC mode

2018-09-08 Thread Ard Biesheuvel
Currently, we rely on the generic CTS chaining mode wrapper to
instantiate the cts(cbc(aes)) skcipher. Due to the high performance
of the ARMv8 Crypto Extensions AES instructions (~1 cycles per byte),
any overhead in the chaining mode layers is amplified, and so it pays
off considerably to fold the CTS handling into the core algorithm.

On Cortex-A53, this results in a ~50% speedup for smaller block sizes.

Signed-off-by: Ard Biesheuvel 
---
Raw performance numbers after the patch.

 arch/arm64/crypto/aes-glue.c  | 142 
 arch/arm64/crypto/aes-modes.S |  73 ++
 2 files changed, 215 insertions(+)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index adcb83eb683c..0860feedbafe 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -31,6 +32,8 @@
 #define aes_ecb_decryptce_aes_ecb_decrypt
 #define aes_cbc_encryptce_aes_cbc_encrypt
 #define aes_cbc_decryptce_aes_cbc_decrypt
+#define aes_cbc_cts_encryptce_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decryptce_aes_cbc_cts_decrypt
 #define aes_ctr_encryptce_aes_ctr_encrypt
 #define aes_xts_encryptce_aes_xts_encrypt
 #define aes_xts_decryptce_aes_xts_decrypt
@@ -45,6 +48,8 @@ MODULE_DESCRIPTION("AES-ECB/CBC/CTR/XTS using ARMv8 Crypto 
Extensions");
 #define aes_ecb_decryptneon_aes_ecb_decrypt
 #define aes_cbc_encryptneon_aes_cbc_encrypt
 #define aes_cbc_decryptneon_aes_cbc_decrypt
+#define aes_cbc_cts_encryptneon_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decryptneon_aes_cbc_cts_decrypt
 #define aes_ctr_encryptneon_aes_ctr_encrypt
 #define aes_xts_encryptneon_aes_xts_encrypt
 #define aes_xts_decryptneon_aes_xts_decrypt
@@ -73,6 +78,11 @@ asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 
const rk[],
 asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
int rounds, int blocks, u8 iv[]);
 
+asmlinkage void aes_cbc_cts_encrypt(u8 out[], u8 const in[], u8 const rk[],
+   int rounds, int bytes, u8 iv[]);
+asmlinkage void aes_cbc_cts_decrypt(u8 out[], u8 const in[], u8 const rk[],
+   int rounds, int bytes, u8 iv[]);
+
 asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
int rounds, int blocks, u8 ctr[]);
 
@@ -209,6 +219,120 @@ static int cbc_decrypt(struct skcipher_request *req)
return err;
 }
 
+static int cts_cbc_encrypt(struct skcipher_request *req)
+{
+   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+   struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
+   int err, rounds = 6 + ctx->key_length / 4;
+   int cbc_blocks = DIV_ROUND_UP(req->cryptlen, AES_BLOCK_SIZE) - 2;
+   struct skcipher_request subreq = *req;
+   struct scatterlist sg_src[2], sg_dst[2];
+   struct scatterlist *src = req->src, *dst = req->dst;
+   struct skcipher_walk walk;
+   unsigned int blocks;
+
+   if (req->cryptlen == AES_BLOCK_SIZE)
+   cbc_blocks = 1;
+
+   if (cbc_blocks > 0) {
+   skcipher_request_set_crypt(, req->src, req->dst,
+  cbc_blocks * AES_BLOCK_SIZE,
+  req->iv);
+   err = skcipher_walk_virt(, , false);
+
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
+   aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+   (u8 *)ctx->key_enc, rounds, blocks,
+   walk.iv);
+   kernel_neon_end();
+   err = skcipher_walk_done(,
+walk.nbytes % AES_BLOCK_SIZE);
+   }
+   if (err)
+   return err;
+
+   if (req->cryptlen == AES_BLOCK_SIZE)
+   return 0;
+
+   src = scatterwalk_ffwd(sg_src, req->src, subreq.cryptlen);
+   dst = scatterwalk_ffwd(sg_dst, req->dst, subreq.cryptlen);
+   }
+
+   /* handle ciphertext stealing */
+   skcipher_request_set_crypt(, src, dst,
+  req->cryptlen - cbc_blocks * AES_BLOCK_SIZE,
+  req->iv);
+
+   err = skcipher_walk_virt(, , false);
+   if (err)
+   return err;
+
+   kernel_neon_begin();
+   aes_cbc_cts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+   (u8 *)ctx->key_enc, rounds, walk.nbytes, walk.iv);
+   

Re: [PATCH] fscrypt: remove CRYPTO_CTR dependency

2018-09-06 Thread Ard Biesheuvel
On 5 September 2018 at 21:24, Eric Biggers  wrote:
> From: Eric Biggers 
>
> fscrypt doesn't use the CTR mode of operation for anything, so there's
> no need to select CRYPTO_CTR.  It was added by commit 71dea01ea2ed
> ("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is
> enabled").  But, I've been unable to identify the arm64 crypto bug it
> was supposedly working around.
>
> I suspect the issue was seen only on some old Android device kernel
> (circa 3.10?).  So if the fix wasn't mistaken, the real bug is probably
> already fixed.  Or maybe it was actually a bug in a non-upstream crypto
> driver.
>
> So, remove the dependency.  If it turns out there's actually still a
> bug, we'll fix it properly.
>
> Signed-off-by: Eric Biggers 

Acked-by: Ard Biesheuvel 

This may be related to

11e3b725cfc2 crypto: arm64/aes-blk - honour iv_out requirement in CBC
and CTR modes

given that the commit in question mentions CTS. How it actually works
around the issue is unclear to me, though.




> ---
>  fs/crypto/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/fs/crypto/Kconfig b/fs/crypto/Kconfig
> index 02b7d91c92310..284b589b4774d 100644
> --- a/fs/crypto/Kconfig
> +++ b/fs/crypto/Kconfig
> @@ -6,7 +6,6 @@ config FS_ENCRYPTION
> select CRYPTO_ECB
> select CRYPTO_XTS
> select CRYPTO_CTS
> -   select CRYPTO_CTR
> select CRYPTO_SHA256
> select KEYS
> help
> --
> 2.19.0.rc2.392.g5ba43deb5a-goog
>


Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-08-31 Thread Ard Biesheuvel
On 31 August 2018 at 17:56, Ard Biesheuvel  wrote:
> Hi Eric,
>
> On 31 August 2018 at 10:01, Eric Biggers  wrote:
>> From: Eric Biggers 
>>
>> Optimize ChaCha20 NEON performance by:
>>
>> - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
>> - Streamlining the part that adds the original state and XORs the data.
>> - Making some other small tweaks.
>>
>> On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
>> about 11.9 cycles per byte to 11.3.
>>
>> There is a tradeoff involved with the 'vtbl.8' rotation method since
>> there is at least one CPU where it's not fastest.  But it seems to be a
>> better default; see the added comment.
>>
>> Signed-off-by: Eric Biggers 
>> ---
>>  arch/arm/crypto/chacha20-neon-core.S | 289 ++-
>>  1 file changed, 147 insertions(+), 142 deletions(-)
>>
>> diff --git a/arch/arm/crypto/chacha20-neon-core.S 
>> b/arch/arm/crypto/chacha20-neon-core.S
>> index 3fecb2124c35a..d381cebaba31d 100644
>> --- a/arch/arm/crypto/chacha20-neon-core.S
>> +++ b/arch/arm/crypto/chacha20-neon-core.S
>> @@ -18,6 +18,33 @@
>>   * (at your option) any later version.
>>   */
>>
>> + /*
>> +  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
>> less:
>> +  *
>> +  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
>> +  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
>> +  * (c)  vrev32.16 (16-bit rotations only)
>> +  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
>> +  * needs index vector)
>> +  *
>> +  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
>> +  * rotations, the only choices are (a) and (b).  We use (a) since it takes
>> +  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
>> +  *
>> +  * For the 16-bit rotation, we use vrev32.16 since it's consistently 
>> fastest
>> +  * and doesn't need a temporary register.
>> +  *
>> +  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
>> sequence
>> +  * is twice as fast as (a), even when doing (a) on multiple registers
>> +  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
>> +  * parallelizes better when temporary registers are scarce.
>> +  *
>> +  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same 
>> speed as
>> +  * (a), so the need to load the rotation table actually makes the vtbl 
>> method
>> +  * slightly slower overall on that CPU.  Still, it seems to be a good
>> +  * compromise to get a significant speed boost on some CPUs.
>> +  */
>> +
>
> Thanks for sharing these results. I have been working on 32-bit ARM
> code under the assumption that the A53 pipeline more or less resembles
> the A7 one, but this is obviously not the case looking at your
> results. My contributions to arch/arm/crypto mainly involved Crypto
> Extensions code, which the A7 does not support in the first place, so
> it does not really matter, but I will keep this in mind going forward.
>
>>  #include 
>>
>> .text
>> @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
>> vmovq10, q2
>> vmovq11, q3
>>
>> +   ldr ip, =.Lrol8_table
>> +   vld1.8  {d10}, [ip, :64]
>> +
>
> I usually try to avoid the =literal ldr notation, because it involves
> an additional load via the D-cache. Could you use a 64-bit literal
> instead of a byte array and use vldr instead? Or switch to adr? (and
> move the literal in range, I suppose)
>
>> mov r3, #10
>>
>>  .Ldoubleround:
>> @@ -63,9 +93,9 @@ ENTRY(chacha20_block_xor_neon)
>>
>> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
>> vadd.i32q0, q0, q1
>> -   veorq4, q3, q0
>> -   vshl.u32q3, q4, #8
>> -   vsri.u32q3, q4, #24
>> +   veorq3, q3, q0
>> +   vtbl.8  d6, {d6}, d10
>> +   vtbl.8  d7, {d7}, d10
>>
>> // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
>> vadd.i32q2, q2, q3
>> @@ -94,9 +124,9 @@ ENTRY(chacha20_block_xor_neon)
>>
>> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
>> vadd.i32q0, q0, q1
>> -   veorq4, q3, q0
>> -   vshl.u32q3, q4, #8
>> -   vsri.u32q3, q4, #24
>>

Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-08-31 Thread Ard Biesheuvel
Hi Eric,

On 31 August 2018 at 10:01, Eric Biggers  wrote:
> From: Eric Biggers 
>
> Optimize ChaCha20 NEON performance by:
>
> - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
> - Streamlining the part that adds the original state and XORs the data.
> - Making some other small tweaks.
>
> On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
> about 11.9 cycles per byte to 11.3.
>
> There is a tradeoff involved with the 'vtbl.8' rotation method since
> there is at least one CPU where it's not fastest.  But it seems to be a
> better default; see the added comment.
>
> Signed-off-by: Eric Biggers 
> ---
>  arch/arm/crypto/chacha20-neon-core.S | 289 ++-
>  1 file changed, 147 insertions(+), 142 deletions(-)
>
> diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> b/arch/arm/crypto/chacha20-neon-core.S
> index 3fecb2124c35a..d381cebaba31d 100644
> --- a/arch/arm/crypto/chacha20-neon-core.S
> +++ b/arch/arm/crypto/chacha20-neon-core.S
> @@ -18,6 +18,33 @@
>   * (at your option) any later version.
>   */
>
> + /*
> +  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
> less:
> +  *
> +  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
> +  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
> +  * (c)  vrev32.16 (16-bit rotations only)
> +  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
> +  * needs index vector)
> +  *
> +  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
> +  * rotations, the only choices are (a) and (b).  We use (a) since it takes
> +  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
> +  *
> +  * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest
> +  * and doesn't need a temporary register.
> +  *
> +  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
> sequence
> +  * is twice as fast as (a), even when doing (a) on multiple registers
> +  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
> +  * parallelizes better when temporary registers are scarce.
> +  *
> +  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same 
> speed as
> +  * (a), so the need to load the rotation table actually makes the vtbl 
> method
> +  * slightly slower overall on that CPU.  Still, it seems to be a good
> +  * compromise to get a significant speed boost on some CPUs.
> +  */
> +

Thanks for sharing these results. I have been working on 32-bit ARM
code under the assumption that the A53 pipeline more or less resembles
the A7 one, but this is obviously not the case looking at your
results. My contributions to arch/arm/crypto mainly involved Crypto
Extensions code, which the A7 does not support in the first place, so
it does not really matter, but I will keep this in mind going forward.

>  #include 
>
> .text
> @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
> vmovq10, q2
> vmovq11, q3
>
> +   ldr ip, =.Lrol8_table
> +   vld1.8  {d10}, [ip, :64]
> +

I usually try to avoid the =literal ldr notation, because it involves
an additional load via the D-cache. Could you use a 64-bit literal
instead of a byte array and use vldr instead? Or switch to adr? (and
move the literal in range, I suppose)

> mov r3, #10
>
>  .Ldoubleround:
> @@ -63,9 +93,9 @@ ENTRY(chacha20_block_xor_neon)
>
> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #8
> -   vsri.u32q3, q4, #24
> +   veorq3, q3, q0
> +   vtbl.8  d6, {d6}, d10
> +   vtbl.8  d7, {d7}, d10
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
> vadd.i32q2, q2, q3
> @@ -94,9 +124,9 @@ ENTRY(chacha20_block_xor_neon)
>
> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #8
> -   vsri.u32q3, q4, #24
> +   veorq3, q3, q0
> +   vtbl.8  d6, {d6}, d10
> +   vtbl.8  d7, {d7}, d10
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
> vadd.i32q2, q2, q3
> @@ -143,11 +173,11 @@ ENDPROC(chacha20_block_xor_neon)
>
> .align  5
>  ENTRY(chacha20_4block_xor_neon)
> -   push{r4-r6, lr}
> -   mov ip, sp  // preserve the stack pointer
> -   sub r3, sp, #0x20   // allocate a 32 byte buffer
> -   bic r3, r3, #0x1f   // aligned to 32 bytes
> -   mov sp, r3
> +   push{r4}

The ARM EABI mandates 8 byte stack alignment, and if you take an
interrupt right at this point, you will enter the interrupt handler
with a misaligned 

Re: [PATCH v2] crypto: arm64/aes-modes - get rid of literal load of addend vector

2018-08-23 Thread Ard Biesheuvel
On 23 August 2018 at 21:04, Nick Desaulniers  wrote:
> On Thu, Aug 23, 2018 at 9:48 AM Ard Biesheuvel
>  wrote:
>>
>> Replace the literal load of the addend vector with a sequence that
>> performs each add individually. This sequence is only 2 instructions
>> longer than the original, and 2% faster on Cortex-A53.
>>
>> This is an improvement by itself, but also works around a Clang issue,
>> whose integrated assembler does not implement the GNU ARM asm syntax
>> completely, and does not support the =literal notation for FP registers
>> (more info at https://bugs.llvm.org/show_bug.cgi?id=38642)
>>
>> Cc: Nick Desaulniers 
>> Signed-off-by: Ard Biesheuvel 
>> ---
>> v2: replace convoluted code involving a SIMD add to increment four BE 
>> counters
>> at once with individual add/rev/mov instructions
>>
>>  arch/arm64/crypto/aes-modes.S | 16 +---
>>  1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
>> index 483a7130cf0e..496c243de4ac 100644
>> --- a/arch/arm64/crypto/aes-modes.S
>> +++ b/arch/arm64/crypto/aes-modes.S
>> @@ -232,17 +232,19 @@ AES_ENTRY(aes_ctr_encrypt)
>> bmi .Lctr1x
>> cmn w6, #4  /* 32 bit overflow? */
>> bcs .Lctr1x
>> -   ldr q8, =0x300020001/* addends 1,2,3[,0] 
>> */
>> -   dup v7.4s, w6
>> +   add w7, w6, #1
>> mov v0.16b, v4.16b
>> -   add v7.4s, v7.4s, v8.4s
>> +   add w8, w6, #2
>> mov v1.16b, v4.16b
>> -   rev32   v8.16b, v7.16b
>> +   add w9, w6, #3
>> mov v2.16b, v4.16b
>> +   rev w7, w7
>> mov v3.16b, v4.16b
>> -   mov v1.s[3], v8.s[0]
>> -   mov v2.s[3], v8.s[1]
>> -   mov v3.s[3], v8.s[2]
>> +   rev w8, w8
>> +   mov v1.s[3], w7
>> +   rev w9, w9
>> +   mov v2.s[3], w8
>> +   mov v3.s[3], w9
>
> Just curious about the order of movs and revs here, is this some kind
> of manual scheduling?
>

Yes. Interleaving ALU and SIMD instructions gives a speedup on some
cores, and doesn't hurt others. Beyond that, it's just putting as much
space between the write of a register and the subsequent read.

> Regardless,
> Reviewed-by: Nick Desaulniers 
>

Thanks!

>> ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input 
>> blocks */
>> bl  aes_encrypt_block4x
>> eor v0.16b, v5.16b, v0.16b
>> --
>> 2.18.0
>>
>
>
> --
> Thanks,
> ~Nick Desaulniers


[PATCH v2] crypto: arm64/aes-modes - get rid of literal load of addend vector

2018-08-23 Thread Ard Biesheuvel
Replace the literal load of the addend vector with a sequence that
performs each add individually. This sequence is only 2 instructions
longer than the original, and 2% faster on Cortex-A53.

This is an improvement by itself, but also works around a Clang issue,
whose integrated assembler does not implement the GNU ARM asm syntax
completely, and does not support the =literal notation for FP registers
(more info at https://bugs.llvm.org/show_bug.cgi?id=38642)

Cc: Nick Desaulniers 
Signed-off-by: Ard Biesheuvel 
---
v2: replace convoluted code involving a SIMD add to increment four BE counters
at once with individual add/rev/mov instructions

 arch/arm64/crypto/aes-modes.S | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 483a7130cf0e..496c243de4ac 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -232,17 +232,19 @@ AES_ENTRY(aes_ctr_encrypt)
bmi .Lctr1x
cmn w6, #4  /* 32 bit overflow? */
bcs .Lctr1x
-   ldr q8, =0x300020001/* addends 1,2,3[,0] */
-   dup v7.4s, w6
+   add w7, w6, #1
mov v0.16b, v4.16b
-   add v7.4s, v7.4s, v8.4s
+   add w8, w6, #2
mov v1.16b, v4.16b
-   rev32   v8.16b, v7.16b
+   add w9, w6, #3
mov v2.16b, v4.16b
+   rev w7, w7
mov v3.16b, v4.16b
-   mov v1.s[3], v8.s[0]
-   mov v2.s[3], v8.s[1]
-   mov v3.s[3], v8.s[2]
+   rev w8, w8
+   mov v1.s[3], w7
+   rev w9, w9
+   mov v2.s[3], w8
+   mov v3.s[3], w9
ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
bl  aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
-- 
2.18.0



[PATCH v2] crypto: arm/ghash-ce - implement support for 4-way aggregation

2018-08-23 Thread Ard Biesheuvel
Speed up the GHASH algorithm based on 64-bit polynomial multiplication
by adding support for 4-way aggregation. This improves throughput by
~85% on Cortex-A53, from 1.7 cycles per byte to 0.9 cycles per byte.

When combined with AES into GCM, throughput improves by ~25%, from
3.8 cycles per byte to 3.0 cycles per byte.

Signed-off-by: Ard Biesheuvel 
---
v2: modulo schedule the loads of the input
add AES/GCM performance numbers to commit log

 arch/arm/crypto/Kconfig |   1 +
 arch/arm/crypto/ghash-ce-core.S | 108 +++-
 arch/arm/crypto/ghash-ce-glue.c |  38 +--
 3 files changed, 131 insertions(+), 16 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 925d1364727a..07dd12efeea4 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -99,6 +99,7 @@ config CRYPTO_GHASH_ARM_CE
depends on KERNEL_MODE_NEON
select CRYPTO_HASH
select CRYPTO_CRYPTD
+   select CRYPTO_GF128MUL
help
  Use an implementation of GHASH (used by the GCM AEAD chaining mode)
  that uses the 64x64 to 128 bit polynomial multiplication (vmull.p64)
diff --git a/arch/arm/crypto/ghash-ce-core.S b/arch/arm/crypto/ghash-ce-core.S
index 2f78c10b1881..406009afa9cf 100644
--- a/arch/arm/crypto/ghash-ce-core.S
+++ b/arch/arm/crypto/ghash-ce-core.S
@@ -63,6 +63,33 @@
k48 .reqd31
SHASH2_p64  .reqd31
 
+   HH  .reqq10
+   HH3 .reqq11
+   HH4 .reqq12
+   HH34.reqq13
+
+   HH_L.reqd20
+   HH_H.reqd21
+   HH3_L   .reqd22
+   HH3_H   .reqd23
+   HH4_L   .reqd24
+   HH4_H   .reqd25
+   HH34_L  .reqd26
+   HH34_H  .reqd27
+   SHASH2_H.reqd29
+
+   XL2 .reqq5
+   XM2 .reqq6
+   XH2 .reqq7
+   T3  .reqq8
+
+   XL2_L   .reqd10
+   XL2_H   .reqd11
+   XM2_L   .reqd12
+   XM2_H   .reqd13
+   T3_L.reqd16
+   T3_H.reqd17
+
.text
.fpucrypto-neon-fp-armv8
 
@@ -175,12 +202,77 @@
beq 0f
vld1.64 {T1}, [ip]
teq r0, #0
-   b   1f
+   b   3f
+
+0: .ifc\pn, p64
+   tst r0, #3  // skip until #blocks is a
+   bne 2f  // round multiple of 4
+
+   vld1.8  {XL2-XM2}, [r2]!
+1: vld1.8  {T3-T2}, [r2]!
+   vrev64.8XL2, XL2
+   vrev64.8XM2, XM2
+
+   subsr0, r0, #4
+
+   vext.8  T1, XL2, XL2, #8
+   veorXL2_H, XL2_H, XL_L
+   veorXL, XL, T1
+
+   vrev64.8T3, T3
+   vrev64.8T1, T2
+
+   vmull.p64   XH, HH4_H, XL_H // a1 * b1
+   veorXL2_H, XL2_H, XL_H
+   vmull.p64   XL, HH4_L, XL_L // a0 * b0
+   vmull.p64   XM, HH34_H, XL2_H   // (a1 + a0)(b1 + b0)
+
+   vmull.p64   XH2, HH3_H, XM2_L   // a1 * b1
+   veorXM2_L, XM2_L, XM2_H
+   vmull.p64   XL2, HH3_L, XM2_H   // a0 * b0
+   vmull.p64   XM2, HH34_L, XM2_L  // (a1 + a0)(b1 + b0)
+
+   veorXH, XH, XH2
+   veorXL, XL, XL2
+   veorXM, XM, XM2
+
+   vmull.p64   XH2, HH_H, T3_L // a1 * b1
+   veorT3_L, T3_L, T3_H
+   vmull.p64   XL2, HH_L, T3_H // a0 * b0
+   vmull.p64   XM2, SHASH2_H, T3_L // (a1 + a0)(b1 + b0)
+
+   veorXH, XH, XH2
+   veorXL, XL, XL2
+   veorXM, XM, XM2
+
+   vmull.p64   XH2, SHASH_H, T1_L  // a1 * b1
+   veorT1_L, T1_L, T1_H
+   vmull.p64   XL2, SHASH_L, T1_H  // a0 * b0
+   vmull.p64   XM2, SHASH2_p64, T1_L   // (a1 + a0)(b1 + b0)
+
+   veorXH, XH, XH2
+   veorXL, XL, XL2
+   veorXM, XM, XM2
 
-0: vld1.64 {T1}, [r2]!
+   beq 4f
+
+   vld1.8  {XL2-XM2}, [r2]!
+
+   veorT1, XL, XH
+   veorXM, XM, T1
+
+   __pmull_reduce_p64
+
+   veorT1, T1, XH
+   veorXL, XL, T1
+
+   b   1b
+   .endif
+
+2: vld1.64 {T1}, [r2]!
subsr0, r0, #1
 
-1: /* multiply XL by SHASH in GF(2^128) */
+3: /* multiply XL by SHASH in GF(2^128) */
 #ifndef CONFIG_CPU_BIG_ENDIAN
vrev64.8T1, T1
 #endif
@@ -193,7 +285,7

[PATCH] crypto: arm/ghash-ce - implement support for 4-way aggregation

2018-08-22 Thread Ard Biesheuvel
Speed up the GHASH algorithm based on 64-bit polynomial multiplication
by adding support for 4-way aggregation. This improves throughput by
~60% on Cortex-A53, from 1.70 cycles per byte to 1.05 cycles per byte.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm/crypto/Kconfig |   1 +
 arch/arm/crypto/ghash-ce-core.S | 101 ++--
 arch/arm/crypto/ghash-ce-glue.c |  38 
 3 files changed, 124 insertions(+), 16 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 925d1364727a..07dd12efeea4 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -99,6 +99,7 @@ config CRYPTO_GHASH_ARM_CE
depends on KERNEL_MODE_NEON
select CRYPTO_HASH
select CRYPTO_CRYPTD
+   select CRYPTO_GF128MUL
help
  Use an implementation of GHASH (used by the GCM AEAD chaining mode)
  that uses the 64x64 to 128 bit polynomial multiplication (vmull.p64)
diff --git a/arch/arm/crypto/ghash-ce-core.S b/arch/arm/crypto/ghash-ce-core.S
index 2f78c10b1881..c982c63877a6 100644
--- a/arch/arm/crypto/ghash-ce-core.S
+++ b/arch/arm/crypto/ghash-ce-core.S
@@ -63,6 +63,27 @@
k48 .reqd31
SHASH2_p64  .reqd31
 
+   HH  .reqq10
+   HH3 .reqq11
+   HH4 .reqq12
+   HH34.reqq13
+
+   HH_L.reqd20
+   HH_H.reqd21
+   HH3_L   .reqd22
+   HH3_H   .reqd23
+   HH4_L   .reqd24
+   HH4_H   .reqd25
+   HH34_L  .reqd26
+   HH34_H  .reqd27
+   SHASH2_H.reqd29
+
+   XL2 .reqq5
+   XM2 .reqq6
+   XH2 .reqq7
+   XL3 .reqq8
+   XM3 .reqq9
+
.text
.fpucrypto-neon-fp-armv8
 
@@ -175,12 +196,76 @@
beq 0f
vld1.64 {T1}, [ip]
teq r0, #0
-   b   1f
+   b   3f
+
+0: .ifc\pn, p64
+   tst r0, #3  // skip until #blocks is a
+   bne 2f  // round multiple of 4
+
+1: vld1.8  {XL2-XM2}, [r2]!
+   vld1.8  {XL3}, [r2]!
+   vrev64.8T1, XL2
+
+   subsr0, r0, #4
+
+   vext.8  T2, T1, T1, #8
+   veorT1_H, T1_H, XL_L
+   veorXL, XL, T2
+
+   vmull.p64   XH, HH4_H, XL_H // a1 * b1
+   veorT1_H, T1_H, XL_H
+   vmull.p64   XL, HH4_L, XL_L // a0 * b0
+   vmull.p64   XM, HH34_H, T1_H// (a1 + a0)(b1 + b0)
+
+   vrev64.8T1, XM2
+
+   vmull.p64   XH2, HH3_H, T1_L// a1 * b1
+   veorT1_L, T1_L, T1_H
+   vmull.p64   XL2, HH3_L, T1_H// a0 * b0
+   vmull.p64   XM2, HH34_L, T1_L   // (a1 + a0)(b1 + b0)
+
+   vrev64.8T1, XL3
+
+   vmull.p64   XL3, HH_H, T1_L // a1 * b1
+   veorT1_L, T1_L, T1_H
+   veorXH2, XH2, XL3
+   vmull.p64   XL3, HH_L, T1_H // a0 * b0
+   vmull.p64   XM3, SHASH2_H, T1_L // (a1 + a0)(b1 + b0)
+
+   vld1.8  {T1}, [r2]!
+   veorXL2, XL2, XL3
+   vrev64.8T1, T1
+   veorXM2, XM2, XM3
+
+   vmull.p64   XL3, SHASH_H, T1_L  // a1 * b1
+   veorT1_L, T1_L, T1_H
+   veorXH2, XH2, XL3
+   vmull.p64   XL3, SHASH_L, T1_H  // a0 * b0
+   vmull.p64   XM3, SHASH2_p64, T1_L   // (a1 + a0)(b1 + b0)
 
-0: vld1.64 {T1}, [r2]!
+   veorXL2, XL2, XL3
+   veorXM2, XM2, XM3
+
+   veorXL, XL, XL2
+   veorXH, XH, XH2
+   veorXM, XM, XM2
+
+   veorT1, XL, XH
+   veorXM, XM, T1
+
+   __pmull_reduce_p64
+
+   veorT1, T1, XH
+   veorXL, XL, T1
+
+   beq 4f
+   b   1b
+   .endif
+
+2: vld1.64 {T1}, [r2]!
subsr0, r0, #1
 
-1: /* multiply XL by SHASH in GF(2^128) */
+3: /* multiply XL by SHASH in GF(2^128) */
 #ifndef CONFIG_CPU_BIG_ENDIAN
vrev64.8T1, T1
 #endif
@@ -203,7 +288,7 @@
 
bne 0b
 
-   vst1.64 {XL}, [r1]
+4: vst1.64 {XL}, [r1]
bx  lr
.endm
 
@@ -212,8 +297,14 @@
 * struct ghash_key const *k, const char *head)
 */
 ENTRY(pmull_ghash_update_p64)
-   vld1.64 {SHASH}, [r3]
+   vld1.64 {SHASH}, [r3

Re: [PATCH] crypto: arm64/aes-modes - get rid of literal load of addend vector

2018-08-21 Thread Ard Biesheuvel
On 21 August 2018 at 20:34, Nick Desaulniers  wrote:
> On Tue, Aug 21, 2018 at 11:19 AM Ard Biesheuvel
>  wrote:
>>
>> On 21 August 2018 at 20:04, Nick Desaulniers  wrote:
>> > On Tue, Aug 21, 2018 at 9:46 AM Ard Biesheuvel
>> >  wrote:
>> >>
>> >> Replace the literal load of the addend vector with a sequence that
>> >> composes it using immediates. While at it, tweak the code that refers
>> >> to it so it does not clobber the register, so we can take the load
>> >> out of the loop as well.
>> >>
>> >> This results in generally better code, but also works around a Clang
>> >> issue, whose integrated assembler does not implement the GNU ARM asm
>> >> syntax completely, and does not support the =literal notation for
>> >> FP registers.
>> >
>> > Would you mind linking to the issue tracker for:
>> > https://bugs.llvm.org/show_bug.cgi?id=38642
>> >
>> > And maybe the comment from the binutils source? (or arm32 reference
>> > manual you mentioned in https://lkml.org/lkml/2018/8/21/589)
>> > https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/testsuite/gas/aarch64/programmer-friendly.s;h=6254c6476efdc848648b05068be0574e7addc85d;hb=HEAD#l11
>> >
>> > They can help provide more context to future travelers.
>> >
>>
>> Sure, if it helps.
>
> Robin linked to the arm documentation and the gas documentation, maybe
> those would be better than the source level comment? Or simply the
> llvm bug since I've posted those links there, too?
>
>> To clarify, these are the consecutive values of each of the registers,
>> using 16-bit elements:
>>
>> v7 := { 1, 1, 1, 1, 0, 0, 0, 0 }
>> v8 := { 2, 2, 2, 2, 0, 0, 0, 0 }
>> v6 := { 3, 0, 3, 0, 3, 0, 3, 0 }
>> v8 := { 1, 2, 1, 2, 1, 2, 1, 2 }
>> v8 := { 1, 2, 3, 0, 1, 2, 3, 0 }
>> v8 := { 1, 0, 2, 0, 3, 0, 0, 0 }
>
> Beautiful, thank you for this.  Can this go in the patch as a comment/ascii 
> art?
>

Sure, although I realized the following works just as well, and is
also 6 instructions.

mov x0, #1
mov x1, #2
mov x2, #3
ins v8.s[0], w0
ins v8.s[1], w1
ins v8.d[1], x2

I generally try to stay away from the element accessors if I can, but
this is not on a hot path anyway, so there is no need for code that
requires comments to understand.


> With that...
>
> Reviewed-by: Nick Desaulniers 

Thanks,


Re: [PATCH] crypto: arm64/aes-modes - get rid of literal load of addend vector

2018-08-21 Thread Ard Biesheuvel
On 21 August 2018 at 20:04, Nick Desaulniers  wrote:
> On Tue, Aug 21, 2018 at 9:46 AM Ard Biesheuvel
>  wrote:
>>
>> Replace the literal load of the addend vector with a sequence that
>> composes it using immediates. While at it, tweak the code that refers
>> to it so it does not clobber the register, so we can take the load
>> out of the loop as well.
>>
>> This results in generally better code, but also works around a Clang
>> issue, whose integrated assembler does not implement the GNU ARM asm
>> syntax completely, and does not support the =literal notation for
>> FP registers.
>
> Would you mind linking to the issue tracker for:
> https://bugs.llvm.org/show_bug.cgi?id=38642
>
> And maybe the comment from the binutils source? (or arm32 reference
> manual you mentioned in https://lkml.org/lkml/2018/8/21/589)
> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/testsuite/gas/aarch64/programmer-friendly.s;h=6254c6476efdc848648b05068be0574e7addc85d;hb=HEAD#l11
>
> They can help provide more context to future travelers.
>

Sure, if it helps.

>>
>> Cc: Nick Desaulniers 
>> Signed-off-by: Ard Biesheuvel 
>> ---
>>  arch/arm64/crypto/aes-modes.S | 18 --
>>  1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
>> index 483a7130cf0e..e966620ee230 100644
>> --- a/arch/arm64/crypto/aes-modes.S
>> +++ b/arch/arm64/crypto/aes-modes.S
>> @@ -225,6 +225,14 @@ AES_ENTRY(aes_ctr_encrypt)
>> enc_prepare w22, x21, x6
>> ld1 {v4.16b}, [x24]
>>
>> +   /* compose addend vector { 1, 2, 3, 0 } in v8.4s */
>> +   moviv7.4h, #1
>> +   moviv8.4h, #2
>> +   uaddl   v6.4s, v7.4h, v8.4h
>
> Clever; how come you didn't `movi v6.4h, #3` or `saddl`?  Shorter
> encoding?  Or does it simplify the zips later?

Encodings are always 32 bits on AArch64, so that does not make a difference.

> Since the destination
> is larger, does this give us the 0?
>

Yes.

> The following zip1/zip2 block is a little tricky. Can you help me
> understand it better?
>

Please see below.

>> +   zip1v8.8h, v7.8h, v8.8h
>
> If zip1 takes the upper half, wouldn't that be zeros, since we moved
> small constants into them, above?
>
>> +   zip1v8.4s, v8.4s, v6.4s
>> +   zip2v8.8h, v8.8h, v7.8h
>
> From the docs [0], it looks like zip1/zip2 is used for interleaving
> two vectors, IIUC?  In our case we're interleaving three vectors; v6,
> v7, and v8 into v8?
>

No, the first register is only the destination register in this case,
not a source register.

> And we don't need a second zip2 because...?  Don't we need (or are
> more interested in) the bottom half of v6?
>

To clarify, these are the consecutive values of each of the registers,
using 16-bit elements:

v7 := { 1, 1, 1, 1, 0, 0, 0, 0 }
v8 := { 2, 2, 2, 2, 0, 0, 0, 0 }
v6 := { 3, 0, 3, 0, 3, 0, 3, 0 }
v8 := { 1, 2, 1, 2, 1, 2, 1, 2 }
v8 := { 1, 2, 3, 0, 1, 2, 3, 0 }
v8 := { 1, 0, 2, 0, 3, 0, 0, 0 }


> Note to self: Page 6 [1] seems like a useful doc on arrangement specifiers.
>
> To get { 1, 2, 3, 0 }, does ARM have something like iota/lane
> id+swizzle instructions, ie:
>
> iota -> { 0, 1, 2, 3 }
> swizzle -> { 1, 2, 3, 0 }
>

AArch64 has the ext instruction, which concatenates n trailing bytes
of one source register with 16-n leading bytes of another. This can be
used, e.g., to implement a byte sized rotate when using the same
register for both inputs.


> [0] 
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/UADDL_advsimd_vector.html
> [1] 
> https://www.uio.no/studier/emner/matnat/ifi/INF5063/h17/timeplan/armv8-neon.pdf
>
>> +
>> umovx6, v4.d[1] /* keep swabbed ctr in reg */
>> rev x6, x6
>>  .LctrloopNx:
>> @@ -232,17 +240,16 @@ AES_ENTRY(aes_ctr_encrypt)
>> bmi .Lctr1x
>> cmn w6, #4  /* 32 bit overflow? */
>> bcs .Lctr1x
>> -   ldr q8, =0x300020001/* addends 1,2,3[,0] 
>> */
>> dup v7.4s, w6
>> mov v0.16b, v4.16b
>> add v7.4s, v7.4s, v8.4s
>> mov v1.16b, v4.16b
>> -   rev32   v8.16b, v7.16b
>> +   rev32   v7.16b, v7.16b
>> mov v2.16b, v4.16b
>> mov v3.16b, v4.16b
>> -   mov v1.s

[PATCH] crypto: arm64/aes-modes - get rid of literal load of addend vector

2018-08-21 Thread Ard Biesheuvel
Replace the literal load of the addend vector with a sequence that
composes it using immediates. While at it, tweak the code that refers
to it so it does not clobber the register, so we can take the load
out of the loop as well.

This results in generally better code, but also works around a Clang
issue, whose integrated assembler does not implement the GNU ARM asm
syntax completely, and does not support the =literal notation for
FP registers.

Cc: Nick Desaulniers 
Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-modes.S | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 483a7130cf0e..e966620ee230 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -225,6 +225,14 @@ AES_ENTRY(aes_ctr_encrypt)
enc_prepare w22, x21, x6
ld1 {v4.16b}, [x24]
 
+   /* compose addend vector { 1, 2, 3, 0 } in v8.4s */
+   moviv7.4h, #1
+   moviv8.4h, #2
+   uaddl   v6.4s, v7.4h, v8.4h
+   zip1v8.8h, v7.8h, v8.8h
+   zip1v8.4s, v8.4s, v6.4s
+   zip2v8.8h, v8.8h, v7.8h
+
umovx6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
 .LctrloopNx:
@@ -232,17 +240,16 @@ AES_ENTRY(aes_ctr_encrypt)
bmi .Lctr1x
cmn w6, #4  /* 32 bit overflow? */
bcs .Lctr1x
-   ldr q8, =0x300020001/* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
add v7.4s, v7.4s, v8.4s
mov v1.16b, v4.16b
-   rev32   v8.16b, v7.16b
+   rev32   v7.16b, v7.16b
mov v2.16b, v4.16b
mov v3.16b, v4.16b
-   mov v1.s[3], v8.s[0]
-   mov v2.s[3], v8.s[1]
-   mov v3.s[3], v8.s[2]
+   mov v1.s[3], v7.s[0]
+   mov v2.s[3], v7.s[1]
+   mov v3.s[3], v7.s[2]
ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
bl  aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
@@ -296,7 +303,6 @@ AES_ENTRY(aes_ctr_encrypt)
ins v4.d[0], x7
b   .Lctrcarrydone
 AES_ENDPROC(aes_ctr_encrypt)
-   .ltorg
 
 
/*
-- 
2.17.1



[PATCH] crypto: arm64/aes-gcm-ce - fix scatterwalk API violation

2018-08-20 Thread Ard Biesheuvel
Commit 71e52c278c54 ("crypto: arm64/aes-ce-gcm - operate on
two input blocks at a time") modified the granularity at which
the AES/GCM code processes its input to allow subsequent changes
to be applied that improve performance by using aggregation to
process multiple input blocks at once.

For this reason, it doubled the algorithm's 'chunksize' property
to 2 x AES_BLOCK_SIZE, but retained the non-SIMD fallback path that
processes a single block at a time. In some cases, this violates the
skcipher scatterwalk API, by calling skcipher_walk_done() with a
non-zero residue value for a chunk that is expected to be handled
in its entirety. This results in a WARN_ON() to be hit by the TLS
self test code, but is likely to break other user cases as well.
Unfortunately, none of the current test cases exercises this exact
code path at the moment.

Fixes: 71e52c278c54 ("crypto: arm64/aes-ce-gcm - operate on two ...")
Reported-by: Vakul Garg 
Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-glue.c | 29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c 
b/arch/arm64/crypto/ghash-ce-glue.c
index 6e9f33d14930..067d8937d5af 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -417,7 +417,7 @@ static int gcm_encrypt(struct aead_request *req)
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv, nrounds);
put_unaligned_be32(2, iv + GCM_IV_SIZE);
 
-   while (walk.nbytes >= AES_BLOCK_SIZE) {
+   while (walk.nbytes >= (2 * AES_BLOCK_SIZE)) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *dst = walk.dst.virt.addr;
u8 *src = walk.src.virt.addr;
@@ -437,11 +437,18 @@ static int gcm_encrypt(struct aead_request *req)
NULL);
 
err = skcipher_walk_done(,
-walk.nbytes % AES_BLOCK_SIZE);
+walk.nbytes % (2 * 
AES_BLOCK_SIZE));
}
-   if (walk.nbytes)
+   if (walk.nbytes) {
__aes_arm64_encrypt(ctx->aes_key.key_enc, ks, iv,
nrounds);
+   if (walk.nbytes > AES_BLOCK_SIZE) {
+   crypto_inc(iv, AES_BLOCK_SIZE);
+   __aes_arm64_encrypt(ctx->aes_key.key_enc,
+   ks + AES_BLOCK_SIZE, iv,
+   nrounds);
+   }
+   }
}
 
/* handle the tail */
@@ -545,7 +552,7 @@ static int gcm_decrypt(struct aead_request *req)
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv, nrounds);
put_unaligned_be32(2, iv + GCM_IV_SIZE);
 
-   while (walk.nbytes >= AES_BLOCK_SIZE) {
+   while (walk.nbytes >= (2 * AES_BLOCK_SIZE)) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *dst = walk.dst.virt.addr;
u8 *src = walk.src.virt.addr;
@@ -564,11 +571,21 @@ static int gcm_decrypt(struct aead_request *req)
} while (--blocks > 0);
 
err = skcipher_walk_done(,
-walk.nbytes % AES_BLOCK_SIZE);
+walk.nbytes % (2 * 
AES_BLOCK_SIZE));
}
-   if (walk.nbytes)
+   if (walk.nbytes) {
+   if (walk.nbytes > AES_BLOCK_SIZE) {
+   u8 *iv2 = iv + AES_BLOCK_SIZE;
+
+   memcpy(iv2, iv, AES_BLOCK_SIZE);
+   crypto_inc(iv2, AES_BLOCK_SIZE);
+
+   __aes_arm64_encrypt(ctx->aes_key.key_enc, iv2,
+   iv2, nrounds);
+   }
__aes_arm64_encrypt(ctx->aes_key.key_enc, iv, iv,
nrounds);
+   }
}
 
/* handle the tail */
-- 
2.11.0



[PATCH] crypto: arm64/sm4-ce - check for the right CPU feature bit

2018-08-07 Thread Ard Biesheuvel
ARMv8.2 specifies special instructions for the SM3 cryptographic hash
and the SM4 symmetric cipher. While it is unlikely that a core would
implement one and not the other, we should only use SM4 instructions
if the SM4 CPU feature bit is set, and we currently check the SM3
feature bit instead. So fix that.

Signed-off-by: Ard Biesheuvel 
---
It would be good to get this backported to -stable but there is no
need to merge this as a fix at -rc8

 arch/arm64/crypto/sm4-ce-glue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index b7fb5274b250..0c4fc223f225 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -69,5 +69,5 @@ static void __exit sm4_ce_mod_fini(void)
crypto_unregister_alg(_ce_alg);
 }
 
-module_cpu_feature_match(SM3, sm4_ce_mod_init);
+module_cpu_feature_match(SM4, sm4_ce_mod_init);
 module_exit(sm4_ce_mod_fini);
-- 
2.18.0



[PATCH 0/2] crypto: arm64/ghash-ce - performance improvements

2018-08-04 Thread Ard Biesheuvel
Another bit of performance work on the GHASH driver: this time it is not
the combined AES/GCM algorithm but the bare GHASH driver that gets updated.

Even though ARM cores that implement the polynomical multiplication
instructions that these routines depend on are guaranteed to also support
the AES instructions, and can thus use the AES/GCM driver, there could
be reasons to use the accelerated GHASH in isolation, e.g., with another
symmetric blockcipher, with a faster h/w accelerator, or potentially with
an accelerator that does not expose the AES key to the OS.

The resulting code runs at 1.1 cycles per byte on Cortex-A53 (down from
2.4 cycles per byte)

Ard Biesheuvel (2):
  crypto: arm64/ghash-ce - replace NEON yield check with block limit
  crypto: arm64/ghash-ce - implement 4-way aggregation

 arch/arm64/crypto/ghash-ce-core.S | 153 ++--
 arch/arm64/crypto/ghash-ce-glue.c |  87 ++-
 2 files changed, 161 insertions(+), 79 deletions(-)

-- 
2.18.0



[PATCH 2/2] crypto: arm64/ghash-ce - implement 4-way aggregation

2018-08-04 Thread Ard Biesheuvel
Enhance the GHASH implementation that uses 64-bit polynomial
multiplication by adding support for 4-way aggregation. This
more than doubles the performance, from 2.4 cycles per byte
to 1.1 cpb on Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 122 +---
 arch/arm64/crypto/ghash-ce-glue.c |  71 ++--
 2 files changed, 142 insertions(+), 51 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index 344811c6a0ca..1b319b716d5e 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -46,6 +46,19 @@
ss3 .reqv26
ss4 .reqv27
 
+   XL2 .reqv8
+   XM2 .reqv9
+   XH2 .reqv10
+   XL3 .reqv11
+   XM3 .reqv12
+   XH3 .reqv13
+   TT3 .reqv14
+   TT4 .reqv15
+   HH  .reqv16
+   HH3 .reqv17
+   HH4 .reqv18
+   HH34.reqv19
+
.text
.arch   armv8-a+crypto
 
@@ -134,11 +147,25 @@
.endm
 
.macro  __pmull_pre_p64
+   add x8, x3, #16
+   ld1 {HH.2d-HH4.2d}, [x8]
+
+   trn1SHASH2.2d, SHASH.2d, HH.2d
+   trn2T1.2d, SHASH.2d, HH.2d
+   eor SHASH2.16b, SHASH2.16b, T1.16b
+
+   trn1HH34.2d, HH3.2d, HH4.2d
+   trn2T1.2d, HH3.2d, HH4.2d
+   eor HH34.16b, HH34.16b, T1.16b
+
moviMASK.16b, #0xe1
shl MASK.2d, MASK.2d, #57
.endm
 
.macro  __pmull_pre_p8
+   ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
+   eor SHASH2.16b, SHASH2.16b, SHASH.16b
+
// k00_16 := 0x_
// k32_48 := 0x_
movik32_48.2d, #0x
@@ -215,8 +242,6 @@
.macro  __pmull_ghash, pn
ld1 {SHASH.2d}, [x3]
ld1 {XL.2d}, [x1]
-   ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-   eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
__pmull_pre_\pn
 
@@ -224,12 +249,79 @@
cbz x4, 0f
ld1 {T1.2d}, [x4]
mov x4, xzr
-   b   1f
+   b   3f
+
+0: .ifc\pn, p64
+   tbnzw0, #0, 2f  // skip until #blocks is a
+   tbnzw0, #1, 2f  // round multiple of 4
+
+1: ld1 {XM3.16b-TT4.16b}, [x2], #64
+
+   sub w0, w0, #4
+
+   rev64   T1.16b, XM3.16b
+   rev64   T2.16b, XH3.16b
+   rev64   TT4.16b, TT4.16b
+   rev64   TT3.16b, TT3.16b
+
+   ext IN1.16b, TT4.16b, TT4.16b, #8
+   ext XL3.16b, TT3.16b, TT3.16b, #8
+
+   eor TT4.16b, TT4.16b, IN1.16b
+   pmull2  XH2.1q, SHASH.2d, IN1.2d// a1 * b1
+   pmull   XL2.1q, SHASH.1d, IN1.1d// a0 * b0
+   pmull   XM2.1q, SHASH2.1d, TT4.1d   // (a1 + a0)(b1 + b0)
+
+   eor TT3.16b, TT3.16b, XL3.16b
+   pmull2  XH3.1q, HH.2d, XL3.2d   // a1 * b1
+   pmull   XL3.1q, HH.1d, XL3.1d   // a0 * b0
+   pmull2  XM3.1q, SHASH2.2d, TT3.2d   // (a1 + a0)(b1 + b0)
+
+   ext IN1.16b, T2.16b, T2.16b, #8
+   eor XL2.16b, XL2.16b, XL3.16b
+   eor XH2.16b, XH2.16b, XH3.16b
+   eor XM2.16b, XM2.16b, XM3.16b
+
+   eor T2.16b, T2.16b, IN1.16b
+   pmull2  XH3.1q, HH3.2d, IN1.2d  // a1 * b1
+   pmull   XL3.1q, HH3.1d, IN1.1d  // a0 * b0
+   pmull   XM3.1q, HH34.1d, T2.1d  // (a1 + a0)(b1 + b0)
 
-0: ld1 {T1.2d}, [x2], #16
+   eor XL2.16b, XL2.16b, XL3.16b
+   eor XH2.16b, XH2.16b, XH3.16b
+   eor XM2.16b, XM2.16b, XM3.16b
+
+   ext IN1.16b, T1.16b, T1.16b, #8
+   ext TT3.16b, XL.16b, XL.16b, #8
+   eor XL.16b, XL.16b, IN1.16b
+   eor T1.16b, T1.16b, TT3.16b
+
+   pmull2  XH.1q, HH4.2d, XL.2d// a1 * b1
+   eor T1.16b, T1.16b, XL.16b
+   pmull   XL.1q, HH4.1d, XL.1d// a0 * b0
+   pmull2  XM.1q, HH34.2d, T1.2d   // (a1 + a0)(b1 + b0)
+
+   eor XL.16b, XL.16b, XL2.16b
+   eor XH.16b, XH.16b, XH2.16b
+   eor XM.16b, XM.16b, XM2.16b
+
+   eor T2.16b, XL.16b, XH.16b
+   ext T1.16b, XL.16b, XH.16b, #8

[PATCH 1/2] crypto: arm64/ghash-ce - replace NEON yield check with block limit

2018-08-04 Thread Ard Biesheuvel
Checking the TIF_NEED_RESCHED flag is disproportionately costly on cores
with fast crypto instructions and comparatively slow memory accesses.

On algorithms such as GHASH, which executes at ~1 cycle per byte on
cores that implement support for 64 bit polynomial multiplication,
there is really no need to check the TIF_NEED_RESCHED particularly
often, and so we can remove the NEON yield check from the assembler
routines.

However, unlike the AEAD or skcipher APIs, the shash/ahash APIs take
arbitrary input lengths, and so there needs to be some sanity check
to ensure that we don't hog the CPU for excessive amounts of time.

So let's simply cap the maximum input size that is processed in one go
to 64 KB.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 39 ++--
 arch/arm64/crypto/ghash-ce-glue.c | 16 ++--
 2 files changed, 23 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index 913e49932ae6..344811c6a0ca 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,31 +213,23 @@
.endm
 
.macro  __pmull_ghash, pn
-   frame_push  5
-
-   mov x19, x0
-   mov x20, x1
-   mov x21, x2
-   mov x22, x3
-   mov x23, x4
-
-0: ld1 {SHASH.2d}, [x22]
-   ld1 {XL.2d}, [x20]
+   ld1 {SHASH.2d}, [x3]
+   ld1 {XL.2d}, [x1]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
__pmull_pre_\pn
 
/* do the head block first, if supplied */
-   cbz x23, 1f
-   ld1 {T1.2d}, [x23]
-   mov x23, xzr
-   b   2f
+   cbz x4, 0f
+   ld1 {T1.2d}, [x4]
+   mov x4, xzr
+   b   1f
 
-1: ld1 {T1.2d}, [x21], #16
-   sub w19, w19, #1
+0: ld1 {T1.2d}, [x2], #16
+   sub w0, w0, #1
 
-2: /* multiply XL by SHASH in GF(2^128) */
+1: /* multiply XL by SHASH in GF(2^128) */
 CPU_LE(rev64   T1.16b, T1.16b  )
 
ext T2.16b, XL.16b, XL.16b, #8
@@ -259,18 +251,9 @@ CPU_LE(rev64   T1.16b, T1.16b  )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
 
-   cbz w19, 3f
-
-   if_will_cond_yield_neon
-   st1 {XL.2d}, [x20]
-   do_cond_yield_neon
-   b   0b
-   endif_yield_neon
-
-   b   1b
+   cbnzw0, 0b
 
-3: st1 {XL.2d}, [x20]
-   frame_pop
+   st1 {XL.2d}, [x1]
ret
.endm
 
diff --git a/arch/arm64/crypto/ghash-ce-glue.c 
b/arch/arm64/crypto/ghash-ce-glue.c
index 88e3d93fa7c7..03ce71ea81a2 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -113,6 +113,9 @@ static void ghash_do_update(int blocks, u64 dg[], const 
char *src,
}
 }
 
+/* avoid hogging the CPU for too long */
+#define MAX_BLOCKS (SZ_64K / GHASH_BLOCK_SIZE)
+
 static int ghash_update(struct shash_desc *desc, const u8 *src,
unsigned int len)
 {
@@ -136,11 +139,16 @@ static int ghash_update(struct shash_desc *desc, const u8 
*src,
blocks = len / GHASH_BLOCK_SIZE;
len %= GHASH_BLOCK_SIZE;
 
-   ghash_do_update(blocks, ctx->digest, src, key,
-   partial ? ctx->buf : NULL);
+   do {
+   int chunk = min(blocks, MAX_BLOCKS);
+
+   ghash_do_update(chunk, ctx->digest, src, key,
+   partial ? ctx->buf : NULL);
 
-   src += blocks * GHASH_BLOCK_SIZE;
-   partial = 0;
+   blocks -= chunk;
+   src += chunk * GHASH_BLOCK_SIZE;
+   partial = 0;
+   } while (unlikely(blocks > 0));
}
if (len)
memcpy(ctx->buf + partial, src, len);
-- 
2.18.0



Re: [PATCH v2 0/3] crypto/arm64: aes-ce-gcm - switch to 2-way aggregation

2018-08-03 Thread Ard Biesheuvel
On 3 August 2018 at 17:47, Herbert Xu  wrote:
> On Mon, Jul 30, 2018 at 11:06:39PM +0200, Ard Biesheuvel wrote:
>> Update the combined AES-GCM AEAD implementation to process two blocks
>> at a time, allowing us to switch to a faster version of the GHASH
>> implementation.
>>
>> Note that this does not update the core GHASH transform, only the
>> combined AES-GCM AEAD mode. GHASH is mostly used with AES anyway, and
>> the ARMv8 architecture mandates support for AES instructions if
>> 64-bit polynomial multiplication instructions are implemented. This
>> means that mosts users of the pmull.p64 based GHASH routines are better
>> off using the combined AES-GCM code anyway. Users of the pmull.p8 based
>> GHASH implementation are unlikely to benefit substantially from aggregation,
>> given that the multiplication phase is much more dominant in this case
>> (and it is only the reduction phase that is amortized over multiple
>> blocks)
>>
>> Performance numbers for Cortex-A53 can be found after patches #2 and #3.
>>
>> Changes since v1:
>> - rebase to take the changes in patch 'crypto: arm64 - revert NEON yield for
>>   fast AEAD implementations' which I sent out on July 29th
>> - add a patch to reduce the number of invocations of kernel_neon_begin()
>>   and kernel_neon_end() on the common path
>>
>> Ard Biesheuvel (3):
>>   crypto/arm64: aes-ce-gcm - operate on two input blocks at a time
>>   crypto/arm64: aes-ce-gcm - implement 2-way aggregation
>>   crypto: arm64/aes-ce-gcm - don't reload key schedule if avoidable
>>
>>  arch/arm64/crypto/ghash-ce-core.S | 136 +--
>>  arch/arm64/crypto/ghash-ce-glue.c | 176 
>>  2 files changed, 198 insertions(+), 114 deletions(-)
>
> All applied.  Thanks.

Thanks Herbert.


Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-08-03 Thread Ard Biesheuvel
On 3 August 2018 at 10:17, Herbert Xu  wrote:
> On Fri, Aug 03, 2018 at 09:10:08AM +0200, Ard Biesheuvel wrote:
>> But I think it's too late now to take this into v4.18. Could you
>> please queue this (and my other two pending arm64/aes-gcm patches, if
>> possible) for v4.19 instead?
>
> OK I'll do that.
>

Thanks.

But, actually, those two pending patches are 3 piece series now ...
(v2 of 'crypto/arm64: aes-ce-gcm - switch to 2-way aggregation')

Thanks,
Ard.


Re: [PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-08-03 Thread Ard Biesheuvel
On 3 August 2018 at 08:14, Herbert Xu  wrote:
> On Sun, Jul 29, 2018 at 04:52:30PM +0200, Ard Biesheuvel wrote:
>> As it turns out, checking the TIF_NEED_RESCHED flag after each
>> iteration results in a significant performance regression (~10%)
>> when running fast algorithms (i.e., ones that use special instructions
>> and operate in the < 4 cycles per byte range) on in-order cores with
>> comparatively slow memory accesses such as the Cortex-A53.
>>
>> Given the speed of these ciphers, and the fact that the page based
>> nature of the AEAD scatterwalk API guarantees that the core NEON
>> transform is never invoked with more than a single page's worth of
>> input, we can estimate the worst case duration of any resulting
>> scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
>> processing a page's worth of input at 4 cycles per byte results in
>> a delay of ~250 us, which is a reasonable upper bound.
>>
>> So let's remove the yield checks from the fused AES-CCM and AES-GCM
>> routines entirely.
>>
>> This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and
>> partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3.
>>
>> Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
>> Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...")
>> Signed-off-by: Ard Biesheuvel 
>> ---
>> This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks'
>> sent out on the 24th of July.
>>
>> Given the significant performance regression, we may want to treat this as
>> a fix (the patches in question went into v4.18)
>>
>> This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing
>> kernel_neon_begin/end pair' which I sent out on the 27th of July, which
>> fixes a data corruption bug, whic should also be applied as a fix.
>
> Acked-by: Herbert Xu 
>

Thanks Herbert.

But I think it's too late now to take this into v4.18. Could you
please queue this (and my other two pending arm64/aes-gcm patches, if
possible) for v4.19 instead?

Thanks,


Re: [PATCH] crypto/arm64: aes-ce-gcm - add missing kernel_neon_begin/end pair

2018-07-31 Thread Ard Biesheuvel
(+ Catalin, Will)

On 27 July 2018 at 14:59, Ard Biesheuvel  wrote:
> Calling pmull_gcm_encrypt_block() requires kernel_neon_begin() and
> kernel_neon_end() to be used since the routine touches the NEON
> register file. Add the missing calls.
>
> Also, since NEON register contents are not preserved outside of
> a kernel mode NEON region, pass the key schedule array again.
>
> Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
> Signed-off-by: Ard Biesheuvel 
> ---
>  arch/arm64/crypto/ghash-ce-glue.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/crypto/ghash-ce-glue.c 
> b/arch/arm64/crypto/ghash-ce-glue.c
> index 7cf0b1aa6ea8..8a10f1d7199a 100644
> --- a/arch/arm64/crypto/ghash-ce-glue.c
> +++ b/arch/arm64/crypto/ghash-ce-glue.c
> @@ -488,9 +488,13 @@ static int gcm_decrypt(struct aead_request *req)
> err = skcipher_walk_done(,
>  walk.nbytes % 
> AES_BLOCK_SIZE);
> }
> -   if (walk.nbytes)
> -   pmull_gcm_encrypt_block(iv, iv, NULL,
> +   if (walk.nbytes) {
> +   kernel_neon_begin();
> +   pmull_gcm_encrypt_block(iv, iv, ctx->aes_key.key_enc,
> num_rounds(>aes_key));
> +   kernel_neon_end();
> +   }
> +
> } else {
> __aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
> num_rounds(>aes_key));
> --
> 2.18.0
>

This fixes a rather nasty bug in the AES-GCM code: failing to call
kernel_neon_begin()/_end() may clobber the NEON register state of
unrelated userland processes.

Could we please get this queued before v4.18 is released? Thanks.


[PATCH v2 1/3] crypto/arm64: aes-ce-gcm - operate on two input blocks at a time

2018-07-30 Thread Ard Biesheuvel
Update the core AES/GCM transform and the associated plumbing to operate
on 2 AES/GHASH blocks at a time. By itself, this is not expected to
result in a noticeable speedup, but it paves the way for reimplementing
the GHASH component using 2-way aggregation.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 127 +++-
 arch/arm64/crypto/ghash-ce-glue.c | 103 ++--
 2 files changed, 161 insertions(+), 69 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index c723647b37db..dac0df29d194 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -286,9 +286,10 @@ ENTRY(pmull_ghash_update_p8)
__pmull_ghash   p8
 ENDPROC(pmull_ghash_update_p8)
 
-   KS  .reqv8
-   CTR .reqv9
-   INP .reqv10
+   KS0 .reqv8
+   KS1 .reqv9
+   INP0.reqv10
+   INP1.reqv11
 
.macro  load_round_keys, rounds, rk
cmp \rounds, #12
@@ -336,84 +337,146 @@ CPU_LE(  rev x8, x8  )
 
.if \enc == 1
ldr x10, [sp]
-   ld1 {KS.16b}, [x10]
+   ld1 {KS0.16b-KS1.16b}, [x10]
.endif
 
-0: ld1 {CTR.8b}, [x5]  // load upper counter
-   ld1 {INP.16b}, [x3], #16
+0: ld1 {INP0.16b-INP1.16b}, [x3], #32
+
rev x9, x8
-   add x8, x8, #1
-   sub w0, w0, #1
-   ins CTR.d[1], x9// set lower counter
+   add x11, x8, #1
+   add x8, x8, #2
 
.if \enc == 1
-   eor INP.16b, INP.16b, KS.16b// encrypt input
-   st1 {INP.16b}, [x2], #16
+   eor INP0.16b, INP0.16b, KS0.16b // encrypt input
+   eor INP1.16b, INP1.16b, KS1.16b
.endif
 
-   rev64   T1.16b, INP.16b
+   ld1 {KS0.8b}, [x5]  // load upper counter
+   rev x11, x11
+   sub w0, w0, #2
+   mov KS1.8b, KS0.8b
+   ins KS0.d[1], x9// set lower counter
+   ins KS1.d[1], x11
+
+   rev64   T1.16b, INP0.16b
 
cmp w7, #12
b.ge2f  // AES-192/256?
 
-1: enc_round   CTR, v21
+1: enc_round   KS0, v21
 
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
 
-   enc_round   CTR, v22
+   enc_round   KS1, v21
 
eor T1.16b, T1.16b, T2.16b
eor XL.16b, XL.16b, IN1.16b
 
-   enc_round   CTR, v23
+   enc_round   KS0, v22
 
pmull2  XH.1q, SHASH.2d, XL.2d  // a1 * b1
eor T1.16b, T1.16b, XL.16b
 
-   enc_round   CTR, v24
+   enc_round   KS1, v22
 
pmull   XL.1q, SHASH.1d, XL.1d  // a0 * b0
pmull   XM.1q, SHASH2.1d, T1.1d // (a1 + a0)(b1 + b0)
 
-   enc_round   CTR, v25
+   enc_round   KS0, v23
 
ext T1.16b, XL.16b, XH.16b, #8
eor T2.16b, XL.16b, XH.16b
eor XM.16b, XM.16b, T1.16b
 
-   enc_round   CTR, v26
+   enc_round   KS1, v23
 
eor XM.16b, XM.16b, T2.16b
pmull   T2.1q, XL.1d, MASK.1d
 
-   enc_round   CTR, v27
+   enc_round   KS0, v24
 
mov XH.d[0], XM.d[1]
mov XM.d[1], XL.d[0]
 
-   enc_round   CTR, v28
+   enc_round   KS1, v24
 
eor XL.16b, XM.16b, T2.16b
 
-   enc_round   CTR, v29
+   enc_round   KS0, v25
 
ext T2.16b, XL.16b, XL.16b, #8
 
-   aeseCTR.16b, v30.16b
+   enc_round   KS1, v25
 
pmull   XL.1q, XL.1d, MASK.1d
eor T2.16b, T2.16b, XH.16b
 
-   eor KS.16b, CTR.16b, v31.16b
+   enc_round   KS0, v26
+
+   eor XL.16b, XL.16b, T2.16b
+   rev64   T1.16b, INP1.16b
+
+   enc_round   KS1, v26
+
+   ext T2.16b, XL.16b, XL.16b, #8
+   ext IN1.16b, T1.16b, T1.16b, #8
+
+   enc_round   KS0, v27
+
+   eor T1.16b, T1.16b, T2.16b
+   eor XL.16b, XL.16b, IN1.16b
+
+   enc_round   KS1, v27
+
+   pmull2  XH.1q, SHASH.2d, XL.2d  // a1 * b1
+   eor T1.16b, T1.16b, XL.16b
+
+   enc_round   KS0, v28
+
+   pmull   XL.1q, SHASH.1d, XL.1d  // a0 * b0
+   pmull   XM.1q, SHASH2.1d

[PATCH v2 3/3] crypto: arm64/aes-ce-gcm - don't reload key schedule if avoidable

2018-07-30 Thread Ard Biesheuvel
Squeeze out another 5% of performance by minimizing the number
of invocations of kernel_neon_begin()/kernel_neon_end() on the
common path, which also allows some reloads of the key schedule
to be optimized away.

The resulting code runs at 2.3 cycles per byte on a Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
Raw numbers after the patch.

 arch/arm64/crypto/ghash-ce-core.S |  9 ++-
 arch/arm64/crypto/ghash-ce-glue.c | 81 +++-
 2 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index f7281e7a592f..913e49932ae6 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -1,7 +1,7 @@
 /*
  * Accelerated GHASH implementation with ARMv8 PMULL instructions.
  *
- * Copyright (C) 2014 - 2017 Linaro Ltd. 
+ * Copyright (C) 2014 - 2018 Linaro Ltd. 
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License version 2 as published
@@ -332,8 +332,6 @@ ENDPROC(pmull_ghash_update_p8)
ld1 {XL.2d}, [x1]
ldr x8, [x5, #8]// load lower counter
 
-   load_round_keys w7, x6
-
moviMASK.16b, #0xe1
trn1SHASH2.2d, SHASH.2d, HH.2d
trn2T1.2d, SHASH.2d, HH.2d
@@ -346,6 +344,8 @@ CPU_LE( rev x8, x8  )
ld1 {KS0.16b-KS1.16b}, [x10]
.endif
 
+   cbnzx6, 4f
+
 0: ld1 {INP0.16b-INP1.16b}, [x3], #32
 
rev x9, x8
@@ -471,6 +471,9 @@ CPU_LE( rev x8, x8  )
enc_round   KS0, v20
enc_round   KS1, v20
b   1b
+
+4: load_round_keys w7, x6
+   b   0b
.endm
 
/*
diff --git a/arch/arm64/crypto/ghash-ce-glue.c 
b/arch/arm64/crypto/ghash-ce-glue.c
index c41ac62c90e9..88e3d93fa7c7 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -1,7 +1,7 @@
 /*
  * Accelerated GHASH implementation with ARMv8 PMULL instructions.
  *
- * Copyright (C) 2014 - 2017 Linaro Ltd. 
+ * Copyright (C) 2014 - 2018 Linaro Ltd. 
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License version 2 as published
@@ -374,37 +374,39 @@ static int gcm_encrypt(struct aead_request *req)
memcpy(iv, req->iv, GCM_IV_SIZE);
put_unaligned_be32(1, iv + GCM_IV_SIZE);
 
-   if (likely(may_use_simd())) {
-   kernel_neon_begin();
+   err = skcipher_walk_aead_encrypt(, req, false);
 
+   if (likely(may_use_simd() && walk.total >= 2 * AES_BLOCK_SIZE)) {
+   u32 const *rk = NULL;
+
+   kernel_neon_begin();
pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc, nrounds);
put_unaligned_be32(2, iv + GCM_IV_SIZE);
pmull_gcm_encrypt_block(ks, iv, NULL, nrounds);
put_unaligned_be32(3, iv + GCM_IV_SIZE);
pmull_gcm_encrypt_block(ks + AES_BLOCK_SIZE, iv, NULL, nrounds);
put_unaligned_be32(4, iv + GCM_IV_SIZE);
-   kernel_neon_end();
-
-   err = skcipher_walk_aead_encrypt(, req, false);
 
-   while (walk.nbytes >= 2 * AES_BLOCK_SIZE) {
+   do {
int blocks = walk.nbytes / (2 * AES_BLOCK_SIZE) * 2;
 
-   kernel_neon_begin();
+   if (rk)
+   kernel_neon_begin();
+
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
  walk.src.virt.addr, ctx->h2, iv,
- ctx->aes_key.key_enc, nrounds, ks);
+ rk, nrounds, ks);
kernel_neon_end();
 
err = skcipher_walk_done(,
walk.nbytes % (2 * AES_BLOCK_SIZE));
-   }
+
+   rk = ctx->aes_key.key_enc;
+   } while (walk.nbytes >= 2 * AES_BLOCK_SIZE);
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv, nrounds);
put_unaligned_be32(2, iv + GCM_IV_SIZE);
 
-   err = skcipher_walk_aead_encrypt(, req, false);
-
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *dst = walk.dst.virt.addr;
@@ -486,50 +488,53 @@ static int gcm_decrypt(struct aead_request *req)
memcpy(iv, req->iv, GCM_IV_SIZE);
put_unaligned_be32(1, iv + GCM_IV_SIZE);
 
-   if (likely(may_use_simd())) {
+   err = skcipher_walk_aead_decrypt(, req, false);
+
+   if (likely(may_use_s

[PATCH v2 2/3] crypto/arm64: aes-ce-gcm - implement 2-way aggregation

2018-07-30 Thread Ard Biesheuvel
Implement a faster version of the GHASH transform which amortizes
the reduction modulo the characteristic polynomial across two
input blocks at a time.

On a Cortex-A53, the gcm(aes) performance increases 24%, from
3.0 cycles per byte to 2.4 cpb for large input sizes.

Signed-off-by: Ard Biesheuvel 
---
Raw numbers after the patch.

 arch/arm64/crypto/ghash-ce-core.S | 86 +++-
 arch/arm64/crypto/ghash-ce-glue.c | 34 +---
 2 files changed, 52 insertions(+), 68 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index dac0df29d194..f7281e7a592f 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -290,6 +290,10 @@ ENDPROC(pmull_ghash_update_p8)
KS1 .reqv9
INP0.reqv10
INP1.reqv11
+   HH  .reqv12
+   XL2 .reqv13
+   XM2 .reqv14
+   XH2 .reqv15
 
.macro  load_round_keys, rounds, rk
cmp \rounds, #12
@@ -323,6 +327,7 @@ ENDPROC(pmull_ghash_update_p8)
.endm
 
.macro  pmull_gcm_do_crypt, enc
+   ld1 {HH.2d}, [x4], #16
ld1 {SHASH.2d}, [x4]
ld1 {XL.2d}, [x1]
ldr x8, [x5, #8]// load lower counter
@@ -330,10 +335,11 @@ ENDPROC(pmull_ghash_update_p8)
load_round_keys w7, x6
 
moviMASK.16b, #0xe1
-   ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
+   trn1SHASH2.2d, SHASH.2d, HH.2d
+   trn2T1.2d, SHASH.2d, HH.2d
 CPU_LE(rev x8, x8  )
shl MASK.2d, MASK.2d, #57
-   eor SHASH2.16b, SHASH2.16b, SHASH.16b
+   eor SHASH2.16b, SHASH2.16b, T1.16b
 
.if \enc == 1
ldr x10, [sp]
@@ -358,116 +364,82 @@ CPU_LE(  rev x8, x8  )
ins KS0.d[1], x9// set lower counter
ins KS1.d[1], x11
 
-   rev64   T1.16b, INP0.16b
+   rev64   T1.16b, INP1.16b
 
cmp w7, #12
b.ge2f  // AES-192/256?
 
 1: enc_round   KS0, v21
-
-   ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
 
enc_round   KS1, v21
-
-   eor T1.16b, T1.16b, T2.16b
-   eor XL.16b, XL.16b, IN1.16b
+   pmull2  XH2.1q, SHASH.2d, IN1.2d// a1 * b1
 
enc_round   KS0, v22
-
-   pmull2  XH.1q, SHASH.2d, XL.2d  // a1 * b1
-   eor T1.16b, T1.16b, XL.16b
+   eor T1.16b, T1.16b, IN1.16b
 
enc_round   KS1, v22
-
-   pmull   XL.1q, SHASH.1d, XL.1d  // a0 * b0
-   pmull   XM.1q, SHASH2.1d, T1.1d // (a1 + a0)(b1 + b0)
+   pmull   XL2.1q, SHASH.1d, IN1.1d// a0 * b0
 
enc_round   KS0, v23
-
-   ext T1.16b, XL.16b, XH.16b, #8
-   eor T2.16b, XL.16b, XH.16b
-   eor XM.16b, XM.16b, T1.16b
+   pmull   XM2.1q, SHASH2.1d, T1.1d// (a1 + a0)(b1 + b0)
 
enc_round   KS1, v23
-
-   eor XM.16b, XM.16b, T2.16b
-   pmull   T2.1q, XL.1d, MASK.1d
+   rev64   T1.16b, INP0.16b
+   ext T2.16b, XL.16b, XL.16b, #8
 
enc_round   KS0, v24
-
-   mov XH.d[0], XM.d[1]
-   mov XM.d[1], XL.d[0]
+   ext IN1.16b, T1.16b, T1.16b, #8
+   eor T1.16b, T1.16b, T2.16b
 
enc_round   KS1, v24
-
-   eor XL.16b, XM.16b, T2.16b
+   eor XL.16b, XL.16b, IN1.16b
 
enc_round   KS0, v25
-
-   ext T2.16b, XL.16b, XL.16b, #8
+   eor T1.16b, T1.16b, XL.16b
 
enc_round   KS1, v25
-
-   pmull   XL.1q, XL.1d, MASK.1d
-   eor T2.16b, T2.16b, XH.16b
+   pmull2  XH.1q, HH.2d, XL.2d // a1 * b1
 
enc_round   KS0, v26
-
-   eor XL.16b, XL.16b, T2.16b
-   rev64   T1.16b, INP1.16b
+   pmull   XL.1q, HH.1d, XL.1d // a0 * b0
 
enc_round   KS1, v26
-
-   ext T2.16b, XL.16b, XL.16b, #8
-   ext IN1.16b, T1.16b, T1.16b, #8
+   pmull2  XM.1q, SHASH2.2d, T1.2d // (a1 + a0)(b1 + b0)
 
enc_round   KS0, v27
-
-   eor T1.16b, T1.16b, T2.16b
-   eor XL.16b, XL.16b, IN1.16b
+   eor XL.16b, XL.16b, XL2.16b
+   eor XH.16b, XH.16b, XH2.16b
 
enc_round   KS1, v27
-
-   pmull2  XH

[PATCH v2 0/3] crypto/arm64: aes-ce-gcm - switch to 2-way aggregation

2018-07-30 Thread Ard Biesheuvel
Update the combined AES-GCM AEAD implementation to process two blocks
at a time, allowing us to switch to a faster version of the GHASH
implementation.

Note that this does not update the core GHASH transform, only the
combined AES-GCM AEAD mode. GHASH is mostly used with AES anyway, and
the ARMv8 architecture mandates support for AES instructions if
64-bit polynomial multiplication instructions are implemented. This
means that mosts users of the pmull.p64 based GHASH routines are better
off using the combined AES-GCM code anyway. Users of the pmull.p8 based
GHASH implementation are unlikely to benefit substantially from aggregation,
given that the multiplication phase is much more dominant in this case
(and it is only the reduction phase that is amortized over multiple
blocks)

Performance numbers for Cortex-A53 can be found after patches #2 and #3.

Changes since v1:
- rebase to take the changes in patch 'crypto: arm64 - revert NEON yield for
  fast AEAD implementations' which I sent out on July 29th
- add a patch to reduce the number of invocations of kernel_neon_begin()
  and kernel_neon_end() on the common path

Ard Biesheuvel (3):
  crypto/arm64: aes-ce-gcm - operate on two input blocks at a time
  crypto/arm64: aes-ce-gcm - implement 2-way aggregation
  crypto: arm64/aes-ce-gcm - don't reload key schedule if avoidable

 arch/arm64/crypto/ghash-ce-core.S | 136 +--
 arch/arm64/crypto/ghash-ce-glue.c | 176 
 2 files changed, 198 insertions(+), 114 deletions(-)

-- 
2.18.0



[PATCH] crypto: arm64 - revert NEON yield for fast AEAD implementations

2018-07-29 Thread Ard Biesheuvel
As it turns out, checking the TIF_NEED_RESCHED flag after each
iteration results in a significant performance regression (~10%)
when running fast algorithms (i.e., ones that use special instructions
and operate in the < 4 cycles per byte range) on in-order cores with
comparatively slow memory accesses such as the Cortex-A53.

Given the speed of these ciphers, and the fact that the page based
nature of the AEAD scatterwalk API guarantees that the core NEON
transform is never invoked with more than a single page's worth of
input, we can estimate the worst case duration of any resulting
scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
processing a page's worth of input at 4 cycles per byte results in
a delay of ~250 us, which is a reasonable upper bound.

So let's remove the yield checks from the fused AES-CCM and AES-GCM
routines entirely.

This reverts commit 7b67ae4d5ce8e2f912377f5fbccb95811a92097f and
partially reverts commit 7c50136a8aba8784f07fb66a950cc61a7f3d2ee3.

Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
Fixes: 7b67ae4d5ce8 ("crypto: arm64/aes-ccm - yield NEON after every ...")
Signed-off-by: Ard Biesheuvel 
---
This supersedes my series 'crypto/arm64: reduce impact of NEON yield checks'
sent out on the 24th of July.

Given the significant performance regression, we may want to treat this as
a fix (the patches in question went into v4.18)

This patch applies onto my patch 'crypto/arm64: aes-ce-gcm - add missing
kernel_neon_begin/end pair' which I sent out on the 27th of July, which
fixes a data corruption bug, whic should also be applied as a fix.

 arch/arm64/crypto/aes-ce-ccm-core.S | 150 +++-
 arch/arm64/crypto/ghash-ce-core.S   |  76 --
 2 files changed, 80 insertions(+), 146 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index 88f5aef7934c..e3a375c4cb83 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,33 +19,24 @@
 *   u32 *macp, u8 const rk[], u32 rounds);
 */
 ENTRY(ce_aes_ccm_auth_data)
-   frame_push  7
-
-   mov x19, x0
-   mov x20, x1
-   mov x21, x2
-   mov x22, x3
-   mov x23, x4
-   mov x24, x5
-
-   ldr w25, [x22]  /* leftover from prev round? */
+   ldr w8, [x3]/* leftover from prev round? */
ld1 {v0.16b}, [x0]  /* load mac */
-   cbz w25, 1f
-   sub w25, w25, #16
+   cbz w8, 1f
+   sub w8, w8, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrbw7, [x20], #1   /* get 1 byte of input */
-   subsw21, w21, #1
-   add w25, w25, #1
+0: ldrbw7, [x1], #1/* get 1 byte of input */
+   subsw2, w2, #1
+   add w8, w8, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1  /* rotate in the input bytes */
beq 8f  /* out of input? */
-   cbnzw25, 0b
+   cbnzw8, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x23]  /* load first round key */
-   prfmpldl1strm, [x20]
-   cmp w24, #12/* which key size? */
-   add x6, x23, #16
-   sub w7, w24, #2 /* modified # of rounds */
+1: ld1 {v3.4s}, [x4]   /* load first round key */
+   prfmpldl1strm, [x1]
+   cmp w5, #12 /* which key size? */
+   add x6, x4, #16
+   sub w7, w5, #2  /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -64,43 +55,33 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16  /* load next round key */
bpl 3b
aesev0.16b, v4.16b
-   subsw21, w21, #16   /* last data? */
+   subsw2, w2, #16 /* last data? */
eor v0.16b, v0.16b, v5.16b  /* final round */
bmi 6f
-   ld1 {v1.16b}, [x20], #16/* load next input block */
+   ld1 {v1.16b}, [x1], #16 /* load next input block */
eor v0.16b, v0.16b, v1.16b  /* xor with mac */
-   beq 6f
-
-   if_will_cond_yield_neon
-   st1 {v0.16b}, [x19] /* store mac */
-   do_cond_yield_neon
-   ld1 {v0.16b}, [x19] /* reload mac */
-   endif_yield_neon
-
-   b   1b
-6: st1 {v0.16b}, [x19] /* store mac */
+   bne 1b
+6: st1 {v0.16b}, [x0]  /* store mac */
beq 10f
-   addsw21, w21, #16
+   addsw2, w2, #16
beq 10f
-   mov w25,

[PATCH 0/2] crypto/arm64: aes-ce-gcm - switch to 2-way aggregation

2018-07-28 Thread Ard Biesheuvel
Update the combined AES-GCM AEAD implementation to process two blocks
at a time, allowing us to switch to a faster version of the GHASH
implementation.

Note that this does not update the core GHASH transform, only the
combined AES-GCM AEAD mode. GHASH is mostly used with AES anyway, and
the ARMv8 architecture mandates support for AES instructions if
64-bit polynomial multiplication instructions are implemented. This
means that mosts users of the pmull.p64 based GHASH routines are better
off using the combined AES-GCM code anyway. Users of the pmull.p8 based
GHASH implementation are unlikely to benefit substantially from aggregation,
given that the multiplication phase is much more dominant in this case
(and it is only the reduction phase that is amortized over multiple
blocks)

Performance numbers for Cortex-A53 can be found after patch #2.

Ard Biesheuvel (2):
  crypto/arm64: aes-ce-gcm - operate on two input blocks at a time
  crypto/arm64: aes-ce-gcm - implement 2-way aggregation

 arch/arm64/crypto/ghash-ce-core.S | 128 +---
 arch/arm64/crypto/ghash-ce-glue.c | 117 --
 2 files changed, 165 insertions(+), 80 deletions(-)

-- 
2.18.0



[PATCH 1/2] crypto/arm64: aes-ce-gcm - operate on two input blocks at a time

2018-07-28 Thread Ard Biesheuvel
Update the core AES/GCM transform and the associated plumbing to operate
on 2 AES/GHASH blocks at a time. By itself, this is not expected to
result in a noticeable speedup, but it paves the way for reimplementing
the GHASH component using 2-way aggregation.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 129 +++-
 arch/arm64/crypto/ghash-ce-glue.c |  84 +
 2 files changed, 155 insertions(+), 58 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index dcffb9e77589..437a2fb0f7f9 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -286,9 +286,10 @@ ENTRY(pmull_ghash_update_p8)
__pmull_ghash   p8
 ENDPROC(pmull_ghash_update_p8)
 
-   KS  .reqv8
-   CTR .reqv9
-   INP .reqv10
+   KS0 .reqv8
+   KS1 .reqv9
+   INP0.reqv10
+   INP1.reqv11
 
.macro  load_round_keys, rounds, rk
cmp \rounds, #12
@@ -350,90 +351,152 @@ CPU_LE(  rev x28, x28)
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
.if \enc == 1
-   ld1 {KS.16b}, [x27]
+   ld1 {KS0.16b-KS1.16b}, [x27]
.endif
 
-1: ld1 {CTR.8b}, [x24] // load upper counter
-   ld1 {INP.16b}, [x22], #16
+1: ld1 {INP0.16b-INP1.16b}, [x22], #32
+
rev x9, x28
-   add x28, x28, #1
-   sub w19, w19, #1
-   ins CTR.d[1], x9// set lower counter
+   add x10, x28, #1
+   add x28, x28, #2
 
.if \enc == 1
-   eor INP.16b, INP.16b, KS.16b// encrypt input
-   st1 {INP.16b}, [x21], #16
+   eor INP0.16b, INP0.16b, KS0.16b // encrypt input
+   eor INP1.16b, INP1.16b, KS1.16b
.endif
 
-   rev64   T1.16b, INP.16b
+   ld1 {KS0.8b}, [x24] // load upper counter
+   rev x10, x10
+   sub w19, w19, #2
+   mov KS1.8b, KS0.8b
+   ins KS0.d[1], x9// set lower counter
+   ins KS1.d[1], x10
+
+   rev64   T1.16b, INP0.16b
 
cmp w26, #12
b.ge4f  // AES-192/256?
 
-2: enc_round   CTR, v21
+2: enc_round   KS0, v21
+
+   ext T2.16b, XL.16b, XL.16b, #8
+   ext IN1.16b, T1.16b, T1.16b, #8
+
+   enc_round   KS1, v21
+
+   eor T1.16b, T1.16b, T2.16b
+   eor XL.16b, XL.16b, IN1.16b
+
+   enc_round   KS0, v22
+
+   pmull2  XH.1q, SHASH.2d, XL.2d  // a1 * b1
+   eor T1.16b, T1.16b, XL.16b
+
+   enc_round   KS1, v22
+
+   pmull   XL.1q, SHASH.1d, XL.1d  // a0 * b0
+   pmull   XM.1q, SHASH2.1d, T1.1d // (a1 + a0)(b1 + b0)
+
+   enc_round   KS0, v23
+
+   ext T1.16b, XL.16b, XH.16b, #8
+   eor T2.16b, XL.16b, XH.16b
+   eor XM.16b, XM.16b, T1.16b
+
+   enc_round   KS1, v23
+
+   eor XM.16b, XM.16b, T2.16b
+   pmull   T2.1q, XL.1d, MASK.1d
+
+   enc_round   KS0, v24
+
+   mov XH.d[0], XM.d[1]
+   mov XM.d[1], XL.d[0]
+
+   enc_round   KS1, v24
+
+   eor XL.16b, XM.16b, T2.16b
+
+   enc_round   KS0, v25
+
+   ext T2.16b, XL.16b, XL.16b, #8
+
+   enc_round   KS1, v25
+
+   pmull   XL.1q, XL.1d, MASK.1d
+   eor T2.16b, T2.16b, XH.16b
+
+   enc_round   KS0, v26
+
+   eor XL.16b, XL.16b, T2.16b
+   rev64   T1.16b, INP1.16b
+
+   enc_round   KS1, v26
 
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
 
-   enc_round   CTR, v22
+   enc_round   KS0, v27
 
eor T1.16b, T1.16b, T2.16b
eor XL.16b, XL.16b, IN1.16b
 
-   enc_round   CTR, v23
+   enc_round   KS1, v27
 
pmull2  XH.1q, SHASH.2d, XL.2d  // a1 * b1
eor T1.16b, T1.16b, XL.16b
 
-   enc_round   CTR, v24
+   enc_round   KS0, v28
 
pmull   XL.1q, SHASH.1d, XL.1d  // a0 * b0
pmull   XM.1q, SHASH2.1d, T1.1d // (a1 + a0)(b1 + b0)
 
-   enc_round   CTR, v25
+   enc_round   KS1, v28
 
ext T1.16b, XL.16b, XH.16b, #8
eor T2.16b, XL.16b, XH.16b
eor

[PATCH 2/2] crypto/arm64: aes-ce-gcm - implement 2-way aggregation

2018-07-28 Thread Ard Biesheuvel
Implement a faster version of the GHASH transform which amortizes the
reduction modulo the characteristic polynomial across two input blocks at
a time. This is based on the Intel white paper "Carry-Less Multiplication
Instruction and its Usage for Computing the GCM Mode"

On a Cortex-A53, the gcm(aes) performance increases 24%, from 3.0 cycles per
byte to 2.4 cpb for large input sizes.

Signed-off-by: Ard Biesheuvel 
---
Raw numbers after the patch

 arch/arm64/crypto/ghash-ce-core.S | 87 +++-
 arch/arm64/crypto/ghash-ce-glue.c | 33 ++--
 2 files changed, 54 insertions(+), 66 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index 437a2fb0f7f9..c144b526abe6 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -290,6 +290,11 @@ ENDPROC(pmull_ghash_update_p8)
KS1 .reqv9
INP0.reqv10
INP1.reqv11
+   HH  .reqv12
+   Hhl .reqv13
+   XLn .reqv14
+   XMn .reqv15
+   XHn .reqv16
 
.macro  load_round_keys, rounds, rk
cmp \rounds, #12
@@ -342,13 +347,13 @@ CPU_LE(   rev x28, x28)
 
 0: mov x0, x25
load_round_keys w26, x0
-   ld1 {SHASH.2d}, [x23]
+   add x1, x23, #32
+   ld1 {HH.2d-Hhl.2d}, [x23]
+   ld1 {SHASH.2d}, [x1]
ld1 {XL.2d}, [x20]
 
moviMASK.16b, #0xe1
-   ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
shl MASK.2d, MASK.2d, #57
-   eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
.if \enc == 1
ld1 {KS0.16b-KS1.16b}, [x27]
@@ -372,116 +377,82 @@ CPU_LE(  rev x28, x28)
ins KS0.d[1], x9// set lower counter
ins KS1.d[1], x10
 
-   rev64   T1.16b, INP0.16b
+   rev64   T1.16b, INP1.16b
 
cmp w26, #12
b.ge4f  // AES-192/256?
 
 2: enc_round   KS0, v21
-
-   ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
 
enc_round   KS1, v21
-
-   eor T1.16b, T1.16b, T2.16b
-   eor XL.16b, XL.16b, IN1.16b
+   pmull2  XHn.1q, SHASH.2d, IN1.2d// a1 * b1
 
enc_round   KS0, v22
-
-   pmull2  XH.1q, SHASH.2d, XL.2d  // a1 * b1
-   eor T1.16b, T1.16b, XL.16b
+   eor T1.16b, T1.16b, IN1.16b
 
enc_round   KS1, v22
-
-   pmull   XL.1q, SHASH.1d, XL.1d  // a0 * b0
-   pmull   XM.1q, SHASH2.1d, T1.1d // (a1 + a0)(b1 + b0)
+   pmull   XLn.1q, SHASH.1d, IN1.1d// a0 * b0
 
enc_round   KS0, v23
-
-   ext T1.16b, XL.16b, XH.16b, #8
-   eor T2.16b, XL.16b, XH.16b
-   eor XM.16b, XM.16b, T1.16b
+   pmull   XMn.1q, Hhl.1d, T1.1d   // (a1 + a0)(b1 + b0)
 
enc_round   KS1, v23
-
-   eor XM.16b, XM.16b, T2.16b
-   pmull   T2.1q, XL.1d, MASK.1d
+   rev64   T1.16b, INP0.16b
+   ext T2.16b, XL.16b, XL.16b, #8
 
enc_round   KS0, v24
-
-   mov XH.d[0], XM.d[1]
-   mov XM.d[1], XL.d[0]
+   ext IN1.16b, T1.16b, T1.16b, #8
+   eor T1.16b, T1.16b, T2.16b
 
enc_round   KS1, v24
-
-   eor XL.16b, XM.16b, T2.16b
+   eor XL.16b, XL.16b, IN1.16b
 
enc_round   KS0, v25
-
-   ext T2.16b, XL.16b, XL.16b, #8
+   pmull2  XH.1q, HH.2d, XL.2d // a1 * b1
 
enc_round   KS1, v25
-
-   pmull   XL.1q, XL.1d, MASK.1d
-   eor T2.16b, T2.16b, XH.16b
+   eor T1.16b, T1.16b, XL.16b
 
enc_round   KS0, v26
-
-   eor XL.16b, XL.16b, T2.16b
-   rev64   T1.16b, INP1.16b
+   pmull   XL.1q, HH.1d, XL.1d // a0 * b0
 
enc_round   KS1, v26
-
-   ext T2.16b, XL.16b, XL.16b, #8
-   ext IN1.16b, T1.16b, T1.16b, #8
+   pmull2  XM.1q, Hhl.2d, T1.2d// (a1 + a0)(b1 + b0)
 
enc_round   KS0, v27
-
-   eor T1.16b, T1.16b, T2.16b
-   eor XL.16b, XL.16b, IN1.16b
+   eor XH.16b, XH.16b, XHn.16b
+   eor XM.16b, XM.16b, XMn.16b
 
enc_round   KS1, v27
-
-   pmull2  XH.1q, SHASH.2d, XL.2d  // a1 * b1
-   eor T1.16b, T1.1

[PATCH] crypto/arm64: aes-ce-gcm - add missing kernel_neon_begin/end pair

2018-07-27 Thread Ard Biesheuvel
Calling pmull_gcm_encrypt_block() requires kernel_neon_begin() and
kernel_neon_end() to be used since the routine touches the NEON
register file. Add the missing calls.

Also, since NEON register contents are not preserved outside of
a kernel mode NEON region, pass the key schedule array again.

Fixes: 7c50136a8aba ("crypto: arm64/aes-ghash - yield NEON after every ...")
Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-glue.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c 
b/arch/arm64/crypto/ghash-ce-glue.c
index 7cf0b1aa6ea8..8a10f1d7199a 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -488,9 +488,13 @@ static int gcm_decrypt(struct aead_request *req)
err = skcipher_walk_done(,
 walk.nbytes % AES_BLOCK_SIZE);
}
-   if (walk.nbytes)
-   pmull_gcm_encrypt_block(iv, iv, NULL,
+   if (walk.nbytes) {
+   kernel_neon_begin();
+   pmull_gcm_encrypt_block(iv, iv, ctx->aes_key.key_enc,
num_rounds(>aes_key));
+   kernel_neon_end();
+   }
+
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(>aes_key));
-- 
2.18.0



Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-26 Thread Ard Biesheuvel
On 25 July 2018 at 18:50, bige...@linutronix.de  wrote:
> On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote:
>> Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a
>> 1000 cycle limit to the quantum of work performed with preemption
>> disabled is unreasonably low, we can increase the yield block counts
>> and approach the optimal numbers a bit closer. But with diminishing
>> returns.
>
> So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this
> series and didn't notice any spikes. This means cyclictest reported a
> max value of like ~20us (which means the crypto code was not
> noticeable).
> I played a little with it and tcrypt tests for aes/sha1 and also no huge
> spikes. So at this point this looks fantastic. I also setup cryptsetup /
> dm-crypt with the usual xts(aes) mode and saw no spikes.
> At this point, on this hardware if you want to raise the block count, I
> wouldn't mind.
>
> I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once
> dm-crypt started its jobs.
>

Thanks a lot.

So 20 us ~= 20,000 cycles on my 1 GHz Cortex-A53, and if I am
understanding you correctly, you wouldn't mind the quantum of work to
be in the order 16,000 cycles or even substantially more?

That is good news, but it is also rather interesting, given that these
algorithms run at ~4 cycles per byte, meaning that you'd manage an
entire 4 KB page without ever yielding. (GCM is used on network
packets, XTS on disk sectors which are all smaller than that)

Do you remember how you found out NEON use is a problem for -rt on
arm64 in the first place? Which algorithm did you test at the time to
arrive at this conclusion?

Note that AES-GCM using ordinary SIMD instructions runs at 29 cpb, and
plain AES at ~20 (on A53), so perhaps it would make sense to
distinguish between algos using crypto instructions and ones using
plain SIMD.


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:45, Dave Martin  wrote:
> On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote:
>> On 25 July 2018 at 11:09, Dave Martin  wrote:
>> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
>> >> Vakul reports a considerable performance hit when running the accelerated
>> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
>> >> been updated to take the TIF_NEED_RESCHED flag into account.
>> >>
>> >> The issue appears to be caused by the fact that Cortex-A53, the core in
>> >> question, has a high end implementation of the Crypto Extensions, and
>> >> has a shallow pipeline, which means even sequential algorithms that may be
>> >> held back by pipeline stalls on high end out of order cores run at maximum
>> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at 
>> >> a
>> >> speed in the order of 2 to 4 cycles per byte, and are currently 
>> >> implemented
>> >> to check the TIF_NEED_RESCHED after each iteration, which may process as
>> >> little as 16 bytes (for GHASH).
>> >>
>> >> Obviously, every cycle of overhead hurts in this context, and given that
>> >> the A53's load/store unit is not quite high end, any delays caused by
>> >> memory accesses that occur in the inner loop of the algorithms are going
>> >> to be quite significant, hence the performance regression.
>> >>
>> >> So reduce the frequency at which the NEON yield checks are performed, so
>> >> that they occur roughly once every 1000 cycles, which is hopefully a
>> >> reasonable tradeoff between throughput and worst case scheduling latency.
>> >
>> > Is there any way to tune this automatically, or it that likely to be more
>> > trouble than it's worth?
>> >
>>
>> Good question. I think A53 is a reasonable worst case, and these
>> changes reduce the impact to the same ballpark as the impact of
>> enabling CONFIG_PREEMPT in the first place.
>>
>> > Also, how did you come up with 1000 cycles?  At what point does
>> > preemption latency become more/less important than throughput?
>> >
>>
>> Another good question. I was hoping Sebastian or the other -rt folks
>> would be triggered by this. Given the above, I ended up with a ~1000
>> cycles quantum, and hopefully this is considered to be small enough.
>>
>> > Maybe someone already invented a similar framework somewhere else in the
>> > kernel.  I seem to remember some automatic selection of memcpy
>> > implementation based on a boot-time benchmark, but I may have
>> > misremembered.
>> >
>>
>> We have crypto benchmarking code in the kernel, and at some point, I
>> even did some work on selecting the best algo based on performance.
>>
>> But to be honest, I think this is a bit overkill. If you need those
>> final 5% of throughput at any cost, you're better off running with
>> CONFIG_PREEMPT=n anyway.
>
> Can't really argue with any of that -- I was just wondering whether
> there was precedent.
>
> Hopefully the ~1000 cycles ballpark will satisfy most people.  For
> the rest, it's too bad: if somebody is relying on the last 1-2% of
> performance, they probably have a broken use case.
>

Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a
1000 cycle limit to the quantum of work performed with preemption
disabled is unreasonably low, we can increase the yield block counts
and approach the optimal numbers a bit closer. But with diminishing
returns.

Also, if the cost of enabling CONFIG_PREEMPT by itself is
significantly reduced, e.g., by moving the per-CPU offset into a GPR,
we can always revisit this of course.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:48, Dave Martin  wrote:
> On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote:
>> On 25 July 2018 at 11:05, Dave Martin  wrote:
>> > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
>> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>> >> iteration of the GHASH and AES-GCM core routines is having a considerable
>> >> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>> >> implemented.
>> >>
>> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>> >> down by 16% for large block sizes and 128 bit keys. This appears to be
>> >> a result of the high performance of the crypto instructions on the one
>> >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
>> >> the relatively poor load/store performance of this simple core.
>> >>
>> >> So let's reduce this performance impact by only doing the yield check
>> >> once every 32 blocks for GHASH (or 4 when using the version based on
>> >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
>> >> This way, we recover most of the performance while still limiting the
>> >> duration of scheduling blackouts due to disabling preemption to ~1000
>> >> cycles.
>> >>
>> >> Cc: Vakul Garg 
>> >> Signed-off-by: Ard Biesheuvel 
>> >> ---
>> >>  arch/arm64/crypto/ghash-ce-core.S | 12 +---
>> >>  1 file changed, 9 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S 
>> >> b/arch/arm64/crypto/ghash-ce-core.S
>> >> index dcffb9e77589..9c14beaabeee 100644
>> >> --- a/arch/arm64/crypto/ghash-ce-core.S
>> >> +++ b/arch/arm64/crypto/ghash-ce-core.S
>> >> @@ -212,7 +212,7 @@
>> >>   ushrXL.2d, XL.2d, #1
>> >>   .endm
>> >>
>> >> - .macro  __pmull_ghash, pn
>> >> + .macro  __pmull_ghash, pn, yield_count
>> >>   frame_push  5
>> >>
>> >>   mov x19, x0
>> >> @@ -259,6 +259,9 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
>> >>   eor T2.16b, T2.16b, XH.16b
>> >>   eor XL.16b, XL.16b, T2.16b
>> >>
>> >> + tst w19, #(\yield_count - 1)
>> >
>> > This should perhaps be (\yield_count) - 1.
>> >
>> > It would be a bit silly to pass a non-trivial expression for yield_count
>> > though.
>> >
>> >> + b.ne1b
>> >> +
>> >
>> > Is it worth having a build-time check that \yield_count is a power of two?
>> > (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
>> > generic macro for that.
>> >
>> > Otherwise this code may poll more often than expected.  Not the end of
>> > the world, though.
>> >
>>
>> Thanks for taking a look.
>>
>> Given that the macro in question is used in exactly two places in the
>> same file, and is unlikely to be reused unless the architecture gains
>> support for another optional instruction that can be used as a drop-in
>> replacement, I don't think any of this truly matters tbh.
>
> Fair enough.  A comment on the macro definition might help, but beyond
> that it is probably overkill.
>

Sure. I'll add that in v2.


Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:09, Dave Martin  wrote:
> On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
>> Vakul reports a considerable performance hit when running the accelerated
>> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
>> been updated to take the TIF_NEED_RESCHED flag into account.
>>
>> The issue appears to be caused by the fact that Cortex-A53, the core in
>> question, has a high end implementation of the Crypto Extensions, and
>> has a shallow pipeline, which means even sequential algorithms that may be
>> held back by pipeline stalls on high end out of order cores run at maximum
>> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
>> speed in the order of 2 to 4 cycles per byte, and are currently implemented
>> to check the TIF_NEED_RESCHED after each iteration, which may process as
>> little as 16 bytes (for GHASH).
>>
>> Obviously, every cycle of overhead hurts in this context, and given that
>> the A53's load/store unit is not quite high end, any delays caused by
>> memory accesses that occur in the inner loop of the algorithms are going
>> to be quite significant, hence the performance regression.
>>
>> So reduce the frequency at which the NEON yield checks are performed, so
>> that they occur roughly once every 1000 cycles, which is hopefully a
>> reasonable tradeoff between throughput and worst case scheduling latency.
>
> Is there any way to tune this automatically, or it that likely to be more
> trouble than it's worth?
>

Good question. I think A53 is a reasonable worst case, and these
changes reduce the impact to the same ballpark as the impact of
enabling CONFIG_PREEMPT in the first place.

> Also, how did you come up with 1000 cycles?  At what point does
> preemption latency become more/less important than throughput?
>

Another good question. I was hoping Sebastian or the other -rt folks
would be triggered by this. Given the above, I ended up with a ~1000
cycles quantum, and hopefully this is considered to be small enough.

> Maybe someone already invented a similar framework somewhere else in the
> kernel.  I seem to remember some automatic selection of memcpy
> implementation based on a boot-time benchmark, but I may have
> misremembered.
>

We have crypto benchmarking code in the kernel, and at some point, I
even did some work on selecting the best algo based on performance.

But to be honest, I think this is a bit overkill. If you need those
final 5% of throughput at any cost, you're better off running with
CONFIG_PREEMPT=n anyway.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 11:05, Dave Martin  wrote:
> On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>> iteration of the GHASH and AES-GCM core routines is having a considerable
>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>> implemented.
>>
>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>> down by 16% for large block sizes and 128 bit keys. This appears to be
>> a result of the high performance of the crypto instructions on the one
>> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
>> the relatively poor load/store performance of this simple core.
>>
>> So let's reduce this performance impact by only doing the yield check
>> once every 32 blocks for GHASH (or 4 when using the version based on
>> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
>> This way, we recover most of the performance while still limiting the
>> duration of scheduling blackouts due to disabling preemption to ~1000
>> cycles.
>>
>> Cc: Vakul Garg 
>> Signed-off-by: Ard Biesheuvel 
>> ---
>>  arch/arm64/crypto/ghash-ce-core.S | 12 +---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/crypto/ghash-ce-core.S 
>> b/arch/arm64/crypto/ghash-ce-core.S
>> index dcffb9e77589..9c14beaabeee 100644
>> --- a/arch/arm64/crypto/ghash-ce-core.S
>> +++ b/arch/arm64/crypto/ghash-ce-core.S
>> @@ -212,7 +212,7 @@
>>   ushrXL.2d, XL.2d, #1
>>   .endm
>>
>> - .macro  __pmull_ghash, pn
>> + .macro  __pmull_ghash, pn, yield_count
>>   frame_push  5
>>
>>   mov x19, x0
>> @@ -259,6 +259,9 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
>>   eor T2.16b, T2.16b, XH.16b
>>   eor XL.16b, XL.16b, T2.16b
>>
>> + tst w19, #(\yield_count - 1)
>
> This should perhaps be (\yield_count) - 1.
>
> It would be a bit silly to pass a non-trivial expression for yield_count
> though.
>
>> + b.ne1b
>> +
>
> Is it worth having a build-time check that \yield_count is a power of two?
> (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
> generic macro for that.
>
> Otherwise this code may poll more often than expected.  Not the end of
> the world, though.
>

Thanks for taking a look.

Given that the macro in question is used in exactly two places in the
same file, and is unlikely to be reused unless the architecture gains
support for another optional instruction that can be used as a drop-in
replacement, I don't think any of this truly matters tbh.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 09:27, Ard Biesheuvel  wrote:
> (+ Mark)
>
> On 25 July 2018 at 08:57, Vakul Garg  wrote:
>>
>>
>>> -Original Message-
>>> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org]
>>> Sent: Tuesday, July 24, 2018 10:42 PM
>>> To: linux-crypto@vger.kernel.org
>>> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com;
>>> dave.mar...@arm.com; Vakul Garg ;
>>> bige...@linutronix.de; Ard Biesheuvel 
>>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of
>>> NEON yield checks
>>>
>>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>>> iteration of the GHASH and AES-GCM core routines is having a considerable
>>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>>> implemented.
>>>
>>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>>> down by 16% for large block sizes and 128 bit keys. This appears to be a
>>> result of the high performance of the crypto instructions on the one hand
>>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the
>>> relatively poor load/store performance of this simple core.
>>>
>>> So let's reduce this performance impact by only doing the yield check once
>>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit
>>> polynomial multiplication), and once every 16 blocks for AES-GCM.
>>> This way, we recover most of the performance while still limiting the
>>> duration of scheduling blackouts due to disabling preemption to ~1000
>>> cycles.
>>
>> I tested this patch. It helped but didn't regain the performance to previous 
>> level.
>> Are there more files remaining to be fixed? (In your original patch series 
>> for adding
>> preemptability check, there were lot more files changed than this series 
>> with 4 files).
>>
>> Instead of using hardcoded  32 block/16 block limit, should it be controlled 
>> using Kconfig?
>> I believe that on different cores, these values could be required to be 
>> different.
>>
>
> Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on
> my 24xA53 system, probably because each per-CPU variable access
> involves disabling and re-enabling preemption, turning every per-CPU
> load into 2 loads and a store,

Actually, more like

load/store
load
load/store

so 3 loads and 2 stores.



> which hurts on this particular core.
> Mark and I have played around a bit with using a GPR to record the
> per-CPU offset, which would make this unnecessary, but this has its
> own set of problems so that is not expected to land any time soon.
>
> So if you care that much about squeezing the last drop of throughput
> out of your system without regard for worst case scheduling latency,
> disabling CONFIG_PREEMPT is a much better idea than playing around
> with tunables to tweak the maximum quantum of work that is executed
> with preemption disabled, especially since distro kernels will pick
> the default anyway.


Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-25 Thread Ard Biesheuvel
(+ Mark)

On 25 July 2018 at 08:57, Vakul Garg  wrote:
>
>
>> -Original Message-----
>> From: Ard Biesheuvel [mailto:ard.biesheu...@linaro.org]
>> Sent: Tuesday, July 24, 2018 10:42 PM
>> To: linux-crypto@vger.kernel.org
>> Cc: herb...@gondor.apana.org.au; will.dea...@arm.com;
>> dave.mar...@arm.com; Vakul Garg ;
>> bige...@linutronix.de; Ard Biesheuvel 
>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of
>> NEON yield checks
>>
>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>> iteration of the GHASH and AES-GCM core routines is having a considerable
>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>> implemented.
>>
>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>> down by 16% for large block sizes and 128 bit keys. This appears to be a
>> result of the high performance of the crypto instructions on the one hand
>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the
>> relatively poor load/store performance of this simple core.
>>
>> So let's reduce this performance impact by only doing the yield check once
>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit
>> polynomial multiplication), and once every 16 blocks for AES-GCM.
>> This way, we recover most of the performance while still limiting the
>> duration of scheduling blackouts due to disabling preemption to ~1000
>> cycles.
>
> I tested this patch. It helped but didn't regain the performance to previous 
> level.
> Are there more files remaining to be fixed? (In your original patch series 
> for adding
> preemptability check, there were lot more files changed than this series with 
> 4 files).
>
> Instead of using hardcoded  32 block/16 block limit, should it be controlled 
> using Kconfig?
> I believe that on different cores, these values could be required to be 
> different.
>

Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on
my 24xA53 system, probably because each per-CPU variable access
involves disabling and re-enabling preemption, turning every per-CPU
load into 2 loads and a store, which hurts on this particular core.
Mark and I have played around a bit with using a GPR to record the
per-CPU offset, which would make this unnecessary, but this has its
own set of problems so that is not expected to land any time soon.

So if you care that much about squeezing the last drop of throughput
out of your system without regard for worst case scheduling latency,
disabling CONFIG_PREEMPT is a much better idea than playing around
with tunables to tweak the maximum quantum of work that is executed
with preemption disabled, especially since distro kernels will pick
the default anyway.


Re: [PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates

2018-07-25 Thread Ard Biesheuvel
On 25 July 2018 at 03:29, Eric Biggers  wrote:
> From: Eric Biggers 
>
> The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
> but the one-way code (used on remainder blocks) implements it with
> vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.
>
> Signed-off-by: Eric Biggers 

Acked-by: Ard Biesheuvel 

> ---
>  arch/arm/crypto/chacha20-neon-core.S | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> b/arch/arm/crypto/chacha20-neon-core.S
> index 3fecb2124c35..451a849ad518 100644
> --- a/arch/arm/crypto/chacha20-neon-core.S
> +++ b/arch/arm/crypto/chacha20-neon-core.S
> @@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon)
>  .Ldoubleround:
> // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #16
> -   vsri.u32q3, q4, #16
> +   veorq3, q3, q0
> +   vrev32.16   q3, q3
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
> vadd.i32q2, q2, q3
> @@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon)
>
> // x0 += x1, x3 = rotl32(x3 ^ x0, 16)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #16
> -   vsri.u32q3, q4, #16
> +   veorq3, q3, q0
> +   vrev32.16   q3, q3
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 12)
> vadd.i32q2, q2, q3
> --
> 2.18.0
>


[PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
iteration of the GHASH and AES-GCM core routines is having a considerable
performance impact on cores such as the Cortex-A53 with Crypto Extensions
implemented.

GHASH performance is down by 22% for large block sizes, and AES-GCM is
down by 16% for large block sizes and 128 bit keys. This appears to be
a result of the high performance of the crypto instructions on the one
hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
the relatively poor load/store performance of this simple core.

So let's reduce this performance impact by only doing the yield check
once every 32 blocks for GHASH (or 4 when using the version based on
8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
This way, we recover most of the performance while still limiting the
duration of scheduling blackouts due to disabling preemption to ~1000
cycles.

Cc: Vakul Garg 
Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index dcffb9e77589..9c14beaabeee 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -212,7 +212,7 @@
ushrXL.2d, XL.2d, #1
.endm
 
-   .macro  __pmull_ghash, pn
+   .macro  __pmull_ghash, pn, yield_count
frame_push  5
 
mov x19, x0
@@ -259,6 +259,9 @@ CPU_LE( rev64   T1.16b, T1.16b  )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
 
+   tst w19, #(\yield_count - 1)
+   b.ne1b
+
cbz w19, 3f
 
if_will_cond_yield_neon
@@ -279,11 +282,11 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
 * struct ghash_key const *k, const char *head)
 */
 ENTRY(pmull_ghash_update_p64)
-   __pmull_ghash   p64
+   __pmull_ghash   p64, 32
 ENDPROC(pmull_ghash_update_p64)
 
 ENTRY(pmull_ghash_update_p8)
-   __pmull_ghash   p8
+   __pmull_ghash   p8, 4
 ENDPROC(pmull_ghash_update_p8)
 
KS  .reqv8
@@ -428,6 +431,9 @@ CPU_LE( rev x28, x28)
st1 {INP.16b}, [x21], #16
.endif
 
+   tst w19, #0xf   // do yield check only
+   b.ne1b  // once every 16 blocks
+
cbz w19, 3f
 
if_will_cond_yield_neon
-- 
2.11.0



[PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Vakul reports a considerable performance hit when running the accelerated
arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
been updated to take the TIF_NEED_RESCHED flag into account.

The issue appears to be caused by the fact that Cortex-A53, the core in
question, has a high end implementation of the Crypto Extensions, and
has a shallow pipeline, which means even sequential algorithms that may be
held back by pipeline stalls on high end out of order cores run at maximum
speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
speed in the order of 2 to 4 cycles per byte, and are currently implemented
to check the TIF_NEED_RESCHED after each iteration, which may process as
little as 16 bytes (for GHASH).

Obviously, every cycle of overhead hurts in this context, and given that
the A53's load/store unit is not quite high end, any delays caused by
memory accesses that occur in the inner loop of the algorithms are going
to be quite significant, hence the performance regression.

So reduce the frequency at which the NEON yield checks are performed, so
that they occur roughly once every 1000 cycles, which is hopefully a
reasonable tradeoff between throughput and worst case scheduling latency.

Ard Biesheuvel (4):
  crypto/arm64: ghash - reduce performance impact of NEON yield checks
  crypto/arm64: aes-ccm - reduce performance impact of NEON yield checks
  crypto/arm64: sha1 - reduce performance impact of NEON yield checks
  crypto/arm64: sha2 - reduce performance impact of NEON yield checks

 arch/arm64/crypto/aes-ce-ccm-core.S |  3 +++
 arch/arm64/crypto/ghash-ce-core.S   | 12 +---
 arch/arm64/crypto/sha1-ce-core.S|  3 +++
 arch/arm64/crypto/sha2-ce-core.S|  3 +++
 4 files changed, 18 insertions(+), 3 deletions(-)

-- 
2.11.0



[PATCH 3/4] crypto/arm64: sha1 - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Only perform the NEON yield check for every 4 blocks of input, to
prevent taking a considerable performance hit on cores with very
fast crypto instructions and comparatively slow memory accesses,
such as the Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha1-ce-core.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 78eb35fb5056..f592c55218d0 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -129,6 +129,9 @@ CPU_LE( rev32   v11.16b, v11.16b)
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
 
+   tst w21, #0x3   // yield only every 4 blocks
+   b.ne1b
+
cbz w21, 3f
 
if_will_cond_yield_neon
-- 
2.11.0



[PATCH 4/4] crypto/arm64: sha2 - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Only perform the NEON yield check for every 4 blocks of input, to
prevent taking a considerable performance hit on cores with very
fast crypto instructions and comparatively slow memory accesses,
such as the Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha2-ce-core.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index cd8b36412469..201a33ff6830 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -136,6 +136,9 @@ CPU_LE( rev32   v19.16b, v19.16b)
add dgav.4s, dgav.4s, dg0v.4s
add dgbv.4s, dgbv.4s, dg1v.4s
 
+   tst w21, #0x3   // yield only every 4 blocks
+   b.ne1b
+
/* handled all input blocks? */
cbz w21, 3f
 
-- 
2.11.0



[PATCH 2/4] crypto/arm64: aes-ccm - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Only perform the NEON yield check for every 8 blocks of input, to
prevent taking a considerable performance hit on cores with very
fast crypto instructions and comparatively slow memory accesses,
such as the Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index 88f5aef7934c..627710cdc220 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -208,6 +208,9 @@ CPU_LE( rev x26, x26)   /* keep 
swabbed ctr in reg */
st1 {v1.16b}, [x19], #16/* write output block */
beq 5f
 
+   tst w21, #(0x7 * 16)/* yield every 8 blocks */
+   b.ne0b
+
if_will_cond_yield_neon
st1 {v0.16b}, [x24] /* store mac */
do_cond_yield_neon
-- 
2.11.0



Re: [PATCH] crypto: arm/speck - fix building in Thumb2 mode

2018-06-19 Thread Ard Biesheuvel
On 19 June 2018 at 00:33, Eric Biggers  wrote:
> Building the kernel with CONFIG_THUMB2_KERNEL=y and
> CONFIG_CRYPTO_SPECK_NEON set fails with the following errors:
>
> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>
> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here -- 
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here -- 
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here -- 
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here -- 
> `bic sp,#0xf'
>
> The problem is that the 'bic' instruction can't operate on the 'sp'
> register in Thumb2 mode.  Fix it by using a temporary register.  This
> isn't in the main loop, so the performance difference is negligible.
> This also matches what aes-neonbs-core.S does.
>
> Reported-by: Stefan Agner 
> Fixes: ede9622162fa ("crypto: arm/speck - add NEON-accelerated implementation 
> of Speck-XTS")
> Signed-off-by: Eric Biggers 
> ---
>  arch/arm/crypto/speck-neon-core.S | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/crypto/speck-neon-core.S 
> b/arch/arm/crypto/speck-neon-core.S
> index 3c1e203e53b9..57caa742016e 100644
> --- a/arch/arm/crypto/speck-neon-core.S
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -272,9 +272,11 @@
>  * Allocate stack space to store 128 bytes worth of tweaks.  For
>  * performance, this space is aligned to a 16-byte boundary so that we
>  * can use the load/store instructions that declare 16-byte alignment.
> +* For Thumb2 compatibility, don't do the 'bic' directly on 'sp'.
>  */
> -   sub sp, #128
> -   bic sp, #0xf
> +   sub r12, sp, #128
> +   bic r12, #0xf
> +   mov sp, r12
>
>  .if \n == 64
> // Load first tweak

Acked-by: Ard Biesheuvel 


Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-18 Thread Ard Biesheuvel
On 18 June 2018 at 23:56, Eric Biggers  wrote:
> On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
>> >>>>> +
>> >>>>> + // One-time XTS preparation
>> >>>>> +
>> >>>>> + /*
>> >>>>> +  * Allocate stack space to store 128 bytes worth of tweaks.  For
>> >>>>> +  * performance, this space is aligned to a 16-byte boundary so 
>> >>>>> that we
>> >>>>> +  * can use the load/store instructions that declare 16-byte 
>> >>>>> alignment.
>> >>>>> +  */
>> >>>>> + sub sp, #128
>> >>>>> + bic sp, #0xf
>> >>>>
>> >>>>
>> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y
>> >>>>
>> >>>>   AS  arch/arm/crypto/speck-neon-core.o
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>> >>>>
>> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>> >>>> `bic sp,#0xf'
>> >>>>
>> >>>> In a quick hack this change seems to address it:
>> >>>>
>> >>>>
>> >>>> -   sub sp, #128
>> >>>> -   bic sp, #0xf
>> >>>> +   mov r6, sp
>> >>>> +   sub r6, #128
>> >>>> +   bic r6, #0xf
>> >>>> +   mov sp, r6
>> >>>>
>> >>>> But there is probably a better solution to address this.
>> >>>>
>> >>>
>> >>> Given that there is no NEON on M class cores, I recommend we put 
>> >>> something like
>> >>>
>> >>> THUMB(bx pc)
>> >>> THUMB(nop.w)
>> >>> THUMB(.arm)
>> >>>
>> >>> at the beginning and be done with it.
>> >>
>> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> >> the beginning as well.
>> >
>> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
>> > that bic sp,#0xf is the only issue...
>> >
>>
>> Well, in general, yes. In the case of NEON code, not really, since the
>> resulting code will not be smaller anyway, because the Thumb2 NEON
>> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
>> units, so all cores that this code can run on will be able to run in
>> ARM mode.
>>
>> So from a maintainability pov, having code that only assembles in one
>> way is better than having code that must compile both to ARM and to
>> Thumb2 opcodes.
>>
>> Just my 2 cents, anyway.
>
> I don't have too much of a preference, though Stefan's suggested 4 
> instructions
> can be reduced to 3, which also matches what aes-neonbs-core.S does:
>
> sub r12, sp, #128
> bic r12, #0xf
> mov sp, r12
>
> Ard, is the following what you're suggesting instead?
>

Yes, but after looking at the actual code, I prefer the change above.
The access occurs only once, not in the loop so the additional
instructions should not affect performance.

Apologies for the noise.

> diff --git a/arch/arm/crypto/speck-neon-core.S 
> b/arch/arm/crypto/speck-neon-core.S
> index 3c1e203e53b9..c989ce3dc057 100644
> --- a/arch/arm/crypto/speck-neon-core.S
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -8,6 +8,7 @@
>   */
>
>  #include 
> +#include 
>
> .text
> .fpuneon
> @@ -233,6 +234,12 @@
>   * nonzero multiple of 128.
>   */
>  .macro _speck_xts_cryptn, decrypting
> +
> +   .align  2
> +   THUMB(bx pc)
> +   THUMB(nop)
> +   THUMB(.arm)
> +
> push{r4-r7}
> mov r7, sp
>
> @@ -413,6 +420,8 @@
> mov sp, r7
> pop {r4-r7}
> bx  lr
> +
> +   THUMB(.thumb)
>  .endm
>
>  ENTRY(speck128_xts_encrypt_neon)


Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-17 Thread Ard Biesheuvel
On 17 June 2018 at 12:41, Stefan Agner  wrote:
> On 17.06.2018 11:40, Ard Biesheuvel wrote:
>> On 17 June 2018 at 11:30, Ard Biesheuvel  wrote:
>>> On 17 June 2018 at 00:40, Stefan Agner  wrote:
>>>> Hi Eric,
>>>>
>>>> On 14.02.2018 19:42, Eric Biggers wrote:
>>>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>>>> next round, etc.), then goes through XTS postprocessing.
>>>>>
>>>>> The performance depends on the processor but can be about 3 times faster
>>>>> than the generic code.  For example, on an ARMv7 processor we observe
>>>>> the following performance with Speck128/256-XTS:
>>>>>
>>>>> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>>>> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>>>
>>>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>>>
>>>>> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>>>> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>>>> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>>>
>>>>> Speck64/128-XTS is even faster:
>>>>>
>>>>> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>>>
>>>>> Note that as with the generic code, only the Speck128 and Speck64
>>>>> variants are supported.  Also, for now only the XTS mode of operation is
>>>>> supported, to target the disk and file encryption use cases.  The NEON
>>>>> code also only handles the portion of the data that is evenly divisible
>>>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>>>> course, other modes of operation could be added later if needed, and/or
>>>>> the NEON code could be updated to handle other buffer sizes.
>>>>>
>>>>> The XTS specification is only defined for AES which has a 128-bit block
>>>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>>>
>>>>> Signed-off-by: Eric Biggers 
>>>>> ---
>>>>>  arch/arm/crypto/Kconfig   |   6 +
>>>>>  arch/arm/crypto/Makefile  |   2 +
>>>>>  arch/arm/crypto/speck-neon-core.S | 432 ++
>>>>>  arch/arm/crypto/speck-neon-glue.c | 288 
>>>>>  4 files changed, 728 insertions(+)
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>>>
>>>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>>>> index b8e69fe282b8..925d1364727a 100644
>>>>> --- a/arch/arm/crypto/Kconfig
>>>>> +++ b/arch/arm/crypto/Kconfig
>>>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>>>   select CRYPTO_BLKCIPHER
>>>>>   select CRYPTO_CHACHA20
>>>>>
>>>>> +config CRYPTO_SPECK_NEON
>>>>> + tristate "NEON accelerated Speck cipher algorithms"
>>>>> + depends on KERNEL_MODE_NEON
>>>>> + select CRYPTO_BLKCIPHER
>>>>> + select CRYPTO_SPECK
>>>>> +
>>>>>  endif
>>>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>>>> index 30ef8e291271..a758107c5525 100644
>>>>> --- a/arch/arm/crypto/Makefile
>>>>> +++ b/arch/arm/crypto/Makefile
>>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>>>> +obj-$(CONFIG_CRYPTO_SPECK_

Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-17 Thread Ard Biesheuvel
On 17 June 2018 at 11:30, Ard Biesheuvel  wrote:
> On 17 June 2018 at 00:40, Stefan Agner  wrote:
>> Hi Eric,
>>
>> On 14.02.2018 19:42, Eric Biggers wrote:
>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>> next round, etc.), then goes through XTS postprocessing.
>>>
>>> The performance depends on the processor but can be about 3 times faster
>>> than the generic code.  For example, on an ARMv7 processor we observe
>>> the following performance with Speck128/256-XTS:
>>>
>>> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>
>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>
>>> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>
>>> Speck64/128-XTS is even faster:
>>>
>>> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>
>>> Note that as with the generic code, only the Speck128 and Speck64
>>> variants are supported.  Also, for now only the XTS mode of operation is
>>> supported, to target the disk and file encryption use cases.  The NEON
>>> code also only handles the portion of the data that is evenly divisible
>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>> course, other modes of operation could be added later if needed, and/or
>>> the NEON code could be updated to handle other buffer sizes.
>>>
>>> The XTS specification is only defined for AES which has a 128-bit block
>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>
>>> Signed-off-by: Eric Biggers 
>>> ---
>>>  arch/arm/crypto/Kconfig   |   6 +
>>>  arch/arm/crypto/Makefile  |   2 +
>>>  arch/arm/crypto/speck-neon-core.S | 432 ++
>>>  arch/arm/crypto/speck-neon-glue.c | 288 
>>>  4 files changed, 728 insertions(+)
>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>
>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>> index b8e69fe282b8..925d1364727a 100644
>>> --- a/arch/arm/crypto/Kconfig
>>> +++ b/arch/arm/crypto/Kconfig
>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>   select CRYPTO_BLKCIPHER
>>>   select CRYPTO_CHACHA20
>>>
>>> +config CRYPTO_SPECK_NEON
>>> + tristate "NEON accelerated Speck cipher algorithms"
>>> + depends on KERNEL_MODE_NEON
>>> + select CRYPTO_BLKCIPHER
>>> + select CRYPTO_SPECK
>>> +
>>>  endif
>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>> index 30ef8e291271..a758107c5525 100644
>>> --- a/arch/arm/crypto/Makefile
>>> +++ b/arch/arm/crypto/Makefile
>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>
>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y  := ghash-ce-core.o ghash-ce-glue.o
>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>
>>>  quiet_cmd_perl = PERL$@
>>>cmd_perl = $(PERL) $(<) > $(@)
>>> diff --git a/arch/arm/crypto/speck-neon-core.S
&g

Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-17 Thread Ard Biesheuvel
On 17 June 2018 at 00:40, Stefan Agner  wrote:
> Hi Eric,
>
> On 14.02.2018 19:42, Eric Biggers wrote:
>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>> next round, etc.), then goes through XTS postprocessing.
>>
>> The performance depends on the processor but can be about 3 times faster
>> than the generic code.  For example, on an ARMv7 processor we observe
>> the following performance with Speck128/256-XTS:
>>
>> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
>> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>
>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>
>> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
>> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
>> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>
>> Speck64/128-XTS is even faster:
>>
>> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>
>> Note that as with the generic code, only the Speck128 and Speck64
>> variants are supported.  Also, for now only the XTS mode of operation is
>> supported, to target the disk and file encryption use cases.  The NEON
>> code also only handles the portion of the data that is evenly divisible
>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> course, other modes of operation could be added later if needed, and/or
>> the NEON code could be updated to handle other buffer sizes.
>>
>> The XTS specification is only defined for AES which has a 128-bit block
>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> paper.  Of course, when possible users should use Speck128-XTS, but even
>> that may be too slow on some processors; Speck64-XTS can be faster.
>>
>> Signed-off-by: Eric Biggers 
>> ---
>>  arch/arm/crypto/Kconfig   |   6 +
>>  arch/arm/crypto/Makefile  |   2 +
>>  arch/arm/crypto/speck-neon-core.S | 432 ++
>>  arch/arm/crypto/speck-neon-glue.c | 288 
>>  4 files changed, 728 insertions(+)
>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>
>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>> index b8e69fe282b8..925d1364727a 100644
>> --- a/arch/arm/crypto/Kconfig
>> +++ b/arch/arm/crypto/Kconfig
>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>   select CRYPTO_BLKCIPHER
>>   select CRYPTO_CHACHA20
>>
>> +config CRYPTO_SPECK_NEON
>> + tristate "NEON accelerated Speck cipher algorithms"
>> + depends on KERNEL_MODE_NEON
>> + select CRYPTO_BLKCIPHER
>> + select CRYPTO_SPECK
>> +
>>  endif
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 30ef8e291271..a758107c5525 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>
>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>> @@ -53,6 +54,7 @@ ghash-arm-ce-y  := ghash-ce-core.o ghash-ce-glue.o
>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>
>>  quiet_cmd_perl = PERL$@
>>cmd_perl = $(PERL) $(<) > $(@)
>> diff --git a/arch/arm/crypto/speck-neon-core.S
>> b/arch/arm/crypto/speck-neon-core.S
>> new file mode 100644
>> index ..3c1e203e53b9
>> --- /dev/null
>> +++ b/arch/arm/crypto/speck-neon-core.S
>> @@ -0,0 +1,432 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> + *
>> + * Copyright (c) 2018 Google, Inc
>> + *
>> + * Author: Eric Biggers 
>> + */
>> +
>> +#include 
>> +
>> + .text
>> + .fpuneon
>> +
>> + // arguments
>> + ROUND_KEYS  .reqr0  // const {u64,u32} *round_keys
>> + NROUNDS .reqr1  // int nrounds
>> + DST .reqr2  // void *dst
>> + SRC .reqr3  // const void *src
>> + NBYTES  .reqr4  // unsigned int nbytes
>> + TWEAK   .reqr5  // void *tweak
>> +
>> + // registers which hold the data being encrypted/decrypted
>> + X0 

Re: [PATCH] crypto: don't optimize keccakf()

2018-06-08 Thread Ard Biesheuvel
On 8 June 2018 at 11:53, Dmitry Vyukov  wrote:
> keccakf() is the only function in kernel that uses __optimize() macro.
> __optimize() breaks frame pointer unwinder as optimized code uses RBP,
> and amusingly this always lead to degraded performance as gcc does not
> inline across different optimizations levels, so keccakf() wasn't inlined
> into its callers and keccakf_round() wasn't inlined into keccakf().
>
> Drop __optimize() to resolve both problems.
>
> Signed-off-by: Dmitry Vyukov 
> Fixes: 83dee2ce1ae7 ("crypto: sha3-generic - rewrite KECCAK transform to help 
> the compiler optimize")
> Reported-by: syzbot+37035ccfa9a0a017f...@syzkaller.appspotmail.com
> Reported-by: syzbot+e073e4740cfbb3ae2...@syzkaller.appspotmail.com
> Cc: linux-crypto@vger.kernel.org
> Cc: "David S. Miller" 
> Cc: Herbert Xu 
> Cc: Ard Biesheuvel 

Acked-by: Ard Biesheuvel 

> ---
>  crypto/sha3_generic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/crypto/sha3_generic.c b/crypto/sha3_generic.c
> index 264ec12c0b9c..7f6735d9003f 100644
> --- a/crypto/sha3_generic.c
> +++ b/crypto/sha3_generic.c
> @@ -152,7 +152,7 @@ static SHA3_INLINE void keccakf_round(u64 st[25])
> st[24] ^= bc[ 4];
>  }
>
> -static void __optimize("O3") keccakf(u64 st[25])
> +static void keccakf(u64 st[25])
>  {
> int round;
>
> --
> 2.18.0.rc1.242.g61856ae69a-goog
>


  1   2   3   4   5   6   7   8   9   >