Re: [PATCH 01/10] staging: ccree: remove inline qualifiers

2017-12-06 Thread Gilad Ben-Yossef
On Mon, Dec 4, 2017 at 11:36 AM, Dan Carpenter  wrote:
> On Sun, Dec 03, 2017 at 01:58:12PM +, Gilad Ben-Yossef wrote:
>> The ccree drivers was marking a lot of big functions in C file as
>> static inline for no good reason. Remove the inline qualifier from
>> any but the few truly single line functions.
>>
>
> The compiler is free to ignore inline hints...  It probably would make
> single line functions inline anyway.
>

Yes. I think of it more as a note to the reader: "don't add stuff to
this function. it is meant to be short and simple".


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [PATCH] staging: ccree: ssi_aead: fixed all coding style warnings.

2017-12-06 Thread Gilad Ben-Yossef
Hi Sunil,

On Thu, Dec 7, 2017 at 12:08 AM, SUNIL KALLUR RAMEGOWDA
 wrote:
> My first kernel patch, fixed warnings.
>

Congratulations and may there be many more! :-)

Please note that in addition to what Greg's patch-bot already
mentioned I'm pretty sure you are looking
at Linus's tree and not Greg's staging-next tree which already
contains fixes for many of the issues
your patch is addressing.

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [PATCH] staging: ccree: ssi_aead: fixed all coding style warnings.

2017-12-06 Thread Greg KH
On Wed, Dec 06, 2017 at 11:08:09PM +0100, SUNIL KALLUR RAMEGOWDA wrote:
> My first kernel patch, fixed warnings.
> 
> Signed-off-by: SUNIL KALLUR RAMEGOWDA 
> ---
>  drivers/staging/ccree/ssi_aead.c | 179 
> +++
>  1 file changed, 123 insertions(+), 56 deletions(-)

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- Your patch did many different things all at once, making it difficult
  to review.  All Linux kernel patches need to only do one thing at a
  time.  If you need to do multiple things (such as clean up all coding
  style issues in a file/driver), do it in a sequence of patches, each
  one doing only one thing.  This will make it easier to review the
  patches to ensure that they are correct, and to help alleviate any
  merge issues that larger patches can cause.

- You did not specify a description of why the patch is needed, or
  possibly, any description at all, in the email body.  Please read the
  section entitled "The canonical patch format" in the kernel file,
  Documentation/SubmittingPatches for what is needed in order to
  properly describe the change.

- You did not write a descriptive Subject: for the patch, allowing Greg,
  and everyone else, to know what this patch is all about.  Please read
  the section entitled "The canonical patch format" in the kernel file,
  Documentation/SubmittingPatches for what a proper Subject: line should
  look like.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot


[PATCH] staging: ccree: ssi_aead: fixed all coding style warnings.

2017-12-06 Thread SUNIL KALLUR RAMEGOWDA
My first kernel patch, fixed warnings.

Signed-off-by: SUNIL KALLUR RAMEGOWDA 
---
 drivers/staging/ccree/ssi_aead.c | 179 +++
 1 file changed, 123 insertions(+), 56 deletions(-)

diff --git a/drivers/staging/ccree/ssi_aead.c b/drivers/staging/ccree/ssi_aead.c
index ba0954e..afb0036 100644
--- a/drivers/staging/ccree/ssi_aead.c
+++ b/drivers/staging/ccree/ssi_aead.c
@@ -100,7 +100,8 @@ static void ssi_aead_exit(struct crypto_aead *tfm)
 
/* Unmap enckey buffer */
if (ctx->enckey) {
-   dma_free_coherent(dev, AES_MAX_KEY_SIZE, ctx->enckey, 
ctx->enckey_dma_addr);
+   dma_free_coherent(dev, AES_MAX_KEY_SIZE,
+ ctx->enckey, ctx->enckey_dma_addr);
dev_dbg(dev, "Freed enckey DMA buffer enckey_dma_addr=%pad\n",
>enckey_dma_addr);
ctx->enckey_dma_addr = 0;
@@ -225,7 +226,8 @@ static int ssi_aead_init(struct crypto_aead *tfm)
return -ENOMEM;
 }
 
-static void ssi_aead_complete(struct device *dev, void *ssi_req, void __iomem 
*cc_base)
+static void ssi_aead_complete(struct device *dev, void *ssi_req,
+ void __iomem *cc_base)
 {
struct aead_request *areq = (struct aead_request *)ssi_req;
struct aead_req_ctx *areq_ctx = aead_request_ctx(areq);
@@ -258,12 +260,20 @@ static void ssi_aead_complete(struct device *dev, void 
*ssi_req, void __iomem *c
 ctx->authsize),
SSI_SG_FROM_BUF);
 
-   /* If an IV was generated, copy it back to the user provided 
buffer. */
+   /* If an IV was generated,
+* copy it back to the user provided buffer.
+*/
if (areq_ctx->backup_giv) {
if (ctx->cipher_mode == DRV_CIPHER_CTR)
-   memcpy(areq_ctx->backup_giv, areq_ctx->ctr_iv + 
CTR_RFC3686_NONCE_SIZE, CTR_RFC3686_IV_SIZE);
+   memcpy(areq_ctx->backup_giv,
+  areq_ctx->ctr_iv +
+   CTR_RFC3686_NONCE_SIZE,
+   CTR_RFC3686_IV_SIZE);
else if (ctx->cipher_mode == DRV_CIPHER_CCM)
-   memcpy(areq_ctx->backup_giv, areq_ctx->ctr_iv + 
CCM_BLOCK_IV_OFFSET, CCM_BLOCK_IV_SIZE);
+   memcpy(areq_ctx->backup_giv,
+  areq_ctx->ctr_iv +
+   CCM_BLOCK_IV_OFFSET,
+   CCM_BLOCK_IV_SIZE);
}
}
 
@@ -274,8 +284,8 @@ static int xcbc_setkey(struct cc_hw_desc *desc, struct 
ssi_aead_ctx *ctx)
 {
/* Load the AES key */
hw_desc_init([0]);
-   /* We are using for the source/user key the same buffer as for the 
output keys,
-* because after this key loading it is not needed anymore
+   /* We are using for the source/user key the same buffer as for the
+* output keys, because after this key loading it is not needed anymore
 */
set_din_type([0], DMA_DLLI,
 ctx->auth_state.xcbc.xcbc_keys_dma_addr, ctx->auth_keylen,
@@ -427,7 +437,8 @@ static int validate_keys_sizes(struct ssi_aead_ctx *ctx)
  * (copy to intenral buffer or hash in case of key longer than block
  */
 static int
-ssi_get_plain_hmac_key(struct crypto_aead *tfm, const u8 *key, unsigned int 
keylen)
+ssi_get_plain_hmac_key(struct crypto_aead *tfm, const u8 *key,
+  unsigned int keylen)
 {
dma_addr_t key_dma_addr = 0;
struct ssi_aead_ctx *ctx = crypto_aead_ctx(tfm);
@@ -458,9 +469,11 @@ ssi_get_plain_hmac_key(struct crypto_aead *tfm, const u8 
*key, unsigned int keyl
}
 
if (likely(keylen != 0)) {
-   key_dma_addr = dma_map_single(dev, (void *)key, keylen, 
DMA_TO_DEVICE);
+   key_dma_addr = dma_map_single(dev, (void *)key,
+ keylen, DMA_TO_DEVICE);
if (unlikely(dma_mapping_error(dev, key_dma_addr))) {
-   dev_err(dev, "Mapping key va=0x%p len=%u for DMA 
failed\n",
+   dev_err(dev,
+   "Mapping key va=0x%p len=%u for DMA failed\n",
key, keylen);
return -ENOMEM;
}
@@ -586,7 +599,8 @@ ssi_aead_setkey(struct crypto_aead *tfm, const u8 *key, 
unsigned int keylen)
/* Copy nonce from last 4 bytes in CTR key to
 *  first 4 bytes in CTR IV
 */
-   memcpy(ctx->ctr_nonce, key + ctx->auth_keylen + 
ctx->enc_keylen -
+   memcpy(ctx->ctr_nonce, key +
+   ctx->auth_keylen + 

Re: [Part2 PATCH v9 12/38] crypto: ccp: Add Platform Security Processor (PSP) device support

2017-12-06 Thread Philippe Ombredanne
On Tue, Dec 5, 2017 at 2:04 AM, Brijesh Singh  wrote:
> The Platform Security Processor (PSP) is part of the AMD Secure
> Processor (AMD-SP) functionality. The PSP is a dedicated processor
> that provides support for key management commands in Secure Encrypted
> Virtualization (SEV) mode, along with software-based Trusted Execution
> Environment (TEE) to enable third-party trusted applications.
>
> Note that the key management functionality provided by the SEV firmware
> can be used outside of the kvm-amd driver hence it doesn't need to
> depend on CONFIG_KVM_AMD.
>
> Cc: Paolo Bonzini 
> Cc: "Radim Krčmář" 
> Cc: Borislav Petkov 
> Cc: Herbert Xu 
> Cc: Gary Hook 
> Cc: Tom Lendacky 
> Cc: linux-crypto@vger.kernel.org
> Cc: k...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Improvements-by: Borislav Petkov 
> Signed-off-by: Brijesh Singh 
> Reviewed-by: Borislav Petkov 
> ---
>  drivers/crypto/ccp/Kconfig   |  11 +
>  drivers/crypto/ccp/Makefile  |   1 +
>  drivers/crypto/ccp/psp-dev.c | 105 
> +++
>  drivers/crypto/ccp/psp-dev.h |  59 
>  drivers/crypto/ccp/sp-dev.c  |  26 +++
>  drivers/crypto/ccp/sp-dev.h  |  24 +-
>  drivers/crypto/ccp/sp-pci.c  |  52 +
>  7 files changed, 277 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/crypto/ccp/psp-dev.c
>  create mode 100644 drivers/crypto/ccp/psp-dev.h
>
> diff --git a/drivers/crypto/ccp/Kconfig b/drivers/crypto/ccp/Kconfig
> index 9c84f9838931..b9dfae47aefd 100644
> --- a/drivers/crypto/ccp/Kconfig
> +++ b/drivers/crypto/ccp/Kconfig
> @@ -33,3 +33,14 @@ config CRYPTO_DEV_CCP_CRYPTO
>   Support for using the cryptographic API with the AMD Cryptographic
>   Coprocessor. This module supports offload of SHA and AES algorithms.
>   If you choose 'M' here, this module will be called ccp_crypto.
> +
> +config CRYPTO_DEV_SP_PSP
> +   bool "Platform Security Processor (PSP) device"
> +   default y
> +   depends on CRYPTO_DEV_CCP_DD && X86_64
> +   help
> +Provide support for the AMD Platform Security Processor (PSP).
> +The PSP is a dedicated processor that provides support for key
> +management commands in Secure Encrypted Virtualization (SEV) mode,
> +along with software-based Trusted Execution Environment (TEE) to
> +enable third-party trusted applications.
> diff --git a/drivers/crypto/ccp/Makefile b/drivers/crypto/ccp/Makefile
> index c4ce726b931e..51d1c0cf66c7 100644
> --- a/drivers/crypto/ccp/Makefile
> +++ b/drivers/crypto/ccp/Makefile
> @@ -8,6 +8,7 @@ ccp-$(CONFIG_CRYPTO_DEV_SP_CCP) += ccp-dev.o \
> ccp-dmaengine.o \
> ccp-debugfs.o
>  ccp-$(CONFIG_PCI) += sp-pci.o
> +ccp-$(CONFIG_CRYPTO_DEV_SP_PSP) += psp-dev.o
>
>  obj-$(CONFIG_CRYPTO_DEV_CCP_CRYPTO) += ccp-crypto.o
>  ccp-crypto-objs := ccp-crypto-main.o \
> diff --git a/drivers/crypto/ccp/psp-dev.c b/drivers/crypto/ccp/psp-dev.c
> new file mode 100644
> index ..b5789f878560
> --- /dev/null
> +++ b/drivers/crypto/ccp/psp-dev.c
> @@ -0,0 +1,105 @@
> +/*
> + * AMD Platform Security Processor (PSP) interface
> + *
> + * Copyright (C) 2016-2017 Advanced Micro Devices, Inc.
> + *
> + * Author: Brijesh Singh 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */

Dear Brijesh,

Have you considered using the new SPDX license ids instead?

This would come out this way:
> +// SDPX-License-Identifier: GPL-2.0
> +/*
> + * AMD Platform Security Processor (PSP) interface
> + *
> + * Copyright (C) 2016-2017 Advanced Micro Devices, Inc.
> + *
> + * Author: Brijesh Singh 
> + */

It is much cleaner and simpler, right?

For the C++ comment style and first line placement, please see Thomas
(tlgx) doc patches and Linus posts explaining his rationale of why he
wants it this way.
It would be awesome if this could be applied to all AMD contributions btw!

-- 
Cordially
Philippe Ombredanne


[PATCH v3 19/20] crypto: arm64/crct10dif-ce - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/crct10dif-ce-core.S | 32 +---
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/crct10dif-ce-core.S 
b/arch/arm64/crypto/crct10dif-ce-core.S
index d5b5a8c038c8..111675f7bad5 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpugeneric+crypto
 
-   arg1_low32  .reqw0
-   arg2.reqx1
-   arg3.reqx2
+   arg1_low32  .reqw19
+   arg2.reqx20
+   arg3.reqx21
 
vzr .reqv13
 
 ENTRY(crc_t10dif_pmull)
+   frame_push  3, 128
+
+   mov arg1_low32, w0
+   mov arg2, x1
+   mov arg3, x2
+
movivzr.16b, #0 // init zero register
 
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE(ext v12.16b, v12.16b, v12.16b, #8   
)
subsarg3, arg3, #128
 
// check if there is another 64B in the buffer to be able to fold
-   b.ge_fold_64_B_loop
+   b.lt_fold_64_B_end
+
+   if_will_cond_yield_neon
+   stp q0, q1, [sp, #48]
+   stp q2, q3, [sp, #80]
+   stp q4, q5, [sp, #112]
+   stp q6, q7, [sp, #144]
+   do_cond_yield_neon
+   ldp q0, q1, [sp, #48]
+   ldp q2, q3, [sp, #80]
+   ldp q4, q5, [sp, #112]
+   ldp q6, q7, [sp, #144]
+   ldr q10, rk3
+   movivzr.16b, #0 // init zero register
+   endif_yield_neon
+
+   b   _fold_64_B_loop
 
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
 _cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+   frame_pop   3, 128
ret
 
 _less_than_128:
-- 
2.11.0



[PATCH v3 20/20] DO NOT MERGE

2017-12-06 Thread Ard Biesheuvel
Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
 arch/arm64/include/asm/assembler.h | 33 
 1 file changed, 33 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index c54e408fd5a7..7072c29b4e83 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -607,6 +607,7 @@ alternative_else_nop_endif
cmp w1, #1 // == PREEMPT_OFFSET
cselx0, x0, xzr, eq
tbnzx0, #TIF_NEED_RESCHED, f// needs rescheduling?
+   b   f
 #endif
.subsection 1
 :
@@ -615,6 +616,38 @@ alternative_else_nop_endif
.macro  do_cond_yield_neon
bl  kernel_neon_end
bl  kernel_neon_begin
+   moviv0.16b, #0x55
+   moviv1.16b, #0x55
+   moviv2.16b, #0x55
+   moviv3.16b, #0x55
+   moviv4.16b, #0x55
+   moviv5.16b, #0x55
+   moviv6.16b, #0x55
+   moviv7.16b, #0x55
+   moviv8.16b, #0x55
+   moviv9.16b, #0x55
+   moviv10.16b, #0x55
+   moviv11.16b, #0x55
+   moviv12.16b, #0x55
+   moviv13.16b, #0x55
+   moviv14.16b, #0x55
+   moviv15.16b, #0x55
+   moviv16.16b, #0x55
+   moviv17.16b, #0x55
+   moviv18.16b, #0x55
+   moviv19.16b, #0x55
+   moviv20.16b, #0x55
+   moviv21.16b, #0x55
+   moviv22.16b, #0x55
+   moviv23.16b, #0x55
+   moviv24.16b, #0x55
+   moviv25.16b, #0x55
+   moviv26.16b, #0x55
+   moviv27.16b, #0x55
+   moviv28.16b, #0x55
+   moviv29.16b, #0x55
+   moviv30.16b, #0x55
+   moviv31.16b, #0x55
.endm
 
.macro  endif_yield_neon, lbl=f
-- 
2.11.0



[PATCH v3 18/20] crypto: arm64/crc32-ce - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/crc32-ce-core.S | 44 ++--
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/crypto/crc32-ce-core.S 
b/arch/arm64/crypto/crc32-ce-core.S
index 18f5a8442276..b4ddbb2027e5 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,9 @@
dCONSTANT   .reqd0
qCONSTANT   .reqq0
 
-   BUF .reqx0
-   LEN .reqx1
-   CRC .reqx2
+   BUF .reqx19
+   LEN .reqx20
+   CRC .reqx21
 
vzr .reqv9
 
@@ -116,13 +116,21 @@
 * size_t len, uint crc32)
 */
 ENTRY(crc32_pmull_le)
-   adr x3, .Lcrc32_constants
+   frame_push  4, 64
+
+   adr x22, .Lcrc32_constants
b   0f
 
 ENTRY(crc32c_pmull_le)
-   adr x3, .Lcrc32c_constants
+   frame_push  4, 64
+
+   adr x22, .Lcrc32c_constants
+
+0: mov BUF, x0
+   mov LEN, x1
+   mov CRC, x2
 
-0: bic LEN, LEN, #15
+   bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movivzr.16b, #0
fmovdCONSTANT, CRC
@@ -131,7 +139,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.ltless_64
 
-   ldr qCONSTANT, [x3]
+   ldr qCONSTANT, [x22]
 
 loop_64:   /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -161,10 +169,21 @@ loop_64:  /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
 
cmp LEN, #0x40
-   b.geloop_64
+   b.ltless_64
+
+   if_will_cond_yield_neon
+   stp q1, q2, [sp, #48]
+   stp q3, q4, [sp, #80]
+   do_cond_yield_neon
+   ldp q1, q2, [sp, #48]
+   ldp q3, q4, [sp, #80]
+   ldr qCONSTANT, [x22]
+   movivzr.16b, #0
+   endif_yield_neon
+   b   loop_64
 
 less_64:   /* Folding cache line into 128bit */
-   ldr qCONSTANT, [x3, #16]
+   ldr qCONSTANT, [x22, #16]
 
pmull2  v5.1q, v1.2d, vCONSTANT.2d
pmull   v1.1q, v1.1d, vCONSTANT.1d
@@ -203,8 +222,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* final 32-bit fold */
-   ldr dCONSTANT, [x3, #32]
-   ldr d3, [x3, #40]
+   ldr dCONSTANT, [x22, #32]
+   ldr d3, [x22, #40]
 
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -212,7 +231,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
 
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
-   ldr qCONSTANT, [x3, #48]
+   ldr qCONSTANT, [x22, #48]
 
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -222,6 +241,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
 
+   frame_pop   4, 64
ret
 ENDPROC(crc32_pmull_le)
 ENDPROC(crc32c_pmull_le)
-- 
2.11.0



[PATCH v3 17/20] crypto: arm64/aes-ghash - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 113 ++--
 arch/arm64/crypto/ghash-ce-glue.c |  28 +++--
 2 files changed, 97 insertions(+), 44 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..8da87cfcce66 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
 
.macro  __pmull_ghash, pn
-   ld1 {SHASH.2d}, [x3]
-   ld1 {XL.2d}, [x1]
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+   ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
__pmull_pre_\pn
 
/* do the head block first, if supplied */
-   cbz x4, 0f
-   ld1 {T1.2d}, [x4]
-   b   1f
+   cbz x23, 1f
+   ld1 {T1.2d}, [x23]
+   mov x23, xzr
+   b   2f
 
-0: ld1 {T1.2d}, [x2], #16
-   sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+   sub w19, w19, #1
 
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
 CPU_LE(rev64   T1.16b, T1.16b  )
 
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE(rev64   T1.16b, T1.16b  )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
 
-   cbnzw0, 0b
+   cbz w19, 3f
+
+   if_will_cond_yield_neon
+   st1 {XL.2d}, [x20]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
-   st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+   frame_pop   5
ret
.endm
 
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
 
.macro  pmull_gcm_do_crypt, enc
-   ld1 {SHASH.2d}, [x4]
-   ld1 {XL.2d}, [x1]
-   ldr x8, [x5, #8]// load lower counter
+   frame_push  10
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+   mov x26, x7
+   .if \enc == 1
+   ldr x27, [sp, #96]  // first stacked arg
+   .endif
+
+   ldr x28, [x24, #8]  // load lower counter
+CPU_LE(rev x28, x28)
+
+0: mov x0, x25
+   load_round_keys w26, x0
+   ld1 {SHASH.2d}, [x23]
+   ld1 {XL.2d}, [x20]
 
moviMASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE(rev x8, x8  )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
 
.if \enc == 1
-   ld1 {KS.16b}, [x7]
+   ld1 {KS.16b}, [x27]
.endif
 
-0: ld1 {CTR.8b}, [x5]  // load upper counter
-   ld1 {INP.16b}, [x3], #16
-   rev x9, x8
-   add x8, x8, #1
-   sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+   ld1 {INP.16b}, [x22], #16
+   rev x9, x28
+   add x28, x28, #1
+   sub w19, w19, #1
ins CTR.d[1], x9// set lower counter
 
.if \enc == 1
eor INP.16b, INP.16b, KS.16b// encrypt input
-   st1 {INP.16b}, [x2], #16
+   st1 {INP.16b}, [x21], #16
.endif
 
rev64   T1.16b, INP.16b
 
-   cmp w6, #12
-   b.ge2f  // AES-192/256?
+   cmp w26, #12
+   b.ge4f  // AES-192/256?
 
-1: enc_round   CTR, v21
+2: enc_round   CTR, v21
 
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE(   rev x8, x8  )
 
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
-   st1 

[PATCH v3 13/20] crypto: arm64/sha2-ce - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha2-ce-core.S | 37 ++--
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 679c6c002f4f..7709455dae92 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -77,30 +77,36 @@
 *int blocks)
 */
 ENTRY(sha2_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   adr x8, .Lsha2_rcon
+0: adr x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
 
/* load state */
-   ld1 {dgav.4s, dgbv.4s}, [x0]
+   ld1 {dgav.4s, dgbv.4s}, [x19]
 
/* load sha256_ce_state::finalize */
ldr_l   w4, sha256_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v16.16b, v16.16b)
 CPU_LE(rev32   v17.16b, v17.16b)
 CPU_LE(rev32   v18.16b, v18.16b)
 CPU_LE(rev32   v19.16b, v19.16b)
 
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
 
@@ -129,16 +135,24 @@ CPU_LE(   rev32   v19.16b, v19.16b)
add dgbv.4s, dgbv.4s, dg1v.4s
 
/* handled all input blocks? */
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s, dgbv.4s}, [x19]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha256_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv17.2d, #0
mov x8, #0x8000
moviv18.2d, #0
@@ -147,9 +161,10 @@ CPU_LE(rev32   v19.16b, v19.16b)
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+   frame_pop   3
ret
 ENDPROC(sha2_ce_transform)
-- 
2.11.0



[PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT

2017-12-06 Thread Ard Biesheuvel
Add support macros to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code.

In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into three, and the code in between is only
executed when the yield path is taken, allowing the context to be preserved.
The third macro takes an optional label argument that marks the resume
path after a yield has been performed.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/assembler.h | 51 
 1 file changed, 51 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 5f61487e9f93..c54e408fd5a7 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -572,4 +572,55 @@ alternative_else_nop_endif
 #endif
.endm
 
+/*
+ * Check whether to yield to another runnable task from kernel mode NEON code
+ * (which runs with preemption disabled).
+ *
+ * if_will_cond_yield_neon
+ *// pre-yield patchup code
+ * do_cond_yield_neon
+ *// post-yield patchup code
+ * endif_yield_neon
+ *
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ *   preemption once will make the task preemptible. If this is not the case,
+ *   yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ *   kernel mode NEON (which will trigger a reschedule), and branch to the
+ *   yield fixup code.
+ *
+ * This macro sequence clobbers x0, x1 and the flags register unconditionally,
+ * and may clobber x2 .. x18 if the yield path is taken.
+ */
+
+   .macro  cond_yield_neon, lbl
+   if_will_cond_yield_neon
+   do_cond_yield_neon
+   endif_yield_neon\lbl
+   .endm
+
+   .macro  if_will_cond_yield_neon
+#ifdef CONFIG_PREEMPT
+   get_thread_info x0
+   ldr w1, [x0, #TSK_TI_PREEMPT]
+   ldr x0, [x0, #TSK_TI_FLAGS]
+   cmp w1, #1 // == PREEMPT_OFFSET
+   cselx0, x0, xzr, eq
+   tbnzx0, #TIF_NEED_RESCHED, f// needs rescheduling?
+#endif
+   .subsection 1
+:
+   .endm
+
+   .macro  do_cond_yield_neon
+   bl  kernel_neon_end
+   bl  kernel_neon_begin
+   .endm
+
+   .macro  endif_yield_neon, lbl=f
+   b   \lbl
+   .previous
+:
+   .endm
+
 #endif /* __ASM_ASSEMBLER_H */
-- 
2.11.0



[PATCH v3 12/20] crypto: arm64/sha1-ce - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha1-ce-core.S | 42 ++--
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 8550408735a0..3139206e8787 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -70,31 +70,37 @@
 *int blocks)
 */
 ENTRY(sha1_ce_transform)
+   frame_push  3
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+
/* load round constants */
-   adr x6, .Lsha1_rcon
+0: adr x6, .Lsha1_rcon
ld1r{k0.4s}, [x6], #4
ld1r{k1.4s}, [x6], #4
ld1r{k2.4s}, [x6], #4
ld1r{k3.4s}, [x6]
 
/* load state */
-   ld1 {dgav.4s}, [x0]
-   ldr dgb, [x0, #16]
+   ld1 {dgav.4s}, [x19]
+   ldr dgb, [x19, #16]
 
/* load sha1_ce_state::finalize */
ldr_l   w4, sha1_ce_offsetof_finalize, x4
-   ldr w4, [x0, x4]
+   ldr w4, [x19, x4]
 
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
-   sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+   sub w21, w21, #1
 
 CPU_LE(rev32   v8.16b, v8.16b  )
 CPU_LE(rev32   v9.16b, v9.16b  )
 CPU_LE(rev32   v10.16b, v10.16b)
 CPU_LE(rev32   v11.16b, v11.16b)
 
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
 
add_update  c, ev, k0,  8,  9, 10, 11, dgb
@@ -125,16 +131,25 @@ CPU_LE(   rev32   v11.16b, v11.16b)
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
 
-   cbnzw2, 0b
+   cbz w21, 3f
+
+   if_will_cond_yield_neon
+   st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   do_cond_yield_neon
+   b   0b
+   endif_yield_neon
+
+   b   1b
 
/*
 * Final block: add padding and total bit count.
 * Skip if the input size was not a round multiple of the block size,
 * the padding is handled by the C code in that case.
 */
-   cbz x4, 3f
+3: cbz x4, 4f
ldr_l   w4, sha1_ce_offsetof_count, x4
-   ldr x4, [x0, x4]
+   ldr x4, [x19, x4]
moviv9.2d, #0
mov x8, #0x8000
moviv10.2d, #0
@@ -143,10 +158,11 @@ CPU_LE(   rev32   v11.16b, v11.16b)
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
-   b   1b
+   b   2b
 
/* store new state */
-3: st1 {dgav.4s}, [x0]
-   str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+   str dgb, [x19, #16]
+   frame_pop   3
ret
 ENDPROC(sha1_ce_transform)
-- 
2.11.0



[PATCH v3 15/20] crypto: arm64/aes-blk - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce.S|  15 +-
 arch/arm64/crypto/aes-modes.S | 331 
 2 files changed, 216 insertions(+), 130 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
 
/* prepare for encryption with key in rk[] */
-   .macro  enc_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for encryption (again) but with new key in rk[] */
-   .macro  enc_switch_key, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  enc_switch_key, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
/* prepare for decryption with key in rk[] */
-   .macro  dec_prepare, rounds, rk, ignore
-   load_round_keys \rounds, \rk
+   .macro  dec_prepare, rounds, rk, temp
+   mov \temp, \rk
+   load_round_keys \rounds, \temp
.endm
 
.macro  do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..ab05772ce385 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align  4
 
 aes_encrypt_block4x:
-   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_encrypt_block4x)
 
 aes_decrypt_block4x:
-   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+   decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
 ENDPROC(aes_decrypt_block4x)
 
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
 
-   enc_prepare w3, x2, x5
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+
+.Lecbencrestart:
+   enc_prepare w22, x21, x5
 
 .LecbencloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbenc1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl  aes_encrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbencrestart
b   .LecbencloopNx
 .Lecbenc1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbencout
 .Lecbencloop:
-   ld1 {v0.16b}, [x1], #16 /* get next pt block */
-   encrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4, w4, #1
+   ld1 {v0.16b}, [x20], #16/* get next pt block */
+   encrypt_block   v0, w22, x21, x5, w6
+   st1 {v0.16b}, [x19], #16
+   subsw23, w23, #1
bne .Lecbencloop
 .Lecbencout:
-   ldp x29, x30, [sp], #16
+   frame_pop   5
ret
 AES_ENDPROC(aes_ecb_encrypt)
 
 
 AES_ENTRY(aes_ecb_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
-   dec_prepare w3, x2, x5
+.Lecbdecrestart:
+   dec_prepare w22, x21, x5
 
 .LecbdecloopNx:
-   subsw4, w4, #4
+   subsw23, w23, #4
bmi .Lecbdec1x
-   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 ct blocks */
+   ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl  aes_decrypt_block4x
-   st1 {v0.16b-v3.16b}, [x0], #64
+   st1 {v0.16b-v3.16b}, [x19], #64
+   cond_yield_neon .Lecbdecrestart
b   .LecbdecloopNx
 .Lecbdec1x:
-   addsw4, w4, #4
+   addsw23, w23, #4
beq .Lecbdecout
 .Lecbdecloop:
-   ld1 {v0.16b}, [x1], #16 /* get next ct block */
-   decrypt_block   v0, w3, x2, x5, w6
-   st1 {v0.16b}, [x0], #16
-   subsw4, w4, #1
+   ld1   

[PATCH v3 14/20] crypto: arm64/aes-ccm - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 150 +---
 1 file changed, 95 insertions(+), 55 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..22ee196cae00 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
 *   u32 *macp, u8 const rk[], u32 rounds);
 */
 ENTRY(ce_aes_ccm_auth_data)
-   ldr w8, [x3]/* leftover from prev round? */
+   frame_push  7
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+
+   ldr w25, [x22]  /* leftover from prev round? */
ld1 {v0.16b}, [x0]  /* load mac */
-   cbz w8, 1f
-   sub w8, w8, #16
+   cbz w25, 1f
+   sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrbw7, [x1], #1/* get 1 byte of input */
-   subsw2, w2, #1
-   add w8, w8, #1
+0: ldrbw7, [x20], #1   /* get 1 byte of input */
+   subsw21, w21, #1
+   add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1  /* rotate in the input bytes */
beq 8f  /* out of input? */
-   cbnzw8, 0b
+   cbnzw25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4]   /* load first round key */
-   prfmpldl1strm, [x1]
-   cmp w5, #12 /* which key size? */
-   add x6, x4, #16
-   sub w7, w5, #2  /* modified # of rounds */
+1: ld1 {v3.4s}, [x23]  /* load first round key */
+   prfmpldl1strm, [x20]
+   cmp w24, #12/* which key size? */
+   add x6, x23, #16
+   sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16  /* load next round key */
bpl 3b
aesev0.16b, v4.16b
-   subsw2, w2, #16 /* last data? */
+   subsw21, w21, #16   /* last data? */
eor v0.16b, v0.16b, v5.16b  /* final round */
bmi 6f
-   ld1 {v1.16b}, [x1], #16 /* load next input block */
+   ld1 {v1.16b}, [x20], #16/* load next input block */
eor v0.16b, v0.16b, v1.16b  /* xor with mac */
-   bne 1b
-6: st1 {v0.16b}, [x0]  /* store mac */
+   beq 6f
+
+   if_will_cond_yield_neon
+   st1 {v0.16b}, [x19] /* store mac */
+   do_cond_yield_neon
+   ld1 {v0.16b}, [x19] /* reload mac */
+   endif_yield_neon
+
+   b   1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
-   addsw2, w2, #16
+   addsw21, w21, #16
beq 10f
-   mov w8, w2
-7: ldrbw7, [x1], #1
+   mov w25, w21
+7: ldrbw7, [x20], #1
umovw6, v0.b[0]
eor w6, w6, w7
-   strbw6, [x0], #1
-   subsw2, w2, #1
+   strbw6, [x19], #1
+   subsw21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1  /* rotate out the mac bytes */
b   7b
-8: mov w7, w8
-   add w8, w8, #16
+8: mov w7, w25
+   add w25, w25, #16
 9: ext v1.16b, v1.16b, v1.16b, #1
addsw7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
-   st1 {v0.16b}, [x0]
-10:str w8, [x3]
+   st1 {v0.16b}, [x19]
+10:str w25, [x22]
+
+   frame_pop   7
ret
 ENDPROC(ce_aes_ccm_auth_data)
 
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
 ENDPROC(ce_aes_ccm_final)
 
.macro  aes_ccm_do_crypt,enc
-   ldr x8, [x6, #8]/* load lower ctr */
-   ld1 {v0.16b}, [x5]  /* load mac */
-CPU_LE(rev x8, x8  )   /* keep swabbed ctr in 
reg */
+   frame_push  8
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
+   mov x25, x6
+
+   ldr x26, [x25, #8]  /* load lower ctr */
+   ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE(rev x26, x26)   /* keep swabbed ctr in 
reg 

[PATCH v3 16/20] crypto: arm64/aes-bs - yield NEON after every block of input

2017-12-06 Thread Ard Biesheuvel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-neonbs-core.S | 305 +++-
 1 file changed, 170 insertions(+), 135 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-core.S 
b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..23659369da78 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
 *   int blocks)
 */
.macro  __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  5
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
 
 99:mov x5, #1
-   lsl x5, x5, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x5, x5, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx5, x5, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
tbnzx5, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
tbnzx5, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
tbnzx5, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
tbnzx5, #4, 0f
-   ld1 {v4.16b}, [x1], #16
+   ld1 {v4.16b}, [x20], #16
tbnzx5, #5, 0f
-   ld1 {v5.16b}, [x1], #16
+   ld1 {v5.16b}, [x20], #16
tbnzx5, #6, 0f
-   ld1 {v6.16b}, [x1], #16
+   ld1 {v6.16b}, [x20], #16
tbnzx5, #7, 0f
-   ld1 {v7.16b}, [x1], #16
+   ld1 {v7.16b}, [x20], #16
 
-0: mov bskey, x2
-   mov rounds, x3
+0: mov bskey, x21
+   mov rounds, x22
bl  \do8
 
-   st1 {\o0\().16b}, [x0], #16
+   st1 {\o0\().16b}, [x19], #16
tbnzx5, #1, 1f
-   st1 {\o1\().16b}, [x0], #16
+   st1 {\o1\().16b}, [x19], #16
tbnzx5, #2, 1f
-   st1 {\o2\().16b}, [x0], #16
+   st1 {\o2\().16b}, [x19], #16
tbnzx5, #3, 1f
-   st1 {\o3\().16b}, [x0], #16
+   st1 {\o3\().16b}, [x19], #16
tbnzx5, #4, 1f
-   st1 {\o4\().16b}, [x0], #16
+   st1 {\o4\().16b}, [x19], #16
tbnzx5, #5, 1f
-   st1 {\o5\().16b}, [x0], #16
+   st1 {\o5\().16b}, [x19], #16
tbnzx5, #6, 1f
-   st1 {\o6\().16b}, [x0], #16
+   st1 {\o6\().16b}, [x19], #16
tbnzx5, #7, 1f
-   st1 {\o7\().16b}, [x0], #16
+   st1 {\o7\().16b}, [x19], #16
 
-   cbnzx4, 99b
+   cbz x23, 1f
+   cond_yield_neon
+   b   99b
 
-1: ldp x29, x30, [sp], #16
+1: frame_pop   5
ret
.endm
 
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
 */
.align  4
 ENTRY(aesbs_cbc_decrypt)
-   stp x29, x30, [sp, #-16]!
-   mov x29, sp
+   frame_push  6
+
+   mov x19, x0
+   mov x20, x1
+   mov x21, x2
+   mov x22, x3
+   mov x23, x4
+   mov x24, x5
 
 99:mov x6, #1
-   lsl x6, x6, x4
-   subsw4, w4, #8
-   cselx4, x4, xzr, pl
+   lsl x6, x6, x23
+   subsw23, w23, #8
+   cselx23, x23, xzr, pl
cselx6, x6, xzr, mi
 
-   ld1 {v0.16b}, [x1], #16
+   ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnzx6, #1, 0f
-   ld1 {v1.16b}, [x1], #16
+   ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnzx6, #2, 0f
-   ld1 {v2.16b}, [x1], #16
+   ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnzx6, #3, 0f
-   ld1 {v3.16b}, [x1], #16
+   ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnzx6, #4, 0f
-   ld1   

[PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames

2017-12-06 Thread Ard Biesheuvel
We are going to add code to all the NEON crypto routines that will
turn them into non-leaf functions, so we need to manage the stack
frames. To make this less tedious and error prone, add some macros
that take the number of callee saved registers to preserve and the
extra size to allocate in the stack frame (for locals) and emit
the ldp/stp sequences.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/assembler.h | 60 
 1 file changed, 60 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index aef72d886677..5f61487e9f93 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -499,6 +499,66 @@ alternative_else_nop_endif
 #endif
.endm
 
+   /*
+* frame_push - Push @regcount callee saved registers to the stack,
+*  starting at x19, as well as x29/x30, and set x29 to
+*  the new value of sp. Add @extra bytes of stack space
+*  for locals.
+*/
+   .macro  frame_push, regcount:req, extra
+   __frame st, \regcount, \extra
+   .endm
+
+   /*
+* frame_pop  - Pop @regcount callee saved registers from the stack,
+*  starting at x19, as well as x29/x30. Also pop @extra
+*  bytes of stack space for locals.
+*/
+   .macro  frame_pop, regcount:req, extra
+   __frame ld, \regcount, \extra
+   .endm
+
+   .macro  __frame, op, regcount:req, extra=0
+   .ifc\op, st
+   stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
+   mov x29, sp
+   .endif
+   .if \regcount < 0 || \regcount > 10
+   .error  "regcount should be in the range [0 ... 10]"
+   .endif
+   .if (\extra % 16) != 0
+   .error  "extra should be a multiple of 16 bytes"
+   .endif
+   .if \regcount > 1
+   \op\()p x19, x20, [sp, #16]
+   .if \regcount > 3
+   \op\()p x21, x22, [sp, #32]
+   .if \regcount > 5
+   \op\()p x23, x24, [sp, #48]
+   .if \regcount > 7
+   \op\()p x25, x26, [sp, #64]
+   .if \regcount > 9
+   \op\()p x27, x28, [sp, #80]
+   .elseif \regcount == 9
+   \op\()r x27, [sp, #80]
+   .endif
+   .elseif \regcount == 7
+   \op\()r x25, [sp, #64]
+   .endif
+   .elseif \regcount == 5
+   \op\()r x23, [sp, #48]
+   .endif
+   .elseif \regcount == 3
+   \op\()r x21, [sp, #32]
+   .endif
+   .elseif \regcount == 1
+   \op\()r x19, [sp, #16]
+   .endif
+   .ifc\op, ld
+   ldp x29, x30, [sp], #((\regcount + 3) / 2) * 16 + \extra
+   .endif
+   .endm
+
 /*
  * Errata workaround post TTBR0_EL1 update.
  */
-- 
2.11.0



[PATCH v3 07/20] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path

2017-12-06 Thread Ard Biesheuvel
CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-modes.S | 31 
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
 */
 
 AES_ENTRY(aes_cbc_encrypt)
-   ld1 {v0.16b}, [x5]  /* get iv */
+   ld1 {v4.16b}, [x5]  /* get iv */
enc_prepare w3, x2, x6
 
-.Lcbcencloop:
-   ld1 {v1.16b}, [x1], #16 /* get next pt block */
-   eor v0.16b, v0.16b, v1.16b  /* ..and xor with iv */
+.Lcbcencloop4x:
+   subsw4, w4, #4
+   bmi .Lcbcenc1x
+   ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
+   eor v0.16b, v0.16b, v4.16b  /* ..and xor with iv */
encrypt_block   v0, w3, x2, x6, w7
-   st1 {v0.16b}, [x0], #16
+   eor v1.16b, v1.16b, v0.16b
+   encrypt_block   v1, w3, x2, x6, w7
+   eor v2.16b, v2.16b, v1.16b
+   encrypt_block   v2, w3, x2, x6, w7
+   eor v3.16b, v3.16b, v2.16b
+   encrypt_block   v3, w3, x2, x6, w7
+   st1 {v0.16b-v3.16b}, [x0], #64
+   mov v4.16b, v3.16b
+   b   .Lcbcencloop4x
+.Lcbcenc1x:
+   addsw4, w4, #4
+   beq .Lcbcencout
+.Lcbcencloop:
+   ld1 {v0.16b}, [x1], #16 /* get next pt block */
+   eor v4.16b, v4.16b, v0.16b  /* ..and xor with iv */
+   encrypt_block   v4, w3, x2, x6, w7
+   st1 {v4.16b}, [x0], #16
subsw4, w4, #1
bne .Lcbcencloop
-   st1 {v0.16b}, [x5]  /* return iv */
+.Lcbcencout:
+   st1 {v4.16b}, [x5]  /* return iv */
ret
 AES_ENDPROC(aes_cbc_encrypt)
 
-- 
2.11.0



[PATCH v3 06/20] crypto: arm64/aes-blk - remove configurable interleave

2017-12-06 Thread Ard Biesheuvel
The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)

We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/Makefile|   3 -
 arch/arm64/crypto/aes-modes.S | 237 
 2 files changed, 40 insertions(+), 200 deletions(-)

diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index b5edc5918c28..aaf4e9afd750 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,9 +50,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
 obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
 aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
 
-AFLAGS_aes-ce.o:= -DINTERLEAVE=4
-AFLAGS_aes-neon.o  := -DINTERLEAVE=4
-
 CFLAGS_aes-glue-ce.o   := -DUSE_V8_CRYPTO_EXTENSIONS
 
 $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
.text
.align  4
 
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare   - setup NEON registers for encryption
- * - dec_prepare   - setup NEON registers for decryption
- * - enc_switch_key- change to new key after having prepared for encryption
- * - encrypt_block - encrypt a single block
- * - decrypt block - decrypt a single block
- * - encrypt_block2x   - encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x   - decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x   - encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x   - decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP  ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
-   encrypt_block2x v0, v1, w3, x2, x8, w7
-   ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
-   decrypt_block2x v0, v1, w3, x2, x8, w7
-   ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
 aes_encrypt_block4x:
encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
ret
 ENDPROC(aes_decrypt_block4x)
 
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
-   .macro  do_encrypt_block2x
-   bl  aes_encrypt_block2x
-   .endm
-
-   .macro  do_decrypt_block2x
-   bl  aes_decrypt_block2x
-   .endm
-
-   .macro  do_encrypt_block4x
-   bl  aes_encrypt_block4x
-   .endm
-
-   .macro  do_decrypt_block4x
-   bl  aes_decrypt_block4x
-   .endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
-   .macro  do_encrypt_block2x
-   encrypt_block2x v0, v1, w3, x2, x8, w7
-   .endm
-
-   .macro  do_decrypt_block2x
-   decrypt_block2x v0, v1, w3, x2, x8, w7
-   .endm
-
-   .macro  do_encrypt_block4x
-   encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
-   .endm
-
-   .macro  do_decrypt_block4x
-   decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
-   .endm
-
-#endif
-
/*
 * aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
 * int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
 */
 
 AES_ENTRY(aes_ecb_encrypt)
-   FRAME_PUSH
+   stp x29, x30, [sp, #-16]!
+   mov x29, sp
 
enc_prepare w3, x2, x5
 
 .LecbencloopNx:
-#if INTERLEAVE >= 2
-   subsw4, w4, #INTERLEAVE
+   subsw4, w4, #4
bmi .Lecbenc1x
-#if INTERLEAVE == 2
-   ld1 {v0.16b-v1.16b}, [x1], #32  /* get 2 pt blocks */
-   do_encrypt_block2x
-   st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64  /* get 4 pt blocks */
-   do_encrypt_block4x
+   bl  aes_encrypt_block4x
st1

[PATCH v3 09/20] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels

2017-12-06 Thread Ard Biesheuvel
Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.

Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha256-glue.c | 36 +---
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
 static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
  unsigned int len)
 {
-   /*
-* Stacking and unstacking a substantial slice of the NEON register
-* file may significantly affect performance for small updates when
-* executing in interrupt context, so fall back to the scalar code
-* in that case.
-*/
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return sha256_base_do_update(desc, data, len,
(sha256_block_fn *)sha256_block_data_order);
 
-   kernel_neon_begin();
-   sha256_base_do_update(desc, data, len,
-   (sha256_block_fn *)sha256_block_neon);
-   kernel_neon_end();
+   while (len > 0) {
+   unsigned int chunk = len;
+
+   /*
+* Don't hog the CPU for the entire time it takes to process all
+* input when running on a preemptible kernel, but process the
+* data block by block instead.
+*/
+   if (IS_ENABLED(CONFIG_PREEMPT) &&
+   chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+   chunk = SHA256_BLOCK_SIZE -
+   sctx->count % SHA256_BLOCK_SIZE;
 
+   kernel_neon_begin();
+   sha256_base_do_update(desc, data, chunk,
+ (sha256_block_fn *)sha256_block_neon);
+   kernel_neon_end();
+   data += chunk;
+   len -= chunk;
+   }
return 0;
 }
 
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, 
const u8 *data,
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_data_order);
} else {
-   kernel_neon_begin();
if (len)
-   sha256_base_do_update(desc, data, len,
-   (sha256_block_fn *)sha256_block_neon);
+   sha256_update_neon(desc, data, len);
+   kernel_neon_begin();
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_neon);
kernel_neon_end();
-- 
2.11.0



[PATCH v3 08/20] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path

2017-12-06 Thread Ard Biesheuvel
CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-modes.S | 23 ++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
 AES_ENTRY(aes_mac_update)
ld1 {v0.16b}, [x4]  /* get dg */
enc_prepare w2, x1, x7
-   cbnzw5, .Lmacenc
+   cbz w5, .Lmacloop4x
 
+   encrypt_block   v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+   subsw3, w3, #4
+   bmi .Lmac1x
+   ld1 {v1.16b-v4.16b}, [x0], #64  /* get next pt block */
+   eor v0.16b, v0.16b, v1.16b  /* ..and xor with dg */
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v2.16b
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v3.16b
+   encrypt_block   v0, w2, x1, x7, w8
+   eor v0.16b, v0.16b, v4.16b
+   cmp w3, wzr
+   csinv   x5, x6, xzr, eq
+   cbz w5, .Lmacout
+   encrypt_block   v0, w2, x1, x7, w8
+   b   .Lmacloop4x
+.Lmac1x:
+   add w3, w3, #4
 .Lmacloop:
cbz w3, .Lmacout
ld1 {v1.16b}, [x0], #16 /* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
csinv   x5, x6, xzr, eq
cbz w5, .Lmacout
 
-.Lmacenc:
encrypt_block   v0, w2, x1, x7, w8
b   .Lmacloop
 
-- 
2.11.0



[PATCH v3 03/20] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop

2017-12-06 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-glue.c| 95 ++--
 arch/arm64/crypto/aes-modes.S   | 90 +--
 arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
 3 files changed, 97 insertions(+), 102 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 998ba519a026..00a3e2fd6a48 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
 
 /* defined in aes-modes.S */
 asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, int first);
+   int rounds, int blocks);
 asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, int first);
+   int rounds, int blocks);
 
 asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 iv[], int first);
+   int rounds, int blocks, u8 iv[]);
 asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 iv[], int first);
+   int rounds, int blocks, u8 iv[]);
 
 asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
-   int rounds, int blocks, u8 ctr[], int first);
+   int rounds, int blocks, u8 ctr[]);
 
 asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
 {
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-   int err, first, rounds = 6 + ctx->key_length / 4;
+   int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
-   for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_enc, rounds, blocks, first);
+   (u8 *)ctx->key_enc, rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
-   kernel_neon_end();
return err;
 }
 
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
 {
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-   int err, first, rounds = 6 + ctx->key_length / 4;
+   int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
-   for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+   while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+   kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-   (u8 *)ctx->key_dec, rounds, blocks, first);
+   (u8 *)ctx->key_dec, rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % 

[PATCH v3 00/20] crypto: arm64 - play nice with CONFIG_PREEMPT

2017-12-06 Thread Ard Biesheuvel
This is the second followup 'crypto: arm64 - disable NEON across scatterwalk
API calls' sent out last Friday.

As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.

This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)

So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.

As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input.

Changes since v2:
- Drop logic to yield only after so many blocks - as it turns out, the
  throughput of the algorithms that are most likely to be affected by the
  overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if that
  is inacceptable, you are probably not using CONFIG_PREEMPT in the first
  place. (Speed comparison at the end of this cover letter)
- Add yield support to the AES-CCM driver
- Clean up macros based on feedback from Dave
- Given that I had to add stack frame logic to many of these functions, factor
  it out and wrap it in a couple of macros
- Merge the changes to the core asm driver and glue code of the GHASH/GCM
  driver. The latter was not correct without the former.

Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
  with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE

Cc: Dave Martin 
Cc: Russell King - ARM Linux 
Cc: Sebastian Andrzej Siewior 
Cc: Mark Rutland 
Cc: linux-rt-us...@vger.kernel.org
Cc: Peter Zijlstra 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 

Ard Biesheuvel (20):
  crypto: testmgr - add a new test case for CRC-T10DIF
  crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
  crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
  crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - remove configurable interleave
  crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
  crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
  crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
  arm64: assembler: add utility macros to push/pop stack frames
  arm64: assembler: add macros to conditionally yield the NEON under
PREEMPT
  crypto: arm64/sha1-ce - yield NEON after every block of input
  crypto: arm64/sha2-ce - yield NEON after every block of input
  crypto: arm64/aes-ccm - yield NEON after every block of input
  crypto: arm64/aes-blk - yield NEON after every block of input
  crypto: arm64/aes-bs - yield NEON after every block of input
  crypto: arm64/aes-ghash - yield NEON after every block of input
  crypto: arm64/crc32-ce - yield NEON after every block of input
  crypto: arm64/crct10dif-ce - yield NEON after every block of input
  DO NOT MERGE

 arch/arm64/crypto/Makefile |   3 -
 arch/arm64/crypto/aes-ce-ccm-core.S| 150 --
 arch/arm64/crypto/aes-ce-ccm-glue.c|  47 +-
 arch/arm64/crypto/aes-ce.S |  15 +-
 arch/arm64/crypto/aes-glue.c   |  95 ++--
 arch/arm64/crypto/aes-modes.S  | 562 +---
 arch/arm64/crypto/aes-neonbs-core.S| 305 ++-
 arch/arm64/crypto/aes-neonbs-glue.c|  48 +-
 arch/arm64/crypto/chacha20-neon-glue.c |  12 +-
 

[PATCH v3 01/20] crypto: testmgr - add a new test case for CRC-T10DIF

2017-12-06 Thread Ard Biesheuvel
In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.

Signed-off-by: Ard Biesheuvel 
---
 crypto/testmgr.h | 259 
 1 file changed, 259 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index a714b6293959..0c849aec161d 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1494,6 +1494,265 @@ static const struct hash_testvec 
crct10dif_tv_template[] = {
.digest = (u8 *)(u16 []){ 0x44c6 },
.np = 4,
.tap= { 1, 255, 57, 6 },
+   }, {
+   .plaintext ="\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+   "\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+   "\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+   "\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+   "\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+   "\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+   "\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+   "\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+   "\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+   "\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+   "\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+   "\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+   "\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+   "\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+   "\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+   "\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+   "\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+   "\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+   "\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+   "\x58\xef\x63\xfa\x91\x05\x9c\x33"
+   "\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+   "\x19\xb0\x47\xde\x52\xe9\x80\x17"
+   "\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+   "\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+   "\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+   "\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+   "\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+   "\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+   "\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+   "\x86\x1d\x91\x28\xbf\x33\xca\x61"
+   "\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+   "\x47\xde\x75\x0c\x80\x17\xae\x22"
+   "\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+   "\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+   "\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+   "\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+   "\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+   "\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+   "\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+   "\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+   "\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+   "\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+   "\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+   "\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+   "\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+   "\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+   "\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+   "\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+   "\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+   "\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+   "\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+   "\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+   "\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+   "\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+   "\xf9\x6d\x04\x9b\x0f\xa6\x3d\xd4"
+   "\x48\xdf\x76\x0d\x81\x18\xaf\x23"
+   "\xba\x51\xe8\x5c\xf3\x8a\x21\x95"
+   "\x2c\xc3\x37\xce\x65\xfc\x70\x07"
+   "\x9e\x12\xa9\x40\xd7\x4b\xe2\x79"
+   "\x10\x84\x1b\xb2\x26\xbd\x54\xeb"
+   "\x5f\xf6\x8d\x01\x98\x2f\xc6\x3a"
+   "\xd1\x68\xff\x73\x0a\xa1\x15\xac"
+   

[PATCH v3 04/20] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop

2017-12-06 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-neonbs-glue.c | 36 +---
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-glue.c 
b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
 
+   kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
   ctx->rounds, blocks);
+   kernel_neon_end();
err = skcipher_walk_done(,
 walk.nbytes - blocks * AES_BLOCK_SIZE);
}
-   kernel_neon_end();
 
return err;
 }
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
/* fall back to the non-bitsliced NEON implementation */
+   kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
 ctx->enc, ctx->key.rounds, blocks,
 walk.iv);
+   kernel_neon_end();
err = skcipher_walk_done(, walk.nbytes % AES_BLOCK_SIZE);
}
-   kernel_neon_end();
return err;
 }
 
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
 
+   kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
  ctx->key.rk, ctx->key.rounds, blocks,
  walk.iv);
+   kernel_neon_end();
err = skcipher_walk_done(,
 walk.nbytes - blocks * AES_BLOCK_SIZE);
}
-   kernel_neon_end();
 
return err;
 }
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 buf[AES_BLOCK_SIZE];
int err;
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
-   kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
 
+   kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
  ctx->rk, ctx->rounds, blocks, walk.iv, final);
+   kernel_neon_end();
 
if 

[PATCH v3 02/20] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop

2017-12-06 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++--
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c 
b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 
maciv[], u32 msglen)
 }
 
 static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
-  u32 abytes, u32 *macp, bool use_neon)
+  u32 abytes, u32 *macp)
 {
-   if (likely(use_neon)) {
+   if (may_use_simd()) {
+   kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
 num_rounds(key));
+   kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 
mac[], u8 const in[],
}
 }
 
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
-  bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
 {
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request 
*req, u8 mac[],
ltag.len = 6;
}
 
-   ccm_update_mac(ctx, mac, (u8 *), ltag.len, , use_neon);
+   ccm_update_mac(ctx, mac, (u8 *), ltag.len, );
scatterwalk_start(, req->src);
 
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request 
*req, u8 mac[],
n = scatterwalk_clamp(, len);
}
p = scatterwalk_map();
-   ccm_update_mac(ctx, mac, p, n, , use_neon);
+   ccm_update_mac(ctx, mac, p, n, );
len -= n;
 
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
-   bool use_neon = may_use_simd();
int err;
 
err = ccm_init_mac(req, mac, len);
if (err)
return err;
 
-   if (likely(use_neon))
-   kernel_neon_begin();
-
if (req->assoclen)
-   ccm_calculate_auth_mac(req, mac, use_neon);
+   ccm_calculate_auth_mac(req, mac);
 
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
 
err = skcipher_walk_aead_encrypt(, req, true);
 
-   if (likely(use_neon)) {
+   if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
 
if (walk.nbytes == walk.total)
tail = 0;
 
+   kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
   walk.src.virt.addr,
   walk.nbytes - tail, ctx->key_enc,
   num_rounds(ctx), mac, walk.iv);
+   kernel_neon_end();
 
err = skcipher_walk_done(, tail);
}
-   if (!err)
+   if (!err) {
+   kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
 num_rounds(ctx));
-
-   kernel_neon_end();
+   kernel_neon_end();
+   }
} else {
err = 

[PATCH v3 05/20] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop

2017-12-06 Thread Ard Biesheuvel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/chacha20-neon-glue.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/crypto/chacha20-neon-glue.c 
b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 
*src,
u8 buf[CHACHA20_BLOCK_SIZE];
 
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+   kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
+   kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
+
+   if (!bytes)
+   return;
+
+   kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 
*src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+   kernel_neon_end();
 }
 
 static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
 
-   err = skcipher_walk_virt(, req, true);
+   err = skcipher_walk_virt(, req, false);
 
crypto_chacha20_init(state, ctx, walk.iv);
 
-   kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
 
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(, walk.nbytes - nbytes);
}
-   kernel_neon_end();
 
return err;
 }
-- 
2.11.0



Re: [PATCH 1/3] crypto: exynos - Support Exynos5250+ SoCs

2017-12-06 Thread Joe Perches
On Wed, 2017-12-06 at 15:05 +0100, Krzysztof Kozlowski wrote:
> On Wed, Dec 6, 2017 at 2:42 PM, Łukasz Stelmach  
> wrote:
> > It was <2017-12-05 wto 14:34>, when Krzysztof Kozlowski wrote:
> > > On Tue, Dec 5, 2017 at 1:35 PM, Łukasz Stelmach  
> > > wrote:
> > > > Add support for PRNG in Exynos5250+ SoCs.
[]
> > > > diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
[]
> > > > @@ -300,7 +323,10 @@ static int exynos_rng_probe(struct platform_device 
> > > > *pdev)
> > > > dev_err(>dev,
> > > > "Couldn't register rng crypto alg: %d\n", ret);
> > > > exynos_rng_dev = NULL;
> > > > -   }
> > > > +   } else
> > > 
> > > Missing {} around else clause. Probably checkpatch should point it.
> > 
> > It doesn't. Fixed.

checkpatch does report this if using --strict

$ ./scripts/checkpatch.pl --strict -
CHECK: Unbalanced braces around else statement
#119: FILE: drivers/crypto/exynos-rng.c:326:
+   } else

Arguably, this should always be reported.


Re: [PATCH 1/3] crypto: exynos - Support Exynos5250+ SoCs

2017-12-06 Thread Krzysztof Kozlowski
On Wed, Dec 6, 2017 at 3:53 PM, Łukasz Stelmach  wrote:
> It was <2017-12-06 śro 15:05>, when Krzysztof Kozlowski wrote:
>> On Wed, Dec 6, 2017 at 2:42 PM, Łukasz Stelmach  
>> wrote:
>>> It was <2017-12-05 wto 14:34>, when Krzysztof Kozlowski wrote:
 On Tue, Dec 5, 2017 at 1:35 PM, Łukasz Stelmach  
 wrote:
> Add support for PRNG in Exynos5250+ SoCs.
>
> Signed-off-by: Łukasz Stelmach 
> ---
>  .../bindings/crypto/samsung,exynos-rng4.txt|  4 ++-
>  drivers/crypto/exynos-rng.c| 36 
> --
>  2 files changed, 36 insertions(+), 4 deletions(-)
>
> diff --git
> a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
> b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
> index 4ca8dd4d7e66..a13fbdb4bd88 100644
> --- a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
> +++ b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
> @@ -2,7 +2,9 @@ Exynos Pseudo Random Number Generator
>
>  Required properties:
>
> -- compatible  : Should be "samsung,exynos4-rng".
> +- compatible  : One of:
> +- "samsung,exynos4-rng" for Exynos4210 and Exynos4412
> +- "samsung,exynos5250-prng" for Exynos5250+
>  - reg : Specifies base physical address and size of the 
> registers map.
>  - clocks  : Phandle to clock-controller plus clock-specifier pair.
>  - clock-names : "secss" as a clock name.
> diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
> index 451620b475a0..894ef93ef5ec 100644
> --- a/drivers/crypto/exynos-rng.c
> +++ b/drivers/crypto/exynos-rng.c
> @@ -22,12 +22,17 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include 
>
>  #define EXYNOS_RNG_CONTROL 0x0
>  #define EXYNOS_RNG_STATUS  0x10
> +
> +#define EXYNOS_RNG_SEED_CONF   0x14
> +#define EXYNOS_RNG_GEN_PRNG0x02

 Use BIT(1) instead.
>
> Done.
>
> +
>  #define EXYNOS_RNG_SEED_BASE   0x140
>  #define EXYNOS_RNG_SEED(n) (EXYNOS_RNG_SEED_BASE + (n * 0x4))
>  #define EXYNOS_RNG_OUT_BASE0x160
> @@ -43,6 +48,11 @@
>  #define EXYNOS_RNG_SEED_REGS   5
>  #define EXYNOS_RNG_SEED_SIZE   (EXYNOS_RNG_SEED_REGS * 4)
>
> +enum exynos_prng_type {
> +   EXYNOS_PRNG_TYPE4 = 4,
> +   EXYNOS_PRNG_TYPE5 = 5,

 That's unusual numbering and naming, so just:
 enum exynos_prng_type {
   EXYNOS_PRNG_EXYNOS4,
   EXYNOS_PRNG_EXYNOS5,
 };

 Especially that TYPE4 and TYPE5 suggest so kind of sub-type (like
 versions of some IP blocks, e.g. MFC) but it is just the family of
 Exynos.
>>>
>>> Half done. I've changed TYPE to EXYNOS.
>>>
>>> I used explicit numbering in the enum because I want both values to act
>>> same true-false-wise. If one is 0 this condition is not met.
>>
>> First of all - that condition cannot happen. It is not possible from
>> the device-matching code.
>
> Let me explain what I didn't want. With the enum:
>
> enum exynos_prng_type {
> EXYNOS_PRNG_EXYNOS4,
> EXYNOS_PRNG_EXYNOS5,
> };
>
> and a code like this
>
> if (rng->type) {
>
> }
>
> EXYNOS_PRNG_EXYNOS4 is identical to false while EXYNOS_PRNG_EXYNPOS5
> evaluates as true. I think this is a bad idea. I don't want it ever to
> happen.
> Because chips have their own numbers I thought using those
> numbers would be OK.

It does not scale because we do not know what chips could be added...
and we might for example discover minor differences between 4210 and
4412 so you would have:
EXYNOS4 = 4
EXYNOS4412 = 5
EXYNOS5 = 6
or some other combinations confusing the reader in the future. Instead
just do not open-code the numbers because you do not need them (I mean
you do not need the exact values).

>
>> But if you want to indicate it explicitly
>> (for code reviewing?) then how about:
>> enum exynos_prng_type {
>>EXYNOS_PRNG_UNKNOWN = 0,
>>EXYNOS_PRNG_EXYNOS4,
>>EXYNOS_PRNG_EXYNOS5,
>> };
>>
>> In such case you have the same effect but your intentions are clear
>> (you expect possibility of =0... which is not possible :) ).
>
> Fair enough.

Great!

>
> +   dev_info(>dev,
> +"Exynos Pseudo Random Number Generator 
> (type:%d)\n",

 dev_dbg, this is not that important information to affect the boot time.
>>>
>>> Quite many devices report their presence during boot with such
>>> messages. For example:
>>>
>>> [3.390247] exynos-ehci 1211.usb: EHCI Host Controller
>>> [3.395493] exynos-ehci 1211.usb: new USB bus registered, 

Re: [PATCH 1/3] crypto: exynos - Support Exynos5250+ SoCs

2017-12-06 Thread Łukasz Stelmach
It was <2017-12-06 śro 15:05>, when Krzysztof Kozlowski wrote:
> On Wed, Dec 6, 2017 at 2:42 PM, Łukasz Stelmach  
> wrote:
>> It was <2017-12-05 wto 14:34>, when Krzysztof Kozlowski wrote:
>>> On Tue, Dec 5, 2017 at 1:35 PM, Łukasz Stelmach  
>>> wrote:
 Add support for PRNG in Exynos5250+ SoCs.

 Signed-off-by: Łukasz Stelmach 
 ---
  .../bindings/crypto/samsung,exynos-rng4.txt|  4 ++-
  drivers/crypto/exynos-rng.c| 36 
 --
  2 files changed, 36 insertions(+), 4 deletions(-)

 diff --git
 a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
 b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
 index 4ca8dd4d7e66..a13fbdb4bd88 100644
 --- a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
 +++ b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
 @@ -2,7 +2,9 @@ Exynos Pseudo Random Number Generator

  Required properties:

 -- compatible  : Should be "samsung,exynos4-rng".
 +- compatible  : One of:
 +- "samsung,exynos4-rng" for Exynos4210 and Exynos4412
 +- "samsung,exynos5250-prng" for Exynos5250+
  - reg : Specifies base physical address and size of the registers 
 map.
  - clocks  : Phandle to clock-controller plus clock-specifier pair.
  - clock-names : "secss" as a clock name.
 diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
 index 451620b475a0..894ef93ef5ec 100644
 --- a/drivers/crypto/exynos-rng.c
 +++ b/drivers/crypto/exynos-rng.c
 @@ -22,12 +22,17 @@
  #include 
  #include 
  #include 
 +#include 
  #include 

  #include 

  #define EXYNOS_RNG_CONTROL 0x0
  #define EXYNOS_RNG_STATUS  0x10
 +
 +#define EXYNOS_RNG_SEED_CONF   0x14
 +#define EXYNOS_RNG_GEN_PRNG0x02
>>>
>>> Use BIT(1) instead.

Done.

 +
  #define EXYNOS_RNG_SEED_BASE   0x140
  #define EXYNOS_RNG_SEED(n) (EXYNOS_RNG_SEED_BASE + (n * 0x4))
  #define EXYNOS_RNG_OUT_BASE0x160
 @@ -43,6 +48,11 @@
  #define EXYNOS_RNG_SEED_REGS   5
  #define EXYNOS_RNG_SEED_SIZE   (EXYNOS_RNG_SEED_REGS * 4)

 +enum exynos_prng_type {
 +   EXYNOS_PRNG_TYPE4 = 4,
 +   EXYNOS_PRNG_TYPE5 = 5,
>>>
>>> That's unusual numbering and naming, so just:
>>> enum exynos_prng_type {
>>>   EXYNOS_PRNG_EXYNOS4,
>>>   EXYNOS_PRNG_EXYNOS5,
>>> };
>>>
>>> Especially that TYPE4 and TYPE5 suggest so kind of sub-type (like
>>> versions of some IP blocks, e.g. MFC) but it is just the family of
>>> Exynos.
>>
>> Half done. I've changed TYPE to EXYNOS.
>>
>> I used explicit numbering in the enum because I want both values to act
>> same true-false-wise. If one is 0 this condition is not met.
>
> First of all - that condition cannot happen. It is not possible from
> the device-matching code.

Let me explain what I didn't want. With the enum:

enum exynos_prng_type {
EXYNOS_PRNG_EXYNOS4,
EXYNOS_PRNG_EXYNOS5,
};

and a code like this

if (rng->type) {

}

EXYNOS_PRNG_EXYNOS4 is identical to false while EXYNOS_PRNG_EXYNPOS5
evaluates as true. I think this is a bad idea. I don't want it ever to
happen. Because chips have their own numbers I thought using those
numbers would be OK.

> But if you want to indicate it explicitly
> (for code reviewing?) then how about:
> enum exynos_prng_type {
>EXYNOS_PRNG_UNKNOWN = 0,
>EXYNOS_PRNG_EXYNOS4,
>EXYNOS_PRNG_EXYNOS5,
> };
>
> In such case you have the same effect but your intentions are clear
> (you expect possibility of =0... which is not possible :) ).

Fair enough.

 +   dev_info(>dev,
 +"Exynos Pseudo Random Number Generator 
 (type:%d)\n",
>>>
>>> dev_dbg, this is not that important information to affect the boot time.
>>
>> Quite many devices report their presence during boot with such
>> messages. For example:
>>
>> [3.390247] exynos-ehci 1211.usb: EHCI Host Controller
>> [3.395493] exynos-ehci 1211.usb: new USB bus registered, assigned 
>> bus number 1
>> [3.403702] exynos-ehci 1211.usb: irq 80, io mem 0x1211
>> [3.431793] exynos-ehci 1211.usb: USB 2.0 started, EHCI 1.00
>>
>> From my experience it isn't printk() itself that slows down boot but the
>> serial console.
>
> True, the console is bottleneck (not necessarily serial) [1] but that
> does not change the fact there is no need to print the type of RNG.

With values of the enum not being meaningful themselves printing the
type does not make much sense for me too. Is it ok just to print report
the device presence?

-- 
Łukasz Stelmach
Samsung R Institute 

Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT

2017-12-06 Thread Dave Martin
On Wed, Dec 06, 2017 at 12:25:44PM +, Ard Biesheuvel wrote:
> On 6 December 2017 at 12:12, Dave P Martin  wrote:
> > On Wed, Dec 06, 2017 at 11:57:12AM +, Ard Biesheuvel wrote:
> >> On 6 December 2017 at 11:51, Dave Martin  wrote:
> >> > On Tue, Dec 05, 2017 at 06:04:34PM +, Ard Biesheuvel wrote:
> >> >> On 5 December 2017 at 12:45, Ard Biesheuvel  
> >> >> wrote:
> >> >> >
> >> >> >
> >> >> >> On 5 Dec 2017, at 12:28, Dave Martin  wrote:

[...]

> >> >> >> If we had
> >> >> >>
> >> >> >>if_cond_yield_neon
> >> >> >>// patchup code
> >> >> >>endif_yield_neon
> >> >> >>
> >> >>
> >> >> I like this, but ...
> >> >>
> >> >> >> then the caller is free to conditionally branch over that as 
> >> >> >> appropriate
> >> >> >> like
> >> >> >>
> >> >> >> loop:
> >> >> >>// crypto stuff
> >> >> >>tst x0, #0xf
> >> >> >>b.neloop
> >> >> >>
> >> >> >>if_cond_yield_neon
> >> >> >>// patchup code
> >> >> >>endif_cond_yield_neon
> >> >> >>
> >> >>
> >> >> I need to put the post patchup code somewhere too. Please refer to the
> >> >> CRC patches for the best examples of this.
> >> >
> >> > I'm not sure I follow.  If you need to do something different after
> >> > patchup, can't you either pull that code into the if...endif too, or
> >> > otherwise put a branch before the endif?
> >> >
> >>
> >> No, not really.
> >>
> >> What I currently have is
> >>
> >> >if_cond_yield_neon
> >> >// patchup code
> >> >endif_cond_yield_neon
> >>
> >> which is being turned into
> >>
> >> get_thread_info x0
> >> ldr w1, [x0, #TSK_TI_PREEMPT]
> >> ldr x0, [x0, #TSK_TI_FLAGS]
> >> cmp w1, #1 // == PREEMPT_OFFSET
> >> csel x0, x0, xzr, eq
> >> tbnz x0, #TIF_NEED_RESCHED, f // needs rescheduling?
> >> :
> >>
> >> .subsection 1
> >> :
> >> // patchup code
> >> bl kernel_neon_end
> >> bl kernel_neon_begin
> >> // post patchup code goes here
> >> b b
> >> .subsection
> >>
> >> so what I basically need is a place to put the post patchup code,
> >> unless I open code it and branch to a different label right after
> >> kernel_neon_begin
> >
> > Ah, I get you.  I had convinced myself that there was only post-
> > patchup code.
> >
> > So you conceptually want something like
> >
> > if_will_cond_yield_neon
> > // pre-yield patchup code
> > do_cond_yield_neon
> > // post-yield patchup code
> > endif_yield_neon
> >
> > .macro if_will_cond_yield_neon
> > // do preempt_count == 1 && TIF_NEED_RESCHED check
> > b.wont_yield 3432f
> > .endm
> >
> > .macro do_cond_yield_neon
> > bl  kernel_neon_end
> > bl  kernel_neon_begin
> > .endm
> >
> > .macro endif_yield_neon
> > 3432:
> > .endm
> >
> > The main difference I see is that this doesn't get the slow-case
> > code out of line.
> >
> > Or have I still missed something?
> >
> 
> No, that seems accurate. I think the idiom looks much better, but I'd
> still like to get the code out of line, so that everything disappears
> completely for !CONFIG_PREEMPT.

Sure, out-of-lining should still be possible by spilling into another
section or subsection similarly to what you were already doing -- and
it may still make sense to have (optional?) label arguments to allow
some branch elision.  I didn't want to complicate things initially.

> I'll try to rework my stuff using this as a starting point.

Cool, let me know.

Cheers
---Dave


Re: [PATCH 1/3] crypto: exynos - Support Exynos5250+ SoCs

2017-12-06 Thread Krzysztof Kozlowski
On Wed, Dec 6, 2017 at 2:42 PM, Łukasz Stelmach  wrote:
> It was <2017-12-05 wto 14:34>, when Krzysztof Kozlowski wrote:
>> On Tue, Dec 5, 2017 at 1:35 PM, Łukasz Stelmach  
>> wrote:
>>> Add support for PRNG in Exynos5250+ SoCs.
>>>
>>> Signed-off-by: Łukasz Stelmach 
>>> ---
>>>  .../bindings/crypto/samsung,exynos-rng4.txt|  4 ++-
>>>  drivers/crypto/exynos-rng.c| 36 
>>> --
>>>  2 files changed, 36 insertions(+), 4 deletions(-)
>>>
>>> diff --git
>>> a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>>> b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>>> index 4ca8dd4d7e66..a13fbdb4bd88 100644
>>> --- a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>>> +++ b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>>> @@ -2,7 +2,9 @@ Exynos Pseudo Random Number Generator
>>>
>>>  Required properties:
>>>
>>> -- compatible  : Should be "samsung,exynos4-rng".
>>> +- compatible  : One of:
>>> +- "samsung,exynos4-rng" for Exynos4210 and Exynos4412
>>> +- "samsung,exynos5250-prng" for Exynos5250+
>>>  - reg : Specifies base physical address and size of the registers 
>>> map.
>>>  - clocks  : Phandle to clock-controller plus clock-specifier pair.
>>>  - clock-names : "secss" as a clock name.
>>> diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
>>> index 451620b475a0..894ef93ef5ec 100644
>>> --- a/drivers/crypto/exynos-rng.c
>>> +++ b/drivers/crypto/exynos-rng.c
>>> @@ -22,12 +22,17 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>
>>>  #include 
>>>
>>>  #define EXYNOS_RNG_CONTROL 0x0
>>>  #define EXYNOS_RNG_STATUS  0x10
>>> +
>>> +#define EXYNOS_RNG_SEED_CONF   0x14
>>> +#define EXYNOS_RNG_GEN_PRNG0x02
>>
>> Use BIT(1) instead.
>>
>>> +
>>>  #define EXYNOS_RNG_SEED_BASE   0x140
>>>  #define EXYNOS_RNG_SEED(n) (EXYNOS_RNG_SEED_BASE + (n * 0x4))
>>>  #define EXYNOS_RNG_OUT_BASE0x160
>>> @@ -43,6 +48,11 @@
>>>  #define EXYNOS_RNG_SEED_REGS   5
>>>  #define EXYNOS_RNG_SEED_SIZE   (EXYNOS_RNG_SEED_REGS * 4)
>>>
>>> +enum exynos_prng_type {
>>> +   EXYNOS_PRNG_TYPE4 = 4,
>>> +   EXYNOS_PRNG_TYPE5 = 5,
>>
>> That's unusual numbering and naming, so just:
>> enum exynos_prng_type {
>>   EXYNOS_PRNG_EXYNOS4,
>>   EXYNOS_PRNG_EXYNOS5,
>> };
>>
>> Especially that TYPE4 and TYPE5 suggest so kind of sub-type (like
>> versions of some IP blocks, e.g. MFC) but it is just the family of
>> Exynos.
>
> Half done. I've changed TYPE to EXYNOS.
>
> I used explicit numbering in the enum because I want both values to act
> same true-false-wise. If one is 0 this condition is not met.

First of all - that condition cannot happen. It is not possible from
the device-matching code. But if you want to indicate it explicitly
(for code reviewing?) then how about:
enum exynos_prng_type {
   EXYNOS_PRNG_UNKNOWN = 0,
   EXYNOS_PRNG_EXYNOS4,
   EXYNOS_PRNG_EXYNOS5,
};

In such case you have the same effect but your intentions are clear
(you expect possibility of =0... which is not possible :) ).

>>> +};
>>> +
>>>  /*
>>>   * Driver re-seeds itself with generated random numbers to increase
>>>   * the randomness.
>>> @@ -63,6 +73,7 @@ struct exynos_rng_ctx {
>>>  /* Device associated memory */
>>>  struct exynos_rng_dev {
>>> struct device   *dev;
>>> +   enum exynos_prng_type   type;
>>> void __iomem*mem;
>>> struct clk  *clk;
>>> /* Generated numbers stored for seeding during resume */
>>> @@ -160,8 +171,13 @@ static int exynos_rng_get_random(struct exynos_rng_dev 
>>> *rng,
>>>  {
>>> int retry = EXYNOS_RNG_WAIT_RETRIES;
>>>
>>> -   exynos_rng_writel(rng, EXYNOS_RNG_CONTROL_START,
>>> - EXYNOS_RNG_CONTROL);
>>> +   if (rng->type == EXYNOS_PRNG_TYPE4) {
>>> +   exynos_rng_writel(rng, EXYNOS_RNG_CONTROL_START,
>>> + EXYNOS_RNG_CONTROL);
>>> +   } else if (rng->type == EXYNOS_PRNG_TYPE5) {
>>> +   exynos_rng_writel(rng, EXYNOS_RNG_GEN_PRNG,
>>> + EXYNOS_RNG_SEED_CONF);
>>> +   }
>>>
>>> while (!(exynos_rng_readl(rng,
>>> EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) && 
>>> --retry)
>>> @@ -279,6 +295,13 @@ static int exynos_rng_probe(struct platform_device 
>>> *pdev)
>>> if (!rng)
>>> return -ENOMEM;
>>>
>>> +   rng->type = (enum 
>>> exynos_prng_type)of_device_get_match_data(>dev);
>>> +   if (rng->type != EXYNOS_PRNG_TYPE4 &&
>>> +   rng->type != EXYNOS_PRNG_TYPE5) {
>>> +   dev_err(>dev, "Unsupported PRNG type: %d", 

Re: [PATCH 1/3] crypto: exynos - Support Exynos5250+ SoCs

2017-12-06 Thread Łukasz Stelmach
It was <2017-12-05 wto 14:34>, when Krzysztof Kozlowski wrote:
> On Tue, Dec 5, 2017 at 1:35 PM, Łukasz Stelmach  
> wrote:
>> Add support for PRNG in Exynos5250+ SoCs.
>>
>> Signed-off-by: Łukasz Stelmach 
>> ---
>>  .../bindings/crypto/samsung,exynos-rng4.txt|  4 ++-
>>  drivers/crypto/exynos-rng.c| 36 
>> --
>>  2 files changed, 36 insertions(+), 4 deletions(-)
>>
>> diff --git
>> a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>> b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>> index 4ca8dd4d7e66..a13fbdb4bd88 100644
>> --- a/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>> +++ b/Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
>> @@ -2,7 +2,9 @@ Exynos Pseudo Random Number Generator
>>
>>  Required properties:
>>
>> -- compatible  : Should be "samsung,exynos4-rng".
>> +- compatible  : One of:
>> +- "samsung,exynos4-rng" for Exynos4210 and Exynos4412
>> +- "samsung,exynos5250-prng" for Exynos5250+
>>  - reg : Specifies base physical address and size of the registers 
>> map.
>>  - clocks  : Phandle to clock-controller plus clock-specifier pair.
>>  - clock-names : "secss" as a clock name.
>> diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
>> index 451620b475a0..894ef93ef5ec 100644
>> --- a/drivers/crypto/exynos-rng.c
>> +++ b/drivers/crypto/exynos-rng.c
>> @@ -22,12 +22,17 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>
>>  #include 
>>
>>  #define EXYNOS_RNG_CONTROL 0x0
>>  #define EXYNOS_RNG_STATUS  0x10
>> +
>> +#define EXYNOS_RNG_SEED_CONF   0x14
>> +#define EXYNOS_RNG_GEN_PRNG0x02
>
> Use BIT(1) instead.
>
>> +
>>  #define EXYNOS_RNG_SEED_BASE   0x140
>>  #define EXYNOS_RNG_SEED(n) (EXYNOS_RNG_SEED_BASE + (n * 0x4))
>>  #define EXYNOS_RNG_OUT_BASE0x160
>> @@ -43,6 +48,11 @@
>>  #define EXYNOS_RNG_SEED_REGS   5
>>  #define EXYNOS_RNG_SEED_SIZE   (EXYNOS_RNG_SEED_REGS * 4)
>>
>> +enum exynos_prng_type {
>> +   EXYNOS_PRNG_TYPE4 = 4,
>> +   EXYNOS_PRNG_TYPE5 = 5,
>
> That's unusual numbering and naming, so just:
> enum exynos_prng_type {
>   EXYNOS_PRNG_EXYNOS4,
>   EXYNOS_PRNG_EXYNOS5,
> };
>
> Especially that TYPE4 and TYPE5 suggest so kind of sub-type (like
> versions of some IP blocks, e.g. MFC) but it is just the family of
> Exynos.

Half done. I've changed TYPE to EXYNOS.

I used explicit numbering in the enum because I want both values to act
same true-false-wise. If one is 0 this condition is not met.

>> +};
>> +
>>  /*
>>   * Driver re-seeds itself with generated random numbers to increase
>>   * the randomness.
>> @@ -63,6 +73,7 @@ struct exynos_rng_ctx {
>>  /* Device associated memory */
>>  struct exynos_rng_dev {
>> struct device   *dev;
>> +   enum exynos_prng_type   type;
>> void __iomem*mem;
>> struct clk  *clk;
>> /* Generated numbers stored for seeding during resume */
>> @@ -160,8 +171,13 @@ static int exynos_rng_get_random(struct exynos_rng_dev 
>> *rng,
>>  {
>> int retry = EXYNOS_RNG_WAIT_RETRIES;
>>
>> -   exynos_rng_writel(rng, EXYNOS_RNG_CONTROL_START,
>> - EXYNOS_RNG_CONTROL);
>> +   if (rng->type == EXYNOS_PRNG_TYPE4) {
>> +   exynos_rng_writel(rng, EXYNOS_RNG_CONTROL_START,
>> + EXYNOS_RNG_CONTROL);
>> +   } else if (rng->type == EXYNOS_PRNG_TYPE5) {
>> +   exynos_rng_writel(rng, EXYNOS_RNG_GEN_PRNG,
>> + EXYNOS_RNG_SEED_CONF);
>> +   }
>>
>> while (!(exynos_rng_readl(rng,
>> EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) && 
>> --retry)
>> @@ -279,6 +295,13 @@ static int exynos_rng_probe(struct platform_device 
>> *pdev)
>> if (!rng)
>> return -ENOMEM;
>>
>> +   rng->type = (enum 
>> exynos_prng_type)of_device_get_match_data(>dev);
>> +   if (rng->type != EXYNOS_PRNG_TYPE4 &&
>> +   rng->type != EXYNOS_PRNG_TYPE5) {
>> +   dev_err(>dev, "Unsupported PRNG type: %d", rng->type);
>> +   return -ENOTSUPP;
>> +   }
>> +
>> rng->dev = >dev;
>> rng->clk = devm_clk_get(>dev, "secss");
>> if (IS_ERR(rng->clk)) {
>> @@ -300,7 +323,10 @@ static int exynos_rng_probe(struct platform_device 
>> *pdev)
>> dev_err(>dev,
>> "Couldn't register rng crypto alg: %d\n", ret);
>> exynos_rng_dev = NULL;
>> -   }
>> +   } else
>
> Missing {} around else clause. Probably checkpatch should point it.

It doesn't. Fixed.

>> +   dev_info(>dev,
>> +"Exynos Pseudo Random 

Re: [PATCH 2/3] crypto: exynos - Improve performance of PRNG

2017-12-06 Thread Łukasz Stelmach
It was <2017-12-06 śro 12:37>, when Krzysztof Kozlowski wrote:
> On Wed, Dec 6, 2017 at 12:32 PM, Łukasz Stelmach  
> wrote:
>> It was <2017-12-05 wto 19:06>, when Krzysztof Kozlowski wrote:
>>> On Tue, Dec 5, 2017 at 6:53 PM, Krzysztof Kozlowski  wrote:
 On Tue, Dec 05, 2017 at 05:43:10PM +0100, Łukasz Stelmach wrote:
> It was <2017-12-05 wto 14:54>, when Stephan Mueller wrote:
>> Am Dienstag, 5. Dezember 2017, 13:35:57 CET schrieb Łukasz Stelmach:
>>> Use memcpy_fromio() instead of custom exynos_rng_copy_random() function
>>> to retrieve generated numbers from the registers of PRNG.
>>>
>>> Remove unnecessary invocation of cpu_relax().
>>>
>>> Signed-off-by: Łukasz Stelmach 
>>> ---
>>>  drivers/crypto/exynos-rng.c | 36 +---
>>>  1 file changed, 5 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
>>> index 894ef93ef5ec..002e9d2a83cc 100644
>>> --- a/drivers/crypto/exynos-rng.c
>>> +++ b/drivers/crypto/exynos-rng.c
>
> [...]
>
>>> @@ -171,6 +143,8 @@ static int exynos_rng_get_random(struct 
>>> exynos_rng_dev
>>> *rng, {
>>>int retry = EXYNOS_RNG_WAIT_RETRIES;
>>>
>>> +  *read = min_t(size_t, dlen, EXYNOS_RNG_SEED_SIZE);
>>> +
>>>if (rng->type == EXYNOS_PRNG_TYPE4) {
>>>exynos_rng_writel(rng, EXYNOS_RNG_CONTROL_START,
>>>  EXYNOS_RNG_CONTROL);
>>> @@ -180,8 +154,8 @@ static int exynos_rng_get_random(struct 
>>> exynos_rng_dev
>>> *rng, }
>>>
>>>while (!(exynos_rng_readl(rng,
>>> -  EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) && 
>>> --retry)
>>> -  cpu_relax();
>>> +  EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) &&
>>> + --retry);
>>
>> [...]
>>
> The busy loop is not very busy. Every time I checked the loop (w/o
> cpu_relax()) was executed twice (retry was 98) and the operation was
> reliable. I don't see why do we need a memory barrier here. On the other
> hand, I am not sure the whole exynos_rng_get_random() shouldn't be ran
> under a mutex or a spinlock (I don't see anything like this in the upper
> layers of the crypto framework).
>
> The *_relaxed() I/O operations do not enforce memory

 The cpu_relax() is a common pattern for busy-loop. If you want to break
 this pattern - please explain why only this part of kernel should not
 follow it (and rest of kernel should).

 The other part - this code is already using relaxed versions which might
 get you into difficult to debug issues. You mentioned that loop works
 reliable after removing the cpu_relax... yeah, it might for 99.999% but
 that's not the argument. I remember few emails from Arnd Bergmann
 mentioning explicitly to avoid using relaxed versions "just because",
 unless it is necessary or really understood.

 The code first writes to control register, then checks for status so you
 should have these operations strictly ordered. Therefore I think
 cpu_relax() should not be removed.
>>>
>>> ... or just convert it to readl_poll_timeout() because it makes code
>>> more readable, takes care of timeout and you do not have care about
>>> specific implementation (whether there should or should not be
>>> cpu_relax).
>>
>> OK. This appears to perform reasonably.
>>
>> do {
>> cpu_relax();
>> } while (!(exynos_rng_readl(rng, EXYNOS_RNG_STATUS) &
>>EXYNOS_RNG_STATUS_RNG_DONE) && --retry);
>
> You mean that:
>   while (readl_relaxed()) cpu_relax();
> is slower than
>   do cpu_relax() while (readl_relaxed())
> ?
>
> Hmm... Interesting. I would be happy to learn more about it why it
> behaves so differently. Maybe the execution of cpu_relax() before
> readl_relaxed() reduces the amount of loops to actually one read?
>
> Indeed some parts of kernel code for ARM prefers this approach,
> although still the most popular pattern is existing one (while()
> cpu_relax).

Without cpu_relax() retry is decremented twice. So there are three
reads. It appears that a single call cpu_relax() gives the hardware
enough time to be ready when the first read occurs (retry is 99 after
the loop).

-- 
Łukasz Stelmach
Samsung R Institute Poland
Samsung Electronics


signature.asc
Description: PGP signature


locking in prng

2017-12-06 Thread Łukasz Stelmach
Hi,

I am preparing some improvements for the exynos-rng PRNG driver. It is a
driver for hardware PRNG in Exynos SoCs. The hardware should not be
access simultaneously by more than one thread/processe or it may return
the same data to all of them.

Are there any locks betweend kernel consumers or userspace and the
driver or should the driver has its own lock to serialize hardware
access?

Kind regards,
-- 
Łukasz Stelmach
Samsung R Institute Poland
Samsung Electronics


signature.asc
Description: PGP signature


Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT

2017-12-06 Thread Ard Biesheuvel
On 6 December 2017 at 12:12, Dave P Martin  wrote:
> On Wed, Dec 06, 2017 at 11:57:12AM +, Ard Biesheuvel wrote:
>> On 6 December 2017 at 11:51, Dave Martin  wrote:
>> > On Tue, Dec 05, 2017 at 06:04:34PM +, Ard Biesheuvel wrote:
>> >> On 5 December 2017 at 12:45, Ard Biesheuvel  
>> >> wrote:
>> >> >
>> >> >
>> >> >> On 5 Dec 2017, at 12:28, Dave Martin  wrote:
>> >> >>
>> >> >>> On Mon, Dec 04, 2017 at 12:26:37PM +, Ard Biesheuvel wrote:
>> >> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
>> >> >>> that may be called from the assembler code. Given that especially the
>> >> >>> instruction based accelerated crypto code may use very tight loops, 
>> >> >>> add
>> >> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
>> >> >>> executed every so many loop iterations.
>> >> >>>
>> >> >>> In some cases, yielding the NEON involves saving and restoring a non
>> >> >>> trivial amount of context (especially in the CRC folding algorithms),
>> >> >>> and so the macro is split into two, and the code in between is only
>> >> >>> executed when the yield path is taken, allowing the contex to be 
>> >> >>> preserved.
>> >> >>> The second macro takes a label argument that marks the 
>> >> >>> resume-from-yield
>> >> >>> path, which should restore the preserved context again.
>> >> >>>
>> >> >>> Signed-off-by: Ard Biesheuvel 
>> >> >>> ---
>> >> >>> arch/arm64/include/asm/assembler.h | 50 
>> >> >>> 1 file changed, 50 insertions(+)
>> >> >>>
>> >> >>> diff --git a/arch/arm64/include/asm/assembler.h 
>> >> >>> b/arch/arm64/include/asm/assembler.h
>> >> >>> index aef72d886677..917b026d3e00 100644
>> >> >>> --- a/arch/arm64/include/asm/assembler.h
>> >> >>> +++ b/arch/arm64/include/asm/assembler.h
>> >> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
>> >> >>> #endif
>> >> >>>.endm
>> >> >>>
>> >> >>> +/*
>> >> >>> + * yield_neon - check whether to yield to another runnable task from
>> >> >>> + *kernel mode NEON code (running with preemption disabled)
>> >> >>> + *
>> >> >>> + * - Check whether the preempt count is exactly 1, in which case 
>> >> >>> disabling
>> >> >>> + *   preemption once will make the task preemptible. If this is not 
>> >> >>> the case,
>> >> >>> + *   yielding is pointless.
>> >> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and 
>> >> >>> re-enable
>> >> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch 
>> >> >>> to the
>> >> >>> + *   yield fixup code at @lbl.
>> >> >>> + */
>> >> >>> +.macroyield_neon, lbl:req, ctr, order, stride, loop
>> >> >>> +yield_neon_pre\ctr, \order, \stride, \loop
>> >> >>> +yield_neon_post\lbl
>> >> >>> +.endm
>> >> >>> +
>> >> >>> +.macroyield_neon_pre, ctr, order=0, stride, loop=f
>> >> >>> +#ifdef CONFIG_PREEMPT
>> >> >>> +/*
>> >> >>> + * With some algorithms, it makes little sense to poll the
>> >> >>> + * TIF_NEED_RESCHED flag after every iteration, so only perform
>> >> >>> + * the check every 2^order strides.
>> >> >>> + */
>> >> >>> +.if\order > 1
>> >> >>> +.if(\stride & (\stride - 1)) != 0
>> >> >>> +.error"stride should be a power of 2"
>> >> >>> +.endif
>> >> >>> +tst\ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
>> >> >>> +b.ne\loop
>> >> >>> +.endif
>> >> >>
>> >> >> I'm not sure what baking in this check gives us, and this seems to
>> >> >> conflate two rather different things: yielding and defining a
>> >> >> "heartbeat" frequency for the calling code.
>> >> >>
>> >> >> Can we separate out the crypto-loop-helper semantics from the yield-
>> >> >> neon stuff?
>> >> >>
>> >> >
>> >> > Fair enough. I incorporated the check here so it disappears from the 
>> >> > code entirely when !CONFIG_PREEMPT, because otherwise, you end up with 
>> >> > a sequence that is mispredicted every # iterations without any benefit.
>> >> > I guess i could macroise that separately though.
>> >> >
>> >> >> If we had
>> >> >>
>> >> >>if_cond_yield_neon
>> >> >>// patchup code
>> >> >>endif_yield_neon
>> >> >>
>> >>
>> >> I like this, but ...
>> >>
>> >> >> then the caller is free to conditionally branch over that as 
>> >> >> appropriate
>> >> >> like
>> >> >>
>> >> >> loop:
>> >> >>// crypto stuff
>> >> >>tst x0, #0xf
>> >> >>b.neloop
>> >> >>
>> >> >>if_cond_yield_neon
>> >> >>// patchup code
>> >> >>endif_cond_yield_neon
>> >> >>
>> >>
>> >> I need to put the post patchup code somewhere too. Please refer to the
>> >> CRC patches for the best examples of this.
>> >
>> > I'm not sure I follow.  If you need to do something different after
>> > patchup, can't you either pull that code into the if...endif too, or
>> > otherwise put a branch before 

Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT

2017-12-06 Thread Dave P Martin
On Wed, Dec 06, 2017 at 11:57:12AM +, Ard Biesheuvel wrote:
> On 6 December 2017 at 11:51, Dave Martin  wrote:
> > On Tue, Dec 05, 2017 at 06:04:34PM +, Ard Biesheuvel wrote:
> >> On 5 December 2017 at 12:45, Ard Biesheuvel  
> >> wrote:
> >> >
> >> >
> >> >> On 5 Dec 2017, at 12:28, Dave Martin  wrote:
> >> >>
> >> >>> On Mon, Dec 04, 2017 at 12:26:37PM +, Ard Biesheuvel wrote:
> >> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
> >> >>> that may be called from the assembler code. Given that especially the
> >> >>> instruction based accelerated crypto code may use very tight loops, add
> >> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
> >> >>> executed every so many loop iterations.
> >> >>>
> >> >>> In some cases, yielding the NEON involves saving and restoring a non
> >> >>> trivial amount of context (especially in the CRC folding algorithms),
> >> >>> and so the macro is split into two, and the code in between is only
> >> >>> executed when the yield path is taken, allowing the contex to be 
> >> >>> preserved.
> >> >>> The second macro takes a label argument that marks the 
> >> >>> resume-from-yield
> >> >>> path, which should restore the preserved context again.
> >> >>>
> >> >>> Signed-off-by: Ard Biesheuvel 
> >> >>> ---
> >> >>> arch/arm64/include/asm/assembler.h | 50 
> >> >>> 1 file changed, 50 insertions(+)
> >> >>>
> >> >>> diff --git a/arch/arm64/include/asm/assembler.h 
> >> >>> b/arch/arm64/include/asm/assembler.h
> >> >>> index aef72d886677..917b026d3e00 100644
> >> >>> --- a/arch/arm64/include/asm/assembler.h
> >> >>> +++ b/arch/arm64/include/asm/assembler.h
> >> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
> >> >>> #endif
> >> >>>.endm
> >> >>>
> >> >>> +/*
> >> >>> + * yield_neon - check whether to yield to another runnable task from
> >> >>> + *kernel mode NEON code (running with preemption disabled)
> >> >>> + *
> >> >>> + * - Check whether the preempt count is exactly 1, in which case 
> >> >>> disabling
> >> >>> + *   preemption once will make the task preemptible. If this is not 
> >> >>> the case,
> >> >>> + *   yielding is pointless.
> >> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and 
> >> >>> re-enable
> >> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch 
> >> >>> to the
> >> >>> + *   yield fixup code at @lbl.
> >> >>> + */
> >> >>> +.macroyield_neon, lbl:req, ctr, order, stride, loop
> >> >>> +yield_neon_pre\ctr, \order, \stride, \loop
> >> >>> +yield_neon_post\lbl
> >> >>> +.endm
> >> >>> +
> >> >>> +.macroyield_neon_pre, ctr, order=0, stride, loop=f
> >> >>> +#ifdef CONFIG_PREEMPT
> >> >>> +/*
> >> >>> + * With some algorithms, it makes little sense to poll the
> >> >>> + * TIF_NEED_RESCHED flag after every iteration, so only perform
> >> >>> + * the check every 2^order strides.
> >> >>> + */
> >> >>> +.if\order > 1
> >> >>> +.if(\stride & (\stride - 1)) != 0
> >> >>> +.error"stride should be a power of 2"
> >> >>> +.endif
> >> >>> +tst\ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
> >> >>> +b.ne\loop
> >> >>> +.endif
> >> >>
> >> >> I'm not sure what baking in this check gives us, and this seems to
> >> >> conflate two rather different things: yielding and defining a
> >> >> "heartbeat" frequency for the calling code.
> >> >>
> >> >> Can we separate out the crypto-loop-helper semantics from the yield-
> >> >> neon stuff?
> >> >>
> >> >
> >> > Fair enough. I incorporated the check here so it disappears from the 
> >> > code entirely when !CONFIG_PREEMPT, because otherwise, you end up with a 
> >> > sequence that is mispredicted every # iterations without any benefit.
> >> > I guess i could macroise that separately though.
> >> >
> >> >> If we had
> >> >>
> >> >>if_cond_yield_neon
> >> >>// patchup code
> >> >>endif_yield_neon
> >> >>
> >>
> >> I like this, but ...
> >>
> >> >> then the caller is free to conditionally branch over that as appropriate
> >> >> like
> >> >>
> >> >> loop:
> >> >>// crypto stuff
> >> >>tst x0, #0xf
> >> >>b.neloop
> >> >>
> >> >>if_cond_yield_neon
> >> >>// patchup code
> >> >>endif_cond_yield_neon
> >> >>
> >>
> >> I need to put the post patchup code somewhere too. Please refer to the
> >> CRC patches for the best examples of this.
> >
> > I'm not sure I follow.  If you need to do something different after
> > patchup, can't you either pull that code into the if...endif too, or
> > otherwise put a branch before the endif?
> >
>
> No, not really.
>
> What I currently have is
>
> >if_cond_yield_neon
> >// patchup code
> >endif_cond_yield_neon
>
> which is being turned into
>
> get_thread_info x0
> 

Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT

2017-12-06 Thread Ard Biesheuvel
On 6 December 2017 at 11:51, Dave Martin  wrote:
> On Tue, Dec 05, 2017 at 06:04:34PM +, Ard Biesheuvel wrote:
>> On 5 December 2017 at 12:45, Ard Biesheuvel  
>> wrote:
>> >
>> >
>> >> On 5 Dec 2017, at 12:28, Dave Martin  wrote:
>> >>
>> >>> On Mon, Dec 04, 2017 at 12:26:37PM +, Ard Biesheuvel wrote:
>> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
>> >>> that may be called from the assembler code. Given that especially the
>> >>> instruction based accelerated crypto code may use very tight loops, add
>> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
>> >>> executed every so many loop iterations.
>> >>>
>> >>> In some cases, yielding the NEON involves saving and restoring a non
>> >>> trivial amount of context (especially in the CRC folding algorithms),
>> >>> and so the macro is split into two, and the code in between is only
>> >>> executed when the yield path is taken, allowing the contex to be 
>> >>> preserved.
>> >>> The second macro takes a label argument that marks the resume-from-yield
>> >>> path, which should restore the preserved context again.
>> >>>
>> >>> Signed-off-by: Ard Biesheuvel 
>> >>> ---
>> >>> arch/arm64/include/asm/assembler.h | 50 
>> >>> 1 file changed, 50 insertions(+)
>> >>>
>> >>> diff --git a/arch/arm64/include/asm/assembler.h 
>> >>> b/arch/arm64/include/asm/assembler.h
>> >>> index aef72d886677..917b026d3e00 100644
>> >>> --- a/arch/arm64/include/asm/assembler.h
>> >>> +++ b/arch/arm64/include/asm/assembler.h
>> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
>> >>> #endif
>> >>>.endm
>> >>>
>> >>> +/*
>> >>> + * yield_neon - check whether to yield to another runnable task from
>> >>> + *kernel mode NEON code (running with preemption disabled)
>> >>> + *
>> >>> + * - Check whether the preempt count is exactly 1, in which case 
>> >>> disabling
>> >>> + *   preemption once will make the task preemptible. If this is not the 
>> >>> case,
>> >>> + *   yielding is pointless.
>> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and 
>> >>> re-enable
>> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch to 
>> >>> the
>> >>> + *   yield fixup code at @lbl.
>> >>> + */
>> >>> +.macroyield_neon, lbl:req, ctr, order, stride, loop
>> >>> +yield_neon_pre\ctr, \order, \stride, \loop
>> >>> +yield_neon_post\lbl
>> >>> +.endm
>> >>> +
>> >>> +.macroyield_neon_pre, ctr, order=0, stride, loop=f
>> >>> +#ifdef CONFIG_PREEMPT
>> >>> +/*
>> >>> + * With some algorithms, it makes little sense to poll the
>> >>> + * TIF_NEED_RESCHED flag after every iteration, so only perform
>> >>> + * the check every 2^order strides.
>> >>> + */
>> >>> +.if\order > 1
>> >>> +.if(\stride & (\stride - 1)) != 0
>> >>> +.error"stride should be a power of 2"
>> >>> +.endif
>> >>> +tst\ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
>> >>> +b.ne\loop
>> >>> +.endif
>> >>
>> >> I'm not sure what baking in this check gives us, and this seems to
>> >> conflate two rather different things: yielding and defining a
>> >> "heartbeat" frequency for the calling code.
>> >>
>> >> Can we separate out the crypto-loop-helper semantics from the yield-
>> >> neon stuff?
>> >>
>> >
>> > Fair enough. I incorporated the check here so it disappears from the code 
>> > entirely when !CONFIG_PREEMPT, because otherwise, you end up with a 
>> > sequence that is mispredicted every # iterations without any benefit.
>> > I guess i could macroise that separately though.
>> >
>> >> If we had
>> >>
>> >>if_cond_yield_neon
>> >>// patchup code
>> >>endif_yield_neon
>> >>
>>
>> I like this, but ...
>>
>> >> then the caller is free to conditionally branch over that as appropriate
>> >> like
>> >>
>> >> loop:
>> >>// crypto stuff
>> >>tst x0, #0xf
>> >>b.neloop
>> >>
>> >>if_cond_yield_neon
>> >>// patchup code
>> >>endif_cond_yield_neon
>> >>
>>
>> I need to put the post patchup code somewhere too. Please refer to the
>> CRC patches for the best examples of this.
>
> I'm not sure I follow.  If you need to do something different after
> patchup, can't you either pull that code into the if...endif too, or
> otherwise put a branch before the endif?
>

No, not really.

What I currently have is

>if_cond_yield_neon
>// patchup code
>endif_cond_yield_neon

which is being turned into

get_thread_info x0
ldr w1, [x0, #TSK_TI_PREEMPT]
ldr x0, [x0, #TSK_TI_FLAGS]
cmp w1, #1 // == PREEMPT_OFFSET
csel x0, x0, xzr, eq
tbnz x0, #TIF_NEED_RESCHED, f // needs rescheduling?
:

.subsection 1
:
// patchup code
bl kernel_neon_end
bl kernel_neon_begin
// post patchup code goes here
b 

Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT

2017-12-06 Thread Dave Martin
On Tue, Dec 05, 2017 at 06:04:34PM +, Ard Biesheuvel wrote:
> On 5 December 2017 at 12:45, Ard Biesheuvel  wrote:
> >
> >
> >> On 5 Dec 2017, at 12:28, Dave Martin  wrote:
> >>
> >>> On Mon, Dec 04, 2017 at 12:26:37PM +, Ard Biesheuvel wrote:
> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
> >>> that may be called from the assembler code. Given that especially the
> >>> instruction based accelerated crypto code may use very tight loops, add
> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
> >>> executed every so many loop iterations.
> >>>
> >>> In some cases, yielding the NEON involves saving and restoring a non
> >>> trivial amount of context (especially in the CRC folding algorithms),
> >>> and so the macro is split into two, and the code in between is only
> >>> executed when the yield path is taken, allowing the contex to be 
> >>> preserved.
> >>> The second macro takes a label argument that marks the resume-from-yield
> >>> path, which should restore the preserved context again.
> >>>
> >>> Signed-off-by: Ard Biesheuvel 
> >>> ---
> >>> arch/arm64/include/asm/assembler.h | 50 
> >>> 1 file changed, 50 insertions(+)
> >>>
> >>> diff --git a/arch/arm64/include/asm/assembler.h 
> >>> b/arch/arm64/include/asm/assembler.h
> >>> index aef72d886677..917b026d3e00 100644
> >>> --- a/arch/arm64/include/asm/assembler.h
> >>> +++ b/arch/arm64/include/asm/assembler.h
> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
> >>> #endif
> >>>.endm
> >>>
> >>> +/*
> >>> + * yield_neon - check whether to yield to another runnable task from
> >>> + *kernel mode NEON code (running with preemption disabled)
> >>> + *
> >>> + * - Check whether the preempt count is exactly 1, in which case 
> >>> disabling
> >>> + *   preemption once will make the task preemptible. If this is not the 
> >>> case,
> >>> + *   yielding is pointless.
> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and 
> >>> re-enable
> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch to 
> >>> the
> >>> + *   yield fixup code at @lbl.
> >>> + */
> >>> +.macroyield_neon, lbl:req, ctr, order, stride, loop
> >>> +yield_neon_pre\ctr, \order, \stride, \loop
> >>> +yield_neon_post\lbl
> >>> +.endm
> >>> +
> >>> +.macroyield_neon_pre, ctr, order=0, stride, loop=f
> >>> +#ifdef CONFIG_PREEMPT
> >>> +/*
> >>> + * With some algorithms, it makes little sense to poll the
> >>> + * TIF_NEED_RESCHED flag after every iteration, so only perform
> >>> + * the check every 2^order strides.
> >>> + */
> >>> +.if\order > 1
> >>> +.if(\stride & (\stride - 1)) != 0
> >>> +.error"stride should be a power of 2"
> >>> +.endif
> >>> +tst\ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
> >>> +b.ne\loop
> >>> +.endif
> >>
> >> I'm not sure what baking in this check gives us, and this seems to
> >> conflate two rather different things: yielding and defining a
> >> "heartbeat" frequency for the calling code.
> >>
> >> Can we separate out the crypto-loop-helper semantics from the yield-
> >> neon stuff?
> >>
> >
> > Fair enough. I incorporated the check here so it disappears from the code 
> > entirely when !CONFIG_PREEMPT, because otherwise, you end up with a 
> > sequence that is mispredicted every # iterations without any benefit.
> > I guess i could macroise that separately though.
> >
> >> If we had
> >>
> >>if_cond_yield_neon
> >>// patchup code
> >>endif_yield_neon
> >>
> 
> I like this, but ...
> 
> >> then the caller is free to conditionally branch over that as appropriate
> >> like
> >>
> >> loop:
> >>// crypto stuff
> >>tst x0, #0xf
> >>b.neloop
> >>
> >>if_cond_yield_neon
> >>// patchup code
> >>endif_cond_yield_neon
> >>
> 
> I need to put the post patchup code somewhere too. Please refer to the
> CRC patches for the best examples of this.

I'm not sure I follow.  If you need to do something different after
patchup, can't you either pull that code into the if...endif too, or
otherwise put a branch before the endif?

[...]

Cheers
---Dave


Re: [PATCH 2/3] crypto: exynos - Improve performance of PRNG

2017-12-06 Thread Krzysztof Kozlowski
On Wed, Dec 6, 2017 at 12:32 PM, Łukasz Stelmach  wrote:
> It was <2017-12-05 wto 19:06>, when Krzysztof Kozlowski wrote:
>> On Tue, Dec 5, 2017 at 6:53 PM, Krzysztof Kozlowski  wrote:
>>> On Tue, Dec 05, 2017 at 05:43:10PM +0100, Łukasz Stelmach wrote:
 It was <2017-12-05 wto 14:54>, when Stephan Mueller wrote:
> Am Dienstag, 5. Dezember 2017, 13:35:57 CET schrieb Łukasz Stelmach:
>> Use memcpy_fromio() instead of custom exynos_rng_copy_random() function
>> to retrieve generated numbers from the registers of PRNG.
>>
>> Remove unnecessary invocation of cpu_relax().
>>
>> Signed-off-by: Łukasz Stelmach 
>> ---
>>  drivers/crypto/exynos-rng.c | 36 +---
>>  1 file changed, 5 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
>> index 894ef93ef5ec..002e9d2a83cc 100644
>> --- a/drivers/crypto/exynos-rng.c
>> +++ b/drivers/crypto/exynos-rng.c

 [...]

>> @@ -171,6 +143,8 @@ static int exynos_rng_get_random(struct 
>> exynos_rng_dev
>> *rng, {
>>int retry = EXYNOS_RNG_WAIT_RETRIES;
>>
>> +  *read = min_t(size_t, dlen, EXYNOS_RNG_SEED_SIZE);
>> +
>>if (rng->type == EXYNOS_PRNG_TYPE4) {
>>exynos_rng_writel(rng, EXYNOS_RNG_CONTROL_START,
>>  EXYNOS_RNG_CONTROL);
>> @@ -180,8 +154,8 @@ static int exynos_rng_get_random(struct 
>> exynos_rng_dev
>> *rng, }
>>
>>while (!(exynos_rng_readl(rng,
>> -  EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) && 
>> --retry)
>> -  cpu_relax();
>> +  EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) &&
>> + --retry);
>
> [...]
>
 The busy loop is not very busy. Every time I checked the loop (w/o
 cpu_relax()) was executed twice (retry was 98) and the operation was
 reliable. I don't see why do we need a memory barrier here. On the other
 hand, I am not sure the whole exynos_rng_get_random() shouldn't be ran
 under a mutex or a spinlock (I don't see anything like this in the upper
 layers of the crypto framework).

 The *_relaxed() I/O operations do not enforce memory
>>>
>>> The cpu_relax() is a common pattern for busy-loop. If you want to break
>>> this pattern - please explain why only this part of kernel should not
>>> follow it (and rest of kernel should).
>>>
>>> The other part - this code is already using relaxed versions which might
>>> get you into difficult to debug issues. You mentioned that loop works
>>> reliable after removing the cpu_relax... yeah, it might for 99.999% but
>>> that's not the argument. I remember few emails from Arnd Bergmann
>>> mentioning explicitly to avoid using relaxed versions "just because",
>>> unless it is necessary or really understood.
>>>
>>> The code first writes to control register, then checks for status so you
>>> should have these operations strictly ordered. Therefore I think
>>> cpu_relax() should not be removed.
>>
>> ... or just convert it to readl_poll_timeout() because it makes code
>> more readable, takes care of timeout and you do not have care about
>> specific implementation (whether there should or should not be
>> cpu_relax).
>
> OK. This appears to perform reasonably.
>
> do {
> cpu_relax();
> } while (!(exynos_rng_readl(rng, EXYNOS_RNG_STATUS) &
>EXYNOS_RNG_STATUS_RNG_DONE) && --retry);

You mean that:
  while (readl_relaxed()) cpu_relax();
is slower than
  do cpu_relax() while (readl_relaxed())
?

Hmm... Interesting. I would be happy to learn more about it why it
behaves so differently. Maybe the execution of cpu_relax() before
readl_relaxed() reduces the amount of loops to actually one read?

Indeed some parts of kernel code for ARM prefers this approach,
although still the most popular pattern is existing one (while()
cpu_relax).

Best regards,
Krzysztof


Re: [PATCH 2/3] crypto: exynos - Improve performance of PRNG

2017-12-06 Thread Łukasz Stelmach
It was <2017-12-05 wto 19:06>, when Krzysztof Kozlowski wrote:
> On Tue, Dec 5, 2017 at 6:53 PM, Krzysztof Kozlowski  wrote:
>> On Tue, Dec 05, 2017 at 05:43:10PM +0100, Łukasz Stelmach wrote:
>>> It was <2017-12-05 wto 14:54>, when Stephan Mueller wrote:
 Am Dienstag, 5. Dezember 2017, 13:35:57 CET schrieb Łukasz Stelmach:
> Use memcpy_fromio() instead of custom exynos_rng_copy_random() function
> to retrieve generated numbers from the registers of PRNG.
>
> Remove unnecessary invocation of cpu_relax().
>
> Signed-off-by: Łukasz Stelmach 
> ---
>  drivers/crypto/exynos-rng.c | 36 +---
>  1 file changed, 5 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/crypto/exynos-rng.c b/drivers/crypto/exynos-rng.c
> index 894ef93ef5ec..002e9d2a83cc 100644
> --- a/drivers/crypto/exynos-rng.c
> +++ b/drivers/crypto/exynos-rng.c
>>>
>>> [...]
>>>
> @@ -171,6 +143,8 @@ static int exynos_rng_get_random(struct exynos_rng_dev
> *rng, {
>int retry = EXYNOS_RNG_WAIT_RETRIES;
>
> +  *read = min_t(size_t, dlen, EXYNOS_RNG_SEED_SIZE);
> +
>if (rng->type == EXYNOS_PRNG_TYPE4) {
>exynos_rng_writel(rng, EXYNOS_RNG_CONTROL_START,
>  EXYNOS_RNG_CONTROL);
> @@ -180,8 +154,8 @@ static int exynos_rng_get_random(struct exynos_rng_dev
> *rng, }
>
>while (!(exynos_rng_readl(rng,
> -  EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) && 
> --retry)
> -  cpu_relax();
> +  EXYNOS_RNG_STATUS) & EXYNOS_RNG_STATUS_RNG_DONE) &&
> + --retry);

[...]

>>> The busy loop is not very busy. Every time I checked the loop (w/o
>>> cpu_relax()) was executed twice (retry was 98) and the operation was
>>> reliable. I don't see why do we need a memory barrier here. On the other
>>> hand, I am not sure the whole exynos_rng_get_random() shouldn't be ran
>>> under a mutex or a spinlock (I don't see anything like this in the upper
>>> layers of the crypto framework).
>>>
>>> The *_relaxed() I/O operations do not enforce memory
>>
>> The cpu_relax() is a common pattern for busy-loop. If you want to break
>> this pattern - please explain why only this part of kernel should not
>> follow it (and rest of kernel should).
>>
>> The other part - this code is already using relaxed versions which might
>> get you into difficult to debug issues. You mentioned that loop works
>> reliable after removing the cpu_relax... yeah, it might for 99.999% but
>> that's not the argument. I remember few emails from Arnd Bergmann
>> mentioning explicitly to avoid using relaxed versions "just because",
>> unless it is necessary or really understood.
>>
>> The code first writes to control register, then checks for status so you
>> should have these operations strictly ordered. Therefore I think
>> cpu_relax() should not be removed.
>
> ... or just convert it to readl_poll_timeout() because it makes code
> more readable, takes care of timeout and you do not have care about
> specific implementation (whether there should or should not be
> cpu_relax).

OK. This appears to perform reasonably.

do {
cpu_relax();
} while (!(exynos_rng_readl(rng, EXYNOS_RNG_STATUS) &
   EXYNOS_RNG_STATUS_RNG_DONE) && --retry);

-- 
Łukasz Stelmach
Samsung R Institute Poland
Samsung Electronics


signature.asc
Description: PGP signature


Re: [PATCH RFC 4/4] crypto: stm32: convert to the new crypto engine API

2017-12-06 Thread Fabien DESSENNE
Hi,


On 29/11/17 09:41, Corentin Labbe wrote:
> This patch convert the driver to the new crypto engine API.
>
> Signed-off-by: Corentin Labbe 

Tested-by: Fabien Dessenne 

> ---
>   drivers/crypto/stm32/stm32-hash.c | 22 +++---
>   1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/crypto/stm32/stm32-hash.c 
> b/drivers/crypto/stm32/stm32-hash.c
> index 4ca4a264a833..e3f9f7b04ce2 100644
> --- a/drivers/crypto/stm32/stm32-hash.c
> +++ b/drivers/crypto/stm32/stm32-hash.c
> @@ -122,6 +122,7 @@ enum stm32_hash_data_format {
>   #define HASH_DMA_THRESHOLD  50
>   
>   struct stm32_hash_ctx {
> + struct crypto_engine_reqctx enginectx;
>   struct stm32_hash_dev   *hdev;
>   unsigned long   flags;
>   
> @@ -811,7 +812,7 @@ static void stm32_hash_finish_req(struct ahash_request 
> *req, int err)
>   rctx->flags |= HASH_FLAGS_ERRORS;
>   }
>   
> - crypto_finalize_hash_request(hdev->engine, req, err);
> + crypto_finalize_request(hdev->engine, >base, err);
>   }
>   
>   static int stm32_hash_hw_init(struct stm32_hash_dev *hdev,
> @@ -828,15 +829,21 @@ static int stm32_hash_hw_init(struct stm32_hash_dev 
> *hdev,
>   return 0;
>   }
>   
> +static int stm32_hash_one_request(struct crypto_engine *engine,
> +   struct crypto_async_request *areq);
> +static int stm32_hash_prepare_req(struct crypto_engine *engine,
> +   struct crypto_async_request *areq);
> +
>   static int stm32_hash_handle_queue(struct stm32_hash_dev *hdev,
>  struct ahash_request *req)
>   {
> - return crypto_transfer_hash_request_to_engine(hdev->engine, req);
> + return crypto_transfer_request_to_engine(hdev->engine, >base);
>   }
>   
>   static int stm32_hash_prepare_req(struct crypto_engine *engine,
> -   struct ahash_request *req)
> +   struct crypto_async_request *areq)
>   {
> + struct ahash_request *req = ahash_request_cast(areq);
>   struct stm32_hash_ctx *ctx = crypto_ahash_ctx(crypto_ahash_reqtfm(req));
>   struct stm32_hash_dev *hdev = stm32_hash_find_dev(ctx);
>   struct stm32_hash_request_ctx *rctx;
> @@ -855,8 +862,9 @@ static int stm32_hash_prepare_req(struct crypto_engine 
> *engine,
>   }
>   
>   static int stm32_hash_one_request(struct crypto_engine *engine,
> -   struct ahash_request *req)
> +   struct crypto_async_request *areq)
>   {
> + struct ahash_request *req = ahash_request_cast(areq);
>   struct stm32_hash_ctx *ctx = crypto_ahash_ctx(crypto_ahash_reqtfm(req));
>   struct stm32_hash_dev *hdev = stm32_hash_find_dev(ctx);
>   struct stm32_hash_request_ctx *rctx;
> @@ -1033,6 +1041,9 @@ static int stm32_hash_cra_init_algs(struct crypto_tfm 
> *tfm,
>   if (algs_hmac_name)
>   ctx->flags |= HASH_FLAGS_HMAC;
>   
> + ctx->enginectx.op.do_one_request = stm32_hash_one_request;
> + ctx->enginectx.op.prepare_request = stm32_hash_prepare_req;
> + ctx->enginectx.op.unprepare_request = NULL;
>   return 0;
>   }
>   
> @@ -1493,9 +1504,6 @@ static int stm32_hash_probe(struct platform_device 
> *pdev)
>   goto err_engine;
>   }
>   
> - hdev->engine->prepare_hash_request = stm32_hash_prepare_req;
> - hdev->engine->hash_one_request = stm32_hash_one_request;
> -
>   ret = crypto_engine_start(hdev->engine);
>   if (ret)
>   goto err_engine_start;


Re: [PATCH RFC 1/4] crypto: engine - Permit to enqueue all async requests

2017-12-06 Thread Fabien DESSENNE


On 29/11/17 09:41, Corentin Labbe wrote:
> The crypto engine could actually only enqueue hash and ablkcipher request.
> This patch permit it to enqueue any type of crypto_async_request.
>
> Signed-off-by: Corentin Labbe 
> ---
>   crypto/crypto_engine.c  | 188 
> +++-
>   include/crypto/engine.h |  46 +---
>   2 files changed, 60 insertions(+), 174 deletions(-)
>
> diff --git a/crypto/crypto_engine.c b/crypto/crypto_engine.c
> index 61e7c4e02fd2..f7c4c4c1f41b 100644
> --- a/crypto/crypto_engine.c
> +++ b/crypto/crypto_engine.c
> @@ -34,11 +34,10 @@ static void crypto_pump_requests(struct crypto_engine 
> *engine,
>bool in_kthread)
>   {
>   struct crypto_async_request *async_req, *backlog;
> - struct ahash_request *hreq;
> - struct ablkcipher_request *breq;
>   unsigned long flags;
>   bool was_busy = false;
> - int ret, rtype;
> + int ret;
> + struct crypto_engine_reqctx *enginectx;
>   
>   spin_lock_irqsave(>queue_lock, flags);
>   
> @@ -94,7 +93,6 @@ static void crypto_pump_requests(struct crypto_engine 
> *engine,
>   
>   spin_unlock_irqrestore(>queue_lock, flags);
>   
> - rtype = crypto_tfm_alg_type(engine->cur_req->tfm);
>   /* Until here we get the request need to be encrypted successfully */
>   if (!was_busy && engine->prepare_crypt_hardware) {
>   ret = engine->prepare_crypt_hardware(engine);
> @@ -104,57 +102,31 @@ static void crypto_pump_requests(struct crypto_engine 
> *engine,
>   }
>   }
>   
> - switch (rtype) {
> - case CRYPTO_ALG_TYPE_AHASH:
> - hreq = ahash_request_cast(engine->cur_req);
> - if (engine->prepare_hash_request) {
> - ret = engine->prepare_hash_request(engine, hreq);
> - if (ret) {
> - dev_err(engine->dev, "failed to prepare 
> request: %d\n",
> - ret);
> - goto req_err;
> - }
> - engine->cur_req_prepared = true;
> - }
> - ret = engine->hash_one_request(engine, hreq);
> - if (ret) {
> - dev_err(engine->dev, "failed to hash one request from 
> queue\n");
> - goto req_err;
> - }
> - return;
> - case CRYPTO_ALG_TYPE_ABLKCIPHER:
> - breq = ablkcipher_request_cast(engine->cur_req);
> - if (engine->prepare_cipher_request) {
> - ret = engine->prepare_cipher_request(engine, breq);
> - if (ret) {
> - dev_err(engine->dev, "failed to prepare 
> request: %d\n",
> - ret);
> - goto req_err;
> - }
> - engine->cur_req_prepared = true;
> - }
> - ret = engine->cipher_one_request(engine, breq);
> + enginectx = crypto_tfm_ctx(async_req->tfm);
> +
> + if (enginectx->op.prepare_request) {
> + ret = enginectx->op.prepare_request(engine, async_req);
>   if (ret) {
> - dev_err(engine->dev, "failed to cipher one request from 
> queue\n");
> + dev_err(engine->dev, "failed to prepare request: %d\n",
> + ret);
>   goto req_err;
>   }
> - return;
> - default:
> - dev_err(engine->dev, "failed to prepare request of unknown 
> type\n");
> - return;
> + engine->cur_req_prepared = true;
> + }
> + if (!enginectx->op.do_one_request) {
> + dev_err(engine->dev, "failed to do request\n");
> + ret = -EINVAL;
> + goto req_err;
> + }
> + ret = enginectx->op.do_one_request(engine, async_req);
> + if (ret) {
> + dev_err(engine->dev, "failed to hash one request from queue\n");
> + goto req_err;
>   }
> + return;
>   
>   req_err:
> - switch (rtype) {
> - case CRYPTO_ALG_TYPE_AHASH:
> - hreq = ahash_request_cast(engine->cur_req);
> - crypto_finalize_hash_request(engine, hreq, ret);
> - break;
> - case CRYPTO_ALG_TYPE_ABLKCIPHER:
> - breq = ablkcipher_request_cast(engine->cur_req);
> - crypto_finalize_cipher_request(engine, breq, ret);
> - break;
> - }
> + crypto_finalize_request(engine, async_req, ret);
>   return;
>   
>   out:
> @@ -170,59 +142,16 @@ static void crypto_pump_work(struct kthread_work *work)
>   }
>   
>   /**
> - * crypto_transfer_cipher_request - transfer the new request into the
> - * enginequeue
> + * crypto_transfer_request - transfer the new request into the engine queue
>* @engine: the hardware engine
>* @req: the 

Re: [PATCH RFC 0/4] crypto: engine - Permit to enqueue all async requests

2017-12-06 Thread Fabien DESSENNE
Hi Corentin,


I am fine with this proposal: it is generic enough and I have been able 
to test and run the crypto engine with aead_request without changing any 
single line of code.

This is what I need to be able to send the AEAD extension of the 
stm32-cryp driver (successfully tested with your engine upgrade proposal).


I have also tested the stm32-hash patch.

Note that stm32-cryp (new driver applied by Herbert recently 
[https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=9e054ec21ef8344345b28603fb272fe999f735db])
 
would also need to be converted to the new crypto engine API : this is a 
trivial patch.

Thank you for your proposal, I hope that this proposal is fine for 
Herbert too.

BR


Fabien


On 29/11/17 09:41, Corentin Labbe wrote:
> Hello
>
> The current crypto_engine support only ahash and ablkcipher.
> My first patch which try to add skcipher was Nacked, it will add too many 
> functions
> and adding other algs(aead, asymetric_key) will make the situation worst.
>
> This patchset remove all algs specific stuff and now only process generic 
> crypto_async_request.
>
> The requests handler function pointer are now moved out of struct engine and
> are now stored directly in a crypto_engine_reqctx.
>
> The original proposal of Herbert [1] cannot be done completly since the 
> crypto_engine
> could only dequeue crypto_async_request and it is impossible to access any 
> request_ctx
> without knowing the underlying request type.
>
> So I do something near that was requested: adding crypto_engine_reqctx in TFM 
> context.
> Note that the current implementation expect that crypto_engine_reqctx
> is the first member of the context.
>
> The first patch convert the crypto engine with the new way,
> while the following patchs convert the 3 existing users of crypto_engine.
> Note that this split break bisection, so probably the final commit will be 
> all merged.
>
> The 3 latest patch were compile tested only, but the first is tested 
> successfully
> with my new sun8i-ce driver.
>
> Regards
>
> [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1474434.html
>
> Corentin Labbe (4):
>crypto: engine - Permit to enqueue all async requests
>crypto: omap: convert to new crypto engine API
>crypto: virtio: convert to new crypto engine API
>crypto: stm32: convert to the new crypto engine API
>
>   crypto/crypto_engine.c   | 188 
> ++-
>   drivers/crypto/omap-aes.c|  21 ++-
>   drivers/crypto/omap-aes.h|   3 +
>   drivers/crypto/omap-des.c|  24 +++-
>   drivers/crypto/stm32/stm32-hash.c|  22 +++-
>   drivers/crypto/virtio/virtio_crypto_algs.c   |  15 ++-
>   drivers/crypto/virtio/virtio_crypto_common.h |   2 +-
>   drivers/crypto/virtio/virtio_crypto_core.c   |   3 -
>   include/crypto/engine.h  |  46 +++
>   9 files changed, 122 insertions(+), 202 deletions(-)
>