Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-18 Thread Ard Biesheuvel
On 18 June 2018 at 23:56, Eric Biggers  wrote:
> On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
>> > +
>> > + // One-time XTS preparation
>> > +
>> > + /*
>> > +  * Allocate stack space to store 128 bytes worth of tweaks.  For
>> > +  * performance, this space is aligned to a 16-byte boundary so 
>> > that we
>> > +  * can use the load/store instructions that declare 16-byte 
>> > alignment.
>> > +  */
>> > + sub sp, #128
>> > + bic sp, #0xf
>> 
>> 
>>  This fails here when building with CONFIG_THUMB2_KERNEL=y
>> 
>>    AS  arch/arm/crypto/speck-neon-core.o
>> 
>>  arch/arm/crypto/speck-neon-core.S: Assembler messages:
>> 
>>  arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>>  `bic sp,#0xf'
>>  arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>>  `bic sp,#0xf'
>>  arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>>  `bic sp,#0xf'
>>  arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>>  `bic sp,#0xf'
>> 
>>  In a quick hack this change seems to address it:
>> 
>> 
>>  -   sub sp, #128
>>  -   bic sp, #0xf
>>  +   mov r6, sp
>>  +   sub r6, #128
>>  +   bic r6, #0xf
>>  +   mov sp, r6
>> 
>>  But there is probably a better solution to address this.
>> 
>> >>>
>> >>> Given that there is no NEON on M class cores, I recommend we put 
>> >>> something like
>> >>>
>> >>> THUMB(bx pc)
>> >>> THUMB(nop.w)
>> >>> THUMB(.arm)
>> >>>
>> >>> at the beginning and be done with it.
>> >>
>> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
>> >> the beginning as well.
>> >
>> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
>> > that bic sp,#0xf is the only issue...
>> >
>>
>> Well, in general, yes. In the case of NEON code, not really, since the
>> resulting code will not be smaller anyway, because the Thumb2 NEON
>> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
>> units, so all cores that this code can run on will be able to run in
>> ARM mode.
>>
>> So from a maintainability pov, having code that only assembles in one
>> way is better than having code that must compile both to ARM and to
>> Thumb2 opcodes.
>>
>> Just my 2 cents, anyway.
>
> I don't have too much of a preference, though Stefan's suggested 4 
> instructions
> can be reduced to 3, which also matches what aes-neonbs-core.S does:
>
> sub r12, sp, #128
> bic r12, #0xf
> mov sp, r12
>
> Ard, is the following what you're suggesting instead?
>

Yes, but after looking at the actual code, I prefer the change above.
The access occurs only once, not in the loop so the additional
instructions should not affect performance.

Apologies for the noise.

> diff --git a/arch/arm/crypto/speck-neon-core.S 
> b/arch/arm/crypto/speck-neon-core.S
> index 3c1e203e53b9..c989ce3dc057 100644
> --- a/arch/arm/crypto/speck-neon-core.S
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -8,6 +8,7 @@
>   */
>
>  #include 
> +#include 
>
> .text
> .fpuneon
> @@ -233,6 +234,12 @@
>   * nonzero multiple of 128.
>   */
>  .macro _speck_xts_cryptn, decrypting
> +
> +   .align  2
> +   THUMB(bx pc)
> +   THUMB(nop)
> +   THUMB(.arm)
> +
> push{r4-r7}
> mov r7, sp
>
> @@ -413,6 +420,8 @@
> mov sp, r7
> pop {r4-r7}
> bx  lr
> +
> +   THUMB(.thumb)
>  .endm
>
>  ENTRY(speck128_xts_encrypt_neon)


Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-18 Thread Eric Biggers
On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote:
> > +
> > + // One-time XTS preparation
> > +
> > + /*
> > +  * Allocate stack space to store 128 bytes worth of tweaks.  For
> > +  * performance, this space is aligned to a 16-byte boundary so 
> > that we
> > +  * can use the load/store instructions that declare 16-byte 
> > alignment.
> > +  */
> > + sub sp, #128
> > + bic sp, #0xf
> 
> 
>  This fails here when building with CONFIG_THUMB2_KERNEL=y
> 
>    AS  arch/arm/crypto/speck-neon-core.o
> 
>  arch/arm/crypto/speck-neon-core.S: Assembler messages:
> 
>  arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
>  `bic sp,#0xf'
>  arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
>  `bic sp,#0xf'
>  arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
>  `bic sp,#0xf'
>  arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
>  `bic sp,#0xf'
> 
>  In a quick hack this change seems to address it:
> 
> 
>  -   sub sp, #128
>  -   bic sp, #0xf
>  +   mov r6, sp
>  +   sub r6, #128
>  +   bic r6, #0xf
>  +   mov sp, r6
> 
>  But there is probably a better solution to address this.
> 
> >>>
> >>> Given that there is no NEON on M class cores, I recommend we put 
> >>> something like
> >>>
> >>> THUMB(bx pc)
> >>> THUMB(nop.w)
> >>> THUMB(.arm)
> >>>
> >>> at the beginning and be done with it.
> >>
> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at
> >> the beginning as well.
> >
> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems
> > that bic sp,#0xf is the only issue...
> >
> 
> Well, in general, yes. In the case of NEON code, not really, since the
> resulting code will not be smaller anyway, because the Thumb2 NEON
> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON
> units, so all cores that this code can run on will be able to run in
> ARM mode.
> 
> So from a maintainability pov, having code that only assembles in one
> way is better than having code that must compile both to ARM and to
> Thumb2 opcodes.
> 
> Just my 2 cents, anyway.

I don't have too much of a preference, though Stefan's suggested 4 instructions
can be reduced to 3, which also matches what aes-neonbs-core.S does:

sub r12, sp, #128
bic r12, #0xf
mov sp, r12

Ard, is the following what you're suggesting instead?

diff --git a/arch/arm/crypto/speck-neon-core.S 
b/arch/arm/crypto/speck-neon-core.S
index 3c1e203e53b9..c989ce3dc057 100644
--- a/arch/arm/crypto/speck-neon-core.S
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -8,6 +8,7 @@
  */
 
 #include 
+#include 
 
.text
.fpuneon
@@ -233,6 +234,12 @@
  * nonzero multiple of 128.
  */
 .macro _speck_xts_cryptn, decrypting
+
+   .align  2
+   THUMB(bx pc)
+   THUMB(nop)
+   THUMB(.arm)
+
push{r4-r7}
mov r7, sp
 
@@ -413,6 +420,8 @@
mov sp, r7
pop {r4-r7}
bx  lr
+
+   THUMB(.thumb)
 .endm
 
 ENTRY(speck128_xts_encrypt_neon)


Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-17 Thread Ard Biesheuvel
On 17 June 2018 at 12:41, Stefan Agner  wrote:
> On 17.06.2018 11:40, Ard Biesheuvel wrote:
>> On 17 June 2018 at 11:30, Ard Biesheuvel  wrote:
>>> On 17 June 2018 at 00:40, Stefan Agner  wrote:
 Hi Eric,

 On 14.02.2018 19:42, Eric Biggers wrote:
> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
> encrypted/decrypted (doing one cipher round for all the blocks, then the
> next round, etc.), then goes through XTS postprocessing.
>
> The performance depends on the processor but can be about 3 times faster
> than the generic code.  For example, on an ARMv7 processor we observe
> the following performance with Speck128/256-XTS:
>
> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>
> In comparison to AES-256-XTS without the Cryptography Extensions:
>
> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>
> Speck64/128-XTS is even faster:
>
> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>
> Note that as with the generic code, only the Speck128 and Speck64
> variants are supported.  Also, for now only the XTS mode of operation is
> supported, to target the disk and file encryption use cases.  The NEON
> code also only handles the portion of the data that is evenly divisible
> into 128-byte chunks, with any remainder handled by a C fallback.  Of
> course, other modes of operation could be added later if needed, and/or
> the NEON code could be updated to handle other buffer sizes.
>
> The XTS specification is only defined for AES which has a 128-bit block
> size, so for the GF(2^64) math needed for Speck64-XTS we use the
> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
> paper.  Of course, when possible users should use Speck128-XTS, but even
> that may be too slow on some processors; Speck64-XTS can be faster.
>
> Signed-off-by: Eric Biggers 
> ---
>  arch/arm/crypto/Kconfig   |   6 +
>  arch/arm/crypto/Makefile  |   2 +
>  arch/arm/crypto/speck-neon-core.S | 432 ++
>  arch/arm/crypto/speck-neon-glue.c | 288 
>  4 files changed, 728 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index b8e69fe282b8..925d1364727a 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>   select CRYPTO_BLKCIPHER
>   select CRYPTO_CHACHA20
>
> +config CRYPTO_SPECK_NEON
> + tristate "NEON accelerated Speck cipher algorithms"
> + depends on KERNEL_MODE_NEON
> + select CRYPTO_BLKCIPHER
> + select CRYPTO_SPECK
> +
>  endif
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 30ef8e291271..a758107c5525 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>
>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
> @@ -53,6 +54,7 @@ ghash-arm-ce-y  := ghash-ce-core.o ghash-ce-glue.o
>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>
>  quiet_cmd_perl = PERL$@
>cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/speck-neon-core.S
> b/arch/arm/crypto/speck-neon-core.S
> new file mode 100644
> index ..3c1e203e53b9
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -0,0 +1,432 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Author: Eric Biggers 
> + */
> +
> +#include 
> +
> + .text
> + .fpuneon
> +
>>>

Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-17 Thread Stefan Agner
On 17.06.2018 11:40, Ard Biesheuvel wrote:
> On 17 June 2018 at 11:30, Ard Biesheuvel  wrote:
>> On 17 June 2018 at 00:40, Stefan Agner  wrote:
>>> Hi Eric,
>>>
>>> On 14.02.2018 19:42, Eric Biggers wrote:
 Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
 Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
 encrypted/decrypted (doing one cipher round for all the blocks, then the
 next round, etc.), then goes through XTS postprocessing.

 The performance depends on the processor but can be about 3 times faster
 than the generic code.  For example, on an ARMv7 processor we observe
 the following performance with Speck128/256-XTS:

 xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
 xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s

 In comparison to AES-256-XTS without the Cryptography Extensions:

 xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
 xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
 xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s

 Speck64/128-XTS is even faster:

 xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s

 Note that as with the generic code, only the Speck128 and Speck64
 variants are supported.  Also, for now only the XTS mode of operation is
 supported, to target the disk and file encryption use cases.  The NEON
 code also only handles the portion of the data that is evenly divisible
 into 128-byte chunks, with any remainder handled by a C fallback.  Of
 course, other modes of operation could be added later if needed, and/or
 the NEON code could be updated to handle other buffer sizes.

 The XTS specification is only defined for AES which has a 128-bit block
 size, so for the GF(2^64) math needed for Speck64-XTS we use the
 reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
 paper.  Of course, when possible users should use Speck128-XTS, but even
 that may be too slow on some processors; Speck64-XTS can be faster.

 Signed-off-by: Eric Biggers 
 ---
  arch/arm/crypto/Kconfig   |   6 +
  arch/arm/crypto/Makefile  |   2 +
  arch/arm/crypto/speck-neon-core.S | 432 ++
  arch/arm/crypto/speck-neon-glue.c | 288 
  4 files changed, 728 insertions(+)
  create mode 100644 arch/arm/crypto/speck-neon-core.S
  create mode 100644 arch/arm/crypto/speck-neon-glue.c

 diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
 index b8e69fe282b8..925d1364727a 100644
 --- a/arch/arm/crypto/Kconfig
 +++ b/arch/arm/crypto/Kconfig
 @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
   select CRYPTO_BLKCIPHER
   select CRYPTO_CHACHA20

 +config CRYPTO_SPECK_NEON
 + tristate "NEON accelerated Speck cipher algorithms"
 + depends on KERNEL_MODE_NEON
 + select CRYPTO_BLKCIPHER
 + select CRYPTO_SPECK
 +
  endif
 diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
 index 30ef8e291271..a758107c5525 100644
 --- a/arch/arm/crypto/Makefile
 +++ b/arch/arm/crypto/Makefile
 @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
 +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o

  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
 @@ -53,6 +54,7 @@ ghash-arm-ce-y  := ghash-ce-core.o ghash-ce-glue.o
  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
 +speck-neon-y := speck-neon-core.o speck-neon-glue.o

  quiet_cmd_perl = PERL$@
cmd_perl = $(PERL) $(<) > $(@)
 diff --git a/arch/arm/crypto/speck-neon-core.S
 b/arch/arm/crypto/speck-neon-core.S
 new file mode 100644
 index ..3c1e203e53b9
 --- /dev/null
 +++ b/arch/arm/crypto/speck-neon-core.S
 @@ -0,0 +1,432 @@
 +// SPDX-License-Identifier: GPL-2.0
 +/*
 + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
 + *
 + * Copyright (c) 2018 Google, Inc
 + *
 + * Author: Eric Biggers 
 + */
 +
 +#include 
 +
 + .text
 + .fpuneon
 +
 + // arguments
 + ROUND_KEYS  .reqr0  // const {u64,u32} *round_keys
 + NROUNDS .reqr1  // int nrounds

Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-17 Thread Ard Biesheuvel
On 17 June 2018 at 11:30, Ard Biesheuvel  wrote:
> On 17 June 2018 at 00:40, Stefan Agner  wrote:
>> Hi Eric,
>>
>> On 14.02.2018 19:42, Eric Biggers wrote:
>>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>>> next round, etc.), then goes through XTS postprocessing.
>>>
>>> The performance depends on the processor but can be about 3 times faster
>>> than the generic code.  For example, on an ARMv7 processor we observe
>>> the following performance with Speck128/256-XTS:
>>>
>>> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>>
>>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>>
>>> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>>
>>> Speck64/128-XTS is even faster:
>>>
>>> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>>
>>> Note that as with the generic code, only the Speck128 and Speck64
>>> variants are supported.  Also, for now only the XTS mode of operation is
>>> supported, to target the disk and file encryption use cases.  The NEON
>>> code also only handles the portion of the data that is evenly divisible
>>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>>> course, other modes of operation could be added later if needed, and/or
>>> the NEON code could be updated to handle other buffer sizes.
>>>
>>> The XTS specification is only defined for AES which has a 128-bit block
>>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>>> paper.  Of course, when possible users should use Speck128-XTS, but even
>>> that may be too slow on some processors; Speck64-XTS can be faster.
>>>
>>> Signed-off-by: Eric Biggers 
>>> ---
>>>  arch/arm/crypto/Kconfig   |   6 +
>>>  arch/arm/crypto/Makefile  |   2 +
>>>  arch/arm/crypto/speck-neon-core.S | 432 ++
>>>  arch/arm/crypto/speck-neon-glue.c | 288 
>>>  4 files changed, 728 insertions(+)
>>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>>
>>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>>> index b8e69fe282b8..925d1364727a 100644
>>> --- a/arch/arm/crypto/Kconfig
>>> +++ b/arch/arm/crypto/Kconfig
>>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>>   select CRYPTO_BLKCIPHER
>>>   select CRYPTO_CHACHA20
>>>
>>> +config CRYPTO_SPECK_NEON
>>> + tristate "NEON accelerated Speck cipher algorithms"
>>> + depends on KERNEL_MODE_NEON
>>> + select CRYPTO_BLKCIPHER
>>> + select CRYPTO_SPECK
>>> +
>>>  endif
>>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>>> index 30ef8e291271..a758107c5525 100644
>>> --- a/arch/arm/crypto/Makefile
>>> +++ b/arch/arm/crypto/Makefile
>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>>
>>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>>> @@ -53,6 +54,7 @@ ghash-arm-ce-y  := ghash-ce-core.o ghash-ce-glue.o
>>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>>
>>>  quiet_cmd_perl = PERL$@
>>>cmd_perl = $(PERL) $(<) > $(@)
>>> diff --git a/arch/arm/crypto/speck-neon-core.S
>>> b/arch/arm/crypto/speck-neon-core.S
>>> new file mode 100644
>>> index ..3c1e203e53b9
>>> --- /dev/null
>>> +++ b/arch/arm/crypto/speck-neon-core.S
>>> @@ -0,0 +1,432 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>>> + *
>>> + * Copyright (c) 2018 Google, Inc
>>> + *
>>> + * Author: Eric Biggers 
>>> + */
>>> +
>>> +#include 
>>> +
>>> + .text
>>> + .fpuneon
>>> +
>>> + // arguments
>>> + ROUND_KEYS  .reqr0  // const {u64,u32} *round_keys
>>> + NROUNDS .reqr1  // int nrounds
>>> + DST .reqr2  // void *dst
>>> + SRC .reqr3  // const void *src
>>> + NBYTES  .reqr4  /

Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-17 Thread Ard Biesheuvel
On 17 June 2018 at 00:40, Stefan Agner  wrote:
> Hi Eric,
>
> On 14.02.2018 19:42, Eric Biggers wrote:
>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>> next round, etc.), then goes through XTS postprocessing.
>>
>> The performance depends on the processor but can be about 3 times faster
>> than the generic code.  For example, on an ARMv7 processor we observe
>> the following performance with Speck128/256-XTS:
>>
>> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
>> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>
>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>
>> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
>> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
>> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>
>> Speck64/128-XTS is even faster:
>>
>> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>
>> Note that as with the generic code, only the Speck128 and Speck64
>> variants are supported.  Also, for now only the XTS mode of operation is
>> supported, to target the disk and file encryption use cases.  The NEON
>> code also only handles the portion of the data that is evenly divisible
>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> course, other modes of operation could be added later if needed, and/or
>> the NEON code could be updated to handle other buffer sizes.
>>
>> The XTS specification is only defined for AES which has a 128-bit block
>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> paper.  Of course, when possible users should use Speck128-XTS, but even
>> that may be too slow on some processors; Speck64-XTS can be faster.
>>
>> Signed-off-by: Eric Biggers 
>> ---
>>  arch/arm/crypto/Kconfig   |   6 +
>>  arch/arm/crypto/Makefile  |   2 +
>>  arch/arm/crypto/speck-neon-core.S | 432 ++
>>  arch/arm/crypto/speck-neon-glue.c | 288 
>>  4 files changed, 728 insertions(+)
>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>
>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>> index b8e69fe282b8..925d1364727a 100644
>> --- a/arch/arm/crypto/Kconfig
>> +++ b/arch/arm/crypto/Kconfig
>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>   select CRYPTO_BLKCIPHER
>>   select CRYPTO_CHACHA20
>>
>> +config CRYPTO_SPECK_NEON
>> + tristate "NEON accelerated Speck cipher algorithms"
>> + depends on KERNEL_MODE_NEON
>> + select CRYPTO_BLKCIPHER
>> + select CRYPTO_SPECK
>> +
>>  endif
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 30ef8e291271..a758107c5525 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>
>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>> @@ -53,6 +54,7 @@ ghash-arm-ce-y  := ghash-ce-core.o ghash-ce-glue.o
>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>
>>  quiet_cmd_perl = PERL$@
>>cmd_perl = $(PERL) $(<) > $(@)
>> diff --git a/arch/arm/crypto/speck-neon-core.S
>> b/arch/arm/crypto/speck-neon-core.S
>> new file mode 100644
>> index ..3c1e203e53b9
>> --- /dev/null
>> +++ b/arch/arm/crypto/speck-neon-core.S
>> @@ -0,0 +1,432 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> + *
>> + * Copyright (c) 2018 Google, Inc
>> + *
>> + * Author: Eric Biggers 
>> + */
>> +
>> +#include 
>> +
>> + .text
>> + .fpuneon
>> +
>> + // arguments
>> + ROUND_KEYS  .reqr0  // const {u64,u32} *round_keys
>> + NROUNDS .reqr1  // int nrounds
>> + DST .reqr2  // void *dst
>> + SRC .reqr3  // const void *src
>> + NBYTES  .reqr4  // unsigned int nbytes
>> + TWEAK   .reqr5  // void *tweak
>> +
>> + // registers which hold the data being encrypted/decrypted
>> + X0 

Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-06-16 Thread Stefan Agner
Hi Eric,

On 14.02.2018 19:42, Eric Biggers wrote:
> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
> encrypted/decrypted (doing one cipher round for all the blocks, then the
> next round, etc.), then goes through XTS postprocessing.
> 
> The performance depends on the processor but can be about 3 times faster
> than the generic code.  For example, on an ARMv7 processor we observe
> the following performance with Speck128/256-XTS:
> 
> xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
> xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
> 
> In comparison to AES-256-XTS without the Cryptography Extensions:
> 
> xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
> xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
> xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s
> 
> Speck64/128-XTS is even faster:
> 
> xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s
> 
> Note that as with the generic code, only the Speck128 and Speck64
> variants are supported.  Also, for now only the XTS mode of operation is
> supported, to target the disk and file encryption use cases.  The NEON
> code also only handles the portion of the data that is evenly divisible
> into 128-byte chunks, with any remainder handled by a C fallback.  Of
> course, other modes of operation could be added later if needed, and/or
> the NEON code could be updated to handle other buffer sizes.
> 
> The XTS specification is only defined for AES which has a 128-bit block
> size, so for the GF(2^64) math needed for Speck64-XTS we use the
> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
> paper.  Of course, when possible users should use Speck128-XTS, but even
> that may be too slow on some processors; Speck64-XTS can be faster.
> 
> Signed-off-by: Eric Biggers 
> ---
>  arch/arm/crypto/Kconfig   |   6 +
>  arch/arm/crypto/Makefile  |   2 +
>  arch/arm/crypto/speck-neon-core.S | 432 ++
>  arch/arm/crypto/speck-neon-glue.c | 288 
>  4 files changed, 728 insertions(+)
>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
> 
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index b8e69fe282b8..925d1364727a 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>   select CRYPTO_BLKCIPHER
>   select CRYPTO_CHACHA20
>  
> +config CRYPTO_SPECK_NEON
> + tristate "NEON accelerated Speck cipher algorithms"
> + depends on KERNEL_MODE_NEON
> + select CRYPTO_BLKCIPHER
> + select CRYPTO_SPECK
> +
>  endif
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 30ef8e291271..a758107c5525 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>  
>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
> @@ -53,6 +54,7 @@ ghash-arm-ce-y  := ghash-ce-core.o ghash-ce-glue.o
>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>  
>  quiet_cmd_perl = PERL$@
>cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/speck-neon-core.S
> b/arch/arm/crypto/speck-neon-core.S
> new file mode 100644
> index ..3c1e203e53b9
> --- /dev/null
> +++ b/arch/arm/crypto/speck-neon-core.S
> @@ -0,0 +1,432 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
> + *
> + * Copyright (c) 2018 Google, Inc
> + *
> + * Author: Eric Biggers 
> + */
> +
> +#include 
> +
> + .text
> + .fpuneon
> +
> + // arguments
> + ROUND_KEYS  .reqr0  // const {u64,u32} *round_keys
> + NROUNDS .reqr1  // int nrounds
> + DST .reqr2  // void *dst
> + SRC .reqr3  // const void *src
> + NBYTES  .reqr4  // unsigned int nbytes
> + TWEAK   .reqr5  // void *tweak
> +
> + // registers which hold the data being encrypted/decrypted
> + X0  .reqq0
> + X0_L.reqd0
> + X0_H.reqd1
> + Y0  .reqq1
> + Y0_H

[PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

2018-02-14 Thread Eric Biggers
Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
encrypted/decrypted (doing one cipher round for all the blocks, then the
next round, etc.), then goes through XTS postprocessing.

The performance depends on the processor but can be about 3 times faster
than the generic code.  For example, on an ARMv7 processor we observe
the following performance with Speck128/256-XTS:

xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s
xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s

In comparison to AES-256-XTS without the Cryptography Extensions:

xts-aes-neonbs:Encryption  41.2 MB/s, Decryption  36.7 MB/s
xts(aes-asm):  Encryption  31.7 MB/s, Decryption  30.8 MB/s
xts(aes-generic):  Encryption  21.2 MB/s, Decryption  20.9 MB/s

Speck64/128-XTS is even faster:

xts-speck64-neon:  Encryption 138.6 MB/s, Decryption 139.1 MB/s

Note that as with the generic code, only the Speck128 and Speck64
variants are supported.  Also, for now only the XTS mode of operation is
supported, to target the disk and file encryption use cases.  The NEON
code also only handles the portion of the data that is evenly divisible
into 128-byte chunks, with any remainder handled by a C fallback.  Of
course, other modes of operation could be added later if needed, and/or
the NEON code could be updated to handle other buffer sizes.

The XTS specification is only defined for AES which has a 128-bit block
size, so for the GF(2^64) math needed for Speck64-XTS we use the
reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
paper.  Of course, when possible users should use Speck128-XTS, but even
that may be too slow on some processors; Speck64-XTS can be faster.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/Kconfig   |   6 +
 arch/arm/crypto/Makefile  |   2 +
 arch/arm/crypto/speck-neon-core.S | 432 ++
 arch/arm/crypto/speck-neon-glue.c | 288 
 4 files changed, 728 insertions(+)
 create mode 100644 arch/arm/crypto/speck-neon-core.S
 create mode 100644 arch/arm/crypto/speck-neon-glue.c

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index b8e69fe282b8..925d1364727a 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
 
+config CRYPTO_SPECK_NEON
+   tristate "NEON accelerated Speck cipher algorithms"
+   depends on KERNEL_MODE_NEON
+   select CRYPTO_BLKCIPHER
+   select CRYPTO_SPECK
+
 endif
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 30ef8e291271..a758107c5525 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
 obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
+obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
@@ -53,6 +54,7 @@ ghash-arm-ce-y:= ghash-ce-core.o ghash-ce-glue.o
 crct10dif-arm-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
 chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
+speck-neon-y := speck-neon-core.o speck-neon-glue.o
 
 quiet_cmd_perl = PERL$@
   cmd_perl = $(PERL) $(<) > $(@)
diff --git a/arch/arm/crypto/speck-neon-core.S 
b/arch/arm/crypto/speck-neon-core.S
new file mode 100644
index ..3c1e203e53b9
--- /dev/null
+++ b/arch/arm/crypto/speck-neon-core.S
@@ -0,0 +1,432 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
+ *
+ * Copyright (c) 2018 Google, Inc
+ *
+ * Author: Eric Biggers 
+ */
+
+#include 
+
+   .text
+   .fpuneon
+
+   // arguments
+   ROUND_KEYS  .reqr0  // const {u64,u32} *round_keys
+   NROUNDS .reqr1  // int nrounds
+   DST .reqr2  // void *dst
+   SRC .reqr3  // const void *src
+   NBYTES  .reqr4  // unsigned int nbytes
+   TWEAK   .reqr5  // void *tweak
+
+   // registers which hold the data being encrypted/decrypted
+   X0  .reqq0
+   X0_L.reqd0
+   X0_H.reqd1
+   Y0  .reqq1
+   Y0_H.reqd3
+   X1  .reqq2
+   X1_L.reqd4
+   X1_H.reqd5
+   Y1  .reqq3
+   Y1_H.reqd7
+   X2  .reqq4
+   X2_L