Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers
On Fri, Aug 31, 2018 at 06:51:34PM +0200, Ard Biesheuvel wrote:
> >>
> >> +   adr ip, .Lrol8_table
> >> mov r3, #10
> >>
> >>  .Ldoubleround4:
> >> @@ -238,24 +268,25 @@ ENTRY(chacha20_4block_xor_neon)
> >> // x1 += x5, x13 = rotl32(x13 ^ x1, 8)
> >> // x2 += x6, x14 = rotl32(x14 ^ x2, 8)
> >> // x3 += x7, x15 = rotl32(x15 ^ x3, 8)
> >> +   vld1.8  {d16}, [ip, :64]
> 
> Also, would it perhaps be more efficient to keep the rotation vector
> in a pair of GPRs, and use something like
> 
> vmov d16, r4, r5
> 
> here?
> 

I tried that, but it doesn't help on either Cortex-A7 or Cortex-A53.
In fact it's very slightly worse.

- Eric


Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-09-01 Thread Eric Biggers
Hi Ard,

On Fri, Aug 31, 2018 at 05:56:24PM +0200, Ard Biesheuvel wrote:
> Hi Eric,
> 
> On 31 August 2018 at 10:01, Eric Biggers  wrote:
> > From: Eric Biggers 
> >
> > Optimize ChaCha20 NEON performance by:
> >
> > - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
> > - Streamlining the part that adds the original state and XORs the data.
> > - Making some other small tweaks.
> >
> > On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
> > about 11.9 cycles per byte to 11.3.
> >
> > There is a tradeoff involved with the 'vtbl.8' rotation method since
> > there is at least one CPU where it's not fastest.  But it seems to be a
> > better default; see the added comment.
> >
> > Signed-off-by: Eric Biggers 
> > ---
> >  arch/arm/crypto/chacha20-neon-core.S | 289 ++-
> >  1 file changed, 147 insertions(+), 142 deletions(-)
> >
> > diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> > b/arch/arm/crypto/chacha20-neon-core.S
> > index 3fecb2124c35a..d381cebaba31d 100644
> > --- a/arch/arm/crypto/chacha20-neon-core.S
> > +++ b/arch/arm/crypto/chacha20-neon-core.S
> > @@ -18,6 +18,33 @@
> >   * (at your option) any later version.
> >   */
> >
> > + /*
> > +  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
> > less:
> > +  *
> > +  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
> > +  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
> > +  * (c)  vrev32.16 (16-bit rotations only)
> > +  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
> > +  * needs index vector)
> > +  *
> > +  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
> > +  * rotations, the only choices are (a) and (b).  We use (a) since it takes
> > +  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
> > +  *
> > +  * For the 16-bit rotation, we use vrev32.16 since it's consistently 
> > fastest
> > +  * and doesn't need a temporary register.
> > +  *
> > +  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
> > sequence
> > +  * is twice as fast as (a), even when doing (a) on multiple registers
> > +  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
> > +  * parallelizes better when temporary registers are scarce.
> > +  *
> > +  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same 
> > speed as
> > +  * (a), so the need to load the rotation table actually makes the vtbl 
> > method
> > +  * slightly slower overall on that CPU.  Still, it seems to be a good
> > +  * compromise to get a significant speed boost on some CPUs.
> > +  */
> > +
> 
> Thanks for sharing these results. I have been working on 32-bit ARM
> code under the assumption that the A53 pipeline more or less resembles
> the A7 one, but this is obviously not the case looking at your
> results. My contributions to arch/arm/crypto mainly involved Crypto
> Extensions code, which the A7 does not support in the first place, so
> it does not really matter, but I will keep this in mind going forward.
> 
> >  #include 
> >
> > .text
> > @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
> > vmovq10, q2
> > vmovq11, q3
> >
> > +   ldr ip, =.Lrol8_table
> > +   vld1.8  {d10}, [ip, :64]
> > +
> 
> I usually try to avoid the =literal ldr notation, because it involves
> an additional load via the D-cache. Could you use a 64-bit literal
> instead of a byte array and use vldr instead? Or switch to adr? (and
> move the literal in range, I suppose)

'adr' works if I move rol8_table to between chacha20_block_xor_neon() and
chacha20_4block_xor_neon().

> >  ENTRY(chacha20_4block_xor_neon)
> > -   push{r4-r6, lr}
> > -   mov ip, sp  // preserve the stack 
> > pointer
> > -   sub r3, sp, #0x20   // allocate a 32 byte buffer
> > -   bic r3, r3, #0x1f   // aligned to 32 bytes
> > -   mov sp, r3
> > +   push{r4}
> 
> The ARM EABI mandates 8 byte stack alignment, and if you take an
> interrupt right at this point, you will enter the interrupt handler
> with a misaligned stack. Whether this could actually cause any
> problems is a different question, but it is better to keep it 8-byte
> aligned to be sure.
> 
> I'd suggest you just push r4 and lr, so you can return from the
> function by popping r4 and pc, as is customary on ARM.

Are you sure that's an actual constraint?  That would imply you could never
push/pop an odd number of ARM registers to/from the stack... but if I
disassemble an ARM kernel, such pushes/pops occur frequently in generated code.
So 8-byte alignment must only be required when a subroutine is called.

If I understand the interrupt handling code correctly, the kernel stack is being
aligned in the svc_entry 

Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-08-31 Thread Ard Biesheuvel
On 31 August 2018 at 17:56, Ard Biesheuvel  wrote:
> Hi Eric,
>
> On 31 August 2018 at 10:01, Eric Biggers  wrote:
>> From: Eric Biggers 
>>
>> Optimize ChaCha20 NEON performance by:
>>
>> - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
>> - Streamlining the part that adds the original state and XORs the data.
>> - Making some other small tweaks.
>>
>> On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
>> about 11.9 cycles per byte to 11.3.
>>
>> There is a tradeoff involved with the 'vtbl.8' rotation method since
>> there is at least one CPU where it's not fastest.  But it seems to be a
>> better default; see the added comment.
>>
>> Signed-off-by: Eric Biggers 
>> ---
>>  arch/arm/crypto/chacha20-neon-core.S | 289 ++-
>>  1 file changed, 147 insertions(+), 142 deletions(-)
>>
>> diff --git a/arch/arm/crypto/chacha20-neon-core.S 
>> b/arch/arm/crypto/chacha20-neon-core.S
>> index 3fecb2124c35a..d381cebaba31d 100644
>> --- a/arch/arm/crypto/chacha20-neon-core.S
>> +++ b/arch/arm/crypto/chacha20-neon-core.S
>> @@ -18,6 +18,33 @@
>>   * (at your option) any later version.
>>   */
>>
>> + /*
>> +  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
>> less:
>> +  *
>> +  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
>> +  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
>> +  * (c)  vrev32.16 (16-bit rotations only)
>> +  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
>> +  * needs index vector)
>> +  *
>> +  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
>> +  * rotations, the only choices are (a) and (b).  We use (a) since it takes
>> +  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
>> +  *
>> +  * For the 16-bit rotation, we use vrev32.16 since it's consistently 
>> fastest
>> +  * and doesn't need a temporary register.
>> +  *
>> +  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
>> sequence
>> +  * is twice as fast as (a), even when doing (a) on multiple registers
>> +  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
>> +  * parallelizes better when temporary registers are scarce.
>> +  *
>> +  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same 
>> speed as
>> +  * (a), so the need to load the rotation table actually makes the vtbl 
>> method
>> +  * slightly slower overall on that CPU.  Still, it seems to be a good
>> +  * compromise to get a significant speed boost on some CPUs.
>> +  */
>> +
>
> Thanks for sharing these results. I have been working on 32-bit ARM
> code under the assumption that the A53 pipeline more or less resembles
> the A7 one, but this is obviously not the case looking at your
> results. My contributions to arch/arm/crypto mainly involved Crypto
> Extensions code, which the A7 does not support in the first place, so
> it does not really matter, but I will keep this in mind going forward.
>
>>  #include 
>>
>> .text
>> @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
>> vmovq10, q2
>> vmovq11, q3
>>
>> +   ldr ip, =.Lrol8_table
>> +   vld1.8  {d10}, [ip, :64]
>> +
>
> I usually try to avoid the =literal ldr notation, because it involves
> an additional load via the D-cache. Could you use a 64-bit literal
> instead of a byte array and use vldr instead? Or switch to adr? (and
> move the literal in range, I suppose)
>
>> mov r3, #10
>>
>>  .Ldoubleround:
>> @@ -63,9 +93,9 @@ ENTRY(chacha20_block_xor_neon)
>>
>> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
>> vadd.i32q0, q0, q1
>> -   veorq4, q3, q0
>> -   vshl.u32q3, q4, #8
>> -   vsri.u32q3, q4, #24
>> +   veorq3, q3, q0
>> +   vtbl.8  d6, {d6}, d10
>> +   vtbl.8  d7, {d7}, d10
>>
>> // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
>> vadd.i32q2, q2, q3
>> @@ -94,9 +124,9 @@ ENTRY(chacha20_block_xor_neon)
>>
>> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
>> vadd.i32q0, q0, q1
>> -   veorq4, q3, q0
>> -   vshl.u32q3, q4, #8
>> -   vsri.u32q3, q4, #24
>> +   veorq3, q3, q0
>> +   vtbl.8  d6, {d6}, d10
>> +   vtbl.8  d7, {d7}, d10
>>
>> // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
>> vadd.i32q2, q2, q3
>> @@ -143,11 +173,11 @@ ENDPROC(chacha20_block_xor_neon)
>>
>> .align  5
>>  ENTRY(chacha20_4block_xor_neon)
>> -   push{r4-r6, lr}
>> -   mov ip, sp  // preserve the stack pointer
>> -   sub r3, sp, #0x20   // allocate a 32 byte buffer
>> -   bic r3, r3, #0x1f   // aligned to 32 bytes
>> -   mov 

Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-08-31 Thread Ard Biesheuvel
Hi Eric,

On 31 August 2018 at 10:01, Eric Biggers  wrote:
> From: Eric Biggers 
>
> Optimize ChaCha20 NEON performance by:
>
> - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
> - Streamlining the part that adds the original state and XORs the data.
> - Making some other small tweaks.
>
> On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
> about 11.9 cycles per byte to 11.3.
>
> There is a tradeoff involved with the 'vtbl.8' rotation method since
> there is at least one CPU where it's not fastest.  But it seems to be a
> better default; see the added comment.
>
> Signed-off-by: Eric Biggers 
> ---
>  arch/arm/crypto/chacha20-neon-core.S | 289 ++-
>  1 file changed, 147 insertions(+), 142 deletions(-)
>
> diff --git a/arch/arm/crypto/chacha20-neon-core.S 
> b/arch/arm/crypto/chacha20-neon-core.S
> index 3fecb2124c35a..d381cebaba31d 100644
> --- a/arch/arm/crypto/chacha20-neon-core.S
> +++ b/arch/arm/crypto/chacha20-neon-core.S
> @@ -18,6 +18,33 @@
>   * (at your option) any later version.
>   */
>
> + /*
> +  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
> less:
> +  *
> +  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
> +  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
> +  * (c)  vrev32.16 (16-bit rotations only)
> +  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
> +  * needs index vector)
> +  *
> +  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
> +  * rotations, the only choices are (a) and (b).  We use (a) since it takes
> +  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
> +  *
> +  * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest
> +  * and doesn't need a temporary register.
> +  *
> +  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
> sequence
> +  * is twice as fast as (a), even when doing (a) on multiple registers
> +  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
> +  * parallelizes better when temporary registers are scarce.
> +  *
> +  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same 
> speed as
> +  * (a), so the need to load the rotation table actually makes the vtbl 
> method
> +  * slightly slower overall on that CPU.  Still, it seems to be a good
> +  * compromise to get a significant speed boost on some CPUs.
> +  */
> +

Thanks for sharing these results. I have been working on 32-bit ARM
code under the assumption that the A53 pipeline more or less resembles
the A7 one, but this is obviously not the case looking at your
results. My contributions to arch/arm/crypto mainly involved Crypto
Extensions code, which the A7 does not support in the first place, so
it does not really matter, but I will keep this in mind going forward.

>  #include 
>
> .text
> @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
> vmovq10, q2
> vmovq11, q3
>
> +   ldr ip, =.Lrol8_table
> +   vld1.8  {d10}, [ip, :64]
> +

I usually try to avoid the =literal ldr notation, because it involves
an additional load via the D-cache. Could you use a 64-bit literal
instead of a byte array and use vldr instead? Or switch to adr? (and
move the literal in range, I suppose)

> mov r3, #10
>
>  .Ldoubleround:
> @@ -63,9 +93,9 @@ ENTRY(chacha20_block_xor_neon)
>
> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #8
> -   vsri.u32q3, q4, #24
> +   veorq3, q3, q0
> +   vtbl.8  d6, {d6}, d10
> +   vtbl.8  d7, {d7}, d10
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
> vadd.i32q2, q2, q3
> @@ -94,9 +124,9 @@ ENTRY(chacha20_block_xor_neon)
>
> // x0 += x1, x3 = rotl32(x3 ^ x0, 8)
> vadd.i32q0, q0, q1
> -   veorq4, q3, q0
> -   vshl.u32q3, q4, #8
> -   vsri.u32q3, q4, #24
> +   veorq3, q3, q0
> +   vtbl.8  d6, {d6}, d10
> +   vtbl.8  d7, {d7}, d10
>
> // x2 += x3, x1 = rotl32(x1 ^ x2, 7)
> vadd.i32q2, q2, q3
> @@ -143,11 +173,11 @@ ENDPROC(chacha20_block_xor_neon)
>
> .align  5
>  ENTRY(chacha20_4block_xor_neon)
> -   push{r4-r6, lr}
> -   mov ip, sp  // preserve the stack pointer
> -   sub r3, sp, #0x20   // allocate a 32 byte buffer
> -   bic r3, r3, #0x1f   // aligned to 32 bytes
> -   mov sp, r3
> +   push{r4}

The ARM EABI mandates 8 byte stack alignment, and if you take an
interrupt right at this point, you will enter the interrupt handler
with a misaligned 

[PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations

2018-08-31 Thread Eric Biggers
From: Eric Biggers 

Optimize ChaCha20 NEON performance by:

- Implementing the 8-bit rotations using the 'vtbl.8' instruction.
- Streamlining the part that adds the original state and XORs the data.
- Making some other small tweaks.

On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
about 11.9 cycles per byte to 11.3.

There is a tradeoff involved with the 'vtbl.8' rotation method since
there is at least one CPU where it's not fastest.  But it seems to be a
better default; see the added comment.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/chacha20-neon-core.S | 289 ++-
 1 file changed, 147 insertions(+), 142 deletions(-)

diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
index 3fecb2124c35a..d381cebaba31d 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -18,6 +18,33 @@
  * (at your option) any later version.
  */
 
+ /*
+  * NEON doesn't have a rotate instruction.  The alternatives are, more or 
less:
+  *
+  * (a)  vshl.u32 + vsri.u32   (needs temporary register)
+  * (b)  vshl.u32 + vshr.u32 + vorr(needs temporary register)
+  * (c)  vrev32.16 (16-bit rotations only)
+  * (d)  vtbl.8 + vtbl.8   (multiple of 8 bits rotations only,
+  * needs index vector)
+  *
+  * ChaCha20 has 16, 12, 8, and 7-bit rotations.  For the 12 and 7-bit
+  * rotations, the only choices are (a) and (b).  We use (a) since it takes
+  * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
+  *
+  * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest
+  * and doesn't need a temporary register.
+  *
+  * For the 8-bit rotation, we use vtbl.8 + vtbl.8.  On Cortex-A7, this 
sequence
+  * is twice as fast as (a), even when doing (a) on multiple registers
+  * simultaneously to eliminate the stall between vshl and vsri.  Also, it
+  * parallelizes better when temporary registers are scarce.
+  *
+  * A disadvantage is that on Cortex-A53, the vtbl sequence is the same speed 
as
+  * (a), so the need to load the rotation table actually makes the vtbl method
+  * slightly slower overall on that CPU.  Still, it seems to be a good
+  * compromise to get a significant speed boost on some CPUs.
+  */
+
 #include 
 
.text
@@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon)
vmovq10, q2
vmovq11, q3
 
+   ldr ip, =.Lrol8_table
+   vld1.8  {d10}, [ip, :64]
+
mov r3, #10
 
 .Ldoubleround:
@@ -63,9 +93,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -94,9 +124,9 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 8)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #8
-   vsri.u32q3, q4, #24
+   veorq3, q3, q0
+   vtbl.8  d6, {d6}, d10
+   vtbl.8  d7, {d7}, d10
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 7)
vadd.i32q2, q2, q3
@@ -143,11 +173,11 @@ ENDPROC(chacha20_block_xor_neon)
 
.align  5
 ENTRY(chacha20_4block_xor_neon)
-   push{r4-r6, lr}
-   mov ip, sp  // preserve the stack pointer
-   sub r3, sp, #0x20   // allocate a 32 byte buffer
-   bic r3, r3, #0x1f   // aligned to 32 bytes
-   mov sp, r3
+   push{r4}
+   mov r4, sp  // preserve the stack pointer
+   sub ip, sp, #0x20   // allocate a 32 byte buffer
+   bic ip, ip, #0x1f   // aligned to 32 bytes
+   mov sp, ip
 
// r0: Input state matrix, s
// r1: 4 data blocks output, o
@@ -157,25 +187,24 @@ ENTRY(chacha20_4block_xor_neon)
// This function encrypts four consecutive ChaCha20 blocks by loading
// the state matrix in NEON registers four times. The algorithm performs
// each operation on the corresponding word of each state matrix, hence
-   // requires no word shuffling. For final XORing step we transpose the
-   // matrix by interleaving 32- and then 64-bit words, which allows us to
-   // do XOR in NEON registers.
+   // requires no word shuffling. The words are re-interleaved before the
+   // final addition of the original state and the XORing step.
//
 
-   // x0..15[0-3] = s0..3[0..3]
-   add r3, r0, #0x20
+