On Fri, Aug 31, 2018 at 06:51:34PM +0200, Ard Biesheuvel wrote:
> >>
> >> + adr ip, .Lrol8_table
> >> mov r3, #10
> >>
> >> .Ldoubleround4:
> >> @@ -238,24 +268,25 @@ ENTRY(chacha20_4block_xor_neon)
> >> // x1 += x5, x13 = rotl32(x13 ^ x1, 8)
> >>
Hi Ard,
On Fri, Aug 31, 2018 at 05:56:24PM +0200, Ard Biesheuvel wrote:
> Hi Eric,
>
> On 31 August 2018 at 10:01, Eric Biggers wrote:
> > From: Eric Biggers
> >
> > Optimize ChaCha20 NEON performance by:
> >
> > - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
> > -
On 31 August 2018 at 17:56, Ard Biesheuvel wrote:
> Hi Eric,
>
> On 31 August 2018 at 10:01, Eric Biggers wrote:
>> From: Eric Biggers
>>
>> Optimize ChaCha20 NEON performance by:
>>
>> - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
>> - Streamlining the part that adds the
Hi Eric,
On 31 August 2018 at 10:01, Eric Biggers wrote:
> From: Eric Biggers
>
> Optimize ChaCha20 NEON performance by:
>
> - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
> - Streamlining the part that adds the original state and XORs the data.
> - Making some other small
From: Eric Biggers
Optimize ChaCha20 NEON performance by:
- Implementing the 8-bit rotations using the 'vtbl.8' instruction.
- Streamlining the part that adds the original state and XORs the data.
- Making some other small tweaks.
On ARM Cortex-A7, these optimizations improve ChaCha20