[PATCH] "PowerPC64" chacha-core big-endian support "Shorter version"

2020-09-25 Thread Maamoun TK
The last patch follows the C implementation but I just figured out a decent way to do it. --- powerpc64/p7/chacha-core-internal.asm | 22 +- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/powerpc64/p7/chacha-core-internal.asm

[PATCH] "PowerPC64" chacha-core big-endian support

2020-09-25 Thread Maamoun TK
--- powerpc64/p7/chacha-core-internal.asm | 55 ++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/powerpc64/p7/chacha-core-internal.asm b/powerpc64/p7/chacha-core-internal.asm index 33c721c1..922050ff 100644 ---

Re: PPC chacha

2020-09-25 Thread Maamoun TK
Writing .align explicitly instead of defining FUNC_ALIGN has no negative effects except the function won't get alignment for big-endian mode. It looks like there are some additional operations are needed for big-endian mode before storing the results to 'dst' buffer, in chacha-core-internal.c:

Re: PPC chacha

2020-09-25 Thread Niels Möller
Maamoun TK writes: > Great work. The implementation looks fine, I like the idea of using -16 > instead of 16 for rotating because vspltisw is limited to (-16 to 15) > and vrlw picks the low-order 5 bits which is the same for both -16 and > 16. I picked up that trick from Torbjörn Granlund's

Re: PPC chacha

2020-09-25 Thread Niels Möller
Jeffrey Walton writes: > I hope I'm not crossing my wires, but doesn't ChaCha core require a > counter addition? Sure, but nettle's _chacha_core function (what I've implemented so far for ppc) does a single block, and doesn't modify the counter. Variants like _chacha_3core (currently

Re: PPC chacha

2020-09-25 Thread Jeffrey Walton
On Fri, Sep 25, 2020 at 11:04 AM Jeffrey Walton wrote: > > On Fri, Sep 25, 2020 at 10:25 AM Niels Möller wrote: > > > > Jeffrey Walton writes: > > ... > It should be easy enough to test. Start with a counter of 0xfff8 > and encrypt a couple of [64-byte] blocks. You can use Bernstein's >

Re: PPC chacha

2020-09-25 Thread Jeffrey Walton
On Fri, Sep 25, 2020 at 10:25 AM Niels Möller wrote: > > Jeffrey Walton writes: > > > I believe the 64-bit adds (addudm) and subtracts (subudm) require > > POWER8. > > I don't think there are any 64-bit adds in my chacha code, only 32-bit, > vadduwm. The chacha state is fundamentally 16 32-bit

Re: PPC chacha

2020-09-25 Thread Maamoun TK
Yes, it would make sense. On Fri, Sep 25, 2020 at 5:25 PM Niels Möller wrote: > Jeffrey Walton writes: > > > I believe the 64-bit adds (addudm) and subtracts (subudm) require > > POWER8. > > I don't think there are any 64-bit adds in my chacha code, only 32-bit, > vadduwm. The chacha state is

Re: PPC chacha

2020-09-25 Thread Niels Möller
Jeffrey Walton writes: > I believe the 64-bit adds (addudm) and subtracts (subudm) require > POWER8. I don't think there are any 64-bit adds in my chacha code, only 32-bit, vadduwm. The chacha state is fundamentally 16 32-bit words, with operations very friendly to 4-way simd. Using 64-bit

Re: PPC chacha

2020-09-25 Thread Jeffrey Walton
On Fri, Sep 25, 2020 at 7:43 AM Maamoun TK wrote: > ... > > I'm not sure where it fits under powerpc64. The code doesn't need any > > cryptographic extensions, but it depends on vector instructions as well > > as VSX registers (for the unaligned load and store instructions). So I'd > > need

Re: PPC chacha

2020-09-25 Thread Maamoun TK
> > > I'm trying to learn a bit of ppc assembly. Below is an implementation of > _chacha_core. Seems to work, when tested on gcc112.fsffrance.org (just > put the file in the powerpc64 directory and reconfigure). This machine > is little-endian, I haven't yet tested on big-endian. > Great work.

Re: [PATCH] "PowerPC64" GCM support

2020-09-25 Thread Maamoun TK
It's gotten better with this patch, now it takes 0.49 seconds to execute under the same circumstances. On Fri, Sep 25, 2020 at 9:59 AM Niels Möller wrote: > Maamoun TK writes: > > >> What's the speedup you get from assembly gcm_fill? I see the C > >> implementation uses memcpy and

Re: [PATCH] "PowerPC64" GCM support

2020-09-25 Thread Niels Möller
Maamoun TK writes: >> What's the speedup you get from assembly gcm_fill? I see the C >> implementation uses memcpy and WRITE_UINT32, and is likely significantly >> slower than the ctr_fill16 in ctr.c. But it could be improved using >> portable means. If done well, it should be a very small