Re: ppc64 micro optimization

2024-04-14 Thread Niels Möller
Niels Möller  writes:

> I've added tests that set the intial counter so that the four counter
> bytes wraps around 2^32, and I've verified that if these instructions
> should be changed to vadduwm, to get output that agrees with nettle's
> other gcm implementations.

I've commit those fixes, and a fix for big-endian support, on the branch 
ppc64-gcm-aes-rebased. I think that's now ready for merging.

I see some opportunities for further improvement, but that can be done
after merge, to aid consistency with related fixes to the other ppc64
assembly files.

> Another question on powerpc64 assembly: For the byte swapping, currently
> done using the vperm instruction and a mask word, is there any reason to
> not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword)
> instead? That applies to more functions than the new gcm-aes code.

A closer look at the spec indicated that xxbrd is only available from
power9 (i.e., if the processor supports VSX, *and* supports ISA 3.0, if
I've understood it correctly).

I think it would be a good idea to consistently use pseudoops like

  .machine "power8"

in the ppc assembly files, if that would let the assembler catch
accidental use of unavailable instructions.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-03-24 Thread Niels Möller
Niels Möller  writes:

> One other question: In the counter updates,
>
>> C increase ctr value as input to aes_encrypt
>> vaddudm S1, S0, CNT1
>> vaddudm S2, S1, CNT1
>> vaddudm S3, S2, CNT1
>> vaddudm S4, S3, CNT1
>> vaddudm S5, S4, CNT1
>> vaddudm S6, S5, CNT1
>> vaddudm S7, S6, CNT1
>
> shouldn't that be vadduwm (32-bit word addition, rather than 64-bit
> dword addition)? As I understand it, gcm uses a 32-bit counter, which
> should wrap around without any carry to higher bits if the initial value
> is just below 2^32.

I've added tests that set the intial counter so that the four counter
bytes wraps around 2^32, and I've verified that if these instructions
should be changed to vadduwm, to get output that agrees with nettle's
other gcm implementations.

Another question on powerpc64 assembly: For the byte swapping, currently
done using the vperm instruction and a mask word, is there any reason to
not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword)
instead? That applies to more functions than the new gcm-aes code.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-03-20 Thread Niels Möller
Niels Möller  writes:

> Below is an updated version of gcm-aes-encrypt.asm, seems to work for
> me, and uses fewer of the regular registers. Some comments and
> questions:
>
> 1. What about the vsrX registers, 0 <= X < 32? They are used to copy
>values from and to the v registers (aka vsrX, 32 <= X < 64), e.g.,
>
> xxlor vs1, VSR(S0), VSR(S0)
>
>Can those registers be used freely, and how?

I've asked in a different forum, and as far as I understand, registers
vs0-vs13 free to use ("volatile"), because half of each corresponds to a
volatile floating point register (fpr0-fpr13). While registers vs14-vs31
need to be saved and restored if used (the halves corresponding to
fpr14-fpr31 are non-volatile, so in principle, it would be sufficent to
save and restore those halves).

> 2. From my reading of the ELF v2 ABI spec, there's a "protected zone"
>below the stack pointer that can be used freely for storage. Is that
>right? Or maybe that's only for te ELFv2 ABI?

That appears to be the same in ELFv1 ABI, see
https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html#STACK

One other question: In the counter updates,

> C increase ctr value as input to aes_encrypt
> vaddudm S1, S0, CNT1
> vaddudm S2, S1, CNT1
> vaddudm S3, S2, CNT1
> vaddudm S4, S3, CNT1
> vaddudm S5, S4, CNT1
> vaddudm S6, S5, CNT1
> vaddudm S7, S6, CNT1

shouldn't that be vadduwm (32-bit word addition, rather than 64-bit
dword addition)? As I understand it, gcm uses a 32-bit counter, which
should wrap around without any carry to higher bits if the initial value
is just below 2^32.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-03-17 Thread Niels Möller
Niels Möller  writes:

> Next, I'll have a look at register usage in the assembly code.

Below is an updated version of gcm-aes-encrypt.asm, seems to work for
me, and uses fewer of the regular registers. Some comments and
questions:

1. What about the vsrX registers, 0 <= X < 32? They are used to copy
   values from and to the v registers (aka vsrX, 32 <= X < 64), e.g.,

xxlor vs1, VSR(S0), VSR(S0)

   Can those registers be used freely, and how? If we can use them, we
   shouldn't need to save and restore any vector registers. They're not
   mentioned in powerpc64/README. Looking in the ELF v2 ABI spec, that
   seems to say that the low halves (which are used as floting point
   registers) are "volatile", but that's not quite enough?

2. From my reading of the ELF v2 ABI spec, there's a "protected zone"
   below the stack pointer that can be used freely for storage. Is that
   right? Or maybe that's only for te ELFv2 ABI?

3. Nit: In the copyright line, I'd like to delete the "All rights
   reserved" phrase. That's not in any the copyright header of any other
   Nettle files, including those previously contributed by IBM.

Regards,
/Niels

-8<---
C powerpc64/p8/gcm-aes-encrypt.asm

ifelse(`
   Copyright (C) 2023- IBM Inc. All rights reserved
   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

C Register usage:

define(`SP', `r1')
define(`TOCP', `r2')

define(`HT', `r3')
define(`SRND', `r4')
define(`SLEN', `r5')
define(`SDST', `r6')
define(`SSRC', `r7')
define(`RK', `r8')
C r9-r11 used as constant indices.
define(`LOOP', `r12')

C
C vectors used in aes encrypt output
C

define(`K', `v1')
define(`S0', `v2')
define(`S1', `v3')
define(`S2', `v4')
define(`S3', `v5')
define(`S4', `v6')
define(`S5', `v7')
define(`S6', `v8')
define(`S7', `v9')

C
C ghash assigned registers and vectors
C

define(`ZERO', `v21')
define(`POLY', `v22')
define(`POLY_L', `v0')

define(`D', `v10')
define(`H1M', `v11')
define(`H1L', `v12')
define(`H2M', `v13')
define(`H2L', `v14')
define(`H3M', `v15')
define(`H3L', `v16')
define(`H4M', `v17')
define(`H4L', `v18')
define(`R', `v19')
define(`F', `v20')
define(`R2', `v21')
define(`F2', `v22')

define(`LE_TEMP', `v30')
define(`LE_MASK', `v31')

define(`CNT1', `v28')
define(`LASTCNT', `v29')

.file "gcm-aes-encrypt.asm"

.text

 C size_t
 C _gcm_aes_encrypt(struct gcm_key *key, size_t rounds,
 C  size_t len, uint8_t *dst, const uint8_t *src)
 C

define(`FUNC_ALIGN', `5')
PROLOGUE(_nettle_gcm_aes_encrypt)
srdi. LOOP, SLEN, 7 C loop n 8 blocks
beq No_encrypt_out

C 288 byte "protected zone" is sufficient for storage.
stxv VSR(v20), -16(SP)
stxv VSR(v21), -32(SP)
stxv VSR(v22), -48(SP)
stxv VSR(v28), -64(SP)
stxv VSR(v29), -80(SP)
stxv VSR(v30), -96(SP)
stxv VSR(v31), -112(SP)

vxor ZERO,ZERO,ZERO
vspltisb CNT1, 1
vsldoi CNT1, ZERO, CNT1, 1  C counter 1

DATA_LOAD_VEC(POLY,.polynomial,r9)
IF_LE(`
li r9,0
lvsl   LE_MASK,0,r9
vspltisb   LE_TEMP,0x07
vxor   LE_MASK,LE_MASK,LE_TEMP
')
xxmrghdVSR(POLY_L),VSR(ZERO),VSR(POLY)

C load table elements
li r9,1*16
li r10,2*16
li r11,3*16
lxvd2x VSR(H1M),0,HT
lxvd2x VSR(H1L),r9,HT
lxvd2x VSR(H2M),r10,HT
lxvd2x VSR(H2L),r11,HT
addi HT, HT, 64
lxvd2x VSR(H3M),0,HT
lxvd2x VSR(H3L),r9,HT
lxvd2x VSR(H4M),r10,HT
lxvd2x VSR(H4L),r11,HT

addi HT, HT,  4048  C Advance to point to the 'CTR' field in the context
 
lxvd2x VSR(D),r9,HT C load 'X' pointer
C byte-reverse of each doubleword permuting on little-endian mode
IF_LE(`
vperm  D,D,D,LE_MASK
')

lxvb16x VSR(S0), 0, HT  C Load 'CTR'

sldi SLEN, LOOP, 7

addi LOOP, LOOP, -1

lxvd2x VSR(K),r11,HTC First subkey
vperm   K,K,K,LE_MASK

.align 5
C increase ctr value as input to aes_encrypt
vaddudm S1, S0, CNT1
vaddudm S2, 

Re: ppc64 micro optimization

2024-03-15 Thread Niels Möller
Niels Möller  writes:

> Danny Tsen  writes:
>
>> My fault. I did not include the gym-aes-crypt.c in the patch. Here is
>> the updated patch. Please apply this one and we can work from there.
>
> Thanks, now pushed onto a new branch ppc64-gcm-aes.

I've now pushed some more changes to that branch: added gcm-internal.h,
fixed the non-fat case, and moved around the nop definitions of the new
functions.

Next, I'll have a look at register usage in the assembly code.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-03-06 Thread Niels Möller
Danny Tsen  writes:

> My fault. I did not include the gym-aes-crypt.c in the patch. Here is
> the updated patch. Please apply this one and we can work from there.

Thanks, now pushed onto a new branch ppc64-gcm-aes.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-03-05 Thread Danny Tsen
Hi Niels,

My fault.  I did not include the gym-aes-crypt.c in the patch.  Here is the 
updated patch.  Please apply this one and we can work from there.

Thanks.
-Danny


> On Mar 5, 2024, at 1:08 PM, Niels Möller  wrote:
>
> Danny Tsen  writes:
>
>> Please let me know when you merge the code and we can work from there.
>
> Hi, I tried to apply and build with the v5 patch, and noticed some problems.
>
> Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go
> in gcm-internal.h, like on this branch,
> https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.h?ref_type=heads
> Corresponding name mangling defines should also be in gcm-internal.h,
> not in the installed gcm.h header.
>
> The file gcm-aes.c was missing in the patch. If the dummy C versions of
> _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the
> definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy
> function for both encrypt and decrypt).
>
> It would also be nice if you could check that the new code is used
> and working in a non-fat build, configured with --disable-fat
> --enable-power-crypto-ext.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-03-05 Thread Niels Möller
Danny Tsen  writes:

> Please let me know when you merge the code and we can work from there.

Hi, I tried to apply and build with the v5 patch, and noticed some problems.

Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go
in gcm-internal.h, like on this branch,
https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.h?ref_type=heads
Corresponding name mangling defines should also be in gcm-internal.h,
not in the installed gcm.h header.

The file gcm-aes.c was missing in the patch. If the dummy C versions of
_gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the
definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy
function for both encrypt and decrypt).

It would also be nice if you could check that the new code is used
and working in a non-fat build, configured with --disable-fat
--enable-power-crypto-ext.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


RE: ppc64 micro optimization

2024-02-26 Thread Danny Tsen
Hi Niels,

Please let me know when you merge the code and we can work from there.

Thanks.
-Danny

From: Niels Möller 
Sent: Friday, February 23, 2024 1:07 AM
To: Danny Tsen 
Cc: nettle-bugs@lists.lysator.liu.se ; George 
Wilson 
Subject: [EXTERNAL] Re: ppc64 micro optimization

Danny Tsen  writes:

> Here is the v5 patch from your comments.  Please review.

Thanks. I think this looks pretty good. Maybe I should commit it on a
branch and we can iterate from there. I'll be on vacation and mostly
offline next week, though.

> --- a/gcm-aes128.c
> +++ b/gcm-aes128.c
> @@ -63,6 +63,11 @@ void
>  gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>size_t length, uint8_t *dst, const uint8_t *src)
>  {
> +  size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, 
> length, dst, src);
> +  ctx->gcm.data_size += done;
> +  length -= done;
> +  src += done;
> +  dst += done;
>GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>  }

We should come up with some preprocessor things to completely omit the
new code on architectures that don't have _gcm_aes_encrypt (possibly
with some macro to reduce duplication). I think that's the main thing
I'd like to have before merge. Otherwise, looks nice and clean.

Ah, and I think you you could write >key instead of the explicit
cast.

> +C load table elements
> +li r9,1*16
> +li r10,2*16
> +li r11,3*16
> +lxvd2x VSR(H1M),0,HT
> +lxvd2x VSR(H1L),r9,HT
> +lxvd2x VSR(H2M),r10,HT
> +lxvd2x VSR(H2L),r11,HT
> +addi HT, HT, 64
> +lxvd2x VSR(H3M),0,HT
> +lxvd2x VSR(H3L),r9,HT
> +lxvd2x VSR(H4M),r10,HT
> +lxvd2x VSR(H4L),r11,HT
> +
> +li r25,0x10
> +li r26,0x20
> +li r27,0x30
> +li r28,0x40
> +li r29,0x50
> +li r30,0x60
> +li r31,0x70

I still think there's opportunity to reduce number of registers (and
corresponding load-store of callee save registers. E.g, here r9-r11 are
used for the same thing as r25-r27.

> +.align 5
> +C increase ctr value as input to aes_encrypt
> +vaddudm S1, S0, CNT1
> +vaddudm S2, S1, CNT1
> +vaddudm S3, S2, CNT1
> +vaddudm S4, S3, CNT1
> +vaddudm S5, S4, CNT1
> +vaddudm S6, S5, CNT1
> +vaddudm S7, S6, CNT1

This is a rather long dependency chain; I wonder if you could make a
measurable saving of a cycle or two by using additional CNT2 or CNT4
registers (if not, it's preferable to keep the current simple chain).

Regards,
/Niels

--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-02-22 Thread Niels Möller
Danny Tsen  writes:

> Here is the v5 patch from your comments.  Please review.

Thanks. I think this looks pretty good. Maybe I should commit it on a
branch and we can iterate from there. I'll be on vacation and mostly
offline next week, though.

> --- a/gcm-aes128.c
> +++ b/gcm-aes128.c
> @@ -63,6 +63,11 @@ void
>  gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>   size_t length, uint8_t *dst, const uint8_t *src)
>  {
> +  size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, 
> length, dst, src);
> +  ctx->gcm.data_size += done;
> +  length -= done;
> +  src += done;
> +  dst += done;
>GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>  }

We should come up with some preprocessor things to completely omit the
new code on architectures that don't have _gcm_aes_encrypt (possibly
with some macro to reduce duplication). I think that's the main thing
I'd like to have before merge. Otherwise, looks nice and clean.

Ah, and I think you you could write >key instead of the explicit
cast.

> +C load table elements
> +li r9,1*16
> +li r10,2*16
> +li r11,3*16
> +lxvd2x VSR(H1M),0,HT
> +lxvd2x VSR(H1L),r9,HT
> +lxvd2x VSR(H2M),r10,HT
> +lxvd2x VSR(H2L),r11,HT
> +addi HT, HT, 64
> +lxvd2x VSR(H3M),0,HT
> +lxvd2x VSR(H3L),r9,HT
> +lxvd2x VSR(H4M),r10,HT
> +lxvd2x VSR(H4L),r11,HT
> +
> +li r25,0x10
> +li r26,0x20
> +li r27,0x30
> +li r28,0x40
> +li r29,0x50
> +li r30,0x60
> +li r31,0x70

I still think there's opportunity to reduce number of registers (and
corresponding load-store of callee save registers. E.g, here r9-r11 are
used for the same thing as r25-r27.

> +.align 5
> +C increase ctr value as input to aes_encrypt
> +vaddudm S1, S0, CNT1
> +vaddudm S2, S1, CNT1
> +vaddudm S3, S2, CNT1
> +vaddudm S4, S3, CNT1
> +vaddudm S5, S4, CNT1
> +vaddudm S6, S5, CNT1
> +vaddudm S7, S6, CNT1

This is a rather long dependency chain; I wonder if you could make a
measurable saving of a cycle or two by using additional CNT2 or CNT4
registers (if not, it's preferable to keep the current simple chain).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-02-20 Thread Danny Tsen
Hi Niels,

Here is the v5 patch from your comments.  Please review.

Thanks.
-Danny


> On Feb 14, 2024, at 8:46 AM, Niels Möller  wrote:
>
> Danny Tsen  writes:
>
>> Here is the new patch v4 for AES/GCM stitched implementation and
>> benchmark based on the current repo.
>
> Thanks. I'm not able to read it all carefully at the moment, but I have
> a few comments, see below.
>
> In the mean time, I've also tried to implement something similar for
> x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to
> the contrary, my stitched implementation seems considerably slower...
> But at least that helped me understand the higher-level issues better.
>
>> --- a/gcm-aes128.c
>> +++ b/gcm-aes128.c
>> @@ -63,14 +63,30 @@ void
>> gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>>   size_t length, uint8_t *dst, const uint8_t *src)
>> {
>> -  GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>> +  size_t done = _gcm_aes_encrypt (>key, >gcm.x.b, >gcm.ctr.b,
>> +  _AES128_ROUNDS, >cipher.keys, length, 
>> dst, src);
>
> I know I asked you to explode the context into many separate arguments
> to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's
> not so nice to have so many arguments that they can't be passed in
> registers. Second, when running a fat build on a machine where the
> needed instructions are unavailable, it's a bit of a waste to have to
> spend lots of instructions on preparing those arguments for calling a
> nop function. So to reduce overhead, I'm now leaning towards an
> interface like
>
>  /* To reduce the number of arguments (e.g., maximum of 6 register
> arguments on x86_64), pass a pointer to gcm_key, which really is a
> pointer to the first member of the appropriate gcm_aes*_ctx
> struct. */
>  size_t
>  _gcm_aes_encrypt (struct gcm_key *CTX,
>unsigned rounds,
>size_t size, uint8_t *dst, const uint8_t *src);
>
> That's not so pretty, but I think that is workable and efficient, and
> since it is an internal function, the interface can be changed if this
> is implemented on other architectures and we find out that it needs some
> tweaks. See
> https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads
> for the code I wrote to accept that ctx argument.
>
> It would also be nice to have a #define around the code calling
> _gcm_aes_encrypt, so that it is compiled only if (i) we have an
> non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat
> build, which may select a non-trivial implementation of _gcm_aes_encrypt
> at run time.
>
>> +  ctx->gcm.data_size += done;
>> +  length -= done;
>> +  if (length > 0) {
>
> Not sure of the check for length > 0 is needed. It is fine to call
> gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a
> call with length 0, though, which may be a more common case when
> _gcm_aes_encrypt is used?
>
>> +define(`SAVE_GPR', `std $1, $2(SP)')
>> +define(`RESTORE_GPR', `ld $1, $2(SP)')
>
> I think the above two macros are unneeded, it's easier to read to use
> std and ld directly.
>
>> +define(`SAVE_VR',
>> +  `li r11, $2
>> +   stvx $1, r11, $3')
>> +define(`RESTORE_VR',
>> +  `li r11, $2
>> +   lvx $1, r11, $3')
>
> It would be nice if we could trim the use of vector registers so we
> don't need to save and restore lots of them. And if we need two
> instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR
> that also adjusts the stack pointer, and doesn't need to use an additional
> register for indexing?
>
>> +C load table elements
>> +li r9,1*16
>> +li r10,2*16
>> +li r11,3*16
>> +lxvd2x VSR(H1M),0,HT
>> +lxvd2x VSR(H1L),r9,HT
>> +lxvd2x VSR(H2M),r10,HT
>> +lxvd2x VSR(H2L),r11,HT
>> +li r9,4*16
>> +li r10,5*16
>> +li r11,6*16
>> +li r12,7*16
>> +lxvd2x VSR(H3M),r9,HT
>> +lxvd2x VSR(H3L),r10,HT
>> +lxvd2x VSR(H4M),r11,HT
>> +lxvd2x VSR(H4L),r12,HT
>
> I think it would be nicer to follow the style I tried to implement in my
> recent updates, using some registers (e.g., r9-r11) as offsets,
> initializing them only once, and using everywhere. E.g., in this case,
> the loading could be
>
>lxvd2x VSR(H1M),0,HT
>lxvd2x VSR(H1L),r9,HT
>lxvd2x VSR(H2M),r10,HT
>lxvd2x VSR(H2L),r11,HT
>addi  HT, HT, 64
>lxvd2x VSR(H3M),0,HT
>lxvd2x VSR(H3L),r9,HT
>lxvd2x VSR(H4M),r10,HT
>lxvd2x VSR(H4L),r11,HT
>
>> +C do two 4x ghash
>> +
>> +C previous digest combining
>> +xxlor vs0, VSR(S0), VSR(S0)
>> +vxor S0,S0,D
>> +
>> +GF_MUL(F2, R2, H3L, H3M, S1)
>> +GF_MUL(F, R, H4L, H4M, S0)
>> +vxor   F,F,F2
>> +vxor   

Re: ppc64 micro optimization

2024-02-14 Thread Niels Möller
Danny Tsen  writes:

> Here is the new patch v4 for AES/GCM stitched implementation and
> benchmark based on the current repo.

Thanks. I'm not able to read it all carefully at the moment, but I have
a few comments, see below. 

In the mean time, I've also tried to implement something similar for
x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to
the contrary, my stitched implementation seems considerably slower...
But at least that helped me understand the higher-level issues better.

> --- a/gcm-aes128.c
> +++ b/gcm-aes128.c
> @@ -63,14 +63,30 @@ void
>  gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>   size_t length, uint8_t *dst, const uint8_t *src)
>  {
> -  GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
> +  size_t done = _gcm_aes_encrypt (>key, >gcm.x.b, >gcm.ctr.b,
> +   _AES128_ROUNDS, >cipher.keys, length, 
> dst, src);

I know I asked you to explode the context into many separate arguments
to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's
not so nice to have so many arguments that they can't be passed in
registers. Second, when running a fat build on a machine where the
needed instructions are unavailable, it's a bit of a waste to have to
spend lots of instructions on preparing those arguments for calling a
nop function. So to reduce overhead, I'm now leaning towards an
interface like

  /* To reduce the number of arguments (e.g., maximum of 6 register
 arguments on x86_64), pass a pointer to gcm_key, which really is a
 pointer to the first member of the appropriate gcm_aes*_ctx
 struct. */
  size_t
  _gcm_aes_encrypt (struct gcm_key *CTX,
unsigned rounds,
size_t size, uint8_t *dst, const uint8_t *src);

That's not so pretty, but I think that is workable and efficient, and
since it is an internal function, the interface can be changed if this
is implemented on other architectures and we find out that it needs some
tweaks. See
https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads
for the code I wrote to accept that ctx argument.

It would also be nice to have a #define around the code calling
_gcm_aes_encrypt, so that it is compiled only if (i) we have an
non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat
build, which may select a non-trivial implementation of _gcm_aes_encrypt
at run time.

> +  ctx->gcm.data_size += done;
> +  length -= done;
> +  if (length > 0) {

Not sure of the check for length > 0 is needed. It is fine to call
gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a
call with length 0, though, which may be a more common case when
_gcm_aes_encrypt is used?

> +define(`SAVE_GPR', `std $1, $2(SP)')
> +define(`RESTORE_GPR', `ld $1, $2(SP)')

I think the above two macros are unneeded, it's easier to read to use
std and ld directly.

> +define(`SAVE_VR',
> +  `li r11, $2
> +   stvx $1, r11, $3')
> +define(`RESTORE_VR',
> +  `li r11, $2
> +   lvx $1, r11, $3')

It would be nice if we could trim the use of vector registers so we
don't need to save and restore lots of them. And if we need two
instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR
that also adjusts the stack pointer, and doesn't need to use an additional
register for indexing?

> +C load table elements
> +li r9,1*16
> +li r10,2*16
> +li r11,3*16
> +lxvd2x VSR(H1M),0,HT
> +lxvd2x VSR(H1L),r9,HT
> +lxvd2x VSR(H2M),r10,HT
> +lxvd2x VSR(H2L),r11,HT
> +li r9,4*16
> +li r10,5*16
> +li r11,6*16
> +li r12,7*16
> +lxvd2x VSR(H3M),r9,HT
> +lxvd2x VSR(H3L),r10,HT
> +lxvd2x VSR(H4M),r11,HT
> +lxvd2x VSR(H4L),r12,HT

I think it would be nicer to follow the style I tried to implement in my
recent updates, using some registers (e.g., r9-r11) as offsets,
initializing them only once, and using everywhere. E.g., in this case,
the loading could be

lxvd2x VSR(H1M),0,HT
lxvd2x VSR(H1L),r9,HT
lxvd2x VSR(H2M),r10,HT
lxvd2x VSR(H2L),r11,HT
addi  HT, HT, 64
lxvd2x VSR(H3M),0,HT
lxvd2x VSR(H3L),r9,HT
lxvd2x VSR(H4M),r10,HT
lxvd2x VSR(H4L),r11,HT

> +C do two 4x ghash
> +
> +C previous digest combining
> +xxlor vs0, VSR(S0), VSR(S0)
> +vxor S0,S0,D
> +
> +GF_MUL(F2, R2, H3L, H3M, S1)
> +GF_MUL(F, R, H4L, H4M, S0)
> +vxor   F,F,F2
> +vxor   R,R,R2
> +
> +xxlor VSR(S0), vs0, vs0
> +
> +GF_MUL(F3, R3, H2L, H2M, S2)
> +GF_MUL(F4, R4, H1L, H1M, S3)
> +vxor   F3,F3,F4
> +vxor   R3,R3,R4
> +
> +vxor   F,F,F3
> +vxor   D,R,R3
> +GHASH_REDUCE(D, F, POLY_L, R2, F2)  C R2, F2 used as 

Re: ppc64 micro optimization

2024-02-03 Thread Danny Tsen
Hi Niels,

Here is the new patch v4 for AES/GCM stitched implementation and benchmark 
based on the current repo.

Thanks.
-Danny



> On Jan 31, 2024, at 4:35 AM, Niels Möller  wrote:
>
> Niels Möller  writes:
>
>> While the powerpc64 vncipher instruction really wants the original
>> subkeys, not transformed. So on power, it would be better to have a
>> _nettle_aes_invert that is essentially a memcpy, and then the aes
>> decrypt assembly code could be reworked without the xors, and run at exactly
>> the same speed as encryption.
>
> I've tried this out, see branch
> https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert . It
> appears to give the desired improvement in aes decrypt speed, making it
> run at the same speed as aes encrypt. Which is a speedup of about 80%
> when benchmarked on power10 (the cfarm120 machine).
>
>> Current _nettle_aes_invert also changes the order of the subkeys, with
>> a FIXME comment suggesting that it would be better to update the order
>> keys are accessed in the aes decryption functions.
>
> I've merged the changes to keep subkey order the same for encrypt and
> decrypt (so that the decrypt round loop uses subkeys starting at the end
> of the array), which affects all aes implementations except s390x, which
> doesn't need any subkey expansion. But I've deleted the sparc32 assembly
> rather than updating it.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
> ___
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-01-30 Thread Niels Möller
Niels Möller  writes:

> While the powerpc64 vncipher instruction really wants the original
> subkeys, not transformed. So on power, it would be better to have a
> _nettle_aes_invert that is essentially a memcpy, and then the aes
> decrypt assembly code could be reworked without the xors, and run at exactly
> the same speed as encryption. 

I've tried this out, see branch
https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert. It
appears to give the desired improvement in aes decrypt speed, making it
run at the same speed as aes encrypt. Which is a speedup of about 80%
when benchmarked on power10 (the cfarm120 machine).

> Current _nettle_aes_invert also changes the order of the subkeys, with
> a FIXME comment suggesting that it would be better to update the order
> keys are accessed in the aes decryption functions.

I've merged the changes to keep subkey order the same for encrypt and
decrypt (so that the decrypt round loop uses subkeys starting at the end
of the array), which affects all aes implementations except s390x, which
doesn't need any subkey expansion. But I've deleted the sparc32 assembly
rather than updating it.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


RE: ppc64 micro optimization

2024-01-24 Thread Danny Tsen
Hi Niels,

This may take a while before I digest the changes and I redo the changes.  It 
could be another 2 weeks.  It may take longer because I will be taking time out 
and out of the country.  Will do my best.

Thanks.
-Danny

From: Niels Möller 
Sent: Thursday, January 25, 2024 3:58 AM
To: Danny Tsen 
Cc: nettle-bugs@lists.lysator.liu.se ; George 
Wilson 
Subject: [EXTERNAL] Re: ppc64 micro optimization

Danny Tsen  writes:

> Thanks for merging the stitched implementation for PPC64 with your
> detailed information and efforts

We're not quite there yet, though. Do you think you could rebase your
work on top of recent changes? Sorry about conflicts, but I think new
macros should fit well with what you need (feel free to have additional
macros, where you find that useful).

Regards,
/Niels

--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-01-24 Thread Niels Möller
Danny Tsen  writes:

> Thanks for merging the stitched implementation for PPC64 with your
> detailed information and efforts

We're not quite there yet, though. Do you think you could rebase your
work on top of recent changes? Sorry about conflicts, but I think new
macros should fit well with what you need (feel free to have additional
macros, where you find that useful).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-01-22 Thread Danny Tsen
Hi Niels,

Thanks for merging the stitched implementation for PPC64 with your detailed 
information and efforts

Thanks.
-Danny

> On Jan 21, 2024, at 11:27 PM, Niels Möller  wrote:
> 
> In preparing for merging the gcm-aes "stitched" implementation, I'm
> reviewing the existing ghash code. WIP branch "ppc-ghash-macros.
> 
> I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides
> that, I've been able to improve scheduling of the reduction instructions
> (adding in the result of vpmsumd last seems to improve parallelism, some
> 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also
> streamlined the way load offsets are used, and trimmed the number of
> needed vector registers slightly.
> 
> For the AES code, I've merged the new macros (I settled on the names
> OPN_XXY and OPN_XXXY), no change in speed expected from that change.
> 
> I've also tried to understand the differenct between AES encrypt and
> decrypt, where decrypt is much slower, and uses an extra xor instruction
> in the round loop. I think the reason for that is that other AES
> implementations (including x86_64 and arm64 instructions, and Nettle's C
> implementation) expect the decryption subkeys to be transformed via the
> AES "MIX_COLUMN" operation, see
> https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_type=heads#L163
>  
> 
> While the powerpc64 vncipher instruction really wants the original
> subkeys, not transformed. So on power, it would be better to have a
> _nettle_aes_invert that is essentially a memcpy, and then the aes
> decrypt assembly code could be reworked without the xors, and run at exactly
> the same speed as encryption. Current _nettle_aes_invert also changes
> the order of the subkeys, with a FIXME comment suggesting that it would
> be better to update the order keys are accessed in the aes decryption
> functions.
> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
> 
> ___
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


ppc64 micro optimization

2024-01-21 Thread Niels Möller
In preparing for merging the gcm-aes "stitched" implementation, I'm
reviewing the existing ghash code. WIP branch "ppc-ghash-macros.

I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides
that, I've been able to improve scheduling of the reduction instructions
(adding in the result of vpmsumd last seems to improve parallelism, some
3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also
streamlined the way load offsets are used, and trimmed the number of
needed vector registers slightly.

For the AES code, I've merged the new macros (I settled on the names
OPN_XXY and OPN_XXXY), no change in speed expected from that change.

I've also tried to understand the differenct between AES encrypt and
decrypt, where decrypt is much slower, and uses an extra xor instruction
in the round loop. I think the reason for that is that other AES
implementations (including x86_64 and arm64 instructions, and Nettle's C
implementation) expect the decryption subkeys to be transformed via the
AES "MIX_COLUMN" operation, see
https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_type=heads#L163

While the powerpc64 vncipher instruction really wants the original
subkeys, not transformed. So on power, it would be better to have a
_nettle_aes_invert that is essentially a memcpy, and then the aes
decrypt assembly code could be reworked without the xors, and run at exactly
the same speed as encryption. Current _nettle_aes_invert also changes
the order of the subkeys, with a FIXME comment suggesting that it would
be better to update the order keys are accessed in the aes decryption
functions.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se