Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

2024-05-07 Thread Eric Richter
On Sun, 2024-05-05 at 16:10 +0200, Niels Möller wrote:
> Eric Richter  writes:
> 
> > This patch introduces an optimized powerpc64 assembly
> > implementation for
> > sha256-compress-n. This takes advantage of the vshasigma
> > instruction, as
> > well as unrolling loops to best take advantage of running
> > instructions
> > in parallel.
> 
> Thanks. I'm now having a closer read of the assembly code. Comments
> below.
> 
> > +C ROUND(A B C D E F G H R EXT)
> > +define(`ROUND', `
> > +
> > +   vadduwm VT1, VK, IV($9)   C VT1: k+W
> > +   vadduwm VT4, $8, VT1  C VT4: H+k+W
> > +
> > +   lxvw4x  VSR(VK), TK, K    C Load Key
> > +   addiTK, TK, 4     C Increment Pointer
> > to next key
> > +
> > +   vadduwm VT2, $4, $8   C VT2: H+D
> > +   vadduwm VT2, VT2, VT1 C VT2:
> > H+D+k+W
> > +
> > +   vshasigmaw  SIGE, $5, 1, 0b   C Sigma(E)  Se
> > +   vshasigmaw  SIGA, $1, 1, 0    C Sigma(A)  Sa
> > +
> > +   vxorVT3, $2, $3   C VT3: b^c
> > +   vselVT0, $7, $6, $5   C VT0: Ch.
> > +   vselVT3, $3, $1, VT3  C VT3: Maj(a,b,c)
> > +
> > +   vadduwm VT4, VT4, VT0 C VT4: Hkw +
> > Ch.
> > +   vadduwm VT3, VT3, VT4 C VT3: HkW +
> > Ch. + Maj.
> > +
> > +   vadduwm VT0, VT0, VT2 C VT0: Ch. +
> > DHKW
> > +   vadduwm $8, SIGE, SIGA    C Anext: Se
> > + Sa
> > +   vadduwm $4, VT0, SIGE C Dnext: Ch.
> > + DHKW + Se
> > +   vadduwm $8, $8, VT3   C Anext:
> > Se+Sa+HkW+Ch.+Maj.
> > +
> > +
> > +   C Schedule (data) for 16th round in future
> > +   C Extend W[i]
> > +   ifelse(`$10', `1', `
> > +   vshasigmaw  SIGE, IV($9 + 14), 0, 0b
> > +   vshasigmaw  SIGA, IV($9 + 1), 0, 0b
> > +   vadduwm IV($9), IV($9), SIGE
> > +   vadduwm IV($9), IV($9), SIGA
> > +   vadduwm IV($9), IV($9), IV($9 + 9)
> > +   ')
> > +')
> 
> I think it would be a bit simpler to take out the extend logic to its
> own macro. 
> 
> > +define(`EXTENDROUND',  `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9,
> > 1)')
> 
> If you do that, then you would define
>   
>   define(`EXTENDROUND',   `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9)
> EXTEND($9)')
> 
> (In other related code, input expansion is done at the beginning of a
> round iteration rather than at the end, but doing at the end like you
> do
> may be better scheduling).
> 

Makes sense, I'll move that extend logic into its own macro.

You are correct, the expansion logic was moved to the end of the round
for an improvement to scheduling on the CPU. The vshasigma instructions
take more cycles and are scheduled on a different unit than the other
arithmetic operations. This allows those to work in parallel with the
beginning of the next round, as there are no dependent registers until
the next vshasigma instructions in-round.

> > +define(`LOAD', `
> > +   IF_BE(`lxvw4x   VSR(IV($1)), 0, INPUT')
> > +   IF_LE(`
> > +   lxvd2x  VSR(IV($1)), 0, INPUT
> > +   vperm   IV($1), IV($1), IV($1), VT0
> > +   ')
> > +   addiINPUT, INPUT, 4
> > +')
> > +
> > +define(`DOLOADS', `
> > +   IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
> 
> Could you have a dedicated register for the permutation constant, and
> load it only once at function entry? If you have general registers to
> spare, it could also make sense to use, e.g., three registers for the
> contant values 16, 32, 48, and use for indexing. Then you don't need
> to
> update the INPUT pointer as often, and you can use the same constants
> for other load/store sequences as well.
> 

There are plenty of GPRs to spare, I will test and bench a few options
for using more GPRs as indexes.

As for VRs, unfortunately the current implementation uses all 32 VRs:
 16 for W[i]
 8 for state
 7 for round arithmetic (two of these specifically for sigma, to avoid
a dependency bubble)
 1 for storing the key constant K

That said, I'm going to experiment with some VSX instructions to see if
it is possible to spill over certain operations into VSRs, without
needing an explicit copy back from VSR to VR.

> > +   LOAD(0)
> > +   LOAD(1)
> > +   LOAD(2)
> > +   LOAD(3)
> 
> > +PROLOGUE(_nettle_sha256_compress_n)
> > +   cmpwi   0, NUMBLOCKS, 0
> > +   ble 0, .done
> > +   mtctr   NUMBLOCKS
> > +
> > +   C Store non-volatile registers
> > +   subiSP, SP, 64+(12*16)
> > +   std T0, 24(SP)
> > +   std T1, 16(SP)
> > +   std COUNT,  8(SP)
> 
> For save/restore of registers, I prefer to use the register names,
> not
> the defined symbols. And T0, T1, COUNT are defined to use r7, r8,
> r10,
> which *are* volatile, right?
> 

Ah yep, good catch!

> Does the data stored fit in the 288 byte "protected zone"? If so,
> probably best to not modify the stack pointer.
> 

At the moment it should as I'm currently moving the 

Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

2024-05-05 Thread Niels Möller
Eric Richter  writes:

> This patch introduces an optimized powerpc64 assembly implementation for
> sha256-compress-n. This takes advantage of the vshasigma instruction, as
> well as unrolling loops to best take advantage of running instructions
> in parallel.

Thanks. I'm now having a closer read of the assembly code. Comments below.

> +C ROUND(A B C D E F G H R EXT)
> +define(`ROUND', `
> +
> + vadduwm VT1, VK, IV($9)   C VT1: k+W
> + vadduwm VT4, $8, VT1  C VT4: H+k+W
> +
> + lxvw4x  VSR(VK), TK, KC Load Key
> + addiTK, TK, 4 C Increment Pointer to next key
> +
> + vadduwm VT2, $4, $8   C VT2: H+D
> + vadduwm VT2, VT2, VT1 C VT2: H+D+k+W
> +
> + vshasigmaw  SIGE, $5, 1, 0b   C Sigma(E)  Se
> + vshasigmaw  SIGA, $1, 1, 0C Sigma(A)  Sa
> +
> + vxorVT3, $2, $3   C VT3: b^c
> + vselVT0, $7, $6, $5   C VT0: Ch.
> + vselVT3, $3, $1, VT3  C VT3: Maj(a,b,c)
> +
> + vadduwm VT4, VT4, VT0 C VT4: Hkw + Ch.
> + vadduwm VT3, VT3, VT4 C VT3: HkW + Ch. + Maj.
> +
> + vadduwm VT0, VT0, VT2 C VT0: Ch. + DHKW
> + vadduwm $8, SIGE, SIGAC Anext: Se + Sa
> + vadduwm $4, VT0, SIGE C Dnext: Ch. + DHKW + Se
> + vadduwm $8, $8, VT3   C Anext: Se+Sa+HkW+Ch.+Maj.
> +
> +
> + C Schedule (data) for 16th round in future
> + C Extend W[i]
> + ifelse(`$10', `1', `
> + vshasigmaw  SIGE, IV($9 + 14), 0, 0b
> + vshasigmaw  SIGA, IV($9 + 1), 0, 0b
> + vadduwm IV($9), IV($9), SIGE
> + vadduwm IV($9), IV($9), SIGA
> + vadduwm IV($9), IV($9), IV($9 + 9)
> + ')
> +')

I think it would be a bit simpler to take out the extend logic to its
own macro. 

> +define(`EXTENDROUND',`ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9, 1)')

If you do that, then you would define
  
  define(`EXTENDROUND', `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9) EXTEND($9)')

(In other related code, input expansion is done at the beginning of a
round iteration rather than at the end, but doing at the end like you do
may be better scheduling).

> +define(`LOAD', `
> + IF_BE(`lxvw4x   VSR(IV($1)), 0, INPUT')
> + IF_LE(`
> + lxvd2x  VSR(IV($1)), 0, INPUT
> + vperm   IV($1), IV($1), IV($1), VT0
> + ')
> + addiINPUT, INPUT, 4
> +')
> +
> +define(`DOLOADS', `
> + IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')

Could you have a dedicated register for the permutation constant, and
load it only once at function entry? If you have general registers to
spare, it could also make sense to use, e.g., three registers for the
contant values 16, 32, 48, and use for indexing. Then you don't need to
update the INPUT pointer as often, and you can use the same constants
for other load/store sequences as well.

> + LOAD(0)
> + LOAD(1)
> + LOAD(2)
> + LOAD(3)

> +PROLOGUE(_nettle_sha256_compress_n)
> + cmpwi   0, NUMBLOCKS, 0
> + ble 0, .done
> + mtctr   NUMBLOCKS
> +
> + C Store non-volatile registers
> + subiSP, SP, 64+(12*16)
> + std T0, 24(SP)
> + std T1, 16(SP)
> + std COUNT,  8(SP)

For save/restore of registers, I prefer to use the register names, not
the defined symbols. And T0, T1, COUNT are defined to use r7, r8, r10,
which *are* volatile, right?

Does the data stored fit in the 288 byte "protected zone"? If so,
probably best to not modify the stack pointer.

> + li  T0, 32
> + stvxv20, 0, SP
> + subiT0, T0, 16
> + stvxv21, T0, SP

Here it would also help a bit to allocate constants 16, 32, 48 in registers.

> + subiT0, T0, 16
> + stvxv22, T0, SP
> + subiT0, T0, 16
> + stvxv23, T0, SP
> + subiT0, T0, 16
> + stvxv24, T0, SP
> + subiT0, T0, 16
> + stvxv25, T0, SP
> + subiT0, T0, 16
> + stvxv26, T0, SP
> + subiT0, T0, 16
> + stvxv27, T0, SP
> + subiT0, T0, 16
> + stvxv28, T0, SP
> + subiT0, T0, 16
> + stvxv29, T0, SP
> + subiT0, T0, 16
> + stvxv30, T0, SP
> + subiT0, T0, 16
> + stvxv31, T0, SP
> +
> + C Load state values
> + li  T0, 16
> + lxvw4x  VSR(VSA), 0, STATE  C VSA contains A,B,C,D
> + lxvw4x  VSR(VSE), T0, STATE C VSE contains E,F,G,H
> +
> +.loop:
> + li  TK, 0
> + lxvw4x  VSR(VK), TK, K
> + addiTK, TK, 4
> +
> + DOLOADS
> +
> + C "permute" state from VSA containing A,B,C,D into VSA,VSB,VSC,VSD

Can you give a bit more detail on this permutation? Does the main round
operations only use 32 bits each from the state registers? There's no
reasonable way to use a more 

[PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

2024-04-18 Thread Eric Richter
This patch introduces an optimized powerpc64 assembly implementation for
sha256-compress-n. This takes advantage of the vshasigma instruction, as
well as unrolling loops to best take advantage of running instructions
in parallel.

The following data was captured on a POWER 10 LPAR @ ~3.896GHz

Current C implementation:
 Algorithm mode Mbyte/s
sha256   update  280.97
   hmac-sha256 64 bytes   80.81
   hmac-sha256256 bytes  170.50
   hmac-sha256   1024 bytes  241.92
   hmac-sha256   4096 bytes  268.54
   hmac-sha256   single msg  276.16

With optimized assembly:
 Algorithm mode Mbyte/s
sha256   update  446.42
   hmac-sha256 64 bytes  124.89
   hmac-sha256256 bytes  268.90
   hmac-sha256   1024 bytes  382.06
   hmac-sha256   4096 bytes  425.38
   hmac-sha256   single msg  439.75

Signed-off-by: Eric Richter 
---
 fat-ppc.c |  12 +
 powerpc64/fat/sha256-compress-n-2.asm |  36 +++
 powerpc64/p8/sha256-compress-n.asm| 323 ++
 3 files changed, 371 insertions(+)
 create mode 100644 powerpc64/fat/sha256-compress-n-2.asm
 create mode 100644 powerpc64/p8/sha256-compress-n.asm

diff --git a/fat-ppc.c b/fat-ppc.c
index cd76f7a1..efbeb2ec 100644
--- a/fat-ppc.c
+++ b/fat-ppc.c
@@ -203,6 +203,10 @@ DECLARE_FAT_FUNC(_nettle_poly1305_blocks, 
poly1305_blocks_func)
 DECLARE_FAT_FUNC_VAR(poly1305_blocks, poly1305_blocks_func, c)
 DECLARE_FAT_FUNC_VAR(poly1305_blocks, poly1305_blocks_func, ppc64)
 
+DECLARE_FAT_FUNC(_nettle_sha256_compress_n, sha256_compress_n_func)
+DECLARE_FAT_FUNC_VAR(sha256_compress_n, sha256_compress_n_func, c)
+DECLARE_FAT_FUNC_VAR(sha256_compress_n, sha256_compress_n_func, ppc64)
+
 
 static void CONSTRUCTOR
 fat_init (void)
@@ -231,6 +235,8 @@ fat_init (void)
  _nettle_ghash_update_arm64() */
   _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64;
   _nettle_ghash_update_vec = _nettle_ghash_update_ppc64;
+
+  _nettle_sha256_compress_n_vec = _nettle_sha256_compress_n_ppc64;
 }
   else
 {
@@ -239,6 +245,7 @@ fat_init (void)
   _nettle_aes_invert_vec = _nettle_aes_invert_c;
   _nettle_ghash_set_key_vec = _nettle_ghash_set_key_c;
   _nettle_ghash_update_vec = _nettle_ghash_update_c;
+  _nettle_sha256_compress_n_vec = _nettle_sha256_compress_n_c;
 }
   if (features.have_altivec)
 {
@@ -338,3 +345,8 @@ DEFINE_FAT_FUNC(_nettle_poly1305_blocks, const uint8_t *,
  size_t blocks,
 const uint8_t *m),
(ctx, blocks, m))
+
+DEFINE_FAT_FUNC(_nettle_sha256_compress_n, const uint8_t *,
+   (uint32_t *state, const uint32_t *k,
+size_t blocks, const uint8_t *input),
+   (state, k, blocks, input))
diff --git a/powerpc64/fat/sha256-compress-n-2.asm 
b/powerpc64/fat/sha256-compress-n-2.asm
new file mode 100644
index ..4f4eee9d
--- /dev/null
+++ b/powerpc64/fat/sha256-compress-n-2.asm
@@ -0,0 +1,36 @@
+C powerpc64/fat/sha256-compress-n-2.asm
+
+ifelse(`
+   Copyright (C) 2024 Eric Richter, IBM Corporation
+
+   This file is part of GNU Nettle.
+
+   GNU Nettle is free software: you can redistribute it and/or
+   modify it under the terms of either:
+
+ * the GNU Lesser General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your
+   option) any later version.
+
+   or
+
+ * the GNU General Public License as published by the Free
+   Software Foundation; either version 2 of the License, or (at your
+   option) any later version.
+
+   or both in parallel, as here.
+
+   GNU Nettle is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see http://www.gnu.org/licenses/.
+')
+
+dnl PROLOGUE(_nettle_sha256_compress_n) picked up by configure
+
+define(`fat_transform', `$1_ppc64')
+include_src(`powerpc64/p8/sha256-compress-n.asm')
diff --git a/powerpc64/p8/sha256-compress-n.asm 
b/powerpc64/p8/sha256-compress-n.asm
new file mode 100644
index ..d76f337e
--- /dev/null
+++ b/powerpc64/p8/sha256-compress-n.asm
@@ -0,0 +1,323 @@
+C x86_64/sha256-compress-n.asm
+
+ifelse(`
+   Copyright (C) 2024 Eric Richter, IBM Corporation
+
+   This file is part of GNU Nettle.
+
+   GNU Nettle is free software: you can redistribute it and/or
+   modify it under the terms of either:
+
+ * the GNU Lesser General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your
+   option) any later version.
+
+   or
+
+ * the GNU General Public License as published by