Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-21 Thread Danny Tsen
Hi Niels,

More comments.  Please see inline.

> On Nov 21, 2023, at 1:46 PM, Danny Tsen  wrote:
> 
> Hi Niels,
> 
> Thanks for the quick response.
> 
> I'll think more thru your comments here and it may take some more time to get 
> an update.  And just a quick answer to 4 of your questions.
> 
> 
>  1.  Depends on some special registers from caller.  This is so that I don't 
> need to change the registers used in aes_internal_encrypt and gf_mul_4x 
> functions.  This is a way to minimize too much change in the existing code.  
> But I can change that for sure.  m4 macro could be helpful here.
>  2.  The reason to use gcm_encrypt is to minimize duplicate code in 
> gcm_aes128..., but I can change that.
>  3.   Yes, 4x blocks won't provide the same performance as 8x.
>  4.  Yes, function call did introduce quite a lot of overhead in a loop.  We 
> can call gf_mul_4x from _ghash_update but the stack handling has to be 
> changed and I tried not to change anything in _ghash_update since my code 
> dosen't call _ghash_update.  But I guess I can use m4 macro instead.
> 
> Thanks.
> -Danny
> 
> From: Niels Möller 
> Sent: Tuesday, November 21, 2023 1:07 PM
> To: Danny Tsen 
> Cc: nettle-bugs@lists.lysator.liu.se ; 
> George Wilson 
> Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with 
> stitched implementation
> 
> Danny Tsen  writes:
> 
>> This patch provides a performance improvement over AES/GCM with stitched
>> implementation for ppc64.  The code is a wrapper in assembly to handle 
>> multiple 8
>> blocks and handle big and little endian.
>> 
>> The overall improvement is based on the nettle-benchmark with ~80% 
>> improvement for
>> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline.  
>> The
>> benchmark was run on a P10 machine with 3.896GHz CPU.
> 
> That's a pretty nice performance improvements. A first round of comments
> below, mainly structural.
> 
> (And I think attachments didn't make it to the list, possibly because
> some of them had Content-type: application/octet-stream rather than
> text/plain).
> 
>> +#if defined(__powerpc64__) || defined(__powerpc__)
>> +#define HAVE_AES_GCM_STITCH 1
>> +#endif
> 
> If the C code needs to know about optional assembly functions, the
> HAVE_NATIVE tests are intended for that.
> 
>> void
>> gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key,
>>const void *cipher, nettle_cipher_func *f,
>> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key 
>> *key,
>> {
>>   assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
>> 
>> +#if defined(HAVE_AES_GCM_STITCH)
>> +  size_t rem_len = 0;
>> +
>> +  if (length >= 128) {
>> +int rounds = 0;
>> +if (f == (nettle_cipher_func *) aes128_encrypt) {
>> +  rounds = _AES128_ROUNDS;
>> +} else if (f == (nettle_cipher_func *) aes192_encrypt) {
>> +  rounds = _AES192_ROUNDS;
>> +} else if (f == (nettle_cipher_func *) aes256_encrypt) {
>> +  rounds = _AES256_ROUNDS;
>> +}
>> +if (rounds) {
>> +  struct gcm_aes_context c;
>> +  get_ctx(, ctx, key, cipher);
>> +  _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, 
>> dst, src);
> 
> I think this is the wrong place for this dispatch, I think it should go
> in gcm-aes128.c, gcm-aes192.c, etc.
> 
>> --- a/powerpc64/p8/aes-encrypt-internal.asm
>> +++ b/powerpc64/p8/aes-encrypt-internal.asm
>> @@ -52,6 +52,16 @@ define(`S5', `v7')
>> define(`S6', `v8')
>> define(`S7', `v9')
>> 
>> +C re-define SRC if from _gcm_aes
>> +define(`S10', `v10')
>> +define(`S11', `v11')
>> +define(`S12', `v12')
>> +define(`S13', `v13')
>> +define(`S14', `v14')
>> +define(`S15', `v15')
>> +define(`S16', `v16')
>> +define(`S17', `v17')
>> +
>> .file "aes-encrypt-internal.asm"
>> 
>> .text
>> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt)
>>  DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5)
>> 
>>  subi ROUNDS,ROUNDS,1
>> +
>> + cmpdi r23, 0x5f C call from _gcm_aes
>> + beq Lx8_loop
>> +
>>  srdi LENGTH,LENGTH,4
>> 
>>  srdi r5,LENGTH,3 #8x loop count
>> @@ -93,6 +107,9 @@ Lx8_loop:
>>  lxvd2x VSR(K),0,KEYS
>>  vperm   K,K,K,SWAP_MASK
>> 
>> + cmpdi r23, 0x5f
>> + beq Skip_load
> 
> It's a little messy to have branches depending on a special register set
> by some callers. I think it would be simpler to either move the round
> loop (i.e., the loop with the label from L8x_round_loop:) into a
> subroutine with all-register arguments, and call that from both
> _nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro
> expanding to the body of that loop, and use that macro in both places.
> 
>> --- /dev/null
>> +++ b/powerpc64/p8/gcm-aes-decrypt.asm
>> @@ -0,0 +1,425 @@
>> +C powerpc64/p8/gcm-aes-decrypt.asm
> 
>> +.macro SAVE_REGS
>> + mflr 0
>> + std 0,16(1)
>> + stdu  SP,-464(SP)
> 
> If macros are needed, please use m4 macros, like other nettle assembly code.
> 
>> +.align 5
>> +Loop8x_de:
> [...]
>> +bl 

RE: Fw: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-21 Thread Danny Tsen
Hi Niels,

Thanks for the quick response.

I'll think more thru your comments here and it may take some more time to get 
an update.  And just a quick answer to 4 of your questions.


  1.  Depends on some special registers from caller.  This is so that I don't 
need to change the registers used in aes_internal_encrypt and gf_mul_4x 
functions.  This is a way to minimize too much change in the existing code.  
But I can change that for sure.  m4 macro could be helpful here.
  2.  The reason to use gcm_encrypt is to minimize duplicate code in 
gcm_aes128..., but I can change that.
  3.   Yes, 4x blocks won't provide the same performance as 8x.
  4.  Yes, function call did introduce quite a lot of overhead in a loop.  We 
can call gf_mul_4x from _ghash_update but the stack handling has to be changed 
and I tried not to change anything in _ghash_update since my code dosen't call 
_ghash_update.  But I guess I can use m4 macro instead.

Thanks.
-Danny

From: Niels Möller 
Sent: Tuesday, November 21, 2023 1:07 PM
To: Danny Tsen 
Cc: nettle-bugs@lists.lysator.liu.se ; George 
Wilson 
Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with 
stitched implementation

Danny Tsen  writes:

> This patch provides a performance improvement over AES/GCM with stitched
> implementation for ppc64.  The code is a wrapper in assembly to handle 
> multiple 8
> blocks and handle big and little endian.
>
> The overall improvement is based on the nettle-benchmark with ~80% 
> improvement for
> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline.  
> The
> benchmark was run on a P10 machine with 3.896GHz CPU.

That's a pretty nice performance improvements. A first round of comments
below, mainly structural.

(And I think attachments didn't make it to the list, possibly because
some of them had Content-type: application/octet-stream rather than
text/plain).

> +#if defined(__powerpc64__) || defined(__powerpc__)
> +#define HAVE_AES_GCM_STITCH 1
> +#endif

If the C code needs to know about optional assembly functions, the
HAVE_NATIVE tests are intended for that.

>  void
>  gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key,
> const void *cipher, nettle_cipher_func *f,
> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key 
> *key,
>  {
>assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
>
> +#if defined(HAVE_AES_GCM_STITCH)
> +  size_t rem_len = 0;
> +
> +  if (length >= 128) {
> +int rounds = 0;
> +if (f == (nettle_cipher_func *) aes128_encrypt) {
> +  rounds = _AES128_ROUNDS;
> +} else if (f == (nettle_cipher_func *) aes192_encrypt) {
> +  rounds = _AES192_ROUNDS;
> +} else if (f == (nettle_cipher_func *) aes256_encrypt) {
> +  rounds = _AES256_ROUNDS;
> +}
> +if (rounds) {
> +  struct gcm_aes_context c;
> +  get_ctx(, ctx, key, cipher);
> +  _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, dst, 
> src);

I think this is the wrong place for this dispatch, I think it should go
in gcm-aes128.c, gcm-aes192.c, etc.

> --- a/powerpc64/p8/aes-encrypt-internal.asm
> +++ b/powerpc64/p8/aes-encrypt-internal.asm
> @@ -52,6 +52,16 @@ define(`S5', `v7')
>  define(`S6', `v8')
>  define(`S7', `v9')
>
> +C re-define SRC if from _gcm_aes
> +define(`S10', `v10')
> +define(`S11', `v11')
> +define(`S12', `v12')
> +define(`S13', `v13')
> +define(`S14', `v14')
> +define(`S15', `v15')
> +define(`S16', `v16')
> +define(`S17', `v17')
> +
>  .file "aes-encrypt-internal.asm"
>
>  .text
> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt)
>   DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5)
>
>   subi ROUNDS,ROUNDS,1
> +
> + cmpdi r23, 0x5f C call from _gcm_aes
> + beq Lx8_loop
> +
>   srdi LENGTH,LENGTH,4
>
>   srdi r5,LENGTH,3 #8x loop count
> @@ -93,6 +107,9 @@ Lx8_loop:
>   lxvd2x VSR(K),0,KEYS
>   vperm   K,K,K,SWAP_MASK
>
> + cmpdi r23, 0x5f
> + beq Skip_load

It's a little messy to have branches depending on a special register set
by some callers. I think it would be simpler to either move the round
loop (i.e., the loop with the label from L8x_round_loop:) into a
subroutine with all-register arguments, and call that from both
_nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro
expanding to the body of that loop, and use that macro in both places.

> --- /dev/null
> +++ b/powerpc64/p8/gcm-aes-decrypt.asm
> @@ -0,0 +1,425 @@
> +C powerpc64/p8/gcm-aes-decrypt.asm

> +.macro SAVE_REGS
> + mflr 0
> + std 0,16(1)
> + stdu  SP,-464(SP)

If macros are needed, please use m4 macros, like other nettle assembly code.

> +.align 5
> +Loop8x_de:
[...]
> +bl _nettle_aes_encrypt_ppc64

I suspect this reference will break in non-fat builds?

> +nop
> +
> +C do two 4x ghash
[...]
> +bl _nettle_gf_mul_4x_ppc64
> +nop
> +
> +bl _nettle_gf_mul_4x_ppc64
> +nop

So the body of the main loop is one subroutine call to do 8 aes blocks,
and two subroutine 

Re: Fw: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-21 Thread Niels Möller
Danny Tsen  writes:

> This patch provides a performance improvement over AES/GCM with stitched
> implementation for ppc64.  The code is a wrapper in assembly to handle 
> multiple 8
> blocks and handle big and little endian.
>
> The overall improvement is based on the nettle-benchmark with ~80% 
> improvement for
> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline.  
> The
> benchmark was run on a P10 machine with 3.896GHz CPU.

That's a pretty nice performance improvements. A first round of comments
below, mainly structural. 

(And I think attachments didn't make it to the list, possibly because
some of them had Content-type: application/octet-stream rather than
text/plain).

> +#if defined(__powerpc64__) || defined(__powerpc__)
> +#define HAVE_AES_GCM_STITCH 1
> +#endif

If the C code needs to know about optional assembly functions, the
HAVE_NATIVE tests are intended for that.

>  void
>  gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key,
>const void *cipher, nettle_cipher_func *f,
> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key 
> *key,
>  {
>assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
>  
> +#if defined(HAVE_AES_GCM_STITCH)
> +  size_t rem_len = 0;
> +
> +  if (length >= 128) {
> +int rounds = 0;
> +if (f == (nettle_cipher_func *) aes128_encrypt) {
> +  rounds = _AES128_ROUNDS;
> +} else if (f == (nettle_cipher_func *) aes192_encrypt) {
> +  rounds = _AES192_ROUNDS;
> +} else if (f == (nettle_cipher_func *) aes256_encrypt) {
> +  rounds = _AES256_ROUNDS;
> +}
> +if (rounds) {
> +  struct gcm_aes_context c;
> +  get_ctx(, ctx, key, cipher);
> +  _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, dst, 
> src);

I think this is the wrong place for this dispatch, I think it should go
in gcm-aes128.c, gcm-aes192.c, etc.

> --- a/powerpc64/p8/aes-encrypt-internal.asm
> +++ b/powerpc64/p8/aes-encrypt-internal.asm
> @@ -52,6 +52,16 @@ define(`S5', `v7')
>  define(`S6', `v8')
>  define(`S7', `v9')
>  
> +C re-define SRC if from _gcm_aes
> +define(`S10', `v10')
> +define(`S11', `v11')
> +define(`S12', `v12')
> +define(`S13', `v13')
> +define(`S14', `v14')
> +define(`S15', `v15')
> +define(`S16', `v16')
> +define(`S17', `v17')
> +
>  .file "aes-encrypt-internal.asm"
>  
>  .text
> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt)
>   DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5)
>  
>   subi ROUNDS,ROUNDS,1
> +
> + cmpdi r23, 0x5f C call from _gcm_aes
> + beq Lx8_loop
> +
>   srdi LENGTH,LENGTH,4
>  
>   srdi r5,LENGTH,3 #8x loop count
> @@ -93,6 +107,9 @@ Lx8_loop:
>   lxvd2x VSR(K),0,KEYS
>   vperm   K,K,K,SWAP_MASK
>  
> + cmpdi r23, 0x5f
> + beq Skip_load

It's a little messy to have branches depending on a special register set
by some callers. I think it would be simpler to either move the round
loop (i.e., the loop with the label from L8x_round_loop:) into a
subroutine with all-register arguments, and call that from both
_nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro
expanding to the body of that loop, and use that macro in both places.

> --- /dev/null
> +++ b/powerpc64/p8/gcm-aes-decrypt.asm
> @@ -0,0 +1,425 @@
> +C powerpc64/p8/gcm-aes-decrypt.asm

> +.macro SAVE_REGS
> + mflr 0
> + std 0,16(1)
> + stdu  SP,-464(SP)

If macros are needed, please use m4 macros, like other nettle assembly code.

> +.align 5
> +Loop8x_de:
[...]
> +bl _nettle_aes_encrypt_ppc64

I suspect this reference will break in non-fat builds?

> +nop
> +
> +C do two 4x ghash
[...]
> +bl _nettle_gf_mul_4x_ppc64
> +nop
> +
> +bl _nettle_gf_mul_4x_ppc64
> +nop

So the body of the main loop is one subroutine call to do 8 aes blocks,
and two subroutine calls to do corresponding ghash. I had expected some
more instrution-level interleaving of the two operations, do you think
that could be beneficial, or is out-of-order machinery so powerful that
instruction scheduling is not so important?

I think this could be simpler if you define subroutines (or maybe
macros) taylored to use from this loop, which can be reused by the code
to do aes and ghash separately.

I would also be curious if you get something noticably slower if you do
only 4 blocks per loop (but if the bottleneck is the dependencies in the
aes loop, it may be that doing 8 blocks is important also in this
setting).

For the interface between C and assembly, one could consider an
interface that can be passed an arbitrary number of block, similar to
_ghash_update. If it's too much complexity to actually do an arbitrary
number of blocks, it could return number of blocks done, and leave to
the caller (the C code) to handle the left-over.

> --- a/powerpc64/p8/ghash-update.asm
> +++ b/powerpc64/p8/ghash-update.asm
> @@ -281,6 +281,48 @@ IF_LE(`
>  blr
>  EPILOGUE(_nettle_ghash_update)
>  
> +C
> +C GCM multification and reduction
> +C   All inputs depends on definitions
> +C
> +C 

Fw: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-21 Thread Danny Tsen



To Whom It May Concern,

This patch provides a performance improvement over AES/GCM with stitched 
implementation for ppc64.  The code is a wrapper in assembly to handle multiple 
8 blocks and handle big and little endian.

The overall improvement is based on the nettle-benchmark with ~80% improvement 
for AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. 
 The benchmark was run on a P10 machine with 3.896GHz CPU.

Please find the attached patch and benchmarks.

Thanks.
-Danny
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se