Re: ppc64: AES/GCM Performance improvement with stitched implementation
On Wed, Nov 22, 2023 at 1:50 PM Niels Möller wrote: > David Edelsohn writes: > > > Calls impose a lot of overhead on Power. > > Thanks, that's good to know. > > > And both the efficient loop instruction and the preferred indirect call > > instruction use the CTR register. > > That's one thing I wonder after having a closer look at the AES loops. > > One rather common pattern in GMP and Nettle assembly loops, is to use > the same register as both index register and loop counter. A loop that > in C would conventionally be written as > > for (i = 0; i < n; i++) > dst[i] = f(src[i]); > > is written in assembly closer to > > dst += n; src += n; // Base registers point at end of arrays > n = -n; // Use negative index register > for (; n != 0; n++) > dst[n] = f(src[n]); > > This saves one register (and eliminates corresponding update > instructions), and the loop branch is based on carry flag (or zero flag) > from the index register update n++. (If the items processed by the loop > are larger than a byte, n would also be scaled by the size, and one > would do n += size rather than n++, and it still works just fine). > > Would that pattern work well on power, or is it always preferable to use > the special counter register, e.g., if it provides better branch > prediction? I'm not so familiar with power assembly, but from the AES > code it looks like the relevant instructions are mtctr to initialize the > counter, and bdnz to decrement and branch. > Calls on Power have a high overhead in general, not because of jump or return prediction, but because of the frame setup and teardown in the midst of a highly speculating and out of order core. One thinks of the processor executing the program instructions linearly, but in reality lots of instructions are in flight with lots of register renaming and lots of speculation. The setup and teardown of the frames (saving and restoring registers in the prologue and epilogue, including the link register) and confirmation that the predictions were correct before commiting the results can cause unexpected load and store conflicts in flight. MTCTR moves a GPR to the count (CTR) register. The CTR register is optimized for zero-cost countable loops with the bdnz (branch and decrement counter non zero), etc. instructions. The CTR register also is used for indirect calls (mtctr -> bctr, bcctr - branch to counter, branch conditional to counter). For indirect branches, one also can branch indirect through the linker register (mtlr -> blr), but that can corrupt the link stack internal to the processor used to predict return addresses. So one mainly has the CTR register for both loops and indirect calls. However, if one uses the count register for an indirect call, for all practical purposes, it is not available as the count register for the loop -- spilling and restoring the count register introduces too many stalls. A call inside a loop is bad. An indirect call inside a loop is doubly bad because of the call itself and because it prevents the loop from utilizing the optimal count register idiom. Thanks, David ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64: AES/GCM Performance improvement with stitched implementation
David Edelsohn writes: > Calls impose a lot of overhead on Power. Thanks, that's good to know. > And both the efficient loop instruction and the preferred indirect call > instruction use the CTR register. That's one thing I wonder after having a closer look at the AES loops. One rather common pattern in GMP and Nettle assembly loops, is to use the same register as both index register and loop counter. A loop that in C would conventionally be written as for (i = 0; i < n; i++) dst[i] = f(src[i]); is written in assembly closer to dst += n; src += n; // Base registers point at end of arrays n = -n; // Use negative index register for (; n != 0; n++) dst[n] = f(src[n]); This saves one register (and eliminates corresponding update instructions), and the loop branch is based on carry flag (or zero flag) from the index register update n++. (If the items processed by the loop are larger than a byte, n would also be scaled by the size, and one would do n += size rather than n++, and it still works just fine). Would that pattern work well on power, or is it always preferable to use the special counter register, e.g., if it provides better branch prediction? I'm not so familiar with power assembly, but from the AES code it looks like the relevant instructions are mtctr to initialize the counter, and bdnz to decrement and branch. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64: AES/GCM Performance improvement with stitched implementation
On Wed, Nov 22, 2023 at 10:37 AM Danny Tsen wrote: > > > > On Nov 22, 2023, at 2:27 AM, Niels Möller wrote: > > > > Danny Tsen writes: > > > >> Interleaving at the instructions level may be a good option but due to > >> PPC instruction pipeline this may need to have sufficient > >> registers/vectors. Use same vectors to change contents in successive > >> instructions may require more cycles. In that case, more > >> vectors/scalar will get involved and all vectors assignment may have > >> to change. That’s the reason I avoided in this case. > > > > To investigate the potential, I would suggest some experiments with > > software pipelining. > > > > Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the > > round loop. I think that should be 44 instructions of aes mangling, plus > > instructions to setup the counter input, and do the final xor and > > endianness things with the message. Arrange so that it loads the AES > > state in a set of registers we can call A, operating in-place on these > > registers. But at the end, arrange the XORing so that the final > > cryptotext is located in a different set of registers, B. > > > > Then, write the instructions to do ghash using the B registers as input, > > I think that should be about 20-25 instructions. Interleave those as > > well as possible with the AES instructions (say, two aes instructions, > > one ghash instruction, etc). > > > > Software pipelining means that each iteration of the loop does aes-ctr > > on four blocks, + ghash on the output for the four *previous* blocks (so > > one needs extra code outside of the loop to deal with first and last 4 > > blocks). Decrypt processing should be simpler. > > > > Then you can benchmark that loop in isolation. It doesn't need to be the > > complete function, the handling of first and last blocks can be omitted, > > and it doesn't even have to be completely correct, as long as it's the > > right instruction mix and the right data dependencies. The benchmark > > should give a good idea for the potential speedup, if any, from > > instruction-level interleaving. > This is a very ideal condition. Too much interleaving may not produce the > best results and different architectures may have different results. I had > tried various way when I implemented AES/GCM stitching functions for > OpenSSL. I’ll give it a try since your ghash function is different. > > > > > I would hope 4-way is doable with available vector registers (and this > > inner loop should be less than 100 instructions, so not too > > unmanageable). Going up to 8-way (like the current AES code) would also > > be interesting, but as you say, you might have a shortage of registers. > > If you have to copy state between registers and memory in each iteration > > of an 8-way loop (which it looks like you also have to do in your > > current patch), that overhead cost may outweight the gains you have from > > more independence in the AES rounds. > 4x unrolling may not produce the best performance. I did that when I > implemented this stitching function in OpenSSL and it’s in one assembly > file and no functions calls outside the function. Once again, calling a > function within a loop introduce a lot of overhead. Here are my past > results for your reference. First one is the original performance from > OpenSSL. The second one was the 4x unrolling and the third one was the > 8x. But I can try again. > > (This was run on a p10 with 3.5 GHz machine) > > AES-128-GCM 382128.50k 1023073.64k 2621489.41k 3604979.37k > 4018642.94k 4032080.55k > AES-128-GCM 347370.13k 1236054.06k 2778748.59k 3900567.21k > 4527158.61k 4579759.45k ( 4x AES and 4x ghash > ) > AES-128-GCM 356520.19k 989983.06k 2902907.56k 4379016.19k > 5180981.25k 5249717.59k ( 8x AES and 2 4x gha > sh combined) > Calls impose a lot of overhead on Power. And both the efficient loop instruction and the preferred indirect call instruction use the CTR register. Thanks, David > > Thanks. > -Danny > > > > > Regards, > > /Niels > > > > -- > > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > > Internet email is subject to wholesale government surveillance. > > ___ > nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se > To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se > ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64: AES/GCM Performance improvement with stitched implementation
> On Nov 22, 2023, at 2:27 AM, Niels Möller wrote: > > Danny Tsen writes: > >> Interleaving at the instructions level may be a good option but due to >> PPC instruction pipeline this may need to have sufficient >> registers/vectors. Use same vectors to change contents in successive >> instructions may require more cycles. In that case, more >> vectors/scalar will get involved and all vectors assignment may have >> to change. That’s the reason I avoided in this case. > > To investigate the potential, I would suggest some experiments with > software pipelining. > > Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the > round loop. I think that should be 44 instructions of aes mangling, plus > instructions to setup the counter input, and do the final xor and > endianness things with the message. Arrange so that it loads the AES > state in a set of registers we can call A, operating in-place on these > registers. But at the end, arrange the XORing so that the final > cryptotext is located in a different set of registers, B. > > Then, write the instructions to do ghash using the B registers as input, > I think that should be about 20-25 instructions. Interleave those as > well as possible with the AES instructions (say, two aes instructions, > one ghash instruction, etc). > > Software pipelining means that each iteration of the loop does aes-ctr > on four blocks, + ghash on the output for the four *previous* blocks (so > one needs extra code outside of the loop to deal with first and last 4 > blocks). Decrypt processing should be simpler. > > Then you can benchmark that loop in isolation. It doesn't need to be the > complete function, the handling of first and last blocks can be omitted, > and it doesn't even have to be completely correct, as long as it's the > right instruction mix and the right data dependencies. The benchmark > should give a good idea for the potential speedup, if any, from > instruction-level interleaving. This is a very ideal condition. Too much interleaving may not produce the best results and different architectures may have different results. I had tried various way when I implemented AES/GCM stitching functions for OpenSSL. I’ll give it a try since your ghash function is different. > > I would hope 4-way is doable with available vector registers (and this > inner loop should be less than 100 instructions, so not too > unmanageable). Going up to 8-way (like the current AES code) would also > be interesting, but as you say, you might have a shortage of registers. > If you have to copy state between registers and memory in each iteration > of an 8-way loop (which it looks like you also have to do in your > current patch), that overhead cost may outweight the gains you have from > more independence in the AES rounds. 4x unrolling may not produce the best performance. I did that when I implemented this stitching function in OpenSSL and it’s in one assembly file and no functions calls outside the function. Once again, calling a function within a loop introduce a lot of overhead. Here are my past results for your reference. First one is the original performance from OpenSSL. The second one was the 4x unrolling and the third one was the 8x. But I can try again. (This was run on a p10 with 3.5 GHz machine) AES-128-GCM 382128.50k 1023073.64k 2621489.41k 3604979.37k 4018642.94k 4032080.55k AES-128-GCM 347370.13k 1236054.06k 2778748.59k 3900567.21k 4527158.61k 4579759.45k ( 4x AES and 4x ghash ) AES-128-GCM 356520.19k 989983.06k 2902907.56k 4379016.19k 5180981.25k 5249717.59k ( 8x AES and 2 4x gha sh combined) Thanks. -Danny > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64: AES/GCM Performance improvement with stitched implementation
Danny Tsen writes: > Interleaving at the instructions level may be a good option but due to > PPC instruction pipeline this may need to have sufficient > registers/vectors. Use same vectors to change contents in successive > instructions may require more cycles. In that case, more > vectors/scalar will get involved and all vectors assignment may have > to change. That’s the reason I avoided in this case. To investigate the potential, I would suggest some experiments with software pipelining. Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the round loop. I think that should be 44 instructions of aes mangling, plus instructions to setup the counter input, and do the final xor and endianness things with the message. Arrange so that it loads the AES state in a set of registers we can call A, operating in-place on these registers. But at the end, arrange the XORing so that the final cryptotext is located in a different set of registers, B. Then, write the instructions to do ghash using the B registers as input, I think that should be about 20-25 instructions. Interleave those as well as possible with the AES instructions (say, two aes instructions, one ghash instruction, etc). Software pipelining means that each iteration of the loop does aes-ctr on four blocks, + ghash on the output for the four *previous* blocks (so one needs extra code outside of the loop to deal with first and last 4 blocks). Decrypt processing should be simpler. Then you can benchmark that loop in isolation. It doesn't need to be the complete function, the handling of first and last blocks can be omitted, and it doesn't even have to be completely correct, as long as it's the right instruction mix and the right data dependencies. The benchmark should give a good idea for the potential speedup, if any, from instruction-level interleaving. I would hope 4-way is doable with available vector registers (and this inner loop should be less than 100 instructions, so not too unmanageable). Going up to 8-way (like the current AES code) would also be interesting, but as you say, you might have a shortage of registers. If you have to copy state between registers and memory in each iteration of an 8-way loop (which it looks like you also have to do in your current patch), that overhead cost may outweight the gains you have from more independence in the AES rounds. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64: AES/GCM Performance improvement with stitched implementation
Hi Niels, More comments. Please see inline. > On Nov 21, 2023, at 1:46 PM, Danny Tsen wrote: > > Hi Niels, > > Thanks for the quick response. > > I'll think more thru your comments here and it may take some more time to get > an update. And just a quick answer to 4 of your questions. > > > 1. Depends on some special registers from caller. This is so that I don't > need to change the registers used in aes_internal_encrypt and gf_mul_4x > functions. This is a way to minimize too much change in the existing code. > But I can change that for sure. m4 macro could be helpful here. > 2. The reason to use gcm_encrypt is to minimize duplicate code in > gcm_aes128..., but I can change that. > 3. Yes, 4x blocks won't provide the same performance as 8x. > 4. Yes, function call did introduce quite a lot of overhead in a loop. We > can call gf_mul_4x from _ghash_update but the stack handling has to be > changed and I tried not to change anything in _ghash_update since my code > dosen't call _ghash_update. But I guess I can use m4 macro instead. > > Thanks. > -Danny > > From: Niels Möller > Sent: Tuesday, November 21, 2023 1:07 PM > To: Danny Tsen > Cc: nettle-bugs@lists.lysator.liu.se ; > George Wilson > Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with > stitched implementation > > Danny Tsen writes: > >> This patch provides a performance improvement over AES/GCM with stitched >> implementation for ppc64. The code is a wrapper in assembly to handle >> multiple 8 >> blocks and handle big and little endian. >> >> The overall improvement is based on the nettle-benchmark with ~80% >> improvement for >> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. >> The >> benchmark was run on a P10 machine with 3.896GHz CPU. > > That's a pretty nice performance improvements. A first round of comments > below, mainly structural. > > (And I think attachments didn't make it to the list, possibly because > some of them had Content-type: application/octet-stream rather than > text/plain). > >> +#if defined(__powerpc64__) || defined(__powerpc__) >> +#define HAVE_AES_GCM_STITCH 1 >> +#endif > > If the C code needs to know about optional assembly functions, the > HAVE_NATIVE tests are intended for that. > >> void >> gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key, >>const void *cipher, nettle_cipher_func *f, >> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key >> *key, >> { >> assert(ctx->data_size % GCM_BLOCK_SIZE == 0); >> >> +#if defined(HAVE_AES_GCM_STITCH) >> + size_t rem_len = 0; >> + >> + if (length >= 128) { >> +int rounds = 0; >> +if (f == (nettle_cipher_func *) aes128_encrypt) { >> + rounds = _AES128_ROUNDS; >> +} else if (f == (nettle_cipher_func *) aes192_encrypt) { >> + rounds = _AES192_ROUNDS; >> +} else if (f == (nettle_cipher_func *) aes256_encrypt) { >> + rounds = _AES256_ROUNDS; >> +} >> +if (rounds) { >> + struct gcm_aes_context c; >> + get_ctx(, ctx, key, cipher); >> + _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, >> dst, src); > > I think this is the wrong place for this dispatch, I think it should go > in gcm-aes128.c, gcm-aes192.c, etc. > >> --- a/powerpc64/p8/aes-encrypt-internal.asm >> +++ b/powerpc64/p8/aes-encrypt-internal.asm >> @@ -52,6 +52,16 @@ define(`S5', `v7') >> define(`S6', `v8') >> define(`S7', `v9') >> >> +C re-define SRC if from _gcm_aes >> +define(`S10', `v10') >> +define(`S11', `v11') >> +define(`S12', `v12') >> +define(`S13', `v13') >> +define(`S14', `v14') >> +define(`S15', `v15') >> +define(`S16', `v16') >> +define(`S17', `v17') >> + >> .file "aes-encrypt-internal.asm" >> >> .text >> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt) >> DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5) >> >> subi ROUNDS,ROUNDS,1 >> + >> + cmpdi r23, 0x5f C call from _gcm_aes >> + beq Lx8_loop >> + >> srdi LENGTH,LENGTH,4 >> >> srdi r5,LENGTH,3 #8x loop count >> @@ -93,6 +107,9 @@ Lx8_loop: >> lxvd2x VSR(K),0,KEYS >> vperm K,K,K,SWAP_MASK >> >> + cmpdi r23, 0x5f >> + beq Skip_load > > It's a little messy to have branches depending on a special register set > by some callers. I think it would be simpler to eithe
RE: Fw: ppc64: AES/GCM Performance improvement with stitched implementation
Hi Niels, Thanks for the quick response. I'll think more thru your comments here and it may take some more time to get an update. And just a quick answer to 4 of your questions. 1. Depends on some special registers from caller. This is so that I don't need to change the registers used in aes_internal_encrypt and gf_mul_4x functions. This is a way to minimize too much change in the existing code. But I can change that for sure. m4 macro could be helpful here. 2. The reason to use gcm_encrypt is to minimize duplicate code in gcm_aes128..., but I can change that. 3. Yes, 4x blocks won't provide the same performance as 8x. 4. Yes, function call did introduce quite a lot of overhead in a loop. We can call gf_mul_4x from _ghash_update but the stack handling has to be changed and I tried not to change anything in _ghash_update since my code dosen't call _ghash_update. But I guess I can use m4 macro instead. Thanks. -Danny From: Niels Möller Sent: Tuesday, November 21, 2023 1:07 PM To: Danny Tsen Cc: nettle-bugs@lists.lysator.liu.se ; George Wilson Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with stitched implementation Danny Tsen writes: > This patch provides a performance improvement over AES/GCM with stitched > implementation for ppc64. The code is a wrapper in assembly to handle > multiple 8 > blocks and handle big and little endian. > > The overall improvement is based on the nettle-benchmark with ~80% > improvement for > AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. > The > benchmark was run on a P10 machine with 3.896GHz CPU. That's a pretty nice performance improvements. A first round of comments below, mainly structural. (And I think attachments didn't make it to the list, possibly because some of them had Content-type: application/octet-stream rather than text/plain). > +#if defined(__powerpc64__) || defined(__powerpc__) > +#define HAVE_AES_GCM_STITCH 1 > +#endif If the C code needs to know about optional assembly functions, the HAVE_NATIVE tests are intended for that. > void > gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key, > const void *cipher, nettle_cipher_func *f, > @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key > *key, > { >assert(ctx->data_size % GCM_BLOCK_SIZE == 0); > > +#if defined(HAVE_AES_GCM_STITCH) > + size_t rem_len = 0; > + > + if (length >= 128) { > +int rounds = 0; > +if (f == (nettle_cipher_func *) aes128_encrypt) { > + rounds = _AES128_ROUNDS; > +} else if (f == (nettle_cipher_func *) aes192_encrypt) { > + rounds = _AES192_ROUNDS; > +} else if (f == (nettle_cipher_func *) aes256_encrypt) { > + rounds = _AES256_ROUNDS; > +} > +if (rounds) { > + struct gcm_aes_context c; > + get_ctx(, ctx, key, cipher); > + _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, dst, > src); I think this is the wrong place for this dispatch, I think it should go in gcm-aes128.c, gcm-aes192.c, etc. > --- a/powerpc64/p8/aes-encrypt-internal.asm > +++ b/powerpc64/p8/aes-encrypt-internal.asm > @@ -52,6 +52,16 @@ define(`S5', `v7') > define(`S6', `v8') > define(`S7', `v9') > > +C re-define SRC if from _gcm_aes > +define(`S10', `v10') > +define(`S11', `v11') > +define(`S12', `v12') > +define(`S13', `v13') > +define(`S14', `v14') > +define(`S15', `v15') > +define(`S16', `v16') > +define(`S17', `v17') > + > .file "aes-encrypt-internal.asm" > > .text > @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt) > DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5) > > subi ROUNDS,ROUNDS,1 > + > + cmpdi r23, 0x5f C call from _gcm_aes > + beq Lx8_loop > + > srdi LENGTH,LENGTH,4 > > srdi r5,LENGTH,3 #8x loop count > @@ -93,6 +107,9 @@ Lx8_loop: > lxvd2x VSR(K),0,KEYS > vperm K,K,K,SWAP_MASK > > + cmpdi r23, 0x5f > + beq Skip_load It's a little messy to have branches depending on a special register set by some callers. I think it would be simpler to either move the round loop (i.e., the loop with the label from L8x_round_loop:) into a subroutine with all-register arguments, and call that from both _nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro expanding to the body of that loop, and use that macro in both places. > --- /dev/null > +++ b/powerpc64/p8/gcm-aes-decrypt.asm > @@ -0,0 +1,425 @@ > +C powerpc64/p8/gcm-aes-decrypt.asm > +.macro SAVE_REGS > + mflr 0 > + std 0,16(1) > + stdu SP,-464(SP) If macros are needed, please use m4 macros, like other nettle assembly code. > +.align 5 > +Loop8x_de: [...] > +bl _nettle_aes_encrypt_ppc64 I suspect thi
Re: Fw: ppc64: AES/GCM Performance improvement with stitched implementation
Danny Tsen writes: > This patch provides a performance improvement over AES/GCM with stitched > implementation for ppc64. The code is a wrapper in assembly to handle > multiple 8 > blocks and handle big and little endian. > > The overall improvement is based on the nettle-benchmark with ~80% > improvement for > AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. > The > benchmark was run on a P10 machine with 3.896GHz CPU. That's a pretty nice performance improvements. A first round of comments below, mainly structural. (And I think attachments didn't make it to the list, possibly because some of them had Content-type: application/octet-stream rather than text/plain). > +#if defined(__powerpc64__) || defined(__powerpc__) > +#define HAVE_AES_GCM_STITCH 1 > +#endif If the C code needs to know about optional assembly functions, the HAVE_NATIVE tests are intended for that. > void > gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key, >const void *cipher, nettle_cipher_func *f, > @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key > *key, > { >assert(ctx->data_size % GCM_BLOCK_SIZE == 0); > > +#if defined(HAVE_AES_GCM_STITCH) > + size_t rem_len = 0; > + > + if (length >= 128) { > +int rounds = 0; > +if (f == (nettle_cipher_func *) aes128_encrypt) { > + rounds = _AES128_ROUNDS; > +} else if (f == (nettle_cipher_func *) aes192_encrypt) { > + rounds = _AES192_ROUNDS; > +} else if (f == (nettle_cipher_func *) aes256_encrypt) { > + rounds = _AES256_ROUNDS; > +} > +if (rounds) { > + struct gcm_aes_context c; > + get_ctx(, ctx, key, cipher); > + _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, dst, > src); I think this is the wrong place for this dispatch, I think it should go in gcm-aes128.c, gcm-aes192.c, etc. > --- a/powerpc64/p8/aes-encrypt-internal.asm > +++ b/powerpc64/p8/aes-encrypt-internal.asm > @@ -52,6 +52,16 @@ define(`S5', `v7') > define(`S6', `v8') > define(`S7', `v9') > > +C re-define SRC if from _gcm_aes > +define(`S10', `v10') > +define(`S11', `v11') > +define(`S12', `v12') > +define(`S13', `v13') > +define(`S14', `v14') > +define(`S15', `v15') > +define(`S16', `v16') > +define(`S17', `v17') > + > .file "aes-encrypt-internal.asm" > > .text > @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt) > DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5) > > subi ROUNDS,ROUNDS,1 > + > + cmpdi r23, 0x5f C call from _gcm_aes > + beq Lx8_loop > + > srdi LENGTH,LENGTH,4 > > srdi r5,LENGTH,3 #8x loop count > @@ -93,6 +107,9 @@ Lx8_loop: > lxvd2x VSR(K),0,KEYS > vperm K,K,K,SWAP_MASK > > + cmpdi r23, 0x5f > + beq Skip_load It's a little messy to have branches depending on a special register set by some callers. I think it would be simpler to either move the round loop (i.e., the loop with the label from L8x_round_loop:) into a subroutine with all-register arguments, and call that from both _nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro expanding to the body of that loop, and use that macro in both places. > --- /dev/null > +++ b/powerpc64/p8/gcm-aes-decrypt.asm > @@ -0,0 +1,425 @@ > +C powerpc64/p8/gcm-aes-decrypt.asm > +.macro SAVE_REGS > + mflr 0 > + std 0,16(1) > + stdu SP,-464(SP) If macros are needed, please use m4 macros, like other nettle assembly code. > +.align 5 > +Loop8x_de: [...] > +bl _nettle_aes_encrypt_ppc64 I suspect this reference will break in non-fat builds? > +nop > + > +C do two 4x ghash [...] > +bl _nettle_gf_mul_4x_ppc64 > +nop > + > +bl _nettle_gf_mul_4x_ppc64 > +nop So the body of the main loop is one subroutine call to do 8 aes blocks, and two subroutine calls to do corresponding ghash. I had expected some more instrution-level interleaving of the two operations, do you think that could be beneficial, or is out-of-order machinery so powerful that instruction scheduling is not so important? I think this could be simpler if you define subroutines (or maybe macros) taylored to use from this loop, which can be reused by the code to do aes and ghash separately. I would also be curious if you get something noticably slower if you do only 4 blocks per loop (but if the bottleneck is the dependencies in the aes loop, it may be that doing 8 blocks is important also in this setting). For the interface between C and assembly, one could consider an interface that can be passed an arbitrary number of block, similar to _ghash_update. If it's too much complexity to actually do an arbitrary number of blocks, it could return number of blocks done, and leave to the caller (the C code) to handle the left-over. > --- a/powerpc64/p8/ghash-update.asm > +++ b/powerpc64/p8/ghash-update.asm > @@ -281,6 +281,48 @@ IF_LE(` > blr > EPILOGUE(_nettle_ghash_update) > > +C > +C GCM multification and reduction > +C All inputs depends on definitions > +C > +C
Fw: ppc64: AES/GCM Performance improvement with stitched implementation
To Whom It May Concern, This patch provides a performance improvement over AES/GCM with stitched implementation for ppc64. The code is a wrapper in assembly to handle multiple 8 blocks and handle big and little endian. The overall improvement is based on the nettle-benchmark with ~80% improvement for AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. The benchmark was run on a P10 machine with 3.896GHz CPU. Please find the attached patch and benchmarks. Thanks. -Danny ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Fw: ppc64: AES/GCM Performance improvement with stitched implementation
To Whom It May Concern, This patch provides a performance improvement over AES/GCM with stitched implementation for ppc64. The code is a wrapper in assembly to handle multiple 8 blocks and handle big and little endian. The overall improvement is based on the nettle-benchmark with ~80% improvement for AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. The benchmark was run on a P10 machine with 3.896GHz CPU. Please find the attached patch and benchmarks. Thanks. -Danny ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se