Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-22 Thread David Edelsohn
On Wed, Nov 22, 2023 at 1:50 PM Niels Möller  wrote:

> David Edelsohn  writes:
>
> > Calls impose a lot of overhead on Power.
>
> Thanks, that's good to know.
>
> > And both the efficient loop instruction and the preferred indirect call
> > instruction use the CTR register.
>
> That's one thing I wonder after having a closer look at the AES loops.
>
> One rather common pattern in GMP and Nettle assembly loops, is to use
> the same register as both index register and loop counter. A loop that
> in C would conventionally be written as
>
>   for (i = 0; i < n; i++)
> dst[i] = f(src[i]);
>
> is written in assembly closer to
>
>   dst += n; src += n; // Base registers point at end of arrays
>   n = -n; // Use negative index register
>   for (; n != 0; n++)
> dst[n] = f(src[n]);
>
> This saves one register (and eliminates corresponding update
> instructions), and the loop branch is based on carry flag (or zero flag)
> from the index register update n++. (If the items processed by the loop
> are larger than a byte, n would also be scaled by the size, and one
> would do n += size rather than n++, and it still works just fine).
>
> Would that pattern work well on power, or is it always preferable to use
> the special counter register, e.g., if it provides better branch
> prediction? I'm not so familiar with power assembly, but from the AES
> code it looks like the relevant instructions are mtctr to initialize the
> counter, and bdnz to decrement and branch.
>

Calls on Power have a high overhead in general, not because of jump or
return prediction, but because of the frame setup and teardown in the midst
of a highly speculating and out of order core.  One thinks of the processor
executing the program instructions linearly, but in reality lots of
instructions are in flight with lots of register renaming and lots of
speculation.  The setup and teardown of the frames (saving and restoring
registers in the prologue and epilogue, including the link register) and
confirmation that the predictions were correct before commiting the results
can cause unexpected load and store conflicts in flight.

MTCTR moves a GPR to the count (CTR) register.  The CTR register is
optimized for zero-cost countable loops with the bdnz (branch and decrement
counter non zero), etc. instructions.

The CTR register also is used for indirect calls (mtctr -> bctr, bcctr -
branch to counter, branch conditional to counter).  For indirect branches,
one also can branch indirect through the linker register (mtlr -> blr), but
that can corrupt the link stack internal to the processor used to predict
return addresses.  So one mainly has the CTR register for both loops and
indirect calls.  However, if one uses the count register for an indirect
call, for all practical purposes, it is not available as the count register
for the loop -- spilling and restoring the count register introduces too
many stalls.

A call inside a loop is bad.  An indirect call inside a loop is doubly bad
because of the call itself and because it prevents the loop from utilizing
the optimal count register idiom.

Thanks, David
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-22 Thread Niels Möller
David Edelsohn  writes:

> Calls impose a lot of overhead on Power.

Thanks, that's good to know.

> And both the efficient loop instruction and the preferred indirect call
> instruction use the CTR register.

That's one thing I wonder after having a closer look at the AES loops.

One rather common pattern in GMP and Nettle assembly loops, is to use
the same register as both index register and loop counter. A loop that
in C would conventionally be written as

  for (i = 0; i < n; i++)
dst[i] = f(src[i]);

is written in assembly closer to

  dst += n; src += n; // Base registers point at end of arrays
  n = -n; // Use negative index register
  for (; n != 0; n++)
dst[n] = f(src[n]);

This saves one register (and eliminates corresponding update
instructions), and the loop branch is based on carry flag (or zero flag)
from the index register update n++. (If the items processed by the loop
are larger than a byte, n would also be scaled by the size, and one
would do n += size rather than n++, and it still works just fine).

Would that pattern work well on power, or is it always preferable to use
the special counter register, e.g., if it provides better branch
prediction? I'm not so familiar with power assembly, but from the AES
code it looks like the relevant instructions are mtctr to initialize the
counter, and bdnz to decrement and branch.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-22 Thread David Edelsohn
On Wed, Nov 22, 2023 at 10:37 AM Danny Tsen  wrote:

>
>
> > On Nov 22, 2023, at 2:27 AM, Niels Möller  wrote:
> >
> > Danny Tsen  writes:
> >
> >> Interleaving at the instructions level may be a good option but due to
> >> PPC instruction pipeline this may need to have sufficient
> >> registers/vectors. Use same vectors to change contents in successive
> >> instructions may require more cycles. In that case, more
> >> vectors/scalar will get involved and all vectors assignment may have
> >> to change. That’s the reason I avoided in this case.
> >
> > To investigate the potential, I would suggest some experiments with
> > software pipelining.
> >
> > Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
> > round loop. I think that should be 44 instructions of aes mangling, plus
> > instructions to setup the counter input, and do the final xor and
> > endianness things with the message. Arrange so that it loads the AES
> > state in a set of registers we can call A, operating in-place on these
> > registers. But at the end, arrange the XORing so that the final
> > cryptotext is located in a different set of registers, B.
> >
> > Then, write the instructions to do ghash using the B registers as input,
> > I think that should be about 20-25 instructions. Interleave those as
> > well as possible with the AES instructions (say, two aes instructions,
> > one ghash instruction, etc).
> >
> > Software pipelining means that each iteration of the loop does aes-ctr
> > on four blocks, + ghash on the output for the four *previous* blocks (so
> > one needs extra code outside of the loop to deal with first and last 4
> > blocks). Decrypt processing should be simpler.
> >
> > Then you can benchmark that loop in isolation. It doesn't need to be the
> > complete function, the handling of first and last blocks can be omitted,
> > and it doesn't even have to be completely correct, as long as it's the
> > right instruction mix and the right data dependencies. The benchmark
> > should give a good idea for the potential speedup, if any, from
> > instruction-level interleaving.
> This is a very ideal condition.  Too much interleaving may not produce the
> best results and different architectures may have different results.  I had
> tried various way when I implemented AES/GCM stitching functions for
> OpenSSL.  I’ll give it a try since your ghash function is different.
>
> >
> > I would hope 4-way is doable with available vector registers (and this
> > inner loop should be less than 100 instructions, so not too
> > unmanageable). Going up to 8-way (like the current AES code) would also
> > be interesting, but as you say, you might have a shortage of registers.
> > If you have to copy state between registers and memory in each iteration
> > of an 8-way loop (which it looks like you also have to do in your
> > current patch), that overhead cost may outweight the gains you have from
> > more independence in the AES rounds.
> 4x unrolling may not produce the best performance.  I did that when I
> implemented this stitching function in OpenSSL and it’s in one assembly
> file and no functions calls outside the function.  Once again, calling a
> function within a loop introduce a lot of overhead.  Here are my past
> results for your reference.  First one is the original performance from
> OpenSSL.  The second one was the 4x unrolling and the third one was the
> 8x.  But I can try again.
>
> (This was run on a p10 with 3.5 GHz machine)
>
> AES-128-GCM 382128.50k  1023073.64k  2621489.41k  3604979.37k
> 4018642.94k  4032080.55k
> AES-128-GCM 347370.13k  1236054.06k  2778748.59k  3900567.21k
> 4527158.61k  4579759.45k ( 4x AES and 4x ghash
> )
> AES-128-GCM 356520.19k   989983.06k  2902907.56k  4379016.19k
> 5180981.25k  5249717.59k ( 8x AES and 2 4x gha
> sh combined)
>

Calls impose a lot of overhead on Power.

And both the efficient loop instruction and the preferred indirect call
instruction use the CTR register.

Thanks, David


>
> Thanks.
> -Danny
>
> >
> > Regards,
> > /Niels
> >
> > --
> > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> > Internet email is subject to wholesale government surveillance.
>
> ___
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-22 Thread Danny Tsen


> On Nov 22, 2023, at 2:27 AM, Niels Möller  wrote:
> 
> Danny Tsen  writes:
> 
>> Interleaving at the instructions level may be a good option but due to
>> PPC instruction pipeline this may need to have sufficient
>> registers/vectors. Use same vectors to change contents in successive
>> instructions may require more cycles. In that case, more
>> vectors/scalar will get involved and all vectors assignment may have
>> to change. That’s the reason I avoided in this case.
> 
> To investigate the potential, I would suggest some experiments with
> software pipelining.
> 
> Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
> round loop. I think that should be 44 instructions of aes mangling, plus
> instructions to setup the counter input, and do the final xor and
> endianness things with the message. Arrange so that it loads the AES
> state in a set of registers we can call A, operating in-place on these
> registers. But at the end, arrange the XORing so that the final
> cryptotext is located in a different set of registers, B.
> 
> Then, write the instructions to do ghash using the B registers as input,
> I think that should be about 20-25 instructions. Interleave those as
> well as possible with the AES instructions (say, two aes instructions,
> one ghash instruction, etc).
> 
> Software pipelining means that each iteration of the loop does aes-ctr
> on four blocks, + ghash on the output for the four *previous* blocks (so
> one needs extra code outside of the loop to deal with first and last 4
> blocks). Decrypt processing should be simpler.
> 
> Then you can benchmark that loop in isolation. It doesn't need to be the
> complete function, the handling of first and last blocks can be omitted,
> and it doesn't even have to be completely correct, as long as it's the
> right instruction mix and the right data dependencies. The benchmark
> should give a good idea for the potential speedup, if any, from
> instruction-level interleaving.
This is a very ideal condition.  Too much interleaving may not produce the best 
results and different architectures may have different results.  I had tried 
various way when I implemented AES/GCM stitching functions for OpenSSL.  I’ll 
give it a try since your ghash function is different.

> 
> I would hope 4-way is doable with available vector registers (and this
> inner loop should be less than 100 instructions, so not too
> unmanageable). Going up to 8-way (like the current AES code) would also
> be interesting, but as you say, you might have a shortage of registers.
> If you have to copy state between registers and memory in each iteration
> of an 8-way loop (which it looks like you also have to do in your
> current patch), that overhead cost may outweight the gains you have from
> more independence in the AES rounds.
4x unrolling may not produce the best performance.  I did that when I 
implemented this stitching function in OpenSSL and it’s in one assembly file 
and no functions calls outside the function.  Once again, calling a function 
within a loop introduce a lot of overhead.  Here are my past results for your 
reference.  First one is the original performance from OpenSSL.  The second one 
was the 4x unrolling and the third one was the 8x.  But I can try again.

(This was run on a p10 with 3.5 GHz machine)

AES-128-GCM 382128.50k  1023073.64k  2621489.41k  3604979.37k  
4018642.94k  4032080.55k
AES-128-GCM 347370.13k  1236054.06k  2778748.59k  3900567.21k  
4527158.61k  4579759.45k ( 4x AES and 4x ghash
)
AES-128-GCM 356520.19k   989983.06k  2902907.56k  4379016.19k  
5180981.25k  5249717.59k ( 8x AES and 2 4x gha
sh combined)

Thanks.
-Danny

> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-22 Thread Niels Möller
Danny Tsen  writes:

> Interleaving at the instructions level may be a good option but due to
> PPC instruction pipeline this may need to have sufficient
> registers/vectors. Use same vectors to change contents in successive
> instructions may require more cycles. In that case, more
> vectors/scalar will get involved and all vectors assignment may have
> to change. That’s the reason I avoided in this case.

To investigate the potential, I would suggest some experiments with
software pipelining.

Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
round loop. I think that should be 44 instructions of aes mangling, plus
instructions to setup the counter input, and do the final xor and
endianness things with the message. Arrange so that it loads the AES
state in a set of registers we can call A, operating in-place on these
registers. But at the end, arrange the XORing so that the final
cryptotext is located in a different set of registers, B.

Then, write the instructions to do ghash using the B registers as input,
I think that should be about 20-25 instructions. Interleave those as
well as possible with the AES instructions (say, two aes instructions,
one ghash instruction, etc).

Software pipelining means that each iteration of the loop does aes-ctr
on four blocks, + ghash on the output for the four *previous* blocks (so
one needs extra code outside of the loop to deal with first and last 4
blocks). Decrypt processing should be simpler.

Then you can benchmark that loop in isolation. It doesn't need to be the
complete function, the handling of first and last blocks can be omitted,
and it doesn't even have to be completely correct, as long as it's the
right instruction mix and the right data dependencies. The benchmark
should give a good idea for the potential speedup, if any, from
instruction-level interleaving.

I would hope 4-way is doable with available vector registers (and this
inner loop should be less than 100 instructions, so not too
unmanageable). Going up to 8-way (like the current AES code) would also
be interesting, but as you say, you might have a shortage of registers.
If you have to copy state between registers and memory in each iteration
of an 8-way loop (which it looks like you also have to do in your
current patch), that overhead cost may outweight the gains you have from
more independence in the AES rounds.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se