On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

> This enhancement makes a change to the ChaCha20 block function intrinsic on 
> aarch64, moving away from the block parallel implementation and to the 
> quarter-round parallel implementation that was done on x86_64.  Assembly 
> language profiling yielded an 11% improvement in throughput.  When put 
> together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains 
> are more modest, somewhere in the 2-4% range depending on job size, but still 
> an improvement.

This looks very nice, and I'm tempted to just approve it as it is. My only 
concern is that the algorithm changes aren't really explained, but I guess what 
you have done here is the _128-Bit Vectorization_ in 
`https://eprint.iacr.org/2013/759.pdf`. Is that right?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2630610061

Reply via email to