On Mon, 3 Feb 2025 10:56:28 GMT, Andrew Haley <a...@openjdk.org> wrote:

>> This enhancement makes a change to the ChaCha20 block function intrinsic on 
>> aarch64, moving away from the block parallel implementation and to the 
>> quarter-round parallel implementation that was done on x86_64.  Assembly 
>> language profiling yielded an 11% improvement in throughput.  When put 
>> together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains 
>> are more modest, somewhere in the 2-4% range depending on job size, but 
>> still an improvement.
>
> This looks very nice, and I'm tempted to just approve it as it is. My only 
> concern is that the algorithm changes aren't really explained, but I guess 
> what you have done here is the _128-Bit Vectorization_ in 
> `https://eprint.iacr.org/2013/759.pdf`. Is that right?

Hi @theRealAph, thanks for taking a look at the changes.

Actually, I hadn't read that paper believe it or not, though it certainly looks 
like what I'm doing.  When I was prototyping this a few years ago in x86_64 
assembly it just occurred to me that directly loading the state into 4 
consecutive vectors had everything align in the columnar organization such that 
we could do the double-round right off the bat (which must have been 
intentional in the design of the cipher).  Then all we had to do was do the 
lane rotation leftward 1/2/3 for the second/third/fourth vectors and do the 
quarter rounds again.  And longer vectors for AVX2, AVX-512, or using more 
vectors like on aarch64 allowed me to do more blocks at one time.

It was long enough ago that I don't recall exactly why, but when I did the 
quarter-round and block parallel versions in aarch64 assembly initially, it 
seemed like they were pretty comparable in speed.  At the time, I interpreted 
that to mean that the gains in block-parallel by not needing to do lane 
rotations was offset by having to gather/scatter each 32-bit state value from 
different vectors at 64-byte offsets, and it more or less balanced out.  When I 
went back to look at these two approaches, I think I was able to tweak things 
on the quarter-round parallel version to make loads and stores a little 
friendlier (one ld1 and a bunch of register-to-register movs vs. 4 ld4r ops, 
and 4 st1 ops vs. 16 st4 ops).

In terms of explaining the algorithm changes, I could add some comment text to 
the header of the stub function that better explains the general idea behind 
what is being done.  It would certainly help anyone maintaining it down the 
line (myself included).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2631451246

Reply via email to