On Thu, 21 May 2026 00:06:29 GMT, Volodymyr Paprotski <[email protected]> 
wrote:

>> This PR:
>> - changes existing AVX512 SHA3 intrinsic to be more parallel
>> - adds an AVX2 SHA3 intrinsic
>> - change `SHA3Parallel.java` to NR=4 (to be able to exploit the AVX512 
>> parallelism while keeping doubleKeccak for platforms where double 
>> parallelism is preferable. I experimented with NR=8 as well, does also gain 
>> a few percent, but I think NR=4 is sufficient tradeoff)
>> 
>> Performance gains:
>> - `MessageDigestBench.digest`:
>>   - AVX2: **16%-39%**
>>   - AVX512: **24%-33%**
>> - `SignatureBench.MLDSA.sign`
>>   - AVX2: **6-12%**
>>   - AVX512: **11%-18%**
>> - `SignatureBench.MLDSA.verify`
>>   - AVX2: **2%-14%**
>>   - AVX512: **31%-40%**
>> - `KEMBench.MLKEM`
>>   - AVX2: **~5%**
>>   - AVX512: **14%-23%**
>> - `KEMBench.JSSE_*`
>>   - appears unaffected
>> 
>> Note on intrinsics. (As noted in the code..) there are multiple entrypoints 
>> wrapping the same intrinsic..
>> - `SHA3.implCompress`: single blockSize of user data xored with keccak
>> - `DigestBase.implCompressMultiBlock`: loop over user data and xor with 
>> keccak
>> - `SHA3Parallel.doubleKeccak`: (still used for AVX2) no message data, just 
>> two state vectors
>> - `SHA3Parallel.quadKeccak`: (AVX512 benefit) no message data, four state 
>> vectors
>> 
>> Note 1: `make test 
>> TEST="micro:org.openjdk.bench.javax.crypto.full.MessageDigestBench 
>> micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA 
>> micro:org.openjdk.bench.javax.crypto.full.KEMBench"`
>> Note 2: I have left more targeted fuzzing and benchmarks out of this PR, but 
>> they are preserved at [on my 
>> branch](https://github.com/vpaprotsk/jdk/compare/sha3-avx-quad...vpaprotsk:jdk:sha3-avx-quad-extras?expand=1).
>>  If there is something you rather see pulled in.. (otherwise, can include a 
>> diff in JBS for 'future reference')
>> 
>> ---------
>> - [X] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   comments, next round done!

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 83:

> 81: ATTRIBUTE_ALIGNED(64) static const uint64_t avx2_rotate_consts[] = {
> 82:   //  X0      X0     X1  X3  X1  X3     X2  X4   X2  X4
> 83:                      1, 28,  1, 28,    62, 27,  62, 27,   //    A0_,   
> A1A3,   A2A4

A0_ should be removed from comment as there are no corresponding entries?

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 1055:

> 1053:       __ movq(T0, Address(buf, 5 * 8));   //b5
> 1054:       __ movq(T1, Address(buf, 15 * 8));  //b15
> 1055:       __ vshufpd(T0, T0/*b5*/, T1/*b15*/, 0b0000, vector_len); //b5A15

Comment should be //b5b15 ?

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 1061:

> 1059:       __ movq(T0, Address(buf, 10 * 8));  //b10
> 1060:       __ vpxor(T0, T0, A10A20, vector_len);
> 1061:       __ vshufpd(A10A20, T0/*A10*/, A10A20, 0b1010, vector_len); 
> //b10A20

Looks to me that this vshufpd could be removed.
Or if retained, comment should be //A10A20 ?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3284439491
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3291081988
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3291098777

Reply via email to