On Fri, 22 May 2026 02:46:29 GMT, Volodymyr Paprotski <[email protected]> 
wrote:

>> This PR:
>> - changes existing AVX512 SHA3 intrinsic to be more parallel
>> - adds an AVX2 SHA3 intrinsic
>> - change `SHA3Parallel.java` to NR=4 (to be able to exploit the AVX512 
>> parallelism while keeping doubleKeccak for platforms where double 
>> parallelism is preferable. I experimented with NR=8 as well, does also gain 
>> a few percent, but I think NR=4 is sufficient tradeoff)
>> 
>> Performance gains:
>> - `MessageDigestBench.digest`:
>>   - AVX2: **16%-39%**
>>   - AVX512: **24%-33%**
>> - `SignatureBench.MLDSA.sign`
>>   - AVX2: **6-12%**
>>   - AVX512: **11%-18%**
>> - `SignatureBench.MLDSA.verify`
>>   - AVX2: **2%-14%**
>>   - AVX512: **31%-40%**
>> - `KEMBench.MLKEM`
>>   - AVX2: **~5%**
>>   - AVX512: **14%-23%**
>> - `KEMBench.JSSE_*`
>>   - appears unaffected
>> 
>> Note on intrinsics. (As noted in the code..) there are multiple entrypoints 
>> wrapping the same intrinsic..
>> - `SHA3.implCompress`: single blockSize of user data xored with keccak
>> - `DigestBase.implCompressMultiBlock`: loop over user data and xor with 
>> keccak
>> - `SHA3Parallel.doubleKeccak`: (still used for AVX2) no message data, just 
>> two state vectors
>> - `SHA3Parallel.quadKeccak`: (AVX512 benefit) no message data, four state 
>> vectors
>> 
>> Note 1: `make test 
>> TEST="micro:org.openjdk.bench.javax.crypto.full.MessageDigestBench 
>> micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA 
>> micro:org.openjdk.bench.javax.crypto.full.KEMBench"`
>> Note 2: I have left more targeted fuzzing and benchmarks out of this PR, but 
>> they are preserved at [on my 
>> branch](https://github.com/vpaprotsk/jdk/compare/sha3-avx-quad...vpaprotsk:jdk:sha3-avx-quad-extras?expand=1).
>>  If there is something you rather see pulled in.. (otherwise, can include a 
>> diff in JBS for 'future reference')
>> 
>> ---------
>> - [X] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Comments from Aleksey Shipilev

The AVX2 benchmarks show mixed results (see attached) on my Intel Core 
i9-12900K Alder Lake 3.2GHz 24-Core w/32GB main memory:

ML-KEM decapsulation: -2% to 0% delta
ML-KEM encapsulation: -2% to 2% delta
ML-KEM key generation: -1% to 5% delta

ML-DSA sign of 1024 bytes: 0% to 5% delta
ML-DSA sign of 16384 bytes: -5% to 1% delta
ML-DSA verify of 1024 bytes: 6% to 12% delta
ML-DSA verify of 16384 bytes: -7% to 3% delta
ML-DSA key generation: 6% to 12% delta

As you can see from above and the attachment, the regression in performance is 
i) tied to data size for sign/verify operations and ii) for ML-KEM's smaller 
key sizes.  For i), AVX-2 has to do a number of shuffles (3 instructions) per 
round for the two 128 bit states, where the C2 inlining for rotations are 
probably already efficient in this area.  For ii), there is less work to do 
when expanding/generating the A matrix for the smaller key sizes.  Other 
slowdowns compared to AVX-512 could be that AVX-2 does not support a true quad 
Keccak and could pay a higher price for unused lanes.
[Intrinsics ML-KEM_ML-DSA Benchmarks - 
i9-8384353.pdf](https://github.com/user-attachments/files/28207824/Intrinsics.ML-KEM_ML-DSA.Benchmarks.-.i9-8384353.pdf)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/31125#issuecomment-4532071028

Reply via email to