Re: RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v6]

Jatin Bhateja Mon, 24 Nov 2025 19:04:40 -0800

On Mon, 24 Nov 2025 22:01:17 GMT, Volodymyr Paprotski <[email protected]> 
wrote:


>> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline 
>>    - `SignatureBench.MLDSA` is 1.2x-2.2x faster
>>    - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed 
>> https://github.com/vpaprotsk/jdk/pull/7)
>> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version 
>>   - `SignatureBench.MLDSA` is upto 5% faster, never slower
>> 
>> Note on intrinsic:
>> - The emitted (existing) AVX512 assembler was not "significantly" changed; 
>> mostly more efficient instruction selection and tighter register allocation, 
>> which allowed removal of NTT loop and stack spill.
>> - Code was refactored to allow reuse of same assembler (as possible) for 
>> AVX512 and AVX2
>> 
>> Tests and benchmarks:
>> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
>> - Added benchmark to measure the performance of intrinsic itself
>> 
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java 
>> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
>> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java 
>> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" 
>> JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
>> make test 
>> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" 
>> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions 
>> -XX:+UseDilithiumIntrinsics;FORK=1"
>> make test 
>> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" 
>> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions 
>> -XX:-UseDilithiumIntrinsics;FORK=1"
>
> Volodymyr Paprotski has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   spelling

Very nice work @vpaprotsk ,
Please also add in comments the links to original reference implimentation.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 365:

> 363: 
> 364: static void loadXmms(const XMMRegister destinationRegs[], Register 
> source, int offset,
> 365:                        int vector_len, MacroAssembler *_masm, int regCnt 
> = -1, int memStep = -1) {

Suggestion:

                     int vector_len, MacroAssembler *_masm, int regCnt = -1, 
int memStep = -1) {

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 381:

> 379: 
> 380: static void storeXmms(Register destination, int offset, const 
> XMMRegister xmmRegs[],
> 381:                        int vector_len, MacroAssembler *_masm, int regCnt 
> = -1, int memStep = -1) {

Suggestion:

                      int vector_len, MacroAssembler *_masm, int regCnt = -1, 
int memStep = -1) {

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 659:

> 657: // zetas (int[128*8]) = c_rarg1
> 658: static address generate_dilithiumAlmostInverseNtt_avx(StubGenerator 
> *stubgen,
> 659:                                          int vector_len,MacroAssembler 
> *_masm) {

Fix indentation

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 718:

> 716: 
> 717:     // Constants for shuffle and montMul64
> 718:     __ mov64(scratch, 0b1010101010101010);

64 bit constant suffix

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 901:

> 899: // poly2 (int[256]) = c_rarg2
> 900: static address generate_dilithiumNttMult_avx(StubGenerator *stubgen,
> 901:                                      int vector_len, MacroAssembler 
> *_masm) {

Fix indentation

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 939:

> 937:                   vector_len, scratch); // 2^64 mod q
> 938:   if (vector_len == Assembler::AVX_512bit) {
> 939:     __ mov64(scratch, 0b0101010101010101);

Add long constant suffix

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 985:

> 983: // constant (int) = c_rarg1
> 984: static address generate_dilithiumMontMulByConstant_avx(StubGenerator 
> *stubgen,
> 985:                                         int vector_len, MacroAssembler 
> *_masm) {

Fix indentation

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1026:

> 1024:     __ evpbroadcastd(constant, rConstant, Assembler::AVX_512bit); // 
> constant multiplier
> 1025: 
> 1026:     __ mov64(scratch, 0b0101010101010101); //dw-mask

Constant suffix

-------------

PR Review: https://git.openjdk.org/jdk/pull/28136#pullrequestreview-3503056034
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558380867
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558381318
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558385868
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558390552
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558390067
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558370904
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558391135
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2558371478

Re: RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements [v6]

Reply via email to