Re: RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements

Jatin Bhateja Sun, 16 Nov 2025 23:17:10 -0800

On Tue, 4 Nov 2025 16:38:49 GMT, Volodymyr Paprotski <[email protected]> 
wrote:


> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline 
>    - `SignatureBench.MLDSA` is 1.2x-2.2x faster
>    - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed 
> https://github.com/vpaprotsk/jdk/pull/7)
> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version 
>   - `SignatureBench.MLDSA` is upto 5% faster, never slower
> 
> Note on intrinsic:
> - The emitted (existing) AVX512 assembler was not "significantly" changed; 
> mostly more efficient instruction selection and tighter register allocation, 
> which allowed removal of NTT loop and stack spill.
> - Code was refactored to allow reuse of same assembler (as possible) for 
> AVX512 and AVX2
> 
> Tests and benchmarks:
> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
> - Added benchmark to measure the performance of intrinsic itself
> 
> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java 
> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java 
> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" 
> JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
> make test 
> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" 
> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions 
> -XX:+UseDilithiumIntrinsics;FORK=1"
> make test 
> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" 
> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions 
> -XX:-UseDilithiumIntrinsics;FORK=1"

Minor initial comments

src/hotspot/cpu/x86/assembler_x86.cpp line 3867:

> 3865:             (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 3866:             (vector_len == AVX_512bit ? VM_Version::supports_evex() : 
> false)), "");
> 3867:   InstructionAttr attributes(vector_len, /* vex_w */ false, /* 
> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);

When you check for AVX512-VL you allow accessing 128/256 bit registers from the 
higher register bank [X/Y]MM(16-31) 

But your assertions are nowhere checking this.

src/hotspot/cpu/x86/assembler_x86.cpp line 3876:

> 3874:           (vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 3875:           (vector_len == AVX_512bit ? VM_Version::supports_evex() : 
> false)), "");
> 3876:   InstructionAttr attributes(vector_len, /* vex_w */ false, /* 
> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);

When you check for AVX512-VL you allow accessing 128/256 bit registers from the 
higher register bank [X/Y]MM(16-31)

But your assertions are nowhere checking this.

src/hotspot/cpu/x86/assembler_x86.cpp line 3882:

> 3880: 
> 3881: void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister 
> src, bool merge, int vector_len) {
> 3882:   assert(VM_Version::supports_evex(), "");

Suggestion:

  assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), "");

src/hotspot/cpu/x86/assembler_x86.cpp line 3894:

> 3892: 
> 3893: void Assembler::evmovshdup(XMMRegister dst, KRegister mask, XMMRegister 
> src, bool merge, int vector_len) {
> 3894:   assert(VM_Version::supports_evex(), "");

Same as above

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 397:

> 395: //
> 396: static address generate_dilithiumAlmostNtt_avx(StubGenerator *stubgen,
> 397:                             int vector_len, MacroAssembler *_masm) {

Indentation corretness

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 2:

> 1: /*
> 2:  * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights 
> reserved.

Suggestion:

 * Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 114:

> 112:         rnd.setSeed(seed);
> 113:         //Note: it might be useful to increase this number during 
> development of new intrinsics
> 114:         final int repeat = 10000000;

Instead of high repetition count can you try tuning the tiered compilation 
threshold.

test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 145:

> 143:             coeffs1[j] = rnd.nextInt();
> 144:             coeffs2[j] = rnd.nextInt();
> 145:         }

You can uses generators for randome initialization of array

-------------

PR Review: https://git.openjdk.org/jdk/pull/28136#pullrequestreview-3471195396
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532894350
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532894989
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532900199
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532901821
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532910907
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532868326
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532875974
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532872372

Re: RFR: 8371259: ML-DSA AVX2 and AVX512 intrinsics and improvements

Reply via email to