On Tue, 4 Nov 2025 16:38:49 GMT, Volodymyr Paprotski <[email protected]> wrote:
> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline > - `SignatureBench.MLDSA` is 1.2x-2.2x faster > - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed > https://github.com/vpaprotsk/jdk/pull/7) > - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version > - `SignatureBench.MLDSA` is upto 5% faster, never slower > > Note on intrinsic: > - The emitted (existing) AVX512 assembler was not "significantly" changed; > mostly more efficient instruction selection and tighter register allocation, > which allowed removal of NTT loop and stack spill. > - Code was refactored to allow reuse of same assembler (as possible) for > AVX512 and AVX2 > > Tests and benchmarks: > - Added a fuzz test to ensure Java and intrinsic produces exactly same result > - Added benchmark to measure the performance of intrinsic itself > > make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java > test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" > make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java > test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" > JTREG="JAVA_OPTIONS=-XX:UseAVX=2" > make test > TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" > MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions > -XX:+UseDilithiumIntrinsics;FORK=1" > make test > TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" > MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions > -XX:-UseDilithiumIntrinsics;FORK=1" Minor initial comments src/hotspot/cpu/x86/assembler_x86.cpp line 3867: > 3865: (vector_len == AVX_256bit ? VM_Version::supports_avx2() : > 3866: (vector_len == AVX_512bit ? VM_Version::supports_evex() : > false)), ""); > 3867: InstructionAttr attributes(vector_len, /* vex_w */ false, /* > legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true); When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31) But your assertions are nowhere checking this. src/hotspot/cpu/x86/assembler_x86.cpp line 3876: > 3874: (vector_len == AVX_256bit ? VM_Version::supports_avx2() : > 3875: (vector_len == AVX_512bit ? VM_Version::supports_evex() : > false)), ""); > 3876: InstructionAttr attributes(vector_len, /* vex_w */ false, /* > legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true); When you check for AVX512-VL you allow accessing 128/256 bit registers from the higher register bank [X/Y]MM(16-31) But your assertions are nowhere checking this. src/hotspot/cpu/x86/assembler_x86.cpp line 3882: > 3880: > 3881: void Assembler::evmovsldup(XMMRegister dst, KRegister mask, XMMRegister > src, bool merge, int vector_len) { > 3882: assert(VM_Version::supports_evex(), ""); Suggestion: assert(vector_len == AVX_512 || VM_Version::supports_avx512vl), ""); src/hotspot/cpu/x86/assembler_x86.cpp line 3894: > 3892: > 3893: void Assembler::evmovshdup(XMMRegister dst, KRegister mask, XMMRegister > src, bool merge, int vector_len) { > 3894: assert(VM_Version::supports_evex(), ""); Same as above src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 397: > 395: // > 396: static address generate_dilithiumAlmostNtt_avx(StubGenerator *stubgen, > 397: int vector_len, MacroAssembler *_masm) { Indentation corretness test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 2: > 1: /* > 2: * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights > reserved. Suggestion: * Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved. test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 114: > 112: rnd.setSeed(seed); > 113: //Note: it might be useful to increase this number during > development of new intrinsics > 114: final int repeat = 10000000; Instead of high repetition count can you try tuning the tiered compilation threshold. test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java line 145: > 143: coeffs1[j] = rnd.nextInt(); > 144: coeffs2[j] = rnd.nextInt(); > 145: } You can uses generators for randome initialization of array ------------- PR Review: https://git.openjdk.org/jdk/pull/28136#pullrequestreview-3471195396 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532894350 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532894989 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532900199 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532901821 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532910907 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532868326 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532875974 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2532872372
