On Thu, 20 Nov 2025 22:55:07 GMT, Volodymyr Paprotski <[email protected]> wrote:
>> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline >> - `SignatureBench.MLDSA` is 1.2x-2.2x faster >> - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed >> https://github.com/vpaprotsk/jdk/pull/7) >> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version >> - `SignatureBench.MLDSA` is upto 5% faster, never slower >> >> Note on intrinsic: >> - The emitted (existing) AVX512 assembler was not "significantly" changed; >> mostly more efficient instruction selection and tighter register allocation, >> which allowed removal of NTT loop and stack spill. >> - Code was refactored to allow reuse of same assembler (as possible) for >> AVX512 and AVX2 >> >> Tests and benchmarks: >> - Added a fuzz test to ensure Java and intrinsic produces exactly same result >> - Added benchmark to measure the performance of intrinsic itself >> >> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java >> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" >> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java >> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" >> JTREG="JAVA_OPTIONS=-XX:UseAVX=2" >> make test >> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" >> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions >> -XX:+UseDilithiumIntrinsics;FORK=1" >> make test >> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" >> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions >> -XX:-UseDilithiumIntrinsics;FORK=1" > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > next set of comments Good work! I just found a few typos in the comments. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 88: > 86: // +-----+-----+-----+-----+----- > 87: // > 88: // NOTE: size 0 and 1 are used for initial and final shuffles > respectivelly of Typo: respectivelly -> respectively src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 248: > 246: // We do Montgomery multiplications of two AVX registers in 4 steps: > 247: // 1. Do the multiplications of the corresponding even numbered slots > into > 248: // the odd numbered slots of a scratch2 register. Typo: scratch2 -> scratch src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 249: > 247: // 1. Do the multiplications of the corresponding even numbered slots > into > 248: // the odd numbered slots of a scratch2 register. > 249: // 2. Swap the even and odd numbered slots of the original input > registers.* Typo: unnecessary '*' at the end src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 250: > 248: // the odd numbered slots of a scratch2 register. > 249: // 2. Swap the even and odd numbered slots of the original input > registers.* > 250: // 3. Similar to step 1, but into output register. Typo: into output register -> into an output register src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 253: > 251: // 4. Combine the outputs of step 1 and step 3 into the output of the > Montgomery > 252: // multiplication. > 253: // (*For levels 0-6 in the Ntt and levels 1-7 of the inverse Ntt, need > NOT swap Typo: unnecessary '(*' at the beginning src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 282: > 280: const XMMRegister* scratch = scratch1 == input1 ? output: scratch1; > 281: > 282: // scratch = input1_even*intput2_even Suggestion: // scratch = input1_even * intput2_even src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 479: > 477: // level 0 - 128 > 478: // scratch1 = coeffs3 * zetas1 > 479: // coeffs3, coeffs1 = coeffs1±scratch1 Suggestion: // coeffs3, coeffs1 = coeffs1 ± scratch1 src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 524: > 522: // coeffs1_2 = coeffs1_2 + scratch1 > 523: loadXmms(Zetas3, zetas, level * 512, vector_len, _masm); > 524: shuffle(Scratch1, Coeffs1_2, Coeffs2_2, distance * 32); > //Coeffs2_2 freed Suggestion: // Coeffs2_2 freed src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 529: > 527: > 528: loadXmms(Zetas3, zetas, 4*64 + level * 512, vector_len, _masm); > 529: shuffle(Scratch1, Coeffs3_2, Coeffs4_2, distance * 32); > //Coeffs4_2 freed Suggestion: // Coeffs4_2 freed src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 554: > 552: const XMMRegister Coeffs2_2[] = {xmm4, xmm5, xmm6, xmm7}; > 553: > 554: // Since we cannot fit the entire payload into registers, we process process input -> process the input src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 555: > 553: > 554: // Since we cannot fit the entire payload into registers, we process > 555: // input in two stages. First half, load 8 registers 32 integers > each apart. First half -> For the first half src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 557: > 555: // input in two stages. First half, load 8 registers 32 integers > each apart. > 556: // With one load, we can process level 0-2 (128-, 64- and > 32-integers apart) > 557: // Remaining levels, load 8 registers from consecutive memory (16-, > 8-, 4-, Remaining -> For the remaining src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 558: > 556: // With one load, we can process level 0-2 (128-, 64- and > 32-integers apart) > 557: // Remaining levels, load 8 registers from consecutive memory (16-, > 8-, 4-, > 558: // 2-, 1-integer appart) appart -> apart src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 559: > 557: // Remaining levels, load 8 registers from consecutive memory (16-, > 8-, 4-, > 558: // 2-, 1-integer appart) > 559: // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within > registers appart -> apart src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 560: > 558: // 2-, 1-integer appart) > 559: // Levels 5, 6, 7 (4-, 2-, 1-integer appart) require shuffles within > registers > 560: // Other levels, shuffles can be done by re-aranging register order Other -> on the other re-aranging register order -> rearranging the register order src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 562: > 560: // Other levels, shuffles can be done by re-aranging register order > 561: > 562: // Four batches of 8 registers each, 128 bytes appart appart -> apart src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 701: > 699: // In each of these iterations half of the coefficients are added to > and > 700: // subtracted from the other half of the coefficients then the result > of > 701: // the substration is (Montgomery) multiplied by the corresponding > zetas. substration -> subtraction (I know this was in my own comment :-( ) src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 850: > 848: } > 849: > 850: // Four batches of 8 registers each, 128 bytes appart appart -> apart ------------- PR Comment: https://git.openjdk.org/jdk/pull/28136#issuecomment-3571728756 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556771999 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556825899 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556836110 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556839540 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556845331 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556853907 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556865521 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556913637 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556915972 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556943987 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556925142 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556945036 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556949814 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556953155 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556942168 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556956323 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556978873 PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2556961642
