On Fri, 15 May 2026 19:31:41 GMT, Volodymyr Paprotski <[email protected]> wrote:
>> This PR: >> - changes existing AVX512 SHA3 intrinsic to be more parallel >> - adds an AVX2 SHA3 intrinsic >> - change `SHA3Parallel.java` to NR=4 (to be able to exploit the AVX512 >> parallelism while keeping doubleKeccak for platforms where double >> parallelism is preferable. I experimented with NR=8 as well, does also gain >> a few percent, but I think NR=4 is sufficient tradeoff) >> >> Performance gains: >> - `MessageDigestBench.digest`: >> - AVX2: **16%-39%** >> - AVX512: **24%-33%** >> - `SignatureBench.MLDSA.sign` >> - AVX2: **6-12%** >> - AVX512: **11%-18%** >> - `SignatureBench.MLDSA.verify` >> - AVX2: **2%-14%** >> - AVX512: **31%-40%** >> - `KEMBench.MLKEM` >> - AVX2: **~5%** >> - AVX512: **14%-23%** >> - `KEMBench.JSSE_*` >> - appears unaffected >> >> Note on intrinsics. (As noted in the code..) there are multiple entrypoints >> wrapping the same intrinsic.. >> - `SHA3.implCompress`: single blockSize of user data xored with keccak >> - `DigestBase.implCompressMultiBlock`: loop over user data and xor with >> keccak >> - `SHA3Parallel.doubleKeccak`: (still used for AVX2) no message data, just >> two state vectors >> - `SHA3Parallel.quadKeccak`: (AVX512 benefit) no message data, four state >> vectors >> >> Note 1: `make test >> TEST="micro:org.openjdk.bench.javax.crypto.full.MessageDigestBench >> micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA >> micro:org.openjdk.bench.javax.crypto.full.KEMBench"` >> Note 2: I have left more targeted fuzzing and benchmarks out of this PR, but >> they are preserved at [on my >> branch](https://github.com/vpaprotsk/jdk/compare/sha3-avx-quad...vpaprotsk:jdk:sha3-avx-quad-extras?expand=1). >> If there is something you rather see pulled in.. (otherwise, can include a >> diff in JBS for 'future reference') >> >> --------- >> - [X] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Volodymyr Paprotski has updated the pull request incrementally with two > additional commits since the last revision: > > - whitespace > - comments from Sandhya src/hotspot/cpu/x86/assembler_x86.cpp line 9522: > 9520: InstructionAttr attributes(vector_len, /* vex_w */ > VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ true, > /* uses_vl */ true); > 9521: attributes.set_rex_vex_w_reverted(); > 9522: attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /* > input_size_in_bits */ EVEX_NObit); Tuple type should be EVEX_M128. src/hotspot/cpu/x86/assembler_x86.cpp line 9627: > 9625: InstructionAttr attributes(vector_len, /* vex_w */ > VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ true, > /* uses_vl */ true); > 9626: attributes.set_rex_vex_w_reverted(); > 9627: attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /* > input_size_in_bits */ EVEX_NObit); Tuple type should be EVEX_M128. src/hotspot/cpu/x86/assembler_x86.cpp line 9778: > 9776: vector_len == AVX_256bit ? VM_Version::supports_avx2() : > 9777: vector_len == AVX_512bit ? VM_Version::supports_evex() : 0, > ""); > 9778: InstructionMark im(this); As the instruction is only supported for AVX2 and above, the following check: vector_len == AVX_128bit ? VM_Version::supports_avx() should be written as vector_len == AVX_128bit ? VM_Version::supports_avx2() Also then the assert(UseAVX > 1, "requires AVX2") becomes redundant and so could be removed. Likewise for vpsrlvq. src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 2: > 1: /* > 2: * Copyright (c) 2024, 2026, Oracle and/or its affiliates. All rights > reserved. This should remain as 2025. src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 248: > 246: } else { > 247: __ vmovdqu(X1, Address(state1, disp), Assembler::AVX_128bit); > 248: __ vshufpd(X2, X1, X2, 0b01, Assembler::AVX_128bit); When called for stubgen_sha3* the control reaches here. Nothing seems to be loaded into X2 prior but it is used in vshufpd? src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 293: > 291: __ evpxorq( A7, k1, A7, Address(buf, 7 * 8), false, > Assembler::AVX_128bit); > 292: __ evpxorq( A8, k1, A8, Address(buf, 8 * 8), false, > Assembler::AVX_128bit); > 293: __ cmpl(block_size, 72); Are we reading 8 extra bytes from buf (0 to 79 instead of 0 to 71)? K1 has 0x1F, we have q(8 byte) elements, so only last two bits of K1 would be used which are set to 1 so it is not limiting the memory read. I must be missing something. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3251408863 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3251409861 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3251358621 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3260622798 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3262708646 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3262802821
