On Mon, 11 May 2026 16:47:23 GMT, Volodymyr Paprotski <[email protected]> wrote:
> This PR: > - changes existing AVX512 SHA3 intrinsic to be more parallel > - adds an AVX2 SHA3 intrinsic > - change `SHA3Parallel.java` to NR=4 (to be able to exploit the AVX512 > parallelism while keeping doubleKeccak for platforms where double parallelism > is preferable. I experimented with NR=8 as well, does also gain a few > percent, but I think NR=4 is sufficient tradeoff) > > Performance gains: > - `MessageDigestBench.digest`: > - AVX2: **16%-39%** > - AVX512: **24%-33%** > - `SignatureBench.MLDSA.sign` > - AVX2: **6-12%** > - AVX512: **11%-18%** > - `SignatureBench.MLDSA.verify` > - AVX2: **2%-14%** > - AVX512: **31%-40%** > - `KEMBench.MLKEM` > - AVX2: **~5%** > - AVX512: **14%-23%** > - `KEMBench.JSSE_*` > - appears unaffected > > Note on intrinsics. (As noted in the code..) there are multiple entrypoints > wrapping the same intrinsic.. > - `SHA3.implCompress`: single blockSize of user data xored with keccak > - `DigestBase.implCompressMultiBlock`: loop over user data and xor with keccak > - `SHA3Parallel.doubleKeccak`: (still used for AVX2) no message data, just > two state vectors > - `SHA3Parallel.quadKeccak`: (AVX512 benefit) no message data, four state > vectors > > Note 1: `make test > TEST="micro:org.openjdk.bench.javax.crypto.full.MessageDigestBench > micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA > micro:org.openjdk.bench.javax.crypto.full.KEMBench"` > Note 2: I have left more targeted fuzzing and benchmarks out of this PR, but > they are preserved at [on my > branch](https://github.com/vpaprotsk/jdk/compare/sha3-avx-quad...vpaprotsk:jdk:sha3-avx-quad-extras?expand=1). > If there is something you rather see pulled in.. (otherwise, can include a > diff in JBS for 'future reference') > > --------- > - [X] I confirm that I make this contribution in accordance with the [OpenJDK > Interim AI Policy](https://openjdk.org/legal/ai). I left a few comments regarding 'housekeeping' issues for the new intrinsic implementations. Thank you for correctly updating the registration of external addresses. n.b. I have left review of the generation steps to the x86 vectorization experts (I am happy to review any proposed aarch64 equivalent). src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 96: > 94: StubGenerator *stubgen, > 95: MacroAssembler *_masm) { > 96: bool multiBlock; You have moved the switch statement which verifies the stub id is in the expected range after the call to `load_archive_data`. Please return it to the start of this method to ensure we perform validation before trying to retrieve the stub from the AOT cache. Note that you need to follow the original and provide a case switch for all legitimate cases and a default case which calls `ShouldNotReachHere()`. src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 624: > 622: StubGenerator *stubgen, > 623: MacroAssembler *_masm) { > 624: bool multiBlock = stub_id == StubId::stubgen_sha3_implCompressMB_id; Please insert a switch statement here like the one employed in the previous generator to validate the stub id before we call `load_archive_data`. src/hotspot/share/runtime/stubDeclarations.hpp line 834: > 832: do_entry(compiler, double_keccak, double_keccak, double_keccak) \ > 833: do_stub(compiler, quad_keccak) \ > 834: do_entry(compiler, quad_keccak, quad_keccak, quad_keccak) \ I'm assuming that the intention in adding this stub to the generic declarations is that other architectures may also choose to implement this stub. Is that correct? src/java.base/share/classes/sun/security/provider/SHA3Parallel.java line 93: > 91: @IntrinsicCandidate > 92: private static int quadKeccak(long[] lanes0, long[] lanes1, long[] > lanes2, long[] lanes3) { > 93: doubleKeccak(lanes0, lanes1); So, on an architecture that does not provide the `quad_keccak` intrinsic intrinsic but does provide `double_keccak` we will still see two calls to the `doubleKeccak` intrinsic when a call is made to Java method `quadKeccak`. Likewise if we don't have `double_keccak` we can still field this with four calls to a `keccak` intrinsic. Nice! ------------- PR Review: https://git.openjdk.org/jdk/pull/31125#pullrequestreview-4270490073 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3224832302 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3224922748 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3225003440 PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3225046003
