On Fri, 15 May 2026 19:31:41 GMT, Volodymyr Paprotski <[email protected]> 
wrote:

>> This PR:
>> - changes existing AVX512 SHA3 intrinsic to be more parallel
>> - adds an AVX2 SHA3 intrinsic
>> - change `SHA3Parallel.java` to NR=4 (to be able to exploit the AVX512 
>> parallelism while keeping doubleKeccak for platforms where double 
>> parallelism is preferable. I experimented with NR=8 as well, does also gain 
>> a few percent, but I think NR=4 is sufficient tradeoff)
>> 
>> Performance gains:
>> - `MessageDigestBench.digest`:
>>   - AVX2: **16%-39%**
>>   - AVX512: **24%-33%**
>> - `SignatureBench.MLDSA.sign`
>>   - AVX2: **6-12%**
>>   - AVX512: **11%-18%**
>> - `SignatureBench.MLDSA.verify`
>>   - AVX2: **2%-14%**
>>   - AVX512: **31%-40%**
>> - `KEMBench.MLKEM`
>>   - AVX2: **~5%**
>>   - AVX512: **14%-23%**
>> - `KEMBench.JSSE_*`
>>   - appears unaffected
>> 
>> Note on intrinsics. (As noted in the code..) there are multiple entrypoints 
>> wrapping the same intrinsic..
>> - `SHA3.implCompress`: single blockSize of user data xored with keccak
>> - `DigestBase.implCompressMultiBlock`: loop over user data and xor with 
>> keccak
>> - `SHA3Parallel.doubleKeccak`: (still used for AVX2) no message data, just 
>> two state vectors
>> - `SHA3Parallel.quadKeccak`: (AVX512 benefit) no message data, four state 
>> vectors
>> 
>> Note 1: `make test 
>> TEST="micro:org.openjdk.bench.javax.crypto.full.MessageDigestBench 
>> micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA 
>> micro:org.openjdk.bench.javax.crypto.full.KEMBench"`
>> Note 2: I have left more targeted fuzzing and benchmarks out of this PR, but 
>> they are preserved at [on my 
>> branch](https://github.com/vpaprotsk/jdk/compare/sha3-avx-quad...vpaprotsk:jdk:sha3-avx-quad-extras?expand=1).
>>  If there is something you rather see pulled in.. (otherwise, can include a 
>> diff in JBS for 'future reference')
>> 
>> ---------
>> - [X] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Volodymyr Paprotski has updated the pull request incrementally with two 
> additional commits since the last revision:
> 
>  - whitespace
>  - comments from Sandhya

src/hotspot/cpu/x86/assembler_x86.cpp line 9522:

> 9520:   InstructionAttr attributes(vector_len, /* vex_w */ 
> VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ true, 
> /* uses_vl */ true);
> 9521:   attributes.set_rex_vex_w_reverted();
> 9522:   attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /* 
> input_size_in_bits */ EVEX_NObit);

Tuple type should be EVEX_M128.

src/hotspot/cpu/x86/assembler_x86.cpp line 9627:

> 9625:   InstructionAttr attributes(vector_len, /* vex_w */ 
> VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ true, 
> /* uses_vl */ true);
> 9626:   attributes.set_rex_vex_w_reverted();
> 9627:   attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /* 
> input_size_in_bits */ EVEX_NObit);

Tuple type should be EVEX_M128.

src/hotspot/cpu/x86/assembler_x86.cpp line 9778:

> 9776:          vector_len == AVX_256bit ? VM_Version::supports_avx2() :
> 9777:          vector_len == AVX_512bit ? VM_Version::supports_evex() : 0, 
> "");
> 9778:   InstructionMark im(this);

As the instruction is only supported for AVX2 and above, the following check:
   vector_len == AVX_128bit ? VM_Version::supports_avx()  
should be written as
     vector_len == AVX_128bit ? VM_Version::supports_avx2()
Also then the assert(UseAVX > 1, "requires AVX2") becomes redundant and so 
could be removed.

Likewise for vpsrlvq.

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 2:

> 1: /*
> 2:  * Copyright (c) 2024, 2026, Oracle and/or its affiliates. All rights 
> reserved.

This should remain as 2025.

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 248:

> 246:     } else {
> 247:       __ vmovdqu(X1, Address(state1, disp), Assembler::AVX_128bit);
> 248:       __ vshufpd(X2, X1, X2, 0b01, Assembler::AVX_128bit);

When called for stubgen_sha3* the control reaches here. Nothing seems to be 
loaded into X2 prior but it is used in vshufpd?

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 293:

> 291:     __ evpxorq( A7, k1,  A7, Address(buf,  7 * 8), false, 
> Assembler::AVX_128bit);
> 292:     __ evpxorq( A8, k1,  A8, Address(buf,  8 * 8), false, 
> Assembler::AVX_128bit);
> 293:     __ cmpl(block_size, 72);

Are we reading 8 extra bytes from buf (0 to 79 instead of 0 to 71)?
K1 has 0x1F, we have q(8 byte) elements, so only last two bits of K1 would be 
used which are set to 1 so it is not limiting the memory read. I must be 
missing something.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3251408863
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3251409861
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3251358621
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3260622798
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3262708646
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3262802821

Reply via email to