Re: RFR: 8384353: SHA3 AVX2 and AVX512 intrinsics and improvements [v3]

Volodymyr Paprotski Tue, 19 May 2026 12:54:33 -0700

On Fri, 15 May 2026 22:59:37 GMT, Sandhya Viswanathan 
<[email protected]> wrote:


>> Volodymyr Paprotski has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - whitespace
>>  - comments from Sandhya
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 9522:
> 
>> 9520:   InstructionAttr attributes(vector_len, /* vex_w */ 
>> VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ 
>> true, /* uses_vl */ true);
>> 9521:   attributes.set_rex_vex_w_reverted();
>> 9522:   attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /* 
>> input_size_in_bits */ EVEX_NObit);
> 
> Tuple type should be EVEX_M128.

done, thanks for catching. Looks like another 'AVX2-only' instruction I needed, 
though for this instruction, I think I did run this on EVEX machine with 
vector_len=256..

> src/hotspot/cpu/x86/assembler_x86.cpp line 9627:
> 
>> 9625:   InstructionAttr attributes(vector_len, /* vex_w */ 
>> VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ 
>> true, /* uses_vl */ true);
>> 9626:   attributes.set_rex_vex_w_reverted();
>> 9627:   attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /* 
>> input_size_in_bits */ EVEX_NObit);
> 
> Tuple type should be EVEX_M128.

done

> src/hotspot/cpu/x86/assembler_x86.cpp line 9778:
> 
>> 9776:          vector_len == AVX_256bit ? VM_Version::supports_avx2() :
>> 9777:          vector_len == AVX_512bit ? VM_Version::supports_evex() : 0, 
>> "");
>> 9778:   InstructionMark im(this);
> 
> As the instruction is only supported for AVX2 and above, the following check:
>    vector_len == AVX_128bit ? VM_Version::supports_avx()  
> should be written as
>      vector_len == AVX_128bit ? VM_Version::supports_avx2()
> Also then the assert(UseAVX > 1, "requires AVX2") becomes redundant and so 
> could be removed.
> 
> Likewise for vpsrlvq.

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 2024, 2026, Oracle and/or its affiliates. All rights 
>> reserved.
> 
> This should remain as 2025.

done

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 163:
> 
>> 161:   int vector_len = Assembler::AVX_128bit;
>> 162:   bool multiBlock = stub_id == StubId::stubgen_sha3_implCompressMB_id;
>> 163:   bool doubleKeccak = true;
> 
> Instead of doubleKeccak may be we should name this more generic e.g. 
> parallelKeccak or multiKeccak.

I like parallel, since 'multi' is already taken by 'multi buffer' to mean 
several sequential buffers.. done

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 239:
> 
>> 237:       __ vshufpd(T0, C0, C1, 0b00, Assembler::AVX_128bit);
>> 238:       __ vshufpd(T1, C0, C1, 0b11, Assembler::AVX_128bit);
>> 239:       __ evinserti64x2(X1, X1, T0, 0b01, Assembler::AVX_256bit);
> 
> This is a AVX512DQ instruction.

Hmm.. this one needs some thought.. the trivial fix is to add dq as a 
requirement.. but so far as I can spot, this is the only DQ instruction in 
here.. maybe I can change the order of all the shuffles to be able to use a 
different instruction..

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 248:
> 
>> 246:     } else {
>> 247:       __ vmovdqu(X1, Address(state1, disp), Assembler::AVX_128bit);
>> 248:       __ vshufpd(X2, X1, X2, 0b01, Assembler::AVX_128bit);
> 
> When called for stubgen_sha3* the control reaches here. Nothing seems to be 
> loaded into X2 prior but it is used in vshufpd?

That is correct.. but you are right, very misleading. for the non-parallel 
keccak, (for AVX512 version) we only care about the first column, so the `1` in 
`0b01` doesnt matter ('dont care' bit.. but no X here..). I think a comment is 
appropriate here..

> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 293:
> 
>> 291:     __ evpxorq( A7, k1,  A7, Address(buf,  7 * 8), false, 
>> Assembler::AVX_128bit);
>> 292:     __ evpxorq( A8, k1,  A8, Address(buf,  8 * 8), false, 
>> Assembler::AVX_128bit);
>> 293:     __ cmpl(block_size, 72);
> 
> Are we reading 8 extra bytes from buf (0 to 79 instead of 0 to 71)?
> K1 has 0x1F, we have q(8 byte) elements, so only last two bits of K1 would be 
> used which are set to 1 so it is not limiting the memory read. I must be 
> missing something.

ooh.. this one. this really tripped me up, because the way this intrinsic loops 
over user data is very non-intuitive (though its probably more 
intrinsic-friendly). Best I can describe: "limit is the start of the last 
block". From DigestBase.java


        // compress complete blocks
        if (len >= blockSize) {
            int limit = ofs + len;
            ofs = implCompressMultiBlock(b, ofs, limit - blockSize); // <<<<< 
HERE!!!
            len = limit - ofs;
        }
        // copy remainder to buffer
        if (len > 0) {
            System.arraycopy(b, ofs, buffer, 0, len);
            bufOfs = len;
        }
    }
    // compress complete blocks
    private int implCompressMultiBlock(byte[] b, int ofs, int limit) {
        implCompressMultiBlockCheck(b, ofs, limit);
        return implCompressMultiBlock0(b, ofs, limit);
    }

    @IntrinsicCandidate
    private int implCompressMultiBlock0(byte[] b, int ofs, int limit) {
        for (; ofs <= limit; ofs += blockSize) {
            implCompress(b, ofs);
        }
        return ofs;
    }

    private void implCompressMultiBlockCheck(byte[] b, int ofs, int limit) {
        if (limit < 0) {
            return;  // not an error because implCompressMultiBlockImpl won't 
execute if limit < 0
                     // and an exception is thrown if ofs < 0.
        }

        Objects.requireNonNull(b);
        Preconditions.checkIndex(ofs, b.length, Preconditions.AIOOBE_FORMATTER);

        int endIndex = (limit / blockSize) * blockSize  + blockSize - 1; // 
<<<<< HERE
        if (endIndex >= b.length) {
            throw new ArrayIndexOutOfBoundsException(endIndex);
        }
    }

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269188752
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269188317
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269189060
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269187997
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269186891
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269186849
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269187722
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269187093

Re: RFR: 8384353: SHA3 AVX2 and AVX512 intrinsics and improvements [v3]

Reply via email to