On Fri, 15 May 2026 22:59:37 GMT, Sandhya Viswanathan
<[email protected]> wrote:
>> Volodymyr Paprotski has updated the pull request incrementally with two
>> additional commits since the last revision:
>>
>> - whitespace
>> - comments from Sandhya
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 9522:
>
>> 9520: InstructionAttr attributes(vector_len, /* vex_w */
>> VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */
>> true, /* uses_vl */ true);
>> 9521: attributes.set_rex_vex_w_reverted();
>> 9522: attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /*
>> input_size_in_bits */ EVEX_NObit);
>
> Tuple type should be EVEX_M128.
done, thanks for catching. Looks like another 'AVX2-only' instruction I needed,
though for this instruction, I think I did run this on EVEX machine with
vector_len=256..
> src/hotspot/cpu/x86/assembler_x86.cpp line 9627:
>
>> 9625: InstructionAttr attributes(vector_len, /* vex_w */
>> VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */
>> true, /* uses_vl */ true);
>> 9626: attributes.set_rex_vex_w_reverted();
>> 9627: attributes.set_address_attributes(/* tuple_type */ EVEX_FV, /*
>> input_size_in_bits */ EVEX_NObit);
>
> Tuple type should be EVEX_M128.
done
> src/hotspot/cpu/x86/assembler_x86.cpp line 9778:
>
>> 9776: vector_len == AVX_256bit ? VM_Version::supports_avx2() :
>> 9777: vector_len == AVX_512bit ? VM_Version::supports_evex() : 0,
>> "");
>> 9778: InstructionMark im(this);
>
> As the instruction is only supported for AVX2 and above, the following check:
> vector_len == AVX_128bit ? VM_Version::supports_avx()
> should be written as
> vector_len == AVX_128bit ? VM_Version::supports_avx2()
> Also then the assert(UseAVX > 1, "requires AVX2") becomes redundant and so
> could be removed.
>
> Likewise for vpsrlvq.
done
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 2:
>
>> 1: /*
>> 2: * Copyright (c) 2024, 2026, Oracle and/or its affiliates. All rights
>> reserved.
>
> This should remain as 2025.
done
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 163:
>
>> 161: int vector_len = Assembler::AVX_128bit;
>> 162: bool multiBlock = stub_id == StubId::stubgen_sha3_implCompressMB_id;
>> 163: bool doubleKeccak = true;
>
> Instead of doubleKeccak may be we should name this more generic e.g.
> parallelKeccak or multiKeccak.
I like parallel, since 'multi' is already taken by 'multi buffer' to mean
several sequential buffers.. done
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 239:
>
>> 237: __ vshufpd(T0, C0, C1, 0b00, Assembler::AVX_128bit);
>> 238: __ vshufpd(T1, C0, C1, 0b11, Assembler::AVX_128bit);
>> 239: __ evinserti64x2(X1, X1, T0, 0b01, Assembler::AVX_256bit);
>
> This is a AVX512DQ instruction.
Hmm.. this one needs some thought.. the trivial fix is to add dq as a
requirement.. but so far as I can spot, this is the only DQ instruction in
here.. maybe I can change the order of all the shuffles to be able to use a
different instruction..
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 248:
>
>> 246: } else {
>> 247: __ vmovdqu(X1, Address(state1, disp), Assembler::AVX_128bit);
>> 248: __ vshufpd(X2, X1, X2, 0b01, Assembler::AVX_128bit);
>
> When called for stubgen_sha3* the control reaches here. Nothing seems to be
> loaded into X2 prior but it is used in vshufpd?
That is correct.. but you are right, very misleading. for the non-parallel
keccak, (for AVX512 version) we only care about the first column, so the `1` in
`0b01` doesnt matter ('dont care' bit.. but no X here..). I think a comment is
appropriate here..
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 293:
>
>> 291: __ evpxorq( A7, k1, A7, Address(buf, 7 * 8), false,
>> Assembler::AVX_128bit);
>> 292: __ evpxorq( A8, k1, A8, Address(buf, 8 * 8), false,
>> Assembler::AVX_128bit);
>> 293: __ cmpl(block_size, 72);
>
> Are we reading 8 extra bytes from buf (0 to 79 instead of 0 to 71)?
> K1 has 0x1F, we have q(8 byte) elements, so only last two bits of K1 would be
> used which are set to 1 so it is not limiting the memory read. I must be
> missing something.
ooh.. this one. this really tripped me up, because the way this intrinsic loops
over user data is very non-intuitive (though its probably more
intrinsic-friendly). Best I can describe: "limit is the start of the last
block". From DigestBase.java
// compress complete blocks
if (len >= blockSize) {
int limit = ofs + len;
ofs = implCompressMultiBlock(b, ofs, limit - blockSize); // <<<<<
HERE!!!
len = limit - ofs;
}
// copy remainder to buffer
if (len > 0) {
System.arraycopy(b, ofs, buffer, 0, len);
bufOfs = len;
}
}
// compress complete blocks
private int implCompressMultiBlock(byte[] b, int ofs, int limit) {
implCompressMultiBlockCheck(b, ofs, limit);
return implCompressMultiBlock0(b, ofs, limit);
}
@IntrinsicCandidate
private int implCompressMultiBlock0(byte[] b, int ofs, int limit) {
for (; ofs <= limit; ofs += blockSize) {
implCompress(b, ofs);
}
return ofs;
}
private void implCompressMultiBlockCheck(byte[] b, int ofs, int limit) {
if (limit < 0) {
return; // not an error because implCompressMultiBlockImpl won't
execute if limit < 0
// and an exception is thrown if ofs < 0.
}
Objects.requireNonNull(b);
Preconditions.checkIndex(ofs, b.length, Preconditions.AIOOBE_FORMATTER);
int endIndex = (limit / blockSize) * blockSize + blockSize - 1; //
<<<<< HERE
if (endIndex >= b.length) {
throw new ArrayIndexOutOfBoundsException(endIndex);
}
}
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269188752
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269188317
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269189060
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269187997
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269186891
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269186849
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269187722
PR Review Comment: https://git.openjdk.org/jdk/pull/31125#discussion_r3269187093