Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

Volodymyr Paprotski Wed, 07 Jan 2026 08:47:19 -0800

On Wed, 7 Jan 2026 06:19:09 GMT, Shawn M Emery <[email protected]> wrote:


>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 906:
>> 
>>> 904:       __ addptr(condensed, 192);
>>> 905:       __ addptr(parsed, 256);
>>> 906:       __ subl(parsedLength, 128);
>> 
>> (128 instead of 256 here because `parsedLength` is an index to an `short` 
>> array..)
>> 
>> I am confused by the stride. The `twelve2Sixteen()` seems to (almost) 
>> guarantee that the parsed length is a multiple of 64 (last block can be 48 
>> bytes). This would imply a stride of 128 bytes for `parsed`. And 96 for 
>> `condensed`.
>> 
>> This is exactly how the existing code already behaves so I am less 
>> concerned, but I would like an explanation why it works?
>
> I believe the numbers are right: with each pass 256 bytes of coefficients are 
> `parsed` into the parse buffer.  This means that half of the coefficients 
> have been processed (`parsedLength` = 128).  Would having a comment stating 
> as such address your concerns?

I wasn't as clear in my question. The asm is indeed processing the bytes in the 
increment. What I was trying to convince myself about.. 'how come we are not 
reading past the end of the array. Or are we?'.

On one hand, this is exactly what the existing asm code does, so I will assume 
that its correct. However, on the java side/version of this code, I could only 
convince myself about processing ~two AVX512 vectors at a time, not four.

So either I cant count, or there is some further (implicit) restrictions on the 
callers of `twelve2Sixteen`

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2669202305

Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

Reply via email to