Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-07 Thread Ferenc Rakoczi
On Wed, 7 Jan 2026 17:59:52 GMT, Volodymyr Paprotski  
wrote:

>> In ML_KEM.java there is this  assert (and this is the only call to  
>> implKyber12To16() 
>> 
>> assert ((remainder == 0) || (remainder == 48)) &&
>> (index + i * 96 <= condensed.length);
>> implKyber12To16(condensed, index, parsed, parsedLength);
>> 
>> and one can check how the callers of twelve2Sixteen() make sure that this is 
>> the case.
>
> Yep, thats exactly the assert I was looking at as well.. looks to me like its 
> dividing the 'expanded-short-array-length' by 64 and ensuring the remainder 
> is zero (ignoring the 48 for a bit.. and the condensed-length check).
> 
> (for simplicity) So the 'expanded' array length should be a multiple of 64; 
> i.e. 128-bytes. But we stride the expanded array by 256 bytes? (and 
> parsedLength by 128-shorts..)
> 
> I haven't checked the callers of `twelve2Sixteen` but I suspect that the 
> length of the expanded array is always a multiple of 256-bytes (128-shorts).. 
> in which case, the assert is 'incomplete'?

Oooops, yes, the assert and the comment on twelve2sixteen() should be fixed. 
All of the calls are processing 192 or 384 bytes (and producing 128 or 256 
shorts). The comment and assert belonged to an earlier version and were not 
updated when I changed my mind about the implementation.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2669726088


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-07 Thread Volodymyr Paprotski
On Wed, 7 Jan 2026 17:47:59 GMT, Ferenc Rakoczi  wrote:

>> I wasn't as clear in my question. The asm is indeed processing the bytes in 
>> the increment. What I was trying to convince myself about.. 'how come we are 
>> not reading past the end of the array. Or are we?'.
>> 
>> On one hand, this is exactly what the existing asm code does, so I will 
>> assume that its correct. However, on the java side/version of this code, I 
>> could only convince myself about processing ~two AVX512 vectors at a time, 
>> not four.
>> 
>> So either I cant count, or there is some further (implicit) restrictions on 
>> the callers of `twelve2Sixteen`
>
> In ML_KEM.java there is this  assert (and this is the only call to  
> implKyber12To16() 
> 
> assert ((remainder == 0) || (remainder == 48)) &&
> (index + i * 96 <= condensed.length);
> implKyber12To16(condensed, index, parsed, parsedLength);
> 
> and one can check how the callers of twelve2Sixteen() make sure that this is 
> the case.

Yep, thats exactly the assert I was looking at as well.. looks to me like its 
dividing the 'expanded-short-array-length' by 64 and ensuring the remainder is 
zero (ignoring the 48 for a bit.. and the condensed-length check).

(for simplicity) So the 'expanded' array length should be a multiple of 64; 
i.e. 128-bytes. But we stride the expanded array by 256 bytes? (and 
parsedLength by 128-shorts..)

I haven't checked the callers of `twelve2Sixteen` but I suspect that the length 
of the expanded array is always a multiple of 256-bytes (128-shorts).. in which 
case, the assert is 'incomplete'?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2669535184


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-07 Thread Ferenc Rakoczi
On Wed, 7 Jan 2026 16:43:30 GMT, Volodymyr Paprotski  
wrote:

>> I believe the numbers are right: with each pass 256 bytes of coefficients 
>> are `parsed` into the parse buffer.  This means that half of the 
>> coefficients have been processed (`parsedLength` = 128).  Would having a 
>> comment stating as such address your concerns?
>
> I wasn't as clear in my question. The asm is indeed processing the bytes in 
> the increment. What I was trying to convince myself about.. 'how come we are 
> not reading past the end of the array. Or are we?'.
> 
> On one hand, this is exactly what the existing asm code does, so I will 
> assume that its correct. However, on the java side/version of this code, I 
> could only convince myself about processing ~two AVX512 vectors at a time, 
> not four.
> 
> So either I cant count, or there is some further (implicit) restrictions on 
> the callers of `twelve2Sixteen`

In ML_KEM.java there is this  assert (and this is the only call to  
implKyber12To16() 

assert ((remainder == 0) || (remainder == 48)) &&
(index + i * 96 <= condensed.length);
implKyber12To16(condensed, index, parsed, parsedLength);

and one can check how the callers of twelve2Sixteen() make sure that this is 
the case.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2669490940


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-07 Thread Volodymyr Paprotski
On Wed, 7 Jan 2026 06:19:09 GMT, Shawn M Emery  wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 906:
>> 
>>> 904:   __ addptr(condensed, 192);
>>> 905:   __ addptr(parsed, 256);
>>> 906:   __ subl(parsedLength, 128);
>> 
>> (128 instead of 256 here because `parsedLength` is an index to an `short` 
>> array..)
>> 
>> I am confused by the stride. The `twelve2Sixteen()` seems to (almost) 
>> guarantee that the parsed length is a multiple of 64 (last block can be 48 
>> bytes). This would imply a stride of 128 bytes for `parsed`. And 96 for 
>> `condensed`.
>> 
>> This is exactly how the existing code already behaves so I am less 
>> concerned, but I would like an explanation why it works?
>
> I believe the numbers are right: with each pass 256 bytes of coefficients are 
> `parsed` into the parse buffer.  This means that half of the coefficients 
> have been processed (`parsedLength` = 128).  Would having a comment stating 
> as such address your concerns?

I wasn't as clear in my question. The asm is indeed processing the bytes in the 
increment. What I was trying to convince myself about.. 'how come we are not 
reading past the end of the array. Or are we?'.

On one hand, this is exactly what the existing asm code does, so I will assume 
that its correct. However, on the java side/version of this code, I could only 
convince myself about processing ~two AVX512 vectors at a time, not four.

So either I cant count, or there is some further (implicit) restrictions on the 
callers of `twelve2Sixteen`

-

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2669202305


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-07 Thread Volodymyr Paprotski
On Wed, 7 Jan 2026 06:18:55 GMT, Shawn M Emery  wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 862:
>> 
>>> 860:   __ addptr(condensed, condensedOffs);
>>> 861: 
>>> 862:   if (VM_Version::supports_avx512_vbmi2()) {
>> 
>> Which instruction needs vbmi2? All I could spot was that `evpermb` that 
>> needs vbmi. Relax the restriction slightly?
>
> Good catch!  Initially the code was using `vpshldvw`, but was changed to just 
> use `vpsrlvw`.  Fixed in next commit.
> I should probably update the bug synopsis to exclude VBMI2?

I would be happy with just the code being pedantic. Everything else is 'just 
nice' :)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2669186665


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-07 Thread Volodymyr Paprotski
On Wed, 7 Jan 2026 13:18:50 GMT, Ferenc Rakoczi  wrote:

> > "Insert 0b nibble after every third nibble". I only have two questions, 
> > looks good otherwise.

@ferakocz apologies for the misunderstanding; everything after the PS was not a 
request for change.. those were the questions that occurred to me and I found 
the answer.. The only reason I put them in was for the next reviewer. Or if I 
am wrong, e.g. no, I did not find a better instruction than vpermb either. (My 
first reaction to seeing the java code, was 'oh, this is easy, just a `vpermb`, 
then had to reason out why not..)

-

PR Comment: https://git.openjdk.org/jdk/pull/28815#issuecomment-3719734572


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-07 Thread Ferenc Rakoczi
On Wed, 7 Jan 2026 00:18:43 GMT, Volodymyr Paprotski  
wrote:

> "Insert 0b nibble after every third nibble". I only have two questions, 
> looks good otherwise.

Yes, that is the idea.

> 
> PS: things I've considered:
> 
> * Loop controls?
>   
>   * ML_KEM.java guarantees  (per callee comment and assert) lengths are 
> multiple of 64
>   * also same as original code
> * Why not simply a vpermb? Have zeroes already from the masked load with k1..

It *is* using vpermb (evpermb() generates the EVEX encoded VPERMB)

>   
>   * shuffle granularity is actually 4-bits, not 8-bits

Really? In what instruction? I hadn't found it in the manual.

> * logical shift already zeroes top bits, so `vpand` not required?

Only every 2nd byte is shifted, the rest needs to be masked.
>   
>   * odd columns not shifted, so still have extra bits that need clearing

Yes, that is what the vpand does. (actually, it also (unnecessarily) masks the 
shifted bytes.

> * Why VBMI?
>   
>   * needed for `evpermb`

Yes.

-

PR Comment: https://git.openjdk.org/jdk/pull/28815#issuecomment-3718842604


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-06 Thread Shawn M Emery
On Tue, 6 Jan 2026 23:29:52 GMT, Volodymyr Paprotski  
wrote:

>> Shawn M Emery has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Update copyright year
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 862:
> 
>> 860:   __ addptr(condensed, condensedOffs);
>> 861: 
>> 862:   if (VM_Version::supports_avx512_vbmi2()) {
> 
> Which instruction needs vbmi2? All I could spot was that `evpermb` that needs 
> vbmi. Relax the restriction slightly?

Good catch!  Initially the code was using `vpshldvw`, but was changed to just 
use `vpsrlvw`.  Fixed in next commit.
I should probably update the bug synopsis to exclude VBMI2?

> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 906:
> 
>> 904:   __ addptr(condensed, 192);
>> 905:   __ addptr(parsed, 256);
>> 906:   __ subl(parsedLength, 128);
> 
> (128 instead of 256 here because `parsedLength` is an index to an `short` 
> array..)
> 
> I am confused by the stride. The `twelve2Sixteen()` seems to (almost) 
> guarantee that the parsed length is a multiple of 64 (last block can be 48 
> bytes). This would imply a stride of 128 bytes for `parsed`. And 96 for 
> `condensed`.
> 
> This is exactly how the existing code already behaves so I am less concerned, 
> but I would like an explanation why it works?

I believe the numbers are right: with each pass 256 bytes of coefficients are 
`parsed` into the parse buffer.  This means that half of the coefficients have 
been processed (`parseLength` = 128).  Would having a comment stating as such 
be sufficient for your concerns?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2667206396
PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2667206828


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-06 Thread Volodymyr Paprotski
On Sat, 3 Jan 2026 00:23:13 GMT, Shawn M Emery  wrote:

>> This change allows use of the AVX512_VBMI/VMBI2 instruction set to further 
>> optimize decompression/parsing of polynomial coefficients for ML-KEM.  The 
>> speedup gained in the ML-KEM benchmarks for key generation is between 0.2 to 
>> 0.5%, encapsulation is  0.3 to 1.5%, and decapsulation is 0 to 0.9%.
>> 
>> Thank you to @sviswa7 and @ferakocz for their help in working through the 
>> early stages of this code with me.
>
> Shawn M Emery has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update copyright year

"Insert 0b nibble after every third nibble". I only have two questions, 
looks good otherwise.


PS: things I've considered:

- Loop controls?
  - ML_KEM.java guarantees  (per callee comment and assert) lengths are 
multiple of 64
  - also same as original code
- Why not simply a vpermb? Have zeroes already from the masked load with k1..
  - shuffle granularity is actually 4-bits, not 8-bits
- logical shift already zeroes top bits, so `vpand` not required?
  - odd columns not shifted, so still have extra bits that need clearing
- Why VBMI?
  - needed for `evpermb`

src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 862:

> 860:   __ addptr(condensed, condensedOffs);
> 861: 
> 862:   if (VM_Version::supports_avx512_vbmi2()) {

Which instruction needs vbmi2? All I could spot was that `evpermb` that needs 
vbmi. Relax the restriction slightly?

src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 906:

> 904:   __ addptr(condensed, 192);
> 905:   __ addptr(parsed, 256);
> 906:   __ subl(parsedLength, 128);

(128 instead of 256 here because `parsedLength` is an index to an `short` 
array..)

I am confused by the stride. The `twelve2Sixteen()` seems to (almost) guarantee 
that the parsed length is a multiple of 64 (last block can be 48 bytes). This 
would imply a stride of 128 bytes for `parsed`. And 96 for `condensed`.

This is exactly how the existing code already behaves so I am less concerned, 
but I would like an explanation why it works?

-

PR Review: https://git.openjdk.org/jdk/pull/28815#pullrequestreview-3632845110
PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2666594767
PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r263039


Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2 [v2]

2026-01-02 Thread Shawn M Emery
> This change allows use of the AVX512_VBMI/VMBI2 instruction set to further 
> optimize decompression/parsing of polynomial coefficients for ML-KEM.  The 
> speedup gained in the ML-KEM benchmarks for key generation is between 0.2 to 
> 0.5%, encapsulation is  0.3 to 1.5%, and decapsulation is 0 to 0.9%.
> 
> Thank you to @sviswa7 and @ferakocz for their help in working through the 
> early stages of this code with me.

Shawn M Emery has updated the pull request incrementally with one additional 
commit since the last revision:

  Update copyright year

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/28815/files
  - new: https://git.openjdk.org/jdk/pull/28815/files/d2cadaf9..7cd8de53

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=28815&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28815&range=00-01

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/28815.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28815/head:pull/28815

PR: https://git.openjdk.org/jdk/pull/28815


RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and AVX512_VBMI2

2025-12-21 Thread Shawn M. Emery

This change allows use of the AVX512_VBMI/VMBI2 instruction set to further 
optimize decompression/parsing of polynomial coefficients for ML-KEM.  The 
speedup gained in the ML-KEM benchmarks for key generation is between 0.2 to 
0.5%, encapsulation is  0.3 to 1.5%, and decapsulation is 0 to 0.9%.

Thank you to @sviswa7 and @ferakocz for their help in working through the early 
stages of this code with me.

-

Commit messages:
 - Merge with mainline
 - Swap parameter operation with source
 - Remove wrong mask from evpsrlvw
 - Reverse ordering for vpermb and vpsrlvw instructions
 - Switch from vpshldvw to vpsrlvw
 - Fix whitespaces
 - 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI and 
AVX512_VBMI2

Changes:https://git.openjdk.org/jdk/pull/28815/files
  Webrev:https://webrevs.openjdk.org/?repo=jdk&pr=28815&range=00 

  Issue:https://bugs.openjdk.org/browse/JDK-8360934
  Stats: 88 lines in 1 file changed: 87 ins; 0 del; 1 mod
  Patch:https://git.openjdk.org/jdk/pull/28815.diff
  Fetch: git fetchhttps://git.openjdk.org/jdk.git pull/28815/head:pull/28815

PR:https://git.openjdk.org/jdk/pull/28815