Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI [v4]

Shawn M Emery Wed, 14 Jan 2026 07:07:40 -0800

On Mon, 12 Jan 2026 07:23:39 GMT, Shawn M Emery <[email protected]> wrote:


>>> > Better to align loop sarting address to OptoLoopAlignment
>>> 
>>> For parity, should I do this for the other labels in the file as well?
>>> 
>>> > I will run the micro benchmark on AMD Turin and report back by early next 
>>> > week.
>>> 
>>> That would be great, thank you for doing this!
>> 
>> Here are the score on Turin.
>> 
>> 
>> Baseline:
>> Benchmark                                    (algorithm)  (keyLength)  
>> (provider)   Mode  Cnt      Score   Error  Units
>> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-512            0        
>>       thrpt    2  62235.790          ops/s
>> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-768            0        
>>       thrpt    2  38238.390          ops/s
>> KeyPairGeneratorBench.MLKEM.generateKeyPair  ML-KEM-1024            0        
>>       thrpt    2  24725.512          ops/s
>> 
>> Withopt:
>> Benchmark                                    (algorithm)  (keyLength)  
>> (provider)   Mode  Cnt      Score   Error  Units
>> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-512            0        
>>       thrpt    2  62483.697          ops/s
>> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-768            0        
>>       thrpt    2  38464.272          ops/s
>> KeyPairGeneratorBench.MLKEM.generateKeyPair  ML-KEM-1024            0        
>>       thrpt    2  24702.044          ops/s
>> 
>> 
>> 
>> Baseline:
>> Benchmark             (algorithm)  (provider)   Mode  Cnt      Score   Error 
>>  Units
>> KEMBench.decapsulate   ML-KEM-512              thrpt    2  46416.479         
>>  ops/s
>> KEMBench.decapsulate   ML-KEM-768              thrpt    2  28516.289         
>>  ops/s
>> KEMBench.decapsulate  ML-KEM-1024              thrpt    2  19250.020         
>>  ops/s
>> KEMBench.encapsulate   ML-KEM-512              thrpt    2  60374.724         
>>  ops/s
>> KEMBench.encapsulate   ML-KEM-768              thrpt    2  36226.100         
>>  ops/s
>> KEMBench.encapsulate  ML-KEM-1024              thrpt    2  23656.223         
>>  ops/s
>> 
>> Withopt:
>> Benchmark             (algorithm)  (provider)   Mode  Cnt      Score   Error 
>>  Units
>> KEMBench.decapsulate   ML-KEM-512              thrpt    2  46730.153         
>>  ops/s
>> KEMBench.decapsulate   ML-KEM-768              thrpt    2  28650.349         
>>  ops/s
>> KEMBench.decapsulate  ML-KEM-1024              thrpt    2  19390.927         
>>  ops/s
>> KEMBench.encapsulate   ML-KEM-512              thrpt    2  60238.211         
>>  ops/s
>> KEMBench.encapsulate   ML-KEM-768              thrpt    2  36454.138         
>>  ops/s
>> KEMBench.encapsulat...
>
> Thank you for sharing these results.  It is disconcerting to see the drop in 
> performance for i) key gen-1024, ii) encapsulation-512, and iii) 
> enacapsulation-1024, though I don't know the SE for these runs.  During my 
> testing on a AMD EPYC 9J14 96-Core Processor I consistently get noticeable 
> performance increases for all ML-KEM operations:
> 
> [Publish ML_KEM Benchmarks - 
> Sheet1.pdf](https://github.com/user-attachments/files/24559070/Publish.ML_KEM.Benchmarks.-.Sheet1.pdf)

Here are results comparing pre and post OptoLoopAlignment:

[Alignment ML_KEM Benchmarks - 
Sheet1.pdf](https://github.com/user-attachments/files/24607923/Alignment.ML_KEM.Benchmarks.-.Sheet1.pdf)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2689366713

Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI [v4]

Reply via email to