On Sun, 11 Jan 2026 09:31:03 GMT, Jatin Bhateja <[email protected]> wrote:

>>> Better to align loop sarting address to OptoLoopAlignment
>> 
>> For parity, should I do this for the other labels in the file as well?
>> 
>>> I will run the micro benchmark on AMD Turin and report by back early next 
>>> week.
>> 
>> That would be great, thank you for doing this!
>
> Just a note on LoopAlignment, there are multiple moving parts here, first 
> aligning starting addresses of loop to 64 ([recommendation from Zen5 
> optimization guide](https://docs.amd.com/v/u/en-US/58455_1.00) section 2.8.3) 
> ensure small loop bodies are not split-across the cache line, if that happens 
> then there is a code entry penalty since for first iteration of loop 
> front-end will have to read multiple L1I cachelines, once its decoded and 
> uops are part of Op-cache (AMD) or DSB (Intel) then uops stream for 
> successive loop iterations are emitted from op-cache. Since op-cache is 
> shared b/w 2 HW threads in SMT configuration hence in case of noisy neighbor 
> scenarios or context-switches we may hit code-entry penalty during lifetime 
> of loop. 
> 
> So its advisable to add alignment in this case for other labels before loops 
> we already have OptoLoopAlignment in place.

> > Better to align loop sarting address to OptoLoopAlignment
> 
> For parity, should I do this for the other labels in the file as well?
> 
> > I will run the micro benchmark on AMD Turin and report back by early next 
> > week.
> 
> That would be great, thank you for doing this!

Here are the score on Turin.


Baseline:
Benchmark                                    (algorithm)  (keyLength)  
(provider)   Mode  Cnt      Score   Error  Units
KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-512            0           
   thrpt    2  62235.790          ops/s
KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-768            0           
   thrpt    2  38238.390          ops/s
KeyPairGeneratorBench.MLKEM.generateKeyPair  ML-KEM-1024            0           
   thrpt    2  24725.512          ops/s

Withopt:
Benchmark                                    (algorithm)  (keyLength)  
(provider)   Mode  Cnt      Score   Error  Units
KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-512            0           
   thrpt    2  62483.697          ops/s
KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-768            0           
   thrpt    2  38464.272          ops/s
KeyPairGeneratorBench.MLKEM.generateKeyPair  ML-KEM-1024            0           
   thrpt    2  24702.044          ops/s



Baseline:
Benchmark             (algorithm)  (provider)   Mode  Cnt      Score   Error  
Units
KEMBench.decapsulate   ML-KEM-512              thrpt    2  46416.479          
ops/s
KEMBench.decapsulate   ML-KEM-768              thrpt    2  28516.289          
ops/s
KEMBench.decapsulate  ML-KEM-1024              thrpt    2  19250.020          
ops/s
KEMBench.encapsulate   ML-KEM-512              thrpt    2  60374.724          
ops/s
KEMBench.encapsulate   ML-KEM-768              thrpt    2  36226.100          
ops/s
KEMBench.encapsulate  ML-KEM-1024              thrpt    2  23656.223          
ops/s

Withopt:
Benchmark             (algorithm)  (provider)   Mode  Cnt      Score   Error  
Units
KEMBench.decapsulate   ML-KEM-512              thrpt    2  46730.153          
ops/s
KEMBench.decapsulate   ML-KEM-768              thrpt    2  28650.349          
ops/s
KEMBench.decapsulate  ML-KEM-1024              thrpt    2  19390.927          
ops/s
KEMBench.encapsulate   ML-KEM-512              thrpt    2  60238.211          
ops/s
KEMBench.encapsulate   ML-KEM-768              thrpt    2  36454.138          
ops/s
KEMBench.encapsulate  ML-KEM-1024              thrpt    2  23649.839          
ops/s


System was set at fixed frequency of 2.7 Ghz during benchmarking.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2679382599

Reply via email to