On Sun, 11 Jan 2026 09:31:03 GMT, Jatin Bhateja <[email protected]> wrote:
>>> Better to align loop sarting address to OptoLoopAlignment >> >> For parity, should I do this for the other labels in the file as well? >> >>> I will run the micro benchmark on AMD Turin and report by back early next >>> week. >> >> That would be great, thank you for doing this! > > Just a note on LoopAlignment, there are multiple moving parts here, first > aligning starting addresses of loop to 64 ([recommendation from Zen5 > optimization guide](https://docs.amd.com/v/u/en-US/58455_1.00) section 2.8.3) > ensure small loop bodies are not split-across the cache line, if that happens > then there is a code entry penalty since for first iteration of loop > front-end will have to read multiple L1I cachelines, once its decoded and > uops are part of Op-cache (AMD) or DSB (Intel) then uops stream for > successive loop iterations are emitted from op-cache. Since op-cache is > shared b/w 2 HW threads in SMT configuration hence in case of noisy neighbor > scenarios or context-switches we may hit code-entry penalty during lifetime > of loop. > > So its advisable to add alignment in this case for other labels before loops > we already have OptoLoopAlignment in place. > > Better to align loop sarting address to OptoLoopAlignment > > For parity, should I do this for the other labels in the file as well? > > > I will run the micro benchmark on AMD Turin and report back by early next > > week. > > That would be great, thank you for doing this! Here are the score on Turin. Baseline: Benchmark (algorithm) (keyLength) (provider) Mode Cnt Score Error Units KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-512 0 thrpt 2 62235.790 ops/s KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-768 0 thrpt 2 38238.390 ops/s KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-1024 0 thrpt 2 24725.512 ops/s Withopt: Benchmark (algorithm) (keyLength) (provider) Mode Cnt Score Error Units KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-512 0 thrpt 2 62483.697 ops/s KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-768 0 thrpt 2 38464.272 ops/s KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-1024 0 thrpt 2 24702.044 ops/s Baseline: Benchmark (algorithm) (provider) Mode Cnt Score Error Units KEMBench.decapsulate ML-KEM-512 thrpt 2 46416.479 ops/s KEMBench.decapsulate ML-KEM-768 thrpt 2 28516.289 ops/s KEMBench.decapsulate ML-KEM-1024 thrpt 2 19250.020 ops/s KEMBench.encapsulate ML-KEM-512 thrpt 2 60374.724 ops/s KEMBench.encapsulate ML-KEM-768 thrpt 2 36226.100 ops/s KEMBench.encapsulate ML-KEM-1024 thrpt 2 23656.223 ops/s Withopt: Benchmark (algorithm) (provider) Mode Cnt Score Error Units KEMBench.decapsulate ML-KEM-512 thrpt 2 46730.153 ops/s KEMBench.decapsulate ML-KEM-768 thrpt 2 28650.349 ops/s KEMBench.decapsulate ML-KEM-1024 thrpt 2 19390.927 ops/s KEMBench.encapsulate ML-KEM-512 thrpt 2 60238.211 ops/s KEMBench.encapsulate ML-KEM-768 thrpt 2 36454.138 ops/s KEMBench.encapsulate ML-KEM-1024 thrpt 2 23649.839 ops/s System was set at fixed frequency of 2.7 Ghz during benchmarking. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2679382599
