On Thu, 10 Nov 2022 20:11:46 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
>> This PR delivers ChaCha20 intrinsics that accelerate the core block function >> that generates key stream from the key, counter and nonce. Intrinsics have >> been written for the following platforms and instruction sets: >> >> - x86_64: AVX, AVX2 and AVX512 >> - aarch64: platforms that support the advanced SIMD instructions >> >> Note: Microbenchmark results moved to a comment in the PR so we don't have >> to see it in every email. >> >> Special thanks to the folks who have made many helpful comments while this >> PR was in draft form. > > Jamil Nimeh has updated the pull request incrementally with one additional > commit since the last revision: > > replace hi/lo word shuffles and left-right shift/or operations for vpshufd > on byte-aligned rotations using vpshufb (not vpshufd as I typo'ed on my commit message) on AVX/AVX2 for 8-bit and 16-bit left rotations has given us some modest speed gains: Before (with intrinsics): AVX=1 ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 ops/s ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 ops/s ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 ops/s ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 ops/s AVX=2 ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 ops/s ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 ops/s ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 ops/s ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 ops/s After (using vpshufb): AVX=1 Benchmark (dataSize) Mode Cnt Score Error Units ChaCha20.encrypt 256 thrpt 40 1447416.349 ± 14054.478 ops/s ChaCha20.encrypt 1024 thrpt 40 495844.721 ± 1949.237 ops/s ChaCha20.encrypt 4096 thrpt 40 138154.478 ± 411.707 ops/s ChaCha20.encrypt 16384 thrpt 40 35165.143 ± 110.483 ops/s AVX=2 ChaCha20.encrypt 256 thrpt 40 2020170.211 ± 10507.466 ops/s ChaCha20.encrypt 1024 thrpt 40 829644.325 ± 6452.931 ops/s ChaCha20.encrypt 4096 thrpt 40 246066.542 ± 1052.905 ops/s ChaCha20.encrypt 16384 thrpt 40 64021.363 ± 468.979 ops/s This was done on the same system that the original benchmarks were done on. None of these changes affect AVX512. I'm working on a hybrid intrinsic approach to get the best of both worlds for those smaller single-part jobs. ------------- PR: https://git.openjdk.org/jdk/pull/7702