On Thu, 10 Nov 2022 20:12:30 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
>> Jamil Nimeh has updated the pull request incrementally with one additional >> commit since the last revision: >> >> replace hi/lo word shuffles and left-right shift/or operations for vpshufd >> on byte-aligned rotations > > using vpshufb (not vpshufd as I typo'ed on my commit message) on AVX/AVX2 for > 8-bit and 16-bit left rotations has given us some modest speed gains: > Before (with intrinsics): > > AVX=1 > ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 > ops/s > ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 > ops/s > ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 > ops/s > ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 > ops/s > > AVX=2 > ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 > ops/s > ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 > ops/s > ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 > ops/s > ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 > ops/s > > After (using vpshufb): > > AVX=1 > Benchmark (dataSize) Mode Cnt Score Error > Units > ChaCha20.encrypt 256 thrpt 40 1447416.349 ± 14054.478 > ops/s > ChaCha20.encrypt 1024 thrpt 40 495844.721 ± 1949.237 > ops/s > ChaCha20.encrypt 4096 thrpt 40 138154.478 ± 411.707 > ops/s > ChaCha20.encrypt 16384 thrpt 40 35165.143 ± 110.483 > ops/s > > AVX=2 > ChaCha20.encrypt 256 thrpt 40 2020170.211 ± 10507.466 > ops/s > ChaCha20.encrypt 1024 thrpt 40 829644.325 ± 6452.931 > ops/s > ChaCha20.encrypt 4096 thrpt 40 246066.542 ± 1052.905 > ops/s > ChaCha20.encrypt 16384 thrpt 40 64021.363 ± 468.979 > ops/s > > This was done on the same system that the original benchmarks were done on. > None of these changes affect AVX512. > > I'm working on a hybrid intrinsic approach to get the best of both worlds for > those smaller single-part jobs. @jnimeh Very nice work overall. I think it would be ok to get this PR integrated and do the hybrid approach as a follow on PR. Your work in general shows very good improvement over base. ------------- PR: https://git.openjdk.org/jdk/pull/7702