On Thu, 10 Nov 2022 20:11:46 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

>> This PR delivers ChaCha20 intrinsics that accelerate the core block function 
>> that generates key stream from the key, counter and nonce.  Intrinsics have 
>> been written for the following platforms and instruction sets:
>> 
>> - x86_64: AVX, AVX2 and AVX512
>> - aarch64: platforms that support the advanced SIMD instructions
>> 
>> Note: Microbenchmark results moved to a comment in the PR so we don't have 
>> to see it in every email.
>> 
>> Special thanks to the folks who have made many helpful comments while this 
>> PR was in draft form.
>
> Jamil Nimeh has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   replace hi/lo word shuffles and left-right shift/or operations for vpshufd 
> on byte-aligned rotations

using vpshufb (not vpshufd as I typo'ed on my commit message) on AVX/AVX2 for 
8-bit and 16-bit left rotations has given us some modest speed gains:
Before (with intrinsics):

AVX=1
ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535  
ops/s
ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770  
ops/s

AVX=2
ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309  
ops/s
ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048  
ops/s

After (using vpshufb):

AVX=1
Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
Units
ChaCha20.encrypt                  256    thrpt   40  1447416.349 ± 14054.478  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   495844.721 ±  1949.237  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   138154.478 ±   411.707  
ops/s         
ChaCha20.encrypt                16384    thrpt   40    35165.143 ±   110.483  
ops/s

AVX=2
ChaCha20.encrypt                  256    thrpt   40  2020170.211 ± 10507.466  
ops/s
ChaCha20.encrypt                 1024    thrpt   40   829644.325 ±  6452.931  
ops/s
ChaCha20.encrypt                 4096    thrpt   40   246066.542 ±  1052.905  
ops/s
ChaCha20.encrypt                16384    thrpt   40    64021.363 ±   468.979  
ops/s

This was done on the same system that the original benchmarks were done on.  
None of these changes affect AVX512.

I'm working on a hybrid intrinsic approach to get the best of both worlds for 
those smaller single-part jobs.

-------------

PR: https://git.openjdk.org/jdk/pull/7702

Reply via email to