On Thu, 10 Nov 2022 20:12:30 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

>> Jamil Nimeh has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   replace hi/lo word shuffles and left-right shift/or operations for vpshufd 
>> on byte-aligned rotations
>
> using vpshufb (not vpshufd as I typo'ed on my commit message) on AVX/AVX2 for 
> 8-bit and 16-bit left rotations has given us some modest speed gains:
> Before (with intrinsics):
> 
> AVX=1
> ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770  
> ops/s
> 
> AVX=2
> ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048  
> ops/s
> 
> After (using vpshufb):
> 
> AVX=1
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  
> Units
> ChaCha20.encrypt                  256    thrpt   40  1447416.349 ± 14054.478  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   495844.721 ±  1949.237  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   138154.478 ±   411.707  
> ops/s         
> ChaCha20.encrypt                16384    thrpt   40    35165.143 ±   110.483  
> ops/s
> 
> AVX=2
> ChaCha20.encrypt                  256    thrpt   40  2020170.211 ± 10507.466  
> ops/s
> ChaCha20.encrypt                 1024    thrpt   40   829644.325 ±  6452.931  
> ops/s
> ChaCha20.encrypt                 4096    thrpt   40   246066.542 ±  1052.905  
> ops/s
> ChaCha20.encrypt                16384    thrpt   40    64021.363 ±   468.979  
> ops/s
> 
> This was done on the same system that the original benchmarks were done on.  
> None of these changes affect AVX512.
> 
> I'm working on a hybrid intrinsic approach to get the best of both worlds for 
> those smaller single-part jobs.

@jnimeh Very nice work overall. I think it would be ok to get this PR 
integrated and do the hybrid approach as a follow on PR. Your work in general 
shows very good improvement over base.

-------------

PR: https://git.openjdk.org/jdk/pull/7702

Reply via email to