On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

> This enhancement makes a change to the ChaCha20 block function intrinsic on 
> aarch64, moving away from the block parallel implementation and to the 
> quarter-round parallel implementation that was done on x86_64.  Assembly 
> language profiling yielded an 11% improvement in throughput.  When put 
> together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains 
> are more modest, somewhere in the 2-4% range depending on job size, but still 
> an improvement.

Some perf numbers...

ChaCha20 Intrinsics Disabled (-XX:-UseChaCha20Intrinsics)

Benchmark                                             (dataSize)  (keyLength)  
(mode)  (padding)      (permutation)  (provider)   Mode  Cnt        Score      
Error  Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                  256          256   
 None  NoPadding           ChaCha20              thrpt   40  1387685.897 ± 
6380.864  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 1024          256   
 None  NoPadding           ChaCha20              thrpt   40   389604.653 ± 
1152.250  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 4096          256   
 None  NoPadding           ChaCha20              thrpt   40   101251.772 ±  
239.854  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                16384          256   
 None  NoPadding           ChaCha20              thrpt   40    25564.584 ±   
67.180  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                  256          256   
 None  NoPadding           ChaCha20              thrpt   40  1321081.861 ± 
3681.500  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 1024          256   
 None  NoPadding           ChaCha20              thrpt   40   386623.577 ±  
726.790  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 4096          256   
 None  NoPadding           ChaCha20              thrpt   40   101205.846 ±  
242.324  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                16384          256   
 None  NoPadding           ChaCha20              thrpt   40    25672.120 ±   
51.305  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt          256          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   447115.739 ± 
4961.898  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         1024          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   203335.249 ± 
1061.335  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         4096          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    63911.592 ±  
263.081  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt        16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    17040.111 ±   
52.876  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt          256          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   565292.934 ± 
3536.657  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         1024          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   222610.735 ± 
1240.699  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         4096          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    65414.212 ±  
223.482  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt        16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    17134.066 ±   
72.718  ops/s

o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt       16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    17019.128 ±   
65.802  ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt       16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    16997.012 ±   
68.808  ops/s



Block-Parallel Intrinsics Implementation

Benchmark                                             (dataSize)  (keyLength)  
(mode)  (padding)      (permutation)  (provider)   Mode  Cnt        Score       
Error  Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                  256          256   
 None  NoPadding           ChaCha20              thrpt   40  2164945.312 ±  
8845.473  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 1024          256   
 None  NoPadding           ChaCha20              thrpt   40   659831.098 ±  
1968.217  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 4096          256   
 None  NoPadding           ChaCha20              thrpt   40   175252.222 ±   
512.910  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                16384          256   
 None  NoPadding           ChaCha20              thrpt   40    44329.489 ±   
126.564  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                  256          256   
 None  NoPadding           ChaCha20              thrpt   40  1975016.045 ± 
11695.931  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 1024          256   
 None  NoPadding           ChaCha20              thrpt   40   640856.881 ±  
1830.533  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 4096          256   
 None  NoPadding           ChaCha20              thrpt   40   173305.072 ±   
366.240  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                16384          256   
 None  NoPadding           ChaCha20              thrpt   40    44208.373 ±   
107.018  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt          256          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   466351.469 ±  
3278.807  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         1024          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   247662.489 ±  
1165.507  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         4096          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    85367.721 ±   
404.796  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt        16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    23492.360 ±    
92.043  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt          256          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   589645.973 ±  
4262.663  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         1024          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   278130.465 ±  
1394.179  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         4096          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    88081.739 ±   
443.476  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt        16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    23853.430 ±   
104.346  ops/s

o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt       16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    23620.475 ±    
75.932  ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt       16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    23750.134 ±   
118.572  ops/s



Quarter-Round Parallel Intrinsics Implementation

Benchmark                                             (dataSize)  (keyLength)  
(mode)  (padding)      (permutation)  (provider)   Mode  Cnt        Score       
Error  Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                  256          256   
 None  NoPadding           ChaCha20              thrpt   40  2219198.137 ± 
13314.344  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 1024          256   
 None  NoPadding           ChaCha20              thrpt   40   684200.661 ±  
3601.031  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 4096          256   
 None  NoPadding           ChaCha20              thrpt   40   181048.566 ±   
942.201  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                16384          256   
 None  NoPadding           ChaCha20              thrpt   40    46150.219 ±   
118.031  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                  256          256   
 None  NoPadding           ChaCha20              thrpt   40  2049320.671 ±  
9549.691  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 1024          256   
 None  NoPadding           ChaCha20              thrpt   40   663456.090 ±  
2722.964  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 4096          256   
 None  NoPadding           ChaCha20              thrpt   40   179921.834 ±   
573.613  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                16384          256   
 None  NoPadding           ChaCha20              thrpt   40    45885.159 ±   
102.974  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt          256          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   476694.433 ±  
4118.055  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         1024          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   251749.129 ±  
1535.415  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         4096          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    87052.901 ±   
436.111  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt        16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    24099.749 ±   
136.009  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt          256          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   601333.942 ±  
5414.186  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         1024          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40   280884.583 ±  
2332.119  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         4096          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    90250.320 ±   
604.948  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt        16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    24346.217 ±   
101.557  ops/s

o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt       16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    23950.145 ±   
119.081  ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt       16384          256   
 None  NoPadding  ChaCha20-Poly1305              thrpt   40    24405.675 ±    
93.554  ops/s

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2627798257

Reply via email to