On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich <bulasev...@openjdk.org> wrote:
> Background > > The goal is to improve SHA3 implementation performance as it runs up to two > times slower than native (OpenSSL, measured on AMD64 and AArch6464) > implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and > most AArch64) do not. > > For AArch64 hardware that does support SHA3 accelerators, an intrinsic was > implemented for ARMv8.2 with SHA3 instructions support: > - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: > Implement SHA3 accelerator/intrinsic > > SHA3 java implementation have already been reworked to eliminate memory > consumption and make it work faster: > - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash > algorithm performance improvements (~12x speedup) > The latter issue addressed this previously, but there is still some room to > improve SHA3 performance with current C2 implementation, which is proposed in > this PR. > > Option 1 (this PR) > > With this change I unroll the internal cycles manually, inline it manually, > and use local variables (not array) for data processing. Such an approach > gives the best performance (see benchmark results). Without this change (with > current array data processing) we observe a lot of load/store operations in > comparison to processing in local variables, both on AMD64 and on AArch64. > Native implementations shows that on AArch64 (32 GPU registers) SHA-3 > algorithm can hold all 25 data and all temporary variables in registers. C2 > can't optimize it as well because many regisers are allocated for internal > usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local > variables the number of excessive load/stores is much smaller and performance > result is much better. > > Option 2 (alternative) > > LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) > > This is a more conservative change which minimizes code changes, but it has > lower performance improvement. Please let me know if you think this change is > better then main one: in this case I will replace it within this Pull Request. > > With this change I unroll the internal cycles manually and use @Forceinline > annotation. Manual unrolling is necessary because C2 does not recognize there > is a counted cycle that can be completely unrolled. Instead of replacing the > loop with five loop bodies C2 splits the loop to pre- main- and post- loop > which is not good for this case. C2 works better when the array is created > locally, but in our case the array is created on object instantiation, so C2 > can't prove the array length is constant. The second issue affecting > performance is that methods with unrolled loops get larger and C2 does not > inline them. It's addressed here by using @Forceinline annotation. > > Benchmark results > > We measured the change on four platforms: ARM32, PPC, AArch64 (with no > advanced SHA-3 instructions) and AMD on > > [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) > benchmark. Here is the result: > http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html > - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on > ARM32/PPC/AArch64/AMD64 > - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on > ARM32/PPC/AArch64/AMD64 > > Testing > > Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 > implementation over the same vectors before and after the change, and > checking the results are the same. > The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java > The test vectors compiled from the [Final Algorithm > Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt > http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt Hi, Thanks for looking at this! > Have you run the tests to make sure it passes the Known Answer Tests? In my original post I included the test suite and KAT - I have not found ready-to use one, so I did it from scratch using the test vectors provided for KECCAK Final Algorithm Package. > Can you also include the before and after JMH benchmarks for SHA3 in the PR? > they are under test/micro/org/openjdk/bench/java/security/MessageDigests There are before/after MessageDigests benchmark results for option1/options2 on ARM32/PPC/ARM64/AMD: http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html - is it OK? ------------- PR: https://git.openjdk.java.net/jdk/pull/6847