On Tue, 1 Feb 2022 09:52:46 GMT, Boris Ulasevich <bulasev...@openjdk.org> wrote:
>> Background >> >> The goal is to improve SHA3 implementation performance as it runs up to two >> times slower than native (OpenSSL, measured on AMD64 and AArch6464) >> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 >> and most AArch64) do not. >> >> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was >> implemented for ARMv8.2 with SHA3 instructions support: >> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: >> Implement SHA3 accelerator/intrinsic >> >> SHA3 java implementation have already been reworked to eliminate memory >> consumption and make it work faster: >> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 >> Hash algorithm performance improvements (~12x speedup) >> The latter issue addressed this previously, but there is still some room to >> improve SHA3 performance with current C2 implementation, which is proposed >> in this PR. >> >> Option 1 (this PR) >> >> With this change I unroll the internal cycles manually, inline it manually, >> and use local variables (not array) for data processing. Such an approach >> gives the best performance (see benchmark results). Without this change >> (with current array data processing) we observe a lot of load/store >> operations in comparison to processing in local variables, both on AMD64 and >> on AArch64. >> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 >> algorithm can hold all 25 data and all temporary variables in registers. C2 >> can't optimize it as well because many regisers are allocated for internal >> usage: rscratch1, rscratch2, rmethod, rthread, etc. With data in local >> variables the number of excessive load/stores is much smaller and >> performance result is much better. >> >> Option 2 (alternative) >> >> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db) >> >> This is a more conservative change which minimizes code changes, but it has >> lower performance improvement. Please let me know if you think this change >> is better then main one: in this case I will replace it within this Pull >> Request. >> >> With this change I unroll the internal cycles manually and use @Forceinline >> annotation. Manual unrolling is necessary because C2 does not recognize >> there is a counted cycle that can be completely unrolled. Instead of >> replacing the loop with five loop bodies C2 splits the loop to pre- main- >> and post- loop which is not good for this case. C2 works better when the >> array is created locally, but in our case the array is created on object >> instantiation, so C2 can't prove the array length is constant. The second >> issue affecting performance is that methods with unrolled loops get larger >> and C2 does not inline them. It's addressed here by using @Forceinline >> annotation. >> >> Benchmark results >> >> We measured the change on four platforms: ARM32, PPC, AArch64 (with no >> advanced SHA-3 instructions) and AMD on >> >> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java) >> benchmark. Here is the result: >> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html >> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on >> ARM32/PPC/AArch64/AMD64 >> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on >> ARM32/PPC/AArch64/AMD64 >> >> Testing >> >> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 >> implementation over the same vectors before and after the change, and >> checking the results are the same. >> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java >> The test vectors compiled from the [Final Algorithm >> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project): >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt >> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt > > Boris Ulasevich has updated the pull request incrementally with one > additional commit since the last revision: > > copyright fix Lgtm as well. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6847