Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

Boris Ulasevich Thu, 27 Jan 2022 23:48:36 -0800

On Wed, 15 Dec 2021 09:20:38 GMT, Boris Ulasevich <bulasev...@openjdk.org> 
wrote:


> Background
> 
> The goal is to improve SHA3 implementation performance as it runs up to two 
> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 and 
> most AArch64) do not.
> 
> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
> implemented for ARMv8.2 with SHA3 instructions support:
> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
> Implement SHA3 accelerator/intrinsic
> 
> SHA3 java implementation have already been reworked to eliminate memory 
> consumption and make it work faster:
> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 Hash 
> algorithm performance improvements (~12x speedup)
> The latter issue addressed this previously, but there is still some room to 
> improve SHA3 performance with current C2 implementation, which is proposed in 
> this PR.
> 
> Option 1 (this PR)
> 
> With this change I unroll the internal cycles manually, inline it manually, 
> and use local variables (not array) for data processing. Such an approach 
> gives the best performance (see benchmark results). Without this change (with 
> current array data processing) we observe a lot of load/store operations in 
> comparison to processing in local variables, both on AMD64 and on AArch64.  
> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
> algorithm can hold all 25 data and all temporary variables in registers. C2 
> can't optimize it as well because many regisers are allocated for internal 
> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
> variables the number of excessive load/stores is much smaller and performance 
> result is much better.
> 
> Option 2 (alternative)
> 
> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
> 
> This is a more conservative change which minimizes code changes, but it has 
> lower performance improvement. Please let me know if you think this change is 
> better then main one: in this case I will replace it within this Pull Request.
> 
> With this change I unroll the internal cycles manually and use @Forceinline 
> annotation. Manual unrolling is necessary because C2 does not recognize there 
> is a counted cycle that can be completely unrolled. Instead of replacing the 
> loop with five loop bodies C2 splits the loop to pre- main- and post- loop 
> which is not good for this case. C2 works better when the array is created 
> locally, but in our case the array is created on object instantiation, so C2 
> can't prove the array length is constant. The second issue affecting 
> performance is that methods with unrolled loops get larger and C2 does not 
> inline them. It's addressed here by using @Forceinline annotation.
> 
> Benchmark results
> 
> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
> advanced SHA-3 instructions) and AMD on
>   
> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>  benchmark. Here is the result: 
> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
> ARM32/PPC/AArch64/AMD64
> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
> ARM32/PPC/AArch64/AMD64
> 
> Testing
> 
> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
> implementation over the same vectors before and after the change, and 
> checking the results are the same.
> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
> The test vectors compiled from the [Final Algorithm 
> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt

Hi,

Thanks for looking at this!

> Have you run the tests to make sure it passes the Known Answer Tests?

In my original post I included the test suite and KAT - I have not found 
ready-to use one, so I did it from scratch using the test vectors provided for 
KECCAK Final Algorithm Package.

> Can you also include the before and after JMH benchmarks for SHA3 in the PR? 
> they are under test/micro/org/openjdk/bench/java/security/MessageDigests

There are before/after MessageDigests benchmark results for option1/options2 on 
ARM32/PPC/ARM64/AMD: 
http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html - is it OK?

-------------

PR: https://git.openjdk.java.net/jdk/pull/6847

Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code

Reply via email to