Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]

Paul Hohensee Tue, 01 Feb 2022 09:13:42 -0800

On Tue, 1 Feb 2022 09:52:46 GMT, Boris Ulasevich <bulasev...@openjdk.org> wrote:


>> Background
>> 
>> The goal is to improve SHA3 implementation performance as it runs up to two 
>> times slower than native (OpenSSL, measured on AMD64 and AArch6464) 
>> implementation. Some hardware provides SHA3 accelerators, but most (AMD64 
>> and most AArch64) do not.
>> 
>> For AArch64 hardware that does support SHA3 accelerators, an intrinsic was 
>> implemented for ARMv8.2 with SHA3 instructions support:
>> - [JDK-8252204](https://bugs.openjdk.java.net/browse/JDK-8252204): AArch64: 
>> Implement SHA3 accelerator/intrinsic
>> 
>> SHA3 java implementation have already been reworked to eliminate memory 
>> consumption and make it work faster:
>> - [JDK-8157495](https://bugs.openjdk.java.net/browse/JDK-8157495): SHA-3 
>> Hash algorithm performance improvements (~12x speedup)
>> The latter issue addressed this previously, but there is still some room to 
>> improve SHA3 performance with current C2 implementation, which is proposed 
>> in this PR.
>> 
>> Option 1 (this PR)
>> 
>> With this change I unroll the internal cycles manually, inline it manually, 
>> and use local variables (not array) for data processing. Such an approach 
>> gives the best performance (see benchmark results). Without this change 
>> (with current array data processing) we observe a lot of load/store 
>> operations in comparison to processing in local variables, both on AMD64 and 
>> on AArch64.  
>> Native implementations shows that on AArch64 (32 GPU registers) SHA-3 
>> algorithm can hold all 25 data and all temporary variables in registers. C2 
>> can't optimize it as well because many regisers are allocated for internal 
>> usage: rscratch1, rscratch2, rmethod, rthread, etc.  With data in local 
>> variables the number of excessive load/stores is much smaller and 
>> performance result is much better.
>> 
>> Option 2 (alternative)
>> 
>> LINK: [the change](https://github.com/bulasevich/jdk/commit/095fc9db)
>> 
>> This is a more conservative change which minimizes code changes, but it has 
>> lower performance improvement. Please let me know if you think this change 
>> is better then main one: in this case I will replace it within this Pull 
>> Request.
>> 
>> With this change I unroll the internal cycles manually and use @Forceinline 
>> annotation. Manual unrolling is necessary because C2 does not recognize 
>> there is a counted cycle that can be completely unrolled. Instead of 
>> replacing the loop with five loop bodies C2 splits the loop to pre- main- 
>> and post- loop which is not good for this case. C2 works better when the 
>> array is created locally, but in our case the array is created on object 
>> instantiation, so C2 can't prove the array length is constant. The second 
>> issue affecting performance is that methods with unrolled loops get larger 
>> and C2 does not inline them. It's addressed here by using @Forceinline 
>> annotation.
>> 
>> Benchmark results
>> 
>> We measured the change on four platforms: ARM32, PPC, AArch64 (with no 
>> advanced SHA-3 instructions) and AMD on
>>   
>> [MessageDigests](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java)
>>  benchmark. Here is the result: 
>> http://cr.openjdk.java.net/~bulasevich/8275914/sha3_bench.html
>> - fix1 (red) is the Option 1 (this PR) change: gains 50/38/83/38% on 
>> ARM32/PPC/AArch64/AMD64
>> - fix2 (green) is the Option 2 (alternative) change: gains 23/33/40/17% on 
>> ARM32/PPC/AArch64/AMD64
>> 
>> Testing
>> 
>> Tested with JTREG and SHA-3 Project (NIST) test vectors: run SHA3 
>> implementation over the same vectors before and after the change, and 
>> checking the results are the same.
>> The test tool: http://cr.openjdk.java.net/~bulasevich/8275914/RunKat.java
>> The test vectors compiled from the [Final Algorithm 
>> Package](https://csrc.nist.gov/Projects/Hash-Functions/SHA-3-Project):
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_224.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_256.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_384.txt
>> http://cr.openjdk.java.net/~bulasevich/8275914/MsgKAT_512.txt
>
> Boris Ulasevich has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   copyright fix

Lgtm as well.

-------------

Marked as reviewed by phh (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/6847

Re: RFR: 8275914: SHA3: changing java implementation to help C2 create high-performance code [v3]

Reply via email to