On Wed, 21 Sep 2022 12:06:31 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:

>> Hi,
>> 
>> In the message digest implementation, for example SHA256, in JDK, two 
>> bitwise operations could be improved with equivalent arithmetic, and then 
>> the number bitwise operations could be reduced accordingly.  Specifically
>> "(x and y) xor ((complement x) and z)" could be replaced with the equivalent 
>> "z xor (x and (y xor z))", and "(x and y) xor (x and z) xor (y and z)" could 
>> be replaced with the equivalent "(x and y) xor ((x xor y) and z)".  Each 
>> replacement reduces one bitwise operation, and thus improve the performance.
>> 
>> Per my testing on my MacOS laptop, the update on SHA256 improves the message 
>> digest throughput by 0.5%-0.8%.  The improvement is not significant, but 
>> might be worthy of it as the update is pretty simple and trivial, for those 
>> platforms that do not support CPU intrinsic for a certain hash algorithm.
>> 
>> This patch update SHA2 implementation only.  Please let me know what do you 
>> think.  If you are good  with this little bit performance, I will update 
>> more message digest implementations.  If no one interested in these little 
>> benefits, I will close this PR later.
>> 
>> Thanks,
>> Xuelei
>
> (Thinking out loud, because I am on the fence with this change)
> 
> First (negative), it departs from SHA function codes as stated in most 
> papers. This is a minor thing if we can prove the equivalence, but it still 
> adds to maintenance overhead. While the translation of `maj` (boolean 
> majority) function is more or less straightforward, using the `xor`/`and` 
> distributivity, the translation for `ch` is less so. I had to scribble down 
> things to reveal that new expressions' DNF is the old one.
> 
> Second (negative), I suspect current code is optimized more for the 
> instruction-level parallelism than for the number of operations. (This might 
> be the reason why SHA papers do this, to optimize latency in hardware 
> implementations?) In case of `ch`, we have one operation less, but it is 
> probably a wash, since in the original expression that spare operation can 
> run in parallel with the rest of the code. In `maj` case it is worse: we now 
> have three operations on critical path (expression tree is one level deeper) 
> instead of two, which inhibits the ILP.
> 
> Third (positive), I think the real target are VM modes which cannot exploit 
> any ILP, so reduction in op count / bytecode size is sensible. For example, 
> Zero that runs only in interpreter, and where bytecode-level ILP cannot be 
> exploited as every bytecode instruction is handled by large chunk of code. 
> The improvements/regressions on this code are usually visible when hashing 
> the JMODs or jlinking the image. I see about 0.3% faster 
> linux-x86_64-zero-release builds with this patch.
> 
> Fourth (positive), it is not unprecedented to optimize for interpreters here. 
> I did https://bugs.openjdk.org/browse/JDK-8256523 the last time in the same 
> code.
> 
> Are you getting +0.8% throughput on microbenchmarks?

@shipilev Good points and I agree with them.

> Are you getting +0.8% throughput on microbenchmarks?

Yes, I got it a few times. I tested on MacOS/M1 and Linux/AMD for 
64/512/2048/16384.  As the improvement is not significant, the number was 
impacted by the platforms status/loading.  I can see improvement, but I did not 
get a stable +0.8%.

I got a more than +1% improvement in another algorithm implementation (SM3).  
That's the reason I want to check out the JDK implementations.  But you let me 
convinced that this may not be worthy of the effort.  Thank you so much!

> https://bugs.openjdk.org/browse/JDK-8256523
Thank you so much for pointing this out.  From it, I learned something new that 
the Java inline may not work in some circumstance.  Impressed PR description!

-------------

PR: https://git.openjdk.org/jdk/pull/10365

Reply via email to