On Wed, 21 Sep 2022 12:06:31 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:
>> Hi, >> >> In the message digest implementation, for example SHA256, in JDK, two >> bitwise operations could be improved with equivalent arithmetic, and then >> the number bitwise operations could be reduced accordingly. Specifically >> "(x and y) xor ((complement x) and z)" could be replaced with the equivalent >> "z xor (x and (y xor z))", and "(x and y) xor (x and z) xor (y and z)" could >> be replaced with the equivalent "(x and y) xor ((x xor y) and z)". Each >> replacement reduces one bitwise operation, and thus improve the performance. >> >> Per my testing on my MacOS laptop, the update on SHA256 improves the message >> digest throughput by 0.5%-0.8%. The improvement is not significant, but >> might be worthy of it as the update is pretty simple and trivial, for those >> platforms that do not support CPU intrinsic for a certain hash algorithm. >> >> This patch update SHA2 implementation only. Please let me know what do you >> think. If you are good with this little bit performance, I will update >> more message digest implementations. If no one interested in these little >> benefits, I will close this PR later. >> >> Thanks, >> Xuelei > > (Thinking out loud, because I am on the fence with this change) > > First (negative), it departs from SHA function codes as stated in most > papers. This is a minor thing if we can prove the equivalence, but it still > adds to maintenance overhead. While the translation of `maj` (boolean > majority) function is more or less straightforward, using the `xor`/`and` > distributivity, the translation for `ch` is less so. I had to scribble down > things to reveal that new expressions' DNF is the old one. > > Second (negative), I suspect current code is optimized more for the > instruction-level parallelism than for the number of operations. (This might > be the reason why SHA papers do this, to optimize latency in hardware > implementations?) In case of `ch`, we have one operation less, but it is > probably a wash, since in the original expression that spare operation can > run in parallel with the rest of the code. In `maj` case it is worse: we now > have three operations on critical path (expression tree is one level deeper) > instead of two, which inhibits the ILP. > > Third (positive), I think the real target are VM modes which cannot exploit > any ILP, so reduction in op count / bytecode size is sensible. For example, > Zero that runs only in interpreter, and where bytecode-level ILP cannot be > exploited as every bytecode instruction is handled by large chunk of code. > The improvements/regressions on this code are usually visible when hashing > the JMODs or jlinking the image. I see about 0.3% faster > linux-x86_64-zero-release builds with this patch. > > Fourth (positive), it is not unprecedented to optimize for interpreters here. > I did https://bugs.openjdk.org/browse/JDK-8256523 the last time in the same > code. > > Are you getting +0.8% throughput on microbenchmarks? @shipilev Good points and I agree with them. > Are you getting +0.8% throughput on microbenchmarks? Yes, I got it a few times. I tested on MacOS/M1 and Linux/AMD for 64/512/2048/16384. As the improvement is not significant, the number was impacted by the platforms status/loading. I can see improvement, but I did not get a stable +0.8%. I got a more than +1% improvement in another algorithm implementation (SM3). That's the reason I want to check out the JDK implementations. But you let me convinced that this may not be worthy of the effort. Thank you so much! > https://bugs.openjdk.org/browse/JDK-8256523 Thank you so much for pointing this out. From it, I learned something new that the Java inline may not work in some circumstance. Impressed PR description! ------------- PR: https://git.openjdk.org/jdk/pull/10365