Hi Florian On 19/08/2014, at 12:32 am, Florian Weimer <fwei...@redhat.com> wrote:
> This change addresses a severe performance regression, first introduced in > JDK 8, triggered by the negotiation of a GCM cipher suite in the TLS > implementation. This regression is a result of the poor performance of the > implementation of the GHASH function. > > I first tried to eliminate just the allocations in blockMult while still > retaining the byte arrays. This did not substantially increase performance > in my micro-benchmark. I then replaced the 16-byte arrays with longs, > replaced the inner loops with direct bit fiddling on the longs, eliminated > data-dependent conditionals (which are generally frowned upon in > cryptographic algorithms due to the risk of timing attacks), and split the > main loop in two, one for each half of the hash state. This is the result: > > <https://fweimer.fedorapeople.org/openjdk/ghash-performance/> > > Performance is roughly ten times faster. My test download over HTTPS is no > longer CPU-bound, and GHASH hardly shows up in profiles anymore. (That's why > I didn't consider further changes, lookup tables in particular.) > Micro-benchmarking shows roughly a ten-fold increase in throughput, but this > is probably underestimating it because of the high allocation rate of the old > code. I can reproduce these results using the Bouncy Castle GCM implementation and your multiplication strategy. Compared to a baseline for AES/GCM of 7.33 MB/s (Bouncy Castle basic multiplier) and 3.32 MB/s (Oracle JCE) using 1.8.0_25 with your 2x64 bit multiplier strategy I get 37.4 MB/s. This is a 4x improvement over the basic BC implementation and a 10x improvement over the Oracle 1.8 JCE version. For comparison, the BC table based implementations (not constant time) can be pushed to 68 MB/s (8x and 19x faster than the ref speeds). These results are consistent for 1.8, 1.7, 1.6 and 1.6 -d32 on the same OS/hardware (Mac OS X, Haswell MBP). Interestingly the constant time conditional masking outperforms the equivalent (non constant time) branching code slightly on 1.8/1.7 JVMs, so I don’t see any penalty whatsoever for the constant time (at least not on this architecture - it looks like the extra ops are less important than the branches being avoided?). > > The performance improvement on 32-bit architectures is probably a bit less, > but I suspect that using four ints instead of two longs would penalize 64-bit > architectures. > With the 2x64 adapted to a 4x32 bit multiplier I get about 24 MB/s on all architectures (about a 35% penalty). Interestingly this is consistent in 1.6 -d32. cheers tim > -- > Florian Weimer / Red Hat Product Security