Hi Florian

On 19/08/2014, at 12:32 am, Florian Weimer <fwei...@redhat.com> wrote:

> This change addresses a severe performance regression, first introduced in 
> JDK 8, triggered by the negotiation of a GCM cipher suite in the TLS 
> implementation.  This regression is a result of the poor performance of the 
> implementation of the GHASH function.
> 
> I first tried to eliminate just the allocations in blockMult while still 
> retaining the byte arrays.  This did not substantially increase performance 
> in my micro-benchmark.  I then replaced the 16-byte arrays with longs, 
> replaced the inner loops with direct bit fiddling on the longs, eliminated 
> data-dependent conditionals (which are generally frowned upon in 
> cryptographic algorithms due to the risk of timing attacks), and split the 
> main loop in two, one for each half of the hash state.  This is the result:
> 
>  <https://fweimer.fedorapeople.org/openjdk/ghash-performance/>
> 
> Performance is roughly ten times faster.  My test download over HTTPS is no 
> longer CPU-bound, and GHASH hardly shows up in profiles anymore. (That's why 
> I didn't consider further changes, lookup tables in particular.)  
> Micro-benchmarking shows roughly a ten-fold increase in throughput, but this 
> is probably underestimating it because of the high allocation rate of the old 
> code.

I can reproduce these results using the Bouncy Castle GCM implementation and 
your multiplication strategy.

Compared to a baseline for AES/GCM of 7.33 MB/s (Bouncy Castle basic 
multiplier) and 3.32 MB/s (Oracle JCE) using 1.8.0_25 with your 2x64 bit 
multiplier strategy I get 37.4 MB/s.

This is a 4x improvement over the basic BC implementation and a 10x improvement 
over the Oracle 1.8 JCE version.
For comparison, the BC table based implementations (not constant time) can be 
pushed to 68 MB/s (8x and 19x faster than the ref speeds).

These results are consistent for 1.8, 1.7, 1.6 and 1.6 -d32 on the same 
OS/hardware (Mac OS X, Haswell MBP).
Interestingly the constant time conditional masking outperforms the equivalent 
(non constant time) branching code slightly on 1.8/1.7 JVMs, so I don’t see any 
penalty whatsoever for the constant time (at least not on this architecture - 
it looks like the extra ops are less important than the branches being 
avoided?).

> 
> The performance improvement on 32-bit architectures is probably a bit less, 
> but I suspect that using four ints instead of two longs would penalize 64-bit 
> architectures.
> 

With the 2x64 adapted to a 4x32 bit multiplier I get about 24 MB/s on all 
architectures (about a 35% penalty). Interestingly this is consistent in 1.6 
-d32.


cheers
tim

> -- 
> Florian Weimer / Red Hat Product Security

Reply via email to