On Sun, 13 Nov 2022 02:54:10 GMT, Anthony Scarpino <ascarp...@openjdk.org> wrote:
> I would like a review of an update to the GCM code. A recent report showed > that GCM memory usage for TLS was very large. This was a result of in-place > buffers, which TLS uses, and how the code handled the combined intrinsic > method during decryption. A temporary buffer was used because the combined > intrinsic does gctr before ghash which results in a bad tag. The fix is to > not use the combined intrinsic during in-place decryption and depend on the > individual GHASH and CounterMode intrinsics. Direct ByteBuffers are not > affected as they are not used by the intrinsics directly. > > The reduction in the memory usage boosted performance back to where it was > before despite using slower intrinsics (gctr & ghash individually). The > extra memory allocation for the temporary buffer out-weighted the faster > intrinsic. > > > JDK 17: 122913.554 ops/sec > JDK 19: 94885.008 ops/sec > Post fix: 122735.804 ops/sec > > There is no regression test because this is a memory change and test coverage > already existing. Carter, when I looked at this a few months back (admittedly I'm a fairly careless profiler and didn't fully dig down to a root cause) I felt as though direct bytebuffers were possibly getting compromised round about here: https://github.com/openjdk/jdk/blob/3416bfa2560e240b5e602f10e98e8a06c96852df/src/java.base/share/classes/com/sun/crypto/provider/GaloisCounterMode.java#L722, where... essentially it's easier for intrinsics to operate on byte arrays and so any direct data passed in gets copied into a new byte array which is then passed into the intrinsic. It's possible that I'm misunderstanding, however. I think one could test this hypothesis by adjusting the size of PARALLEL_LEN. Halving it will lead to a less efficient intrinsic usage but correspondingly halve the allocation rate. ------------- PR: https://git.openjdk.org/jdk/pull/11121