RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration]

Andrew Haley Thu, 04 Jun 2015 10:10:41 -0700

I'm sorry this is a rather long email, and I pray for your patience.

I'm getting close to something I can put forward for review.  The
performance is encouraging.


[ Some background: The kernel of RSA and Diffie-Hellman key exchange
is Montgomery multiplication.  Optimizing RSA basically comes down to
optimizing Montgomery multiplication.  The core of OpenJDK's RSA is
BigInteger.oddModPow().  ]

My Montgomery multiply routine (mostly portable C, with a small
assembly-language insert) executes a 1024-bit multiply/reduce in about
2000ns; the hand-coded assembly language equivalent in OpenSSL is
faster (as you'd expect) but not very much faster: about 1700ns.  In
other words, compiled C is only about 20% slower.

Firstly, this is pretty remarkable performance by GCC (Yay!  Go GCC!)
and it shows I'm on the right track.  It also shows that there isn't a
huge amount to be gained by hand-coding Montgomery multiplication, but
anybody who fancies their hand can try to improve on GCC.  This is
also very nice because porting it to non-x86 hardware is fairly
straightforward -- certainly far easier than writing a large assembly-
language routine.

Here are some numbers for comparison.

Today's hs-comp:

                  sign    verify    sign/s verify/s
rsa  512 bits 0.000133s 0.000009s   7508.5 112819.1
rsa 1024 bits 0.000667s 0.000028s   1498.6  35497.2
rsa 2048 bits 0.003867s 0.000097s    258.6  10342.7
rsa 4096 bits 0.026383s 0.000357s     37.9   2799.8

After my patch:

                  sign    verify    sign/s verify/s
rsa  512 bits 0.000071s 0.000005s  14127.3 204112.4
rsa 1024 bits 0.000292s 0.000013s   3424.5  74204.1
rsa 2048 bits 0.001628s 0.000045s    614.4  22399.7
rsa 4096 bits 0.010966s 0.000163s     91.2   6117.8

So, it's about twice as fast we have at present.

[ Note that this comparison includes the latest "8081778: Use Intel
x64 CPU instructions for RSA acceleration" patch. ]

However, even after my patch OpenJDK is still somewhat slower than
OpenSSL, which is:

                  sign    verify    sign/s verify/s
rsa  512 bits 0.000048s 0.000004s  20687.1 257399.4
rsa 1024 bits 0.000189s 0.000011s   5288.3  91711.5
rsa 2048 bits 0.001174s 0.000037s    851.7  27367.2
rsa 4096 bits 0.008475s 0.000137s    118.0   7305.4

[ I am assuming that OpenSSL represents something like the "speed of
light" for RSA on x86: this is carefully hand-coded assembly language
and C, hand tuned.  Getting anywhere near OpenSSL is a major win. ]

Here's my problem:

Some of this slowdown is due to the overhead of using the JCE, but not
very much.  Quite a lot of it, however, is due to the fact that the
scratch memory used in oddModPow() is a big-endian array of jints.  I
have to convert the big-endian jints into native jlongs to do the
multiply on little-endian x86-64.

If I do the word reversal during the multiply (i.e. keep all data in
memory in little-endian form, swap words when reading and writing to
memory) the performance is horrible: a 1024-bit multiply takes 3000ns,
50% longer.  This perhaps isn't very surprising: if you do the
word-reversal before the multiplication you have O(N) swaps, if you do
it during the multiplication you have O(N^2).

I have found that the best thing to do is to word reverse all the data
in memory into temporary little-endian arrays and do the work on them.
It's much faster, but still really is very annoying: for 1024-bit RSA
the word reversal is 14% of the total runtime.

It would be nice to keep all of the data in an array of jlongs for the
duration of oddModPow().  Here's one idea: I could write a version of
oddModPow() which is guaranteed never to use the Java version of the
Montgomery multiply.  This would use a JNI method which calls the
native Montgomery multiply routine, guaranteeing that that we always
use that native routine, even from the interpreter and C1.  Then I
could keep all the internal state in native word order, and all this
word-swapping would just go away.  This would have the additional
benefit that it would be faster when using the interpreter and C1.
So, we'd have two versions of oddModPow() in BigInteger, and switch
between them depending on whether the platform had support for a
native Montgomery multiplication.

The downside of having two versions of oddModPow() would, of course,
be some code duplication.

Or am I just making too much fuss about this?  Maybe I should be happy
with what I've got.

Thank you for reading this far,

Andrew.

RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration]

Reply via email to