I'm sorry this is a rather long email, and I pray for your patience. I'm getting close to something I can put forward for review. The performance is encouraging.
[ Some background: The kernel of RSA and Diffie-Hellman key exchange is Montgomery multiplication. Optimizing RSA basically comes down to optimizing Montgomery multiplication. The core of OpenJDK's RSA is BigInteger.oddModPow(). ] My Montgomery multiply routine (mostly portable C, with a small assembly-language insert) executes a 1024-bit multiply/reduce in about 2000ns; the hand-coded assembly language equivalent in OpenSSL is faster (as you'd expect) but not very much faster: about 1700ns. In other words, compiled C is only about 20% slower. Firstly, this is pretty remarkable performance by GCC (Yay! Go GCC!) and it shows I'm on the right track. It also shows that there isn't a huge amount to be gained by hand-coding Montgomery multiplication, but anybody who fancies their hand can try to improve on GCC. This is also very nice because porting it to non-x86 hardware is fairly straightforward -- certainly far easier than writing a large assembly- language routine. Here are some numbers for comparison. Today's hs-comp: sign verify sign/s verify/s rsa 512 bits 0.000133s 0.000009s 7508.5 112819.1 rsa 1024 bits 0.000667s 0.000028s 1498.6 35497.2 rsa 2048 bits 0.003867s 0.000097s 258.6 10342.7 rsa 4096 bits 0.026383s 0.000357s 37.9 2799.8 After my patch: sign verify sign/s verify/s rsa 512 bits 0.000071s 0.000005s 14127.3 204112.4 rsa 1024 bits 0.000292s 0.000013s 3424.5 74204.1 rsa 2048 bits 0.001628s 0.000045s 614.4 22399.7 rsa 4096 bits 0.010966s 0.000163s 91.2 6117.8 So, it's about twice as fast we have at present. [ Note that this comparison includes the latest "8081778: Use Intel x64 CPU instructions for RSA acceleration" patch. ] However, even after my patch OpenJDK is still somewhat slower than OpenSSL, which is: sign verify sign/s verify/s rsa 512 bits 0.000048s 0.000004s 20687.1 257399.4 rsa 1024 bits 0.000189s 0.000011s 5288.3 91711.5 rsa 2048 bits 0.001174s 0.000037s 851.7 27367.2 rsa 4096 bits 0.008475s 0.000137s 118.0 7305.4 [ I am assuming that OpenSSL represents something like the "speed of light" for RSA on x86: this is carefully hand-coded assembly language and C, hand tuned. Getting anywhere near OpenSSL is a major win. ] Here's my problem: Some of this slowdown is due to the overhead of using the JCE, but not very much. Quite a lot of it, however, is due to the fact that the scratch memory used in oddModPow() is a big-endian array of jints. I have to convert the big-endian jints into native jlongs to do the multiply on little-endian x86-64. If I do the word reversal during the multiply (i.e. keep all data in memory in little-endian form, swap words when reading and writing to memory) the performance is horrible: a 1024-bit multiply takes 3000ns, 50% longer. This perhaps isn't very surprising: if you do the word-reversal before the multiplication you have O(N) swaps, if you do it during the multiplication you have O(N^2). I have found that the best thing to do is to word reverse all the data in memory into temporary little-endian arrays and do the work on them. It's much faster, but still really is very annoying: for 1024-bit RSA the word reversal is 14% of the total runtime. It would be nice to keep all of the data in an array of jlongs for the duration of oddModPow(). Here's one idea: I could write a version of oddModPow() which is guaranteed never to use the Java version of the Montgomery multiply. This would use a JNI method which calls the native Montgomery multiply routine, guaranteeing that that we always use that native routine, even from the interpreter and C1. Then I could keep all the internal state in native word order, and all this word-swapping would just go away. This would have the additional benefit that it would be faster when using the interpreter and C1. So, we'd have two versions of oddModPow() in BigInteger, and switch between them depending on whether the platform had support for a native Montgomery multiplication. The downside of having two versions of oddModPow() would, of course, be some code duplication. Or am I just making too much fuss about this? Maybe I should be happy with what I've got. Thank you for reading this far, Andrew.