Sadly, this turned out to be a false positive. More precisely, the measurements are probably accurate, but the results aren't useful.
Two eggs in the face for the price of one here: 1) Yes, of course the test script included a validation check of the signature result. Unfortunately, that check was itself broken, so it didn't notice that the purported signature was nonsense. 2) Even without (1), the approach of bypassing the keystore would not have worked in its present form, because while Garner's Algorithm is not very complicated, it's not just modexp, so doing the modexp without touching the keystore would not have prevented us from needing to read from the keystore for the rest of the calculation. This should have been obvious from the start. So it's going through all the motions, and given the constant time design it's probably taking the same amount of time that it would if this approach worked, but the some of the input is garbage. GIGO. Feh. Going back to the original problem: profiling says that the main bottleneck is the slow retrieval from the keystore: this appears to take about twice as long as modexp, and the bulk of that time is spent in hal_aes_keyunwrap(), specifically in the function which runs a block through the AES core. So this looks like an AES throughput problem, at least given the way we're currently using the AES core. It's possible that we could speed up the keystore a bit by using a different wrapping algorithm, as Peter Gutmann has been suggesting all along, although perhaps for reasons other than speed. Eg, we could use AES in one of the classic block cipher modes instead of AES keywrap: presumably we'd need a MAC too, but at least in theory that could run in parallel with the encryption, although the control code to manage that might be a bit of nasty. At some point, just for confirmation, I (or somebody) should generate a test build with the AES keywrap code disabled (so, obviously, not to be used for any purpose other than this test) so that we can find out how fast the rest of the RSA signature path runs run without the keywrap overhead, where the remaining bottlenecks are, etc. Other alternatives, both of which boil down to tossing some part of the problem over the wall to the Verilog folks: * Build a real RSA CRT core, which would consist of two modexp cores whose outputs fed into the other operations needed for Garner's algorithm. See the rsa_crt() function in sw/libhal/rsa.c for what this would need to do. * Build a real AES keywrap core, and hope it runs significantly faster than the current approach of software driving AES ECB. Some useful work did come out of all this, in particular I think we understand the needs of the internal core management API better than we did before I fell down this rabbit hole, and that code will end up on the master branch eventually. But the current approach isn't going to deliver the hoped-for RSA speed. Sorry for raising false hopes. _______________________________________________ Tech mailing list Tech@cryptech.is https://lists.cryptech.is/listinfo/tech