Warning, this is a little long and a lot wonky. I did some timing tests to compare the RSA signing performance of Pavel's "old" ModExpA7 core (in the dual-core configuration we use for CRT) versus his "new" ModExpNG core.
First I built a bitstream with both cores, at 60MHz bus and core speed. Then I hacked rsa.c and modexp.c to support both cores, and added a function hal_modexp_use_modexpng() and CLI command `rsa modexpng on/off` to switch between them. Finally I added a bunch of instrumentation, using the ARM's cycle-counting facility, to get fine-grained execution timing. This is all committed on branch 'modexpng' the following repos: core/platform/common core/platform/alpha sw/libhal sw/stm32 The following tables are side-by-side comparisons of the two cores, signing the same message with the same 2048-bit key. All times are in milliseconds. The first table is for bulk signing, i.e. the key has been used before, so the blinding factors have been calculated, as well as the modulus factor and Montgomery coefficient that the core uses to speed up its calculations. The first line is the result of running libhal/tests/parallel-signatures.py against the board. (For this, I took the median of 1000 signatures, since the mean would include the first signature, which would include the blinding factor and precalc times.) The third line is the result of calling hal_rpc_pkey_sign from the CLI, and is the mean of 1000 or so runs, after priming it once or twice. Therefore, the second line is the overhead of the serial RPC mechanism. Note that this version does blinding factor mutation in software for both cores. The modexpng core does this for free, but would require some code changes to read out from the core, so this is an area where modexpng could end up even faster. Note also that message blinding/unblinding is done in software for modexpa7, but in hardware for modexpng. Finally note that there is a measurable penalty for unpacking libtfm fp_ints (little-endian structs) out to big-endian bytestrings, so that they can be fed to little-endian modexp cores. We could win some cycles by copying little-endian to little-endian, if we are willing to tie libhal to a specific bignum model and a specific core model, which is actually what we're doing in the driver code already. modexpa7 modexpng parallel-signatures 149.597 109.626 RPC overhead 24.034 23.919 hal_rpc_pkey_sign 125.563 85.707 hal_ks_fetch 17.303 17.303 hal_mkm_get_kek 0.627 0.627 hal_aes_keyunwrap 16.596 16.596 pkey_local_sign_rsa 108.177 68.323 hal_rsa_private_key_from_der 0.687 0.687 hal_rsa_decrypt 107.474 67.628 rsa_crt 103.724 63.878 blinding factor mutation 11.460 11.460 blind/unblind message 12.039 modexp2/ng 77.428 52.398 unpack_fp 10.422 26.769 hal_modexp2/ng 66.920 25.550 The second table is all about the first signature, where we calculate the blinding factors, the modulus coefficient, and the Montgomery factor. The blinding factors require doing a modexp, which requires calculating the factors for N, while the signing requires calculating the factors for P and Q. For the modexpa7 core, the "precalc" is done in hardware; for the modexpng core, it's done in software. modexpa7 modexpng pkey_local_sign_rsa 910.379 620.545 hal_rsa_decrypt 875.589 585.810 rsa_crt 871.767 582.056 create_blinding_factors 706.332 378.762 modexp 585.381 257.834 precalc N 246.373 hal_modexp/ng 577.831 3.941 precalc N 570.390 (rest of modexp) 7.441 modexp2/ng 150.675 203.276 unpack_fp 10.422 26.769 precalc P/Q 150.878 hal_modexp2/ng 140.167 25.549 precalc P/Q 73.118 In conclusion, ModExpNG is significantly faster, and it's worth switching over completely to it. Also, thanks to Joachim for pointing me at the Cortex-M4's DWT function. It really made it a lot easier to measure execution time, with a higher degree of confidence. paul _______________________________________________ Tech mailing list Tech@cryptech.is https://lists.cryptech.is/listinfo/tech