Hi,

Rob, I have certain questions about the way we store RSA keys. The background for those who have not been following is that I've been working on a faster ModExp core with full CRT support and integrated blinding. Those modifications will change the distribution of computations done inside the STM32 and on the FPGA, so as Rob pointed out, it makes sense to re-evaluate where and when specific things are carried out.

The "standard" RSA key (at least in the OpenSSL sense, as far as I know) consists of the following quantities: modulus, public exponent, secret exponent, two smaller primes, two shorter exponents and CRT helper coefficient. Our ModExp currently needs two additional quantities besides the aforementioned "standard" set of numbers.

The first one is the "Montgomery factor", it only depends on the modulus and is relatively easy to compute. Given modulus N which is L bits long, the factor is 2 ** (2*L) mod N.

The second quantity is the modulus-dependent reduction coefficient. Montgomery modular reduction works by adding a certain number of multiples of the modulus to the intermediate result to make some of the least significant bits zero. Then the product is reduced by simply shifting it to the right. The problem is to determine how many multiples to add, and that's where the coefficient comes into play.

Software implementations typically compute only the lowest 32 bits of this coefficient on the fly and then work by repeatedly clearing the lower 32 bits of the intermediate product. This is not viable in hardware, because both hardware multipliers and adders have latency. As soon as you're finished computing the multiple of modulus, you need to add it to the intermediate result. While you're adding, the multipliers will be idle, once you've added, you start multiplying and the adders are stalled. Hardware needs entire coefficient to keep the math pipeline busy at all times.

We currently have two decicated pieces of Verilog that do the two precomputations. I'm trying to understand whether it would be viable to offload the computation to the STM32 and get rid of those Verilog modules to simplify the core.

I think, that it may even be possible to not store the Montgomery factor at all and just precompute it on the fly when a key is loaded.

The reduction coefficient is computed according to the extended Euclidean algorithm and, as far as I remember, takes about the same time as the exponentiation itself, so it still makes sense to precompute it and store along with other key components.

There's also a math trick that allows you to get ~10% speed increase at the expense of precomputing one more word of the reduction coefficient than there are words in the modulus. ModExp internally operates on 16-bit words, because that's what the math slices in the FPGA can handle. So to take advantage of the trick we need to store 1040-bit quantity for 1024-bit keys, 2064-bit quantity for 2048-bit keys, etc. I wonder how inconvenient that might be?

--
With best regards,
Pavel Shatov
_______________________________________________
Tech mailing list
Tech@cryptech.is
https://lists.cryptech.is/listinfo/tech

Reply via email to