07.03.2019 9:15, Rob Austein пишет:
Software implementations typically compute only the lowest 32 bits of
this coefficient on the fly and then work by repeatedly clearing the
lower 32 bits of the intermediate product. This is not viable in
hardware, because both hardware multipliers and adders have
latency. As soon as you're finished computing the multiple of modulus,
you need to add it to the intermediate result. While you're adding,
the multipliers will be idle, once you've added, you start multiplying
and the adders are stalled. Hardware needs entire coefficient to keep
the math pipeline busy at all times.
We currently have two decicated pieces of Verilog that do the two
precomputations. I'm trying to understand whether it would be viable
to offload the computation to the STM32 and get rid of those Verilog
modules to simplify the core.
What, you don't want to do it on a RISC-V?
Trying this in software is might be worth the experiment. Is this
dependent only on the key's public components? Constant time is nice,
but is not so critical when the input values are already public.
Haven't thought about doing it on a RISC-V core, good point. Also nice
that you mentioned constant-time. Given modulus N which is L bits long,
the Montgomery factor is 2 ** (2*L) mod N, I currently compute it in
Verilog this way:
1. F = 1
2. for i=0 to 2*N-1
3. F1 = F << 1
4. F2 = F1 - N
5. F = (F2 < 0) ? F1 : F2
If we're doing CRT, we operate on the smaller primes which are secret,
so yes, that must be constant-time. And that was the initial reason I
wrote that helper Verilog module.
The reduction coefficient is more complicated, since it effectively
needs modular inversion: R = -N ** -1 mod 2 ** L. There are several
algorithms in the literature, I'm currently doing it this way:
1. R = 1
2. B = 1
3. NN = ~N + 1
4. for i=1 to L-1
5. B = B << 1
6. T = R * NN mod 2 ** L
7. if T[i] then
8. R = R + B
I believe there may exist faster ways to do it, but given that this
computation should only be done one per key generation, I don't know
whether it will make sense to further optimize this. I believe
constant-time is also required here since we need the coefficients for
the smaller primes to do CRT.
In that light I'm starting to think that my idea to offload the
computation to STM32 is not that smart after all. Speaking of RISC-V,
can it get us true constant-time operation?
There's also a math trick that allows you to get ~10% speed increase
at the expense of precomputing one more word of the reduction
coefficient than there are words in the modulus. ModExp internally
operates on 16-bit words, because that's what the math slices in the
FPGA can handle. So to take advantage of the trick we need to store
1040-bit quantity for 1024-bit keys, 2064-bit quantity for 2048-bit
keys, etc. I wonder how inconvenient that might be?
So you want to trade in your old extra storage for new? OK.
Plenty of PRIVATE tag values left if we need them.
Yes, that was basically my idea.
The other thing we discussed briefly a few days ago was short-lived
storage for RSA blinding factors, so that we can amortize the cost of
generating them over multiple signature operations with some kind of
relatively cheap mutation after each signature. We currently do this
using a bit of memory hanging off the side of the in-memory pkey
object. I don't think it makes sense to stuff blinding factors into
the keystore, since that would require a flash write after each
signature and this isn't really the sort of thing we want to preserve
for very long in any case. So I think the current tradeoff there is
about right, but figured I should put this out there while we're
discussing the rest of it in case others have better ideas.
I'm now looking into how to integrate blinding into the core. Suppose
that our modulus is N = P * Q and the message to sign is M. When doing
CRT, we do two "easier" exponentiations mod P and mod Q, but the message
M is twice larger. So we have to first compute two new bases MP = M mod
P and MQ = M mod Q. Now do I get it right, that what we want to do is we
blind the original twice larger message M? In theory we can blind the
two smaller bases separately. Okay, the latter may be a totally stupid
thing, because I haven't worked out all the math details yet, just asking.
--
With best regards,
Pavel Shatov
_______________________________________________
Tech mailing list
Tech@cryptech.is
https://lists.cryptech.is/listinfo/tech