Hi,

I've pushed the ModExpNG ("next-generation") IP core, that I've been working on since December into the Git repository under /user/shatov/modexpng. Honestly, this has taken substantially longer, than I initially anticipated. I guess there are primarily two reasons for this. The first one is already ten months old and is now learning how to stand on his feet without support. The second one is the interruption of funding for the project.

The new core is mostly compatible with the ModExpA7 core we currently use, but there are a few differences:

1) The top-level wrapper module has an extra port called 'clk_core', that clocks DPS slices inside of the core. This allows the math logic to be clocked by a frequency different from the system (FMC bus) clock.

We currently use the 180/3 = 60 MHz clock for both I/O and cores. As an effort to increase signing performance, some time ago we decided to go for 180/4 = 45 MHz I/O and 45*2 = 90 MHz cores, but didn't actually do the change. The modified version of FMC arbiter has been sitting in my Git stash. With this new port we will be able to go for a different scheme with 45 MHz I/O, 45*2 = 90 MHz for all the other cores and 45*N MHz for ModExpNG. The upper limit on N seems to be 8, according to the datasheet, i.e. 45*8 = 360 MHz, but this has not been tested yet.

2) The core needs the same set of extra pre-computed numbers as the ModExpA7, but it can't pre-compute them by itself anymore. The pre-computation algorithm is totally different from the exponentiation procedure, so it made sense to leave that out and concentrate on optimization of just the exponentiation. I believe the pre-computation has to be done in the FPGA, because it involves the smaller secret moduli P and Q, so it must be constant-time at least. We can borrow the pre-computation modules from ModExpA7 (which was already written to operate in constant-time) and turn them into "helper" math modules, that is basically about just writing two wrappers. Another option is, since there have been talks about using a RISC-V core, we might try to do the pre-computations on a soft processor.

3) The core is hardwired to always do blinding. There are several reasons for this. First, Rob and I did some research, and the consensus seems to be something like "there are no solid proofs, that constant time modular exponentiation, even when hardened against power analysis and other potential attacks, is safe without blinding, so we'd better leave blinding enabled just in case". Second, the overhead from blinding is too small (<1%) to justify the extra complexity of adding a dedicated control bit and all that logic, that would be necessary to make internal operations conditional. If one still absolutely doesn't want to do blinding for some reason, a pair of (1, 1) can be used as the blinding factors.

I've also written some demo C code for the STM32 to demonstrate how to use the new core. The code uses randomized test vectors generated by the Python math model in /user/shatov/modexpng_fpga_model.

I also did measure the actual core performance in hardware. The Alpha board has a set of spare "IRQ" lines going from the STM32 to the FPGA, so I temporarily modified the demo driver to initially set the IRQ_0 pin low, then drive it high just before pulsing the "next" control bit and then again driving it low after the "valid" bit goes high. On the FPGA side I added a counter, that counts the number of ticks the IRQ_0 line stays high. Okay, yes, maybe it would have been easier to just use a scope and blink an LED, but I was too lazy to dust it off, sorry.

Below are the times it takes to do an exponentiation. Of course actual singing speed will be somewhat lower, because other steps, such as hashing the message, writing the hash into the core, etc are not taken into account:
               w/ CRT | no CRT
              --------+--------
1024-bit key:  3,9 ms |   24 ms
2048-bit key:   25 ms |  182 ms
4096-bit key:  184 ms | 1424 ms

The same thing in exponentiations per second:

                  w/ CRT |   no CRT
              -----------+----------
1024-bit key:  260 exp/s |  41 exp/s
2048-bit key:   40 exp/s | 5,4 exp/s
4096-bit key:  5,4 exp/s | 0,7 exp/s

The numbers above are for the current 60 MHz clock and single core instance. Based on current resource usage (2% FFs, 4% LUTs, 4% DSPs, 11% BRAMs), in theory we can crunch about eight cores into the bitstream and increase the clock frequency six times, so the upper theoretical limit on singing speed is 48x the numbers above. I don't know yet, how close we can get to that. I will try to do some test implementations in the following days to have more specific numbers for the "restart" follow-up call.

As a side note, the "classic" boost from CRT on a conventional processor is four times. Since computational complexity of multiplication is O(n**2), given twice smaller moduli, it takes 4x fewer operations. The exponent also becomes two times smaller, that's another 2x, but one has to do two exponentiations instead of one, so that 2x doesn't count. Programmable logic implementation on the other hand can take advantage of that additional 2x, because it can do the two easier exponentiations simultaneously. It can be clearly seen from the timings above, that the non-CRT/CRT ratio approaches 8 as the key length increases.

The core should already be usable in its current state, but there are a few things that should be done in the future:

1) Determine how fast the DSP slices inside of the core can be clocked. This is basically changing the clock period constraint, then running an implementation and seeing whether timing is met.

2) Try synthesizing a multi-core bitstream. The core takes advantage of the dedicated cascade paths between DSP slices. The problem is that Xilinx tools do not always take that into account and sometimes end up with an unroutable design, so manual floorplanning might be necessary.

3) Write at least some documentation.

4) The core is currently configured with NUM_MULTS = 8. This sets the number of parallel DSP slices doing the multiplication. NUM_MULTS can also be set to 16, 32 or 64. Doubling NUM_MULTS will synthesize a core, that is twice faster, but is also twice larger. Basically we can trade resources for speed by adjusting NUM_MULTS. Note, that for multi-core signer the total performance is not affected by this setting, with NUM_MULTS = 16 we can only fit four larger cores into the device instead of eight smaller ones.

So far I haven't tried anything above 8. I suspect NUM_MULTS = 64 is too extreme won't work as-is, as it will create a column of 64 DSP slices that spans ~2/3 of the device. All the DSP slices in a column share a common input, so there will be a huge 1:64 operand broadcast tree that most probably won't meet timing without additional pipelining.

Basically the question is to find the right balance between many smaller cores or few larger cores. Maybe the current setting of NUM_MULTS = 8 is just fine and we won't want to change it.


--
With best regards,
Pavel Shatov
_______________________________________________
Tech mailing list
Tech@cryptech.is
https://lists.cryptech.is/listinfo/tech

Reply via email to