[Cryptech Tech] ModExpNG Release

Pavel Shatov Thu, 24 Oct 2019 04:29:16 -0700

Hi,

I've pushed the ModExpNG ("next-generation") IP core, that I've beenworking on since December into the Git repository under/user/shatov/modexpng. Honestly, this has taken substantially longer,than I initially anticipated. I guess there are primarily two reasonsfor this. The first one is already ten months old and is now learninghow to stand on his feet without support. The second one is theinterruption of funding for the project.

The new core is mostly compatible with the ModExpA7 core we currentlyuse, but there are a few differences:

1) The top-level wrapper module has an extra port called 'clk_core',that clocks DPS slices inside of the core. This allows the math logic tobe clocked by a frequency different from the system (FMC bus) clock.

We currently use the 180/3 = 60 MHz clock for both I/O and cores. As aneffort to increase signing performance, some time ago we decided to gofor 180/4 = 45 MHz I/O and 45*2 = 90 MHz cores, but didn't actually dothe change. The modified version of FMC arbiter has been sitting in myGit stash. With this new port we will be able to go for a differentscheme with 45 MHz I/O, 45*2 = 90 MHz for all the other cores and 45*NMHz for ModExpNG. The upper limit on N seems to be 8, according to thedatasheet, i.e. 45*8 = 360 MHz, but this has not been tested yet.

2) The core needs the same set of extra pre-computed numbers as theModExpA7, but it can't pre-compute them by itself anymore. Thepre-computation algorithm is totally different from the exponentiationprocedure, so it made sense to leave that out and concentrate onoptimization of just the exponentiation. I believe the pre-computationhas to be done in the FPGA, because it involves the smaller secretmoduli P and Q, so it must be constant-time at least. We can borrow thepre-computation modules from ModExpA7 (which was already written tooperate in constant-time) and turn them into "helper" math modules, thatis basically about just writing two wrappers. Another option is, sincethere have been talks about using a RISC-V core, we might try to do thepre-computations on a soft processor.

3) The core is hardwired to always do blinding. There are severalreasons for this. First, Rob and I did some research, and the consensusseems to be something like "there are no solid proofs, that constanttime modular exponentiation, even when hardened against power analysisand other potential attacks, is safe without blinding, so we'd betterleave blinding enabled just in case". Second, the overhead from blindingis too small (<1%) to justify the extra complexity of adding a dedicatedcontrol bit and all that logic, that would be necessary to make internaloperations conditional. If one still absolutely doesn't want to doblinding for some reason, a pair of (1, 1) can be used as the blindingfactors.

I've also written some demo C code for the STM32 to demonstrate how touse the new core. The code uses randomized test vectors generated by thePython math model in /user/shatov/modexpng_fpga_model.

I also did measure the actual core performance in hardware. The Alphaboard has a set of spare "IRQ" lines going from the STM32 to the FPGA,so I temporarily modified the demo driver to initially set the IRQ_0 pinlow, then drive it high just before pulsing the "next" control bit andthen again driving it low after the "valid" bit goes high. On the FPGAside I added a counter, that counts the number of ticks the IRQ_0 linestays high. Okay, yes, maybe it would have been easier to just use ascope and blink an LED, but I was too lazy to dust it off, sorry.

Below are the times it takes to do an exponentiation. Of course actualsinging speed will be somewhat lower, because other steps, such ashashing the message, writing the hash into the core, etc are not takeninto account:

               w/ CRT | no CRT
              --------+--------
1024-bit key:  3,9 ms |   24 ms
2048-bit key:   25 ms |  182 ms
4096-bit key:  184 ms | 1424 ms

The same thing in exponentiations per second:

                  w/ CRT |   no CRT
              -----------+----------
1024-bit key:  260 exp/s |  41 exp/s
2048-bit key:   40 exp/s | 5,4 exp/s
4096-bit key:  5,4 exp/s | 0,7 exp/s

The numbers above are for the current 60 MHz clock and single coreinstance. Based on current resource usage (2% FFs, 4% LUTs, 4% DSPs, 11%BRAMs), in theory we can crunch about eight cores into the bitstream andincrease the clock frequency six times, so the upper theoretical limiton singing speed is 48x the numbers above. I don't know yet, how closewe can get to that. I will try to do some test implementations in thefollowing days to have more specific numbers for the "restart" follow-upcall.

As a side note, the "classic" boost from CRT on a conventional processoris four times. Since computational complexity of multiplication isO(n**2), given twice smaller moduli, it takes 4x fewer operations. Theexponent also becomes two times smaller, that's another 2x, but one hasto do two exponentiations instead of one, so that 2x doesn't count.Programmable logic implementation on the other hand can take advantageof that additional 2x, because it can do the two easier exponentiationssimultaneously. It can be clearly seen from the timings above, that thenon-CRT/CRT ratio approaches 8 as the key length increases.

The core should already be usable in its current state, but there are afew things that should be done in the future:

1) Determine how fast the DSP slices inside of the core can be clocked.This is basically changing the clock period constraint, then running animplementation and seeing whether timing is met.

2) Try synthesizing a multi-core bitstream. The core takes advantage ofthe dedicated cascade paths between DSP slices. The problem is thatXilinx tools do not always take that into account and sometimes end upwith an unroutable design, so manual floorplanning might be necessary.


3) Write at least some documentation.

4) The core is currently configured with NUM_MULTS = 8. This sets thenumber of parallel DSP slices doing the multiplication. NUM_MULTS canalso be set to 16, 32 or 64. Doubling NUM_MULTS will synthesize a core,that is twice faster, but is also twice larger. Basically we can traderesources for speed by adjusting NUM_MULTS. Note, that for multi-coresigner the total performance is not affected by this setting, withNUM_MULTS = 16 we can only fit four larger cores into the device insteadof eight smaller ones.

So far I haven't tried anything above 8. I suspect NUM_MULTS = 64 is tooextreme won't work as-is, as it will create a column of 64 DSP slicesthat spans ~2/3 of the device. All the DSP slices in a column share acommon input, so there will be a huge 1:64 operand broadcast tree thatmost probably won't meet timing without additional pipelining.

Basically the question is to find the right balance between many smallercores or few larger cores. Maybe the current setting of NUM_MULTS = 8 isjust fine and we won't want to change it.



--
With best regards,
Pavel Shatov
_______________________________________________
Tech mailing list
Tech@cryptech.is
https://lists.cryptech.is/listinfo/tech

[Cryptech Tech] ModExpNG Release

Reply via email to