Hi,

yesterday I pushed a number of commits, that were spread across several repositories. The changes include the following pieces:

1. FMC bus arbiter was upgraded. Code from the `fmc_clk' branch was merged into master, this removes the old clock domain crossing logic and makes the I/O bus synchronous. Due to a limitation in the STM32 processor the I/O can't run at the same 90 MHz clock as the cores, so the I/O runs at 90 / 2 = 45 MHz. Strictly speaking, the new FMC arbiter still does 45/90 MHz clock domain crossing, but since the two clocks are synchronous, it's much simpler and has lower latency. Currently the total FMC bus transaction duration is 9 clock cycles (1 cycle synchronization stage + 4 cycles FMC controller latency + 3 cycles core selector latency + 1 cycle clock domain crossing logic), so the throughput is 32 bits * 45 MHz / 9 = 160 megabits/second.

2. Alpha platform modules were upgraded. The cores are now clocked by 90 MHz clock, not 60. The new clock manager module now also has an extra dedicated high-speed output port for cores, that support higher clock speeds. Currently this frequency is 45 * 4 = 180 MHz and is only used by the new ModExpNG core (see below).

3. Core selector was upgraded. As we were adding new cores to the design and trying to increase their clock speed, timing problems started to arise when building the bitstream. The two primary reasons for this are the global asynchronous reset net and the high fanout of the readback data multiplexor. The first problem was solved by addition of a special parametrized module, that replicates the reset signal (see commit message for detailed description of how the module works). The second problem was alleviated by relaxing the setup constraint for the selector's output multiplexor. In short, since the I/O runs at only 45 MHz, it doesn't make sense to select the output data value at 90 MHz. Multi-cycle constraints were introduced that give data two 90 MHz clock cycles instead of one to propagate through the multiplexor. Again, the corresponding commit message explains how exactly this works.

4. ModExpNG core was upgraded. The core now supports up to 180 MHz high-speed clock. Latest performance measurements are the following:

Exponentiation time in milliseconds:

                w/ CRT  |  non-CRT
              ----------+----------
1024-bit key:   1.37 ms |   8.28 ms
2048-bit key:   8.46 ms |  61.10 ms
4096-bit key:  61.72 ms | 475.08 ms

Speed in exponentiations per second:

                  w/ CRT   |   non-CRT
              -------------+-------------
1024-bit key:  731.4 exp/s | 120.7 exp/s
2048-bit key:  118.2 exp/s |  16.4 exp/s
4096-bit key:   16.2 exp/s |   2.1 exp/s


The next step would be to upgrade the HAL layer of the STM32 firmware. The crucial change is that the newer ModExp core has built-in support for blinding and also does the final part of the Chinese Remainder Theorem based signing ("Garner's formula") itself, so the STM32 now has no need to do any modular math operations when signing at all. I'm not very much familiar with how the HAL layer works, so I'd pass this to someone with better understanding to make the changes.


--
With best regards,
Pavel Shatov
_______________________________________________
Tech mailing list
[email protected]
https://lists.cryptech.is/listinfo/tech

Reply via email to