Status report to get everybody onto the same page: * Joachim has done a lot of work on making a faster AES core.
* Integrating that faster AES core into the alpha build does not RSA signatures noticeably faster. We don't know why. Presumably this means that the current bottleneck is not really the AES core speed, but we don't know what the bottleneck is. We're still happy to have the faster AES core, that's good stuff, it's just not currently solving this particular problem. * One of the things we tried as an experiment was halving the number of cores (so down from 8 modexp and 4 AES to 4 modexp and 2 AES), in case this was some kind of artifact of pushing the limits of what we can fit on this FPGA. Nope, no noticeable change. * Overall signature throughput does go up as we add more clients until we hit the limit of how many cores are in the build, but it's not linear: two clients is nearly twice the throughput of one client; each additional client after that adds something, but a smaller something, to the overall throughput. This again tends to suggest that some shared resource is the real bottleneck. * Several of us are still suspicious of FMC I/O here, among other reasons because the ARM spends a lot of time in the FMC I/O code during the do_block() function which keeps showing up as the 800kg gorilla in the profiling results. * Paul (re-)raised an interesting question about whether we could clock the FPGA from the 90MHz FMC bus and make this a synchronous interface, which at least in theory might significantly increase throughput on the bus. Might at least be worth the experiment, particularly if it: * Lets us get rid of the double read in fmc_read() and * Lets us stop having to spin wait polling the FMC NWAIT GPIO pin after every 32 bit word we transfer(!). Assuming this change is workable, if we were to combine it with the trivial hack to move the bytes wapping to Verilog, we'd finally have an interface which looks relatively sane in terms of the number of ARM instructions involved in moving data between ARM and FPGA. Paul and I don't know enough about the FMC bus to know whether this is plausible. Pavel? You're our FMC expert. Opinions? * I noticed a few more minor things we could do with the current FMC I/O code which might squeeze a few more cycles out of it, will try them at some point, but as long as we have the NWAIT poll after every word I would not expect this to make much difference. Throughput numbers from current tests, with profiling disabled (would be significantly slower with profiling). This is for two different bitstreams, identical other than the numbers of aes_fast and modexpa7 cores. "n" is the number of signatures per test, "c" is the number of clients in that test. # Testing with two aes_speed cores and four modexpa7 cores rsa_2048 sigs/sec 5.67867374529 secs/sig 0:00:00.176097 mean 0:00:00.175575 (n 1000, c 1 t0 2018-05-25 00:48:08.021713 t1 2018-05-25 00:51:04.119169) rsa_2048 sigs/sec 8.91530052613 secs/sig 0:00:00.112166 mean 0:00:00.223732 (n 1000, c 2 t0 2018-05-25 00:45:14.212604 t1 2018-05-25 00:47:06.379322) # Testing with four aes_speed cores and eight modexpa7 cores rsa_2048 sigs/sec 5.67869635076 secs/sig 0:00:00.176096 mean 0:00:00.175576 (n 1000, c 1 t0 2018-05-25 00:58:16.127982 t1 2018-05-25 01:01:12.224737) rsa_2048 sigs/sec 8.91534805697 secs/sig 0:00:00.112166 mean 0:00:00.223737 (n 1000, c 2 t0 2018-05-25 00:54:39.331429 t1 2018-05-25 00:56:31.497549) rsa_2048 sigs/sec 10.5404984184 secs/sig 0:00:00.094872 mean 0:00:00.283858 (n 1000, c 3 t0 2018-05-25 01:01:49.373243 t1 2018-05-25 01:03:24.245417) rsa_2048 sigs/sec 10.9159940249 secs/sig 0:00:00.091608 mean 0:00:00.365467 (n 1000, c 4 t0 2018-05-25 01:04:11.440790 t1 2018-05-25 01:05:43.049488) _______________________________________________ Tech mailing list Tech@cryptech.is https://lists.cryptech.is/listinfo/tech