I have done some benchmarks of various cprng_fast implementations: arc4-mtx The libkern implementation from netbsd-current, which uses a spin mutex to serialize access to a single, shared arc4 state.
arc4-nomtx Mutex calls #ifdeffed out. What was in NetBSD prior to 2012. This implementation is not correct. arc4-percpu New implementation of cprng_fast using percpu state and arc4 as the core stream cipher. Uses the arc4 implementation from sys/crypto/arc4, slightly modified to give an entry point that skips the xor. hc128-percpu Same new implementation but with hc128 as the core stream cipher. Differs from what I posted earlier in that all use of inline functions in the public API has been removed. hc128-inline Percpu iplementation I posted earlier with all noted bugs fixed; uses inlines in header file which expose some algorithm guts to speed up cprng_fast32(). PROCEDURE All implementations were modified to rekey every 512MB of output and to use the CPU cycle counter to time the generation of 4MB using 1,000,000 calls to cprng_fast32() after each rekeying. All implementations were tested on a SuperMicro rackmount server with an H8QC8+ motherboard, 4 dual-core Opteron 880 CPUs, and 32GB of DDR400 RAM. System memory bandwidth was measured at 7,732 MB/sec using the STREAM benchmark. Memory and cache latencies were not measured. All implementations were run in a GENERIC kernel compiled with "no options DIAGNOSTIC". The cpb and single-stream tests were run 3 times per kernel and the median result chosen; the 4-stream test was run once per kernel due to the very long runtimes for some cases. The following tests were run: 1) Cycles-per-byte, 32 bit request: at initial keying of the generator and each automatic rekeying. 2) Generate 4GB of data by requesting 256 bytes at a time from userspace using a small C program that loops around the kern.arandom sysctl call. Values are in seconds. 3) Generate 16GB of data by running 4 copies of the generator program in parallel. Values are in seconds. RESULTS kernel cpb (32 bit) 4GB (1 way) 16GB (4 ways) Scaling Factor ------ ------------ ----------- ------------- -------------- arc4-mtx 35 42.58 398.83 0.106 arc4-nomtx 24 42.12 2338.92 0.018 arc4-percpu 27 33.63 41.59 0.808 hc128-percpu 21 23.75 34.90 0.680 hc128-inline 19 22.66 31.75 0.713 DISCUSSION The immediate surprise here is the performance of the "arc4-nomtx" variant: it is almost as fast as the percpu HC128 for a single stream of 32 bit requests but an order of magnitude worse than "arc4-mtx" for 4 streams in parallel. This is likely because of severe memory and cache contention caused by multiple CPUs attempting to update the (improperly) "shared" ARC4 state in parallel. The mutex-laden variant, "arc4-mtx", performs worse than everything else when run in parallel -- but still an order of magnitude better than the original, mutex-free arc4. Optimization is hard. Let's go shopping! The performance of arc4-percpu shows that, for a uniprocessor generating a single stream of requests, the cost of the per-cpu data references and bookkeeping are about 15% for short requests, which are the worst case. Perhaps an explicitly uniprocessor variant would be justified in some cases. All HC128 variants are faster than all ARC4 variants. The percpu ARC4 variant scales better than either percpu HC128 variant, likely due to better cache efficiency of its smaller state array, but is still slower in absolute terms when run 4-way on this system; almost 25% slower than the variant of HC128 which uses inlines. Inlining improves the performance of the percpu HC128 by about 10%, but this is extremely sensitive to factors which are difficult to control, such as the quality of compiler cache blocking for the target architecture and CPU model. In fact, with the code compiled for a generic x86-64 CPU, defining DIAGNOSTIC, which adds a single not-taken branch prior to every invocation of the HC128 cipher in the cycles-per-byte test case, *reduces* cycles-per-byte to 17 or 18. This is an area for further work. Due to the additional implementation complexity required to buffer their output, ciphers from the SALSA family were not tested. Details at 11. CONCLUSIONS Personally, I'd go with the HC128 variant without the ugly inlining. We restrict bytes-on-key to 512MB -- the best current attack on HC128 requires 2^65 bytes to distinguish its keystream, so I think it's a reasonable choice at this time. Percpu ARC4 is really a strawman; RC4 is just not safe at any speed any longer. Let's test SALSA20 and discuss further. We should run some of these benchmarks on big-endian systems and systems that are architecturally unlike the OOO x86-64 I used for my tests.