cprng_fast implementation benchmarks

Thor Lancelot Simon Sun, 20 Apr 2014 00:19:30 -0700

I have done some benchmarks of various cprng_fast implementations:

        arc4-mtx                The libkern implementation from
                                netbsd-current, which uses a spin mutex to
                                serialize access to a single, shared arc4
                                state.


        arc4-nomtx              Mutex calls #ifdeffed out.  What was in
                                NetBSD prior to 2012.  This implementation
                                is not correct.

        arc4-percpu             New implementation of cprng_fast using percpu
                                state and arc4 as the core stream cipher.
                                Uses the arc4 implementation from
                                sys/crypto/arc4, slightly modified to give an
                                entry point that skips the xor.

        hc128-percpu            Same new implementation but with hc128 as the
                                core stream cipher.  Differs from what I
                                posted earlier in that all use of inline
                                functions in the public API has been removed.

        hc128-inline            Percpu iplementation I posted earlier with all
                                noted bugs fixed; uses inlines in header file
                                which expose some algorithm guts to speed up
                                cprng_fast32().

PROCEDURE

        All implementations were modified to rekey every 512MB of output and
        to use the CPU cycle counter to time the generation of 4MB using
        1,000,000 calls to cprng_fast32() after each rekeying.

        All implementations were tested on a SuperMicro rackmount server with
        an H8QC8+ motherboard, 4 dual-core Opteron 880 CPUs, and 32GB of
        DDR400 RAM.  System memory bandwidth was measured at 7,732 MB/sec
        using the STREAM benchmark.  Memory and cache latencies were not
        measured.

        All implementations were run in a GENERIC kernel compiled with
        "no options DIAGNOSTIC".

        The cpb and single-stream tests were run 3 times per kernel and the
        median result chosen; the 4-stream test was run once per kernel
        due to the very long runtimes for some cases.

        The following tests were run:

                1) Cycles-per-byte, 32 bit request: at initial keying of
                   the generator and each automatic rekeying.

                2) Generate 4GB of data by requesting 256 bytes at a time
                   from userspace using a small C program that loops around
                   the kern.arandom sysctl call.  Values are in seconds.

                3) Generate 16GB of data by running 4 copies of the generator
                   program in parallel.  Values are in seconds.

RESULTS

kernel          cpb (32 bit)    4GB (1 way)     16GB (4 ways)   Scaling Factor
------          ------------    -----------     -------------   --------------
arc4-mtx        35              42.58           398.83          0.106
arc4-nomtx      24              42.12           2338.92         0.018
arc4-percpu     27              33.63           41.59           0.808
hc128-percpu    21              23.75           34.90           0.680
hc128-inline    19              22.66           31.75           0.713

DISCUSSION

The immediate surprise here is the performance of the "arc4-nomtx" variant:
it is almost as fast as the percpu HC128 for a single stream of 32 bit requests
but an order of magnitude worse than "arc4-mtx" for 4 streams in parallel.

This is likely because of severe memory and cache contention caused by
multiple CPUs attempting to update the (improperly) "shared" ARC4 state in
parallel.  The mutex-laden variant, "arc4-mtx", performs worse than everything
else when run in parallel -- but still an order of magnitude better than the
original, mutex-free arc4.

Optimization is hard.  Let's go shopping!

The performance of arc4-percpu shows that, for a uniprocessor generating
a single stream of requests, the cost of the per-cpu data references
and bookkeeping are about 15% for short requests, which are the worst
case.  Perhaps an explicitly uniprocessor variant would be justified in
some cases.

All HC128 variants are faster than all ARC4 variants.

The percpu ARC4 variant scales better than either percpu HC128 variant,
likely due to better cache efficiency of its smaller state array,
but is still slower in absolute terms when run 4-way on this system;
almost 25% slower than the variant of HC128 which uses inlines.

Inlining improves the performance of the percpu HC128 by about 10%, but
this is extremely sensitive to factors which are difficult to control,
such as the quality of compiler cache blocking for the target architecture
and CPU model.  In fact, with the code compiled for a generic x86-64 CPU,
defining DIAGNOSTIC, which adds a single not-taken branch prior to every
invocation of the HC128 cipher in the cycles-per-byte test case, *reduces*
cycles-per-byte to 17 or 18.  This is an area for further work.

Due to the additional implementation complexity required to buffer their
output, ciphers from the SALSA family were not tested.  Details at 11.

CONCLUSIONS

Personally, I'd go with the HC128 variant without the ugly inlining.  We
restrict bytes-on-key to 512MB -- the best current attack on HC128 requires
2^65 bytes to distinguish its keystream, so I think it's a reasonable choice
at this time.  Percpu ARC4 is really a strawman; RC4 is just not safe at any
speed any longer.  Let's test SALSA20 and discuss further.

We should run some of these benchmarks on big-endian systems and systems
that are architecturally unlike the OOO x86-64 I used for my tests.

cprng_fast implementation benchmarks

Reply via email to