On Wed, Apr 23, 2014 at 02:34:33PM +0100, Mindaugas Rasiukevicius wrote: > Thor Lancelot Simon <t...@panix.com> wrote: > > On Wed, Apr 23, 2014 at 01:16:42PM +0100, Mindaugas Rasiukevicius wrote: > > > > > > You mentioned that 4GB of data are generated by requesting 256 bytes. > > > It would be more interesting to see the throughput of 4 byte requests > > > i.e. how many cprng_fast32() calls per second can we do? > > > > That is how the numbers in the "cpb" column were generated: I modified > > the kernel code so that after each rekeying, it calls cprng_fast32() > > 1,000,000 times in a tight loop to generate 4,000,000 bytes. > > OK. Can you provide the time in seconds? Having throughput per second > is useful for general contemplation and comparison with the rates other > subsystems can achieve.
This CPU has a 2.4GHz clock, so we'd be looking at 109 MB/sec for chacha8 called via cprng_fast32(), assuming I counted the zeroes correctly. Interestingly according to the rightmost column of the data at http://bench.cr.yp.to/results-stream.html our implementation is about as good as anyone else's, for these inconveniently short requests. In the tables at that URL it is easy to see the overhead come to predominate as the requests get shorter -- I bet it might not even help if we buffered aggressively, since I bet the copies would be quite costly. However, I'm a little suspicious of the test framework they used since it shows truly immense overhead for short requests on MIPS (Loongson) and ARM while x86 and PPC seem fine -- I wonder if it is doing something dumb like calling the core transform through a function pointer. If we really want to be sure we'll need to do similar tests ourselves in-kernel. Thor