subject:"\[kernel\-hardening\] Re\: HalfSipHash Acceptable Usage"

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld

Hi George,

On Thu, Dec 22, 2016 at 4:55 AM, George Spelvin
 wrote:
> Do we have to go through this?  No, the benchmark was *not* bogus.
> Then I replaced the kernel #includes with the necessary typedefs
> and #defines to make it compile in user-space.
> * I didn't iterate 100K times, I timed the functions *once*.
> * I saved the times in a buffer and printed them all at the end
>   so printf() wouldn't pollute the caches.
> * Before every even-numbered iteration, I flushed the I-cache
>   of everything from _init to _fini (i.e. all the non-library code).
>   This cold-cache case is what is going to happen in the kernel.

Wow! Great. Thanks for the pointers on the right way to do this. Very
helpful, and enlightening results indeed. Think you could send me the
whole .c of what you finally came up with? I'd like to run this on
some other architectures; I've got a few odd boxes laying around here.

> The P4 results were:
> SipHash actually wins slightly in the cold-cache case, because
> it iterates more.  In the hot-cache case, it loses
> Core 2 duo:
> Pretty much a tie, honestly.
> Ivy Bridge:
> Modern processors *hate* cold caches.  But notice how md5 is *faster*
> than SipHash on hot-cache IPv6.
> Ivy Bridge, -m64:
> Of course, when you compile -m64, SipHash is unbeatable.

Okay, so I think these results are consistent with some of the
assessments from before -- that SipHash is really just fine as a
replacement for MD5. Not great on older 32-bit x86, but not too
horrible, and the performance improvements on every other architecture
and the security improvements everywhere are a net good.

> Here's the modified benchmark() code.  The entire package is
> a bit voluminous for the mailing list, but anyone is welcome to it.

Please do send! I'm sure I'll learn from reading it. Thanks again for
doing the hardwork of putting something proper together.

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin

> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this?  No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark.  I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space.  (Voluminous but
straightforward.)  E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code.  The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
  so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
  of everything from _init to _fini (i.e. all the non-library code).
  This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test.  The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold: 40843488358435843568
tcpv4 md5 cold: 1052 996 9961060 996
tcpv6 siphash cold: 40803296331232963312
tcpv4 siphash cold: 29682748297227162716
tcpv6 md5 hot:   900 712 712712  712
tcpv4 md5 hot:   632 672 672672  672
tcpv6 siphash hot:  24842292234023402340
tcpv4 siphash hot:  16601560156423401564

SipHash actually wins slightly in the cold-cache case, because
it iterates more.  In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold: 33962868296430122832
tcpv4 md5 cold: 13681044132013321308
tcpv6 siphash cold: 29402952291624482604
tcpv4 siphash cold: 31922988357635043624
tcpv6 md5 hot:  11161032 99610081008
tcpv4 md5 hot:   936 936 936 936 936
tcpv6 siphash hot:  12001236123611881188
tcpv4 siphash hot:   936 804 804 804 804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold: 60866136696263586060
tcpv4 md5 cold:  816 732104610541012
tcpv6 siphash cold: 37561886215223902566
tcpv4 siphash cold: 32642108302631203526
tcpv6 md5 hot:  1062 808 824 824 832
tcpv4 md5 hot:   730 730 740 748 748
tcpv6 siphash hot:   960 952 9361112 926
tcpv4 siphash hot:   638 544 562 552 560

Modern processors *hate* cold caches.  But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold: 46803672395636163525
tcpv4 md5 cold: 10661416117911791134
tcpv6 siphash cold:  9401258199516092255
tcpv4 siphash cold: 14401269129218701621
tcpv6 md5 hot:  1372112210881088
tcpv4 md5 hot:   997 997 997 997 998
tcpv6 siphash hot:   340 340 340 352 340
tcpv4 siphash hot:   227 238 238 238 238

Of course, when you compile -m64, SipHash is unbeatable.


Here's the modified benchmark() code.  The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
extern char const _init, _fini;
char const *p = &_init;

while (p < &_fini) {
asm("clflush %0" : : "m" (*p));
p += 64;
}
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
uint32_t eax, edx;
asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
return eax;
}

static int benchmark(void)
{
cycles_t start, finish;
int i;
u32 seq_number = 0;
__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 
222, 0xff10 };
__be32 saddr4 = 2, daddr4 = 182112;
__be16 sport = 22, dport = 41992;

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld

On Wed, Dec 21, 2016 at 11:27 PM, Theodore Ts'o  wrote:
> And "with enough registers" includes ARM and MIPS, right?  So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections.  :-)

Plus the benchmark was bogus anyway, and when I built a more specific
harness -- actually comparing the TCP sequence number functions --
SipHash was faster than MD5, even on register starved x86. So I think
we're fine and this chapter of the discussion can come to a close, in
order to move on to more interesting things.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Eric Dumazet

On Wed, 2016-12-21 at 11:39 -0500, Rik van Riel wrote:

> Does anybody still have a P4?
> 
> If they do, they're probably better off replacing
> it with an Atom. The reduced power bills will pay
> for replacing that P4 within a year or two.

Well, maybe they have millions of units to replace.

> 
> In short, I am not sure how important the P4
> performance numbers are, especially if we can
> improve security for everybody else...

Worth adding that the ISN or syncookie generation are less than 10% of
the actual cost of handling a problematic (having to generate ISN or
syncookie) TCP packet anyway.

So we are talking of minors potential impact for '2000-era' cpus.

Definitely I vote for using SipHash in TCP ASAP.

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Rik van Riel

On Wed, 2016-12-21 at 10:55 -0500, George Spelvin wrote:
> Actually, DJB just made a very relevant suggestion.
> 
> As I've mentioned, the 32-bit performance problems are an x86-
> specific
> problem.  ARM does very well, and other processors aren't bad at all.
> 
> SipHash fits very nicely (and runs very fast) in the MMX registers.
> 
> They're 64 bits, and there are 8 of them, so the integer registers
> can
> be reserved for pointers and loop counters and all that.  And there's
> reference code available.
> 
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Those can be very expensive. Almost certainly not
worth it for small amounts of data.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

5 matches

Site Navigation

Mail list logo

Footer information