Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Kees Cook
On Tue, Mar 21, 2017 at 7:03 PM, Eric Dumazet  wrote:
> On Tue, 2017-03-21 at 16:51 -0700, Kees Cook wrote:
>
>> Am I understanding you correctly that you'd want something like:
>>
>> refcount.h:
>> #ifdef UNPROTECTED_REFCOUNT
>> #define refcount_inc(x)   atomic_inc(x)
>> ...
>> #else
>> void refcount_inc(...
>> ...
>> #endif
>>
>> some/net.c:
>> #define UNPROTECTED_REFCOUNT
>> #include 
>>
>> or similar?
>
> At first, it could be something simple like that yes.
>
> Note that we might define two refcount_inc()  : One that does whole
> tests, and refcount_inc_relaxed() that might translate to atomic_inc()
> on non debug kernels.
>
> Then later, maybe provide a dynamic infrastructure so that we can
> dynamically force the full checks even for refcount_inc_relaxed() on say
> 1% of the hosts, to get better debug coverage ?

Well, this isn't about finding bugs in normal workflows. This is about
catching bugs that attackers have found and start exploiting to gain a
use-after-free primitive. The intention is for it to be always
enabled.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Peter Zijlstra
On Wed, Mar 22, 2017 at 07:54:04AM -0700, Eric Dumazet wrote:
> 
> I guess someone could code a lib/test_refcount.c launching X threads
> using either atomic_inc or refcount_inc() in a loop.
> 
> That would give a rough estimate of the refcount_t overhead among
> various platforms.

Cycles spend on uncontended ops:

SKL SNB IVB-EP

atomic: lock incl   ~15 ~13 ~10
atomic-ref: call refcount_inc   ~31 ~37 ~31
atomic-ref2:$inlined~23 ~22 ~21


Contended numbers (E3-1245 v5):


root@skl:~/spinlocks# LOCK=./atomic ./test1.sh
1: 14.797240
2: 87.451230
4: 100.747790
8: 118.234010

root@skl:~/spinlocks# LOCK=./atomic-ref ./test1.sh
1: 30.627320
2: 91.866730
4: 111.029560
8: 141.922420

root@skl:~/spinlocks# LOCK=./atomic-ref2 ./test1.sh
1: 23.243930
2: 98.620250
4: 119.604240
8: 124.864380



The code includes the patches found here:

  https://lkml.kernel.org/r/20170317211918.393791...@infradead.org

and effectively does:

#define REFCOUNT_WARN(cond, str) WARN_ON_ONCE(cond)

 s/WARN_ONCE/REFCOUNT_WARN/

on lib/refcount.c

Find the tarball of the userspace code used attached (its a bit of a
mess; its grown over time and needs a cleanup).

I used: gcc (Debian 6.3.0-6) 6.3.0 20170205


So while its about ~20 cycles worse, reducing contention is far more
effective than removing straight line instruction count (which too is
entirely possible, because GCC generates absolute shite in places).


spinlocks.tar.bz2
Description: Binary data


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Eric Dumazet
On Wed, 2017-03-22 at 16:08 +0100, Peter Zijlstra wrote:
> On Wed, Mar 22, 2017 at 07:54:04AM -0700, Eric Dumazet wrote:
> > On Wed, 2017-03-22 at 15:33 +0100, Peter Zijlstra wrote:
> > 
> > > 
> > > But I would feel a whole lot better about the entire thing if we could
> > > measure their impact. It would also give us good precedent to whack
> > > other potential users of _nocheck over the head with -- show numbers.
> > 
> > I wont be able to measure the impact on real workloads, our productions
> > kernels are based on 4.3 at this moment.
> 
> Is there really no micro bench that exercises the relevant network
> paths? Do you really fully rely on Google production workloads?

You could run a synflood test, with ~10 Mpps.

sock_hold() is definitely used in SYN handling.

Last upstream kernels do not work on my lab hosts, for whatever reason.




Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Peter Zijlstra
On Wed, Mar 22, 2017 at 07:54:04AM -0700, Eric Dumazet wrote:
> On Wed, 2017-03-22 at 15:33 +0100, Peter Zijlstra wrote:
> 
> > 
> > But I would feel a whole lot better about the entire thing if we could
> > measure their impact. It would also give us good precedent to whack
> > other potential users of _nocheck over the head with -- show numbers.
> 
> I wont be able to measure the impact on real workloads, our productions
> kernels are based on 4.3 at this moment.

Is there really no micro bench that exercises the relevant network
paths? Do you really fully rely on Google production workloads?

> I guess someone could code a lib/test_refcount.c launching X threads
> using either atomic_inc or refcount_inc() in a loop.
> 
> That would give a rough estimate of the refcount_t overhead among
> various platforms.

Its also a fairly meaningless number. It doesn't include any of the
other work the network path does.


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Eric Dumazet
On Wed, 2017-03-22 at 15:33 +0100, Peter Zijlstra wrote:

> 
> But I would feel a whole lot better about the entire thing if we could
> measure their impact. It would also give us good precedent to whack
> other potential users of _nocheck over the head with -- show numbers.

I wont be able to measure the impact on real workloads, our productions
kernels are based on 4.3 at this moment.

I guess someone could code a lib/test_refcount.c launching X threads
using either atomic_inc or refcount_inc() in a loop.

That would give a rough estimate of the refcount_t overhead among
various platforms.




Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Peter Zijlstra
On Wed, Mar 22, 2017 at 06:22:16AM -0700, Eric Dumazet wrote:

> But admittedly we can replace all these by standard refcount_inc() and
> simply provide a CONFIG option to turn off the checks, and let brave
> people enable this option.

Still brings us back to lacking a real reason to provide that CONFIG
option. Not to mention that this CONFIG knob will kill the warnings for
everything, even the code that might not be as heavily audited as
network and which doesn't really care much about the performance of
refcount operations.


So I'm actually in favour of _nocheck variants, if we can show the need
for them. And I like your idea of being able to dynamically switch them
back to full debug as well.


But I would feel a whole lot better about the entire thing if we could
measure their impact. It would also give us good precedent to whack
other potential users of _nocheck over the head with -- show numbers.


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Peter Zijlstra
On Tue, Mar 21, 2017 at 04:51:13PM -0700, Kees Cook wrote:
> On Tue, Mar 21, 2017 at 2:23 PM, Eric Dumazet  wrote:

> > Unfortunately there is no good test simulating real-world workloads,
> > which are mostly using TCP flows.
> 
> Sure, but there has to be _something_ that can be used to test to
> measure the effects. Without a meaningful test, it's weird to reject a
> change for performance reasons.

This. How can you optimize if there's no way to actually measure
something?

> > Most synthetic tools you can find are not using epoll(), and very often
> > hit bottlenecks in other layers.
> >
> >
> > It looks like our suggestion to get kernel builds with atomic_inc()
> > being exactly an atomic_inc() is not even discussed or implemented.
> 
> So, FWIW, I originally tried to make this a CONFIG in the first couple
> passes at getting a refcount defense. I would be fine with this, but I
> was not able to convince Peter. :) However, things have evolved a lot
> since then, so perhaps there are things do be done here.

Well, the argument was that unless there's a benchmark that shows it
cares, its all premature optimization.

Similarly, you wanted this enabled at all times because hardening.



Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Eric Dumazet
On Wed, 2017-03-22 at 13:25 +0100, Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 07:03:19PM -0700, Eric Dumazet wrote:
> 
> > Note that we might define two refcount_inc()  : One that does whole
> > tests, and refcount_inc_relaxed() that might translate to atomic_inc()
> > on non debug kernels.
> 
> So you'd want a duplicate interface, such that most code, which doesn't
> care about refcount performance much, can still have all the tests
> enabled.
> 
> But the code that cares about it (and preferably can prove it with
> numbers) can use the other.
> 
> I'm also somewhat hesitant to use _relaxed for this distinction, as it
> has a clear meaning in atomics, maybe _nocheck?
> 
> Also; what operations do you want _nocheck variants of, only
> refcount_inc() ?

I was mostly thinking of points where we were already checking the value
either before or after the atomic_inc(), using some lazy check (a la
WARN_ON(atomic_read(p) == 0) or something like that.

But admittedly we can replace all these by standard refcount_inc() and
simply provide a CONFIG option to turn off the checks, and let brave
people enable this option.


 




Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-22 Thread Peter Zijlstra
On Tue, Mar 21, 2017 at 07:03:19PM -0700, Eric Dumazet wrote:

> Note that we might define two refcount_inc()  : One that does whole
> tests, and refcount_inc_relaxed() that might translate to atomic_inc()
> on non debug kernels.

So you'd want a duplicate interface, such that most code, which doesn't
care about refcount performance much, can still have all the tests
enabled.

But the code that cares about it (and preferably can prove it with
numbers) can use the other.

I'm also somewhat hesitant to use _relaxed for this distinction, as it
has a clear meaning in atomics, maybe _nocheck?

Also; what operations do you want _nocheck variants of, only
refcount_inc() ?

That said; I'm really loath to provide these without actual measurements
that prove they make a difference.

> Then later, maybe provide a dynamic infrastructure so that we can
> dynamically force the full checks even for refcount_inc_relaxed() on say
> 1% of the hosts, to get better debug coverage ?

Shouldn't be too hard to do in arch specific code using alternative
stuff. Generic code could use jump labels I suppose, but that would
result in bigger code.



Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-21 Thread Eric Dumazet
On Tue, 2017-03-21 at 16:51 -0700, Kees Cook wrote:

> Am I understanding you correctly that you'd want something like:
> 
> refcount.h:
> #ifdef UNPROTECTED_REFCOUNT
> #define refcount_inc(x)   atomic_inc(x)
> ...
> #else
> void refcount_inc(...
> ...
> #endif
> 
> some/net.c:
> #define UNPROTECTED_REFCOUNT
> #include 
> 
> or similar?

At first, it could be something simple like that yes.

Note that we might define two refcount_inc()  : One that does whole
tests, and refcount_inc_relaxed() that might translate to atomic_inc()
on non debug kernels.

Then later, maybe provide a dynamic infrastructure so that we can
dynamically force the full checks even for refcount_inc_relaxed() on say
1% of the hosts, to get better debug coverage ?





Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-21 Thread Kees Cook
On Tue, Mar 21, 2017 at 2:23 PM, Eric Dumazet  wrote:
> On Tue, 2017-03-21 at 13:49 -0700, Kees Cook wrote:
>
>> Yeah, this is exactly what I'd like to find as well. Just comparing
>> cycles between refcount implementations, while interesting, doesn't
>> show us real-world performance changes, which is what we need to
>> measure.
>>
>> Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from
>> elsewhere in this email thread) real-world meaningful enough?
>
> Not at all ;)
>
> This was targeting the specific change I had in mind for
> ip_idents_reserve(), which is not used by TCP flows.

Okay, I just wanted to check. I didn't think so, but it was the only
example in the thread.

> Unfortunately there is no good test simulating real-world workloads,
> which are mostly using TCP flows.

Sure, but there has to be _something_ that can be used to test to
measure the effects. Without a meaningful test, it's weird to reject a
change for performance reasons.

> Most synthetic tools you can find are not using epoll(), and very often
> hit bottlenecks in other layers.
>
>
> It looks like our suggestion to get kernel builds with atomic_inc()
> being exactly an atomic_inc() is not even discussed or implemented.

So, FWIW, I originally tried to make this a CONFIG in the first couple
passes at getting a refcount defense. I would be fine with this, but I
was not able to convince Peter. :) However, things have evolved a lot
since then, so perhaps there are things do be done here.

> Coding this would require less time than running a typical Google kernel
> qualification (roughly one month, thousands of hosts..., days of SWE).

It wasn't the issue of coding time; just that it had been specifically
not wanted. :)

Am I understanding you correctly that you'd want something like:

refcount.h:
#ifdef UNPROTECTED_REFCOUNT
#define refcount_inc(x)   atomic_inc(x)
...
#else
void refcount_inc(...
...
#endif

some/net.c:
#define UNPROTECTED_REFCOUNT
#include 

or similar?

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-21 Thread David Miller
From: Eric Dumazet 
Date: Tue, 21 Mar 2017 14:23:09 -0700

> It looks like our suggestion to get kernel builds with atomic_inc()
> being exactly an atomic_inc() is not even discussed or implemented.
> 
> Coding this would require less time than running a typical Google kernel
> qualification (roughly one month, thousands of hosts..., days of SWE).

+1


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-21 Thread Eric Dumazet
On Tue, 2017-03-21 at 13:49 -0700, Kees Cook wrote:

> Yeah, this is exactly what I'd like to find as well. Just comparing
> cycles between refcount implementations, while interesting, doesn't
> show us real-world performance changes, which is what we need to
> measure.
> 
> Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from
> elsewhere in this email thread) real-world meaningful enough?

Not at all ;)

This was targeting the specific change I had in mind for
ip_idents_reserve(), which is not used by TCP flows.

Unfortunately there is no good test simulating real-world workloads,
which are mostly using TCP flows.

Most synthetic tools you can find are not using epoll(), and very often
hit bottlenecks in other layers.


It looks like our suggestion to get kernel builds with atomic_inc()
being exactly an atomic_inc() is not even discussed or implemented.

Coding this would require less time than running a typical Google kernel
qualification (roughly one month, thousands of hosts..., days of SWE).




Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-21 Thread Kees Cook
On Mon, Mar 20, 2017 at 6:40 AM, Peter Zijlstra  wrote:
> On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote:
>> On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote:
>> >
>> > So what bench/setup do you want ran?
>>
>> You can start by counting how many cycles an atomic op takes
>> vs. how many cycles this new code takes.
>
> On what uarch?
>
> I think I tested hand coded asm version and it ended up about double the
> cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until
> the memory bus saturated, at which point they took the same). Newer
> parts will of course have different numbers,
>
> Can't we run some iperf on a 40gbe fiber loop or something? It would be
> very useful to have an actual workload we can run.

Yeah, this is exactly what I'd like to find as well. Just comparing
cycles between refcount implementations, while interesting, doesn't
show us real-world performance changes, which is what we need to
measure.

Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from
elsewhere in this email thread) real-world meaningful enough?

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Eric Dumazet
On Mon, 2017-03-20 at 09:18 -0700, Eric Dumazet wrote:

> Interesting.
> 
> UDP ipv4 xmit path gets a ~25 % improvement on PPC with this patch.
> 
> ( 20 concurrent netperf -t UDP_STREAM  : 2.45 Mpps -> 3.07 Mpps )

Well, there _is_ a difference, but not 25 % (this was probably caused by
different queues on TX or RX between my reboots).

I added a sysctl hack to be able to dynamically change on a given
workload, and we hit other bottlenecks (mainly qdisc locks and driver tx
locks) anyway.








Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Eric Dumazet
On Mon, 2017-03-20 at 07:59 -0700, Eric Dumazet wrote:
> On Mon, 2017-03-20 at 07:51 -0700, Eric Dumazet wrote:
> 
> > atomic_cmpxchg() on PowerPC is horribly more expensive because of the
> > added two SYNC instructions.
> 
> Although I just saw that refcount was using atomic_cmpxchg_relaxed()
> 
> Time to find some documentation (probably missing) or get some specs for
> this thing.

Interesting.

UDP ipv4 xmit path gets a ~25 % improvement on PPC with this patch.

( 20 concurrent netperf -t UDP_STREAM  : 2.45 Mpps -> 3.07 Mpps )

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 
8471dd116771462d149e1da2807e446b69b74bcc..9f14aebf0ae1f5f366cfff0fbf58c48603916bc7
 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -497,14 +497,14 @@ u32 ip_idents_reserve(u32 hash, int segs)
u32 now = (u32)jiffies;
u32 new, delta = 0;
 
-   if (old != now && cmpxchg(p_tstamp, old, now) == old)
+   if (old != now && cmpxchg_relaxed(p_tstamp, old, now) == old)
delta = prandom_u32_max(now - old);
 
/* Do not use atomic_add_return() as it makes UBSAN unhappy */
do {
old = (u32)atomic_read(p_id);
new = old + delta + segs;
-   } while (atomic_cmpxchg(p_id, old, new) != old);
+   } while (atomic_cmpxchg_relaxed(p_id, old, new) != old);
 
return new - segs;
 }





Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Peter Zijlstra
On Mon, Mar 20, 2017 at 07:51:01AM -0700, Eric Dumazet wrote:
> PowerPC has no efficient atomic_inc() and this definitely shows on
> network intensive workloads involving concurrent cores/threads.

Correct, PPC LL/SC are dreadfully expensive.

> atomic_cmpxchg() on PowerPC is horribly more expensive because of the
> added two SYNC instructions.

Note that refcount_t uses atomic_cmpxchg_release() and
atomic_cmpxchg_relaxed() which avoid most of the painful barriers.


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Eric Dumazet
On Mon, 2017-03-20 at 14:40 +0100, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote:
> > On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote:
> > >
> > > So what bench/setup do you want ran?
> > 
> > You can start by counting how many cycles an atomic op takes
> > vs. how many cycles this new code takes.
> 
> On what uarch?
> 
> I think I tested hand coded asm version and it ended up about double the
> cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until
> the memory bus saturated, at which point they took the same). Newer
> parts will of course have different numbers,
> 
> Can't we run some iperf on a 40gbe fiber loop or something? It would be
> very useful to have an actual workload we can run.

If atomic ops are converted one by one, it is likely that results will
be noise.

We can not start a global conversion without having a way to have
selective debugging ?

Then, adopting this fine infra would really not be a problem.

Some arches have efficient atomic_inc() ( no full barriers ) while load
+ test + atomic_cmpxchg() + test + loop" is more expensive.

PowerPC has no efficient atomic_inc() and this definitely shows on
network intensive workloads involving concurrent cores/threads.

atomic_cmpxchg() on PowerPC is horribly more expensive because of the
added two SYNC instructions.

networking performance is quite poor on PowerPC as of today.




RE: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread David Laight
From: Peter Zijlstra
> Sent: 20 March 2017 14:28
> On Mon, Mar 20, 2017 at 02:10:24PM +, David Laight wrote:
> > On x86 the cpu flags from the 'lock inc/dec' could be used to reasonably
> > cheaply detect errors - provided you actually generate a forwards branch.
> 
> Note that currently there is no arch specific implementation. We could
> of course cure this.
> 
> But note that the thing you propose; using the overflow flag, can only
> reasonably be done on PREEMPT=n kernels, otherwise we have an incredible
> number of contexts that can nest.
> 
> Sure; getting all starts aligned to double overflow is incredibly rare,
> but I don't want to be the one to have to debug that.

One overflow would set the overflow flag, you don't need both to fail.

In any case you can use the sign flag.
Say valid count values are -64k to -256 and 0 to MAXINT.
The count will normally be +ve unless the 'main free path'
has released the 64k references it holds.
If the sign bit is set after inc/dec the value is checked; 
might valid, an error, or require the item be freed.

Ok assuming the items have reasonable lifetimes and have a nominal
'delete' function.

David



Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Eric Dumazet
On Mon, 2017-03-20 at 07:51 -0700, Eric Dumazet wrote:

> atomic_cmpxchg() on PowerPC is horribly more expensive because of the
> added two SYNC instructions.

Although I just saw that refcount was using atomic_cmpxchg_relaxed()

Time to find some documentation (probably missing) or get some specs for
this thing.





Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Herbert Xu
On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote:
>
> So what bench/setup do you want ran?

You can start by counting how many cycles an atomic op takes
vs. how many cycles this new code takes.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Peter Zijlstra
On Mon, Mar 20, 2017 at 02:10:24PM +, David Laight wrote:
> On x86 the cpu flags from the 'lock inc/dec' could be used to reasonably
> cheaply detect errors - provided you actually generate a forwards branch.

Note that currently there is no arch specific implementation. We could
of course cure this.

But note that the thing you propose; using the overflow flag, can only
reasonably be done on PREEMPT=n kernels, otherwise we have an incredible
number of contexts that can nest.

Sure; getting all starts aligned to double overflow is incredibly rare,
but I don't want to be the one to have to debug that.


RE: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread David Laight
From: Herbert Xu
> Sent: 20 March 2017 13:16
> On Mon, Mar 20, 2017 at 11:39:37AM +0100, Peter Zijlstra wrote:
> >
> > Can we at least give a benchmark and have someone run numbers? We should
> > be able to quantify these things.
> 
> Do you realise how many times this thing gets hit at 10Gb/s or
> higher? Anyway, since you're proposing this change you should
> demonstrate that it does not cause a performance regression.

What checks does refcnt_t actually do?

An extra decrement is hard to detect since the item gets freed early.
I guess making the main 'allocate/free' code hold (say) 64k references
would give some leeway for extra decrements.

An extra increment will be detected when the count eventually wraps.
Unless the error is in a very common path that won't happen for a long time.

On x86 the cpu flags from the 'lock inc/dec' could be used to reasonably
cheaply detect errors - provided you actually generate a forwards branch.

Otherwise having a common, but not every packet, code path verify that the
reference count is 'sane' would give reasonable coverage.

David


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Peter Zijlstra
On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote:
> On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote:
> >
> > So what bench/setup do you want ran?
> 
> You can start by counting how many cycles an atomic op takes
> vs. how many cycles this new code takes.

On what uarch?

I think I tested hand coded asm version and it ended up about double the
cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until
the memory bus saturated, at which point they took the same). Newer
parts will of course have different numbers,

Can't we run some iperf on a 40gbe fiber loop or something? It would be
very useful to have an actual workload we can run.


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Herbert Xu
On Mon, Mar 20, 2017 at 11:39:37AM +0100, Peter Zijlstra wrote:
>
> Can we at least give a benchmark and have someone run numbers? We should
> be able to quantify these things.

Do you realise how many times this thing gets hit at 10Gb/s or
higher? Anyway, since you're proposing this change you should
demonstrate that it does not cause a performance regression.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Peter Zijlstra
On Mon, Mar 20, 2017 at 09:16:29PM +0800, Herbert Xu wrote:
> On Mon, Mar 20, 2017 at 11:39:37AM +0100, Peter Zijlstra wrote:
> >
> > Can we at least give a benchmark and have someone run numbers? We should
> > be able to quantify these things.
> 
> Do you realise how many times this thing gets hit at 10Gb/s or
> higher? Anyway, since you're proposing this change you should
> demonstrate that it does not cause a performance regression.

So what bench/setup do you want ran?


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-20 Thread Peter Zijlstra
On Sat, Mar 18, 2017 at 06:21:21PM -0700, David Miller wrote:
> From: Herbert Xu 
> Date: Sun, 19 Mar 2017 00:47:59 +0800
> 
> > Eric Dumazet  wrote:
> >> On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote:
> >> 
> >>> Should we then first measure the actual numbers to understand what we
> >>> are talking here about? 
> >>> I would be glad to do it if you suggest what is the correct way to do
> >>> measurements here to actually reflect the real life use cases. 
> >> 
> >> How have these patches been tested in real life exactly ?
> >> 
> >> Can you quantify number of added cycles per TCP packet, where I expect
> >> we have maybe 20 atomic operations in all layers ...
> > 
> > I completely agree.  I think this thing needs to default to the
> > existing atomic_t behaviour.
> 
> I totally agree as well, the refcount_t facility as-is is unacceptable
> for networking.

Can we at least give a benchmark and have someone run numbers? We should
be able to quantify these things.



Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-18 Thread David Miller
From: Herbert Xu 
Date: Sun, 19 Mar 2017 00:47:59 +0800

> Eric Dumazet  wrote:
>> On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote:
>> 
>>> Should we then first measure the actual numbers to understand what we
>>> are talking here about? 
>>> I would be glad to do it if you suggest what is the correct way to do
>>> measurements here to actually reflect the real life use cases. 
>> 
>> How have these patches been tested in real life exactly ?
>> 
>> Can you quantify number of added cycles per TCP packet, where I expect
>> we have maybe 20 atomic operations in all layers ...
> 
> I completely agree.  I think this thing needs to default to the
> existing atomic_t behaviour.

I totally agree as well, the refcount_t facility as-is is unacceptable
for networking.


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-18 Thread Herbert Xu
Eric Dumazet  wrote:
> On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote:
> 
>> Should we then first measure the actual numbers to understand what we
>> are talking here about? 
>> I would be glad to do it if you suggest what is the correct way to do
>> measurements here to actually reflect the real life use cases. 
> 
> How have these patches been tested in real life exactly ?
> 
> Can you quantify number of added cycles per TCP packet, where I expect
> we have maybe 20 atomic operations in all layers ...

I completely agree.  I think this thing needs to default to the
existing atomic_t behaviour.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-17 Thread Eric Dumazet
On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote:

> Should we then first measure the actual numbers to understand what we
> are talking here about? 
> I would be glad to do it if you suggest what is the correct way to do
> measurements here to actually reflect the real life use cases. 

How have these patches been tested in real life exactly ?

Can you quantify number of added cycles per TCP packet, where I expect
we have maybe 20 atomic operations in all layers ...


(sk refcnt, skb->users, page refcounts, sk->sk_wmem_alloc,
sk->sk_rmem_alloc, qdisc ...)

Once we 'protect' all of them, cost will be quite high.

This translates to more fossil fuel being burnt.

one atomic_inc() used to be a single x86 instruction.

Rough estimate of refcount_inc() :

0140 :
 140:   55  push   %rbp
 141:   48 89 e5mov%rsp,%rbp
 144:   e8 00 00 00 00  callq  refcount_inc_not_zero
 149:   84 c0   test   %al,%al
 14b:   74 02   je 14f 
 14d:   5d  pop%rbp
 14e:   c3  retq   

00e0 :
  e0:   8b 17   mov(%rdi),%edx
  e2:   eb 10   jmpf4 
  e4:   85 c9   test   %ecx,%ecx
  e6:   74 1b   je 103 
  e8:   89 d0   mov%edx,%eax
  ea:   f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
  ee:   39 d0   cmp%edx,%eax
  f0:   74 0c   je fe 
  f2:   89 c2   mov%eax,%edx
  f4:   85 d2   test   %edx,%edx
  f6:   8d 4a 01lea0x1(%rdx),%ecx
  f9:   75 e9   jnee4 
  fb:   31 c0   xor%eax,%eax
  fd:   c3  retq   
  fe:   83 f9 ffcmp$0x,%ecx
 101:   74 06   je 109 
 103:   b8 01 00 00 00  mov$0x1,%eax
 108:   c3  retq   

This is simply bloat for most cases.

Again, I believe this infrastructure makes sense for debugging kernels.

If some vendors are willing to run fully enabled debug kernels,
that is their choice. Probably many devices wont show any difference.

Have we forced KASAN being enabled in linux kernel, just because
it found ~400 bugs so far ?

I believe refcount_t infra is not mature enough to be widely used
right now.

Maybe in few months when we have more flexibility, like existing
debugging facilities (CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_PAGE_REF,
LOCKDEP, KMEMLEAK, KASAN, ...)




RE: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-17 Thread Reshetova, Elena
> From: Kees Cook 
> Date: Thu, 16 Mar 2017 11:38:25 -0600
> 
> > I am, of course, biased, but I think the evidence of actual
> > refcounting attacks outweighs the theoretical performance cost of
> > these changes.
> 
> This is not theoretical at all.
> 
> We count the nanoseconds that every packet takes to get processed and
> you are adding quite a bit.
> 
> I understand your point of view, but this is knowingly going to add
> performance regressions to the networking code.

Should we then first measure the actual numbers to understand what we are 
talking here about? 
I would be glad to do it if you suggest what is the correct way to do 
measurements here to actually reflect the real life use cases. 

Best Regards,
Elena.


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-16 Thread David Miller
From: Kees Cook 
Date: Thu, 16 Mar 2017 11:38:25 -0600

> I am, of course, biased, but I think the evidence of actual
> refcounting attacks outweighs the theoretical performance cost of
> these changes.

This is not theoretical at all.

We count the nanoseconds that every packet takes to get processed and
you are adding quite a bit.

I understand your point of view, but this is knowingly going to add
performance regressions to the networking code.


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-16 Thread Kees Cook
On Thu, Mar 16, 2017 at 10:58 AM, Eric Dumazet  wrote:
> On Thu, 2017-03-16 at 17:28 +0200, Elena Reshetova wrote:
>> refcount_t type and corresponding API should be
>> used instead of atomic_t when the variable is used as
>> a reference counter. This allows to avoid accidental
>> refcounter overflows that might lead to use-after-free
>> situations.
>
>
> ...
>
>>  static __always_inline void sock_hold(struct sock *sk)
>>  {
>> - atomic_inc(>sk_refcnt);
>> + refcount_inc(>sk_refcnt);
>>  }
>>
>
> While I certainly see the value of these refcount_t, we have a very
> different behavior on these atomic_inc() which were doing a single
> inlined LOCK RMW on x86.

I think we can certainly investigate arch-specific ways to improve the
performance, but the consensus seemed to be that getting the
infrastructure in and doing the migration was the first set of steps.

> We now call an external function performing a
> atomic_read(), various ops/tests, then atomic_cmpxchg_relaxed(), in a
> loop, loosing the nice ability for x86 of preventing live locks.
>
> Looks a lot of bloat, just to be able to chase hypothetical bugs in the
> kernel.
>
> I would love to have a way to enable extra debugging when I want a debug
> kernel, like LOCKDEP or KASAN.
>
> By adding all this bloat, we assert linux kernel is terminally buggy and
> every atomic_inc() we did was suspicious, and need to be always
> instrumented/validated.

This IS the assertion, unfortunately. With average 5 year lifetimes on
security flaws[1], and many of the last couple years' public exploits
being refcount flaws[2], this is something we have to get done. We
need the default kernel to be much more self-protective, and this is
one of many places to make it happen.

I am, of course, biased, but I think the evidence of actual
refcounting attacks outweighs the theoretical performance cost of
these changes. If there is a realistic workflow that shows actual
problems, let's examine it and find a solution for that as a separate
part of this work without blocking this migration.

-Kees

[1] https://outflux.net/blog/archives/2016/10/18/security-bug-lifetime/
[2] http://kernsec.org/wiki/index.php/Bug_Classes/Integer_overflow

-- 
Kees Cook
Pixel Security


Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-16 Thread Eric Dumazet
On Thu, 2017-03-16 at 17:28 +0200, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.


...

>  static __always_inline void sock_hold(struct sock *sk)
>  {
> - atomic_inc(>sk_refcnt);
> + refcount_inc(>sk_refcnt);
>  }
>  

While I certainly see the value of these refcount_t, we have a very
different behavior on these atomic_inc() which were doing a single
inlined LOCK RMW on x86.

We now call an external function performing a 
atomic_read(), various ops/tests, then atomic_cmpxchg_relaxed(), in a
loop, loosing the nice ability for x86 of preventing live locks.

Looks a lot of bloat, just to be able to chase hypothetical bugs in the
kernel.

I would love to have a way to enable extra debugging when I want a debug
kernel, like LOCKDEP or KASAN.

By adding all this bloat, we assert linux kernel is terminally buggy and
every atomic_inc() we did was suspicious, and need to be always
instrumented/validated.





[PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t

2017-03-16 Thread Elena Reshetova
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 crypto/algif_aead.c |  2 +-
 include/net/inet_hashtables.h   |  4 ++--
 include/net/request_sock.h  |  9 +
 include/net/sock.h  | 17 +
 net/atm/proc.c  |  2 +-
 net/bluetooth/af_bluetooth.c|  2 +-
 net/bluetooth/rfcomm/sock.c |  2 +-
 net/core/skbuff.c   |  6 +++---
 net/core/sock.c |  4 ++--
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c  |  4 ++--
 net/ipv4/inet_timewait_sock.c   |  8 
 net/ipv4/ping.c |  4 ++--
 net/ipv4/raw.c  |  2 +-
 net/ipv4/syncookies.c   |  2 +-
 net/ipv4/tcp_fastopen.c |  2 +-
 net/ipv4/tcp_ipv4.c |  4 ++--
 net/ipv4/udp.c  |  6 +++---
 net/ipv4/udp_diag.c |  4 ++--
 net/ipv6/datagram.c |  2 +-
 net/ipv6/inet6_hashtables.c |  4 ++--
 net/ipv6/tcp_ipv6.c |  4 ++--
 net/ipv6/udp.c  |  2 +-
 net/key/af_key.c|  2 +-
 net/l2tp/l2tp_debugfs.c |  3 +--
 net/llc/llc_conn.c  |  8 
 net/llc/llc_sap.c   |  2 +-
 net/netfilter/xt_TPROXY.c   |  4 ++--
 net/netlink/af_netlink.c|  6 +++---
 net/packet/af_packet.c  |  2 +-
 net/phonet/socket.c |  2 +-
 net/rxrpc/af_rxrpc.c|  2 +-
 net/sched/em_meta.c |  2 +-
 net/tipc/socket.c   |  2 +-
 net/unix/af_unix.c  |  2 +-
 35 files changed, 68 insertions(+), 67 deletions(-)

diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index 5a80537..607380d 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -750,7 +750,7 @@ static void aead_sock_destruct(struct sock *sk)
unsigned int ivlen = crypto_aead_ivsize(
crypto_aead_reqtfm(>aead_req));
 
-   WARN_ON(atomic_read(>sk_refcnt) != 0);
+   WARN_ON(refcount_read(>sk_refcnt) != 0);
aead_put_sgl(sk);
sock_kzfree_s(sk, ctx->iv, ivlen);
sock_kfree_s(sk, ctx, ctx->len);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 1178931..b9e6e0e 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -32,7 +32,7 @@
 #include 
 #include 
 
-#include 
+#include 
 #include 
 
 /* This is for all connections with a full identity, no wildcards.
@@ -334,7 +334,7 @@ static inline struct sock *inet_lookup(struct net *net,
sk = __inet_lookup(net, hashinfo, skb, doff, saddr, sport, daddr,
   dport, dif, );
 
-   if (sk && !refcounted && !atomic_inc_not_zero(>sk_refcnt))
+   if (sk && !refcounted && !refcount_inc_not_zero(>sk_refcnt))
sk = NULL;
return sk;
 }
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index a12a5d2..e76e8c2 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -89,7 +90,7 @@ reqsk_alloc(const struct request_sock_ops *ops, struct sock 
*sk_listener,
return NULL;
req->rsk_listener = NULL;
if (attach_listener) {
-   if (unlikely(!atomic_inc_not_zero(_listener->sk_refcnt))) {
+   if (unlikely(!refcount_inc_not_zero(_listener->sk_refcnt))) {
kmem_cache_free(ops->slab, req);
return NULL;
}
@@ -100,7 +101,7 @@ reqsk_alloc(const struct request_sock_ops *ops, struct sock 
*sk_listener,
sk_node_init(_to_sk(req)->sk_node);
sk_tx_queue_clear(req_to_sk(req));
req->saved_syn = NULL;
-   atomic_set(>rsk_refcnt, 0);
+   refcount_set(>rsk_refcnt, 0);
 
return req;
 }
@@ -108,7 +109,7 @@ reqsk_alloc(const struct request_sock_ops *ops, struct sock 
*sk_listener,
 static inline void reqsk_free(struct request_sock *req)
 {
/* temporary debugging */
-   WARN_ON_ONCE(atomic_read(>rsk_refcnt) != 0);
+   WARN_ON_ONCE(refcount_read(>rsk_refcnt) != 0);
 
req->rsk_ops->destructor(req);
if (req->rsk_listener)
@@ -119,7 +120,7 @@ static inline void reqsk_free(struct request_sock *req)
 
 static inline void reqsk_put(struct request_sock *req)
 {
-   if (atomic_dec_and_test(>rsk_refcnt))
+   if (refcount_dec_and_test(>rsk_refcnt))
reqsk_free(req);
 }
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 24dcdba..c4f2b6c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h