Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Peter Zijlstra
On Mon, Jul 24, 2017 at 10:09:32PM +1000, Michael Ellerman wrote:
> Peter Zijlstra  writes:
> > anyway, and the fact that your LL/SC is horrendously slow in any case.
> 
> Boo :/

:-)

> Just kidding. I suspect you're right that we can probably pack a
> reasonable amount of tests in the body of the LL/SC and not notice.
> 
> > Also, I still haven't seen an actual benchmark where our cmpxchg loop
> > actually regresses anything, just a lot of yelling about potential
> > regressions :/
> 
> Heh yeah. Though I have looked at the code it generates on PPC and it's
> not sleek, though I guess that's not a benchmark is it :)

Oh for sure, GCC still can't sanely convert a cmpxchg loop (esp. if the
cmpxchg is implemented using asm) into a native LL/SC sequence, so the
generic code will end up looking pretty horrendous.

A native implementation of the same semantics should look loads better.


One thing that might help you is that refcount_dec_and_test() is weaker
than atomic_dec_and_test() wrt ordering, so that might help some
(RELEASE vs fully ordered).


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Peter Zijlstra
On Mon, Jul 24, 2017 at 10:09:32PM +1000, Michael Ellerman wrote:
> Peter Zijlstra  writes:
> > anyway, and the fact that your LL/SC is horrendously slow in any case.
> 
> Boo :/

:-)

> Just kidding. I suspect you're right that we can probably pack a
> reasonable amount of tests in the body of the LL/SC and not notice.
> 
> > Also, I still haven't seen an actual benchmark where our cmpxchg loop
> > actually regresses anything, just a lot of yelling about potential
> > regressions :/
> 
> Heh yeah. Though I have looked at the code it generates on PPC and it's
> not sleek, though I guess that's not a benchmark is it :)

Oh for sure, GCC still can't sanely convert a cmpxchg loop (esp. if the
cmpxchg is implemented using asm) into a native LL/SC sequence, so the
generic code will end up looking pretty horrendous.

A native implementation of the same semantics should look loads better.


One thing that might help you is that refcount_dec_and_test() is weaker
than atomic_dec_and_test() wrt ordering, so that might help some
(RELEASE vs fully ordered).


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Michael Ellerman
Peter Zijlstra  writes:

> On Mon, Jul 24, 2017 at 04:38:06PM +1000, Michael Ellerman wrote:
>
>> What I'm not entirely clear on is what the best trade off is in terms of
>> overhead vs checks. The summary of behaviour between the fast and full
>> versions you promised Ingo will help there I think.
>
> That's something that's probably completely different for PPC than it is
> for x86.

Yeah definitely. I guess I see the x86 version as a lower bound on the
semantics we'd need to implement and still claim to implement the
refcount stuff.

> Both because your primitive is LL/SC and thus the saturation
> semantics we need a cmpxchg loop for are more natural in your case

Yay!

> anyway, and the fact that your LL/SC is horrendously slow in any case.

Boo :/

Just kidding. I suspect you're right that we can probably pack a
reasonable amount of tests in the body of the LL/SC and not notice.

> Also, I still haven't seen an actual benchmark where our cmpxchg loop
> actually regresses anything, just a lot of yelling about potential
> regressions :/

Heh yeah. Though I have looked at the code it generates on PPC and it's
not sleek, though I guess that's not a benchmark is it :)

cheers


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Michael Ellerman
Peter Zijlstra  writes:

> On Mon, Jul 24, 2017 at 04:38:06PM +1000, Michael Ellerman wrote:
>
>> What I'm not entirely clear on is what the best trade off is in terms of
>> overhead vs checks. The summary of behaviour between the fast and full
>> versions you promised Ingo will help there I think.
>
> That's something that's probably completely different for PPC than it is
> for x86.

Yeah definitely. I guess I see the x86 version as a lower bound on the
semantics we'd need to implement and still claim to implement the
refcount stuff.

> Both because your primitive is LL/SC and thus the saturation
> semantics we need a cmpxchg loop for are more natural in your case

Yay!

> anyway, and the fact that your LL/SC is horrendously slow in any case.

Boo :/

Just kidding. I suspect you're right that we can probably pack a
reasonable amount of tests in the body of the LL/SC and not notice.

> Also, I still haven't seen an actual benchmark where our cmpxchg loop
> actually regresses anything, just a lot of yelling about potential
> regressions :/

Heh yeah. Though I have looked at the code it generates on PPC and it's
not sleek, though I guess that's not a benchmark is it :)

cheers


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Peter Zijlstra
On Mon, Jul 24, 2017 at 04:38:06PM +1000, Michael Ellerman wrote:

> What I'm not entirely clear on is what the best trade off is in terms of
> overhead vs checks. The summary of behaviour between the fast and full
> versions you promised Ingo will help there I think.

That's something that's probably completely different for PPC than it is
for x86. Both because your primitive is LL/SC and thus the saturation
semantics we need a cmpxchg loop for are more natural in your case
anyway, and the fact that your LL/SC is horrendously slow in any case.

Also, I still haven't seen an actual benchmark where our cmpxchg loop
actually regresses anything, just a lot of yelling about potential
regressions :/


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Peter Zijlstra
On Mon, Jul 24, 2017 at 04:38:06PM +1000, Michael Ellerman wrote:

> What I'm not entirely clear on is what the best trade off is in terms of
> overhead vs checks. The summary of behaviour between the fast and full
> versions you promised Ingo will help there I think.

That's something that's probably completely different for PPC than it is
for x86. Both because your primitive is LL/SC and thus the saturation
semantics we need a cmpxchg loop for are more natural in your case
anyway, and the fact that your LL/SC is horrendously slow in any case.

Also, I still haven't seen an actual benchmark where our cmpxchg loop
actually regresses anything, just a lot of yelling about potential
regressions :/


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Michael Ellerman
Kees Cook  writes:
> On Fri, Jul 21, 2017 at 2:22 PM, Andrew Morton
>>
>> Do we have a feeling for how feasible/difficult it will be for other
>> architectures to implement such a thing?
>
> The PaX atomic_t overflow protection this is heavily based on was
> ported to a number of architectures (arm, powerpc, mips, sparc), so I
> suspect it shouldn't be too hard to adapt those for the more narrow
> refcount_t protection:
> https://forums.grsecurity.net/viewtopic.php?f=7=4173

The ppc code there appears to be 32-bit only and has other issues so
probably isn't something we'd use.

I don't think there should be any fundamental problem implementing it.

What I'm not entirely clear on is what the best trade off is in terms of
overhead vs checks. The summary of behaviour between the fast and full
versions you promised Ingo will help there I think.

cheers


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-24 Thread Michael Ellerman
Kees Cook  writes:
> On Fri, Jul 21, 2017 at 2:22 PM, Andrew Morton
>>
>> Do we have a feeling for how feasible/difficult it will be for other
>> architectures to implement such a thing?
>
> The PaX atomic_t overflow protection this is heavily based on was
> ported to a number of architectures (arm, powerpc, mips, sparc), so I
> suspect it shouldn't be too hard to adapt those for the more narrow
> refcount_t protection:
> https://forums.grsecurity.net/viewtopic.php?f=7=4173

The ppc code there appears to be 32-bit only and has other issues so
probably isn't something we'd use.

I don't think there should be any fundamental problem implementing it.

What I'm not entirely clear on is what the best trade off is in terms of
overhead vs checks. The summary of behaviour between the fast and full
versions you promised Ingo will help there I think.

cheers


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-21 Thread Kees Cook
On Fri, Jul 21, 2017 at 2:22 PM, Andrew Morton
 wrote:
> On Thu, 20 Jul 2017 11:11:06 +0200 Ingo Molnar  wrote:
>
>>
>> * Kees Cook  wrote:
>>
>> > This implements refcount_t overflow protection on x86 without a noticeable
>> > performance impact, though without the fuller checking of REFCOUNT_FULL.
>> > This is done by duplicating the existing atomic_t refcount implementation
>> > but with normally a single instruction added to detect if the refcount
>> > has gone negative (i.e. wrapped past INT_MAX or below zero). When
>> > detected, the handler saturates the refcount_t to INT_MIN / 2. With this
>> > overflow protection, the erroneous reference release that would follow
>> > a wrap back to zero is blocked from happening, avoiding the class of
>> > refcount-over-increment use-after-free vulnerabilities entirely.
>> >
>> > Only the overflow case of refcounting can be perfectly protected, since it
>> > can be detected and stopped before the reference is freed and left to be
>> > abused by an attacker. This implementation also notices some of the "dec
>> > to 0 without test", and "below 0" cases. However, these only indicate that
>> > a use-after-free may have already happened. Such notifications are likely
>> > avoidable by an attacker that has already exploited a use-after-free
>> > vulnerability, but it's better to have them than allow such conditions to
>> > remain universally silent.
>> >
>> > On first overflow detection, the refcount value is reset to INT_MIN / 2
>> > (which serves as a saturation value), the offending process is killed,
>> > and a report and stack trace are produced. When operations detect only
>> > negative value results (such as changing an already saturated value),
>> > saturation still happens but no notification is performed (since the
>> > value was already saturated).
>> >
>> > On the matter of races, since the entire range beyond INT_MAX but before
>> > 0 is negative, every operation at INT_MIN / 2 will trap, leaving no
>> > overflow-only race condition.
>> >
>> > As for performance, this implementation adds a single "js" instruction
>> > to the regular execution flow of a copy of the standard atomic_t refcount
>> > operations. (The non-"and_test" refcount_dec() function, which is uncommon
>> > in regular refcount design patterns, has an additional "jz" instruction
>> > to detect reaching exactly zero.) Since this is a forward jump, it is by
>> > default the non-predicted path, which will be reinforced by dynamic branch
>> > prediction. The result is this protection having virtually no measurable
>> > change in performance over standard atomic_t operations. The error path,
>> > located in .text.unlikely, saves the refcount location and then uses UD0
>> > to fire a refcount exception handler, which resets the refcount, handles
>> > reporting, and returns to regular execution. This keeps the changes to
>> > .text size minimal, avoiding return jumps and open-coded calls to the
>> > error reporting routine.
>>
>> Pretty nice!
>>
>
> Yes, this is a relief.
>
> Do we have a feeling for how feasible/difficult it will be for other
> architectures to implement such a thing?

The PaX atomic_t overflow protection this is heavily based on was
ported to a number of architectures (arm, powerpc, mips, sparc), so I
suspect it shouldn't be too hard to adapt those for the more narrow
refcount_t protection:
https://forums.grsecurity.net/viewtopic.php?f=7=4173

And an arm64 port of the fast refcount_t protection is already happening too.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-21 Thread Kees Cook
On Fri, Jul 21, 2017 at 2:22 PM, Andrew Morton
 wrote:
> On Thu, 20 Jul 2017 11:11:06 +0200 Ingo Molnar  wrote:
>
>>
>> * Kees Cook  wrote:
>>
>> > This implements refcount_t overflow protection on x86 without a noticeable
>> > performance impact, though without the fuller checking of REFCOUNT_FULL.
>> > This is done by duplicating the existing atomic_t refcount implementation
>> > but with normally a single instruction added to detect if the refcount
>> > has gone negative (i.e. wrapped past INT_MAX or below zero). When
>> > detected, the handler saturates the refcount_t to INT_MIN / 2. With this
>> > overflow protection, the erroneous reference release that would follow
>> > a wrap back to zero is blocked from happening, avoiding the class of
>> > refcount-over-increment use-after-free vulnerabilities entirely.
>> >
>> > Only the overflow case of refcounting can be perfectly protected, since it
>> > can be detected and stopped before the reference is freed and left to be
>> > abused by an attacker. This implementation also notices some of the "dec
>> > to 0 without test", and "below 0" cases. However, these only indicate that
>> > a use-after-free may have already happened. Such notifications are likely
>> > avoidable by an attacker that has already exploited a use-after-free
>> > vulnerability, but it's better to have them than allow such conditions to
>> > remain universally silent.
>> >
>> > On first overflow detection, the refcount value is reset to INT_MIN / 2
>> > (which serves as a saturation value), the offending process is killed,
>> > and a report and stack trace are produced. When operations detect only
>> > negative value results (such as changing an already saturated value),
>> > saturation still happens but no notification is performed (since the
>> > value was already saturated).
>> >
>> > On the matter of races, since the entire range beyond INT_MAX but before
>> > 0 is negative, every operation at INT_MIN / 2 will trap, leaving no
>> > overflow-only race condition.
>> >
>> > As for performance, this implementation adds a single "js" instruction
>> > to the regular execution flow of a copy of the standard atomic_t refcount
>> > operations. (The non-"and_test" refcount_dec() function, which is uncommon
>> > in regular refcount design patterns, has an additional "jz" instruction
>> > to detect reaching exactly zero.) Since this is a forward jump, it is by
>> > default the non-predicted path, which will be reinforced by dynamic branch
>> > prediction. The result is this protection having virtually no measurable
>> > change in performance over standard atomic_t operations. The error path,
>> > located in .text.unlikely, saves the refcount location and then uses UD0
>> > to fire a refcount exception handler, which resets the refcount, handles
>> > reporting, and returns to regular execution. This keeps the changes to
>> > .text size minimal, avoiding return jumps and open-coded calls to the
>> > error reporting routine.
>>
>> Pretty nice!
>>
>
> Yes, this is a relief.
>
> Do we have a feeling for how feasible/difficult it will be for other
> architectures to implement such a thing?

The PaX atomic_t overflow protection this is heavily based on was
ported to a number of architectures (arm, powerpc, mips, sparc), so I
suspect it shouldn't be too hard to adapt those for the more narrow
refcount_t protection:
https://forums.grsecurity.net/viewtopic.php?f=7=4173

And an arm64 port of the fast refcount_t protection is already happening too.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-21 Thread Andrew Morton
On Thu, 20 Jul 2017 11:11:06 +0200 Ingo Molnar  wrote:

> 
> * Kees Cook  wrote:
> 
> > This implements refcount_t overflow protection on x86 without a noticeable
> > performance impact, though without the fuller checking of REFCOUNT_FULL.
> > This is done by duplicating the existing atomic_t refcount implementation
> > but with normally a single instruction added to detect if the refcount
> > has gone negative (i.e. wrapped past INT_MAX or below zero). When
> > detected, the handler saturates the refcount_t to INT_MIN / 2. With this
> > overflow protection, the erroneous reference release that would follow
> > a wrap back to zero is blocked from happening, avoiding the class of
> > refcount-over-increment use-after-free vulnerabilities entirely.
> > 
> > Only the overflow case of refcounting can be perfectly protected, since it
> > can be detected and stopped before the reference is freed and left to be
> > abused by an attacker. This implementation also notices some of the "dec
> > to 0 without test", and "below 0" cases. However, these only indicate that
> > a use-after-free may have already happened. Such notifications are likely
> > avoidable by an attacker that has already exploited a use-after-free
> > vulnerability, but it's better to have them than allow such conditions to
> > remain universally silent.
> > 
> > On first overflow detection, the refcount value is reset to INT_MIN / 2
> > (which serves as a saturation value), the offending process is killed,
> > and a report and stack trace are produced. When operations detect only
> > negative value results (such as changing an already saturated value),
> > saturation still happens but no notification is performed (since the
> > value was already saturated).
> > 
> > On the matter of races, since the entire range beyond INT_MAX but before
> > 0 is negative, every operation at INT_MIN / 2 will trap, leaving no
> > overflow-only race condition.
> > 
> > As for performance, this implementation adds a single "js" instruction
> > to the regular execution flow of a copy of the standard atomic_t refcount
> > operations. (The non-"and_test" refcount_dec() function, which is uncommon
> > in regular refcount design patterns, has an additional "jz" instruction
> > to detect reaching exactly zero.) Since this is a forward jump, it is by
> > default the non-predicted path, which will be reinforced by dynamic branch
> > prediction. The result is this protection having virtually no measurable
> > change in performance over standard atomic_t operations. The error path,
> > located in .text.unlikely, saves the refcount location and then uses UD0
> > to fire a refcount exception handler, which resets the refcount, handles
> > reporting, and returns to regular execution. This keeps the changes to
> > .text size minimal, avoiding return jumps and open-coded calls to the
> > error reporting routine.
> 
> Pretty nice!
> 

Yes, this is a relief.

Do we have a feeling for how feasible/difficult it will be for other
architectures to implement such a thing?



Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-21 Thread Andrew Morton
On Thu, 20 Jul 2017 11:11:06 +0200 Ingo Molnar  wrote:

> 
> * Kees Cook  wrote:
> 
> > This implements refcount_t overflow protection on x86 without a noticeable
> > performance impact, though without the fuller checking of REFCOUNT_FULL.
> > This is done by duplicating the existing atomic_t refcount implementation
> > but with normally a single instruction added to detect if the refcount
> > has gone negative (i.e. wrapped past INT_MAX or below zero). When
> > detected, the handler saturates the refcount_t to INT_MIN / 2. With this
> > overflow protection, the erroneous reference release that would follow
> > a wrap back to zero is blocked from happening, avoiding the class of
> > refcount-over-increment use-after-free vulnerabilities entirely.
> > 
> > Only the overflow case of refcounting can be perfectly protected, since it
> > can be detected and stopped before the reference is freed and left to be
> > abused by an attacker. This implementation also notices some of the "dec
> > to 0 without test", and "below 0" cases. However, these only indicate that
> > a use-after-free may have already happened. Such notifications are likely
> > avoidable by an attacker that has already exploited a use-after-free
> > vulnerability, but it's better to have them than allow such conditions to
> > remain universally silent.
> > 
> > On first overflow detection, the refcount value is reset to INT_MIN / 2
> > (which serves as a saturation value), the offending process is killed,
> > and a report and stack trace are produced. When operations detect only
> > negative value results (such as changing an already saturated value),
> > saturation still happens but no notification is performed (since the
> > value was already saturated).
> > 
> > On the matter of races, since the entire range beyond INT_MAX but before
> > 0 is negative, every operation at INT_MIN / 2 will trap, leaving no
> > overflow-only race condition.
> > 
> > As for performance, this implementation adds a single "js" instruction
> > to the regular execution flow of a copy of the standard atomic_t refcount
> > operations. (The non-"and_test" refcount_dec() function, which is uncommon
> > in regular refcount design patterns, has an additional "jz" instruction
> > to detect reaching exactly zero.) Since this is a forward jump, it is by
> > default the non-predicted path, which will be reinforced by dynamic branch
> > prediction. The result is this protection having virtually no measurable
> > change in performance over standard atomic_t operations. The error path,
> > located in .text.unlikely, saves the refcount location and then uses UD0
> > to fire a refcount exception handler, which resets the refcount, handles
> > reporting, and returns to regular execution. This keeps the changes to
> > .text size minimal, avoiding return jumps and open-coded calls to the
> > error reporting routine.
> 
> Pretty nice!
> 

Yes, this is a relief.

Do we have a feeling for how feasible/difficult it will be for other
architectures to implement such a thing?



Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-21 Thread Ingo Molnar

* Kees Cook  wrote:

> On Thu, Jul 20, 2017 at 10:15 AM, Kees Cook  wrote:
> > On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar  wrote:
> >> Could you please also create a tabulated quick-comparison of the three 
> >> variants,
> >> of all key properties, about behavior, feature and tradeoff differences?
> >>
> >> Something like:
> >>
> >> !ARCH_HAS_REFCOUNT  
> >> ARCH_HAS_REFCOUNT=y REFCOUNT_FULL=y
> >>
> >> avg fast path instructions: 5   3  
> >>  10
> >> behavior on overflow:   unsafe, silent  safe,   verbose
> >>  safe,   verbose
> >> behavior on underflow:  unsafe, silent  unsafe, verbose
> >>  unsafe, verbose
> >> ...
> >>
> >> etc. - note that this table is just a quick mockup with wild guesses. 
> >> (Please add
> >> more comparisons of other aspects as well.)
> >>
> >> Such a comparison would make it easier for arch, subsystem and distribution
> >> maintainers to decide on which variant to use/enable.
> >
> > Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that
> > clean. The differences between -full and -fast are pretty subtle, but
> > I think I can describe it using the updated LKDTM tests I've written
> > to compare the two. There are conditions that -fast doesn't catch, but
> > those cases aren't actually useful for the overflow defense.
> >
> > As for "avg fast path instructions", do you mean the resulting
> > assembly for each refcount API function? I think it's going to look
> > something like "1   2   45", but I'll write it up.
> 
> So, doing a worst-case timing of a loop of inc() to INT_MAX and then
> dec_and_test() back to zero, I see this out of perf:
> 
> atomic
> 25255.114805  task-clock (msec)
>  82249267387  cycles
>  11208720041  instructions
> 
> refcount-fast
> 25259.577583  task-clock (msec)
>  82211446892  cycles
>  15486246572  instructions
> 
> refcount-full
> 44625.923432  task-clock (msec)
> 144814735193  cycles
> 105937495952  instructions
> 
> I'll still summarize all this in the v7 series, but I think that
> really clarifies the differences: 1.5x more instructions in -fast, but
> nearly identical cycles and clock. Using -full sees a large change (as
> expected).

Ok, that's pretty convincig - I'd suggest including a cicles row in the table
instead of an instructions row: number of instructions is indeed slightly
misleading in this case.

Thanks,

Ingo


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-21 Thread Ingo Molnar

* Kees Cook  wrote:

> On Thu, Jul 20, 2017 at 10:15 AM, Kees Cook  wrote:
> > On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar  wrote:
> >> Could you please also create a tabulated quick-comparison of the three 
> >> variants,
> >> of all key properties, about behavior, feature and tradeoff differences?
> >>
> >> Something like:
> >>
> >> !ARCH_HAS_REFCOUNT  
> >> ARCH_HAS_REFCOUNT=y REFCOUNT_FULL=y
> >>
> >> avg fast path instructions: 5   3  
> >>  10
> >> behavior on overflow:   unsafe, silent  safe,   verbose
> >>  safe,   verbose
> >> behavior on underflow:  unsafe, silent  unsafe, verbose
> >>  unsafe, verbose
> >> ...
> >>
> >> etc. - note that this table is just a quick mockup with wild guesses. 
> >> (Please add
> >> more comparisons of other aspects as well.)
> >>
> >> Such a comparison would make it easier for arch, subsystem and distribution
> >> maintainers to decide on which variant to use/enable.
> >
> > Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that
> > clean. The differences between -full and -fast are pretty subtle, but
> > I think I can describe it using the updated LKDTM tests I've written
> > to compare the two. There are conditions that -fast doesn't catch, but
> > those cases aren't actually useful for the overflow defense.
> >
> > As for "avg fast path instructions", do you mean the resulting
> > assembly for each refcount API function? I think it's going to look
> > something like "1   2   45", but I'll write it up.
> 
> So, doing a worst-case timing of a loop of inc() to INT_MAX and then
> dec_and_test() back to zero, I see this out of perf:
> 
> atomic
> 25255.114805  task-clock (msec)
>  82249267387  cycles
>  11208720041  instructions
> 
> refcount-fast
> 25259.577583  task-clock (msec)
>  82211446892  cycles
>  15486246572  instructions
> 
> refcount-full
> 44625.923432  task-clock (msec)
> 144814735193  cycles
> 105937495952  instructions
> 
> I'll still summarize all this in the v7 series, but I think that
> really clarifies the differences: 1.5x more instructions in -fast, but
> nearly identical cycles and clock. Using -full sees a large change (as
> expected).

Ok, that's pretty convincig - I'd suggest including a cicles row in the table
instead of an instructions row: number of instructions is indeed slightly
misleading in this case.

Thanks,

Ingo


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-20 Thread Kees Cook
On Thu, Jul 20, 2017 at 10:15 AM, Kees Cook  wrote:
> On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar  wrote:
>> Could you please also create a tabulated quick-comparison of the three 
>> variants,
>> of all key properties, about behavior, feature and tradeoff differences?
>>
>> Something like:
>>
>> !ARCH_HAS_REFCOUNT  ARCH_HAS_REFCOUNT=y  
>>REFCOUNT_FULL=y
>>
>> avg fast path instructions: 5   3
>>10
>> behavior on overflow:   unsafe, silent  safe,   verbose  
>>safe,   verbose
>> behavior on underflow:  unsafe, silent  unsafe, verbose  
>>unsafe, verbose
>> ...
>>
>> etc. - note that this table is just a quick mockup with wild guesses. 
>> (Please add
>> more comparisons of other aspects as well.)
>>
>> Such a comparison would make it easier for arch, subsystem and distribution
>> maintainers to decide on which variant to use/enable.
>
> Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that
> clean. The differences between -full and -fast are pretty subtle, but
> I think I can describe it using the updated LKDTM tests I've written
> to compare the two. There are conditions that -fast doesn't catch, but
> those cases aren't actually useful for the overflow defense.
>
> As for "avg fast path instructions", do you mean the resulting
> assembly for each refcount API function? I think it's going to look
> something like "1   2   45", but I'll write it up.

So, doing a worst-case timing of a loop of inc() to INT_MAX and then
dec_and_test() back to zero, I see this out of perf:

atomic
25255.114805  task-clock (msec)
 82249267387  cycles
 11208720041  instructions

refcount-fast
25259.577583  task-clock (msec)
 82211446892  cycles
 15486246572  instructions

refcount-full
44625.923432  task-clock (msec)
144814735193  cycles
105937495952  instructions

I'll still summarize all this in the v7 series, but I think that
really clarifies the differences: 1.5x more instructions in -fast, but
nearly identical cycles and clock. Using -full sees a large change (as
expected).

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-20 Thread Kees Cook
On Thu, Jul 20, 2017 at 10:15 AM, Kees Cook  wrote:
> On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar  wrote:
>> Could you please also create a tabulated quick-comparison of the three 
>> variants,
>> of all key properties, about behavior, feature and tradeoff differences?
>>
>> Something like:
>>
>> !ARCH_HAS_REFCOUNT  ARCH_HAS_REFCOUNT=y  
>>REFCOUNT_FULL=y
>>
>> avg fast path instructions: 5   3
>>10
>> behavior on overflow:   unsafe, silent  safe,   verbose  
>>safe,   verbose
>> behavior on underflow:  unsafe, silent  unsafe, verbose  
>>unsafe, verbose
>> ...
>>
>> etc. - note that this table is just a quick mockup with wild guesses. 
>> (Please add
>> more comparisons of other aspects as well.)
>>
>> Such a comparison would make it easier for arch, subsystem and distribution
>> maintainers to decide on which variant to use/enable.
>
> Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that
> clean. The differences between -full and -fast are pretty subtle, but
> I think I can describe it using the updated LKDTM tests I've written
> to compare the two. There are conditions that -fast doesn't catch, but
> those cases aren't actually useful for the overflow defense.
>
> As for "avg fast path instructions", do you mean the resulting
> assembly for each refcount API function? I think it's going to look
> something like "1   2   45", but I'll write it up.

So, doing a worst-case timing of a loop of inc() to INT_MAX and then
dec_and_test() back to zero, I see this out of perf:

atomic
25255.114805  task-clock (msec)
 82249267387  cycles
 11208720041  instructions

refcount-fast
25259.577583  task-clock (msec)
 82211446892  cycles
 15486246572  instructions

refcount-full
44625.923432  task-clock (msec)
144814735193  cycles
105937495952  instructions

I'll still summarize all this in the v7 series, but I think that
really clarifies the differences: 1.5x more instructions in -fast, but
nearly identical cycles and clock. Using -full sees a large change (as
expected).

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-20 Thread Kees Cook
On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar  wrote:
> Could you please also create a tabulated quick-comparison of the three 
> variants,
> of all key properties, about behavior, feature and tradeoff differences?
>
> Something like:
>
> !ARCH_HAS_REFCOUNT  ARCH_HAS_REFCOUNT=y   
>   REFCOUNT_FULL=y
>
> avg fast path instructions: 5   3 
>   10
> behavior on overflow:   unsafe, silent  safe,   verbose   
>   safe,   verbose
> behavior on underflow:  unsafe, silent  unsafe, verbose   
>   unsafe, verbose
> ...
>
> etc. - note that this table is just a quick mockup with wild guesses. (Please 
> add
> more comparisons of other aspects as well.)
>
> Such a comparison would make it easier for arch, subsystem and distribution
> maintainers to decide on which variant to use/enable.

Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that
clean. The differences between -full and -fast are pretty subtle, but
I think I can describe it using the updated LKDTM tests I've written
to compare the two. There are conditions that -fast doesn't catch, but
those cases aren't actually useful for the overflow defense.

As for "avg fast path instructions", do you mean the resulting
assembly for each refcount API function? I think it's going to look
something like "1   2   45", but I'll write it up.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-20 Thread Kees Cook
On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar  wrote:
> Could you please also create a tabulated quick-comparison of the three 
> variants,
> of all key properties, about behavior, feature and tradeoff differences?
>
> Something like:
>
> !ARCH_HAS_REFCOUNT  ARCH_HAS_REFCOUNT=y   
>   REFCOUNT_FULL=y
>
> avg fast path instructions: 5   3 
>   10
> behavior on overflow:   unsafe, silent  safe,   verbose   
>   safe,   verbose
> behavior on underflow:  unsafe, silent  unsafe, verbose   
>   unsafe, verbose
> ...
>
> etc. - note that this table is just a quick mockup with wild guesses. (Please 
> add
> more comparisons of other aspects as well.)
>
> Such a comparison would make it easier for arch, subsystem and distribution
> maintainers to decide on which variant to use/enable.

Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that
clean. The differences between -full and -fast are pretty subtle, but
I think I can describe it using the updated LKDTM tests I've written
to compare the two. There are conditions that -fast doesn't catch, but
those cases aren't actually useful for the overflow defense.

As for "avg fast path instructions", do you mean the resulting
assembly for each refcount API function? I think it's going to look
something like "1   2   45", but I'll write it up.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-20 Thread Ingo Molnar

* Kees Cook  wrote:

> This implements refcount_t overflow protection on x86 without a noticeable
> performance impact, though without the fuller checking of REFCOUNT_FULL.
> This is done by duplicating the existing atomic_t refcount implementation
> but with normally a single instruction added to detect if the refcount
> has gone negative (i.e. wrapped past INT_MAX or below zero). When
> detected, the handler saturates the refcount_t to INT_MIN / 2. With this
> overflow protection, the erroneous reference release that would follow
> a wrap back to zero is blocked from happening, avoiding the class of
> refcount-over-increment use-after-free vulnerabilities entirely.
> 
> Only the overflow case of refcounting can be perfectly protected, since it
> can be detected and stopped before the reference is freed and left to be
> abused by an attacker. This implementation also notices some of the "dec
> to 0 without test", and "below 0" cases. However, these only indicate that
> a use-after-free may have already happened. Such notifications are likely
> avoidable by an attacker that has already exploited a use-after-free
> vulnerability, but it's better to have them than allow such conditions to
> remain universally silent.
> 
> On first overflow detection, the refcount value is reset to INT_MIN / 2
> (which serves as a saturation value), the offending process is killed,
> and a report and stack trace are produced. When operations detect only
> negative value results (such as changing an already saturated value),
> saturation still happens but no notification is performed (since the
> value was already saturated).
> 
> On the matter of races, since the entire range beyond INT_MAX but before
> 0 is negative, every operation at INT_MIN / 2 will trap, leaving no
> overflow-only race condition.
> 
> As for performance, this implementation adds a single "js" instruction
> to the regular execution flow of a copy of the standard atomic_t refcount
> operations. (The non-"and_test" refcount_dec() function, which is uncommon
> in regular refcount design patterns, has an additional "jz" instruction
> to detect reaching exactly zero.) Since this is a forward jump, it is by
> default the non-predicted path, which will be reinforced by dynamic branch
> prediction. The result is this protection having virtually no measurable
> change in performance over standard atomic_t operations. The error path,
> located in .text.unlikely, saves the refcount location and then uses UD0
> to fire a refcount exception handler, which resets the refcount, handles
> reporting, and returns to regular execution. This keeps the changes to
> .text size minimal, avoiding return jumps and open-coded calls to the
> error reporting routine.

Pretty nice!

Could you please also create a tabulated quick-comparison of the three 
variants, 
of all key properties, about behavior, feature and tradeoff differences?

Something like:

!ARCH_HAS_REFCOUNT  ARCH_HAS_REFCOUNT=y 
REFCOUNT_FULL=y

avg fast path instructions: 5   3   
10
behavior on overflow:   unsafe, silent  safe,   verbose 
safe,   verbose
behavior on underflow:  unsafe, silent  unsafe, verbose 
unsafe, verbose
...

etc. - note that this table is just a quick mockup with wild guesses. (Please 
add 
more comparisons of other aspects as well.)

Such a comparison would make it easier for arch, subsystem and distribution 
maintainers to decide on which variant to use/enable.

Thanks,

Ingo


Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-20 Thread Ingo Molnar

* Kees Cook  wrote:

> This implements refcount_t overflow protection on x86 without a noticeable
> performance impact, though without the fuller checking of REFCOUNT_FULL.
> This is done by duplicating the existing atomic_t refcount implementation
> but with normally a single instruction added to detect if the refcount
> has gone negative (i.e. wrapped past INT_MAX or below zero). When
> detected, the handler saturates the refcount_t to INT_MIN / 2. With this
> overflow protection, the erroneous reference release that would follow
> a wrap back to zero is blocked from happening, avoiding the class of
> refcount-over-increment use-after-free vulnerabilities entirely.
> 
> Only the overflow case of refcounting can be perfectly protected, since it
> can be detected and stopped before the reference is freed and left to be
> abused by an attacker. This implementation also notices some of the "dec
> to 0 without test", and "below 0" cases. However, these only indicate that
> a use-after-free may have already happened. Such notifications are likely
> avoidable by an attacker that has already exploited a use-after-free
> vulnerability, but it's better to have them than allow such conditions to
> remain universally silent.
> 
> On first overflow detection, the refcount value is reset to INT_MIN / 2
> (which serves as a saturation value), the offending process is killed,
> and a report and stack trace are produced. When operations detect only
> negative value results (such as changing an already saturated value),
> saturation still happens but no notification is performed (since the
> value was already saturated).
> 
> On the matter of races, since the entire range beyond INT_MAX but before
> 0 is negative, every operation at INT_MIN / 2 will trap, leaving no
> overflow-only race condition.
> 
> As for performance, this implementation adds a single "js" instruction
> to the regular execution flow of a copy of the standard atomic_t refcount
> operations. (The non-"and_test" refcount_dec() function, which is uncommon
> in regular refcount design patterns, has an additional "jz" instruction
> to detect reaching exactly zero.) Since this is a forward jump, it is by
> default the non-predicted path, which will be reinforced by dynamic branch
> prediction. The result is this protection having virtually no measurable
> change in performance over standard atomic_t operations. The error path,
> located in .text.unlikely, saves the refcount location and then uses UD0
> to fire a refcount exception handler, which resets the refcount, handles
> reporting, and returns to regular execution. This keeps the changes to
> .text size minimal, avoiding return jumps and open-coded calls to the
> error reporting routine.

Pretty nice!

Could you please also create a tabulated quick-comparison of the three 
variants, 
of all key properties, about behavior, feature and tradeoff differences?

Something like:

!ARCH_HAS_REFCOUNT  ARCH_HAS_REFCOUNT=y 
REFCOUNT_FULL=y

avg fast path instructions: 5   3   
10
behavior on overflow:   unsafe, silent  safe,   verbose 
safe,   verbose
behavior on underflow:  unsafe, silent  unsafe, verbose 
unsafe, verbose
...

etc. - note that this table is just a quick mockup with wild guesses. (Please 
add 
more comparisons of other aspects as well.)

Such a comparison would make it easier for arch, subsystem and distribution 
maintainers to decide on which variant to use/enable.

Thanks,

Ingo


[PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-18 Thread Kees Cook
This series implements a fast refcount overflow protection for x86, which is
needed to provide coverage for the several refcount-overflow use-after-free
flaws the kernel has seen over the last many years. For example, here are
two from 2016:

http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/

http://cyseclabs.com/page?n=02012016

Patch 1 provides support for adding additional assembly to the GEN_*_RMWcc
macros, and patch 2 does the real work. (I left Josh's Reviewed-by since
the exception handler code has remained the same.)

Here is patch 2's full commit log:


This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.
This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (i.e. wrapped past INT_MAX or below zero). When
detected, the handler saturates the refcount_t to INT_MIN / 2. With this
overflow protection, the erroneous reference release that would follow
a wrap back to zero is blocked from happening, avoiding the class of
refcount-over-increment use-after-free vulnerabilities entirely.

Only the overflow case of refcounting can be perfectly protected, since it
can be detected and stopped before the reference is freed and left to be
abused by an attacker. This implementation also notices some of the "dec
to 0 without test", and "below 0" cases. However, these only indicate that
a use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them than allow such conditions to
remain universally silent.

On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value), the offending process is killed,
and a report and stack trace are produced. When operations detect only
negative value results (such as changing an already saturated value),
saturation still happens but no notification is performed (since the
value was already saturated).

On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.

As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.

Example assembly comparison:

refcount_inc before
.text:
81546149:   f0 ff 45 f4 lock incl -0xc(%rbp)

refcount_inc after
.text:
81546149:   f0 ff 45 f4 lock incl -0xc(%rbp)
8154614d:   0f 88 80 d5 17 00   js 816c36d3
...
.text.unlikely:
816c36d3:   48 8d 4d f4 lea-0xc(%rbp),%rcx
816c36d7:   0f ff   (bad)

This protection is a modified version of the x86 PAX_REFCOUNT defense
from the last public patch of PaX/grsecurity, based on my understanding of
the code. Changes or omissions from the original code are mine and don't
reflect the original grsecurity/PaX code. Thanks to PaX Team for various
suggestions for improvement.


Thanks!

-Kees

v6:
- use single saturation value (INT_MIN / 2)
- detect refcount_dec() to zero and saturate

v5:
- add unchecked atomic_t implementation when !CONFIG_REFCOUNT_FULL
- use "leal" again, as in v3 for more flexible reset handling
- provide better underflow detection, with saturation

v4:
- switch to js from jns to gain static branch prediction benefits
- use .text.unlikely for js target, effectively making handler __cold
- use UD0 with refcount exception handler instead of int 0x81
- Kconfig defaults on when arch has support

v3:
- drop named text sections until we need to distinguish sizes/directions
- reset value immediately instead of passing back to handler
- drop needless export; josh

v2:
- fix instruction pointer decrement bug; thejh
- switch to js; pax-team
- improve commit log





[PATCH v6 0/2] x86: Implement fast refcount overflow protection

2017-07-18 Thread Kees Cook
This series implements a fast refcount overflow protection for x86, which is
needed to provide coverage for the several refcount-overflow use-after-free
flaws the kernel has seen over the last many years. For example, here are
two from 2016:

http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/

http://cyseclabs.com/page?n=02012016

Patch 1 provides support for adding additional assembly to the GEN_*_RMWcc
macros, and patch 2 does the real work. (I left Josh's Reviewed-by since
the exception handler code has remained the same.)

Here is patch 2's full commit log:


This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.
This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (i.e. wrapped past INT_MAX or below zero). When
detected, the handler saturates the refcount_t to INT_MIN / 2. With this
overflow protection, the erroneous reference release that would follow
a wrap back to zero is blocked from happening, avoiding the class of
refcount-over-increment use-after-free vulnerabilities entirely.

Only the overflow case of refcounting can be perfectly protected, since it
can be detected and stopped before the reference is freed and left to be
abused by an attacker. This implementation also notices some of the "dec
to 0 without test", and "below 0" cases. However, these only indicate that
a use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them than allow such conditions to
remain universally silent.

On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value), the offending process is killed,
and a report and stack trace are produced. When operations detect only
negative value results (such as changing an already saturated value),
saturation still happens but no notification is performed (since the
value was already saturated).

On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.

As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.

Example assembly comparison:

refcount_inc before
.text:
81546149:   f0 ff 45 f4 lock incl -0xc(%rbp)

refcount_inc after
.text:
81546149:   f0 ff 45 f4 lock incl -0xc(%rbp)
8154614d:   0f 88 80 d5 17 00   js 816c36d3
...
.text.unlikely:
816c36d3:   48 8d 4d f4 lea-0xc(%rbp),%rcx
816c36d7:   0f ff   (bad)

This protection is a modified version of the x86 PAX_REFCOUNT defense
from the last public patch of PaX/grsecurity, based on my understanding of
the code. Changes or omissions from the original code are mine and don't
reflect the original grsecurity/PaX code. Thanks to PaX Team for various
suggestions for improvement.


Thanks!

-Kees

v6:
- use single saturation value (INT_MIN / 2)
- detect refcount_dec() to zero and saturate

v5:
- add unchecked atomic_t implementation when !CONFIG_REFCOUNT_FULL
- use "leal" again, as in v3 for more flexible reset handling
- provide better underflow detection, with saturation

v4:
- switch to js from jns to gain static branch prediction benefits
- use .text.unlikely for js target, effectively making handler __cold
- use UD0 with refcount exception handler instead of int 0x81
- Kconfig defaults on when arch has support

v3:
- drop named text sections until we need to distinguish sizes/directions
- reset value immediately instead of passing back to handler
- drop needless export; josh

v2:
- fix instruction pointer decrement bug; thejh
- switch to js; pax-team
- improve commit log