Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 2:23 PM, Josh Poimboeuf  wrote:
> On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
>> > Right, but what's the purpose of preventing speculation past
>> > access_ok()?
>>
>> Caution. It's the same rationale for the nospec_array_ptr() patches.
>> If we, kernel community, can identify any possible speculation past a
>> bounds check we should inject a speculation mitigation. Unless there's
>> a way to be 100% certain that the first unwanted speculation can be
>> turned into a gadget later on in the instruction stream, err on the
>> side of shutting it down early.
>
> I'm all for being cautious.  The nospec_array_ptr() patches are fine,
> and they make sense in light of the variant 1 CVE.
>
> But that still doesn't answer my question.  I haven't seen *any*
> rationale for this patch.  It would be helpful to at least describe
> what's being protected against, even if it's hypothetical.  How can we
> review it if the commit log doesn't describe its purpose?

Certainly the changelog needs improvement, I'll roll these concerns
into v2 and we can go from there.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 2:23 PM, Josh Poimboeuf  wrote:
> On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
>> > Right, but what's the purpose of preventing speculation past
>> > access_ok()?
>>
>> Caution. It's the same rationale for the nospec_array_ptr() patches.
>> If we, kernel community, can identify any possible speculation past a
>> bounds check we should inject a speculation mitigation. Unless there's
>> a way to be 100% certain that the first unwanted speculation can be
>> turned into a gadget later on in the instruction stream, err on the
>> side of shutting it down early.
>
> I'm all for being cautious.  The nospec_array_ptr() patches are fine,
> and they make sense in light of the variant 1 CVE.
>
> But that still doesn't answer my question.  I haven't seen *any*
> rationale for this patch.  It would be helpful to at least describe
> what's being protected against, even if it's hypothetical.  How can we
> review it if the commit log doesn't describe its purpose?

Certainly the changelog needs improvement, I'll roll these concerns
into v2 and we can go from there.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf
On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
> > Right, but what's the purpose of preventing speculation past
> > access_ok()?
> 
> Caution. It's the same rationale for the nospec_array_ptr() patches.
> If we, kernel community, can identify any possible speculation past a
> bounds check we should inject a speculation mitigation. Unless there's
> a way to be 100% certain that the first unwanted speculation can be
> turned into a gadget later on in the instruction stream, err on the
> side of shutting it down early.

I'm all for being cautious.  The nospec_array_ptr() patches are fine,
and they make sense in light of the variant 1 CVE.

But that still doesn't answer my question.  I haven't seen *any*
rationale for this patch.  It would be helpful to at least describe
what's being protected against, even if it's hypothetical.  How can we
review it if the commit log doesn't describe its purpose?

-- 
Josh


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf
On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
> > Right, but what's the purpose of preventing speculation past
> > access_ok()?
> 
> Caution. It's the same rationale for the nospec_array_ptr() patches.
> If we, kernel community, can identify any possible speculation past a
> bounds check we should inject a speculation mitigation. Unless there's
> a way to be 100% certain that the first unwanted speculation can be
> turned into a gadget later on in the instruction stream, err on the
> side of shutting it down early.

I'm all for being cautious.  The nospec_array_ptr() patches are fine,
and they make sense in light of the variant 1 CVE.

But that still doesn't answer my question.  I haven't seen *any*
rationale for this patch.  It would be helpful to at least describe
what's being protected against, even if it's hypothetical.  How can we
review it if the commit log doesn't describe its purpose?

-- 
Josh


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 1:49 PM, Josh Poimboeuf  wrote:
> On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
>> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
>> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
>> >> wrote:
>> >> > From: Andi Kleen 
>> >> >
>> >> > When access_ok fails we should always stop speculating.
>> >> > Add the required barriers to the x86 access_ok macro.
>> >>
>> >> Honestly, this seems completely bogus.
>> >>
>> >> The description is pure garbage afaik.
>> >>
>> >> The fact is, we have to stop speculating when access_ok() does *not*
>> >> fail - because that's when we'll actually do the access. And it's that
>> >> access that needs to be non-speculative.
>> >>
>> >> That actually seems to be what the code does (it stops speculation
>> >> when __range_not_ok() returns false, but access_ok() is
>> >> !__range_not_ok()). But the explanation is crap, and dangerous.
>> >
>> > The description also seems to be missing the "why", as it's not
>> > self-evident (to me, at least).
>> >
>> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>> >
>> > i.e., wouldn't the pattern be:
>> >
>> > get_user(uval, uptr);
>> > if (uval < array_size) {
>> > lfence();
>> > foo = a2[a1[uval] * 256];
>> > }
>> >
>> > Shouldn't the lfence come much later, *after* reading the variable and
>> > comparing it and branching accordingly?
>>
>> The goal of putting the lfence in uaccess_begin() is to prevent
>> speculation past access_ok().
>
> Right, but what's the purpose of preventing speculation past
> access_ok()?

Caution. It's the same rationale for the nospec_array_ptr() patches.
If we, kernel community, can identify any possible speculation past a
bounds check we should inject a speculation mitigation. Unless there's
a way to be 100% certain that the first unwanted speculation can be
turned into a gadget later on in the instruction stream, err on the
side of shutting it down early.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 1:49 PM, Josh Poimboeuf  wrote:
> On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
>> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
>> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
>> >> wrote:
>> >> > From: Andi Kleen 
>> >> >
>> >> > When access_ok fails we should always stop speculating.
>> >> > Add the required barriers to the x86 access_ok macro.
>> >>
>> >> Honestly, this seems completely bogus.
>> >>
>> >> The description is pure garbage afaik.
>> >>
>> >> The fact is, we have to stop speculating when access_ok() does *not*
>> >> fail - because that's when we'll actually do the access. And it's that
>> >> access that needs to be non-speculative.
>> >>
>> >> That actually seems to be what the code does (it stops speculation
>> >> when __range_not_ok() returns false, but access_ok() is
>> >> !__range_not_ok()). But the explanation is crap, and dangerous.
>> >
>> > The description also seems to be missing the "why", as it's not
>> > self-evident (to me, at least).
>> >
>> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>> >
>> > i.e., wouldn't the pattern be:
>> >
>> > get_user(uval, uptr);
>> > if (uval < array_size) {
>> > lfence();
>> > foo = a2[a1[uval] * 256];
>> > }
>> >
>> > Shouldn't the lfence come much later, *after* reading the variable and
>> > comparing it and branching accordingly?
>>
>> The goal of putting the lfence in uaccess_begin() is to prevent
>> speculation past access_ok().
>
> Right, but what's the purpose of preventing speculation past
> access_ok()?

Caution. It's the same rationale for the nospec_array_ptr() patches.
If we, kernel community, can identify any possible speculation past a
bounds check we should inject a speculation mitigation. Unless there's
a way to be 100% certain that the first unwanted speculation can be
turned into a gadget later on in the instruction stream, err on the
side of shutting it down early.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf
On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
> >> wrote:
> >> > From: Andi Kleen 
> >> >
> >> > When access_ok fails we should always stop speculating.
> >> > Add the required barriers to the x86 access_ok macro.
> >>
> >> Honestly, this seems completely bogus.
> >>
> >> The description is pure garbage afaik.
> >>
> >> The fact is, we have to stop speculating when access_ok() does *not*
> >> fail - because that's when we'll actually do the access. And it's that
> >> access that needs to be non-speculative.
> >>
> >> That actually seems to be what the code does (it stops speculation
> >> when __range_not_ok() returns false, but access_ok() is
> >> !__range_not_ok()). But the explanation is crap, and dangerous.
> >
> > The description also seems to be missing the "why", as it's not
> > self-evident (to me, at least).
> >
> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
> >
> > i.e., wouldn't the pattern be:
> >
> > get_user(uval, uptr);
> > if (uval < array_size) {
> > lfence();
> > foo = a2[a1[uval] * 256];
> > }
> >
> > Shouldn't the lfence come much later, *after* reading the variable and
> > comparing it and branching accordingly?
> 
> The goal of putting the lfence in uaccess_begin() is to prevent
> speculation past access_ok().

Right, but what's the purpose of preventing speculation past
access_ok()?

-- 
Josh


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf
On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
> >> wrote:
> >> > From: Andi Kleen 
> >> >
> >> > When access_ok fails we should always stop speculating.
> >> > Add the required barriers to the x86 access_ok macro.
> >>
> >> Honestly, this seems completely bogus.
> >>
> >> The description is pure garbage afaik.
> >>
> >> The fact is, we have to stop speculating when access_ok() does *not*
> >> fail - because that's when we'll actually do the access. And it's that
> >> access that needs to be non-speculative.
> >>
> >> That actually seems to be what the code does (it stops speculation
> >> when __range_not_ok() returns false, but access_ok() is
> >> !__range_not_ok()). But the explanation is crap, and dangerous.
> >
> > The description also seems to be missing the "why", as it's not
> > self-evident (to me, at least).
> >
> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
> >
> > i.e., wouldn't the pattern be:
> >
> > get_user(uval, uptr);
> > if (uval < array_size) {
> > lfence();
> > foo = a2[a1[uval] * 256];
> > }
> >
> > Shouldn't the lfence come much later, *after* reading the variable and
> > comparing it and branching accordingly?
> 
> The goal of putting the lfence in uaccess_begin() is to prevent
> speculation past access_ok().

Right, but what's the purpose of preventing speculation past
access_ok()?

-- 
Josh


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
> On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
>> wrote:
>> > From: Andi Kleen 
>> >
>> > When access_ok fails we should always stop speculating.
>> > Add the required barriers to the x86 access_ok macro.
>>
>> Honestly, this seems completely bogus.
>>
>> The description is pure garbage afaik.
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>>
>> That actually seems to be what the code does (it stops speculation
>> when __range_not_ok() returns false, but access_ok() is
>> !__range_not_ok()). But the explanation is crap, and dangerous.
>
> The description also seems to be missing the "why", as it's not
> self-evident (to me, at least).
>
> Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>
> i.e., wouldn't the pattern be:
>
> get_user(uval, uptr);
> if (uval < array_size) {
> lfence();
> foo = a2[a1[uval] * 256];
> }
>
> Shouldn't the lfence come much later, *after* reading the variable and
> comparing it and branching accordingly?

The goal of putting the lfence in uaccess_begin() is to prevent
speculation past access_ok(). You are correct that the cpu could later
mis-speculate on uval, that's where taint analysis tooling needs to
come into play to track uval to where it is used. That's where the
nospec_array_ptr() patches come into play.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
> On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
>> wrote:
>> > From: Andi Kleen 
>> >
>> > When access_ok fails we should always stop speculating.
>> > Add the required barriers to the x86 access_ok macro.
>>
>> Honestly, this seems completely bogus.
>>
>> The description is pure garbage afaik.
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>>
>> That actually seems to be what the code does (it stops speculation
>> when __range_not_ok() returns false, but access_ok() is
>> !__range_not_ok()). But the explanation is crap, and dangerous.
>
> The description also seems to be missing the "why", as it's not
> self-evident (to me, at least).
>
> Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>
> i.e., wouldn't the pattern be:
>
> get_user(uval, uptr);
> if (uval < array_size) {
> lfence();
> foo = a2[a1[uval] * 256];
> }
>
> Shouldn't the lfence come much later, *after* reading the variable and
> comparing it and branching accordingly?

The goal of putting the lfence in uaccess_begin() is to prevent
speculation past access_ok(). You are correct that the cpu could later
mis-speculate on uval, that's where taint analysis tooling needs to
come into play to track uval to where it is used. That's where the
nospec_array_ptr() patches come into play.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf
On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  wrote:
> > From: Andi Kleen 
> >
> > When access_ok fails we should always stop speculating.
> > Add the required barriers to the x86 access_ok macro.
> 
> Honestly, this seems completely bogus.
> 
> The description is pure garbage afaik.
> 
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.
> 
> That actually seems to be what the code does (it stops speculation
> when __range_not_ok() returns false, but access_ok() is
> !__range_not_ok()). But the explanation is crap, and dangerous.

The description also seems to be missing the "why", as it's not
self-evident (to me, at least).

Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?

i.e., wouldn't the pattern be:

get_user(uval, uptr);
if (uval < array_size) {
lfence();
foo = a2[a1[uval] * 256];
}

Shouldn't the lfence come much later, *after* reading the variable and
comparing it and branching accordingly?

-- 
Josh


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf
On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  wrote:
> > From: Andi Kleen 
> >
> > When access_ok fails we should always stop speculating.
> > Add the required barriers to the x86 access_ok macro.
> 
> Honestly, this seems completely bogus.
> 
> The description is pure garbage afaik.
> 
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.
> 
> That actually seems to be what the code does (it stops speculation
> when __range_not_ok() returns false, but access_ok() is
> !__range_not_ok()). But the explanation is crap, and dangerous.

The description also seems to be missing the "why", as it's not
self-evident (to me, at least).

Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?

i.e., wouldn't the pattern be:

get_user(uval, uptr);
if (uval < array_size) {
lfence();
foo = a2[a1[uval] * 256];
}

Shouldn't the lfence come much later, *after* reading the variable and
comparing it and branching accordingly?

-- 
Josh


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 3:53 PM, Dan Williams  wrote:
>
> I've been thinking the "and" is only suitable for the array bounds
> check, for get_user() we're trying to block speculation past
> access_ok() at which point we can only do the lfence?

Well, we *could* do the "and", at least for the simple cases (ie the
true "get_user()" that integrates the access_ok with the access).

IOW, mainly the code in arch/x86/lib/getuser.S.

But it probably is a lot simpler to just add the "lfence" to ASM_STAC,
because by definition those cases don't tend to be the truly critical
ones - people who use those functions tend to do one or two accesses,
and the real cost is likely the I$ misses and the D$ miss to get
current->addr_limit. Not to mention the "stac" itself, which is much
more expensive than the access on current microarchitectures.

But something like this *might* work:

   index c97d935a29e8..7fa3d293beaf 100644
   --- a/arch/x86/lib/getuser.S
   +++ b/arch/x86/lib/getuser.S
   @@ -38,8 +38,11 @@
   .text
ENTRY(__get_user_1)
   mov PER_CPU_VAR(current_task), %_ASM_DX
   -   cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
   +   mov TASK_addr_limit(%_ASM_DX),%_ASM_DX
   +   cmp %_ASM_DX,%_ASM_AX
   jae bad_get_user
   +   or $0xfff,%_ASM_DX
   +   and %_ASM_DX,%_ASM_AX
   ASM_STAC
1: movzbl (%_ASM_AX),%edx
   xor %eax,%eax

(this only does the one-byte case - the 2/4/8 byte cases are exactly the same).

The above is completely untested and might have some stupid
thinko/typo, so take it purely as a "example patch" to show the
concept, rather than actually do it.

But just adding "lfence" to the existing ASM_STAC is a hell of a lot
easier, and the performance difference between that trivial patch and
the above "let's be clever with 'and'" might not be measurable.

I really have no idea how expensive lfence might actually end up being
in practice. It's possible that lfence is actually fairly cheap in
kernel code, since we tend to not have very high IPC anyway.

 Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 3:53 PM, Dan Williams  wrote:
>
> I've been thinking the "and" is only suitable for the array bounds
> check, for get_user() we're trying to block speculation past
> access_ok() at which point we can only do the lfence?

Well, we *could* do the "and", at least for the simple cases (ie the
true "get_user()" that integrates the access_ok with the access).

IOW, mainly the code in arch/x86/lib/getuser.S.

But it probably is a lot simpler to just add the "lfence" to ASM_STAC,
because by definition those cases don't tend to be the truly critical
ones - people who use those functions tend to do one or two accesses,
and the real cost is likely the I$ misses and the D$ miss to get
current->addr_limit. Not to mention the "stac" itself, which is much
more expensive than the access on current microarchitectures.

But something like this *might* work:

   index c97d935a29e8..7fa3d293beaf 100644
   --- a/arch/x86/lib/getuser.S
   +++ b/arch/x86/lib/getuser.S
   @@ -38,8 +38,11 @@
   .text
ENTRY(__get_user_1)
   mov PER_CPU_VAR(current_task), %_ASM_DX
   -   cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
   +   mov TASK_addr_limit(%_ASM_DX),%_ASM_DX
   +   cmp %_ASM_DX,%_ASM_AX
   jae bad_get_user
   +   or $0xfff,%_ASM_DX
   +   and %_ASM_DX,%_ASM_AX
   ASM_STAC
1: movzbl (%_ASM_AX),%edx
   xor %eax,%eax

(this only does the one-byte case - the 2/4/8 byte cases are exactly the same).

The above is completely untested and might have some stupid
thinko/typo, so take it purely as a "example patch" to show the
concept, rather than actually do it.

But just adding "lfence" to the existing ASM_STAC is a hell of a lot
easier, and the performance difference between that trivial patch and
the above "let's be clever with 'and'" might not be measurable.

I really have no idea how expensive lfence might actually end up being
in practice. It's possible that lfence is actually fairly cheap in
kernel code, since we tend to not have very high IPC anyway.

 Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Dan Williams
On Mon, Jan 8, 2018 at 3:44 PM, Linus Torvalds
 wrote:
> On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams  wrote:
>> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
>>  wrote:
>>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  
>>> wrote:

 I assume if we put this in uaccess_begin() we also need audit for
 paths that use access_ok but don't do on to call uaccess_begin()? A
 quick glance shows a few places where we are open coding the stac().
 Perhaps land the lfence in stac() directly?
>>>
>>> Yeah, we should put it in uaccess_begin(), and in the actual user
>>> accessor helpers that do stac. Some of them probably should be changed
>>> to use uaccess_begin() instead while at it.
>>>
>>> One question for the CPU people: do we actually care and need to do
>>> this for things that might *write* to something? The speculative write
>>> obviously is killed, but does it perhaps bring in a cacheline even
>>> when killed?
>>
>> As far as I understand a write could trigger a request-for-ownership
>> read for the target cacheline.
>
> Oh, absolutely.
>
> I just wonder at what point that happens.
>
> Honestly, trying to get exclusive access to a cacheline can be _very_
> expensive (not just for the local thread), so I would actually expect
> that doing so for speculative writes is actually bad for performance.
>
> That's doubly true because - unlike reads - there is no critical
> latency issue, so trying to get the cache access started as early as
> possible simply isn't all that important.
>
> So I suspect that a write won't actually try to allocate the cacheline
> until the write has actually retired.
>
> End result: writes - unlike reads - *probably* will not speculatively
> perturb the cache with speculative write addresses.
>
>> Even though writes can trigger reads, as far as I can see the write
>> needs to be dependent on the first out-of-bounds read
>
> Yeah. A write on its own wouldn't matter, even if it were to perturb
> the cache state, because the address already comes from user space, so
> there's no new information in the cache perturbation for the attacker.
>
> But that all implies that we shouldn't need the lfence for the
> "put_user()" case, only for the get_user() (where the value we read
> would then perhaps be used to do another access).
>
> So we want to add the lfence (or "and") to get_user(), but not
> necessarily put_user().

Yes, perhaps __uaccess_begin_get() and __uaccess_begin_put() to keep
things separate?

> Agreed?

I've been thinking the "and" is only suitable for the array bounds
check, for get_user() we're trying to block speculation past
access_ok() at which point we can only do the lfence?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Dan Williams
On Mon, Jan 8, 2018 at 3:44 PM, Linus Torvalds
 wrote:
> On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams  wrote:
>> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
>>  wrote:
>>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  
>>> wrote:

 I assume if we put this in uaccess_begin() we also need audit for
 paths that use access_ok but don't do on to call uaccess_begin()? A
 quick glance shows a few places where we are open coding the stac().
 Perhaps land the lfence in stac() directly?
>>>
>>> Yeah, we should put it in uaccess_begin(), and in the actual user
>>> accessor helpers that do stac. Some of them probably should be changed
>>> to use uaccess_begin() instead while at it.
>>>
>>> One question for the CPU people: do we actually care and need to do
>>> this for things that might *write* to something? The speculative write
>>> obviously is killed, but does it perhaps bring in a cacheline even
>>> when killed?
>>
>> As far as I understand a write could trigger a request-for-ownership
>> read for the target cacheline.
>
> Oh, absolutely.
>
> I just wonder at what point that happens.
>
> Honestly, trying to get exclusive access to a cacheline can be _very_
> expensive (not just for the local thread), so I would actually expect
> that doing so for speculative writes is actually bad for performance.
>
> That's doubly true because - unlike reads - there is no critical
> latency issue, so trying to get the cache access started as early as
> possible simply isn't all that important.
>
> So I suspect that a write won't actually try to allocate the cacheline
> until the write has actually retired.
>
> End result: writes - unlike reads - *probably* will not speculatively
> perturb the cache with speculative write addresses.
>
>> Even though writes can trigger reads, as far as I can see the write
>> needs to be dependent on the first out-of-bounds read
>
> Yeah. A write on its own wouldn't matter, even if it were to perturb
> the cache state, because the address already comes from user space, so
> there's no new information in the cache perturbation for the attacker.
>
> But that all implies that we shouldn't need the lfence for the
> "put_user()" case, only for the get_user() (where the value we read
> would then perhaps be used to do another access).
>
> So we want to add the lfence (or "and") to get_user(), but not
> necessarily put_user().

Yes, perhaps __uaccess_begin_get() and __uaccess_begin_put() to keep
things separate?

> Agreed?

I've been thinking the "and" is only suitable for the array bounds
check, for get_user() we're trying to block speculation past
access_ok() at which point we can only do the lfence?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams  wrote:
> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
>  wrote:
>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  
>> wrote:
>>>
>>> I assume if we put this in uaccess_begin() we also need audit for
>>> paths that use access_ok but don't do on to call uaccess_begin()? A
>>> quick glance shows a few places where we are open coding the stac().
>>> Perhaps land the lfence in stac() directly?
>>
>> Yeah, we should put it in uaccess_begin(), and in the actual user
>> accessor helpers that do stac. Some of them probably should be changed
>> to use uaccess_begin() instead while at it.
>>
>> One question for the CPU people: do we actually care and need to do
>> this for things that might *write* to something? The speculative write
>> obviously is killed, but does it perhaps bring in a cacheline even
>> when killed?
>
> As far as I understand a write could trigger a request-for-ownership
> read for the target cacheline.

Oh, absolutely.

I just wonder at what point that happens.

Honestly, trying to get exclusive access to a cacheline can be _very_
expensive (not just for the local thread), so I would actually expect
that doing so for speculative writes is actually bad for performance.

That's doubly true because - unlike reads - there is no critical
latency issue, so trying to get the cache access started as early as
possible simply isn't all that important.

So I suspect that a write won't actually try to allocate the cacheline
until the write has actually retired.

End result: writes - unlike reads - *probably* will not speculatively
perturb the cache with speculative write addresses.

> Even though writes can trigger reads, as far as I can see the write
> needs to be dependent on the first out-of-bounds read

Yeah. A write on its own wouldn't matter, even if it were to perturb
the cache state, because the address already comes from user space, so
there's no new information in the cache perturbation for the attacker.

But that all implies that we shouldn't need the lfence for the
"put_user()" case, only for the get_user() (where the value we read
would then perhaps be used to do another access).

So we want to add the lfence (or "and") to get_user(), but not
necessarily put_user().

Agreed?

  Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Linus Torvalds
On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams  wrote:
> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
>  wrote:
>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  
>> wrote:
>>>
>>> I assume if we put this in uaccess_begin() we also need audit for
>>> paths that use access_ok but don't do on to call uaccess_begin()? A
>>> quick glance shows a few places where we are open coding the stac().
>>> Perhaps land the lfence in stac() directly?
>>
>> Yeah, we should put it in uaccess_begin(), and in the actual user
>> accessor helpers that do stac. Some of them probably should be changed
>> to use uaccess_begin() instead while at it.
>>
>> One question for the CPU people: do we actually care and need to do
>> this for things that might *write* to something? The speculative write
>> obviously is killed, but does it perhaps bring in a cacheline even
>> when killed?
>
> As far as I understand a write could trigger a request-for-ownership
> read for the target cacheline.

Oh, absolutely.

I just wonder at what point that happens.

Honestly, trying to get exclusive access to a cacheline can be _very_
expensive (not just for the local thread), so I would actually expect
that doing so for speculative writes is actually bad for performance.

That's doubly true because - unlike reads - there is no critical
latency issue, so trying to get the cache access started as early as
possible simply isn't all that important.

So I suspect that a write won't actually try to allocate the cacheline
until the write has actually retired.

End result: writes - unlike reads - *probably* will not speculatively
perturb the cache with speculative write addresses.

> Even though writes can trigger reads, as far as I can see the write
> needs to be dependent on the first out-of-bounds read

Yeah. A write on its own wouldn't matter, even if it were to perturb
the cache state, because the address already comes from user space, so
there's no new information in the cache perturbation for the attacker.

But that all implies that we shouldn't need the lfence for the
"put_user()" case, only for the get_user() (where the value we read
would then perhaps be used to do another access).

So we want to add the lfence (or "and") to get_user(), but not
necessarily put_user().

Agreed?

  Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Dan Williams
On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
 wrote:
> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  wrote:
>>
>> I assume if we put this in uaccess_begin() we also need audit for
>> paths that use access_ok but don't do on to call uaccess_begin()? A
>> quick glance shows a few places where we are open coding the stac().
>> Perhaps land the lfence in stac() directly?
>
> Yeah, we should put it in uaccess_begin(), and in the actual user
> accessor helpers that do stac. Some of them probably should be changed
> to use uaccess_begin() instead while at it.
>
> One question for the CPU people: do we actually care and need to do
> this for things that might *write* to something? The speculative write
> obviously is killed, but does it perhaps bring in a cacheline even
> when killed?

As far as I understand a write could trigger a request-for-ownership
read for the target cacheline.

> Because maybe we don't need the lfence in put_user(), only in get_user()?

Even though writes can trigger reads, as far as I can see the write
needs to be dependent on the first out-of-bounds read:

if (x < max)
y = array1[x];
put_user(array2 + y, z);

...in other words that first read should be annotated with
nospec_array_ptr() making an lfence in put_user() or other writes
moot.

yp = nospec_array_ptr(array1, x, max);
if (yp)
y = *yp;
put_user(array2 + y, z);


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Dan Williams
On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
 wrote:
> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  wrote:
>>
>> I assume if we put this in uaccess_begin() we also need audit for
>> paths that use access_ok but don't do on to call uaccess_begin()? A
>> quick glance shows a few places where we are open coding the stac().
>> Perhaps land the lfence in stac() directly?
>
> Yeah, we should put it in uaccess_begin(), and in the actual user
> accessor helpers that do stac. Some of them probably should be changed
> to use uaccess_begin() instead while at it.
>
> One question for the CPU people: do we actually care and need to do
> this for things that might *write* to something? The speculative write
> obviously is killed, but does it perhaps bring in a cacheline even
> when killed?

As far as I understand a write could trigger a request-for-ownership
read for the target cacheline.

> Because maybe we don't need the lfence in put_user(), only in get_user()?

Even though writes can trigger reads, as far as I can see the write
needs to be dependent on the first out-of-bounds read:

if (x < max)
y = array1[x];
put_user(array2 + y, z);

...in other words that first read should be annotated with
nospec_array_ptr() making an lfence in put_user() or other writes
moot.

yp = nospec_array_ptr(array1, x, max);
if (yp)
y = *yp;
put_user(array2 + y, z);


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Sun, Jan 07, 2018 at 06:24:11PM -0800, Alexei Starovoitov wrote:
> > How about:
> > CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
> > CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE
> 
> INSTRUCTION_FENCE if anything. LFENCE for Intel (and now also for AMD as
> per 0592b0bce169) is a misnomer, IFENCE would be a better name for it.

Btw., we should probably propagate that naming to most places and only point it 
out in comments/documentation that Intel/AMD calls this CPU functionality 
'LFENCE' 
for historic reasons and that the opcode got stolen for the Spectre fixes.

Thanks,

Ingo


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Sun, Jan 07, 2018 at 06:24:11PM -0800, Alexei Starovoitov wrote:
> > How about:
> > CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
> > CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE
> 
> INSTRUCTION_FENCE if anything. LFENCE for Intel (and now also for AMD as
> per 0592b0bce169) is a misnomer, IFENCE would be a better name for it.

Btw., we should probably propagate that naming to most places and only point it 
out in comments/documentation that Intel/AMD calls this CPU functionality 
'LFENCE' 
for historic reasons and that the opcode got stolen for the Spectre fixes.

Thanks,

Ingo


RE: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread David Laight
From: Alan Cox
> Sent: 08 January 2018 12:13
...
> > Try over 35% slowdown
> > Given that AWS instance runs known code (user and kernel) why do we
> > need to worry about any of these sideband attacks?
> 
> You may not need to. Amazon themselves obviously need to worry that no
> other VM steals your data (or vice versa) but above that (and with raw
> hardware appliances) if you control all the code you run then the nopti
> and other disables may be useful (At the end of the day as with anything
> else you do your own risk assessment).

I believe AWS allows VM kernels to load user-written device drivers
so the security of other VMs cannot rely on whether a VM is booted
with PTI=yes or PTI=no.

David



RE: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread David Laight
From: Alan Cox
> Sent: 08 January 2018 12:13
...
> > Try over 35% slowdown
> > Given that AWS instance runs known code (user and kernel) why do we
> > need to worry about any of these sideband attacks?
> 
> You may not need to. Amazon themselves obviously need to worry that no
> other VM steals your data (or vice versa) but above that (and with raw
> hardware appliances) if you control all the code you run then the nopti
> and other disables may be useful (At the end of the day as with anything
> else you do your own risk assessment).

I believe AWS allows VM kernels to load user-written device drivers
so the security of other VMs cannot rely on whether a VM is booted
with PTI=yes or PTI=no.

David



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Alan Cox
On Mon, 8 Jan 2018 12:00:28 +
David Laight  wrote:

> From: Linus Torvalds
> > Sent: 07 January 2018 19:47  
> ...
> > Also, people definitely *are* noticing the performance issues with the
> > current set of patches, and they are causing real problems. Go search
> > for reports of Amazon AWS slowdowns.  
> 
> Try over 35% slowdown
> Given that AWS instance runs known code (user and kernel) why do we
> need to worry about any of these sideband attacks?

You may not need to. Amazon themselves obviously need to worry that no
other VM steals your data (or vice versa) but above that (and with raw
hardware appliances) if you control all the code you run then the nopti
and other disables may be useful (At the end of the day as with anything
else you do your own risk assessment).

Do remember to consider if you are running untrusted but supposedly
sandboxed code (eg Java, js).

I'm not using pti etc on my minecraft VMs - no point. If anyone gets to
run arbitrary code on them except me then it's already compromised.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Alan Cox
On Mon, 8 Jan 2018 12:00:28 +
David Laight  wrote:

> From: Linus Torvalds
> > Sent: 07 January 2018 19:47  
> ...
> > Also, people definitely *are* noticing the performance issues with the
> > current set of patches, and they are causing real problems. Go search
> > for reports of Amazon AWS slowdowns.  
> 
> Try over 35% slowdown
> Given that AWS instance runs known code (user and kernel) why do we
> need to worry about any of these sideband attacks?

You may not need to. Amazon themselves obviously need to worry that no
other VM steals your data (or vice versa) but above that (and with raw
hardware appliances) if you control all the code you run then the nopti
and other disables may be useful (At the end of the day as with anything
else you do your own risk assessment).

Do remember to consider if you are running untrusted but supposedly
sandboxed code (eg Java, js).

I'm not using pti etc on my minecraft VMs - no point. If anyone gets to
run arbitrary code on them except me then it's already compromised.

Alan


RE: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread David Laight
From: Linus Torvalds
> Sent: 07 January 2018 19:47
...
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.

Try over 35% slowdown
Given that AWS instance runs known code (user and kernel) why do we
need to worry about any of these sideband attacks?

David



RE: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread David Laight
From: Linus Torvalds
> Sent: 07 January 2018 19:47
...
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.

Try over 35% slowdown
Given that AWS instance runs known code (user and kernel) why do we
need to worry about any of these sideband attacks?

David



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Andrea Arcangeli
On Sat, Jan 06, 2018 at 08:41:34PM +0100, Thomas Gleixner wrote:
> optimized argumentation. We need to make sure that we have a solution which
> kills the problem safely and then take it from there. Correctness first,
> optimization later is the rule for this. Better safe than sorry.

Agreed, assuming the objective here is to achieve a complete spectre
fix fast.

Also note there's a whole set of stuff to do in addition of IBRS:
IBPB, stuff_RSB() and the register hygiene in kernel entry points and
vmexists, that alters the whole syscall stackframe to be able to clear
callee saved registers.

That register hygiene was one of the most tedious pieces to get right
along with the PTI "rep movsb" (no C) stack trampoline that never
calls into C code with zero stack available because it's very bad to
do so, consdering C is free to use some stack for register spillage.

I suggest to discuss how important register hygiene is on top of IBRS,
IBPB and stuff_RSB() to fix spectre, not future optimizations that
only matter for old CPUS and are irrelevant for future silicon.

I also suggest to discuss how to automate the other parts of variant#1
lfence/mfence across the bound checks, depending on arch with a open
source scanner, or if to pretend developers think about it like we
think about mb() (except no regression test will ever notice a bounds
check speculation memory barrier missing).

Reptolines alone are leaving a whole set of stuff unfixed: register
hygiene still missing, bios/firmware calls still require ibrs, all asm
has to be audited by hand as there's no sure asm scanner I know of
(grep can go somewhere though) and the gcc dependency isn't very
flexible to begin with, and they don't help with lfence/mfence across
bound checks, they still require IBPB and stuff_RSB() to avoid
guest/user mode against guest/user spectre variant#2 attacks.

I don't see why we should talk about pure performance optimization at
this point instead of focusing on the above.

Not to tell if you want to guarantee mathematically that guest
userland cannot read the guest kernel memory by starting a spectre
variant#2 attack from guest userland to host userland (David Gilbert's
new attack discovery). For that you'll have to set ibrs_enabled 2
ibpb_enabled 1 mode or ibrs_enabled 0 ibpb_enabled 2 mode in the host
kernel or alternatively ibrs_enabled 0 ibpb_enabled 2 in the guest
kernel.

ibrs 2 bpbp 1 will prevent qemu userland to use the IBP so guest
userland cannot probe it. ibrs 0 ibpb 2 will flush the IBP state at
vmexit so qemu userland won't be affected by it. ibrs 0 ibpb 2 in
guest will flush the IBP state at kernel entry so guest userland won't
be able to affect anything.

Of course such an attack from guest user -> guest kernel -> host
kernel -> host user -> host kernel -> guest kernel -> guest user and
probing IBP (RSB is fixed for good with unconditional stuff_RSB in
vmexit even when SMEP is set, precisely because SMEP won't stop guest
ring 3 to probe host ring 3 RSB and same for ring 0) is far fetched,
but reptolines alone cannot solve it unless you also build qemu
userland with reptolines (which then means the whole userland has to
be built with reptolines because the qemu dependency chain is endless,
includes glibc etc..).

As a reminder (for lkml): if you use KVM, spectre variant#2 is the
only attack that can affect guest/host memory isolation. spectre
variant#1 and meltdown (aka variant#3) always have been impossible
through KVM guest/user isolation. spectre variant#2 is the one that is
harder to fix and it's the most theoretical of them all and it may be
impossible to mount as an attack depending on host kernel code that
has to play against itself to achieve it. The setup for such an attack
is very tedious, takes half an hour or several hours depending on the
amounts of memory and you may have to know already accurately the
kernel that is running on the host. As opposed to spectre variant#1
and meltdown (aka variant#3), it's very unlikely anybody gets attacked
through spectre variant#2. It's also the side channel with the lowest
amount of kbytes/sec of bandwidth if mounted successfully in the first
place. However if it can be mounted successfully it becomes almost a
concern as the other two variants, which is why it needs fixing too.

Thanks,
Andrea


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Andrea Arcangeli
On Sat, Jan 06, 2018 at 08:41:34PM +0100, Thomas Gleixner wrote:
> optimized argumentation. We need to make sure that we have a solution which
> kills the problem safely and then take it from there. Correctness first,
> optimization later is the rule for this. Better safe than sorry.

Agreed, assuming the objective here is to achieve a complete spectre
fix fast.

Also note there's a whole set of stuff to do in addition of IBRS:
IBPB, stuff_RSB() and the register hygiene in kernel entry points and
vmexists, that alters the whole syscall stackframe to be able to clear
callee saved registers.

That register hygiene was one of the most tedious pieces to get right
along with the PTI "rep movsb" (no C) stack trampoline that never
calls into C code with zero stack available because it's very bad to
do so, consdering C is free to use some stack for register spillage.

I suggest to discuss how important register hygiene is on top of IBRS,
IBPB and stuff_RSB() to fix spectre, not future optimizations that
only matter for old CPUS and are irrelevant for future silicon.

I also suggest to discuss how to automate the other parts of variant#1
lfence/mfence across the bound checks, depending on arch with a open
source scanner, or if to pretend developers think about it like we
think about mb() (except no regression test will ever notice a bounds
check speculation memory barrier missing).

Reptolines alone are leaving a whole set of stuff unfixed: register
hygiene still missing, bios/firmware calls still require ibrs, all asm
has to be audited by hand as there's no sure asm scanner I know of
(grep can go somewhere though) and the gcc dependency isn't very
flexible to begin with, and they don't help with lfence/mfence across
bound checks, they still require IBPB and stuff_RSB() to avoid
guest/user mode against guest/user spectre variant#2 attacks.

I don't see why we should talk about pure performance optimization at
this point instead of focusing on the above.

Not to tell if you want to guarantee mathematically that guest
userland cannot read the guest kernel memory by starting a spectre
variant#2 attack from guest userland to host userland (David Gilbert's
new attack discovery). For that you'll have to set ibrs_enabled 2
ibpb_enabled 1 mode or ibrs_enabled 0 ibpb_enabled 2 mode in the host
kernel or alternatively ibrs_enabled 0 ibpb_enabled 2 in the guest
kernel.

ibrs 2 bpbp 1 will prevent qemu userland to use the IBP so guest
userland cannot probe it. ibrs 0 ibpb 2 will flush the IBP state at
vmexit so qemu userland won't be affected by it. ibrs 0 ibpb 2 in
guest will flush the IBP state at kernel entry so guest userland won't
be able to affect anything.

Of course such an attack from guest user -> guest kernel -> host
kernel -> host user -> host kernel -> guest kernel -> guest user and
probing IBP (RSB is fixed for good with unconditional stuff_RSB in
vmexit even when SMEP is set, precisely because SMEP won't stop guest
ring 3 to probe host ring 3 RSB and same for ring 0) is far fetched,
but reptolines alone cannot solve it unless you also build qemu
userland with reptolines (which then means the whole userland has to
be built with reptolines because the qemu dependency chain is endless,
includes glibc etc..).

As a reminder (for lkml): if you use KVM, spectre variant#2 is the
only attack that can affect guest/host memory isolation. spectre
variant#1 and meltdown (aka variant#3) always have been impossible
through KVM guest/user isolation. spectre variant#2 is the one that is
harder to fix and it's the most theoretical of them all and it may be
impossible to mount as an attack depending on host kernel code that
has to play against itself to achieve it. The setup for such an attack
is very tedious, takes half an hour or several hours depending on the
amounts of memory and you may have to know already accurately the
kernel that is running on the host. As opposed to spectre variant#1
and meltdown (aka variant#3), it's very unlikely anybody gets attacked
through spectre variant#2. It's also the side channel with the lowest
amount of kbytes/sec of bandwidth if mounted successfully in the first
place. However if it can be mounted successfully it becomes almost a
concern as the other two variants, which is why it needs fixing too.

Thanks,
Andrea


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Peter Zijlstra
On Sun, Jan 07, 2018 at 06:57:35PM -0800, Alexei Starovoitov wrote:
> On Sun, Jan 07, 2018 at 01:59:35PM +, Alan Cox wrote:

> > lfence timing is also heavily dependent upon what work has to be done to
> > retire previous live instructions. 
> > BPF does not normally do a lot of writing so you'd expect the cost to be 
> > low.
> 
> right. to retire previous loads. Not sure what 'not a lot of writing'
> has to do with lfence.

LFENCE will wait for completion on _ALL_ prior instructions, not just
loads.

Stores are by far the most expensive instructions to wait for, as they
only complete once their value is globally visible (on x86).


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Peter Zijlstra
On Sun, Jan 07, 2018 at 06:57:35PM -0800, Alexei Starovoitov wrote:
> On Sun, Jan 07, 2018 at 01:59:35PM +, Alan Cox wrote:

> > lfence timing is also heavily dependent upon what work has to be done to
> > retire previous live instructions. 
> > BPF does not normally do a lot of writing so you'd expect the cost to be 
> > low.
> 
> right. to retire previous loads. Not sure what 'not a lot of writing'
> has to do with lfence.

LFENCE will wait for completion on _ALL_ prior instructions, not just
loads.

Stores are by far the most expensive instructions to wait for, as they
only complete once their value is globally visible (on x86).


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Peter Zijlstra
On Sun, Jan 07, 2018 at 06:24:11PM -0800, Alexei Starovoitov wrote:
> How about:
> CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
> CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE

INSTRUCTION_FENCE if anything. LFENCE for Intel (and now also for AMD as
per 0592b0bce169) is a misnomer, IFENCE would be a better name for it.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-08 Thread Peter Zijlstra
On Sun, Jan 07, 2018 at 06:24:11PM -0800, Alexei Starovoitov wrote:
> How about:
> CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
> CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE

INSTRUCTION_FENCE if anything. LFENCE for Intel (and now also for AMD as
per 0592b0bce169) is a misnomer, IFENCE would be a better name for it.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Greg KH
On Sun, Jan 07, 2018 at 09:23:47PM -0500, David Miller wrote:
> From: Thomas Gleixner 
> Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)
> 
> > I surely agree, but we have gone the way of PTI without the ability of
> > exempting individual processes exactly for one reason:
> > 
> >   Lack of time
> > 
> > It can be done on top of the PTI implementation and it won't take ages.
> > 
> > For spectre_v1/2 we face the same problem simply because we got informed so
> > much ahead of time and we were all twiddling thumbs, enjoying our christmas
> > vacation and having a good time.
> 
> I just want to point out that this should be noted in history as a
> case where all of this controlled disclosure stuff seems to have made
> things worse rather than better.

I will note that the "controlled disclosure" for this thing was a total
and complete mess, and unlike any that I have ever seen in the past.
The people involved in running it had no idea how to do it at all, and
because of that, it failed miserably, despite being warned about it
numerous times by numerous people.

> Why is there so much haste and paranoia if supposedly some group of
> people had all this extra time to think about and deal with this bug?

Because that group was so small and isolated that they did not actually
talk to anyone who could actually provide input to help deal with the
bug.

So we are stuck now with dealing with this "properly", which is fine,
but please don't think that this is an excuse to blame "controlled
disclosure".  We know how to do that correctly, it did not happen in
this case at all because of the people driving the problem refused to do
it.

> Think I'm nuts?  Ok, then how did we fare any better by keeping this
> junk under wraps for weeks if not months?  (seriously, did responsible
> people really know about this as far back as... June 2017?)

Some "people" did, just not some "responsible people" :)

Oh well, time for the kernel to fix hardware bugs again, that's what we
are here for, you would think we would be used to it by now...

greg k-h


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Greg KH
On Sun, Jan 07, 2018 at 09:23:47PM -0500, David Miller wrote:
> From: Thomas Gleixner 
> Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)
> 
> > I surely agree, but we have gone the way of PTI without the ability of
> > exempting individual processes exactly for one reason:
> > 
> >   Lack of time
> > 
> > It can be done on top of the PTI implementation and it won't take ages.
> > 
> > For spectre_v1/2 we face the same problem simply because we got informed so
> > much ahead of time and we were all twiddling thumbs, enjoying our christmas
> > vacation and having a good time.
> 
> I just want to point out that this should be noted in history as a
> case where all of this controlled disclosure stuff seems to have made
> things worse rather than better.

I will note that the "controlled disclosure" for this thing was a total
and complete mess, and unlike any that I have ever seen in the past.
The people involved in running it had no idea how to do it at all, and
because of that, it failed miserably, despite being warned about it
numerous times by numerous people.

> Why is there so much haste and paranoia if supposedly some group of
> people had all this extra time to think about and deal with this bug?

Because that group was so small and isolated that they did not actually
talk to anyone who could actually provide input to help deal with the
bug.

So we are stuck now with dealing with this "properly", which is fine,
but please don't think that this is an excuse to blame "controlled
disclosure".  We know how to do that correctly, it did not happen in
this case at all because of the people driving the problem refused to do
it.

> Think I'm nuts?  Ok, then how did we fare any better by keeping this
> junk under wraps for weeks if not months?  (seriously, did responsible
> people really know about this as far back as... June 2017?)

Some "people" did, just not some "responsible people" :)

Oh well, time for the kernel to fix hardware bugs again, that's what we
are here for, you would think we would be used to it by now...

greg k-h


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 01:59:35PM +, Alan Cox wrote:
> > which means that POC is relying 64-bit address speculation.
> > In the places coverity found the user supplied value is 32-bit,
> 
> People have 32bit computers as well as 64bit and in some cases 32bit is
> fine for an attack depending where your target is relative to the object.

right. absolutely correct on 32-bit archs.
The question whether truncation to 32-bit is enough to workaround
spectre1 on 64-bit archs.
I hope the researchers can clarify.

> lfence timing is also heavily dependent upon what work has to be done to
> retire previous live instructions. 
> BPF does not normally do a lot of writing so you'd expect the cost to be low.

right. to retire previous loads. Not sure what 'not a lot of writing'
has to do with lfence.

Our XDP based DDOS mostly does reads with few writes for stats into maps,
whereas XDP based load balancer modifies every packet.
XDP is root only, so not relevant in the spectre context. Just clarifying
read vs writes in BPF.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 01:59:35PM +, Alan Cox wrote:
> > which means that POC is relying 64-bit address speculation.
> > In the places coverity found the user supplied value is 32-bit,
> 
> People have 32bit computers as well as 64bit and in some cases 32bit is
> fine for an attack depending where your target is relative to the object.

right. absolutely correct on 32-bit archs.
The question whether truncation to 32-bit is enough to workaround
spectre1 on 64-bit archs.
I hope the researchers can clarify.

> lfence timing is also heavily dependent upon what work has to be done to
> retire previous live instructions. 
> BPF does not normally do a lot of writing so you'd expect the cost to be low.

right. to retire previous loads. Not sure what 'not a lot of writing'
has to do with lfence.

Our XDP based DDOS mostly does reads with few writes for stats into maps,
whereas XDP based load balancer modifies every packet.
XDP is root only, so not relevant in the spectre context. Just clarifying
read vs writes in BPF.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 12:15:40PM -0800, Dan Williams wrote:
> 
> I'm thinking we should provide the option to at least build the
> hot-path nospec_array_ptr() usages without an lfence.
> 
> CONFIG_SPECTRE1_PARANOIA_SAFE
> CONFIG_SPECTRE1_PARANOIA_PERF

SAFE vs PERF naming is problematic and misleading, since users don't
have the data to make a decision they will be forced to go with SAFE.

What is not safe about array_access() macro with AND ?
How lfence approach makes it safer ?
Only because lfence was blessed by intel earlier when
they couldn't figure out a different way?

How about:
CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE

> ...if only for easing performance testing and let the distribution set
> its policy.
> 
> Where hot-path usages can do:
> 
> nospec_relax(nospec_array_ptr())

AND approach doesn't prevent speculation hence nospec_ is an incorrect prefix.
Alan's "speculation management" terminology fits well here.

Can we keep array_access() name and change it underneath to
either mask or lfence ?



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 12:15:40PM -0800, Dan Williams wrote:
> 
> I'm thinking we should provide the option to at least build the
> hot-path nospec_array_ptr() usages without an lfence.
> 
> CONFIG_SPECTRE1_PARANOIA_SAFE
> CONFIG_SPECTRE1_PARANOIA_PERF

SAFE vs PERF naming is problematic and misleading, since users don't
have the data to make a decision they will be forced to go with SAFE.

What is not safe about array_access() macro with AND ?
How lfence approach makes it safer ?
Only because lfence was blessed by intel earlier when
they couldn't figure out a different way?

How about:
CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE

> ...if only for easing performance testing and let the distribution set
> its policy.
> 
> Where hot-path usages can do:
> 
> nospec_relax(nospec_array_ptr())

AND approach doesn't prevent speculation hence nospec_ is an incorrect prefix.
Alan's "speculation management" terminology fits well here.

Can we keep array_access() name and change it underneath to
either mask or lfence ?



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread David Miller
From: Thomas Gleixner 
Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)

> I surely agree, but we have gone the way of PTI without the ability of
> exempting individual processes exactly for one reason:
> 
>   Lack of time
> 
> It can be done on top of the PTI implementation and it won't take ages.
> 
> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.

I just want to point out that this should be noted in history as a
case where all of this controlled disclosure stuff seems to have made
things worse rather than better.

Why is there so much haste and paranoia if supposedly some group of
people had all this extra time to think about and deal with this bug?

>From what I've seen, every single time, the worse a problem is, the
more important it is to expose it to as many smart folks as possible.
And to do so as fast as possible.

And to me that means full disclosure immediately for the super high
level stuff like what we are dealing with here.

Think I'm nuts?  Ok, then how did we fare any better by keeping this
junk under wraps for weeks if not months?  (seriously, did responsible
people really know about this as far back as... June 2017?)

Controlled disclosure for high propfile bugs seems to only achieve two
things:

1) Vendors can cover their butts and come up with deflection
   strategies.

2) The "theatre" aspect of security can be maximized as much as
   possible.  We even have a pretty web site and cute avatars this
   time!

None of this has anything to do with having time to come up with the
best possible implementation of a fix.  You know, the technical part?

So after what appears to be as much as 6 months of deliberating the
very wise men in the special room said: "KPTI and lfence"

Do I get this right?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread David Miller
From: Thomas Gleixner 
Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)

> I surely agree, but we have gone the way of PTI without the ability of
> exempting individual processes exactly for one reason:
> 
>   Lack of time
> 
> It can be done on top of the PTI implementation and it won't take ages.
> 
> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.

I just want to point out that this should be noted in history as a
case where all of this controlled disclosure stuff seems to have made
things worse rather than better.

Why is there so much haste and paranoia if supposedly some group of
people had all this extra time to think about and deal with this bug?

>From what I've seen, every single time, the worse a problem is, the
more important it is to expose it to as many smart folks as possible.
And to do so as fast as possible.

And to me that means full disclosure immediately for the super high
level stuff like what we are dealing with here.

Think I'm nuts?  Ok, then how did we fare any better by keeping this
junk under wraps for weeks if not months?  (seriously, did responsible
people really know about this as far back as... June 2017?)

Controlled disclosure for high propfile bugs seems to only achieve two
things:

1) Vendors can cover their butts and come up with deflection
   strategies.

2) The "theatre" aspect of security can be maximized as much as
   possible.  We even have a pretty web site and cute avatars this
   time!

None of this has anything to do with having time to come up with the
best possible implementation of a fix.  You know, the technical part?

So after what appears to be as much as 6 months of deliberating the
very wise men in the special room said: "KPTI and lfence"

Do I get this right?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 11:08:24AM +0100, Thomas Gleixner wrote:
> On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> > which clearly states that bpf_tail_call() was used in the attack.
> > Yet none of the intel nor arm patches address speculation in
> > this bpf helper!
> > It means that:
> > - gpz didn't share neither exploit nor the detailed description
> >   of the POC with cpu vendors until now
> > - coverity rules used to find all these places in the kernel
> >   failed to find bpf_tail_call
> > - cpu vendors were speculating what variant 1 can actually do
> 
> You forgot to mention that there might be other attacks than the public POC
> which are not covered by a simple AND 

if above statement is not a pure speculation please share CVE number.
For varaint[123] they've been reserved for months.

> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.

right. they were informed in a way that they failed to address
variant1 with pre-embargo and public patches.

> The exploits are out in the wild and they are observed already, so we

this statement contradicts with everything that was publicly stated.
Or you're referring to 'exploit' at the end of spectre paper?

> want to discuss the right way to fix it for the next 3 month and leave all
> doors open until the last bit of performance is squeezed out.

Let's look at facts:
- Linus explains his array_access() idea
- lfence proponents quickly point out that it needs gcc to be smart
  enough to emit setbe and go back to lfence patches
- I spent half an hour to tweak array_access() to be generic
  on all archs and compiler indepedent
- lfence proponets point out that AND with a variable somehow
  won't work, yet after further discussion it's actually fine due to
  the nature of variant1 attack that needs to train predictor with in-bounds
  access to mispredict with out-of-bounds speculative load
- then lfence proponets claim 'non public POC' not covered by AND
- it all in the matter of 2 days.
- and now the argument that it will take 3 month to discuss a solution
  without lfence, yet still no performance numbers for lfence
  though 'people in the know' were twiddling thumbs for months.

That's just not cool.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alexei Starovoitov
On Sun, Jan 07, 2018 at 11:08:24AM +0100, Thomas Gleixner wrote:
> On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> > which clearly states that bpf_tail_call() was used in the attack.
> > Yet none of the intel nor arm patches address speculation in
> > this bpf helper!
> > It means that:
> > - gpz didn't share neither exploit nor the detailed description
> >   of the POC with cpu vendors until now
> > - coverity rules used to find all these places in the kernel
> >   failed to find bpf_tail_call
> > - cpu vendors were speculating what variant 1 can actually do
> 
> You forgot to mention that there might be other attacks than the public POC
> which are not covered by a simple AND 

if above statement is not a pure speculation please share CVE number.
For varaint[123] they've been reserved for months.

> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.

right. they were informed in a way that they failed to address
variant1 with pre-embargo and public patches.

> The exploits are out in the wild and they are observed already, so we

this statement contradicts with everything that was publicly stated.
Or you're referring to 'exploit' at the end of spectre paper?

> want to discuss the right way to fix it for the next 3 month and leave all
> doors open until the last bit of performance is squeezed out.

Let's look at facts:
- Linus explains his array_access() idea
- lfence proponents quickly point out that it needs gcc to be smart
  enough to emit setbe and go back to lfence patches
- I spent half an hour to tweak array_access() to be generic
  on all archs and compiler indepedent
- lfence proponets point out that AND with a variable somehow
  won't work, yet after further discussion it's actually fine due to
  the nature of variant1 attack that needs to train predictor with in-bounds
  access to mispredict with out-of-bounds speculative load
- then lfence proponets claim 'non public POC' not covered by AND
- it all in the matter of 2 days.
- and now the argument that it will take 3 month to discuss a solution
  without lfence, yet still no performance numbers for lfence
  though 'people in the know' were twiddling thumbs for months.

That's just not cool.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread David Miller
From: Thomas Gleixner 
Date: Sun, 7 Jan 2018 19:31:41 +0100 (CET)

> 2) Alexei's analyis is purely based on the public information of the google
>zero folks. If it would be complete and the only attack vector all fine.
> 
>If not and I doubt it is, we're going to regret this decision faster
>than we made it and this is not the kind of play field where we can
>afford that.

Please state this more clearly.

Do you know about other attack vectors and just are not allowed to
talk about them?

Or is this, ironically, speculation?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread David Miller
From: Thomas Gleixner 
Date: Sun, 7 Jan 2018 19:31:41 +0100 (CET)

> 2) Alexei's analyis is purely based on the public information of the google
>zero folks. If it would be complete and the only attack vector all fine.
> 
>If not and I doubt it is, we're going to regret this decision faster
>than we made it and this is not the kind of play field where we can
>afford that.

Please state this more clearly.

Do you know about other attack vectors and just are not allowed to
talk about them?

Or is this, ironically, speculation?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau
On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

OK OK. At least we should have security by default and let people trade
it against performance if they want and understand the risk. People
never know when they're secure enough (security doesn't measure) but
they know fairly well when they're not performant enough to take action
(most often changing the machine).

Willy


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau
On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

OK OK. At least we should have security by default and let people trade
it against performance if they want and understand the risk. People
never know when they're secure enough (security doesn't measure) but
they know fairly well when they're not performant enough to take action
(most often changing the machine).

Willy


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sun, 7 Jan 2018, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

I surely agree, but we have gone the way of PTI without the ability of
exempting individual processes exactly for one reason:

  Lack of time

It can be done on top of the PTI implementation and it won't take ages.

For spectre_v1/2 we face the same problem simply because we got informed so
much ahead of time and we were all twiddling thumbs, enjoying our christmas
vacation and having a good time.

The exploits are out in the wild and they are observed already, so we
really have to make a decision whether we want to fix that in the most
obvious ways even if it hurts performance right now and then take a break
from all that hell and sit down and sort the performance mess or whether we
want to discuss the right way to fix it for the next 3 month and leave all
doors open until the last bit of performance is squeezed out.

I totally can understand the anger of those who learned all of this a few
days ago and are now grasping straws to avoid the obviously coming
performance hit, but its not our fault that we have to make a decision
which cannot make everyone happy tomorrow.

If the general sentiment is that we have to fix the performance issue
first, then please let me know and I take 3 weeks of vacation right now.

Thanks,

tglx


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sun, 7 Jan 2018, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

I surely agree, but we have gone the way of PTI without the ability of
exempting individual processes exactly for one reason:

  Lack of time

It can be done on top of the PTI implementation and it won't take ages.

For spectre_v1/2 we face the same problem simply because we got informed so
much ahead of time and we were all twiddling thumbs, enjoying our christmas
vacation and having a good time.

The exploits are out in the wild and they are observed already, so we
really have to make a decision whether we want to fix that in the most
obvious ways even if it hurts performance right now and then take a break
from all that hell and sit down and sort the performance mess or whether we
want to discuss the right way to fix it for the next 3 month and leave all
doors open until the last bit of performance is squeezed out.

I totally can understand the anger of those who learned all of this a few
days ago and are now grasping straws to avoid the obviously coming
performance hit, but its not our fault that we have to make a decision
which cannot make everyone happy tomorrow.

If the general sentiment is that we have to fix the performance issue
first, then please let me know and I take 3 weeks of vacation right now.

Thanks,

tglx


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sun, Jan 7, 2018 at 12:12 PM, Willy Tarreau  wrote:
>
> Linus, no need to explain that to me, I'm precisely trying to see how
> to disable PTI for a specific process because I face up to 45% loss in
> certain circumstances, making it a no-go. But while a few of us have
> very specific workloads emphasizing this impact, others have very
> different ones and will not notice. For example my laptop did boot
> pretty fine and I didn't notice anything until I fire a network
> benchmark.

Sure, most people have hardware where the bottleneck is entirely
elsewhere (slow network, rotating disk, whatever).

But this whole "normal people won't notice" is dangerous thinking.
They may well notice very much, we simply don't know what they are
doing.

Quite honesty, it's equally correct to say "normal people won't be
affected by the security issue in the first place".

That laptop that you didn't have any issues with? Likely it never had
an exploit running on it either!

So the whole "normal people" argument is pure and utter garbage. It's
wrong. It's pure shit when it comes to performance, but it's also pure
shit when it comes to the security issue.

Don't use it.

We need to fix the security problem, but we need to do it *without*
these braindead arguments that performance is somehow secondary.

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sun, Jan 7, 2018 at 12:12 PM, Willy Tarreau  wrote:
>
> Linus, no need to explain that to me, I'm precisely trying to see how
> to disable PTI for a specific process because I face up to 45% loss in
> certain circumstances, making it a no-go. But while a few of us have
> very specific workloads emphasizing this impact, others have very
> different ones and will not notice. For example my laptop did boot
> pretty fine and I didn't notice anything until I fire a network
> benchmark.

Sure, most people have hardware where the bottleneck is entirely
elsewhere (slow network, rotating disk, whatever).

But this whole "normal people won't notice" is dangerous thinking.
They may well notice very much, we simply don't know what they are
doing.

Quite honesty, it's equally correct to say "normal people won't be
affected by the security issue in the first place".

That laptop that you didn't have any issues with? Likely it never had
an exploit running on it either!

So the whole "normal people" argument is pure and utter garbage. It's
wrong. It's pure shit when it comes to performance, but it's also pure
shit when it comes to the security issue.

Don't use it.

We need to fix the security problem, but we need to do it *without*
these braindead arguments that performance is somehow secondary.

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Dan Williams
On Sun, Jan 7, 2018 at 11:47 AM, Linus Torvalds
 wrote:
> On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>>
>> To be fair there's overreaction on both sides. The vast majority of
>> users need to get a 100% safe system and will never notice  any
>> difference.
>
> There is no such thing as a "100% safe system". Never will be - unless
> you make sure you have no users.
>
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.
>
> So this whole "security is so important that performance doesn't
> matter" mindset is pure and utter garbage.
>
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
>
> Performance matters. A *LOT*.

I'm thinking we should provide the option to at least build the
hot-path nospec_array_ptr() usages without an lfence.

CONFIG_SPECTRE1_PARANOIA_SAFE
CONFIG_SPECTRE1_PARANOIA_PERF

...if only for easing performance testing and let the distribution set
its policy.

Where hot-path usages can do:

nospec_relax(nospec_array_ptr())

...to optionally ellide the lfence.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Dan Williams
On Sun, Jan 7, 2018 at 11:47 AM, Linus Torvalds
 wrote:
> On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>>
>> To be fair there's overreaction on both sides. The vast majority of
>> users need to get a 100% safe system and will never notice  any
>> difference.
>
> There is no such thing as a "100% safe system". Never will be - unless
> you make sure you have no users.
>
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.
>
> So this whole "security is so important that performance doesn't
> matter" mindset is pure and utter garbage.
>
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
>
> Performance matters. A *LOT*.

I'm thinking we should provide the option to at least build the
hot-path nospec_array_ptr() usages without an lfence.

CONFIG_SPECTRE1_PARANOIA_SAFE
CONFIG_SPECTRE1_PARANOIA_PERF

...if only for easing performance testing and let the distribution set
its policy.

Where hot-path usages can do:

nospec_relax(nospec_array_ptr())

...to optionally ellide the lfence.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau
On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote:
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
> 
> Performance matters. A *LOT*.

Linus, no need to explain that to me, I'm precisely trying to see how
to disable PTI for a specific process because I face up to 45% loss in
certain circumstances, making it a no-go. But while a few of us have
very specific workloads emphasizing this impact, others have very
different ones and will not notice. For example my laptop did boot
pretty fine and I didn't notice anything until I fire a network
benchmark.

Willy


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Willy Tarreau
On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote:
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
> 
> Performance matters. A *LOT*.

Linus, no need to explain that to me, I'm precisely trying to see how
to disable PTI for a specific process because I face up to 45% loss in
certain circumstances, making it a no-go. But while a few of us have
very specific workloads emphasizing this impact, others have very
different ones and will not notice. For example my laptop did boot
pretty fine and I didn't notice anything until I fire a network
benchmark.

Willy


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>
> To be fair there's overreaction on both sides. The vast majority of
> users need to get a 100% safe system and will never notice  any
> difference.

There is no such thing as a "100% safe system". Never will be - unless
you make sure you have no users.

Also, people definitely *are* noticing the performance issues with the
current set of patches, and they are causing real problems. Go search
for reports of Amazon AWS slowdowns.

So this whole "security is so important that performance doesn't
matter" mindset is pure and utter garbage.

And the whole "normal people won't even notice" is pure garbage too.
Don't spread that bullshit when you see actual normal people
complaining.

Performance matters. A *LOT*.

Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau  wrote:
>
> To be fair there's overreaction on both sides. The vast majority of
> users need to get a 100% safe system and will never notice  any
> difference.

There is no such thing as a "100% safe system". Never will be - unless
you make sure you have no users.

Also, people definitely *are* noticing the performance issues with the
current set of patches, and they are causing real problems. Go search
for reports of Amazon AWS slowdowns.

So this whole "security is so important that performance doesn't
matter" mindset is pure and utter garbage.

And the whole "normal people won't even notice" is pure garbage too.
Don't spread that bullshit when you see actual normal people
complaining.

Performance matters. A *LOT*.

Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alan Cox
> everyone.  I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.
> 

I'm not arguing otherwise - I'd just prefer most users machines are
secure while we have the discussion and while we see what other
architectural tricks people can come up with

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alan Cox
> everyone.  I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.
> 

I'm not arguing otherwise - I'd just prefer most users machines are
secure while we have the discussion and while we see what other
architectural tricks people can come up with

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sun, 7 Jan 2018, James Bottomley wrote:

> On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> > From: Willy Tarreau 
> > Date: Sat, 6 Jan 2018 21:42:29 +0100
> > 
> > > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> > >> Normally people who propose security fixes don't have to argue
> > about the
> > >> fact they added 30 clocks to avoid your box being 0wned.
> > > 
> > > In fact it depends, because if a fix makes the system unusable for
> > its
> > > initial purpose, this fix will simply not be deployed at all, which
> > is
> > > the worst that can happen.
> > 
> > +1
> > 
> > I completely agree with Willy and Alexei.
> > 
> > And the scale isn't even accurate, we're talking about at least
> > hundreds upon hundreds of clocks, not 30, if we add an operation
> > whose side effect is to wait for all pending loads to complete.  So
> > yeah this is going to be heavily scrutinized.
> 
> Plus this is the standard kernel code review MO: we've never blindly
> accepted code just because *security* (otherwise we'd have grsec in by
> now).  We use the pushback to get better and more performant code.
>  What often happens is it turns out that the "either security or
> performance" position was a false dichotomy and there is a way of
> fixing stuff that's acceptable (although not usually perfect) for
> everyone.  I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.

In principle I agree, though there are a few things to consider:

1) We have not the time to discuss that for the next 6 month

2) Alexei's analyis is purely based on the public information of the google
   zero folks. If it would be complete and the only attack vector all fine.

   If not and I doubt it is, we're going to regret this decision faster
   than we made it and this is not the kind of play field where we can
   afford that.

The whole 'we know better and performance is key' attitude is what led to
this disaster in the first place. We should finaly start to learn.

Can we please stop that and live with the extra cycles for a few month up
to the point where we get more informed answers to all these questions?

Thanks,

tglx


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sun, 7 Jan 2018, James Bottomley wrote:

> On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> > From: Willy Tarreau 
> > Date: Sat, 6 Jan 2018 21:42:29 +0100
> > 
> > > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> > >> Normally people who propose security fixes don't have to argue
> > about the
> > >> fact they added 30 clocks to avoid your box being 0wned.
> > > 
> > > In fact it depends, because if a fix makes the system unusable for
> > its
> > > initial purpose, this fix will simply not be deployed at all, which
> > is
> > > the worst that can happen.
> > 
> > +1
> > 
> > I completely agree with Willy and Alexei.
> > 
> > And the scale isn't even accurate, we're talking about at least
> > hundreds upon hundreds of clocks, not 30, if we add an operation
> > whose side effect is to wait for all pending loads to complete.  So
> > yeah this is going to be heavily scrutinized.
> 
> Plus this is the standard kernel code review MO: we've never blindly
> accepted code just because *security* (otherwise we'd have grsec in by
> now).  We use the pushback to get better and more performant code.
>  What often happens is it turns out that the "either security or
> performance" position was a false dichotomy and there is a way of
> fixing stuff that's acceptable (although not usually perfect) for
> everyone.  I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.

In principle I agree, though there are a few things to consider:

1) We have not the time to discuss that for the next 6 month

2) Alexei's analyis is purely based on the public information of the google
   zero folks. If it would be complete and the only attack vector all fine.

   If not and I doubt it is, we're going to regret this decision faster
   than we made it and this is not the kind of play field where we can
   afford that.

The whole 'we know better and performance is key' attitude is what led to
this disaster in the first place. We should finaly start to learn.

Can we please stop that and live with the extra cycles for a few month up
to the point where we get more informed answers to all these questions?

Thanks,

tglx


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread James Bottomley
On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> From: Willy Tarreau 
> Date: Sat, 6 Jan 2018 21:42:29 +0100
> 
> > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> >> Normally people who propose security fixes don't have to argue
> about the
> >> fact they added 30 clocks to avoid your box being 0wned.
> > 
> > In fact it depends, because if a fix makes the system unusable for
> its
> > initial purpose, this fix will simply not be deployed at all, which
> is
> > the worst that can happen.
> 
> +1
> 
> I completely agree with Willy and Alexei.
> 
> And the scale isn't even accurate, we're talking about at least
> hundreds upon hundreds of clocks, not 30, if we add an operation
> whose side effect is to wait for all pending loads to complete.  So
> yeah this is going to be heavily scrutinized.

Plus this is the standard kernel code review MO: we've never blindly
accepted code just because *security* (otherwise we'd have grsec in by
now).  We use the pushback to get better and more performant code.
 What often happens is it turns out that the "either security or
performance" position was a false dichotomy and there is a way of
fixing stuff that's acceptable (although not usually perfect) for
everyone.  I'm not saying this always happens, but it is reasonable to
let the iterative pushback see if we can get to better code in this
case rather than trying to cut it of with the "because *security*"
argument.

James



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread James Bottomley
On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> From: Willy Tarreau 
> Date: Sat, 6 Jan 2018 21:42:29 +0100
> 
> > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> >> Normally people who propose security fixes don't have to argue
> about the
> >> fact they added 30 clocks to avoid your box being 0wned.
> > 
> > In fact it depends, because if a fix makes the system unusable for
> its
> > initial purpose, this fix will simply not be deployed at all, which
> is
> > the worst that can happen.
> 
> +1
> 
> I completely agree with Willy and Alexei.
> 
> And the scale isn't even accurate, we're talking about at least
> hundreds upon hundreds of clocks, not 30, if we add an operation
> whose side effect is to wait for all pending loads to complete.  So
> yeah this is going to be heavily scrutinized.

Plus this is the standard kernel code review MO: we've never blindly
accepted code just because *security* (otherwise we'd have grsec in by
now).  We use the pushback to get better and more performant code.
 What often happens is it turns out that the "either security or
performance" position was a false dichotomy and there is a way of
fixing stuff that's acceptable (although not usually perfect) for
everyone.  I'm not saying this always happens, but it is reasonable to
let the iterative pushback see if we can get to better code in this
case rather than trying to cut it of with the "because *security*"
argument.

James



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alan Cox
> > 2.  It is very very complicated to answer a question like "is
> > sequence x safe on all of vendor's microprocessors" even for the vendor  
> 
> so far "is sequence x safe" was viewed by cpu vendors as
> "is sequence x going to stop speculative execution".

Incorrect. Modern processors are very very sophisticated beasts and you
are dealing with a wide range of architectures and micro-architectures
that all have their own differences.

> > Intel's current position is 'lfence'.  

> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!

There are a set of patches under discussion for eBPF both the JIT and
interpreter. BTW I would expect that there are things Coverity didn't
find, and that we'll also see variants on the attack where different
tricks for measurement emerge that may change what is needed a little.

> which means that POC is relying 64-bit address speculation.
> In the places coverity found the user supplied value is 32-bit,

People have 32bit computers as well as 64bit and in some cases 32bit is
fine for an attack depending where your target is relative to the object.

> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.  
> 
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.

The last solution I saw proposed in that path for the JIT is to "and" with
constant which in that situation clearly makes the most sense providing
it is safe on all the CPUs involved.

lfence timing is also heavily dependent upon what work has to be done to
retire previous live instructions. BPF does not normally do a lot of
writing so you'd expect the cost to be low.

Anyway - Intel's current position is that lfence is safe. It's not
impossible other sequences will in future be documented as safe by one or
more vendors of x86 processors.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Alan Cox
> > 2.  It is very very complicated to answer a question like "is
> > sequence x safe on all of vendor's microprocessors" even for the vendor  
> 
> so far "is sequence x safe" was viewed by cpu vendors as
> "is sequence x going to stop speculative execution".

Incorrect. Modern processors are very very sophisticated beasts and you
are dealing with a wide range of architectures and micro-architectures
that all have their own differences.

> > Intel's current position is 'lfence'.  

> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!

There are a set of patches under discussion for eBPF both the JIT and
interpreter. BTW I would expect that there are things Coverity didn't
find, and that we'll also see variants on the attack where different
tricks for measurement emerge that may change what is needed a little.

> which means that POC is relying 64-bit address speculation.
> In the places coverity found the user supplied value is 32-bit,

People have 32bit computers as well as 64bit and in some cases 32bit is
fine for an attack depending where your target is relative to the object.

> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.  
> 
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.

The last solution I saw proposed in that path for the JIT is to "and" with
constant which in that situation clearly makes the most sense providing
it is safe on all the CPUs involved.

lfence timing is also heavily dependent upon what work has to be done to
retire previous live instructions. BPF does not normally do a lot of
writing so you'd expect the cost to be low.

Anyway - Intel's current position is that lfence is safe. It's not
impossible other sequences will in future be documented as safe by one or
more vendors of x86 processors.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
>   of the POC with cpu vendors until now
> - coverity rules used to find all these places in the kernel
>   failed to find bpf_tail_call
> - cpu vendors were speculating what variant 1 can actually do

You forgot to mention that there might be other attacks than the public POC
which are not covered by a simple AND 

Thanks,

tglx


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
>   of the POC with cpu vendors until now
> - coverity rules used to find all these places in the kernel
>   failed to find bpf_tail_call
> - cpu vendors were speculating what variant 1 can actually do

You forgot to mention that there might be other attacks than the public POC
which are not covered by a simple AND 

Thanks,

tglx


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau
On Sat, Jan 06, 2018 at 07:38:14PM -0800, Alexei Starovoitov wrote:
> yep. plenty of unknowns and what's happening now is an overreaction.

To be fair there's overreaction on both sides. The vast majority of
users need to get a 100% safe system and will never notice  any
difference. A few of us are more concerned by the risk of performance
loss brought by these fixes. We do all need to run tests on the
patchsets to bring numbers on the table.

> What is the rush to push half baked patches into upstream
> that don't even address the variant 1 ?

You probably need to trust the CPU vendors a bit for having had all
the details about the risks for a few months now and accept that if
they're destroying their product's performance compared to their main
competitor, they probably studied a number of other alternatives first.
It doesn't mean they thought about everything of course, and maybe they
need to study your proposal as a better solution to reduce criticism.

> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
>   of the POC with cpu vendors until now

Or it was considered less urgent to fix compared to the rest. It's
also possible that the scariest details were not written in the GPZ
article.

> Now the attack is well described, yet cpu vendors still pushing
> for lfence patches that don't make sense. Why?

Imagine if you were in their position and were pushing a solution which
would later be found to be inefficient and to be vulnerable again. Your
name would appear twice in the press in a few months, this would be
terrible. It makes sense in their position to find the safest fix first
given that those like you or me really concerned about the performance
impact know how to add an option to a boot loader or revert a patch that
causes trouble.

> What I think is important is to understand vulnerability first.
> I don't think it was done.

I suspect it was clearly done by those how had no other option but
working on this painful series over the last few months :-/

> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.
> 
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.

Please help here by testing the patch series and reporting numbers
before, with the fix, and with your proposal. That's the best way to
catch their attention and to get your proposal considered as a viable
alternative (or as a partial one for certain environments). I did the
same when I believed it could be sufficient to add noise to RDTSC and
found it totally inefficient after testing. But it's better for
everyone that research and testing is done rather than criticizing the
proposed fixes (which is not fair to the people who work hard on them
for everyone else).

Cheers,
Willy



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau
On Sat, Jan 06, 2018 at 07:38:14PM -0800, Alexei Starovoitov wrote:
> yep. plenty of unknowns and what's happening now is an overreaction.

To be fair there's overreaction on both sides. The vast majority of
users need to get a 100% safe system and will never notice  any
difference. A few of us are more concerned by the risk of performance
loss brought by these fixes. We do all need to run tests on the
patchsets to bring numbers on the table.

> What is the rush to push half baked patches into upstream
> that don't even address the variant 1 ?

You probably need to trust the CPU vendors a bit for having had all
the details about the risks for a few months now and accept that if
they're destroying their product's performance compared to their main
competitor, they probably studied a number of other alternatives first.
It doesn't mean they thought about everything of course, and maybe they
need to study your proposal as a better solution to reduce criticism.

> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
>   of the POC with cpu vendors until now

Or it was considered less urgent to fix compared to the rest. It's
also possible that the scariest details were not written in the GPZ
article.

> Now the attack is well described, yet cpu vendors still pushing
> for lfence patches that don't make sense. Why?

Imagine if you were in their position and were pushing a solution which
would later be found to be inefficient and to be vulnerable again. Your
name would appear twice in the press in a few months, this would be
terrible. It makes sense in their position to find the safest fix first
given that those like you or me really concerned about the performance
impact know how to add an option to a boot loader or revert a patch that
causes trouble.

> What I think is important is to understand vulnerability first.
> I don't think it was done.

I suspect it was clearly done by those how had no other option but
working on this painful series over the last few months :-/

> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.
> 
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.

Please help here by testing the patch series and reporting numbers
before, with the fix, and with your proposal. That's the best way to
catch their attention and to get your proposal considered as a viable
alternative (or as a partial one for certain environments). I did the
same when I believed it could be sufficient to add noise to RDTSC and
found it totally inefficient after testing. But it's better for
everyone that research and testing is done rather than criticizing the
proposed fixes (which is not fair to the people who work hard on them
for everyone else).

Cheers,
Willy



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 11:05:07PM +, Alan Cox wrote:
> > Even if it would be practical the speed probably going to be in bytes per 
> > second,
> > so to read anything meaningful an attack detection techniques (that people
> > are actively working on) will be able to catch it.
> > At the end security cannot be absolute.
> > The current level of paranoia shouldn't force us to make hastily decisions.
> 
> I think there are at least three overlapping problem spaces here
> 
> 1.This is a new field. That could mean that it turns out to be
> really hard and everyone discovers that eBPF was pretty much the only
> interesting attack. It could also mean we are going to see several years
> or refinement by evil geniuses all over the world and what we see now is
> tip of iceberg in cleverness.

yep. plenty of unknowns and what's happening now is an overreaction.

> 2.It is very very complicated to answer a question like "is
> sequence x safe on all of vendor's microprocessors" even for the vendor

so far "is sequence x safe" was viewed by cpu vendors as
"is sequence x going to stop speculative execution".
The AND approach wasn't even considered and I suspect there may be
other ways to avoid the side channel attack, but vendors are too
focused on stopping speculation. Why? The vulnerability already
disclosed in detail. Public clouds already patched their kernels.
What is the rush to push half baked patches into upstream
that don't even address the variant 1 ?

> Intel's current position is 'lfence'.

Let's look at the facts. From GPZ blog:
"In simplified terms, this program is used to determine the address
of prog_map by guessing the offset from prog_map to a userspace
address and tail-calling through prog_map at the guessed offsets."

which clearly states that bpf_tail_call() was used in the attack.
Yet none of the intel nor arm patches address speculation in
this bpf helper!
It means that:
- gpz didn't share neither exploit nor the detailed description
  of the POC with cpu vendors until now
- coverity rules used to find all these places in the kernel
  failed to find bpf_tail_call
- cpu vendors were speculating what variant 1 can actually do

Now the attack is well described, yet cpu vendors still pushing
for lfence patches that don't make sense. Why?

Furthermore GPZ blog states:
"To cause the branch prediction to predict that the offset is below
the length of prog_map, tail calls to an in-bounds index are
performed in between."

which means that attack is 'in-bound * largeN + out-of-bound * 1'.
Going back to our discussion few emails ago about 'mask' as a variable
it means that value predictor when speculating this last out of
bound access will correctly predict exact 'mask', so 'index & mask'
fix will certainly defeat this poc.

More from the blog:
"To increase the mis-speculation window, the cache line containing
the length of prog_map is bounced to another core."

and in the kernel we have:
struct bpf_map {
   atomic_t refcnt;
   enum bpf_map_type map_type;
   u32 key_size;
   u32 value_size;
   u32 max_entries;
   ...
note that only 'refcnt' could be used for bouncing,
so if we place max_entries and refcnt into different cache
lines this poc will likely fail as well.

More from the blog:
"At this point, a second eBPF program can be used to actually leak data.
In pseudocode, this program looks as follows:
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) 
+ prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);
"
which means that POC is relying 64-bit address speculation.
In the places coverity found the user supplied value is 32-bit,
so none of them are suitable for this POC.
We cannot rule out that the attack cannot be crafted to work
with 32-bit, but it means there is no urgency to lfence everything.

> The important thing is that there is something clean, all architecture
> that can be used today that doesn't keep forcing everyone to change
> drivers when new/better ways to do the speculation management appear.

What I think is important is to understand vulnerability first.
I don't think it was done.

> The differences involved on the "lfence" versus "and" versus before are
> not likely to be anywhere in that order of magnitude.

we clearly disagree here. Both intel and arm patches proposed
to add lfence in bpf_map_lookup() which is the hottest function
we have and we do run it at 40+Gbps speeds where every nanosecond
counts, so no, lfence is not a solution.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 11:05:07PM +, Alan Cox wrote:
> > Even if it would be practical the speed probably going to be in bytes per 
> > second,
> > so to read anything meaningful an attack detection techniques (that people
> > are actively working on) will be able to catch it.
> > At the end security cannot be absolute.
> > The current level of paranoia shouldn't force us to make hastily decisions.
> 
> I think there are at least three overlapping problem spaces here
> 
> 1.This is a new field. That could mean that it turns out to be
> really hard and everyone discovers that eBPF was pretty much the only
> interesting attack. It could also mean we are going to see several years
> or refinement by evil geniuses all over the world and what we see now is
> tip of iceberg in cleverness.

yep. plenty of unknowns and what's happening now is an overreaction.

> 2.It is very very complicated to answer a question like "is
> sequence x safe on all of vendor's microprocessors" even for the vendor

so far "is sequence x safe" was viewed by cpu vendors as
"is sequence x going to stop speculative execution".
The AND approach wasn't even considered and I suspect there may be
other ways to avoid the side channel attack, but vendors are too
focused on stopping speculation. Why? The vulnerability already
disclosed in detail. Public clouds already patched their kernels.
What is the rush to push half baked patches into upstream
that don't even address the variant 1 ?

> Intel's current position is 'lfence'.

Let's look at the facts. From GPZ blog:
"In simplified terms, this program is used to determine the address
of prog_map by guessing the offset from prog_map to a userspace
address and tail-calling through prog_map at the guessed offsets."

which clearly states that bpf_tail_call() was used in the attack.
Yet none of the intel nor arm patches address speculation in
this bpf helper!
It means that:
- gpz didn't share neither exploit nor the detailed description
  of the POC with cpu vendors until now
- coverity rules used to find all these places in the kernel
  failed to find bpf_tail_call
- cpu vendors were speculating what variant 1 can actually do

Now the attack is well described, yet cpu vendors still pushing
for lfence patches that don't make sense. Why?

Furthermore GPZ blog states:
"To cause the branch prediction to predict that the offset is below
the length of prog_map, tail calls to an in-bounds index are
performed in between."

which means that attack is 'in-bound * largeN + out-of-bound * 1'.
Going back to our discussion few emails ago about 'mask' as a variable
it means that value predictor when speculating this last out of
bound access will correctly predict exact 'mask', so 'index & mask'
fix will certainly defeat this poc.

More from the blog:
"To increase the mis-speculation window, the cache line containing
the length of prog_map is bounced to another core."

and in the kernel we have:
struct bpf_map {
   atomic_t refcnt;
   enum bpf_map_type map_type;
   u32 key_size;
   u32 value_size;
   u32 max_entries;
   ...
note that only 'refcnt' could be used for bouncing,
so if we place max_entries and refcnt into different cache
lines this poc will likely fail as well.

More from the blog:
"At this point, a second eBPF program can be used to actually leak data.
In pseudocode, this program looks as follows:
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) 
+ prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);
"
which means that POC is relying 64-bit address speculation.
In the places coverity found the user supplied value is 32-bit,
so none of them are suitable for this POC.
We cannot rule out that the attack cannot be crafted to work
with 32-bit, but it means there is no urgency to lfence everything.

> The important thing is that there is something clean, all architecture
> that can be used today that doesn't keep forcing everyone to change
> drivers when new/better ways to do the speculation management appear.

What I think is important is to understand vulnerability first.
I don't think it was done.

> The differences involved on the "lfence" versus "and" versus before are
> not likely to be anywhere in that order of magnitude.

we clearly disagree here. Both intel and arm patches proposed
to add lfence in bpf_map_lookup() which is the hottest function
we have and we do run it at 40+Gbps speeds where every nanosecond
counts, so no, lfence is not a solution.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread David Miller
From: Willy Tarreau 
Date: Sat, 6 Jan 2018 21:42:29 +0100

> On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
>> Normally people who propose security fixes don't have to argue about the
>> fact they added 30 clocks to avoid your box being 0wned.
> 
> In fact it depends, because if a fix makes the system unusable for its
> initial purpose, this fix will simply not be deployed at all, which is
> the worst that can happen.

+1

I completely agree with Willy and Alexei.

And the scale isn't even accurate, we're talking about at least
hundreds upon hundreds of clocks, not 30, if we add an operation whose
side effect is to wait for all pending loads to complete.  So yeah
this is going to be heavily scrutinized.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread David Miller
From: Willy Tarreau 
Date: Sat, 6 Jan 2018 21:42:29 +0100

> On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
>> Normally people who propose security fixes don't have to argue about the
>> fact they added 30 clocks to avoid your box being 0wned.
> 
> In fact it depends, because if a fix makes the system unusable for its
> initial purpose, this fix will simply not be deployed at all, which is
> the worst that can happen.

+1

I completely agree with Willy and Alexei.

And the scale isn't even accurate, we're talking about at least
hundreds upon hundreds of clocks, not 30, if we add an operation whose
side effect is to wait for all pending loads to complete.  So yeah
this is going to be heavily scrutinized.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  wrote:
>
> I assume if we put this in uaccess_begin() we also need audit for
> paths that use access_ok but don't do on to call uaccess_begin()? A
> quick glance shows a few places where we are open coding the stac().
> Perhaps land the lfence in stac() directly?

Yeah, we should put it in uaccess_begin(), and in the actual user
accessor helpers that do stac. Some of them probably should be changed
to use uaccess_begin() instead while at it.

One question for the CPU people: do we actually care and need to do
this for things that might *write* to something? The speculative write
obviously is killed, but does it perhaps bring in a cacheline even
when killed?

Because maybe we don't need the lfence in put_user(), only in get_user()?

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Linus Torvalds
On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams  wrote:
>
> I assume if we put this in uaccess_begin() we also need audit for
> paths that use access_ok but don't do on to call uaccess_begin()? A
> quick glance shows a few places where we are open coding the stac().
> Perhaps land the lfence in stac() directly?

Yeah, we should put it in uaccess_begin(), and in the actual user
accessor helpers that do stac. Some of them probably should be changed
to use uaccess_begin() instead while at it.

One question for the CPU people: do we actually care and need to do
this for things that might *write* to something? The speculative write
obviously is killed, but does it perhaps bring in a cacheline even
when killed?

Because maybe we don't need the lfence in put_user(), only in get_user()?

   Linus


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams
On Fri, Jan 5, 2018 at 7:09 PM, Linus Torvalds
 wrote:
> On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
>  wrote:
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>
> I also suspect we should probably do this entirely differently.
>
> Maybe the whole lfence can be part of uaccess_begin() instead (ie
> currently 'stac()'). That would fit the existing structure better, I
> think. And it would avoid any confusion about the whole "when to stop
> speculation".

I assume if we put this in uaccess_begin() we also need audit for
paths that use access_ok but don't do on to call uaccess_begin()? A
quick glance shows a few places where we are open coding the stac().
Perhaps land the lfence in stac() directly?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams
On Fri, Jan 5, 2018 at 7:09 PM, Linus Torvalds
 wrote:
> On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
>  wrote:
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>
> I also suspect we should probably do this entirely differently.
>
> Maybe the whole lfence can be part of uaccess_begin() instead (ie
> currently 'stac()'). That would fit the existing structure better, I
> think. And it would avoid any confusion about the whole "when to stop
> speculation".

I assume if we put this in uaccess_begin() we also need audit for
paths that use access_ok but don't do on to call uaccess_begin()? A
quick glance shows a few places where we are open coding the stac().
Perhaps land the lfence in stac() directly?


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alan Cox
> Even if it would be practical the speed probably going to be in bytes per 
> second,
> so to read anything meaningful an attack detection techniques (that people
> are actively working on) will be able to catch it.
> At the end security cannot be absolute.
> The current level of paranoia shouldn't force us to make hastily decisions.

I think there are at least three overlapping problem spaces here

1.  This is a new field. That could mean that it turns out to be
really hard and everyone discovers that eBPF was pretty much the only
interesting attack. It could also mean we are going to see several years
or refinement by evil geniuses all over the world and what we see now is
tip of iceberg in cleverness.

2.  It is very very complicated to answer a question like "is
sequence x safe on all of vendor's microprocessors" even for the vendor

3.  A lot of people outside of the professional security space are
very new to the concept of security economics and risk management as
opposed to seeing the fake binary nice green tick that says their
computers are secure that they can pass to their senior management or
show to audit.

> So how about we do array_access() macro similar to above by default
> with extra CONFIG_ to convert it to lfence ?

We don't have to decide today. Intel's current position is 'lfence'. Over
time we may see vendors provide more sequences. We will see vendors add
new instruction hints and the like (See the ARM whitepaper for example)
and the array access code will change and change again for the better.

The important thing is that there is something clean, all architecture
that can be used today that doesn't keep forcing everyone to change
drivers when new/better ways to do the speculation management appear.

> Why default to AND approach instead of lfence ?
> Because the kernel should still be usable. If security
> sacrifices performance so much such security will be turned off.
> Ex: kpti suppose to add 5-30%. If it means 10% on production workload
> and the datacenter capacity cannot grow 10% overnight, kpti will be off.

The differences involved on the "lfence" versus "and" versus before are
not likely to be anywhere in that order of magnitude. As I said I want to
take a hard look at the IPv4/6 ones but most of them are not in places
where you'd expect a lot of data to be in flight in a perf critical path.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alan Cox
> Even if it would be practical the speed probably going to be in bytes per 
> second,
> so to read anything meaningful an attack detection techniques (that people
> are actively working on) will be able to catch it.
> At the end security cannot be absolute.
> The current level of paranoia shouldn't force us to make hastily decisions.

I think there are at least three overlapping problem spaces here

1.  This is a new field. That could mean that it turns out to be
really hard and everyone discovers that eBPF was pretty much the only
interesting attack. It could also mean we are going to see several years
or refinement by evil geniuses all over the world and what we see now is
tip of iceberg in cleverness.

2.  It is very very complicated to answer a question like "is
sequence x safe on all of vendor's microprocessors" even for the vendor

3.  A lot of people outside of the professional security space are
very new to the concept of security economics and risk management as
opposed to seeing the fake binary nice green tick that says their
computers are secure that they can pass to their senior management or
show to audit.

> So how about we do array_access() macro similar to above by default
> with extra CONFIG_ to convert it to lfence ?

We don't have to decide today. Intel's current position is 'lfence'. Over
time we may see vendors provide more sequences. We will see vendors add
new instruction hints and the like (See the ARM whitepaper for example)
and the array access code will change and change again for the better.

The important thing is that there is something clean, all architecture
that can be used today that doesn't keep forcing everyone to change
drivers when new/better ways to do the speculation management appear.

> Why default to AND approach instead of lfence ?
> Because the kernel should still be usable. If security
> sacrifices performance so much such security will be turned off.
> Ex: kpti suppose to add 5-30%. If it means 10% on production workload
> and the datacenter capacity cannot grow 10% overnight, kpti will be off.

The differences involved on the "lfence" versus "and" versus before are
not likely to be anywhere in that order of magnitude. As I said I want to
take a hard look at the IPv4/6 ones but most of them are not in places
where you'd expect a lot of data to be in flight in a perf critical path.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Thomas Gleixner
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> So how about we do array_access() macro similar to above by default
> with extra CONFIG_ to convert it to lfence ?
> Why default to AND approach instead of lfence ?
> Because the kernel should still be usable. If security
> sacrifices performance so much such security will be turned off.
> Ex: kpti suppose to add 5-30%. If it means 10% on production workload
> and the datacenter capacity cannot grow 10% overnight, kpti will be off.

That's the decision and responsibility of the person who disables it.

Thanks,

tglx




Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Thomas Gleixner
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> So how about we do array_access() macro similar to above by default
> with extra CONFIG_ to convert it to lfence ?
> Why default to AND approach instead of lfence ?
> Because the kernel should still be usable. If security
> sacrifices performance so much such security will be turned off.
> Ex: kpti suppose to add 5-30%. If it means 10% on production workload
> and the datacenter capacity cannot grow 10% overnight, kpti will be off.

That's the decision and responsibility of the person who disables it.

Thanks,

tglx




Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 08:22:13PM +, Alan Cox wrote:
> > "Value prediction consists of predicting entire 32- and 64-bit register 
> > values
> > based  on  previously-seen values"
> 
> For their implementation yes
> 
> > 
> > > In other words there are at least two problems with Linus proposal
> > > 
> > > 1. The / mask has to be generated and that has to involve
> > > speculative flows.  
> > 
> > to answer above and Thomas's
> > "For one particular architecture and that's not a solution for generic 
> > code."
> > 
> > The following:
> > #define array_access(base, idx, max) ({ \
> > union { typeof(base[0]) _val; unsigned long _bit; } __u;\
> > unsigned long _i = (idx);   \
> > unsigned long _m = (max);   \
> > unsigned long _mask = ~(long)(_m - 1 - _i) >> 63;   \
> > __u._val = base[_i & _mask];\
> > __u._bit &= _mask;  \
> > __u._val; })
> > 
> > is generic and no speculative flows.
> 
> In the value speculation case imagine it's been called 1000 times for
> process A which as a limit of say 16 so that file->priv->max is 16, and
> then is run for process B which is different.
> 
> A value speculating processor waiting for file->priv->max which has been
> pushed out of cache by an attacker is at liberty to say 'I've no idea
> what max is but hey it was 16 last time so lets plug 16 in and keep going"
> 
> So while the change in the mask computation is clever and (subject to
> compiler cleverness) safe against guesses of which path will be taken I
> don't think it's generically safe.
> 
> Unfortunately a lot of things we index are of different sizes as seen by
> different tasks, or when passed different other index values so this does
> matter.
> 
> > Even if 'mask' in 'index & mask' example is a stall the educated
> > guess will come from the prior value (according to the quoted paper)
> 
> Which might be for a different set of variables when the table is say per
> process like file handles, or the value is different each call.
> 
> If we have single array of fixed size then I suspect you are right but
> usually we don't.

Thanks. I see your point. Agree on the above.
The variant 1 exploit does 2000 bytes a second using 64-bit address math.
Things like 'fd' are 32-bit, so it's magnitude higher attack
complexity already (without any kernel changes).
If we do above array_access() the exploit complexity increases even more.
More so the attacker would need to train fdt->max_fds on a known
good fdt with millions of files for 100s of iterations only to do
one speculative access on another fdt with small max_fds 
(to exploit value speculation from large max_fds)
while keeping cache line for that speculative out-of-bounds access on
small fdt empty and measuring cache load times on another cpu.
I frankly don't see such attack being able to keep cache lines pristine
for that small fdt speculation doing hundreds of non-speculative
accesses on another fdt. Way too many moving pieces.
Even if it would be practical the speed probably going to be in bytes per 
second,
so to read anything meaningful an attack detection techniques (that people
are actively working on) will be able to catch it.
At the end security cannot be absolute.
The current level of paranoia shouldn't force us to make hastily decisions.

So how about we do array_access() macro similar to above by default
with extra CONFIG_ to convert it to lfence ?
Why default to AND approach instead of lfence ?
Because the kernel should still be usable. If security
sacrifices performance so much such security will be turned off.
Ex: kpti suppose to add 5-30%. If it means 10% on production workload
and the datacenter capacity cannot grow 10% overnight, kpti will be off.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 08:22:13PM +, Alan Cox wrote:
> > "Value prediction consists of predicting entire 32- and 64-bit register 
> > values
> > based  on  previously-seen values"
> 
> For their implementation yes
> 
> > 
> > > In other words there are at least two problems with Linus proposal
> > > 
> > > 1. The / mask has to be generated and that has to involve
> > > speculative flows.  
> > 
> > to answer above and Thomas's
> > "For one particular architecture and that's not a solution for generic 
> > code."
> > 
> > The following:
> > #define array_access(base, idx, max) ({ \
> > union { typeof(base[0]) _val; unsigned long _bit; } __u;\
> > unsigned long _i = (idx);   \
> > unsigned long _m = (max);   \
> > unsigned long _mask = ~(long)(_m - 1 - _i) >> 63;   \
> > __u._val = base[_i & _mask];\
> > __u._bit &= _mask;  \
> > __u._val; })
> > 
> > is generic and no speculative flows.
> 
> In the value speculation case imagine it's been called 1000 times for
> process A which as a limit of say 16 so that file->priv->max is 16, and
> then is run for process B which is different.
> 
> A value speculating processor waiting for file->priv->max which has been
> pushed out of cache by an attacker is at liberty to say 'I've no idea
> what max is but hey it was 16 last time so lets plug 16 in and keep going"
> 
> So while the change in the mask computation is clever and (subject to
> compiler cleverness) safe against guesses of which path will be taken I
> don't think it's generically safe.
> 
> Unfortunately a lot of things we index are of different sizes as seen by
> different tasks, or when passed different other index values so this does
> matter.
> 
> > Even if 'mask' in 'index & mask' example is a stall the educated
> > guess will come from the prior value (according to the quoted paper)
> 
> Which might be for a different set of variables when the table is say per
> process like file handles, or the value is different each call.
> 
> If we have single array of fixed size then I suspect you are right but
> usually we don't.

Thanks. I see your point. Agree on the above.
The variant 1 exploit does 2000 bytes a second using 64-bit address math.
Things like 'fd' are 32-bit, so it's magnitude higher attack
complexity already (without any kernel changes).
If we do above array_access() the exploit complexity increases even more.
More so the attacker would need to train fdt->max_fds on a known
good fdt with millions of files for 100s of iterations only to do
one speculative access on another fdt with small max_fds 
(to exploit value speculation from large max_fds)
while keeping cache line for that speculative out-of-bounds access on
small fdt empty and measuring cache load times on another cpu.
I frankly don't see such attack being able to keep cache lines pristine
for that small fdt speculation doing hundreds of non-speculative
accesses on another fdt. Way too many moving pieces.
Even if it would be practical the speed probably going to be in bytes per 
second,
so to read anything meaningful an attack detection techniques (that people
are actively working on) will be able to catch it.
At the end security cannot be absolute.
The current level of paranoia shouldn't force us to make hastily decisions.

So how about we do array_access() macro similar to above by default
with extra CONFIG_ to convert it to lfence ?
Why default to AND approach instead of lfence ?
Because the kernel should still be usable. If security
sacrifices performance so much such security will be turned off.
Ex: kpti suppose to add 5-30%. If it means 10% on production workload
and the datacenter capacity cannot grow 10% overnight, kpti will be off.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau
On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> Normally people who propose security fixes don't have to argue about the
> fact they added 30 clocks to avoid your box being 0wned.

In fact it depends, because if a fix makes the system unusable for its
initial purpose, this fix will simply not be deployed at all, which is
the worst that can happen. Especially when it cannot be disabled by
config and people stop updating their systems to stay on the last
"known good" version. Fortunately in Linux we often have the choice so
that users rarely have a valid reason for not upgrading!

Willy


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Willy Tarreau
On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> Normally people who propose security fixes don't have to argue about the
> fact they added 30 clocks to avoid your box being 0wned.

In fact it depends, because if a fix makes the system unusable for its
initial purpose, this fix will simply not be deployed at all, which is
the worst that can happen. Especially when it cannot be disabled by
config and people stop updating their systems to stay on the last
"known good" version. Fortunately in Linux we often have the choice so
that users rarely have a valid reason for not upgrading!

Willy


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alan Cox
> "Value prediction consists of predicting entire 32- and 64-bit register values
> based  on  previously-seen values"

For their implementation yes

> 
> > In other words there are at least two problems with Linus proposal
> > 
> > 1. The / mask has to be generated and that has to involve
> > speculative flows.  
> 
> to answer above and Thomas's
> "For one particular architecture and that's not a solution for generic code."
> 
> The following:
> #define array_access(base, idx, max) ({ \
> union { typeof(base[0]) _val; unsigned long _bit; } __u;\
> unsigned long _i = (idx);   \
> unsigned long _m = (max);   \
> unsigned long _mask = ~(long)(_m - 1 - _i) >> 63;   \
> __u._val = base[_i & _mask];\
> __u._bit &= _mask;  \
> __u._val; })
> 
> is generic and no speculative flows.

In the value speculation case imagine it's been called 1000 times for
process A which as a limit of say 16 so that file->priv->max is 16, and
then is run for process B which is different.

A value speculating processor waiting for file->priv->max which has been
pushed out of cache by an attacker is at liberty to say 'I've no idea
what max is but hey it was 16 last time so lets plug 16 in and keep going"

So while the change in the mask computation is clever and (subject to
compiler cleverness) safe against guesses of which path will be taken I
don't think it's generically safe.

Unfortunately a lot of things we index are of different sizes as seen by
different tasks, or when passed different other index values so this does
matter.

> Even if 'mask' in 'index & mask' example is a stall the educated
> guess will come from the prior value (according to the quoted paper)

Which might be for a different set of variables when the table is say per
process like file handles, or the value is different each call.

If we have single array of fixed size then I suspect you are right but
usually we don't.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alan Cox
> "Value prediction consists of predicting entire 32- and 64-bit register values
> based  on  previously-seen values"

For their implementation yes

> 
> > In other words there are at least two problems with Linus proposal
> > 
> > 1. The / mask has to be generated and that has to involve
> > speculative flows.  
> 
> to answer above and Thomas's
> "For one particular architecture and that's not a solution for generic code."
> 
> The following:
> #define array_access(base, idx, max) ({ \
> union { typeof(base[0]) _val; unsigned long _bit; } __u;\
> unsigned long _i = (idx);   \
> unsigned long _m = (max);   \
> unsigned long _mask = ~(long)(_m - 1 - _i) >> 63;   \
> __u._val = base[_i & _mask];\
> __u._bit &= _mask;  \
> __u._val; })
> 
> is generic and no speculative flows.

In the value speculation case imagine it's been called 1000 times for
process A which as a limit of say 16 so that file->priv->max is 16, and
then is run for process B which is different.

A value speculating processor waiting for file->priv->max which has been
pushed out of cache by an attacker is at liberty to say 'I've no idea
what max is but hey it was 16 last time so lets plug 16 in and keep going"

So while the change in the mask computation is clever and (subject to
compiler cleverness) safe against guesses of which path will be taken I
don't think it's generically safe.

Unfortunately a lot of things we index are of different sizes as seen by
different tasks, or when passed different other index values so this does
matter.

> Even if 'mask' in 'index & mask' example is a stall the educated
> guess will come from the prior value (according to the quoted paper)

Which might be for a different set of variables when the table is say per
process like file handles, or the value is different each call.

If we have single array of fixed size then I suspect you are right but
usually we don't.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 07:55:51PM +, Alan Cox wrote:
> > cpus execute what they see. speculative execution does the same
> > except results are not committed to visible registers and stay
> > in renanmed/shadow set. There is no 'undo' of the speculative execution.
> > The whole issue is that cache and branch predictor don't have
> > a shadow unlike registers.
> 
> Can I suggest you read something like "Exploitig Value Locaity to Exceed
> The Dataflow Limit" by Lipasti and Shen 1996.

thanks for the pointer.
A quote from above paper:
"Value prediction consists of predicting entire 32- and 64-bit register values
based  on  previously-seen values"

> In other words there are at least two problems with Linus proposal
> 
> 1. The / mask has to be generated and that has to involve
> speculative flows.

to answer above and Thomas's
"For one particular architecture and that's not a solution for generic code."

The following:
#define array_access(base, idx, max) ({ \
union { typeof(base[0]) _val; unsigned long _bit; } __u;\
unsigned long _i = (idx);   \
unsigned long _m = (max);   \
unsigned long _mask = ~(long)(_m - 1 - _i) >> 63;   \
__u._val = base[_i & _mask];\
__u._bit &= _mask;  \
__u._val; })

is generic and no speculative flows.

> 2. There are processors on the planet that may speculate not just what
> instruction to execute but faced with a stall on an input continue by
> using an educated guess at the value that will appear at the input in
> future.

correct. that's why earlier I mentioned that "if 'mask' cannot
be influenced by attacker".
Even if 'mask' in 'index & mask' example is a stall the educated
guess will come from the prior value (according to the quoted paper)

To be honest I haven't read that particular paper in the past, but
abstracts fits my understanding and this array_access() proposal.
Thanks for the pointer. Will read it fully to make sure
I didn't miss anything.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 07:55:51PM +, Alan Cox wrote:
> > cpus execute what they see. speculative execution does the same
> > except results are not committed to visible registers and stay
> > in renanmed/shadow set. There is no 'undo' of the speculative execution.
> > The whole issue is that cache and branch predictor don't have
> > a shadow unlike registers.
> 
> Can I suggest you read something like "Exploitig Value Locaity to Exceed
> The Dataflow Limit" by Lipasti and Shen 1996.

thanks for the pointer.
A quote from above paper:
"Value prediction consists of predicting entire 32- and 64-bit register values
based  on  previously-seen values"

> In other words there are at least two problems with Linus proposal
> 
> 1. The / mask has to be generated and that has to involve
> speculative flows.

to answer above and Thomas's
"For one particular architecture and that's not a solution for generic code."

The following:
#define array_access(base, idx, max) ({ \
union { typeof(base[0]) _val; unsigned long _bit; } __u;\
unsigned long _i = (idx);   \
unsigned long _m = (max);   \
unsigned long _mask = ~(long)(_m - 1 - _i) >> 63;   \
__u._val = base[_i & _mask];\
__u._bit &= _mask;  \
__u._val; })

is generic and no speculative flows.

> 2. There are processors on the planet that may speculate not just what
> instruction to execute but faced with a stall on an input continue by
> using an educated guess at the value that will appear at the input in
> future.

correct. that's why earlier I mentioned that "if 'mask' cannot
be influenced by attacker".
Even if 'mask' in 'index & mask' example is a stall the educated
guess will come from the prior value (according to the quoted paper)

To be honest I haven't read that particular paper in the past, but
abstracts fits my understanding and this array_access() proposal.
Thanks for the pointer. Will read it fully to make sure
I didn't miss anything.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alan Cox
> cpus execute what they see. speculative execution does the same
> except results are not committed to visible registers and stay
> in renanmed/shadow set. There is no 'undo' of the speculative execution.
> The whole issue is that cache and branch predictor don't have
> a shadow unlike registers.

Can I suggest you read something like "Exploitig Value Locaity to Exceed
The Dataflow Limit" by Lipasti and Shen 1996.

In other words there are at least two problems with Linus proposal

1. The / mask has to be generated and that has to involve
speculative flows.

2. There are processors on the planet that may speculate not just what
instruction to execute but faced with a stall on an input continue by
using an educated guess at the value that will appear at the input in
future.

> > > I think "lets sprinkle lfence everywhere" approach is going to
> > > cause serious performance degradation. Yet people pushing for lfence
> > > didn't present any numbers.  
> > 
> > Normally people who propose security fixes don't have to argue about the
> > fact they added 30 clocks to avoid your box being 0wned.
> > 
> > If you look at the patches very very few are in remotely hot paths, which
> > is good news. The ones in hot paths definitely need careful scrutiny but
> > the priority ought to be fixing and then optimizing unless its obvious
> > how to tackle it.  
> 
> fdtable, access_ok and networking are clearly hot, 

Those points are not when compared to a lot of paths executed far far
more (like packet processing). The ipv4/6 ones are mixing lfences and a
lot of stores so probably hurt with lfence. fdtable worries me
a lot less and is anyway clearly required.

I have reviewing the packet building ones on my TODO list for a reason.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alan Cox
> cpus execute what they see. speculative execution does the same
> except results are not committed to visible registers and stay
> in renanmed/shadow set. There is no 'undo' of the speculative execution.
> The whole issue is that cache and branch predictor don't have
> a shadow unlike registers.

Can I suggest you read something like "Exploitig Value Locaity to Exceed
The Dataflow Limit" by Lipasti and Shen 1996.

In other words there are at least two problems with Linus proposal

1. The / mask has to be generated and that has to involve
speculative flows.

2. There are processors on the planet that may speculate not just what
instruction to execute but faced with a stall on an input continue by
using an educated guess at the value that will appear at the input in
future.

> > > I think "lets sprinkle lfence everywhere" approach is going to
> > > cause serious performance degradation. Yet people pushing for lfence
> > > didn't present any numbers.  
> > 
> > Normally people who propose security fixes don't have to argue about the
> > fact they added 30 clocks to avoid your box being 0wned.
> > 
> > If you look at the patches very very few are in remotely hot paths, which
> > is good news. The ones in hot paths definitely need careful scrutiny but
> > the priority ought to be fixing and then optimizing unless its obvious
> > how to tackle it.  
> 
> fdtable, access_ok and networking are clearly hot, 

Those points are not when compared to a lot of paths executed far far
more (like packet processing). The ipv4/6 ones are mixing lfences and a
lot of stores so probably hurt with lfence. fdtable worries me
a lot less and is anyway clearly required.

I have reviewing the packet building ones on my TODO list for a reason.

Alan


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Thomas Gleixner
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:

> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
> > On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
> >  wrote:
> > [..]
> > >> retpoline is variant-2, this patch series is about variant-1.
> > >
> > > that's exactly the point. Don't slow down the kernel with lfences
> > > to solve variant 1. retpoline for 2 is ok from long term kernel
> > > viability perspective.
> > >
> > 
> > Setting aside that we still need to measure the impact of these
> > changes the end result will still be nospec_array_ptr() sprinkled in
> > various locations. So can we save the debate about what's inside that
> > macro on various architectures and at least proceed with annotating
> > the problematic locations? Perhaps we can go a step further and have a
> > config option to switch between the clever array_access() approach
> > from Linus that might be fine depending on the compiler, and the
> > cpu-vendor-recommended not to speculate implementation of
> > nospec_array_ptr().
> 
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.

For one particular architecture and that's not a solution for generic
code.

Aside of that I fundamentally disagree with your purely performance
optimized argumentation. We need to make sure that we have a solution which
kills the problem safely and then take it from there. Correctness first,
optimization later is the rule for this. Better safe than sorry.

Thanks,

tglx




Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Thomas Gleixner
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:

> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
> > On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
> >  wrote:
> > [..]
> > >> retpoline is variant-2, this patch series is about variant-1.
> > >
> > > that's exactly the point. Don't slow down the kernel with lfences
> > > to solve variant 1. retpoline for 2 is ok from long term kernel
> > > viability perspective.
> > >
> > 
> > Setting aside that we still need to measure the impact of these
> > changes the end result will still be nospec_array_ptr() sprinkled in
> > various locations. So can we save the debate about what's inside that
> > macro on various architectures and at least proceed with annotating
> > the problematic locations? Perhaps we can go a step further and have a
> > config option to switch between the clever array_access() approach
> > from Linus that might be fine depending on the compiler, and the
> > cpu-vendor-recommended not to speculate implementation of
> > nospec_array_ptr().
> 
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.

For one particular architecture and that's not a solution for generic
code.

Aside of that I fundamentally disagree with your purely performance
optimized argumentation. We need to make sure that we have a solution which
kills the problem safely and then take it from there. Correctness first,
optimization later is the rule for this. Better safe than sorry.

Thanks,

tglx




Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams
On Sat, Jan 6, 2018 at 11:25 AM, Alexei Starovoitov
 wrote:
> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
>> On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
>>  wrote:
>> [..]
>> >> retpoline is variant-2, this patch series is about variant-1.
>> >
>> > that's exactly the point. Don't slow down the kernel with lfences
>> > to solve variant 1. retpoline for 2 is ok from long term kernel
>> > viability perspective.
>> >
>>
>> Setting aside that we still need to measure the impact of these
>> changes the end result will still be nospec_array_ptr() sprinkled in
>> various locations. So can we save the debate about what's inside that
>> macro on various architectures and at least proceed with annotating
>> the problematic locations? Perhaps we can go a step further and have a
>> config option to switch between the clever array_access() approach
>> from Linus that might be fine depending on the compiler, and the
>> cpu-vendor-recommended not to speculate implementation of
>> nospec_array_ptr().
>
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.
> It probably can be made independent of compiler choice to use setbe-like insn.

Right, when that 'probably' is 'certainly' for the architecture you
care about just update the nospec_array_ptr() definition at that
point.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams
On Sat, Jan 6, 2018 at 11:25 AM, Alexei Starovoitov
 wrote:
> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
>> On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
>>  wrote:
>> [..]
>> >> retpoline is variant-2, this patch series is about variant-1.
>> >
>> > that's exactly the point. Don't slow down the kernel with lfences
>> > to solve variant 1. retpoline for 2 is ok from long term kernel
>> > viability perspective.
>> >
>>
>> Setting aside that we still need to measure the impact of these
>> changes the end result will still be nospec_array_ptr() sprinkled in
>> various locations. So can we save the debate about what's inside that
>> macro on various architectures and at least proceed with annotating
>> the problematic locations? Perhaps we can go a step further and have a
>> config option to switch between the clever array_access() approach
>> from Linus that might be fine depending on the compiler, and the
>> cpu-vendor-recommended not to speculate implementation of
>> nospec_array_ptr().
>
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.
> It probably can be made independent of compiler choice to use setbe-like insn.

Right, when that 'probably' is 'certainly' for the architecture you
care about just update the nospec_array_ptr() definition at that
point.


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
> On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
>  wrote:
> [..]
> >> retpoline is variant-2, this patch series is about variant-1.
> >
> > that's exactly the point. Don't slow down the kernel with lfences
> > to solve variant 1. retpoline for 2 is ok from long term kernel
> > viability perspective.
> >
> 
> Setting aside that we still need to measure the impact of these
> changes the end result will still be nospec_array_ptr() sprinkled in
> various locations. So can we save the debate about what's inside that
> macro on various architectures and at least proceed with annotating
> the problematic locations? Perhaps we can go a step further and have a
> config option to switch between the clever array_access() approach
> from Linus that might be fine depending on the compiler, and the
> cpu-vendor-recommended not to speculate implementation of
> nospec_array_ptr().

recommended by panicing vendors who had no better ideas?
Ohh, speculation is exploitable, let's stop speculation.
Instead of fighting it we can safely steer it where it doesn't leak
kernel data. AND approach is doing exactly that.
It probably can be made independent of compiler choice to use setbe-like insn.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Alexei Starovoitov
On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
> On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
>  wrote:
> [..]
> >> retpoline is variant-2, this patch series is about variant-1.
> >
> > that's exactly the point. Don't slow down the kernel with lfences
> > to solve variant 1. retpoline for 2 is ok from long term kernel
> > viability perspective.
> >
> 
> Setting aside that we still need to measure the impact of these
> changes the end result will still be nospec_array_ptr() sprinkled in
> various locations. So can we save the debate about what's inside that
> macro on various architectures and at least proceed with annotating
> the problematic locations? Perhaps we can go a step further and have a
> config option to switch between the clever array_access() approach
> from Linus that might be fine depending on the compiler, and the
> cpu-vendor-recommended not to speculate implementation of
> nospec_array_ptr().

recommended by panicing vendors who had no better ideas?
Ohh, speculation is exploitable, let's stop speculation.
Instead of fighting it we can safely steer it where it doesn't leak
kernel data. AND approach is doing exactly that.
It probably can be made independent of compiler choice to use setbe-like insn.



Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams
On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
 wrote:
[..]
>> retpoline is variant-2, this patch series is about variant-1.
>
> that's exactly the point. Don't slow down the kernel with lfences
> to solve variant 1. retpoline for 2 is ok from long term kernel
> viability perspective.
>

Setting aside that we still need to measure the impact of these
changes the end result will still be nospec_array_ptr() sprinkled in
various locations. So can we save the debate about what's inside that
macro on various architectures and at least proceed with annotating
the problematic locations? Perhaps we can go a step further and have a
config option to switch between the clever array_access() approach
from Linus that might be fine depending on the compiler, and the
cpu-vendor-recommended not to speculate implementation of
nospec_array_ptr().


Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Dan Williams
On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
 wrote:
[..]
>> retpoline is variant-2, this patch series is about variant-1.
>
> that's exactly the point. Don't slow down the kernel with lfences
> to solve variant 1. retpoline for 2 is ok from long term kernel
> viability perspective.
>

Setting aside that we still need to measure the impact of these
changes the end result will still be nospec_array_ptr() sprinkled in
various locations. So can we save the debate about what's inside that
macro on various architectures and at least proceed with annotating
the problematic locations? Perhaps we can go a step further and have a
config option to switch between the clever array_access() approach
from Linus that might be fine depending on the compiler, and the
cpu-vendor-recommended not to speculate implementation of
nospec_array_ptr().


  1   2   >