RE: [RFD] x86/split_lock: Request to Intel

2019-10-18 Thread hpa
On October 18, 2019 3:45:14 AM PDT, David Laight  
wrote:
>From: Luck, Tony
>> Sent: 18 October 2019 00:28
>...
>> 2) Fix set_bit() et. al. to not issue atomic operations that cross
>boundaries.
>> 
>> Fenghua had been pursuing option #1 in previous iterations. He found
>a few
>> more places with the help of the "grep" patterns suggested by David
>Laight.
>> So that path is up to ~8 patches now that do one of:
>>  + Change from u32 to u64
>>  + Force alignment with a union with a u64
>>  + Change to non-atomic (places that didn't need atomic)
>> 
>> Downside of strategy #1 is that people will add new misaligned cases
>in the
>> future. So this process has no defined end point.
>> 
>> Strategy #2 begun when I looked at the split-lock issue I saw that
>with a
>> constant bit argument set_bit() just does a "ORB" on the affected
>byte (i.e.
>> no split lock). Similar for clear_bit() and change_bit(). Changing
>code to also
>> do that for the variable bit case is easy.
>> 
>> test_and_clr_bit() needs more care, but luckily, we had Peter Anvin
>nearby
>> to give us a neat solution.
>
>Changing the x86-64 bitops to use 32bit memory cycles is trivial
>(provided you are willing to accept a limit of 2G bits).
>
>OTOH this only works because x86 is LE.
>On any BE systems passing an 'int []' to any of the bit-functions is so
>terribly
>wrong it is unbelievable.
>
>So changing the x86-64 bitops is largely papering over a crack.
>
>In essence any code that casts the argument to any of the bitops
>functions
>is almost certainly badly broken on BE systems.
>
>The x86 cpu features code is always LE.
>It probably ought to have a typedef for a union of long [] and int [].
>
>   David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
>MK1 1PT, UK
>Registration No: 1397386 (Wales)

One thing I suggested is that we should actually expose the violations at 
committee time either by wrapping them in macros using __alignof__ and/or make 
the kernel compile with -Wcast-align.

On x86 the btsl/btcl/btrl instructions can be used without limiting to 2Gbit of 
the address is computed, the way one does for plain and, or, etc. However, if 
the real toes for the arguments are exposed then or is possible to do better.

Finally, as far as bigendian is concerned: the problem Linux has on bigendian 
machines is that it tries to use littleendian bitmaps on bigendian machines: on 
bigendian machines, bit 0 is naturally the MSB. If your reaction is "but that 
is absurd", then you have just grokked why bigendian is fundamentally broken.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


RE: [RFD] x86/split_lock: Request to Intel

2019-10-18 Thread David Laight
From: Luck, Tony
> Sent: 18 October 2019 00:28
...
> 2) Fix set_bit() et. al. to not issue atomic operations that cross boundaries.
> 
> Fenghua had been pursuing option #1 in previous iterations. He found a few
> more places with the help of the "grep" patterns suggested by David Laight.
> So that path is up to ~8 patches now that do one of:
>   + Change from u32 to u64
>   + Force alignment with a union with a u64
>   + Change to non-atomic (places that didn't need atomic)
> 
> Downside of strategy #1 is that people will add new misaligned cases in the
> future. So this process has no defined end point.
> 
> Strategy #2 begun when I looked at the split-lock issue I saw that with a
> constant bit argument set_bit() just does a "ORB" on the affected byte (i.e.
> no split lock). Similar for clear_bit() and change_bit(). Changing code to 
> also
> do that for the variable bit case is easy.
> 
> test_and_clr_bit() needs more care, but luckily, we had Peter Anvin nearby
> to give us a neat solution.

Changing the x86-64 bitops to use 32bit memory cycles is trivial
(provided you are willing to accept a limit of 2G bits).

OTOH this only works because x86 is LE.
On any BE systems passing an 'int []' to any of the bit-functions is so terribly
wrong it is unbelievable.

So changing the x86-64 bitops is largely papering over a crack.

In essence any code that casts the argument to any of the bitops functions
is almost certainly badly broken on BE systems.

The x86 cpu features code is always LE.
It probably ought to have a typedef for a union of long [] and int [].

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [RFD] x86/split_lock: Request to Intel

2019-10-18 Thread Peter Zijlstra
On Fri, Oct 18, 2019 at 06:20:44PM +0800, Xiaoyao Li wrote:

> We enable #AC on all cores/threads to detect split lock.
>  -If user space causes #AC, sending SIGBUS to it.
>  -If kernel causes #AC, we globally disable #AC on all cores/threads,
> letting kernel go on working and WARN. (only disabling #AC on the thread
> generates it just doesn't help, since the buggy kernel code is possible to
> run on any threads and thus disabling #AC on all of them)
> 
> As described above, either enabled globally or disabled globally, so whether
> it's per-core or per-thread really doesn't matter

Go back and read the friggin' thread already. A big clue: virt ruins it
(like it tends to do).


Re: [RFD] x86/split_lock: Request to Intel

2019-10-18 Thread Xiaoyao Li

On 10/18/2019 5:02 PM, Thomas Gleixner wrote:

On Fri, 18 Oct 2019, Xiaoyao Li wrote:

On 10/17/2019 8:29 PM, Thomas Gleixner wrote:

The more I look at this trainwreck, the less interested I am in merging any
of this at all.

The fact that it took Intel more than a year to figure out that the MSR is
per core and not per thread is yet another proof that this industry just
works by pure chance.



Whether it's per-core or per-thread doesn't affect much how we implement for
host/native.


How useful.


OK. IIUC. We can agree on the use model of native like below:

We enable #AC on all cores/threads to detect split lock.
 -If user space causes #AC, sending SIGBUS to it.
 -If kernel causes #AC, we globally disable #AC on all cores/threads, 
letting kernel go on working and WARN. (only disabling #AC on the thread 
generates it just doesn't help, since the buggy kernel code is possible 
to run on any threads and thus disabling #AC on all of them)


As described above, either enabled globally or disabled globally, so 
whether it's per-core or per-thread really doesn't matter



And also, no matter it's per-core or per-thread, we always can do something in
VIRT.


It matters a lot. If it would be per thread then we would not have this
discussion at all.


Indeed, it's the fact that the control MSR bit is per-core to cause this 
discussion. But the per-core scope only makes this feature difficult or 
impossible to be virtualized.


We could make the decision to not expose it to guest to avoid the really 
bad thing. However, even we don't expose this feature to guest and don't 
virtualize it, the below problem always here.


If you think it's not a problem and acceptable to add an option to let 
KVM disable host's #AC detection, we can just make it this way. And then 
we can design the virtualizaion part without any change to native design 
at all.



Maybe what matters is below.


Seriously, this makes only sense when it's by default enabled and not
rendered useless by VIRT. Otherwise we never get any reports and none of
the issues are going to be fixed.



For VIRT, it doesn't want old guest to be killed due to #AC. But for native,
it doesn't want VIRT to disable the #AC detection

I think it's just about the default behavior that whether to disable the
host's #AC detection or kill the guest (SIGBUS or something else) once there
is an split-lock #AC in guest.

So we can provide CONFIG option to set the default behavior and module
parameter to let KVM set/change the default behavior.


Care to read through the whole discussion and figure out WHY it's not that
simple?

Thanks,

tglx



Re: [RFD] x86/split_lock: Request to Intel

2019-10-18 Thread Thomas Gleixner
On Fri, 18 Oct 2019, Xiaoyao Li wrote:
> On 10/17/2019 8:29 PM, Thomas Gleixner wrote:
> > The more I look at this trainwreck, the less interested I am in merging any
> > of this at all.
> > 
> > The fact that it took Intel more than a year to figure out that the MSR is
> > per core and not per thread is yet another proof that this industry just
> > works by pure chance.
> > 
> 
> Whether it's per-core or per-thread doesn't affect much how we implement for
> host/native.

How useful.

> And also, no matter it's per-core or per-thread, we always can do something in
> VIRT.

It matters a lot. If it would be per thread then we would not have this
discussion at all.

> Maybe what matters is below.
> 
> > Seriously, this makes only sense when it's by default enabled and not
> > rendered useless by VIRT. Otherwise we never get any reports and none of
> > the issues are going to be fixed.
> > 
> 
> For VIRT, it doesn't want old guest to be killed due to #AC. But for native,
> it doesn't want VIRT to disable the #AC detection
> 
> I think it's just about the default behavior that whether to disable the
> host's #AC detection or kill the guest (SIGBUS or something else) once there
> is an split-lock #AC in guest.
> 
> So we can provide CONFIG option to set the default behavior and module
> parameter to let KVM set/change the default behavior.

Care to read through the whole discussion and figure out WHY it's not that
simple?

Thanks,

tglx


Re: [RFD] x86/split_lock: Request to Intel

2019-10-17 Thread Xiaoyao Li

On 10/17/2019 8:29 PM, Thomas Gleixner wrote:

The more I look at this trainwreck, the less interested I am in merging any
of this at all.

The fact that it took Intel more than a year to figure out that the MSR is
per core and not per thread is yet another proof that this industry just
works by pure chance.



Whether it's per-core or per-thread doesn't affect much how we implement 
for host/native.


And also, no matter it's per-core or per-thread, we always can do 
something in VIRT.


Maybe what matters is below.


Seriously, this makes only sense when it's by default enabled and not
rendered useless by VIRT. Otherwise we never get any reports and none of
the issues are going to be fixed.



For VIRT, it doesn't want old guest to be killed due to #AC. But for 
native, it doesn't want VIRT to disable the #AC detection


I think it's just about the default behavior that whether to disable the 
host's #AC detection or kill the guest (SIGBUS or something else) once 
there is an split-lock #AC in guest.


So we can provide CONFIG option to set the default behavior and module 
parameter to let KVM set/change the default behavior.




Re: [RFD] x86/split_lock: Request to Intel

2019-10-17 Thread Sean Christopherson
On Thu, Oct 17, 2019 at 11:31:15PM +0200, Thomas Gleixner wrote:
> On Thu, 17 Oct 2019, Sean Christopherson wrote:
> > On Thu, Oct 17, 2019 at 02:29:45PM +0200, Thomas Gleixner wrote:
> > > The more I look at this trainwreck, the less interested I am in merging 
> > > any
> > > of this at all.
> > > 
> > > The fact that it took Intel more than a year to figure out that the MSR is
> > > per core and not per thread is yet another proof that this industry just
> > > works by pure chance.
> > > 
> > > There is a simple way out of this misery:
> > > 
> > >   Intel issues a microcode update which does:
> > > 
> > > 1) Convert the OR logic of the AC enable bit in the TEST_CTRL MSR to
> > >AND logic, i.e. when one thread disables AC it's automatically
> > >disabled on the core.
> > > 
> > >Alternatively it supresses the #AC when the current thread has it
> > >disabled.
> > > 
> > > 2) Provide a separate bit which indicates that the AC enable logic is
> > >actually AND based or that #AC is supressed when the current thread
> > >has it disabled.
> > > 
> > > Which way I don't really care as long as it makes sense.
> > 
> > The #AC bit doesn't use OR-logic, it's straight up shared, i.e. writes on
> > one CPU are immediately visible on its sibling CPU.
> 
> That's less horrible than I read out of your initial explanation.
> 
> Thankfully all of this is meticulously documented in the SDM ...

Preaching to the choir on this one...

> Though it changes the picture radically. The truly shared MSR allows
> regular software synchronization without IPIs and without an insane amount
> of corner case handling.
> 
> So as you pointed out we need a per core state, which is influenced by:
> 
>  1) The global enablement switch
> 
>  2) Host induced #AC
> 
>  3) Guest induced #AC
> 
> A) Guest has #AC handling
> 
> B) Guest has no #AC handling
> 
> #1:
> 
>- OFF: #AC is globally disabled
> 
>- ON:  #AC is globally enabled
> 
>- FORCE: same as ON but #AC is enforced on guests
> 
> #2:
> 
>If the host triggers an #AC then the #AC has to be force disabled on the
>affected core independent of the state of #1. Nothing we can do about
>that and once the initial wave of #AC issues is fixed this should not
>happen on production systems. That disables #3 even for the #3.A case
>for simplicity sake.
> 
> #3:
> 
>A) Guest has #AC handling
> 
>   #AC is forwarded to the guest. No further action required aside of
>   accounting
> 
>B) Guest has no #AC handling
> 
>   If #AC triggers the resulting action depends on the state of #1:
> 
>- FORCE: Guest is killed with SIGBUS or whatever the virt crowd
> thinks is the appropriate solution
>  - ON: #AC triggered state is recorded per vCPU and the MSR is
>   toggled on VMENTER/VMEXIT in software from that point on.
>
> So the only interesting case is #3.B and #1.state == ON. There you need
> serialization of the state and the MSR write between the cores, but only
> when the vCPU triggered an #AC. Until then, nothing to do.

And "vCPU triggered an #AC" should include an explicit check in KVM's
emulator.

> vmenter()
> {
>   if (vcpu->ac_disable)
>   this_core_disable_ac();
> }
> 
> vmexit()
> {
>   if (vcpu->ac_disable) {
>   this_core_enable_ac();
> }
> 
> this_core_dis/enable_ac() takes the global state into account and has the
> necessary serialization in place.

Overall, looks good to me.  Although Tony's mail makes it obvious we need
to sync internally...


RE: [RFD] x86/split_lock: Request to Intel

2019-10-17 Thread Luck, Tony
> If that's not going to happen, then we just bury the whole thing and put it
> on hold until a sane implementation of that functionality surfaces in
> silicon some day in the not so foreseeable future.

We will drop the patches to flip the MSR bits to enable checking.

But we can fix the split lock issues that have already been found in the kernel.

Two strategies:

1) Adjust alignments of arrays passed to set_bit() et. al.

2) Fix set_bit() et. al. to not issue atomic operations that cross boundaries.

Fenghua had been pursuing option #1 in previous iterations. He found a few
more places with the help of the "grep" patterns suggested by David Laight.
So that path is up to ~8 patches now that do one of:
+ Change from u32 to u64
+ Force alignment with a union with a u64
+ Change to non-atomic (places that didn't need atomic)

Downside of strategy #1 is that people will add new misaligned cases in the
future. So this process has no defined end point.

Strategy #2 begun when I looked at the split-lock issue I saw that with a
constant bit argument set_bit() just does a "ORB" on the affected byte (i.e.
no split lock). Similar for clear_bit() and change_bit(). Changing code to also
do that for the variable bit case is easy.

test_and_clr_bit() needs more care, but luckily, we had Peter Anvin nearby
to give us a neat solution.

So strategy #2 is being tried now (and Fenghua will post some patches
soon).

Strategy #2 does increase code size when the bit number argument isn't
a constant. But that isn't the common case (Fenghua is counting and will
give numbers when patches are ready).

So take a look at the option #2 patches when they are posted. If the code
size increase is unacceptable, we can go back to fixing each of the callers
to get alignment right.

-Tony




Re: [RFD] x86/split_lock: Request to Intel

2019-10-17 Thread Thomas Gleixner
On Thu, 17 Oct 2019, Sean Christopherson wrote:
> On Thu, Oct 17, 2019 at 02:29:45PM +0200, Thomas Gleixner wrote:
> > The more I look at this trainwreck, the less interested I am in merging any
> > of this at all.
> > 
> > The fact that it took Intel more than a year to figure out that the MSR is
> > per core and not per thread is yet another proof that this industry just
> > works by pure chance.
> > 
> > There is a simple way out of this misery:
> > 
> >   Intel issues a microcode update which does:
> > 
> > 1) Convert the OR logic of the AC enable bit in the TEST_CTRL MSR to
> >AND logic, i.e. when one thread disables AC it's automatically
> >disabled on the core.
> > 
> >Alternatively it supresses the #AC when the current thread has it
> >disabled.
> > 
> > 2) Provide a separate bit which indicates that the AC enable logic is
> >actually AND based or that #AC is supressed when the current thread
> >has it disabled.
> > 
> > Which way I don't really care as long as it makes sense.
> 
> The #AC bit doesn't use OR-logic, it's straight up shared, i.e. writes on
> one CPU are immediately visible on its sibling CPU.

That's less horrible than I read out of your initial explanation.

Thankfully all of this is meticulously documented in the SDM ...

Though it changes the picture radically. The truly shared MSR allows
regular software synchronization without IPIs and without an insane amount
of corner case handling.

So as you pointed out we need a per core state, which is influenced by:

 1) The global enablement switch

 2) Host induced #AC

 3) Guest induced #AC

A) Guest has #AC handling

B) Guest has no #AC handling

#1:

   - OFF: #AC is globally disabled

   - ON:  #AC is globally enabled

   - FORCE: same as ON but #AC is enforced on guests

#2:

   If the host triggers an #AC then the #AC has to be force disabled on the
   affected core independent of the state of #1. Nothing we can do about
   that and once the initial wave of #AC issues is fixed this should not
   happen on production systems. That disables #3 even for the #3.A case
   for simplicity sake.

#3:

   A) Guest has #AC handling

  #AC is forwarded to the guest. No further action required aside of
  accounting

   B) Guest has no #AC handling

  If #AC triggers the resulting action depends on the state of #1:

 - FORCE: Guest is killed with SIGBUS or whatever the virt crowd
  thinks is the appropriate solution

 - ON: #AC triggered state is recorded per vCPU and the MSR is
toggled on VMENTER/VMEXIT in software from that point on.

So the only interesting case is #3.B and #1.state == ON. There you need
serialization of the state and the MSR write between the cores, but only
when the vCPU triggered an #AC. Until then, nothing to do.

vmenter()
{
if (vcpu->ac_disable)
this_core_disable_ac();
}

vmexit()
{
if (vcpu->ac_disable) {
this_core_enable_ac();
}

this_core_dis/enable_ac() takes the global state into account and has the
necessary serialization in place.

Thanks,

tglx


Re: [RFD] x86/split_lock: Request to Intel

2019-10-17 Thread Sean Christopherson
On Thu, Oct 17, 2019 at 02:29:45PM +0200, Thomas Gleixner wrote:
> The more I look at this trainwreck, the less interested I am in merging any
> of this at all.
> 
> The fact that it took Intel more than a year to figure out that the MSR is
> per core and not per thread is yet another proof that this industry just
> works by pure chance.
> 
> There is a simple way out of this misery:
> 
>   Intel issues a microcode update which does:
> 
> 1) Convert the OR logic of the AC enable bit in the TEST_CTRL MSR to
>AND logic, i.e. when one thread disables AC it's automatically
>disabled on the core.
> 
>Alternatively it supresses the #AC when the current thread has it
>disabled.
> 
> 2) Provide a separate bit which indicates that the AC enable logic is
>actually AND based or that #AC is supressed when the current thread
>has it disabled.
> 
> Which way I don't really care as long as it makes sense.

The #AC bit doesn't use OR-logic, it's straight up shared, i.e. writes on
one CPU are immediately visible on its sibling CPU.  It doesn't magically
solve the problem, but I don't think we need IPIs to coordinate between
siblings, e.g. wouldn't something like this work?  The per-cpu things
being pointers that are shared by siblings.

void split_lock_disable(void)
{
spinlock_t *ac_lock = this_cpu_ptr(split_lock_ac_lock);

spin_lock(ac_lock);
if (this_cpu_inc_return(*split_lock_ac_disabled) == 1)
WRMSR(RDMSR() & ~bit);
spin_unlock(ac_lock);
}

void split_lock_enable(void)
{
spinlock_t *ac_lock = this_cpu_ptr(split_lock_ac_lock);

spin_lock(ac_lock);
if (this_cpu_dec_return(*split_lock_ac_disabled) == 0)
WRMSR(RDMSR() | bit);
spin_unlock(ac_lock);
}


To avoid the spin_lock and WRMSR latency on every VM-Enter and VM-Exit,
actions (3a) and (4a) from your matrix (copied below) could be changed to
only do split_lock_disable() if the guest actually generates an #AC, and
then do split_lock_enable() on the next VM-Exit.  Assuming even legacy
guests are somewhat sane and rarely do split-locks, lazily disabling the
control would eliminate most of the overhead and would also reduce the
time that the sibling CPU is running in the host without #AC protection.


N | #AC   | #AC enabled | SMT | Ctrl| Guest | Action
R | available | on host | | exposed | #AC   |
--|---|-|-|-|---|-
  |   | | | |   |
0 | N | x   |  x  |   N |   x   | None
  |   | | | |   |
1 | Y | N   |  x  |   N |   x   | None
  |   | | | |   |
2 | Y | Y   |  x  |   Y |   Y   | Forward to guest
  |   | | | |   |
3 | Y | Y   |  N  |   Y |   N   | A) Store in vCPU and
  |   | | | |   |toggle on VMENTER/EXIT
  |   | | | |   |
  |   | | | |   | B) SIGBUS or KVM exit code
  |   | | | |   |
4 | Y | Y   |  Y  |   Y |   N   | A) Disable globally on
  |   | | | |   |host. Store in 
vCPU/guest
  |   | | | |   |state and evtl. 
reenable
  |   | | | |   |when guest goes away.
  |   | | | |   | 
  |   | | | |   | B) SIGBUS or KVM exit code


> If that's not going to happen, then we just bury the whole thing and put it
> on hold until a sane implementation of that functionality surfaces in
> silicon some day in the not so foreseeable future.
> 
> Seriously, this makes only sense when it's by default enabled and not
> rendered useless by VIRT. Otherwise we never get any reports and none of
> the issues are going to be fixed.
> 
> Thanks,
> 
>   tglx