Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-11-03 Thread Bae, Chang Seok
On Tue, 2020-11-03 at 13:41 -0800, Dave Hansen wrote:
> On 11/3/20 1:32 PM, Bae, Chang Seok wrote:
> > On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
> > > On 10/14/20 9:10 AM, Andy Lutomirski wrote:
> > > > I was suggesting something a little bit different.  We'd keep XMM,
> > > > YMM, ZMM, etc state stored exactly the way we do now and, for
> > > > AMX-using tasks, we would save the AMX state in an entirely
> > > > separate
> > > > buffer.  This way the pain of having a variable xstate layout is
> > > > confined just to AMX tasks.
> > > 
> > > OK, got it.
> > > 
> > > So, we'd either need a second set of XSAVE/XRSTORs, or "manual"
> > > copying
> > > of the registers out to memory.  We can preserve the modified
> > > optimization if we're careful about ordering, but only for *ONE* of
> > > the
> > > XSAVE buffers (if we use two).
> > 
> > For what is worth,
> > 
> > If using two buffers, the buffer for saving the tile data also needs
> > space
> > for the legacy states.
> 
> Just to be clear, you're talking about the 512-byte 'struct
> fxregs_state' which is the first member of 'struct xregs_state', right?

Yes.

> 
> > The AMX state is stored at the end of the XSAVE buffer (at least for
> > now).
> > So, the layout (in terms of offsets of non-AMX states) won't be changed
> > at
> > run-time.
> 
> I don't know what point you are trying to make here.

I was trying to clarify the concern that Andy had:

"the pain of having a variable xstate layout"

Thanks,
Chang


Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-11-03 Thread Dave Hansen
On 11/3/20 1:32 PM, Bae, Chang Seok wrote:
> On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
>> On 10/14/20 9:10 AM, Andy Lutomirski wrote:
>>> I was suggesting something a little bit different.  We'd keep XMM,
>>> YMM, ZMM, etc state stored exactly the way we do now and, for
>>> AMX-using tasks, we would save the AMX state in an entirely separate
>>> buffer.  This way the pain of having a variable xstate layout is
>>> confined just to AMX tasks.
>> OK, got it.
>>
>> So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
>> of the registers out to memory.  We can preserve the modified
>> optimization if we're careful about ordering, but only for *ONE* of the
>> XSAVE buffers (if we use two).
> For what is worth,
> 
> If using two buffers, the buffer for saving the tile data also needs space
> for the legacy states.

Just to be clear, you're talking about the 512-byte 'struct
fxregs_state' which is the first member of 'struct xregs_state', right?

Basically, any two-buffer model wastes 576 bytes of space (legacy fxsave
plus xsave header) for each task that uses two buffers.  This provides
an overall space savings unless there are something like 16x more
TMUL-using tasks than non-TMUL-using tasks.

We could eliminate the 576 bytes of waste, but at the *performance* cost
of having a permanently (non-task_struct-embedded) out-of-line XSAVE
buffer for all tasks.

> The AMX state is stored at the end of the XSAVE buffer (at least for now).
> So, the layout (in terms of offsets of non-AMX states) won't be changed at
> run-time.

I don't know what point you are trying to make here.


Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-11-03 Thread Bae, Chang Seok
On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
> On 10/14/20 9:10 AM, Andy Lutomirski wrote:
> > 
> > I was suggesting something a little bit different.  We'd keep XMM,
> > YMM, ZMM, etc state stored exactly the way we do now and, for
> > AMX-using tasks, we would save the AMX state in an entirely separate
> > buffer.  This way the pain of having a variable xstate layout is
> > confined just to AMX tasks.
> 
> OK, got it.
> 
> So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
> of the registers out to memory.  We can preserve the modified
> optimization if we're careful about ordering, but only for *ONE* of the
> XSAVE buffers (if we use two).

For what is worth,

If using two buffers, the buffer for saving the tile data also needs space
for the legacy states.

The AMX state is stored at the end of the XSAVE buffer (at least for now).
So, the layout (in terms of offsets of non-AMX states) won't be changed at
run-time.

Thanks,
Chang


Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-14 Thread Dave Hansen
On 10/14/20 9:10 AM, Andy Lutomirski wrote:
>> Actually, I think the modified optimization would survive such a scheme:
>>
>>  * copy page array into percpu area
>>  * XRSTORS from percpu area, modified optimization tuple is saved
>>  * run userspace
>>  * XSAVES back to percpu area.  tuple matches, modified optimization
>>is still in play
>>  * copy percpu area back to page array
>>
>> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
>> XSAVE tracking hardware never knows it isn't working on the "canonical"
>> buffer (the page array).
> I was suggesting something a little bit different.  We'd keep XMM,
> YMM, ZMM, etc state stored exactly the way we do now and, for
> AMX-using tasks, we would save the AMX state in an entirely separate
> buffer.  This way the pain of having a variable xstate layout is
> confined just to AMX tasks.

OK, got it.

So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
of the registers out to memory.  We can preserve the modified
optimization if we're careful about ordering, but only for *ONE* of the
XSAVE buffers (if we use two).

> I'm okay with vmalloc() too, but I do think we need to deal with the
> various corner cases like allocation failing.

Yeah, agreed about handling the corner cases.  Also, if we preserve
plain old vmalloc() for now, we need good tracepoints or stats so we can
precisely figure out how many vmalloc()s (and IPIs) are due to AMX.


Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-14 Thread Andy Lutomirski
On Tue, Oct 13, 2020 at 11:03 PM Dave Hansen  wrote:
>
> On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> > I have no problem with vmalloc(), but I do have a problem with
> > vfree() due to the IPIs that result. We need a cache or something.
>
> This sounds like the kind of thing we should just build into vmalloc()
> instead of having a bunch of callers implement their own caches.  It
> shouldn't be too much of a challenge to have vmalloc() keep a cacheline
> or two of stats about common vmalloc() sizes and keep some pools around.
>
> It's not going to be hard to implement caches to reduce vfree()-induced
> churn, but I'm having a hard time imaging that it'll have anywhere near
> the benefits that it did for stacks.  Tasks fundamentally come and go a
> *lot*, and those paths are hot.
>
> Tasks who go to the trouble to populate 8k or 64k of register state
> fundamentally *can't* come and go frequently.  We also (probably) don't
> have to worry about AMX tasks doing fork()/exec() pairs and putting
> pressure on vmalloc().  Before an app can call out to library functions
> to do the fork, they've got to save the state off themselves and likely
> get it back to the init state.  The fork() code can tell AMX is in the
> init state and decline to allocate actual space for the child.
>
> > I have to say: this mechanism is awful. Can we get away with skipping
> > the dynamic XSAVES mess entirely?  What if we instead allocate
> > however much space we need as an array of pages and have one percpu
> > contiguous region. To save, we XSAVE(S or C) just the AMX state to
> > the percpu area and then copy it.  To restore, we do the inverse.  Or
> > would this kill the modified optimization and thus be horrible?
>
> Actually, I think the modified optimization would survive such a scheme:
>
>  * copy page array into percpu area
>  * XRSTORS from percpu area, modified optimization tuple is saved
>  * run userspace
>  * XSAVES back to percpu area.  tuple matches, modified optimization
>is still in play
>  * copy percpu area back to page array
>
> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
> XSAVE tracking hardware never knows it isn't working on the "canonical"
> buffer (the page array).

I was suggesting something a little bit different.  We'd keep XMM,
YMM, ZMM, etc state stored exactly the way we do now and, for
AMX-using tasks, we would save the AMX state in an entirely separate
buffer.  This way the pain of having a variable xstate layout is
confined just to AMX tasks.

I'm okay with vmalloc() too, but I do think we need to deal with the
various corner cases like allocation failing.


Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-14 Thread Peter Zijlstra
On Tue, Oct 13, 2020 at 03:43:59PM -0700, Dave Hansen wrote:
> On 10/13/20 3:31 PM, Brown, Len wrote:
> > vmalloc() does not fail, and does not return an error, and so there is no 
> > concept
> > of returning a signal.
> 
> Well, the order-0 allocations are no-fail, as are the vmalloc kernel
> structures and the page tables that might have to be allocated.  But,
> that's not guaranteed to be in place *forever*.  I think we still need
> to check for and handle allocation failures, even if they're not known
> to be possible today.

Quite, on top of that, we could run out of vmalloc space (unlikely, but
sitll).

You really have to deal with vmalloc() failing.


Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-14 Thread Andy Lutomirski


> On Oct 13, 2020, at 3:31 PM, Brown, Len  wrote:
> 
> 
>> 
>> From: Andy Lutomirski  
> 
>>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>>> @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
> ..
>>> +bool xfirstuse_event_handler(struct fpu *fpu)
>>> +{
>>> +   bool handled = false;
>>> +   u64 event_mask;
>>> +
>>> +   /* Check whether the first-use detection is running. */
>>> +   if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
>>> +   return handled;
>>> +
> 
>> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
>> some helper called farther down the stack
> 
> xfirstuse_event_handler() is called directly from the IDTENTRY 
> exc_device_not_available():
> 
>>> @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
>>> {
>>>unsigned long cr0 = read_cr0();
>>> 
>>> +if (xfirstuse_event_handler(>thread.fpu))
>>> +return;
> 
> Are you suggesting we should instead open code it inside that routine?

MSR_IA32_XFD_ERR is like CR2 and DR6 — it’s functionally a part of the 
exception. Let’s handle it as such.  (And, a bit like DR6, it’s a bit broken.)

> 
>> But this raises an interesting point -- what happens if allocation
>> fails?  I think that, from kernel code, we simply cannot support this
>> exception mechanism.  If kernel code wants to use AMX (and that would
>> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
>> handle errors, not rely on exceptions.  From user code, I assume we
>> send a signal if allocation fails.
> 
> The XFD feature allows us to transparently expand the kernel context switch 
> buffer
> for a user task, when that task first touches this associated hardware.
> It allows applications to operate as if the kernel had allocated the backing 
> AMX
> context switch buffer at initialization time.  However, since we expect only
> a sub-set of tasks to actually use AMX, we instead defer allocation until
> we actually see first use for that task, rather than allocating for all tasks.

I sure hope that not all tasks use it. Context-switching it will be absurdly 
expensive.

> 
> While we currently don't propose AMX use inside the kernel, it is conceivable
> that could be done in the future, just like AVX is used by the RAID code;
> and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end().
> Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this 
> fault.
> (note that we context switch the XFD-armed state per task)

How expensive is *that*?  Can you give approximate cycle counts for saving, 
restoring, arming and disarming?

This reminds me of TS.  Writing TS was more expensive than saving the whole FPU 
state.  And, for kernel code, we can’t just “not arm” the XFD — we would have 
to disarm it.



> 
> vmalloc() does not fail, and does not return an error, and so there is no 
> concept
> of returning a signal.  If we got to the point where vmalloc() sleeps, then 
> the system
> has bigger OOM issues, and the OOM killer would be on the prowl.
> 
> If we were concerned about using vmalloc for a couple of pages in the task 
> structure,
> Then we could implement a routine to harvest unused buffers and free them

Kind of like we vmalloc a couple pages for kernel stacks, and we carefully 
cache them.  And we handle failure gracefully.

Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-14 Thread Andy Lutomirski


> On Oct 13, 2020, at 3:44 PM, Dave Hansen  wrote:
> 
> On 10/13/20 3:31 PM, Brown, Len wrote:
>> vmalloc() does not fail, and does not return an error, and so there is no 
>> concept
>> of returning a signal.
> 
> Well, the order-0 allocations are no-fail, as are the vmalloc kernel
> structures and the page tables that might have to be allocated.  But,
> that's not guaranteed to be in place *forever*.  I think we still need
> to check for and handle allocation failures, even if they're not known
> to be possible today.
> 
>> If we got to the point where vmalloc() sleeps, then the system
>> has bigger OOM issues, and the OOM killer would be on the prowl.
> 
> vmalloc() can *certainly* sleep.  Allocation failures mean returning
> NULL from the allocator, and the very way we avoid doing that is by
> sleeping to go reclaim some memory from some other allocation.
> 
> Sleeping is a normal and healthy part of handling allocation requests,
> including vmalloc().
> 
>> If we were concerned about using vmalloc for a couple of pages in the task 
>> structure,
>> Then we could implement a routine to harvest unused buffers and free them --
>> but that didn't seem worth the complexity.  Note that this feature is 64-bit 
>> only.
> 
> IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the
> XSAVE buffer for the first AMX implementation.  But, it's not overkill
> for the ~66k of space that will be needed if some CPU implementation
> comes along and uses all of the architectural space AMX provides.

I have no problem with vmalloc(), but I do have a problem with vfree() due to 
the IPIs that result. We need a cache or something.

I have to say: this mechanism is awful. Can we get away with skipping the 
dynamic XSAVES mess entirely?  What if we instead allocate however much space 
we need as an array of pages and have one percpu contiguous region. To save, we 
XSAVE(S or C) just the AMX state to the percpu area and then copy it.  To 
restore, we do the inverse.  Or would this kill the modified optimization and 
thus be horrible?

Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-14 Thread Dave Hansen
On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> I have no problem with vmalloc(), but I do have a problem with
> vfree() due to the IPIs that result. We need a cache or something.

This sounds like the kind of thing we should just build into vmalloc()
instead of having a bunch of callers implement their own caches.  It
shouldn't be too much of a challenge to have vmalloc() keep a cacheline
or two of stats about common vmalloc() sizes and keep some pools around.

It's not going to be hard to implement caches to reduce vfree()-induced
churn, but I'm having a hard time imaging that it'll have anywhere near
the benefits that it did for stacks.  Tasks fundamentally come and go a
*lot*, and those paths are hot.

Tasks who go to the trouble to populate 8k or 64k of register state
fundamentally *can't* come and go frequently.  We also (probably) don't
have to worry about AMX tasks doing fork()/exec() pairs and putting
pressure on vmalloc().  Before an app can call out to library functions
to do the fork, they've got to save the state off themselves and likely
get it back to the init state.  The fork() code can tell AMX is in the
init state and decline to allocate actual space for the child.

> I have to say: this mechanism is awful. Can we get away with skipping
> the dynamic XSAVES mess entirely?  What if we instead allocate
> however much space we need as an array of pages and have one percpu
> contiguous region. To save, we XSAVE(S or C) just the AMX state to
> the percpu area and then copy it.  To restore, we do the inverse.  Or
> would this kill the modified optimization and thus be horrible?

Actually, I think the modified optimization would survive such a scheme:

 * copy page array into percpu area
 * XRSTORS from percpu area, modified optimization tuple is saved
 * run userspace
 * XSAVES back to percpu area.  tuple matches, modified optimization
   is still in play
 * copy percpu area back to page array

Since the XRSTORS->XSAVES pair is both done to the percpu area, the
XSAVE tracking hardware never knows it isn't working on the "canonical"
buffer (the page array).

It seems a bit ugly, but it's not like an 8k memcpy() is *that* painful.
 The double dcache footprint isn't super nice, though.

Chang, I think that leaves us with three possibilities:

1. use plain old vmalloc()
2. use vmalloc(), but implement caches for large XSAVE state, either in
   front of, or inside the vmalloc() API
3. Use a static per-cpu buffer for XSAVE*/XRSTOR* and memcpy() to/from
   a scattered list of pages.

A #4 also comes to mind:

Do something somewhat like kmap_atomic().  Map the scattered pages
contiguously and use XSAVE*/XRSTOR* directly, but unmap immediately
after XSAVE*/XRSTOR*.  For ~12k of state, I suspect the ~400-500 cycles
for 3 INVLPG's might beat out a memcpy() of 12k of state.

Global TLB invalidations would not be required.

I have to say, though, that I'd be OK with sticking with plain old
vmalloc() for now.


Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-13 Thread Dave Hansen
On 10/13/20 3:31 PM, Brown, Len wrote:
> vmalloc() does not fail, and does not return an error, and so there is no 
> concept
> of returning a signal.

Well, the order-0 allocations are no-fail, as are the vmalloc kernel
structures and the page tables that might have to be allocated.  But,
that's not guaranteed to be in place *forever*.  I think we still need
to check for and handle allocation failures, even if they're not known
to be possible today.

> If we got to the point where vmalloc() sleeps, then the system
> has bigger OOM issues, and the OOM killer would be on the prowl.

vmalloc() can *certainly* sleep.  Allocation failures mean returning
NULL from the allocator, and the very way we avoid doing that is by
sleeping to go reclaim some memory from some other allocation.

Sleeping is a normal and healthy part of handling allocation requests,
including vmalloc().

> If we were concerned about using vmalloc for a couple of pages in the task 
> structure,
> Then we could implement a routine to harvest unused buffers and free them --
> but that didn't seem worth the complexity.  Note that this feature is 64-bit 
> only.

IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the
XSAVE buffer for the first AMX implementation.  But, it's not overkill
for the ~66k of space that will be needed if some CPU implementation
comes along and uses all of the architectural space AMX provides.


RE: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-13 Thread Brown, Len
> From: Andy Lutomirski  

> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
..
> > +bool xfirstuse_event_handler(struct fpu *fpu)
> > +{
> > +   bool handled = false;
> > +   u64 event_mask;
> > +
> > +   /* Check whether the first-use detection is running. */
> > +   if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > +   return handled;
> > +

> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
> some helper called farther down the stack

xfirstuse_event_handler() is called directly from the IDTENTRY 
exc_device_not_available():

> > @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
> >  {
> > unsigned long cr0 = read_cr0();
> >  
> > +   if (xfirstuse_event_handler(>thread.fpu))
> > +   return;

Are you suggesting we should instead open code it inside that routine?

> But this raises an interesting point -- what happens if allocation
> fails?  I think that, from kernel code, we simply cannot support this
> exception mechanism.  If kernel code wants to use AMX (and that would
> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
> handle errors, not rely on exceptions.  From user code, I assume we
> send a signal if allocation fails.

The XFD feature allows us to transparently expand the kernel context switch 
buffer
for a user task, when that task first touches this associated hardware.
It allows applications to operate as if the kernel had allocated the backing AMX
context switch buffer at initialization time.  However, since we expect only
a sub-set of tasks to actually use AMX, we instead defer allocation until
we actually see first use for that task, rather than allocating for all tasks.

While we currently don't propose AMX use inside the kernel, it is conceivable
that could be done in the future, just like AVX is used by the RAID code;
and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end().
Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this 
fault.
(note that we context switch the XFD-armed state per task)

vmalloc() does not fail, and does not return an error, and so there is no 
concept
of returning a signal.  If we got to the point where vmalloc() sleeps, then the 
system
has bigger OOM issues, and the OOM killer would be on the prowl.

If we were concerned about using vmalloc for a couple of pages in the task 
structure,
Then we could implement a routine to harvest unused buffers and free them --
but that didn't seem worth the complexity.  Note that this feature is 64-bit 
only.

Thanks,
-Len




Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-01 Thread Andy Lutomirski
On Thu, Oct 1, 2020 at 1:43 PM Chang S. Bae  wrote:
>
> Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
> architecture. XFD allows the kernel to enable a feature state in XCR0 and
> to receive a #NM trap when a task uses instructions accessing that state.
> In this way, Linux can allocate the large task->fpu buffer only for tasks
> that use it.
>
> XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
> IA32_XFD_ERR to assist the #NM trap handler. Both use the same
> state-component bitmap format, used by XCR0.
>
> Use this hardware capability to find the right time to expand xstate area.
> Introduce two sets of helper functions for that:
>
> 1. The first set is primarily for interacting with the XFD hardware
>feature. Helpers for configuring disablement, e.g. in context switching,
>are:
> xdisable_setbits()
> xdisable_getbits()
> xdisable_switch()
>
> 2. The second set is for managing the first-use status and handling #NM
>trap:
> xfirstuse_enabled()
> xfirstuse_not_detected()
> xfirstuse_event_handler()
>
> The #NM handler induces the xstate area expansion to save the first-used
> states.
>
> No functional change until the kernel enables dynamic user states and XFD.
>
> Signed-off-by: Chang S. Bae 
> Reviewed-by: Len Brown 
> Cc: x...@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/include/asm/cpufeatures.h  |  1 +
>  arch/x86/include/asm/fpu/internal.h | 53 -
>  arch/x86/include/asm/msr-index.h|  2 ++
>  arch/x86/kernel/fpu/core.c  | 37 
>  arch/x86/kernel/fpu/xstate.c| 34 --
>  arch/x86/kernel/process.c   |  5 +++
>  arch/x86/kernel/process_32.c|  2 +-
>  arch/x86/kernel/process_64.c|  2 +-
>  arch/x86/kernel/traps.c |  3 ++
>  9 files changed, 133 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 2901d5df4366..7d7fe1d82966 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -274,6 +274,7 @@
>  #define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
>  #define X86_FEATURE_XGETBV1(10*32+ 2) /* XGETBV with ECX = 1 
> instruction */
>  #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS 
> instructions */
> +#define X86_FEATURE_XFD(10*32+ 4) /* eXtended 
> Feature Disabling */
>
>  /*
>   * Extended auxiliary flags: Linux defined - for features scattered in 
> various
> diff --git a/arch/x86/include/asm/fpu/internal.h 
> b/arch/x86/include/asm/fpu/internal.h
> index 3b03ead87a46..f5dbbaa060fb 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu 
> *old_fpu, int cpu)
>   * Misc helper functions:
>   */
>
> +/* The first-use detection helpers: */
> +
> +static inline void xdisable_setbits(u64 value)
> +{
> +   wrmsrl_safe(MSR_IA32_XFD, value);
> +}
> +
> +static inline u64 xdisable_getbits(void)
> +{
> +   u64 value;
> +
> +   rdmsrl_safe(MSR_IA32_XFD, );
> +   return value;
> +}
> +
> +static inline u64 xfirstuse_enabled(void)
> +{
> +   /* All the dynamic user components are first-use enabled. */
> +   return xfeatures_mask_user_dynamic;
> +}
> +
> +/*
> + * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD.
> + * xdisable_setbits() only uses this.
> + */
> +static inline u64 xfirstuse_not_detected(struct fpu *fpu)
> +{
> +   u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
> +
> +   /*
> +* If first-use is not detected, set the bit. If the detection is
> +* not enabled, the bit is always zero in firstuse_bv. So, make
> +* following conversion:
> +*/
> +   return  (xfirstuse_enabled() ^ firstuse_bv);
> +}
> +
> +/* Update MSR IA32_XFD based on fpu->firstuse_bv */
> +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> +{
> +   if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> +   return;
> +
> +   if (unlikely(prev->state_mask != next->state_mask))
> +   xdisable_setbits(xfirstuse_not_detected(next));
> +}
> +
> +bool xfirstuse_event_handler(struct fpu *fpu);
> +
>  /*
>   * Load PKRU from the FPU context if available. Delay loading of the
>   * complete FPU state until the return to userland.
>   */
> -static inline void switch_fpu_finish(struct fpu *new_fpu)
> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu 
> *new_fpu)
>  {
> u32 pkru_val = init_pkru_value;
> struct pkru_state *pk;
> @@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>
> set_thread_flag(TIF_NEED_FPU_LOAD);
>
> +   xdisable_switch(old_fpu, new_fpu);
> +
> 

[RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

2020-10-01 Thread Chang S. Bae
Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can allocate the large task->fpu buffer only for tasks
that use it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
state-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand xstate area.
Introduce two sets of helper functions for that:

1. The first set is primarily for interacting with the XFD hardware
   feature. Helpers for configuring disablement, e.g. in context switching,
   are:
xdisable_setbits()
xdisable_getbits()
xdisable_switch()

2. The second set is for managing the first-use status and handling #NM
   trap:
xfirstuse_enabled()
xfirstuse_not_detected()
xfirstuse_event_handler()

The #NM handler induces the xstate area expansion to save the first-used
states.

No functional change until the kernel enables dynamic user states and XFD.

Signed-off-by: Chang S. Bae 
Reviewed-by: Len Brown 
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/fpu/internal.h | 53 -
 arch/x86/include/asm/msr-index.h|  2 ++
 arch/x86/kernel/fpu/core.c  | 37 
 arch/x86/kernel/fpu/xstate.c| 34 --
 arch/x86/kernel/process.c   |  5 +++
 arch/x86/kernel/process_32.c|  2 +-
 arch/x86/kernel/process_64.c|  2 +-
 arch/x86/kernel/traps.c |  3 ++
 9 files changed, 133 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 2901d5df4366..7d7fe1d82966 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -274,6 +274,7 @@
 #define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
 #define X86_FEATURE_XGETBV1(10*32+ 2) /* XGETBV with ECX = 1 
instruction */
 #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS 
instructions */
+#define X86_FEATURE_XFD(10*32+ 4) /* eXtended Feature 
Disabling */
 
 /*
  * Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 3b03ead87a46..f5dbbaa060fb 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu 
*old_fpu, int cpu)
  * Misc helper functions:
  */
 
+/* The first-use detection helpers: */
+
+static inline void xdisable_setbits(u64 value)
+{
+   wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xdisable_getbits(void)
+{
+   u64 value;
+
+   rdmsrl_safe(MSR_IA32_XFD, );
+   return value;
+}
+
+static inline u64 xfirstuse_enabled(void)
+{
+   /* All the dynamic user components are first-use enabled. */
+   return xfeatures_mask_user_dynamic;
+}
+
+/*
+ * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD.
+ * xdisable_setbits() only uses this.
+ */
+static inline u64 xfirstuse_not_detected(struct fpu *fpu)
+{
+   u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
+
+   /*
+* If first-use is not detected, set the bit. If the detection is
+* not enabled, the bit is always zero in firstuse_bv. So, make
+* following conversion:
+*/
+   return  (xfirstuse_enabled() ^ firstuse_bv);
+}
+
+/* Update MSR IA32_XFD based on fpu->firstuse_bv */
+static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
+{
+   if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+   return;
+
+   if (unlikely(prev->state_mask != next->state_mask))
+   xdisable_setbits(xfirstuse_not_detected(next));
+}
+
+bool xfirstuse_event_handler(struct fpu *fpu);
+
 /*
  * Load PKRU from the FPU context if available. Delay loading of the
  * complete FPU state until the return to userland.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
 {
u32 pkru_val = init_pkru_value;
struct pkru_state *pk;
@@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 
set_thread_flag(TIF_NEED_FPU_LOAD);
 
+   xdisable_switch(old_fpu, new_fpu);
+
if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
return;
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2859ee4f39a8..0ccbe8cc99ad 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -610,6 +610,8 @@
 #define