Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-30 Thread Len Brown
On Tue, Mar 30, 2021 at 4:28 AM Thomas Gleixner  wrote:
>
> Len,
>
> On Mon, Mar 29 2021 at 18:16, Len Brown wrote:
> > On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner  wrote:
> > Let me know if this problem description is fair:
> >
> > Many-core Xeon servers will support AMX, and when I run an AMX application
> > on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my 
> > CPU.
> > If Linux cpuidle requests C6, the hardware will demote to C1E.
> >
> > The concern is that a core in C1E will negatively impact power of
> > self, or performance
> > of a neighboring core.
> >
> > This is what we are talking about, right?
>
> Correct.
>
> > I'm delighted that there are Xeon customers, who care about this power 
> > savings.
> > Unfortunately, they are the exception, not the rule.
>
> That maybe true or not. The point is that there is some side effect and
> from a correctness point of view it needs to be addressed.

I don't see how demoting C6 to C1E is a "correctness" issue.
There is nothing "incorrect" about demoting to C1E when software permits C6,
and as I mentioned, this happens all the time.
All architectural state, including the AMX state, is preserved by hardware.

I do agree that there is the possibility that this scenario can result
may not be optimal power savings.
It isn't clear how often that situation might occur.

> >>- Use TILERELEASE on context switch after XSAVES?
> >
> > Yes, that would be perfectly reasonable.
> >
> >>- Any other mechanism on context switch
> >
> > XRESTOR of a context with INIT=1 would also do it.
> >
> >>- Clear XFD[18] when going idle and issue TILERELEASE depending
> >>  on the last state
> >
> > I think you mean to *set* XFD.
> > When the task touched AMX, he took a #NM, and we cleared XFD for that task.
> > So when we get here, XFD is already clear (unarmed).
> > Nevertheless, the setting of XFD is moot here.
>
> No. We set XFD when the task which used AMX schedules out. If the CPU
> reaches idle without going back to user space then XFD is still set and
> AMX INIT still 0. So my assumption was that in order to issue
> TILERELEASE before going idle, you need to clear XFD[18] first, but I
> just saw in the spec that it is not necessary.

Right, XFD setting is moot here.

> >>- Use any other means to set the thing back into INIT=1 state when
> >>  going idle
> >
> > TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.
> >
> >> There is no option 'shrug and ignore' unfortunately.
> >
> > I'm not going to say it is impossible that this path will matter.
> > If some terrible things go wrong with the hardware, and the hardware
> > is configured and used in a very specific way, yes, this could matter.
>
> So then let me summarize for the bare metal case:
>
>1) The paragraph in the specification is unclear and needs to be
>   rephrased.
>
>   What I understood from your explanations so far:
>
>   When AMX is disabled by clearing XCR0[18:17], by clearing
>   CR4.OSXSAVE, or by setting IA32_XFD[18], then there are no
>   negative side effects due to AMX INIT=0 as long as the CPU is
>   executing.

Right.

>   The only possible side effect is when the CPU goes idle because
>   AMX INIT=0 limits C states.

Right.

>2) As a consequence of #1 there is no further action required on
>   context switch when XFD[18] is set.

I agree.

>3) When the CPU goes idle with AMX INIT=0 then the idle code should
>   invoke TILERELEASE. Maybe the condition is not even necessary and
>   TILERELEASE can be invoked unconditionally before trying to enter
>   idle.
>
> If that's correct, then this should be part of the next series.

If you insist, then that is fine with me.

Personally, I would prefer to be able to measure an actual benefit
before adding code, but this one is small, and so I'm not religious about it.

> > In the grand scheme of things, this is a pretty small issue, say,
> > compared to the API discussion.
>
> No argument about that.
>
> Thanks,
>
> tglx



-- 
Len Brown, Intel Open Source Technology Center


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-30 Thread Thomas Gleixner
Len,

On Mon, Mar 29 2021 at 18:16, Len Brown wrote:
> On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner  wrote:
> Let me know if this problem description is fair:
>
> Many-core Xeon servers will support AMX, and when I run an AMX application
> on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU.
> If Linux cpuidle requests C6, the hardware will demote to C1E.
>
> The concern is that a core in C1E will negatively impact power of
> self, or performance
> of a neighboring core.
>
> This is what we are talking about, right?

Correct.

> I'm delighted that there are Xeon customers, who care about this power 
> savings.
> Unfortunately, they are the exception, not the rule.

That maybe true or not. The point is that there is some side effect and
from a correctness point of view it needs to be addressed.

>>- Use TILERELEASE on context switch after XSAVES?
>
> Yes, that would be perfectly reasonable.
>
>>- Any other mechanism on context switch
>
> XRESTOR of a context with INIT=1 would also do it.
>
>>- Clear XFD[18] when going idle and issue TILERELEASE depending
>>  on the last state
>
> I think you mean to *set* XFD.
> When the task touched AMX, he took a #NM, and we cleared XFD for that task.
> So when we get here, XFD is already clear (unarmed).
> Nevertheless, the setting of XFD is moot here.

No. We set XFD when the task which used AMX schedules out. If the CPU
reaches idle without going back to user space then XFD is still set and
AMX INIT still 0. So my assumption was that in order to issue
TILERELEASE before going idle, you need to clear XFD[18] first, but I
just saw in the spec that it is not necessary.

>>- Use any other means to set the thing back into INIT=1 state when
>>  going idle
>
> TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.
>
>> There is no option 'shrug and ignore' unfortunately.
>
> I'm not going to say it is impossible that this path will matter.
> If some terrible things go wrong with the hardware, and the hardware
> is configured and used in a very specific way, yes, this could matter.

So then let me summarize for the bare metal case:

   1) The paragraph in the specification is unclear and needs to be
  rephrased.

  What I understood from your explanations so far:

  When AMX is disabled by clearing XCR0[18:17], by clearing
  CR4.OSXSAVE, or by setting IA32_XFD[18], then there are no
  negative side effects due to AMX INIT=0 as long as the CPU is
  executing.

  The only possible side effect is when the CPU goes idle because
  AMX INIT=0 limits C states.

   2) As a consequence of #1 there is no further action required on
  context switch when XFD[18] is set.

   3) When the CPU goes idle with AMX INIT=0 then the idle code should
  invoke TILERELEASE. Maybe the condition is not even necessary and
  TILERELEASE can be invoked unconditionally before trying to enter
  idle.

If that's correct, then this should be part of the next series.

> In the grand scheme of things, this is a pretty small issue, say,
> compared to the API discussion.

No argument about that.

Thanks,

tglx


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown
On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner  wrote:

> According to documentation it is irrelevant whether AMX usage is
> disabled via XCR0, CR4.OSXSAVE or XFD[18]. In any case the effect of
> AMX INIT=0 will prevent C6.
>
> As I explained in great length there are enough ways to get into a
> situation where this can happen and a CPU goes idle with AMX INIT=0.
>
> So what are we supposed to do?

Let me know if this problem description is fair:

Many-core Xeon servers will support AMX, and when I run an AMX application
on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU.
If Linux cpuidle requests C6, the hardware will demote to C1E.

The concern is that a core in C1E will negatively impact power of
self, or performance
of a neighboring core.

This is what we are talking about, right?

First, I should mention that if I threw a dart at a map of Xeons
deployed across the universe, the chances are "significant" that I'd
hit one that is configured with C6 disabled, and this discussion would be moot.

Second, I should mention that Linux cpuidle demotes from deep C-states
to shallow ones all day long.  This is typically due to expected timer
expiration,
and other heuristics.

Third, I should mention that the processor itself demotes from C6 to C1E
for a number of reasons -- basically like what Linux is doing, but in HW.

Albeit, the hardware does have the capability to "un-demote" when it demotes
and recognizes it made a mistake, and that "un-demote" capability would
not be present if the reason for demotion was AVX INIT=0.

Okay, that said, let's assume we have found a system where this problem
could happen, and we use it in a way that makes it happen.  Would we notice?

If your system were profoundly idle, and one or more cores were in C1E,
then it would prevent the SOC from entering Package C6 (if enabled).
Yes, there is a measurable idle power difference between Package C1E
and Package C6.  (indeed, this is why Package C6 exists).

I'm delighted that there are Xeon customers, who care about this power savings.
Unfortunately, they are the exception, not the rule.

If you were to provoke this scenario on many cores simultaneously, then
I expect you could detect a power difference between C1E and CC6.
However, that difference would be smaller than the difference
in power due to the frequency choice of the running cores,
because it is basically just the L2-leakage vs L2-off difference.

Regarding frequency credits for a core being in C1E vs C6.
Yes, this is factored into the frequency credits for turbo mode.
How much impact, I can't say, because that information is not yet available.
However, this is mitigated by the fact that Xeon single core turbo
is deployed differently than client.  Xeon's are deployed
more with multi-core turbo in mind, and so how much you'll
notice C1E vs C6 may not be significant, unless perhaps it happened
on all the cores across the system.

>- Use TILERELEASE on context switch after XSAVES?

Yes, that would be perfectly reasonable.

>- Any other mechanism on context switch

XRESTOR of a context with INIT=1 would also do it.

>- Clear XFD[18] when going idle and issue TILERELEASE depending
>  on the last state

I think you mean to *set* XFD.
When the task touched AMX, he took a #NM, and we cleared XFD for that task.
So when we get here, XFD is already clear (unarmed).
Nevertheless, the setting of XFD is moot here.

>- Use any other means to set the thing back into INIT=1 state when
>  going idle

TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.

> There is no option 'shrug and ignore' unfortunately.

I'm not going to say it is impossible that this path will matter.
If some terrible things go wrong with the hardware, and the hardware
is configured and used in a very specific way, yes, this could matter.

In the grand scheme of things, this is a pretty small issue,
say, compared to the API discussion.

thanks,
Len Brown, Intel Open Source Technology Center


-Len


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown
On Mon, Mar 29, 2021 at 1:43 PM Andy Lutomirski  wrote:

> > *switching* XCR0 on context switch is slow, but perfectly legal.
>
> How slow is it?  And how slow is switching XFD?  XFD is definitely 
> serializing?

XCR0 writes in a VM guest cause a VMEXIT..
XCR writes in a VM guest do not.

I will find out what the relative cost is on bare metal, where VMEXIT
is not an issue.
-- 
Len Brown, Intel Open Source Technology Center


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Thomas Gleixner
On Mon, Mar 29 2021 at 11:43, Len Brown wrote:
> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner  wrote:
> But yes, if a bare metal OS doesn't support any threading libraries
> that query XCR0 with xgetbv, and they don't care about the performance
> impact of switching XCR0, they could choose to switch XCR0 and
> would want to TILERELEASE to assure C6 access, if it is enabled.

That's not the point. The C6 issue has nothing to do with the ABI
considerations vs. XCR0.

According to documentation it is irrelevant whether AMX usage is
disabled via XCR0, CR4.OSXSAVE or XFD[18]. In any case the effect of
AMX INIT=0 will prevent C6.

As I explained in great length there are enough ways to get into a
situation where this can happen and a CPU goes idle with AMX INIT=0.

So what are we supposed to do?

   - Use TILERELEASE on context switch after XSAVES?

   - Any other mechanism on context switch

   - Clear XFD[18] when going idle and issue TILERELEASE depending
 on the last state

   - Use any other means to set the thing back into INIT=1 state when
 going idle

There is no option 'shrug and ignore' unfortunately.

Thanks,

tglx


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Bae, Chang Seok
On Mar 26, 2021, at 09:34, Jann Horn  wrote:
> On Sun, Feb 21, 2021 at 7:56 PM Chang S. Bae  wrote:
>> 
>> +   if (handle_xfirstuse_event(>thread.fpu))
>> +   return;
> 
> What happens if handle_xfirstuse_event() fails because vmalloc()
> failed in alloc_xstate_buffer()? I think that should probably kill the
> task with something like force_sig() - but as far as I can tell, at
> the moment, it will instead end up at die(), which should only be used
> for kernel bugs.

This question was raised on v1 before [1].

In the end, people suggested to handle the failure, e.g., with tracepoints or
stats. So, proposed this on the allocation site:

+   state_ptr = vmalloc(newsz);
+   if (!state_ptr) {
+   trace_x86_fpu_xstate_alloc_failed(fpu);
+   return -ENOMEM;
+   }

Also, I tried to justify this to Boris [2]:

  >> Maybe it is possible to backtrack this allocation failure out of #NM
  >> handling. But the tracepoint can provide a clear context, although limited
  >> to those using it.

  > Yes, add it when it is really needed. Not slapping it proactively and
  > hoping for any potential usage.

Let me know if you have a better way.

Thanks,
Chang

[1] https://lore.kernel.org/lkml/c4669d5f-11b8-3879-562c-78a791b86...@intel.com/
[2] https://lore.kernel.org/lkml/20210204131002.ga17...@zn.tnic/



Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Andy Lutomirski


> On Mar 29, 2021, at 9:06 AM, Len Brown  wrote:
> 
> On Mon, Mar 29, 2021 at 11:43 AM Len Brown  wrote:
>> 
>> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner  wrote:
>> 
 I found the author of this passage, and he agreed to revise it to say this
 was targeted primarily at VMMs.
>>> 
>>> Why would this only a problem for VMMs?
>> 
>> VMMs may have to emulate different hardware for different guest OS's,
>> and they would likely "context switch" XCR0 to achieve that.
>> 
>> As switching XCR0 at run-time would confuse the heck out of user-space,
>> it was not imagined that a bare-metal OS would do that.
> 
> to clarify...
> *switching* XCR0 on context switch is slow, but perfectly legal.

How slow is it?  And how slow is switching XFD?  XFD is definitely serializing?


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown
On Mon, Mar 29, 2021 at 11:43 AM Len Brown  wrote:
>
> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner  wrote:
>
> > > I found the author of this passage, and he agreed to revise it to say this
> > > was targeted primarily at VMMs.
> >
> > Why would this only a problem for VMMs?
>
> VMMs may have to emulate different hardware for different guest OS's,
> and they would likely "context switch" XCR0 to achieve that.
>
> As switching XCR0 at run-time would confuse the heck out of user-space,
> it was not imagined that a bare-metal OS would do that.

to clarify...
*switching* XCR0 on context switch is slow, but perfectly legal.

*changing* XCR0 during the lifetime of a process, in any of its tasks,
on any of its CPUs, will confuse any software that uses xgetbv/XCR0
to calculate the size of XSAVE buffers for userspace threading.


-- 
Len Brown, Intel Open Source Technology Center


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown
On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner  wrote:

> > I found the author of this passage, and he agreed to revise it to say this
> > was targeted primarily at VMMs.
>
> Why would this only a problem for VMMs?

VMMs may have to emulate different hardware for different guest OS's,
and they would likely "context switch" XCR0 to achieve that.

As switching XCR0 at run-time would confuse the heck out of user-space,
it was not imagined that a bare-metal OS would do that.

But yes, if a bare metal OS doesn't support any threading libraries
that query XCR0 with xgetbv, and they don't care about the performance
impact of switching XCR0, they could choose to switch XCR0 and
would want to TILERELEASE to assure C6 access, if it is enabled.

> > "negative power and performance implications" refers to the fact that
> > the processor will not enter C6 when AMX INIT=0, instead it will demote
> > to the next shallower C-state, eg C1E.
> >
> > (this is because the C6 flow doesn't save the AMX registers)
> >
> > For customers that have C6 enabled, the inability of a core to enter C6
> > may impact the maximum turbo frequency of other cores.
>
> That's the same on bare metal, right?

Yes, the hardware works exactly the same way.

thanks,
Len Brown, Intel Open Source Technology Center


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Thomas Gleixner
On Mon, Mar 29 2021 at 09:14, Len Brown wrote:
> On Sat, Mar 20, 2021 at 6:14 PM Thomas Gleixner  wrote:
>>
>> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
>> > +
>> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
>> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
>> > +{
>> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
>> > + return;
>> > +
>> > + if (unlikely(prev->state_mask != next->state_mask))
>> > + xdisable_setbits(xfirstuse_not_detected(next));
>> > +}
>>
>> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
>> when it does not match. The spec document says:
>>
>>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>>clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>>system software initialize AMX state (e.g., by executing TILERELEASE)
>>before doing so. This is because maintaining AMX state in a
>>non-initialized state may have negative power and performance
>>implications."
>>
>> I'm not seeing anything related to this. Is this a recommendation
>> which can be ignored or is that going to be duct taped into the code
>> base once the first user complains about slowdowns of their non AMX
>> workloads on that machine?
>
> I found the author of this passage, and he agreed to revise it to say this
> was targeted primarily at VMMs.

Why would this only a problem for VMMs?

> "negative power and performance implications" refers to the fact that
> the processor will not enter C6 when AMX INIT=0, instead it will demote
> to the next shallower C-state, eg C1E.
>
> (this is because the C6 flow doesn't save the AMX registers)
>
> For customers that have C6 enabled, the inability of a core to enter C6
> may impact the maximum turbo frequency of other cores.

That's the same on bare metal, right?

Thanks,

tglx


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown
On Sat, Mar 20, 2021 at 6:14 PM Thomas Gleixner  wrote:
>
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> > +
> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> > +{
> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > + return;
> > +
> > + if (unlikely(prev->state_mask != next->state_mask))
> > + xdisable_setbits(xfirstuse_not_detected(next));
> > +}
>
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
>
>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>system software initialize AMX state (e.g., by executing TILERELEASE)
>before doing so. This is because maintaining AMX state in a
>non-initialized state may have negative power and performance
>implications."
>
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I found the author of this passage, and he agreed to revise it to say this
was targeted primarily at VMMs.

"negative power and performance implications" refers to the fact that
the processor will not enter C6 when AMX INIT=0, instead it will demote
to the next shallower C-state, eg C1E.

(this is because the C6 flow doesn't save the AMX registers)

For customers that have C6 enabled, the inability of a core to enter C6
may impact the maximum turbo frequency of other cores.

thanks,
-Len Brown
Intel Open Source Technology Center


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-26 Thread Jann Horn
On Sun, Feb 21, 2021 at 7:56 PM Chang S. Bae  wrote:
> Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
> architecture. XFD allows the kernel to enable a feature state in XCR0 and
> to receive a #NM trap when a task uses instructions accessing that state.
> In this way, Linux can defer allocating the large XSAVE buffer until tasks
> need it.
>
> XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
> IA32_XFD_ERR to assist the #NM trap handler. Both use the same
> state-component bitmap format, used by XCR0.
>
> Use this hardware capability to find the right time to expand the xstate
> buffer. Introduce two sets of helper functions for that:
>
> 1. The first set is primarily for interacting with the XFD hardware:
> xdisable_setbits()
> xdisable_getbits()
> xdisable_switch()
>
> 2. The second set is for managing the first-use status and handling #NM
>trap:
> xfirstuse_enabled()
> xfirstuse_not_detected()
>
> The #NM handler induces the xstate buffer expansion to save the first-used
> states.
[...]
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 7f5aec758f0e..821a7f408ad4 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
[...]
> +static __always_inline bool handle_xfirstuse_event(struct fpu *fpu)
> +{
> +   bool handled = false;
> +   u64 event_mask;
[...]
> +   if (alloc_xstate_buffer(fpu, event_mask))
> +   return handled;
[...]
> +}
> +
>  DEFINE_IDTENTRY(exc_device_not_available)
>  {
> unsigned long cr0 = read_cr0();
>
> +   if (handle_xfirstuse_event(>thread.fpu))
> +   return;

What happens if handle_xfirstuse_event() fails because vmalloc()
failed in alloc_xstate_buffer()? I think that should probably kill the
task with something like force_sig() - but as far as I can tell, at
the moment, it will instead end up at die(), which should only be used
for kernel bugs.

> +
>  #ifdef CONFIG_MATH_EMULATION
> if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
> struct math_emu_info info = { };
> --
> 2.17.1
>
>


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-25 Thread Liu, Jing2





For AMX, we must still reserve the space, but we are not going to write zeros
for clean state.  We so this in software by checking XINUSE=0, and clearing
the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
writing the zeros, even though we can't compress the space.

So my understanding is that clearing xstate_bv will not help prevent saving
zeros, but only not masking EDX:EAX, since the following logic. Not sure if
this is just what you mean. :)

FWIW, PATCH21 [1] uses the instruction mask to skip writing zeros on sigframe.
Then, XSAVE will clear the xstate_bv for the xtile data state bit.

[1] 
https://lore.kernel.org/lkml/20210221185637.19281-22-chang.seok@intel.com/
Yes, no mask in EDX:EAX works in such case. Thanks for pointing out the 
patch.


BRs,
Jing


Thanks,
Chang




Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-25 Thread Bae, Chang Seok
On Mar 24, 2021, at 22:12, Liu, Jing2  wrote:
> On 3/25/2021 5:09 AM, Len Brown wrote:
>> 
>> For AMX, we must still reserve the space, but we are not going to write zeros
>> for clean state.  We so this in software by checking XINUSE=0, and clearing
>> the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
>> writing the zeros, even though we can't compress the space.
> So my understanding is that clearing xstate_bv will not help prevent saving
> zeros, but only not masking EDX:EAX, since the following logic. Not sure if
> this is just what you mean. :)

FWIW, PATCH21 [1] uses the instruction mask to skip writing zeros on sigframe.
Then, XSAVE will clear the xstate_bv for the xtile data state bit.

[1] 
https://lore.kernel.org/lkml/20210221185637.19281-22-chang.seok@intel.com/

Thanks,
Chang

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Liu, Jing2




On 3/25/2021 5:09 AM, Len Brown wrote:

On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2  wrote:


IMO, the problem with AVX512 state
is that we guaranteed it will be zero for XINUSE=0.
That means we have to write 0's on saves.

why "we have to write 0's on saves" when XINUSE=0.

Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
xstate_bv bit is 0; if use XSAVE, it need save the state but
xstate_bv bit is also 0.

   It would be better
to be able to skip the write -- even if we can't save the space
we can save the data transfer.  (This is what we did for AMX).

With XFD feature that XFD=1, XSAVE command still has to save INIT state
to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
do the same that both can help save the data transfer.

Hi Jing, Good observation!

There are 3 cases.

Hi Len, thanks for your reply.


1. Task context switch save into the context switch buffer.
Here we use XSAVES, and as you point out, XSAVES includes
the compaction optimization feature tracked by XINUSE.
So when AMX is enabled, but clean, XSAVES doesn't write zeros.
Further, it omits the buffer space for AMX in the destination altogether!
However, since XINUSE=1 is possible, we have to *allocate* a buffer
large enough to handle the dirty data for when XSAVES can not
employ that optimization.

Yes, I agree with you about the first case.


2. Entry into user signal handler saves into the user space sigframe.
Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
and for AVX512, we save neither time or space.

My understanding that for application compatibility, we can *not* compact
the destination buffer that user-space sees.  This is because existing code
may have adopted fixed size offsets.  (which is unfortunate).



And so, for AVX512, we both reserve the space, and we write zeros
for clean AVX512 state.
By XSAVE, I think this is true if we assume setting EDX:EAX AVX512 bits 
as 1,
which means XSAVE will write zeros when XINUSE=0. Is this the same 
assumption

with yours?...

For AMX, we must still reserve the space, but we are not going to write zeros
for clean state.  We so this in software by checking XINUSE=0, and clearing
the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
writing the zeros, even though we can't compress the space.

So my understanding is that clearing xstate_bv will not help prevent saving
zeros, but only not masking EDX:EAX, since the following logic. Not sure if
this is just what you mean. :)

RFBM ← XCR0 AND EDX:EAX; /* bitwise logical AND */
OLD_BV ← XSTATE_BV field from XSAVE header;
...
FOR i ← 2 TO 62
IF RFBM[i] = 1
THEN save XSAVE state component i at offset n from base of XSAVE area;
FI;
ENDFOR;

XSTATE_BV field in XSAVE header ← (OLD_BV AND NOT RFBM) OR (XINUSE AND 
RFBM);



3. user space always uses fully uncompacted XSAVE buffers.


The reason I'm interested in XINUSE denotation is that it might be helpful
for the XFD MSRs context switch cost during vmexit and vmenter.

As the guest OS may be using XFD, the VMM can not use it for itself.
Rather, the VMM must context switch it when it switches between guests.
(or not expose it to guests at all)


My understand is that KVM cannot assume that userspace qemu uses XFD or not,
so KVM need context switch XFD between vcpu threads when vmexit/vmenter.

That's why I am thinking about detecting XINUSE when vmexit, otherwise, a
wrong armed IA32_XFD will impact XSAVES/XRSTORS causing guest AMX states
lost.

Thanks,
Jing


cheers,
-Len


cheers,
Len Brown, Intel Open Source Technology Center




Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Andy Lutomirski


> On Mar 24, 2021, at 2:58 PM, Dave Hansen  wrote:
> 
> On 3/24/21 2:42 PM, Andy Lutomirski wrote:
> 3. user space always uses fully uncompacted XSAVE buffers.
> 
 There is no reason we have to do this for new states. Arguably we
 shouldn’t for AMX to avoid yet another altstack explosion.
>>> The thing that's worried me is that the list of OS-enabled states is
>>> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
>>> think that apps will see AMX enabled with XGETBV and them assume that
>>> it's on the signal stack.
>>> 
>>> Please tell me I'm being too paranoid.  If we can break this
>>> assumption, it would get rid of a lot of future pain.
>> There are no AMX apps. I sure hope that there are no apps that
>> enumerate xfeatures with CPUID and try to decode the mess in the
>> signal stack.
> 
> I don't think they quite need to decode it in order to be screwed over a
> bit.  For instance, I don't think it's too crazy if someone did:
> 
>xcr0 = xgetbv(0);
>xrstor(xcr0, _stack[something]);
>// change some registers
>xsave(xcr0, _stack[something]);
> 
> The XRSTOR would work fine, but the XSAVE would overflow the stack
> because it would save the AMX state.  It also *looks* awfully benign.
> This is true even if the silly signal handler didn't know about AMX at
> *ALL*.
> 
> A good app would have checked that the xfeatures field in the header
> matched xcr0.

Ugh.

On the other hand, if we require a syscall to flip the AMX bit in XCR0, we 
could maybe get away with saying that programs that flip the bit and don’t 
understand the new ABI get to keep both pieces.

I don’t live futzing with the ABI like this, but AMX is really only barely 
compatible with everything before it.

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Dave Hansen
On 3/24/21 2:42 PM, Andy Lutomirski wrote:
 3. user space always uses fully uncompacted XSAVE buffers.

>>> There is no reason we have to do this for new states. Arguably we
>>> shouldn’t for AMX to avoid yet another altstack explosion.
>> The thing that's worried me is that the list of OS-enabled states is
>> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
>> think that apps will see AMX enabled with XGETBV and them assume that
>> it's on the signal stack.
>>
>> Please tell me I'm being too paranoid.  If we can break this
>> assumption, it would get rid of a lot of future pain.
> There are no AMX apps. I sure hope that there are no apps that
> enumerate xfeatures with CPUID and try to decode the mess in the
> signal stack.

I don't think they quite need to decode it in order to be screwed over a
bit.  For instance, I don't think it's too crazy if someone did:

xcr0 = xgetbv(0);
xrstor(xcr0, _stack[something]);
// change some registers
xsave(xcr0, _stack[something]);

The XRSTOR would work fine, but the XSAVE would overflow the stack
because it would save the AMX state.  It also *looks* awfully benign.
This is true even if the silly signal handler didn't know about AMX at
*ALL*.

A good app would have checked that the xfeatures field in the header
matched xcr0.


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Andy Lutomirski


> On Mar 24, 2021, at 2:30 PM, Dave Hansen  wrote:
> 
> On 3/24/21 2:26 PM, Andy Lutomirski wrote:
>>> 3. user space always uses fully uncompacted XSAVE buffers.
>>> 
>> There is no reason we have to do this for new states. Arguably we
>> shouldn’t for AMX to avoid yet another altstack explosion.
> 
> The thing that's worried me is that the list of OS-enabled states is
> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
> think that apps will see AMX enabled with XGETBV and them assume that
> it's on the signal stack.
> 
> Please tell me I'm being too paranoid.  If we can break this assumption,
> it would get rid of a lot of future pain.

There are no AMX apps. I sure hope that there are no apps that enumerate 
xfeatures with CPUID and try to decode the mess in the signal stack.

I do think we need to save AMX state *somewhere* if a signal happens unless 
userspace opts out, but I don’t think it needs to be in the nominally expected 
spot.

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Dave Hansen
On 3/24/21 2:26 PM, Andy Lutomirski wrote:
>> 3. user space always uses fully uncompacted XSAVE buffers.
>> 
> There is no reason we have to do this for new states. Arguably we
> shouldn’t for AMX to avoid yet another altstack explosion.

The thing that's worried me is that the list of OS-enabled states is
visible to apps via XGETBV.  It doesn't seem too much of a stretch to
think that apps will see AMX enabled with XGETBV and them assume that
it's on the signal stack.

Please tell me I'm being too paranoid.  If we can break this assumption,
it would get rid of a lot of future pain.


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Andy Lutomirski




> On Mar 24, 2021, at 2:09 PM, Len Brown  wrote:
> 
> On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2  
> wrote:
> 
>>> IMO, the problem with AVX512 state
>>> is that we guaranteed it will be zero for XINUSE=0.
>>> That means we have to write 0's on saves.
> 
>> why "we have to write 0's on saves" when XINUSE=0.
>> 
>> Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
>> xstate_bv bit is 0; if use XSAVE, it need save the state but
>> xstate_bv bit is also 0.
>>>  It would be better
>>> to be able to skip the write -- even if we can't save the space
>>> we can save the data transfer.  (This is what we did for AMX).
>> With XFD feature that XFD=1, XSAVE command still has to save INIT state
>> to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
>> do the same that both can help save the data transfer.
> 
> Hi Jing, Good observation!
> 
> There are 3 cases.
> 
> 1. Task context switch save into the context switch buffer.
> Here we use XSAVES, and as you point out, XSAVES includes
> the compaction optimization feature tracked by XINUSE.
> So when AMX is enabled, but clean, XSAVES doesn't write zeros.
> Further, it omits the buffer space for AMX in the destination altogether!
> However, since XINUSE=1 is possible, we have to *allocate* a buffer
> large enough to handle the dirty data for when XSAVES can not
> employ that optimization.
> 
> 2. Entry into user signal handler saves into the user space sigframe.
> Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
> and for AVX512, we save neither time or space.
> 
> My understanding that for application compatibility, we can *not* compact
> the destination buffer that user-space sees.  This is because existing code
> may have adopted fixed size offsets.  (which is unfortunate).
> 
> And so, for AVX512, we both reserve the space, and we write zeros
> for clean AVX512 state.
> 
> For AMX, we must still reserve the space, but we are not going to write zeros
> for clean state.  We so this in software by checking XINUSE=0, and clearing
> the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
> writing the zeros, even though we can't compress the space.

Why?

> 
> 3. user space always uses fully uncompacted XSAVE buffers.
> 

There is no reason we have to do this for new states. Arguably we shouldn’t for 
AMX to avoid yet another altstack explosion.

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Len Brown
On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2  wrote:

> > IMO, the problem with AVX512 state
> > is that we guaranteed it will be zero for XINUSE=0.
> > That means we have to write 0's on saves.

> why "we have to write 0's on saves" when XINUSE=0.
>
> Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
> xstate_bv bit is 0; if use XSAVE, it need save the state but
> xstate_bv bit is also 0.
> >   It would be better
> > to be able to skip the write -- even if we can't save the space
> > we can save the data transfer.  (This is what we did for AMX).
> With XFD feature that XFD=1, XSAVE command still has to save INIT state
> to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
> do the same that both can help save the data transfer.

Hi Jing, Good observation!

There are 3 cases.

1. Task context switch save into the context switch buffer.
Here we use XSAVES, and as you point out, XSAVES includes
the compaction optimization feature tracked by XINUSE.
So when AMX is enabled, but clean, XSAVES doesn't write zeros.
Further, it omits the buffer space for AMX in the destination altogether!
However, since XINUSE=1 is possible, we have to *allocate* a buffer
large enough to handle the dirty data for when XSAVES can not
employ that optimization.

2. Entry into user signal handler saves into the user space sigframe.
Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
and for AVX512, we save neither time or space.

My understanding that for application compatibility, we can *not* compact
the destination buffer that user-space sees.  This is because existing code
may have adopted fixed size offsets.  (which is unfortunate).

And so, for AVX512, we both reserve the space, and we write zeros
for clean AVX512 state.

For AMX, we must still reserve the space, but we are not going to write zeros
for clean state.  We so this in software by checking XINUSE=0, and clearing
the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
writing the zeros, even though we can't compress the space.

3. user space always uses fully uncompacted XSAVE buffers.

> The reason I'm interested in XINUSE denotation is that it might be helpful
> for the XFD MSRs context switch cost during vmexit and vmenter.

As the guest OS may be using XFD, the VMM can not use it for itself.
Rather, the VMM must context switch it when it switches between guests.
(or not expose it to guests at all)

cheers,
-Len


cheers,
Len Brown, Intel Open Source Technology Center


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Dave Hansen
On 3/23/21 2:52 PM, Bae, Chang Seok wrote:
>>  "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>>   clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>>   system software initialize AMX state (e.g., by executing TILERELEASE)
>>   before doing so. This is because maintaining AMX state in a
>>   non-initialized state may have negative power and performance
>>   implications."
>>
>> I'm not seeing anything related to this. Is this a recommendation
>> which can be ignored or is that going to be duct taped into the code
>> base once the first user complains about slowdowns of their non AMX
>> workloads on that machine?
> I think this part in the doc is worth to be mentioned at first:
> 
> “The XTILEDATA state component is very large, and an operating system may
> prefer not to allocate memory for the XTILEDATA state of every user
> thread. Such an operating system that enables Intel AMX might prefer to
> prevent specific user threads from using the feature. An extension called
> extended feature disable (XFD) is added to the XSAVE feature set to
> support such a usage. XFD is described in Section 3.2.6.”
> 
> So, in this series, instead of saving this state always, the state is saved
> only when used. XFD helps to detect each thread’s first use of those
> registers. Thus, the XFD’s MSR bit is maintained as per-task here.

This doesn't really have anything to do with XFD.

The spec says, basically, "as long as you have AMX state in the
registers, you may pay a penalty".

When we switch between userspace tasks, AMX gets automatically
reinitialized by XRSTOR if the task to which we switch is not using AMX.
 All is good there.

But, what if we remain in the kernel?  Let's say kswapd is going to run
for a while.  Does kswapd pay the AMX-not-in-init-state penalty?  Or,
what if we want to go to idle?  Does AMX state affect *how* idle the CPU
can go?

We probably want to actively go out and zap AMX state at some
well-defined boundary.  It's radioactive.  Task switching seems as sane
a place as any to do that.


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-23 Thread Liu, Jing2




On 3/24/2021 5:01 AM, Len Brown wrote:

I have an obnoxious question: do we really want to use the XFD mechanism?

Obnoxious questions are often the most valuable! :-)

[...]
cheers,
Len Brown, Intel Open Source Technology Center

ps. I agree that un-necessary XINUSE=1 is possible.
Notwithstanding the issues initially deploying AVX512, I am skeptical
that it is common today.

Sorry, I'm trying to understand from...

IMO, the problem with AVX512 state
is that we guaranteed it will be zero for XINUSE=0.
That means we have to write 0's on saves.

why "we have to write 0's on saves" when XINUSE=0.

Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
xstate_bv bit is 0; if use XSAVE, it need save the state but
xstate_bv bit is also 0.

  It would be better
to be able to skip the write -- even if we can't save the space
we can save the data transfer.  (This is what we did for AMX).

With XFD feature that XFD=1, XSAVE command still has to save INIT state
to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
do the same that both can help save the data transfer.

The reason I'm interested in XINUSE denotation is that it might be helpful
for the XFD MSRs context switch cost during vmexit and vmenter.

Thanks,
Jing


pps. your idea of requiring the user to allocate their own signal stack
is interesting.   It isn't really about allocating the stack, though --
the stack of the task that uses the feature is generally fine already.
The opportunity is to allow tasks that do *not* use the new feature to
get away with minimal data transfer and stack size.  As we don't
have the 0's guarantee for AMX, we bought the important part
of that back.




Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-23 Thread Bae, Chang Seok
On Mar 20, 2021, at 15:13, Thomas Gleixner  wrote:
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
>> +
>> +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
>> +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
>> +{
>> +if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
>> +return;
>> +
>> +if (unlikely(prev->state_mask != next->state_mask))
>> +xdisable_setbits(xfirstuse_not_detected(next));
>> +}
> 
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
> 
>  "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>   clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>   system software initialize AMX state (e.g., by executing TILERELEASE)
>   before doing so. This is because maintaining AMX state in a
>   non-initialized state may have negative power and performance
>   implications."
> 
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I think this part in the doc is worth to be mentioned at first:

“The XTILEDATA state component is very large, and an operating system may
prefer not to allocate memory for the XTILEDATA state of every user
thread. Such an operating system that enables Intel AMX might prefer to
prevent specific user threads from using the feature. An extension called
extended feature disable (XFD) is added to the XSAVE feature set to
support such a usage. XFD is described in Section 3.2.6.”

So, in this series, instead of saving this state always, the state is saved
only when used. XFD helps to detect each thread’s first use of those
registers. Thus, the XFD’s MSR bit is maintained as per-task here.

Thanks,
Chang

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-23 Thread Len Brown
> I have an obnoxious question: do we really want to use the XFD mechanism?

Obnoxious questions are often the most valuable! :-)

> Right now, glibc, and hence most user space code, blindly uses
> whatever random CPU features are present for no particularly good
> reason, which means that all these features get stuck in the XINUSE=1
> state, even if there is no code whatsoever in the process that
> benefits.  AVX512 is bad enough as we're seeing right now.  AMX will
> be much worse if this happens.
>
> We *could* instead use XCR0 and require an actual syscall to enable
> it.  We could even then play games like requiring whomever enables the
> feature to allocate memory for the state save area for signals, and
> signal delivery could save the state and disable the feature, this
> preventing the signal frame from blowing up to 8 or 12 or who knows
> how many kB.

This approach would have some challenges.

Enumeration today is two parts.
1. CPUID tells you if the feature exists in the HW
2. xgetbv/XCR0 tells you if the OS supports that feature

Since #2 would be missing, you are right, there would need to be
a new API enabling the user to request the OS to enable support
for that task.

If that new API is not invoked before the user touches the feature,
they die with a #UD.

And so there would need to be some assurance that the API is successfully
called before any library might use the feature.  Is there a practical way
to guarantee that, given that the feature may be used (or not) only by
a dynamically
linked library?

If a library spawns threads and queries the size of XSAVE before the API
is called, it may be confused when that size changes after the API is called.

So a simple question, "who calls the API, and when?" isn't so simple.

Finally, note that XCR0 updates cause a VMEXIT,
while XFD updates do not.

So context switching XCR0 is possible, but is problematic.

The other combination is XFD + API rather than XCR0 + API.
With XFD, the context switching is faster, and the faulting (#NM and
the new MSR with #NM cause) is viable.
We have the bit set in XCR0, so no state size advantage.
Still have issues with API logistics.
So we didn't see that the API adds any value, only pain,
over transparent 1st use enabling with XFD and no API.

cheers,
Len Brown, Intel Open Source Technology Center

ps. I agree that un-necessary XINUSE=1 is possible.
Notwithstanding the issues initially deploying AVX512, I am skeptical
that it is common today.  IMO, the problem with AVX512 state
is that we guaranteed it will be zero for XINUSE=0.
That means we have to write 0's on saves.  It would be better
to be able to skip the write -- even if we can't save the space
we can save the data transfer.  (This is what we did for AMX).

pps. your idea of requiring the user to allocate their own signal stack
is interesting.   It isn't really about allocating the stack, though --
the stack of the task that uses the feature is generally fine already.
The opportunity is to allow tasks that do *not* use the new feature to
get away with minimal data transfer and stack size.  As we don't
have the 0's guarantee for AMX, we bought the important part
of that back.


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-20 Thread Andy Lutomirski
On Sat, Mar 20, 2021 at 3:13 PM Thomas Gleixner  wrote:
>
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> > +
> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> > +{
> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > + return;
> > +
> > + if (unlikely(prev->state_mask != next->state_mask))
> > + xdisable_setbits(xfirstuse_not_detected(next));
> > +}
>
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
>
>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>system software initialize AMX state (e.g., by executing TILERELEASE)
>before doing so. This is because maintaining AMX state in a
>non-initialized state may have negative power and performance
>implications."
>
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I have an obnoxious question: do we really want to use the XFD mechanism?

Right now, glibc, and hence most user space code, blindly uses
whatever random CPU features are present for no particularly good
reason, which means that all these features get stuck in the XINUSE=1
state, even if there is no code whatsoever in the process that
benefits.  AVX512 is bad enough as we're seeing right now.  AMX will
be much worse if this happens.

We *could* instead use XCR0 and require an actual syscall to enable
it.  We could even then play games like requiring whomever enables the
feature to allocate memory for the state save area for signals, and
signal delivery could save the state and disable the feature, this
preventing the signal frame from blowing up to 8 or 12 or who knows
how many kB.

--Andy


Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-20 Thread Thomas Gleixner
On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> +
> +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> +{
> + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> + return;
> +
> + if (unlikely(prev->state_mask != next->state_mask))
> + xdisable_setbits(xfirstuse_not_detected(next));
> +}

So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
when it does not match. The spec document says:

  "System software may disable use of Intel AMX by clearing XCR0[18:17], by
   clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
   system software initialize AMX state (e.g., by executing TILERELEASE)
   before doing so. This is because maintaining AMX state in a
   non-initialized state may have negative power and performance
   implications."

I'm not seeing anything related to this. Is this a recommendation
which can be ignored or is that going to be duct taped into the code
base once the first user complains about slowdowns of their non AMX
workloads on that machine?

Thanks,

tglx


[PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-02-21 Thread Chang S. Bae
Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can defer allocating the large XSAVE buffer until tasks
need it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
state-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand the xstate
buffer. Introduce two sets of helper functions for that:

1. The first set is primarily for interacting with the XFD hardware:
xdisable_setbits()
xdisable_getbits()
xdisable_switch()

2. The second set is for managing the first-use status and handling #NM
   trap:
xfirstuse_enabled()
xfirstuse_not_detected()

The #NM handler induces the xstate buffer expansion to save the first-used
states.

The XFD feature is enabled only for the compacted format. If the kernel
uses the standard format, the buffer has to be always enough for all the
states.

Signed-off-by: Chang S. Bae 
Reviewed-by: Len Brown 
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Changed to enable XFD only when the compacted format is used.
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Inlined the XFD-induced #NM handling code (Andy Lutomirski)
---
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/fpu/internal.h | 51 -
 arch/x86/include/asm/msr-index.h|  2 ++
 arch/x86/kernel/cpu/cpuid-deps.c|  1 +
 arch/x86/kernel/fpu/xstate.c| 37 +++--
 arch/x86/kernel/process.c   |  5 +++
 arch/x86/kernel/process_32.c|  2 +-
 arch/x86/kernel/process_64.c|  2 +-
 arch/x86/kernel/traps.c | 40 ++
 9 files changed, 135 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..3170ab367cf2 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -277,6 +277,7 @@
 #define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
 #define X86_FEATURE_XGETBV1(10*32+ 2) /* XGETBV with ECX = 1 
instruction */
 #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS 
instructions */
+#define X86_FEATURE_XFD(10*32+ 4) /* eXtended Feature 
Disabling */
 
 /*
  * Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index f964f3efc92e..c467312d38d8 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -557,11 +557,58 @@ static inline void switch_fpu_prepare(struct fpu 
*old_fpu, int cpu)
  * Misc helper functions:
  */
 
+/* The first-use detection helpers: */
+
+static inline void xdisable_setbits(u64 value)
+{
+   wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xdisable_getbits(void)
+{
+   u64 value;
+
+   rdmsrl_safe(MSR_IA32_XFD, );
+   return value;
+}
+
+static inline u64 xfirstuse_enabled(void)
+{
+   /* All the dynamic user components are first-use enabled. */
+   return xfeatures_mask_user_dynamic;
+}
+
+/*
+ * Convert fpu->state_mask to the xdisable configuration to be written to
+ * MSR IA32_XFD.  So, xdisable_setbits() only uses this outcome.
+ */
+static inline u64 xfirstuse_not_detected(struct fpu *fpu)
+{
+   u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
+
+   /*
+* If first-use is not detected, set the bit. If the detection is
+* not enabled, the bit is always zero in firstuse_bv. So, make
+* following conversion:
+*/
+   return  (xfirstuse_enabled() ^ firstuse_bv);
+}
+
+/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
+static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
+{
+   if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+   return;
+
+   if (unlikely(prev->state_mask != next->state_mask))
+   xdisable_setbits(xfirstuse_not_detected(next));
+}
+
 /*
  * Load PKRU from the FPU context if available. Delay loading of the
  * complete FPU state until the return to userland.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
 {
u32 pkru_val = init_pkru_value;
struct pkru_state *pk;
@@ -571,6 +618,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 
set_thread_flag(TIF_NEED_FPU_LOAD);
 
+   xdisable_switch(old_fpu, new_fpu);
+
if