Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On Tue, 2020-11-03 at 13:41 -0800, Dave Hansen wrote: > On 11/3/20 1:32 PM, Bae, Chang Seok wrote: > > On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote: > > > On 10/14/20 9:10 AM, Andy Lutomirski wrote: > > > > I was suggesting something a little bit different. We'd keep XMM, > > > > YMM, ZMM, etc state stored exactly the way we do now and, for > > > > AMX-using tasks, we would save the AMX state in an entirely > > > > separate > > > > buffer. This way the pain of having a variable xstate layout is > > > > confined just to AMX tasks. > > > > > > OK, got it. > > > > > > So, we'd either need a second set of XSAVE/XRSTORs, or "manual" > > > copying > > > of the registers out to memory. We can preserve the modified > > > optimization if we're careful about ordering, but only for *ONE* of > > > the > > > XSAVE buffers (if we use two). > > > > For what is worth, > > > > If using two buffers, the buffer for saving the tile data also needs > > space > > for the legacy states. > > Just to be clear, you're talking about the 512-byte 'struct > fxregs_state' which is the first member of 'struct xregs_state', right? Yes. > > > The AMX state is stored at the end of the XSAVE buffer (at least for > > now). > > So, the layout (in terms of offsets of non-AMX states) won't be changed > > at > > run-time. > > I don't know what point you are trying to make here. I was trying to clarify the concern that Andy had: "the pain of having a variable xstate layout" Thanks, Chang
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On 11/3/20 1:32 PM, Bae, Chang Seok wrote: > On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote: >> On 10/14/20 9:10 AM, Andy Lutomirski wrote: >>> I was suggesting something a little bit different. We'd keep XMM, >>> YMM, ZMM, etc state stored exactly the way we do now and, for >>> AMX-using tasks, we would save the AMX state in an entirely separate >>> buffer. This way the pain of having a variable xstate layout is >>> confined just to AMX tasks. >> OK, got it. >> >> So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying >> of the registers out to memory. We can preserve the modified >> optimization if we're careful about ordering, but only for *ONE* of the >> XSAVE buffers (if we use two). > For what is worth, > > If using two buffers, the buffer for saving the tile data also needs space > for the legacy states. Just to be clear, you're talking about the 512-byte 'struct fxregs_state' which is the first member of 'struct xregs_state', right? Basically, any two-buffer model wastes 576 bytes of space (legacy fxsave plus xsave header) for each task that uses two buffers. This provides an overall space savings unless there are something like 16x more TMUL-using tasks than non-TMUL-using tasks. We could eliminate the 576 bytes of waste, but at the *performance* cost of having a permanently (non-task_struct-embedded) out-of-line XSAVE buffer for all tasks. > The AMX state is stored at the end of the XSAVE buffer (at least for now). > So, the layout (in terms of offsets of non-AMX states) won't be changed at > run-time. I don't know what point you are trying to make here.
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote: > On 10/14/20 9:10 AM, Andy Lutomirski wrote: > > > > I was suggesting something a little bit different. We'd keep XMM, > > YMM, ZMM, etc state stored exactly the way we do now and, for > > AMX-using tasks, we would save the AMX state in an entirely separate > > buffer. This way the pain of having a variable xstate layout is > > confined just to AMX tasks. > > OK, got it. > > So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying > of the registers out to memory. We can preserve the modified > optimization if we're careful about ordering, but only for *ONE* of the > XSAVE buffers (if we use two). For what is worth, If using two buffers, the buffer for saving the tile data also needs space for the legacy states. The AMX state is stored at the end of the XSAVE buffer (at least for now). So, the layout (in terms of offsets of non-AMX states) won't be changed at run-time. Thanks, Chang
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On 10/14/20 9:10 AM, Andy Lutomirski wrote: >> Actually, I think the modified optimization would survive such a scheme: >> >> * copy page array into percpu area >> * XRSTORS from percpu area, modified optimization tuple is saved >> * run userspace >> * XSAVES back to percpu area. tuple matches, modified optimization >>is still in play >> * copy percpu area back to page array >> >> Since the XRSTORS->XSAVES pair is both done to the percpu area, the >> XSAVE tracking hardware never knows it isn't working on the "canonical" >> buffer (the page array). > I was suggesting something a little bit different. We'd keep XMM, > YMM, ZMM, etc state stored exactly the way we do now and, for > AMX-using tasks, we would save the AMX state in an entirely separate > buffer. This way the pain of having a variable xstate layout is > confined just to AMX tasks. OK, got it. So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying of the registers out to memory. We can preserve the modified optimization if we're careful about ordering, but only for *ONE* of the XSAVE buffers (if we use two). > I'm okay with vmalloc() too, but I do think we need to deal with the > various corner cases like allocation failing. Yeah, agreed about handling the corner cases. Also, if we preserve plain old vmalloc() for now, we need good tracepoints or stats so we can precisely figure out how many vmalloc()s (and IPIs) are due to AMX.
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On Tue, Oct 13, 2020 at 11:03 PM Dave Hansen wrote: > > On 10/13/20 6:11 PM, Andy Lutomirski wrote: > > I have no problem with vmalloc(), but I do have a problem with > > vfree() due to the IPIs that result. We need a cache or something. > > This sounds like the kind of thing we should just build into vmalloc() > instead of having a bunch of callers implement their own caches. It > shouldn't be too much of a challenge to have vmalloc() keep a cacheline > or two of stats about common vmalloc() sizes and keep some pools around. > > It's not going to be hard to implement caches to reduce vfree()-induced > churn, but I'm having a hard time imaging that it'll have anywhere near > the benefits that it did for stacks. Tasks fundamentally come and go a > *lot*, and those paths are hot. > > Tasks who go to the trouble to populate 8k or 64k of register state > fundamentally *can't* come and go frequently. We also (probably) don't > have to worry about AMX tasks doing fork()/exec() pairs and putting > pressure on vmalloc(). Before an app can call out to library functions > to do the fork, they've got to save the state off themselves and likely > get it back to the init state. The fork() code can tell AMX is in the > init state and decline to allocate actual space for the child. > > > I have to say: this mechanism is awful. Can we get away with skipping > > the dynamic XSAVES mess entirely? What if we instead allocate > > however much space we need as an array of pages and have one percpu > > contiguous region. To save, we XSAVE(S or C) just the AMX state to > > the percpu area and then copy it. To restore, we do the inverse. Or > > would this kill the modified optimization and thus be horrible? > > Actually, I think the modified optimization would survive such a scheme: > > * copy page array into percpu area > * XRSTORS from percpu area, modified optimization tuple is saved > * run userspace > * XSAVES back to percpu area. tuple matches, modified optimization >is still in play > * copy percpu area back to page array > > Since the XRSTORS->XSAVES pair is both done to the percpu area, the > XSAVE tracking hardware never knows it isn't working on the "canonical" > buffer (the page array). I was suggesting something a little bit different. We'd keep XMM, YMM, ZMM, etc state stored exactly the way we do now and, for AMX-using tasks, we would save the AMX state in an entirely separate buffer. This way the pain of having a variable xstate layout is confined just to AMX tasks. I'm okay with vmalloc() too, but I do think we need to deal with the various corner cases like allocation failing.
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On Tue, Oct 13, 2020 at 03:43:59PM -0700, Dave Hansen wrote: > On 10/13/20 3:31 PM, Brown, Len wrote: > > vmalloc() does not fail, and does not return an error, and so there is no > > concept > > of returning a signal. > > Well, the order-0 allocations are no-fail, as are the vmalloc kernel > structures and the page tables that might have to be allocated. But, > that's not guaranteed to be in place *forever*. I think we still need > to check for and handle allocation failures, even if they're not known > to be possible today. Quite, on top of that, we could run out of vmalloc space (unlikely, but sitll). You really have to deal with vmalloc() failing.
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
> On Oct 13, 2020, at 3:31 PM, Brown, Len wrote: > > >> >> From: Andy Lutomirski > >>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c >>> @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr) > .. >>> +bool xfirstuse_event_handler(struct fpu *fpu) >>> +{ >>> + bool handled = false; >>> + u64 event_mask; >>> + >>> + /* Check whether the first-use detection is running. */ >>> + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled()) >>> + return handled; >>> + > >> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in >> some helper called farther down the stack > > xfirstuse_event_handler() is called directly from the IDTENTRY > exc_device_not_available(): > >>> @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available) >>> { >>>unsigned long cr0 = read_cr0(); >>> >>> +if (xfirstuse_event_handler(>thread.fpu)) >>> +return; > > Are you suggesting we should instead open code it inside that routine? MSR_IA32_XFD_ERR is like CR2 and DR6 — it’s functionally a part of the exception. Let’s handle it as such. (And, a bit like DR6, it’s a bit broken.) > >> But this raises an interesting point -- what happens if allocation >> fails? I think that, from kernel code, we simply cannot support this >> exception mechanism. If kernel code wants to use AMX (and that would >> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and >> handle errors, not rely on exceptions. From user code, I assume we >> send a signal if allocation fails. > > The XFD feature allows us to transparently expand the kernel context switch > buffer > for a user task, when that task first touches this associated hardware. > It allows applications to operate as if the kernel had allocated the backing > AMX > context switch buffer at initialization time. However, since we expect only > a sub-set of tasks to actually use AMX, we instead defer allocation until > we actually see first use for that task, rather than allocating for all tasks. I sure hope that not all tasks use it. Context-switching it will be absurdly expensive. > > While we currently don't propose AMX use inside the kernel, it is conceivable > that could be done in the future, just like AVX is used by the RAID code; > and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end(). > Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this > fault. > (note that we context switch the XFD-armed state per task) How expensive is *that*? Can you give approximate cycle counts for saving, restoring, arming and disarming? This reminds me of TS. Writing TS was more expensive than saving the whole FPU state. And, for kernel code, we can’t just “not arm” the XFD — we would have to disarm it. > > vmalloc() does not fail, and does not return an error, and so there is no > concept > of returning a signal. If we got to the point where vmalloc() sleeps, then > the system > has bigger OOM issues, and the OOM killer would be on the prowl. > > If we were concerned about using vmalloc for a couple of pages in the task > structure, > Then we could implement a routine to harvest unused buffers and free them Kind of like we vmalloc a couple pages for kernel stacks, and we carefully cache them. And we handle failure gracefully.
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
> On Oct 13, 2020, at 3:44 PM, Dave Hansen wrote: > > On 10/13/20 3:31 PM, Brown, Len wrote: >> vmalloc() does not fail, and does not return an error, and so there is no >> concept >> of returning a signal. > > Well, the order-0 allocations are no-fail, as are the vmalloc kernel > structures and the page tables that might have to be allocated. But, > that's not guaranteed to be in place *forever*. I think we still need > to check for and handle allocation failures, even if they're not known > to be possible today. > >> If we got to the point where vmalloc() sleeps, then the system >> has bigger OOM issues, and the OOM killer would be on the prowl. > > vmalloc() can *certainly* sleep. Allocation failures mean returning > NULL from the allocator, and the very way we avoid doing that is by > sleeping to go reclaim some memory from some other allocation. > > Sleeping is a normal and healthy part of handling allocation requests, > including vmalloc(). > >> If we were concerned about using vmalloc for a couple of pages in the task >> structure, >> Then we could implement a routine to harvest unused buffers and free them -- >> but that didn't seem worth the complexity. Note that this feature is 64-bit >> only. > > IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the > XSAVE buffer for the first AMX implementation. But, it's not overkill > for the ~66k of space that will be needed if some CPU implementation > comes along and uses all of the architectural space AMX provides. I have no problem with vmalloc(), but I do have a problem with vfree() due to the IPIs that result. We need a cache or something. I have to say: this mechanism is awful. Can we get away with skipping the dynamic XSAVES mess entirely? What if we instead allocate however much space we need as an array of pages and have one percpu contiguous region. To save, we XSAVE(S or C) just the AMX state to the percpu area and then copy it. To restore, we do the inverse. Or would this kill the modified optimization and thus be horrible?
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On 10/13/20 6:11 PM, Andy Lutomirski wrote: > I have no problem with vmalloc(), but I do have a problem with > vfree() due to the IPIs that result. We need a cache or something. This sounds like the kind of thing we should just build into vmalloc() instead of having a bunch of callers implement their own caches. It shouldn't be too much of a challenge to have vmalloc() keep a cacheline or two of stats about common vmalloc() sizes and keep some pools around. It's not going to be hard to implement caches to reduce vfree()-induced churn, but I'm having a hard time imaging that it'll have anywhere near the benefits that it did for stacks. Tasks fundamentally come and go a *lot*, and those paths are hot. Tasks who go to the trouble to populate 8k or 64k of register state fundamentally *can't* come and go frequently. We also (probably) don't have to worry about AMX tasks doing fork()/exec() pairs and putting pressure on vmalloc(). Before an app can call out to library functions to do the fork, they've got to save the state off themselves and likely get it back to the init state. The fork() code can tell AMX is in the init state and decline to allocate actual space for the child. > I have to say: this mechanism is awful. Can we get away with skipping > the dynamic XSAVES mess entirely? What if we instead allocate > however much space we need as an array of pages and have one percpu > contiguous region. To save, we XSAVE(S or C) just the AMX state to > the percpu area and then copy it. To restore, we do the inverse. Or > would this kill the modified optimization and thus be horrible? Actually, I think the modified optimization would survive such a scheme: * copy page array into percpu area * XRSTORS from percpu area, modified optimization tuple is saved * run userspace * XSAVES back to percpu area. tuple matches, modified optimization is still in play * copy percpu area back to page array Since the XRSTORS->XSAVES pair is both done to the percpu area, the XSAVE tracking hardware never knows it isn't working on the "canonical" buffer (the page array). It seems a bit ugly, but it's not like an 8k memcpy() is *that* painful. The double dcache footprint isn't super nice, though. Chang, I think that leaves us with three possibilities: 1. use plain old vmalloc() 2. use vmalloc(), but implement caches for large XSAVE state, either in front of, or inside the vmalloc() API 3. Use a static per-cpu buffer for XSAVE*/XRSTOR* and memcpy() to/from a scattered list of pages. A #4 also comes to mind: Do something somewhat like kmap_atomic(). Map the scattered pages contiguously and use XSAVE*/XRSTOR* directly, but unmap immediately after XSAVE*/XRSTOR*. For ~12k of state, I suspect the ~400-500 cycles for 3 INVLPG's might beat out a memcpy() of 12k of state. Global TLB invalidations would not be required. I have to say, though, that I'd be OK with sticking with plain old vmalloc() for now.
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On 10/13/20 3:31 PM, Brown, Len wrote: > vmalloc() does not fail, and does not return an error, and so there is no > concept > of returning a signal. Well, the order-0 allocations are no-fail, as are the vmalloc kernel structures and the page tables that might have to be allocated. But, that's not guaranteed to be in place *forever*. I think we still need to check for and handle allocation failures, even if they're not known to be possible today. > If we got to the point where vmalloc() sleeps, then the system > has bigger OOM issues, and the OOM killer would be on the prowl. vmalloc() can *certainly* sleep. Allocation failures mean returning NULL from the allocator, and the very way we avoid doing that is by sleeping to go reclaim some memory from some other allocation. Sleeping is a normal and healthy part of handling allocation requests, including vmalloc(). > If we were concerned about using vmalloc for a couple of pages in the task > structure, > Then we could implement a routine to harvest unused buffers and free them -- > but that didn't seem worth the complexity. Note that this feature is 64-bit > only. IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the XSAVE buffer for the first AMX implementation. But, it's not overkill for the ~66k of space that will be needed if some CPU implementation comes along and uses all of the architectural space AMX provides.
RE: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
> From: Andy Lutomirski > > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c > > @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr) .. > > +bool xfirstuse_event_handler(struct fpu *fpu) > > +{ > > + bool handled = false; > > + u64 event_mask; > > + > > + /* Check whether the first-use detection is running. */ > > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled()) > > + return handled; > > + > MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in > some helper called farther down the stack xfirstuse_event_handler() is called directly from the IDTENTRY exc_device_not_available(): > > @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available) > > { > > unsigned long cr0 = read_cr0(); > > > > + if (xfirstuse_event_handler(>thread.fpu)) > > + return; Are you suggesting we should instead open code it inside that routine? > But this raises an interesting point -- what happens if allocation > fails? I think that, from kernel code, we simply cannot support this > exception mechanism. If kernel code wants to use AMX (and that would > be very strange indeed), it should call x86_i_am_crazy_amx_begin() and > handle errors, not rely on exceptions. From user code, I assume we > send a signal if allocation fails. The XFD feature allows us to transparently expand the kernel context switch buffer for a user task, when that task first touches this associated hardware. It allows applications to operate as if the kernel had allocated the backing AMX context switch buffer at initialization time. However, since we expect only a sub-set of tasks to actually use AMX, we instead defer allocation until we actually see first use for that task, rather than allocating for all tasks. While we currently don't propose AMX use inside the kernel, it is conceivable that could be done in the future, just like AVX is used by the RAID code; and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end(). Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this fault. (note that we context switch the XFD-armed state per task) vmalloc() does not fail, and does not return an error, and so there is no concept of returning a signal. If we got to the point where vmalloc() sleeps, then the system has bigger OOM issues, and the OOM killer would be on the prowl. If we were concerned about using vmalloc for a couple of pages in the task structure, Then we could implement a routine to harvest unused buffers and free them -- but that didn't seem worth the complexity. Note that this feature is 64-bit only. Thanks, -Len
Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
On Thu, Oct 1, 2020 at 1:43 PM Chang S. Bae wrote: > > Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE > architecture. XFD allows the kernel to enable a feature state in XCR0 and > to receive a #NM trap when a task uses instructions accessing that state. > In this way, Linux can allocate the large task->fpu buffer only for tasks > that use it. > > XFD introduces two MSRs: IA32_XFD to enable/disable the feature and > IA32_XFD_ERR to assist the #NM trap handler. Both use the same > state-component bitmap format, used by XCR0. > > Use this hardware capability to find the right time to expand xstate area. > Introduce two sets of helper functions for that: > > 1. The first set is primarily for interacting with the XFD hardware >feature. Helpers for configuring disablement, e.g. in context switching, >are: > xdisable_setbits() > xdisable_getbits() > xdisable_switch() > > 2. The second set is for managing the first-use status and handling #NM >trap: > xfirstuse_enabled() > xfirstuse_not_detected() > xfirstuse_event_handler() > > The #NM handler induces the xstate area expansion to save the first-used > states. > > No functional change until the kernel enables dynamic user states and XFD. > > Signed-off-by: Chang S. Bae > Reviewed-by: Len Brown > Cc: x...@kernel.org > Cc: linux-kernel@vger.kernel.org > --- > arch/x86/include/asm/cpufeatures.h | 1 + > arch/x86/include/asm/fpu/internal.h | 53 - > arch/x86/include/asm/msr-index.h| 2 ++ > arch/x86/kernel/fpu/core.c | 37 > arch/x86/kernel/fpu/xstate.c| 34 -- > arch/x86/kernel/process.c | 5 +++ > arch/x86/kernel/process_32.c| 2 +- > arch/x86/kernel/process_64.c| 2 +- > arch/x86/kernel/traps.c | 3 ++ > 9 files changed, 133 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/include/asm/cpufeatures.h > b/arch/x86/include/asm/cpufeatures.h > index 2901d5df4366..7d7fe1d82966 100644 > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -274,6 +274,7 @@ > #define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */ > #define X86_FEATURE_XGETBV1(10*32+ 2) /* XGETBV with ECX = 1 > instruction */ > #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS > instructions */ > +#define X86_FEATURE_XFD(10*32+ 4) /* eXtended > Feature Disabling */ > > /* > * Extended auxiliary flags: Linux defined - for features scattered in > various > diff --git a/arch/x86/include/asm/fpu/internal.h > b/arch/x86/include/asm/fpu/internal.h > index 3b03ead87a46..f5dbbaa060fb 100644 > --- a/arch/x86/include/asm/fpu/internal.h > +++ b/arch/x86/include/asm/fpu/internal.h > @@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu > *old_fpu, int cpu) > * Misc helper functions: > */ > > +/* The first-use detection helpers: */ > + > +static inline void xdisable_setbits(u64 value) > +{ > + wrmsrl_safe(MSR_IA32_XFD, value); > +} > + > +static inline u64 xdisable_getbits(void) > +{ > + u64 value; > + > + rdmsrl_safe(MSR_IA32_XFD, ); > + return value; > +} > + > +static inline u64 xfirstuse_enabled(void) > +{ > + /* All the dynamic user components are first-use enabled. */ > + return xfeatures_mask_user_dynamic; > +} > + > +/* > + * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD. > + * xdisable_setbits() only uses this. > + */ > +static inline u64 xfirstuse_not_detected(struct fpu *fpu) > +{ > + u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled()); > + > + /* > +* If first-use is not detected, set the bit. If the detection is > +* not enabled, the bit is always zero in firstuse_bv. So, make > +* following conversion: > +*/ > + return (xfirstuse_enabled() ^ firstuse_bv); > +} > + > +/* Update MSR IA32_XFD based on fpu->firstuse_bv */ > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next) > +{ > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled()) > + return; > + > + if (unlikely(prev->state_mask != next->state_mask)) > + xdisable_setbits(xfirstuse_not_detected(next)); > +} > + > +bool xfirstuse_event_handler(struct fpu *fpu); > + > /* > * Load PKRU from the FPU context if available. Delay loading of the > * complete FPU state until the return to userland. > */ > -static inline void switch_fpu_finish(struct fpu *new_fpu) > +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu > *new_fpu) > { > u32 pkru_val = init_pkru_value; > struct pkru_state *pk; > @@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) > > set_thread_flag(TIF_NEED_FPU_LOAD); > > + xdisable_switch(old_fpu, new_fpu); > + >
[RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE architecture. XFD allows the kernel to enable a feature state in XCR0 and to receive a #NM trap when a task uses instructions accessing that state. In this way, Linux can allocate the large task->fpu buffer only for tasks that use it. XFD introduces two MSRs: IA32_XFD to enable/disable the feature and IA32_XFD_ERR to assist the #NM trap handler. Both use the same state-component bitmap format, used by XCR0. Use this hardware capability to find the right time to expand xstate area. Introduce two sets of helper functions for that: 1. The first set is primarily for interacting with the XFD hardware feature. Helpers for configuring disablement, e.g. in context switching, are: xdisable_setbits() xdisable_getbits() xdisable_switch() 2. The second set is for managing the first-use status and handling #NM trap: xfirstuse_enabled() xfirstuse_not_detected() xfirstuse_event_handler() The #NM handler induces the xstate area expansion to save the first-used states. No functional change until the kernel enables dynamic user states and XFD. Signed-off-by: Chang S. Bae Reviewed-by: Len Brown Cc: x...@kernel.org Cc: linux-kernel@vger.kernel.org --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/fpu/internal.h | 53 - arch/x86/include/asm/msr-index.h| 2 ++ arch/x86/kernel/fpu/core.c | 37 arch/x86/kernel/fpu/xstate.c| 34 -- arch/x86/kernel/process.c | 5 +++ arch/x86/kernel/process_32.c| 2 +- arch/x86/kernel/process_64.c| 2 +- arch/x86/kernel/traps.c | 3 ++ 9 files changed, 133 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 2901d5df4366..7d7fe1d82966 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -274,6 +274,7 @@ #define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */ #define X86_FEATURE_XGETBV1(10*32+ 2) /* XGETBV with ECX = 1 instruction */ #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */ +#define X86_FEATURE_XFD(10*32+ 4) /* eXtended Feature Disabling */ /* * Extended auxiliary flags: Linux defined - for features scattered in various diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 3b03ead87a46..f5dbbaa060fb 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu) * Misc helper functions: */ +/* The first-use detection helpers: */ + +static inline void xdisable_setbits(u64 value) +{ + wrmsrl_safe(MSR_IA32_XFD, value); +} + +static inline u64 xdisable_getbits(void) +{ + u64 value; + + rdmsrl_safe(MSR_IA32_XFD, ); + return value; +} + +static inline u64 xfirstuse_enabled(void) +{ + /* All the dynamic user components are first-use enabled. */ + return xfeatures_mask_user_dynamic; +} + +/* + * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD. + * xdisable_setbits() only uses this. + */ +static inline u64 xfirstuse_not_detected(struct fpu *fpu) +{ + u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled()); + + /* +* If first-use is not detected, set the bit. If the detection is +* not enabled, the bit is always zero in firstuse_bv. So, make +* following conversion: +*/ + return (xfirstuse_enabled() ^ firstuse_bv); +} + +/* Update MSR IA32_XFD based on fpu->firstuse_bv */ +static inline void xdisable_switch(struct fpu *prev, struct fpu *next) +{ + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled()) + return; + + if (unlikely(prev->state_mask != next->state_mask)) + xdisable_setbits(xfirstuse_not_detected(next)); +} + +bool xfirstuse_event_handler(struct fpu *fpu); + /* * Load PKRU from the FPU context if available. Delay loading of the * complete FPU state until the return to userland. */ -static inline void switch_fpu_finish(struct fpu *new_fpu) +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu) { u32 pkru_val = init_pkru_value; struct pkru_state *pk; @@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) set_thread_flag(TIF_NEED_FPU_LOAD); + xdisable_switch(old_fpu, new_fpu); + if (!cpu_feature_enabled(X86_FEATURE_OSPKE)) return; diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 2859ee4f39a8..0ccbe8cc99ad 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -610,6 +610,8 @@ #define