Re: [PATCH 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state

2021-01-19 Thread Borislav Petkov
Just nitpicks:

On Sun, Jan 17, 2021 at 10:20:38PM -0800, Andy Lutomirski wrote:
> Currently, requesting kernel FPU access doesn't distinguish which parts of
> the extended ("FPU") state are needed.  This is nice for simplicity, but
> there are a few cases in which it's suboptimal:
> 
>  - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
>not use legacy 387 state.  These users want MXCSR initialized but don't
>care about the FPU control word.  Skipping FNINIT would save time.
>(Empirically, FNINIT is several times slower than LDMXCSR.)
> 
>  - Code that wants MMX doesn't want need MXCSR or FCW initialized.

"want/need" ?

>_mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
>initializing MXCSR will fail.

"... because LDMXCSR generates an #UD when the aforementioned CR4 bit is
not set."

>  - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
>dynamic states will need special handling.
> 
> This patch adds a more specific API that allows callers specify exactly

s/This patch adds/Add/

> what they want.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


RE: [PATCH 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state

2021-01-18 Thread David Laight
From: Andy Lutomirski
> Sent: 18 January 2021 06:21
> 
> Currently, requesting kernel FPU access doesn't distinguish which parts of
> the extended ("FPU") state are needed.  This is nice for simplicity, but
> there are a few cases in which it's suboptimal:
> 
>  - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
>not use legacy 387 state.  These users want MXCSR initialized but don't
>care about the FPU control word.  Skipping FNINIT would save time.
>(Empirically, FNINIT is several times slower than LDMXCSR.)
> 
>  - Code that wants MMX doesn't want need MXCSR or FCW initialized.
>_mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
>initializing MXCSR will fail.
> 
>  - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
>dynamic states will need special handling.
> 
> This patch adds a more specific API that allows callers specify exactly
> what they want.

Is it worth returning whether the required fpu feature is available?
Or, maybe optionally, available cheaply?

There are also code fragments that really just want one or two
[xyx]mm registers to speed something up.
For instance PCIe reads can be a lot faster if a wide register
can be used.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state

2021-01-18 Thread Peter Zijlstra
On Sun, Jan 17, 2021 at 10:20:38PM -0800, Andy Lutomirski wrote:

>  - Code that wants MMX doesn't want need MXCSR or FCW initialized.
>_mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
>initializing MXCSR will fail.

> +#define KFPU_MMX 0   /* nothing gets initialized */

This... why is that correct?


[PATCH 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state

2021-01-17 Thread Andy Lutomirski
Currently, requesting kernel FPU access doesn't distinguish which parts of
the extended ("FPU") state are needed.  This is nice for simplicity, but
there are a few cases in which it's suboptimal:

 - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
   not use legacy 387 state.  These users want MXCSR initialized but don't
   care about the FPU control word.  Skipping FNINIT would save time.
   (Empirically, FNINIT is several times slower than LDMXCSR.)

 - Code that wants MMX doesn't want need MXCSR or FCW initialized.
   _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
   initializing MXCSR will fail.

 - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
   dynamic states will need special handling.

This patch adds a more specific API that allows callers specify exactly
what they want.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/fpu/api.h | 16 ++--
 arch/x86/kernel/fpu/core.c | 17 +++--
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index dcd9503b1098..133907a200ef 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -16,14 +16,26 @@
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
  * disables preemption so be careful if you intend to use it for long periods
  * of time.
- * If you intend to use the FPU in softirq you need to check first with
+ * If you intend to use the FPU in irq/softirq you need to check first with
  * irq_fpu_usable() if it is possible.
  */
-extern void kernel_fpu_begin(void);
+
+/* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
+#define KFPU_387   _BITUL(0)   /* FCW will be initialized */
+#define KFPU_XYZMM _BITUL(1)   /* MXCSR will be initialized */
+#define KFPU_MMX   0   /* nothing gets initialized */
+
+extern void kernel_fpu_begin_mask(unsigned int kfpu_mask);
 extern void kernel_fpu_end(void);
 extern bool irq_fpu_usable(void);
 extern void fpregs_mark_activate(void);
 
+/* Code that is unaware of kernel_fpu_begin_mask() can use this */
+static inline void kernel_fpu_begin(void)
+{
+   kernel_fpu_begin_mask(KFPU_387 | KFPU_XYZMM);
+}
+
 /*
  * Use fpregs_lock() while editing CPU's FPU registers or fpu->state.
  * A context switch will (and softirq might) save CPU's FPU registers to
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index eb86a2b831b1..52d05c806aa6 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -121,7 +121,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
 }
 EXPORT_SYMBOL(copy_fpregs_to_fpstate);
 
-void kernel_fpu_begin(void)
+void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 {
preempt_disable();
 
@@ -141,13 +141,18 @@ void kernel_fpu_begin(void)
}
__cpu_invalidate_fpregs_state();
 
-   if (boot_cpu_has(X86_FEATURE_XMM))
-   ldmxcsr(MXCSR_DEFAULT);
+   /* Put sane initial values into the control registers. */
+   if (likely(kfpu_mask & KFPU_XYZMM)) {
+   if (boot_cpu_has(X86_FEATURE_XMM))
+   ldmxcsr(MXCSR_DEFAULT);
+   }
 
-   if (boot_cpu_has(X86_FEATURE_FPU))
-   asm volatile ("fninit");
+   if (unlikely(kfpu_mask & KFPU_387)) {
+   if (boot_cpu_has(X86_FEATURE_FPU))
+   asm volatile ("fninit");
+   }
 }
-EXPORT_SYMBOL_GPL(kernel_fpu_begin);
+EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
 
 void kernel_fpu_end(void)
 {
-- 
2.29.2