Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-12-02 Thread Michael Ellerman
On Sun, 19 Nov 2023 09:18:02 -0600, Timothy Pearson wrote:
> During floating point and vector save to thread data fr0/vs0 are clobbered
> by the FPSCR/VSCR store routine.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
  https://git.kernel.org/powerpc/c/5e1d824f9a283cbf90f25241b66d1f69adb3835b

cheers


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-30 Thread Timothy Pearson
- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Tuesday, November 28, 2023 6:57:01 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Michael Ellerman  writes:
>> Timothy Pearson  writes:
>>
>>> Just wanted to check back and see if this patch was going to be queued
>>> up soon?  We're still having to work around / advertise the data
>>> destruction issues the underlying bug is causing on e.g. Debian
>>> Stable.
>>
>> Yeah I'll apply it this week, so it will be in rc4.
> 
> I reworked the change log to include the exact call path I identified
> instead of the more high level description you had. And tweaked a few
> other bits of wording and so on, apparently fr0 is a kernelism, the ABI
> and binutils calls it f0.
> 
> I'm not sure how wedded you were to your change log, so if you dislike
> my edits let me know and we can come up with a joint one.
> 
> The actual patch is unchanged.
> 
> cheers

The commit message looks OK to me.  I've also seen application crashes as a 
result of the register corruption, but that may be a minor detail that isn't 
really worth updating things over at this point -- those come from e.g. glibc 
using vs0 as part of a path that processes pointer information, typically seen 
where there's a need to replicate the same pointer to adjacent fields in a data 
struct.


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-28 Thread Michael Ellerman
Michael Ellerman  writes:
> Timothy Pearson  writes:
>
>> Just wanted to check back and see if this patch was going to be queued
>> up soon?  We're still having to work around / advertise the data
>> destruction issues the underlying bug is causing on e.g. Debian
>> Stable.
>
> Yeah I'll apply it this week, so it will be in rc4.

I reworked the change log to include the exact call path I identified
instead of the more high level description you had. And tweaked a few
other bits of wording and so on, apparently fr0 is a kernelism, the ABI
and binutils calls it f0.

I'm not sure how wedded you were to your change log, so if you dislike
my edits let me know and we can come up with a joint one.

The actual patch is unchanged.

cheers


>From 5e1d824f9a283cbf90f25241b66d1f69adb3835b Mon Sep 17 00:00:00 2001
From: Timothy Pearson 
Date: Sun, 19 Nov 2023 09:18:02 -0600
Subject: [PATCH] powerpc: Don't clobber f0/vs0 during fp|altivec register save

During floating point and vector save to thread data f0/vs0 are
clobbered by the FPSCR/VSCR store routine. This has been obvserved to
lead to userspace register corruption and application data corruption
with io-uring.

Fix it by restoring f0/vs0 after FPSCR/VSCR store has completed for
all the FP, altivec, VMX register save paths.

Tested under QEMU in kvm mode, running on a Talos II workstation with
dual POWER9 DD2.2 CPUs.

Additional detail (mpe):

Typically save_fpu() is called from __giveup_fpu() which saves the FP
regs and also *turns off FP* in the tasks MSR, meaning the kernel will
reload the FP regs from the thread struct before letting the task use FP
again. So in that case save_fpu() is free to clobber f0 because the FP
regs no longer hold live values for the task.

There is another case though, which is the path via:
  sys_clone()
...
copy_process()
  dup_task_struct()
arch_dup_task_struct()
  flush_all_to_thread()
save_all()

That path saves the FP regs but leaves them live. That's meant as an
optimisation for a process that's using FP/VSX and then calls fork(),
leaving the regs live means the parent process doesn't have to take a
fault after the fork to get its FP regs back. The optimisation was added
in commit 8792468da5e1 ("powerpc: Add the ability to save FPU without
giving it up").

That path does clobber f0, but f0 is volatile across function calls,
and typically programs reach copy_process() from userspace via a syscall
wrapper function. So in normal usage f0 being clobbered across a
syscall doesn't cause visible data corruption.

But there is now a new path, because io-uring can call copy_process()
via create_io_thread() from the signal handling path. That's OK if the
signal is handled as part of syscall return, but it's not OK if the
signal is handled due to some other interrupt.

That path is:

interrupt_return_srr_user()
  interrupt_exit_user_prepare()
interrupt_exit_user_prepare_main()
  do_notify_resume()
get_signal()
  task_work_run()
create_worker_cb()
  create_io_worker()
copy_process()
  dup_task_struct()
arch_dup_task_struct()
  flush_all_to_thread()
save_all()
  if (tsk->thread.regs->msr & MSR_FP)
save_fpu()
# f0 is clobbered and potentially live in userspace

Note the above discussion applies equally to save_altivec().

Fixes: 8792468da5e1 ("powerpc: Add the ability to save FPU without giving it 
up")
Cc: sta...@vger.kernel.org # v4.6+
Closes: 
https://lore.kernel.org/all/480932026.45576726.1699374859845.javamail.zim...@raptorengineeringinc.com/
Closes: 
https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.javamail.zim...@raptorengineeringinc.com/
Tested-by: Timothy Pearson 
Tested-by: Jens Axboe 
Signed-off-by: Timothy Pearson 
[mpe: Reword change log to describe exact path of corruption & other minor 
tweaks]
Signed-off-by: Michael Ellerman 
Link: 
https://msgid.link/1921539696.48534988.1700407082933.javamail.zim...@raptorengineeringinc.com
---
 arch/powerpc/kernel/fpu.S| 13 +
 arch/powerpc/kernel/vector.S |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index 6a9acfb690c9..2f8f3f93cbb6 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -23,6 +23,15 @@
 #include 
 
 #ifdef CONFIG_VSX
+#define __REST_1FPVSR(n,c,base)
\
+BEGIN_FTR_SECTION  \
+   b   2f; \
+END_FTR_SECTION_IFSET(CPU_FTR_VSX);\
+   REST_FPR(n,base);   \
+   b   3f; \
+2: REST_VSR(n,c,base);  

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-27 Thread Nicholas Piggin
On Tue Nov 28, 2023 at 10:59 AM AEST, Michael Ellerman wrote:
> Christophe Leroy  writes:
> > Le 27/11/2023 à 19:39, Timothy Pearson a écrit :
> >> Just wanted to check back and see if this patch was going to be
> >> queued up soon?  We're still having to work around / advertise the
> >> data destruction issues the underlying bug is causing on e.g. Debian
> >> Stable.
> >
> > Has any agreement been reach on the final solution ? Seeing the many 
> > discussion on patch v2 I had the feeling that it was not the final solution.
>
> The actual patch is fine I think.
>
> The discussion was about improving the explanation of exactly what's
> happening in the change log, and whether there is a larger bug causing
> FP corruption unrelated to io-uring.

One thing I said is maybe we should remove the "register save" concept
entirely and always giveup. But that would not be suitable for a minimal
fix. I didn't mean it as an alternate fix.

Even if we did always giveup in future, this patch still gives a nicer
API. There would have to be a noticable performance advantage to not
restoring fr0/vs0 after saving before we'd think about changing it back
to clobber, since that encourages foot shooting.

Thanks,
Nick


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-27 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 27/11/2023 à 19:39, Timothy Pearson a écrit :
>> Just wanted to check back and see if this patch was going to be
>> queued up soon?  We're still having to work around / advertise the
>> data destruction issues the underlying bug is causing on e.g. Debian
>> Stable.
>
> Has any agreement been reach on the final solution ? Seeing the many 
> discussion on patch v2 I had the feeling that it was not the final solution.

The actual patch is fine I think.

The discussion was about improving the explanation of exactly what's
happening in the change log, and whether there is a larger bug causing
FP corruption unrelated to io-uring.

I'm now reasonably confident there's no detectable corruption of fr0
happening except via the io-uring -> clone path.

It's still a bad bug for us to corrupt fr0 across sys_clone(), but in
practice it doesn't affect userspace because fr0 is volatile across
function calls.

cheers


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-27 Thread Michael Ellerman
Timothy Pearson  writes:

> Just wanted to check back and see if this patch was going to be queued
> up soon?  We're still having to work around / advertise the data
> destruction issues the underlying bug is causing on e.g. Debian
> Stable.

Yeah I'll apply it this week, so it will be in rc4.

I wanted to be more certain the corruption only happens in practice with
io-uring before applying it.

cheers


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-27 Thread Christophe Leroy
Hi,

Le 27/11/2023 à 19:39, Timothy Pearson a écrit :
> Just wanted to check back and see if this patch was going to be queued up 
> soon?  We're still having to work around / advertise the data destruction 
> issues the underlying bug is causing on e.g. Debian Stable.
> 

Has any agreement been reach on the final solution ? Seeing the many 
discussion on patch v2 I had the feeling that it was not the final solution.

Christophe


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-27 Thread Timothy Pearson
Just wanted to check back and see if this patch was going to be queued up soon? 
 We're still having to work around / advertise the data destruction issues the 
underlying bug is causing on e.g. Debian Stable.

Thanks!


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-23 Thread Timothy Pearson



- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Tuesday, November 21, 2023 11:01:50 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Timothy Pearson  writes:
>>
> ...
>>
>> So a little more detail on this, just to put it to rest properly vs.
>> assuming hand analysis caught every possible pathway. :)
>>
>> The debugging that generates this stack trace also verifies the following in
>> __giveup_fpu():
>>
>> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
>> calling
>> save_fpu()
>> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after 
>> calling
>> save_fpu()
>> 3.) MSR_FP is set both in the task struct and in the live MSR.
>>
>> Only if all three conditions are met will it generate the trace.  This
>> is a generalization of the hack I used to find the problem in the
>> first place.
>>
>> If the state will subsequently be reloaded from the thread struct,
>> that means we're reloading the registers from the thread struct that
>> we just verified was corrupted by the earlier save_fpu() call.  There
>> are only two ways I can see for that to be true -- one is if the
>> registers were already clobbered when giveup_all() was entered, and
>> the other is if save_fpu() went ahead and clobbered them right here
>> inside giveup_all().
>>
>> To see which scenario we were dealing with, I added a bit more
>> instrumentation to dump the current register state if MSR_FP bit was
>> already set in registers (i.e. not dumping data from task struct, but
>> using the live FPU registers instead), and sure enough the registers
>> are corrupt on entry, so something else has already called save_fpu()
>> before we even hit giveup_all() in this call chain.
> 
> Can you share the debug patch you're using?
> 
> cheers

Sure, here you go.  Note that with my FPU patch there is no WARN_ON hit, at 
least in my testing, so it isn't userspace purposefully loading the fr0/vs0 
register with the FPSCR.

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 392404688cec..bde57dc3262a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -154,7 +154,49 @@ static void __giveup_fpu(struct task_struct *tsk)
 {
unsigned long msr;
 
+   // DEBUGGING
+   uint64_t prev_fpr0 = *(((uint64_t*)(>thread.fp_state.fpr[0]))+0);
+   uint64_t prev_fpr1 = *(((uint64_t*)(>thread.fp_state.fpr[0]))+1);
+   struct thread_fp_state debug_fp_state;
+   unsigned long currentmsr = mfmsr();
+
+   if (currentmsr & MSR_FP) {
+   store_fp_state(_fp_state);
+   load_fp_state(_fp_state);
+   }
+
save_fpu(tsk);
+
+   // DEBUGGING
+   if (tsk->thread.regs->msr & MSR_FP) {
+   if (((*(((uint64_t*)(>thread.fp_state.fpr[0]))+0) == 
0x82004000) && (prev_fpr0 != 0x82004000))
+|| ((*(((uint64_t*)(>thread.fp_state.fpr[0]))+1) == 
0x82004000) && (prev_fpr1 != 0x82004000)))
+   {
+   WARN_ON(1);
+
+   printk("[TS %lld] In __giveup_fpu() for process [comm: 
'%s'  pid %d tid %d], before save current "
+   "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 
0x%016llx/%016llx fp9: 0x%016llx/%016llx"
+   " msr: 0x%016lx (FP %d VSX %d EE %d) on core %d\n",
+   ktime_get_boottime_ns(), current->comm, current->pid, 
current->tgid,
+   *(((uint64_t*)(_fp_state.fpr[0]))+0), 
*(((uint64_t*)(_fp_state.fpr[0]))+1),
+   *(((uint64_t*)(_fp_state.fpr[1]))+0), 
*(((uint64_t*)(_fp_state.fpr[1]))+1),
+   *(((uint64_t*)(_fp_state.fpr[8]))+0), 
*(((uint64_t*)(_fp_state.fpr[8]))+1),
+   *(((uint64_t*)(>thread.fp_state.fpr[9]))+0), 
*(((uint64_t*)(>thread.fp_state.fpr[9]))+1),
+   currentmsr, !!(currentmsr & MSR_FP), !!(currentmsr & 
MSR_VSX), !!(currentmsr & MSR_EE), raw_smp_processor_id());
+
+   printk("[TS %lld] In __giveup_fpu() for process [comm: 
'%s'  pid %d tid %d], after save saved "
+   "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 
0x%016llx/%016llx fp9: 0x%016llx/%016llx"
+   " msr: 0x%016lx (FP %d VSX %d EE %d) on core %d\n",
+   ktime_get_boottime_ns(), cu

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-21 Thread Michael Ellerman
Timothy Pearson  writes:
>
...
>
> So a little more detail on this, just to put it to rest properly vs.
> assuming hand analysis caught every possible pathway. :)
>
> The debugging that generates this stack trace also verifies the following in 
> __giveup_fpu():
>
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
> calling save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after 
> calling save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
>
> Only if all three conditions are met will it generate the trace.  This
> is a generalization of the hack I used to find the problem in the
> first place.
>
> If the state will subsequently be reloaded from the thread struct,
> that means we're reloading the registers from the thread struct that
> we just verified was corrupted by the earlier save_fpu() call.  There
> are only two ways I can see for that to be true -- one is if the
> registers were already clobbered when giveup_all() was entered, and
> the other is if save_fpu() went ahead and clobbered them right here
> inside giveup_all().
>
> To see which scenario we were dealing with, I added a bit more
> instrumentation to dump the current register state if MSR_FP bit was
> already set in registers (i.e. not dumping data from task struct, but
> using the live FPU registers instead), and sure enough the registers
> are corrupt on entry, so something else has already called save_fpu()
> before we even hit giveup_all() in this call chain.

Can you share the debug patch you're using?

cheers


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Nicholas Piggin
On Tue Nov 21, 2023 at 11:23 AM AEST, Timothy Pearson wrote:
>
>
> - Original Message -
> > From: "Michael Ellerman" 
> > To: "Timothy Pearson" 
> > Cc: "Jens Axboe" , "regressions" 
> > , "npiggin" ,
> > "christophe leroy" , "linuxppc-dev" 
> > 
> > Sent: Monday, November 20, 2023 5:39:52 PM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> > register  save
>
> > Timothy Pearson  writes:
> >> - Original Message -
> >>> From: "Michael Ellerman" 
> > ...
> >>> 
> >>> But we now have a new path, because io-uring can call copy_process() via
> >>> create_io_thread() from the signal handling path. That's OK if the signal 
> >>> is
> >>> handled as we return from a syscall, but it's not OK if the signal is 
> >>> handled
> >>> due to some other interrupt.
> >>> 
> >>> Which is:
> >>> 
> >>> interrupt_return_srr_user()
> >>>  interrupt_exit_user_prepare()
> >>>interrupt_exit_user_prepare_main()
> >>>  do_notify_resume()
> >>>get_signal()
> >>>  task_work_run()
> >>>create_worker_cb()
> >>>  create_io_worker()
> >>>copy_process()
> >>>  dup_task_struct()
> >>>arch_dup_task_struct()
> >>>  flush_all_to_thread()
> >>>save_all()
> >>>  if (tsk->thread.regs->msr & MSR_FP)
> >>>save_fpu()
> >>># fr0 is clobbered and potentially live in 
> >>> userspace
> >>> 
> >>> 
> >>> So tldr I think the corruption is only an issue since io-uring started 
> >>> doing
> >>> the clone via signal, which I think matches the observed timeline of this 
> >>> bug
> >>> appearing.
> >>
> >> I agree the corruption really only started showing up in earnest on
> >> io_uring clone-via-signal, as this was confirmed several times in the
> >> course of debugging.
> > 
> > Thanks.
> > 
> >> Note as well that I may very well have a wrong call order in the
> >> commit message, since I was relying on a couple of WARN_ON() macros I
> >> inserted to check for a similar (but not identical) condition and
> >> didn't spend much time getting new traces after identifying the root
> >> cause.
> > 
> > Yep no worries. I'll reword it to incorporate the full path from my mail.
> > 
> >> I went back and grabbed some real world system-wide stack traces, since I 
> >> now
> >> know what to trigger on.  A typical example is:
> >>
> >> interrupt_return_srr_user()
> >>  interrupt_exit_user_prepare()
> >>   interrupt_exit_user_prepare_main()
> >>schedule()
> >> __schedule()
> >>  __switch_to()
> >>   giveup_all()
> >># tsk->thread.regs->msr MSR_FP is still set here
> >>__giveup_fpu()
> >> save_fpu()
> >> # fr0 is clobbered and potentially live in userspace
> > 
> > fr0 is not live there.
> > 
> > __giveup_fpu() does roughly:
> > 
> > msr = tsk->thread.regs->msr;
> > msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
> >msr &= ~MSR_VSX;
> > tsk->thread.regs = msr;
> > 
> > ie. it clears the FP etc. bits from the task's MSR. That means the FP
> > state will be reloaded from the thread struct before the task is run again.
> > 
> > Also on that path we're switching to another task, so we'll be reloading
> > the other task's FP state before returning to userspace.
> > 
> > So I don't see any bug there.
>
> Yeah, you're right.  I was trying to get traces while doing something else, 
> and didn't think that all the way through, sorry. :)  It's not going to be 
> super easy to get a real trace (I was triggering the WARN_ON() from of fr0 
> getting set to to FPSCR), so let's just assume it's mainly the path you 
> manually found above and update the commit message accordingly.
>
> > There's only two places that call save_fpu() and skip the giveup logic,
> > which is save_all() and kvmppc_save_user_regs().
>
> Now that's interesting as well, since it might explain so

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Nicholas Piggin
On Tue Nov 21, 2023 at 2:10 PM AEST, Timothy Pearson wrote:
> - Original Message -
> > From: "Michael Ellerman" 
> > To: "Timothy Pearson" 
> > Cc: "Jens Axboe" , "regressions" 
> > , "npiggin" ,
> > "christophe leroy" , "linuxppc-dev" 
> > 
> > Sent: Monday, November 20, 2023 5:39:52 PM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> > register  save
>
> > Timothy Pearson  writes:
> >> - Original Message -
> >>> From: "Michael Ellerman" 
> > ...
> >>> 
> >>> But we now have a new path, because io-uring can call copy_process() via
> >>> create_io_thread() from the signal handling path. That's OK if the signal 
> >>> is
> >>> handled as we return from a syscall, but it's not OK if the signal is 
> >>> handled
> >>> due to some other interrupt.
> >>> 
> >>> Which is:
> >>> 
> >>> interrupt_return_srr_user()
> >>>  interrupt_exit_user_prepare()
> >>>interrupt_exit_user_prepare_main()
> >>>  do_notify_resume()
> >>>get_signal()
> >>>  task_work_run()
> >>>create_worker_cb()
> >>>  create_io_worker()
> >>>copy_process()
> >>>  dup_task_struct()
> >>>arch_dup_task_struct()
> >>>  flush_all_to_thread()
> >>>save_all()
> >>>  if (tsk->thread.regs->msr & MSR_FP)
> >>>save_fpu()
> >>># fr0 is clobbered and potentially live in 
> >>> userspace
> >>> 
> >>> 
> >>> So tldr I think the corruption is only an issue since io-uring started 
> >>> doing
> >>> the clone via signal, which I think matches the observed timeline of this 
> >>> bug
> >>> appearing.
> >>
> >> I agree the corruption really only started showing up in earnest on
> >> io_uring clone-via-signal, as this was confirmed several times in the
> >> course of debugging.
> > 
> > Thanks.
> > 
> >> Note as well that I may very well have a wrong call order in the
> >> commit message, since I was relying on a couple of WARN_ON() macros I
> >> inserted to check for a similar (but not identical) condition and
> >> didn't spend much time getting new traces after identifying the root
> >> cause.
> > 
> > Yep no worries. I'll reword it to incorporate the full path from my mail.
> > 
> >> I went back and grabbed some real world system-wide stack traces, since I 
> >> now
> >> know what to trigger on.  A typical example is:
> >>
> >> interrupt_return_srr_user()
> >>  interrupt_exit_user_prepare()
> >>   interrupt_exit_user_prepare_main()
> >>schedule()
> >> __schedule()
> >>  __switch_to()
> >>   giveup_all()
> >># tsk->thread.regs->msr MSR_FP is still set here
> >>__giveup_fpu()
> >> save_fpu()
> >> # fr0 is clobbered and potentially live in userspace
> > 
> > fr0 is not live there.
>  
> > ie. it clears the FP etc. bits from the task's MSR. That means the FP
> > state will be reloaded from the thread struct before the task is run again.
>
> So a little more detail on this, just to put it to rest properly vs. assuming 
> hand analysis caught every possible pathway. :)
>
> The debugging that generates this stack trace also verifies the following in 
> __giveup_fpu():
>
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
> calling save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after 
> calling save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
>
> Only if all three conditions are met will it generate the trace.  This is a 
> generalization of the hack I used to find the problem in the first place.
>
> If the state will subsequently be reloaded from the thread struct, that means 
> we're reloading the registers from the thread struct that we just verified 
> was corrupted by the earlier save_fpu() call.  There are only two ways I can 
> see for that to be true -- one is if the registers were already clobbered 
> when giveup_all() was entered, and

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson



- Original Message -
> From: "Timothy Pearson" 
> To: "Michael Ellerman" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Monday, November 20, 2023 10:10:32 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> - Original Message -
>> From: "Michael Ellerman" 
>> To: "Timothy Pearson" 
>> Cc: "Jens Axboe" , "regressions" 
>> ,
>> "npiggin" ,
>> "christophe leroy" , "linuxppc-dev"
>> 
>> Sent: Monday, November 20, 2023 5:39:52 PM
>> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec
>> register  save
> 
>> Timothy Pearson  writes:
>>> - Original Message -
>>>> From: "Michael Ellerman" 
>> ...
>>>> 
>>>> But we now have a new path, because io-uring can call copy_process() via
>>>> create_io_thread() from the signal handling path. That's OK if the signal 
>>>> is
>>>> handled as we return from a syscall, but it's not OK if the signal is 
>>>> handled
>>>> due to some other interrupt.
>>>> 
>>>> Which is:
>>>> 
>>>> interrupt_return_srr_user()
>>>>  interrupt_exit_user_prepare()
>>>>interrupt_exit_user_prepare_main()
>>>>  do_notify_resume()
>>>>get_signal()
>>>>  task_work_run()
>>>>create_worker_cb()
>>>>  create_io_worker()
>>>>copy_process()
>>>>  dup_task_struct()
>>>>arch_dup_task_struct()
>>>>  flush_all_to_thread()
>>>>save_all()
>>>>  if (tsk->thread.regs->msr & MSR_FP)
>>>>save_fpu()
>>>># fr0 is clobbered and potentially live in 
>>>> userspace
>>>> 
>>>> 
>>>> So tldr I think the corruption is only an issue since io-uring started 
>>>> doing
>>>> the clone via signal, which I think matches the observed timeline of this 
>>>> bug
>>>> appearing.
>>>
>>> I agree the corruption really only started showing up in earnest on
>>> io_uring clone-via-signal, as this was confirmed several times in the
>>> course of debugging.
>> 
>> Thanks.
>> 
>>> Note as well that I may very well have a wrong call order in the
>>> commit message, since I was relying on a couple of WARN_ON() macros I
>>> inserted to check for a similar (but not identical) condition and
>>> didn't spend much time getting new traces after identifying the root
>>> cause.
>> 
>> Yep no worries. I'll reword it to incorporate the full path from my mail.
>> 
>>> I went back and grabbed some real world system-wide stack traces, since I 
>>> now
>>> know what to trigger on.  A typical example is:
>>>
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>   interrupt_exit_user_prepare_main()
>>>schedule()
>>> __schedule()
>>>  __switch_to()
>>>   giveup_all()
>>># tsk->thread.regs->msr MSR_FP is still set here
>>>__giveup_fpu()
>>> save_fpu()
>>> # fr0 is clobbered and potentially live in userspace
>> 
>> fr0 is not live there.
> 
>> ie. it clears the FP etc. bits from the task's MSR. That means the FP
>> state will be reloaded from the thread struct before the task is run again.
> 
> So a little more detail on this, just to put it to rest properly vs. assuming
> hand analysis caught every possible pathway. :)
> 
> The debugging that generates this stack trace also verifies the following in
> __giveup_fpu():
> 
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
> calling
> save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after 
> calling
> save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
> 
> Only if all three conditions are met will it generate the trace.  This is a
> generalization of the hack I used to find the problem in the first place.
> 
> If the state will subsequently be reloaded from the thread struct, that

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson
- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Monday, November 20, 2023 5:39:52 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Timothy Pearson  writes:
>> - Original Message -
>>> From: "Michael Ellerman" 
> ...
>>> 
>>> But we now have a new path, because io-uring can call copy_process() via
>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>> handled as we return from a syscall, but it's not OK if the signal is 
>>> handled
>>> due to some other interrupt.
>>> 
>>> Which is:
>>> 
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>interrupt_exit_user_prepare_main()
>>>  do_notify_resume()
>>>get_signal()
>>>  task_work_run()
>>>create_worker_cb()
>>>  create_io_worker()
>>>copy_process()
>>>  dup_task_struct()
>>>arch_dup_task_struct()
>>>  flush_all_to_thread()
>>>save_all()
>>>  if (tsk->thread.regs->msr & MSR_FP)
>>>save_fpu()
>>># fr0 is clobbered and potentially live in 
>>> userspace
>>> 
>>> 
>>> So tldr I think the corruption is only an issue since io-uring started doing
>>> the clone via signal, which I think matches the observed timeline of this 
>>> bug
>>> appearing.
>>
>> I agree the corruption really only started showing up in earnest on
>> io_uring clone-via-signal, as this was confirmed several times in the
>> course of debugging.
> 
> Thanks.
> 
>> Note as well that I may very well have a wrong call order in the
>> commit message, since I was relying on a couple of WARN_ON() macros I
>> inserted to check for a similar (but not identical) condition and
>> didn't spend much time getting new traces after identifying the root
>> cause.
> 
> Yep no worries. I'll reword it to incorporate the full path from my mail.
> 
>> I went back and grabbed some real world system-wide stack traces, since I now
>> know what to trigger on.  A typical example is:
>>
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>   interrupt_exit_user_prepare_main()
>>schedule()
>> __schedule()
>>  __switch_to()
>>   giveup_all()
>># tsk->thread.regs->msr MSR_FP is still set here
>>__giveup_fpu()
>> save_fpu()
>> # fr0 is clobbered and potentially live in userspace
> 
> fr0 is not live there.
 
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.

So a little more detail on this, just to put it to rest properly vs. assuming 
hand analysis caught every possible pathway. :)

The debugging that generates this stack trace also verifies the following in 
__giveup_fpu():

1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
calling save_fpu()
2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling 
save_fpu()
3.) MSR_FP is set both in the task struct and in the live MSR.

Only if all three conditions are met will it generate the trace.  This is a 
generalization of the hack I used to find the problem in the first place.

If the state will subsequently be reloaded from the thread struct, that means 
we're reloading the registers from the thread struct that we just verified was 
corrupted by the earlier save_fpu() call.  There are only two ways I can see 
for that to be true -- one is if the registers were already clobbered when 
giveup_all() was entered, and the other is if save_fpu() went ahead and 
clobbered them right here inside giveup_all().

To see which scenario we were dealing with, I added a bit more instrumentation 
to dump the current register state if MSR_FP bit was already set in registers 
(i.e. not dumping data from task struct, but using the live FPU registers 
instead), and sure enough the registers are corrupt on entry, so something else 
has already called save_fpu() before we even hit giveup_all() in this call 
chain.

Unless I'm missing something, doesn't this effectively mean that anything 
interrupting a task can hit this bug?  Or, put another way, I'm seeing several 
processes hit this exact call chain with the corrupt register going back out to 
userspace without io_uring even in the mix, so there seems to be another 
pathway in play.  These traces are from a qemu guest, in case it matters given 
the kvm path is possibly susceptible.

Just a few things to think about.  The FPU patch itself definitely resolves the 
problems; I used a sledgehammer approach *specifically* so that there is no 
place for a rare call sequence we didn't consider to hit it again down the 
line. :)


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson



- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Monday, November 20, 2023 5:39:52 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Timothy Pearson  writes:
>> - Original Message -
>>> From: "Michael Ellerman" 
> ...
>>> 
>>> But we now have a new path, because io-uring can call copy_process() via
>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>> handled as we return from a syscall, but it's not OK if the signal is 
>>> handled
>>> due to some other interrupt.
>>> 
>>> Which is:
>>> 
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>interrupt_exit_user_prepare_main()
>>>  do_notify_resume()
>>>get_signal()
>>>  task_work_run()
>>>create_worker_cb()
>>>  create_io_worker()
>>>copy_process()
>>>  dup_task_struct()
>>>arch_dup_task_struct()
>>>  flush_all_to_thread()
>>>save_all()
>>>  if (tsk->thread.regs->msr & MSR_FP)
>>>save_fpu()
>>># fr0 is clobbered and potentially live in 
>>> userspace
>>> 
>>> 
>>> So tldr I think the corruption is only an issue since io-uring started doing
>>> the clone via signal, which I think matches the observed timeline of this 
>>> bug
>>> appearing.
>>
>> I agree the corruption really only started showing up in earnest on
>> io_uring clone-via-signal, as this was confirmed several times in the
>> course of debugging.
> 
> Thanks.
> 
>> Note as well that I may very well have a wrong call order in the
>> commit message, since I was relying on a couple of WARN_ON() macros I
>> inserted to check for a similar (but not identical) condition and
>> didn't spend much time getting new traces after identifying the root
>> cause.
> 
> Yep no worries. I'll reword it to incorporate the full path from my mail.
> 
>> I went back and grabbed some real world system-wide stack traces, since I now
>> know what to trigger on.  A typical example is:
>>
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>   interrupt_exit_user_prepare_main()
>>schedule()
>> __schedule()
>>  __switch_to()
>>   giveup_all()
>># tsk->thread.regs->msr MSR_FP is still set here
>>__giveup_fpu()
>> save_fpu()
>> # fr0 is clobbered and potentially live in userspace
> 
> fr0 is not live there.
> 
> __giveup_fpu() does roughly:
> 
>   msr = tsk->thread.regs->msr;
>   msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
>msr &= ~MSR_VSX;
>   tsk->thread.regs = msr;
> 
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.
> 
> Also on that path we're switching to another task, so we'll be reloading
> the other task's FP state before returning to userspace.
> 
> So I don't see any bug there.

Yeah, you're right.  I was trying to get traces while doing something else, and 
didn't think that all the way through, sorry. :)  It's not going to be super 
easy to get a real trace (I was triggering the WARN_ON() from of fr0 getting 
set to to FPSCR), so let's just assume it's mainly the path you manually found 
above and update the commit message accordingly.

> There's only two places that call save_fpu() and skip the giveup logic,
> which is save_all() and kvmppc_save_user_regs().

Now that's interesting as well, since it might explain some issues I've seen 
for years on a specific QEMU workload.  Once this is backported to stable I'll 
need to get the kernels updated on those boxes and see if the issues 
disappear...


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Nicholas Piggin
On Tue Nov 21, 2023 at 9:39 AM AEST, Michael Ellerman wrote:
> Timothy Pearson  writes:
> > - Original Message -
> >> From: "Michael Ellerman" 
> ...
> >> 
> >> But we now have a new path, because io-uring can call copy_process() via
> >> create_io_thread() from the signal handling path. That's OK if the signal 
> >> is
> >> handled as we return from a syscall, but it's not OK if the signal is 
> >> handled
> >> due to some other interrupt.
> >> 
> >> Which is:
> >> 
> >> interrupt_return_srr_user()
> >>  interrupt_exit_user_prepare()
> >>interrupt_exit_user_prepare_main()
> >>  do_notify_resume()
> >>get_signal()
> >>  task_work_run()
> >>create_worker_cb()
> >>  create_io_worker()
> >>copy_process()
> >>  dup_task_struct()
> >>arch_dup_task_struct()
> >>  flush_all_to_thread()
> >>save_all()
> >>  if (tsk->thread.regs->msr & MSR_FP)
> >>save_fpu()
> >># fr0 is clobbered and potentially live in 
> >> userspace
> >> 
> >> 
> >> So tldr I think the corruption is only an issue since io-uring started 
> >> doing
> >> the clone via signal, which I think matches the observed timeline of this 
> >> bug
> >> appearing.
> >
> > I agree the corruption really only started showing up in earnest on
> > io_uring clone-via-signal, as this was confirmed several times in the
> > course of debugging.
>
> Thanks.
>
> > Note as well that I may very well have a wrong call order in the
> > commit message, since I was relying on a couple of WARN_ON() macros I
> > inserted to check for a similar (but not identical) condition and
> > didn't spend much time getting new traces after identifying the root
> > cause.
>
> Yep no worries. I'll reword it to incorporate the full path from my mail.
>
> > I went back and grabbed some real world system-wide stack traces, since I 
> > now know what to trigger on.  A typical example is:
> >
> > interrupt_return_srr_user()
> >  interrupt_exit_user_prepare()
> >   interrupt_exit_user_prepare_main()
> >schedule()
> > __schedule()
> >  __switch_to()
> >   giveup_all()
> ># tsk->thread.regs->msr MSR_FP is still set here
> >__giveup_fpu()
> > save_fpu()
> > # fr0 is clobbered and potentially live in userspace
>
> fr0 is not live there.
>
> __giveup_fpu() does roughly:
>
>   msr = tsk->thread.regs->msr;
>   msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
> msr &= ~MSR_VSX;
>   tsk->thread.regs = msr;
>
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.
>
> Also on that path we're switching to another task, so we'll be reloading
> the other task's FP state before returning to userspace.
>
> So I don't see any bug there.
>
> There's only two places that call save_fpu() and skip the giveup logic,
> which is save_all() and kvmppc_save_user_regs().
>
> save_all() is only called via clone() so I think that's unable to
> actually cause visible register corruption as I described in my previous
> mail.
>
> I thought the KVM case was similar, as it's called via an ioctl, but
> I'll have to talk to Nick as his mail indicates otherwise.

Yeah it can, because later on it runs the guest and that will clobber
other regs. I reproduced fr14 corruption in the host easily with KVM
selftests.

It should just do a "giveup" on FP/VEC (as it does with TM).

Thanks,
Nick


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Nicholas Piggin
On Tue Nov 21, 2023 at 2:45 AM AEST, Timothy Pearson wrote:
>
>
> - Original Message -
> > From: "Michael Ellerman" 
> > To: "Timothy Pearson" , "Jens Axboe" 
> > , "regressions"
> > , "npiggin" , "christophe 
> > leroy" ,
> > "linuxppc-dev" 
> > Sent: Monday, November 20, 2023 1:10:06 AM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> > register  save
>
> > Hi Timothy,
> > 
> > Great work debugging this. I think your fix is good, but I want to 
> > understand it
> > a bit more to make sure I can explain why we haven't seen it outside of
> > io-uring.
> > If this can be triggered outside of io-uring then I have even more 
> > backporting
> > in my future :}
> > 
> > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs 
> > and
> > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP 
> > regs
> > from the thread struct before letting the task use FP again. So in that case
> > save_fpu() is free to clobber fr0 because the FP regs no longer hold live 
> > values
> > for the task.
> > 
> > There is another case though, which is the path via:
> >  copy_process()
> >dup_task_struct()
> >  arch_dup_task_struct()
> >flush_all_to_thread()
> >  save_all()
> > 
> > That path saves the FP regs but leaves them live. That's meant as an
> > optimisation for a process that's using FP/VSX and then calls fork(), 
> > leaving
> > the regs live means the parent process doesn't have to take a fault after 
> > the
> > fork to get its FP regs back.
> > 
> > That path does clobber fr0, but fr0 is volatile across a syscall, and the 
> > only
> > way to reach copy_process() from userspace is via a syscall. So in normal 
> > usage
> > fr0 being clobbered across a syscall shouldn't cause data corruption.
> > 
> > Even if we handle a signal on the return from the fork() syscall, the worst 
> > that
> > happens is that the task's thread struct holds the clobbered fr0, but the 
> > task
> > doesn't care (because fr0 is volatile across the syscall anyway).
> > 
> > That path is something like:
> > 
> > system_call_vectored_common()
> >  system_call_exception()
> >sys_fork()
> >  kernel_clone()
> >copy_process()
> >  dup_task_struct()
> >arch_dup_task_struct()
> >  flush_all_to_thread()
> >save_all()
> >  if (tsk->thread.regs->msr & MSR_FP)
> >save_fpu()
> ># does not clear MSR_FP from regs->msr
> >  syscall_exit_prepare()
> >interrupt_exit_user_prepare_main()
> >  do_notify_resume()
> >get_signal()
> >handle_rt_signal64()
> >  prepare_setup_sigcontext()
> >flush_fp_to_thread()
> >  if (tsk->thread.regs->msr & MSR_FP)
> >giveup_fpu()
> >  __giveup_fpu
> >save_fpu()
> ># clobbered fr0 is saved, but task considers it volatile
> ># across syscall anyway
> > 
> > 
> > But we now have a new path, because io-uring can call copy_process() via
> > create_io_thread() from the signal handling path. That's OK if the signal is
> > handled as we return from a syscall, but it's not OK if the signal is 
> > handled
> > due to some other interrupt.
> > 
> > Which is:
> > 
> > interrupt_return_srr_user()
> >  interrupt_exit_user_prepare()
> >interrupt_exit_user_prepare_main()
> >  do_notify_resume()
> >get_signal()
> >  task_work_run()
> >create_worker_cb()
> >  create_io_worker()
> >copy_process()
> >  dup_task_struct()
> >arch_dup_task_struct()
> >  flush_all_to_thread()
> >save_all()
> >  if (tsk->thread.regs->msr & MSR_FP)
> >save_fpu()
> ># fr0 is clobbered and potentially live in 
> > userspace
> > 
> > 
> > So tldr I think the corruption is only an issue since io-uring started doing
> > the clone via signal, which I think matches the observed timeline of this 
&

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Michael Ellerman
Timothy Pearson  writes:
> - Original Message -
>> From: "Michael Ellerman" 
...
>> 
>> But we now have a new path, because io-uring can call copy_process() via
>> create_io_thread() from the signal handling path. That's OK if the signal is
>> handled as we return from a syscall, but it's not OK if the signal is handled
>> due to some other interrupt.
>> 
>> Which is:
>> 
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>interrupt_exit_user_prepare_main()
>>  do_notify_resume()
>>get_signal()
>>  task_work_run()
>>create_worker_cb()
>>  create_io_worker()
>>copy_process()
>>  dup_task_struct()
>>arch_dup_task_struct()
>>  flush_all_to_thread()
>>save_all()
>>  if (tsk->thread.regs->msr & MSR_FP)
>>save_fpu()
>># fr0 is clobbered and potentially live in 
>> userspace
>> 
>> 
>> So tldr I think the corruption is only an issue since io-uring started doing
>> the clone via signal, which I think matches the observed timeline of this bug
>> appearing.
>
> I agree the corruption really only started showing up in earnest on
> io_uring clone-via-signal, as this was confirmed several times in the
> course of debugging.

Thanks.

> Note as well that I may very well have a wrong call order in the
> commit message, since I was relying on a couple of WARN_ON() macros I
> inserted to check for a similar (but not identical) condition and
> didn't spend much time getting new traces after identifying the root
> cause.

Yep no worries. I'll reword it to incorporate the full path from my mail.

> I went back and grabbed some real world system-wide stack traces, since I now 
> know what to trigger on.  A typical example is:
>
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>   interrupt_exit_user_prepare_main()
>schedule()
> __schedule()
>  __switch_to()
>   giveup_all()
># tsk->thread.regs->msr MSR_FP is still set here
>__giveup_fpu()
> save_fpu()
> # fr0 is clobbered and potentially live in userspace

fr0 is not live there.

__giveup_fpu() does roughly:

msr = tsk->thread.regs->msr;
msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
msr &= ~MSR_VSX;
tsk->thread.regs = msr;

ie. it clears the FP etc. bits from the task's MSR. That means the FP
state will be reloaded from the thread struct before the task is run again.

Also on that path we're switching to another task, so we'll be reloading
the other task's FP state before returning to userspace.

So I don't see any bug there.

There's only two places that call save_fpu() and skip the giveup logic,
which is save_all() and kvmppc_save_user_regs().

save_all() is only called via clone() so I think that's unable to
actually cause visible register corruption as I described in my previous
mail.

I thought the KVM case was similar, as it's called via an ioctl, but
I'll have to talk to Nick as his mail indicates otherwise.

cheers


Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson



- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" , "Jens Axboe" 
> , "regressions"
> , "npiggin" , "christophe 
> leroy" ,
> "linuxppc-dev" 
> Sent: Monday, November 20, 2023 1:10:06 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Hi Timothy,
> 
> Great work debugging this. I think your fix is good, but I want to understand 
> it
> a bit more to make sure I can explain why we haven't seen it outside of
> io-uring.
> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
> 
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP 
> regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live 
> values
> for the task.
> 
> There is another case though, which is the path via:
>  copy_process()
>dup_task_struct()
>  arch_dup_task_struct()
>flush_all_to_thread()
>  save_all()
> 
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
> 
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal 
> usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
> 
> Even if we handle a signal on the return from the fork() syscall, the worst 
> that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
> 
> That path is something like:
> 
> system_call_vectored_common()
>  system_call_exception()
>sys_fork()
>  kernel_clone()
>copy_process()
>  dup_task_struct()
>arch_dup_task_struct()
>  flush_all_to_thread()
>save_all()
>  if (tsk->thread.regs->msr & MSR_FP)
>save_fpu()
># does not clear MSR_FP from regs->msr
>  syscall_exit_prepare()
>interrupt_exit_user_prepare_main()
>  do_notify_resume()
>get_signal()
>handle_rt_signal64()
>  prepare_setup_sigcontext()
>flush_fp_to_thread()
>  if (tsk->thread.regs->msr & MSR_FP)
>giveup_fpu()
>  __giveup_fpu
>save_fpu()
># clobbered fr0 is saved, but task considers it volatile
># across syscall anyway
> 
> 
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
> 
> Which is:
> 
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>interrupt_exit_user_prepare_main()
>  do_notify_resume()
>get_signal()
>  task_work_run()
>create_worker_cb()
>  create_io_worker()
>copy_process()
>  dup_task_struct()
>arch_dup_task_struct()
>  flush_all_to_thread()
>save_all()
>  if (tsk->thread.regs->msr & MSR_FP)
>save_fpu()
># fr0 is clobbered and potentially live in 
> userspace
> 
> 
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.

I agree the corruption really only started showing up in earnest on io_uring 
clone-via-signal, as this was confirmed several times in the course of 
debugging.  Bear in mind however that we have seen very, very rare crashes over 
several years in other tasks, and I am starting to think this bug might be the 
root cause (see below).

Note as well that I may very well have a wrong call order in the commit 
message, since I was relying on a couple of WARN_ON() macros I inserted to 
check for a similar (but not identical) condition and didn't spend much time 
getting new traces after identifying the root cause.

I went back and grabbed some real world system-wide stack trac

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Nicholas Piggin
Yeah, awesome.

On Mon Nov 20, 2023 at 5:10 PM AEST, Michael Ellerman wrote:
> Hi Timothy,
>
> Great work debugging this. I think your fix is good, but I want to understand 
> it
> a bit more to make sure I can explain why we haven't seen it outside of 
> io-uring.

Analysis seems right to me.

Probably the best minimal fix. But I wonder if we should just use the
one path for saving/flushing/giving up, just use giveup instead of
save?

KVM looks odd too, and actually gets this wrong. In a way that's not
fixed by Timothy's patch, because it's just not restoring userspace
registers at all. Fortunately QEMU isn't in the habit of using non
volatile FP/VEC registers over a VCPU ioctl, but there's no reason it
couldn't do since GCC/LLVM can easily use them. KVM really wants to be
using giveup.

Thanks,
Nick

> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
>
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP 
> regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live 
> values
> for the task.
>
> There is another case though, which is the path via:
>   copy_process()
> dup_task_struct()
>   arch_dup_task_struct()
> flush_all_to_thread()
>   save_all()
>
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
>
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal 
> usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
>
> Even if we handle a signal on the return from the fork() syscall, the worst 
> that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
>
> That path is something like:
>
> system_call_vectored_common()
>   system_call_exception()
> sys_fork()
>   kernel_clone()
> copy_process()
>   dup_task_struct()
> arch_dup_task_struct()
>   flush_all_to_thread()
> save_all()
>   if (tsk->thread.regs->msr & MSR_FP)
> save_fpu()
> # does not clear MSR_FP from regs->msr
>   syscall_exit_prepare()
> interrupt_exit_user_prepare_main()
>   do_notify_resume()
> get_signal()
> handle_rt_signal64()
>   prepare_setup_sigcontext()
> flush_fp_to_thread()
>   if (tsk->thread.regs->msr & MSR_FP)
> giveup_fpu()
>   __giveup_fpu
> save_fpu()
> # clobbered fr0 is saved, but task considers it volatile
> # across syscall anyway
>
>
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
>
> Which is:
>
> interrupt_return_srr_user()
>   interrupt_exit_user_prepare()
> interrupt_exit_user_prepare_main()
>   do_notify_resume()
> get_signal()
>   task_work_run()
> create_worker_cb()
>   create_io_worker()
> copy_process()
>   dup_task_struct()
> arch_dup_task_struct()
>   flush_all_to_thread()
> save_all()
>   if (tsk->thread.regs->msr & MSR_FP)
> save_fpu()
> # fr0 is clobbered and potentially live in 
> userspace
>
>
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.
>
> Gotta run home, will have a closer look at the actual patch later on.
>
> cheers
>
>
> Timothy Pearson  writes:
> > During floating point and vector save to thread data fr0/vs0 are clobbered
> > by the FPSCR/VSCR store routine.  This leads to userspace register 
> > corruption
> > and application data corruption / crash under the following rare condition:
> >
> >  * A userspace thread is executing with VSX/FP mode enabled
> >  * The userspace thread is making active use of fr0 and/or vs0
> >  * An IPI is taken in kernel mode, forcing the userspace thread to 
> > reschedule
> >  * The userspace thread is interrupted by the IPI before accessing data it
> >previously stored in fr0/vs0
> >  * The thread being switched in by the IPI has a pending 

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-19 Thread Michael Ellerman
Hi Timothy,

Great work debugging this. I think your fix is good, but I want to understand it
a bit more to make sure I can explain why we haven't seen it outside of 
io-uring.
If this can be triggered outside of io-uring then I have even more backporting
in my future :}

Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
from the thread struct before letting the task use FP again. So in that case
save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
for the task.

There is another case though, which is the path via:
  copy_process()
dup_task_struct()
  arch_dup_task_struct()
flush_all_to_thread()
  save_all()

That path saves the FP regs but leaves them live. That's meant as an
optimisation for a process that's using FP/VSX and then calls fork(), leaving
the regs live means the parent process doesn't have to take a fault after the
fork to get its FP regs back.

That path does clobber fr0, but fr0 is volatile across a syscall, and the only
way to reach copy_process() from userspace is via a syscall. So in normal usage
fr0 being clobbered across a syscall shouldn't cause data corruption.

Even if we handle a signal on the return from the fork() syscall, the worst that
happens is that the task's thread struct holds the clobbered fr0, but the task
doesn't care (because fr0 is volatile across the syscall anyway).

That path is something like:

system_call_vectored_common()
  system_call_exception()
sys_fork()
  kernel_clone()
copy_process()
  dup_task_struct()
arch_dup_task_struct()
  flush_all_to_thread()
save_all()
  if (tsk->thread.regs->msr & MSR_FP)
save_fpu()
# does not clear MSR_FP from regs->msr
  syscall_exit_prepare()
interrupt_exit_user_prepare_main()
  do_notify_resume()
get_signal()
handle_rt_signal64()
  prepare_setup_sigcontext()
flush_fp_to_thread()
  if (tsk->thread.regs->msr & MSR_FP)
giveup_fpu()
  __giveup_fpu
save_fpu()
# clobbered fr0 is saved, but task considers it volatile
# across syscall anyway


But we now have a new path, because io-uring can call copy_process() via
create_io_thread() from the signal handling path. That's OK if the signal is
handled as we return from a syscall, but it's not OK if the signal is handled
due to some other interrupt.

Which is:

interrupt_return_srr_user()
  interrupt_exit_user_prepare()
interrupt_exit_user_prepare_main()
  do_notify_resume()
get_signal()
  task_work_run()
create_worker_cb()
  create_io_worker()
copy_process()
  dup_task_struct()
arch_dup_task_struct()
  flush_all_to_thread()
save_all()
  if (tsk->thread.regs->msr & MSR_FP)
save_fpu()
# fr0 is clobbered and potentially live in userspace


So tldr I think the corruption is only an issue since io-uring started doing
the clone via signal, which I think matches the observed timeline of this bug
appearing.

Gotta run home, will have a closer look at the actual patch later on.

cheers


Timothy Pearson  writes:
> During floating point and vector save to thread data fr0/vs0 are clobbered
> by the FPSCR/VSCR store routine.  This leads to userspace register corruption
> and application data corruption / crash under the following rare condition:
>
>  * A userspace thread is executing with VSX/FP mode enabled
>  * The userspace thread is making active use of fr0 and/or vs0
>  * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
>  * The userspace thread is interrupted by the IPI before accessing data it
>previously stored in fr0/vs0
>  * The thread being switched in by the IPI has a pending signal
>
> If these exact criteria are met, then the following sequence happens:
>
>  * The existing thread FP storage is still valid before the IPI, due to a
>prior call to save_fpu() or store_fp_state().  Note that the current
>fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
>is now invalid pending a call to restore_fp()/restore_altivec().
>  * IPI -- FP/VSX register state remains invalid
>  * interrupt_exit_user_prepare_main() calls do_notify_resume(),
>due to the pending signal
>  * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
>merrily reads and saves the invalid FP/VSX state to thread local storage.
>  * interrupt_exit_user_prepare_main() calls restore_math(), writing the 
> invalid
>FP/VSX state back to registers.
>  * Execution is released to userspace, and the 

[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-19 Thread Timothy Pearson
During floating point and vector save to thread data fr0/vs0 are clobbered
by the FPSCR/VSCR store routine.  This leads to userspace register corruption
and application data corruption / crash under the following rare condition:

 * A userspace thread is executing with VSX/FP mode enabled
 * The userspace thread is making active use of fr0 and/or vs0
 * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
 * The userspace thread is interrupted by the IPI before accessing data it
   previously stored in fr0/vs0
 * The thread being switched in by the IPI has a pending signal

If these exact criteria are met, then the following sequence happens:

 * The existing thread FP storage is still valid before the IPI, due to a
   prior call to save_fpu() or store_fp_state().  Note that the current
   fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
   is now invalid pending a call to restore_fp()/restore_altivec().
 * IPI -- FP/VSX register state remains invalid
 * interrupt_exit_user_prepare_main() calls do_notify_resume(),
   due to the pending signal
 * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
   merrily reads and saves the invalid FP/VSX state to thread local storage.
 * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
   FP/VSX state back to registers.
 * Execution is released to userspace, and the application crashes or corrupts
   data.

Without the pending signal, do_notify_resume() is never called, therefore the
invalid register state does't matter as it is overwritten nearly immediately
by interrupt_exit_user_prepare_main() calling restore_math() before return
to userspace.

Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
altivec register save paths.

Tested under QEMU in kvm mode, running on a Talos II workstation with dual
POWER9 DD2.2 CPUs.

Closes: 
https://lore.kernel.org/all/480932026.45576726.1699374859845.javamail.zim...@raptorengineeringinc.com/
Closes: 
https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.javamail.zim...@raptorengineeringinc.com/
Tested-by: Timothy Pearson 
Tested-by: Jens Axboe 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/fpu.S| 13 +
 arch/powerpc/kernel/vector.S |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index 6a9acfb690c9..2f8f3f93cbb6 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -23,6 +23,15 @@
 #include 
 
 #ifdef CONFIG_VSX
+#define __REST_1FPVSR(n,c,base)
\
+BEGIN_FTR_SECTION  \
+   b   2f; \
+END_FTR_SECTION_IFSET(CPU_FTR_VSX);\
+   REST_FPR(n,base);   \
+   b   3f; \
+2: REST_VSR(n,c,base); \
+3:
+
 #define __REST_32FPVSRS(n,c,base)  \
 BEGIN_FTR_SECTION  \
b   2f; \
@@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);  
\
 2: SAVE_32VSRS(n,c,base);  \
 3:
 #else
+#define __REST_1FPVSR(n,b,base)REST_FPR(n, base)
 #define __REST_32FPVSRS(n,b,base)  REST_32FPRS(n, base)
 #define __SAVE_32FPVSRS(n,b,base)  SAVE_32FPRS(n, base)
 #endif
+#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
 #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
 #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
 
@@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
SAVE_32FPVSRS(0, R4, R3)
mffsfr0
stfdfr0,FPSTATE_FPSCR(r3)
+   REST_1FPVSR(0, R4, R3)
blr
 EXPORT_SYMBOL(store_fp_state)
 
@@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
 2: SAVE_32FPVSRS(0, R4, R6)
mffsfr0
stfdfr0,FPSTATE_FPSCR(r6)
+   REST_1FPVSR(0, R4, R6)
blr
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 4094e4c4c77a..80b3f6e476b6 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -33,6 +33,7 @@ _GLOBAL(store_vr_state)
mfvscr  v0
li  r4, VRSTATE_VSCR
stvxv0, r4, r3
+   lvx v0, 0, r3
blr
 EXPORT_SYMBOL(store_vr_state)
 
@@ -109,6 +110,7 @@ _GLOBAL(save_altivec)
mfvscr  v0
li  r4,VRSTATE_VSCR
stvxv0,r4,r7
+   lvx v0,0,r7
blr
 
 #ifdef CONFIG_VSX
-- 
2.39.2