from:"Timothy Pearson"

Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables

2024-03-19 Thread Timothy Pearson




- Original Message -
> From: "Jason Gunthorpe" 
> To: "Shivaprasad G Bhat" 
> Cc: "Timothy Pearson" , "Alex Williamson" 
> , "linuxppc-dev"
> , "Michael Ellerman" , 
> "npiggin" , "christophe
> leroy" , "aneesh kumar" 
> , "naveen n rao"
> , "gbatra" , 
> brk...@linux.vnet.ibm.com, "Alexey Kardashevskiy"
> , r...@kernel.org, "linux-kernel" 
> , "kvm" , "aik"
> , msucha...@suse.de, "jroedel" , "vaibhav" 
> , sva...@linux.ibm.com
> Sent: Tuesday, March 19, 2024 9:32:02 AM
> Subject: Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view 
> for single level TCE tables

> On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
>> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
>> it_userspace") which implemented the tce indirect levels
>> support for PowerNV ended up removing the single level support
>> which existed by default(generic tce_iommu_userspace_view_alloc/free()
>> calls). On pSeries the TCEs are single level, and the allocation
>> of userspace view is lost with the removal of generic code.
> 
> :( :(
> 
> If this has been broken since 2018 and nobody cared till now can we
> please go in a direction of moving this code to the new iommu APIs
> instead of doubling down on more of this old stuff that apparently
> almost nobody cares about ??
> 
> Jason

Just FYI Raptor is working on porting things over to the new APIs.  RFC patches 
should be posted in the next week or two.

Re: [PATCH v3] powerpc: Add gpr1 and fpu save/restore functions

2024-02-17 Thread Timothy Pearson




- Original Message -
> From: "christophe leroy" 
> To: "Timothy Pearson" 
> Cc: "linuxppc-dev" 
> Sent: Saturday, February 17, 2024 10:39:06 AM
> Subject: Re: [PATCH v3] powerpc: Add gpr1 and fpu save/restore functions

> Le 16/02/2024 à 18:24, Timothy Pearson a écrit :
>> When building the kernel in size optimized mode with the amdgpu module 
>> enabled,
>> gcc will begin referencing external gpr1 and fpu save/restore functions.  
>> This
>> will then cause a linker failure as we do not link against libgcc which
>> normally contains those builtin functions.
> 
> Right, but modules are linked with --save-restore-funcs when using gcc
> so crtsavres.o is not used and only your change to scripts/mod/modpost.c
> seems to be required to be able to build amdgpu module with GCC.
> 
> Maybe that's different with clang, but maybe worth a test and then a
> second patch ?

It looks to be gated on ld, specifically if we're CONFIG_PPC64 && 
CONFIG_LD_IS_BFD.  I can update the patch to match that gate if desired.

> Nevertheless, see comments below, you can do even shorter and more
> readable using GAS macros.

That's true, but my goal was to get this working, not refactor the entire file. 
:)  Any chance we can merge and do a fine polish later, especially if there's a 
distinct possibility of the entire file going away in the near future?

Re: [PATCH v2] powerpc: Add gpr1 and fpu save/restore functions

2024-02-16 Thread Timothy Pearson

- Original Message -
> From: "christophe leroy" 
> To: "Timothy Pearson" , "linuxppc-dev" 
> 
> Sent: Tuesday, February 13, 2024 9:13:41 AM
> Subject: Re: [PATCH v2] powerpc: Add gpr1 and fpu save/restore functions

> Le 12/02/2024 à 18:14, Timothy Pearson a écrit :


> All this looks very similar to savegpr0_xx and restgpr0_xx with the
> difference being that r12 is used instead of r1.
> 
> Could we avoid duplication by using macros ?
> 

Yes.  v3 sent.

[PATCH v3] powerpc: Add gpr1 and fpu save/restore functions

2024-02-16 Thread Timothy Pearson

When building the kernel in size optimized mode with the amdgpu module enabled,
gcc will begin referencing external gpr1 and fpu save/restore functions.  This
will then cause a linker failure as we do not link against libgcc which
normally contains those builtin functions.

Implement gpr1 and fpu save/restore functions per the PowerPC 64-bit ELFv2 ABI
documentation.

Tested on a Talos II with a WX7100 installed and running in DisplayCore mode.

Reported-by: kernel test robot 
Tested-by: Timothy Pearson 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/prom_init_check.sh |   4 +-
 arch/powerpc/lib/crtsavres.S   | 363 +
 scripts/mod/modpost.c  |   4 +
 3 files changed, 253 insertions(+), 118 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init_check.sh 
b/arch/powerpc/kernel/prom_init_check.sh
index 69623b9045d5..76c5651e29d3 100644
--- a/arch/powerpc/kernel/prom_init_check.sh
+++ b/arch/powerpc/kernel/prom_init_check.sh
@@ -72,10 +72,10 @@ do
 
# ignore register save/restore funcitons
case $UNDEF in
-   _restgpr_*|_restgpr0_*|_rest32gpr_*)
+   _restgpr_*|_restgpr0_*|_restgpr1_*|_rest32gpr_*)
OK=1
;;
-   _savegpr_*|_savegpr0_*|_save32gpr_*)
+   _savegpr_*|_savegpr0_*|_restgpr0_*|_save32gpr_*)
OK=1
;;
esac
diff --git a/arch/powerpc/lib/crtsavres.S b/arch/powerpc/lib/crtsavres.S
index 7e5e1c28e56a..f97270d36720 100644
--- a/arch/powerpc/lib/crtsavres.S
+++ b/arch/powerpc/lib/crtsavres.S
@@ -3,6 +3,7 @@
  *
  *   Copyright (C) 1995, 1996, 1998, 2000, 2001 Free Software Foundation, Inc.
  *   Copyright 2008 Freescale Semiconductor, Inc.
+ *   Copyright 2024 Raptor Engineering, LLC
  *   Written By Michael Meissner
  *
  * Based on gcc/config/rs6000/crtsavres.asm from gcc
@@ -314,126 +315,134 @@ _GLOBAL(_restvr_31)
 
 #else /* CONFIG_PPC64 */
 
-.globl _savegpr0_14
-_savegpr0_14:
-   std r14,-144(r1)
-.globl _savegpr0_15
-_savegpr0_15:
-   std r15,-136(r1)
-.globl _savegpr0_16
-_savegpr0_16:
-   std r16,-128(r1)
-.globl _savegpr0_17
-_savegpr0_17:
-   std r17,-120(r1)
-.globl _savegpr0_18
-_savegpr0_18:
-   std r18,-112(r1)
-.globl _savegpr0_19
-_savegpr0_19:
-   std r19,-104(r1)
-.globl _savegpr0_20
-_savegpr0_20:
-   std r20,-96(r1)
-.globl _savegpr0_21
-_savegpr0_21:
-   std r21,-88(r1)
-.globl _savegpr0_22
-_savegpr0_22:
-   std r22,-80(r1)
-.globl _savegpr0_23
-_savegpr0_23:
-   std r23,-72(r1)
-.globl _savegpr0_24
-_savegpr0_24:
-   std r24,-64(r1)
-.globl _savegpr0_25
-_savegpr0_25:
-   std r25,-56(r1)
-.globl _savegpr0_26
-_savegpr0_26:
-   std r26,-48(r1)
-.globl _savegpr0_27
-_savegpr0_27:
-   std r27,-40(r1)
-.globl _savegpr0_28
-_savegpr0_28:
-   std r28,-32(r1)
-.globl _savegpr0_29
-_savegpr0_29:
-   std r29,-24(r1)
-.globl _savegpr0_30
-_savegpr0_30:
-   std r30,-16(r1)
-.globl _savegpr0_31
-_savegpr0_31:
-   std r31,-8(r1)
-   std r0,16(r1)
+#define __PPC64_SAVEGPR(n,base)\
+.globl _savegpr##n##_14\
+_savegpr##n##_14:  \
+   std r14,-144(base)  \
+.globl _savegpr##n##_15\
+_savegpr##n##_15:  \
+   std r15,-136(base)  \
+.globl _savegpr##n##_16\
+_savegpr##n##_16:  \
+   std r16,-128(base)  \
+.globl _savegpr##n##_17\
+_savegpr##n##_17:  \
+   std r17,-120(base)  \
+.globl _savegpr##n##_18\
+_savegpr##n##_18:  \
+   std r18,-112(base)  \
+.globl _savegpr##n##_19\
+_savegpr##n##_19:  \
+   std r19,-104(base)  \
+.globl _savegpr##n##_20\
+_savegpr##n##_20:  \
+   std r20,-96(base)   \
+.globl _savegpr##n##_21\
+_savegpr##n##_21:  \
+   std r21,-88(base)   \
+.globl _savegpr##n##_22\
+_savegpr##n##_22:  \
+   std r22,-80(base)   \
+.globl _savegpr##n##_23\
+_savegpr##n##_23:  \
+   std r23,-72(base)   \
+.globl _savegpr##n##_24\
+_savegpr##n##_24:  \
+   std r24,-64(base)   \
+.globl _savegpr##n##_25\
+_savegpr##n##_25:  \
+   std r25,-56(base)   \
+.globl _savegpr##n##_26\
+_savegpr##n##_26:  \
+   std r26,-48(base)   \
+.globl _savegpr##n##_27\
+_savegpr##n##_27:  \
+   std r27,-40(base)   \
+.globl _savegpr##n##_28\
+_savegpr##n##_28:  \
+   std r28,-32(base)   \
+.globl _savegpr##n##_29\
+_savegpr##n##_29:  \
+   std r29,-24(base)   \
+.globl _savegpr##n##_30\
+_savegpr##n##_30:  \
+   std r30,-16(base)   \
+.globl _savegpr##n##_31\
+_savegpr##n##_31:  \
+   std r31,-8(base

Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

2024-02-12 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" , "Segher Boessenkool" 
> 
> Cc: "linuxppc-dev" 
> Sent: Monday, February 12, 2024 11:23:30 PM
> Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

> Timothy Pearson  writes:
>> ----- Original Message -
>>> From: "Segher Boessenkool" 
>>> To: "Timothy Pearson" 
>>> Cc: "linuxppc-dev" 
>>> Sent: Monday, February 12, 2024 12:23:22 PM
>>> Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions
>>
>>> On Mon, Feb 12, 2024 at 12:07:03PM -0600, Timothy Pearson wrote:
>>>> > I have done it for *all* architectures some ten years ago.  Never found
>>>> > any problem.
>>>> 
>>>> That makes sense, what I mean by invasive is that we'd need buy-in from the
>>>> other
>>>> maintainers across all of the affected architectures.  Is that likely to 
>>>> occur?
>>> 
>>> I don't know.  Here is my PowerPC-specific patch, it's a bit older, it
>>> might not apply cleanly anymore, the changes needed should be obvious
>>> though:
>>> 
>>> 
>>> === 8< ===
>>> commit f16dfa5257eb14549ce22243fb2b465615085134
>>> Author: Segher Boessenkool 
>>> Date:   Sat May 3 03:48:06 2008 +0200
>>> 
>>>powerpc: Link vmlinux against libgcc.a
>>> 
>>> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
>>> index b7212b619c52..0a2fac6ffc1c 100644
>>> --- a/arch/powerpc/Makefile
>>> +++ b/arch/powerpc/Makefile
>>> @@ -158,6 +158,9 @@ core-y  += 
>>> arch/powerpc/kernel/
>>> core-$(CONFIG_XMON)+= arch/powerpc/xmon/
>>> core-$(CONFIG_KVM) += arch/powerpc/kvm/
>>> 
>>> +LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) -print-libgcc-file-name)
>>> +libs-y += $(LIBGCC)
>>> +
>>> drivers-$(CONFIG_OPROFILE) += arch/powerpc/oprofile/
>>> 
>>> # Default to zImage, override when needed
>>> === 8< ===
>>
>> OK.  PowerPC maintainers, how would you prefer to handle this?
> 
> I'll take the patch to add the functions for now. We can look into
> linking against libgcc as a future cleanup.

Sounds good.

>>>> > There are better options than -Os, fwiw.  Some --param's give smaller
>>>> > *and* faster kernels.  What exactly is best is heavily arch-dependent
>>>> > though (as well as dependent on the application code, the kernel code in
>>>> > this case) :-(
>>>> 
>>>> I've been through this a few times, and -Os is the only option that makes
>>>> things (just barely) fit unfortunately.
>>> 
>>> -O2 with appropriate inlining tuning beats -Os every day of the week,
>>> in my experience.
>>
>> On 6.6 it's 24MiB vs 40MiB, O2 vs. Os. :(
> 
> What compiler/config etc. are you using for that?

It's the kernel config that buildroot generates for skiroot -- I think a lot of 
the size difference is in some of the modules that we enable such as amdgpu, 
but haven't dug too deeply.  Once this firmware release is in beta (and 
therefore published publicly) I'll send over a link to the configs.

Thanks!

Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

2024-02-12 Thread Timothy Pearson




- Original Message -
> From: "Segher Boessenkool" 
> To: "Timothy Pearson" 
> Cc: "linuxppc-dev" 
> Sent: Monday, February 12, 2024 12:23:22 PM
> Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

> On Mon, Feb 12, 2024 at 12:07:03PM -0600, Timothy Pearson wrote:
>> > I have done it for *all* architectures some ten years ago.  Never found
>> > any problem.
>> 
>> That makes sense, what I mean by invasive is that we'd need buy-in from the
>> other
>> maintainers across all of the affected architectures.  Is that likely to 
>> occur?
> 
> I don't know.  Here is my PowerPC-specific patch, it's a bit older, it
> might not apply cleanly anymore, the changes needed should be obvious
> though:
> 
> 
> === 8< ===
> commit f16dfa5257eb14549ce22243fb2b465615085134
> Author: Segher Boessenkool 
> Date:   Sat May 3 03:48:06 2008 +0200
> 
>powerpc: Link vmlinux against libgcc.a
> 
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index b7212b619c52..0a2fac6ffc1c 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -158,6 +158,9 @@ core-y  += 
> arch/powerpc/kernel/
> core-$(CONFIG_XMON)+= arch/powerpc/xmon/
> core-$(CONFIG_KVM) += arch/powerpc/kvm/
> 
> +LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) -print-libgcc-file-name)
> +libs-y += $(LIBGCC)
> +
> drivers-$(CONFIG_OPROFILE) += arch/powerpc/oprofile/
> 
> # Default to zImage, override when needed
> === 8< ===

OK.  PowerPC maintainers, how would you prefer to handle this?

>> > There are better options than -Os, fwiw.  Some --param's give smaller
>> > *and* faster kernels.  What exactly is best is heavily arch-dependent
>> > though (as well as dependent on the application code, the kernel code in
>> > this case) :-(
>> 
>> I've been through this a few times, and -Os is the only option that makes
>> things (just barely) fit unfortunately.
> 
> -O2 with appropriate inlining tuning beats -Os every day of the week,
> in my experience.

On 6.6 it's 24MiB vs 40MiB, O2 vs. Os. :(

Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

2024-02-12 Thread Timothy Pearson




- Original Message -
> From: "Segher Boessenkool" 
> To: "Timothy Pearson" 
> Cc: "linuxppc-dev" 
> Sent: Monday, February 12, 2024 11:59:06 AM
> Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

> On Mon, Feb 12, 2024 at 11:46:19AM -0600, Timothy Pearson wrote:
>> Interesting, that make sense.
>> 
>> How should we proceed from the current situation?  Bringing in libgcc seems
>> like a fairly invasive change,
> 
> I have done it for *all* architectures some ten years ago.  Never found
> any problem.

That makes sense, what I mean by invasive is that we'd need buy-in from the 
other
maintainers across all of the affected architectures.  Is that likely to occur?

>> should we merge this to fix the current bug
>> (cannot build ppc64 kernel in size-optimized mode) and start discussion on
>> bringing in libgcc as the long-term fix across multiple architectures?
>> 
>> My goal here is to not have to carry a downstream patch in perpetuity for
>> our embedded Linux firmware, which needs to be compiled in size-optimized
>> mode due to hardware Flash limitations.
> 
> There are better options than -Os, fwiw.  Some --param's give smaller
> *and* faster kernels.  What exactly is best is heavily arch-dependent
> though (as well as dependent on the application code, the kernel code in
> this case) :-(

I've been through this a few times, and -Os is the only option that makes
things (just barely) fit unfortunately.

Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

2024-02-12 Thread Timothy Pearson




- Original Message -
> From: "Segher Boessenkool" 
> To: "Timothy Pearson" 
> Cc: "linuxppc-dev" 
> Sent: Monday, February 12, 2024 11:30:43 AM
> Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions
> 
> Long long time ago, linux-0.11 or something, it was discovered that some
> programmiing mistakes resulted in double-length divisions (64x64->64 on
> 32-bit systems, say).  Most architectures have no hardware support for
> that, x86 is one of those; so you need very expensive support routines
> to do that (_udivdi3 or _divdi3 in that case, ...ti3 on 64-bit archs).
> 
> So it was decided to not link to libgcc to avoid this.  But that means
> that all the extremely many other suppoort routines, more for some other
> archs, are also not there.  While it would have been much easier to just
> link to something that provides the _{u,}divdi3 symbol and then causes a
> forced linking error from that!
> 
> 
> Segher

Interesting, that make sense.

How should we proceed from the current situation?  Bringing in libgcc seems
like a fairly invasive change, should we merge this to fix the current bug
(cannot build ppc64 kernel in size-optimized mode) and start discussion on
bringing in libgcc as the long-term fix across multiple architectures?

My goal here is to not have to carry a downstream patch in perpetuity for
our embedded Linux firmware, which needs to be compiled in size-optimized
mode due to hardware Flash limitations.

Thanks!

[PATCH v2] powerpc: Add gpr1 and fpu save/restore functions

2024-02-12 Thread Timothy Pearson

When building the kernel in size optimized mode with the amdgpu module enabled,
gcc will begin referencing external gpr1 and fpu save/restore functions.  This
will then cause a linker failure as we do not link against libgcc which
normally contains those builtin functions.

Implement gpr1 and fpu save/restore functions per the PowerPC 64-bit ELFv2 ABI
documentation.

Tested on a Talos II with a WX7100 installed and running in DisplayCore mode.

Reported-by: kernel test robot 
Tested-by: Timothy Pearson 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/prom_init_check.sh |   4 +-
 arch/powerpc/lib/crtsavres.S   | 244 +
 scripts/mod/modpost.c  |   4 +
 3 files changed, 250 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init_check.sh 
b/arch/powerpc/kernel/prom_init_check.sh
index 69623b9045d5..76c5651e29d3 100644
--- a/arch/powerpc/kernel/prom_init_check.sh
+++ b/arch/powerpc/kernel/prom_init_check.sh
@@ -72,10 +72,10 @@ do
 
# ignore register save/restore funcitons
case $UNDEF in
-   _restgpr_*|_restgpr0_*|_rest32gpr_*)
+   _restgpr_*|_restgpr0_*|_restgpr1_*|_rest32gpr_*)
OK=1
;;
-   _savegpr_*|_savegpr0_*|_save32gpr_*)
+   _savegpr_*|_savegpr0_*|_restgpr0_*|_save32gpr_*)
OK=1
;;
esac
diff --git a/arch/powerpc/lib/crtsavres.S b/arch/powerpc/lib/crtsavres.S
index 7e5e1c28e56a..6cd870aacd7f 100644
--- a/arch/powerpc/lib/crtsavres.S
+++ b/arch/powerpc/lib/crtsavres.S
@@ -3,6 +3,7 @@
  *
  *   Copyright (C) 1995, 1996, 1998, 2000, 2001 Free Software Foundation, Inc.
  *   Copyright 2008 Freescale Semiconductor, Inc.
+ *   Copyright 2024 Raptor Engineering, LLC
  *   Written By Michael Meissner
  *
  * Based on gcc/config/rs6000/crtsavres.asm from gcc
@@ -435,6 +436,127 @@ _restgpr0_31:
mtlrr0
blr
 
+.globl _savegpr1_14
+_savegpr1_14:
+   std r14,-144(r12)
+.globl _savegpr1_15
+_savegpr1_15:
+   std r15,-136(r12)
+.globl _savegpr1_16
+_savegpr1_16:
+   std r16,-128(r12)
+.globl _savegpr1_17
+_savegpr1_17:
+   std r17,-120(r12)
+.globl _savegpr1_18
+_savegpr1_18:
+   std r18,-112(r12)
+.globl _savegpr1_19
+_savegpr1_19:
+   std r19,-104(r12)
+.globl _savegpr1_20
+_savegpr1_20:
+   std r20,-96(r12)
+.globl _savegpr1_21
+_savegpr1_21:
+   std r21,-88(r12)
+.globl _savegpr1_22
+_savegpr1_22:
+   std r22,-80(r12)
+.globl _savegpr1_23
+_savegpr1_23:
+   std r23,-72(r12)
+.globl _savegpr1_24
+_savegpr1_24:
+   std r24,-64(r12)
+.globl _savegpr1_25
+_savegpr1_25:
+   std r25,-56(r12)
+.globl _savegpr1_26
+_savegpr1_26:
+   std r26,-48(r12)
+.globl _savegpr1_27
+_savegpr1_27:
+   std r27,-40(r12)
+.globl _savegpr1_28
+_savegpr1_28:
+   std r28,-32(r12)
+.globl _savegpr1_29
+_savegpr1_29:
+   std r29,-24(r12)
+.globl _savegpr1_30
+_savegpr1_30:
+   std r30,-16(r12)
+.globl _savegpr1_31
+_savegpr1_31:
+   std r31,-8(r12)
+   std r0,16(r12)
+   blr
+
+.globl _restgpr1_14
+_restgpr1_14:
+   ld  r14,-144(r12)
+.globl _restgpr1_15
+_restgpr1_15:
+   ld  r15,-136(r12)
+.globl _restgpr1_16
+_restgpr1_16:
+   ld  r16,-128(r12)
+.globl _restgpr1_17
+_restgpr1_17:
+   ld  r17,-120(r12)
+.globl _restgpr1_18
+_restgpr1_18:
+   ld  r18,-112(r12)
+.globl _restgpr1_19
+_restgpr1_19:
+   ld  r19,-104(r12)
+.globl _restgpr1_20
+_restgpr1_20:
+   ld  r20,-96(r12)
+.globl _restgpr1_21
+_restgpr1_21:
+   ld  r21,-88(r12)
+.globl _restgpr1_22
+_restgpr1_22:
+   ld  r22,-80(r12)
+.globl _restgpr1_23
+_restgpr1_23:
+   ld  r23,-72(r12)
+.globl _restgpr1_24
+_restgpr1_24:
+   ld  r24,-64(r12)
+.globl _restgpr1_25
+_restgpr1_25:
+   ld  r25,-56(r12)
+.globl _restgpr1_26
+_restgpr1_26:
+   ld  r26,-48(r12)
+.globl _restgpr1_27
+_restgpr1_27:
+   ld  r27,-40(r12)
+.globl _restgpr1_28
+_restgpr1_28:
+   ld  r28,-32(r12)
+.globl _restgpr1_29
+_restgpr1_29:
+   ld  r0,16(r12)
+   ld  r29,-24(r12)
+   mtlrr0
+   ld  r30,-16(r12)
+   ld  r31,-8(r12)
+   blr
+
+.globl _restgpr1_30
+_restgpr1_30:
+   ld  r30,-16(r12)
+.globl _restgpr1_31
+_restgpr1_31:
+   ld  r0,16(r12)
+   ld  r31,-8(r12)
+   mtlrr0
+   blr
+
 #ifdef CONFIG_ALTIVEC
 /* Called with r0 pointing just beyond the end of the vector save area.  */
 
@@ -540,6 +662,128 @@ _restvr_31:
 
 #endif /* CONFIG_ALTIVEC */
 
+#ifdef CONFIG_PPC_FPU
+
+.globl _savefpr_14
+_savefpr_14:
+   stfd f14,-144(r1)
+.globl _savefpr_15
+_savefpr_15:
+   stfd f15,-136(r1)
+.globl _savefpr_16
+_savefpr_16:
+   stfd f16,-128(r1)
+.globl _savefpr_17
+_savefpr_17:
+   stfd f17,-120(r1)
+.globl _savefpr_18
+_savefpr_18:
+   stfd f18,-112(r1)
+.globl

Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

2024-02-12 Thread Timothy Pearson

- Original Message -
> From: "Segher Boessenkool" 
> To: "Timothy Pearson" 
> Cc: "linuxppc-dev" 
> Sent: Monday, February 12, 2024 11:02:07 AM
> Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions

> On Mon, Feb 12, 2024 at 10:41:18AM -0600, Timothy Pearson wrote:
>> Implement gpr1 and fpu save/restore functions per the ABI v2 documentation.
> 
> There is no "ABI v2".  This is the ELFv2 ABI, it is a name, it is not a
> version 2 of anything (in fact, it is version 1 everywhere).

Apologies, I wasn't precise on the name.

> The same functions are needed and used in other ABIs, too.
> 
> But, why do this patch?  You just need
> 
> +LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) -print-libgcc-file-name)
> 
> +libs-y += $(LIBGCC)
> 
> and nothing more.  It is required for proper functioning of GCC to link
> with the libgcc support library.

There is existing code in the kernel right now to provide support functions for 
gpr0 and altivec save/restore.  I don't know the full story here, but at some 
point in the kernel's history it seems to have been decided to provide the 
helper functions in lieu of linking libgcc directly.  If this is incorrect, 
then I need to know that so I can rework the patch to enable libcc and remove 
the existing support functions.

Is there anyone on-list that knows more of the history and decision-making that 
went into the current state of the kernel here?

Thanks!

[PATCH] powerpc: Add gpr1 and fpu save/restore functions

2024-02-12 Thread Timothy Pearson

When building the kernel in size optimized mode with the amdgpu module enabled,
gcc will begin referencing external gpr1 and fpu save/restore functions.  This
will then cause a linker failure as we do not link against libgcc which
normally contains those builtin functions.

Implement gpr1 and fpu save/restore functions per the ABI v2 documentation.

Tested on a Talos II with a WX7100 installed and running in DisplayCore mode.

Reported-by: kernel test robot 
Tested-by: Timothy Pearson 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/prom_init_check.sh |   4 +-
 arch/powerpc/lib/crtsavres.S   | 244 +
 scripts/mod/modpost.c  |   4 +
 3 files changed, 250 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init_check.sh 
b/arch/powerpc/kernel/prom_init_check.sh
index 69623b9045d5..76c5651e29d3 100644
--- a/arch/powerpc/kernel/prom_init_check.sh
+++ b/arch/powerpc/kernel/prom_init_check.sh
@@ -72,10 +72,10 @@ do
 
# ignore register save/restore funcitons
case $UNDEF in
-   _restgpr_*|_restgpr0_*|_rest32gpr_*)
+   _restgpr_*|_restgpr0_*|_restgpr1_*|_rest32gpr_*)
OK=1
;;
-   _savegpr_*|_savegpr0_*|_save32gpr_*)
+   _savegpr_*|_savegpr0_*|_restgpr0_*|_save32gpr_*)
OK=1
;;
esac
diff --git a/arch/powerpc/lib/crtsavres.S b/arch/powerpc/lib/crtsavres.S
index 7e5e1c28e56a..6cd870aacd7f 100644
--- a/arch/powerpc/lib/crtsavres.S
+++ b/arch/powerpc/lib/crtsavres.S
@@ -3,6 +3,7 @@
  *
  *   Copyright (C) 1995, 1996, 1998, 2000, 2001 Free Software Foundation, Inc.
  *   Copyright 2008 Freescale Semiconductor, Inc.
+ *   Copyright 2024 Raptor Engineering, LLC
  *   Written By Michael Meissner
  *
  * Based on gcc/config/rs6000/crtsavres.asm from gcc
@@ -435,6 +436,127 @@ _restgpr0_31:
mtlrr0
blr
 
+.globl _savegpr1_14
+_savegpr1_14:
+   std r14,-144(r12)
+.globl _savegpr1_15
+_savegpr1_15:
+   std r15,-136(r12)
+.globl _savegpr1_16
+_savegpr1_16:
+   std r16,-128(r12)
+.globl _savegpr1_17
+_savegpr1_17:
+   std r17,-120(r12)
+.globl _savegpr1_18
+_savegpr1_18:
+   std r18,-112(r12)
+.globl _savegpr1_19
+_savegpr1_19:
+   std r19,-104(r12)
+.globl _savegpr1_20
+_savegpr1_20:
+   std r20,-96(r12)
+.globl _savegpr1_21
+_savegpr1_21:
+   std r21,-88(r12)
+.globl _savegpr1_22
+_savegpr1_22:
+   std r22,-80(r12)
+.globl _savegpr1_23
+_savegpr1_23:
+   std r23,-72(r12)
+.globl _savegpr1_24
+_savegpr1_24:
+   std r24,-64(r12)
+.globl _savegpr1_25
+_savegpr1_25:
+   std r25,-56(r12)
+.globl _savegpr1_26
+_savegpr1_26:
+   std r26,-48(r12)
+.globl _savegpr1_27
+_savegpr1_27:
+   std r27,-40(r12)
+.globl _savegpr1_28
+_savegpr1_28:
+   std r28,-32(r12)
+.globl _savegpr1_29
+_savegpr1_29:
+   std r29,-24(r12)
+.globl _savegpr1_30
+_savegpr1_30:
+   std r30,-16(r12)
+.globl _savegpr1_31
+_savegpr1_31:
+   std r31,-8(r12)
+   std r0,16(r12)
+   blr
+
+.globl _restgpr1_14
+_restgpr1_14:
+   ld  r14,-144(r12)
+.globl _restgpr1_15
+_restgpr1_15:
+   ld  r15,-136(r12)
+.globl _restgpr1_16
+_restgpr1_16:
+   ld  r16,-128(r12)
+.globl _restgpr1_17
+_restgpr1_17:
+   ld  r17,-120(r12)
+.globl _restgpr1_18
+_restgpr1_18:
+   ld  r18,-112(r12)
+.globl _restgpr1_19
+_restgpr1_19:
+   ld  r19,-104(r12)
+.globl _restgpr1_20
+_restgpr1_20:
+   ld  r20,-96(r12)
+.globl _restgpr1_21
+_restgpr1_21:
+   ld  r21,-88(r12)
+.globl _restgpr1_22
+_restgpr1_22:
+   ld  r22,-80(r12)
+.globl _restgpr1_23
+_restgpr1_23:
+   ld  r23,-72(r12)
+.globl _restgpr1_24
+_restgpr1_24:
+   ld  r24,-64(r12)
+.globl _restgpr1_25
+_restgpr1_25:
+   ld  r25,-56(r12)
+.globl _restgpr1_26
+_restgpr1_26:
+   ld  r26,-48(r12)
+.globl _restgpr1_27
+_restgpr1_27:
+   ld  r27,-40(r12)
+.globl _restgpr1_28
+_restgpr1_28:
+   ld  r28,-32(r12)
+.globl _restgpr1_29
+_restgpr1_29:
+   ld  r0,16(r12)
+   ld  r29,-24(r12)
+   mtlrr0
+   ld  r30,-16(r12)
+   ld  r31,-8(r12)
+   blr
+
+.globl _restgpr1_30
+_restgpr1_30:
+   ld  r30,-16(r12)
+.globl _restgpr1_31
+_restgpr1_31:
+   ld  r0,16(r12)
+   ld  r31,-8(r12)
+   mtlrr0
+   blr
+
 #ifdef CONFIG_ALTIVEC
 /* Called with r0 pointing just beyond the end of the vector save area.  */
 
@@ -540,6 +662,128 @@ _restvr_31:
 
 #endif /* CONFIG_ALTIVEC */
 
+#ifdef CONFIG_PPC_FPU
+
+.globl _savefpr_14
+_savefpr_14:
+   stfd f14,-144(r1)
+.globl _savefpr_15
+_savefpr_15:
+   stfd f15,-136(r1)
+.globl _savefpr_16
+_savefpr_16:
+   stfd f16,-128(r1)
+.globl _savefpr_17
+_savefpr_17:
+   stfd f17,-120(r1)
+.globl _savefpr_18
+_savefpr_18:
+   stfd f18,-112(r1)
+.globl _savefpr_19

Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Timothy Pearson




- Original Message -
> From: "Timothy Pearson" 
> To: "Jason Gunthorpe" 
> Cc: "Timothy Pearson" , "Shivaprasad G Bhat" 
> , "iommu"
> , "linuxppc-dev" , 
> "linux-kernel" ,
> "Michael Ellerman" , "npiggin" , 
> "christophe leroy"
> , "aneesh kumar" , 
> "naveen n rao" ,
> "jroedel" , "aik" , "bgray" 
> , "Greg Kroah-Hartman"
> , "gbatra" , "vaibhav" 
> 
> Sent: Friday, January 26, 2024 9:39:56 AM
> Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group 
> release_ownership() call

> - Original Message -
>> From: "Jason Gunthorpe" 
>> To: "Timothy Pearson" 
>> Cc: "Shivaprasad G Bhat" , "iommu" 
>> ,
>> "linuxppc-dev"
>> , "linux-kernel" 
>> ,
>> "Michael Ellerman"
>> , "npiggin" , "christophe leroy"
>> , "aneesh kumar"
>> , "naveen n rao" ,
>> "jroedel" , "aik"
>> , "bgray" , "Greg Kroah-Hartman"
>> , "gbatra"
>> , "vaibhav" 
>> Sent: Friday, January 26, 2024 9:38:06 AM
>> Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group
>> release_ownership() call
> 
>> On Fri, Jan 26, 2024 at 09:29:55AM -0600, Timothy Pearson wrote:
>>> > On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote:
>>> >> > Also, is there any chance someone can work on actually fixing this to
>>> >> > be a proper iommu driver? I think that will become important for power
>>> >> > to use the common dma_iommu code in the next year...
>>> >> We are looking into it.
>>> > 
>>> > Okay, let me know, I can possibly help make parts of this happen
>>> > 
>>> > power is the last still-current architecture to be outside the modern
>>> > IOMMU and DMA API design and I'm going to start proposing things that
>>> > will not be efficient on power because of this.
>>> 
>>> I can get development resources on this fairly rapidly, including
>>> testing.  We should figure out the best way forward and how to deal
>>> with the VFIO side of things, even if that's a rewrite at the end of
>>> the day the machine-specific codebase isn't *that* large for our two
>>> target flavors (64-bit PowerNV and 64-bit pSeries).
>> 
>> I have a feeling the way forward is to just start a power driver under
>> drivers/iommu/ and use kconfig to make the user exclusively select
>> either the legacy arch or the modern iommu.
>> 
>> Get that working to a level where dma_iommu is happy.
>> 
>> Get iommufd support in the new driver good enough to run your
>> application.
>> 
>> Just forget about the weird KVM and SPAPR stuff, leave it under the
>> kconfig of the old code and nobody will run it. Almost nobody already
>> runs it, apparently.
> 
> We actually use QEMU/KVM/VFIO extensively at Raptor, so need the support and
> need it to be performant...

Never mind, I can't read this morning. :)  You did say iommufd support, which 
gives the VFIO passthrough functionality.  I think this is a reasonable 
approach, and will discuss further internally this afternoon.

Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Timothy Pearson




- Original Message -
> From: "Jason Gunthorpe" 
> To: "Timothy Pearson" 
> Cc: "Shivaprasad G Bhat" , "iommu" 
> , "linuxppc-dev"
> , "linux-kernel" 
> , "Michael Ellerman"
> , "npiggin" , "christophe leroy" 
> , "aneesh kumar"
> , "naveen n rao" , 
> "jroedel" , "aik"
> , "bgray" , "Greg Kroah-Hartman" 
> , "gbatra"
> , "vaibhav" 
> Sent: Friday, January 26, 2024 9:38:06 AM
> Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group 
> release_ownership() call

> On Fri, Jan 26, 2024 at 09:29:55AM -0600, Timothy Pearson wrote:
>> > On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote:
>> >> > Also, is there any chance someone can work on actually fixing this to
>> >> > be a proper iommu driver? I think that will become important for power
>> >> > to use the common dma_iommu code in the next year...
>> >> We are looking into it.
>> > 
>> > Okay, let me know, I can possibly help make parts of this happen
>> > 
>> > power is the last still-current architecture to be outside the modern
>> > IOMMU and DMA API design and I'm going to start proposing things that
>> > will not be efficient on power because of this.
>> 
>> I can get development resources on this fairly rapidly, including
>> testing.  We should figure out the best way forward and how to deal
>> with the VFIO side of things, even if that's a rewrite at the end of
>> the day the machine-specific codebase isn't *that* large for our two
>> target flavors (64-bit PowerNV and 64-bit pSeries).
> 
> I have a feeling the way forward is to just start a power driver under
> drivers/iommu/ and use kconfig to make the user exclusively select
> either the legacy arch or the modern iommu.
> 
> Get that working to a level where dma_iommu is happy.
> 
> Get iommufd support in the new driver good enough to run your
> application.
> 
> Just forget about the weird KVM and SPAPR stuff, leave it under the
> kconfig of the old code and nobody will run it. Almost nobody already
> runs it, apparently.

We actually use QEMU/KVM/VFIO extensively at Raptor, so need the support and 
need it to be performant...

> Remove it in a few years
> 
> From what I remember the code under the arch directory is already very
> nearly almost an iommu driver. I think someone could fairly quickly
> get to something working and using dma_iommu.c. If s390 is any
> experience there is some benchmarking and tweaking to get performance
> equal to the arch's tweaked dma_iommu copy.
> 
> Jason

Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Timothy Pearson




- Original Message -
> From: "Jason Gunthorpe" 
> To: "Shivaprasad G Bhat" 
> Cc: io...@lists.linux.dev, "linuxppc-dev" , 
> "linux-kernel"
> , "Michael Ellerman" , 
> "npiggin" , "christophe
> leroy" , "aneesh kumar" 
> , "naveen n rao"
> , jroe...@suse.de, "Timothy Pearson" 
> , a...@amd.com,
> bg...@linux.ibm.com, "Greg Kroah-Hartman" , 
> gba...@linux.vnet.ibm.com,
> vaib...@linux.ibm.com
> Sent: Friday, January 26, 2024 9:17:01 AM
> Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group 
> release_ownership() call

> On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote:
>> > Also, is there any chance someone can work on actually fixing this to
>> > be a proper iommu driver? I think that will become important for power
>> > to use the common dma_iommu code in the next year...
>> We are looking into it.
> 
> Okay, let me know, I can possibly help make parts of this happen
> 
> power is the last still-current architecture to be outside the modern
> IOMMU and DMA API design and I'm going to start proposing things that
> will not be efficient on power because of this.

I can get development resources on this fairly rapidly, including testing.  We 
should figure out the best way forward and how to deal with the VFIO side of 
things, even if that's a rewrite at the end of the day the machine-specific 
codebase isn't *that* large for our two target flavors (64-bit PowerNV and 
64-bit pSeries).

> I think a basic iommu driver using the dma API would not be so hard.
> 
> I don't know what to do about the SPAPR VFIO mess though. :(
> 
> Jason

Re: PowerNV PCIe hotplug support?

2023-12-28 Thread Timothy Pearson

les linked in: pnv_php switchtec i2c_dev ses enclosure 
scsi_transport_sas xts ecb sg ctr nvme cbc nvme_core ofpart drm_shmem_helper 
t10_pi vmx_crypto ipmi_powernv powernv_flash crc64_rocksoft_generic 
ipmi_devintf gf128mul drm_kms_helper crct10dif_vpmsum opal_prd ipmi_msghandler 
mtd at24 i2c_algo_bit regmap_i2c crc64_rocksoft crc64 evdev joydev aacraid 
binfmt_misc drm loop fuse dm_mod drm_panel_orientation_quirks ip_tables 
x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc fscache netfs ib_mthca 
ib_ipoib ib_umad rdma_ucm rdma_cm iw_cm ib_cm configfs mlx4_ib ib_uverbs 
ib_core hid_generic usbhid hid xhci_pci xhci_hcd ixgbe mdio_devres of_mdio tg3 
fixed_phy fwnode_mdio xfrm_algo mdio crc32c_vpmsum usbcore mlx4_core libphy 
usb_common [last unloaded: pnv_php]
[  157.326314] CPU: 13 PID: 1170 Comm: bash Tainted: GW  6.6.8 
#1
[  157.326414] Hardware name: T2P9S01 REV 1.01 POWER9 0x4e1203 
opal:skiboot-9858186 PowerNV
[  157.326485] NIP:  c0da0f44 LR: c0da0f38 CTR: c08b58e0
[  157.326573] REGS: c0003a1bb7a0 TRAP: 0300   Tainted: GW  
 (6.6.8)
[  157.326652] MSR:  90009033   CR: 24424222  
XER: 
[  157.326777] CFAR: c0d9d96c DAR: 0010 DSISR: 4000 
IRQMASK: 0
   GPR00: c0da0f38 c0003a1bba40 c10fe800 

   GPR04: 0040 0040 001b 

   GPR08: c000201ffcba1b00  cf168a00 
c00819063700
   GPR12: c08b58e0 c01d1c80  
000100f6a548
   GPR16:    
000100e93e40
   GPR20: 000102cc5560 000100f21a58 7fffe59d6784 
c000201fff792ed0
   GPR24: 0001  c000201fff793630 
c0001358d8e0
   GPR28: c0001358d8c0 c0002000109cc000 c0002000109cc0c0 
0010
[  157.327630] NIP [c0da0f44] mutex_lock+0x34/0x90
[  157.327698] LR [c0da0f38] mutex_lock+0x28/0x90
[  157.327766] Call Trace:
[  157.327799] [c0003a1bba40] [c0da0f38] mutex_lock+0x28/0x90 
(unreliable)
[  157.327880] [c0003a1bba70] [c0202658] msi_lock_descs+0x28/0x40
[  157.327963] [c0003a1bba90] [c08b5948] pci_disable_msi+0x68/0xb0
[  157.328072] [c0003a1bbac0] [c00819060708] 
pnv_php_disable_irq+0x140/0x160 [pnv_php]
[  157.328150] [c0003a1bbb10] [c00819060a28] 
pnv_php_free_slot+0x50/0x90 [pnv_php]
[  157.328248] [c0003a1bbb40] [c00819061fb0] 
pnv_php_unregister+0x248/0x290 [pnv_php]
[  157.328342] [c0003a10] [c0081906207c] 
pnv_php_disable_slot+0x84/0xe0 [pnv_php]
[  157.328443] [c0003a1bbbf0] [c08cf8d0] power_write_file+0xa0/0x180
[  157.328546] [c0003a1bbc70] [c08c4b60] 
pci_slot_attr_store+0x40/0x60
[  157.328643] [c0003a1bbc90] [c0614ef4] sysfs_kf_write+0x64/0x80
[  157.328700] [c0003a1bbcb0] [c0613a18] 
kernfs_fop_write_iter+0x1b8/0x2a0
[  157.328789] [c0003a1bbd00] [c0521070] vfs_write+0x350/0x4b0
[  157.328881] [c0003a1bbdc0] [c05214d4] ksys_write+0x84/0x140
[  157.328960] [c0003a1bbe10] [c0030d38] 
system_call_exception+0x168/0x3a0
[  157.329052] [c0003a1bbe50] [c000c270] 
system_call_vectored_common+0xf0/0x280
[  157.329139] --- interrupt: 3000 at 0x7fffa77ead38
[  157.329191] NIP:  7fffa77ead38 LR:  CTR: 
[  157.329259] REGS: c0003a1bbe80 TRAP: 3000   Tainted: GW  
 (6.6.8)
[  157.329327] MSR:  9000d033   CR: 
48422408  XER: 
[  157.329444] IRQMASK: 0
   GPR00: 0004 7fffe59d6560 7fffa7906f00 
0001
   GPR04: 000102ce4c40 0002 0010 
000102ce6c40
   GPR08:    

   GPR12:  7fffa7a1b0c0  
000100f6a548
   GPR16:    
000100e93e40
   GPR20: 000102cc5560 000100f21a58 7fffe59d6784 
7fffe59d6780
   GPR24: 0001 000100f706f0 000102ce4c40 
0002
   GPR28: 0002 7fffa7901980 000102ce4c40 
0002
[  157.330151] NIP [7fffa77ead38] 0x7fffa77ead38
[  157.330208] LR [] 0x0
[  157.330245] --- interrupt: 3000
[  157.330275] Code: 3842d8f0 7c0802a6 6000 7c0802a6 fbe1fff8 7c7f1b78 
f8010010 f821ffd1 4bffc9f5 6000 3920 e94d0908 <7d00f8a8> 7c284800 
40c20010 7d40f9ad
[  157.330467] ---[ end trace  ]---

[  157.902980] note: bash[1170] exited with irqs disabled

- Original Message -
> From: "Timothy Pearson" 
> To: "linu

Re: PowerNV PCIe hotplug support?

2023-12-27 Thread Timothy Pearson

] 0x0
[ 3771.108425] --- interrupt: 3000
[ 3771.108437] Code: 3842d8f0 7c0802a6 6000 7c0802a6 fbe1fff8 7c7f1b78 
f8010010 f821ffd1 4bffc9f5 6000 3920 e94d0908 <7d00f8a8> 7c284800 
40c20010 7d40f9ad
[ 3771.108510] ---[ end trace  ]---

- Original Message -
> From: "Timothy Pearson" 
> To: "linuxppc-dev" 
> Sent: Wednesday, December 27, 2023 10:15:24 PM
> Subject: PowerNV PCIe hotplug support?

> I've been evaluating some new options for our POWER9-based hardware in the 
> NVMe
> space, and would like some clarification on the current status of PCIe hotplug
> for the PowerNV platforms.
> 
> From what I understand, the pnv_php driver provides the basic hotplug
> functionality on PowerNV.  What I'm not clear on is to what extent this is
> intended to flow downstream to attached PCIe switches.
> 
> I have a test setup here that consists of a PMC 8533 switch and several
> downstream NVMe drives, with the switch attached directly to the PHB4 root
> port.  After loading the pnv_php module, I can disconnect the downstream NVMe
> devices by either using echo 0 on /sys/bus/pcu/slots/Snnn/power, or by
> doing a physical surprise unplug, however nothing I try can induce a newly
> plugged device to train and be detected on the bus.  Even trying a echo 0 and
> subsequent echo 1 to /sys/bus/pcu/slots/Snnn/power only results in the
> device going offline, there seems to be no way to bring the device back online
> short of a reboot.
> 
> Hotplug of other devices connected directly to the PHB4 appears to work 
> properly
> (I can online and offline via the power node); the issue seems to be 
> restricted
> to downstream devices connected to the (theoretically hotplug capable) PMC 
> 8533
> switch.
> 
> Is this the intended behavior for downstream (non-IBM) PCIe ports?  Raptor can
> provide resources to assist in a fix if needed, but I would like to understand
> if this is a bug or an unimplemented feature first, and if the latter what the
> main issues are likely to be in implementation.
> 
> Thank you!

PowerNV PCIe hotplug support?

2023-12-27 Thread Timothy Pearson

I've been evaluating some new options for our POWER9-based hardware in the NVMe 
space, and would like some clarification on the current status of PCIe hotplug 
for the PowerNV platforms.

>From what I understand, the pnv_php driver provides the basic hotplug 
>functionality on PowerNV.  What I'm not clear on is to what extent this is 
>intended to flow downstream to attached PCIe switches.

I have a test setup here that consists of a PMC 8533 switch and several 
downstream NVMe drives, with the switch attached directly to the PHB4 root 
port.  After loading the pnv_php module, I can disconnect the downstream NVMe 
devices by either using echo 0 on /sys/bus/pcu/slots/Snnn/power, or by 
doing a physical surprise unplug, however nothing I try can induce a newly 
plugged device to train and be detected on the bus.  Even trying a echo 0 and 
subsequent echo 1 to /sys/bus/pcu/slots/Snnn/power only results in the 
device going offline, there seems to be no way to bring the device back online 
short of a reboot.

Hotplug of other devices connected directly to the PHB4 appears to work 
properly (I can online and offline via the power node); the issue seems to be 
restricted to downstream devices connected to the (theoretically hotplug 
capable) PMC 8533 switch.

Is this the intended behavior for downstream (non-IBM) PCIe ports?  Raptor can 
provide resources to assist in a fix if needed, but I would like to understand 
if this is a bug or an unimplemented feature first, and if the latter what the 
main issues are likely to be in implementation.

Thank you!

Re: [RFC PATCH 10/12] drm/amd/display: Use ARCH_HAS_KERNEL_FPU_SUPPORT

2023-12-13 Thread Timothy Pearson




- Original Message -
> From: "Samuel Holland" 
> To: "Michael Ellerman" 
> Cc: "linux-kernel" , "amd-gfx" 
> , "linux-arch"
> , "linux-arm-kernel" 
> , loonga...@lists.linux.dev,
> "linuxppc-dev" , "x86" , 
> linux-ri...@lists.infradead.org, "Christoph
> Hellwig" , "Timothy Pearson" 
> 
> Sent: Wednesday, December 13, 2023 7:03:20 PM
> Subject: Re: [RFC PATCH 10/12] drm/amd/display: Use 
> ARCH_HAS_KERNEL_FPU_SUPPORT

> On 2023-12-11 6:23 AM, Michael Ellerman wrote:
>> Hi Samuel,
>> 
>> Thanks for trying to clean all this up.
>> 
>> One problem below.
>> 
>> Samuel Holland  writes:
>>> Now that all previously-supported architectures select
>>> ARCH_HAS_KERNEL_FPU_SUPPORT, this code can depend on that symbol instead
>>> of the existing list of architectures. It can also take advantage of the
>>> common kernel-mode FPU API and method of adjusting CFLAGS.
>>>
>>> Signed-off-by: Samuel Holland 
>> ...
>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>>> b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>>> index 4ae4720535a5..b64f917174ca 100644
>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>>> @@ -87,20 +78,9 @@ void dc_fpu_begin(const char *function_name, const int 
>>> line)
>>> WARN_ON_ONCE(!in_task());
>>> preempt_disable();
>>> depth = __this_cpu_inc_return(fpu_recursion_depth);
>>> -
>>> if (depth == 1) {
>>> -#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
>>> +   BUG_ON(!kernel_fpu_available());
>>> kernel_fpu_begin();
>>> -#elif defined(CONFIG_PPC64)
>>> -   if (cpu_has_feature(CPU_FTR_VSX_COMP))
>>> -   enable_kernel_vsx();
>>> -   else if (cpu_has_feature(CPU_FTR_ALTIVEC_COMP))
>>> -   enable_kernel_altivec();
>>  
>> Note altivec.
>> 
>>> -   else if (!cpu_has_feature(CPU_FTR_FPU_UNAVAILABLE))
>>> -   enable_kernel_fp();
>>> -#elif defined(CONFIG_ARM64)
>>> -   kernel_neon_begin();
>>> -#endif
>>> }
>>>  
>>> TRACE_DCN_FPU(true, function_name, line, depth);
>>> diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile
>>> b/drivers/gpu/drm/amd/display/dc/dml/Makefile
>>> index ea7d60f9a9b4..5aad0f572ba3 100644
>>> --- a/drivers/gpu/drm/amd/display/dc/dml/Makefile
>>> +++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile
>>> @@ -25,40 +25,8 @@
>>>  # It provides the general basic services required by other DAL
>>>  # subcomponents.
>>>  
>>> -ifdef CONFIG_X86
>>> -dml_ccflags-$(CONFIG_CC_IS_GCC) := -mhard-float
>>> -dml_ccflags := $(dml_ccflags-y) -msse
>>> -endif
>>> -
>>> -ifdef CONFIG_PPC64
>>> -dml_ccflags := -mhard-float -maltivec
>>> -endif
>> 
>> And altivec is enabled in the flags there.
>> 
>> That doesn't match your implementation for powerpc in patch 7, which
>> only deals with float.
>> 
>> I suspect the AMD driver actually doesn't need altivec enabled, but I
>> don't know that for sure. It compiles without it, but I don't have a GPU
>> to actually test. I've added Timothy on Cc who added the support for
>> powerpc to the driver originally, hopefully he has a test system.

If you would like me to test I'm happy to do so, but I am travelling until 
Friday so would need to wait until then.

Thanks!

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-30 Thread Timothy Pearson

- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Tuesday, November 28, 2023 6:57:01 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Michael Ellerman  writes:
>> Timothy Pearson  writes:
>>
>>> Just wanted to check back and see if this patch was going to be queued
>>> up soon?  We're still having to work around / advertise the data
>>> destruction issues the underlying bug is causing on e.g. Debian
>>> Stable.
>>
>> Yeah I'll apply it this week, so it will be in rc4.
> 
> I reworked the change log to include the exact call path I identified
> instead of the more high level description you had. And tweaked a few
> other bits of wording and so on, apparently fr0 is a kernelism, the ABI
> and binutils calls it f0.
> 
> I'm not sure how wedded you were to your change log, so if you dislike
> my edits let me know and we can come up with a joint one.
> 
> The actual patch is unchanged.
> 
> cheers

The commit message looks OK to me.  I've also seen application crashes as a 
result of the register corruption, but that may be a minor detail that isn't 
really worth updating things over at this point -- those come from e.g. glibc 
using vs0 as part of a path that processes pointer information, typically seen 
where there's a need to replicate the same pointer to adjacent fields in a data 
struct.

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-27 Thread Timothy Pearson

Just wanted to check back and see if this patch was going to be queued up soon? 
 We're still having to work around / advertise the data destruction issues the 
underlying bug is causing on e.g. Debian Stable.

Thanks!

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-23 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Tuesday, November 21, 2023 11:01:50 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Timothy Pearson  writes:
>>
> ...
>>
>> So a little more detail on this, just to put it to rest properly vs.
>> assuming hand analysis caught every possible pathway. :)
>>
>> The debugging that generates this stack trace also verifies the following in
>> __giveup_fpu():
>>
>> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
>> calling
>> save_fpu()
>> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after 
>> calling
>> save_fpu()
>> 3.) MSR_FP is set both in the task struct and in the live MSR.
>>
>> Only if all three conditions are met will it generate the trace.  This
>> is a generalization of the hack I used to find the problem in the
>> first place.
>>
>> If the state will subsequently be reloaded from the thread struct,
>> that means we're reloading the registers from the thread struct that
>> we just verified was corrupted by the earlier save_fpu() call.  There
>> are only two ways I can see for that to be true -- one is if the
>> registers were already clobbered when giveup_all() was entered, and
>> the other is if save_fpu() went ahead and clobbered them right here
>> inside giveup_all().
>>
>> To see which scenario we were dealing with, I added a bit more
>> instrumentation to dump the current register state if MSR_FP bit was
>> already set in registers (i.e. not dumping data from task struct, but
>> using the live FPU registers instead), and sure enough the registers
>> are corrupt on entry, so something else has already called save_fpu()
>> before we even hit giveup_all() in this call chain.
> 
> Can you share the debug patch you're using?
> 
> cheers

Sure, here you go.  Note that with my FPU patch there is no WARN_ON hit, at 
least in my testing, so it isn't userspace purposefully loading the fr0/vs0 
register with the FPSCR.

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 392404688cec..bde57dc3262a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -154,7 +154,49 @@ static void __giveup_fpu(struct task_struct *tsk)
 {
unsigned long msr;
 
+   // DEBUGGING
+   uint64_t prev_fpr0 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0);
+   uint64_t prev_fpr1 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1);
+   struct thread_fp_state debug_fp_state;
+   unsigned long currentmsr = mfmsr();
+
+   if (currentmsr & MSR_FP) {
+   store_fp_state(&debug_fp_state);
+   load_fp_state(&debug_fp_state);
+   }
+
save_fpu(tsk);
+
+   // DEBUGGING
+   if (tsk->thread.regs->msr & MSR_FP) {
+   if (((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0) == 
0x82004000) && (prev_fpr0 != 0x82004000))
+|| ((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1) == 
0x82004000) && (prev_fpr1 != 0x82004000)))
+   {
+   WARN_ON(1);
+
+   printk("[TS %lld] In __giveup_fpu() for process [comm: 
'%s'  pid %d tid %d], before save current "
+   "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 
0x%016llx/%016llx fp9: 0x%016llx/%016llx"
+   " msr: 0x%016lx (FP %d VSX %d EE %d) on core %d\n",
+   ktime_get_boottime_ns(), current->comm, current->pid, 
current->tgid,
+   *(((uint64_t*)(&debug_fp_state.fpr[0]))+0), 
*(((uint64_t*)(&debug_fp_state.fpr[0]))+1),
+   *(((uint64_t*)(&debug_fp_state.fpr[1]))+0), 
*(((uint64_t*)(&debug_fp_state.fpr[1]))+1),
+   *(((uint64_t*)(&debug_fp_state.fpr[8]))+0), 
*(((uint64_t*)(&debug_fp_state.fpr[8]))+1),
+   *(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+0), 
*(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+1),
+   currentmsr, !!(currentmsr & MSR_FP), !!(currentmsr & 
MSR_VSX), !!(currentmsr & MSR_EE), raw_smp_processor_id());
+
+   printk("[TS %lld] In __giveup_fpu() for process [comm: 
'%s'  pid %d tid %d], after save saved "
+   "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 
0x%016llx/%016ll

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson




- Original Message -
> From: "Timothy Pearson" 
> To: "Michael Ellerman" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Monday, November 20, 2023 10:10:32 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> - Original Message -
>> From: "Michael Ellerman" 
>> To: "Timothy Pearson" 
>> Cc: "Jens Axboe" , "regressions" 
>> ,
>> "npiggin" ,
>> "christophe leroy" , "linuxppc-dev"
>> 
>> Sent: Monday, November 20, 2023 5:39:52 PM
>> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec
>> register  save
> 
>> Timothy Pearson  writes:
>>> - Original Message -
>>>> From: "Michael Ellerman" 
>> ...
>>>> 
>>>> But we now have a new path, because io-uring can call copy_process() via
>>>> create_io_thread() from the signal handling path. That's OK if the signal 
>>>> is
>>>> handled as we return from a syscall, but it's not OK if the signal is 
>>>> handled
>>>> due to some other interrupt.
>>>> 
>>>> Which is:
>>>> 
>>>> interrupt_return_srr_user()
>>>>  interrupt_exit_user_prepare()
>>>>interrupt_exit_user_prepare_main()
>>>>  do_notify_resume()
>>>>get_signal()
>>>>  task_work_run()
>>>>create_worker_cb()
>>>>  create_io_worker()
>>>>copy_process()
>>>>  dup_task_struct()
>>>>arch_dup_task_struct()
>>>>  flush_all_to_thread()
>>>>save_all()
>>>>  if (tsk->thread.regs->msr & MSR_FP)
>>>>save_fpu()
>>>># fr0 is clobbered and potentially live in 
>>>> userspace
>>>> 
>>>> 
>>>> So tldr I think the corruption is only an issue since io-uring started 
>>>> doing
>>>> the clone via signal, which I think matches the observed timeline of this 
>>>> bug
>>>> appearing.
>>>
>>> I agree the corruption really only started showing up in earnest on
>>> io_uring clone-via-signal, as this was confirmed several times in the
>>> course of debugging.
>> 
>> Thanks.
>> 
>>> Note as well that I may very well have a wrong call order in the
>>> commit message, since I was relying on a couple of WARN_ON() macros I
>>> inserted to check for a similar (but not identical) condition and
>>> didn't spend much time getting new traces after identifying the root
>>> cause.
>> 
>> Yep no worries. I'll reword it to incorporate the full path from my mail.
>> 
>>> I went back and grabbed some real world system-wide stack traces, since I 
>>> now
>>> know what to trigger on.  A typical example is:
>>>
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>   interrupt_exit_user_prepare_main()
>>>schedule()
>>> __schedule()
>>>  __switch_to()
>>>   giveup_all()
>>># tsk->thread.regs->msr MSR_FP is still set here
>>>__giveup_fpu()
>>> save_fpu()
>>> # fr0 is clobbered and potentially live in userspace
>> 
>> fr0 is not live there.
> 
>> ie. it clears the FP etc. bits from the task's MSR. That means the FP
>> state will be reloaded from the thread struct before the task is run again.
> 
> So a little more detail on this, just to put it to rest properly vs. assuming
> hand analysis caught every possible pathway. :)
> 
> The debugging that generates this stack trace also verifies the following in
> __giveup_fpu():
> 
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
> calling
> save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after 
> calling
> save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
> 
> Only if all three conditions are met will it generate the trace.  This is a
> generalization of the hack I used to find the problem in the first place.
> 
> If the state will subsequently

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson

- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Monday, November 20, 2023 5:39:52 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Timothy Pearson  writes:
>> - Original Message -
>>> From: "Michael Ellerman" 
> ...
>>> 
>>> But we now have a new path, because io-uring can call copy_process() via
>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>> handled as we return from a syscall, but it's not OK if the signal is 
>>> handled
>>> due to some other interrupt.
>>> 
>>> Which is:
>>> 
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>interrupt_exit_user_prepare_main()
>>>  do_notify_resume()
>>>get_signal()
>>>  task_work_run()
>>>create_worker_cb()
>>>  create_io_worker()
>>>copy_process()
>>>  dup_task_struct()
>>>arch_dup_task_struct()
>>>  flush_all_to_thread()
>>>save_all()
>>>  if (tsk->thread.regs->msr & MSR_FP)
>>>save_fpu()
>>># fr0 is clobbered and potentially live in 
>>> userspace
>>> 
>>> 
>>> So tldr I think the corruption is only an issue since io-uring started doing
>>> the clone via signal, which I think matches the observed timeline of this 
>>> bug
>>> appearing.
>>
>> I agree the corruption really only started showing up in earnest on
>> io_uring clone-via-signal, as this was confirmed several times in the
>> course of debugging.
> 
> Thanks.
> 
>> Note as well that I may very well have a wrong call order in the
>> commit message, since I was relying on a couple of WARN_ON() macros I
>> inserted to check for a similar (but not identical) condition and
>> didn't spend much time getting new traces after identifying the root
>> cause.
> 
> Yep no worries. I'll reword it to incorporate the full path from my mail.
> 
>> I went back and grabbed some real world system-wide stack traces, since I now
>> know what to trigger on.  A typical example is:
>>
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>   interrupt_exit_user_prepare_main()
>>schedule()
>> __schedule()
>>  __switch_to()
>>   giveup_all()
>># tsk->thread.regs->msr MSR_FP is still set here
>>__giveup_fpu()
>> save_fpu()
>> # fr0 is clobbered and potentially live in userspace
> 
> fr0 is not live there.

> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.

So a little more detail on this, just to put it to rest properly vs. assuming 
hand analysis caught every possible pathway. :)

The debugging that generates this stack trace also verifies the following in 
__giveup_fpu():

1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to 
calling save_fpu()
2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling 
save_fpu()
3.) MSR_FP is set both in the task struct and in the live MSR.

Only if all three conditions are met will it generate the trace.  This is a 
generalization of the hack I used to find the problem in the first place.

If the state will subsequently be reloaded from the thread struct, that means 
we're reloading the registers from the thread struct that we just verified was 
corrupted by the earlier save_fpu() call.  There are only two ways I can see 
for that to be true -- one is if the registers were already clobbered when 
giveup_all() was entered, and the other is if save_fpu() went ahead and 
clobbered them right here inside giveup_all().

To see which scenario we were dealing with, I added a bit more instrumentation 
to dump the current register state if MSR_FP bit was already set in registers 
(i.e. not dumping data from task struct, but using the live FPU registers 
instead), and sure enough the registers are corrupt on entry, so something else 
has already called save_fpu() before we even hit giveup_all() in this call 
chain.

Unless I'm missing something, doesn't this effectively mean that anything 
interrupting a task can hit this bug?  Or, put another way, I'm seeing several 
processes hit this exact call chain with the corrupt register going back out to 
userspace without io_uring even in the mix, so there seems to be another 
pathway in play.  These traces are from a qemu guest, in case it matters given 
the kvm path is possibly susceptible.

Just a few things to think about.  The FPU patch itself definitely resolves the 
problems; I used a sledgehammer approach *specifically* so that there is no 
place for a rare call sequence we didn't consider to hit it again down the 
line. :)

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" 
> Cc: "Jens Axboe" , "regressions" 
> , "npiggin" ,
> "christophe leroy" , "linuxppc-dev" 
> 
> Sent: Monday, November 20, 2023 5:39:52 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Timothy Pearson  writes:
>> - Original Message -
>>> From: "Michael Ellerman" 
> ...
>>> 
>>> But we now have a new path, because io-uring can call copy_process() via
>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>> handled as we return from a syscall, but it's not OK if the signal is 
>>> handled
>>> due to some other interrupt.
>>> 
>>> Which is:
>>> 
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>interrupt_exit_user_prepare_main()
>>>  do_notify_resume()
>>>get_signal()
>>>  task_work_run()
>>>create_worker_cb()
>>>  create_io_worker()
>>>copy_process()
>>>  dup_task_struct()
>>>arch_dup_task_struct()
>>>  flush_all_to_thread()
>>>save_all()
>>>  if (tsk->thread.regs->msr & MSR_FP)
>>>save_fpu()
>>># fr0 is clobbered and potentially live in 
>>> userspace
>>> 
>>> 
>>> So tldr I think the corruption is only an issue since io-uring started doing
>>> the clone via signal, which I think matches the observed timeline of this 
>>> bug
>>> appearing.
>>
>> I agree the corruption really only started showing up in earnest on
>> io_uring clone-via-signal, as this was confirmed several times in the
>> course of debugging.
> 
> Thanks.
> 
>> Note as well that I may very well have a wrong call order in the
>> commit message, since I was relying on a couple of WARN_ON() macros I
>> inserted to check for a similar (but not identical) condition and
>> didn't spend much time getting new traces after identifying the root
>> cause.
> 
> Yep no worries. I'll reword it to incorporate the full path from my mail.
> 
>> I went back and grabbed some real world system-wide stack traces, since I now
>> know what to trigger on.  A typical example is:
>>
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>   interrupt_exit_user_prepare_main()
>>schedule()
>> __schedule()
>>  __switch_to()
>>   giveup_all()
>># tsk->thread.regs->msr MSR_FP is still set here
>>__giveup_fpu()
>> save_fpu()
>> # fr0 is clobbered and potentially live in userspace
> 
> fr0 is not live there.
> 
> __giveup_fpu() does roughly:
> 
>   msr = tsk->thread.regs->msr;
>   msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
>msr &= ~MSR_VSX;
>   tsk->thread.regs = msr;
> 
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.
> 
> Also on that path we're switching to another task, so we'll be reloading
> the other task's FP state before returning to userspace.
> 
> So I don't see any bug there.

Yeah, you're right.  I was trying to get traces while doing something else, and 
didn't think that all the way through, sorry. :)  It's not going to be super 
easy to get a real trace (I was triggering the WARN_ON() from of fr0 getting 
set to to FPSCR), so let's just assume it's mainly the path you manually found 
above and update the commit message accordingly.

> There's only two places that call save_fpu() and skip the giveup logic,
> which is save_all() and kvmppc_save_user_regs().

Now that's interesting as well, since it might explain some issues I've seen 
for years on a specific QEMU workload.  Once this is backported to stable I'll 
need to get the kernels updated on those boxes and see if the issues 
disappear...

Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-20 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" , "Jens Axboe" 
> , "regressions"
> , "npiggin" , "christophe 
> leroy" ,
> "linuxppc-dev" 
> Sent: Monday, November 20, 2023 1:10:06 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec 
> register  save

> Hi Timothy,
> 
> Great work debugging this. I think your fix is good, but I want to understand 
> it
> a bit more to make sure I can explain why we haven't seen it outside of
> io-uring.
> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
> 
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP 
> regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live 
> values
> for the task.
> 
> There is another case though, which is the path via:
>  copy_process()
>dup_task_struct()
>  arch_dup_task_struct()
>flush_all_to_thread()
>  save_all()
> 
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
> 
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal 
> usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
> 
> Even if we handle a signal on the return from the fork() syscall, the worst 
> that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
> 
> That path is something like:
> 
> system_call_vectored_common()
>  system_call_exception()
>sys_fork()
>  kernel_clone()
>copy_process()
>  dup_task_struct()
>arch_dup_task_struct()
>  flush_all_to_thread()
>save_all()
>  if (tsk->thread.regs->msr & MSR_FP)
>save_fpu()
># does not clear MSR_FP from regs->msr
>  syscall_exit_prepare()
>interrupt_exit_user_prepare_main()
>  do_notify_resume()
>get_signal()
>handle_rt_signal64()
>  prepare_setup_sigcontext()
>flush_fp_to_thread()
>  if (tsk->thread.regs->msr & MSR_FP)
>giveup_fpu()
>  __giveup_fpu
>save_fpu()
># clobbered fr0 is saved, but task considers it volatile
># across syscall anyway
> 
> 
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
> 
> Which is:
> 
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>interrupt_exit_user_prepare_main()
>  do_notify_resume()
>get_signal()
>  task_work_run()
>create_worker_cb()
>  create_io_worker()
>copy_process()
>  dup_task_struct()
>arch_dup_task_struct()
>  flush_all_to_thread()
>save_all()
>  if (tsk->thread.regs->msr & MSR_FP)
>save_fpu()
># fr0 is clobbered and potentially live in 
> userspace
> 
> 
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.

I agree the corruption really only started showing up in earnest on io_uring 
clone-via-signal, as this was confirmed several times in the course of 
debugging.  Bear in mind however that we have seen very, very rare crashes over 
several years in other tasks, and I am starting to think this bug might be the 
root cause (see below).

Note as well that I may very well have a wrong call order in the commit 
message, since I was relying on a couple of WARN_ON() macros I inserted to 
check for a similar (but not identical) condition and didn't spend much time 
getting new traces after identifying the root cause.

I went

[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-19 Thread Timothy Pearson

During floating point and vector save to thread data fr0/vs0 are clobbered
by the FPSCR/VSCR store routine.  This leads to userspace register corruption
and application data corruption / crash under the following rare condition:

 * A userspace thread is executing with VSX/FP mode enabled
 * The userspace thread is making active use of fr0 and/or vs0
 * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
 * The userspace thread is interrupted by the IPI before accessing data it
   previously stored in fr0/vs0
 * The thread being switched in by the IPI has a pending signal

If these exact criteria are met, then the following sequence happens:

 * The existing thread FP storage is still valid before the IPI, due to a
   prior call to save_fpu() or store_fp_state().  Note that the current
   fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
   is now invalid pending a call to restore_fp()/restore_altivec().
 * IPI -- FP/VSX register state remains invalid
 * interrupt_exit_user_prepare_main() calls do_notify_resume(),
   due to the pending signal
 * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
   merrily reads and saves the invalid FP/VSX state to thread local storage.
 * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
   FP/VSX state back to registers.
 * Execution is released to userspace, and the application crashes or corrupts
   data.

Without the pending signal, do_notify_resume() is never called, therefore the
invalid register state does't matter as it is overwritten nearly immediately
by interrupt_exit_user_prepare_main() calling restore_math() before return
to userspace.

Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
altivec register save paths.

Tested under QEMU in kvm mode, running on a Talos II workstation with dual
POWER9 DD2.2 CPUs.

Closes: 
https://lore.kernel.org/all/480932026.45576726.1699374859845.javamail.zim...@raptorengineeringinc.com/
Closes: 
https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.javamail.zim...@raptorengineeringinc.com/
Tested-by: Timothy Pearson 
Tested-by: Jens Axboe 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/fpu.S| 13 +
 arch/powerpc/kernel/vector.S |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index 6a9acfb690c9..2f8f3f93cbb6 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -23,6 +23,15 @@
 #include 
 
 #ifdef CONFIG_VSX
+#define __REST_1FPVSR(n,c,base)
\
+BEGIN_FTR_SECTION  \
+   b   2f; \
+END_FTR_SECTION_IFSET(CPU_FTR_VSX);\
+   REST_FPR(n,base);   \
+   b   3f; \
+2: REST_VSR(n,c,base); \
+3:
+
 #define __REST_32FPVSRS(n,c,base)  \
 BEGIN_FTR_SECTION  \
b   2f; \
@@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);  
\
 2: SAVE_32VSRS(n,c,base);  \
 3:
 #else
+#define __REST_1FPVSR(n,b,base)REST_FPR(n, base)
 #define __REST_32FPVSRS(n,b,base)  REST_32FPRS(n, base)
 #define __SAVE_32FPVSRS(n,b,base)  SAVE_32FPRS(n, base)
 #endif
+#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
 #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
 #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
 
@@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
SAVE_32FPVSRS(0, R4, R3)
mffsfr0
stfdfr0,FPSTATE_FPSCR(r3)
+   REST_1FPVSR(0, R4, R3)
blr
 EXPORT_SYMBOL(store_fp_state)
 
@@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
 2: SAVE_32FPVSRS(0, R4, R6)
mffsfr0
stfdfr0,FPSTATE_FPSCR(r6)
+   REST_1FPVSR(0, R4, R6)
blr
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 4094e4c4c77a..80b3f6e476b6 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -33,6 +33,7 @@ _GLOBAL(store_vr_state)
mfvscr  v0
li  r4, VRSTATE_VSCR
stvxv0, r4, r3
+   lvx v0, 0, r3
blr
 EXPORT_SYMBOL(store_vr_state)
 
@@ -109,6 +110,7 @@ _GLOBAL(save_altivec)
mfvscr  v0
li  r4,VRSTATE_VSCR
stvxv0,r4,r7
+   lvx v0,0,r7
blr
 
 #ifdef CONFIG_VSX
-- 
2.39.2

Re: [PATCH] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-18 Thread Timothy Pearson




- Original Message -
> From: "Timothy Pearson" 
> To: "Jens Axboe" , "regressions" 
> , "Michael Ellerman"
> , "npiggin" , "christophe leroy" 
> , "linuxppc-dev"
> 
> Sent: Saturday, November 18, 2023 5:45:03 PM
> Subject: [PATCH] powerpc: Don't clobber fr0/vs0 during fp|altivec register  
> save

> During floating point and vector save to thread data fr0/vs0 are clobbered
> by the FPSCR/VSCR store routine.  This leads to userspace register corruption
> and application data corruption / crash under the following rare condition:
> 
> * A userspace thread is executing with VSX/FP mode enabled
> * The userspace thread is making active use of fr0 and/or vs0
> * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
> * The userspace thread is interrupted by the IPI before accessing data it
>   previously stored in fr0/vs0
> * The thread being switched in by the IPI has a pending signal
> 
> If these exact criteria are met, then the following sequence happens:
> 
> * The existing thread FP storage is still valid before the IPI, due to a
>   prior call to save_fpu() or store_fp_state().  Note that the current
>   fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
>   is now invalid pending a call to restore_fp()/restore_altivec().
> * IPI -- FP/VSX register state remains invalid
> * interrupt_exit_user_prepare_main() calls do_notify_resume(),
>   due to the pending signal
> * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
>   merrily reads and saves the invalid FP/VSX state to thread local storage.
> * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
>   FP/VSX state back to registers.
> * Execution is released to userspace, and the application crashes or corrupts
>   data.
> 
> Without the pending signal, do_notify_resume() is never called, therefore the
> invalid register state does't matter as it is overwritten nearly immeediately
> by interrupt_exit_user_prepare_main() calling restore_math() before return
> to userspace.
> 
> The combination of MariaDB and io_uring is especially good at triggering data
> corruption using the above sequence, see MariaDB bug MDEV-30728.
> 
> Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
> altivec register save paths.
> 
> Tested under QEMU in kvm mode, running on a Talos II workstation with dual
> POWER9 DD2.2 CPUs.
> 
> Tested-by: Timothy Pearson 
> Signed-off-by: Timothy Pearson 
> ---
> arch/powerpc/kernel/fpu.S| 13 +
> arch/powerpc/kernel/vector.S |  4 
> 2 files changed, 17 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
> index 6a9acfb690c9..2f8f3f93cbb6 100644
> --- a/arch/powerpc/kernel/fpu.S
> +++ b/arch/powerpc/kernel/fpu.S
> @@ -23,6 +23,15 @@
> #include 
> 
> #ifdef CONFIG_VSX
> +#define __REST_1FPVSR(n,c,base)  
> \
> +BEGIN_FTR_SECTION\
> + b   2f; \
> +END_FTR_SECTION_IFSET(CPU_FTR_VSX);  \
> + REST_FPR(n,base);   \
> + b   3f; \
> +2:   REST_VSR(n,c,base); \
> +3:
> +
> #define __REST_32FPVSRS(n,c,base) \
> BEGIN_FTR_SECTION \
>   b   2f; \
> @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);
> \
> 2:SAVE_32VSRS(n,c,base);  \
> 3:
> #else
> +#define __REST_1FPVSR(n,b,base)  REST_FPR(n, base)
> #define __REST_32FPVSRS(n,b,base) REST_32FPRS(n, base)
> #define __SAVE_32FPVSRS(n,b,base) SAVE_32FPRS(n, base)
> #endif
> +#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
> #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
> #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
> 
> @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
>   SAVE_32FPVSRS(0, R4, R3)
>   mffsfr0
>   stfdfr0,FPSTATE_FPSCR(r3)
> + REST_1FPVSR(0, R4, R3)
>   blr
> EXPORT_SYMBOL(store_fp_state)
> 
> @@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
> 2:SAVE_32FPVSRS(0, R4, R6)
>   mffsfr0
>   stfdfr0,FPSTATE_FPSCR(r6)
> + REST_1FPVSR(0, R4

[PATCH] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-18 Thread Timothy Pearson

During floating point and vector save to thread data fr0/vs0 are clobbered
by the FPSCR/VSCR store routine.  This leads to userspace register corruption
and application data corruption / crash under the following rare condition:

 * A userspace thread is executing with VSX/FP mode enabled
 * The userspace thread is making active use of fr0 and/or vs0
 * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
 * The userspace thread is interrupted by the IPI before accessing data it
   previously stored in fr0/vs0
 * The thread being switched in by the IPI has a pending signal

If these exact criteria are met, then the following sequence happens:

 * The existing thread FP storage is still valid before the IPI, due to a
   prior call to save_fpu() or store_fp_state().  Note that the current
   fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
   is now invalid pending a call to restore_fp()/restore_altivec().
 * IPI -- FP/VSX register state remains invalid
 * interrupt_exit_user_prepare_main() calls do_notify_resume(),
   due to the pending signal
 * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
   merrily reads and saves the invalid FP/VSX state to thread local storage.
 * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
   FP/VSX state back to registers.
 * Execution is released to userspace, and the application crashes or corrupts
   data.

Without the pending signal, do_notify_resume() is never called, therefore the
invalid register state does't matter as it is overwritten nearly immeediately
by interrupt_exit_user_prepare_main() calling restore_math() before return
to userspace.

The combination of MariaDB and io_uring is especially good at triggering data
corruption using the above sequence, see MariaDB bug MDEV-30728.

Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
altivec register save paths.

Tested under QEMU in kvm mode, running on a Talos II workstation with dual
POWER9 DD2.2 CPUs.

Tested-by: Timothy Pearson 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/fpu.S| 13 +
 arch/powerpc/kernel/vector.S |  4 
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index 6a9acfb690c9..2f8f3f93cbb6 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -23,6 +23,15 @@
 #include 
 
 #ifdef CONFIG_VSX
+#define __REST_1FPVSR(n,c,base)
\
+BEGIN_FTR_SECTION  \
+   b   2f; \
+END_FTR_SECTION_IFSET(CPU_FTR_VSX);\
+   REST_FPR(n,base);   \
+   b   3f; \
+2: REST_VSR(n,c,base); \
+3:
+
 #define __REST_32FPVSRS(n,c,base)  \
 BEGIN_FTR_SECTION  \
b   2f; \
@@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);  
\
 2: SAVE_32VSRS(n,c,base);  \
 3:
 #else
+#define __REST_1FPVSR(n,b,base)REST_FPR(n, base)
 #define __REST_32FPVSRS(n,b,base)  REST_32FPRS(n, base)
 #define __SAVE_32FPVSRS(n,b,base)  SAVE_32FPRS(n, base)
 #endif
+#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
 #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
 #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
 
@@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
SAVE_32FPVSRS(0, R4, R3)
mffsfr0
stfdfr0,FPSTATE_FPSCR(r3)
+   REST_1FPVSR(0, R4, R3)
blr
 EXPORT_SYMBOL(store_fp_state)
 
@@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
 2: SAVE_32FPVSRS(0, R4, R6)
mffsfr0
stfdfr0,FPSTATE_FPSCR(r6)
+   REST_1FPVSR(0, R4, R6)
blr
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 4094e4c4c77a..8c63b05b421e 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -33,6 +33,8 @@ _GLOBAL(store_vr_state)
mfvscr  v0
li  r4, VRSTATE_VSCR
stvxv0, r4, r3
+   li  r4, 0
+   lvx v0, r4, r3
blr
 EXPORT_SYMBOL(store_vr_state)
 
@@ -109,6 +111,8 @@ _GLOBAL(save_altivec)
mfvscr  v0
li  r4,VRSTATE_VSCR
stvxv0,r4,r7
+   li  r4,0
+   lvx v0,r4,r7
blr
 
 #ifdef CONFIG_VSX
-- 
2.39.2

Re: [PATCH] powerpc: Fix data corruption on IPI

2023-11-17 Thread Timothy Pearson




- Original Message -
> From: "Timothy Pearson" 
> To: "Timothy Pearson" 
> Cc: "npiggin" , "linuxppc-dev" 
> 
> Sent: Friday, November 17, 2023 2:26:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> - Original Message -
>> From: "Timothy Pearson" 
>> To: "npiggin" 
>> Cc: "linuxppc-dev" 
>> Sent: Friday, November 17, 2023 2:20:29 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> - Original Message -
>>> From: "npiggin" 
>>> To: "Timothy Pearson" , "Michael Ellerman"
>>> 
>>> Cc: "linuxppc-dev" 
>>> Sent: Friday, November 17, 2023 2:01:12 AM
>>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>> 
>>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>>>
>>>>
>>>> - Original Message -
>>>> > From: "Michael Ellerman" 
>>>> > To: "Timothy Pearson" , "linuxppc-dev"
>>>> > 
>>>> > Cc: "Jens Axboe" 
>>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>>>
>>>> > Hi Timothy,
>>>> > 
>>>> > Thanks for debugging this, but I'm unclear why this is helping because
>>>> > we should already have a full barrier (hwsync) on both the sending and
>>>> > receiving side.
>>>> > 
>>>> > More below.
>>>>
>>>> I've spent another few days poking at this, and think I might finally have
>>>> something more solid in terms of what exactly is happening, but would like 
>>>> some
>>>> feedback on the concept / how best to fix the potential problem.
>>>>
>>>> As background, there are several worker threads both in userspace and in 
>>>> kernel
>>>> mode.  Crucially, the main MariaDB data processing thread (the one that 
>>>> handles
>>>> tasks like flushing dirty pages to disk) always runs on the same core as 
>>>> the
>>>> io_uring kernel thread that picks up I/O worker creation requests and 
>>>> handles
>>>> them via create_worker_cb().
>>>>
>>>> Changes in the ~5.12 era switched away from a delayed worker setup.  
>>>> io_uring
>>>> currently sets up the new process with create_io_thread(), and immediately 
>>>> uses
>>>> an IPI to forcibly schedule the new process.  Because of the way the two
>>>> threads interact, the new process ends up grabbing the CPU from the running
>>>> MariaDB user thread; I've never seen it schedule on a different core.  If 
>>>> the
>>>> timing is right in this process, things get trampled on in userspace and 
>>>> the
>>>> database server either crashes or throws a corruption fault.
>>>>
>>>> Through extensive debugging, I've narrowed this down to invalid state in 
>>>> the VSX
>>>> registers on return to the MariaDB user thread from the new kernel thread. 
>>>>  For
>>>> some reason, it seems we don't restore FP state on return from the 
>>>> PF_IO_WORKER
>>>> thread, and something in the kernel was busy writing new data to them.
>>>>
>>>> A direct example I was able to observe is as follows:
>>>>
>>>> xxspltd vs0,vs0,0  <-- vs0 now zeroed out
>>>> xorir9,r9,1<-- Presumably we switch to the new kernel thread 
>>>> here
>>>> due to the IPI
>>>> slwir9,r9,7<-- On userspace thread resume, vs0 now contains the
>>>> value 0x8200400082004000
>>>> xxswapd vs8,vs0<-- vs8 now has the wrong value
>>>> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on
>>>> stw r9,116(r3)
>>>> stxvd2x vs8,r3,r0
>>>> ...
>>>> CRASH
>>> 
>>> Nice find, that looks pretty conclusive.
>>> 
>>>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>>>> operations with VSX registers so it does eventually fail.  I ended up with 
>>>> a
>>>> tight loop around glibc operations that use VSX to trigger the failure 
>>>> reliably
>>>> enough to even understa

Re: [PATCH] powerpc: Fix data corruption on IPI

2023-11-17 Thread Timothy Pearson




- Original Message -
> From: "Timothy Pearson" 
> To: "npiggin" 
> Cc: "linuxppc-dev" 
> Sent: Friday, November 17, 2023 2:20:29 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> - Original Message -
>> From: "npiggin" 
>> To: "Timothy Pearson" , "Michael Ellerman"
>> 
>> Cc: "linuxppc-dev" 
>> Sent: Friday, November 17, 2023 2:01:12 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>>
>>>
>>> - Original Message -
>>> > From: "Michael Ellerman" 
>>> > To: "Timothy Pearson" , "linuxppc-dev"
>>> > 
>>> > Cc: "Jens Axboe" 
>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>>
>>> > Hi Timothy,
>>> > 
>>> > Thanks for debugging this, but I'm unclear why this is helping because
>>> > we should already have a full barrier (hwsync) on both the sending and
>>> > receiving side.
>>> > 
>>> > More below.
>>>
>>> I've spent another few days poking at this, and think I might finally have
>>> something more solid in terms of what exactly is happening, but would like 
>>> some
>>> feedback on the concept / how best to fix the potential problem.
>>>
>>> As background, there are several worker threads both in userspace and in 
>>> kernel
>>> mode.  Crucially, the main MariaDB data processing thread (the one that 
>>> handles
>>> tasks like flushing dirty pages to disk) always runs on the same core as the
>>> io_uring kernel thread that picks up I/O worker creation requests and 
>>> handles
>>> them via create_worker_cb().
>>>
>>> Changes in the ~5.12 era switched away from a delayed worker setup.  
>>> io_uring
>>> currently sets up the new process with create_io_thread(), and immediately 
>>> uses
>>> an IPI to forcibly schedule the new process.  Because of the way the two
>>> threads interact, the new process ends up grabbing the CPU from the running
>>> MariaDB user thread; I've never seen it schedule on a different core.  If 
>>> the
>>> timing is right in this process, things get trampled on in userspace and the
>>> database server either crashes or throws a corruption fault.
>>>
>>> Through extensive debugging, I've narrowed this down to invalid state in 
>>> the VSX
>>> registers on return to the MariaDB user thread from the new kernel thread.  
>>> For
>>> some reason, it seems we don't restore FP state on return from the 
>>> PF_IO_WORKER
>>> thread, and something in the kernel was busy writing new data to them.
>>>
>>> A direct example I was able to observe is as follows:
>>>
>>> xxspltd vs0,vs0,0  <-- vs0 now zeroed out
>>> xorir9,r9,1<-- Presumably we switch to the new kernel thread 
>>> here
>>> due to the IPI
>>> slwir9,r9,7<-- On userspace thread resume, vs0 now contains the
>>> value 0x8200400082004000
>>> xxswapd vs8,vs0<-- vs8 now has the wrong value
>>> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on
>>> stw r9,116(r3)
>>> stxvd2x vs8,r3,r0
>>> ...
>>> CRASH
>> 
>> Nice find, that looks pretty conclusive.
>> 
>>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>>> operations with VSX registers so it does eventually fail.  I ended up with a
>>> tight loop around glibc operations that use VSX to trigger the failure 
>>> reliably
>>> enough to even understand what was going on.
>>>
>>> As I am not as familiar with this part of the Linux kernel as with most 
>>> other
>>> areas, what is the expected save/restore path for the FP/VSX registers 
>>> around
>>> an IPI and associated forced thread switch?  If restore_math() is in the 
>>> return
>>> path, note that MSR_FP is set in regs->msr.
>> 
>> Context switching these FP/vec registers should be pretty robust in
>> general because it's not just io-uring that uses them. io-uring could
>> be using some uncommon kernel code that uses the registers incorrectly
>> though I guess.
>> 
>>

Re: [PATCH] powerpc: Fix data corruption on IPI

2023-11-17 Thread Timothy Pearson




- Original Message -
> From: "npiggin" 
> To: "Timothy Pearson" , "Michael Ellerman" 
> 
> Cc: "linuxppc-dev" 
> Sent: Friday, November 17, 2023 2:01:12 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>
>>
>> - Original Message -
>> > From: "Michael Ellerman" 
>> > To: "Timothy Pearson" , "linuxppc-dev"
>> > 
>> > Cc: "Jens Axboe" 
>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>
>> > Hi Timothy,
>> > 
>> > Thanks for debugging this, but I'm unclear why this is helping because
>> > we should already have a full barrier (hwsync) on both the sending and
>> > receiving side.
>> > 
>> > More below.
>>
>> I've spent another few days poking at this, and think I might finally have
>> something more solid in terms of what exactly is happening, but would like 
>> some
>> feedback on the concept / how best to fix the potential problem.
>>
>> As background, there are several worker threads both in userspace and in 
>> kernel
>> mode.  Crucially, the main MariaDB data processing thread (the one that 
>> handles
>> tasks like flushing dirty pages to disk) always runs on the same core as the
>> io_uring kernel thread that picks up I/O worker creation requests and handles
>> them via create_worker_cb().
>>
>> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
>> currently sets up the new process with create_io_thread(), and immediately 
>> uses
>> an IPI to forcibly schedule the new process.  Because of the way the two
>> threads interact, the new process ends up grabbing the CPU from the running
>> MariaDB user thread; I've never seen it schedule on a different core.  If the
>> timing is right in this process, things get trampled on in userspace and the
>> database server either crashes or throws a corruption fault.
>>
>> Through extensive debugging, I've narrowed this down to invalid state in the 
>> VSX
>> registers on return to the MariaDB user thread from the new kernel thread.  
>> For
>> some reason, it seems we don't restore FP state on return from the 
>> PF_IO_WORKER
>> thread, and something in the kernel was busy writing new data to them.
>>
>> A direct example I was able to observe is as follows:
>>
>> xxspltd vs0,vs0,0  <-- vs0 now zeroed out
>> xorir9,r9,1<-- Presumably we switch to the new kernel thread here
>> due to the IPI
>> slwir9,r9,7<-- On userspace thread resume, vs0 now contains the
>> value 0x8200400082004000
>> xxswapd vs8,vs0<-- vs8 now has the wrong value
>> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on
>> stw r9,116(r3)
>> stxvd2x vs8,r3,r0
>> ...
>> CRASH
> 
> Nice find, that looks pretty conclusive.
> 
>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>> operations with VSX registers so it does eventually fail.  I ended up with a
>> tight loop around glibc operations that use VSX to trigger the failure 
>> reliably
>> enough to even understand what was going on.
>>
>> As I am not as familiar with this part of the Linux kernel as with most other
>> areas, what is the expected save/restore path for the FP/VSX registers around
>> an IPI and associated forced thread switch?  If restore_math() is in the 
>> return
>> path, note that MSR_FP is set in regs->msr.
> 
> Context switching these FP/vec registers should be pretty robust in
> general because it's not just io-uring that uses them. io-uring could
> be using some uncommon kernel code that uses the registers incorrectly
> though I guess.
> 
>>
>> Second question: should we even be using the VSX registers at all in kernel
>> space?  Is this a side effect of io_uring interacting so closely with 
>> userspace
>> threads, or something else entirely?
>>
>> If I can get pointed somewhat in the right direction I'm ready to develop the
>> rest of the fix for this issue...just trying to avoid another several days of
>> slogging through the source to see what it's supposed to be doing in the 
>> first
>> place. :)
> 
> Kernel can use FP/VEC/VSX registers but it has to enable and disable
> explicitly. Such kernel code also should not be preemptible.
&g

Re: [PATCH] powerpc: Fix data corruption on IPI

2023-11-16 Thread Timothy Pearson




- Original Message -
> From: "Timothy Pearson" 
> To: "Michael Ellerman" , "npiggin" 
> Cc: "linuxppc-dev" 
> Sent: Friday, November 17, 2023 1:39:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> - Original Message -
>> From: "Michael Ellerman" 
>> To: "Timothy Pearson" , "linuxppc-dev"
>> 
>> Cc: "Jens Axboe" 
>> Sent: Tuesday, November 14, 2023 6:14:37 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> Hi Timothy,
>> 
>> Thanks for debugging this, but I'm unclear why this is helping because
>> we should already have a full barrier (hwsync) on both the sending and
>> receiving side.
>> 
>> More below.
> 
> I've spent another few days poking at this, and think I might finally have
> something more solid in terms of what exactly is happening, but would like 
> some
> feedback on the concept / how best to fix the potential problem.
> 
> As background, there are several worker threads both in userspace and in 
> kernel
> mode.  Crucially, the main MariaDB data processing thread (the one that 
> handles
> tasks like flushing dirty pages to disk) always runs on the same core as the
> io_uring kernel thread that picks up I/O worker creation requests and handles
> them via create_worker_cb().
> 
> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
> currently sets up the new process with create_io_thread(), and immediately 
> uses
> an IPI to forcibly schedule the new process.  Because of the way the two
> threads interact, the new process ends up grabbing the CPU from the running
> MariaDB user thread; I've never seen it schedule on a different core.  If the
> timing is right in this process, things get trampled on in userspace and the
> database server either crashes or throws a corruption fault.
> 
> Through extensive debugging, I've narrowed this down to invalid state in the 
> VSX
> registers on return to the MariaDB user thread from the new kernel thread.  
> For
> some reason, it seems we don't restore FP state on return from the 
> PF_IO_WORKER
> thread, and something in the kernel was busy writing new data to them.
> 
> A direct example I was able to observe is as follows:
> 
> xxspltd vs0,vs0,0  <-- vs0 now zeroed out
> xorir9,r9,1<-- Presumably we switch to the new kernel thread here
> due to the IPI
> slwir9,r9,7<-- On userspace thread resume, vs0 now contains the
> value 0x8200400082004000
> xxswapd vs8,vs0<-- vs8 now has the wrong value
> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on
> stw r9,116(r3)
> stxvd2x vs8,r3,r0
> ...
> CRASH
> 
> This is a very difficult race to hit, but MariaDB naturally does repetitive
> operations with VSX registers so it does eventually fail.  I ended up with a
> tight loop around glibc operations that use VSX to trigger the failure 
> reliably
> enough to even understand what was going on.
> 
> As I am not as familiar with this part of the Linux kernel as with most other
> areas, what is the expected save/restore path for the FP/VSX registers around
> an IPI and associated forced thread switch?  If restore_math() is in the 
> return
> path, note that MSR_FP is set in regs->msr.
> 
> Second question: should we even be using the VSX registers at all in kernel
> space?  Is this a side effect of io_uring interacting so closely with 
> userspace
> threads, or something else entirely?

Thinking a bit more, a third option could be if we're restoring garbage into 
the registers by accident.  I know the I/O worker threads claim to use a 
"lightweight" version of kernel_clone(), and it'd be easy to have missed 
something important...

Re: [PATCH] powerpc: Fix data corruption on IPI

2023-11-16 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" , "linuxppc-dev" 
> 
> Cc: "Jens Axboe" 
> Sent: Tuesday, November 14, 2023 6:14:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> Hi Timothy,
> 
> Thanks for debugging this, but I'm unclear why this is helping because
> we should already have a full barrier (hwsync) on both the sending and
> receiving side.
> 
> More below.

I've spent another few days poking at this, and think I might finally have 
something more solid in terms of what exactly is happening, but would like some 
feedback on the concept / how best to fix the potential problem.

As background, there are several worker threads both in userspace and in kernel 
mode.  Crucially, the main MariaDB data processing thread (the one that handles 
tasks like flushing dirty pages to disk) always runs on the same core as the 
io_uring kernel thread that picks up I/O worker creation requests and handles 
them via create_worker_cb().

Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring 
currently sets up the new process with create_io_thread(), and immediately uses 
an IPI to forcibly schedule the new process.  Because of the way the two 
threads interact, the new process ends up grabbing the CPU from the running 
MariaDB user thread; I've never seen it schedule on a different core.  If the 
timing is right in this process, things get trampled on in userspace and the 
database server either crashes or throws a corruption fault.

Through extensive debugging, I've narrowed this down to invalid state in the 
VSX registers on return to the MariaDB user thread from the new kernel thread.  
For some reason, it seems we don't restore FP state on return from the 
PF_IO_WORKER thread, and something in the kernel was busy writing new data to 
them.

A direct example I was able to observe is as follows:

xxspltd vs0,vs0,0  <-- vs0 now zeroed out
xorir9,r9,1<-- Presumably we switch to the new kernel thread here 
due to the IPI
slwir9,r9,7<-- On userspace thread resume, vs0 now contains the 
value 0x8200400082004000
xxswapd vs8,vs0<-- vs8 now has the wrong value
stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on
stw r9,116(r3)
stxvd2x vs8,r3,r0
...
CRASH

This is a very difficult race to hit, but MariaDB naturally does repetitive 
operations with VSX registers so it does eventually fail.  I ended up with a 
tight loop around glibc operations that use VSX to trigger the failure reliably 
enough to even understand what was going on.

As I am not as familiar with this part of the Linux kernel as with most other 
areas, what is the expected save/restore path for the FP/VSX registers around 
an IPI and associated forced thread switch?  If restore_math() is in the return 
path, note that MSR_FP is set in regs->msr.

Second question: should we even be using the VSX registers at all in kernel 
space?  Is this a side effect of io_uring interacting so closely with userspace 
threads, or something else entirely?

If I can get pointed somewhat in the right direction I'm ready to develop the 
rest of the fix for this issue...just trying to avoid another several days of 
slogging through the source to see what it's supposed to be doing in the first 
place. :)

Thanks!

Re: [PATCH] powerpc: Fix data corruption on IPI

2023-11-14 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" , "linuxppc-dev" 
> 
> Cc: "Jens Axboe" 
> Sent: Tuesday, November 14, 2023 6:14:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> Hi Timothy,
> 
> Thanks for debugging this, but I'm unclear why this is helping because
> we should already have a full barrier (hwsync) on both the sending and
> receiving side.
> 
> More below.

So it looks like we might be dealing with a couple of potentially separate 
issues here, not sure.  This is probably the most infuriating bug I've run 
across in my career, so please bear with me -- with the amount of active test 
kernels I have installed at any one time I'm occassionally infecting one with 
the tests from another. ;)

First, I'm not 100% convinced the code in try_to_wake_up() is race-free on 
ppc64, since adding a memory barrier between it and the call to kick_process() 
significantly reduces the frequency of the on-disk corruption.  Can you take a 
look and see if anything stands out as to why that would be important?

The second part of this though is that the barrier only reduces the frequency 
of the corruption, it does not eliminate the corruption.  After some 
consolidation, what completely eliminates the corruption is a combination of:
 * Switching to TWA_SIGNAL_NO_IPI in task_work_add() * 
 * Adding udelay(1000) in io_uring/rw.c:io_write(), right before the call to 
io_rw_init_file()

Adding a memory barrier instead of the udelay() doesn't work, nor does adding 
the delay without switching to NO_IPI.

[1] Keeping TWA_SIGNAL and adding the barrier instruction also works, but this 
is conceptually simpler to understand as to why it would have an effect at all

[PATCH] powerpc: Fix data corruption on IPI

2023-11-14 Thread Timothy Pearson

>From 0b2678b7cdada1a3d9aec8626f31a988d81373fa Mon Sep 17 00:00:00 2001
From: Timothy Pearson 
Date: Mon, 13 Nov 2023 22:42:58 -0600
Subject: [PATCH] powerpc: Fix data corruption on IPI

On multithreaded SMP workloads such as those using io_uring, it is possible for
multiple threads to hold an inconsistent view of system memory when an IPI is
issued.  This in turn leads to userspace memory corruption with varying degrees
of probability based on workload and inter-thread timing.

io_uring provokes this bug by its use of TWA_SIGNAL during thread creation,
which is especially noticeable as significant userspace data corruption with
certain workloads such as MariaDB (bug MDEV-30728).  While using
TWA_SIGNAL_NO_IPI works around the corruption, no other architecture requires
this workaround.

Issue an lwsync barrier instruction prior to sending the IPI.  This ensures
the receiving CPU has a consistent view of system memory, in line with other
architectures.

Tested under QEMU in kvm mode, running on a Talos II workstation with dual
POWER9 DD2.2 CPUs.

Tested-by: Timothy Pearson 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/smp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ab691c89d787..ba42238de518 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg)

 void arch_smp_send_reschedule(int cpu)
 {
-   if (likely(smp_ops))
+   if (likely(smp_ops)) {
+   __smp_lwsync();
do_message_pass(cpu, PPC_MSG_RESCHEDULE);
+   }
 }
 EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);

-- 
2.39.2

Re: [PATCH] powerpc: Fix data corruption on IPI

2023-11-14 Thread Timothy Pearson




- Original Message -
> From: "Salvatore Bonaccorso" 
> To: "Timothy Pearson" 
> Cc: "Linuxppc-dev" 
> , "Jens 
> Axboe"
> , "regressions" , "Michael 
> Ellerman" , "npiggin"
> , "christophe leroy" 
> Sent: Tuesday, November 14, 2023 1:59:14 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> On Mon, Nov 13, 2023 at 10:56:09PM -0600, Timothy Pearson wrote:
>> >From 0b2678b7cdada1a3d9aec8626f31a988d81373fa Mon Sep 17 00:00:00 2001
>> From: Timothy Pearson 
>> Date: Mon, 13 Nov 2023 22:42:58 -0600
>> Subject: [PATCH] powerpc: Fix data corruption on IPI
>> 
>> On multithreaded SMP workloads such as those using io_uring, it is possible 
>> for
>> multiple threads to hold an inconsistent view of system memory when an IPI is
>> issued.  This in turn leads to userspace memory corruption with varying 
>> degrees
>> of probability based on workload and inter-thread timing.
>> 
>> io_uring provokes this bug by its use of TWA_SIGNAL during thread creation,
>> which is especially noticeable as significant userspace data corruption with
>> certain workloads such as MariaDB (bug MDEV-30728).  While using
>> TWA_SIGNAL_NO_IPI works around the corruption, no other architecture requires
>> this workaround.
>> 
>> Issue an lwsync barrier instruction prior to sending the IPI.  This ensures
>> the receiving CPU has a consistent view of system memory, in line with other
>> architectures.
>> 
>> Tested under QEMU in kvm mode, running on a Talos II workstation with dual
>> POWER9 DD2.2 CPUs.
>> 
>> Tested-by: Timothy Pearson 
>> Signed-off-by: Timothy Pearson 
>> ---
>>  arch/powerpc/kernel/smp.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index ab691c89d787..ba42238de518 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg)
>>  
>>  void arch_smp_send_reschedule(int cpu)
>>  {
>> -if (likely(smp_ops))
>> +if (likely(smp_ops)) {
>> +__smp_lwsync();
>>  do_message_pass(cpu, PPC_MSG_RESCHEDULE);
>> +}
>>  }
>>  EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);
> 
> Once this is accepted in mainline, can you ensure that it get
> backported to the needed relevant stable series? (Should it be CC'ed
> as well for stable@?).

Absolutely!  We've been blocked on kernel upgrades for production database 
systems for a while due to this particular bug.

> For context, and maybe worth adding a Link: reference as well this is
> hit in Debian in https://bugs.debian.org/1032104

Sounds good.  If anyone here needs a v2 with that line added just let me know.

Thanks!

Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c)

2023-06-15 Thread Timothy Pearson




- Original Message -
> From: "Randy Dunlap" 
> To: "Timothy Pearson" , "Michael Ellerman" 
> 
> Cc: "Stephen Rothwell" , "Linux Next Mailing List" 
> , "linux-kernel"
> , "linuxppc-dev" 
> , "Alexey Kardashevskiy" 
> Sent: Thursday, June 15, 2023 11:00:08 AM
> Subject: Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c)

> Hi Timothy,
> 
> On 6/3/23 20:57, Timothy Pearson wrote:
>> 
>> 
>> - Original Message -
>>> From: "Michael Ellerman" 
>>> To: "Randy Dunlap" , "Stephen Rothwell"
>>> , "Linux Next Mailing List"
>>> 
>>> Cc: "linux-kernel" , "linuxppc-dev"
>>> , "Alexey
>>> Kardashevskiy" , "Timothy Pearson"
>>> 
>>> Sent: Saturday, June 3, 2023 7:22:51 PM
>>> Subject: Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c)
>> 
>>> Randy Dunlap  writes:
>>>> On 6/1/23 21:01, Stephen Rothwell wrote:
>>>>> Hi all,
>>>>>
>>>>> Changes since 20230601:
>>>>>
>>>>
>>>> On powerpc64, a randconfig failed with:
>>>>
>>>> In file included from ../include/linux/list.h:5,
>>>>  from ../include/linux/preempt.h:11,
>>>>  from ../include/linux/spinlock.h:56,
>>>>  from ../include/linux/mmzone.h:8,
>>>>  from ../include/linux/gfp.h:7,
>>>>  from ../include/linux/slab.h:15,
>>>>  from ../arch/powerpc/kernel/iommu.c:15:
>>>> ../arch/powerpc/kernel/iommu.c: In function
>>>> 'spapr_tce_setup_phb_iommus_initcall':
>>>> ../arch/powerpc/kernel/iommu.c:1391:36: error: 'hose_list' undeclared 
>>>> (first use
>>>> in this function); did you mean 'zonelist'?
>>>>  1391 | list_for_each_entry(hose, &hose_list, list_node) {
>>>>   |^
>>> ...
>>>
>>> hose_list is in pci-common.c which is built when PCI=y.
>>>
>>> PSERIES and POWERNV force PCI=y.
>>>
>>> But this config has neither:
>>>
>>> # CONFIG_PPC_POWERNV is not set
>>> # CONFIG_PPC_PSERIES is not set
>>> CONFIG_HAVE_PCI=y
>>> # CONFIG_PCI is not set
>>> # CONFIG_COMMON_CLK_RS9_PCIE is not set
>>>
>>>
>>> Probably the spapr_tce code should be wrapped in an #ifdef that is only
>>> enabled when POWERNV || PSERIES is enabled.
>>>
>>> cheers
>> 
>> Sounds reasonable, I was going to look into this further over the weekend.  I
>> can put together a patch for Monday if that works?
> 
> Did you prepare a patch for this? I am still seeing this build error.
> 
> thanks.
> --
> ~Randy

Yes, it was sent in to the linuxppc-dev list some weeks ago.  Did it not arrive?

[PATCH] powerpc/iommu: Only build sPAPR access functions on pSeries

2023-06-05 Thread Timothy Pearson

 and PowerNV

A build failure with CONFIG_HAVE_PCI=y set without PSERIES or POWERNV
set was caught by the random configuration checker.  Guard the sPAPR
specific IOMMU functions on CONFIG_PPC_PSERIES || CONFIG_PPC_POWERNV.

Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/iommu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 67f0b01e6ff5..c52449ae6936 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1090,6 +1090,7 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
 static int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
@@ -1140,6 +1141,7 @@ static void iommu_release_ownership(struct iommu_table 
*tbl)
spin_unlock(&tbl->pools[i].lock);
spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
+#endif
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
@@ -1171,6 +1173,7 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
 /*
  * A simple iommu_table_group_ops which only allows reusing the existing
  * iommu_table. This handles VFIO for POWER7 or the nested KVM.
@@ -1398,5 +1401,6 @@ static int __init 
spapr_tce_setup_phb_iommus_initcall(void)
return 0;
 }
 postcore_initcall_sync(spapr_tce_setup_phb_iommus_initcall);
+#endif
 
 #endif /* CONFIG_IOMMU_API */
-- 
2.30.2

Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c)

2023-06-03 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Randy Dunlap" , "Stephen Rothwell" 
> , "Linux Next Mailing List"
> 
> Cc: "linux-kernel" , "linuxppc-dev" 
> , "Alexey
> Kardashevskiy" , "Timothy Pearson" 
> 
> Sent: Saturday, June 3, 2023 7:22:51 PM
> Subject: Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c)

> Randy Dunlap  writes:
>> On 6/1/23 21:01, Stephen Rothwell wrote:
>>> Hi all,
>>> 
>>> Changes since 20230601:
>>> 
>>
>> On powerpc64, a randconfig failed with:
>>
>> In file included from ../include/linux/list.h:5,
>>  from ../include/linux/preempt.h:11,
>>  from ../include/linux/spinlock.h:56,
>>  from ../include/linux/mmzone.h:8,
>>  from ../include/linux/gfp.h:7,
>>  from ../include/linux/slab.h:15,
>>  from ../arch/powerpc/kernel/iommu.c:15:
>> ../arch/powerpc/kernel/iommu.c: In function
>> 'spapr_tce_setup_phb_iommus_initcall':
>> ../arch/powerpc/kernel/iommu.c:1391:36: error: 'hose_list' undeclared (first 
>> use
>> in this function); did you mean 'zonelist'?
>>  1391 | list_for_each_entry(hose, &hose_list, list_node) {
>>   |^
> ...
> 
> hose_list is in pci-common.c which is built when PCI=y.
> 
> PSERIES and POWERNV force PCI=y.
> 
> But this config has neither:
> 
> # CONFIG_PPC_POWERNV is not set
> # CONFIG_PPC_PSERIES is not set
> CONFIG_HAVE_PCI=y
> # CONFIG_PCI is not set
> # CONFIG_COMMON_CLK_RS9_PCIE is not set
> 
> 
> Probably the spapr_tce code should be wrapped in an #ifdef that is only
> enabled when POWERNV || PSERIES is enabled.
> 
> cheers

Sounds reasonable, I was going to look into this further over the weekend.  I 
can put together a patch for Monday if that works?

Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-21 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" , "Timothy Pearson" 
> 
> Cc: "kvm" , "linuxppc-dev" 
> 
> Sent: Tuesday, March 21, 2023 5:33:57 AM
> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

> Timothy Pearson  writes:
>> - Original Message -
>>> From: "Timothy Pearson" 
>>> To: "Michael Ellerman" 
>>> Cc: "Timothy Pearson" , "kvm"
>>> , "linuxppc-dev"
>>> 
>>> Sent: Thursday, March 9, 2023 1:28:20 PM
>>> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
>>
>>> - Original Message -
>>>> From: "Michael Ellerman" 
>>>> To: "Timothy Pearson" , "kvm"
>>>> 
>>>> Cc: "linuxppc-dev" 
>>>> Sent: Thursday, March 9, 2023 5:40:01 AM
>>>> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
>>> 
>>>> Timothy Pearson  writes:
>>>>> This patch series reenables VFIO support on POWER systems.  It
>>>>> is based on Alexey Kardashevskiys's patch series, rebased and
>>>>> successfully tested under QEMU with a Marvell PCIe SATA controller
>>>>> on a POWER9 Blackbird host.
>>>>>
>>>>> Alexey Kardashevskiy (3):
>>>>>   powerpc/iommu: Add "borrowing" iommu_table_group_ops
>>>>>   powerpc/pci_64: Init pcibios subsys a bit later
>>>>>   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
>>>>> domains
>>>> 
>>>> As sent the patches had lost Alexey's authorship (no From: line), I
>>>> fixed it up when applying so the first 3 are authored by Alexey.
>>>> 
>>>> cheers
>>> 
>>> Thanks for catching that, it wasn't intentional.  Probably used a wrong Git
>>> command...
>>
>> Just wanted to touch base on the patches, since they're still listed as Under
>> Review on patchwork.  Are we good to go for the 6.4 merge window?
> 
> They've been in my next (and so linux-next), since last week. I just
> haven't updated patchwork yet.
> 
> So yeah they are on track to go into mainline during the v6.4 merge window.
> 
> cheers

Sounds great, thanks!  Saw them in the next tree but wasn't sure if the 
patchwork status was more reflective of overall status.

Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-18 Thread Timothy Pearson




- Original Message -
> From: "Timothy Pearson" 
> To: "Michael Ellerman" 
> Cc: "Timothy Pearson" , "kvm" 
> , "linuxppc-dev"
> 
> Sent: Thursday, March 9, 2023 1:28:20 PM
> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

> - Original Message -
>> From: "Michael Ellerman" 
>> To: "Timothy Pearson" , "kvm"
>> 
>> Cc: "linuxppc-dev" 
>> Sent: Thursday, March 9, 2023 5:40:01 AM
>> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
> 
>> Timothy Pearson  writes:
>>> This patch series reenables VFIO support on POWER systems.  It
>>> is based on Alexey Kardashevskiys's patch series, rebased and
>>> successfully tested under QEMU with a Marvell PCIe SATA controller
>>> on a POWER9 Blackbird host.
>>>
>>> Alexey Kardashevskiy (3):
>>>   powerpc/iommu: Add "borrowing" iommu_table_group_ops
>>>   powerpc/pci_64: Init pcibios subsys a bit later
>>>   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
>>> domains
>> 
>> As sent the patches had lost Alexey's authorship (no From: line), I
>> fixed it up when applying so the first 3 are authored by Alexey.
>> 
>> cheers
> 
> Thanks for catching that, it wasn't intentional.  Probably used a wrong Git
> command...

Just wanted to touch base on the patches, since they're still listed as Under 
Review on patchwork.  Are we good to go for the 6.4 merge window?

Thanks!

Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-09 Thread Timothy Pearson




- Original Message -
> From: "Michael Ellerman" 
> To: "Timothy Pearson" , "kvm" 
> 
> Cc: "linuxppc-dev" 
> Sent: Thursday, March 9, 2023 5:40:01 AM
> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

> Timothy Pearson  writes:
>> This patch series reenables VFIO support on POWER systems.  It
>> is based on Alexey Kardashevskiys's patch series, rebased and
>> successfully tested under QEMU with a Marvell PCIe SATA controller
>> on a POWER9 Blackbird host.
>>
>> Alexey Kardashevskiy (3):
>>   powerpc/iommu: Add "borrowing" iommu_table_group_ops
>>   powerpc/pci_64: Init pcibios subsys a bit later
>>   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
>> domains
> 
> As sent the patches had lost Alexey's authorship (no From: line), I
> fixed it up when applying so the first 3 are authored by Alexey.
> 
> cheers

Thanks for catching that, it wasn't intentional.  Probably used a wrong Git 
command...

Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-07 Thread Timothy Pearson




- Original Message -
> From: "Alex Williamson" 
> To: "Timothy Pearson" 
> Cc: "kvm" , "linuxppc-dev" 
> 
> Sent: Monday, March 6, 2023 6:59:41 PM
> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

> On Mon, 6 Mar 2023 18:35:22 -0600 (CST)
> Timothy Pearson  wrote:
> 
>> ----- Original Message -
>> > From: "Alex Williamson" 
>> > To: "Timothy Pearson" 
>> > Cc: "kvm" , "linuxppc-dev" 
>> > 
>> > Sent: Monday, March 6, 2023 5:46:07 PM
>> > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
>> 
>> > On Mon, 6 Mar 2023 11:29:53 -0600 (CST)
>> > Timothy Pearson  wrote:
>> >   
>> >> This patch series reenables VFIO support on POWER systems.  It
>> >> is based on Alexey Kardashevskiys's patch series, rebased and
>> >> successfully tested under QEMU with a Marvell PCIe SATA controller
>> >> on a POWER9 Blackbird host.
>> >> 
>> >> Alexey Kardashevskiy (3):
>> >>   powerpc/iommu: Add "borrowing" iommu_table_group_ops
>> >>   powerpc/pci_64: Init pcibios subsys a bit later
>> >>   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
>> >> domains
>> >> 
>> >> Timothy Pearson (1):
>> >>   Add myself to MAINTAINERS for Power VFIO support
>> >> 
>> >>  MAINTAINERS   |   5 +
>> >>  arch/powerpc/include/asm/iommu.h  |   6 +-
>> >>  arch/powerpc/include/asm/pci-bridge.h |   7 +
>> >>  arch/powerpc/kernel/iommu.c   | 246 +-
>> >>  arch/powerpc/kernel/pci_64.c  |   2 +-
>> >>  arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
>> >>  arch/powerpc/platforms/pseries/iommu.c|  27 +++
>> >>  arch/powerpc/platforms/pseries/pseries.h  |   4 +
>> >>  arch/powerpc/platforms/pseries/setup.c|   3 +
>> >>  drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
>> >>  10 files changed, 338 insertions(+), 94 deletions(-)
>> >>   
>> > 
>> > For vfio and MAINTAINERS portions,
>> > 
>> > Acked-by: Alex Williamson 
>> > 
>> > I'll note though that spapr_tce_take_ownership() looks like it copied a
>> > bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map
>> > are tested before calling iommu_take_ownership() but not in the unwind
>> > loop, ie. tables we might have skipped on setup are unconditionally
>> > released on unwind.  Thanks,
>> > 
>> > Alex
>> 
>> Thanks for that.  I'll put together a patch to get rid of that
>> potential bug that can be applied after this series is merged, unless
>> you'd rather I resubmit a v3 with the issue fixed?
> 
> Follow-up fix is fine by me.  Thanks,
> 
> Alex

Just sent that patch in.  Thanks!

[PATCH] Check for IOMMU table validity in error handler

2023-03-07 Thread Timothy Pearson

If tce_iommu_take_ownership is unable to take ownership of
a specific IOMMU table, the unwinder in the error handler
could attempt to release ownership of an invalid table.

Check validity of each table in the unwinder before attempting
to release ownership.  Thanks to Alex Williamson for the initial
observation!

Signed-off-by: Timothy Pearson 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 60a50ce8701e..c012ecb42ebc 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -1219,10 +1219,15 @@ static int tce_iommu_take_ownership(struct 
tce_container *container,
 
rc = iommu_take_ownership(tbl);
if (rc) {
-   for (j = 0; j < i; ++j)
-   iommu_release_ownership(
-   table_group->tables[j]);
+   for (j = 0; j < i; ++j) {
+   struct iommu_table *tbl =
+   table_group->tables[j];
 
+   if (!tbl || !tbl->it_map)
+   continue;
+
+   iommu_release_ownership(table_group->tables[j]);
+   }
return rc;
}
}
-- 
2.30.2

Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-06 Thread Timothy Pearson




- Original Message -
> From: "Alex Williamson" 
> To: "Timothy Pearson" 
> Cc: "kvm" , "linuxppc-dev" 
> 
> Sent: Monday, March 6, 2023 5:46:07 PM
> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

> On Mon, 6 Mar 2023 11:29:53 -0600 (CST)
> Timothy Pearson  wrote:
> 
>> This patch series reenables VFIO support on POWER systems.  It
>> is based on Alexey Kardashevskiys's patch series, rebased and
>> successfully tested under QEMU with a Marvell PCIe SATA controller
>> on a POWER9 Blackbird host.
>> 
>> Alexey Kardashevskiy (3):
>>   powerpc/iommu: Add "borrowing" iommu_table_group_ops
>>   powerpc/pci_64: Init pcibios subsys a bit later
>>   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
>> domains
>> 
>> Timothy Pearson (1):
>>   Add myself to MAINTAINERS for Power VFIO support
>> 
>>  MAINTAINERS   |   5 +
>>  arch/powerpc/include/asm/iommu.h  |   6 +-
>>  arch/powerpc/include/asm/pci-bridge.h |   7 +
>>  arch/powerpc/kernel/iommu.c   | 246 +-
>>  arch/powerpc/kernel/pci_64.c  |   2 +-
>>  arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
>>  arch/powerpc/platforms/pseries/iommu.c|  27 +++
>>  arch/powerpc/platforms/pseries/pseries.h  |   4 +
>>  arch/powerpc/platforms/pseries/setup.c|   3 +
>>  drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
>>  10 files changed, 338 insertions(+), 94 deletions(-)
>> 
> 
> For vfio and MAINTAINERS portions,
> 
> Acked-by: Alex Williamson 
> 
> I'll note though that spapr_tce_take_ownership() looks like it copied a
> bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map
> are tested before calling iommu_take_ownership() but not in the unwind
> loop, ie. tables we might have skipped on setup are unconditionally
> released on unwind.  Thanks,
> 
> Alex

Thanks for that.  I'll put together a patch to get rid of that potential bug 
that can be applied after this series is merged, unless you'd rather I resubmit 
a v3 with the issue fixed?

[PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-06 Thread Timothy Pearson

This patch series reenables VFIO support on POWER systems.  It
is based on Alexey Kardashevskiys's patch series, rebased and
successfully tested under QEMU with a Marvell PCIe SATA controller
on a POWER9 Blackbird host.

Alexey Kardashevskiy (3):
  powerpc/iommu: Add "borrowing" iommu_table_group_ops
  powerpc/pci_64: Init pcibios subsys a bit later
  powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
domains

Timothy Pearson (1):
  Add myself to MAINTAINERS for Power VFIO support

 MAINTAINERS   |   5 +
 arch/powerpc/include/asm/iommu.h  |   6 +-
 arch/powerpc/include/asm/pci-bridge.h |   7 +
 arch/powerpc/kernel/iommu.c   | 246 +-
 arch/powerpc/kernel/pci_64.c  |   2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
 arch/powerpc/platforms/pseries/iommu.c|  27 +++
 arch/powerpc/platforms/pseries/pseries.h  |   4 +
 arch/powerpc/platforms/pseries/setup.c|   3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
 10 files changed, 338 insertions(+), 94 deletions(-)

-- 
2.30.2

[PATCH v2 1/4] powerpc/iommu: Add "borrowing" iommu_table_group_ops

2023-03-06 Thread Timothy Pearson

PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows
for PEs: control the ownership, create/set/unset a table the hardware
for dynamic DMA windows (DDW). VFIO uses the API to implement support
on POWER.

So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and 
other cases (POWER7 or nested KVM) did not and instead reused
existing iommu_table structs. This means 1) no DDW 2) ownership transfer
is done directly in the VFIO SPAPR TCE driver.

Soon POWER is going to get its own iommu_ops and ownership control is
going to move there. This implements spapr_tce_table_group_ops which
borrows iommu_table tables. The upside is that VFIO needs to know less
about POWER.

The new ops returns the existing table from create_table() and
only checks if the same window is already set. This is only going to work
if the default DMA window starts table_group.tce32_start and as big as
pe->table_group.tce32_size (not the case for IODA2+ PowerNV).

This changes iommu_table_group_ops::take_ownership() to return an error
if borrowing a table failed.

This should not cause any visible change in behavior for PowerNV.
pSeries was not that well tested/supported anyway.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/include/asm/iommu.h  |  6 +-
 arch/powerpc/kernel/iommu.c   | 98 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c |  6 +-
 arch/powerpc/platforms/pseries/iommu.c|  3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   | 94 --
 5 files changed, 121 insertions(+), 86 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e29c73e3dd4..678b5bdc79b1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -175,7 +175,7 @@ struct iommu_table_group_ops {
long (*unset_window)(struct iommu_table_group *table_group,
int num);
/* Switch ownership from platform code to external user (e.g. VFIO) */
-   void (*take_ownership)(struct iommu_table_group *table_group);
+   long (*take_ownership)(struct iommu_table_group *table_group);
/* Switch ownership from external user (e.g. VFIO) back to core */
void (*release_ownership)(struct iommu_table_group *table_group);
 };
@@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
enum dma_data_direction *direction);
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+
+extern struct iommu_table_group_ops spapr_tce_table_group_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
@@ -303,8 +305,6 @@ extern int iommu_tce_check_gpa(unsigned long page_shift,
iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct iommu_table *tbl);
-extern void iommu_release_ownership(struct iommu_table *tbl);
 
 extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
 extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..f9f5a9418092 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1086,7 +1086,7 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
-int iommu_take_ownership(struct iommu_table *tbl)
+static int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;
@@ -1118,9 +1118,8 @@ int iommu_take_ownership(struct iommu_table *tbl)
 
return ret;
 }
-EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
-void iommu_release_ownership(struct iommu_table *tbl)
+static void iommu_release_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
@@ -1137,7 +1136,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
spin_unlock(&tbl->pools[i].lock);
spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
-EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
@@ -1179,4 +1177,96 @@ void iommu_del_device(struct device *dev)
iommu_group_remove_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_del_device);
+
+/*
+ * A simple iommu_table_group_ops which only allows reusing the existing
+ * iommu_table. This handles VFIO for POWER7 or the nested KVM.
+ * The ops does not allow creating windows and only allows reusing the existing
+ * one if it matches table_group->tce32_start/tce32_size/page_shift.
+ */
+static unsigned long spapr_tce_get_table_size(__u32 page_shift,
+

[PATCH v2 2/4] powerpc/pci_64: Init pcibios subsys a bit later

2023-03-06 Thread Timothy Pearson

The following patches are going to add dependency/use of iommu_ops which
is initialized in subsys_initcall as well.

This moves pciobios_init() to the next initcall level.

This should not cause behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/pci_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index fd42059ae2a5..e27342ef128b 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -73,7 +73,7 @@ static int __init pcibios_init(void)
return 0;
 }
 
-subsys_initcall(pcibios_init);
+subsys_initcall_sync(pcibios_init);
 
 int pcibios_unmap_io_space(struct pci_bus *bus)
 {
-- 
2.30.2

[PATCH v2] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE support platform

2023-03-06 Thread Timothy Pearson

 dependent

When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Timothy Pearson 
---
 arch/arm64/kvm/arm.c   | 3 +++
 arch/mips/kvm/mips.c   | 3 +++
 arch/powerpc/kvm/powerpc.c | 6 ++
 arch/riscv/kvm/vm.c| 3 +++
 arch/s390/kvm/kvm-s390.c   | 3 +++
 arch/x86/kvm/x86.c | 3 +++
 virt/kvm/kvm_main.c| 1 -
 7 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 3bd732eaf087..0ad50969430a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -220,6 +220,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 36c8991b5d39..52bdc479875d 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1046,6 +1046,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_SYNC_MMU:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4c5405fc5538..d23e25e8432d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -576,6 +576,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
 #endif
 
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+   r = !xive_enabled();
+   break;
+#endif
+
case KVM_CAP_PPC_ALLOC_HTAB:
r = hv_enabled;
break;
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 65a964d7e70d..0ef7a6168018 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -65,6 +65,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_MP_STATE:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 39b36562c043..6ca84bfdd2dc 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -573,6 +573,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_S390_VCPU_RESETS:
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_S390_DIAG318:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7713420abab0..891aeace811e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4432,6 +4432,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_VAPIC:
case KVM_CAP_ENABLE_CAP:
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..b1679d08a216 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4479,7 +4479,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
 #endif
 #ifdef CONFIG_HAVE_KVM_IRQFD
case KVM_CAP_IRQFD:
-   case KVM_CAP_IRQFD_RESAMPLE:
 #endif
case KVM_CAP_IOEVENTFD_ANY_LENGTH:
case KVM_CAP_CHECK_EXTENSION_VM:
-- 
2.30.2

[PATCH v2 4/4] Add myself to MAINTAINERS for Power VFIO support

2023-03-06 Thread Timothy Pearson

Signed-off-by: Timothy Pearson 
---
 MAINTAINERS | 5 +
 1 file changed, 5 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8d5bc223f305..876f96e82d66 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9836,6 +9836,11 @@ F:   drivers/crypto/vmx/ghash*
 F: drivers/crypto/vmx/ppc-xlate.pl
 F: drivers/crypto/vmx/vmx.c
 
+IBM Power VFIO Support
+M: Timothy Pearson 
+S: Supported
+F: drivers/vfio/vfio_iommu_spapr_tce.c
+
 IBM ServeRAID RAID DRIVER
 S: Orphan
 F: drivers/scsi/ips.*
-- 
2.30.2

[PATCH v2 3/4] powerpc/iommu: Add iommu_ops to report capabilities and

2023-03-06 Thread Timothy Pearson

 allow blocking domains

Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to
the generic VFIO which broke POWER:
- a coherency capability check;
- blocking IOMMU domain - iommu_group_dma_owner_claimed()/...

This adds a simple iommu_ops which reports support for cache
coherency and provides a basic support for blocking domains. No other
domain types are implemented so the default domain is NULL.

Since now iommu_ops controls the group ownership, this takes it out of
VFIO.

This adds an IOMMU device into a pci_controller (=PHB) and registers it
in the IOMMU subsystem, iommu_ops is registered at this point.
This setup is done in postcore_initcall_sync.

This replaces iommu_group_add_device() with iommu_probe_device() as
the former misses necessary steps in connecting PCI devices to IOMMU
devices. This adds a comment about why explicit iommu_probe_device()
is still needed.

The previous discussion is here:
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Cc: Deming Wang 
Cc: Robin Murphy 
Cc: Jason Gunthorpe 
Cc: Alex Williamson 
Cc: Daniel Henrique Barboza 
Cc: Fabiano Rosas 
Cc: Murilo Opsfelder Araujo 
Cc: Nicholas Piggin 
Co-authored-by: Timothy Pearson 
Signed-off-by: Alexey Kardashevskiy 
Signed-off-by: Timothy Pearson 
---
 arch/powerpc/include/asm/pci-bridge.h |   7 +
 arch/powerpc/kernel/iommu.c   | 148 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  30 +
 arch/powerpc/platforms/pseries/iommu.c|  24 
 arch/powerpc/platforms/pseries/pseries.h  |   4 +
 arch/powerpc/platforms/pseries/setup.c|   3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   |   8 --
 7 files changed, 214 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 71c1d26f2400..2aa3a091ef20 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device_node;
 
@@ -44,6 +45,9 @@ struct pci_controller_ops {
 #endif
 
void(*shutdown)(struct pci_controller *hose);
+
+   struct iommu_group *(*device_group)(struct pci_controller *hose,
+   struct pci_dev *pdev);
 };
 
 /*
@@ -131,6 +135,9 @@ struct pci_controller {
struct irq_domain   *dev_domain;
struct irq_domain   *msi_domain;
struct fwnode_handle*fwnode;
+
+   /* iommu_ops support */
+   struct iommu_device iommu;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index f9f5a9418092..b42e202af3bc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DBG(...)
 
@@ -1156,8 +1157,14 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 
pr_debug("%s: Adding %s to iommu group %d\n",
 __func__, dev_name(dev),  iommu_group_id(table_group->group));
-
-   return iommu_group_add_device(table_group->group, dev);
+   /*
+* This is still not adding devices via the IOMMU bus notifier because
+* of pcibios_init() from arch/powerpc/kernel/pci_64.c which calls
+* pcibios_scan_phb() first (and this guy adds devices and triggers
+* the notifier) and only then it calls pci_bus_add_devices() which
+* configures DMA for buses which also creates PEs and IOMMU groups.
+*/
+   return iommu_probe_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
@@ -1237,6 +1244,7 @@ static long spapr_tce_take_ownership(struct 
iommu_table_group *table_group)
rc = iommu_take_ownership(tbl);
if (!rc)
continue;
+
for (j = 0; j < i; ++j)
iommu_release_ownership(table_group->tables[j]);
return rc;
@@ -1269,4 +1277,140 @@ struct iommu_table_group_ops spapr_tce_table_group_ops 
= {
.release_ownership = spapr_tce_release_ownership,
 };
 
+/*
+ * A simple iommu_ops to allow less cruft in generic VFIO code.
+ */
+static int spapr_tce_blocking_iommu_attach_dev(struct iommu_domain *dom,
+  struct device *dev)
+{
+   struct iommu_group *grp = iommu_group_get(dev);
+   struct iommu_table_group *table_group;
+   int ret = -EINVAL;
+
+   if (!grp)
+

Re: [PATCH] drm/amdgpu: Re-enable DCN for 64-bit powerpc

2022-07-25 Thread Timothy Pearson




- Original Message -
> From: "Linus Torvalds" 
> To: "Michael Ellerman" 
> Cc: "linuxppc-dev" , "Alex Deucher" 
> , "amd-gfx"
> , li...@roeck-us.net, "linux-kernel" 
> , "Dan Horák"
> , "Timothy Pearson" 
> Sent: Monday, July 25, 2022 2:19:57 PM
> Subject: Re: [PATCH] drm/amdgpu: Re-enable DCN for 64-bit powerpc

> On Mon, Jul 25, 2022 at 5:39 AM Michael Ellerman  wrote:
>>
>> Further digging shows that the build failures only occur with compilers
>> that default to 64-bit long double.
> 
> Where the heck do we have 'long double' things anywhere in the kernel?
> 
> I tried to grep for it, and failed miserably. I found some constants
> that would qualify, but they were in the v4l colorspaces-details.rst
> doc file.
> 
> Strange.

We don't, at least not that I can see.  The affected code uses standard doubles.

What I'm wondering is if the compiler is getting confused between standard and 
long doubles when they are both the same bit length...

Re: [PATCH RFC 3/5] powerpc/speculation: Add support for 'cpu_spec_mitigations=' cmdline options

2019-04-04 Thread Timothy Pearson

Will be joining in ~ 5 mins.  Getting Chromium set up here.

- Original Message -
> From: "Jiri Kosina" 
> To: "Josh Poimboeuf" 
> Cc: "Peter Zijlstra" , "Heiko Carstens" 
> , "Paul Mackerras"
> , "H . Peter Anvin" , "Ingo Molnar" 
> , "Andrea Arcangeli"
> , linux-s...@vger.kernel.org, x...@kernel.org, "Will 
> Deacon" , "Linus
> Torvalds" , "Catalin Marinas" 
> , "Waiman Long"
> , linux-a...@vger.kernel.org, "Jon Masters" 
> , "Borislav Petkov" ,
> "Andy Lutomirski" , "Thomas Gleixner" , 
> linux-arm-ker...@lists.infradead.org,
> "Greg Kroah-Hartman" , 
> linux-ker...@vger.kernel.org, "Tyler Hicks" ,
> "Martin Schwidefsky" , linuxppc-dev@lists.ozlabs.org
> Sent: Thursday, April 4, 2019 2:49:05 PM
> Subject: Re: [PATCH RFC 3/5] powerpc/speculation: Add support for 
> 'cpu_spec_mitigations=' cmdline options

> On Thu, 4 Apr 2019, Josh Poimboeuf wrote:
> 
>> Configure powerpc CPU runtime speculation bug mitigations in accordance
>> with the 'cpu_spec_mitigations=' cmdline options.  This affects
>> Meltdown, Spectre v1, Spectre v2, and Speculative Store Bypass.
> [ ... snip ... ]
>> -if (!no_nospec)
>> +if (!no_nospec && cpu_spec_mitigations != CPU_SPEC_MITIGATIONS_OFF)
> 
> '!no_nospec' is something that I am sure will come back to hunt me in my
> bad dreams.
> 
> But that's been there already, and fixing it is out of scope of this
> patch. Other than that, as discussed previously -- I really like this new
> global option. Feel free to add
> 
>   Reviewed-by: Jiri Kosina 
> 
> for the whole set.
> 
> Thanks,
> 
> --
> Jiri Kosina
> SUSE Labs

Re: [PATCH 0/7] Add initial version of "cognitive DMA"

2018-06-24 Thread Timothy Pearson

When should we be targeting merge?  At this point this is a substantial
improvement over currently shipping kernels for our systems, and we
don't really want to have to ship a patched / custom OS kernel if we can
avoid it.

On 06/24/2018 08:09 PM, Russell Currey wrote:
> On Sat, 2018-06-23 at 18:52 -0500, Timothy Pearson wrote:
> 
> There's still more to do and this shouldn't be merged yet - would
> encourage anyone with suitable hardware to test though.
> 
>> POWER9 (PHB4) requires all peripherals using DMA to be either
>> restricted
>> to 32-bit windows or capable of accessing the entire 64 bits of
>> memory
>> space.  Some devices, such as most GPUs, can only address up to a
>> certain
>> number of bits (approximately 40, in many cases), while at the same
>> time
>> it is highly desireable to use a larger DMA space than the fallback
>> 32 bits.
>>
>> This series adds something called "cognitive DMA", which is a form of
>> dynamic
>> TCE allocation.  This allows the peripheral to DMA to host addresses
>> mapped in
>> 1G (PHB4) or 256M (PHB3) chunks, and is transparent to the peripheral
>> and its
>> driver stack.
>>
>> This series has been tested on a Talos II server with a Radeon WX4100
>> and
>> a wide range of OpenGL applications.  While there is still work,
>> notably
>> involving what happens if a peripheral attempts to DMA close to a TCE
>> window boundary, this series greatly improves functionality for AMD
>> GPUs
>> on POWER9 devices over the existing 32-bit DMA support.
>>
>> Russell Currey (4):
>>   powerpc/powernv/pci: Track largest available TCE order per PHB
>>   powerpc/powernv: DMA operations for discontiguous allocation
>>   powerpc/powernv/pci: Track DMA and TCE tables in debugfs
>>   powerpc/powernv/pci: Safety fixes for pseudobypass TCE allocation
>>
>> Timothy Pearson (3):
>>   powerpc/powernv/pci: Export pnv_pci_ioda2_tce_invalidate_pe
>>   powerpc/powernv/pci: Invalidate TCE cache after DMA map setup
>>   powerpc/powernv/pci: Don't use the lower 4G TCEs in pseudo-DMA mode
>>
>>  arch/powerpc/include/asm/dma-mapping.h|   1 +
>>  arch/powerpc/platforms/powernv/Makefile   |   2 +-
>>  arch/powerpc/platforms/powernv/pci-dma.c  | 320
>> ++
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 169 
>>  arch/powerpc/platforms/powernv/pci.h  |  11 +
>>  5 files changed, 452 insertions(+), 51 deletions(-)
>>  create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c
>>


-- 
Timothy Pearson
Raptor Engineering
+1 (415) 727-8645 (direct line)
+1 (512) 690-0200 (switchboard)
https://www.raptorengineering.com

[PATCH 4/7] powerpc/powernv/pci: Safety fixes for pseudobypass TCE

2018-06-23 Thread Timothy Pearson

 allocation

Signed-off-by: Russell Currey 
---
 arch/powerpc/platforms/powernv/pci-dma.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-dma.c 
b/arch/powerpc/platforms/powernv/pci-dma.c
index 1d5409be343e..237940a2a052 100644
--- a/arch/powerpc/platforms/powernv/pci-dma.c
+++ b/arch/powerpc/platforms/powernv/pci-dma.c
@@ -29,8 +29,9 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe 
*pe, phys_addr_t addr
 {
int tce;
__be64 old, new;
+   unsigned long flags;
 
-   spin_lock(&pe->tce_alloc_lock);
+   spin_lock_irqsave(&pe->tce_alloc_lock, flags);
tce = bitmap_find_next_zero_area(pe->tce_bitmap,
 pe->tce_count,
 0,
@@ -40,9 +41,10 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe 
*pe, phys_addr_t addr
old = pe->tces[tce];
new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
pe->tces[tce] = new;
+   mb();
pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n",
tce, new, old);
-   spin_unlock(&pe->tce_alloc_lock);
+   spin_unlock_irqrestore(&pe->tce_alloc_lock, flags);
 
return tce;
 }
-- 
2.17.1

[PATCH 3/7] powerpc/powernv/pci: Track DMA and TCE tables in debugfs

2018-06-23 Thread Timothy Pearson



Add a new debugfs entry to trigger dumping out the tracking table and
TCEs for a given PE, for example PE 0x4 of PHB 2:

echo 0x4 > /sys/kernel/debug/powerpc/PCI0002/sketchy

This will result in the table being dumped out in dmesg.

Signed-off-by: Russell Currey 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7ecc186493ca..55f0f7b885bc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3342,6 +3342,47 @@ static int pnv_pci_diag_data_set(void *data, u64 val)
 DEFINE_SIMPLE_ATTRIBUTE(pnv_pci_diag_data_fops, NULL,
pnv_pci_diag_data_set, "%llu\n");
 
+static int pnv_pci_sketchy_set(void *data, u64 val)
+{
+   struct pci_controller *hose;
+   struct pnv_ioda_pe *pe;
+   struct pnv_phb *phb;
+   u64 entry1, entry2;
+   int i;
+
+   hose = (struct pci_controller *)data;
+   if (!hose || !hose->private_data)
+   return -ENODEV;
+
+   phb = hose->private_data;
+   pe = &phb->ioda.pe_array[val];
+
+   if (!pe)
+   return -EINVAL;
+
+   if (!pe->tces || !pe->tce_tracker)
+   return -EIO;
+
+   for (i = 0; i < pe->tce_count; i++) {
+   if (i > 16 && pe->tces[i] == 0)
+   break;
+   pr_info("%3d: %016llx\n", i, be64_to_cpu(pe->tces[i]));
+   }
+
+   for (i = 0; i < pe->tce_count; i++) {
+   entry1 = pe->tce_tracker[i * 2];
+   entry2 = pe->tce_tracker[i * 2 + 1];
+   if (!entry1)
+   break;
+   pr_info("%3d: %016llx %016llx\n", i, entry1, entry2);
+   }
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(pnv_pci_sketchy_fops, NULL,
+   pnv_pci_sketchy_set, "%llu\n");
+
+
 #endif /* CONFIG_DEBUG_FS */
 
 static void pnv_pci_ioda_create_dbgfs(void)
@@ -3367,6 +3408,8 @@ static void pnv_pci_ioda_create_dbgfs(void)
 
debugfs_create_file("dump_diag_regs", 0200, phb->dbgfs, hose,
&pnv_pci_diag_data_fops);
+   debugfs_create_file("sketchy", 0200, phb->dbgfs, hose,
+   &pnv_pci_sketchy_fops);
}
 #endif /* CONFIG_DEBUG_FS */
 }
-- 
2.17.1

[PATCH 7/7] powerpc/powernv/pci: Don't use the lower 4G TCEs in

2018-06-23 Thread Timothy Pearson

 pseudo-DMA mode

Four TCEs are reserved for legacy 32-bit DMA mappings in psuedo DMA
mode.  Mark these with an invalid address to avoid their use by
the TCE cache mapper.

Signed-off-by: Timothy Pearson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index a6097dd323f8..e8a1333f6b3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1783,7 +1783,7 @@ static bool pnv_pci_ioda_pe_single_vendor(struct 
pnv_ioda_pe *pe)
 
 static int pnv_pci_pseudo_bypass_setup(struct pnv_ioda_pe *pe)
 {
-   u64 tce_count, table_size, window_size;
+   u64 i, tce_count, table_size, window_size;
struct pnv_phb *p = pe->phb;
struct page *table_pages;
__be64 *tces;
@@ -1835,6 +1835,12 @@ static int pnv_pci_pseudo_bypass_setup(struct 
pnv_ioda_pe *pe)
/* mark the first 4GB as reserved so this can still be used for 32bit */
bitmap_set(pe->tce_bitmap, 0, 1ULL << (32 - p->ioda.max_tce_order));
 
+   /* make sure reserved first 4GB TCEs are not used by the mapper
+* set each address to -1, which will never match an incoming request
+*/
+   for (i = 0; i < 4; i++)
+   pe->tce_tracker[i * 2] = -1;
+
pe_info(pe, "pseudo-bypass sizes: tracker %d bitmap %d TCEs %lld\n",
tracker_entries, bitmap_size, tce_count);
 
-- 
2.17.1

[PATCH 5/7] powerpc/powernv/pci: Export

2018-06-23 Thread Timothy Pearson

 pnv_pci_ioda2_tce_invalidate_pe

Pseudo DMA support requires a method to invalidate the TCE cache
Export pnv_pci_ioda2_tce_invalidate_pe for use by the pseudo DMA
mapper.

Signed-off-by: Timothy Pearson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 arch/powerpc/platforms/powernv/pci.h  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 55f0f7b885bc..a6097dd323f8 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2102,7 +2102,7 @@ static void pnv_pci_phb3_tce_invalidate(struct 
pnv_ioda_pe *pe, bool rm,
}
 }
 
-static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
+void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
struct pnv_phb *phb = pe->phb;
 
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 83492aba90f1..8d3849e76be3 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -264,6 +264,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
 /* Nvlink functions */
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
+extern void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
 extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
struct iommu_table *tbl);
-- 
2.17.1

[PATCH 6/7] powerpc/powernv/pci: Invalidate TCE cache after DMA map

2018-06-23 Thread Timothy Pearson

 setup

Per the IODA2, TCEs must be invalidated after their settings
have been changed.  Invalidate the cache after the address
is changed during TCE allocation when using pseudo DMA.

Signed-off-by: Timothy Pearson 
---
 arch/powerpc/platforms/powernv/pci-dma.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-dma.c 
b/arch/powerpc/platforms/powernv/pci-dma.c
index 237940a2a052..060dbc168401 100644
--- a/arch/powerpc/platforms/powernv/pci-dma.c
+++ b/arch/powerpc/platforms/powernv/pci-dma.c
@@ -42,8 +42,7 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe 
*pe, phys_addr_t addr
new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
pe->tces[tce] = new;
mb();
-   pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n",
-   tce, new, old);
+   pnv_pci_ioda2_tce_invalidate_pe(pe);
spin_unlock_irqrestore(&pe->tce_alloc_lock, flags);
 
return tce;
-- 
2.17.1

[PATCH 1/7] powerpc/powernv/pci: Track largest available TCE order

2018-06-23 Thread Timothy Pearson

 per PHB

Knowing the largest possible TCE size of a PHB is useful, so get it out
of the device tree.  This relies on the property being added in OPAL.

It is assumed that any PHB4 or later machine would be running firmware
that implemented this property, and otherwise assumed to be PHB3, which
has a maximum TCE order of 28 bits or 256MB TCEs.

This is used later in the series.

Signed-off-by: Russell Currey 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 16 
 arch/powerpc/platforms/powernv/pci.h  |  3 +++
 2 files changed, 19 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5bd0eb6681bc..bcb3bfce072a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3873,11 +3873,13 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
struct resource r;
const __be64 *prop64;
const __be32 *prop32;
+   struct property *prop;
int len;
unsigned int segno;
u64 phb_id;
void *aux;
long rc;
+   u32 val;
 
if (!of_device_is_available(np))
return;
@@ -4016,6 +4018,20 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
}
phb->ioda.pe_array = aux + pemap_off;
 
+   phb->ioda.max_tce_order = 0;
+   // Get TCE order from the DT.  If it's not present, assume P8
+   if (!of_get_property(np, "ibm,supported-tce-sizes", NULL)) {
+   phb->ioda.max_tce_order = 28; // assume P8 256mb TCEs
+   } else {
+   of_property_for_each_u32(np, "ibm,supported-tce-sizes", prop,
+prop32, val) {
+   if (val > phb->ioda.max_tce_order)
+   phb->ioda.max_tce_order = val;
+   }
+   pr_debug("PHB%llx Found max TCE order of %d bits\n",
+phb->opal_id, phb->ioda.max_tce_order);
+   }
+
/*
 * Choose PE number for root bus, which shouldn't have
 * M64 resources consumed by its child devices. To pick
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index eada4b6068cb..c9952def5e93 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -173,6 +173,9 @@ struct pnv_phb {
struct list_headpe_list;
struct mutexpe_list_mutex;
 
+   /* Largest supported TCE order bits */
+   uint8_t max_tce_order;
+
/* Reverse map of PEs, indexed by {bus, devfn} */
unsigned intpe_rmap[0x1];
} ioda;
-- 
2.17.1

[PATCH 2/7] powerpc/powernv: DMA operations for discontiguous

2018-06-23 Thread Timothy Pearson

 allocation

Cognitive DMA is a new set of DMA operations that solve some issues for
devices that want to address more than 32 bits but can't address the 59
bits required to enable direct DMA.

The previous implementation for POWER8/PHB3 worked around this by
configuring a bypass from the default 32-bit address space into 64-bit
address space.  This approach does not work for POWER9/PHB4 because
regions of memory are discontiguous and many devices will be unable to
address memory beyond the first node.

Instead, implement a new set of DMA operations that allocate TCEs as DMA
mappings are requested so that all memory is addressable even when a
one-to-one mapping between real addresses and DMA addresses isn't
possible.  These TCEs are the maximum size available on the platform,
which is 256M on PHB3 and 1G on PHB4.

Devices can now map any region of memory up to the maximum amount they can
address according to the DMA mask set, in chunks of the largest available
TCE size.

This implementation replaces the need for the existing PHB3 solution and
should be compatible with future PHB versions.

Signed-off-by: Russell Currey 
---
 arch/powerpc/include/asm/dma-mapping.h|   1 +
 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/platforms/powernv/pci-dma.c  | 319 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 102 +++
 arch/powerpc/platforms/powernv/pci.h  |   7 +
 5 files changed, 381 insertions(+), 50 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 8fa394520af6..354f435160f3 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -74,6 +74,7 @@ static inline unsigned long device_to_mask(struct device *dev)
 extern struct dma_map_ops dma_iommu_ops;
 #endif
 extern const struct dma_map_ops dma_nommu_ops;
+extern const struct dma_map_ops dma_pseudo_bypass_ops;
 
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
 {
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 703a350a7f4e..2467bdab3c13 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,7 +6,7 @@ obj-y   += opal-msglog.o opal-hmi.o 
opal-power.o opal-irqchip.o
 obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o 
opal-sensor-groups.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
-obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
+obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-dma.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
 obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
diff --git a/arch/powerpc/platforms/powernv/pci-dma.c 
b/arch/powerpc/platforms/powernv/pci-dma.c
new file mode 100644
index ..1d5409be343e
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pci-dma.c
@@ -0,0 +1,319 @@
+/*
+ * DMA operations supporting pseudo-bypass for PHB3+
+ *
+ * Author: Russell Currey 
+ *
+ * Copyright 2018 IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include "pci.h"
+
+/* select and allocate a TCE using the bitmap */
+static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t 
addr)
+{
+   int tce;
+   __be64 old, new;
+
+   spin_lock(&pe->tce_alloc_lock);
+   tce = bitmap_find_next_zero_area(pe->tce_bitmap,
+pe->tce_count,
+0,
+1,
+0);
+   bitmap_set(pe->tce_bitmap, tce, 1);
+   old = pe->tces[tce];
+   new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
+   pe->tces[tce] = new;
+   pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n",
+   tce, new, old);
+   spin_unlock(&pe->tce_alloc_lock);
+
+   return tce;
+}
+
+/*
+ * The tracking table for assigning TCEs has two entries per TCE.
+ * - @entry1 contains the physical address and the smallest bit indicates
+ * if it's currently valid.
+ * - @entry2 contains the DMA address returned in the upper 34 bits, and a
+ * refcount in the lower 30 bits.
+ */
+static dma_addr_t dma_pseudo_bypass_get_address(struct device *dev,
+   phys_addr_t addr)
+{
+   struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
+   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   struct pnv_ioda_pe *pe;
+

[PATCH 0/7] Add initial version of "cognitive DMA"

2018-06-23 Thread Timothy Pearson

POWER9 (PHB4) requires all peripherals using DMA to be either restricted
to 32-bit windows or capable of accessing the entire 64 bits of memory
space.  Some devices, such as most GPUs, can only address up to a certain
number of bits (approximately 40, in many cases), while at the same time
it is highly desireable to use a larger DMA space than the fallback 32 bits.

This series adds something called "cognitive DMA", which is a form of dynamic
TCE allocation.  This allows the peripheral to DMA to host addresses mapped in
1G (PHB4) or 256M (PHB3) chunks, and is transparent to the peripheral and its
driver stack.

This series has been tested on a Talos II server with a Radeon WX4100 and
a wide range of OpenGL applications.  While there is still work, notably
involving what happens if a peripheral attempts to DMA close to a TCE
window boundary, this series greatly improves functionality for AMD GPUs
on POWER9 devices over the existing 32-bit DMA support.

Russell Currey (4):
  powerpc/powernv/pci: Track largest available TCE order per PHB
  powerpc/powernv: DMA operations for discontiguous allocation
  powerpc/powernv/pci: Track DMA and TCE tables in debugfs
  powerpc/powernv/pci: Safety fixes for pseudobypass TCE allocation

Timothy Pearson (3):
  powerpc/powernv/pci: Export pnv_pci_ioda2_tce_invalidate_pe
  powerpc/powernv/pci: Invalidate TCE cache after DMA map setup
  powerpc/powernv/pci: Don't use the lower 4G TCEs in pseudo-DMA mode

 arch/powerpc/include/asm/dma-mapping.h|   1 +
 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/platforms/powernv/pci-dma.c  | 320 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 169 
 arch/powerpc/platforms/powernv/pci.h  |  11 +
 5 files changed, 452 insertions(+), 51 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c

-- 
2.17.1

62 matches

Mail list logo