Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
- Original Message - > From: "Jason Gunthorpe" > To: "Shivaprasad G Bhat" > Cc: "Timothy Pearson" , "Alex Williamson" > , "linuxppc-dev" > , "Michael Ellerman" , > "npiggin" , "christophe > leroy" , "aneesh kumar" > , "naveen n rao" > , "gbatra" , > brk...@linux.vnet.ibm.com, "Alexey Kardashevskiy" > , r...@kernel.org, "linux-kernel" > , "kvm" , "aik" > , msucha...@suse.de, "jroedel" , "vaibhav" > , sva...@linux.ibm.com > Sent: Tuesday, March 19, 2024 9:32:02 AM > Subject: Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view > for single level TCE tables > On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote: >> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to >> it_userspace") which implemented the tce indirect levels >> support for PowerNV ended up removing the single level support >> which existed by default(generic tce_iommu_userspace_view_alloc/free() >> calls). On pSeries the TCEs are single level, and the allocation >> of userspace view is lost with the removal of generic code. > > :( :( > > If this has been broken since 2018 and nobody cared till now can we > please go in a direction of moving this code to the new iommu APIs > instead of doubling down on more of this old stuff that apparently > almost nobody cares about ?? > > Jason Just FYI Raptor is working on porting things over to the new APIs. RFC patches should be posted in the next week or two.
Re: [PATCH v3] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "christophe leroy" > To: "Timothy Pearson" > Cc: "linuxppc-dev" > Sent: Saturday, February 17, 2024 10:39:06 AM > Subject: Re: [PATCH v3] powerpc: Add gpr1 and fpu save/restore functions > Le 16/02/2024 à 18:24, Timothy Pearson a écrit : >> When building the kernel in size optimized mode with the amdgpu module >> enabled, >> gcc will begin referencing external gpr1 and fpu save/restore functions. >> This >> will then cause a linker failure as we do not link against libgcc which >> normally contains those builtin functions. > > Right, but modules are linked with --save-restore-funcs when using gcc > so crtsavres.o is not used and only your change to scripts/mod/modpost.c > seems to be required to be able to build amdgpu module with GCC. > > Maybe that's different with clang, but maybe worth a test and then a > second patch ? It looks to be gated on ld, specifically if we're CONFIG_PPC64 && CONFIG_LD_IS_BFD. I can update the patch to match that gate if desired. > Nevertheless, see comments below, you can do even shorter and more > readable using GAS macros. That's true, but my goal was to get this working, not refactor the entire file. :) Any chance we can merge and do a fine polish later, especially if there's a distinct possibility of the entire file going away in the near future?
Re: [PATCH v2] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "christophe leroy" > To: "Timothy Pearson" , "linuxppc-dev" > > Sent: Tuesday, February 13, 2024 9:13:41 AM > Subject: Re: [PATCH v2] powerpc: Add gpr1 and fpu save/restore functions > Le 12/02/2024 à 18:14, Timothy Pearson a écrit : > All this looks very similar to savegpr0_xx and restgpr0_xx with the > difference being that r12 is used instead of r1. > > Could we avoid duplication by using macros ? > Yes. v3 sent.
[PATCH v3] powerpc: Add gpr1 and fpu save/restore functions
When building the kernel in size optimized mode with the amdgpu module enabled, gcc will begin referencing external gpr1 and fpu save/restore functions. This will then cause a linker failure as we do not link against libgcc which normally contains those builtin functions. Implement gpr1 and fpu save/restore functions per the PowerPC 64-bit ELFv2 ABI documentation. Tested on a Talos II with a WX7100 installed and running in DisplayCore mode. Reported-by: kernel test robot Tested-by: Timothy Pearson Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/prom_init_check.sh | 4 +- arch/powerpc/lib/crtsavres.S | 363 + scripts/mod/modpost.c | 4 + 3 files changed, 253 insertions(+), 118 deletions(-) diff --git a/arch/powerpc/kernel/prom_init_check.sh b/arch/powerpc/kernel/prom_init_check.sh index 69623b9045d5..76c5651e29d3 100644 --- a/arch/powerpc/kernel/prom_init_check.sh +++ b/arch/powerpc/kernel/prom_init_check.sh @@ -72,10 +72,10 @@ do # ignore register save/restore funcitons case $UNDEF in - _restgpr_*|_restgpr0_*|_rest32gpr_*) + _restgpr_*|_restgpr0_*|_restgpr1_*|_rest32gpr_*) OK=1 ;; - _savegpr_*|_savegpr0_*|_save32gpr_*) + _savegpr_*|_savegpr0_*|_restgpr0_*|_save32gpr_*) OK=1 ;; esac diff --git a/arch/powerpc/lib/crtsavres.S b/arch/powerpc/lib/crtsavres.S index 7e5e1c28e56a..f97270d36720 100644 --- a/arch/powerpc/lib/crtsavres.S +++ b/arch/powerpc/lib/crtsavres.S @@ -3,6 +3,7 @@ * * Copyright (C) 1995, 1996, 1998, 2000, 2001 Free Software Foundation, Inc. * Copyright 2008 Freescale Semiconductor, Inc. + * Copyright 2024 Raptor Engineering, LLC * Written By Michael Meissner * * Based on gcc/config/rs6000/crtsavres.asm from gcc @@ -314,126 +315,134 @@ _GLOBAL(_restvr_31) #else /* CONFIG_PPC64 */ -.globl _savegpr0_14 -_savegpr0_14: - std r14,-144(r1) -.globl _savegpr0_15 -_savegpr0_15: - std r15,-136(r1) -.globl _savegpr0_16 -_savegpr0_16: - std r16,-128(r1) -.globl _savegpr0_17 -_savegpr0_17: - std r17,-120(r1) -.globl _savegpr0_18 -_savegpr0_18: - std r18,-112(r1) -.globl _savegpr0_19 -_savegpr0_19: - std r19,-104(r1) -.globl _savegpr0_20 -_savegpr0_20: - std r20,-96(r1) -.globl _savegpr0_21 -_savegpr0_21: - std r21,-88(r1) -.globl _savegpr0_22 -_savegpr0_22: - std r22,-80(r1) -.globl _savegpr0_23 -_savegpr0_23: - std r23,-72(r1) -.globl _savegpr0_24 -_savegpr0_24: - std r24,-64(r1) -.globl _savegpr0_25 -_savegpr0_25: - std r25,-56(r1) -.globl _savegpr0_26 -_savegpr0_26: - std r26,-48(r1) -.globl _savegpr0_27 -_savegpr0_27: - std r27,-40(r1) -.globl _savegpr0_28 -_savegpr0_28: - std r28,-32(r1) -.globl _savegpr0_29 -_savegpr0_29: - std r29,-24(r1) -.globl _savegpr0_30 -_savegpr0_30: - std r30,-16(r1) -.globl _savegpr0_31 -_savegpr0_31: - std r31,-8(r1) - std r0,16(r1) +#define __PPC64_SAVEGPR(n,base)\ +.globl _savegpr##n##_14\ +_savegpr##n##_14: \ + std r14,-144(base) \ +.globl _savegpr##n##_15\ +_savegpr##n##_15: \ + std r15,-136(base) \ +.globl _savegpr##n##_16\ +_savegpr##n##_16: \ + std r16,-128(base) \ +.globl _savegpr##n##_17\ +_savegpr##n##_17: \ + std r17,-120(base) \ +.globl _savegpr##n##_18\ +_savegpr##n##_18: \ + std r18,-112(base) \ +.globl _savegpr##n##_19\ +_savegpr##n##_19: \ + std r19,-104(base) \ +.globl _savegpr##n##_20\ +_savegpr##n##_20: \ + std r20,-96(base) \ +.globl _savegpr##n##_21\ +_savegpr##n##_21: \ + std r21,-88(base) \ +.globl _savegpr##n##_22\ +_savegpr##n##_22: \ + std r22,-80(base) \ +.globl _savegpr##n##_23\ +_savegpr##n##_23: \ + std r23,-72(base) \ +.globl _savegpr##n##_24\ +_savegpr##n##_24: \ + std r24,-64(base) \ +.globl _savegpr##n##_25\ +_savegpr##n##_25: \ + std r25,-56(base) \ +.globl _savegpr##n##_26\ +_savegpr##n##_26: \ + std r26,-48(base) \ +.globl _savegpr##n##_27\ +_savegpr##n##_27: \ + std r27,-40(base) \ +.globl _savegpr##n##_28\ +_savegpr##n##_28: \ + std r28,-32(base) \ +.globl _savegpr##n##_29\ +_savegpr##n##_29: \ + std r29,-24(base) \ +.globl _savegpr##n##_30\ +_savegpr##n##_30: \ + std r30,-16(base) \ +.globl _savegpr##n##_31\ +_savegpr##n##_31: \ + std r31,-8(base
Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" , "Segher Boessenkool" > > Cc: "linuxppc-dev" > Sent: Monday, February 12, 2024 11:23:30 PM > Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions > Timothy Pearson writes: >> ----- Original Message - >>> From: "Segher Boessenkool" >>> To: "Timothy Pearson" >>> Cc: "linuxppc-dev" >>> Sent: Monday, February 12, 2024 12:23:22 PM >>> Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions >> >>> On Mon, Feb 12, 2024 at 12:07:03PM -0600, Timothy Pearson wrote: >>>> > I have done it for *all* architectures some ten years ago. Never found >>>> > any problem. >>>> >>>> That makes sense, what I mean by invasive is that we'd need buy-in from the >>>> other >>>> maintainers across all of the affected architectures. Is that likely to >>>> occur? >>> >>> I don't know. Here is my PowerPC-specific patch, it's a bit older, it >>> might not apply cleanly anymore, the changes needed should be obvious >>> though: >>> >>> >>> === 8< === >>> commit f16dfa5257eb14549ce22243fb2b465615085134 >>> Author: Segher Boessenkool >>> Date: Sat May 3 03:48:06 2008 +0200 >>> >>>powerpc: Link vmlinux against libgcc.a >>> >>> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile >>> index b7212b619c52..0a2fac6ffc1c 100644 >>> --- a/arch/powerpc/Makefile >>> +++ b/arch/powerpc/Makefile >>> @@ -158,6 +158,9 @@ core-y += >>> arch/powerpc/kernel/ >>> core-$(CONFIG_XMON)+= arch/powerpc/xmon/ >>> core-$(CONFIG_KVM) += arch/powerpc/kvm/ >>> >>> +LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) -print-libgcc-file-name) >>> +libs-y += $(LIBGCC) >>> + >>> drivers-$(CONFIG_OPROFILE) += arch/powerpc/oprofile/ >>> >>> # Default to zImage, override when needed >>> === 8< === >> >> OK. PowerPC maintainers, how would you prefer to handle this? > > I'll take the patch to add the functions for now. We can look into > linking against libgcc as a future cleanup. Sounds good. >>>> > There are better options than -Os, fwiw. Some --param's give smaller >>>> > *and* faster kernels. What exactly is best is heavily arch-dependent >>>> > though (as well as dependent on the application code, the kernel code in >>>> > this case) :-( >>>> >>>> I've been through this a few times, and -Os is the only option that makes >>>> things (just barely) fit unfortunately. >>> >>> -O2 with appropriate inlining tuning beats -Os every day of the week, >>> in my experience. >> >> On 6.6 it's 24MiB vs 40MiB, O2 vs. Os. :( > > What compiler/config etc. are you using for that? It's the kernel config that buildroot generates for skiroot -- I think a lot of the size difference is in some of the modules that we enable such as amdgpu, but haven't dug too deeply. Once this firmware release is in beta (and therefore published publicly) I'll send over a link to the configs. Thanks!
Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "Segher Boessenkool" > To: "Timothy Pearson" > Cc: "linuxppc-dev" > Sent: Monday, February 12, 2024 12:23:22 PM > Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions > On Mon, Feb 12, 2024 at 12:07:03PM -0600, Timothy Pearson wrote: >> > I have done it for *all* architectures some ten years ago. Never found >> > any problem. >> >> That makes sense, what I mean by invasive is that we'd need buy-in from the >> other >> maintainers across all of the affected architectures. Is that likely to >> occur? > > I don't know. Here is my PowerPC-specific patch, it's a bit older, it > might not apply cleanly anymore, the changes needed should be obvious > though: > > > === 8< === > commit f16dfa5257eb14549ce22243fb2b465615085134 > Author: Segher Boessenkool > Date: Sat May 3 03:48:06 2008 +0200 > >powerpc: Link vmlinux against libgcc.a > > diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile > index b7212b619c52..0a2fac6ffc1c 100644 > --- a/arch/powerpc/Makefile > +++ b/arch/powerpc/Makefile > @@ -158,6 +158,9 @@ core-y += > arch/powerpc/kernel/ > core-$(CONFIG_XMON)+= arch/powerpc/xmon/ > core-$(CONFIG_KVM) += arch/powerpc/kvm/ > > +LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) -print-libgcc-file-name) > +libs-y += $(LIBGCC) > + > drivers-$(CONFIG_OPROFILE) += arch/powerpc/oprofile/ > > # Default to zImage, override when needed > === 8< === OK. PowerPC maintainers, how would you prefer to handle this? >> > There are better options than -Os, fwiw. Some --param's give smaller >> > *and* faster kernels. What exactly is best is heavily arch-dependent >> > though (as well as dependent on the application code, the kernel code in >> > this case) :-( >> >> I've been through this a few times, and -Os is the only option that makes >> things (just barely) fit unfortunately. > > -O2 with appropriate inlining tuning beats -Os every day of the week, > in my experience. On 6.6 it's 24MiB vs 40MiB, O2 vs. Os. :(
Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "Segher Boessenkool" > To: "Timothy Pearson" > Cc: "linuxppc-dev" > Sent: Monday, February 12, 2024 11:59:06 AM > Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions > On Mon, Feb 12, 2024 at 11:46:19AM -0600, Timothy Pearson wrote: >> Interesting, that make sense. >> >> How should we proceed from the current situation? Bringing in libgcc seems >> like a fairly invasive change, > > I have done it for *all* architectures some ten years ago. Never found > any problem. That makes sense, what I mean by invasive is that we'd need buy-in from the other maintainers across all of the affected architectures. Is that likely to occur? >> should we merge this to fix the current bug >> (cannot build ppc64 kernel in size-optimized mode) and start discussion on >> bringing in libgcc as the long-term fix across multiple architectures? >> >> My goal here is to not have to carry a downstream patch in perpetuity for >> our embedded Linux firmware, which needs to be compiled in size-optimized >> mode due to hardware Flash limitations. > > There are better options than -Os, fwiw. Some --param's give smaller > *and* faster kernels. What exactly is best is heavily arch-dependent > though (as well as dependent on the application code, the kernel code in > this case) :-( I've been through this a few times, and -Os is the only option that makes things (just barely) fit unfortunately.
Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "Segher Boessenkool" > To: "Timothy Pearson" > Cc: "linuxppc-dev" > Sent: Monday, February 12, 2024 11:30:43 AM > Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions > > Long long time ago, linux-0.11 or something, it was discovered that some > programmiing mistakes resulted in double-length divisions (64x64->64 on > 32-bit systems, say). Most architectures have no hardware support for > that, x86 is one of those; so you need very expensive support routines > to do that (_udivdi3 or _divdi3 in that case, ...ti3 on 64-bit archs). > > So it was decided to not link to libgcc to avoid this. But that means > that all the extremely many other suppoort routines, more for some other > archs, are also not there. While it would have been much easier to just > link to something that provides the _{u,}divdi3 symbol and then causes a > forced linking error from that! > > > Segher Interesting, that make sense. How should we proceed from the current situation? Bringing in libgcc seems like a fairly invasive change, should we merge this to fix the current bug (cannot build ppc64 kernel in size-optimized mode) and start discussion on bringing in libgcc as the long-term fix across multiple architectures? My goal here is to not have to carry a downstream patch in perpetuity for our embedded Linux firmware, which needs to be compiled in size-optimized mode due to hardware Flash limitations. Thanks!
[PATCH v2] powerpc: Add gpr1 and fpu save/restore functions
When building the kernel in size optimized mode with the amdgpu module enabled, gcc will begin referencing external gpr1 and fpu save/restore functions. This will then cause a linker failure as we do not link against libgcc which normally contains those builtin functions. Implement gpr1 and fpu save/restore functions per the PowerPC 64-bit ELFv2 ABI documentation. Tested on a Talos II with a WX7100 installed and running in DisplayCore mode. Reported-by: kernel test robot Tested-by: Timothy Pearson Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/prom_init_check.sh | 4 +- arch/powerpc/lib/crtsavres.S | 244 + scripts/mod/modpost.c | 4 + 3 files changed, 250 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/prom_init_check.sh b/arch/powerpc/kernel/prom_init_check.sh index 69623b9045d5..76c5651e29d3 100644 --- a/arch/powerpc/kernel/prom_init_check.sh +++ b/arch/powerpc/kernel/prom_init_check.sh @@ -72,10 +72,10 @@ do # ignore register save/restore funcitons case $UNDEF in - _restgpr_*|_restgpr0_*|_rest32gpr_*) + _restgpr_*|_restgpr0_*|_restgpr1_*|_rest32gpr_*) OK=1 ;; - _savegpr_*|_savegpr0_*|_save32gpr_*) + _savegpr_*|_savegpr0_*|_restgpr0_*|_save32gpr_*) OK=1 ;; esac diff --git a/arch/powerpc/lib/crtsavres.S b/arch/powerpc/lib/crtsavres.S index 7e5e1c28e56a..6cd870aacd7f 100644 --- a/arch/powerpc/lib/crtsavres.S +++ b/arch/powerpc/lib/crtsavres.S @@ -3,6 +3,7 @@ * * Copyright (C) 1995, 1996, 1998, 2000, 2001 Free Software Foundation, Inc. * Copyright 2008 Freescale Semiconductor, Inc. + * Copyright 2024 Raptor Engineering, LLC * Written By Michael Meissner * * Based on gcc/config/rs6000/crtsavres.asm from gcc @@ -435,6 +436,127 @@ _restgpr0_31: mtlrr0 blr +.globl _savegpr1_14 +_savegpr1_14: + std r14,-144(r12) +.globl _savegpr1_15 +_savegpr1_15: + std r15,-136(r12) +.globl _savegpr1_16 +_savegpr1_16: + std r16,-128(r12) +.globl _savegpr1_17 +_savegpr1_17: + std r17,-120(r12) +.globl _savegpr1_18 +_savegpr1_18: + std r18,-112(r12) +.globl _savegpr1_19 +_savegpr1_19: + std r19,-104(r12) +.globl _savegpr1_20 +_savegpr1_20: + std r20,-96(r12) +.globl _savegpr1_21 +_savegpr1_21: + std r21,-88(r12) +.globl _savegpr1_22 +_savegpr1_22: + std r22,-80(r12) +.globl _savegpr1_23 +_savegpr1_23: + std r23,-72(r12) +.globl _savegpr1_24 +_savegpr1_24: + std r24,-64(r12) +.globl _savegpr1_25 +_savegpr1_25: + std r25,-56(r12) +.globl _savegpr1_26 +_savegpr1_26: + std r26,-48(r12) +.globl _savegpr1_27 +_savegpr1_27: + std r27,-40(r12) +.globl _savegpr1_28 +_savegpr1_28: + std r28,-32(r12) +.globl _savegpr1_29 +_savegpr1_29: + std r29,-24(r12) +.globl _savegpr1_30 +_savegpr1_30: + std r30,-16(r12) +.globl _savegpr1_31 +_savegpr1_31: + std r31,-8(r12) + std r0,16(r12) + blr + +.globl _restgpr1_14 +_restgpr1_14: + ld r14,-144(r12) +.globl _restgpr1_15 +_restgpr1_15: + ld r15,-136(r12) +.globl _restgpr1_16 +_restgpr1_16: + ld r16,-128(r12) +.globl _restgpr1_17 +_restgpr1_17: + ld r17,-120(r12) +.globl _restgpr1_18 +_restgpr1_18: + ld r18,-112(r12) +.globl _restgpr1_19 +_restgpr1_19: + ld r19,-104(r12) +.globl _restgpr1_20 +_restgpr1_20: + ld r20,-96(r12) +.globl _restgpr1_21 +_restgpr1_21: + ld r21,-88(r12) +.globl _restgpr1_22 +_restgpr1_22: + ld r22,-80(r12) +.globl _restgpr1_23 +_restgpr1_23: + ld r23,-72(r12) +.globl _restgpr1_24 +_restgpr1_24: + ld r24,-64(r12) +.globl _restgpr1_25 +_restgpr1_25: + ld r25,-56(r12) +.globl _restgpr1_26 +_restgpr1_26: + ld r26,-48(r12) +.globl _restgpr1_27 +_restgpr1_27: + ld r27,-40(r12) +.globl _restgpr1_28 +_restgpr1_28: + ld r28,-32(r12) +.globl _restgpr1_29 +_restgpr1_29: + ld r0,16(r12) + ld r29,-24(r12) + mtlrr0 + ld r30,-16(r12) + ld r31,-8(r12) + blr + +.globl _restgpr1_30 +_restgpr1_30: + ld r30,-16(r12) +.globl _restgpr1_31 +_restgpr1_31: + ld r0,16(r12) + ld r31,-8(r12) + mtlrr0 + blr + #ifdef CONFIG_ALTIVEC /* Called with r0 pointing just beyond the end of the vector save area. */ @@ -540,6 +662,128 @@ _restvr_31: #endif /* CONFIG_ALTIVEC */ +#ifdef CONFIG_PPC_FPU + +.globl _savefpr_14 +_savefpr_14: + stfd f14,-144(r1) +.globl _savefpr_15 +_savefpr_15: + stfd f15,-136(r1) +.globl _savefpr_16 +_savefpr_16: + stfd f16,-128(r1) +.globl _savefpr_17 +_savefpr_17: + stfd f17,-120(r1) +.globl _savefpr_18 +_savefpr_18: + stfd f18,-112(r1) +.globl
Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions
- Original Message - > From: "Segher Boessenkool" > To: "Timothy Pearson" > Cc: "linuxppc-dev" > Sent: Monday, February 12, 2024 11:02:07 AM > Subject: Re: [PATCH] powerpc: Add gpr1 and fpu save/restore functions > On Mon, Feb 12, 2024 at 10:41:18AM -0600, Timothy Pearson wrote: >> Implement gpr1 and fpu save/restore functions per the ABI v2 documentation. > > There is no "ABI v2". This is the ELFv2 ABI, it is a name, it is not a > version 2 of anything (in fact, it is version 1 everywhere). Apologies, I wasn't precise on the name. > The same functions are needed and used in other ABIs, too. > > But, why do this patch? You just need > > +LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) -print-libgcc-file-name) > > +libs-y += $(LIBGCC) > > and nothing more. It is required for proper functioning of GCC to link > with the libgcc support library. There is existing code in the kernel right now to provide support functions for gpr0 and altivec save/restore. I don't know the full story here, but at some point in the kernel's history it seems to have been decided to provide the helper functions in lieu of linking libgcc directly. If this is incorrect, then I need to know that so I can rework the patch to enable libcc and remove the existing support functions. Is there anyone on-list that knows more of the history and decision-making that went into the current state of the kernel here? Thanks!
[PATCH] powerpc: Add gpr1 and fpu save/restore functions
When building the kernel in size optimized mode with the amdgpu module enabled, gcc will begin referencing external gpr1 and fpu save/restore functions. This will then cause a linker failure as we do not link against libgcc which normally contains those builtin functions. Implement gpr1 and fpu save/restore functions per the ABI v2 documentation. Tested on a Talos II with a WX7100 installed and running in DisplayCore mode. Reported-by: kernel test robot Tested-by: Timothy Pearson Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/prom_init_check.sh | 4 +- arch/powerpc/lib/crtsavres.S | 244 + scripts/mod/modpost.c | 4 + 3 files changed, 250 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/prom_init_check.sh b/arch/powerpc/kernel/prom_init_check.sh index 69623b9045d5..76c5651e29d3 100644 --- a/arch/powerpc/kernel/prom_init_check.sh +++ b/arch/powerpc/kernel/prom_init_check.sh @@ -72,10 +72,10 @@ do # ignore register save/restore funcitons case $UNDEF in - _restgpr_*|_restgpr0_*|_rest32gpr_*) + _restgpr_*|_restgpr0_*|_restgpr1_*|_rest32gpr_*) OK=1 ;; - _savegpr_*|_savegpr0_*|_save32gpr_*) + _savegpr_*|_savegpr0_*|_restgpr0_*|_save32gpr_*) OK=1 ;; esac diff --git a/arch/powerpc/lib/crtsavres.S b/arch/powerpc/lib/crtsavres.S index 7e5e1c28e56a..6cd870aacd7f 100644 --- a/arch/powerpc/lib/crtsavres.S +++ b/arch/powerpc/lib/crtsavres.S @@ -3,6 +3,7 @@ * * Copyright (C) 1995, 1996, 1998, 2000, 2001 Free Software Foundation, Inc. * Copyright 2008 Freescale Semiconductor, Inc. + * Copyright 2024 Raptor Engineering, LLC * Written By Michael Meissner * * Based on gcc/config/rs6000/crtsavres.asm from gcc @@ -435,6 +436,127 @@ _restgpr0_31: mtlrr0 blr +.globl _savegpr1_14 +_savegpr1_14: + std r14,-144(r12) +.globl _savegpr1_15 +_savegpr1_15: + std r15,-136(r12) +.globl _savegpr1_16 +_savegpr1_16: + std r16,-128(r12) +.globl _savegpr1_17 +_savegpr1_17: + std r17,-120(r12) +.globl _savegpr1_18 +_savegpr1_18: + std r18,-112(r12) +.globl _savegpr1_19 +_savegpr1_19: + std r19,-104(r12) +.globl _savegpr1_20 +_savegpr1_20: + std r20,-96(r12) +.globl _savegpr1_21 +_savegpr1_21: + std r21,-88(r12) +.globl _savegpr1_22 +_savegpr1_22: + std r22,-80(r12) +.globl _savegpr1_23 +_savegpr1_23: + std r23,-72(r12) +.globl _savegpr1_24 +_savegpr1_24: + std r24,-64(r12) +.globl _savegpr1_25 +_savegpr1_25: + std r25,-56(r12) +.globl _savegpr1_26 +_savegpr1_26: + std r26,-48(r12) +.globl _savegpr1_27 +_savegpr1_27: + std r27,-40(r12) +.globl _savegpr1_28 +_savegpr1_28: + std r28,-32(r12) +.globl _savegpr1_29 +_savegpr1_29: + std r29,-24(r12) +.globl _savegpr1_30 +_savegpr1_30: + std r30,-16(r12) +.globl _savegpr1_31 +_savegpr1_31: + std r31,-8(r12) + std r0,16(r12) + blr + +.globl _restgpr1_14 +_restgpr1_14: + ld r14,-144(r12) +.globl _restgpr1_15 +_restgpr1_15: + ld r15,-136(r12) +.globl _restgpr1_16 +_restgpr1_16: + ld r16,-128(r12) +.globl _restgpr1_17 +_restgpr1_17: + ld r17,-120(r12) +.globl _restgpr1_18 +_restgpr1_18: + ld r18,-112(r12) +.globl _restgpr1_19 +_restgpr1_19: + ld r19,-104(r12) +.globl _restgpr1_20 +_restgpr1_20: + ld r20,-96(r12) +.globl _restgpr1_21 +_restgpr1_21: + ld r21,-88(r12) +.globl _restgpr1_22 +_restgpr1_22: + ld r22,-80(r12) +.globl _restgpr1_23 +_restgpr1_23: + ld r23,-72(r12) +.globl _restgpr1_24 +_restgpr1_24: + ld r24,-64(r12) +.globl _restgpr1_25 +_restgpr1_25: + ld r25,-56(r12) +.globl _restgpr1_26 +_restgpr1_26: + ld r26,-48(r12) +.globl _restgpr1_27 +_restgpr1_27: + ld r27,-40(r12) +.globl _restgpr1_28 +_restgpr1_28: + ld r28,-32(r12) +.globl _restgpr1_29 +_restgpr1_29: + ld r0,16(r12) + ld r29,-24(r12) + mtlrr0 + ld r30,-16(r12) + ld r31,-8(r12) + blr + +.globl _restgpr1_30 +_restgpr1_30: + ld r30,-16(r12) +.globl _restgpr1_31 +_restgpr1_31: + ld r0,16(r12) + ld r31,-8(r12) + mtlrr0 + blr + #ifdef CONFIG_ALTIVEC /* Called with r0 pointing just beyond the end of the vector save area. */ @@ -540,6 +662,128 @@ _restvr_31: #endif /* CONFIG_ALTIVEC */ +#ifdef CONFIG_PPC_FPU + +.globl _savefpr_14 +_savefpr_14: + stfd f14,-144(r1) +.globl _savefpr_15 +_savefpr_15: + stfd f15,-136(r1) +.globl _savefpr_16 +_savefpr_16: + stfd f16,-128(r1) +.globl _savefpr_17 +_savefpr_17: + stfd f17,-120(r1) +.globl _savefpr_18 +_savefpr_18: + stfd f18,-112(r1) +.globl _savefpr_19
Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call
- Original Message - > From: "Timothy Pearson" > To: "Jason Gunthorpe" > Cc: "Timothy Pearson" , "Shivaprasad G Bhat" > , "iommu" > , "linuxppc-dev" , > "linux-kernel" , > "Michael Ellerman" , "npiggin" , > "christophe leroy" > , "aneesh kumar" , > "naveen n rao" , > "jroedel" , "aik" , "bgray" > , "Greg Kroah-Hartman" > , "gbatra" , "vaibhav" > > Sent: Friday, January 26, 2024 9:39:56 AM > Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group > release_ownership() call > - Original Message - >> From: "Jason Gunthorpe" >> To: "Timothy Pearson" >> Cc: "Shivaprasad G Bhat" , "iommu" >> , >> "linuxppc-dev" >> , "linux-kernel" >> , >> "Michael Ellerman" >> , "npiggin" , "christophe leroy" >> , "aneesh kumar" >> , "naveen n rao" , >> "jroedel" , "aik" >> , "bgray" , "Greg Kroah-Hartman" >> , "gbatra" >> , "vaibhav" >> Sent: Friday, January 26, 2024 9:38:06 AM >> Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group >> release_ownership() call > >> On Fri, Jan 26, 2024 at 09:29:55AM -0600, Timothy Pearson wrote: >>> > On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote: >>> >> > Also, is there any chance someone can work on actually fixing this to >>> >> > be a proper iommu driver? I think that will become important for power >>> >> > to use the common dma_iommu code in the next year... >>> >> We are looking into it. >>> > >>> > Okay, let me know, I can possibly help make parts of this happen >>> > >>> > power is the last still-current architecture to be outside the modern >>> > IOMMU and DMA API design and I'm going to start proposing things that >>> > will not be efficient on power because of this. >>> >>> I can get development resources on this fairly rapidly, including >>> testing. We should figure out the best way forward and how to deal >>> with the VFIO side of things, even if that's a rewrite at the end of >>> the day the machine-specific codebase isn't *that* large for our two >>> target flavors (64-bit PowerNV and 64-bit pSeries). >> >> I have a feeling the way forward is to just start a power driver under >> drivers/iommu/ and use kconfig to make the user exclusively select >> either the legacy arch or the modern iommu. >> >> Get that working to a level where dma_iommu is happy. >> >> Get iommufd support in the new driver good enough to run your >> application. >> >> Just forget about the weird KVM and SPAPR stuff, leave it under the >> kconfig of the old code and nobody will run it. Almost nobody already >> runs it, apparently. > > We actually use QEMU/KVM/VFIO extensively at Raptor, so need the support and > need it to be performant... Never mind, I can't read this morning. :) You did say iommufd support, which gives the VFIO passthrough functionality. I think this is a reasonable approach, and will discuss further internally this afternoon.
Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call
- Original Message - > From: "Jason Gunthorpe" > To: "Timothy Pearson" > Cc: "Shivaprasad G Bhat" , "iommu" > , "linuxppc-dev" > , "linux-kernel" > , "Michael Ellerman" > , "npiggin" , "christophe leroy" > , "aneesh kumar" > , "naveen n rao" , > "jroedel" , "aik" > , "bgray" , "Greg Kroah-Hartman" > , "gbatra" > , "vaibhav" > Sent: Friday, January 26, 2024 9:38:06 AM > Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group > release_ownership() call > On Fri, Jan 26, 2024 at 09:29:55AM -0600, Timothy Pearson wrote: >> > On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote: >> >> > Also, is there any chance someone can work on actually fixing this to >> >> > be a proper iommu driver? I think that will become important for power >> >> > to use the common dma_iommu code in the next year... >> >> We are looking into it. >> > >> > Okay, let me know, I can possibly help make parts of this happen >> > >> > power is the last still-current architecture to be outside the modern >> > IOMMU and DMA API design and I'm going to start proposing things that >> > will not be efficient on power because of this. >> >> I can get development resources on this fairly rapidly, including >> testing. We should figure out the best way forward and how to deal >> with the VFIO side of things, even if that's a rewrite at the end of >> the day the machine-specific codebase isn't *that* large for our two >> target flavors (64-bit PowerNV and 64-bit pSeries). > > I have a feeling the way forward is to just start a power driver under > drivers/iommu/ and use kconfig to make the user exclusively select > either the legacy arch or the modern iommu. > > Get that working to a level where dma_iommu is happy. > > Get iommufd support in the new driver good enough to run your > application. > > Just forget about the weird KVM and SPAPR stuff, leave it under the > kconfig of the old code and nobody will run it. Almost nobody already > runs it, apparently. We actually use QEMU/KVM/VFIO extensively at Raptor, so need the support and need it to be performant... > Remove it in a few years > > From what I remember the code under the arch directory is already very > nearly almost an iommu driver. I think someone could fairly quickly > get to something working and using dma_iommu.c. If s390 is any > experience there is some benchmarking and tweaking to get performance > equal to the arch's tweaked dma_iommu copy. > > Jason
Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call
- Original Message - > From: "Jason Gunthorpe" > To: "Shivaprasad G Bhat" > Cc: io...@lists.linux.dev, "linuxppc-dev" , > "linux-kernel" > , "Michael Ellerman" , > "npiggin" , "christophe > leroy" , "aneesh kumar" > , "naveen n rao" > , jroe...@suse.de, "Timothy Pearson" > , a...@amd.com, > bg...@linux.ibm.com, "Greg Kroah-Hartman" , > gba...@linux.vnet.ibm.com, > vaib...@linux.ibm.com > Sent: Friday, January 26, 2024 9:17:01 AM > Subject: Re: [PATCH 1/2] powerpc: iommu: Bring back table group > release_ownership() call > On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote: >> > Also, is there any chance someone can work on actually fixing this to >> > be a proper iommu driver? I think that will become important for power >> > to use the common dma_iommu code in the next year... >> We are looking into it. > > Okay, let me know, I can possibly help make parts of this happen > > power is the last still-current architecture to be outside the modern > IOMMU and DMA API design and I'm going to start proposing things that > will not be efficient on power because of this. I can get development resources on this fairly rapidly, including testing. We should figure out the best way forward and how to deal with the VFIO side of things, even if that's a rewrite at the end of the day the machine-specific codebase isn't *that* large for our two target flavors (64-bit PowerNV and 64-bit pSeries). > I think a basic iommu driver using the dma API would not be so hard. > > I don't know what to do about the SPAPR VFIO mess though. :( > > Jason
Re: PowerNV PCIe hotplug support?
les linked in: pnv_php switchtec i2c_dev ses enclosure scsi_transport_sas xts ecb sg ctr nvme cbc nvme_core ofpart drm_shmem_helper t10_pi vmx_crypto ipmi_powernv powernv_flash crc64_rocksoft_generic ipmi_devintf gf128mul drm_kms_helper crct10dif_vpmsum opal_prd ipmi_msghandler mtd at24 i2c_algo_bit regmap_i2c crc64_rocksoft crc64 evdev joydev aacraid binfmt_misc drm loop fuse dm_mod drm_panel_orientation_quirks ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc fscache netfs ib_mthca ib_ipoib ib_umad rdma_ucm rdma_cm iw_cm ib_cm configfs mlx4_ib ib_uverbs ib_core hid_generic usbhid hid xhci_pci xhci_hcd ixgbe mdio_devres of_mdio tg3 fixed_phy fwnode_mdio xfrm_algo mdio crc32c_vpmsum usbcore mlx4_core libphy usb_common [last unloaded: pnv_php] [ 157.326314] CPU: 13 PID: 1170 Comm: bash Tainted: GW 6.6.8 #1 [ 157.326414] Hardware name: T2P9S01 REV 1.01 POWER9 0x4e1203 opal:skiboot-9858186 PowerNV [ 157.326485] NIP: c0da0f44 LR: c0da0f38 CTR: c08b58e0 [ 157.326573] REGS: c0003a1bb7a0 TRAP: 0300 Tainted: GW (6.6.8) [ 157.326652] MSR: 90009033 CR: 24424222 XER: [ 157.326777] CFAR: c0d9d96c DAR: 0010 DSISR: 4000 IRQMASK: 0 GPR00: c0da0f38 c0003a1bba40 c10fe800 GPR04: 0040 0040 001b GPR08: c000201ffcba1b00 cf168a00 c00819063700 GPR12: c08b58e0 c01d1c80 000100f6a548 GPR16: 000100e93e40 GPR20: 000102cc5560 000100f21a58 7fffe59d6784 c000201fff792ed0 GPR24: 0001 c000201fff793630 c0001358d8e0 GPR28: c0001358d8c0 c0002000109cc000 c0002000109cc0c0 0010 [ 157.327630] NIP [c0da0f44] mutex_lock+0x34/0x90 [ 157.327698] LR [c0da0f38] mutex_lock+0x28/0x90 [ 157.327766] Call Trace: [ 157.327799] [c0003a1bba40] [c0da0f38] mutex_lock+0x28/0x90 (unreliable) [ 157.327880] [c0003a1bba70] [c0202658] msi_lock_descs+0x28/0x40 [ 157.327963] [c0003a1bba90] [c08b5948] pci_disable_msi+0x68/0xb0 [ 157.328072] [c0003a1bbac0] [c00819060708] pnv_php_disable_irq+0x140/0x160 [pnv_php] [ 157.328150] [c0003a1bbb10] [c00819060a28] pnv_php_free_slot+0x50/0x90 [pnv_php] [ 157.328248] [c0003a1bbb40] [c00819061fb0] pnv_php_unregister+0x248/0x290 [pnv_php] [ 157.328342] [c0003a10] [c0081906207c] pnv_php_disable_slot+0x84/0xe0 [pnv_php] [ 157.328443] [c0003a1bbbf0] [c08cf8d0] power_write_file+0xa0/0x180 [ 157.328546] [c0003a1bbc70] [c08c4b60] pci_slot_attr_store+0x40/0x60 [ 157.328643] [c0003a1bbc90] [c0614ef4] sysfs_kf_write+0x64/0x80 [ 157.328700] [c0003a1bbcb0] [c0613a18] kernfs_fop_write_iter+0x1b8/0x2a0 [ 157.328789] [c0003a1bbd00] [c0521070] vfs_write+0x350/0x4b0 [ 157.328881] [c0003a1bbdc0] [c05214d4] ksys_write+0x84/0x140 [ 157.328960] [c0003a1bbe10] [c0030d38] system_call_exception+0x168/0x3a0 [ 157.329052] [c0003a1bbe50] [c000c270] system_call_vectored_common+0xf0/0x280 [ 157.329139] --- interrupt: 3000 at 0x7fffa77ead38 [ 157.329191] NIP: 7fffa77ead38 LR: CTR: [ 157.329259] REGS: c0003a1bbe80 TRAP: 3000 Tainted: GW (6.6.8) [ 157.329327] MSR: 9000d033 CR: 48422408 XER: [ 157.329444] IRQMASK: 0 GPR00: 0004 7fffe59d6560 7fffa7906f00 0001 GPR04: 000102ce4c40 0002 0010 000102ce6c40 GPR08: GPR12: 7fffa7a1b0c0 000100f6a548 GPR16: 000100e93e40 GPR20: 000102cc5560 000100f21a58 7fffe59d6784 7fffe59d6780 GPR24: 0001 000100f706f0 000102ce4c40 0002 GPR28: 0002 7fffa7901980 000102ce4c40 0002 [ 157.330151] NIP [7fffa77ead38] 0x7fffa77ead38 [ 157.330208] LR [] 0x0 [ 157.330245] --- interrupt: 3000 [ 157.330275] Code: 3842d8f0 7c0802a6 6000 7c0802a6 fbe1fff8 7c7f1b78 f8010010 f821ffd1 4bffc9f5 6000 3920 e94d0908 <7d00f8a8> 7c284800 40c20010 7d40f9ad [ 157.330467] ---[ end trace ]--- [ 157.902980] note: bash[1170] exited with irqs disabled - Original Message - > From: "Timothy Pearson" > To: "linu
Re: PowerNV PCIe hotplug support?
] 0x0 [ 3771.108425] --- interrupt: 3000 [ 3771.108437] Code: 3842d8f0 7c0802a6 6000 7c0802a6 fbe1fff8 7c7f1b78 f8010010 f821ffd1 4bffc9f5 6000 3920 e94d0908 <7d00f8a8> 7c284800 40c20010 7d40f9ad [ 3771.108510] ---[ end trace ]--- - Original Message - > From: "Timothy Pearson" > To: "linuxppc-dev" > Sent: Wednesday, December 27, 2023 10:15:24 PM > Subject: PowerNV PCIe hotplug support? > I've been evaluating some new options for our POWER9-based hardware in the > NVMe > space, and would like some clarification on the current status of PCIe hotplug > for the PowerNV platforms. > > From what I understand, the pnv_php driver provides the basic hotplug > functionality on PowerNV. What I'm not clear on is to what extent this is > intended to flow downstream to attached PCIe switches. > > I have a test setup here that consists of a PMC 8533 switch and several > downstream NVMe drives, with the switch attached directly to the PHB4 root > port. After loading the pnv_php module, I can disconnect the downstream NVMe > devices by either using echo 0 on /sys/bus/pcu/slots/Snnn/power, or by > doing a physical surprise unplug, however nothing I try can induce a newly > plugged device to train and be detected on the bus. Even trying a echo 0 and > subsequent echo 1 to /sys/bus/pcu/slots/Snnn/power only results in the > device going offline, there seems to be no way to bring the device back online > short of a reboot. > > Hotplug of other devices connected directly to the PHB4 appears to work > properly > (I can online and offline via the power node); the issue seems to be > restricted > to downstream devices connected to the (theoretically hotplug capable) PMC > 8533 > switch. > > Is this the intended behavior for downstream (non-IBM) PCIe ports? Raptor can > provide resources to assist in a fix if needed, but I would like to understand > if this is a bug or an unimplemented feature first, and if the latter what the > main issues are likely to be in implementation. > > Thank you!
PowerNV PCIe hotplug support?
I've been evaluating some new options for our POWER9-based hardware in the NVMe space, and would like some clarification on the current status of PCIe hotplug for the PowerNV platforms. >From what I understand, the pnv_php driver provides the basic hotplug >functionality on PowerNV. What I'm not clear on is to what extent this is >intended to flow downstream to attached PCIe switches. I have a test setup here that consists of a PMC 8533 switch and several downstream NVMe drives, with the switch attached directly to the PHB4 root port. After loading the pnv_php module, I can disconnect the downstream NVMe devices by either using echo 0 on /sys/bus/pcu/slots/Snnn/power, or by doing a physical surprise unplug, however nothing I try can induce a newly plugged device to train and be detected on the bus. Even trying a echo 0 and subsequent echo 1 to /sys/bus/pcu/slots/Snnn/power only results in the device going offline, there seems to be no way to bring the device back online short of a reboot. Hotplug of other devices connected directly to the PHB4 appears to work properly (I can online and offline via the power node); the issue seems to be restricted to downstream devices connected to the (theoretically hotplug capable) PMC 8533 switch. Is this the intended behavior for downstream (non-IBM) PCIe ports? Raptor can provide resources to assist in a fix if needed, but I would like to understand if this is a bug or an unimplemented feature first, and if the latter what the main issues are likely to be in implementation. Thank you!
Re: [RFC PATCH 10/12] drm/amd/display: Use ARCH_HAS_KERNEL_FPU_SUPPORT
- Original Message - > From: "Samuel Holland" > To: "Michael Ellerman" > Cc: "linux-kernel" , "amd-gfx" > , "linux-arch" > , "linux-arm-kernel" > , loonga...@lists.linux.dev, > "linuxppc-dev" , "x86" , > linux-ri...@lists.infradead.org, "Christoph > Hellwig" , "Timothy Pearson" > > Sent: Wednesday, December 13, 2023 7:03:20 PM > Subject: Re: [RFC PATCH 10/12] drm/amd/display: Use > ARCH_HAS_KERNEL_FPU_SUPPORT > On 2023-12-11 6:23 AM, Michael Ellerman wrote: >> Hi Samuel, >> >> Thanks for trying to clean all this up. >> >> One problem below. >> >> Samuel Holland writes: >>> Now that all previously-supported architectures select >>> ARCH_HAS_KERNEL_FPU_SUPPORT, this code can depend on that symbol instead >>> of the existing list of architectures. It can also take advantage of the >>> common kernel-mode FPU API and method of adjusting CFLAGS. >>> >>> Signed-off-by: Samuel Holland >> ... >>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c >>> b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c >>> index 4ae4720535a5..b64f917174ca 100644 >>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c >>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c >>> @@ -87,20 +78,9 @@ void dc_fpu_begin(const char *function_name, const int >>> line) >>> WARN_ON_ONCE(!in_task()); >>> preempt_disable(); >>> depth = __this_cpu_inc_return(fpu_recursion_depth); >>> - >>> if (depth == 1) { >>> -#if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) >>> + BUG_ON(!kernel_fpu_available()); >>> kernel_fpu_begin(); >>> -#elif defined(CONFIG_PPC64) >>> - if (cpu_has_feature(CPU_FTR_VSX_COMP)) >>> - enable_kernel_vsx(); >>> - else if (cpu_has_feature(CPU_FTR_ALTIVEC_COMP)) >>> - enable_kernel_altivec(); >> >> Note altivec. >> >>> - else if (!cpu_has_feature(CPU_FTR_FPU_UNAVAILABLE)) >>> - enable_kernel_fp(); >>> -#elif defined(CONFIG_ARM64) >>> - kernel_neon_begin(); >>> -#endif >>> } >>> >>> TRACE_DCN_FPU(true, function_name, line, depth); >>> diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile >>> b/drivers/gpu/drm/amd/display/dc/dml/Makefile >>> index ea7d60f9a9b4..5aad0f572ba3 100644 >>> --- a/drivers/gpu/drm/amd/display/dc/dml/Makefile >>> +++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile >>> @@ -25,40 +25,8 @@ >>> # It provides the general basic services required by other DAL >>> # subcomponents. >>> >>> -ifdef CONFIG_X86 >>> -dml_ccflags-$(CONFIG_CC_IS_GCC) := -mhard-float >>> -dml_ccflags := $(dml_ccflags-y) -msse >>> -endif >>> - >>> -ifdef CONFIG_PPC64 >>> -dml_ccflags := -mhard-float -maltivec >>> -endif >> >> And altivec is enabled in the flags there. >> >> That doesn't match your implementation for powerpc in patch 7, which >> only deals with float. >> >> I suspect the AMD driver actually doesn't need altivec enabled, but I >> don't know that for sure. It compiles without it, but I don't have a GPU >> to actually test. I've added Timothy on Cc who added the support for >> powerpc to the driver originally, hopefully he has a test system. If you would like me to test I'm happy to do so, but I am travelling until Friday so would need to wait until then. Thanks!
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Tuesday, November 28, 2023 6:57:01 AM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Michael Ellerman writes: >> Timothy Pearson writes: >> >>> Just wanted to check back and see if this patch was going to be queued >>> up soon? We're still having to work around / advertise the data >>> destruction issues the underlying bug is causing on e.g. Debian >>> Stable. >> >> Yeah I'll apply it this week, so it will be in rc4. > > I reworked the change log to include the exact call path I identified > instead of the more high level description you had. And tweaked a few > other bits of wording and so on, apparently fr0 is a kernelism, the ABI > and binutils calls it f0. > > I'm not sure how wedded you were to your change log, so if you dislike > my edits let me know and we can come up with a joint one. > > The actual patch is unchanged. > > cheers The commit message looks OK to me. I've also seen application crashes as a result of the register corruption, but that may be a minor detail that isn't really worth updating things over at this point -- those come from e.g. glibc using vs0 as part of a path that processes pointer information, typically seen where there's a need to replicate the same pointer to adjacent fields in a data struct.
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
Just wanted to check back and see if this patch was going to be queued up soon? We're still having to work around / advertise the data destruction issues the underlying bug is causing on e.g. Debian Stable. Thanks!
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Tuesday, November 21, 2023 11:01:50 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Timothy Pearson writes: >> > ... >> >> So a little more detail on this, just to put it to rest properly vs. >> assuming hand analysis caught every possible pathway. :) >> >> The debugging that generates this stack trace also verifies the following in >> __giveup_fpu(): >> >> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to >> calling >> save_fpu() >> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after >> calling >> save_fpu() >> 3.) MSR_FP is set both in the task struct and in the live MSR. >> >> Only if all three conditions are met will it generate the trace. This >> is a generalization of the hack I used to find the problem in the >> first place. >> >> If the state will subsequently be reloaded from the thread struct, >> that means we're reloading the registers from the thread struct that >> we just verified was corrupted by the earlier save_fpu() call. There >> are only two ways I can see for that to be true -- one is if the >> registers were already clobbered when giveup_all() was entered, and >> the other is if save_fpu() went ahead and clobbered them right here >> inside giveup_all(). >> >> To see which scenario we were dealing with, I added a bit more >> instrumentation to dump the current register state if MSR_FP bit was >> already set in registers (i.e. not dumping data from task struct, but >> using the live FPU registers instead), and sure enough the registers >> are corrupt on entry, so something else has already called save_fpu() >> before we even hit giveup_all() in this call chain. > > Can you share the debug patch you're using? > > cheers Sure, here you go. Note that with my FPU patch there is no WARN_ON hit, at least in my testing, so it isn't userspace purposefully loading the fr0/vs0 register with the FPSCR. diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 392404688cec..bde57dc3262a 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -154,7 +154,49 @@ static void __giveup_fpu(struct task_struct *tsk) { unsigned long msr; + // DEBUGGING + uint64_t prev_fpr0 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0); + uint64_t prev_fpr1 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1); + struct thread_fp_state debug_fp_state; + unsigned long currentmsr = mfmsr(); + + if (currentmsr & MSR_FP) { + store_fp_state(&debug_fp_state); + load_fp_state(&debug_fp_state); + } + save_fpu(tsk); + + // DEBUGGING + if (tsk->thread.regs->msr & MSR_FP) { + if (((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0) == 0x82004000) && (prev_fpr0 != 0x82004000)) +|| ((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1) == 0x82004000) && (prev_fpr1 != 0x82004000))) + { + WARN_ON(1); + + printk("[TS %lld] In __giveup_fpu() for process [comm: '%s' pid %d tid %d], before save current " + "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 0x%016llx/%016llx fp9: 0x%016llx/%016llx" + " msr: 0x%016lx (FP %d VSX %d EE %d) on core %d\n", + ktime_get_boottime_ns(), current->comm, current->pid, current->tgid, + *(((uint64_t*)(&debug_fp_state.fpr[0]))+0), *(((uint64_t*)(&debug_fp_state.fpr[0]))+1), + *(((uint64_t*)(&debug_fp_state.fpr[1]))+0), *(((uint64_t*)(&debug_fp_state.fpr[1]))+1), + *(((uint64_t*)(&debug_fp_state.fpr[8]))+0), *(((uint64_t*)(&debug_fp_state.fpr[8]))+1), + *(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+0), *(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+1), + currentmsr, !!(currentmsr & MSR_FP), !!(currentmsr & MSR_VSX), !!(currentmsr & MSR_EE), raw_smp_processor_id()); + + printk("[TS %lld] In __giveup_fpu() for process [comm: '%s' pid %d tid %d], after save saved " + "fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 0x%016llx/%016ll
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Timothy Pearson" > To: "Michael Ellerman" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Monday, November 20, 2023 10:10:32 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > - Original Message - >> From: "Michael Ellerman" >> To: "Timothy Pearson" >> Cc: "Jens Axboe" , "regressions" >> , >> "npiggin" , >> "christophe leroy" , "linuxppc-dev" >> >> Sent: Monday, November 20, 2023 5:39:52 PM >> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec >> register save > >> Timothy Pearson writes: >>> - Original Message - >>>> From: "Michael Ellerman" >> ... >>>> >>>> But we now have a new path, because io-uring can call copy_process() via >>>> create_io_thread() from the signal handling path. That's OK if the signal >>>> is >>>> handled as we return from a syscall, but it's not OK if the signal is >>>> handled >>>> due to some other interrupt. >>>> >>>> Which is: >>>> >>>> interrupt_return_srr_user() >>>> interrupt_exit_user_prepare() >>>>interrupt_exit_user_prepare_main() >>>> do_notify_resume() >>>>get_signal() >>>> task_work_run() >>>>create_worker_cb() >>>> create_io_worker() >>>>copy_process() >>>> dup_task_struct() >>>>arch_dup_task_struct() >>>> flush_all_to_thread() >>>>save_all() >>>> if (tsk->thread.regs->msr & MSR_FP) >>>>save_fpu() >>>># fr0 is clobbered and potentially live in >>>> userspace >>>> >>>> >>>> So tldr I think the corruption is only an issue since io-uring started >>>> doing >>>> the clone via signal, which I think matches the observed timeline of this >>>> bug >>>> appearing. >>> >>> I agree the corruption really only started showing up in earnest on >>> io_uring clone-via-signal, as this was confirmed several times in the >>> course of debugging. >> >> Thanks. >> >>> Note as well that I may very well have a wrong call order in the >>> commit message, since I was relying on a couple of WARN_ON() macros I >>> inserted to check for a similar (but not identical) condition and >>> didn't spend much time getting new traces after identifying the root >>> cause. >> >> Yep no worries. I'll reword it to incorporate the full path from my mail. >> >>> I went back and grabbed some real world system-wide stack traces, since I >>> now >>> know what to trigger on. A typical example is: >>> >>> interrupt_return_srr_user() >>> interrupt_exit_user_prepare() >>> interrupt_exit_user_prepare_main() >>>schedule() >>> __schedule() >>> __switch_to() >>> giveup_all() >>># tsk->thread.regs->msr MSR_FP is still set here >>>__giveup_fpu() >>> save_fpu() >>> # fr0 is clobbered and potentially live in userspace >> >> fr0 is not live there. > >> ie. it clears the FP etc. bits from the task's MSR. That means the FP >> state will be reloaded from the thread struct before the task is run again. > > So a little more detail on this, just to put it to rest properly vs. assuming > hand analysis caught every possible pathway. :) > > The debugging that generates this stack trace also verifies the following in > __giveup_fpu(): > > 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to > calling > save_fpu() > 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after > calling > save_fpu() > 3.) MSR_FP is set both in the task struct and in the live MSR. > > Only if all three conditions are met will it generate the trace. This is a > generalization of the hack I used to find the problem in the first place. > > If the state will subsequently
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Monday, November 20, 2023 5:39:52 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Timothy Pearson writes: >> - Original Message - >>> From: "Michael Ellerman" > ... >>> >>> But we now have a new path, because io-uring can call copy_process() via >>> create_io_thread() from the signal handling path. That's OK if the signal is >>> handled as we return from a syscall, but it's not OK if the signal is >>> handled >>> due to some other interrupt. >>> >>> Which is: >>> >>> interrupt_return_srr_user() >>> interrupt_exit_user_prepare() >>>interrupt_exit_user_prepare_main() >>> do_notify_resume() >>>get_signal() >>> task_work_run() >>>create_worker_cb() >>> create_io_worker() >>>copy_process() >>> dup_task_struct() >>>arch_dup_task_struct() >>> flush_all_to_thread() >>>save_all() >>> if (tsk->thread.regs->msr & MSR_FP) >>>save_fpu() >>># fr0 is clobbered and potentially live in >>> userspace >>> >>> >>> So tldr I think the corruption is only an issue since io-uring started doing >>> the clone via signal, which I think matches the observed timeline of this >>> bug >>> appearing. >> >> I agree the corruption really only started showing up in earnest on >> io_uring clone-via-signal, as this was confirmed several times in the >> course of debugging. > > Thanks. > >> Note as well that I may very well have a wrong call order in the >> commit message, since I was relying on a couple of WARN_ON() macros I >> inserted to check for a similar (but not identical) condition and >> didn't spend much time getting new traces after identifying the root >> cause. > > Yep no worries. I'll reword it to incorporate the full path from my mail. > >> I went back and grabbed some real world system-wide stack traces, since I now >> know what to trigger on. A typical example is: >> >> interrupt_return_srr_user() >> interrupt_exit_user_prepare() >> interrupt_exit_user_prepare_main() >>schedule() >> __schedule() >> __switch_to() >> giveup_all() >># tsk->thread.regs->msr MSR_FP is still set here >>__giveup_fpu() >> save_fpu() >> # fr0 is clobbered and potentially live in userspace > > fr0 is not live there. > ie. it clears the FP etc. bits from the task's MSR. That means the FP > state will be reloaded from the thread struct before the task is run again. So a little more detail on this, just to put it to rest properly vs. assuming hand analysis caught every possible pathway. :) The debugging that generates this stack trace also verifies the following in __giveup_fpu(): 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling save_fpu() 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling save_fpu() 3.) MSR_FP is set both in the task struct and in the live MSR. Only if all three conditions are met will it generate the trace. This is a generalization of the hack I used to find the problem in the first place. If the state will subsequently be reloaded from the thread struct, that means we're reloading the registers from the thread struct that we just verified was corrupted by the earlier save_fpu() call. There are only two ways I can see for that to be true -- one is if the registers were already clobbered when giveup_all() was entered, and the other is if save_fpu() went ahead and clobbered them right here inside giveup_all(). To see which scenario we were dealing with, I added a bit more instrumentation to dump the current register state if MSR_FP bit was already set in registers (i.e. not dumping data from task struct, but using the live FPU registers instead), and sure enough the registers are corrupt on entry, so something else has already called save_fpu() before we even hit giveup_all() in this call chain. Unless I'm missing something, doesn't this effectively mean that anything interrupting a task can hit this bug? Or, put another way, I'm seeing several processes hit this exact call chain with the corrupt register going back out to userspace without io_uring even in the mix, so there seems to be another pathway in play. These traces are from a qemu guest, in case it matters given the kvm path is possibly susceptible. Just a few things to think about. The FPU patch itself definitely resolves the problems; I used a sledgehammer approach *specifically* so that there is no place for a rare call sequence we didn't consider to hit it again down the line. :)
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Monday, November 20, 2023 5:39:52 PM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Timothy Pearson writes: >> - Original Message - >>> From: "Michael Ellerman" > ... >>> >>> But we now have a new path, because io-uring can call copy_process() via >>> create_io_thread() from the signal handling path. That's OK if the signal is >>> handled as we return from a syscall, but it's not OK if the signal is >>> handled >>> due to some other interrupt. >>> >>> Which is: >>> >>> interrupt_return_srr_user() >>> interrupt_exit_user_prepare() >>>interrupt_exit_user_prepare_main() >>> do_notify_resume() >>>get_signal() >>> task_work_run() >>>create_worker_cb() >>> create_io_worker() >>>copy_process() >>> dup_task_struct() >>>arch_dup_task_struct() >>> flush_all_to_thread() >>>save_all() >>> if (tsk->thread.regs->msr & MSR_FP) >>>save_fpu() >>># fr0 is clobbered and potentially live in >>> userspace >>> >>> >>> So tldr I think the corruption is only an issue since io-uring started doing >>> the clone via signal, which I think matches the observed timeline of this >>> bug >>> appearing. >> >> I agree the corruption really only started showing up in earnest on >> io_uring clone-via-signal, as this was confirmed several times in the >> course of debugging. > > Thanks. > >> Note as well that I may very well have a wrong call order in the >> commit message, since I was relying on a couple of WARN_ON() macros I >> inserted to check for a similar (but not identical) condition and >> didn't spend much time getting new traces after identifying the root >> cause. > > Yep no worries. I'll reword it to incorporate the full path from my mail. > >> I went back and grabbed some real world system-wide stack traces, since I now >> know what to trigger on. A typical example is: >> >> interrupt_return_srr_user() >> interrupt_exit_user_prepare() >> interrupt_exit_user_prepare_main() >>schedule() >> __schedule() >> __switch_to() >> giveup_all() >># tsk->thread.regs->msr MSR_FP is still set here >>__giveup_fpu() >> save_fpu() >> # fr0 is clobbered and potentially live in userspace > > fr0 is not live there. > > __giveup_fpu() does roughly: > > msr = tsk->thread.regs->msr; > msr &= ~(MSR_FP|MSR_FE0|MSR_FE1); >msr &= ~MSR_VSX; > tsk->thread.regs = msr; > > ie. it clears the FP etc. bits from the task's MSR. That means the FP > state will be reloaded from the thread struct before the task is run again. > > Also on that path we're switching to another task, so we'll be reloading > the other task's FP state before returning to userspace. > > So I don't see any bug there. Yeah, you're right. I was trying to get traces while doing something else, and didn't think that all the way through, sorry. :) It's not going to be super easy to get a real trace (I was triggering the WARN_ON() from of fr0 getting set to to FPSCR), so let's just assume it's mainly the path you manually found above and update the commit message accordingly. > There's only two places that call save_fpu() and skip the giveup logic, > which is save_all() and kvmppc_save_user_regs(). Now that's interesting as well, since it might explain some issues I've seen for years on a specific QEMU workload. Once this is backported to stable I'll need to get the kernels updated on those boxes and see if the issues disappear...
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" , "Jens Axboe" > , "regressions" > , "npiggin" , "christophe > leroy" , > "linuxppc-dev" > Sent: Monday, November 20, 2023 1:10:06 AM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Hi Timothy, > > Great work debugging this. I think your fix is good, but I want to understand > it > a bit more to make sure I can explain why we haven't seen it outside of > io-uring. > If this can be triggered outside of io-uring then I have even more backporting > in my future :} > > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP > regs > from the thread struct before letting the task use FP again. So in that case > save_fpu() is free to clobber fr0 because the FP regs no longer hold live > values > for the task. > > There is another case though, which is the path via: > copy_process() >dup_task_struct() > arch_dup_task_struct() >flush_all_to_thread() > save_all() > > That path saves the FP regs but leaves them live. That's meant as an > optimisation for a process that's using FP/VSX and then calls fork(), leaving > the regs live means the parent process doesn't have to take a fault after the > fork to get its FP regs back. > > That path does clobber fr0, but fr0 is volatile across a syscall, and the only > way to reach copy_process() from userspace is via a syscall. So in normal > usage > fr0 being clobbered across a syscall shouldn't cause data corruption. > > Even if we handle a signal on the return from the fork() syscall, the worst > that > happens is that the task's thread struct holds the clobbered fr0, but the task > doesn't care (because fr0 is volatile across the syscall anyway). > > That path is something like: > > system_call_vectored_common() > system_call_exception() >sys_fork() > kernel_clone() >copy_process() > dup_task_struct() >arch_dup_task_struct() > flush_all_to_thread() >save_all() > if (tsk->thread.regs->msr & MSR_FP) >save_fpu() ># does not clear MSR_FP from regs->msr > syscall_exit_prepare() >interrupt_exit_user_prepare_main() > do_notify_resume() >get_signal() >handle_rt_signal64() > prepare_setup_sigcontext() >flush_fp_to_thread() > if (tsk->thread.regs->msr & MSR_FP) >giveup_fpu() > __giveup_fpu >save_fpu() ># clobbered fr0 is saved, but task considers it volatile ># across syscall anyway > > > But we now have a new path, because io-uring can call copy_process() via > create_io_thread() from the signal handling path. That's OK if the signal is > handled as we return from a syscall, but it's not OK if the signal is handled > due to some other interrupt. > > Which is: > > interrupt_return_srr_user() > interrupt_exit_user_prepare() >interrupt_exit_user_prepare_main() > do_notify_resume() >get_signal() > task_work_run() >create_worker_cb() > create_io_worker() >copy_process() > dup_task_struct() >arch_dup_task_struct() > flush_all_to_thread() >save_all() > if (tsk->thread.regs->msr & MSR_FP) >save_fpu() ># fr0 is clobbered and potentially live in > userspace > > > So tldr I think the corruption is only an issue since io-uring started doing > the clone via signal, which I think matches the observed timeline of this bug > appearing. I agree the corruption really only started showing up in earnest on io_uring clone-via-signal, as this was confirmed several times in the course of debugging. Bear in mind however that we have seen very, very rare crashes over several years in other tasks, and I am starting to think this bug might be the root cause (see below). Note as well that I may very well have a wrong call order in the commit message, since I was relying on a couple of WARN_ON() macros I inserted to check for a similar (but not identical) condition and didn't spend much time getting new traces after identifying the root cause. I went
[PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
During floating point and vector save to thread data fr0/vs0 are clobbered by the FPSCR/VSCR store routine. This leads to userspace register corruption and application data corruption / crash under the following rare condition: * A userspace thread is executing with VSX/FP mode enabled * The userspace thread is making active use of fr0 and/or vs0 * An IPI is taken in kernel mode, forcing the userspace thread to reschedule * The userspace thread is interrupted by the IPI before accessing data it previously stored in fr0/vs0 * The thread being switched in by the IPI has a pending signal If these exact criteria are met, then the following sequence happens: * The existing thread FP storage is still valid before the IPI, due to a prior call to save_fpu() or store_fp_state(). Note that the current fr0/vs0 registers have been clobbered, so the FP/VSX state in registers is now invalid pending a call to restore_fp()/restore_altivec(). * IPI -- FP/VSX register state remains invalid * interrupt_exit_user_prepare_main() calls do_notify_resume(), due to the pending signal * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which merrily reads and saves the invalid FP/VSX state to thread local storage. * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid FP/VSX state back to registers. * Execution is released to userspace, and the application crashes or corrupts data. Without the pending signal, do_notify_resume() is never called, therefore the invalid register state does't matter as it is overwritten nearly immediately by interrupt_exit_user_prepare_main() calling restore_math() before return to userspace. Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and altivec register save paths. Tested under QEMU in kvm mode, running on a Talos II workstation with dual POWER9 DD2.2 CPUs. Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.javamail.zim...@raptorengineeringinc.com/ Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.javamail.zim...@raptorengineeringinc.com/ Tested-by: Timothy Pearson Tested-by: Jens Axboe Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/fpu.S| 13 + arch/powerpc/kernel/vector.S | 2 ++ 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S index 6a9acfb690c9..2f8f3f93cbb6 100644 --- a/arch/powerpc/kernel/fpu.S +++ b/arch/powerpc/kernel/fpu.S @@ -23,6 +23,15 @@ #include #ifdef CONFIG_VSX +#define __REST_1FPVSR(n,c,base) \ +BEGIN_FTR_SECTION \ + b 2f; \ +END_FTR_SECTION_IFSET(CPU_FTR_VSX);\ + REST_FPR(n,base); \ + b 3f; \ +2: REST_VSR(n,c,base); \ +3: + #define __REST_32FPVSRS(n,c,base) \ BEGIN_FTR_SECTION \ b 2f; \ @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX); \ 2: SAVE_32VSRS(n,c,base); \ 3: #else +#define __REST_1FPVSR(n,b,base)REST_FPR(n, base) #define __REST_32FPVSRS(n,b,base) REST_32FPRS(n, base) #define __SAVE_32FPVSRS(n,b,base) SAVE_32FPRS(n, base) #endif +#define REST_1FPVSR(n,c,base) __REST_1FPVSR(n,__REG_##c,__REG_##base) #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base) #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base) @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state) SAVE_32FPVSRS(0, R4, R3) mffsfr0 stfdfr0,FPSTATE_FPSCR(r3) + REST_1FPVSR(0, R4, R3) blr EXPORT_SYMBOL(store_fp_state) @@ -138,4 +150,5 @@ _GLOBAL(save_fpu) 2: SAVE_32FPVSRS(0, R4, R6) mffsfr0 stfdfr0,FPSTATE_FPSCR(r6) + REST_1FPVSR(0, R4, R6) blr diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S index 4094e4c4c77a..80b3f6e476b6 100644 --- a/arch/powerpc/kernel/vector.S +++ b/arch/powerpc/kernel/vector.S @@ -33,6 +33,7 @@ _GLOBAL(store_vr_state) mfvscr v0 li r4, VRSTATE_VSCR stvxv0, r4, r3 + lvx v0, 0, r3 blr EXPORT_SYMBOL(store_vr_state) @@ -109,6 +110,7 @@ _GLOBAL(save_altivec) mfvscr v0 li r4,VRSTATE_VSCR stvxv0,r4,r7 + lvx v0,0,r7 blr #ifdef CONFIG_VSX -- 2.39.2
Re: [PATCH] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Timothy Pearson" > To: "Jens Axboe" , "regressions" > , "Michael Ellerman" > , "npiggin" , "christophe leroy" > , "linuxppc-dev" > > Sent: Saturday, November 18, 2023 5:45:03 PM > Subject: [PATCH] powerpc: Don't clobber fr0/vs0 during fp|altivec register > save > During floating point and vector save to thread data fr0/vs0 are clobbered > by the FPSCR/VSCR store routine. This leads to userspace register corruption > and application data corruption / crash under the following rare condition: > > * A userspace thread is executing with VSX/FP mode enabled > * The userspace thread is making active use of fr0 and/or vs0 > * An IPI is taken in kernel mode, forcing the userspace thread to reschedule > * The userspace thread is interrupted by the IPI before accessing data it > previously stored in fr0/vs0 > * The thread being switched in by the IPI has a pending signal > > If these exact criteria are met, then the following sequence happens: > > * The existing thread FP storage is still valid before the IPI, due to a > prior call to save_fpu() or store_fp_state(). Note that the current > fr0/vs0 registers have been clobbered, so the FP/VSX state in registers > is now invalid pending a call to restore_fp()/restore_altivec(). > * IPI -- FP/VSX register state remains invalid > * interrupt_exit_user_prepare_main() calls do_notify_resume(), > due to the pending signal > * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which > merrily reads and saves the invalid FP/VSX state to thread local storage. > * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid > FP/VSX state back to registers. > * Execution is released to userspace, and the application crashes or corrupts > data. > > Without the pending signal, do_notify_resume() is never called, therefore the > invalid register state does't matter as it is overwritten nearly immeediately > by interrupt_exit_user_prepare_main() calling restore_math() before return > to userspace. > > The combination of MariaDB and io_uring is especially good at triggering data > corruption using the above sequence, see MariaDB bug MDEV-30728. > > Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and > altivec register save paths. > > Tested under QEMU in kvm mode, running on a Talos II workstation with dual > POWER9 DD2.2 CPUs. > > Tested-by: Timothy Pearson > Signed-off-by: Timothy Pearson > --- > arch/powerpc/kernel/fpu.S| 13 + > arch/powerpc/kernel/vector.S | 4 > 2 files changed, 17 insertions(+) > > diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S > index 6a9acfb690c9..2f8f3f93cbb6 100644 > --- a/arch/powerpc/kernel/fpu.S > +++ b/arch/powerpc/kernel/fpu.S > @@ -23,6 +23,15 @@ > #include > > #ifdef CONFIG_VSX > +#define __REST_1FPVSR(n,c,base) > \ > +BEGIN_FTR_SECTION\ > + b 2f; \ > +END_FTR_SECTION_IFSET(CPU_FTR_VSX); \ > + REST_FPR(n,base); \ > + b 3f; \ > +2: REST_VSR(n,c,base); \ > +3: > + > #define __REST_32FPVSRS(n,c,base) \ > BEGIN_FTR_SECTION \ > b 2f; \ > @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX); > \ > 2:SAVE_32VSRS(n,c,base); \ > 3: > #else > +#define __REST_1FPVSR(n,b,base) REST_FPR(n, base) > #define __REST_32FPVSRS(n,b,base) REST_32FPRS(n, base) > #define __SAVE_32FPVSRS(n,b,base) SAVE_32FPRS(n, base) > #endif > +#define REST_1FPVSR(n,c,base) __REST_1FPVSR(n,__REG_##c,__REG_##base) > #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base) > #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base) > > @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state) > SAVE_32FPVSRS(0, R4, R3) > mffsfr0 > stfdfr0,FPSTATE_FPSCR(r3) > + REST_1FPVSR(0, R4, R3) > blr > EXPORT_SYMBOL(store_fp_state) > > @@ -138,4 +150,5 @@ _GLOBAL(save_fpu) > 2:SAVE_32FPVSRS(0, R4, R6) > mffsfr0 > stfdfr0,FPSTATE_FPSCR(r6) > + REST_1FPVSR(0, R4
[PATCH] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
During floating point and vector save to thread data fr0/vs0 are clobbered by the FPSCR/VSCR store routine. This leads to userspace register corruption and application data corruption / crash under the following rare condition: * A userspace thread is executing with VSX/FP mode enabled * The userspace thread is making active use of fr0 and/or vs0 * An IPI is taken in kernel mode, forcing the userspace thread to reschedule * The userspace thread is interrupted by the IPI before accessing data it previously stored in fr0/vs0 * The thread being switched in by the IPI has a pending signal If these exact criteria are met, then the following sequence happens: * The existing thread FP storage is still valid before the IPI, due to a prior call to save_fpu() or store_fp_state(). Note that the current fr0/vs0 registers have been clobbered, so the FP/VSX state in registers is now invalid pending a call to restore_fp()/restore_altivec(). * IPI -- FP/VSX register state remains invalid * interrupt_exit_user_prepare_main() calls do_notify_resume(), due to the pending signal * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which merrily reads and saves the invalid FP/VSX state to thread local storage. * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid FP/VSX state back to registers. * Execution is released to userspace, and the application crashes or corrupts data. Without the pending signal, do_notify_resume() is never called, therefore the invalid register state does't matter as it is overwritten nearly immeediately by interrupt_exit_user_prepare_main() calling restore_math() before return to userspace. The combination of MariaDB and io_uring is especially good at triggering data corruption using the above sequence, see MariaDB bug MDEV-30728. Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and altivec register save paths. Tested under QEMU in kvm mode, running on a Talos II workstation with dual POWER9 DD2.2 CPUs. Tested-by: Timothy Pearson Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/fpu.S| 13 + arch/powerpc/kernel/vector.S | 4 2 files changed, 17 insertions(+) diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S index 6a9acfb690c9..2f8f3f93cbb6 100644 --- a/arch/powerpc/kernel/fpu.S +++ b/arch/powerpc/kernel/fpu.S @@ -23,6 +23,15 @@ #include #ifdef CONFIG_VSX +#define __REST_1FPVSR(n,c,base) \ +BEGIN_FTR_SECTION \ + b 2f; \ +END_FTR_SECTION_IFSET(CPU_FTR_VSX);\ + REST_FPR(n,base); \ + b 3f; \ +2: REST_VSR(n,c,base); \ +3: + #define __REST_32FPVSRS(n,c,base) \ BEGIN_FTR_SECTION \ b 2f; \ @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX); \ 2: SAVE_32VSRS(n,c,base); \ 3: #else +#define __REST_1FPVSR(n,b,base)REST_FPR(n, base) #define __REST_32FPVSRS(n,b,base) REST_32FPRS(n, base) #define __SAVE_32FPVSRS(n,b,base) SAVE_32FPRS(n, base) #endif +#define REST_1FPVSR(n,c,base) __REST_1FPVSR(n,__REG_##c,__REG_##base) #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base) #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base) @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state) SAVE_32FPVSRS(0, R4, R3) mffsfr0 stfdfr0,FPSTATE_FPSCR(r3) + REST_1FPVSR(0, R4, R3) blr EXPORT_SYMBOL(store_fp_state) @@ -138,4 +150,5 @@ _GLOBAL(save_fpu) 2: SAVE_32FPVSRS(0, R4, R6) mffsfr0 stfdfr0,FPSTATE_FPSCR(r6) + REST_1FPVSR(0, R4, R6) blr diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S index 4094e4c4c77a..8c63b05b421e 100644 --- a/arch/powerpc/kernel/vector.S +++ b/arch/powerpc/kernel/vector.S @@ -33,6 +33,8 @@ _GLOBAL(store_vr_state) mfvscr v0 li r4, VRSTATE_VSCR stvxv0, r4, r3 + li r4, 0 + lvx v0, r4, r3 blr EXPORT_SYMBOL(store_vr_state) @@ -109,6 +111,8 @@ _GLOBAL(save_altivec) mfvscr v0 li r4,VRSTATE_VSCR stvxv0,r4,r7 + li r4,0 + lvx v0,r4,r7 blr #ifdef CONFIG_VSX -- 2.39.2
Re: [PATCH] powerpc: Fix data corruption on IPI
- Original Message - > From: "Timothy Pearson" > To: "Timothy Pearson" > Cc: "npiggin" , "linuxppc-dev" > > Sent: Friday, November 17, 2023 2:26:37 AM > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > - Original Message - >> From: "Timothy Pearson" >> To: "npiggin" >> Cc: "linuxppc-dev" >> Sent: Friday, November 17, 2023 2:20:29 AM >> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > >> - Original Message - >>> From: "npiggin" >>> To: "Timothy Pearson" , "Michael Ellerman" >>> >>> Cc: "linuxppc-dev" >>> Sent: Friday, November 17, 2023 2:01:12 AM >>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI >> >>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote: >>>> >>>> >>>> - Original Message - >>>> > From: "Michael Ellerman" >>>> > To: "Timothy Pearson" , "linuxppc-dev" >>>> > >>>> > Cc: "Jens Axboe" >>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM >>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI >>>> >>>> > Hi Timothy, >>>> > >>>> > Thanks for debugging this, but I'm unclear why this is helping because >>>> > we should already have a full barrier (hwsync) on both the sending and >>>> > receiving side. >>>> > >>>> > More below. >>>> >>>> I've spent another few days poking at this, and think I might finally have >>>> something more solid in terms of what exactly is happening, but would like >>>> some >>>> feedback on the concept / how best to fix the potential problem. >>>> >>>> As background, there are several worker threads both in userspace and in >>>> kernel >>>> mode. Crucially, the main MariaDB data processing thread (the one that >>>> handles >>>> tasks like flushing dirty pages to disk) always runs on the same core as >>>> the >>>> io_uring kernel thread that picks up I/O worker creation requests and >>>> handles >>>> them via create_worker_cb(). >>>> >>>> Changes in the ~5.12 era switched away from a delayed worker setup. >>>> io_uring >>>> currently sets up the new process with create_io_thread(), and immediately >>>> uses >>>> an IPI to forcibly schedule the new process. Because of the way the two >>>> threads interact, the new process ends up grabbing the CPU from the running >>>> MariaDB user thread; I've never seen it schedule on a different core. If >>>> the >>>> timing is right in this process, things get trampled on in userspace and >>>> the >>>> database server either crashes or throws a corruption fault. >>>> >>>> Through extensive debugging, I've narrowed this down to invalid state in >>>> the VSX >>>> registers on return to the MariaDB user thread from the new kernel thread. >>>> For >>>> some reason, it seems we don't restore FP state on return from the >>>> PF_IO_WORKER >>>> thread, and something in the kernel was busy writing new data to them. >>>> >>>> A direct example I was able to observe is as follows: >>>> >>>> xxspltd vs0,vs0,0 <-- vs0 now zeroed out >>>> xorir9,r9,1<-- Presumably we switch to the new kernel thread >>>> here >>>> due to the IPI >>>> slwir9,r9,7<-- On userspace thread resume, vs0 now contains the >>>> value 0x8200400082004000 >>>> xxswapd vs8,vs0<-- vs8 now has the wrong value >>>> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on >>>> stw r9,116(r3) >>>> stxvd2x vs8,r3,r0 >>>> ... >>>> CRASH >>> >>> Nice find, that looks pretty conclusive. >>> >>>> This is a very difficult race to hit, but MariaDB naturally does repetitive >>>> operations with VSX registers so it does eventually fail. I ended up with >>>> a >>>> tight loop around glibc operations that use VSX to trigger the failure >>>> reliably >>>> enough to even understa
Re: [PATCH] powerpc: Fix data corruption on IPI
- Original Message - > From: "Timothy Pearson" > To: "npiggin" > Cc: "linuxppc-dev" > Sent: Friday, November 17, 2023 2:20:29 AM > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > - Original Message - >> From: "npiggin" >> To: "Timothy Pearson" , "Michael Ellerman" >> >> Cc: "linuxppc-dev" >> Sent: Friday, November 17, 2023 2:01:12 AM >> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > >> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote: >>> >>> >>> - Original Message - >>> > From: "Michael Ellerman" >>> > To: "Timothy Pearson" , "linuxppc-dev" >>> > >>> > Cc: "Jens Axboe" >>> > Sent: Tuesday, November 14, 2023 6:14:37 AM >>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI >>> >>> > Hi Timothy, >>> > >>> > Thanks for debugging this, but I'm unclear why this is helping because >>> > we should already have a full barrier (hwsync) on both the sending and >>> > receiving side. >>> > >>> > More below. >>> >>> I've spent another few days poking at this, and think I might finally have >>> something more solid in terms of what exactly is happening, but would like >>> some >>> feedback on the concept / how best to fix the potential problem. >>> >>> As background, there are several worker threads both in userspace and in >>> kernel >>> mode. Crucially, the main MariaDB data processing thread (the one that >>> handles >>> tasks like flushing dirty pages to disk) always runs on the same core as the >>> io_uring kernel thread that picks up I/O worker creation requests and >>> handles >>> them via create_worker_cb(). >>> >>> Changes in the ~5.12 era switched away from a delayed worker setup. >>> io_uring >>> currently sets up the new process with create_io_thread(), and immediately >>> uses >>> an IPI to forcibly schedule the new process. Because of the way the two >>> threads interact, the new process ends up grabbing the CPU from the running >>> MariaDB user thread; I've never seen it schedule on a different core. If >>> the >>> timing is right in this process, things get trampled on in userspace and the >>> database server either crashes or throws a corruption fault. >>> >>> Through extensive debugging, I've narrowed this down to invalid state in >>> the VSX >>> registers on return to the MariaDB user thread from the new kernel thread. >>> For >>> some reason, it seems we don't restore FP state on return from the >>> PF_IO_WORKER >>> thread, and something in the kernel was busy writing new data to them. >>> >>> A direct example I was able to observe is as follows: >>> >>> xxspltd vs0,vs0,0 <-- vs0 now zeroed out >>> xorir9,r9,1<-- Presumably we switch to the new kernel thread >>> here >>> due to the IPI >>> slwir9,r9,7<-- On userspace thread resume, vs0 now contains the >>> value 0x8200400082004000 >>> xxswapd vs8,vs0<-- vs8 now has the wrong value >>> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on >>> stw r9,116(r3) >>> stxvd2x vs8,r3,r0 >>> ... >>> CRASH >> >> Nice find, that looks pretty conclusive. >> >>> This is a very difficult race to hit, but MariaDB naturally does repetitive >>> operations with VSX registers so it does eventually fail. I ended up with a >>> tight loop around glibc operations that use VSX to trigger the failure >>> reliably >>> enough to even understand what was going on. >>> >>> As I am not as familiar with this part of the Linux kernel as with most >>> other >>> areas, what is the expected save/restore path for the FP/VSX registers >>> around >>> an IPI and associated forced thread switch? If restore_math() is in the >>> return >>> path, note that MSR_FP is set in regs->msr. >> >> Context switching these FP/vec registers should be pretty robust in >> general because it's not just io-uring that uses them. io-uring could >> be using some uncommon kernel code that uses the registers incorrectly >> though I guess. >> >>
Re: [PATCH] powerpc: Fix data corruption on IPI
- Original Message - > From: "npiggin" > To: "Timothy Pearson" , "Michael Ellerman" > > Cc: "linuxppc-dev" > Sent: Friday, November 17, 2023 2:01:12 AM > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote: >> >> >> - Original Message - >> > From: "Michael Ellerman" >> > To: "Timothy Pearson" , "linuxppc-dev" >> > >> > Cc: "Jens Axboe" >> > Sent: Tuesday, November 14, 2023 6:14:37 AM >> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI >> >> > Hi Timothy, >> > >> > Thanks for debugging this, but I'm unclear why this is helping because >> > we should already have a full barrier (hwsync) on both the sending and >> > receiving side. >> > >> > More below. >> >> I've spent another few days poking at this, and think I might finally have >> something more solid in terms of what exactly is happening, but would like >> some >> feedback on the concept / how best to fix the potential problem. >> >> As background, there are several worker threads both in userspace and in >> kernel >> mode. Crucially, the main MariaDB data processing thread (the one that >> handles >> tasks like flushing dirty pages to disk) always runs on the same core as the >> io_uring kernel thread that picks up I/O worker creation requests and handles >> them via create_worker_cb(). >> >> Changes in the ~5.12 era switched away from a delayed worker setup. io_uring >> currently sets up the new process with create_io_thread(), and immediately >> uses >> an IPI to forcibly schedule the new process. Because of the way the two >> threads interact, the new process ends up grabbing the CPU from the running >> MariaDB user thread; I've never seen it schedule on a different core. If the >> timing is right in this process, things get trampled on in userspace and the >> database server either crashes or throws a corruption fault. >> >> Through extensive debugging, I've narrowed this down to invalid state in the >> VSX >> registers on return to the MariaDB user thread from the new kernel thread. >> For >> some reason, it seems we don't restore FP state on return from the >> PF_IO_WORKER >> thread, and something in the kernel was busy writing new data to them. >> >> A direct example I was able to observe is as follows: >> >> xxspltd vs0,vs0,0 <-- vs0 now zeroed out >> xorir9,r9,1<-- Presumably we switch to the new kernel thread here >> due to the IPI >> slwir9,r9,7<-- On userspace thread resume, vs0 now contains the >> value 0x8200400082004000 >> xxswapd vs8,vs0<-- vs8 now has the wrong value >> stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on >> stw r9,116(r3) >> stxvd2x vs8,r3,r0 >> ... >> CRASH > > Nice find, that looks pretty conclusive. > >> This is a very difficult race to hit, but MariaDB naturally does repetitive >> operations with VSX registers so it does eventually fail. I ended up with a >> tight loop around glibc operations that use VSX to trigger the failure >> reliably >> enough to even understand what was going on. >> >> As I am not as familiar with this part of the Linux kernel as with most other >> areas, what is the expected save/restore path for the FP/VSX registers around >> an IPI and associated forced thread switch? If restore_math() is in the >> return >> path, note that MSR_FP is set in regs->msr. > > Context switching these FP/vec registers should be pretty robust in > general because it's not just io-uring that uses them. io-uring could > be using some uncommon kernel code that uses the registers incorrectly > though I guess. > >> >> Second question: should we even be using the VSX registers at all in kernel >> space? Is this a side effect of io_uring interacting so closely with >> userspace >> threads, or something else entirely? >> >> If I can get pointed somewhat in the right direction I'm ready to develop the >> rest of the fix for this issue...just trying to avoid another several days of >> slogging through the source to see what it's supposed to be doing in the >> first >> place. :) > > Kernel can use FP/VEC/VSX registers but it has to enable and disable > explicitly. Such kernel code also should not be preemptible. &g
Re: [PATCH] powerpc: Fix data corruption on IPI
- Original Message - > From: "Timothy Pearson" > To: "Michael Ellerman" , "npiggin" > Cc: "linuxppc-dev" > Sent: Friday, November 17, 2023 1:39:37 AM > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > - Original Message - >> From: "Michael Ellerman" >> To: "Timothy Pearson" , "linuxppc-dev" >> >> Cc: "Jens Axboe" >> Sent: Tuesday, November 14, 2023 6:14:37 AM >> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > >> Hi Timothy, >> >> Thanks for debugging this, but I'm unclear why this is helping because >> we should already have a full barrier (hwsync) on both the sending and >> receiving side. >> >> More below. > > I've spent another few days poking at this, and think I might finally have > something more solid in terms of what exactly is happening, but would like > some > feedback on the concept / how best to fix the potential problem. > > As background, there are several worker threads both in userspace and in > kernel > mode. Crucially, the main MariaDB data processing thread (the one that > handles > tasks like flushing dirty pages to disk) always runs on the same core as the > io_uring kernel thread that picks up I/O worker creation requests and handles > them via create_worker_cb(). > > Changes in the ~5.12 era switched away from a delayed worker setup. io_uring > currently sets up the new process with create_io_thread(), and immediately > uses > an IPI to forcibly schedule the new process. Because of the way the two > threads interact, the new process ends up grabbing the CPU from the running > MariaDB user thread; I've never seen it schedule on a different core. If the > timing is right in this process, things get trampled on in userspace and the > database server either crashes or throws a corruption fault. > > Through extensive debugging, I've narrowed this down to invalid state in the > VSX > registers on return to the MariaDB user thread from the new kernel thread. > For > some reason, it seems we don't restore FP state on return from the > PF_IO_WORKER > thread, and something in the kernel was busy writing new data to them. > > A direct example I was able to observe is as follows: > > xxspltd vs0,vs0,0 <-- vs0 now zeroed out > xorir9,r9,1<-- Presumably we switch to the new kernel thread here > due to the IPI > slwir9,r9,7<-- On userspace thread resume, vs0 now contains the > value 0x8200400082004000 > xxswapd vs8,vs0<-- vs8 now has the wrong value > stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on > stw r9,116(r3) > stxvd2x vs8,r3,r0 > ... > CRASH > > This is a very difficult race to hit, but MariaDB naturally does repetitive > operations with VSX registers so it does eventually fail. I ended up with a > tight loop around glibc operations that use VSX to trigger the failure > reliably > enough to even understand what was going on. > > As I am not as familiar with this part of the Linux kernel as with most other > areas, what is the expected save/restore path for the FP/VSX registers around > an IPI and associated forced thread switch? If restore_math() is in the > return > path, note that MSR_FP is set in regs->msr. > > Second question: should we even be using the VSX registers at all in kernel > space? Is this a side effect of io_uring interacting so closely with > userspace > threads, or something else entirely? Thinking a bit more, a third option could be if we're restoring garbage into the registers by accident. I know the I/O worker threads claim to use a "lightweight" version of kernel_clone(), and it'd be easy to have missed something important...
Re: [PATCH] powerpc: Fix data corruption on IPI
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" , "linuxppc-dev" > > Cc: "Jens Axboe" > Sent: Tuesday, November 14, 2023 6:14:37 AM > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > Hi Timothy, > > Thanks for debugging this, but I'm unclear why this is helping because > we should already have a full barrier (hwsync) on both the sending and > receiving side. > > More below. I've spent another few days poking at this, and think I might finally have something more solid in terms of what exactly is happening, but would like some feedback on the concept / how best to fix the potential problem. As background, there are several worker threads both in userspace and in kernel mode. Crucially, the main MariaDB data processing thread (the one that handles tasks like flushing dirty pages to disk) always runs on the same core as the io_uring kernel thread that picks up I/O worker creation requests and handles them via create_worker_cb(). Changes in the ~5.12 era switched away from a delayed worker setup. io_uring currently sets up the new process with create_io_thread(), and immediately uses an IPI to forcibly schedule the new process. Because of the way the two threads interact, the new process ends up grabbing the CPU from the running MariaDB user thread; I've never seen it schedule on a different core. If the timing is right in this process, things get trampled on in userspace and the database server either crashes or throws a corruption fault. Through extensive debugging, I've narrowed this down to invalid state in the VSX registers on return to the MariaDB user thread from the new kernel thread. For some reason, it seems we don't restore FP state on return from the PF_IO_WORKER thread, and something in the kernel was busy writing new data to them. A direct example I was able to observe is as follows: xxspltd vs0,vs0,0 <-- vs0 now zeroed out xorir9,r9,1<-- Presumably we switch to the new kernel thread here due to the IPI slwir9,r9,7<-- On userspace thread resume, vs0 now contains the value 0x8200400082004000 xxswapd vs8,vs0<-- vs8 now has the wrong value stxvd2x vs8,r3,r12 <-- userspace is now getting stepped on stw r9,116(r3) stxvd2x vs8,r3,r0 ... CRASH This is a very difficult race to hit, but MariaDB naturally does repetitive operations with VSX registers so it does eventually fail. I ended up with a tight loop around glibc operations that use VSX to trigger the failure reliably enough to even understand what was going on. As I am not as familiar with this part of the Linux kernel as with most other areas, what is the expected save/restore path for the FP/VSX registers around an IPI and associated forced thread switch? If restore_math() is in the return path, note that MSR_FP is set in regs->msr. Second question: should we even be using the VSX registers at all in kernel space? Is this a side effect of io_uring interacting so closely with userspace threads, or something else entirely? If I can get pointed somewhat in the right direction I'm ready to develop the rest of the fix for this issue...just trying to avoid another several days of slogging through the source to see what it's supposed to be doing in the first place. :) Thanks!
Re: [PATCH] powerpc: Fix data corruption on IPI
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" , "linuxppc-dev" > > Cc: "Jens Axboe" > Sent: Tuesday, November 14, 2023 6:14:37 AM > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > Hi Timothy, > > Thanks for debugging this, but I'm unclear why this is helping because > we should already have a full barrier (hwsync) on both the sending and > receiving side. > > More below. So it looks like we might be dealing with a couple of potentially separate issues here, not sure. This is probably the most infuriating bug I've run across in my career, so please bear with me -- with the amount of active test kernels I have installed at any one time I'm occassionally infecting one with the tests from another. ;) First, I'm not 100% convinced the code in try_to_wake_up() is race-free on ppc64, since adding a memory barrier between it and the call to kick_process() significantly reduces the frequency of the on-disk corruption. Can you take a look and see if anything stands out as to why that would be important? The second part of this though is that the barrier only reduces the frequency of the corruption, it does not eliminate the corruption. After some consolidation, what completely eliminates the corruption is a combination of: * Switching to TWA_SIGNAL_NO_IPI in task_work_add() * * Adding udelay(1000) in io_uring/rw.c:io_write(), right before the call to io_rw_init_file() Adding a memory barrier instead of the udelay() doesn't work, nor does adding the delay without switching to NO_IPI. [1] Keeping TWA_SIGNAL and adding the barrier instruction also works, but this is conceptually simpler to understand as to why it would have an effect at all
[PATCH] powerpc: Fix data corruption on IPI
>From 0b2678b7cdada1a3d9aec8626f31a988d81373fa Mon Sep 17 00:00:00 2001 From: Timothy Pearson Date: Mon, 13 Nov 2023 22:42:58 -0600 Subject: [PATCH] powerpc: Fix data corruption on IPI On multithreaded SMP workloads such as those using io_uring, it is possible for multiple threads to hold an inconsistent view of system memory when an IPI is issued. This in turn leads to userspace memory corruption with varying degrees of probability based on workload and inter-thread timing. io_uring provokes this bug by its use of TWA_SIGNAL during thread creation, which is especially noticeable as significant userspace data corruption with certain workloads such as MariaDB (bug MDEV-30728). While using TWA_SIGNAL_NO_IPI works around the corruption, no other architecture requires this workaround. Issue an lwsync barrier instruction prior to sending the IPI. This ensures the receiving CPU has a consistent view of system memory, in line with other architectures. Tested under QEMU in kvm mode, running on a Talos II workstation with dual POWER9 DD2.2 CPUs. Tested-by: Timothy Pearson Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/smp.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ab691c89d787..ba42238de518 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg) void arch_smp_send_reschedule(int cpu) { - if (likely(smp_ops)) + if (likely(smp_ops)) { + __smp_lwsync(); do_message_pass(cpu, PPC_MSG_RESCHEDULE); + } } EXPORT_SYMBOL_GPL(arch_smp_send_reschedule); -- 2.39.2
Re: [PATCH] powerpc: Fix data corruption on IPI
- Original Message - > From: "Salvatore Bonaccorso" > To: "Timothy Pearson" > Cc: "Linuxppc-dev" > , "Jens > Axboe" > , "regressions" , "Michael > Ellerman" , "npiggin" > , "christophe leroy" > Sent: Tuesday, November 14, 2023 1:59:14 AM > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI > On Mon, Nov 13, 2023 at 10:56:09PM -0600, Timothy Pearson wrote: >> >From 0b2678b7cdada1a3d9aec8626f31a988d81373fa Mon Sep 17 00:00:00 2001 >> From: Timothy Pearson >> Date: Mon, 13 Nov 2023 22:42:58 -0600 >> Subject: [PATCH] powerpc: Fix data corruption on IPI >> >> On multithreaded SMP workloads such as those using io_uring, it is possible >> for >> multiple threads to hold an inconsistent view of system memory when an IPI is >> issued. This in turn leads to userspace memory corruption with varying >> degrees >> of probability based on workload and inter-thread timing. >> >> io_uring provokes this bug by its use of TWA_SIGNAL during thread creation, >> which is especially noticeable as significant userspace data corruption with >> certain workloads such as MariaDB (bug MDEV-30728). While using >> TWA_SIGNAL_NO_IPI works around the corruption, no other architecture requires >> this workaround. >> >> Issue an lwsync barrier instruction prior to sending the IPI. This ensures >> the receiving CPU has a consistent view of system memory, in line with other >> architectures. >> >> Tested under QEMU in kvm mode, running on a Talos II workstation with dual >> POWER9 DD2.2 CPUs. >> >> Tested-by: Timothy Pearson >> Signed-off-by: Timothy Pearson >> --- >> arch/powerpc/kernel/smp.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c >> index ab691c89d787..ba42238de518 100644 >> --- a/arch/powerpc/kernel/smp.c >> +++ b/arch/powerpc/kernel/smp.c >> @@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg) >> >> void arch_smp_send_reschedule(int cpu) >> { >> -if (likely(smp_ops)) >> +if (likely(smp_ops)) { >> +__smp_lwsync(); >> do_message_pass(cpu, PPC_MSG_RESCHEDULE); >> +} >> } >> EXPORT_SYMBOL_GPL(arch_smp_send_reschedule); > > Once this is accepted in mainline, can you ensure that it get > backported to the needed relevant stable series? (Should it be CC'ed > as well for stable@?). Absolutely! We've been blocked on kernel upgrades for production database systems for a while due to this particular bug. > For context, and maybe worth adding a Link: reference as well this is > hit in Debian in https://bugs.debian.org/1032104 Sounds good. If anyone here needs a v2 with that line added just let me know. Thanks!
Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c)
- Original Message - > From: "Randy Dunlap" > To: "Timothy Pearson" , "Michael Ellerman" > > Cc: "Stephen Rothwell" , "Linux Next Mailing List" > , "linux-kernel" > , "linuxppc-dev" > , "Alexey Kardashevskiy" > Sent: Thursday, June 15, 2023 11:00:08 AM > Subject: Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c) > Hi Timothy, > > On 6/3/23 20:57, Timothy Pearson wrote: >> >> >> - Original Message - >>> From: "Michael Ellerman" >>> To: "Randy Dunlap" , "Stephen Rothwell" >>> , "Linux Next Mailing List" >>> >>> Cc: "linux-kernel" , "linuxppc-dev" >>> , "Alexey >>> Kardashevskiy" , "Timothy Pearson" >>> >>> Sent: Saturday, June 3, 2023 7:22:51 PM >>> Subject: Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c) >> >>> Randy Dunlap writes: >>>> On 6/1/23 21:01, Stephen Rothwell wrote: >>>>> Hi all, >>>>> >>>>> Changes since 20230601: >>>>> >>>> >>>> On powerpc64, a randconfig failed with: >>>> >>>> In file included from ../include/linux/list.h:5, >>>> from ../include/linux/preempt.h:11, >>>> from ../include/linux/spinlock.h:56, >>>> from ../include/linux/mmzone.h:8, >>>> from ../include/linux/gfp.h:7, >>>> from ../include/linux/slab.h:15, >>>> from ../arch/powerpc/kernel/iommu.c:15: >>>> ../arch/powerpc/kernel/iommu.c: In function >>>> 'spapr_tce_setup_phb_iommus_initcall': >>>> ../arch/powerpc/kernel/iommu.c:1391:36: error: 'hose_list' undeclared >>>> (first use >>>> in this function); did you mean 'zonelist'? >>>> 1391 | list_for_each_entry(hose, &hose_list, list_node) { >>>> |^ >>> ... >>> >>> hose_list is in pci-common.c which is built when PCI=y. >>> >>> PSERIES and POWERNV force PCI=y. >>> >>> But this config has neither: >>> >>> # CONFIG_PPC_POWERNV is not set >>> # CONFIG_PPC_PSERIES is not set >>> CONFIG_HAVE_PCI=y >>> # CONFIG_PCI is not set >>> # CONFIG_COMMON_CLK_RS9_PCIE is not set >>> >>> >>> Probably the spapr_tce code should be wrapped in an #ifdef that is only >>> enabled when POWERNV || PSERIES is enabled. >>> >>> cheers >> >> Sounds reasonable, I was going to look into this further over the weekend. I >> can put together a patch for Monday if that works? > > Did you prepare a patch for this? I am still seeing this build error. > > thanks. > -- > ~Randy Yes, it was sent in to the linuxppc-dev list some weeks ago. Did it not arrive?
[PATCH] powerpc/iommu: Only build sPAPR access functions on pSeries
and PowerNV A build failure with CONFIG_HAVE_PCI=y set without PSERIES or POWERNV set was caught by the random configuration checker. Guard the sPAPR specific IOMMU functions on CONFIG_PPC_PSERIES || CONFIG_PPC_POWERNV. Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/iommu.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 67f0b01e6ff5..c52449ae6936 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1090,6 +1090,7 @@ void iommu_tce_kill(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_kill); +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) static int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; @@ -1140,6 +1141,7 @@ static void iommu_release_ownership(struct iommu_table *tbl) spin_unlock(&tbl->pools[i].lock); spin_unlock_irqrestore(&tbl->large_pool.lock, flags); } +#endif int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) { @@ -1171,6 +1173,7 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) } EXPORT_SYMBOL_GPL(iommu_add_device); +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) /* * A simple iommu_table_group_ops which only allows reusing the existing * iommu_table. This handles VFIO for POWER7 or the nested KVM. @@ -1398,5 +1401,6 @@ static int __init spapr_tce_setup_phb_iommus_initcall(void) return 0; } postcore_initcall_sync(spapr_tce_setup_phb_iommus_initcall); +#endif #endif /* CONFIG_IOMMU_API */ -- 2.30.2
Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c)
- Original Message - > From: "Michael Ellerman" > To: "Randy Dunlap" , "Stephen Rothwell" > , "Linux Next Mailing List" > > Cc: "linux-kernel" , "linuxppc-dev" > , "Alexey > Kardashevskiy" , "Timothy Pearson" > > Sent: Saturday, June 3, 2023 7:22:51 PM > Subject: Re: linux-next: Tree for Jun 2 (arch/powerpc/kernel/iommu.c) > Randy Dunlap writes: >> On 6/1/23 21:01, Stephen Rothwell wrote: >>> Hi all, >>> >>> Changes since 20230601: >>> >> >> On powerpc64, a randconfig failed with: >> >> In file included from ../include/linux/list.h:5, >> from ../include/linux/preempt.h:11, >> from ../include/linux/spinlock.h:56, >> from ../include/linux/mmzone.h:8, >> from ../include/linux/gfp.h:7, >> from ../include/linux/slab.h:15, >> from ../arch/powerpc/kernel/iommu.c:15: >> ../arch/powerpc/kernel/iommu.c: In function >> 'spapr_tce_setup_phb_iommus_initcall': >> ../arch/powerpc/kernel/iommu.c:1391:36: error: 'hose_list' undeclared (first >> use >> in this function); did you mean 'zonelist'? >> 1391 | list_for_each_entry(hose, &hose_list, list_node) { >> |^ > ... > > hose_list is in pci-common.c which is built when PCI=y. > > PSERIES and POWERNV force PCI=y. > > But this config has neither: > > # CONFIG_PPC_POWERNV is not set > # CONFIG_PPC_PSERIES is not set > CONFIG_HAVE_PCI=y > # CONFIG_PCI is not set > # CONFIG_COMMON_CLK_RS9_PCIE is not set > > > Probably the spapr_tce code should be wrapped in an #ifdef that is only > enabled when POWERNV || PSERIES is enabled. > > cheers Sounds reasonable, I was going to look into this further over the weekend. I can put together a patch for Monday if that works?
Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" , "Timothy Pearson" > > Cc: "kvm" , "linuxppc-dev" > > Sent: Tuesday, March 21, 2023 5:33:57 AM > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems > Timothy Pearson writes: >> - Original Message - >>> From: "Timothy Pearson" >>> To: "Michael Ellerman" >>> Cc: "Timothy Pearson" , "kvm" >>> , "linuxppc-dev" >>> >>> Sent: Thursday, March 9, 2023 1:28:20 PM >>> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems >> >>> - Original Message - >>>> From: "Michael Ellerman" >>>> To: "Timothy Pearson" , "kvm" >>>> >>>> Cc: "linuxppc-dev" >>>> Sent: Thursday, March 9, 2023 5:40:01 AM >>>> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems >>> >>>> Timothy Pearson writes: >>>>> This patch series reenables VFIO support on POWER systems. It >>>>> is based on Alexey Kardashevskiys's patch series, rebased and >>>>> successfully tested under QEMU with a Marvell PCIe SATA controller >>>>> on a POWER9 Blackbird host. >>>>> >>>>> Alexey Kardashevskiy (3): >>>>> powerpc/iommu: Add "borrowing" iommu_table_group_ops >>>>> powerpc/pci_64: Init pcibios subsys a bit later >>>>> powerpc/iommu: Add iommu_ops to report capabilities and allow blocking >>>>> domains >>>> >>>> As sent the patches had lost Alexey's authorship (no From: line), I >>>> fixed it up when applying so the first 3 are authored by Alexey. >>>> >>>> cheers >>> >>> Thanks for catching that, it wasn't intentional. Probably used a wrong Git >>> command... >> >> Just wanted to touch base on the patches, since they're still listed as Under >> Review on patchwork. Are we good to go for the 6.4 merge window? > > They've been in my next (and so linux-next), since last week. I just > haven't updated patchwork yet. > > So yeah they are on track to go into mainline during the v6.4 merge window. > > cheers Sounds great, thanks! Saw them in the next tree but wasn't sure if the patchwork status was more reflective of overall status.
Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
- Original Message - > From: "Timothy Pearson" > To: "Michael Ellerman" > Cc: "Timothy Pearson" , "kvm" > , "linuxppc-dev" > > Sent: Thursday, March 9, 2023 1:28:20 PM > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems > - Original Message - >> From: "Michael Ellerman" >> To: "Timothy Pearson" , "kvm" >> >> Cc: "linuxppc-dev" >> Sent: Thursday, March 9, 2023 5:40:01 AM >> Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems > >> Timothy Pearson writes: >>> This patch series reenables VFIO support on POWER systems. It >>> is based on Alexey Kardashevskiys's patch series, rebased and >>> successfully tested under QEMU with a Marvell PCIe SATA controller >>> on a POWER9 Blackbird host. >>> >>> Alexey Kardashevskiy (3): >>> powerpc/iommu: Add "borrowing" iommu_table_group_ops >>> powerpc/pci_64: Init pcibios subsys a bit later >>> powerpc/iommu: Add iommu_ops to report capabilities and allow blocking >>> domains >> >> As sent the patches had lost Alexey's authorship (no From: line), I >> fixed it up when applying so the first 3 are authored by Alexey. >> >> cheers > > Thanks for catching that, it wasn't intentional. Probably used a wrong Git > command... Just wanted to touch base on the patches, since they're still listed as Under Review on patchwork. Are we good to go for the 6.4 merge window? Thanks!
Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" , "kvm" > > Cc: "linuxppc-dev" > Sent: Thursday, March 9, 2023 5:40:01 AM > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems > Timothy Pearson writes: >> This patch series reenables VFIO support on POWER systems. It >> is based on Alexey Kardashevskiys's patch series, rebased and >> successfully tested under QEMU with a Marvell PCIe SATA controller >> on a POWER9 Blackbird host. >> >> Alexey Kardashevskiy (3): >> powerpc/iommu: Add "borrowing" iommu_table_group_ops >> powerpc/pci_64: Init pcibios subsys a bit later >> powerpc/iommu: Add iommu_ops to report capabilities and allow blocking >> domains > > As sent the patches had lost Alexey's authorship (no From: line), I > fixed it up when applying so the first 3 are authored by Alexey. > > cheers Thanks for catching that, it wasn't intentional. Probably used a wrong Git command...
Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
- Original Message - > From: "Alex Williamson" > To: "Timothy Pearson" > Cc: "kvm" , "linuxppc-dev" > > Sent: Monday, March 6, 2023 6:59:41 PM > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems > On Mon, 6 Mar 2023 18:35:22 -0600 (CST) > Timothy Pearson wrote: > >> ----- Original Message - >> > From: "Alex Williamson" >> > To: "Timothy Pearson" >> > Cc: "kvm" , "linuxppc-dev" >> > >> > Sent: Monday, March 6, 2023 5:46:07 PM >> > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems >> >> > On Mon, 6 Mar 2023 11:29:53 -0600 (CST) >> > Timothy Pearson wrote: >> > >> >> This patch series reenables VFIO support on POWER systems. It >> >> is based on Alexey Kardashevskiys's patch series, rebased and >> >> successfully tested under QEMU with a Marvell PCIe SATA controller >> >> on a POWER9 Blackbird host. >> >> >> >> Alexey Kardashevskiy (3): >> >> powerpc/iommu: Add "borrowing" iommu_table_group_ops >> >> powerpc/pci_64: Init pcibios subsys a bit later >> >> powerpc/iommu: Add iommu_ops to report capabilities and allow blocking >> >> domains >> >> >> >> Timothy Pearson (1): >> >> Add myself to MAINTAINERS for Power VFIO support >> >> >> >> MAINTAINERS | 5 + >> >> arch/powerpc/include/asm/iommu.h | 6 +- >> >> arch/powerpc/include/asm/pci-bridge.h | 7 + >> >> arch/powerpc/kernel/iommu.c | 246 +- >> >> arch/powerpc/kernel/pci_64.c | 2 +- >> >> arch/powerpc/platforms/powernv/pci-ioda.c | 36 +++- >> >> arch/powerpc/platforms/pseries/iommu.c| 27 +++ >> >> arch/powerpc/platforms/pseries/pseries.h | 4 + >> >> arch/powerpc/platforms/pseries/setup.c| 3 + >> >> drivers/vfio/vfio_iommu_spapr_tce.c | 96 ++--- >> >> 10 files changed, 338 insertions(+), 94 deletions(-) >> >> >> > >> > For vfio and MAINTAINERS portions, >> > >> > Acked-by: Alex Williamson >> > >> > I'll note though that spapr_tce_take_ownership() looks like it copied a >> > bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map >> > are tested before calling iommu_take_ownership() but not in the unwind >> > loop, ie. tables we might have skipped on setup are unconditionally >> > released on unwind. Thanks, >> > >> > Alex >> >> Thanks for that. I'll put together a patch to get rid of that >> potential bug that can be applied after this series is merged, unless >> you'd rather I resubmit a v3 with the issue fixed? > > Follow-up fix is fine by me. Thanks, > > Alex Just sent that patch in. Thanks!
[PATCH] Check for IOMMU table validity in error handler
If tce_iommu_take_ownership is unable to take ownership of a specific IOMMU table, the unwinder in the error handler could attempt to release ownership of an invalid table. Check validity of each table in the unwinder before attempting to release ownership. Thanks to Alex Williamson for the initial observation! Signed-off-by: Timothy Pearson --- drivers/vfio/vfio_iommu_spapr_tce.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 60a50ce8701e..c012ecb42ebc 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -1219,10 +1219,15 @@ static int tce_iommu_take_ownership(struct tce_container *container, rc = iommu_take_ownership(tbl); if (rc) { - for (j = 0; j < i; ++j) - iommu_release_ownership( - table_group->tables[j]); + for (j = 0; j < i; ++j) { + struct iommu_table *tbl = + table_group->tables[j]; + if (!tbl || !tbl->it_map) + continue; + + iommu_release_ownership(table_group->tables[j]); + } return rc; } } -- 2.30.2
Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
- Original Message - > From: "Alex Williamson" > To: "Timothy Pearson" > Cc: "kvm" , "linuxppc-dev" > > Sent: Monday, March 6, 2023 5:46:07 PM > Subject: Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems > On Mon, 6 Mar 2023 11:29:53 -0600 (CST) > Timothy Pearson wrote: > >> This patch series reenables VFIO support on POWER systems. It >> is based on Alexey Kardashevskiys's patch series, rebased and >> successfully tested under QEMU with a Marvell PCIe SATA controller >> on a POWER9 Blackbird host. >> >> Alexey Kardashevskiy (3): >> powerpc/iommu: Add "borrowing" iommu_table_group_ops >> powerpc/pci_64: Init pcibios subsys a bit later >> powerpc/iommu: Add iommu_ops to report capabilities and allow blocking >> domains >> >> Timothy Pearson (1): >> Add myself to MAINTAINERS for Power VFIO support >> >> MAINTAINERS | 5 + >> arch/powerpc/include/asm/iommu.h | 6 +- >> arch/powerpc/include/asm/pci-bridge.h | 7 + >> arch/powerpc/kernel/iommu.c | 246 +- >> arch/powerpc/kernel/pci_64.c | 2 +- >> arch/powerpc/platforms/powernv/pci-ioda.c | 36 +++- >> arch/powerpc/platforms/pseries/iommu.c| 27 +++ >> arch/powerpc/platforms/pseries/pseries.h | 4 + >> arch/powerpc/platforms/pseries/setup.c| 3 + >> drivers/vfio/vfio_iommu_spapr_tce.c | 96 ++--- >> 10 files changed, 338 insertions(+), 94 deletions(-) >> > > For vfio and MAINTAINERS portions, > > Acked-by: Alex Williamson > > I'll note though that spapr_tce_take_ownership() looks like it copied a > bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map > are tested before calling iommu_take_ownership() but not in the unwind > loop, ie. tables we might have skipped on setup are unconditionally > released on unwind. Thanks, > > Alex Thanks for that. I'll put together a patch to get rid of that potential bug that can be applied after this series is merged, unless you'd rather I resubmit a v3 with the issue fixed?
[PATCH v2 0/4] Reenable VFIO support on POWER systems
This patch series reenables VFIO support on POWER systems. It is based on Alexey Kardashevskiys's patch series, rebased and successfully tested under QEMU with a Marvell PCIe SATA controller on a POWER9 Blackbird host. Alexey Kardashevskiy (3): powerpc/iommu: Add "borrowing" iommu_table_group_ops powerpc/pci_64: Init pcibios subsys a bit later powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains Timothy Pearson (1): Add myself to MAINTAINERS for Power VFIO support MAINTAINERS | 5 + arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/kernel/iommu.c | 246 +- arch/powerpc/kernel/pci_64.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 36 +++- arch/powerpc/platforms/pseries/iommu.c| 27 +++ arch/powerpc/platforms/pseries/pseries.h | 4 + arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 96 ++--- 10 files changed, 338 insertions(+), 94 deletions(-) -- 2.30.2
[PATCH v2 1/4] powerpc/iommu: Add "borrowing" iommu_table_group_ops
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs: control the ownership, create/set/unset a table the hardware for dynamic DMA windows (DDW). VFIO uses the API to implement support on POWER. So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and other cases (POWER7 or nested KVM) did not and instead reused existing iommu_table structs. This means 1) no DDW 2) ownership transfer is done directly in the VFIO SPAPR TCE driver. Soon POWER is going to get its own iommu_ops and ownership control is going to move there. This implements spapr_tce_table_group_ops which borrows iommu_table tables. The upside is that VFIO needs to know less about POWER. The new ops returns the existing table from create_table() and only checks if the same window is already set. This is only going to work if the default DMA window starts table_group.tce32_start and as big as pe->table_group.tce32_size (not the case for IODA2+ PowerNV). This changes iommu_table_group_ops::take_ownership() to return an error if borrowing a table failed. This should not cause any visible change in behavior for PowerNV. pSeries was not that well tested/supported anyway. Signed-off-by: Alexey Kardashevskiy Signed-off-by: Timothy Pearson --- arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/kernel/iommu.c | 98 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 6 +- arch/powerpc/platforms/pseries/iommu.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 94 -- 5 files changed, 121 insertions(+), 86 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e29c73e3dd4..678b5bdc79b1 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -175,7 +175,7 @@ struct iommu_table_group_ops { long (*unset_window)(struct iommu_table_group *table_group, int num); /* Switch ownership from platform code to external user (e.g. VFIO) */ - void (*take_ownership)(struct iommu_table_group *table_group); + long (*take_ownership)(struct iommu_table_group *table_group); /* Switch ownership from external user (e.g. VFIO) back to core */ void (*release_ownership)(struct iommu_table_group *table_group); }; @@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm, enum dma_data_direction *direction); extern void iommu_tce_kill(struct iommu_table *tbl, unsigned long entry, unsigned long pages); + +extern struct iommu_table_group_ops spapr_tce_table_group_ops; #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, @@ -303,8 +305,6 @@ extern int iommu_tce_check_gpa(unsigned long page_shift, iommu_tce_check_gpa((tbl)->it_page_shift, (gpa))) extern void iommu_flush_tce(struct iommu_table *tbl); -extern int iommu_take_ownership(struct iommu_table *tbl); -extern void iommu_release_ownership(struct iommu_table *tbl); extern enum dma_data_direction iommu_tce_direction(unsigned long tce); extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ee95937bdaf1..f9f5a9418092 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1086,7 +1086,7 @@ void iommu_tce_kill(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_kill); -int iommu_take_ownership(struct iommu_table *tbl) +static int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; int ret = 0; @@ -1118,9 +1118,8 @@ int iommu_take_ownership(struct iommu_table *tbl) return ret; } -EXPORT_SYMBOL_GPL(iommu_take_ownership); -void iommu_release_ownership(struct iommu_table *tbl) +static void iommu_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; @@ -1137,7 +1136,6 @@ void iommu_release_ownership(struct iommu_table *tbl) spin_unlock(&tbl->pools[i].lock); spin_unlock_irqrestore(&tbl->large_pool.lock, flags); } -EXPORT_SYMBOL_GPL(iommu_release_ownership); int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) { @@ -1179,4 +1177,96 @@ void iommu_del_device(struct device *dev) iommu_group_remove_device(dev); } EXPORT_SYMBOL_GPL(iommu_del_device); + +/* + * A simple iommu_table_group_ops which only allows reusing the existing + * iommu_table. This handles VFIO for POWER7 or the nested KVM. + * The ops does not allow creating windows and only allows reusing the existing + * one if it matches table_group->tce32_start/tce32_size/page_shift. + */ +static unsigned long spapr_tce_get_table_size(__u32 page_shift, +
[PATCH v2 2/4] powerpc/pci_64: Init pcibios subsys a bit later
The following patches are going to add dependency/use of iommu_ops which is initialized in subsys_initcall as well. This moves pciobios_init() to the next initcall level. This should not cause behavioral change. Signed-off-by: Alexey Kardashevskiy Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/pci_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c index fd42059ae2a5..e27342ef128b 100644 --- a/arch/powerpc/kernel/pci_64.c +++ b/arch/powerpc/kernel/pci_64.c @@ -73,7 +73,7 @@ static int __init pcibios_init(void) return 0; } -subsys_initcall(pcibios_init); +subsys_initcall_sync(pcibios_init); int pcibios_unmap_io_space(struct pci_bus *bus) { -- 2.30.2
[PATCH v2] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE support platform
dependent When introduced, IRQFD resampling worked on POWER8 with XICS. However KVM on POWER9 has never implemented it - the compatibility mode code ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native XIVE mode does not handle INTx in KVM at all. This moved the capability support advertising to platforms and stops advertising it on XIVE, i.e. POWER9 and later. Signed-off-by: Alexey Kardashevskiy Signed-off-by: Timothy Pearson --- arch/arm64/kvm/arm.c | 3 +++ arch/mips/kvm/mips.c | 3 +++ arch/powerpc/kvm/powerpc.c | 6 ++ arch/riscv/kvm/vm.c| 3 +++ arch/s390/kvm/kvm-s390.c | 3 +++ arch/x86/kvm/x86.c | 3 +++ virt/kvm/kvm_main.c| 1 - 7 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 3bd732eaf087..0ad50969430a 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -220,6 +220,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: case KVM_CAP_ARM_SYSTEM_SUSPEND: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 36c8991b5d39..52bdc479875d 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1046,6 +1046,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_SYNC_MMU: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 4c5405fc5538..d23e25e8432d 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -576,6 +576,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; #endif +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: + r = !xive_enabled(); + break; +#endif + case KVM_CAP_PPC_ALLOC_HTAB: r = hv_enabled; break; diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c index 65a964d7e70d..0ef7a6168018 100644 --- a/arch/riscv/kvm/vm.c +++ b/arch/riscv/kvm/vm.c @@ -65,6 +65,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_MP_STATE: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 39b36562c043..6ca84bfdd2dc 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -573,6 +573,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_S390_VCPU_RESETS: case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_S390_DIAG318: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7713420abab0..891aeace811e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4432,6 +4432,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_VAPIC: case KVM_CAP_ENABLE_CAP: case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_EXIT_HYPERCALL: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d255964ec331..b1679d08a216 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4479,7 +4479,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) #endif #ifdef CONFIG_HAVE_KVM_IRQFD case KVM_CAP_IRQFD: - case KVM_CAP_IRQFD_RESAMPLE: #endif case KVM_CAP_IOEVENTFD_ANY_LENGTH: case KVM_CAP_CHECK_EXTENSION_VM: -- 2.30.2
[PATCH v2 4/4] Add myself to MAINTAINERS for Power VFIO support
Signed-off-by: Timothy Pearson --- MAINTAINERS | 5 + 1 file changed, 5 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 8d5bc223f305..876f96e82d66 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9836,6 +9836,11 @@ F: drivers/crypto/vmx/ghash* F: drivers/crypto/vmx/ppc-xlate.pl F: drivers/crypto/vmx/vmx.c +IBM Power VFIO Support +M: Timothy Pearson +S: Supported +F: drivers/vfio/vfio_iommu_spapr_tce.c + IBM ServeRAID RAID DRIVER S: Orphan F: drivers/scsi/ips.* -- 2.30.2
[PATCH v2 3/4] powerpc/iommu: Add iommu_ops to report capabilities and
allow blocking domains Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to the generic VFIO which broke POWER: - a coherency capability check; - blocking IOMMU domain - iommu_group_dma_owner_claimed()/... This adds a simple iommu_ops which reports support for cache coherency and provides a basic support for blocking domains. No other domain types are implemented so the default domain is NULL. Since now iommu_ops controls the group ownership, this takes it out of VFIO. This adds an IOMMU device into a pci_controller (=PHB) and registers it in the IOMMU subsystem, iommu_ops is registered at this point. This setup is done in postcore_initcall_sync. This replaces iommu_group_add_device() with iommu_probe_device() as the former misses necessary steps in connecting PCI devices to IOMMU devices. This adds a comment about why explicit iommu_probe_device() is still needed. The previous discussion is here: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/ https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices") Cc: Deming Wang Cc: Robin Murphy Cc: Jason Gunthorpe Cc: Alex Williamson Cc: Daniel Henrique Barboza Cc: Fabiano Rosas Cc: Murilo Opsfelder Araujo Cc: Nicholas Piggin Co-authored-by: Timothy Pearson Signed-off-by: Alexey Kardashevskiy Signed-off-by: Timothy Pearson --- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/kernel/iommu.c | 148 +- arch/powerpc/platforms/powernv/pci-ioda.c | 30 + arch/powerpc/platforms/pseries/iommu.c| 24 arch/powerpc/platforms/pseries/pseries.h | 4 + arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 8 -- 7 files changed, 214 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 71c1d26f2400..2aa3a091ef20 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -8,6 +8,7 @@ #include #include #include +#include struct device_node; @@ -44,6 +45,9 @@ struct pci_controller_ops { #endif void(*shutdown)(struct pci_controller *hose); + + struct iommu_group *(*device_group)(struct pci_controller *hose, + struct pci_dev *pdev); }; /* @@ -131,6 +135,9 @@ struct pci_controller { struct irq_domain *dev_domain; struct irq_domain *msi_domain; struct fwnode_handle*fwnode; + + /* iommu_ops support */ + struct iommu_device iommu; }; /* These are used for config access before all the PCI probing diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index f9f5a9418092..b42e202af3bc 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -35,6 +35,7 @@ #include #include #include +#include #define DBG(...) @@ -1156,8 +1157,14 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) pr_debug("%s: Adding %s to iommu group %d\n", __func__, dev_name(dev), iommu_group_id(table_group->group)); - - return iommu_group_add_device(table_group->group, dev); + /* +* This is still not adding devices via the IOMMU bus notifier because +* of pcibios_init() from arch/powerpc/kernel/pci_64.c which calls +* pcibios_scan_phb() first (and this guy adds devices and triggers +* the notifier) and only then it calls pci_bus_add_devices() which +* configures DMA for buses which also creates PEs and IOMMU groups. +*/ + return iommu_probe_device(dev); } EXPORT_SYMBOL_GPL(iommu_add_device); @@ -1237,6 +1244,7 @@ static long spapr_tce_take_ownership(struct iommu_table_group *table_group) rc = iommu_take_ownership(tbl); if (!rc) continue; + for (j = 0; j < i; ++j) iommu_release_ownership(table_group->tables[j]); return rc; @@ -1269,4 +1277,140 @@ struct iommu_table_group_ops spapr_tce_table_group_ops = { .release_ownership = spapr_tce_release_ownership, }; +/* + * A simple iommu_ops to allow less cruft in generic VFIO code. + */ +static int spapr_tce_blocking_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + struct iommu_group *grp = iommu_group_get(dev); + struct iommu_table_group *table_group; + int ret = -EINVAL; + + if (!grp) +
Re: [PATCH] drm/amdgpu: Re-enable DCN for 64-bit powerpc
- Original Message - > From: "Linus Torvalds" > To: "Michael Ellerman" > Cc: "linuxppc-dev" , "Alex Deucher" > , "amd-gfx" > , li...@roeck-us.net, "linux-kernel" > , "Dan Horák" > , "Timothy Pearson" > Sent: Monday, July 25, 2022 2:19:57 PM > Subject: Re: [PATCH] drm/amdgpu: Re-enable DCN for 64-bit powerpc > On Mon, Jul 25, 2022 at 5:39 AM Michael Ellerman wrote: >> >> Further digging shows that the build failures only occur with compilers >> that default to 64-bit long double. > > Where the heck do we have 'long double' things anywhere in the kernel? > > I tried to grep for it, and failed miserably. I found some constants > that would qualify, but they were in the v4l colorspaces-details.rst > doc file. > > Strange. We don't, at least not that I can see. The affected code uses standard doubles. What I'm wondering is if the compiler is getting confused between standard and long doubles when they are both the same bit length...
Re: [PATCH RFC 3/5] powerpc/speculation: Add support for 'cpu_spec_mitigations=' cmdline options
Will be joining in ~ 5 mins. Getting Chromium set up here. - Original Message - > From: "Jiri Kosina" > To: "Josh Poimboeuf" > Cc: "Peter Zijlstra" , "Heiko Carstens" > , "Paul Mackerras" > , "H . Peter Anvin" , "Ingo Molnar" > , "Andrea Arcangeli" > , linux-s...@vger.kernel.org, x...@kernel.org, "Will > Deacon" , "Linus > Torvalds" , "Catalin Marinas" > , "Waiman Long" > , linux-a...@vger.kernel.org, "Jon Masters" > , "Borislav Petkov" , > "Andy Lutomirski" , "Thomas Gleixner" , > linux-arm-ker...@lists.infradead.org, > "Greg Kroah-Hartman" , > linux-ker...@vger.kernel.org, "Tyler Hicks" , > "Martin Schwidefsky" , linuxppc-dev@lists.ozlabs.org > Sent: Thursday, April 4, 2019 2:49:05 PM > Subject: Re: [PATCH RFC 3/5] powerpc/speculation: Add support for > 'cpu_spec_mitigations=' cmdline options > On Thu, 4 Apr 2019, Josh Poimboeuf wrote: > >> Configure powerpc CPU runtime speculation bug mitigations in accordance >> with the 'cpu_spec_mitigations=' cmdline options. This affects >> Meltdown, Spectre v1, Spectre v2, and Speculative Store Bypass. > [ ... snip ... ] >> -if (!no_nospec) >> +if (!no_nospec && cpu_spec_mitigations != CPU_SPEC_MITIGATIONS_OFF) > > '!no_nospec' is something that I am sure will come back to hunt me in my > bad dreams. > > But that's been there already, and fixing it is out of scope of this > patch. Other than that, as discussed previously -- I really like this new > global option. Feel free to add > > Reviewed-by: Jiri Kosina > > for the whole set. > > Thanks, > > -- > Jiri Kosina > SUSE Labs
Re: [PATCH 0/7] Add initial version of "cognitive DMA"
When should we be targeting merge? At this point this is a substantial improvement over currently shipping kernels for our systems, and we don't really want to have to ship a patched / custom OS kernel if we can avoid it. On 06/24/2018 08:09 PM, Russell Currey wrote: > On Sat, 2018-06-23 at 18:52 -0500, Timothy Pearson wrote: > > There's still more to do and this shouldn't be merged yet - would > encourage anyone with suitable hardware to test though. > >> POWER9 (PHB4) requires all peripherals using DMA to be either >> restricted >> to 32-bit windows or capable of accessing the entire 64 bits of >> memory >> space. Some devices, such as most GPUs, can only address up to a >> certain >> number of bits (approximately 40, in many cases), while at the same >> time >> it is highly desireable to use a larger DMA space than the fallback >> 32 bits. >> >> This series adds something called "cognitive DMA", which is a form of >> dynamic >> TCE allocation. This allows the peripheral to DMA to host addresses >> mapped in >> 1G (PHB4) or 256M (PHB3) chunks, and is transparent to the peripheral >> and its >> driver stack. >> >> This series has been tested on a Talos II server with a Radeon WX4100 >> and >> a wide range of OpenGL applications. While there is still work, >> notably >> involving what happens if a peripheral attempts to DMA close to a TCE >> window boundary, this series greatly improves functionality for AMD >> GPUs >> on POWER9 devices over the existing 32-bit DMA support. >> >> Russell Currey (4): >> powerpc/powernv/pci: Track largest available TCE order per PHB >> powerpc/powernv: DMA operations for discontiguous allocation >> powerpc/powernv/pci: Track DMA and TCE tables in debugfs >> powerpc/powernv/pci: Safety fixes for pseudobypass TCE allocation >> >> Timothy Pearson (3): >> powerpc/powernv/pci: Export pnv_pci_ioda2_tce_invalidate_pe >> powerpc/powernv/pci: Invalidate TCE cache after DMA map setup >> powerpc/powernv/pci: Don't use the lower 4G TCEs in pseudo-DMA mode >> >> arch/powerpc/include/asm/dma-mapping.h| 1 + >> arch/powerpc/platforms/powernv/Makefile | 2 +- >> arch/powerpc/platforms/powernv/pci-dma.c | 320 >> ++ >> arch/powerpc/platforms/powernv/pci-ioda.c | 169 >> arch/powerpc/platforms/powernv/pci.h | 11 + >> 5 files changed, 452 insertions(+), 51 deletions(-) >> create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c >> -- Timothy Pearson Raptor Engineering +1 (415) 727-8645 (direct line) +1 (512) 690-0200 (switchboard) https://www.raptorengineering.com
[PATCH 4/7] powerpc/powernv/pci: Safety fixes for pseudobypass TCE
allocation Signed-off-by: Russell Currey --- arch/powerpc/platforms/powernv/pci-dma.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-dma.c b/arch/powerpc/platforms/powernv/pci-dma.c index 1d5409be343e..237940a2a052 100644 --- a/arch/powerpc/platforms/powernv/pci-dma.c +++ b/arch/powerpc/platforms/powernv/pci-dma.c @@ -29,8 +29,9 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr { int tce; __be64 old, new; + unsigned long flags; - spin_lock(&pe->tce_alloc_lock); + spin_lock_irqsave(&pe->tce_alloc_lock, flags); tce = bitmap_find_next_zero_area(pe->tce_bitmap, pe->tce_count, 0, @@ -40,9 +41,10 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr old = pe->tces[tce]; new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE); pe->tces[tce] = new; + mb(); pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n", tce, new, old); - spin_unlock(&pe->tce_alloc_lock); + spin_unlock_irqrestore(&pe->tce_alloc_lock, flags); return tce; } -- 2.17.1
[PATCH 3/7] powerpc/powernv/pci: Track DMA and TCE tables in debugfs
Add a new debugfs entry to trigger dumping out the tracking table and TCEs for a given PE, for example PE 0x4 of PHB 2: echo 0x4 > /sys/kernel/debug/powerpc/PCI0002/sketchy This will result in the table being dumped out in dmesg. Signed-off-by: Russell Currey --- arch/powerpc/platforms/powernv/pci-ioda.c | 43 +++ 1 file changed, 43 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 7ecc186493ca..55f0f7b885bc 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3342,6 +3342,47 @@ static int pnv_pci_diag_data_set(void *data, u64 val) DEFINE_SIMPLE_ATTRIBUTE(pnv_pci_diag_data_fops, NULL, pnv_pci_diag_data_set, "%llu\n"); +static int pnv_pci_sketchy_set(void *data, u64 val) +{ + struct pci_controller *hose; + struct pnv_ioda_pe *pe; + struct pnv_phb *phb; + u64 entry1, entry2; + int i; + + hose = (struct pci_controller *)data; + if (!hose || !hose->private_data) + return -ENODEV; + + phb = hose->private_data; + pe = &phb->ioda.pe_array[val]; + + if (!pe) + return -EINVAL; + + if (!pe->tces || !pe->tce_tracker) + return -EIO; + + for (i = 0; i < pe->tce_count; i++) { + if (i > 16 && pe->tces[i] == 0) + break; + pr_info("%3d: %016llx\n", i, be64_to_cpu(pe->tces[i])); + } + + for (i = 0; i < pe->tce_count; i++) { + entry1 = pe->tce_tracker[i * 2]; + entry2 = pe->tce_tracker[i * 2 + 1]; + if (!entry1) + break; + pr_info("%3d: %016llx %016llx\n", i, entry1, entry2); + } + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(pnv_pci_sketchy_fops, NULL, + pnv_pci_sketchy_set, "%llu\n"); + + #endif /* CONFIG_DEBUG_FS */ static void pnv_pci_ioda_create_dbgfs(void) @@ -3367,6 +3408,8 @@ static void pnv_pci_ioda_create_dbgfs(void) debugfs_create_file("dump_diag_regs", 0200, phb->dbgfs, hose, &pnv_pci_diag_data_fops); + debugfs_create_file("sketchy", 0200, phb->dbgfs, hose, + &pnv_pci_sketchy_fops); } #endif /* CONFIG_DEBUG_FS */ } -- 2.17.1
[PATCH 7/7] powerpc/powernv/pci: Don't use the lower 4G TCEs in
pseudo-DMA mode Four TCEs are reserved for legacy 32-bit DMA mappings in psuedo DMA mode. Mark these with an invalid address to avoid their use by the TCE cache mapper. Signed-off-by: Timothy Pearson --- arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a6097dd323f8..e8a1333f6b3e 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1783,7 +1783,7 @@ static bool pnv_pci_ioda_pe_single_vendor(struct pnv_ioda_pe *pe) static int pnv_pci_pseudo_bypass_setup(struct pnv_ioda_pe *pe) { - u64 tce_count, table_size, window_size; + u64 i, tce_count, table_size, window_size; struct pnv_phb *p = pe->phb; struct page *table_pages; __be64 *tces; @@ -1835,6 +1835,12 @@ static int pnv_pci_pseudo_bypass_setup(struct pnv_ioda_pe *pe) /* mark the first 4GB as reserved so this can still be used for 32bit */ bitmap_set(pe->tce_bitmap, 0, 1ULL << (32 - p->ioda.max_tce_order)); + /* make sure reserved first 4GB TCEs are not used by the mapper +* set each address to -1, which will never match an incoming request +*/ + for (i = 0; i < 4; i++) + pe->tce_tracker[i * 2] = -1; + pe_info(pe, "pseudo-bypass sizes: tracker %d bitmap %d TCEs %lld\n", tracker_entries, bitmap_size, tce_count); -- 2.17.1
[PATCH 5/7] powerpc/powernv/pci: Export
pnv_pci_ioda2_tce_invalidate_pe Pseudo DMA support requires a method to invalidate the TCE cache Export pnv_pci_ioda2_tce_invalidate_pe for use by the pseudo DMA mapper. Signed-off-by: Timothy Pearson --- arch/powerpc/platforms/powernv/pci-ioda.c | 2 +- arch/powerpc/platforms/powernv/pci.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 55f0f7b885bc..a6097dd323f8 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2102,7 +2102,7 @@ static void pnv_pci_phb3_tce_invalidate(struct pnv_ioda_pe *pe, bool rm, } } -static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) +void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe) { struct pnv_phb *phb = pe->phb; diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index 83492aba90f1..8d3849e76be3 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -264,6 +264,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, /* Nvlink functions */ extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass); extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm); +extern void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe); extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe); extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, struct iommu_table *tbl); -- 2.17.1
[PATCH 6/7] powerpc/powernv/pci: Invalidate TCE cache after DMA map
setup Per the IODA2, TCEs must be invalidated after their settings have been changed. Invalidate the cache after the address is changed during TCE allocation when using pseudo DMA. Signed-off-by: Timothy Pearson --- arch/powerpc/platforms/powernv/pci-dma.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-dma.c b/arch/powerpc/platforms/powernv/pci-dma.c index 237940a2a052..060dbc168401 100644 --- a/arch/powerpc/platforms/powernv/pci-dma.c +++ b/arch/powerpc/platforms/powernv/pci-dma.c @@ -42,8 +42,7 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE); pe->tces[tce] = new; mb(); - pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n", - tce, new, old); + pnv_pci_ioda2_tce_invalidate_pe(pe); spin_unlock_irqrestore(&pe->tce_alloc_lock, flags); return tce; -- 2.17.1
[PATCH 1/7] powerpc/powernv/pci: Track largest available TCE order
per PHB Knowing the largest possible TCE size of a PHB is useful, so get it out of the device tree. This relies on the property being added in OPAL. It is assumed that any PHB4 or later machine would be running firmware that implemented this property, and otherwise assumed to be PHB3, which has a maximum TCE order of 28 bits or 256MB TCEs. This is used later in the series. Signed-off-by: Russell Currey --- arch/powerpc/platforms/powernv/pci-ioda.c | 16 arch/powerpc/platforms/powernv/pci.h | 3 +++ 2 files changed, 19 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 5bd0eb6681bc..bcb3bfce072a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3873,11 +3873,13 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, struct resource r; const __be64 *prop64; const __be32 *prop32; + struct property *prop; int len; unsigned int segno; u64 phb_id; void *aux; long rc; + u32 val; if (!of_device_is_available(np)) return; @@ -4016,6 +4018,20 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, } phb->ioda.pe_array = aux + pemap_off; + phb->ioda.max_tce_order = 0; + // Get TCE order from the DT. If it's not present, assume P8 + if (!of_get_property(np, "ibm,supported-tce-sizes", NULL)) { + phb->ioda.max_tce_order = 28; // assume P8 256mb TCEs + } else { + of_property_for_each_u32(np, "ibm,supported-tce-sizes", prop, +prop32, val) { + if (val > phb->ioda.max_tce_order) + phb->ioda.max_tce_order = val; + } + pr_debug("PHB%llx Found max TCE order of %d bits\n", +phb->opal_id, phb->ioda.max_tce_order); + } + /* * Choose PE number for root bus, which shouldn't have * M64 resources consumed by its child devices. To pick diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index eada4b6068cb..c9952def5e93 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -173,6 +173,9 @@ struct pnv_phb { struct list_headpe_list; struct mutexpe_list_mutex; + /* Largest supported TCE order bits */ + uint8_t max_tce_order; + /* Reverse map of PEs, indexed by {bus, devfn} */ unsigned intpe_rmap[0x1]; } ioda; -- 2.17.1
[PATCH 2/7] powerpc/powernv: DMA operations for discontiguous
allocation Cognitive DMA is a new set of DMA operations that solve some issues for devices that want to address more than 32 bits but can't address the 59 bits required to enable direct DMA. The previous implementation for POWER8/PHB3 worked around this by configuring a bypass from the default 32-bit address space into 64-bit address space. This approach does not work for POWER9/PHB4 because regions of memory are discontiguous and many devices will be unable to address memory beyond the first node. Instead, implement a new set of DMA operations that allocate TCEs as DMA mappings are requested so that all memory is addressable even when a one-to-one mapping between real addresses and DMA addresses isn't possible. These TCEs are the maximum size available on the platform, which is 256M on PHB3 and 1G on PHB4. Devices can now map any region of memory up to the maximum amount they can address according to the DMA mask set, in chunks of the largest available TCE size. This implementation replaces the need for the existing PHB3 solution and should be compatible with future PHB versions. Signed-off-by: Russell Currey --- arch/powerpc/include/asm/dma-mapping.h| 1 + arch/powerpc/platforms/powernv/Makefile | 2 +- arch/powerpc/platforms/powernv/pci-dma.c | 319 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 102 +++ arch/powerpc/platforms/powernv/pci.h | 7 + 5 files changed, 381 insertions(+), 50 deletions(-) create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 8fa394520af6..354f435160f3 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -74,6 +74,7 @@ static inline unsigned long device_to_mask(struct device *dev) extern struct dma_map_ops dma_iommu_ops; #endif extern const struct dma_map_ops dma_nommu_ops; +extern const struct dma_map_ops dma_pseudo_bypass_ops; static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index 703a350a7f4e..2467bdab3c13 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -6,7 +6,7 @@ obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o obj-y += opal-kmsg.o opal-powercap.o opal-psr.o opal-sensor-groups.o obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o -obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o +obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o pci-dma.o obj-$(CONFIG_CXL_BASE) += pci-cxl.o obj-$(CONFIG_EEH) += eeh-powernv.o obj-$(CONFIG_PPC_SCOM) += opal-xscom.o diff --git a/arch/powerpc/platforms/powernv/pci-dma.c b/arch/powerpc/platforms/powernv/pci-dma.c new file mode 100644 index ..1d5409be343e --- /dev/null +++ b/arch/powerpc/platforms/powernv/pci-dma.c @@ -0,0 +1,319 @@ +/* + * DMA operations supporting pseudo-bypass for PHB3+ + * + * Author: Russell Currey + * + * Copyright 2018 IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2 of the License, or (at your + * option) any later version. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#include "pci.h" + +/* select and allocate a TCE using the bitmap */ +static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr) +{ + int tce; + __be64 old, new; + + spin_lock(&pe->tce_alloc_lock); + tce = bitmap_find_next_zero_area(pe->tce_bitmap, +pe->tce_count, +0, +1, +0); + bitmap_set(pe->tce_bitmap, tce, 1); + old = pe->tces[tce]; + new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE); + pe->tces[tce] = new; + pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n", + tce, new, old); + spin_unlock(&pe->tce_alloc_lock); + + return tce; +} + +/* + * The tracking table for assigning TCEs has two entries per TCE. + * - @entry1 contains the physical address and the smallest bit indicates + * if it's currently valid. + * - @entry2 contains the DMA address returned in the upper 34 bits, and a + * refcount in the lower 30 bits. + */ +static dma_addr_t dma_pseudo_bypass_get_address(struct device *dev, + phys_addr_t addr) +{ + struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); + struct pci_controller *hose = pci_bus_to_host(pdev->bus); + struct pnv_phb *phb = hose->private_data; + struct pnv_ioda_pe *pe; +
[PATCH 0/7] Add initial version of "cognitive DMA"
POWER9 (PHB4) requires all peripherals using DMA to be either restricted to 32-bit windows or capable of accessing the entire 64 bits of memory space. Some devices, such as most GPUs, can only address up to a certain number of bits (approximately 40, in many cases), while at the same time it is highly desireable to use a larger DMA space than the fallback 32 bits. This series adds something called "cognitive DMA", which is a form of dynamic TCE allocation. This allows the peripheral to DMA to host addresses mapped in 1G (PHB4) or 256M (PHB3) chunks, and is transparent to the peripheral and its driver stack. This series has been tested on a Talos II server with a Radeon WX4100 and a wide range of OpenGL applications. While there is still work, notably involving what happens if a peripheral attempts to DMA close to a TCE window boundary, this series greatly improves functionality for AMD GPUs on POWER9 devices over the existing 32-bit DMA support. Russell Currey (4): powerpc/powernv/pci: Track largest available TCE order per PHB powerpc/powernv: DMA operations for discontiguous allocation powerpc/powernv/pci: Track DMA and TCE tables in debugfs powerpc/powernv/pci: Safety fixes for pseudobypass TCE allocation Timothy Pearson (3): powerpc/powernv/pci: Export pnv_pci_ioda2_tce_invalidate_pe powerpc/powernv/pci: Invalidate TCE cache after DMA map setup powerpc/powernv/pci: Don't use the lower 4G TCEs in pseudo-DMA mode arch/powerpc/include/asm/dma-mapping.h| 1 + arch/powerpc/platforms/powernv/Makefile | 2 +- arch/powerpc/platforms/powernv/pci-dma.c | 320 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 169 arch/powerpc/platforms/powernv/pci.h | 11 + 5 files changed, 452 insertions(+), 51 deletions(-) create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c -- 2.17.1