Re: [PATCH] mm/memtest: Add ARCH_USE_MEMTEST
On Thu, Feb 4, 2021 at 8:10 PM Anshuman Khandual wrote: > > early_memtest() does not get called from all architectures. Hence enabling > CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel command line > option might not trigger the memory pattern tests as would be expected in > normal circumstances. This situation is misleading. > > The change here prevents the above mentioned problem after introducing a > new config option ARCH_USE_MEMTEST that should be subscribed on platforms > that call early_memtest(), in order to enable the config CONFIG_MEMTEST. > Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would > not be tested anyway. > > Cc: Russell King > Cc: Catalin Marinas > Cc: Will Deacon > Cc: Thomas Bogendoerfer > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Chris Zankel > Cc: Max Filippov > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-m...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-xte...@linux-xtensa.org > Cc: linux...@kvack.org > Cc: linux-ker...@vger.kernel.org > Signed-off-by: Anshuman Khandual > --- > This patch applies on v5.11-rc6 and has been tested on arm64 platform. But > it has been just build tested on all other platforms. > > arch/arm/Kconfig | 1 + > arch/arm64/Kconfig | 1 + > arch/mips/Kconfig| 1 + > arch/powerpc/Kconfig | 1 + > arch/x86/Kconfig | 1 + > arch/xtensa/Kconfig | 1 + > lib/Kconfig.debug| 9 - > 7 files changed, 14 insertions(+), 1 deletion(-) Anshuman, entries in arch/*/Konfig files are sorted in alphabetical order, please keep them that way. Reviewed-by: Max Filippov -- Thanks. -- Max
Re: [PATCH] arch:powerpc simple_write_to_buffer return check
Please provide some description of the change. And please clarify the patch subject, because as far as I can see, the return is already checked allthough the check seams wrong. Le 04/02/2021 à 19:16, Mayank Suman a écrit : Signed-off-by: Mayank Suman --- arch/powerpc/kernel/eeh.c| 8 arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 813713c9120c..2dbe1558a71f 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file *filp, char buf[20]; int ret; - ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count); - if (!ret) + ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); + if (ret <= 0) > return -EFAULT; Why return -EFAULT when the function has returned -EINVAL ? And why is it -EFAULT when ret is 0 ? EFAULT means error accessing memory. /* @@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp, memset(buf, 0, sizeof(buf)); ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); - if (!ret) + if (ret <= 0) return -EFAULT; ret = sscanf(buf, "%x:%x:%x.%x", , , , ); @@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp, memset(buf, 0, sizeof(buf)); ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); - if (!ret) + if (ret <= 0) return -EFAULT; ret = sscanf(buf, "%x:%x:%x.%x", , , , ); diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index 89e22c460ebf..36ed2b8f7375 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp, return -ENXIO; /* Copy over argument buffer */ - ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count); - if (!ret) + ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); + if (ret <= 0) return -EFAULT; /* Retrieve parameters */
Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.
Hey Nick, thanks for reviewing :) On Fri, 2021-02-05 at 16:28 +1000, Nicholas Piggin wrote: > Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm: > > Before guest entry, TBU40 register is changed to reflect guest timebase. > > After exitting guest, the register is reverted to it's original value. > > > > If one tries to get the timestamp from host between those changes, it > > will present an incorrect value. > > > > An example would be trying to add a tracepoint in > > kvmppc_guest_entry_inject_int(), which depending on last tracepoint > > acquired could actually cause the host to crash. > > > > Save the Timebase Offset to PACA and use it on sched_clock() to always > > get the correct timestamp. > > Ouch. Not sure how reasonable it is to half switch into guest registers > and expect to call into the wider kernel, fixing things up as we go. > What if mftb is used in other places? IIUC, the CPU is not supposed to call anything as host between guest entry and guest exit, except guest-related cases, like kvmppc_guest_entry_inject_int(), but anyway, if something calls mftb it will still get the same value as before. This is only supposed to change stuff that depends on sched_clock, like Tracepoints, that can happen in those exceptions. > Especially as it doesn't seem like there is a reason that function _has_ > to be called after the timebase is switched to guest, that's just how > the code is structured. Correct, but if called, like in rb routines, used by tracepoints, the difference between last tb and current (lower) tb may cause the CPU to trap PROGRAM exception, crashing host. > As a local hack to work out a bug okay. If you really need it upstream > could you put it under a debug config option? You mean something that is automatically selected whenever those configs are enabled? CONFIG_TRACEPOINT && CONFIG_KVM_BOOK3S_HANDLER && CONFIG_PPC_BOOK3S_64 Or something the user need to select himself in menuconfig? > > Thanks, > Nick > Thank you! Leonardo Bras > > Signed-off-by: Leonardo Bras > > Suggested-by: Paul Mackerras > > --- > > Changes since v1: > > - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and > > CONFIG_PPC_BOOK3S_64 are defined. > > --- > > arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + > > arch/powerpc/kernel/asm-offsets.c | 1 + > > arch/powerpc/kernel/time.c| 8 +++- > > arch/powerpc/kvm/book3s_hv.c | 2 ++ > > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 ++ > > 5 files changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h > > b/arch/powerpc/include/asm/kvm_book3s_asm.h > > index 078f4648ea27..e2c12a10eed2 100644 > > --- a/arch/powerpc/include/asm/kvm_book3s_asm.h > > +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h > > @@ -131,6 +131,7 @@ struct kvmppc_host_state { > > u64 cfar; > > u64 ppr; > > u64 host_fscr; > > + u64 tb_offset; /* Timebase offset: keeps correct timebase > > while on guest */ > > #endif > > }; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/arch/powerpc/kernel/asm-offsets.c > > b/arch/powerpc/kernel/asm-offsets.c > > index b12d7c049bfe..0beb8fdc6352 100644 > > --- a/arch/powerpc/kernel/asm-offsets.c > > +++ b/arch/powerpc/kernel/asm-offsets.c > > @@ -706,6 +706,7 @@ int main(void) > > HSTATE_FIELD(HSTATE_CFAR, cfar); > > HSTATE_FIELD(HSTATE_PPR, ppr); > > HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr); > > + HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset); > > #endif /* CONFIG_PPC_BOOK3S_64 */ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > #else /* CONFIG_PPC_BOOK3S */ > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > > index 67feb3524460..f27f0163792b 100644 > > --- a/arch/powerpc/kernel/time.c > > +++ b/arch/powerpc/kernel/time.c > > @@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns); > > */ > > notrace unsigned long long sched_clock(void) > > { > > - return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; > > + u64 tb = get_tb() - boot_tb; > > + > > +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER) > > + tb -= local_paca->kvm_hstate.tb_offset; > > +#endif > > + > > + return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift; > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.
Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm: > Before guest entry, TBU40 register is changed to reflect guest timebase. > After exitting guest, the register is reverted to it's original value. > > If one tries to get the timestamp from host between those changes, it > will present an incorrect value. > > An example would be trying to add a tracepoint in > kvmppc_guest_entry_inject_int(), which depending on last tracepoint > acquired could actually cause the host to crash. > > Save the Timebase Offset to PACA and use it on sched_clock() to always > get the correct timestamp. Ouch. Not sure how reasonable it is to half switch into guest registers and expect to call into the wider kernel, fixing things up as we go. What if mftb is used in other places? Especially as it doesn't seem like there is a reason that function _has_ to be called after the timebase is switched to guest, that's just how the code is structured. As a local hack to work out a bug okay. If you really need it upstream could you put it under a debug config option? Thanks, Nick > Signed-off-by: Leonardo Bras > Suggested-by: Paul Mackerras > --- > Changes since v1: > - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and > CONFIG_PPC_BOOK3S_64 are defined. > --- > arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + > arch/powerpc/kernel/asm-offsets.c | 1 + > arch/powerpc/kernel/time.c| 8 +++- > arch/powerpc/kvm/book3s_hv.c | 2 ++ > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 ++ > 5 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h > b/arch/powerpc/include/asm/kvm_book3s_asm.h > index 078f4648ea27..e2c12a10eed2 100644 > --- a/arch/powerpc/include/asm/kvm_book3s_asm.h > +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h > @@ -131,6 +131,7 @@ struct kvmppc_host_state { > u64 cfar; > u64 ppr; > u64 host_fscr; > + u64 tb_offset; /* Timebase offset: keeps correct timebase > while on guest */ > #endif > }; > > diff --git a/arch/powerpc/kernel/asm-offsets.c > b/arch/powerpc/kernel/asm-offsets.c > index b12d7c049bfe..0beb8fdc6352 100644 > --- a/arch/powerpc/kernel/asm-offsets.c > +++ b/arch/powerpc/kernel/asm-offsets.c > @@ -706,6 +706,7 @@ int main(void) > HSTATE_FIELD(HSTATE_CFAR, cfar); > HSTATE_FIELD(HSTATE_PPR, ppr); > HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr); > + HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset); > #endif /* CONFIG_PPC_BOOK3S_64 */ > > #else /* CONFIG_PPC_BOOK3S */ > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index 67feb3524460..f27f0163792b 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns); > */ > notrace unsigned long long sched_clock(void) > { > - return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; > + u64 tb = get_tb() - boot_tb; > + > +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER) > + tb -= local_paca->kvm_hstate.tb_offset; > +#endif > + > + return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift; > } > > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index b3731572295e..c08593c63353 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu > *vcpu, u64 time_limit, > if ((tb & 0xff) < (new_tb & 0xff)) > mtspr(SPRN_TBU40, new_tb + 0x100); > vc->tb_offset_applied = vc->tb_offset; > + local_paca->kvm_hstate.tb_offset = vc->tb_offset; > } > > if (vc->pcr) > @@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu > *vcpu, u64 time_limit, > if ((tb & 0xff) < (new_tb & 0xff)) > mtspr(SPRN_TBU40, new_tb + 0x100); > vc->tb_offset_applied = 0; > + local_paca->kvm_hstate.tb_offset = 0; > } > > mtspr(SPRN_HDEC, 0x7fff); > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b73140607875..8f7a9f7f4ee6 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300) > cmpdi r8,0 > beq 37f > std r8, VCORE_TB_OFFSET_APPL(r5) > + std r8, HSTATE_TB_OFFSET(r13) > mftbr6 /* current host timebase */ > add r8,r8,r6 > mtspr SPRN_TBU40,r8 /* update upper 40 bits */ > @@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) > beq 17f > li r0, 0 > std r0, VCORE_TB_OFFSET_APPL(r5) > + std r0, HSTATE_TB_OFFSET(r13) > mftbr6 /* current guest timebase */ > subf
Re: [PATCH] arch:powerpc simple_write_to_buffer return check
On 05/02/21 4:05 am, Oliver O'Halloran wrote: > On Fri, Feb 5, 2021 at 5:17 AM Mayank Suman wrote: >> >> Signed-off-by: Mayank Suman > > commit messages aren't optional Sorry. I will include the commit message in PATCH v2. > >> --- >> arch/powerpc/kernel/eeh.c| 8 >> arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++-- >> 2 files changed, 6 insertions(+), 6 deletions(-) >> >> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c >> index 813713c9120c..2dbe1558a71f 100644 >> --- a/arch/powerpc/kernel/eeh.c >> +++ b/arch/powerpc/kernel/eeh.c >> @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file >> *filp, >> char buf[20]; >> int ret; >> >> - ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, >> count); >> - if (!ret) >> + ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, >> count); > > We should probably be zeroing the buffer. Reading to sizeof(buf) - 1 > is done in a few places to guarantee that the string is nul > terminated, but without the preceeding memset() that isn't actually > guaranteed. Yes, the buffer should be zeroed out first. I have included memset() in Patch v2. > >> + if (ret <= 0) >> return -EFAULT; > > EFAULT is supposed to be returned when the user supplies a buffer to > write(2) which is outside their address space. I figured letting the > sscanf() in the next step fail if the user passes writes a zero-length > buffer and returning EINVAL made more sense. That said, the exact > semantics around zero length writes are pretty handwavy so I guess > this isn't wrong, but I don't think it's better either. > simple_write_to_buffer may return negative value on fail. So, -EFAULT should be return in case of negative return value. The conditional (!ret) was not sufficient to catch negative return value.
[PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.
Before guest entry, TBU40 register is changed to reflect guest timebase. After exitting guest, the register is reverted to it's original value. If one tries to get the timestamp from host between those changes, it will present an incorrect value. An example would be trying to add a tracepoint in kvmppc_guest_entry_inject_int(), which depending on last tracepoint acquired could actually cause the host to crash. Save the Timebase Offset to PACA and use it on sched_clock() to always get the correct timestamp. Signed-off-by: Leonardo Bras Suggested-by: Paul Mackerras --- Changes since v1: - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and CONFIG_PPC_BOOK3S_64 are defined. --- arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kernel/time.c| 8 +++- arch/powerpc/kvm/book3s_hv.c | 2 ++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 ++ 5 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index 078f4648ea27..e2c12a10eed2 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -131,6 +131,7 @@ struct kvmppc_host_state { u64 cfar; u64 ppr; u64 host_fscr; + u64 tb_offset; /* Timebase offset: keeps correct timebase while on guest */ #endif }; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index b12d7c049bfe..0beb8fdc6352 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -706,6 +706,7 @@ int main(void) HSTATE_FIELD(HSTATE_CFAR, cfar); HSTATE_FIELD(HSTATE_PPR, ppr); HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr); + HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset); #endif /* CONFIG_PPC_BOOK3S_64 */ #else /* CONFIG_PPC_BOOK3S */ diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 67feb3524460..f27f0163792b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns); */ notrace unsigned long long sched_clock(void) { - return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; + u64 tb = get_tb() - boot_tb; + +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER) + tb -= local_paca->kvm_hstate.tb_offset; +#endif + + return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift; } diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index b3731572295e..c08593c63353 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit, if ((tb & 0xff) < (new_tb & 0xff)) mtspr(SPRN_TBU40, new_tb + 0x100); vc->tb_offset_applied = vc->tb_offset; + local_paca->kvm_hstate.tb_offset = vc->tb_offset; } if (vc->pcr) @@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit, if ((tb & 0xff) < (new_tb & 0xff)) mtspr(SPRN_TBU40, new_tb + 0x100); vc->tb_offset_applied = 0; + local_paca->kvm_hstate.tb_offset = 0; } mtspr(SPRN_HDEC, 0x7fff); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index b73140607875..8f7a9f7f4ee6 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300) cmpdi r8,0 beq 37f std r8, VCORE_TB_OFFSET_APPL(r5) + std r8, HSTATE_TB_OFFSET(r13) mftbr6 /* current host timebase */ add r8,r8,r6 mtspr SPRN_TBU40,r8 /* update upper 40 bits */ @@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) beq 17f li r0, 0 std r0, VCORE_TB_OFFSET_APPL(r5) + std r0, HSTATE_TB_OFFSET(r13) mftbr6 /* current guest timebase */ subfr8,r8,r6 mtspr SPRN_TBU40,r8 /* update upper 40 bits */ -- 2.29.2
Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
Le 05/02/2021 à 03:16, Nicholas Piggin a écrit : Excerpts from Michael Ellerman's message of February 5, 2021 10:22 am: Nicholas Piggin writes: Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm: Le 04/02/2021 à 04:27, Nicholas Piggin a écrit : Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am: Le 25/02/2020 à 18:35, Nicholas Piggin a écrit : ... + + /* +* We don't need to restore AMR on the way back to userspace for KUAP. +* The value of AMR only matters while we're in the kernel. +*/ + kuap_restore_amr(regs); Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ? Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in a way or another, and get the previous KUAP state restored by this way ? I'm not sure if there much more risk if it's here rather than the instruction being in another place in the code. There's a lot of user access around the kernel too if you want to find a gadget to unlock KUAP then I suppose there is a pretty large attack surface. My understanding is that user access scope is strictly limited, for instance we enforce the begin/end of user access to be in the same function, and we refrain from calling any other function inside the user access window. x86 even have 'objtool' to enforce it at build time. So in theory there is no way to get out of the function while user access is open. Here with the interrupt exit function it is free beer. You have a place where you re-open user access and return with a simple blr. So that's open bar. If someone manages to just call the interrupt exit function, then user access remains open Hmm okay maybe that's a good point. I don't think it's a very attractive gadget, it's not just a plain blr, it does a full stack frame tear down before the return. And there's no LR reloads anywhere very close. Obviously it depends on what the compiler decides to do, it's possible it could be a usable gadget. But there are other places that are more attractive I think, eg: c061d768: a6 03 3d 7d mtspr 29,r9 c061d76c: 2c 01 00 4c isync c061d770: 00 00 00 60 nop c061d774: 30 00 21 38 addir1,r1,48 c061d778: 20 00 80 4e blr So I don't think we should redesign the code *purely* because we're worried about interrupt_exit_kernel_prepare() being a useful gadget. If we can come up with a way to restructure it that reads well and is maintainable, and also reduces the chance of it being a good gadget then sure. Okay. That would be good if we can keep it in C, the pkeys + kuap combo is fairly complicated and we might want to something cleverer with it, so that would make it even more difficult in asm. Ok. For ppc32, I prefer to keep it in assembly for the time being and move everything from ASM to C at once after porting syscall and interrupts to C and wrappers. Hope this is OK for you. Christophe
Re: [PATCH v5 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()
Hi Christopher, I have checked that each implementation matches the corresponding *_to_user implementation. We've had some debate about whether the overarching implementation in the to/from pairs (especially where things go via a bounce buffer) can be simplified - but that's probably not really something that this patch set should do. On that basis: Reviewed-by: Daniel Axtens Kind regards, Daniel > Reuse the "safe" implementation from signal.c except for calling > unsafe_copy_from_user() to copy into a local buffer. > > Signed-off-by: Christopher M. Riedl > --- > arch/powerpc/kernel/signal.h | 30 ++ > 1 file changed, 30 insertions(+) > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h > index 2559a681536e..7dfc536c78ef 100644 > --- a/arch/powerpc/kernel/signal.h > +++ b/arch/powerpc/kernel/signal.h > @@ -53,6 +53,30 @@ unsigned long copy_ckfpr_from_user(struct task_struct > *task, void __user *from); > [i], label);\ > } while (0) > > +#define unsafe_copy_fpr_from_user(task, from, label) do {\ > + struct task_struct *__t = task; \ > + u64 __user *__f = (u64 __user *)from; \ > + u64 buf[ELF_NFPREG];\ > + int i; \ > + \ > + unsafe_copy_from_user(buf, __f, sizeof(buf), label);\ > + for (i = 0; i < ELF_NFPREG - 1; i++)\ > + __t->thread.TS_FPR(i) = buf[i]; \ > + __t->thread.fp_state.fpscr = buf[i];\ > +} while (0) > + > +#define unsafe_copy_vsx_from_user(task, from, label) do {\ > + struct task_struct *__t = task; \ > + u64 __user *__f = (u64 __user *)from; \ > + u64 buf[ELF_NVSRHALFREG]; \ > + int i; \ > + \ > + unsafe_copy_from_user(buf, __f, sizeof(buf), label);\ > + for (i = 0; i < ELF_NVSRHALFREG ; i++) \ > + __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i]; \ > +} while (0) > + > + > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM > #define unsafe_copy_ckfpr_to_user(to, task, label) do {\ > struct task_struct *__t = task; \ > @@ -80,6 +104,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct > *task, void __user *from); > unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\ > ELF_NFPREG * sizeof(double), label) > > +#define unsafe_copy_fpr_from_user(task, from, label) \ > + unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\ > + ELF_NFPREG * sizeof(double), label) > + > static inline unsigned long > copy_fpr_to_user(void __user *to, struct task_struct *task) > { > @@ -115,6 +143,8 @@ copy_ckfpr_from_user(struct task_struct *task, void > __user *from) > #else > #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0) > > +#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0) > + > static inline unsigned long > copy_fpr_to_user(void __user *to, struct task_struct *task) > { > -- > 2.26.1
Re: [PATCH v5 01/10] powerpc/uaccess: Add unsafe_copy_from_user
Hi Chris, Pending anything that sparse reported (which I haven't checked), this looks ok to me. Reviewed-by: Daniel Axtens Kind regards, Daniel > Just wrap __copy_tofrom_user() for the usual 'unsafe' pattern which > takes in a label to goto on error. > > Signed-off-by: Christopher M. Riedl > --- > arch/powerpc/include/asm/uaccess.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/powerpc/include/asm/uaccess.h > b/arch/powerpc/include/asm/uaccess.h > index 501c9a79038c..036e82eefac9 100644 > --- a/arch/powerpc/include/asm/uaccess.h > +++ b/arch/powerpc/include/asm/uaccess.h > @@ -542,6 +542,9 @@ user_write_access_begin(const void __user *ptr, size_t > len) > #define unsafe_get_user(x, p, e) unsafe_op_wrap(__get_user_allowed(x, p), e) > #define unsafe_put_user(x, p, e) __put_user_goto(x, p, e) > > +#define unsafe_copy_from_user(d, s, l, e) \ > + unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e) > + > #define unsafe_copy_to_user(d, s, l, e) \ > do { \ > u8 __user *_dst = (u8 __user *)(d); \ > -- > 2.26.1
Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t
On Wed Feb 3, 2021 at 12:43 PM CST, Christopher M. Riedl wrote: > Usually sigset_t is exactly 8B which is a "trivial" size and does not > warrant using __copy_from_user(). Use __get_user() directly in > anticipation of future work to remove the trivial size optimizations > from __copy_from_user(). Calling __get_user() also results in a small > boost to signal handling throughput here. > > Signed-off-by: Christopher M. Riedl This patch triggered sparse warnings about 'different address spaces'. This minor fixup cleans that up: diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 42fdc4a7ff72..1dfda6403e14 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -97,7 +97,7 @@ static void prepare_setup_sigcontext(struct task_struct *tsk, int ctx_has_vsx_re #endif /* CONFIG_VSX */ } -static inline int get_user_sigset(sigset_t *dst, const sigset_t *src) +static inline int get_user_sigset(sigset_t *dst, const sigset_t __user *src) { if (sizeof(sigset_t) <= 8) return __get_user(dst->sig[0], >sig[0]);
[PATCH] mm/memtest: Add ARCH_USE_MEMTEST
early_memtest() does not get called from all architectures. Hence enabling CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel command line option might not trigger the memory pattern tests as would be expected in normal circumstances. This situation is misleading. The change here prevents the above mentioned problem after introducing a new config option ARCH_USE_MEMTEST that should be subscribed on platforms that call early_memtest(), in order to enable the config CONFIG_MEMTEST. Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would not be tested anyway. Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Thomas Bogendoerfer Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Chris Zankel Cc: Max Filippov Cc: linux-arm-ker...@lists.infradead.org Cc: linux-m...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-xte...@linux-xtensa.org Cc: linux...@kvack.org Cc: linux-ker...@vger.kernel.org Signed-off-by: Anshuman Khandual --- This patch applies on v5.11-rc6 and has been tested on arm64 platform. But it has been just build tested on all other platforms. arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/mips/Kconfig| 1 + arch/powerpc/Kconfig | 1 + arch/x86/Kconfig | 1 + arch/xtensa/Kconfig | 1 + lib/Kconfig.debug| 9 - 7 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 138248999df7..a63b53c568df 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -32,6 +32,7 @@ config ARM select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF + select ARCH_USE_MEMTEST select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WANT_LD_ORPHAN_WARN diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index c4acf8230f20..dfee5831d876 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -70,6 +70,7 @@ config ARM64 select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS select ARCH_USE_SYM_ANNOTATIONS + select ARCH_USE_MEMTEST select ARCH_SUPPORTS_DEBUG_PAGEALLOC select ARCH_SUPPORTS_MEMORY_FAILURE select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 0a17bedf4f0d..1b21d8e53e6b 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -16,6 +16,7 @@ config MIPS select ARCH_USE_CMPXCHG_LOCKREF if 64BIT select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS + select ARCH_USE_MEMTEST select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU select ARCH_WANT_IPC_PARSE_VERSION select BUILDTIME_TABLE_SORT diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 107bb4319e0e..9935343a8750 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -151,6 +151,7 @@ config PPC select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS + select ARCH_USE_MEMTEST select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select ARCH_WANT_LD_ORPHAN_WARN diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 21f851179ff0..90545348db1b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -100,6 +100,7 @@ config X86 select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS select ARCH_USE_SYM_ANNOTATIONS + select ARCH_USE_MEMTEST select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_DEFAULT_BPF_JITif X86_64 select ARCH_WANTS_DYNAMIC_TASK_STRUCT diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index 37ce1489364e..8eb61fcdfc7f 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -9,6 +9,7 @@ config XTENSA select ARCH_HAS_DMA_SET_UNCACHED if MMU select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS + select ARCH_USE_MEMTEST select ARCH_WANT_FRAME_POINTERS select ARCH_WANT_IPC_PARSE_VERSION select BUILDTIME_TABLE_SORT diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 7937265ef879..6dd25b755a82 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2469,11 +2469,18 @@ config TEST_FPU endif # RUNTIME_TESTING_MENU +config ARCH_USE_MEMTEST + bool + help + An architecture should select this when it uses early_memtest() + during boot process. + config MEMTEST bool "Memtest" + depends on ARCH_USE_MEMTEST help This option adds a kernel parameter 'memtest', which allows memtest - to be set. + to be set and executed. memtest=0, mean disabled; -- default memtest=1, mean
[PATCH 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.
Before guest entry, TBU40 register is changed to reflect guest timebase. After exitting guest, the register is reverted to it's original value. If one tries to get the timestamp from host between those changes, it will present an incorrect value. An example would be trying to add a tracepoint in kvmppc_guest_entry_inject_int(), which depending on last tracepoint acquired could actually cause the host to crash. Save the Timebase Offset to PACA and use it on sched_clock() to always get the correct timestamp. Signed-off-by: Leonardo Bras --- arch/powerpc/include/asm/kvm_book3s_asm.h | 1 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kernel/time.c| 3 ++- arch/powerpc/kvm/book3s_hv.c | 2 ++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 ++ 5 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h index 078f4648ea27..e2c12a10eed2 100644 --- a/arch/powerpc/include/asm/kvm_book3s_asm.h +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h @@ -131,6 +131,7 @@ struct kvmppc_host_state { u64 cfar; u64 ppr; u64 host_fscr; + u64 tb_offset; /* Timebase offset: keeps correct timebase while on guest */ #endif }; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index b12d7c049bfe..0beb8fdc6352 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -706,6 +706,7 @@ int main(void) HSTATE_FIELD(HSTATE_CFAR, cfar); HSTATE_FIELD(HSTATE_PPR, ppr); HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr); + HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset); #endif /* CONFIG_PPC_BOOK3S_64 */ #else /* CONFIG_PPC_BOOK3S */ diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 67feb3524460..adf6648e3572 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -699,7 +699,8 @@ EXPORT_SYMBOL_GPL(tb_to_ns); */ notrace unsigned long long sched_clock(void) { - return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift; + return mulhdu(get_tb() - boot_tb - local_paca->kvm_hstate.tb_offset, tb_to_ns_scale) + << tb_to_ns_shift; } diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index b3731572295e..c08593c63353 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit, if ((tb & 0xff) < (new_tb & 0xff)) mtspr(SPRN_TBU40, new_tb + 0x100); vc->tb_offset_applied = vc->tb_offset; + local_paca->kvm_hstate.tb_offset = vc->tb_offset; } if (vc->pcr) @@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit, if ((tb & 0xff) < (new_tb & 0xff)) mtspr(SPRN_TBU40, new_tb + 0x100); vc->tb_offset_applied = 0; + local_paca->kvm_hstate.tb_offset = 0; } mtspr(SPRN_HDEC, 0x7fff); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index b73140607875..8f7a9f7f4ee6 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300) cmpdi r8,0 beq 37f std r8, VCORE_TB_OFFSET_APPL(r5) + std r8, HSTATE_TB_OFFSET(r13) mftbr6 /* current host timebase */ add r8,r8,r6 mtspr SPRN_TBU40,r8 /* update upper 40 bits */ @@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) beq 17f li r0, 0 std r0, VCORE_TB_OFFSET_APPL(r5) + std r0, HSTATE_TB_OFFSET(r13) mftbr6 /* current guest timebase */ subfr8,r8,r6 mtspr SPRN_TBU40,r8 /* update upper 40 bits */ -- 2.29.2
[PATCH] powerpc/kuap: Allow kernel thread to access userspace after kthread_use_mm
This fix the bad fault reported by KUAP when io_wqe_worker access userspace. Bug: Read fault blocked by KUAP! WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x6b4/0xcd0 NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0 LR [c009e7e0] __do_page_fault+0x6b0/0xcd0 .. Call Trace: [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable) [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120 [c00016367430] [c000c848] handle_page_fault+0x10/0x2c --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0 .. NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0 LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0 interrupt: 300 [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280 [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780 [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120 [c000163679e0] [c0080040791c] xfs_file_buffered_aio_write+0x314/0x5e0 [xfs] [c00016367a90] [c06d74bc] io_write+0x10c/0x460 [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200 [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250 [c00016367cb0] [c06e2578] io_worker_handle_work+0x498/0x800 [c00016367d40] [c06e2cdc] io_wqe_worker+0x3fc/0x4f0 [c00016367da0] [c01cb0a4] kthread+0x1c4/0x1d0 [c00016367e10] [c000dbf0] ret_from_kernel_thread+0x5c/0x6c The kernel consider thread AMR value for kernel thread to be AMR_KUAP_BLOCKED. Hence access to userspace is denied. This of course not correct and we should allow userspace access after kthread_use_mm(). To be precise, kthread_use_mm() should inherit the AMR value of the operating address space. But, the AMR value is thread-specific and we inherit the address space and not thread access restrictions. Because of this ignore AMR value when accessing userspace via kernel thread. Cc: Zorro Lang Cc: Jens Axboe Cc: Christophe Leroy Cc: Nicholas Piggin Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/kup.h | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index f50f72e535aa..2064621ae7b6 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -202,22 +202,16 @@ DECLARE_STATIC_KEY_FALSE(uaccess_flush_key); #include #include -/* - * For kernel thread that doesn't have thread.regs return - * default AMR/IAMR values. - */ static inline u64 current_thread_amr(void) { - if (current->thread.regs) - return current->thread.regs->amr; - return AMR_KUAP_BLOCKED; + VM_BUG_ON(!current->thread.regs); + return current->thread.regs->amr; } static inline u64 current_thread_iamr(void) { - if (current->thread.regs) - return current->thread.regs->iamr; - return AMR_KUEP_BLOCKED; + VM_BUG_ON(!current->thread.regs); + return current->thread.regs->iamr; } #endif /* CONFIG_PPC_PKEY */ @@ -384,7 +378,14 @@ static __always_inline void allow_user_access(void __user *to, const void __user // This is written so we can resolve to a single case at build time BUILD_BUG_ON(!__builtin_constant_p(dir)); - if (mmu_has_feature(MMU_FTR_PKEY)) + /* +* Kernel threads may access user mm with kthread_use_mm() but +* can't use current_thread_amr because they have thread.regs==NULL, +* but they have no pkeys. +*/ + if (current->flags & PF_KTHREAD) + thread_amr = 0; + else if (mmu_has_feature(MMU_FTR_PKEY)) thread_amr = current_thread_amr(); if (dir == KUAP_READ) -- 2.29.2
[PATCH] mm/pmem: Avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled
Differentiate between hardware not supporting hugepages and user disabling THP via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' For the devdax namespace, the kernel handles the above via the supported_alignment attribute and failing to initialize the namespace if the namespace align value is not supported on the platform. For the fsdax namespace, the kernel will continue to initialize the namespace. This can result in the kernel creating a huge pte entry even though the hardware don't support the same. We do want hugepage support with pmem even if the end-user disabled THP via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled). Hence differentiate between hardware/firmware lacking support vs user-controlled disable of THP and prevent a huge fault if the hardware lacks hugepage support. Signed-off-by: Aneesh Kumar K.V --- include/linux/huge_mm.h | 15 +-- mm/huge_memory.c| 6 +- 2 files changed, 14 insertions(+), 7 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 6a19f35f836b..ba973efcd369 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -78,6 +78,7 @@ static inline vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, } enum transparent_hugepage_flag { + TRANSPARENT_HUGEPAGE_NEVER_DAX, TRANSPARENT_HUGEPAGE_FLAG, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, @@ -123,6 +124,13 @@ extern unsigned long transparent_hugepage_flags; */ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) { + + /* +* If the hardware/firmware marked hugepage support disabled. +*/ + if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX)) + return false; + if (vma->vm_flags & VM_NOHUGEPAGE) return false; @@ -134,12 +142,7 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) return true; - /* -* For dax vmas, try to always use hugepage mappings. If the kernel does -* not support hugepages, fsdax mappings will fallback to PAGE_SIZE -* mappings, and device-dax namespaces, that try to guarantee a given -* mapping size, will fail to enable -*/ + if (vma_is_dax(vma)) return true; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9237976abe72..d698b7e27447 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -386,7 +386,11 @@ static int __init hugepage_init(void) struct kobject *hugepage_kobj; if (!has_transparent_hugepage()) { - transparent_hugepage_flags = 0; + /* +* Hardware doesn't support hugepages, hence disable +* DAX PMD support. +*/ + transparent_hugepage_flags = 1 << TRANSPARENT_HUGEPAGE_NEVER_DAX; return -EINVAL; } -- 2.29.2
Re: [PATCH] powerpc/perf: Record counter overflow always if SAMPLE_IP is unset
> On 04-Feb-2021, at 8:25 AM, Michael Ellerman wrote: > > Athira Rajeev writes: >> While sampling for marked events, currently we record the sample only >> if the SIAR valid bit of Sampled Instruction Event Register (SIER) is >> set. SIAR_VALID bit is used for fetching the instruction address from >> Sampled Instruction Address Register(SIAR). But there are some usecases, >> where the user is interested only in the PMU stats at each counter >> overflow and the exact IP of the overflow event is not required. >> Dropping SIAR invalid samples will fail to record some of the counter >> overflows in such cases. >> >> Example of such usecase is dumping the PMU stats (event counts) >> after some regular amount of instructions/events from the userspace >> (ex: via ptrace). Here counter overflow is indicated to userspace via >> signal handler, and captured by monitoring and enabling I/O >> signaling on the event file descriptor. In these cases, we expect to >> get sample/overflow indication after each specified sample_period. >> >> Perf event attribute will not have PERF_SAMPLE_IP set in the >> sample_type if exact IP of the overflow event is not requested. So >> while profiling if SAMPLE_IP is not set, just record the counter overflow >> irrespective of SIAR_VALID check. >> >> Signed-off-by: Athira Rajeev >> --- >> arch/powerpc/perf/core-book3s.c | 10 -- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/arch/powerpc/perf/core-book3s.c >> b/arch/powerpc/perf/core-book3s.c >> index 28206b1fe172..bb4828a05e4d 100644 >> --- a/arch/powerpc/perf/core-book3s.c >> +++ b/arch/powerpc/perf/core-book3s.c >> @@ -2166,10 +2166,16 @@ static void record_and_restart(struct perf_event >> *event, unsigned long val, >> * address even when freeze on supervisor state (kernel) is set in >> * MMCR2. Check attr.exclude_kernel and address to drop the sample in >> * these cases. >> + * >> + * If address is not requested in the sample >> + * via PERF_SAMPLE_IP, just record that sample >> + * irrespective of SIAR valid check. >> */ >> -if (event->attr.exclude_kernel && record) >> -if (is_kernel_addr(mfspr(SPRN_SIAR))) >> +if (event->attr.exclude_kernel && record) { >> +if (is_kernel_addr(mfspr(SPRN_SIAR)) && >> (event->attr.sample_type & PERF_SAMPLE_IP)) >> record = 0; >> +} else if (!record && !(event->attr.sample_type & PERF_SAMPLE_IP)) >> +record = 1; > > This seems wrong, you're assuming that record was set previously to > = siar_valid(), but it may be that record is still 0 from the > initialisation and we weren't going to record. > > Don't we need something more like this? Hi Michael, Thanks for checking the patch and sharing the suggestion. Yes, the below change looks good and tested with my scenario. I will send a V2 with new change. Thanks Athira > > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c > index 9fd06010e8b6..e4e8a017d339 100644 > --- a/arch/powerpc/perf/core-book3s.c > +++ b/arch/powerpc/perf/core-book3s.c > @@ -2136,7 +2136,12 @@ static void record_and_restart(struct perf_event > *event, unsigned long val, > left += period; > if (left <= 0) > left = period; > - record = siar_valid(regs); > + > + if (event->attr.sample_type & PERF_SAMPLE_IP) > + record = siar_valid(regs); > + else > + record = 1; > + > event->hw.last_period = event->hw.sample_period; > } > if (left < 0x8000LL) > @@ -2154,9 +2159,10 @@ static void record_and_restart(struct perf_event > *event, unsigned long val, >* MMCR2. Check attr.exclude_kernel and address to drop the sample in >* these cases. >*/ > - if (event->attr.exclude_kernel && record) > - if (is_kernel_addr(mfspr(SPRN_SIAR))) > - record = 0; > + if (event->attr.exclude_kernel && > + (event->attr.sample_type & PERF_SAMPLE_IP) && > + is_kernel_addr(mfspr(SPRN_SIAR))) > + record = 0; > > /* >* Finally record data if requested. > > > > cheers
Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
Excerpts from Michael Ellerman's message of February 5, 2021 10:22 am: > Nicholas Piggin writes: >> Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm: >>> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit : Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am: > Le 25/02/2020 à 18:35, Nicholas Piggin a écrit : > ... >> + >> +/* >> + * We don't need to restore AMR on the way back to userspace >> for KUAP. >> + * The value of AMR only matters while we're in the kernel. >> + */ >> +kuap_restore_amr(regs); > > Is that correct to restore KUAP state here ? Shouldn't we have it at > lower level in assembly ? > > Isn't there a risk that someone manages to call > interrupt_exit_kernel_prepare() or the end of it in > a way or another, and get the previous KUAP state restored by this way ? I'm not sure if there much more risk if it's here rather than the instruction being in another place in the code. There's a lot of user access around the kernel too if you want to find a gadget to unlock KUAP then I suppose there is a pretty large attack surface. >>> >>> My understanding is that user access scope is strictly limited, for >>> instance we enforce the >>> begin/end of user access to be in the same function, and we refrain from >>> calling any other function >>> inside the user access window. x86 even have 'objtool' to enforce it at >>> build time. So in theory >>> there is no way to get out of the function while user access is open. >>> >>> Here with the interrupt exit function it is free beer. You have a place >>> where you re-open user >>> access and return with a simple blr. So that's open bar. If someone manages >>> to just call the >>> interrupt exit function, then user access remains open >> >> Hmm okay maybe that's a good point. > > I don't think it's a very attractive gadget, it's not just a plain blr, > it does a full stack frame tear down before the return. And there's no > LR reloads anywhere very close. > > Obviously it depends on what the compiler decides to do, it's possible > it could be a usable gadget. But there are other places that are more > attractive I think, eg: > > c061d768: a6 03 3d 7d mtspr 29,r9 > c061d76c: 2c 01 00 4c isync > c061d770: 00 00 00 60 nop > c061d774: 30 00 21 38 addir1,r1,48 > c061d778: 20 00 80 4e blr > > > So I don't think we should redesign the code *purely* because we're > worried about interrupt_exit_kernel_prepare() being a useful gadget. If > we can come up with a way to restructure it that reads well and is > maintainable, and also reduces the chance of it being a good gadget then > sure. Okay. That would be good if we can keep it in C, the pkeys + kuap combo is fairly complicated and we might want to something cleverer with it, so that would make it even more difficult in asm. Thanks, Nick
Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
Nicholas Piggin writes: > Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm: >> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit : >>> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am: Le 25/02/2020 à 18:35, Nicholas Piggin a écrit : ... > + > + /* > + * We don't need to restore AMR on the way back to userspace for KUAP. > + * The value of AMR only matters while we're in the kernel. > + */ > + kuap_restore_amr(regs); Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ? Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in a way or another, and get the previous KUAP state restored by this way ? >>> >>> I'm not sure if there much more risk if it's here rather than the >>> instruction being in another place in the code. >>> >>> There's a lot of user access around the kernel too if you want to find a >>> gadget to unlock KUAP then I suppose there is a pretty large attack >>> surface. >> >> My understanding is that user access scope is strictly limited, for instance >> we enforce the >> begin/end of user access to be in the same function, and we refrain from >> calling any other function >> inside the user access window. x86 even have 'objtool' to enforce it at >> build time. So in theory >> there is no way to get out of the function while user access is open. >> >> Here with the interrupt exit function it is free beer. You have a place >> where you re-open user >> access and return with a simple blr. So that's open bar. If someone manages >> to just call the >> interrupt exit function, then user access remains open > > Hmm okay maybe that's a good point. I don't think it's a very attractive gadget, it's not just a plain blr, it does a full stack frame tear down before the return. And there's no LR reloads anywhere very close. Obviously it depends on what the compiler decides to do, it's possible it could be a usable gadget. But there are other places that are more attractive I think, eg: c061d768: a6 03 3d 7d mtspr 29,r9 c061d76c: 2c 01 00 4c isync c061d770: 00 00 00 60 nop c061d774: 30 00 21 38 addir1,r1,48 c061d778: 20 00 80 4e blr So I don't think we should redesign the code *purely* because we're worried about interrupt_exit_kernel_prepare() being a useful gadget. If we can come up with a way to restructure it that reads well and is maintainable, and also reduces the chance of it being a good gadget then sure. cheers
Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
On 2/4/21 3:36 PM, Rob Herring wrote: On Thu, Feb 4, 2021 at 5:23 PM Lakshmi Ramasubramanian wrote: On 2/4/21 11:26 AM, Rob Herring wrote: On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian wrote: of_alloc_and_init_fdt() and of_free_fdt() have been defined in drivers/of/kexec.c to allocate and free memory for FDT. Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and initialize the FDT, and to free the FDT respectively. powerpc sets the FDT address in image_loader_data field in "struct kimage" and the memory is freed in kimage_file_post_load_cleanup(). This cleanup function uses kfree() to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc() for allocation, the buffer needs to be freed using kvfree(). You could just change the kexec core to call kvfree() instead. Define "fdt" field in "struct kimage_arch" for powerpc to store the address of FDT, and free the memory in powerpc specific arch_kimage_file_post_load_cleanup(). However, given all the other buffers have an explicit field in kimage or kimage_arch, changing powerpc is to match arm64 is better IMO. Just to be clear: I'll leave this as is - free FDT buffer in powerpc's arch_kimage_file_post_load_cleanup() to match arm64 behavior. Yes. ok Will not change "kexec core" to call kvfree() - doing that change would require changing all architectures to use kvmalloc() for image_loader_data allocation. Actually, no. AIUI, kvfree() can be used whether you used kvmalloc, vmalloc, or kmalloc for the alloc. That is good information. Thanks. Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Rob Herring Suggested-by: Thiago Jung Bauermann --- arch/powerpc/include/asm/kexec.h | 2 ++ arch/powerpc/kexec/elf_64.c | 26 -- arch/powerpc/kexec/file_load_64.c | 3 +++ 3 files changed, 21 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 2c0be93d239a..d7d13cac4d31 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -111,6 +111,8 @@ struct kimage_arch { unsigned long elf_headers_mem; unsigned long elf_headers_sz; void *elf_headers; + + void *fdt; }; char *setup_kdump_cmdline(struct kimage *image, char *cmdline, diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c index d0e459bb2f05..51d2d8eb6c1b 100644 --- a/arch/powerpc/kexec/elf_64.c +++ b/arch/powerpc/kexec/elf_64.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, unsigned int fdt_size; unsigned long kernel_load_addr; unsigned long initrd_load_addr = 0, fdt_load_addr; - void *fdt; + void *fdt = NULL; const void *slave_code; struct elfhdr ehdr; char *modified_cmdline = NULL; @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, } fdt_size = fdt_totalsize(initial_boot_params) * 2; - fdt = kmalloc(fdt_size, GFP_KERNEL); + fdt = of_alloc_and_init_fdt(fdt_size); if (!fdt) { pr_err("Not enough memory for the device tree.\n"); ret = -ENOMEM; goto out; } - ret = fdt_open_into(initial_boot_params, fdt, fdt_size); - if (ret < 0) { - pr_err("Error setting up the new device tree.\n"); - ret = -EINVAL; - goto out; - } ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr, The first thing this function does is call setup_new_fdt() which first calls of_kexec_setup_new_fdt(). (Note, I really don't understand the PPC code split. It looks like there's a 32-bit and 64-bit split, but 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except setup_new_fdt_ppc64()). The arm64 version is calling of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly. So we can just make of_alloc_and_init_fdt() also call of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do the alloc and copy). ok - will move fdt allocation into of_kexec_setup_new_fdt(). I don't think the architecture needs to pick the size either. It's doubtful that either one is that sensitive to the amount of extra space. I am not clear about the above comment - are you saying the architectures don't need to pass FDT size to the alloc function? arm64 is adding command line string length and some extra space to the size computed from initial_boot_params for FDT Size: buf_size = fdt_totalsize(initial_boot_params) + cmdline_len + DTB_EXTRA_SPACE; powerpc is just using twice the size computed from initial_boot_params fdt_size = fdt_totalsize(initial_boot_params) * 2; I think it would be safe to let arm64 and
Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
On Thu, Feb 4, 2021 at 5:23 PM Lakshmi Ramasubramanian wrote: > > On 2/4/21 11:26 AM, Rob Herring wrote: > > On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian > > wrote: > >> > >> of_alloc_and_init_fdt() and of_free_fdt() have been defined in > >> drivers/of/kexec.c to allocate and free memory for FDT. > >> > >> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and > >> initialize the FDT, and to free the FDT respectively. > >> > >> powerpc sets the FDT address in image_loader_data field in > >> "struct kimage" and the memory is freed in > >> kimage_file_post_load_cleanup(). This cleanup function uses kfree() > >> to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc() > >> for allocation, the buffer needs to be freed using kvfree(). > > > > You could just change the kexec core to call kvfree() instead. > > > > >> Define "fdt" field in "struct kimage_arch" for powerpc to store > >> the address of FDT, and free the memory in powerpc specific > >> arch_kimage_file_post_load_cleanup(). > > > > However, given all the other buffers have an explicit field in kimage > > or kimage_arch, changing powerpc is to match arm64 is better IMO. > > Just to be clear: > I'll leave this as is - free FDT buffer in powerpc's > arch_kimage_file_post_load_cleanup() to match arm64 behavior. Yes. > Will not change "kexec core" to call kvfree() - doing that change would > require changing all architectures to use kvmalloc() for > image_loader_data allocation. Actually, no. AIUI, kvfree() can be used whether you used kvmalloc, vmalloc, or kmalloc for the alloc. > >> Signed-off-by: Lakshmi Ramasubramanian > >> Suggested-by: Rob Herring > >> Suggested-by: Thiago Jung Bauermann > >> --- > >> arch/powerpc/include/asm/kexec.h | 2 ++ > >> arch/powerpc/kexec/elf_64.c | 26 -- > >> arch/powerpc/kexec/file_load_64.c | 3 +++ > >> 3 files changed, 21 insertions(+), 10 deletions(-) > >> > >> diff --git a/arch/powerpc/include/asm/kexec.h > >> b/arch/powerpc/include/asm/kexec.h > >> index 2c0be93d239a..d7d13cac4d31 100644 > >> --- a/arch/powerpc/include/asm/kexec.h > >> +++ b/arch/powerpc/include/asm/kexec.h > >> @@ -111,6 +111,8 @@ struct kimage_arch { > >> unsigned long elf_headers_mem; > >> unsigned long elf_headers_sz; > >> void *elf_headers; > >> + > >> + void *fdt; > >> }; > >> > >> char *setup_kdump_cmdline(struct kimage *image, char *cmdline, > >> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c > >> index d0e459bb2f05..51d2d8eb6c1b 100644 > >> --- a/arch/powerpc/kexec/elf_64.c > >> +++ b/arch/powerpc/kexec/elf_64.c > >> @@ -19,6 +19,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> #include > >> #include > >> #include > >> @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char > >> *kernel_buf, > >> unsigned int fdt_size; > >> unsigned long kernel_load_addr; > >> unsigned long initrd_load_addr = 0, fdt_load_addr; > >> - void *fdt; > >> + void *fdt = NULL; > >> const void *slave_code; > >> struct elfhdr ehdr; > >> char *modified_cmdline = NULL; > >> @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char > >> *kernel_buf, > >> } > >> > >> fdt_size = fdt_totalsize(initial_boot_params) * 2; > >> - fdt = kmalloc(fdt_size, GFP_KERNEL); > >> + fdt = of_alloc_and_init_fdt(fdt_size); > >> if (!fdt) { > >> pr_err("Not enough memory for the device tree.\n"); > >> ret = -ENOMEM; > >> goto out; > >> } > >> - ret = fdt_open_into(initial_boot_params, fdt, fdt_size); > >> - if (ret < 0) { > >> - pr_err("Error setting up the new device tree.\n"); > >> - ret = -EINVAL; > >> - goto out; > >> - } > >> > >> ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr, > > > > The first thing this function does is call setup_new_fdt() which first > > calls of_kexec_setup_new_fdt(). (Note, I really don't understand the > > PPC code split. It looks like there's a 32-bit and 64-bit split, but > > 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except > > setup_new_fdt_ppc64()). The arm64 version is calling > > of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly. > > > > So we can just make of_alloc_and_init_fdt() also call > > of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do > > the alloc and copy). > ok - will move fdt allocation into of_kexec_setup_new_fdt(). > > I don't think the architecture needs to pick the > > size either. It's doubtful that either one is that sensitive to the > > amount of extra space. > I am not clear about the above comment - > are you saying the architectures don't need to pass FDT size to the > alloc function? > > arm64 is adding command line string length
Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend
[+cc Alex] On Thu, Jan 28, 2021 at 12:09:37PM +0800, Kai-Heng Feng wrote: > On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas wrote: > > On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote: > > > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in > > > hint") enables ACS, and some platforms lose its NVMe after resume from > > > firmware: > > > [ 50.947816] pcieport :00:1b.0: DPC: containment event, > > > status:0x1f01 source:0x > > > [ 50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable error > > > detected > > > [ 50.947829] pcieport :00:1b.0: PCIe Bus Error: > > > severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) > > > [ 50.947830] pcieport :00:1b.0: device [8086:06ac] error > > > status/mask=0020/0001 > > > [ 50.947831] pcieport :00:1b.0:[21] ACSViol > > > (First) > > > [ 50.947841] pcieport :00:1b.0: AER: broadcast error_detected > > > message > > > [ 50.947843] nvme nvme0: frozen state error detected, reset controller > > > > > > It happens right after ACS gets enabled during resume. > > > > > > To prevent that from happening, disable AER interrupt and enable it on > > > system suspend and resume, respectively. > > > > Lots of questions here. Maybe this is what we'll end up doing, but I > > am curious about why the error is reported in the first place. > > > > Is this a consequence of the link going down and back up? > > Could be. From the observations, it only happens when firmware suspend > (S3) is used. > Maybe it happens when it's gets powered up, but I don't have equipment > to debug at hardware level. > > If we use non-firmware suspend method, enabling ACS after resume won't > trip AER and DPC. > > > Is it consequence of the device doing a DMA when it shouldn't? > > If it's doing DMA while suspending, the same error should also happen > after NVMe is suspended and before PCIe port suspending. > Furthermore, if non-firmware suspend method is used, there's so such > issue, so less likely to be any DMA operation. > > > Are we doing something in the wrong order during suspend? Or maybe > > resume, since I assume the error is reported during resume? > > Yes the error is reported during resume. The suspend/resume order > seems fine as non-firmware suspend doesn't have this issue. I really feel like we need a better understanding of what's going on here. Disabling the AER interrupt is like closing our eyes and pretending that because we don't see it, it didn't happen. An ACS error is triggered by a DMA, right? I'm assuming an MMIO access from the CPU wouldn't trigger this error. And it sounds like the error is triggered before we even start running the driver after resume. If we're powering up an NVMe device from D3cold and it DMAs before the driver touches it, something would be seriously broken. I doubt that's what's happening. Maybe a device could resume some previously programmed DMA after powering up from D3hot. Or maybe the error occurred on suspend, like if the device wasn't quiesced or something, but we didn't notice it until resume? The AER error status bits are RW1CS, which means they can be preserved across hot/warm/cold resets. Can you instrument the code to see whether the AER error status bit is set before enabling ACS? I'm not sure that merely enabling ACS (I assume you mean pci_std_enable_acs(), where we write PCI_ACS_CTRL) should cause an interrupt for a previously-logged error. I suspect that could happen when enabling *AER*, but I wouldn't think it would happen when enabling *ACS*. Does this error happen on multiple machines from different vendors? Wondering if it could be a BIOS issue, e.g., BIOS not cleaning up after it did something to cause an error. > > If we *do* take the error, why doesn't DPC recovery work? > > It works for the root port, but not for the NVMe drive: > [ 50.947816] pcieport :00:1b.0: DPC: containment event, > status:0x1f01 source:0x > [ 50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable error > detected > [ 50.947829] pcieport :00:1b.0: PCIe Bus Error: > severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver > ID) > [ 50.947830] pcieport :00:1b.0: device [8086:06ac] error > status/mask=0020/0001 > [ 50.947831] pcieport :00:1b.0:[21] ACSViol(First) > [ 50.947841] pcieport :00:1b.0: AER: broadcast error_detected message > [ 50.947843] nvme nvme0: frozen state error detected, reset controller > [ 50.948400] ACPI: EC: event unblocked > [ 50.948432] xhci_hcd :00:14.0: PME# disabled > [ 50.948444] xhci_hcd :00:14.0: enabling bus mastering > [ 50.949056] pcieport :00:1b.0: PME# disabled > [ 50.949068] pcieport :00:1c.0: PME# disabled > [ 50.949416] e1000e :00:1f.6: PME# disabled > [ 50.949463] e1000e :00:1f.6: enabling bus mastering > [ 50.951606] sd 0:0:0:0: [sda] Starting disk > [
Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
On 2/4/21 11:26 AM, Rob Herring wrote: On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian wrote: of_alloc_and_init_fdt() and of_free_fdt() have been defined in drivers/of/kexec.c to allocate and free memory for FDT. Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and initialize the FDT, and to free the FDT respectively. powerpc sets the FDT address in image_loader_data field in "struct kimage" and the memory is freed in kimage_file_post_load_cleanup(). This cleanup function uses kfree() to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc() for allocation, the buffer needs to be freed using kvfree(). You could just change the kexec core to call kvfree() instead. Define "fdt" field in "struct kimage_arch" for powerpc to store the address of FDT, and free the memory in powerpc specific arch_kimage_file_post_load_cleanup(). However, given all the other buffers have an explicit field in kimage or kimage_arch, changing powerpc is to match arm64 is better IMO. Just to be clear: I'll leave this as is - free FDT buffer in powerpc's arch_kimage_file_post_load_cleanup() to match arm64 behavior. Will not change "kexec core" to call kvfree() - doing that change would require changing all architectures to use kvmalloc() for image_loader_data allocation. Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Rob Herring Suggested-by: Thiago Jung Bauermann --- arch/powerpc/include/asm/kexec.h | 2 ++ arch/powerpc/kexec/elf_64.c | 26 -- arch/powerpc/kexec/file_load_64.c | 3 +++ 3 files changed, 21 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 2c0be93d239a..d7d13cac4d31 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -111,6 +111,8 @@ struct kimage_arch { unsigned long elf_headers_mem; unsigned long elf_headers_sz; void *elf_headers; + + void *fdt; }; char *setup_kdump_cmdline(struct kimage *image, char *cmdline, diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c index d0e459bb2f05..51d2d8eb6c1b 100644 --- a/arch/powerpc/kexec/elf_64.c +++ b/arch/powerpc/kexec/elf_64.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, unsigned int fdt_size; unsigned long kernel_load_addr; unsigned long initrd_load_addr = 0, fdt_load_addr; - void *fdt; + void *fdt = NULL; const void *slave_code; struct elfhdr ehdr; char *modified_cmdline = NULL; @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, } fdt_size = fdt_totalsize(initial_boot_params) * 2; - fdt = kmalloc(fdt_size, GFP_KERNEL); + fdt = of_alloc_and_init_fdt(fdt_size); if (!fdt) { pr_err("Not enough memory for the device tree.\n"); ret = -ENOMEM; goto out; } - ret = fdt_open_into(initial_boot_params, fdt, fdt_size); - if (ret < 0) { - pr_err("Error setting up the new device tree.\n"); - ret = -EINVAL; - goto out; - } ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr, The first thing this function does is call setup_new_fdt() which first calls of_kexec_setup_new_fdt(). (Note, I really don't understand the PPC code split. It looks like there's a 32-bit and 64-bit split, but 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except setup_new_fdt_ppc64()). The arm64 version is calling of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly. So we can just make of_alloc_and_init_fdt() also call of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do the alloc and copy). ok - will move fdt allocation into of_kexec_setup_new_fdt(). I don't think the architecture needs to pick the size either. It's doubtful that either one is that sensitive to the amount of extra space. I am not clear about the above comment - are you saying the architectures don't need to pass FDT size to the alloc function? arm64 is adding command line string length and some extra space to the size computed from initial_boot_params for FDT Size: buf_size = fdt_totalsize(initial_boot_params) + cmdline_len + DTB_EXTRA_SPACE; powerpc is just using twice the size computed from initial_boot_params fdt_size = fdt_totalsize(initial_boot_params) * 2; I think it would be safe to let arm64 and powerpc pass the required FDT size, along with the other params to of_kexec_setup_new_fdt() - and in this function we allocate FDT and set it up. And, for powerpc leave the remaining code in setup_new_fdt_ppc64(). Would that be ok?
Re: [PATCH] arch:powerpc simple_write_to_buffer return check
On Fri, Feb 5, 2021 at 5:17 AM Mayank Suman wrote: > > Signed-off-by: Mayank Suman commit messages aren't optional > --- > arch/powerpc/kernel/eeh.c| 8 > arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++-- > 2 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c > index 813713c9120c..2dbe1558a71f 100644 > --- a/arch/powerpc/kernel/eeh.c > +++ b/arch/powerpc/kernel/eeh.c > @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file > *filp, > char buf[20]; > int ret; > > - ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count); > - if (!ret) > + ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, > count); We should probably be zeroing the buffer. Reading to sizeof(buf) - 1 is done in a few places to guarantee that the string is nul terminated, but without the preceeding memset() that isn't actually guaranteed. > + if (ret <= 0) > return -EFAULT; EFAULT is supposed to be returned when the user supplies a buffer to write(2) which is outside their address space. I figured letting the sscanf() in the next step fail if the user passes writes a zero-length buffer and returning EINVAL made more sense. That said, the exact semantics around zero length writes are pretty handwavy so I guess this isn't wrong, but I don't think it's better either. > /* > @@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp, > > memset(buf, 0, sizeof(buf)); > ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, > count); > - if (!ret) > + if (ret <= 0) > return -EFAULT; > > ret = sscanf(buf, "%x:%x:%x.%x", , , , ); > @@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp, > > memset(buf, 0, sizeof(buf)); > ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, > count); > - if (!ret) > + if (ret <= 0) > return -EFAULT; > > ret = sscanf(buf, "%x:%x:%x.%x", , , , ); > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c > b/arch/powerpc/platforms/powernv/eeh-powernv.c > index 89e22c460ebf..36ed2b8f7375 100644 > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c > @@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp, > return -ENXIO; > > /* Copy over argument buffer */ > - ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count); > - if (!ret) > + ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, > count); > + if (ret <= 0) > return -EFAULT; > > /* Retrieve parameters */ > -- > 2.30.0 >
[PATCH] KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
The Facility Status and Control Register is a privileged SPR that defines the availability of some features in problem state. Since it can be written by the guest, we must restore it to the previous host value after guest exit. This restoration is currently done by taking the value from current->thread.fscr, which in the P9 path is not enough anymore because the guest could context switch the QEMU thread, causing the guest-current value to be saved into the thread struct. The above situation manifested when running a QEMU linked against a libc with System Call Vectored support, which causes scv instructions to be run by QEMU early during the guest boot (during SLOF), at which point the FSCR is 0 due to guest entry. After a few scv calls (1 to a couple hundred), the context switching happens and the QEMU thread runs with the guest value, resulting in a Facility Unavailable interrupt. This patch saves and restores the host value of FSCR in the inner guest entry loop in a way independent of current->thread.fscr. The old way of doing it is still kept in place because it works for the old entry path. Signed-off-by: Fabiano Rosas --- arch/powerpc/kvm/book3s_hv.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 6f612d240392..f2ddf7139a2a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -3595,6 +3595,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit, unsigned long host_tidr = mfspr(SPRN_TIDR); unsigned long host_iamr = mfspr(SPRN_IAMR); unsigned long host_amr = mfspr(SPRN_AMR); + unsigned long host_fscr = mfspr(SPRN_FSCR); s64 dec; u64 tb; int trap, save_pmu; @@ -3735,6 +3736,9 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit, if (host_amr != vcpu->arch.amr) mtspr(SPRN_AMR, host_amr); + if (host_fscr != vcpu->arch.fscr) + mtspr(SPRN_FSCR, host_fscr); + msr_check_and_set(MSR_FP | MSR_VEC | MSR_VSX); store_fp_state(>arch.fp); #ifdef CONFIG_ALTIVEC -- 2.29.2
[PATCH] arch:powerpc simple_write_to_buffer return check
Signed-off-by: Mayank Suman --- arch/powerpc/kernel/eeh.c| 8 arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 813713c9120c..2dbe1558a71f 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file *filp, char buf[20]; int ret; - ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count); - if (!ret) + ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); + if (ret <= 0) return -EFAULT; /* @@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp, memset(buf, 0, sizeof(buf)); ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); - if (!ret) + if (ret <= 0) return -EFAULT; ret = sscanf(buf, "%x:%x:%x.%x", , , , ); @@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp, memset(buf, 0, sizeof(buf)); ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); - if (!ret) + if (ret <= 0) return -EFAULT; ret = sscanf(buf, "%x:%x:%x.%x", , , , ); diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index 89e22c460ebf..36ed2b8f7375 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp, return -ENXIO; /* Copy over argument buffer */ - ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count); - if (!ret) + ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count); + if (ret <= 0) return -EFAULT; /* Retrieve parameters */ -- 2.30.0
Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays
On Thu, Feb 04, 2021 at 11:49:23AM +, Robin Murphy wrote: > On 2021-02-04 07:29, Christoph Hellwig wrote: > > On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote: > > > This patch converts several swiotlb related variables to arrays, in > > > order to maintain stat/status for different swiotlb buffers. Here are > > > variables involved: > > > > > > - io_tlb_start and io_tlb_end > > > - io_tlb_nslabs and io_tlb_used > > > - io_tlb_list > > > - io_tlb_index > > > - max_segment > > > - io_tlb_orig_addr > > > - no_iotlb_memory > > > > > > There is no functional change and this is to prepare to enable 64-bit > > > swiotlb. > > > > Claire Chang (on Cc) already posted a patch like this a month ago, > > which looks much better because it actually uses a struct instead > > of all the random variables. > > Indeed, I skimmed the cover letter and immediately thought that this whole > thing is just the restricted DMA pool concept[1] again, only from a slightly > different angle. Kind of. Let me lay out how some of these pieces are right now: +---+ +--+ | | | | | | | | | a)Xen-SWIOTLB | | b)SWIOTLB (for !Xen) | | | | | +---XX--+ +---X--+ X XX XXX X XX +--XX---+ | | | | | c) SWIOTLB generic | | | +---+ Dongli's patches modify the SWIOTLB generic c), and Xen-SWIOTLB a) parts. Also see the IOMMU_INIT logic which lays this a bit more deepth (for example how to enable SWIOTLB on AMD boxes, or IBM with Calgary IOMMU, etc - see iommu_table.h). Furtheremore it lays the groundwork to allocate AMD SEV SWIOTLB buffers later after boot (so that you can stich different pools together). All the bits are kind of inside of the SWIOTLB code. And also it changes the Xen-SWIOTLB to do something similar. The mempool did it similarly by taking the internal parts (aka the various io_tlb) of SWIOTLB and exposing them out and having other code: +---+ +--+ | | | | | | | | | a)Xen-SWIOTLB | | b)SWIOTLB (for !Xen) | | | | | +---XX--+ +---X--+ X XX XXX X XX +--XX---+ +--+ | | | Device tree | | +<+ enabling SWIOTLB | |c) SWIOTLB generic | | | | | | mempool | +---+ +--+ What I was suggesting to Clarie to follow Xen model, that is do something like this: +---+ +--+ ++ | | | | || | | | | || | a)Xen-SWIOTLB | | b)SWIOTLB (for !Xen) | | e) DT-SWIOTLB | | | | | || +---XX--+ +---X--+ +XX-X+ XXXX X X XX X XX XX XXX X XX X +--XXX--+ | | | | |c) SWIOTLB generic | | | +---+ so using the SWIOTLB generic parts, and then bolt on top of the device-tree logic, along with the mempool logic. But Christopher has an interesting suggestion which is to squash the all the existing code (a, b, c) all together and pepper it with various jump-tables. So: -+ | SWIOTLB: | || | a) SWIOTLB (for non-Xen) | | b) Xen-SWIOTLB| | c) DT-SWIOTLB | || || -+ with all the various bits (M2P/P2M for Xen, mempool for ARM, and normal allocation for BM) in one big file.
Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian wrote: > > of_alloc_and_init_fdt() and of_free_fdt() have been defined in > drivers/of/kexec.c to allocate and free memory for FDT. > > Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and > initialize the FDT, and to free the FDT respectively. > > powerpc sets the FDT address in image_loader_data field in > "struct kimage" and the memory is freed in > kimage_file_post_load_cleanup(). This cleanup function uses kfree() > to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc() > for allocation, the buffer needs to be freed using kvfree(). You could just change the kexec core to call kvfree() instead. > Define "fdt" field in "struct kimage_arch" for powerpc to store > the address of FDT, and free the memory in powerpc specific > arch_kimage_file_post_load_cleanup(). However, given all the other buffers have an explicit field in kimage or kimage_arch, changing powerpc is to match arm64 is better IMO. > Signed-off-by: Lakshmi Ramasubramanian > Suggested-by: Rob Herring > Suggested-by: Thiago Jung Bauermann > --- > arch/powerpc/include/asm/kexec.h | 2 ++ > arch/powerpc/kexec/elf_64.c | 26 -- > arch/powerpc/kexec/file_load_64.c | 3 +++ > 3 files changed, 21 insertions(+), 10 deletions(-) > > diff --git a/arch/powerpc/include/asm/kexec.h > b/arch/powerpc/include/asm/kexec.h > index 2c0be93d239a..d7d13cac4d31 100644 > --- a/arch/powerpc/include/asm/kexec.h > +++ b/arch/powerpc/include/asm/kexec.h > @@ -111,6 +111,8 @@ struct kimage_arch { > unsigned long elf_headers_mem; > unsigned long elf_headers_sz; > void *elf_headers; > + > + void *fdt; > }; > > char *setup_kdump_cmdline(struct kimage *image, char *cmdline, > diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c > index d0e459bb2f05..51d2d8eb6c1b 100644 > --- a/arch/powerpc/kexec/elf_64.c > +++ b/arch/powerpc/kexec/elf_64.c > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char > *kernel_buf, > unsigned int fdt_size; > unsigned long kernel_load_addr; > unsigned long initrd_load_addr = 0, fdt_load_addr; > - void *fdt; > + void *fdt = NULL; > const void *slave_code; > struct elfhdr ehdr; > char *modified_cmdline = NULL; > @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char > *kernel_buf, > } > > fdt_size = fdt_totalsize(initial_boot_params) * 2; > - fdt = kmalloc(fdt_size, GFP_KERNEL); > + fdt = of_alloc_and_init_fdt(fdt_size); > if (!fdt) { > pr_err("Not enough memory for the device tree.\n"); > ret = -ENOMEM; > goto out; > } > - ret = fdt_open_into(initial_boot_params, fdt, fdt_size); > - if (ret < 0) { > - pr_err("Error setting up the new device tree.\n"); > - ret = -EINVAL; > - goto out; > - } > > ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr, The first thing this function does is call setup_new_fdt() which first calls of_kexec_setup_new_fdt(). (Note, I really don't understand the PPC code split. It looks like there's a 32-bit and 64-bit split, but 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except setup_new_fdt_ppc64()). The arm64 version is calling of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly. So we can just make of_alloc_and_init_fdt() also call of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do the alloc and copy). I don't think the architecture needs to pick the size either. It's doubtful that either one is that sensitive to the amount of extra space. > initrd_len, cmdline); > @@ -131,6 +126,10 @@ static void *elf64_load(struct kimage *image, char > *kernel_buf, > ret = kexec_add_buffer(); > if (ret) > goto out; > + > + /* FDT will be freed in arch_kimage_file_post_load_cleanup */ > + image->arch.fdt = fdt; > + > fdt_load_addr = kbuf.mem; > > pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr); > @@ -145,8 +144,15 @@ static void *elf64_load(struct kimage *image, char > *kernel_buf, > kfree(modified_cmdline); > kexec_free_elf_info(_info); > > - /* Make kimage_file_post_load_cleanup free the fdt buffer for us. */ > - return ret ? ERR_PTR(ret) : fdt; > + /* > +* Once FDT buffer has been successfully passed to kexec_add_buffer(), > +* the FDT buffer address is saved in image->arch.fdt. In that case, > +* the memory cannot be freed here in case of any other error. > +*/ > + if (ret && !image->arch.fdt) > + of_free_fdt(fdt); Just call kvfree() directly. > + > +
Re: [PATCH] cpufreq: Remove unused flag CPUFREQ_PM_NO_WARN
On Tue, Feb 2, 2021 at 6:42 AM Viresh Kumar wrote: > > This flag is set by one of the drivers but it isn't used in the code > otherwise. Remove the unused flag and update the driver. > > Signed-off-by: Viresh Kumar Applied as 5.12 material, thanks! > --- > Rebased over: > > https://lore.kernel.org/lkml/a59bb322b22c247d570b70a8e94067804287623b.1612241683.git.viresh.ku...@linaro.org/ > > drivers/cpufreq/pmac32-cpufreq.c | 3 +-- > include/linux/cpufreq.h | 13 + > 2 files changed, 6 insertions(+), 10 deletions(-) > > diff --git a/drivers/cpufreq/pmac32-cpufreq.c > b/drivers/cpufreq/pmac32-cpufreq.c > index 73621bc11976..4f20c6a9108d 100644 > --- a/drivers/cpufreq/pmac32-cpufreq.c > +++ b/drivers/cpufreq/pmac32-cpufreq.c > @@ -439,8 +439,7 @@ static struct cpufreq_driver pmac_cpufreq_driver = { > .init = pmac_cpufreq_cpu_init, > .suspend= pmac_cpufreq_suspend, > .resume = pmac_cpufreq_resume, > - .flags = CPUFREQ_PM_NO_WARN | > - CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING, > + .flags = CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING, > .attr = cpufreq_generic_attr, > .name = "powermac", > }; > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h > index c8e40e91fe9b..353969c7acd3 100644 > --- a/include/linux/cpufreq.h > +++ b/include/linux/cpufreq.h > @@ -398,8 +398,11 @@ struct cpufreq_driver { > /* loops_per_jiffy or other kernel "constants" aren't affected by frequency > transitions */ > #define CPUFREQ_CONST_LOOPSBIT(1) > > -/* don't warn on suspend/resume speed mismatches */ > -#define CPUFREQ_PM_NO_WARN BIT(2) > +/* > + * Set by drivers that want the core to automatically register the cpufreq > + * driver as a thermal cooling device. > + */ > +#define CPUFREQ_IS_COOLING_DEV BIT(2) > > /* > * This should be set by platforms having multiple clock-domains, i.e. > @@ -431,12 +434,6 @@ struct cpufreq_driver { > */ > #define CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING BIT(6) > > -/* > - * Set by drivers that want the core to automatically register the cpufreq > - * driver as a thermal cooling device. > - */ > -#define CPUFREQ_IS_COOLING_DEV BIT(7) > - > int cpufreq_register_driver(struct cpufreq_driver *driver_data); > int cpufreq_unregister_driver(struct cpufreq_driver *driver_data); > > -- > 2.25.0.rc1.19.g042ed3e048af >
Re: [PATCH v16 10/12] arm64: Use OF alloc and free functions for FDT
On Thu, Feb 04, 2021 at 08:41:33AM -0800, Lakshmi Ramasubramanian wrote: > of_alloc_and_init_fdt() and of_free_fdt() have been defined in > drivers/of/kexec.c to allocate and free memory for FDT. > > Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and > initialize the FDT, and to free the FDT respectively. > > Signed-off-by: Lakshmi Ramasubramanian > Suggested-by: Rob Herring > --- > arch/arm64/kernel/machine_kexec_file.c | 37 +++--- > 1 file changed, 10 insertions(+), 27 deletions(-) Acked-by: Will Deacon Will
[PATCH v2 1/2] ima: Free IMA measurement buffer on error
IMA allocates kernel virtual memory to carry forward the measurement list, from the current kernel to the next kernel on kexec system call, in ima_add_kexec_buffer() function. In error code paths this memory is not freed resulting in memory leak. Free the memory allocated for the IMA measurement list in the error code paths in ima_add_kexec_buffer() function. Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Tyler Hicks Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list") --- security/integrity/ima/ima_kexec.c | 1 + 1 file changed, 1 insertion(+) diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 121de3e04af2..206ddcaa5c67 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -119,6 +119,7 @@ void ima_add_kexec_buffer(struct kimage *image) ret = kexec_add_buffer(); if (ret) { pr_err("Error passing over kexec measurement buffer.\n"); + vfree(kexec_buffer); return; } -- 2.30.0
[PATCH v2 2/2] ima: Free IMA measurement buffer after kexec syscall
IMA allocates kernel virtual memory to carry forward the measurement list, from the current kernel to the next kernel on kexec system call, in ima_add_kexec_buffer() function. This buffer is not freed before completing the kexec system call resulting in memory leak. Add ima_buffer field in "struct kimage" to store the virtual address of the buffer allocated for the IMA measurement list. Free the memory allocated for the IMA measurement list in kimage_file_post_load_cleanup() function. Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Tyler Hicks Reviewed-by: Thiago Jung Bauermann Reviewed-by: Tyler Hicks Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list") --- include/linux/kexec.h | 5 + kernel/kexec_file.c| 5 + security/integrity/ima/ima_kexec.c | 2 ++ 3 files changed, 12 insertions(+) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 9e93bef52968..5f61389f5f36 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -300,6 +300,11 @@ struct kimage { /* Information for loading purgatory */ struct purgatory_info purgatory_info; #endif + +#ifdef CONFIG_IMA_KEXEC + /* Virtual address of IMA measurement buffer for kexec syscall */ + void *ima_buffer; +#endif }; /* kexec interface functions */ diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index b02086d70492..5c3447cf7ad5 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -166,6 +166,11 @@ void kimage_file_post_load_cleanup(struct kimage *image) vfree(pi->sechdrs); pi->sechdrs = NULL; +#ifdef CONFIG_IMA_KEXEC + vfree(image->ima_buffer); + image->ima_buffer = NULL; +#endif /* CONFIG_IMA_KEXEC */ + /* See if architecture has anything to cleanup post load */ arch_kimage_file_post_load_cleanup(image); diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 206ddcaa5c67..e29bea3dd4cc 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -129,6 +129,8 @@ void ima_add_kexec_buffer(struct kimage *image) return; } + image->ima_buffer = kexec_buffer; + pr_debug("kexec measurement buffer for the loaded kernel at 0x%lx.\n", kbuf.mem); } -- 2.30.0
Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
On 2/4/21 11:27 PM, Zorro Lang wrote: On Thu, Feb 04, 2021 at 10:31:59PM +0530, Aneesh Kumar K.V wrote: On 2/4/21 10:23 PM, Jens Axboe wrote: On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote: On 2/2/21 11:50 AM, Christophe Leroy wrote: Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit : On 2/2/21 11:32 AM, Christophe Leroy wrote: Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit : Aneesh Kumar K.V writes: Nicholas Piggin writes: Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm: Christophe Leroy writes: +Aneesh Le 29/01/2021 à 07:52, Zorro Lang a écrit : .. [ 96.200296] [ cut here ] [ 96.200304] Bug: Read fault blocked by KUAP! [ 96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310 [ 96.200734] NIP [c0849424] fault_in_pages_readable+0x104/0x350 [ 96.200741] LR [c084952c] fault_in_pages_readable+0x20c/0x350 [ 96.200747] --- interrupt: 300 Problem happens in a section where userspace access is supposed to be granted, so the patch you proposed is definitely not the right fix. c0849408:2c 01 00 4c isync c084940c:a6 03 3d 7d mtspr 29,r9 <== granting userspace access permission c0849410:2c 01 00 4c isync c0849414:00 00 36 e9 ld r9,0(r22) c0849418:20 00 29 81 lwz r9,32(r9) c084941c:00 02 29 71 andi. r9,r9,512 c0849420:78 d3 5e 7f mr r30,r26 ==> c0849424:00 00 bf 8b lbz r29,0(r31) <== accessing userspace c0849428:10 00 82 41 beq c0849438 c084942c:2c 01 00 4c isync c0849430:a6 03 bd 7e mtspr 29,r21 <== clearing userspace access permission c0849434:2c 01 00 4c isync My first guess is that the problem is linked to the following function, see the comment /* * For kernel thread that doesn't have thread.regs return * default AMR/IAMR values. */ static inline u64 current_thread_amr(void) { if (current->thread.regs) return current->thread.regs->amr; return AMR_KUAP_BLOCKED; } Above function was introduced by commit 48a8ab4eeb82 ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode") Yeah that's a bit of a curly one. At some point io_uring did kthread_use_mm(), which is supposed to mean the kthread can operate on behalf of the original process that submitted the IO. But because KUAP is implemented using memory protection keys, it depends on the value of the AMR register, which is not part of the mm, it's in thread.regs->amr. And what's worse by the time we're in kthread_use_mm() we no longer have access to the thread.regs->amr of the original process that submitted the IO. We also can't simply move the AMR into the mm, precisely because it's per thread, not per mm. So TBH I don't know how we're going to fix this. I guess we could return AMR=unblocked for kernel threads, but that's arguably a bug because it allows a process to circumvent memory keys by asking the kernel to do the access. We shouldn't need to inherit AMR should we? We only need it to be locked for kernel threads until it's explicitly unlocked -- nothing mm specific there. I think current_thread_amr could return 0 for kernel threads? Or I would even avoid using that function for allow_user_access and open code the kthread case and remove it from current_thread_amr(). Thanks, Nick updated one From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001 From: "Aneesh Kumar K.V" Date: Tue, 2 Feb 2021 09:23:38 +0530 Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access userspace after kthread_use_mm This fix the bad fault reported by KUAP when io_wqe_worker access userspace. Bug: Read fault blocked by KUAP! WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x6b4/0xcd0 NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0 LR [c009e7e0] __do_page_fault+0x6b0/0xcd0 .. Call Trace: [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable) [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120 [c00016367430] [c000c848] handle_page_fault+0x10/0x2c --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0 .. NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0 LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0 interrupt: 300 [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280 [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780 [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120 [c000163679e0] [c0080040791c] xfs_file_buffered_aio_write+0x314/0x5e0 [xfs] [c00016367a90] [c06d74bc] io_write+0x10c/0x460 [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
On Thu, Feb 04, 2021 at 10:31:59PM +0530, Aneesh Kumar K.V wrote: > On 2/4/21 10:23 PM, Jens Axboe wrote: > > On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote: > > > On 2/2/21 11:50 AM, Christophe Leroy wrote: > > > > > > > > > > > > Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit : > > > > > On 2/2/21 11:32 AM, Christophe Leroy wrote: > > > > > > > > > > > > > > > > > > Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit : > > > > > > > Aneesh Kumar K.V writes: > > > > > > > > > > > > > > > Nicholas Piggin writes: > > > > > > > > > > > > > > > > > Excerpts from Michael Ellerman's message of January 30, 2021 > > > > > > > > > 9:22 pm: > > > > > > > > > > Christophe Leroy writes: > > > > > > > > > > > +Aneesh > > > > > > > > > > > > > > > > > > > > > > Le 29/01/2021 à 07:52, Zorro Lang a écrit : > > > > > > > > > > .. > > > > > > > > > > > > [ 96.200296] [ cut here ] > > > > > > > > > > > > [ 96.200304] Bug: Read fault blocked by KUAP! > > > > > > > > > > > > [ 96.200309] WARNING: CPU: 3 PID: 1876 at > > > > > > > > > > > > arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310 > > > > > > > > > > > > > > > > > > > > > > > [ 96.200734] NIP [c0849424] > > > > > > > > > > > > fault_in_pages_readable+0x104/0x350 > > > > > > > > > > > > [ 96.200741] LR [c084952c] > > > > > > > > > > > > fault_in_pages_readable+0x20c/0x350 > > > > > > > > > > > > [ 96.200747] --- interrupt: 300 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Problem happens in a section where userspace access is > > > > > > > > > > > supposed > > > > > > > > > > > to be granted, so the patch you > > > > > > > > > > > proposed is definitely not the right fix. > > > > > > > > > > > > > > > > > > > > > > c0849408:2c 01 00 4c isync > > > > > > > > > > > c084940c:a6 03 3d 7d mtspr 29,r9 <== > > > > > > > > > > > granting > > > > > > > > > > > userspace access permission > > > > > > > > > > > c0849410:2c 01 00 4c isync > > > > > > > > > > > c0849414:00 00 36 e9 ld r9,0(r22) > > > > > > > > > > > c0849418:20 00 29 81 lwz r9,32(r9) > > > > > > > > > > > c084941c:00 02 29 71 andi. r9,r9,512 > > > > > > > > > > > c0849420:78 d3 5e 7f mr r30,r26 > > > > > > > > > > > ==> c0849424:00 00 bf 8b lbz > > > > > > > > > > > r29,0(r31) <== > > > > > > > > > > > accessing userspace > > > > > > > > > > > c0849428:10 00 82 41 beq > > > > > > > > > > > c0849438 > > > > > > > > > > > > > > > > > > > > > > c084942c:2c 01 00 4c isync > > > > > > > > > > > c0849430:a6 03 bd 7e mtspr 29,r21 <== > > > > > > > > > > > clearing userspace access permission > > > > > > > > > > > c0849434:2c 01 00 4c isync > > > > > > > > > > > > > > > > > > > > > > My first guess is that the problem is linked to the > > > > > > > > > > > following > > > > > > > > > > > function, see the comment > > > > > > > > > > > > > > > > > > > > > > /* > > > > > > > > > > > * For kernel thread that doesn't have thread.regs > > > > > > > > > > > return > > > > > > > > > > > * default AMR/IAMR values. > > > > > > > > > > > */ > > > > > > > > > > > static inline u64 current_thread_amr(void) > > > > > > > > > > > { > > > > > > > > > > > if (current->thread.regs) > > > > > > > > > > > return current->thread.regs->amr; > > > > > > > > > > > return AMR_KUAP_BLOCKED; > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > Above function was introduced by commit 48a8ab4eeb82 > > > > > > > > > > > ("powerpc/book3s64/pkeys: Don't update SPRN_AMR > > > > > > > > > > > when in kernel mode") > > > > > > > > > > > > > > > > > > > > Yeah that's a bit of a curly one. > > > > > > > > > > > > > > > > > > > > At some point io_uring did kthread_use_mm(), which is > > > > > > > > > > supposed to > > > > > > > > > > mean > > > > > > > > > > the kthread can operate on behalf of the original process > > > > > > > > > > that > > > > > > > > > > submitted > > > > > > > > > > the IO. > > > > > > > > > > > > > > > > > > > > But because KUAP is implemented using memory protection > > > > > > > > > > keys, it > > > > > > > > > > depends > > > > > > > > > > on the value of the AMR register, which is not part of the > > > > > > > > > > mm, > > > > > > > > > > it's in > > > > > > > > > > thread.regs->amr. > > > > > > > > > > > > > > > > > > > > And what's worse by the time we're in kthread_use_mm() we no > > > > > > > > > > longer have > > > > > > > > > > access to the thread.regs->amr of the original process that > > > > > > > > > > submitted > > > > > > > > > > the IO. > > > > > > > > > > > > > > > > > > > > We also can't simply move the AMR into the mm, precisely > > > > > > > > > > because > > > > > > > > > > it's > > > > > > > > > > per thread, not per mm. > > > > > > >
Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
On 2/4/21 10:01 AM, Aneesh Kumar K.V wrote: > On 2/4/21 10:23 PM, Jens Axboe wrote: >> On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote: >>> On 2/2/21 11:50 AM, Christophe Leroy wrote: Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit : > On 2/2/21 11:32 AM, Christophe Leroy wrote: >> >> >> Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit : >>> Aneesh Kumar K.V writes: >>> Nicholas Piggin writes: > Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm: >> Christophe Leroy writes: >>> +Aneesh >>> >>> Le 29/01/2021 à 07:52, Zorro Lang a écrit : >> .. [ 96.200296] [ cut here ] [ 96.200304] Bug: Read fault blocked by KUAP! [ 96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310 >>> [ 96.200734] NIP [c0849424] fault_in_pages_readable+0x104/0x350 [ 96.200741] LR [c084952c] fault_in_pages_readable+0x20c/0x350 [ 96.200747] --- interrupt: 300 >>> >>> >>> Problem happens in a section where userspace access is supposed >>> to be granted, so the patch you >>> proposed is definitely not the right fix. >>> >>> c0849408:2c 01 00 4c isync >>> c084940c:a6 03 3d 7d mtspr 29,r9 <== granting >>> userspace access permission >>> c0849410:2c 01 00 4c isync >>> c0849414:00 00 36 e9 ld r9,0(r22) >>> c0849418:20 00 29 81 lwz r9,32(r9) >>> c084941c:00 02 29 71 andi. r9,r9,512 >>> c0849420:78 d3 5e 7f mr r30,r26 >>> ==> c0849424:00 00 bf 8b lbz r29,0(r31) <== >>> accessing userspace >>> c0849428:10 00 82 41 beq c0849438 >>> >>> c084942c:2c 01 00 4c isync >>> c0849430:a6 03 bd 7e mtspr 29,r21 <== >>> clearing userspace access permission >>> c0849434:2c 01 00 4c isync >>> >>> My first guess is that the problem is linked to the following >>> function, see the comment >>> >>> /* >>> * For kernel thread that doesn't have thread.regs return >>> * default AMR/IAMR values. >>> */ >>> static inline u64 current_thread_amr(void) >>> { >>> if (current->thread.regs) >>> return current->thread.regs->amr; >>> return AMR_KUAP_BLOCKED; >>> } >>> >>> Above function was introduced by commit 48a8ab4eeb82 >>> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR >>> when in kernel mode") >> >> Yeah that's a bit of a curly one. >> >> At some point io_uring did kthread_use_mm(), which is supposed to >> mean >> the kthread can operate on behalf of the original process that >> submitted >> the IO. >> >> But because KUAP is implemented using memory protection keys, it >> depends >> on the value of the AMR register, which is not part of the mm, >> it's in >> thread.regs->amr. >> >> And what's worse by the time we're in kthread_use_mm() we no >> longer have >> access to the thread.regs->amr of the original process that >> submitted >> the IO. >> >> We also can't simply move the AMR into the mm, precisely because >> it's >> per thread, not per mm. >> >> So TBH I don't know how we're going to fix this. >> >> I guess we could return AMR=unblocked for kernel threads, but that's >> arguably a bug because it allows a process to circumvent memory >> keys by >> asking the kernel to do the access. > > We shouldn't need to inherit AMR should we? We only need it to be > locked > for kernel threads until it's explicitly unlocked -- nothing mm > specific > there. I think current_thread_amr could return 0 for kernel > threads? Or > I would even avoid using that function for allow_user_access and open > code the kthread case and remove it from current_thread_amr(). > > Thanks, > Nick >>> >>> updated one >>> >>> From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001 >>> From: "Aneesh Kumar K.V" >>> Date: Tue, 2 Feb 2021 09:23:38 +0530 >>> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access >>> userspace >>>after kthread_use_mm >>> >>> This fix the bad fault
Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
On 2/4/21 10:23 PM, Jens Axboe wrote: On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote: On 2/2/21 11:50 AM, Christophe Leroy wrote: Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit : On 2/2/21 11:32 AM, Christophe Leroy wrote: Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit : Aneesh Kumar K.V writes: Nicholas Piggin writes: Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm: Christophe Leroy writes: +Aneesh Le 29/01/2021 à 07:52, Zorro Lang a écrit : .. [ 96.200296] [ cut here ] [ 96.200304] Bug: Read fault blocked by KUAP! [ 96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310 [ 96.200734] NIP [c0849424] fault_in_pages_readable+0x104/0x350 [ 96.200741] LR [c084952c] fault_in_pages_readable+0x20c/0x350 [ 96.200747] --- interrupt: 300 Problem happens in a section where userspace access is supposed to be granted, so the patch you proposed is definitely not the right fix. c0849408:2c 01 00 4c isync c084940c:a6 03 3d 7d mtspr 29,r9 <== granting userspace access permission c0849410:2c 01 00 4c isync c0849414:00 00 36 e9 ld r9,0(r22) c0849418:20 00 29 81 lwz r9,32(r9) c084941c:00 02 29 71 andi. r9,r9,512 c0849420:78 d3 5e 7f mr r30,r26 ==> c0849424:00 00 bf 8b lbz r29,0(r31) <== accessing userspace c0849428:10 00 82 41 beq c0849438 c084942c:2c 01 00 4c isync c0849430:a6 03 bd 7e mtspr 29,r21 <== clearing userspace access permission c0849434:2c 01 00 4c isync My first guess is that the problem is linked to the following function, see the comment /* * For kernel thread that doesn't have thread.regs return * default AMR/IAMR values. */ static inline u64 current_thread_amr(void) { if (current->thread.regs) return current->thread.regs->amr; return AMR_KUAP_BLOCKED; } Above function was introduced by commit 48a8ab4eeb82 ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode") Yeah that's a bit of a curly one. At some point io_uring did kthread_use_mm(), which is supposed to mean the kthread can operate on behalf of the original process that submitted the IO. But because KUAP is implemented using memory protection keys, it depends on the value of the AMR register, which is not part of the mm, it's in thread.regs->amr. And what's worse by the time we're in kthread_use_mm() we no longer have access to the thread.regs->amr of the original process that submitted the IO. We also can't simply move the AMR into the mm, precisely because it's per thread, not per mm. So TBH I don't know how we're going to fix this. I guess we could return AMR=unblocked for kernel threads, but that's arguably a bug because it allows a process to circumvent memory keys by asking the kernel to do the access. We shouldn't need to inherit AMR should we? We only need it to be locked for kernel threads until it's explicitly unlocked -- nothing mm specific there. I think current_thread_amr could return 0 for kernel threads? Or I would even avoid using that function for allow_user_access and open code the kthread case and remove it from current_thread_amr(). Thanks, Nick updated one From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001 From: "Aneesh Kumar K.V" Date: Tue, 2 Feb 2021 09:23:38 +0530 Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access userspace after kthread_use_mm This fix the bad fault reported by KUAP when io_wqe_worker access userspace. Bug: Read fault blocked by KUAP! WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x6b4/0xcd0 NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0 LR [c009e7e0] __do_page_fault+0x6b0/0xcd0 .. Call Trace: [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable) [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120 [c00016367430] [c000c848] handle_page_fault+0x10/0x2c --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0 .. NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0 LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0 interrupt: 300 [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280 [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780 [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120 [c000163679e0] [c0080040791c] xfs_file_buffered_aio_write+0x314/0x5e0 [xfs] [c00016367a90] [c06d74bc] io_write+0x10c/0x460 [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200 [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250 [c00016367cb0] [c06e2578]
Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote: > On 2/2/21 11:50 AM, Christophe Leroy wrote: >> >> >> Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit : >>> On 2/2/21 11:32 AM, Christophe Leroy wrote: Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit : > Aneesh Kumar K.V writes: > >> Nicholas Piggin writes: >> >>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm: Christophe Leroy writes: > +Aneesh > > Le 29/01/2021 à 07:52, Zorro Lang a écrit : .. >> [ 96.200296] [ cut here ] >> [ 96.200304] Bug: Read fault blocked by KUAP! >> [ 96.200309] WARNING: CPU: 3 PID: 1876 at >> arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310 > >> [ 96.200734] NIP [c0849424] >> fault_in_pages_readable+0x104/0x350 >> [ 96.200741] LR [c084952c] >> fault_in_pages_readable+0x20c/0x350 >> [ 96.200747] --- interrupt: 300 > > > Problem happens in a section where userspace access is supposed > to be granted, so the patch you > proposed is definitely not the right fix. > > c0849408:2c 01 00 4c isync > c084940c:a6 03 3d 7d mtspr 29,r9 <== granting > userspace access permission > c0849410:2c 01 00 4c isync > c0849414:00 00 36 e9 ld r9,0(r22) > c0849418:20 00 29 81 lwz r9,32(r9) > c084941c:00 02 29 71 andi. r9,r9,512 > c0849420:78 d3 5e 7f mr r30,r26 > ==> c0849424:00 00 bf 8b lbz r29,0(r31) <== > accessing userspace > c0849428:10 00 82 41 beq c0849438 > > c084942c:2c 01 00 4c isync > c0849430:a6 03 bd 7e mtspr 29,r21 <== > clearing userspace access permission > c0849434:2c 01 00 4c isync > > My first guess is that the problem is linked to the following > function, see the comment > > /* >* For kernel thread that doesn't have thread.regs return >* default AMR/IAMR values. >*/ > static inline u64 current_thread_amr(void) > { > if (current->thread.regs) > return current->thread.regs->amr; > return AMR_KUAP_BLOCKED; > } > > Above function was introduced by commit 48a8ab4eeb82 > ("powerpc/book3s64/pkeys: Don't update SPRN_AMR > when in kernel mode") Yeah that's a bit of a curly one. At some point io_uring did kthread_use_mm(), which is supposed to mean the kthread can operate on behalf of the original process that submitted the IO. But because KUAP is implemented using memory protection keys, it depends on the value of the AMR register, which is not part of the mm, it's in thread.regs->amr. And what's worse by the time we're in kthread_use_mm() we no longer have access to the thread.regs->amr of the original process that submitted the IO. We also can't simply move the AMR into the mm, precisely because it's per thread, not per mm. So TBH I don't know how we're going to fix this. I guess we could return AMR=unblocked for kernel threads, but that's arguably a bug because it allows a process to circumvent memory keys by asking the kernel to do the access. >>> >>> We shouldn't need to inherit AMR should we? We only need it to be >>> locked >>> for kernel threads until it's explicitly unlocked -- nothing mm >>> specific >>> there. I think current_thread_amr could return 0 for kernel >>> threads? Or >>> I would even avoid using that function for allow_user_access and open >>> code the kthread case and remove it from current_thread_amr(). >>> >>> Thanks, >>> Nick >> > > updated one > > From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001 > From: "Aneesh Kumar K.V" > Date: Tue, 2 Feb 2021 09:23:38 +0530 > Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access > userspace > after kthread_use_mm > > This fix the bad fault reported by KUAP when io_wqe_worker access > userspace. > > Bug: Read fault blocked by KUAP! > WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 > __do_page_fault+0x6b4/0xcd0 > NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0 > LR [c009e7e0]
[PATCH v16 11/12] powerpc: Use OF alloc and free for FDT
of_alloc_and_init_fdt() and of_free_fdt() have been defined in drivers/of/kexec.c to allocate and free memory for FDT. Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and initialize the FDT, and to free the FDT respectively. powerpc sets the FDT address in image_loader_data field in "struct kimage" and the memory is freed in kimage_file_post_load_cleanup(). This cleanup function uses kfree() to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc() for allocation, the buffer needs to be freed using kvfree(). Define "fdt" field in "struct kimage_arch" for powerpc to store the address of FDT, and free the memory in powerpc specific arch_kimage_file_post_load_cleanup(). Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Rob Herring Suggested-by: Thiago Jung Bauermann --- arch/powerpc/include/asm/kexec.h | 2 ++ arch/powerpc/kexec/elf_64.c | 26 -- arch/powerpc/kexec/file_load_64.c | 3 +++ 3 files changed, 21 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 2c0be93d239a..d7d13cac4d31 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -111,6 +111,8 @@ struct kimage_arch { unsigned long elf_headers_mem; unsigned long elf_headers_sz; void *elf_headers; + + void *fdt; }; char *setup_kdump_cmdline(struct kimage *image, char *cmdline, diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c index d0e459bb2f05..51d2d8eb6c1b 100644 --- a/arch/powerpc/kexec/elf_64.c +++ b/arch/powerpc/kexec/elf_64.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, unsigned int fdt_size; unsigned long kernel_load_addr; unsigned long initrd_load_addr = 0, fdt_load_addr; - void *fdt; + void *fdt = NULL; const void *slave_code; struct elfhdr ehdr; char *modified_cmdline = NULL; @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, } fdt_size = fdt_totalsize(initial_boot_params) * 2; - fdt = kmalloc(fdt_size, GFP_KERNEL); + fdt = of_alloc_and_init_fdt(fdt_size); if (!fdt) { pr_err("Not enough memory for the device tree.\n"); ret = -ENOMEM; goto out; } - ret = fdt_open_into(initial_boot_params, fdt, fdt_size); - if (ret < 0) { - pr_err("Error setting up the new device tree.\n"); - ret = -EINVAL; - goto out; - } ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr, initrd_len, cmdline); @@ -131,6 +126,10 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, ret = kexec_add_buffer(); if (ret) goto out; + + /* FDT will be freed in arch_kimage_file_post_load_cleanup */ + image->arch.fdt = fdt; + fdt_load_addr = kbuf.mem; pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr); @@ -145,8 +144,15 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, kfree(modified_cmdline); kexec_free_elf_info(_info); - /* Make kimage_file_post_load_cleanup free the fdt buffer for us. */ - return ret ? ERR_PTR(ret) : fdt; + /* +* Once FDT buffer has been successfully passed to kexec_add_buffer(), +* the FDT buffer address is saved in image->arch.fdt. In that case, +* the memory cannot be freed here in case of any other error. +*/ + if (ret && !image->arch.fdt) + of_free_fdt(fdt); + + return ret ? ERR_PTR(ret) : NULL; } const struct kexec_file_ops kexec_elf64_ops = { diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 3cab318aa3b9..d9d5b5569a6d 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -1113,5 +1113,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image) image->arch.elf_headers = NULL; image->arch.elf_headers_sz = 0; + of_free_fdt(image->arch.fdt); + image->arch.fdt = NULL; + return kexec_image_post_load_cleanup_default(image); } -- 2.30.0
[PATCH v16 12/12] arm64: Enable passing IMA log to next kernel on kexec
Update CONFIG_KEXEC_FILE to select CONFIG_HAVE_IMA_KEXEC, if CONFIG_IMA is enabled, to indicate that the IMA measurement log information is present in the device tree for ARM64. Co-developed-by: Prakhar Srivastava Signed-off-by: Prakhar Srivastava Signed-off-by: Lakshmi Ramasubramanian --- arch/arm64/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 05e17351e4f3..8a93573cebb6 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1093,6 +1093,7 @@ config KEXEC config KEXEC_FILE bool "kexec file based system call" select KEXEC_CORE + select HAVE_IMA_KEXEC if IMA help This is new version of kexec system call. This system call is file based and takes file descriptors as system call argument -- 2.30.0
[PATCH v16 10/12] arm64: Use OF alloc and free functions for FDT
of_alloc_and_init_fdt() and of_free_fdt() have been defined in drivers/of/kexec.c to allocate and free memory for FDT. Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and initialize the FDT, and to free the FDT respectively. Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Rob Herring --- arch/arm64/kernel/machine_kexec_file.c | 37 +++--- 1 file changed, 10 insertions(+), 27 deletions(-) diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 7da22bb7b9d5..7d6cc478f73c 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -29,7 +29,7 @@ const struct kexec_file_ops * const kexec_file_loaders[] = { int arch_kimage_file_post_load_cleanup(struct kimage *image) { - vfree(image->arch.dtb); + of_free_fdt(image->arch.dtb); image->arch.dtb = NULL; vfree(image->arch.elf_headers); @@ -57,36 +57,19 @@ static int create_dtb(struct kimage *image, cmdline_len = cmdline ? strlen(cmdline) : 0; buf_size = fdt_totalsize(initial_boot_params) + cmdline_len + DTB_EXTRA_SPACE; - - for (;;) { - buf = vmalloc(buf_size); - if (!buf) - return -ENOMEM; - - /* duplicate a device tree blob */ - ret = fdt_open_into(initial_boot_params, buf, buf_size); - if (ret) - return -EINVAL; - - ret = of_kexec_setup_new_fdt(image, buf, initrd_load_addr, + buf = of_alloc_and_init_fdt(buf_size); + if (!buf) + return -ENOMEM; + ret = of_kexec_setup_new_fdt(image, buf, initrd_load_addr, initrd_len, cmdline); - if (ret) { - vfree(buf); - if (ret == -ENOMEM) { - /* unlikely, but just in case */ - buf_size += DTB_EXTRA_SPACE; - continue; - } else { - return ret; - } - } - + if (!ret) { /* trim it */ fdt_pack(buf); *dtb = buf; + } else + of_free_fdt(buf); - return 0; - } + return ret; } static int prepare_elf_headers(void **addr, unsigned long *sz) @@ -224,6 +207,6 @@ int load_other_segments(struct kimage *image, out_err: image->nr_segments = orig_segments; - vfree(dtb); + of_free_fdt(dtb); return ret; } -- 2.30.0
[PATCH v16 08/12] powerpc: Delete unused function delete_fdt_mem_rsv()
delete_fdt_mem_rsv() defined in "arch/powerpc/kexec/file_load.c" has been renamed to fdt_find_and_del_mem_rsv(), and moved to "drivers/of/kexec.c". Remove delete_fdt_mem_rsv() in "arch/powerpc/kexec/file_load.c". Co-developed-by: Prakhar Srivastava Signed-off-by: Prakhar Srivastava Signed-off-by: Lakshmi Ramasubramanian --- arch/powerpc/include/asm/kexec.h | 1 - arch/powerpc/kexec/file_load.c | 32 2 files changed, 33 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 939bc40dfa62..2c0be93d239a 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -118,7 +118,6 @@ char *setup_kdump_cmdline(struct kimage *image, char *cmdline, int setup_purgatory(struct kimage *image, const void *slave_code, const void *fdt, unsigned long kernel_load_addr, unsigned long fdt_load_addr); -int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size); #ifdef CONFIG_PPC64 struct kexec_buf; diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c index 5dd3a9c45a2d..036c8fb48fc3 100644 --- a/arch/powerpc/kexec/file_load.c +++ b/arch/powerpc/kexec/file_load.c @@ -108,35 +108,3 @@ int setup_purgatory(struct kimage *image, const void *slave_code, return 0; } - -/** - * delete_fdt_mem_rsv - delete memory reservation with given address and size - * - * Return: 0 on success, or negative errno on error. - */ -int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size) -{ - int i, ret, num_rsvs = fdt_num_mem_rsv(fdt); - - for (i = 0; i < num_rsvs; i++) { - uint64_t rsv_start, rsv_size; - - ret = fdt_get_mem_rsv(fdt, i, _start, _size); - if (ret) { - pr_err("Malformed device tree.\n"); - return -EINVAL; - } - - if (rsv_start == start && rsv_size == size) { - ret = fdt_del_mem_rsv(fdt, i); - if (ret) { - pr_err("Error deleting device tree reservation.\n"); - return -EINVAL; - } - - return 0; - } - } - - return -ENOENT; -} -- 2.30.0
[PATCH v16 09/12] of: Define functions to allocate and free FDT
Kernel components that use Flattened Device Tree (FDT) allocate kernel memory and call fdt_open_into() to reorganize the tree into a form suitable for the read-write operations. These operations can be combined into a single function to allocate and initialize the FDT so the different architecures do not have to duplicate the code. Define of_alloc_and_init_fdt() and of_free_fdt() in drivers/of/kexec.c to allocate and initialize FDT, and to free the FDT buffer respectively. Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Rob Herring Suggested-by: Joe Perches --- drivers/of/kexec.c | 37 + include/linux/of.h | 2 ++ 2 files changed, 39 insertions(+) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 5ae0e5d90f55..197e71104f47 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -28,6 +29,42 @@ #define FDT_PROP_RNG_SEED "rng-seed" #define RNG_SEED_SIZE 128 +/** + * of_alloc_and_init_fdt - Allocate and initialize a Flattened device tree + * + * @fdt_size: Flattened device tree size + * + * Return the allocated FDT buffer on success, or NULL on error. + */ +void *of_alloc_and_init_fdt(unsigned int fdt_size) +{ + void *fdt; + int ret; + + fdt = kvmalloc(fdt_size, GFP_KERNEL); + if (!fdt) + return NULL; + + ret = fdt_open_into(initial_boot_params, fdt, fdt_size); + if (ret < 0) { + pr_err("Error setting up the new device tree.\n"); + kvfree(fdt); + fdt = NULL; + } + + return fdt; +} + +/** + * of_free_fdt - Free the buffer for Flattened device tree + * + * @fdt: Flattened device tree buffer to free + */ +void of_free_fdt(void *fdt) +{ + kvfree(fdt); +} + /** * fdt_find_and_del_mem_rsv - delete memory reservation with given address and size * diff --git a/include/linux/of.h b/include/linux/of.h index 19f77dd12507..9f0261761e28 100644 --- a/include/linux/of.h +++ b/include/linux/of.h @@ -563,6 +563,8 @@ struct kimage; int of_kexec_setup_new_fdt(const struct kimage *image, void *fdt, unsigned long initrd_load_addr, unsigned long initrd_len, const char *cmdline); +void *of_alloc_and_init_fdt(unsigned int fdt_size); +void of_free_fdt(void *fdt); #ifdef CONFIG_IMA_KEXEC int of_ima_add_kexec_buffer(struct kimage *image, -- 2.30.0
[PATCH v16 07/12] kexec: Use fdt_appendprop_addrrange() to add ima buffer to FDT
fdt_appendprop_addrrange() function adds a property, with the given name, to the device tree at the given node offset, and also sets the address and size of the property. This function should be used to add "linux,ima-kexec-buffer" property to the device tree and set the address and size of the IMA measurement buffer, instead of using custom function. Use fdt_appendprop_addrrange() to add "linux,ima-kexec-buffer" property to the device tree. This property holds the address and size of the IMA measurement buffer that needs to be passed from the current kernel to the next kernel across kexec system call. Remove custom code that is used in setup_ima_buffer() to add "linux,ima-kexec-buffer" property to the device tree. Co-developed-by: Prakhar Srivastava Signed-off-by: Prakhar Srivastava Signed-off-by: Lakshmi Ramasubramanian Reviewed-by: Thiago Jung Bauermann --- drivers/of/kexec.c | 57 -- 1 file changed, 5 insertions(+), 52 deletions(-) diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 7ee4f498ca19..5ae0e5d90f55 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -232,36 +232,6 @@ int of_ima_add_kexec_buffer(struct kimage *image, return 0; } -/** - * write_number - Convert number to big-endian format - * - * @p: Buffer to write the number to - * @value: Number to convert - * @cells: Number of cells - * - * Return: 0 on success, or negative errno on error. - */ -static int write_number(void *p, u64 value, int cells) -{ - if (cells == 1) { - u32 tmp; - - if (value > U32_MAX) - return -EINVAL; - - tmp = cpu_to_be32(value); - memcpy(p, , sizeof(tmp)); - } else if (cells == 2) { - u64 tmp; - - tmp = cpu_to_be64(value); - memcpy(p, , sizeof(tmp)); - } else - return -EINVAL; - - return 0; -} - /** * setup_ima_buffer - add IMA buffer information to the fdt * @image: kexec image being loaded. @@ -273,32 +243,15 @@ static int write_number(void *p, u64 value, int cells) static int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node) { - int ret, addr_cells, size_cells, entry_size; - u8 value[16]; + int ret; if (!image->ima_buffer_size) return 0; - ret = get_addr_size_cells(_cells, _cells); - if (ret) - return ret; - - entry_size = 4 * (addr_cells + size_cells); - - if (entry_size > sizeof(value)) - return -EINVAL; - - ret = write_number(value, image->ima_buffer_addr, addr_cells); - if (ret) - return ret; - - ret = write_number(value + 4 * addr_cells, image->ima_buffer_size, - size_cells); - if (ret) - return ret; - - ret = fdt_setprop(fdt, chosen_node, "linux,ima-kexec-buffer", value, - entry_size); + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, + "linux,ima-kexec-buffer", + image->ima_buffer_addr, + image->ima_buffer_size); if (ret < 0) return -EINVAL; -- 2.30.0
[PATCH v16 06/12] powerpc: Move arch independent ima kexec functions to drivers/of/kexec.c
The functions defined in "arch/powerpc/kexec/ima.c" handle setting up and freeing the resources required to carry over the IMA measurement list from the current kernel to the next kernel across kexec system call. These functions do not have architecture specific code, but are currently limited to powerpc. Move setup_ima_buffer() call into of_kexec_setup_new_fdt() defined in "drivers/of/kexec.c". Call of_kexec_setup_new_fdt() from setup_new_fdt_ppc64() and remove setup_new_fdt() in "arch/powerpc/kexec/file_load.c". Move the remaining architecture independent functions from "arch/powerpc/kexec/ima.c" to "drivers/of/kexec.c". Delete "arch/powerpc/kexec/ima.c" and "arch/powerpc/include/asm/ima.h". Remove references to the deleted files in powerpc and in ima. Co-developed-by: Prakhar Srivastava Signed-off-by: Prakhar Srivastava Signed-off-by: Lakshmi Ramasubramanian --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/ima.h| 27 arch/powerpc/include/asm/kexec.h | 3 - arch/powerpc/kexec/Makefile | 7 - arch/powerpc/kexec/file_load.c| 35 - arch/powerpc/kexec/file_load_64.c | 4 +- arch/powerpc/kexec/ima.c | 202 - drivers/of/kexec.c| 239 ++ include/linux/of.h| 2 + security/integrity/ima/ima.h | 4 - 10 files changed, 245 insertions(+), 280 deletions(-) delete mode 100644 arch/powerpc/include/asm/ima.h delete mode 100644 arch/powerpc/kexec/ima.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 107bb4319e0e..d6e593ad270e 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -554,7 +554,7 @@ config KEXEC config KEXEC_FILE bool "kexec file based system call" select KEXEC_CORE - select HAVE_IMA_KEXEC + select HAVE_IMA_KEXEC if IMA select BUILD_BIN2C select KEXEC_ELF depends on PPC64 diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h deleted file mode 100644 index 51f64fd06c19.. --- a/arch/powerpc/include/asm/ima.h +++ /dev/null @@ -1,27 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ASM_POWERPC_IMA_H -#define _ASM_POWERPC_IMA_H - -struct kimage; - -int ima_get_kexec_buffer(void **addr, size_t *size); -int ima_free_kexec_buffer(void); - -#ifdef CONFIG_IMA -void remove_ima_buffer(void *fdt, int chosen_node); -#else -static inline void remove_ima_buffer(void *fdt, int chosen_node) {} -#endif - -#ifdef CONFIG_IMA_KEXEC -int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node); -#else -static inline int setup_ima_buffer(const struct kimage *image, void *fdt, - int chosen_node) -{ - remove_ima_buffer(fdt, chosen_node); - return 0; -} -#endif /* CONFIG_IMA_KEXEC */ - -#endif /* _ASM_POWERPC_IMA_H */ diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 2248dc27080b..939bc40dfa62 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -118,9 +118,6 @@ char *setup_kdump_cmdline(struct kimage *image, char *cmdline, int setup_purgatory(struct kimage *image, const void *slave_code, const void *fdt, unsigned long kernel_load_addr, unsigned long fdt_load_addr); -int setup_new_fdt(const struct kimage *image, void *fdt, - unsigned long initrd_load_addr, unsigned long initrd_len, - const char *cmdline); int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size); #ifdef CONFIG_PPC64 diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile index 4aff6846c772..b6c52608cb49 100644 --- a/arch/powerpc/kexec/Makefile +++ b/arch/powerpc/kexec/Makefile @@ -9,13 +9,6 @@ obj-$(CONFIG_PPC32)+= relocate_32.o obj-$(CONFIG_KEXEC_FILE) += file_load.o ranges.o file_load_$(BITS).o elf_$(BITS).o -ifdef CONFIG_HAVE_IMA_KEXEC -ifdef CONFIG_IMA -obj-y += ima.o -endif -endif - - # Disable GCOV, KCOV & sanitizers in odd or sensitive code GCOV_PROFILE_core_$(BITS).o := n KCOV_INSTRUMENT_core_$(BITS).o := n diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c index 956bcb2d1ec2..5dd3a9c45a2d 100644 --- a/arch/powerpc/kexec/file_load.c +++ b/arch/powerpc/kexec/file_load.c @@ -20,7 +20,6 @@ #include #include #include -#include #define SLAVE_CODE_SIZE256 /* First 0x100 bytes */ @@ -141,37 +140,3 @@ int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size) return -ENOENT; } - -/* - * setup_new_fdt - modify /chosen and memory reservation for the next kernel - * @image: kexec image being loaded. - * @fdt: Flattened device tree for the next kernel. - * @initrd_load_addr: Address where the next initrd will be loaded. - * @initrd_len:Size of the next initrd, or 0 if
[PATCH v16 05/12] powerpc: Move ima buffer fields to struct kimage
The fields ima_buffer_addr and ima_buffer_size in "struct kimage_arch" for powerpc are used to carry forward the IMA measurement list across kexec system call. These fields are not architecture specific, but are currently limited to powerpc. arch_ima_add_kexec_buffer() defined in "arch/powerpc/kexec/ima.c" sets ima_buffer_addr and ima_buffer_size for the kexec system call. This function does not have architecture specific code, but is currently limited to powerpc. Move ima_buffer_addr and ima_buffer_size to "struct kimage". Rename arch_ima_add_kexec_buffer() to of_ima_add_kexec_buffer() and move it in drivers/of/kexec.c. Co-developed-by: Prakhar Srivastava Signed-off-by: Prakhar Srivastava Signed-off-by: Lakshmi Ramasubramanian Suggested-by: Will Deacon --- arch/powerpc/include/asm/ima.h | 3 --- arch/powerpc/include/asm/kexec.h | 5 - arch/powerpc/kexec/ima.c | 29 ++--- drivers/of/kexec.c | 23 +++ include/linux/kexec.h | 5 + include/linux/of.h | 5 + security/integrity/ima/ima_kexec.c | 3 ++- 7 files changed, 41 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h index ead488cf3981..51f64fd06c19 100644 --- a/arch/powerpc/include/asm/ima.h +++ b/arch/powerpc/include/asm/ima.h @@ -14,9 +14,6 @@ static inline void remove_ima_buffer(void *fdt, int chosen_node) {} #endif #ifdef CONFIG_IMA_KEXEC -int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr, - size_t size); - int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node); #else static inline int setup_ima_buffer(const struct kimage *image, void *fdt, diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index dbf09d2f36d0..2248dc27080b 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -111,11 +111,6 @@ struct kimage_arch { unsigned long elf_headers_mem; unsigned long elf_headers_sz; void *elf_headers; - -#ifdef CONFIG_IMA_KEXEC - phys_addr_t ima_buffer_addr; - size_t ima_buffer_size; -#endif }; char *setup_kdump_cmdline(struct kimage *image, char *cmdline, diff --git a/arch/powerpc/kexec/ima.c b/arch/powerpc/kexec/ima.c index 720e50e490b6..ed38125e2f87 100644 --- a/arch/powerpc/kexec/ima.c +++ b/arch/powerpc/kexec/ima.c @@ -128,23 +128,6 @@ void remove_ima_buffer(void *fdt, int chosen_node) } #ifdef CONFIG_IMA_KEXEC -/** - * arch_ima_add_kexec_buffer - do arch-specific steps to add the IMA buffer - * - * Architectures should use this function to pass on the IMA buffer - * information to the next kernel. - * - * Return: 0 on success, negative errno on error. - */ -int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr, - size_t size) -{ - image->arch.ima_buffer_addr = load_addr; - image->arch.ima_buffer_size = size; - - return 0; -} - static int write_number(void *p, u64 value, int cells) { if (cells == 1) { @@ -180,7 +163,7 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node) u8 value[16]; remove_ima_buffer(fdt, chosen_node); - if (!image->arch.ima_buffer_size) + if (!image->ima_buffer_size) return 0; ret = get_addr_size_cells(_cells, _cells); @@ -192,11 +175,11 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node) if (entry_size > sizeof(value)) return -EINVAL; - ret = write_number(value, image->arch.ima_buffer_addr, addr_cells); + ret = write_number(value, image->ima_buffer_addr, addr_cells); if (ret) return ret; - ret = write_number(value + 4 * addr_cells, image->arch.ima_buffer_size, + ret = write_number(value + 4 * addr_cells, image->ima_buffer_size, size_cells); if (ret) return ret; @@ -206,13 +189,13 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node) if (ret < 0) return -EINVAL; - ret = fdt_add_mem_rsv(fdt, image->arch.ima_buffer_addr, - image->arch.ima_buffer_size); + ret = fdt_add_mem_rsv(fdt, image->ima_buffer_addr, + image->ima_buffer_size); if (ret) return -EINVAL; pr_debug("IMA buffer at 0x%llx, size = 0x%zx\n", -image->arch.ima_buffer_addr, image->arch.ima_buffer_size); +image->ima_buffer_addr, image->ima_buffer_size); return 0; } diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index 4afd3cc1c04a..efbf54f164fd 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -63,6 +63,29 @@ static int fdt_find_and_del_mem_rsv(void *fdt, unsigned long start, unsigned
[PATCH v16 04/12] powerpc: Use common of_kexec_setup_new_fdt()
From: Rob Herring The code for setting up the /chosen node in the device tree and updating the memory reservation for the next kernel has been moved to of_kexec_setup_new_fdt() defined in "drivers/of/kexec.c". Use the common of_kexec_setup_new_fdt() to setup the device tree and update the memory reservation for kexec for powerpc. Signed-off-by: Rob Herring Reviewed-by: Thiago Jung Bauermann Reviewed-by: Lakshmi Ramasubramanian --- arch/powerpc/kexec/file_load.c | 125 ++--- 1 file changed, 6 insertions(+), 119 deletions(-) diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c index e452b11df631..956bcb2d1ec2 100644 --- a/arch/powerpc/kexec/file_load.c +++ b/arch/powerpc/kexec/file_load.c @@ -16,6 +16,7 @@ #include #include +#include #include #include #include @@ -156,132 +157,18 @@ int setup_new_fdt(const struct kimage *image, void *fdt, unsigned long initrd_load_addr, unsigned long initrd_len, const char *cmdline) { - int ret, chosen_node; - const void *prop; - - /* Remove memory reservation for the current device tree. */ - ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params), -fdt_totalsize(initial_boot_params)); - if (ret == 0) - pr_debug("Removed old device tree reservation.\n"); - else if (ret != -ENOENT) - return ret; - - chosen_node = fdt_path_offset(fdt, "/chosen"); - if (chosen_node == -FDT_ERR_NOTFOUND) { - chosen_node = fdt_add_subnode(fdt, fdt_path_offset(fdt, "/"), - "chosen"); - if (chosen_node < 0) { - pr_err("Error creating /chosen.\n"); - return -EINVAL; - } - } else if (chosen_node < 0) { - pr_err("Malformed device tree: error reading /chosen.\n"); - return -EINVAL; - } - - /* Did we boot using an initrd? */ - prop = fdt_getprop(fdt, chosen_node, "linux,initrd-start", NULL); - if (prop) { - uint64_t tmp_start, tmp_end, tmp_size; - - tmp_start = fdt64_to_cpu(*((const fdt64_t *) prop)); - - prop = fdt_getprop(fdt, chosen_node, "linux,initrd-end", NULL); - if (!prop) { - pr_err("Malformed device tree.\n"); - return -EINVAL; - } - tmp_end = fdt64_to_cpu(*((const fdt64_t *) prop)); - - /* -* kexec reserves exact initrd size, while firmware may -* reserve a multiple of PAGE_SIZE, so check for both. -*/ - tmp_size = tmp_end - tmp_start; - ret = delete_fdt_mem_rsv(fdt, tmp_start, tmp_size); - if (ret == -ENOENT) - ret = delete_fdt_mem_rsv(fdt, tmp_start, -round_up(tmp_size, PAGE_SIZE)); - if (ret == 0) - pr_debug("Removed old initrd reservation.\n"); - else if (ret != -ENOENT) - return ret; - - /* If there's no new initrd, delete the old initrd's info. */ - if (initrd_len == 0) { - ret = fdt_delprop(fdt, chosen_node, - "linux,initrd-start"); - if (ret) { - pr_err("Error deleting linux,initrd-start.\n"); - return -EINVAL; - } - - ret = fdt_delprop(fdt, chosen_node, "linux,initrd-end"); - if (ret) { - pr_err("Error deleting linux,initrd-end.\n"); - return -EINVAL; - } - } - } - - if (initrd_len) { - ret = fdt_setprop_u64(fdt, chosen_node, - "linux,initrd-start", - initrd_load_addr); - if (ret < 0) - goto err; - - /* initrd-end is the first address after the initrd image. */ - ret = fdt_setprop_u64(fdt, chosen_node, "linux,initrd-end", - initrd_load_addr + initrd_len); - if (ret < 0) - goto err; - - ret = fdt_add_mem_rsv(fdt, initrd_load_addr, initrd_len); - if (ret) { - pr_err("Error reserving initrd memory: %s\n", - fdt_strerror(ret)); - return -EINVAL; - } - } - - if (cmdline != NULL) { - ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline); - if (ret < 0) - goto err; - } else {
[PATCH v16 03/12] arm64: Use common of_kexec_setup_new_fdt()
From: Rob Herring The code for setting up the /chosen node in the device tree and updating the memory reservation for the next kernel has been moved to of_kexec_setup_new_fdt() defined in "drivers/of/kexec.c". Use the common of_kexec_setup_new_fdt() to setup the device tree and update the memory reservation for kexec for arm64. Signed-off-by: Rob Herring Reviewed-by: Thiago Jung Bauermann Reviewed-by: Lakshmi Ramasubramanian Acked-by: Will Deacon --- arch/arm64/kernel/machine_kexec_file.c | 123 + 1 file changed, 3 insertions(+), 120 deletions(-) diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 03210f644790..7da22bb7b9d5 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -15,23 +15,12 @@ #include #include #include +#include #include -#include #include #include #include #include -#include - -/* relevant device tree properties */ -#define FDT_PROP_KEXEC_ELFHDR "linux,elfcorehdr" -#define FDT_PROP_MEM_RANGE "linux,usable-memory-range" -#define FDT_PROP_INITRD_START "linux,initrd-start" -#define FDT_PROP_INITRD_END"linux,initrd-end" -#define FDT_PROP_BOOTARGS "bootargs" -#define FDT_PROP_KASLR_SEED"kaslr-seed" -#define FDT_PROP_RNG_SEED "rng-seed" -#define RNG_SEED_SIZE 128 const struct kexec_file_ops * const kexec_file_loaders[] = { _image_ops, @@ -50,112 +39,6 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image) return kexec_image_post_load_cleanup_default(image); } -static int setup_dtb(struct kimage *image, -unsigned long initrd_load_addr, unsigned long initrd_len, -char *cmdline, void *dtb) -{ - int off, ret; - - ret = fdt_path_offset(dtb, "/chosen"); - if (ret < 0) - goto out; - - off = ret; - - ret = fdt_delprop(dtb, off, FDT_PROP_KEXEC_ELFHDR); - if (ret && ret != -FDT_ERR_NOTFOUND) - goto out; - ret = fdt_delprop(dtb, off, FDT_PROP_MEM_RANGE); - if (ret && ret != -FDT_ERR_NOTFOUND) - goto out; - - if (image->type == KEXEC_TYPE_CRASH) { - /* add linux,elfcorehdr */ - ret = fdt_appendprop_addrrange(dtb, 0, off, - FDT_PROP_KEXEC_ELFHDR, - image->arch.elf_headers_mem, - image->arch.elf_headers_sz); - if (ret) - return (ret == -FDT_ERR_NOSPACE ? -ENOMEM : -EINVAL); - - /* add linux,usable-memory-range */ - ret = fdt_appendprop_addrrange(dtb, 0, off, - FDT_PROP_MEM_RANGE, - crashk_res.start, - crashk_res.end - crashk_res.start + 1); - if (ret) - return (ret == -FDT_ERR_NOSPACE ? -ENOMEM : -EINVAL); - } - - /* add bootargs */ - if (cmdline) { - ret = fdt_setprop_string(dtb, off, FDT_PROP_BOOTARGS, cmdline); - if (ret) - goto out; - } else { - ret = fdt_delprop(dtb, off, FDT_PROP_BOOTARGS); - if (ret && (ret != -FDT_ERR_NOTFOUND)) - goto out; - } - - /* add initrd-* */ - if (initrd_load_addr) { - ret = fdt_setprop_u64(dtb, off, FDT_PROP_INITRD_START, - initrd_load_addr); - if (ret) - goto out; - - ret = fdt_setprop_u64(dtb, off, FDT_PROP_INITRD_END, - initrd_load_addr + initrd_len); - if (ret) - goto out; - } else { - ret = fdt_delprop(dtb, off, FDT_PROP_INITRD_START); - if (ret && (ret != -FDT_ERR_NOTFOUND)) - goto out; - - ret = fdt_delprop(dtb, off, FDT_PROP_INITRD_END); - if (ret && (ret != -FDT_ERR_NOTFOUND)) - goto out; - } - - /* add kaslr-seed */ - ret = fdt_delprop(dtb, off, FDT_PROP_KASLR_SEED); - if (ret == -FDT_ERR_NOTFOUND) - ret = 0; - else if (ret) - goto out; - - if (rng_is_initialized()) { - u64 seed = get_random_u64(); - ret = fdt_setprop_u64(dtb, off, FDT_PROP_KASLR_SEED, seed); - if (ret) - goto out; - } else { - pr_notice("RNG is not initialised: omitting \"%s\" property\n", - FDT_PROP_KASLR_SEED); - } - - /* add rng-seed */ - if (rng_is_initialized()) { - void *rng_seed; - ret = fdt_setprop_placeholder(dtb, off, FDT_PROP_RNG_SEED, - RNG_SEED_SIZE, _seed); - if
[PATCH v16 00/12] Carry forward IMA measurement log on kexec on ARM64
On kexec file load Integrity Measurement Architecture (IMA) subsystem may verify the IMA signature of the kernel and initramfs, and measure it. The command line parameters passed to the kernel in the kexec call may also be measured by IMA. A remote attestation service can verify a TPM quote based on the TPM event log, the IMA measurement list, and the TPM PCR data. This can be achieved only if the IMA measurement log is carried over from the current kernel to the next kernel across the kexec call. powerpc already supports carrying forward the IMA measurement log on kexec. This patch set adds support for carrying forward the IMA measurement log on kexec on ARM64. This patch set moves the platform independent code defined for powerpc such that it can be reused for other platforms as well. A chosen node "linux,ima-kexec-buffer" is added to the DTB for ARM64 to hold the address and the size of the memory reserved to carry the IMA measurement log. This patch set has been tested for ARM64 platform using QEMU. I would like help from the community for testing this change on powerpc. Thanks. This patch set is based on commit b3f82afc1041 ("IMA: Measure kernel version in early boot") in https://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git "next-integrity" branch. Changelog: v16 - Defined functions to allocate and free buffer for FDT for powerpc and arm64. - Moved ima_buffer_addr and ima_buffer_size fields from "struct kimage_arch" in powerpc to "struct kimage" v15 - Included Rob's patches in the patch set, and rebased the changes to "next-integrity" branch. - Allocate memory for DTB, on arm64, using kmalloc() instead of vmalloc() to keep it consistent with powerpc implementation. - Call of_kexec_setup_new_fdt() from setup_new_fdt_ppc64() and remove setup_new_fdt() in the same patch to keep it bisect safe. v14 - Select CONFIG_HAVE_IMA_KEXEC for CONFIG_KEXEC_FILE, for powerpc and arm64, if CONFIG_IMA is enabled. - Use IS_ENABLED() macro instead of "#ifdef" in remove_ima_buffer(), ima_get_kexec_buffer(), and ima_free_kexec_buffer(). - Call of_kexec_setup_new_fdt() from setup_new_fdt_ppc64() and remove setup_new_fdt() in "arch/powerpc/kexec/file_load.c". v13 - Moved the arch independent functions to drivers/of/kexec.c and then refactored the code. - Moved arch_ima_add_kexec_buffer() to security/integrity/ima/ima_kexec.c v12 - Use fdt_appendprop_addrrange() in setup_ima_buffer() to setup the IMA measurement list property in the device tree. - Moved architecture independent functions from "arch/powerpc/kexec/ima.c" to "drivers/of/kexec." - Deleted "arch/powerpc/kexec/ima.c" and "arch/powerpc/include/asm/ima.h". v11 - Rebased the changes on the kexec code refactoring done by Rob Herring in his "dt/kexec" branch - Removed "extern" keyword in function declarations - Removed unnecessary header files included in C files - Updated patch descriptions per Thiago's comments v10 - Moved delete_fdt_mem_rsv(), remove_ima_buffer(), get_ima_kexec_buffer, and get_root_addr_size_cells() to drivers/of/kexec.c - Moved arch_ima_add_kexec_buffer() to security/integrity/ima/ima_kexec.c - Conditionally define IMA buffer fields in struct kimage_arch v9 - Moved delete_fdt_mem_rsv() to drivers/of/kexec_fdt.c - Defined a new function get_ima_kexec_buffer() in drivers/of/ima_kexec.c to replace do_get_kexec_buffer() - Changed remove_ima_kexec_buffer() to the original function name remove_ima_buffer() - Moved remove_ima_buffer() to drivers/of/ima_kexec.c - Moved ima_get_kexec_buffer() and ima_free_kexec_buffer() to security/integrity/ima/ima_kexec.c v8: - Moved remove_ima_kexec_buffer(), do_get_kexec_buffer(), and delete_fdt_mem_rsv() to drivers/of/fdt.c - Moved ima_dump_measurement_list() and ima_add_kexec_buffer() back to security/integrity/ima/ima_kexec.c v7: - Renamed remove_ima_buffer() to remove_ima_kexec_buffer() and moved this function definition to kernel. - Moved delete_fdt_mem_rsv() definition to kernel - Moved ima_dump_measurement_list() and ima_add_kexec_buffer() to a new file namely ima_kexec_fdt.c in IMA v6: - Remove any existing FDT_PROP_IMA_KEXEC_BUFFER property in the device tree and also its corresponding memory reservation in the currently running kernel. - Moved the function remove_ima_buffer() defined for powerpc to IMA and renamed the function to ima_remove_kexec_buffer(). Also, moved delete_fdt_mem_rsv() from powerpc to IMA. v5: - Merged get_addr_size_cells() and do_get_kexec_buffer() into a single function when moving the arch independent code from powerpc to IMA - Reverted the change to use FDT functions in powerpc code and added back the original code in get_addr_size_cells() and do_get_kexec_buffer() for powerpc. - Added fdt_add_mem_rsv() for ARM64 to reserve the memory for the IMA
[PATCH v16 01/12] powerpc: Rename kexec elfcorehdr_addr to elf_headers_mem
From: Rob Herring The architecture specific field, elfcorehdr_addr in struct kimage_arch, that holds the address of the buffer in memory for ELF core header for powerpc has a different name than the one used for arm64. This makes it hard to have a common code for setting up the device tree for kexec system call. Rename elfcorehdr_addr to elf_headers_mem to align with arm64 name so common code can use it. Signed-off-by: Rob Herring Reviewed-by: Thiago Jung Bauermann Reviewed-by: Lakshmi Ramasubramanian --- arch/powerpc/include/asm/kexec.h | 2 +- arch/powerpc/kexec/file_load.c| 4 ++-- arch/powerpc/kexec/file_load_64.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 55d6ede30c19..dbf09d2f36d0 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -108,7 +108,7 @@ struct kimage_arch { unsigned long backup_start; void *backup_buf; - unsigned long elfcorehdr_addr; + unsigned long elf_headers_mem; unsigned long elf_headers_sz; void *elf_headers; diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c index 9a232bc36c8f..e452b11df631 100644 --- a/arch/powerpc/kexec/file_load.c +++ b/arch/powerpc/kexec/file_load.c @@ -45,7 +45,7 @@ char *setup_kdump_cmdline(struct kimage *image, char *cmdline, return NULL; elfcorehdr_strlen = sprintf(cmdline_ptr, "elfcorehdr=0x%lx ", - image->arch.elfcorehdr_addr); + image->arch.elf_headers_mem); if (elfcorehdr_strlen + cmdline_len > COMMAND_LINE_SIZE) { pr_err("Appending elfcorehdr= exceeds cmdline size\n"); @@ -263,7 +263,7 @@ int setup_new_fdt(const struct kimage *image, void *fdt, * Avoid elfcorehdr from being stomped on in kdump kernel by * setting up memory reserve map. */ - ret = fdt_add_mem_rsv(fdt, image->arch.elfcorehdr_addr, + ret = fdt_add_mem_rsv(fdt, image->arch.elf_headers_mem, image->arch.elf_headers_sz); if (ret) { pr_err("Error reserving elfcorehdr memory: %s\n", diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index c69bcf9b547a..a05c19b3cc60 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -815,7 +815,7 @@ static int load_elfcorehdr_segment(struct kimage *image, struct kexec_buf *kbuf) goto out; } - image->arch.elfcorehdr_addr = kbuf->mem; + image->arch.elf_headers_mem = kbuf->mem; image->arch.elf_headers_sz = headers_sz; image->arch.elf_headers = headers; out: @@ -851,7 +851,7 @@ int load_crashdump_segments_ppc64(struct kimage *image, return ret; } pr_debug("Loaded elf core header at 0x%lx, bufsz=0x%lx memsz=0x%lx\n", -image->arch.elfcorehdr_addr, kbuf->bufsz, kbuf->memsz); +image->arch.elf_headers_mem, kbuf->bufsz, kbuf->memsz); return 0; } -- 2.30.0
[PATCH v16 02/12] of: Add a common kexec FDT setup function
From: Rob Herring Both arm64 and powerpc do essentially the same FDT /chosen setup for kexec. The differences are either omissions that arm64 should have or additional properties that will be ignored. The setup code can be combined and shared by both powerpc and arm64. The differences relative to the arm64 version: - If /chosen doesn't exist, it will be created (should never happen). - Any old dtb and initrd reserved memory will be released. - The new initrd and elfcorehdr are marked reserved. - "linux,booted-from-kexec" is set. The differences relative to the powerpc version: - "kaslr-seed" and "rng-seed" may be set. - "linux,elfcorehdr" is set. - Any existing "linux,usable-memory-range" is removed. Combine the code for setting up the /chosen node in the FDT and updating the memory reservation for kexec, for powerpc and arm64, in of_kexec_setup_new_fdt() and move it to "drivers/of/kexec.c". Signed-off-by: Rob Herring Reviewed-by: Thiago Jung Bauermann Reviewed-by: Lakshmi Ramasubramanian --- drivers/of/Makefile | 1 + drivers/of/kexec.c | 236 include/linux/of.h | 4 + 3 files changed, 241 insertions(+) create mode 100644 drivers/of/kexec.c diff --git a/drivers/of/Makefile b/drivers/of/Makefile index 6e1e5212f058..8ce11955afde 100644 --- a/drivers/of/Makefile +++ b/drivers/of/Makefile @@ -13,5 +13,6 @@ obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o obj-$(CONFIG_OF_RESOLVE) += resolver.o obj-$(CONFIG_OF_OVERLAY) += overlay.o obj-$(CONFIG_OF_NUMA) += of_numa.o +obj-$(CONFIG_KEXEC_FILE) += kexec.o obj-$(CONFIG_OF_UNITTEST) += unittest-data/ diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c new file mode 100644 index ..4afd3cc1c04a --- /dev/null +++ b/drivers/of/kexec.c @@ -0,0 +1,236 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Arm Limited + * + * Based on arch/arm64/kernel/machine_kexec_file.c: + * Copyright (C) 2018 Linaro Limited + * + * And arch/powerpc/kexec/file_load.c: + * Copyright (C) 2016 IBM Corporation + */ + +#include +#include +#include +#include +#include +#include +#include + +/* relevant device tree properties */ +#define FDT_PROP_KEXEC_ELFHDR "linux,elfcorehdr" +#define FDT_PROP_MEM_RANGE "linux,usable-memory-range" +#define FDT_PROP_INITRD_START "linux,initrd-start" +#define FDT_PROP_INITRD_END"linux,initrd-end" +#define FDT_PROP_BOOTARGS "bootargs" +#define FDT_PROP_KASLR_SEED"kaslr-seed" +#define FDT_PROP_RNG_SEED "rng-seed" +#define RNG_SEED_SIZE 128 + +/** + * fdt_find_and_del_mem_rsv - delete memory reservation with given address and size + * + * @fdt: Flattened device tree for the current kernel. + * @start: Starting address of the reserved memory. + * @size: Size of the reserved memory. + * + * Return: 0 on success, or negative errno on error. + */ +static int fdt_find_and_del_mem_rsv(void *fdt, unsigned long start, unsigned long size) +{ + int i, ret, num_rsvs = fdt_num_mem_rsv(fdt); + + for (i = 0; i < num_rsvs; i++) { + u64 rsv_start, rsv_size; + + ret = fdt_get_mem_rsv(fdt, i, _start, _size); + if (ret) { + pr_err("Malformed device tree.\n"); + return -EINVAL; + } + + if (rsv_start == start && rsv_size == size) { + ret = fdt_del_mem_rsv(fdt, i); + if (ret) { + pr_err("Error deleting device tree reservation.\n"); + return -EINVAL; + } + + return 0; + } + } + + return -ENOENT; +} + +/* + * of_kexec_setup_new_fdt - modify /chosen and memory reservation for the next kernel + * + * @image: kexec image being loaded. + * @fdt: Flattened device tree for the next kernel. + * @initrd_load_addr: Address where the next initrd will be loaded. + * @initrd_len:Size of the next initrd, or 0 if there will be none. + * @cmdline: Command line for the next kernel, or NULL if there will + * be none. + * + * Return: 0 on success, or negative errno on error. + */ +int of_kexec_setup_new_fdt(const struct kimage *image, void *fdt, + unsigned long initrd_load_addr, unsigned long initrd_len, + const char *cmdline) +{ + int ret, chosen_node; + const void *prop; + + /* Remove memory reservation for the current device tree. */ + ret = fdt_find_and_del_mem_rsv(fdt, __pa(initial_boot_params), + fdt_totalsize(initial_boot_params)); + if (ret == -EINVAL) + return ret; + + chosen_node = fdt_path_offset(fdt, "/chosen"); + if (chosen_node == -FDT_ERR_NOTFOUND) + chosen_node = fdt_add_subnode(fdt,
Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction
On 2021/02/04 06:38PM, Naveen N. Rao wrote: > On 2021/02/04 04:17PM, Ravi Bangoria wrote: > > Don't allow Uprobe on 2nd word of a prefixed instruction. As per > > ISA 3.1, prefixed instruction should not cross 64-byte boundary. > > So don't allow Uprobe on such prefixed instruction as well. > > > > There are two ways probed instruction is changed in mapped pages. > > First, when Uprobe is activated, it searches for all the relevant > > pages and replace instruction in them. In this case, if we notice > > that probe is on the 2nd word of prefixed instruction, error out > > directly. Second, when Uprobe is already active and user maps a > > relevant page via mmap(), instruction is replaced via mmap() code > > path. But because Uprobe is invalid, entire mmap() operation can > > not be stopped. In this case just print an error and continue. > > > > Signed-off-by: Ravi Bangoria > > --- > > v1: > > http://lore.kernel.org/r/20210119091234.76317-1-ravi.bango...@linux.ibm.com > > v1->v2: > > - Instead of introducing new arch hook from verify_opcode(), use > > existing hook arch_uprobe_analyze_insn(). > > - Add explicit check for prefixed instruction crossing 64-byte > > boundary. If probe is on such instruction, throw an error. > > > > arch/powerpc/kernel/uprobes.c | 66 ++- > > 1 file changed, 65 insertions(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c > > index e8a63713e655..485d19a2a31f 100644 > > --- a/arch/powerpc/kernel/uprobes.c > > +++ b/arch/powerpc/kernel/uprobes.c > > @@ -7,6 +7,7 @@ > > * Adapted from the x86 port by Ananth N Mavinakayanahalli > > > > */ > > #include > > +#include > > #include > > #include > > #include > > @@ -28,6 +29,69 @@ bool is_trap_insn(uprobe_opcode_t *insn) > > return (is_trap(*insn)); > > } > > > > +#ifdef CONFIG_PPC64 > > +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr) > > +{ > > + struct page *page; > > + struct vm_area_struct *vma; > > + void *kaddr; > > + unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD; > > + > > + if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= > > 0) > > + return -EINVAL; > > + > > + kaddr = kmap_atomic(page); > > + *instr = *((u32 *)(kaddr + (addr & ~PAGE_MASK))); > > + kunmap_atomic(kaddr); > > + put_page(page); > > + return 0; > > +} > > + > > +static int validate_prefixed_instr(struct mm_struct *mm, unsigned long > > addr) > > +{ > > + struct ppc_inst inst; > > + u32 prefix, suffix; > > + > > + /* > > +* No need to check if addr is pointing to beginning of the > > +* page. Even if probe is on a suffix of page-unaligned > > +* prefixed instruction, hw will raise exception and kernel > > +* will send SIGBUS. > > +*/ > > + if (!(addr & ~PAGE_MASK)) > > + return 0; > > + > > + if (get_instr(mm, addr, ) < 0) > > + return -EINVAL; > > + if (get_instr(mm, addr + 4, ) < 0) > > + return -EINVAL; > > + > > + inst = ppc_inst_prefix(prefix, suffix); > > + if (ppc_inst_prefixed(inst) && (addr & 0x3F) == 0x3C) { > > + printk_ratelimited("Cannot register a uprobe on 64 byte " > ^^ pr_info_ratelimited() > > It should be sufficient to check the primary opcode to determine if it > is a prefixed instruction. You don't have to read the suffix. I see that > we don't have a helper to do this currently, so you could do: > > if (ppc_inst_primary_opcode(ppc_inst(prefix)) == 1) Seeing the kprobes code, I realized that we have to check for another scenario (Thanks, Jordan!). If this is the suffix of a prefix instruction for which a uprobe has already been installed, then the previous word will be a 'trap' instruction. You need to check if there is a uprobe at the previous word, and if the original instruction there was a prefix instruction. - Naveen
Re: [PATCH] vio: make remove callback return void
On Wed, Jan 27, 2021 at 10:50:10PM +0100, Uwe Kleine-König wrote: > The driver core ignores the return value of struct bus_type::remove() > because there is only little that can be done. To simplify the quest to > make this function return void, let struct vio_driver::remove() return > void, too. All users already unconditionally return 0, this commit makes > it obvious that returning an error code is a bad idea and makes it > obvious for future driver authors that returning an error code isn't > intended. > > Note there are two nominally different implementations for a vio bus: > one in arch/sparc/kernel/vio.c and the other in > arch/powerpc/platforms/pseries/vio.c. I didn't care to check which > driver is using which of these busses (or if even some of them can be > used with both) and simply adapt all drivers and the two bus codes in > one go. > > Note that for the powerpc implementation there is a semantical change: > Before this patch for a device that was bound to a driver without a > remove callback vio_cmo_bus_remove(viodev) wasn't called. As the device > core still considers the device unbound after vio_bus_remove() returns > calling this unconditionally is the consistent behaviour which is > implemented here. > > Signed-off-by: Uwe Kleine-König > --- > Hello, > > note that this change depends on > https://lore.kernel.org/r/20210121062005.53271-1-...@linux.ibm.com which > removes > an (ignored) return -EBUSY in drivers/net/ethernet/ibm/ibmvnic.c. > I don't know when/if this latter patch will be applied, so it might take > some time until my patch can go in. Acked-by: Greg Kroah-Hartman
Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction
On 2/4/21 4:19 PM, Ravi Bangoria wrote: On 2/4/21 4:17 PM, Ravi Bangoria wrote: Don't allow Uprobe on 2nd word of a prefixed instruction. As per ISA 3.1, prefixed instruction should not cross 64-byte boundary. So don't allow Uprobe on such prefixed instruction as well. There are two ways probed instruction is changed in mapped pages. First, when Uprobe is activated, it searches for all the relevant pages and replace instruction in them. In this case, if we notice that probe is on the 2nd word of prefixed instruction, error out directly. Second, when Uprobe is already active and user maps a relevant page via mmap(), instruction is replaced via mmap() code path. But because Uprobe is invalid, entire mmap() operation can not be stopped. In this case just print an error and continue. @mpe, arch_uprobe_analyze_insn() can return early if cpu_has_feature(CPU_FTR_ARCH_31) is not set. But that will miss out a rare scenario of user running binary with prefixed instruction on p10 predecessors. Please let me know if I should add cpu_has_feature(CPU_FTR_ARCH_31) or not. Wouldn't that binary get a SIGILL in any case? I concur with Naveen... it makes sense to add the check. -- Ananth
Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction
On 2021/02/04 04:19PM, Ravi Bangoria wrote: > > > On 2/4/21 4:17 PM, Ravi Bangoria wrote: > > Don't allow Uprobe on 2nd word of a prefixed instruction. As per > > ISA 3.1, prefixed instruction should not cross 64-byte boundary. > > So don't allow Uprobe on such prefixed instruction as well. > > > > There are two ways probed instruction is changed in mapped pages. > > First, when Uprobe is activated, it searches for all the relevant > > pages and replace instruction in them. In this case, if we notice > > that probe is on the 2nd word of prefixed instruction, error out > > directly. Second, when Uprobe is already active and user maps a > > relevant page via mmap(), instruction is replaced via mmap() code > > path. But because Uprobe is invalid, entire mmap() operation can > > not be stopped. In this case just print an error and continue. > > @mpe, > > arch_uprobe_analyze_insn() can return early if > cpu_has_feature(CPU_FTR_ARCH_31) is not set. But that will > miss out a rare scenario of user running binary with prefixed > instruction on p10 predecessors. Please let me know if I > should add cpu_has_feature(CPU_FTR_ARCH_31) or not. The check you are adding is very specific to prefixed instructions, so it makes sense to add a cpu feature check for v3.1. On older processors, those are invalid instructions like any other. The instruction emulation infrastructure will refuse to emulate it and the instruction will be single stepped. - Naveen
Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction
On 2021/02/04 04:17PM, Ravi Bangoria wrote: > Don't allow Uprobe on 2nd word of a prefixed instruction. As per > ISA 3.1, prefixed instruction should not cross 64-byte boundary. > So don't allow Uprobe on such prefixed instruction as well. > > There are two ways probed instruction is changed in mapped pages. > First, when Uprobe is activated, it searches for all the relevant > pages and replace instruction in them. In this case, if we notice > that probe is on the 2nd word of prefixed instruction, error out > directly. Second, when Uprobe is already active and user maps a > relevant page via mmap(), instruction is replaced via mmap() code > path. But because Uprobe is invalid, entire mmap() operation can > not be stopped. In this case just print an error and continue. > > Signed-off-by: Ravi Bangoria > --- > v1: > http://lore.kernel.org/r/20210119091234.76317-1-ravi.bango...@linux.ibm.com > v1->v2: > - Instead of introducing new arch hook from verify_opcode(), use > existing hook arch_uprobe_analyze_insn(). > - Add explicit check for prefixed instruction crossing 64-byte > boundary. If probe is on such instruction, throw an error. > > arch/powerpc/kernel/uprobes.c | 66 ++- > 1 file changed, 65 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c > index e8a63713e655..485d19a2a31f 100644 > --- a/arch/powerpc/kernel/uprobes.c > +++ b/arch/powerpc/kernel/uprobes.c > @@ -7,6 +7,7 @@ > * Adapted from the x86 port by Ananth N Mavinakayanahalli > > */ > #include > +#include > #include > #include > #include > @@ -28,6 +29,69 @@ bool is_trap_insn(uprobe_opcode_t *insn) > return (is_trap(*insn)); > } > > +#ifdef CONFIG_PPC64 > +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr) > +{ > + struct page *page; > + struct vm_area_struct *vma; > + void *kaddr; > + unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD; > + > + if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= > 0) > + return -EINVAL; > + > + kaddr = kmap_atomic(page); > + *instr = *((u32 *)(kaddr + (addr & ~PAGE_MASK))); > + kunmap_atomic(kaddr); > + put_page(page); > + return 0; > +} > + > +static int validate_prefixed_instr(struct mm_struct *mm, unsigned long addr) > +{ > + struct ppc_inst inst; > + u32 prefix, suffix; > + > + /* > + * No need to check if addr is pointing to beginning of the > + * page. Even if probe is on a suffix of page-unaligned > + * prefixed instruction, hw will raise exception and kernel > + * will send SIGBUS. > + */ > + if (!(addr & ~PAGE_MASK)) > + return 0; > + > + if (get_instr(mm, addr, ) < 0) > + return -EINVAL; > + if (get_instr(mm, addr + 4, ) < 0) > + return -EINVAL; > + > + inst = ppc_inst_prefix(prefix, suffix); > + if (ppc_inst_prefixed(inst) && (addr & 0x3F) == 0x3C) { > + printk_ratelimited("Cannot register a uprobe on 64 byte " ^^ pr_info_ratelimited() It should be sufficient to check the primary opcode to determine if it is a prefixed instruction. You don't have to read the suffix. I see that we don't have a helper to do this currently, so you could do: if (ppc_inst_primary_opcode(ppc_inst(prefix)) == 1) In the future, it might be worthwhile to add IS_PREFIX() as a macro similar to IS_MTMSRD() if there are more such uses. Along with this, if you also add the below to the start of this function, you can get rid of the #ifdef: if (!IS_ENABLED(CONFIG_PPC64)) return 0; - Naveen
[PATCH kernel v3] powerpc/uaccess: Skip might_fault() when user access is enabled
The amount of code executed with enabled user space access (unlocked KUAP) should be minimal. However with CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() may end up replaying interrupts which in turn may access the user space and forget to restore the KUAP state. The problem places are: 1. strncpy_from_user (and similar) which unlock KUAP and call unsafe_get_user -> __get_user_allowed -> __get_user_nocheck() with do_allow=false to skip KUAP as the caller took care of it. 2. __put_user_nocheck_goto() which is called with unlocked KUAP. This changes __get_user_nocheck() to look at @do_allow to decide whether to skip might_fault(). Since strncpy_from_user/etc call might_fault() anyway before unlocking KUAP, there should be no visible change. This drops might_fault() in __put_user_nocheck_goto() as it is only called from unsafe_xxx helpers which manage KUAP themselves. Since keeping might_fault() is still desireable, this adds those to user_access_begin/read/write which is the last point where we can safely do so. Fixes: 334710b1496a ("powerpc/uaccess: Implement unsafe_put_user() using 'asm goto'") Signed-off-by: Alexey Kardashevskiy --- Changes: v3: * removed might_fault() from __put_user_nocheck_goto * added might_fault() to user(_|_read_|_write_)access_begin v2: * s/!do_allow/do_allow/ --- Here is more detail about the issue: https://lore.kernel.org/linuxppc-dev/20210203084503.gx6...@kitsune.suse.cz/T/ Another example of the problem: Kernel attempted to write user page (22c3) - exploit attempt? (uid: 0) [ cut here ] Bug: Write fault blocked by KUAP! WARNING: CPU: 1 PID: 16712 at /home/aik/p/kernel-syzkaller/arch/powerpc/mm/fault.c:229 __do_page_fault+0xca4/0xf10 NIP [c06ff804] filldir64+0x484/0x820 LR [c06ff7fc] filldir64+0x47c/0x820 --- interrupt: 300 [c000589f3b40] [c08131b0] proc_fill_cache+0xf0/0x2b0 [c000589f3c60] [c0814658] proc_pident_readdir+0x1f8/0x390 [c000589f3cc0] [c06fd8e8] iterate_dir+0x108/0x370 [c000589f3d20] [c06fe3d8] sys_getdents64+0xa8/0x410 [c000589f3db0] [c004b708] system_call_exception+0x178/0x2b0 [c000589f3e10] [c000e060] system_call_common+0xf0/0x27c --- arch/powerpc/include/asm/uaccess.h | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 501c9a79038c..a789601998d3 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -216,8 +216,6 @@ do { \ #define __put_user_nocheck_goto(x, ptr, size, label) \ do { \ __typeof__(*(ptr)) __user *__pu_addr = (ptr); \ - if (!is_kernel_addr((unsigned long)__pu_addr)) \ - might_fault(); \ __chk_user_ptr(ptr);\ __put_user_size_goto((x), __pu_addr, (size), label);\ } while (0) @@ -313,7 +311,7 @@ do { \ __typeof__(size) __gu_size = (size);\ \ __chk_user_ptr(__gu_addr); \ - if (!is_kernel_addr((unsigned long)__gu_addr)) \ + if (do_allow && !is_kernel_addr((unsigned long)__gu_addr)) \ might_fault(); \ barrier_nospec(); \ if (do_allow) \ @@ -508,6 +506,8 @@ static __must_check inline bool user_access_begin(const void __user *ptr, size_t { if (unlikely(!access_ok(ptr, len))) return false; + if (!is_kernel_addr((unsigned long)ptr)) + might_fault(); allow_read_write_user((void __user *)ptr, ptr, len); return true; } @@ -521,6 +521,8 @@ user_read_access_begin(const void __user *ptr, size_t len) { if (unlikely(!access_ok(ptr, len))) return false; + if (!is_kernel_addr((unsigned long)ptr)) + might_fault(); allow_read_from_user(ptr, len); return true; } @@ -532,6 +534,8 @@ user_write_access_begin(const void __user *ptr, size_t len) { if (unlikely(!access_ok(ptr, len))) return false; + if (!is_kernel_addr((unsigned long)ptr)) + might_fault(); allow_write_to_user((void __user *)ptr, len); return true; } -- 2.17.1
Re: [PATCH 3/3] tools/perf: Add perf tools support to expose Performance Monitor Counter SPRs as part of extended regs
> On 03-Feb-2021, at 9:55 PM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Feb 03, 2021 at 01:55:37AM -0500, Athira Rajeev escreveu: >> To enable presenting of Performance Monitor Counter Registers >> (PMC1 to PMC6) as part of extended regsiters, patch adds these >> to sample_reg_mask in the tool side (to use with -I? option). >> >> Simplified the PERF_REG_PMU_MASK_300/31 definition. Excluded the >> unsupported SPRs (MMCR3, SIER2, SIER3) from extended mask value for >> CPU_FTR_ARCH_300. > > Applied just 3/3, the tooling part, to my local branch, please holler if > I should wait a bit more. > > - Arnaldo > Thanks Arnaldo for taking the tool side changes. Athira. >> Signed-off-by: Athira Rajeev >> --- >> tools/arch/powerpc/include/uapi/asm/perf_regs.h | 28 >> +++-- >> tools/perf/arch/powerpc/include/perf_regs.h | 6 ++ >> tools/perf/arch/powerpc/util/perf_regs.c| 6 ++ >> 3 files changed, 34 insertions(+), 6 deletions(-) >> >> diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h >> b/tools/arch/powerpc/include/uapi/asm/perf_regs.h >> index bdf5f10f8b9f..578b3ee86105 100644 >> --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h >> +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h >> @@ -55,17 +55,33 @@ enum perf_event_powerpc_regs { >> PERF_REG_POWERPC_MMCR3, >> PERF_REG_POWERPC_SIER2, >> PERF_REG_POWERPC_SIER3, >> +PERF_REG_POWERPC_PMC1, >> +PERF_REG_POWERPC_PMC2, >> +PERF_REG_POWERPC_PMC3, >> +PERF_REG_POWERPC_PMC4, >> +PERF_REG_POWERPC_PMC5, >> +PERF_REG_POWERPC_PMC6, >> /* Max regs without the extended regs */ >> PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1, >> }; >> >> #define PERF_REG_PMU_MASK((1ULL << PERF_REG_POWERPC_MAX) - 1) >> >> -/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */ >> -#define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - >> 1) - PERF_REG_PMU_MASK) >> -/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */ >> -#define PERF_REG_PMU_MASK_31 (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - >> 1) - PERF_REG_PMU_MASK) >> +/* Exclude MMCR3, SIER2, SIER3 for CPU_FTR_ARCH_300 */ >> +#define PERF_EXCLUDE_REG_EXT_300(7ULL << PERF_REG_POWERPC_MMCR3) >> >> -#define PERF_REG_MAX_ISA_300 (PERF_REG_POWERPC_MMCR2 + 1) >> -#define PERF_REG_MAX_ISA_31(PERF_REG_POWERPC_SIER3 + 1) >> +/* >> + * PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 >> + * includes 9 SPRS from MMCR0 to PMC6 excluding the >> + * unsupported SPRS in PERF_EXCLUDE_REG_EXT_300. >> + */ >> +#define PERF_REG_PMU_MASK_300 ((0xfffULL << PERF_REG_POWERPC_MMCR0) - >> PERF_EXCLUDE_REG_EXT_300) >> + >> +/* >> + * PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 >> + * includes 12 SPRs from MMCR0 to PMC6. >> + */ >> +#define PERF_REG_PMU_MASK_31 (0xfffULL << PERF_REG_POWERPC_MMCR0) >> + >> +#define PERF_REG_EXTENDED_MAX (PERF_REG_POWERPC_PMC6 + 1) >> #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */ >> diff --git a/tools/perf/arch/powerpc/include/perf_regs.h >> b/tools/perf/arch/powerpc/include/perf_regs.h >> index 63f3ac91049f..98b6f9eabfc3 100644 >> --- a/tools/perf/arch/powerpc/include/perf_regs.h >> +++ b/tools/perf/arch/powerpc/include/perf_regs.h >> @@ -71,6 +71,12 @@ >> [PERF_REG_POWERPC_MMCR3] = "mmcr3", >> [PERF_REG_POWERPC_SIER2] = "sier2", >> [PERF_REG_POWERPC_SIER3] = "sier3", >> +[PERF_REG_POWERPC_PMC1] = "pmc1", >> +[PERF_REG_POWERPC_PMC2] = "pmc2", >> +[PERF_REG_POWERPC_PMC3] = "pmc3", >> +[PERF_REG_POWERPC_PMC4] = "pmc4", >> +[PERF_REG_POWERPC_PMC5] = "pmc5", >> +[PERF_REG_POWERPC_PMC6] = "pmc6", >> }; >> >> static inline const char *perf_reg_name(int id) >> diff --git a/tools/perf/arch/powerpc/util/perf_regs.c >> b/tools/perf/arch/powerpc/util/perf_regs.c >> index 2b6d4704e3aa..8116a253f91f 100644 >> --- a/tools/perf/arch/powerpc/util/perf_regs.c >> +++ b/tools/perf/arch/powerpc/util/perf_regs.c >> @@ -68,6 +68,12 @@ >> SMPL_REG(mmcr3, PERF_REG_POWERPC_MMCR3), >> SMPL_REG(sier2, PERF_REG_POWERPC_SIER2), >> SMPL_REG(sier3, PERF_REG_POWERPC_SIER3), >> +SMPL_REG(pmc1, PERF_REG_POWERPC_PMC1), >> +SMPL_REG(pmc2, PERF_REG_POWERPC_PMC2), >> +SMPL_REG(pmc3, PERF_REG_POWERPC_PMC3), >> +SMPL_REG(pmc4, PERF_REG_POWERPC_PMC4), >> +SMPL_REG(pmc5, PERF_REG_POWERPC_PMC5), >> +SMPL_REG(pmc6, PERF_REG_POWERPC_PMC6), >> SMPL_REG_END >> }; >> >> -- >> 1.8.3.1 >> > > -- > > - Arnaldo
Re: [PATCH] tools/perf: Fix powerpc gap between kernel end and module start
> On 03-Feb-2021, at 9:01 PM, Arnaldo Carvalho de Melo wrote: > > Thanks, collected the Tested-by from Kajol and the Acked-by from Jiri > and applied to my local tree for testing, then up to my perf/core > branch. > > - Arnaldo Thanks Arnaldo for taking this fix.
Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays
On 2021-02-04 07:29, Christoph Hellwig wrote: On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote: This patch converts several swiotlb related variables to arrays, in order to maintain stat/status for different swiotlb buffers. Here are variables involved: - io_tlb_start and io_tlb_end - io_tlb_nslabs and io_tlb_used - io_tlb_list - io_tlb_index - max_segment - io_tlb_orig_addr - no_iotlb_memory There is no functional change and this is to prepare to enable 64-bit swiotlb. Claire Chang (on Cc) already posted a patch like this a month ago, which looks much better because it actually uses a struct instead of all the random variables. Indeed, I skimmed the cover letter and immediately thought that this whole thing is just the restricted DMA pool concept[1] again, only from a slightly different angle. Robin. [1] https://lore.kernel.org/linux-iommu/20210106034124.30560-1-tien...@chromium.org/
Re: [PATCH v7 39/42] powerpc: move NMI entry/exit code into wrapper
Excerpts from Michael Ellerman's message of February 4, 2021 8:15 pm: > Nicholas Piggin writes: >> This moves the common NMI entry and exit code into the interrupt handler >> wrappers. >> >> This changes the behaviour of soft-NMI (watchdog) and HMI interrupts, and >> also MCE interrupts on 64e, by adding missing parts of the NMI entry to >> them. >> >> Signed-off-by: Nicholas Piggin >> --- >> arch/powerpc/include/asm/interrupt.h | 28 ++ >> arch/powerpc/kernel/mce.c| 11 - >> arch/powerpc/kernel/traps.c | 35 +--- >> arch/powerpc/kernel/watchdog.c | 10 >> 4 files changed, 38 insertions(+), 46 deletions(-) > > This is unhappy when injecting SLB multi-hits: > > root@p86-2:~# echo PPC_SLB_MULTIHIT > /sys/kernel/debug/provoke-crash/DIRECT > [ 312.496026][ T1344] kernel BUG at > arch/powerpc/include/asm/interrupt.h:152! > [ 312.496037][ T1344] Oops: Exception in kernel mode, sig: 5 [#1] > [ 312.496045][ T1344] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA > pSeries pseries hash. Blast! > 147 static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, > struct interrupt_nmi_state *state) > 148 { > 149 if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) || > 150 !firmware_has_feature(FW_FEATURE_LPAR) || > 151 radix_enabled() || (mfmsr() & MSR_DR)) > 152 nmi_exit(); > > > So presumably it's: > > #define __nmi_exit() \ > do {\ > BUG_ON(!in_nmi()); \ Yes that would be it, pseries machine check enables MMU half way through so only one side of this triggers. The MSR_DR check is supposed to catch the other NMIs that run with MMU on (perf, watchdog, etc). Suppose it could test TRAP(regs) explicitly although I wonder if we should also do this to keep things balanced diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c index 149cec2212e6..f57ca0c570be 100644 --- a/arch/powerpc/platforms/pseries/ras.c +++ b/arch/powerpc/platforms/pseries/ras.c @@ -719,6 +719,7 @@ static int mce_handle_err_virtmode(struct pt_regs *regs, static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp) { + unsigned long msr; struct pseries_errorlog *pseries_log; struct pseries_mc_errorlog *mce_log = NULL; int disposition = rtas_error_disposition(errp); @@ -747,9 +748,12 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp) * SLB multihit is done by now. */ out: - mtmsr(mfmsr() | MSR_IR | MSR_DR); + msr = mfmsr(); + mtmsr(msr | MSR_IR | MSR_DR); disposition = mce_handle_err_virtmode(regs, errp, mce_log, disposition); + mtmsr(msr); + return disposition; }
[PATCH] powerpc/kexec_file: fix FDT size estimation for kdump kernel
On systems with large amount of memory, loading kdump kernel through kexec_file_load syscall may fail with the below error: "Failed to update fdt with linux,drconf-usable-memory property" This happens because the size estimation for kdump kernel's FDT does not account for the additional space needed to setup usable memory properties. Fix it by accounting for the space needed to include linux,usable-memory & linux,drconf-usable-memory properties while estimating kdump kernel's FDT size. Fixes: 6ecd0163d360 ("powerpc/kexec_file: Add appropriate regions for memory reserve map") Cc: sta...@vger.kernel.org Signed-off-by: Hari Bathini --- arch/powerpc/include/asm/kexec.h |1 + arch/powerpc/kexec/elf_64.c |2 +- arch/powerpc/kexec/file_load_64.c | 34 ++ 3 files changed, 36 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 55d6ede30c19..9ab344d29a54 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -136,6 +136,7 @@ int load_crashdump_segments_ppc64(struct kimage *image, int setup_purgatory_ppc64(struct kimage *image, const void *slave_code, const void *fdt, unsigned long kernel_load_addr, unsigned long fdt_load_addr); +unsigned int kexec_fdt_totalsize_ppc64(struct kimage *image); int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, unsigned long initrd_load_addr, unsigned long initrd_len, const char *cmdline); diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c index d0e459bb2f05..9842e33533df 100644 --- a/arch/powerpc/kexec/elf_64.c +++ b/arch/powerpc/kexec/elf_64.c @@ -102,7 +102,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, pr_debug("Loaded initrd at 0x%lx\n", initrd_load_addr); } - fdt_size = fdt_totalsize(initial_boot_params) * 2; + fdt_size = kexec_fdt_totalsize_ppc64(image); fdt = kmalloc(fdt_size, GFP_KERNEL); if (!fdt) { pr_err("Not enough memory for the device tree.\n"); diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index c69bcf9b547a..67fa7bfcfa30 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -925,6 +926,39 @@ int setup_purgatory_ppc64(struct kimage *image, const void *slave_code, return ret; } +/** + * kexec_fdt_totalsize_ppc64 - Return the estimated size needed to setup FDT + * for kexec/kdump kernel. + * @image: kexec image being loaded. + * + * Returns the estimated size needed for kexec/kdump kernel FDT. + */ +unsigned int kexec_fdt_totalsize_ppc64(struct kimage *image) +{ + unsigned int fdt_size; + uint64_t usm_entries; + + /* +* The below estimate more than accounts for a typical kexec case where +* the additional space is to accommodate things like kexec cmdline, +* chosen node with properties for initrd start & end addresses and +* a property to indicate kexec boot.. +*/ + fdt_size = fdt_totalsize(initial_boot_params) + (2 * COMMAND_LINE_SIZE); + if (image->type != KEXEC_TYPE_CRASH) + return fdt_size; + + /* +* For kdump kernel, also account for linux,usable-memory and +* linux,drconf-usable-memory properties. Get an approximate on the +* number of usable memory entries and use for FDT size estimation. +*/ + usm_entries = ((memblock_end_of_DRAM() / drmem_lmb_size()) + + (2 * (resource_size(_res) / drmem_lmb_size(; + fdt_size += (unsigned int)(usm_entries * sizeof(uint64_t)); + return fdt_size; +} + /** * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel * being loaded.
Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs
Excerpts from Michael Ellerman's message of February 4, 2021 9:13 pm: > Nicholas Piggin writes: >> Excerpts from Christopher M. Riedl's message of February 4, 2021 4:59 pm: >>> On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote: Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm: > "Christopher M. Riedl" writes: >> The idle entry/exit code saves/restores GPRs in the stack "red zone" >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset >> used for the first GPR is incorrect and overwrites the back chain - the >> Protected Zone actually starts below the current SP. In practice this is >> probably not an issue, but it's still incorrect so fix it. > > Nice catch. > > Corrupting the back chain means you can't backtrace from there, which > could be confusing for debugging one day. Yeah, we seem to have got away without noticing because the CPU will wake up and return out of here before it tries to unwind the stack, but if you tried to walk it by hand if the CPU got stuck in idle or something, then we'd get confused. > It does make me wonder why we don't just create a stack frame and use > the normal macros? It would use a bit more stack space, but we shouldn't > be short of stack space when going idle. > > Nick, was there a particular reason for using the red zone? I don't recall a particular reason, I think a normal stack frame is probably a good idea. >>> >>> I'll send a version using STACKFRAMESIZE - I assume that's the "normal" >>> stack frame :) >>> >> >> I think STACKFRAMESIZE is STACK_FRAME_OVERHEAD + NVGPRs. LR and CR can >> be saved in the caller's frame so that should be okay. > > TBH I didn't know/had forgotten we had STACKFRAMESIZE. > > The safest is SWITCH_FRAME_SIZE, because it's calculated based on > pt_regs (which can change size): > > DEFINE(SWITCH_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct > pt_regs)); > > But the name is obviously terrible for your usage, and it will allocate > a bit more space than you need, because pt_regs has more than just the GPRs. I don't see why that's safer if we're not using pt_regs. > >>> I admit I am a bit confused when I saw the similar but much smaller >>> STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore >>> a few registers. >> >> Yeah if you don't need to save all nvgprs you can use caller's frame >> plus a few bytes in the minimum frame as volatile storage. >> >> STACK_FRAME_OVERHEAD should be 32 on LE, but I think the problem is a >> lot of asm uses it and hasn't necessarily been audited to make sure it's >> not assuming it's bigger. > > Yeah it's a total mess. See this ~3.5 year old issue :/ > > https://github.com/linuxppc/issues/issues/113 > >> You could actually use STACK_FRAME_MIN_SIZE for new code, maybe we add >> a STACK_FRAME_MIN_NVGPR_SIZE to match and use that. > > Something like that makes me nervous because it's so easy to > accidentally use one of the macros that expects a full pt_regs. > > I think ideally you have just two options. > > Option 1: > > You use: > STACK_FRAME_WITH_PT_REGS = STACK_FRAME_MIN_SIZE + sizeof(struct pt_regs) > > And then you can use all the macros in asm-offsets.c generated with > STACK_PT_REGS_OFFSET. I don't see a good reason to use pt_regs here, but in general sure this would be good to have and begin using. > Option 2: > > If you want to be fancy you manually allocate your frame with just > the right amount of space, but with the size there in the code, so for > example the idle code that wants 19 GPRs would do: > > stdur1, -(STACK_FRAME_MIN_SIZE + 8 * 19)(r1) > > And then it's very obvious that you have a non-standard frame size and > need to be more careful. I would be happy with this for the idle code. Thanks, Nick
Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs
Nicholas Piggin writes: > Excerpts from Christopher M. Riedl's message of February 4, 2021 4:59 pm: >> On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote: >>> Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm: >>> > "Christopher M. Riedl" writes: >>> >> The idle entry/exit code saves/restores GPRs in the stack "red zone" >>> >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset >>> >> used for the first GPR is incorrect and overwrites the back chain - the >>> >> Protected Zone actually starts below the current SP. In practice this is >>> >> probably not an issue, but it's still incorrect so fix it. >>> > >>> > Nice catch. >>> > >>> > Corrupting the back chain means you can't backtrace from there, which >>> > could be confusing for debugging one day. >>> >>> Yeah, we seem to have got away without noticing because the CPU will >>> wake up and return out of here before it tries to unwind the stack, >>> but if you tried to walk it by hand if the CPU got stuck in idle or >>> something, then we'd get confused. >>> >>> > It does make me wonder why we don't just create a stack frame and use >>> > the normal macros? It would use a bit more stack space, but we shouldn't >>> > be short of stack space when going idle. >>> > >>> > Nick, was there a particular reason for using the red zone? >>> >>> I don't recall a particular reason, I think a normal stack frame is >>> probably a good idea. >> >> I'll send a version using STACKFRAMESIZE - I assume that's the "normal" >> stack frame :) >> > > I think STACKFRAMESIZE is STACK_FRAME_OVERHEAD + NVGPRs. LR and CR can > be saved in the caller's frame so that should be okay. TBH I didn't know/had forgotten we had STACKFRAMESIZE. The safest is SWITCH_FRAME_SIZE, because it's calculated based on pt_regs (which can change size): DEFINE(SWITCH_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct pt_regs)); But the name is obviously terrible for your usage, and it will allocate a bit more space than you need, because pt_regs has more than just the GPRs. >> I admit I am a bit confused when I saw the similar but much smaller >> STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore >> a few registers. > > Yeah if you don't need to save all nvgprs you can use caller's frame > plus a few bytes in the minimum frame as volatile storage. > > STACK_FRAME_OVERHEAD should be 32 on LE, but I think the problem is a > lot of asm uses it and hasn't necessarily been audited to make sure it's > not assuming it's bigger. Yeah it's a total mess. See this ~3.5 year old issue :/ https://github.com/linuxppc/issues/issues/113 > You could actually use STACK_FRAME_MIN_SIZE for new code, maybe we add > a STACK_FRAME_MIN_NVGPR_SIZE to match and use that. Something like that makes me nervous because it's so easy to accidentally use one of the macros that expects a full pt_regs. I think ideally you have just two options. Option 1: You use: STACK_FRAME_WITH_PT_REGS = STACK_FRAME_MIN_SIZE + sizeof(struct pt_regs) And then you can use all the macros in asm-offsets.c generated with STACK_PT_REGS_OFFSET. Option 2: If you want to be fancy you manually allocate your frame with just the right amount of space, but with the size there in the code, so for example the idle code that wants 19 GPRs would do: stdur1, -(STACK_FRAME_MIN_SIZE + 8 * 19)(r1) And then it's very obvious that you have a non-standard frame size and need to be more careful. cheers
[PATCH 18/20] crypto: nx: nx_debugfs: Header comments should not be kernel-doc
Fixes the following W=1 kernel build warning(s): drivers/crypto/nx/nx_debugfs.c:34: warning: Function parameter or member 'drv' not described in 'nx_debugfs_init' drivers/crypto/nx/nx_debugfs.c:34: warning: expecting prototype for Nest Accelerators driver(). Prototype was for nx_debugfs_init() instead Cc: "Breno Leitão" Cc: Nayna Jain Cc: Paulo Flabiano Smorigo Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Herbert Xu Cc: "David S. Miller" Cc: Kent Yoder Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Lee Jones --- drivers/crypto/nx/nx_debugfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/crypto/nx/nx_debugfs.c b/drivers/crypto/nx/nx_debugfs.c index 1975bcbee9974..ee7cd88bb10a7 100644 --- a/drivers/crypto/nx/nx_debugfs.c +++ b/drivers/crypto/nx/nx_debugfs.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0-only -/** +/* * debugfs routines supporting the Power 7+ Nest Accelerators driver * * Copyright (C) 2011-2012 International Business Machines Inc. -- 2.25.1
[PATCH 16/20] crypto: vmx: Source headers are not good kernel-doc candidates
Fixes the following W=1 kernel build warning(s): drivers/crypto/vmx/vmx.c:23: warning: expecting prototype for Routines supporting VMX instructions on the Power 8(). Prototype was for p8_init() instead Cc: "Breno Leitão" Cc: Nayna Jain Cc: Paulo Flabiano Smorigo Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Herbert Xu Cc: "David S. Miller" Cc: Henrique Cerri Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Lee Jones --- drivers/crypto/vmx/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/crypto/vmx/vmx.c b/drivers/crypto/vmx/vmx.c index a40d08e75fc0b..7eb713cc87c8c 100644 --- a/drivers/crypto/vmx/vmx.c +++ b/drivers/crypto/vmx/vmx.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0-only -/** +/* * Routines supporting VMX instructions on the Power 8 * * Copyright (C) 2015 International Business Machines Inc. -- 2.25.1
[PATCH 17/20] crypto: nx: nx-aes-cbc: Headers comments should not be kernel-doc
Fixes the following W=1 kernel build warning(s): drivers/crypto/nx/nx-aes-cbc.c:24: warning: Function parameter or member 'tfm' not described in 'cbc_aes_nx_set_key' drivers/crypto/nx/nx-aes-cbc.c:24: warning: Function parameter or member 'in_key' not described in 'cbc_aes_nx_set_key' drivers/crypto/nx/nx-aes-cbc.c:24: warning: Function parameter or member 'key_len' not described in 'cbc_aes_nx_set_key' drivers/crypto/nx/nx-aes-cbc.c:24: warning: expecting prototype for Nest Accelerators driver(). Prototype was for cbc_aes_nx_set_key() instead Cc: "Breno Leitão" Cc: Nayna Jain Cc: Paulo Flabiano Smorigo Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Herbert Xu Cc: "David S. Miller" Cc: Kent Yoder Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Lee Jones --- drivers/crypto/nx/nx-aes-cbc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/crypto/nx/nx-aes-cbc.c b/drivers/crypto/nx/nx-aes-cbc.c index 92e921eceed75..d6314ea9ae896 100644 --- a/drivers/crypto/nx/nx-aes-cbc.c +++ b/drivers/crypto/nx/nx-aes-cbc.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0-only -/** +/* * AES CBC routines supporting the Power 7+ Nest Accelerators driver * * Copyright (C) 2011-2012 International Business Machines Inc. -- 2.25.1
[PATCH 19/20] crypto: nx: Demote header comment and add description for 'nbytes'
Fixes the following W=1 kernel build warning(s): drivers/crypto/nx/nx.c:31: warning: Incorrect use of kernel-doc format: * nx_hcall_sync - make an H_COP_OP hcall for the passed in op structure drivers/crypto/nx/nx.c:43: warning: Function parameter or member 'nx_ctx' not described in 'nx_hcall_sync' drivers/crypto/nx/nx.c:43: warning: Function parameter or member 'op' not described in 'nx_hcall_sync' drivers/crypto/nx/nx.c:43: warning: Function parameter or member 'may_sleep' not described in 'nx_hcall_sync' drivers/crypto/nx/nx.c:43: warning: expecting prototype for Nest Accelerators driver(). Prototype was for nx_hcall_sync() instead drivers/crypto/nx/nx.c:209: warning: Function parameter or member 'nbytes' not described in 'trim_sg_list' Cc: "Breno Leitão" Cc: Nayna Jain Cc: Paulo Flabiano Smorigo Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Herbert Xu Cc: "David S. Miller" Cc: Kent Yoder Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Lee Jones --- drivers/crypto/nx/nx.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/crypto/nx/nx.c b/drivers/crypto/nx/nx.c index 0d2dc5be7f192..010be6793c9fc 100644 --- a/drivers/crypto/nx/nx.c +++ b/drivers/crypto/nx/nx.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0-only -/** +/* * Routines supporting the Power 7+ Nest Accelerators driver * * Copyright (C) 2011-2012 International Business Machines Inc. @@ -200,7 +200,8 @@ struct nx_sg *nx_walk_and_build(struct nx_sg *nx_dst, * @sg: sg list head * @end: sg lisg end * @delta: is the amount we need to crop in order to bound the list. - * + * @nbytes: length of data in the scatterlists or data length - whichever + * is greater. */ static long int trim_sg_list(struct nx_sg *sg, struct nx_sg *end, -- 2.25.1
[PATCH 00/20] Rid W=1 warnings in Crypto
This set is part of a larger effort attempting to clean-up W=1 kernel builds, which are currently overwhelmingly riddled with niggly little warnings. This is set 1 of 2 sets required to fully clean Crypto. Lee Jones (20): crypto: hisilicon: sec_drv: Supply missing description for 'sec_queue_empty()'s 'queue' param crypto: bcm: util: Repair a couple of documentation formatting issues crypto: chelsio: chcr_core: File headers are not good candidates for kernel-doc crypto: ux500: hash: hash_core: Fix worthy kernel-doc headers and remove others crypto: bcm: spu: Fix formatting and misspelling issues crypto: keembay: ocs-hcu: Fix incorrectly named functions/structs crypto: bcm: spu2: Fix a whole host of kernel-doc misdemeanours crypto: ux500: cryp: Demote some conformant non-kernel headers fix another crypto: ux500: cryp_irq: File headers are not good kernel-doc candidates crypto: chelsio: chcr_algo: Fix a couple of kernel-doc issues caused by doc-rot crypto: ux500: cryp_core: Fix formatting issue and add description for 'session_id' crypto: atmel-ecc: Struct headers need to start with keyword 'struct' crypto: bcm: cipher: Provide description for 'req' and fix formatting issues crypto: caam: caampkc: Provide the name of the function crypto: caam: caamalg_qi2: Supply a couple of 'fallback' related descriptions crypto: vmx: Source headers are not good kernel-doc candidates crypto: nx: nx-aes-cbc: Headers comments should not be kernel-doc crypto: nx: nx_debugfs: Header comments should not be kernel-doc crypto: nx: Demote header comment and add description for 'nbytes' crypto: cavium: nitrox_isr: Demote non-compliant kernel-doc headers drivers/crypto/atmel-ecc.c| 2 +- drivers/crypto/bcm/cipher.c | 7 ++-- drivers/crypto/bcm/spu.c | 16 - drivers/crypto/bcm/spu2.c | 43 +-- drivers/crypto/bcm/util.c | 4 +-- drivers/crypto/caam/caamalg_qi2.c | 2 ++ drivers/crypto/caam/caampkc.c | 3 +- drivers/crypto/cavium/nitrox/nitrox_isr.c | 4 +-- drivers/crypto/chelsio/chcr_algo.c| 8 ++--- drivers/crypto/chelsio/chcr_core.c| 2 +- drivers/crypto/hisilicon/sec/sec_drv.c| 1 + drivers/crypto/keembay/ocs-hcu.c | 6 ++-- drivers/crypto/nx/nx-aes-cbc.c| 2 +- drivers/crypto/nx/nx.c| 5 +-- drivers/crypto/nx/nx_debugfs.c| 2 +- drivers/crypto/ux500/cryp/cryp.c | 5 +-- drivers/crypto/ux500/cryp/cryp_core.c | 5 +-- drivers/crypto/ux500/cryp/cryp_irq.c | 2 +- drivers/crypto/ux500/hash/hash_core.c | 15 +++- drivers/crypto/vmx/vmx.c | 2 +- 20 files changed, 71 insertions(+), 65 deletions(-) Cc: Alexandre Belloni Cc: Andreas Westin Cc: Atul Gupta Cc: Aymen Sghaier Cc: Ayush Sawal Cc: Benjamin Herrenschmidt Cc: Berne Hebark Cc: "Breno Leitão" Cc: Daniele Alessandrelli Cc: "David S. Miller" Cc: Declan Murphy Cc: "Gustavo A. R. Silva" Cc: Harsh Jain Cc: Henrique Cerri Cc: Herbert Xu Cc: "Horia Geantă" Cc: Jitendra Lulla Cc: Joakim Bech Cc: Jonas Linde Cc: Jonathan Cameron Cc: Kent Yoder Cc: linux-arm-ker...@lists.infradead.org Cc: linux-cry...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Ludovic Desroches Cc: Manoj Malviya Cc: Michael Ellerman Cc: M R Gowda Cc: Nayna Jain Cc: Nicolas Ferre Cc: Niklas Hernaeus Cc: Paul Mackerras Cc: Paulo Flabiano Smorigo Cc: Rob Rice Cc: Rohit Maheshwari Cc: Shujuan Chen Cc: Takashi Iwai Cc: Tudor Ambarus Cc: Vinay Kumar Yadav Cc: Zaibo Xu -- 2.25.1
Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction
On 2/4/21 4:17 PM, Ravi Bangoria wrote: Don't allow Uprobe on 2nd word of a prefixed instruction. As per ISA 3.1, prefixed instruction should not cross 64-byte boundary. So don't allow Uprobe on such prefixed instruction as well. There are two ways probed instruction is changed in mapped pages. First, when Uprobe is activated, it searches for all the relevant pages and replace instruction in them. In this case, if we notice that probe is on the 2nd word of prefixed instruction, error out directly. Second, when Uprobe is already active and user maps a relevant page via mmap(), instruction is replaced via mmap() code path. But because Uprobe is invalid, entire mmap() operation can not be stopped. In this case just print an error and continue. @mpe, arch_uprobe_analyze_insn() can return early if cpu_has_feature(CPU_FTR_ARCH_31) is not set. But that will miss out a rare scenario of user running binary with prefixed instruction on p10 predecessors. Please let me know if I should add cpu_has_feature(CPU_FTR_ARCH_31) or not. - Ravi
[PATCH v2] powerpc/uprobes: Validation for prefixed instruction
Don't allow Uprobe on 2nd word of a prefixed instruction. As per ISA 3.1, prefixed instruction should not cross 64-byte boundary. So don't allow Uprobe on such prefixed instruction as well. There are two ways probed instruction is changed in mapped pages. First, when Uprobe is activated, it searches for all the relevant pages and replace instruction in them. In this case, if we notice that probe is on the 2nd word of prefixed instruction, error out directly. Second, when Uprobe is already active and user maps a relevant page via mmap(), instruction is replaced via mmap() code path. But because Uprobe is invalid, entire mmap() operation can not be stopped. In this case just print an error and continue. Signed-off-by: Ravi Bangoria --- v1: http://lore.kernel.org/r/20210119091234.76317-1-ravi.bango...@linux.ibm.com v1->v2: - Instead of introducing new arch hook from verify_opcode(), use existing hook arch_uprobe_analyze_insn(). - Add explicit check for prefixed instruction crossing 64-byte boundary. If probe is on such instruction, throw an error. arch/powerpc/kernel/uprobes.c | 66 ++- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c index e8a63713e655..485d19a2a31f 100644 --- a/arch/powerpc/kernel/uprobes.c +++ b/arch/powerpc/kernel/uprobes.c @@ -7,6 +7,7 @@ * Adapted from the x86 port by Ananth N Mavinakayanahalli */ #include +#include #include #include #include @@ -28,6 +29,69 @@ bool is_trap_insn(uprobe_opcode_t *insn) return (is_trap(*insn)); } +#ifdef CONFIG_PPC64 +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr) +{ + struct page *page; + struct vm_area_struct *vma; + void *kaddr; + unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD; + + if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= 0) + return -EINVAL; + + kaddr = kmap_atomic(page); + *instr = *((u32 *)(kaddr + (addr & ~PAGE_MASK))); + kunmap_atomic(kaddr); + put_page(page); + return 0; +} + +static int validate_prefixed_instr(struct mm_struct *mm, unsigned long addr) +{ + struct ppc_inst inst; + u32 prefix, suffix; + + /* +* No need to check if addr is pointing to beginning of the +* page. Even if probe is on a suffix of page-unaligned +* prefixed instruction, hw will raise exception and kernel +* will send SIGBUS. +*/ + if (!(addr & ~PAGE_MASK)) + return 0; + + if (get_instr(mm, addr, ) < 0) + return -EINVAL; + if (get_instr(mm, addr + 4, ) < 0) + return -EINVAL; + + inst = ppc_inst_prefix(prefix, suffix); + if (ppc_inst_prefixed(inst) && (addr & 0x3F) == 0x3C) { + printk_ratelimited("Cannot register a uprobe on 64 byte " + "unaligned prefixed instruction\n"); + return -EINVAL; + } + + suffix = prefix; + if (get_instr(mm, addr - 4, ) < 0) + return -EINVAL; + + inst = ppc_inst_prefix(prefix, suffix); + if (ppc_inst_prefixed(inst)) { + printk_ratelimited("Cannot register a uprobe on the second " + "word of prefixed instruction\n"); + return -EINVAL; + } + return 0; +} +#else +static int validate_prefixed_instr(struct mm_struct *mm, unsigned long addr) +{ + return 0; +} +#endif + /** * arch_uprobe_analyze_insn * @mm: the probed address space. @@ -41,7 +105,7 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, if (addr & 0x03) return -EINVAL; - return 0; + return validate_prefixed_instr(mm, addr); } /* -- 2.26.2
Re: [PATCH 01/13] powerpc/powernv: remove get_cxl_module
Christoph Hellwig writes: > The static inline get_cxl_module function is entirely unused since commit > 8bf6b91a5125a ("Revert "powerpc/powernv: Add support for the cxl kernel > api on the real phb"), so remove it. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Andrew Donnellan > --- > arch/powerpc/platforms/powernv/pci-cxl.c | 22 -- > 1 file changed, 22 deletions(-) Acked-by: Michael Ellerman cheers
RE: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation
From: Segher Boessenkool > Sent: 03 February 2021 21:18 ... > Power9 does: > > Load with Update Instructions (RA = 0) > EA is placed into R0. Does that change the value of 0? Rather reminds me of some 1960s era systems that had the small integers at fixed (global) addresses. FORTRAN always passes by reference, pass 0 and the address of the global zero location was passed, the called function could change 0 to 1 for the entire computer! > Load with Update Instructions (RA = RT) > The storage operand addressed by EA is accessed. The displacement > field is added to the data returned by the load and placed into RT. Shame that isn't standard - could be used to optimise some code. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH v7 39/42] powerpc: move NMI entry/exit code into wrapper
Nicholas Piggin writes: > This moves the common NMI entry and exit code into the interrupt handler > wrappers. > > This changes the behaviour of soft-NMI (watchdog) and HMI interrupts, and > also MCE interrupts on 64e, by adding missing parts of the NMI entry to > them. > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/include/asm/interrupt.h | 28 ++ > arch/powerpc/kernel/mce.c| 11 - > arch/powerpc/kernel/traps.c | 35 +--- > arch/powerpc/kernel/watchdog.c | 10 > 4 files changed, 38 insertions(+), 46 deletions(-) This is unhappy when injecting SLB multi-hits: root@p86-2:~# echo PPC_SLB_MULTIHIT > /sys/kernel/debug/provoke-crash/DIRECT [ 312.496026][ T1344] kernel BUG at arch/powerpc/include/asm/interrupt.h:152! [ 312.496037][ T1344] Oops: Exception in kernel mode, sig: 5 [#1] [ 312.496045][ T1344] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 312.496053][ T1344] Modules linked in: squashfs dm_multipath scsi_dh_rdac scsi_dh_alua pseries_rng rng_core vmx_crypto gf128mul crc32c_vpmsum ip_tables x_tables autofs4 [ 312.496085][ T1344] CPU: 19 PID: 1344 Comm: bash Not tainted 5.11.0-rc2-gcc-8.2.0-00123-g3fadced17474-dirty #638 [ 312.496096][ T1344] NIP: c0ef1618 LR: c0ef1600 CTR: [ 312.496104][ T1344] REGS: c0001eb4ba00 TRAP: 0700 Not tainted (5.11.0-rc2-gcc-8.2.0-00123-g3fadced17474-dirty) [ 312.496113][ T1344] MSR: 80021033 CR: 48428284 XER: [ 312.496132][ T1344] CFAR: c0ef28b8 IRQMASK: 3 [ 312.496132][ T1344] GPR00: c0ef15e4 c0001eb4bca0 c1a53900 0001 [ 312.496132][ T1344] GPR04: c17e7230 4000 3ffe 6d66 [ 312.496132][ T1344] GPR08: c0001ec6fe80 0001 ce72d380 1001 [ 312.496132][ T1344] GPR12: 0010 c0001ec6fe80 0100235ad1d0 00013c2fb738 [ 312.496132][ T1344] GPR16: 00013c210ae0 2200 0100235af740 [ 312.496132][ T1344] GPR20: 0001 00013c2a3ca0 c00033c50040 [ 312.496132][ T1344] GPR24: c000100fbd80 c1a7dc78 0001 [ 312.496132][ T1344] GPR28: c0001eb4bd60 0001 0001 [ 312.496229][ T1344] NIP [c0ef1618] machine_check_early+0x138/0x1f0 [ 312.496241][ T1344] LR [c0ef1600] machine_check_early+0x120/0x1f0 [ 312.496249][ T1344] Call Trace: [ 312.496254][ T1344] [c0001eb4bca0] [c0ef15e4] machine_check_early+0x104/0x1f0 (unreliable) [ 312.496267][ T1344] [c0001eb4bcf0] [c0008394] machine_check_early_common+0x134/0x1f0 [ 312.496279][ T1344] --- interrupt: 200 at lkdtm_PPC_SLB_MULTIHIT+0x128/0x138 [ 312.496290][ T1344] NIP: c09cfea8 LR: c09cfea0 CTR: [ 312.496298][ T1344] REGS: c0001eb4bd60 TRAP: 0200 Not tainted (5.11.0-rc2-gcc-8.2.0-00123-g3fadced17474-dirty) [ 312.496307][ T1344] MSR: 80209033 CR: 24428284 XER: [ 312.496326][ T1344] CFAR: 021c DAR: c0080388 DSISR: 0080 IRQMASK: 0 [ 312.496326][ T1344] GPR00: c09cfea0 c000100fbb80 c1a53900 c0080388 [ 312.496326][ T1344] GPR04: 00b0 c00401023440 8e01c53300c0 00bf50d9 [ 312.496326][ T1344] GPR08: 0fff 0021 001c c0080488 [ 312.496326][ T1344] GPR12: 48428222 c0001ec6fe80 0100235ad1d0 00013c2fb738 [ 312.496326][ T1344] GPR16: 00013c210ae0 2200 0100235af740 [ 312.496326][ T1344] GPR20: 0001 00013c2a3ca0 c00033c50040 [ 312.496326][ T1344] GPR24: c000100fbd80 0011 c1a2ffb8 [ 312.496326][ T1344] GPR28: 04b0 c00033c5 c1105298 c0080388 [ 312.496427][ T1344] NIP [c09cfea8] lkdtm_PPC_SLB_MULTIHIT+0x128/0x138 [ 312.496437][ T1344] LR [c09cfea0] lkdtm_PPC_SLB_MULTIHIT+0x120/0x138 [ 312.496446][ T1344] --- interrupt: 200 [ 312.496451][ T1344] [c000100fbbf0] [c09cb530] lkdtm_do_action+0x40/0x80 [ 312.496462][ T1344] [c000100fbc10] [c09cbdfc] direct_entry+0x16c/0x350 [ 312.496473][ T1344] [c000100fbcc0] [c07a7590] full_proxy_write+0x90/0xe0 [ 312.496484][ T1344] [c000100fbd10] [c046b090] vfs_write+0xf0/0x310 [ 312.496496][ T1344] [c000100fbd60] [c046b48c] ksys_write+0x7c/0x140 [ 312.496507][ T1344] [c000100fbdb0] [c0036340] system_call_exception+0x1a0/0x2e0 [ 312.496518][ T1344] [c000100fbe10] [c000d360] system_call_common+0xf0/0x27c [ 312.496528][ T1344]
Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs
Excerpts from Christopher M. Riedl's message of February 4, 2021 4:59 pm: > On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote: >> Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm: >> > "Christopher M. Riedl" writes: >> >> The idle entry/exit code saves/restores GPRs in the stack "red zone" >> >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset >> >> used for the first GPR is incorrect and overwrites the back chain - the >> >> Protected Zone actually starts below the current SP. In practice this is >> >> probably not an issue, but it's still incorrect so fix it. >> > >> > Nice catch. >> > >> > Corrupting the back chain means you can't backtrace from there, which >> > could be confusing for debugging one day. >> >> Yeah, we seem to have got away without noticing because the CPU will >> wake up and return out of here before it tries to unwind the stack, >> but if you tried to walk it by hand if the CPU got stuck in idle or >> something, then we'd get confused. >> >> > It does make me wonder why we don't just create a stack frame and use >> > the normal macros? It would use a bit more stack space, but we shouldn't >> > be short of stack space when going idle. >> > >> > Nick, was there a particular reason for using the red zone? >> >> I don't recall a particular reason, I think a normal stack frame is >> probably a good idea. > > I'll send a version using STACKFRAMESIZE - I assume that's the "normal" > stack frame :) > I think STACKFRAMESIZE is STACK_FRAME_OVERHEAD + NVGPRs. LR and CR can be saved in the caller's frame so that should be okay. > I admit I am a bit confused when I saw the similar but much smaller > STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore > a few registers. Yeah if you don't need to save all nvgprs you can use caller's frame plus a few bytes in the minimum frame as volatile storage. STACK_FRAME_OVERHEAD should be 32 on LE, but I think the problem is a lot of asm uses it and hasn't necessarily been audited to make sure it's not assuming it's bigger. You could actually use STACK_FRAME_MIN_SIZE for new code, maybe we add a STACK_FRAME_MIN_NVGPR_SIZE to match and use that. Thanks, Nick
Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm: > > > Le 04/02/2021 à 04:27, Nicholas Piggin a écrit : >> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am: >>> >>> >>> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit : Implement the bulk of interrupt return logic in C. The asm return code must handle a few cases: restoring full GPRs, and emulating stack store. >>> >>> +notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, unsigned long msr) +{ + unsigned long *ti_flagsp = _thread_info()->flags; + unsigned long flags; + + if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI))) + unrecoverable_exception(regs); + BUG_ON(regs->msr & MSR_PR); + BUG_ON(!FULL_REGS(regs)); + + local_irq_save(flags); + + if (regs->softe == IRQS_ENABLED) { + /* Returning to a kernel context with local irqs enabled. */ + WARN_ON_ONCE(!(regs->msr & MSR_EE)); +again: + if (IS_ENABLED(CONFIG_PREEMPT)) { + /* Return to preemptible kernel context */ + if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) { + if (preempt_count() == 0) + preempt_schedule_irq(); + } + } + + trace_hardirqs_on(); + __hard_EE_RI_disable(); + if (unlikely(lazy_irq_pending())) { + __hard_RI_enable(); + irq_soft_mask_set(IRQS_ALL_DISABLED); + trace_hardirqs_off(); + local_paca->irq_happened |= PACA_IRQ_HARD_DIS; + /* + * Can't local_irq_enable in case we are in interrupt + * context. Must replay directly. + */ + replay_soft_interrupts(); + irq_soft_mask_set(flags); + /* Took an interrupt, may have more exit work to do. */ + goto again; + } + local_paca->irq_happened = 0; + irq_soft_mask_set(IRQS_ENABLED); + } else { + /* Returning to a kernel context with local irqs disabled. */ + trace_hardirqs_on(); + __hard_EE_RI_disable(); + if (regs->msr & MSR_EE) + local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS; + } + + +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM + local_paca->tm_scratch = regs->msr; +#endif + + /* + * We don't need to restore AMR on the way back to userspace for KUAP. + * The value of AMR only matters while we're in the kernel. + */ + kuap_restore_amr(regs); >>> >>> Is that correct to restore KUAP state here ? Shouldn't we have it at lower >>> level in assembly ? >>> >>> Isn't there a risk that someone manages to call >>> interrupt_exit_kernel_prepare() or the end of it in >>> a way or another, and get the previous KUAP state restored by this way ? >> >> I'm not sure if there much more risk if it's here rather than the >> instruction being in another place in the code. >> >> There's a lot of user access around the kernel too if you want to find a >> gadget to unlock KUAP then I suppose there is a pretty large attack >> surface. > > My understanding is that user access scope is strictly limited, for instance > we enforce the > begin/end of user access to be in the same function, and we refrain from > calling any other function > inside the user access window. x86 even have 'objtool' to enforce it at build > time. So in theory > there is no way to get out of the function while user access is open. > > Here with the interrupt exit function it is free beer. You have a place where > you re-open user > access and return with a simple blr. So that's open bar. If someone manages > to just call the > interrupt exit function, then user access remains open Hmm okay maybe that's a good point. >>> Also, it looks a bit strange to have kuap_save_amr_and_lock() done at >>> lowest level in assembly, and >>> kuap_restore_amr() done in upper level. That looks unbalanced. >> >> I'd like to bring the entry assembly into C. >> > > I really think it's not a good idea. We'll get better control and readability > by keeping it at the > lowest possible level in assembly. > > x86 only save and restore SMAC state in assembly. Okay we could go the other way and move the unlock into asm then. Thanks, Nick
Re: [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays
So one thing that has been on my mind for a while: I'd really like to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb to swiotlb the main difference seems to be: - additional reasons to bounce I/O vs the plain DMA capable - the possibility to do a hypercall on arm/arm64 - an extra translation layer before doing the phys_to_dma and vice versa - an special memory allocator I wonder if inbetween a few jump labels or other no overhead enablement options and possibly better use of the dma_range_map we could kill off most of swiotlb-xen instead of maintaining all this code duplication?
Re: [PATCH v4 1/2] powerpc: sstep: Fix load-store and update emulation
On 2021/02/04 01:37PM, Sandipan Das wrote: > The Power ISA says that the fixed-point load and update > instructions must neither use R0 for the base address (RA) > nor have the destination (RT) and the base address (RA) as > the same register. Similarly, for fixed-point stores and > floating-point loads and stores, the instruction is invalid > when R0 is used as the base address (RA). > > This is applicable to the following instructions. > * Load Byte and Zero with Update (lbzu) > * Load Byte and Zero with Update Indexed (lbzux) > * Load Halfword and Zero with Update (lhzu) > * Load Halfword and Zero with Update Indexed (lhzux) > * Load Halfword Algebraic with Update (lhau) > * Load Halfword Algebraic with Update Indexed (lhaux) > * Load Word and Zero with Update (lwzu) > * Load Word and Zero with Update Indexed (lwzux) > * Load Word Algebraic with Update Indexed (lwaux) > * Load Doubleword with Update (ldu) > * Load Doubleword with Update Indexed (ldux) > * Load Floating Single with Update (lfsu) > * Load Floating Single with Update Indexed (lfsux) > * Load Floating Double with Update (lfdu) > * Load Floating Double with Update Indexed (lfdux) > * Store Byte with Update (stbu) > * Store Byte with Update Indexed (stbux) > * Store Halfword with Update (sthu) > * Store Halfword with Update Indexed (sthux) > * Store Word with Update (stwu) > * Store Word with Update Indexed (stwux) > * Store Doubleword with Update (stdu) > * Store Doubleword with Update Indexed (stdux) > * Store Floating Single with Update (stfsu) > * Store Floating Single with Update Indexed (stfsux) > * Store Floating Double with Update (stfdu) > * Store Floating Double with Update Indexed (stfdux) > > E.g. the following behaviour is observed for an invalid > load and update instruction having RA = RT. > > While a userspace program having an instruction word like > 0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting > receiving a SIGILL on a Power system (observed on P8 and > P9), the outcome of executing that instruction word varies > and its behaviour can be considered to be undefined. > > Attaching an uprobe at that instruction's address results > in emulation which currently performs the load as well as > writes the effective address back to the base register. > This might not match the outcome from hardware. > > To remove any inconsistencies, this adds additional checks > for the aforementioned instructions to make sure that the > emulation infrastructure treats them as unknown. The kernel > can then fallback to executing such instructions on hardware. > > Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in > emulate_step()") > Signed-off-by: Sandipan Das > --- > Previous versions can be found at: > v3: > https://lore.kernel.org/linuxppc-dev/20210204071432.116439-1-sandi...@linux.ibm.com/ > v2: > https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandi...@linux.ibm.com/ > v1: > https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandi...@linux.ibm.com/ > > Changes in v4: > - Fixed grammar and switch-case alignment. > > Changes in v3: > - Consolidated the checks as suggested by Naveen. > - Consolidated load/store changes into a single patch. > - Included floating-point load/store and update instructions. > > Changes in v2: > - Jump to unknown_opcode instead of returning -1 for invalid > instruction forms. > > --- > arch/powerpc/lib/sstep.c | 14 ++ > 1 file changed, 14 insertions(+) For the series: Reviewed-by: Naveen N. Rao - Naveen
Re: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation
On 2021/02/03 03:17PM, Segher Boessenkool wrote: > On Wed, Feb 03, 2021 at 03:19:09PM +0530, Naveen N. Rao wrote: > > On 2021/02/03 12:08PM, Sandipan Das wrote: > > > The Power ISA says that the fixed-point load and update > > > instructions must neither use R0 for the base address (RA) > > > nor have the destination (RT) and the base address (RA) as > > > the same register. In these cases, the instruction is > > > invalid. > > > > However, the following behaviour is observed using some > > > invalid opcodes where RA = RT. > > > > > > An userspace program using an invalid instruction word like > > > 0xe9ce0001, i.e. "ldu r14, 0(r14)", runs and exits without > > > getting terminated abruptly. The instruction performs the > > > load operation but does not write the effective address to > > > the base address register. > > > > While the processor (p8 in my test) doesn't seem to be throwing an > > exception, I don't think it is necessarily loading the value. Qemu > > throws an exception though. It's probably best to term the behavior as > > being undefined. > > Power8 does: > > Load with Update Instructions (RA = 0) > EA is placed into R0. > Load with Update Instructions (RA = RT) > EA is placed into RT. The storage operand addressed by EA is > accessed, but the data returned by the load is discarded. I'm actually not seeing that. This is what I am testing with: li 8,0xaaa mr 6,1 std 8,64(6) #ldu6,64(6) .long 0xe8c60041 And, r6 always ends up with 0xaea. It changes with the value I put into r6 though. Granted, this is all up in the air, but it does look like there is more going on and the value isn't the EA or the value at the address. > > Power9 does: > > Load with Update Instructions (RA = 0) > EA is placed into R0. > Load with Update Instructions (RA = RT) > The storage operand addressed by EA is accessed. The displacement > field is added to the data returned by the load and placed into RT. > > Both UMs also say > > Invalid Forms > In general, the POWER9 core handles invalid forms of instructions in > the manner that is most convenient for the particular case (within > the scope of meeting the boundedly-undefined definition described in > the Power ISA). This document specifies the behavior for these > cases. However, it is not recommended that software or other system > facilities make use of the POWER9 behavior in these cases because > such behavior might be different in another processor that > implements the Power ISA. > > (or POWER8 instead of POWER9 of course). Always complaining about most > invalid forms seems wise, certainly if not all recent CPUs behave the > same :-) Agreed. - Naveen
Re: [PATCH v2 0/5] shoot lazy tlbs
I'll ask Andrew to put this in -mm if no objections. The series now doesn't touch other archs in non-trivial ways, and core code is functionally not changed much / at all if the option is not selected so it's actually pretty simple aside from the powerpc change. Thanks, Nick Excerpts from Nicholas Piggin's message of December 14, 2020 4:53 pm: > This is another rebase, on top of mainline now (don't need the > asm-generic tree), and without any x86 or membarrier changes. > This makes the series far smaller and more manageable and > without the controversial bits. > > Thanks, > Nick > > Nicholas Piggin (5): > lazy tlb: introduce lazy mm refcount helper functions > lazy tlb: allow lazy tlb mm switching to be configurable > lazy tlb: shoot lazies, a non-refcounting lazy tlb option > powerpc: use lazy mm refcount helper functions > powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN > > arch/Kconfig | 30 ++ > arch/arm/mach-rpc/ecard.c| 2 +- > arch/powerpc/Kconfig | 1 + > arch/powerpc/kernel/smp.c| 2 +- > arch/powerpc/mm/book3s64/radix_tlb.c | 4 +- > fs/exec.c| 4 +- > include/linux/sched/mm.h | 20 +++ > kernel/cpu.c | 2 +- > kernel/exit.c| 2 +- > kernel/fork.c| 52 > kernel/kthread.c | 11 ++-- > kernel/sched/core.c | 88 > kernel/sched/sched.h | 4 +- > 13 files changed, 184 insertions(+), 38 deletions(-) > > -- > 2.23.0 > >
[PATCH v4 2/2] powerpc: sstep: Fix darn emulation
Commit 8813ff49607e ("powerpc/sstep: Check instruction validity against ISA version before emulation") introduced a proper way to skip unknown instructions. This makes sure that the same is used for the darn instruction when the range selection bits have a reserved value. Fixes: a23987ef267a ("powerpc: sstep: Add support for darn instruction") Signed-off-by: Sandipan Das --- arch/powerpc/lib/sstep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 11f14b209d7f..683f7c20f74b 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -1916,7 +1916,7 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, goto compute_done; } - return -1; + goto unknown_opcode; #ifdef __powerpc64__ case 777: /* modsd */ if (!cpu_has_feature(CPU_FTR_ARCH_300)) -- 2.25.1
[PATCH v4 1/2] powerpc: sstep: Fix load-store and update emulation
The Power ISA says that the fixed-point load and update instructions must neither use R0 for the base address (RA) nor have the destination (RT) and the base address (RA) as the same register. Similarly, for fixed-point stores and floating-point loads and stores, the instruction is invalid when R0 is used as the base address (RA). This is applicable to the following instructions. * Load Byte and Zero with Update (lbzu) * Load Byte and Zero with Update Indexed (lbzux) * Load Halfword and Zero with Update (lhzu) * Load Halfword and Zero with Update Indexed (lhzux) * Load Halfword Algebraic with Update (lhau) * Load Halfword Algebraic with Update Indexed (lhaux) * Load Word and Zero with Update (lwzu) * Load Word and Zero with Update Indexed (lwzux) * Load Word Algebraic with Update Indexed (lwaux) * Load Doubleword with Update (ldu) * Load Doubleword with Update Indexed (ldux) * Load Floating Single with Update (lfsu) * Load Floating Single with Update Indexed (lfsux) * Load Floating Double with Update (lfdu) * Load Floating Double with Update Indexed (lfdux) * Store Byte with Update (stbu) * Store Byte with Update Indexed (stbux) * Store Halfword with Update (sthu) * Store Halfword with Update Indexed (sthux) * Store Word with Update (stwu) * Store Word with Update Indexed (stwux) * Store Doubleword with Update (stdu) * Store Doubleword with Update Indexed (stdux) * Store Floating Single with Update (stfsu) * Store Floating Single with Update Indexed (stfsux) * Store Floating Double with Update (stfdu) * Store Floating Double with Update Indexed (stfdux) E.g. the following behaviour is observed for an invalid load and update instruction having RA = RT. While a userspace program having an instruction word like 0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting receiving a SIGILL on a Power system (observed on P8 and P9), the outcome of executing that instruction word varies and its behaviour can be considered to be undefined. Attaching an uprobe at that instruction's address results in emulation which currently performs the load as well as writes the effective address back to the base register. This might not match the outcome from hardware. To remove any inconsistencies, this adds additional checks for the aforementioned instructions to make sure that the emulation infrastructure treats them as unknown. The kernel can then fallback to executing such instructions on hardware. Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in emulate_step()") Signed-off-by: Sandipan Das --- Previous versions can be found at: v3: https://lore.kernel.org/linuxppc-dev/20210204071432.116439-1-sandi...@linux.ibm.com/ v2: https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandi...@linux.ibm.com/ v1: https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandi...@linux.ibm.com/ Changes in v4: - Fixed grammar and switch-case alignment. Changes in v3: - Consolidated the checks as suggested by Naveen. - Consolidated load/store changes into a single patch. - Included floating-point load/store and update instructions. Changes in v2: - Jump to unknown_opcode instead of returning -1 for invalid instruction forms. --- arch/powerpc/lib/sstep.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index e96cff845ef7..11f14b209d7f 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -3017,6 +3017,20 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, } + if (OP_IS_LOAD_STORE(op->type) && (op->type & UPDATE)) { + switch (GETTYPE(op->type)) { + case LOAD: + if (ra == rd) + goto unknown_opcode; + fallthrough; + case STORE: + case LOAD_FP: + case STORE_FP: + if (ra == 0) + goto unknown_opcode; + } + } + #ifdef CONFIG_VSX if ((GETTYPE(op->type) == LOAD_VSX || GETTYPE(op->type) == STORE_VSX) && -- 2.25.1
Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
Le 04/02/2021 à 04:27, Nicholas Piggin a écrit : Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am: Le 25/02/2020 à 18:35, Nicholas Piggin a écrit : Implement the bulk of interrupt return logic in C. The asm return code must handle a few cases: restoring full GPRs, and emulating stack store. +notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, unsigned long msr) +{ + unsigned long *ti_flagsp = _thread_info()->flags; + unsigned long flags; + + if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI))) + unrecoverable_exception(regs); + BUG_ON(regs->msr & MSR_PR); + BUG_ON(!FULL_REGS(regs)); + + local_irq_save(flags); + + if (regs->softe == IRQS_ENABLED) { + /* Returning to a kernel context with local irqs enabled. */ + WARN_ON_ONCE(!(regs->msr & MSR_EE)); +again: + if (IS_ENABLED(CONFIG_PREEMPT)) { + /* Return to preemptible kernel context */ + if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) { + if (preempt_count() == 0) + preempt_schedule_irq(); + } + } + + trace_hardirqs_on(); + __hard_EE_RI_disable(); + if (unlikely(lazy_irq_pending())) { + __hard_RI_enable(); + irq_soft_mask_set(IRQS_ALL_DISABLED); + trace_hardirqs_off(); + local_paca->irq_happened |= PACA_IRQ_HARD_DIS; + /* +* Can't local_irq_enable in case we are in interrupt +* context. Must replay directly. +*/ + replay_soft_interrupts(); + irq_soft_mask_set(flags); + /* Took an interrupt, may have more exit work to do. */ + goto again; + } + local_paca->irq_happened = 0; + irq_soft_mask_set(IRQS_ENABLED); + } else { + /* Returning to a kernel context with local irqs disabled. */ + trace_hardirqs_on(); + __hard_EE_RI_disable(); + if (regs->msr & MSR_EE) + local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS; + } + + +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM + local_paca->tm_scratch = regs->msr; +#endif + + /* +* We don't need to restore AMR on the way back to userspace for KUAP. +* The value of AMR only matters while we're in the kernel. +*/ + kuap_restore_amr(regs); Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ? Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in a way or another, and get the previous KUAP state restored by this way ? I'm not sure if there much more risk if it's here rather than the instruction being in another place in the code. There's a lot of user access around the kernel too if you want to find a gadget to unlock KUAP then I suppose there is a pretty large attack surface. My understanding is that user access scope is strictly limited, for instance we enforce the begin/end of user access to be in the same function, and we refrain from calling any other function inside the user access window. x86 even have 'objtool' to enforce it at build time. So in theory there is no way to get out of the function while user access is open. Here with the interrupt exit function it is free beer. You have a place where you re-open user access and return with a simple blr. So that's open bar. If someone manages to just call the interrupt exit function, then user access remains open Also, it looks a bit strange to have kuap_save_amr_and_lock() done at lowest level in assembly, and kuap_restore_amr() done in upper level. That looks unbalanced. I'd like to bring the entry assembly into C. I really think it's not a good idea. We'll get better control and readability by keeping it at the lowest possible level in assembly. x86 only save and restore SMAC state in assembly. Christophe