Re: [PATCH v4 4/4] PCI: layerscape: Add suspend/resume for ls1043a
On Thu, Nov 30, 2023 at 03:22:14PM -0500, Frank Li wrote: > On Thu, Nov 30, 2023 at 03:17:39PM -0500, Frank Li wrote: > > On Thu, Nov 30, 2023 at 10:21:00PM +0530, Manivannan Sadhasivam wrote: > > > On Wed, Nov 29, 2023 at 04:44:12PM -0500, Frank Li wrote: > > > > In the suspend path, PME_Turn_Off message is sent to the endpoint to > > > > transition the link to L2/L3_Ready state. In this SoC, there is no way > > > > to > > > > check if the controller has received the PME_To_Ack from the endpoint or > > > > not. So to be on the safer side, the driver just waits for > > > > PCIE_PME_TO_L2_TIMEOUT_US before asserting the SoC specific PMXMTTURNOFF > > > > bit to complete the PME_Turn_Off handshake. This link would then enter > > > > L2/L3 state depending on the VAUX supply. > > > > > > > > In the resume path, the link is brought back from L2 to L0 by doing a > > > > software reset. > > > > > > > > > > Same comment on the patch description as on patch 2/4. > > > > > > > Signed-off-by: Frank Li > > > > --- > > > > > > > > Notes: > > > > Change from v3 to v4 > > > > - Call scfg_pcie_send_turnoff_msg() shared with ls1021a > > > > - update commit message > > > > > > > > Change from v2 to v3 > > > > - Remove ls_pcie_lut_readl(writel) function > > > > > > > > Change from v1 to v2 > > > > - Update subject 'a' to 'A' > > > > > > > > drivers/pci/controller/dwc/pci-layerscape.c | 63 - > > > > 1 file changed, 62 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > > > b/drivers/pci/controller/dwc/pci-layerscape.c > > > > index 590e07bb27002..d39700b3afaaa 100644 > > > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > > > @@ -41,6 +41,15 @@ > > > > #define SCFG_PEXSFTRSTCR 0x190 > > > > #define PEXSR(idx) BIT(idx) > > > > > > > > +/* LS1043A PEX PME control register */ > > > > +#define SCFG_PEXPMECR 0x144 > > > > +#define PEXPME(idx)BIT(31 - (idx) * 4) > > > > + > > > > +/* LS1043A PEX LUT debug register */ > > > > +#define LS_PCIE_LDBG 0x7fc > > > > +#define LDBG_SRBIT(30) > > > > +#define LDBG_WEBIT(31) > > > > + > > > > #define PCIE_IATU_NUM 6 > > > > > > > > struct ls_pcie_drvdata { > > > > @@ -225,6 +234,45 @@ static int ls1021a_pcie_exit_from_l2(struct > > > > dw_pcie_rp *pp) > > > > return scfg_pcie_exit_from_l2(pcie->scfg, SCFG_PEXSFTRSTCR, > > > > PEXSR(pcie->index)); > > > > } > > > > > > > > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > > > > +{ > > > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > > > + > > > > + scfg_pcie_send_turnoff_msg(pcie->scfg, SCFG_PEXPMECR, > > > > PEXPME(pcie->index)); > > > > +} > > > > + > > > > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > > > +{ > > > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > > > + u32 val; > > > > + > > > > + /* > > > > +* Only way let PEX module exit L2 is do a software reset. > > > > > > Can you expand PEX? What is it used for? > > > > > > Also if the reset is only for the PEX module, please use the same comment > > > in > > > both patches 2 and 4. Patch 2 doesn't mention PEX in the comment. > > > > After read spec again, I think PEX is pci express. So it should software > > reset controller. I don't know what exactly did in the chip. But without > > below code, PCIe can't exit L2/L3. > > > > Any harmful if dwc controller reset? Anyway these code works well with > > intel network card. > > Sorry, sent too quick. It is PCIe express wrapper > > Copy from spec: > > "PEXLDBG[SR]. Once set the > PEXLDBG[SR] will enable the soft reset to the PEX wrapper." > Okay. Please use the below comment in both patches 2 and 4: /* Reset the PEX wrapper to bring the link out of L2 */ - Mani > Frank > > > > > Frank > > > > > > > > - Mani > > > > > > > +* LDBG_WE: allows the user to have write access to the > > > > PEXDBG[SR] for both setting and > > > > +* clearing the soft reset on the PEX module. > > > > +* LDBG_SR: When SR is set to 1, the PEX module enters soft > > > > reset. > > > > +*/ > > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > > + val |= LDBG_WE; > > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > > + > > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > > + val |= LDBG_SR; > > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > > + > > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > > + val &= ~LDBG_SR; > > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > > + > > > > + val =
Re: [PATCH v2] powerpc/pseries/vas: Use usleep_range() to support HCALL delay
On 11/29/23 6:07 PM, Michael Ellerman wrote: Haren Myneni writes: VAS allocate, modify and deallocate HCALLs returns H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy delay and expects OS to reissue HCALL after that delay. But using msleep() will often sleep at least 20 msecs even though the hypervisor expects to reissue these HCALLs after 1 or 10msecs. It might cause these HCALLs takes longer when multiple threads issue open or close VAS windows simultaneously. So instead of msleep(), use usleep_range() to ensure sleep with the expected value before issuing HCALL again. Signed-off-by: Haren Myneni Suggested-by: Nathan Lynch --- v1 -> v2: - Use usleep_range instead of using RTAS sleep routine as suggested by Nathan --- arch/powerpc/platforms/pseries/vas.c | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c index 71d52a670d95..bade4402741f 100644 --- a/arch/powerpc/platforms/pseries/vas.c +++ b/arch/powerpc/platforms/pseries/vas.c @@ -36,9 +36,31 @@ static bool migration_in_progress; static long hcall_return_busy_check(long rc) { + unsigned int ms; + /* Check if we are stalled for some time */ if (H_IS_LONG_BUSY(rc)) { - msleep(get_longbusy_msecs(rc)); + ms = get_longbusy_msecs(rc); + /* +* Allocate, Modify and Deallocate HCALLs returns +* H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC +* for the long delay. So the delay should always be 1 +* or 10msecs, but sleeps 1msec in case if the long +* delay is > H_LONG_BUSY_ORDER_10_MSEC. +*/ + if (ms > 10) + ms = 1; I don't understand this. The hypervisor asked you to sleep for more than 10 milliseconds, so instead you sleep for 1? I can understand that we don't want to usleep() for the longer durations that could be returned, but so shouldn't the code be using msleep() for those values? Sleeping for a very short duration definitely seems wrong. Allocate, modify and deallocate HCALLs return only 1MSECS and 10MSECS for long delay. we should not expect > 10MSECS for these HCALLs. Hence ms = 1 if ms > 10 But it is confusing. So will use ms = 10 for ms >= 10 as Nathan suggested. + /* +* msleep() will often sleep at least 20 msecs even +* though the hypervisor expects to reissue these That makes it sound like the hypervisor is reissuing the hcalls. Better would be "the hypervisor suggests the kernel should reissue the hcall after ...". +* HCALLs after 1 or 10msecs. So use usleep_range() +* to sleep with the expected value. +* +* See Documentation/timers/timers-howto.rst on using +* the value range in usleep_range(). +*/ + usleep_range(ms * 100, ms * 1000); If ms == 1, then that's 100 usecs, which is not 1 millisecond? Please use USEC_PER_MSEC. Using usleep_range() same way as mentioned in rtas_busy_delay(). Thanks Haren rc = H_BUSY; } else if (rc == H_BUSY) { cond_resched(); cheers
[powerpc:fixes] BUILD SUCCESS dc158d23b33df9033bcc8e7117e8591dd2f9d125
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes branch HEAD: dc158d23b33df9033bcc8e7117e8591dd2f9d125 KVM: PPC: Book3S HV: Fix KVM_RUN clobbering FP/VEC user registers elapsed time: 1203m configs tested: 166 configs skipped: 90 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfig gcc alpha defconfig gcc arc allnoconfig gcc arc defconfig gcc archsdk_defconfig gcc arc randconfig-001-20231201 gcc arc randconfig-002-20231201 gcc arc tb10x_defconfig gcc arcvdk_hs38_defconfig gcc arm allnoconfig gcc arm randconfig-001-20231201 gcc arm randconfig-002-20231201 gcc arm randconfig-003-20231201 gcc arm randconfig-004-20231201 gcc armspear6xx_defconfig gcc arm64allmodconfig clang arm64 allnoconfig gcc arm64 defconfig gcc arm64 randconfig-001-20231201 gcc arm64 randconfig-002-20231201 gcc arm64 randconfig-003-20231201 gcc arm64 randconfig-004-20231201 gcc csky allnoconfig gcc cskydefconfig gcc csky randconfig-001-20231201 gcc csky randconfig-002-20231201 gcc hexagon allmodconfig clang hexagon allyesconfig clang i386 allmodconfig clang i386 allnoconfig clang i386 allyesconfig clang i386 randconfig-011-20231201 clang i386 randconfig-012-20231201 clang i386 randconfig-013-20231201 clang i386 randconfig-014-20231201 clang i386 randconfig-015-20231201 clang i386 randconfig-016-20231201 clang loongarchallmodconfig gcc loongarch allnoconfig gcc loongarchallyesconfig gcc loongarch defconfig gcc loongarch loongson3_defconfig gcc loongarch randconfig-001-20231201 gcc loongarch randconfig-002-20231201 gcc m68k allmodconfig gcc m68k allnoconfig gcc m68k allyesconfig gcc m68kdefconfig gcc m68k hp300_defconfig gcc m68k virt_defconfig gcc microblaze allmodconfig gcc microblazeallnoconfig gcc microblaze allyesconfig gcc microblaze defconfig gcc mips allmodconfig gcc mips allyesconfig gcc mips db1xxx_defconfig gcc mips fuloong2e_defconfig gcc mips loongson1b_defconfig gcc nios2allmodconfig gcc nios2 allnoconfig gcc nios2allyesconfig gcc nios2 defconfig gcc nios2 randconfig-001-20231201 gcc nios2 randconfig-002-20231201 gcc openrisc allmodconfig gcc openrisc allnoconfig gcc openrisc allyesconfig gcc openriscdefconfig gcc parisc allmodconfig gcc pariscallnoconfig gcc parisc allyesconfig gcc parisc defconfig gcc pariscgeneric-64bit_defconfig gcc pariscrandconfig-001-20231201 gcc pariscrandconfig-002-20231201 gcc parisc64defconfig gcc powerpc allmodconfig clang powerpc allnoconfig gcc powerpc allyesconfig clang powerpc bamboo_defconfig gcc powerpc eiger_defconfig gcc powerpc rainier_defconfig gcc powerpc randconfig-001-20231201 gcc powerpc randconfig-002-20231201 gcc powerpc
Re: linux-next: build failure after merge of the mm tree
Andrew Morton writes: > On Fri, 01 Dec 2023 09:39:20 +1100 Michael Ellerman > wrote: > >> > I am still carrying this patch (it should probably go into the mm >> > tree). Is someone going to pick it up (assuming it is correct)? >> >> I applied it to my next a few days ago, but I must have forgotten to >> push. It's in there now. > > I'll keep a copy in mm.git, to keep the dependencies nice. I added > your acked-by. Sure thing. Thanks. cheers
Re: [PATCH v4 4/4] PCI: layerscape: Add suspend/resume for ls1043a
On Thu, Nov 30, 2023 at 03:17:39PM -0500, Frank Li wrote: > On Thu, Nov 30, 2023 at 10:21:00PM +0530, Manivannan Sadhasivam wrote: > > On Wed, Nov 29, 2023 at 04:44:12PM -0500, Frank Li wrote: > > > In the suspend path, PME_Turn_Off message is sent to the endpoint to > > > transition the link to L2/L3_Ready state. In this SoC, there is no way to > > > check if the controller has received the PME_To_Ack from the endpoint or > > > not. So to be on the safer side, the driver just waits for > > > PCIE_PME_TO_L2_TIMEOUT_US before asserting the SoC specific PMXMTTURNOFF > > > bit to complete the PME_Turn_Off handshake. This link would then enter > > > L2/L3 state depending on the VAUX supply. > > > > > > In the resume path, the link is brought back from L2 to L0 by doing a > > > software reset. > > > > > > > Same comment on the patch description as on patch 2/4. > > > > > Signed-off-by: Frank Li > > > --- > > > > > > Notes: > > > Change from v3 to v4 > > > - Call scfg_pcie_send_turnoff_msg() shared with ls1021a > > > - update commit message > > > > > > Change from v2 to v3 > > > - Remove ls_pcie_lut_readl(writel) function > > > > > > Change from v1 to v2 > > > - Update subject 'a' to 'A' > > > > > > drivers/pci/controller/dwc/pci-layerscape.c | 63 - > > > 1 file changed, 62 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > > b/drivers/pci/controller/dwc/pci-layerscape.c > > > index 590e07bb27002..d39700b3afaaa 100644 > > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > > @@ -41,6 +41,15 @@ > > > #define SCFG_PEXSFTRSTCR 0x190 > > > #define PEXSR(idx) BIT(idx) > > > > > > +/* LS1043A PEX PME control register */ > > > +#define SCFG_PEXPMECR0x144 > > > +#define PEXPME(idx) BIT(31 - (idx) * 4) > > > + > > > +/* LS1043A PEX LUT debug register */ > > > +#define LS_PCIE_LDBG 0x7fc > > > +#define LDBG_SR BIT(30) > > > +#define LDBG_WE BIT(31) > > > + > > > #define PCIE_IATU_NUM6 > > > > > > struct ls_pcie_drvdata { > > > @@ -225,6 +234,45 @@ static int ls1021a_pcie_exit_from_l2(struct > > > dw_pcie_rp *pp) > > > return scfg_pcie_exit_from_l2(pcie->scfg, SCFG_PEXSFTRSTCR, > > > PEXSR(pcie->index)); > > > } > > > > > > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > > > +{ > > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > > + > > > + scfg_pcie_send_turnoff_msg(pcie->scfg, SCFG_PEXPMECR, > > > PEXPME(pcie->index)); > > > +} > > > + > > > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > > +{ > > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > > + u32 val; > > > + > > > + /* > > > + * Only way let PEX module exit L2 is do a software reset. > > > > Can you expand PEX? What is it used for? > > > > Also if the reset is only for the PEX module, please use the same comment in > > both patches 2 and 4. Patch 2 doesn't mention PEX in the comment. > > After read spec again, I think PEX is pci express. So it should software > reset controller. I don't know what exactly did in the chip. But without > below code, PCIe can't exit L2/L3. > > Any harmful if dwc controller reset? Anyway these code works well with > intel network card. If it is a DWC controller reset, then we need to program all CSRs like DBI etc... But from your reply it seems like the reset is limited to some module, so it is fine. - Mani > > Frank > > > > > - Mani > > > > > + * LDBG_WE: allows the user to have write access to the PEXDBG[SR] for > > > both setting and > > > + * clearing the soft reset on the PEX module. > > > + * LDBG_SR: When SR is set to 1, the PEX module enters soft reset. > > > + */ > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val |= LDBG_WE; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val |= LDBG_SR; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val &= ~LDBG_SR; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val &= ~LDBG_WE; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + return 0; > > > +} > > > + > > > static const struct dw_pcie_host_ops ls_pcie_host_ops = { > > > .host_init = ls_pcie_host_init, > > > .pme_turn_off = ls_pcie_send_turnoff_msg, > > > @@ -242,6 +290,19 @@ static const struct ls_pcie_drvdata ls1021a_drvdata > > > = { > > > .exit_from_l2 = ls1021a_pcie_exit_from_l2, > > > }; > > > > > > +static const struct dw_pcie_host_ops ls1043a_pcie_host_ops = { > > > + .host_init = ls_pcie_host_init, > > > +
[PATCH] powerpc/44x: select I2C for CURRITUCK
Fix build errors when CURRITUCK=y and I2C is not builtin (=m or is not set). Fixes these build errors: powerpc-linux-ld: arch/powerpc/platforms/44x/ppc476.o: in function `avr_halt_system': ppc476.c:(.text+0x58): undefined reference to `i2c_smbus_write_byte_data' powerpc-linux-ld: arch/powerpc/platforms/44x/ppc476.o: in function `ppc47x_device_probe': ppc476.c:(.init.text+0x18): undefined reference to `i2c_register_driver' Fixes: 2a2c74b2efcb ("IBM Akebono: Add the Akebono platform") Signed-off-by: Randy Dunlap Reported-by: kernel test robot Closes: lore.kernel.org/r/202312010820.cmdwf5x9-...@intel.com Cc: Alistair Popple Cc: Benjamin Herrenschmidt Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman --- arch/powerpc/platforms/44x/Kconfig |1 + 1 file changed, 1 insertion(+) diff -- a/arch/powerpc/platforms/44x/Kconfig b/arch/powerpc/platforms/44x/Kconfig --- a/arch/powerpc/platforms/44x/Kconfig +++ b/arch/powerpc/platforms/44x/Kconfig @@ -173,6 +173,7 @@ config ISS4xx config CURRITUCK bool "IBM Currituck (476fpe) Support" depends on PPC_47x + select I2C select SWIOTLB select 476FPE select FORCE_PCI
Re: [PATCH v2] powerpc/pseries/vas: Use usleep_range() to support HCALL delay
On 11/29/23 5:43 PM, Nathan Lynch wrote: Haren Myneni writes: VAS allocate, modify and deallocate HCALLs returns H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy delay and expects OS to reissue HCALL after that delay. But using msleep() will often sleep at least 20 msecs even though the hypervisor expects to reissue these HCALLs after 1 or 10msecs. I would word this as "the architecture suggests that the OS reissue these [...]" instead of framing it as something the platform "expects". It might cause these HCALLs takes longer when multiple threads issue open or close VAS windows simultaneously. This is imprecise. Over-sleeping by the OS doesn't cause individual hcalls to take longer. It is more accurate to say that the higher-level operation (allocate, modify, free) may take longer than necessary in cases where the OS must retry the hcalls involved. Correct, takes longer with multiple threads opening/closing windows. I will make it clear. So instead of msleep(), use usleep_range() to ensure sleep with the expected value before issuing HCALL again. Signed-off-by: Haren Myneni Suggested-by: Nathan Lynch --- v1 -> v2: - Use usleep_range instead of using RTAS sleep routine as suggested by Nathan --- arch/powerpc/platforms/pseries/vas.c | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c index 71d52a670d95..bade4402741f 100644 --- a/arch/powerpc/platforms/pseries/vas.c +++ b/arch/powerpc/platforms/pseries/vas.c @@ -36,9 +36,31 @@ static bool migration_in_progress; static long hcall_return_busy_check(long rc) { + unsigned int ms; This should move down into the H_IS_LONG_BUSY() block if it's not used outside of it. + /* Check if we are stalled for some time */ if (H_IS_LONG_BUSY(rc)) { - msleep(get_longbusy_msecs(rc)); + ms = get_longbusy_msecs(rc); + /* +* Allocate, Modify and Deallocate HCALLs returns +* H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC +* for the long delay. So the delay should always be 1 +* or 10msecs, but sleeps 1msec in case if the long +* delay is > H_LONG_BUSY_ORDER_10_MSEC. +*/ + if (ms > 10) + ms = 1; It's strange to coerce ms to 1 when it's greater than 10. Just clamp it to 10, e.g. ms = clamp(get_longbusy_msecs(rc), 1, 10); Sure, these HCALLs should not return > H_LONG_BUSY_ORDER_10_MSEC. + + /* +* msleep() will often sleep at least 20 msecs even +* though the hypervisor expects to reissue these +* HCALLs after 1 or 10msecs. So use usleep_range() +* to sleep with the expected value. +* +* See Documentation/timers/timers-howto.rst on using +* the value range in usleep_range(). +*/ + usleep_range(ms * 100, ms * 1000); If there's going to be commentary here I think it should just explain why potentially sleeping for less than the suggested time is OK. There is wording you can crib in rtas_busy_delay(). rc = H_BUSY; } else if (rc == H_BUSY) { cond_resched(); -- 2.26.3
Re: [PATCH] powerpc/mm: Fix null-pointer dereference in pgtable_cache_add
Thanks for your reply. Ok, I know what you mean, when name is NULL. The process should be aborted and the specific reason for the error should be printed, not just return. I will update v2 patch with "panic". Thanks again, Kunwu On 2023/11/28 19:32, Michael Ellerman wrote: Kunwu Chan writes: Hi Christophe, Thanks for your reply. It's my bad. According your reply, i read the code in sysfs_do_create_link_sd.There is a null pointer check indeed. My intention was to check null pointer after memory allocation. Whether we can add a comment here for someone like me, the null pointer check is no need here? I don't mind there being a NULL check for name. But the code shouldn't silently return if name can't be allocated. Notice that if we can't create the cache we *panic*. A failure to allocate name, which causes us to skip the cache creation, needs to also panic. cheers On 2023/11/24 23:17, Christophe Leroy wrote: Le 22/11/2023 à 10:00, Kunwu Chan a écrit : [Vous ne recevez pas souvent de courriers de chen...@kylinos.cn. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ] kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Ensure the allocation was successful by checking the pointer validity. Are you sure this is needed ? Did you check what happens what name is NULL ? If I followed stuff correctly, I end up in function sysfs_do_create_link_sd() which already handles the NULL name case which a big hammer warning. Signed-off-by: Kunwu Chan --- arch/powerpc/mm/init-common.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c index 119ef491f797..0884fc601c46 100644 --- a/arch/powerpc/mm/init-common.c +++ b/arch/powerpc/mm/init-common.c @@ -139,6 +139,8 @@ void pgtable_cache_add(unsigned int shift) align = max_t(unsigned long, align, minalign); name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift); + if (!name) + return; new = kmem_cache_create(name, table_size, align, 0, ctor(shift)); if (!new) panic("Could not allocate pgtable cache for order %d", shift); -- 2.34.1
Re: [PATCH 5/5] powerpc/64s: Fix CONFIG_NUMA=n build
On Thu, Nov 30, 2023, at 07:43, Michael Ellerman wrote: > "Arnd Bergmann" writes: >> On Wed, Nov 29, 2023, at 14:19, Michael Ellerman wrote: >>> diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h >>> index 7f9ff0640124..72341b9fb552 100644 >>> --- a/arch/powerpc/mm/mmu_decl.h >>> +++ b/arch/powerpc/mm/mmu_decl.h >>> + >>> +#ifdef CONFIG_MEMORY_HOTPLUG >>> +int create_section_mapping(unsigned long start, unsigned long end, >>> + int nid, pgprot_t prot); >>> +#endif >> >> This one should probably go next to the remove_section_mapping() >> declaration in arch/powerpc/include/asm/sparsemem.h for consistency. > > That doesn't work due to: > > In file included from ../include/linux/numa.h:26, > from ../include/linux/async.h:13, > from ../init/initramfs.c:3: > ../arch/powerpc/include/asm/sparsemem.h:19:44: error: unknown type name > ‘pgprot_t’ >19 | int nid, pgprot_t prot); > |^~~~ > > Which might be fixable, but I'd rather just move > remove_section_mapping() into mmu_decl.h as well. Ok, makes sense. Arnd
[PATCH v2] powerpc/mm: Fix null-pointer dereference in pgtable_cache_add
kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Ensure the allocation was successful by checking the pointer validity. Suggested-by: Christophe Leroy Suggested-by: Michael Ellerman Signed-off-by: Kunwu Chan --- v2: Use "panic" instead of "return" --- arch/powerpc/mm/init-common.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c index 119ef491f797..9788950b33f5 100644 --- a/arch/powerpc/mm/init-common.c +++ b/arch/powerpc/mm/init-common.c @@ -139,6 +139,8 @@ void pgtable_cache_add(unsigned int shift) align = max_t(unsigned long, align, minalign); name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift); + if (!name) + panic("Failed to allocate memory for order %d", shift); new = kmem_cache_create(name, table_size, align, 0, ctor(shift)); if (!new) panic("Could not allocate pgtable cache for order %d", shift); -- 2.34.1
Re: [kvm-unit-tests PATCH v1 00/18] arm/arm64: Rework cache maintenance at boot
Hi, Thank you so much for reviving this, much appreciated. I wanted to let you know that I definitely plan to review the series as soon as possible, unfortunately I don't believe I won't be able to do that for at least 2 weeks. Thanks, Alex On Thu, Nov 30, 2023 at 04:07:02AM -0500, Shaoqin Huang wrote: > Hi, > > I'm posting Alexandru's patch set[1] rebased on the latest branch with the > conflicts being resolved. No big changes compare to its original code. > > As this version 1 of this series was posted one years ago, I would first let > you > recall it, what's the intention of this series and what this series do. You > can > view it by click the link[2] and view the cover-letter. > > Since when writing the series[1], the efi support for arm64[3] hasn't been > merged into the kvm-unit-tests, but now the efi support for arm64 has been > merged. Directly rebase the series[1] onto the latest branch will break the > efi > tests. This is mainly because the Patch #15 ("arm/arm64: Enable the MMU > early") > moves the mmu_enable() out of the setup_mmu(), which causes the efi test will > not enable the mmu. So I do a small change in the efi_mem_init() which makes > the > efi test also enable the MMU early, and make it works. > > And another change should be noticed is in the Patch #17 ("arm/arm64: Perform > dcache maintenance"). In the efi_mem_init(), it will disable the mmu, and > build > a new pagetable and re-enable the mmu, if the asm_mmu_disable clean and > invalidate the data caches for entire memory, we don't need to care the dcache > and after mmu disabled, we use the mmu_setup_early() to re-enable the mmu, > which > takes care all the cache maintenance. But the situation changes since the > Patch > #18 ("arm/arm64: Rework the cache maintenance in asm_mmu_disable") only clean > and invalidate the data caches for the stack memory area. So we need to clean > and invalidate the data caches manually before disable the mmu, I'm not > confident about current cache maintenance at the efi setup patch, so I ask for > your help to review it if it's right or not. > > And I also drop one patch ("s390: Do not use the physical allocator") from[1] > since this cause s390 test to fail. > > This series may include bug, so I really appreciate your review to improve > this > series together. > > You can get the code from: > > $ git clone https://gitlab.com/shahuang/kvm-unit-tests.git \ > -b arm-arm64-rework-cache-maintenance-at-boot-v1 > > [1] > https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/arm-arm64-rework-cache-maintenance-at-boot-v2-wip2 > [2] > https://lore.kernel.org/all/20220809091558.14379-1-alexandru.eli...@arm.com/ > [3] > https://patchwork.kernel.org/project/kvm/cover/20230530160924.82158-1-nikos.nikole...@arm.com/ > > Changelog: > -- > RFC->v1: > - Gathered Reviewed-by tags. > - Various changes to commit messages and comments to hopefully make the code > easier to understand. > - Patches #8 ("lib/alloc_phys: Expand documentation with usage and > limitations") > are new. > - Folded patch "arm: page.h: Add missing libcflat.h include" into #17 > ("arm/arm64: Perform dcache maintenance at boot"). > - Reordered the series to group patches that touch aproximately the same > code > together - the patches that change the physical allocator are now first, > followed come the patches that change how the secondaries are brought > online. > - Fixed several nasty bugs where the r4 register was being clobbered in the > arm > assembly. > - Unmap the early UART address if the DTB address does not match the early > address. > - Added dcache maintenance when a page table is modified with the MMU > disabled. > - Moved the cache maintenance when disabling the MMU to be executed before > the > MMU is disabled. > - Rebase it on lasted branch which efi support has been merged. > - Make the efi test also enable MMU early. > - Add cache maintenance on efi setup path especially before mmu_disable. > > RFC: > https://lore.kernel.org/all/20220809091558.14379-1-alexandru.eli...@arm.com/ > > Alexandru Elisei (18): > Makefile: Define __ASSEMBLY__ for assembly files > powerpc: Replace the physical allocator with the page allocator > lib/alloc_phys: Initialize align_min > lib/alloc_phys: Consolidate allocate functions into memalign_early() > lib/alloc_phys: Remove locking > lib/alloc_phys: Remove allocation accounting > lib/alloc_phys: Add callback to perform cache maintenance > lib/alloc_phys: Expand documentation with usage and limitations > arm/arm64: Zero secondary CPUs' stack > arm/arm64: Allocate secondaries' stack using the page allocator > arm/arm64: assembler.h: Replace size with end address for > dcache_by_line_op > arm/arm64: Add C functions for doing cache maintenance > arm/arm64: Configure secondaries' stack before enabling the MMU > arm/arm64: Use pgd_alloc() to allocate mmu_idmap >
Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected
On Wed, Nov 29, 2023 at 06:02:08PM -0800, Sean Christopherson wrote: > > > Ah, it's the same warning, I just missed the CONFIG_MODULES=n requirement. > > > > Oh, wait, doesn't that mean the approach won't work? IIRC doesn't > > symbol_get turn into just when non-modular turning this into a > > link failure without the kconfig part? > > Yes, but it doesn't cause linker errors. IIUC, because the extern declaration > is tagged "weak", a dummy default is used. E.g. on x86, this is what is > generated > with VFIO=y Oh wow that is some pretty dark magic there :| Jason
[PATCH] powerpc/irq: Allow softirq to hardirq stack transition
Allow a transition from the softirq stack to the hardirq stack when handling a hardirq. Doing so means a hardirq received while deep in softirq processing is less likely to cause a stack overflow of the softirq stack. Previously it wasn't safe to do so because irq_exit() (which initiates softirq processing) was called on the hardirq stack. That was changed in commit 1b1b6a6f4cc0 ("powerpc: handle irq_enter/ irq_exit in interrupt handler wrappers") and 1346d00e1bdf ("powerpc: Don't select HAVE_IRQ_EXIT_ON_IRQ_STACK"). The allowed transitions are now: - process stack -> hardirq stack - process stack -> softirq stack - process stack -> softirq stack -> hardirq stack Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/irq.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 6f7d4edaa0bc..7504ceec5c58 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -284,15 +284,14 @@ static __always_inline void call_do_irq(struct pt_regs *regs, void *sp) void __do_IRQ(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); - void *cursp, *irqsp, *sirqsp; + void *cursp, *irqsp; /* Switch to the irq stack to handle this */ cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1)); irqsp = hardirq_ctx[raw_smp_processor_id()]; - sirqsp = softirq_ctx[raw_smp_processor_id()]; /* Already there ? If not switch stack and call */ - if (unlikely(cursp == irqsp || cursp == sirqsp)) + if (unlikely(cursp == irqsp)) __do_irq(regs, current_stack_pointer); else call_do_irq(regs, irqsp); -- 2.41.0
[PATCH 1/2] powerpc/mm: Fix build failures due to arch_reserved_kernel_pages()
With NUMA=n and FA_DUMP=y or PRESERVE_FA_DUMP=y the build fails with: arch/powerpc/kernel/fadump.c:1739:22: error: no previous prototype for ‘arch_reserved_kernel_pages’ [-Werror=missing-prototypes] 1739 | unsigned long __init arch_reserved_kernel_pages(void) | ^~ The prototype for arch_reserved_kernel_pages() is in include/linux/mm.h, but it's guarded by __HAVE_ARCH_RESERVED_KERNEL_PAGES. The powerpc headers define __HAVE_ARCH_RESERVED_KERNEL_PAGES in asm/mmzone.h, which is not included into the generic headers when NUMA=n. Move the definition of __HAVE_ARCH_RESERVED_KERNEL_PAGES into asm/mmu.h which is included regardless of NUMA=n. Additionally the ifdef around __HAVE_ARCH_RESERVED_KERNEL_PAGES needs to also check for CONFIG_PRESERVE_FA_DUMP. Signed-off-by: Michael Ellerman --- arch/powerpc/include/asm/mmu.h| 4 arch/powerpc/include/asm/mmzone.h | 3 --- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h index 52cc25864a1b..d8b7e246a32f 100644 --- a/arch/powerpc/include/asm/mmu.h +++ b/arch/powerpc/include/asm/mmu.h @@ -412,5 +412,9 @@ extern void *abatron_pteptrs[2]; #include #endif +#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP) +#define __HAVE_ARCH_RESERVED_KERNEL_PAGES +#endif + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_MMU_H_ */ diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 4740ca230d36..da827d2d0866 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -42,9 +42,6 @@ u64 memory_hotplug_max(void); #else #define memory_hotplug_max() memblock_end_of_DRAM() #endif /* CONFIG_NUMA */ -#ifdef CONFIG_FA_DUMP -#define __HAVE_ARCH_RESERVED_KERNEL_PAGES -#endif #endif /* __KERNEL__ */ #endif /* _ASM_MMZONE_H_ */ -- 2.41.0
[kvm-unit-tests PATCH v1 02/18] powerpc: Replace the physical allocator with the page allocator
From: Alexandru Elisei The spapr_hcall test makes two page sized allocations using the physical allocator. Replace the physical allocator with the page allocator, which has has more features (like support for freeing allocations), and would allow for further simplification of the physical allocator. CC: Laurent Vivier CC: Thomas Huth CC: kvm-...@vger.kernel.org Signed-off-by: Alexandru Elisei --- lib/powerpc/setup.c | 9 ++--- powerpc/Makefile.common | 1 + powerpc/spapr_hcall.c | 5 +++-- 3 files changed, 10 insertions(+), 5 deletions(-) diff --git a/lib/powerpc/setup.c b/lib/powerpc/setup.c index 1be4c030..80fd38ae 100644 --- a/lib/powerpc/setup.c +++ b/lib/powerpc/setup.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -111,6 +112,7 @@ static void mem_init(phys_addr_t freemem_start) struct mem_region primary, mem = { .start = (phys_addr_t)-1, }; + phys_addr_t base, top; int nr_regs, i; nr_regs = dt_get_memory_params(regs, NR_MEM_REGIONS); @@ -146,9 +148,10 @@ static void mem_init(phys_addr_t freemem_start) __physical_start = mem.start; /* PHYSICAL_START */ __physical_end = mem.end; /* PHYSICAL_END */ - phys_alloc_init(freemem_start, primary.end - freemem_start); - phys_alloc_set_minimum_alignment(__icache_bytes > __dcache_bytes -? __icache_bytes : __dcache_bytes); + base = PAGE_ALIGN(freemem_start) >> PAGE_SHIFT; + top = primary.end >> PAGE_SHIFT; + page_alloc_init_area(0, base, top); + page_alloc_ops_enable(); } void setup(const void *fdt) diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common index f8f47490..ae70443a 100644 --- a/powerpc/Makefile.common +++ b/powerpc/Makefile.common @@ -34,6 +34,7 @@ include $(SRCDIR)/scripts/asm-offsets.mak cflatobjs += lib/util.o cflatobjs += lib/getchar.o cflatobjs += lib/alloc_phys.o +cflatobjs += lib/alloc_page.o cflatobjs += lib/alloc.o cflatobjs += lib/devicetree.o cflatobjs += lib/migrate.o diff --git a/powerpc/spapr_hcall.c b/powerpc/spapr_hcall.c index e9b5300a..77ab4187 100644 --- a/powerpc/spapr_hcall.c +++ b/powerpc/spapr_hcall.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include @@ -58,8 +59,8 @@ static void test_h_page_init(int argc, char **argv) if (argc > 1) report_abort("Unsupported argument: '%s'", argv[1]); - dst = memalign(PAGE_SIZE, PAGE_SIZE); - src = memalign(PAGE_SIZE, PAGE_SIZE); + dst = alloc_page(); + src = alloc_page(); if (!dst || !src) report_abort("Failed to alloc memory"); -- 2.40.1
Re: [PATCH v2 1/3] powerpc/code-patching: Add generic memory patching
On Mon, Oct 16, 2023 at 04:01:45PM +1100, Benjamin Gray wrote: > patch_instruction() is designed for patching instructions in otherwise > readonly memory. Other consumers also sometimes need to patch readonly > memory, so have abused patch_instruction() for arbitrary data patches. > > This is a problem on ppc64 as patch_instruction() decides on the patch > width using the 'instruction' opcode to see if it's a prefixed > instruction. Data that triggers this can lead to larger writes, possibly > crossing a page boundary and failing the write altogether. > > Introduce patch_uint(), and patch_ulong(), with aliases patch_u32(), and > patch_u64() (on ppc64) designed for aligned data patches. The patch > size is now determined by the called function, and is passed as an > additional parameter to generic internals. > > While the instruction flushing is not required for data patches, the > use cases for data patching (mainly module loading and static calls) > are less performance sensitive than for instruction patching > (ftrace activation). That's debatable. While it is nice to be able to activate function tracing quickly, it is not necessarily a hot path. On the flip side, I do have a use case for data patching for ftrace activation :) > So the instruction flushing remains unconditional > in this patch. > > ppc32 does not support prefixed instructions, so is unaffected by the > original issue. Care is taken in not exposing the size parameter in the > public (non-static) interface, so the compiler can const-propagate it > away. > > Signed-off-by: Benjamin Gray > > --- > > v2: * Deduplicate patch_32() definition > * Use u32 for val32 > * Remove noinline > --- > arch/powerpc/include/asm/code-patching.h | 33 > arch/powerpc/lib/code-patching.c | 66 ++-- > 2 files changed, 83 insertions(+), 16 deletions(-) > > diff --git a/arch/powerpc/include/asm/code-patching.h > b/arch/powerpc/include/asm/code-patching.h > index 3f881548fb61..7c6056bb1706 100644 > --- a/arch/powerpc/include/asm/code-patching.h > +++ b/arch/powerpc/include/asm/code-patching.h > @@ -75,6 +75,39 @@ int patch_branch(u32 *addr, unsigned long target, int > flags); > int patch_instruction(u32 *addr, ppc_inst_t instr); > int raw_patch_instruction(u32 *addr, ppc_inst_t instr); > > +/* > + * patch_uint() and patch_ulong() should only be called on addresses where > the > + * patch does not cross a cacheline, otherwise it may not be flushed properly > + * and mixes of new and stale data may be observed. It cannot cross a page > + * boundary, as only the target page is mapped as writable. Should we enforce alignment requirements, especially for patch_ulong() on 64-bit powerpc? I am not sure if there are use cases for unaligned 64-bit writes. That should also ensure that the write doesn't cross a cacheline. > + * > + * patch_instruction() and other instruction patchers automatically satisfy > this > + * requirement due to instruction alignment requirements. > + */ > + > +#ifdef CONFIG_PPC64 > + > +int patch_uint(void *addr, unsigned int val); > +int patch_ulong(void *addr, unsigned long val); > + > +#define patch_u64 patch_ulong > + > +#else > + > +static inline int patch_uint(u32 *addr, unsigned int val) Is there a reason to use u32 * here? > +{ > + return patch_instruction(addr, ppc_inst(val)); > +} > + > +static inline int patch_ulong(void *addr, unsigned long val) > +{ > + return patch_instruction(addr, ppc_inst(val)); > +} > + > +#endif > + > +#define patch_u32 patch_uint > + > static inline unsigned long patch_site_addr(s32 *site) > { > return (unsigned long)site + *site; > diff --git a/arch/powerpc/lib/code-patching.c > b/arch/powerpc/lib/code-patching.c > index b00112d7ad46..60289332412f 100644 > --- a/arch/powerpc/lib/code-patching.c > +++ b/arch/powerpc/lib/code-patching.c > @@ -20,15 +20,14 @@ > #include > #include > > -static int __patch_instruction(u32 *exec_addr, ppc_inst_t instr, u32 > *patch_addr) > +static int __patch_memory(void *exec_addr, unsigned long val, void > *patch_addr, > + bool is_dword) > { > - if (!ppc_inst_prefixed(instr)) { > - u32 val = ppc_inst_val(instr); > + if (!IS_ENABLED(CONFIG_PPC64) || likely(!is_dword)) { > + u32 val32 = val; Would be good to add a comment indicating the need for this for BE. - Naveen
Re: [PATCH] perf test record+probe_libc_inet_pton: Fix call chain match on powerpc
Hi Arnaldo, Thank you for pointing it. From next time I will take care of it. -Likhitha. On 30/11/23 02:26, Arnaldo Carvalho de Melo wrote: Em Sun, Nov 26, 2023 at 02:09:14AM -0500, Likhitha Korrapati escreveu: The perf test "probe libc's inet_pton & backtrace it with ping" fails on powerpc as below: root@xxx perf]# perf test -v "probe libc's inet_pton & backtrace it with ping" 85: probe libc's inet_pton & backtrace it with ping : --- start --- test child forked, pid 96028 ping 96056 [002] 127271.101961: probe_libc:inet_pton: (7fffa1779a60) 7fffa1779a60 __GI___inet_pton+0x0 (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 7fffa172a73c getaddrinfo+0x121c (/usr/lib64/glibc-hwcaps/power10/libc.so.6) FAIL: expected backtrace entry "gaih_inet.*\+0x[[:xdigit:]]+[[:space:]]\(/usr/lib64/glibc-hwcaps/power10/libc.so.6\)$" got "7fffa172a73c getaddrinfo+0x121c (/usr/lib64/glibc-hwcaps/power10/libc.so.6)" test child finished with -1 end Try to have quoted output, the ones separated by at the beginning of the line indented two spaces, so as to avoid: perf test record+probe_libc_inet_pton: Fix call chain match on powerpc The perf test "probe libc's inet_pton & backtrace it with ping" fails on powerpc as below: root@xxx perf]# perf test -v "probe libc's inet_pton & backtrace it with ping" 85: probe libc's inet_pton & backtrace it with ping : # Please enter the commit message for your changes. Lines starting # with '#' will be ignored, and an empty message aborts the commit. # # Author:Likhitha Korrapati # Date: Sun Nov 26 02:09:14 2023 -0500 I'm copy and pasting from the original post, thanks! - Arnaldo probe libc's inet_pton & backtrace it with ping: FAILED! This test installs a probe on libc's inet_pton function, which will use uprobes and then uses perf trace on a ping to localhost. It gets 3 levels deep backtrace and checks whether it is what we expected or not. The test started failing from RHEL 9.4 where as it works in previous distro version (RHEL 9.2). Test expects gaih_inet function to be part of backtrace. But in the glibc version (2.34-86) which is part of distro where it fails, this function is missing and hence the test is failing. From nm and ping command output we can confirm that gaih_inet function is not present in the expected backtrace for glibc version glibc-2.34-86 [root@xxx perf]# nm /usr/lib64/glibc-hwcaps/power10/libc.so.6 | grep gaih_inet 001273e0 t gaih_inet_serv 001cd8d8 r gaih_inet_typeproto [root@xxx perf]# perf script -i /tmp/perf.data.6E8 ping 104048 [000] 128582.508976: probe_libc:inet_pton: (7fff83779a60) 7fff83779a60 __GI___inet_pton+0x0 (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 7fff8372a73c getaddrinfo+0x121c (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 11dc73534 [unknown] (/usr/bin/ping) 7fff8362a8c4 __libc_start_call_main+0x84 (/usr/lib64/glibc-hwcaps/power10/libc.so.6) FAIL: expected backtrace entry "gaih_inet.*\+0x[[:xdigit:]]+[[:space:]]\(/usr/lib64/glibc-hwcaps/power10/libc.so.6\)$" got "7fff9d52a73c getaddrinfo+0x121c (/usr/lib64/glibc-hwcaps/power10/libc.so.6)" With version glibc-2.34-60 gaih_inet function is present as part of the expected backtrace. So we cannot just remove the gaih_inet function from the backtrace. [root@xxx perf]# nm /usr/lib64/glibc-hwcaps/power10/libc.so.6 | grep gaih_inet 00130490 t gaih_inet.constprop.0 0012e830 t gaih_inet_serv 001d45e4 r gaih_inet_typeproto [root@xxx perf]# ./perf script -i /tmp/perf.data.b6S ping 67906 [000] 22699.591699: probe_libc:inet_pton_3: (7fffbdd80820) 7fffbdd80820 __GI___inet_pton+0x0 (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 7fffbdd31160 gaih_inet.constprop.0+0xcd0 (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 7fffbdd31c7c getaddrinfo+0x14c (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 1140d3558 [unknown] (/usr/bin/ping) This patch solves this issue by doing a conditional skip. If there is a gaih_inet function present in the libc then it will be added to the expected backtrace else the function will be skipped from being added to the expected backtrace. Output with the patch [root@xxx perf]# ./perf test -v "probe libc's inet_pton & backtrace it with ping" 83: probe libc's inet_pton & backtrace it with ping : --- start --- test child forked, pid 102662 ping 102692 [000] 127935.549973: probe_libc:inet_pton: (7fff93379a60) 7fff93379a60 __GI___inet_pton+0x0 (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 7fff9332a73c getaddrinfo+0x121c (/usr/lib64/glibc-hwcaps/power10/libc.so.6) 11ef03534 [unknown] (/usr/bin/ping) test child finished with 0 end probe libc's inet_pton & backtrace it with ping: Ok Signed-off-by: Likhitha Korrapati Reported-by: Disha Goel --- tools/perf/tests/shell/record+probe_libc_inet_pton.sh | 5 - 1 file changed, 4 insertions(+), 1
[kvm-unit-tests PATCH v1 00/18] arm/arm64: Rework cache maintenance at boot
Hi, I'm posting Alexandru's patch set[1] rebased on the latest branch with the conflicts being resolved. No big changes compare to its original code. As this version 1 of this series was posted one years ago, I would first let you recall it, what's the intention of this series and what this series do. You can view it by click the link[2] and view the cover-letter. Since when writing the series[1], the efi support for arm64[3] hasn't been merged into the kvm-unit-tests, but now the efi support for arm64 has been merged. Directly rebase the series[1] onto the latest branch will break the efi tests. This is mainly because the Patch #15 ("arm/arm64: Enable the MMU early") moves the mmu_enable() out of the setup_mmu(), which causes the efi test will not enable the mmu. So I do a small change in the efi_mem_init() which makes the efi test also enable the MMU early, and make it works. And another change should be noticed is in the Patch #17 ("arm/arm64: Perform dcache maintenance"). In the efi_mem_init(), it will disable the mmu, and build a new pagetable and re-enable the mmu, if the asm_mmu_disable clean and invalidate the data caches for entire memory, we don't need to care the dcache and after mmu disabled, we use the mmu_setup_early() to re-enable the mmu, which takes care all the cache maintenance. But the situation changes since the Patch #18 ("arm/arm64: Rework the cache maintenance in asm_mmu_disable") only clean and invalidate the data caches for the stack memory area. So we need to clean and invalidate the data caches manually before disable the mmu, I'm not confident about current cache maintenance at the efi setup patch, so I ask for your help to review it if it's right or not. And I also drop one patch ("s390: Do not use the physical allocator") from[1] since this cause s390 test to fail. This series may include bug, so I really appreciate your review to improve this series together. You can get the code from: $ git clone https://gitlab.com/shahuang/kvm-unit-tests.git \ -b arm-arm64-rework-cache-maintenance-at-boot-v1 [1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/arm-arm64-rework-cache-maintenance-at-boot-v2-wip2 [2] https://lore.kernel.org/all/20220809091558.14379-1-alexandru.eli...@arm.com/ [3] https://patchwork.kernel.org/project/kvm/cover/20230530160924.82158-1-nikos.nikole...@arm.com/ Changelog: -- RFC->v1: - Gathered Reviewed-by tags. - Various changes to commit messages and comments to hopefully make the code easier to understand. - Patches #8 ("lib/alloc_phys: Expand documentation with usage and limitations") are new. - Folded patch "arm: page.h: Add missing libcflat.h include" into #17 ("arm/arm64: Perform dcache maintenance at boot"). - Reordered the series to group patches that touch aproximately the same code together - the patches that change the physical allocator are now first, followed come the patches that change how the secondaries are brought online. - Fixed several nasty bugs where the r4 register was being clobbered in the arm assembly. - Unmap the early UART address if the DTB address does not match the early address. - Added dcache maintenance when a page table is modified with the MMU disabled. - Moved the cache maintenance when disabling the MMU to be executed before the MMU is disabled. - Rebase it on lasted branch which efi support has been merged. - Make the efi test also enable MMU early. - Add cache maintenance on efi setup path especially before mmu_disable. RFC: https://lore.kernel.org/all/20220809091558.14379-1-alexandru.eli...@arm.com/ Alexandru Elisei (18): Makefile: Define __ASSEMBLY__ for assembly files powerpc: Replace the physical allocator with the page allocator lib/alloc_phys: Initialize align_min lib/alloc_phys: Consolidate allocate functions into memalign_early() lib/alloc_phys: Remove locking lib/alloc_phys: Remove allocation accounting lib/alloc_phys: Add callback to perform cache maintenance lib/alloc_phys: Expand documentation with usage and limitations arm/arm64: Zero secondary CPUs' stack arm/arm64: Allocate secondaries' stack using the page allocator arm/arm64: assembler.h: Replace size with end address for dcache_by_line_op arm/arm64: Add C functions for doing cache maintenance arm/arm64: Configure secondaries' stack before enabling the MMU arm/arm64: Use pgd_alloc() to allocate mmu_idmap arm/arm64: Enable the MMU early arm/arm64: Map the UART when creating the translation tables arm/arm64: Perform dcache maintenance at boot arm/arm64: Rework the cache maintenance in asm_mmu_disable Makefile | 5 +- arm/Makefile.arm | 4 +- arm/Makefile.arm64 | 4 +- arm/Makefile.common| 6 +- arm/cstart.S | 71 +++-- arm/cstart64.S | 76 +-- lib/alloc_phys.c | 122
[kvm-unit-tests PATCH v1 01/18] Makefile: Define __ASSEMBLY__ for assembly files
From: Alexandru Elisei There are 25 header files today (found with grep -r "#ifndef __ASSEMBLY__) with functionality relies on the __ASSEMBLY__ prepocessor constant being correctly defined to work correctly. So far, kvm-unit-tests has relied on the assembly files to define the constant before including any header files which depend on it. Let's make sure that nobody gets this wrong and define it as a compiler constant when compiling assembly files. __ASSEMBLY__ is now defined for all .S files, even those that didn't set it explicitely before. Reviewed-by: Nikos Nikoleris Reviewed-by: Andrew Jones Signed-off-by: Alexandru Elisei Signed-off-by: Shaoqin Huang --- Makefile | 5 - arm/cstart.S | 1 - arm/cstart64.S | 1 - powerpc/cstart64.S | 1 - 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Makefile b/Makefile index 602910dd..27ed14e6 100644 --- a/Makefile +++ b/Makefile @@ -92,6 +92,9 @@ CFLAGS += -Woverride-init -Wmissing-prototypes -Wstrict-prototypes autodepend-flags = -MMD -MP -MF $(dir $*).$(notdir $*).d +AFLAGS = $(CFLAGS) +AFLAGS += -D__ASSEMBLY__ + LDFLAGS += -nostdlib $(no_pie) -z noexecstack $(libcflat): $(cflatobjs) @@ -113,7 +116,7 @@ directories: @mkdir -p $(OBJDIRS) %.o: %.S - $(CC) $(CFLAGS) -c -nostdlib -o $@ $< + $(CC) $(AFLAGS) -c -nostdlib -o $@ $< -include */.*.d */*/.*.d diff --git a/arm/cstart.S b/arm/cstart.S index 3dd71ed9..b24ecabc 100644 --- a/arm/cstart.S +++ b/arm/cstart.S @@ -5,7 +5,6 @@ * * This work is licensed under the terms of the GNU LGPL, version 2. */ -#define __ASSEMBLY__ #include #include #include diff --git a/arm/cstart64.S b/arm/cstart64.S index bc2be45a..a8ad6dc8 100644 --- a/arm/cstart64.S +++ b/arm/cstart64.S @@ -5,7 +5,6 @@ * * This work is licensed under the terms of the GNU GPL, version 2. */ -#define __ASSEMBLY__ #include #include #include diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S index 34e39341..fa32ef24 100644 --- a/powerpc/cstart64.S +++ b/powerpc/cstart64.S @@ -5,7 +5,6 @@ * * This work is licensed under the terms of the GNU LGPL, version 2. */ -#define __ASSEMBLY__ #include #include #include -- 2.40.1
Re: [PATCH v2 2/3] powerpc/64: Convert patch_instruction() to patch_u32()
On Mon, Oct 16, 2023 at 04:01:46PM +1100, Benjamin Gray wrote: > This use of patch_instruction() is working on 32 bit data, and can fail > if the data looks like a prefixed instruction and the extra write > crosses a page boundary. Use patch_u32() to fix the write size. > > Fixes: 8734b41b3efe ("powerpc/module_64: Fix livepatching for RO modules") > Link: https://lore.kernel.org/all/20230203004649.1f59dbd4@yea/ > Signed-off-by: Benjamin Gray > > --- > > v2: * Added the fixes tag, it seems appropriate even if the subject does > mention a more robust solution being required. > > patch_u64() should be more efficient, but judging from the bug report > it doesn't seem like the data is doubleword aligned. That doesn't look to be the case anymore due to commits 77e69ee7ce07 ("powerpc/64: modules support building with PCREL addresing") and 7e3a68be42e1 ("powerpc/64: vmlinux support building with PCREL addresing") - Naveen
[PATCH 2/2] powerpc: Fix build error due to is_valid_bugaddr()
With CONFIG_GENERIC_BUG=n the build fails with: arch/powerpc/kernel/traps.c:1442:5: error: no previous prototype for ‘is_valid_bugaddr’ [-Werror=missing-prototypes] 1442 | int is_valid_bugaddr(unsigned long addr) | ^~~~ The prototype is only defined, and the function is only needed, when CONFIG_GENERIC_BUG=y, so move the implementation under that. Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/traps.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 5ea2014aff90..11e062b47d3f 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -1439,10 +1439,12 @@ static int emulate_instruction(struct pt_regs *regs) return -EINVAL; } +#ifdef CONFIG_GENERIC_BUG int is_valid_bugaddr(unsigned long addr) { return is_kernel_addr(addr); } +#endif #ifdef CONFIG_MATH_EMULATION static int emulate_math(struct pt_regs *regs) -- 2.41.0
Re: [PATCH 1/2] kexec: fix KEXEC_FILE dependencies
On 11/30/23 10:56, Andrew Morton wrote: On Thu, 2 Nov 2023 16:03:18 +0800 Baoquan He wrote: CONFIG_KEXEC_FILE, but still get purgatory code built in which is totally useless. Not sure if I think too much over this. I see your point here, and I would suggest changing the CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY symbol to just indicate the availability of the purgatory code for the arch, rather than actually controlling the code itself. I already mentioned this for s390, but riscv would need the same thing on top. I think the change below should address your concern. Since no new comment, do you mind spinning v2 to wrap all these up? This patchset remains in mm-hotfixes-unstable from the previous -rc cycle. Eric, do you have any comments? Arnd, do you plan on a v2? If not, should I merge v1? If so, should I now add cc:stable? My apologies, I lost this. I've looked at these changes, and I am in favor of these changes. Furthermore, I ran the following thru the Kconfig regression script, and did not find anything! I believe the following patch represents the current discussion threads around Kconfig and KEXEC/CRASH. Reviewed-by: Eric DeVolder Tested-by: Eric DeVolder Thanks! eric diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 6f105ee4f3cf..1f11a62809f2 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -608,10 +608,10 @@ config ARCH_SUPPORTS_KEXEC def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP) config ARCH_SUPPORTS_KEXEC_FILE - def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y + def_bool PPC64 config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE + def_bool y config ARCH_SELECTS_KEXEC_FILE def_bool y diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild index d25ad1c19f88..ab181d187c23 100644 --- a/arch/riscv/Kbuild +++ b/arch/riscv/Kbuild @@ -5,7 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/ obj-y += errata/ obj-$(CONFIG_KVM) += kvm/ -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ +obj-$(CONFIG_KEXEC_FILE) += purgatory/ # for cleaning subdir- += boot diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 95a2a06acc6a..98857d76e458 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -702,9 +702,7 @@ config ARCH_SELECTS_KEXEC_FILE select KEXEC_ELF config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE - depends on CRYPTO=y - depends on CRYPTO_SHA256=y + def_bool y config ARCH_SUPPORTS_CRASH_DUMP def_bool y diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c index e60fbd8660c4..3ac341d296db 100644 --- a/arch/riscv/kernel/elf_kexec.c +++ b/arch/riscv/kernel/elf_kexec.c @@ -266,7 +266,7 @@ static void *elf_kexec_load(struct kimage *image, char *kernel_buf, cmdline = modified_cmdline; } -#ifdef CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY +#ifdef CONFIG_KEXEC_FILE /* Add purgatory to the image */ kbuf.top_down = true; kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; @@ -280,7 +280,7 @@ static void *elf_kexec_load(struct kimage *image, char *kernel_buf, sizeof(kernel_start), 0); if (ret) pr_err("Error update purgatory ret=%d\n", ret); -#endif /* CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY */ +#endif /* CONFIG_KEXEC_FILE */ /* Add the initrd to the image */ if (initrd != NULL) { diff --git a/arch/s390/Kbuild b/arch/s390/Kbuild index a5d3503b353c..f2ce80b65551 100644 --- a/arch/s390/Kbuild +++ b/arch/s390/Kbuild @@ -7,7 +7,7 @@ obj-$(CONFIG_S390_HYPFS) += hypfs/ obj-$(CONFIG_APPLDATA_BASE) += appldata/ obj-y += net/ obj-$(CONFIG_PCI) += pci/ -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ +obj-$(CONFIG_KEXEC_FILE) += purgatory/ # for cleaning subdir- += boot tools diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 3bec98d20283..d5d8f99d1f25 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -254,13 +254,13 @@ config ARCH_SUPPORTS_KEXEC def_bool y config ARCH_SUPPORTS_KEXEC_FILE - def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 + def_bool y config ARCH_SUPPORTS_KEXEC_SIG def_bool MODULE_SIG_FORMAT config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE + def_bool y config ARCH_SUPPORTS_CRASH_DUMP def_bool y diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3762f41bb092..1566748f16c4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2072,7 +2072,7 @@ config ARCH_SUPPORTS_KEXEC def_bool y config ARCH_SUPPORTS_KEXEC_FILE - def_bool X86_64 && CRYPTO && CRYPTO_SHA256 + def_bool X86_64 config ARCH_SELECTS_KEXEC_FILE def_bool y @@ -2080,7 +2080,7 @@ config ARCH_SELECTS_KEXEC_FILE select HAVE_IMA_KEXEC if IMA config ARCH_SUPPORTS_KEXEC_PURGATORY - def_bool KEXEC_FILE + def_bool y config ARCH_SUPPORTS_KEXEC_SIG def_bool y diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index 7aff28ded2f4..92120e396008 100644 --- a/kernel/Kconfig.kexec +++
Re: [PATCH] scsi: ibmvfc: replace deprecated strncpy with strscpy
On Mon, Oct 30, 2023 at 07:04:33PM +, Justin Stitt wrote: > strncpy() is deprecated for use on NUL-terminated destination strings > [1] and as such we should prefer more robust and less ambiguous string > interfaces. > > We expect these fields to be NUL-terminated as the property names from > which they are derived are also NUL-terminated. > > Moreover, NUL-padding is not required as our destination buffers are > already NUL-allocated and any future NUL-byte assignments are redundant > (like the ones that strncpy() does). > ibmvfc_probe() -> > | struct ibmvfc_host *vhost; > | struct Scsi_Host *shost; > ... > | shost = scsi_host_alloc(_template, sizeof(*vhost)); > ... **side note: is this a bug? Looks like a type to me ^** I think this is the "privsize", so *vhost is correct, IIUC. > ... > | vhost = shost_priv(shost); I.e. vhost is a part of the shost allocation > > ... where shost_priv() is: > | static inline void *shost_priv(struct Scsi_Host *shost) > | { > | return (void *)shost->hostdata; > | } > > .. and: > scsi_host_alloc() -> > | shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL); As seen here. :) > > And for login_info->..., NUL-padding is also not required as it is > explicitly memset to 0: > | memset(login_info, 0, sizeof(*login_info)); > > Considering the above, a suitable replacement is `strscpy` [2] due to > the fact that it guarantees NUL-termination on the destination buffer > without unnecessarily NUL-padding. > > Link: > https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings > [1] > Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html > [2] > Link: https://github.com/KSPP/linux/issues/90 > Cc: linux-harden...@vger.kernel.org > Signed-off-by: Justin Stitt Yeah, this conversion looks correct to me too. Reviewed-by: Kees Cook -Kees > --- > Note: build-tested only. > > Found with: $ rg "strncpy\(" > --- > drivers/scsi/ibmvscsi/ibmvfc.c | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c > index ce9eb00e2ca0..e73a39b1c832 100644 > --- a/drivers/scsi/ibmvscsi/ibmvfc.c > +++ b/drivers/scsi/ibmvscsi/ibmvfc.c > @@ -1455,7 +1455,7 @@ static void ibmvfc_gather_partition_info(struct > ibmvfc_host *vhost) > > name = of_get_property(rootdn, "ibm,partition-name", NULL); > if (name) > - strncpy(vhost->partition_name, name, > sizeof(vhost->partition_name)); > + strscpy(vhost->partition_name, name, > sizeof(vhost->partition_name)); > num = of_get_property(rootdn, "ibm,partition-no", NULL); > if (num) > vhost->partition_number = *num; > @@ -1498,13 +1498,15 @@ static void ibmvfc_set_login_info(struct ibmvfc_host > *vhost) > login_info->async.va = cpu_to_be64(vhost->async_crq.msg_token); > login_info->async.len = cpu_to_be32(async_crq->size * > sizeof(*async_crq->msgs.async)); > - strncpy(login_info->partition_name, vhost->partition_name, > IBMVFC_MAX_NAME); > - strncpy(login_info->device_name, > - dev_name(>host->shost_gendev), IBMVFC_MAX_NAME); > + strscpy(login_info->partition_name, vhost->partition_name, > + sizeof(login_info->partition_name)); > + > + strscpy(login_info->device_name, > + dev_name(>host->shost_gendev), > sizeof(login_info->device_name)); > > location = of_get_property(of_node, "ibm,loc-code", NULL); > location = location ? location : dev_name(vhost->dev); > - strncpy(login_info->drc_name, location, IBMVFC_MAX_NAME); > + strscpy(login_info->drc_name, location, sizeof(login_info->drc_name)); > } > > /** > > --- > base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa > change-id: 20231030-strncpy-drivers-scsi-ibmvscsi-ibmvfc-c-ccfce3255d58 > > Best regards, > -- > Justin Stitt > -- Kees Cook
Re: [PATCH] scsi: ibmvscsi: replace deprecated strncpy with strscpy
On Mon, Oct 30, 2023 at 08:40:48PM +, Justin Stitt wrote: > strncpy() is deprecated for use on NUL-terminated destination strings > [1] and as such we should prefer more robust and less ambiguous string > interfaces. > > We expect partition_name to be NUL-terminated based on its usage with > format strings: > | dev_info(hostdata->dev, "host srp version: %s, " > |"host partition %s (%d), OS %d, max io %u\n", > |hostdata->madapter_info.srp_version, > |hostdata->madapter_info.partition_name, > |be32_to_cpu(hostdata->madapter_info.partition_number), > |be32_to_cpu(hostdata->madapter_info.os_type), > |be32_to_cpu(hostdata->madapter_info.port_max_txu[0])); > ... > | len = snprintf(buf, PAGE_SIZE, "%s\n", > |hostdata->madapter_info.partition_name); > > Moreover, NUL-padding is not required as madapter_info is explicitly > memset to 0: > | memset(>madapter_info, 0x00, > | sizeof(hostdata->madapter_info)); > > Considering the above, a suitable replacement is `strscpy` [2] due to > the fact that it guarantees NUL-termination on the destination buffer > without unnecessarily NUL-padding. > > Link: > https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings > [1] > Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html > [2] > Link: https://github.com/KSPP/linux/issues/90 > Cc: linux-harden...@vger.kernel.org > Signed-off-by: Justin Stitt Agreed; this conversion looks correct to me too. Reviewed-by: Kees Cook -- Kees Cook
Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: > On Fri, Nov 24, 2023 at 09:06:01AM +, Ryan Roberts wrote: > > I don't have any micro-benchmarks for GUP though, if that's your question. > > Is > > there an easy-to-use test I can run to get some numbers? I'd be happy to > > try it out. > > Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched > from your side, afaict. I'll see whether I can provide some rough numbers > instead in the next post (I'll probably only be able to test it in a VM, > though, but hopefully that should still reflect mostly the truth). An update: I finished a round of 64K cont_pte test, in the slow gup micro benchmark I see ~15% perf degrade with this patchset applied on a VM on top of Apple M1. Frankly that's even less than I expected, considering not only how slow gup THP used to be, but also on the fact that that's a tight loop over slow gup, which in normal cases shouldn't happen: "present" ptes normally goes to fast-gup, while !present goes into a fault following it. I assume that's why nobody cared slow gup for THP before. I think adding cont_pte support shouldn't be very hard, but that will include making cont_pte idea global just for arm64 and riscv Svnapot. The current plan is I'll add that performance number into my commit message only, as I don't ever expect any real workload will regress with it. Maybe a global cont_pte api will be needed at some point, but perhaps not yet feel strongly for this use case. Please feel free to raise any concerns otherwise. Thanks, -- Peter Xu
Re: [PATCH v4 05/13] powerpc/rtas: Facilitate high-level call sequences
Nathan Lynch writes: > Alternatively, sys_rtas() could be refactored into locking and > non-locking paths, e.g. > > static long __do_sys_rtas(struct rtas_function *func) > { > // [ ... acquire rtas_lock, enter RTAS, fetch any error etc ... ] > } Of course this conflicts with the code generated by SYSCALL_DEFINE1(rtas), so a different function name would be needed here.
Re: linux-next: build failure after merge of the mm tree
Hi all, On Mon, 27 Nov 2023 14:48:52 +1100 Stephen Rothwell wrote: > > Just cc'ing the PowerPC guys to see if my fix is sensible. > > On Mon, 27 Nov 2023 13:28:09 +1100 Stephen Rothwell > wrote: > > > > After merging the mm tree, today's linux-next build (powerpc64 > > allnoconfig) failed like this: > > > > arch/powerpc/mm/book3s64/pgtable.c:557:5: error: no previous prototype for > > 'pmd_move_must_withdraw' [-Werror=missing-prototypes] > > 557 | int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, > > | ^~ > > cc1: all warnings being treated as errors > > > > Caused by commit > > > > c6345dfa6e3e ("Makefile.extrawarn: turn on missing-prototypes globally") > > > > I have added the following patch for today (which could be applied to > > the mm or powerpc trees): > > > > From 194805b44c11b4c0aa28bdcdc0bb0d82acef394c Mon Sep 17 00:00:00 2001 > > From: Stephen Rothwell > > Date: Mon, 27 Nov 2023 13:08:57 +1100 > > Subject: [PATCH] powerpc: pmd_move_must_withdraw() is only needed for > > CONFIG_TRANSPARENT_HUGEPAGE > > > > Signed-off-by: Stephen Rothwell > > --- > > arch/powerpc/mm/book3s64/pgtable.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/arch/powerpc/mm/book3s64/pgtable.c > > b/arch/powerpc/mm/book3s64/pgtable.c > > index be229290a6a7..3438ab72c346 100644 > > --- a/arch/powerpc/mm/book3s64/pgtable.c > > +++ b/arch/powerpc/mm/book3s64/pgtable.c > > @@ -542,6 +542,7 @@ void ptep_modify_prot_commit(struct vm_area_struct > > *vma, unsigned long addr, > > set_pte_at(vma->vm_mm, addr, ptep, pte); > > } > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > /* > > * For hash translation mode, we use the deposited table to store hash slot > > * information and they are stored at PTRS_PER_PMD offset from related pmd > > @@ -563,6 +564,7 @@ int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, > > > > return true; > > } > > +#endif > > > > /* > > * Does the CPU support tlbie? > > -- > > 2.40.1 I am still carrying this patch (it should probably go into the mm tree). Is someone going to pick it up (assuming it is correct)? -- Cheers, Stephen Rothwell pgpKkhduvXRmu.pgp Description: OpenPGP digital signature
Re: linux-next: build failure after merge of the mm tree
On Fri, 1 Dec 2023 09:04:39 +1100 Stephen Rothwell wrote: > Hi all, > > > > diff --git a/arch/powerpc/mm/book3s64/pgtable.c > > > b/arch/powerpc/mm/book3s64/pgtable.c > > > index be229290a6a7..3438ab72c346 100644 > > > --- a/arch/powerpc/mm/book3s64/pgtable.c > > > +++ b/arch/powerpc/mm/book3s64/pgtable.c > > > @@ -542,6 +542,7 @@ void ptep_modify_prot_commit(struct vm_area_struct > > > *vma, unsigned long addr, > > > set_pte_at(vma->vm_mm, addr, ptep, pte); > > > } > > > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > /* > > > * For hash translation mode, we use the deposited table to store hash > > > slot > > > * information and they are stored at PTRS_PER_PMD offset from related > > > pmd > > > @@ -563,6 +564,7 @@ int pmd_move_must_withdraw(struct spinlock > > > *new_pmd_ptl, > > > > > > return true; > > > } > > > +#endif > > > > > > /* > > > * Does the CPU support tlbie? > > > -- > > > 2.40.1 > > I am still carrying this patch (it should probably go into the mm > tree). Is someone going to pick it up (assuming it is correct)? AFAIK we're still awaiting input from the ppc team. I'll grab it. If it breaks things then we-told-you-so!
Re: linux-next: build failure after merge of the mm tree
Stephen Rothwell writes: > On Mon, 27 Nov 2023 14:48:52 +1100 Stephen Rothwell > wrote: >> >> Just cc'ing the PowerPC guys to see if my fix is sensible. >> >> On Mon, 27 Nov 2023 13:28:09 +1100 Stephen Rothwell >> wrote: >> > >> > After merging the mm tree, today's linux-next build (powerpc64 >> > allnoconfig) failed like this: >> > >> > arch/powerpc/mm/book3s64/pgtable.c:557:5: error: no previous prototype for >> > 'pmd_move_must_withdraw' [-Werror=missing-prototypes] >> > 557 | int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, >> > | ^~ >> > cc1: all warnings being treated as errors >> > >> > Caused by commit >> > >> > c6345dfa6e3e ("Makefile.extrawarn: turn on missing-prototypes globally") >> > >> > I have added the following patch for today (which could be applied to >> > the mm or powerpc trees): >> > >> > From 194805b44c11b4c0aa28bdcdc0bb0d82acef394c Mon Sep 17 00:00:00 2001 >> > From: Stephen Rothwell >> > Date: Mon, 27 Nov 2023 13:08:57 +1100 >> > Subject: [PATCH] powerpc: pmd_move_must_withdraw() is only needed for >> > CONFIG_TRANSPARENT_HUGEPAGE >> > >> > Signed-off-by: Stephen Rothwell >> > --- >> > arch/powerpc/mm/book3s64/pgtable.c | 2 ++ >> > 1 file changed, 2 insertions(+) >> > >> > diff --git a/arch/powerpc/mm/book3s64/pgtable.c >> > b/arch/powerpc/mm/book3s64/pgtable.c >> > index be229290a6a7..3438ab72c346 100644 >> > --- a/arch/powerpc/mm/book3s64/pgtable.c >> > +++ b/arch/powerpc/mm/book3s64/pgtable.c >> > @@ -542,6 +542,7 @@ void ptep_modify_prot_commit(struct vm_area_struct >> > *vma, unsigned long addr, >> >set_pte_at(vma->vm_mm, addr, ptep, pte); >> > } >> > >> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> > /* >> > * For hash translation mode, we use the deposited table to store hash >> > slot >> > * information and they are stored at PTRS_PER_PMD offset from related pmd >> > @@ -563,6 +564,7 @@ int pmd_move_must_withdraw(struct spinlock >> > *new_pmd_ptl, >> > >> >return true; >> > } >> > +#endif >> > >> > /* >> > * Does the CPU support tlbie? >> > -- >> > 2.40.1 > > I am still carrying this patch (it should probably go into the mm > tree). Is someone going to pick it up (assuming it is correct)? I applied it to my next a few days ago, but I must have forgotten to push. It's in there now. cheers
Re: [PATCH v4 05/13] powerpc/rtas: Facilitate high-level call sequences
Nathan Lynch writes: > Michael Ellerman writes: >> Nathan Lynch writes: >>> Michael Ellerman writes: Nathan Lynch via B4 Relay writes: > From: Nathan Lynch > > On RTAS platforms there is a general restriction that the OS must not > enter RTAS on more than one CPU at a time. This low-level > serialization requirement is satisfied by holding a spin > lock (rtas_lock) across most RTAS function invocations. ... > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > index 1fc0b3fffdd1..52f2242d0c28 100644 > --- a/arch/powerpc/kernel/rtas.c > +++ b/arch/powerpc/kernel/rtas.c > @@ -581,6 +652,28 @@ static const struct rtas_function > *rtas_token_to_function(s32 token) > return NULL; > } > > +static void __rtas_function_lock(struct rtas_function *func) > +{ > + if (func && func->lock) > + mutex_lock(func->lock); > +} This is obviously going to defeat most static analysis tools. >>> >>> I guess it's not that obvious to me :-) Is it because the mutex_lock() >>> is conditional? I'll improve this if it's possible. >> >> Well maybe I'm not giving modern static analysis tools enough credit :) >> >> But what I mean that it's not easy to reason about what the function >> does in isolation. ie. all you can say is that it may or may not lock a >> mutex, and you can't say which mutex. > > I've pulled the thread on this a little bit and here is what I can do: > > * Discard rtas_lock_function() and rtas_unlock_function() and make the > function mutexes extern as needed. As of now only > rtas_ibm_get_vpd_lock will need to be exposed. This enables us to put > __acquires(), __releases(), and __must_hold() annotations in > papr-vpd.c since it explicitly manipulates the mutex. > > * Then sys_rtas() becomes the only site that needs > __rtas_function_lock() and __rtas_function_unlock(), which can be > open-coded and commented (and, one hopes, not emulated elsewhere). > > This will look something like: > > SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) > { > struct rtas_function *func = rtas_token_to_function(token); > > if (func->lock) > mutex_lock(func->lock); > > [ ... acquire rtas_lock, enter RTAS, fetch any errors ... ] > > if (func->lock) > mutex_unlock(func->lock); > > The indirection seems unavoidable since we're working backwards from a > token value (supplied by the user and not known at build time) to the > function descriptor. > > Is that tolerable for now? Yeah. Thanks for looking into it. I wasn't unhappy with the original version, but just slightly uneasy about the locking via pointer. But that new proposal sounds good, more code will have static lock annotations, and only sys_rtas() which is already weird, will have the dynamic stuff. > Alternatively, sys_rtas() could be refactored into locking and > non-locking paths, e.g. > > static long __do_sys_rtas(struct rtas_function *func) > { > // [ ... acquire rtas_lock, enter RTAS, fetch any error etc ... ] > } > > static long do_sys_rtas(struct rtas_function *func, struct mutex *mtx) > { > mutex_lock(mtx); > ret = __do_sys_rtas(func); > mutex_unlock(mtx); > > return ret; > } > > SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) > { > // real code does copy_from_user etc > struct rtas_function *func = rtas_token_to_function(uargs->token); > long ret; > > // [ ... input validation and filtering ... ] > > if (func->lock) > ret = do_sys_rtas(func, func->lock); > else > ret = __do_sys_rtas(func); > > // [ ... copy out results ... ] > > return ret; > } You could go even further and switch on the token, and handle each case separately so that you can then statically take the appropriate lock. But that's probably overkill. cheers
Re: linux-next: build failure after merge of the mm tree
On Fri, 01 Dec 2023 09:39:20 +1100 Michael Ellerman wrote: > > I am still carrying this patch (it should probably go into the mm > > tree). Is someone going to pick it up (assuming it is correct)? > > I applied it to my next a few days ago, but I must have forgotten to > push. It's in there now. I'll keep a copy in mm.git, to keep the dependencies nice. I added your acked-by.
Re: [PATCH v4 2/4] PCI: layerscape: Add suspend/resume for ls1021a
On Wed, Nov 29, 2023 at 04:44:10PM -0500, Frank Li wrote: > ls1021a add suspend/resume support. > "Add suspend/resume support for Layerscape LS1021a" > In the suspend path, PME_Turn_Off message is sent to the endpoint to > transition the link to L2/L3_Ready state. In this SoC, there is no way to > check if the controller has received the PME_To_Ack from the endpoint or > not. So to be on the safer side, the driver just waits for > PCIE_PME_TO_L2_TIMEOUT_US before asserting the SoC specific PMXMTTURNOFF > bit to complete the PME_Turn_Off handshake. This link would then enter "Then the link would enter L2/L3 state depending on the VAUX supply." > L2/L3 state depending on the VAUX supply. > > In the resume path, the link is brought back from L2 to L0 by doing a > software reset. > > Signed-off-by: Frank Li Couple of comments below. But the patch is looking good now. Thanks for the update. > --- > > Notes: > Change from v3 to v4 > - update commit message. > - it is reset a glue logic part for PCI controller. > - use regmap_write_bits() to reduce code change. > > Change from v2 to v3 > - update according to mani's feedback > change from v1 to v2 > - change subject 'a' to 'A' > > drivers/pci/controller/dwc/pci-layerscape.c | 83 - > 1 file changed, 82 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > b/drivers/pci/controller/dwc/pci-layerscape.c > index aea89926bcc4f..42bca2c3b5c3e 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape.c > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > @@ -35,11 +35,19 @@ > #define PF_MCR_PTOMR BIT(0) > #define PF_MCR_EXL2S BIT(1) > > +/* LS1021A PEXn PM Write Control Register */ > +#define SCFG_PEXPMWRCR(idx) (0x5c + (idx) * 0x64) > +#define PMXMTTURNOFF BIT(31) > +#define SCFG_PEXSFTRSTCR 0x190 > +#define PEXSR(idx) BIT(idx) > + > #define PCIE_IATU_NUM6 > > struct ls_pcie_drvdata { > const u32 pf_off; > + const struct dw_pcie_host_ops *ops; > int (*exit_from_l2)(struct dw_pcie_rp *pp); > + bool scfg_support; > bool pm_support; > }; > > @@ -47,6 +55,8 @@ struct ls_pcie { > struct dw_pcie *pci; > const struct ls_pcie_drvdata *drvdata; > void __iomem *pf_base; > + struct regmap *scfg; > + int index; > bool big_endian; > }; > > @@ -171,13 +181,65 @@ static int ls_pcie_host_init(struct dw_pcie_rp *pp) > return 0; > } > > +static void scfg_pcie_send_turnoff_msg(struct regmap *scfg, u32 reg, u32 > mask) > +{ > + /* Send PME_Turn_Off message */ > + regmap_write_bits(scfg, reg, mask, mask); > + > + /* > + * There is no specific register to check for PME_To_Ack from endpoint. > + * So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US. > + */ > + mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000); > + > + /* > + * Layerscape hardware reference manual recommends clearing the > PMXMTTURNOFF bit > + * to complete the PME_Turn_Off handshake. > + */ > + regmap_write_bits(scfg, reg, mask, 0); > +} > + > +static void ls1021a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + > + scfg_pcie_send_turnoff_msg(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), > PMXMTTURNOFF); > +} > + > +static int scfg_pcie_exit_from_l2(struct regmap *scfg, u32 reg, u32 mask) > +{ > + /* Only way exit from l2 is that do software reset */ "Only way exit from L2 is by doing a software reset of the controller" I really hope that the reset is not a global controller reset i.e., not resetting all CSRs. > + regmap_write_bits(scfg, reg, mask, mask); > + No need of a newline. > + regmap_write_bits(scfg, reg, mask, 0); > + > + return 0; > +} > + > +static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + > + return scfg_pcie_exit_from_l2(pcie->scfg, SCFG_PEXSFTRSTCR, > PEXSR(pcie->index)); > +} > + > static const struct dw_pcie_host_ops ls_pcie_host_ops = { > .host_init = ls_pcie_host_init, > .pme_turn_off = ls_pcie_send_turnoff_msg, > }; > > +static const struct dw_pcie_host_ops ls1021a_pcie_host_ops = { > + .host_init = ls_pcie_host_init, > + .pme_turn_off = ls1021a_pcie_send_turnoff_msg, > +}; > + > static const struct ls_pcie_drvdata ls1021a_drvdata = { > - .pm_support = false, > + .pm_support = true, > + .scfg_support = true, > + .ops = _pcie_host_ops, > + .exit_from_l2 = ls1021a_pcie_exit_from_l2, > }; > > static const struct ls_pcie_drvdata layerscape_drvdata = { > @@ -205,6 +267,8 @@ static int ls_pcie_probe(struct platform_device *pdev) > struct dw_pcie *pci; > struct ls_pcie *pcie; > struct resource *dbi_base; >
Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
- Original Message - > From: "Michael Ellerman" > To: "Timothy Pearson" > Cc: "Jens Axboe" , "regressions" > , "npiggin" , > "christophe leroy" , "linuxppc-dev" > > Sent: Tuesday, November 28, 2023 6:57:01 AM > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec > register save > Michael Ellerman writes: >> Timothy Pearson writes: >> >>> Just wanted to check back and see if this patch was going to be queued >>> up soon? We're still having to work around / advertise the data >>> destruction issues the underlying bug is causing on e.g. Debian >>> Stable. >> >> Yeah I'll apply it this week, so it will be in rc4. > > I reworked the change log to include the exact call path I identified > instead of the more high level description you had. And tweaked a few > other bits of wording and so on, apparently fr0 is a kernelism, the ABI > and binutils calls it f0. > > I'm not sure how wedded you were to your change log, so if you dislike > my edits let me know and we can come up with a joint one. > > The actual patch is unchanged. > > cheers The commit message looks OK to me. I've also seen application crashes as a result of the register corruption, but that may be a minor detail that isn't really worth updating things over at this point -- those come from e.g. glibc using vs0 as part of a path that processes pointer information, typically seen where there's a need to replicate the same pointer to adjacent fields in a data struct.
Re: [PATCH v4 3/4] PCI: layerscape: Rename pf_* as pf_lut_*
On Wed, Nov 29, 2023 at 04:44:11PM -0500, Frank Li wrote: > 'pf' and 'lut' is just difference name in difference chips, but basic it is > a MMIO base address plus an offset. > > Rename it to avoid duplicate pf_* and lut_* in driver. > > Signed-off-by: Frank Li Reviewed-by: Manivannan Sadhasivam Can you fix the name in pci-layerscape-ep.c also? - Mani > --- > > Notes: > pf_lut is better than pf_* or lut* because some chip use 'pf', some chip > use 'lut'. > > change from v1 to v4 > - new patch at v3 > > drivers/pci/controller/dwc/pci-layerscape.c | 34 ++--- > 1 file changed, 17 insertions(+), 17 deletions(-) > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > b/drivers/pci/controller/dwc/pci-layerscape.c > index 42bca2c3b5c3e..590e07bb27002 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape.c > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > @@ -44,7 +44,7 @@ > #define PCIE_IATU_NUM6 > > struct ls_pcie_drvdata { > - const u32 pf_off; > + const u32 pf_lut_off; > const struct dw_pcie_host_ops *ops; > int (*exit_from_l2)(struct dw_pcie_rp *pp); > bool scfg_support; > @@ -54,13 +54,13 @@ struct ls_pcie_drvdata { > struct ls_pcie { > struct dw_pcie *pci; > const struct ls_pcie_drvdata *drvdata; > - void __iomem *pf_base; > + void __iomem *pf_lut_base; > struct regmap *scfg; > int index; > bool big_endian; > }; > > -#define ls_pcie_pf_readl_addr(addr) ls_pcie_pf_readl(pcie, addr) > +#define ls_pcie_pf_lut_readl_addr(addr) ls_pcie_pf_lut_readl(pcie, addr) > #define to_ls_pcie(x)dev_get_drvdata((x)->dev) > > static bool ls_pcie_is_bridge(struct ls_pcie *pcie) > @@ -101,20 +101,20 @@ static void ls_pcie_fix_error_response(struct ls_pcie > *pcie) > iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR); > } > > -static u32 ls_pcie_pf_readl(struct ls_pcie *pcie, u32 off) > +static u32 ls_pcie_pf_lut_readl(struct ls_pcie *pcie, u32 off) > { > if (pcie->big_endian) > - return ioread32be(pcie->pf_base + off); > + return ioread32be(pcie->pf_lut_base + off); > > - return ioread32(pcie->pf_base + off); > + return ioread32(pcie->pf_lut_base + off); > } > > -static void ls_pcie_pf_writel(struct ls_pcie *pcie, u32 off, u32 val) > +static void ls_pcie_pf_lut_writel(struct ls_pcie *pcie, u32 off, u32 val) > { > if (pcie->big_endian) > - iowrite32be(val, pcie->pf_base + off); > + iowrite32be(val, pcie->pf_lut_base + off); > else > - iowrite32(val, pcie->pf_base + off); > + iowrite32(val, pcie->pf_lut_base + off); > } > > static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > @@ -124,11 +124,11 @@ static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp > *pp) > u32 val; > int ret; > > - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR); > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR); > val |= PF_MCR_PTOMR; > - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val); > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val); > > - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR, > + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR, >val, !(val & PF_MCR_PTOMR), >PCIE_PME_TO_L2_TIMEOUT_US/10, >PCIE_PME_TO_L2_TIMEOUT_US); > @@ -147,15 +147,15 @@ static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) >* Set PF_MCR_EXL2S bit in LS_PCIE_PF_MCR register for the link >* to exit L2 state. >*/ > - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR); > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR); > val |= PF_MCR_EXL2S; > - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val); > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val); > > /* >* L2 exit timeout of 10ms is not defined in the specifications, >* it was chosen based on empirical observations. >*/ > - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR, > + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR, >val, !(val & PF_MCR_EXL2S), >1000, >1); > @@ -243,7 +243,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = { > }; > > static const struct ls_pcie_drvdata layerscape_drvdata = { > - .pf_off = 0xc, > + .pf_lut_off = 0xc, > .pm_support = true, > .exit_from_l2 = ls_pcie_exit_from_l2, > }; > @@ -293,7 +293,7 @@ static int ls_pcie_probe(struct platform_device *pdev) > > pcie->big_endian = of_property_read_bool(dev->of_node, "big-endian"); > > - pcie->pf_base = pci->dbi_base + pcie->drvdata->pf_off; > + pcie->pf_lut_base = pci->dbi_base + pcie->drvdata->pf_lut_off; >
Re: [PATCH v4 4/4] PCI: layerscape: Add suspend/resume for ls1043a
On Wed, Nov 29, 2023 at 04:44:12PM -0500, Frank Li wrote: > In the suspend path, PME_Turn_Off message is sent to the endpoint to > transition the link to L2/L3_Ready state. In this SoC, there is no way to > check if the controller has received the PME_To_Ack from the endpoint or > not. So to be on the safer side, the driver just waits for > PCIE_PME_TO_L2_TIMEOUT_US before asserting the SoC specific PMXMTTURNOFF > bit to complete the PME_Turn_Off handshake. This link would then enter > L2/L3 state depending on the VAUX supply. > > In the resume path, the link is brought back from L2 to L0 by doing a > software reset. > Same comment on the patch description as on patch 2/4. > Signed-off-by: Frank Li > --- > > Notes: > Change from v3 to v4 > - Call scfg_pcie_send_turnoff_msg() shared with ls1021a > - update commit message > > Change from v2 to v3 > - Remove ls_pcie_lut_readl(writel) function > > Change from v1 to v2 > - Update subject 'a' to 'A' > > drivers/pci/controller/dwc/pci-layerscape.c | 63 - > 1 file changed, 62 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > b/drivers/pci/controller/dwc/pci-layerscape.c > index 590e07bb27002..d39700b3afaaa 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape.c > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > @@ -41,6 +41,15 @@ > #define SCFG_PEXSFTRSTCR 0x190 > #define PEXSR(idx) BIT(idx) > > +/* LS1043A PEX PME control register */ > +#define SCFG_PEXPMECR0x144 > +#define PEXPME(idx) BIT(31 - (idx) * 4) > + > +/* LS1043A PEX LUT debug register */ > +#define LS_PCIE_LDBG 0x7fc > +#define LDBG_SR BIT(30) > +#define LDBG_WE BIT(31) > + > #define PCIE_IATU_NUM6 > > struct ls_pcie_drvdata { > @@ -225,6 +234,45 @@ static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp > *pp) > return scfg_pcie_exit_from_l2(pcie->scfg, SCFG_PEXSFTRSTCR, > PEXSR(pcie->index)); > } > > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + > + scfg_pcie_send_turnoff_msg(pcie->scfg, SCFG_PEXPMECR, > PEXPME(pcie->index)); > +} > + > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + u32 val; > + > + /* > + * Only way let PEX module exit L2 is do a software reset. Can you expand PEX? What is it used for? Also if the reset is only for the PEX module, please use the same comment in both patches 2 and 4. Patch 2 doesn't mention PEX in the comment. - Mani > + * LDBG_WE: allows the user to have write access to the PEXDBG[SR] for > both setting and > + * clearing the soft reset on the PEX module. > + * LDBG_SR: When SR is set to 1, the PEX module enters soft reset. > + */ > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > + val |= LDBG_WE; > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > + > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > + val |= LDBG_SR; > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > + > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > + val &= ~LDBG_SR; > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > + > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > + val &= ~LDBG_WE; > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > + > + return 0; > +} > + > static const struct dw_pcie_host_ops ls_pcie_host_ops = { > .host_init = ls_pcie_host_init, > .pme_turn_off = ls_pcie_send_turnoff_msg, > @@ -242,6 +290,19 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = { > .exit_from_l2 = ls1021a_pcie_exit_from_l2, > }; > > +static const struct dw_pcie_host_ops ls1043a_pcie_host_ops = { > + .host_init = ls_pcie_host_init, > + .pme_turn_off = ls1043a_pcie_send_turnoff_msg, > +}; > + > +static const struct ls_pcie_drvdata ls1043a_drvdata = { > + .pf_lut_off = 0x1, > + .pm_support = true, > + .scfg_support = true, > + .ops = _pcie_host_ops, > + .exit_from_l2 = ls1043a_pcie_exit_from_l2, > +}; > + > static const struct ls_pcie_drvdata layerscape_drvdata = { > .pf_lut_off = 0xc, > .pm_support = true, > @@ -252,7 +313,7 @@ static const struct of_device_id ls_pcie_of_match[] = { > { .compatible = "fsl,ls1012a-pcie", .data = _drvdata }, > { .compatible = "fsl,ls1021a-pcie", .data = _drvdata }, > { .compatible = "fsl,ls1028a-pcie", .data = _drvdata }, > - { .compatible = "fsl,ls1043a-pcie", .data = _drvdata }, > + { .compatible = "fsl,ls1043a-pcie", .data = _drvdata }, > { .compatible = "fsl,ls1046a-pcie", .data = _drvdata }, > { .compatible = "fsl,ls2080a-pcie", .data = _drvdata }, > {
Re: [PATCH] powerpc/irq: Allow softirq to hardirq stack transition
Le 30/11/2023 à 13:50, Michael Ellerman a écrit : > Allow a transition from the softirq stack to the hardirq stack when > handling a hardirq. Doing so means a hardirq received while deep in > softirq processing is less likely to cause a stack overflow of the > softirq stack. > > Previously it wasn't safe to do so because irq_exit() (which initiates > softirq processing) was called on the hardirq stack. > > That was changed in commit 1b1b6a6f4cc0 ("powerpc: handle irq_enter/ > irq_exit in interrupt handler wrappers") and 1346d00e1bdf ("powerpc: > Don't select HAVE_IRQ_EXIT_ON_IRQ_STACK"). > > The allowed transitions are now: > - process stack -> hardirq stack > - process stack -> softirq stack > - process stack -> softirq stack -> hardirq stack It means you don't like my patch https://patchwork.ozlabs.org/project/linuxppc-dev/patch/6cd9d8bb2258d8b51999c2584eac74423d2b5e29.1657203774.git.christophe.le...@csgroup.eu/ ? I never got any feedback. > > Signed-off-by: Michael Ellerman > --- > arch/powerpc/kernel/irq.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c > index 6f7d4edaa0bc..7504ceec5c58 100644 > --- a/arch/powerpc/kernel/irq.c > +++ b/arch/powerpc/kernel/irq.c > @@ -284,15 +284,14 @@ static __always_inline void call_do_irq(struct pt_regs > *regs, void *sp) > void __do_IRQ(struct pt_regs *regs) > { > struct pt_regs *old_regs = set_irq_regs(regs); > - void *cursp, *irqsp, *sirqsp; > + void *cursp, *irqsp; > > /* Switch to the irq stack to handle this */ > cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1)); > irqsp = hardirq_ctx[raw_smp_processor_id()]; > - sirqsp = softirq_ctx[raw_smp_processor_id()]; > > /* Already there ? If not switch stack and call */ > - if (unlikely(cursp == irqsp || cursp == sirqsp)) > + if (unlikely(cursp == irqsp)) > __do_irq(regs, current_stack_pointer); > else > call_do_irq(regs, irqsp);
Re: [PATCH 1/2] kexec: fix KEXEC_FILE dependencies
On Thu, 2 Nov 2023 16:03:18 +0800 Baoquan He wrote: > > > CONFIG_KEXEC_FILE, but still get purgatory code built in which is > > > totally useless. > > > > > > Not sure if I think too much over this. > > > > I see your point here, and I would suggest changing the > > CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY symbol to just indicate > > the availability of the purgatory code for the arch, rather > > than actually controlling the code itself. I already mentioned > > this for s390, but riscv would need the same thing on top. > > > > I think the change below should address your concern. > > Since no new comment, do you mind spinning v2 to wrap all these up? This patchset remains in mm-hotfixes-unstable from the previous -rc cycle. Eric, do you have any comments? Arnd, do you plan on a v2? If not, should I merge v1? If so, should I now add cc:stable?
Re: [PATCH 2/3] powerpc/pseries/memhp: Remove unbalanced dlpar_release_drc() call
> On Nov 28, 2023, at 9:21 AM, Nathan Lynch wrote: > > Nick Child writes: >> Hi Nathan, >> Patches 1 and 3 LGTM > > thanks. > >> Regarding this patch, dlpar_memory_remove_by_count() calls >> dlpar_add_lmb() and does not free drc on add error. >> dlpar_add_lmb() is called here in error recovery so probably >> not a big deal. >> >> This is all new code to me but it looks like if the requested >> number of lmbs cannot be removed then it attempts to add back >> the ones that were successfully removed. So if you cannot add >> an lmb that WAS successfully removed, it seems sane to also >> release the drc. > > Maybe I'll drop this one for now and turn my attention to removing all > the high-level rollback logic in this code. There's no reliable way to > accomplish what it's trying to do (i.e. restore the original quantity of > LMBs) and it just complicates things. Removing all of the rollback code is a wonderful idea.
Re: [PATCH v4 4/4] PCI: layerscape: Add suspend/resume for ls1043a
On Thu, Nov 30, 2023 at 10:21:00PM +0530, Manivannan Sadhasivam wrote: > On Wed, Nov 29, 2023 at 04:44:12PM -0500, Frank Li wrote: > > In the suspend path, PME_Turn_Off message is sent to the endpoint to > > transition the link to L2/L3_Ready state. In this SoC, there is no way to > > check if the controller has received the PME_To_Ack from the endpoint or > > not. So to be on the safer side, the driver just waits for > > PCIE_PME_TO_L2_TIMEOUT_US before asserting the SoC specific PMXMTTURNOFF > > bit to complete the PME_Turn_Off handshake. This link would then enter > > L2/L3 state depending on the VAUX supply. > > > > In the resume path, the link is brought back from L2 to L0 by doing a > > software reset. > > > > Same comment on the patch description as on patch 2/4. > > > Signed-off-by: Frank Li > > --- > > > > Notes: > > Change from v3 to v4 > > - Call scfg_pcie_send_turnoff_msg() shared with ls1021a > > - update commit message > > > > Change from v2 to v3 > > - Remove ls_pcie_lut_readl(writel) function > > > > Change from v1 to v2 > > - Update subject 'a' to 'A' > > > > drivers/pci/controller/dwc/pci-layerscape.c | 63 - > > 1 file changed, 62 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > b/drivers/pci/controller/dwc/pci-layerscape.c > > index 590e07bb27002..d39700b3afaaa 100644 > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > @@ -41,6 +41,15 @@ > > #define SCFG_PEXSFTRSTCR 0x190 > > #define PEXSR(idx) BIT(idx) > > > > +/* LS1043A PEX PME control register */ > > +#define SCFG_PEXPMECR 0x144 > > +#define PEXPME(idx)BIT(31 - (idx) * 4) > > + > > +/* LS1043A PEX LUT debug register */ > > +#define LS_PCIE_LDBG 0x7fc > > +#define LDBG_SRBIT(30) > > +#define LDBG_WEBIT(31) > > + > > #define PCIE_IATU_NUM 6 > > > > struct ls_pcie_drvdata { > > @@ -225,6 +234,45 @@ static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp > > *pp) > > return scfg_pcie_exit_from_l2(pcie->scfg, SCFG_PEXSFTRSTCR, > > PEXSR(pcie->index)); > > } > > > > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > > +{ > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > + > > + scfg_pcie_send_turnoff_msg(pcie->scfg, SCFG_PEXPMECR, > > PEXPME(pcie->index)); > > +} > > + > > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > +{ > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > + u32 val; > > + > > + /* > > +* Only way let PEX module exit L2 is do a software reset. > > Can you expand PEX? What is it used for? > > Also if the reset is only for the PEX module, please use the same comment in > both patches 2 and 4. Patch 2 doesn't mention PEX in the comment. After read spec again, I think PEX is pci express. So it should software reset controller. I don't know what exactly did in the chip. But without below code, PCIe can't exit L2/L3. Any harmful if dwc controller reset? Anyway these code works well with intel network card. Frank > > - Mani > > > +* LDBG_WE: allows the user to have write access to the PEXDBG[SR] for > > both setting and > > +* clearing the soft reset on the PEX module. > > +* LDBG_SR: When SR is set to 1, the PEX module enters soft reset. > > +*/ > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > + val |= LDBG_WE; > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > + > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > + val |= LDBG_SR; > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > + > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > + val &= ~LDBG_SR; > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > + > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > + val &= ~LDBG_WE; > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > + > > + return 0; > > +} > > + > > static const struct dw_pcie_host_ops ls_pcie_host_ops = { > > .host_init = ls_pcie_host_init, > > .pme_turn_off = ls_pcie_send_turnoff_msg, > > @@ -242,6 +290,19 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = { > > .exit_from_l2 = ls1021a_pcie_exit_from_l2, > > }; > > > > +static const struct dw_pcie_host_ops ls1043a_pcie_host_ops = { > > + .host_init = ls_pcie_host_init, > > + .pme_turn_off = ls1043a_pcie_send_turnoff_msg, > > +}; > > + > > +static const struct ls_pcie_drvdata ls1043a_drvdata = { > > + .pf_lut_off = 0x1, > > + .pm_support = true, > > + .scfg_support = true, > > + .ops = _pcie_host_ops, > > + .exit_from_l2 = ls1043a_pcie_exit_from_l2, > > +}; > > + > > static const struct ls_pcie_drvdata layerscape_drvdata = { > > .pf_lut_off = 0xc, > > .pm_support = true, > > @@
Re: [PATCH v4 4/4] PCI: layerscape: Add suspend/resume for ls1043a
On Thu, Nov 30, 2023 at 03:17:39PM -0500, Frank Li wrote: > On Thu, Nov 30, 2023 at 10:21:00PM +0530, Manivannan Sadhasivam wrote: > > On Wed, Nov 29, 2023 at 04:44:12PM -0500, Frank Li wrote: > > > In the suspend path, PME_Turn_Off message is sent to the endpoint to > > > transition the link to L2/L3_Ready state. In this SoC, there is no way to > > > check if the controller has received the PME_To_Ack from the endpoint or > > > not. So to be on the safer side, the driver just waits for > > > PCIE_PME_TO_L2_TIMEOUT_US before asserting the SoC specific PMXMTTURNOFF > > > bit to complete the PME_Turn_Off handshake. This link would then enter > > > L2/L3 state depending on the VAUX supply. > > > > > > In the resume path, the link is brought back from L2 to L0 by doing a > > > software reset. > > > > > > > Same comment on the patch description as on patch 2/4. > > > > > Signed-off-by: Frank Li > > > --- > > > > > > Notes: > > > Change from v3 to v4 > > > - Call scfg_pcie_send_turnoff_msg() shared with ls1021a > > > - update commit message > > > > > > Change from v2 to v3 > > > - Remove ls_pcie_lut_readl(writel) function > > > > > > Change from v1 to v2 > > > - Update subject 'a' to 'A' > > > > > > drivers/pci/controller/dwc/pci-layerscape.c | 63 - > > > 1 file changed, 62 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > > b/drivers/pci/controller/dwc/pci-layerscape.c > > > index 590e07bb27002..d39700b3afaaa 100644 > > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > > @@ -41,6 +41,15 @@ > > > #define SCFG_PEXSFTRSTCR 0x190 > > > #define PEXSR(idx) BIT(idx) > > > > > > +/* LS1043A PEX PME control register */ > > > +#define SCFG_PEXPMECR0x144 > > > +#define PEXPME(idx) BIT(31 - (idx) * 4) > > > + > > > +/* LS1043A PEX LUT debug register */ > > > +#define LS_PCIE_LDBG 0x7fc > > > +#define LDBG_SR BIT(30) > > > +#define LDBG_WE BIT(31) > > > + > > > #define PCIE_IATU_NUM6 > > > > > > struct ls_pcie_drvdata { > > > @@ -225,6 +234,45 @@ static int ls1021a_pcie_exit_from_l2(struct > > > dw_pcie_rp *pp) > > > return scfg_pcie_exit_from_l2(pcie->scfg, SCFG_PEXSFTRSTCR, > > > PEXSR(pcie->index)); > > > } > > > > > > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > > > +{ > > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > > + > > > + scfg_pcie_send_turnoff_msg(pcie->scfg, SCFG_PEXPMECR, > > > PEXPME(pcie->index)); > > > +} > > > + > > > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > > +{ > > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > > + u32 val; > > > + > > > + /* > > > + * Only way let PEX module exit L2 is do a software reset. > > > > Can you expand PEX? What is it used for? > > > > Also if the reset is only for the PEX module, please use the same comment in > > both patches 2 and 4. Patch 2 doesn't mention PEX in the comment. > > After read spec again, I think PEX is pci express. So it should software > reset controller. I don't know what exactly did in the chip. But without > below code, PCIe can't exit L2/L3. > > Any harmful if dwc controller reset? Anyway these code works well with > intel network card. Sorry, sent too quick. It is PCIe express wrapper Copy from spec: "PEXLDBG[SR]. Once set the PEXLDBG[SR] will enable the soft reset to the PEX wrapper." Frank > > Frank > > > > > - Mani > > > > > + * LDBG_WE: allows the user to have write access to the PEXDBG[SR] for > > > both setting and > > > + * clearing the soft reset on the PEX module. > > > + * LDBG_SR: When SR is set to 1, the PEX module enters soft reset. > > > + */ > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val |= LDBG_WE; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val |= LDBG_SR; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val &= ~LDBG_SR; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_LDBG); > > > + val &= ~LDBG_WE; > > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_LDBG, val); > > > + > > > + return 0; > > > +} > > > + > > > static const struct dw_pcie_host_ops ls_pcie_host_ops = { > > > .host_init = ls_pcie_host_init, > > > .pme_turn_off = ls_pcie_send_turnoff_msg, > > > @@ -242,6 +290,19 @@ static const struct ls_pcie_drvdata ls1021a_drvdata > > > = { > > > .exit_from_l2 = ls1021a_pcie_exit_from_l2, > > > }; > > > > > > +static const struct dw_pcie_host_ops ls1043a_pcie_host_ops = { > > > + .host_init = ls_pcie_host_init, > > > + .pme_turn_off =
Re: [PATCH v4 05/13] powerpc/rtas: Facilitate high-level call sequences
Michael Ellerman writes: > Nathan Lynch writes: >> Michael Ellerman writes: >> >>> Nathan Lynch via B4 Relay >>> writes: From: Nathan Lynch On RTAS platforms there is a general restriction that the OS must not enter RTAS on more than one CPU at a time. This low-level serialization requirement is satisfied by holding a spin lock (rtas_lock) across most RTAS function invocations. >>> ... diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 1fc0b3fffdd1..52f2242d0c28 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -581,6 +652,28 @@ static const struct rtas_function *rtas_token_to_function(s32 token) return NULL; } +static void __rtas_function_lock(struct rtas_function *func) +{ + if (func && func->lock) + mutex_lock(func->lock); +} >>> >>> This is obviously going to defeat most static analysis tools. >> >> I guess it's not that obvious to me :-) Is it because the mutex_lock() >> is conditional? I'll improve this if it's possible. > > Well maybe I'm not giving modern static analysis tools enough credit :) > > But what I mean that it's not easy to reason about what the function > does in isolation. ie. all you can say is that it may or may not lock a > mutex, and you can't say which mutex. I've pulled the thread on this a little bit and here is what I can do: * Discard rtas_lock_function() and rtas_unlock_function() and make the function mutexes extern as needed. As of now only rtas_ibm_get_vpd_lock will need to be exposed. This enables us to put __acquires(), __releases(), and __must_hold() annotations in papr-vpd.c since it explicitly manipulates the mutex. * Then sys_rtas() becomes the only site that needs __rtas_function_lock() and __rtas_function_unlock(), which can be open-coded and commented (and, one hopes, not emulated elsewhere). This will look something like: SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) { struct rtas_function *func = rtas_token_to_function(token); if (func->lock) mutex_lock(func->lock); [ ... acquire rtas_lock, enter RTAS, fetch any errors ... ] if (func->lock) mutex_unlock(func->lock); The indirection seems unavoidable since we're working backwards from a token value (supplied by the user and not known at build time) to the function descriptor. Is that tolerable for now? Alternatively, sys_rtas() could be refactored into locking and non-locking paths, e.g. static long __do_sys_rtas(struct rtas_function *func) { // [ ... acquire rtas_lock, enter RTAS, fetch any error etc ... ] } static long do_sys_rtas(struct rtas_function *func, struct mutex *mtx) { mutex_lock(mtx); ret = __do_sys_rtas(func); mutex_unlock(mtx); return ret; } SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) { // real code does copy_from_user etc struct rtas_function *func = rtas_token_to_function(uargs->token); long ret; // [ ... input validation and filtering ... ] if (func->lock) ret = do_sys_rtas(func, func->lock); else ret = __do_sys_rtas(func); // [ ... copy out results ... ] return ret; }
Re: [PATCH v4 05/13] powerpc/rtas: Facilitate high-level call sequences
Michael Ellerman writes: > Nathan Lynch writes: >> Michael Ellerman writes: >>> Nathan Lynch writes: Michael Ellerman writes: > Nathan Lynch via B4 Relay > writes: >> From: Nathan Lynch >> >> On RTAS platforms there is a general restriction that the OS must not >> enter RTAS on more than one CPU at a time. This low-level >> serialization requirement is satisfied by holding a spin >> lock (rtas_lock) across most RTAS function invocations. > ... >> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c >> index 1fc0b3fffdd1..52f2242d0c28 100644 >> --- a/arch/powerpc/kernel/rtas.c >> +++ b/arch/powerpc/kernel/rtas.c >> @@ -581,6 +652,28 @@ static const struct rtas_function >> *rtas_token_to_function(s32 token) >> return NULL; >> } >> >> +static void __rtas_function_lock(struct rtas_function *func) >> +{ >> +if (func && func->lock) >> +mutex_lock(func->lock); >> +} > > This is obviously going to defeat most static analysis tools. I guess it's not that obvious to me :-) Is it because the mutex_lock() is conditional? I'll improve this if it's possible. >>> >>> Well maybe I'm not giving modern static analysis tools enough credit :) >>> >>> But what I mean that it's not easy to reason about what the function >>> does in isolation. ie. all you can say is that it may or may not lock a >>> mutex, and you can't say which mutex. >> >> I've pulled the thread on this a little bit and here is what I can do: >> >> * Discard rtas_lock_function() and rtas_unlock_function() and make the >> function mutexes extern as needed. As of now only >> rtas_ibm_get_vpd_lock will need to be exposed. This enables us to put >> __acquires(), __releases(), and __must_hold() annotations in >> papr-vpd.c since it explicitly manipulates the mutex. >> >> * Then sys_rtas() becomes the only site that needs >> __rtas_function_lock() and __rtas_function_unlock(), which can be >> open-coded and commented (and, one hopes, not emulated elsewhere). >> >> This will look something like: >> >> SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) >> { >> struct rtas_function *func = rtas_token_to_function(token); >> >> if (func->lock) >> mutex_lock(func->lock); >> >> [ ... acquire rtas_lock, enter RTAS, fetch any errors ... ] >> >> if (func->lock) >> mutex_unlock(func->lock); >> >> The indirection seems unavoidable since we're working backwards from a >> token value (supplied by the user and not known at build time) to the >> function descriptor. >> >> Is that tolerable for now? > > Yeah. Thanks for looking into it. > > I wasn't unhappy with the original version, but just slightly uneasy > about the locking via pointer. > > But that new proposal sounds good, more code will have static lock > annotations, and only sys_rtas() which is already weird, will have the > dynamic stuff. OK, I'll work that up then.