Re: [PATCH] perf vendor events: Fix eventcode of power10 json events
> On 25-May-2021, at 8:57 PM, Paul A. Clarke wrote: >> > I lost the original message, but Nageswara Sastry said: >> 1. Extracted all the 244 events from the patch. >> 2. Check them in 'perf list' - all 244 events found >> 3. Ran all the events with 'perf stat -e "event name" sleep 1', all ran fine. >>No errors were seen in 'dmesg' > > I count 255 events. > > PC Seems while extracting I filtered out newly added ones, so I got 244(255-11). Now checked with all 255 events. Thanks for pointing out. Thanks!! R.Nageswara Sastry
Re: simplify gendisk and request_queue allocation for bio based drivers
On Wed, May 26, 2021 at 12:41:37AM +0200, Ulf Hansson wrote: > On Fri, 21 May 2021 at 07:51, Christoph Hellwig wrote: > > > > Hi all, > > > > this series is the first part of cleaning up lifetimes and allocation of > > the gendisk and request_queue structure. It adds a new interface to > > allocate the disk and queue together for bio based drivers, and a helper > > for cleanup/free them when a driver is unloaded or a device is removed. > > May I ask what else you have in the pipe for the next steps? > > The reason why I ask is that I am looking into some issues related to > lifecycle problems of gendisk/mmc, typically triggered at SD/MMC card > removal. In the short run not much more than superficial cleanups. Eventually I want bio based drivers to not require a separate request_queue, leaving that purely as a data structure for blk-mq based drivers. But it will take a while until we get there, so it should not block any fixes. For hot unplug handling it might be worth to take a look at nvme, as it is tested a lot for that case.
[powerpc:next] BUILD SUCCESS b73c8cccd72ac28beaf262fd6ef4b91411fc8335
bmips_be_defconfig powerpc mpc8272_ads_defconfig powerpc sbc8548_defconfig sh sh7770_generic_defconfig arm am200epdkit_defconfig alpha defconfig riscvnommu_k210_defconfig powerpc tqm8541_defconfig powerpc mpc834x_itxgp_defconfig mips capcella_defconfig armmps2_defconfig nds32 defconfig powerpc obs600_defconfig powerpc bamboo_defconfig mipsbcm47xx_defconfig arm axm55xx_defconfig sh landisk_defconfig openriscor1ksim_defconfig mips maltaaprp_defconfig mips rm200_defconfig armpleb_defconfig arm omap1_defconfig mips mpc30x_defconfig powerpc rainier_defconfig powerpc ppc6xx_defconfig arm hackkit_defconfig mips mtx1_defconfig powerpc allyesconfig x86_64 defconfig mips tb0219_defconfig mipsar7_defconfig m68k m5249evb_defconfig arm tegra_defconfig riscv allnoconfig arm davinci_all_defconfig powerpc mgcoge_defconfig nds32 allnoconfig powerpc g5_defconfig arm u8500_defconfig sh se7750_defconfig sh se7206_defconfig arm cns3420vb_defconfig powerpc sequoia_defconfig sh se7619_defconfig s390defconfig m68kq40_defconfig microblaze defconfig sh se7705_defconfig m68kmvme16x_defconfig arm rpc_defconfig riscv nommu_k210_sdcard_defconfig mipse55_defconfig powerpcfsp2_defconfig mips malta_defconfig powerpc wii_defconfig sh j2_defconfig arm64alldefconfig sh secureedge5410_defconfig shedosk7705_defconfig powerpc tqm8xx_defconfig powerpc currituck_defconfig m68k apollo_defconfig arm pxa255-idp_defconfig powerpc canyonlands_defconfig powerpc pasemi_defconfig arcnsim_700_defconfig armdove_defconfig arc nsimosci_hs_defconfig s390 debug_defconfig arm aspeed_g5_defconfig mips rb532_defconfig armtrizeps4_defconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig nios2 defconfig arc allyesconfig nios2allyesconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig parisc defconfig s390 allyesconfig s390 allmodconfig parisc allyesconfig i386 allyesconfig sparcallyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a001-20210525 i386 randconfig-a002-20210525 i386 randconfig-a005-20210525 i386 randconfig-a006-20210525 i386 randconfig-a003-20210525 i386 randconfig-a004-20210525 i386 randconfig-a001-20210526 i386 randconfig-a002-20210526 i386 randconfig-a005-20210526 i386 randconfig-a004-20210526 i386 randconfig-a003-20210526 i386 randconfig-a006-20210526 x86_64
[powerpc:next-test] BUILD SUCCESS 027f55e87c3094270a3223f7331d033fe15a2b3f
mtx1_defconfig mips tb0219_defconfig mipsar7_defconfig m68k m5249evb_defconfig arm tegra_defconfig riscv allnoconfig csky alldefconfig sparc sparc32_defconfig sh se7343_defconfig sh se7619_defconfig m68kq40_defconfig powerpcklondike_defconfig microblaze defconfig sh se7705_defconfig m68kmvme16x_defconfig arm rpc_defconfig riscv nommu_k210_sdcard_defconfig mipse55_defconfig powerpc mpc8272_ads_defconfig powerpcfsp2_defconfig mips malta_defconfig powerpc wii_defconfig sh j2_defconfig arm64alldefconfig sh secureedge5410_defconfig mips mpc30x_defconfig shedosk7705_defconfig m68k apollo_defconfig arm pxa255-idp_defconfig powerpc canyonlands_defconfig sh sh7724_generic_defconfig powerpc pasemi_defconfig arcnsim_700_defconfig armdove_defconfig arc nsimosci_hs_defconfig s390 debug_defconfig arm aspeed_g5_defconfig mips rb532_defconfig armtrizeps4_defconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig nios2 defconfig arc allyesconfig nds32 allnoconfig nios2allyesconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig parisc defconfig s390 allyesconfig s390 allmodconfig parisc allyesconfig s390defconfig i386 allyesconfig sparcallyesconfig sparc defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a001-20210525 i386 randconfig-a002-20210525 i386 randconfig-a005-20210525 i386 randconfig-a006-20210525 i386 randconfig-a003-20210525 i386 randconfig-a004-20210525 i386 randconfig-a001-20210526 i386 randconfig-a002-20210526 i386 randconfig-a005-20210526 i386 randconfig-a004-20210526 i386 randconfig-a003-20210526 i386 randconfig-a006-20210526 x86_64 randconfig-a013-20210525 x86_64 randconfig-a012-20210525 x86_64 randconfig-a014-20210525 x86_64 randconfig-a016-20210525 x86_64 randconfig-a015-20210525 x86_64 randconfig-a011-20210525 i386 randconfig-a011-20210525 i386 randconfig-a016-20210525 i386 randconfig-a015-20210525 i386 randconfig-a012-20210525 i386 randconfig-a014-20210525 i386 randconfig-a013-20210525 x86_64 randconfig-a005-20210526 x86_64 randconfig-a001-20210526 x86_64 randconfig-a006-20210526 x86_64 randconfig-a003-20210526 x86_64 randconfig-a004-20210526 x86_64 randconfig-a002-20210526 riscvallyesconfig riscvnommu_virt_defconfig riscv defconfig riscv rv32_defconfig riscvallmodconfig um allmodconfig umallnoconfig um allyesconfig um defconfig x86_64 allyesconfig x86_64rhel-8.3-kselftests x86_64 defconfig x86_64
[powerpc:merge] BUILD SUCCESS 112f47a1484ddca610b70cbe4a99f0d0f1701daf
allyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a001-20210525 i386 randconfig-a002-20210525 i386 randconfig-a005-20210525 i386 randconfig-a006-20210525 i386 randconfig-a003-20210525 i386 randconfig-a004-20210525 i386 randconfig-a001-20210526 i386 randconfig-a002-20210526 i386 randconfig-a005-20210526 i386 randconfig-a004-20210526 i386 randconfig-a003-20210526 i386 randconfig-a006-20210526 x86_64 randconfig-a013-20210525 x86_64 randconfig-a012-20210525 x86_64 randconfig-a014-20210525 x86_64 randconfig-a016-20210525 x86_64 randconfig-a015-20210525 x86_64 randconfig-a011-20210525 i386 randconfig-a011-20210525 i386 randconfig-a016-20210525 i386 randconfig-a015-20210525 i386 randconfig-a012-20210525 i386 randconfig-a014-20210525 i386 randconfig-a013-20210525 x86_64 randconfig-a005-20210526 x86_64 randconfig-a001-20210526 x86_64 randconfig-a006-20210526 x86_64 randconfig-a003-20210526 x86_64 randconfig-a004-20210526 x86_64 randconfig-a002-20210526 riscvallyesconfig riscvnommu_virt_defconfig riscv allnoconfig riscv defconfig riscv rv32_defconfig riscvallmodconfig um allmodconfig umallnoconfig um allyesconfig um defconfig x86_64 allyesconfig x86_64rhel-8.3-kselftests x86_64 defconfig x86_64 rhel-8.3 x86_64 rhel-8.3-kbuiltin x86_64 kexec clang tested configs: x86_64 randconfig-b001-20210525 x86_64 randconfig-a005-20210525 x86_64 randconfig-a006-20210525 x86_64 randconfig-a001-20210525 x86_64 randconfig-a003-20210525 x86_64 randconfig-a004-20210525 x86_64 randconfig-a002-20210525 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
Re: [PATCH v3 0/4] P2040/P2041 i2c recovery erratum
Wolfram Sang writes: > On Wed, May 12, 2021 at 09:20:48AM +1200, Chris Packham wrote: >> The P2040/P2041 has an erratum where the i2c recovery scheme >> documented in the reference manual (and currently implemented >> in the i2c-mpc.c driver) does not work. The errata document >> provides an alternative that does work. This series implements >> that alternative and uses a property in the devicetree to >> decide when the alternative mechanism is needed. > > The series looks good to me. Usually, I don't take DTS patches. This > time I'd make an exception and apply all patches to for-current because > this is clearly a bugfix. For that, I'd need an ack from PPC > maintainers. Could I have those for patches 2+3? Yep, done. cheers
Re: [PATCH v3 3/4] powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers
Chris Packham writes: > The i2c controllers on the P1010 have an erratum where the documented > scheme for i2c bus recovery will not work (A-004447). A different > mechanism is needed which is documented in the P1010 Chip Errata Rev L. > > Signed-off-by: Chris Packham > --- > > Notes: > Changes in v3: > - New > > arch/powerpc/boot/dts/fsl/p1010si-post.dtsi | 8 > 1 file changed, 8 insertions(+) Acked-by: Michael Ellerman cheers > diff --git a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > index c2717f31925a..ccda0a91abf0 100644 > --- a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > +++ b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > @@ -122,7 +122,15 @@ memory-controller@2000 { > }; > > /include/ "pq3-i2c-0.dtsi" > + i2c@3000 { > + fsl,i2c-erratum-a004447; > + }; > + > /include/ "pq3-i2c-1.dtsi" > + i2c@3100 { > + fsl,i2c-erratum-a004447; > + }; > + > /include/ "pq3-duart-0.dtsi" > /include/ "pq3-espi-0.dtsi" > spi0: spi@7000 { > -- > 2.31.1
Re: [PATCH v3 2/4] powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers
Chris Packham writes: > The i2c controllers on the P2040/P2041 have an erratum where the > documented scheme for i2c bus recovery will not work (A-004447). A > different mechanism is needed which is documented in the P2040 Chip > Errata Rev Q (latest available at the time of writing). > > Signed-off-by: Chris Packham > --- > arch/powerpc/boot/dts/fsl/p2041si-post.dtsi | 16 > 1 file changed, 16 insertions(+) Acked-by: Michael Ellerman cheers > diff --git a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi > b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi > index 872e4485dc3f..ddc018d42252 100644 > --- a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi > +++ b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi > @@ -371,7 +371,23 @@ sdhc@114000 { > }; > > /include/ "qoriq-i2c-0.dtsi" > + i2c@118000 { > + fsl,i2c-erratum-a004447; > + }; > + > + i2c@118100 { > + fsl,i2c-erratum-a004447; > + }; > + > /include/ "qoriq-i2c-1.dtsi" > + i2c@119000 { > + fsl,i2c-erratum-a004447; > + }; > + > + i2c@119100 { > + fsl,i2c-erratum-a004447; > + }; > + > /include/ "qoriq-duart-0.dtsi" > /include/ "qoriq-duart-1.dtsi" > /include/ "qoriq-gpio-0.dtsi" > -- > 2.31.1
Re: simplify gendisk and request_queue allocation for bio based drivers
On Fri, 21 May 2021 at 07:51, Christoph Hellwig wrote: > > Hi all, > > this series is the first part of cleaning up lifetimes and allocation of > the gendisk and request_queue structure. It adds a new interface to > allocate the disk and queue together for bio based drivers, and a helper > for cleanup/free them when a driver is unloaded or a device is removed. May I ask what else you have in the pipe for the next steps? The reason why I ask is that I am looking into some issues related to lifecycle problems of gendisk/mmc, typically triggered at SD/MMC card removal. > > Together this removes the need to treat the gendisk and request_queue > as separate entities for bio based drivers. > > Diffstat: > arch/m68k/emu/nfblock.c | 20 +--- > arch/xtensa/platforms/iss/simdisk.c | 29 +-- > block/blk-core.c|1 > block/blk.h |6 - > block/genhd.c | 149 > +++- > block/partitions/core.c | 19 ++-- > drivers/block/brd.c | 94 +++--- > drivers/block/drbd/drbd_main.c | 23 + > drivers/block/n64cart.c |8 - > drivers/block/null_blk/main.c | 38 - > drivers/block/pktcdvd.c | 11 -- > drivers/block/ps3vram.c | 31 +-- > drivers/block/rsxx/dev.c| 39 +++-- > drivers/block/rsxx/rsxx_priv.h |1 > drivers/block/zram/zram_drv.c | 19 > drivers/lightnvm/core.c | 24 + > drivers/md/bcache/super.c | 15 --- > drivers/md/dm.c | 16 +-- > drivers/md/md.c | 25 ++ > drivers/memstick/core/ms_block.c|1 > drivers/nvdimm/blk.c| 27 +- > drivers/nvdimm/btt.c| 25 +- > drivers/nvdimm/btt.h|2 > drivers/nvdimm/pmem.c | 17 +--- > drivers/nvme/host/core.c|1 > drivers/nvme/host/multipath.c | 46 +++ > drivers/s390/block/dcssblk.c| 26 +- > drivers/s390/block/xpram.c | 26 ++ > include/linux/blkdev.h |1 > include/linux/genhd.h | 23 + > 30 files changed, 297 insertions(+), 466 deletions(-) This looks like a nice cleanup to me. Feel free to add, for the series: Reviewed-by: Ulf Hansson Kind regards Uffe
Re: [PATCH] perf vendor events: Fix eventcode of power10 json events
Tested patch with the following steps: 1. Extracted all the 244 events from the patch. 2. Check them in 'perf list' - all 244 events found 3. Ran all the events with 'perf stat -e "event name" sleep 1', all ran fine. No errors were seen in 'dmesg' Tested-by: Nageswara R Sastry > On 25-May-2021, at 12:07 PM, Kajol Jain wrote: > > Fixed the eventcode values in the power10 json event files to append > "0x" since these are hexadecimal values. > Patch also changes event description of PM_EXEC_STALL_LOAD_FINISH and > PM_EXEC_STALL_NTC_FLUSH event and move some events to correct files. > > Fixes: 32daa5d7899e ("perf vendor events: Initial JSON/events list for > power10 platform") > Signed-off-by: Kajol Jain > --- > .../arch/powerpc/power10/cache.json | 30 ++-- > .../arch/powerpc/power10/floating_point.json | 2 +- > .../arch/powerpc/power10/frontend.json| 124 ++-- > .../arch/powerpc/power10/locks.json | 4 +- > .../arch/powerpc/power10/marked.json | 61 > .../arch/powerpc/power10/memory.json | 79 +- > .../arch/powerpc/power10/others.json | 133 +++-- > .../arch/powerpc/power10/pipeline.json| 135 +- > .../pmu-events/arch/powerpc/power10/pmc.json | 8 +- > .../arch/powerpc/power10/translation.json | 22 +-- > 10 files changed, 299 insertions(+), 299 deletions(-) > > diff --git a/tools/perf/pmu-events/arch/powerpc/power10/cache.json > b/tools/perf/pmu-events/arch/powerpc/power10/cache.json > index 616f29098c71..605be14f441c 100644 > --- a/tools/perf/pmu-events/arch/powerpc/power10/cache.json > +++ b/tools/perf/pmu-events/arch/powerpc/power10/cache.json > @@ -1,46 +1,56 @@ > [ > { > -"EventCode": "1003C", > +"EventCode": "0x1003C", > "EventName": "PM_EXEC_STALL_DMISS_L2L3", > "BriefDescription": "Cycles in which the oldest instruction in the > pipeline was waiting for a load miss to resolve from either the local L2 or > local L3." > }, > { > -"EventCode": "34056", > +"EventCode": "0x1E054", > +"EventName": "PM_EXEC_STALL_DMISS_L21_L31", > +"BriefDescription": "Cycles in which the oldest instruction in the > pipeline was waiting for a load miss to resolve from another core's L2 or L3 > on the same chip." > + }, > + { > +"EventCode": "0x34054", > +"EventName": "PM_EXEC_STALL_DMISS_L2L3_NOCONFLICT", > +"BriefDescription": "Cycles in which the oldest instruction in the > pipeline was waiting for a load miss to resolve from the local L2 or local > L3, without a dispatch conflict." > + }, > + { > +"EventCode": "0x34056", > "EventName": "PM_EXEC_STALL_LOAD_FINISH", > -"BriefDescription": "Cycles in which the oldest instruction in the > pipeline was finishing a load after its data was reloaded from a data source > beyond the local L1; cycles in which the LSU was processing an L1-hit; cycles > in which the NTF instruction merged with another load in the LMQ." > +"BriefDescription": "Cycles in which the oldest instruction in the > pipeline was finishing a load after its data was reloaded from a data source > beyond the local L1; cycles in which the LSU was processing an L1-hit; cycles > in which the NTF instruction merged with another load in the LMQ; cycles in > which the NTF instruction is waiting for a data reload for a load miss, but > the data comes back with a non-NTF instruction." > }, > { > -"EventCode": "3006C", > +"EventCode": "0x3006C", > "EventName": "PM_RUN_CYC_SMT2_MODE", > "BriefDescription": "Cycles when this thread's run latch is set and the > core is in SMT2 mode." > }, > { > -"EventCode": "300F4", > +"EventCode": "0x300F4", > "EventName": "PM_RUN_INST_CMPL_CONC", > "BriefDescription": "PowerPC instructions completed by this thread when > all threads in the core had the run-latch set." > }, > { > -"EventCode": "4C016", > +"EventCode": "0x4C016", > "EventName": "PM_EXEC_STALL_DMISS_L2L3_CONFLICT", > "BriefDescription": "Cycles in which the oldest instruction in the > pipeline was waiting for a load miss to resolve from the local L2 or local > L3, with a dispatch conflict." > }, > { > -"EventCode": "4D014", > +"EventCode": "0x4D014", > "EventName": "PM_EXEC_STALL_LOAD", > "BriefDescription": "Cycles in which the oldest instruction in the > pipeline was a load instruction executing in the Load Store Unit." > }, > { > -"EventCode": "4D016", > +"EventCode": "0x4D016", > "EventName": "PM_EXEC_STALL_PTESYNC", > "BriefDescription": "Cycles in which the oldest instruction in the > pipeline was a PTESYNC instruction executing in the Load Store Unit." > }, > { > -"EventCode": "401EA", > +"EventCode": "0x401EA", > "EventName": "PM_THRESH_EXC_128", > "BriefDescription": "Threshold counter exceeded a value of 128." > }, > { > -"EventCode": "400F6", > +"EventCod
Re: [PATCH v3 0/4] P2040/P2041 i2c recovery erratum
On Wed, May 12, 2021 at 09:20:48AM +1200, Chris Packham wrote: > The P2040/P2041 has an erratum where the i2c recovery scheme > documented in the reference manual (and currently implemented > in the i2c-mpc.c driver) does not work. The errata document > provides an alternative that does work. This series implements > that alternative and uses a property in the devicetree to > decide when the alternative mechanism is needed. The series looks good to me. Usually, I don't take DTS patches. This time I'd make an exception and apply all patches to for-current because this is clearly a bugfix. For that, I'd need an ack from PPC maintainers. Could I have those for patches 2+3? signature.asc Description: PGP signature
Re: [PATCH v3 0/4] P2040/P2041 i2c recovery erratum
> For those reading along the v2 mentioned in that thread was posted as > https://lore.kernel.org/linux-i2c/20170511122033.22471-1-joakim.tjernl...@infinera.com/ > > there was a bit of discussion but it seemed to die out without reaching > a conclusion. > > The i2c-mpc driver is now using the generic recovery mechanism so that > addresses one bit of feedback from the original thread. Yes, and the generic recovery has been improved since then. There is an "incomplete_write_byte" fault injector now and the generic recovery handles it correctly meanwhile. Before, it actually could cause a write to happen but we are sending STOPs now. signature.asc Description: PGP signature
Re: [RFC PATCH 2/2] mm/mremap: Fix race between MOVE_PUD mremap and pageout
On Mon, May 24, 2021 at 10:34 PM Aneesh Kumar K.V wrote: > > @@ -221,6 +222,9 @@ static inline void page_vma_mapped_walk_done(struct > page_vma_mapped_walk *pvmw) > spin_unlock(pvmw->pte_ptl); > if (pvmw->pmd_ptl) > spin_unlock(pvmw->pmd_ptl); > + if (pvmw->pud_ptl) > + spin_unlock(pvmw->pud_ptl); > + > } You have this habit of adding odd whitespace.. But yes, this seems to be the right way to fix the races properly. The pageout code is special, the pageout code is normally not critical, so it's the pageout code that should go the extra mile to make up for the fact that it doesn't hold the mmap_sem like good page table modification codepaths do. Linus
Re: [PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout
On Mon, May 24, 2021 at 10:44 PM A lneesh Kumar K.V wrote: > > Should we worry about the below race. The window would be small > > CPU 1 CPU 2 CPU 3 > > mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one > > mmap_write_lock_killable() > > addr = old_addr > > lock(pmd_ptl) > pmd = *old_pmd > pmd_clear(old_pmd) > flush_tlb_range(old_addr) > > lock(pte_ptl) > *new_pmd = pmd > unlock(pte_ptl) > > unlock(pmd_ptl) > lock(pte_ptl) > > *new_addr = 10; and fills > TLB > with new addr > and > old pfn > > ptep_clear_flush(old_addr) > old pfn is free. > Stale > TLB entry Hmm. Do you need a third CPU there? What is done above on CPU3 looks like it might just be CPU1 accessing the new range immediately. Which doesn't actually sound at all unlikely - so maybe the window is small, but it sounds like something that could happen. This looks nasty. The page shrinker has always been problematic because it basically avoids the normal full set of locks. I wonder if we could just make the page shrinker try-lock the mmap_sem and avoid all this that way. It _is_ allowed to fail, after all, and the page shrinker is "not normal" and should be less of a performance issue than all the actual normal VM paths. Does anybody have any good ideas? > > And new optimization for empty pmd, which seems unrelated to the > > change and should presumably be separate: > > That was added that we can safely do pte_lockptr() below Oh, because pte_lockptr() doesn't actually use the "old_pmd" pointer value - it actually *dereferences* the pointer. That looks like a mis-design. Why does it do that? Why don't we pass it the pmd value, if that's what it wants? Linus
Re: [PATCH v6 07/11] mm/mremap: Use range flush that does TLB and page walk cache flush
On Tue, May 25, 2021 at 3:28 AM Aneesh Kumar K.V wrote: > > How about flush_tlb_and_page_table_cache() ? Honestly, I'd prefer it to be a separate function. So keep the existing flush_tlb() as-is, and add a flush_tlb_walking_cache() and document that any architecture that flushes the walker caches as part of the regular tlb flush can just make that a no-op. Would that work for powerpc? But: > > (b) is this even worth it as a public interface? > > But such a large range invalidate doesn't imply we are freeing page > tables. No. But it's what everybody else (ie x86 and ARM) does, and if you're flushing megabytes of TLB's, what's the downside of flushing a few TLB walker cache entries? You already do that for internal powerpc errata anyway (ie "mm_needs_flush_escalation()"), so I'm saying that you might as well treat the page walker cache as a powerpc-internal implementation thing. Put another way: can you even _measure_ the difference between "just make powerpc look like everybody else" and "add a new explicit page table walker cache flush function interface"? Now, from a quick look at the powerpc code, it looks like powerpc is a bit mis-architected, and when you flush the walker cache, you flush everything for that ASID. x86 and arm only flush the parts affected by the TLB flush range (now, admittedly, that's what they do _architecturally_ - for all I know the actual hardware just always flushes all walker caches when you flush any TLB and so maybe they act exactly like the powrpc RIC_FLUSH_PWC in practice). So maybe it's measurable. But I kind of doubt it, and I'd like to know that you've actually done some testing to see that "yes, this matters, I can't just say 'if flushing more than a pmd, just flush the walker cache too'". Hmm? Linus
Re: [PATCH] perf vendor events: Fix eventcode of power10 json events
On Tue, May 25, 2021 at 09:42:15AM -0500, Paul A. Clarke wrote: > On Tue, May 25, 2021 at 12:07:23PM +0530, Kajol Jain wrote: > > Fixed the eventcode values in the power10 json event files to append > > "0x" since these are hexadecimal values. > > Patch also changes event description of PM_EXEC_STALL_LOAD_FINISH and > > PM_EXEC_STALL_NTC_FLUSH event and move some events to correct files. > > > > Fixes: 32daa5d7899e ("perf vendor events: Initial JSON/events list for > > power10 platform") > > Signed-off-by: Kajol Jain > > I checked that everything that was "removed" was actually just moved > somewhere else, that all the added EventCodes indeed had '0x', the > number of EventCodes added matched the number removed, and that the > additional text added seemed reasonable. LGTM. > > Reviewed-by: Paul A. Clarke I lost the original message, but Nageswara Sastry said: > 1. Extracted all the 244 events from the patch. > 2. Check them in 'perf list' - all 244 events found > 3. Ran all the events with 'perf stat -e "event name" sleep 1', all ran fine. > No errors were seen in 'dmesg' I count 255 events. PC
Re: [PATCH] perf vendor events: Fix eventcode of power10 json events
On Tue, May 25, 2021 at 12:07:23PM +0530, Kajol Jain wrote: > Fixed the eventcode values in the power10 json event files to append > "0x" since these are hexadecimal values. > Patch also changes event description of PM_EXEC_STALL_LOAD_FINISH and > PM_EXEC_STALL_NTC_FLUSH event and move some events to correct files. > > Fixes: 32daa5d7899e ("perf vendor events: Initial JSON/events list for > power10 platform") > Signed-off-by: Kajol Jain I checked that everything that was "removed" was actually just moved somewhere else, that all the added EventCodes indeed had '0x', the number of EventCodes added matched the number removed, and that the additional text added seemed reasonable. LGTM. Reviewed-by: Paul A. Clarke
Re: [RFC v2 4/4] powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device
On Tue, May 25, 2021 at 06:52:16PM +0530, Kajol Jain wrote: > Patch here adds cpu hotplug functions to nvdimm pmu. I'm thinking "Patch here" qualifies for "This patch", see Documentation/process/submitting-patches.rst . > It adds cpumask to designate a cpu to make HCALL to > collect the counter data for the nvdimm device and > update ABI documentation accordingly. > > Result in power9 lpar system: > command:# cat /sys/devices/nmem0/cpumask > 0 Is this specific to the papr thing, or should this be in generic nvdimm code?
[PATCH V3 2/2] selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA v3.1
With the MMCR0 control bit (PMCCEXT) in ISA v3.1, read access to group B registers is restricted when MMCR0 PMCC=0b00. In other platforms (like power9), the older behaviour works where group B PMU SPRs are readable. Patch creates a selftest which verifies that the test takes a SIGILL when attempting to read PMU registers via helper function "dump_ebb_state" for ISA v3.1. Signed-off-by: Athira Rajeev --- tools/testing/selftests/powerpc/pmu/ebb/Makefile | 2 +- .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++ 2 files changed, 64 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c diff --git a/tools/testing/selftests/powerpc/pmu/ebb/Makefile b/tools/testing/selftests/powerpc/pmu/ebb/Makefile index c5ecb46..0101606 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/Makefile +++ b/tools/testing/selftests/powerpc/pmu/ebb/Makefile @@ -24,7 +24,7 @@ TEST_GEN_PROGS := reg_access_test event_attributes_test cycles_test \ fork_cleanup_test ebb_on_child_test\ ebb_on_willing_child_test back_to_back_ebbs_test \ lost_exception_test no_handler_test\ -cycles_with_mmcr2_test +cycles_with_mmcr2_test regs_access_pmccext_test top_srcdir = ../../../../../.. include ../../../lib.mk diff --git a/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c new file mode 100644 index 000..1eda8e9 --- /dev/null +++ b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c @@ -0,0 +1,63 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2021, Athira Rajeev, IBM Corp. + */ + +#include +#include +#include +#include + +#include "ebb.h" + + +/* + * Test that closing the EBB event clears MMCR0_PMCC and + * sets MMCR0_PMCCEXT preventing further read access to the + * group B PMU registers. + */ + +static int regs_access_pmccext(void) +{ + struct event event; + + SKIP_IF(!ebb_is_supported()); + + event_init_named(&event, 0x1001e, "cycles"); + event_leader_ebb_init(&event); + + FAIL_IF(event_open(&event)); + + ebb_enable_pmc_counting(1); + setup_ebb_handler(standard_ebb_callee); + ebb_global_enable(); + FAIL_IF(ebb_event_enable(&event)); + + mtspr(SPRN_PMC1, pmc_sample_period(sample_period)); + + while (ebb_state.stats.ebb_count < 1) + FAIL_IF(core_busy_loop()); + + ebb_global_disable(); + event_close(&event); + + FAIL_IF(ebb_state.stats.ebb_count == 0); + + /* +* For ISA v3.1, verify the test takes a SIGILL when reading +* PMU regs after the event is closed. With the control bit +* in MMCR0 (PMCCEXT) restricting access to group B PMU regs, +* sigill is expected. +*/ + if (have_hwcap2(PPC_FEATURE2_ARCH_3_1)) + FAIL_IF(catch_sigill(dump_ebb_state)); + else + dump_ebb_state(); + + return 0; +} + +int main(void) +{ + return test_harness(regs_access_pmccext, "regs_access_pmccext"); +} -- 1.8.3.1
[PATCH V3 1/2] selftests/powerpc: Fix "no_handler" EBB selftest
The "no_handler_test" in ebb selftests attempts to read the PMU registers twice via helper function "dump_ebb_state". First dump is just before closing of event and the second invocation is done after closing of the event. The original intention of second dump_ebb_state was to dump the state of registers at the end of the test when the counters are frozen. But this will be achieved with the first call itself since sample period is set to low value and PMU will be frozen by then. Hence patch removes the dump which was done before closing of the event. Signed-off-by: Athira Rajeev Reported-by: Shirisha Ganta --- tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c index fc5bf48..01e827c 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c @@ -50,8 +50,6 @@ static int no_handler_test(void) event_close(&event); - dump_ebb_state(); - /* The real test is that we never took an EBB at 0x0 */ return 0; -- 1.8.3.1
[PATCH V3 0/2] selftests/powerpc: Updates to EBB selftest for ISA v3.1
The "no_handler_test" in ebb selftests attempts to read the PMU registers after closing of the event via helper function "dump_ebb_state". With the MMCR0 control bit (PMCCEXT) in ISA v3.1, read access to group B registers is restricted when MMCR0 PMCC=0b00. Hence the call to dump_ebb_state after closing of event will generate a SIGILL, which is expected. Test has below in logs: <<>> !! child died by signal 4 failure: no_handler_test <<>> In other platforms (like power9), the older behaviour works where group B PMU SPRs are readable. The "dump_ebb_state" is called twice in the test. The second call after closing of event was done inorder to dump state of registers when the counters are frozen. But since the counters should already be frozen by the time first dump is done, patch1 drops the second call to "dump_ebb_state". To address the new sigill behaviour in ISA v3.1, patch2 creates a separate selftest. Changelog: v2 -> v3: Fixed a space issue in patch2. v1 -> v2: Addressed review comments from Michael Ellerman. First version attempted to address the SIGILL behaviour in existing "no_handler_test" test itself. As per mpe's suggestion, moved that to a separate test and removed the second call to "dump_ebb_state" since that is actually not needed. Athira Rajeev (2): selftests/powerpc: Fix "no_handler" EBB selftest selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA v3.1 tools/testing/selftests/powerpc/pmu/ebb/Makefile | 2 +- .../selftests/powerpc/pmu/ebb/no_handler_test.c| 2 - .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++ 3 files changed, 64 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c -- 1.8.3.1
Re: [PATCH v6 07/11] mm/mremap: Use range flush that does TLB and page walk cache flush
Linus Torvalds writes: > On Sun, May 23, 2021 at 11:04 PM Aneesh Kumar K.V > wrote: >> >> Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and >> page walk cache where TLB entries are mapped with page size PAGE_SIZE. > > So I dislike this patch for two reasons: > > (a) naming. > > If the ppc people want to use crazy TLA's that have no meaning outside > of the powerpc community, that's fine. But only in powerpc code. > > "pwc" makes no sense to me, or to anybody else that isn't intimately > involved in low-level powerpc stuff. I assume it's "page walk cache", > but honestly, outside of this area, PWC is mostly used for a specific > type of webcam. > > So there's no way I'd accept this as-is, simply because of that. > flush_pte_tlb_pwc_range() is simply not an acceptable name. You would > have to spell it out, not use an obscure TLA. > > But I think you don't even want to do that, because of How about flush_tlb_and_page_table_cache() ? > > (b) is this even worth it as a public interface? > > Why doesn't the powerpc radix TLB flushing code just always flush the > page table walking cache when the range is larger than a PMD? > > Once you have big flush ranges like that, I don't believe it makes any > sense not to flush the walking cache too. But such a large range invalidate doesn't imply we are freeing page tables. Hence forcing a page table cache flush for large range invalidate can have performance impact. ppc64 don't do a range page table cache invalidate. Hence we will have to flush the full page table cache. > > NOTE! This is particularly true as "flush the walking cache" isn't a > well-defined operation anyway. Which _levels_ of the walking cache? > Again, the size (and alignment) of the flush would actually tell you. > A new boolean "flush" parameter does *NOT* tell that at all. > > So I think this new interface is mis-named, but I also think it's > pointless. Just DTRT automatically when somebody asks for a flush that > covers a PMD range (or a PUD range). > > Linus -aneesh
[V2 2/2] selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA v3.1
With the MMCR0 control bit (PMCCEXT) in ISA v3.1, read access to group B registers is restricted when MMCR0 PMCC=0b00. In other platforms (like power9), the older behaviour works where group B PMU SPRs are readable. Patch creates a selftest which verifies that the test takes a SIGILL when attempting to read PMU registers via helper function "dump_ebb_state" for ISA v3.1. Signed-off-by: Athira Rajeev --- tools/testing/selftests/powerpc/pmu/ebb/Makefile | 2 +- .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++ 2 files changed, 64 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c diff --git a/tools/testing/selftests/powerpc/pmu/ebb/Makefile b/tools/testing/selftests/powerpc/pmu/ebb/Makefile index c5ecb46..0101606 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/Makefile +++ b/tools/testing/selftests/powerpc/pmu/ebb/Makefile @@ -24,7 +24,7 @@ TEST_GEN_PROGS := reg_access_test event_attributes_test cycles_test \ fork_cleanup_test ebb_on_child_test\ ebb_on_willing_child_test back_to_back_ebbs_test \ lost_exception_test no_handler_test\ -cycles_with_mmcr2_test +cycles_with_mmcr2_test regs_access_pmccext_test top_srcdir = ../../../../../.. include ../../../lib.mk diff --git a/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c new file mode 100644 index 000..5f1a040 --- /dev/null +++ b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c @@ -0,0 +1,63 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2021, Athira Rajeev, IBM Corp. + */ + +#include +#include +#include +#include + +#include "ebb.h" + + +/* + * Test that closing the EBB event clears MMCR0_PMCC and + * sets MMCR0_PMCCEXT preventing further read access to the + * group B PMU registers. + */ + +static int regs_access_pmccext(void) +{ + struct event event; + + SKIP_IF(!ebb_is_supported()); + + event_init_named(&event, 0x1001e, "cycles"); + event_leader_ebb_init(&event); + + FAIL_IF(event_open(&event)); + + ebb_enable_pmc_counting(1); + setup_ebb_handler(standard_ebb_callee); + ebb_global_enable(); + FAIL_IF(ebb_event_enable(&event)); + + mtspr(SPRN_PMC1, pmc_sample_period(sample_period)); + + while (ebb_state.stats.ebb_count < 1) + FAIL_IF(core_busy_loop()); + + ebb_global_disable(); + event_close(&event); + + FAIL_IF(ebb_state.stats.ebb_count == 0); + + /* +* For ISA v3.1, verify the test takes a SIGILL when reading +* PMU regs after the event is closed. With the control bit +* in MMCR0 (PMCCEXT) restricting access to group B PMU regs, +* sigill is expected. +*/ + if (have_hwcap2(PPC_FEATURE2_ARCH_3_1)) + FAIL_IF(catch_sigill(dump_ebb_state)); + else + dump_ebb_state(); + + return 0; +} + +int main(void) +{ + return test_harness(regs_access_pmccext,"regs_access_pmccext"); +} -- 1.8.3.1
[V2 1/2] selftests/powerpc: Fix "no_handler" EBB selftest
The "no_handler_test" in ebb selftests attempts to read the PMU registers twice via helper function "dump_ebb_state". First dump is just before closing of event and the second invocation is done after closing of the event. The original intention of second dump_ebb_state was to dump the state of registers at the end of the test when the counters are frozen. But this will be achieved with the first call itself since sample period is set to low value and PMU will be frozen by then. Hence patch removes the dump which was done before closing of the event. Signed-off-by: Athira Rajeev Reported-by: Shirisha Ganta --- tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c index fc5bf48..01e827c 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c @@ -50,8 +50,6 @@ static int no_handler_test(void) event_close(&event); - dump_ebb_state(); - /* The real test is that we never took an EBB at 0x0 */ return 0; -- 1.8.3.1
[V2 0/2] selftests/powerpc: Updates to EBB selftest for ISA v3.1
The "no_handler_test" in ebb selftests attempts to read the PMU registers after closing of the event via helper function "dump_ebb_state". With the MMCR0 control bit (PMCCEXT) in ISA v3.1, read access to group B registers is restricted when MMCR0 PMCC=0b00. Hence the call to dump_ebb_state after closing of event will generate a SIGILL, which is expected. Test has below in logs: <<>> !! child died by signal 4 failure: no_handler_test <<>> In other platforms (like power9), the older behaviour works where group B PMU SPRs are readable. The "dump_ebb_state" is called twice in the test. The second call after closing of event was done inorder to dump state of registers when the counters are frozen. But since the counters should already be frozen by the time first dump is done, patch1 drops the second call to "dump_ebb_state". To address the new sigill behaviour in ISA v3.1, patch2 creates a separate selftest. Changelog: v1 -> v2: Addressed review comments from Michael Ellerman. First version attempted to address the SIGILL behaviour in existing "no_handler_test" test itself. As per mpe's suggestion, moved that to a separate test and removed the second call to "dump_ebb_state" since that is actually not needed. Athira Rajeev (2): selftests/powerpc: Fix "no_handler" EBB selftest selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA v3.1 tools/testing/selftests/powerpc/pmu/ebb/Makefile | 2 +- .../selftests/powerpc/pmu/ebb/no_handler_test.c| 2 - .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++ 3 files changed, 64 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c -- 1.8.3.1
[RFC v2 4/4] powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device
Patch here adds cpu hotplug functions to nvdimm pmu. It adds cpumask to designate a cpu to make HCALL to collect the counter data for the nvdimm device and update ABI documentation accordingly. Result in power9 lpar system: command:# cat /sys/devices/nmem0/cpumask 0 Signed-off-by: Kajol Jain --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 6 ++ arch/powerpc/platforms/pseries/papr_scm.c | 61 +++ include/linux/nd.h| 17 -- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 38c4daf65af2..986df1691914 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -76,6 +76,12 @@ Description: (RO) Attribute group to describe the magic bits For example:: noopstat = "event=0x1" +What: /sys/devices/nmemX/cpumask +Date: May 2021 +Contact: linuxppc-dev , linux-nvd...@lists.01.org, +Description: (RO) This sysfs file exposes the cpumask which is designated to make +HCALLs to retrieve nvdimm pmu event counter data. + What: /sys/devices/nmemX/events Date: May 2021 Contact: linuxppc-dev , linux-nvd...@lists.01.org, diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index f2d57da98ff4..76121d876b7f 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -339,6 +339,28 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, return 0; } +static ssize_t cpumask_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct pmu *pmu = dev_get_drvdata(dev); + struct nvdimm_pmu *nd_pmu; + + nd_pmu = container_of(pmu, struct nvdimm_pmu, pmu); + + return cpumap_print_to_pagebuf(true, buf, cpumask_of(nd_pmu->cpu)); +} + +static DEVICE_ATTR_RO(cpumask); + +static struct attribute *nvdimm_cpumask_attrs[] = { + &dev_attr_cpumask.attr, + NULL, +}; + +static const struct attribute_group nvdimm_pmu_cpumask_group = { + .attrs = nvdimm_cpumask_attrs, +}; + PMU_FORMAT_ATTR(event, "config:0-4"); static struct attribute *nvdimm_pmu_format_attr[] = { @@ -459,6 +481,24 @@ static void papr_scm_pmu_del(struct perf_event *event, int flags) papr_scm_pmu_read(event); } +static int nvdimm_pmu_offline_cpu(unsigned int cpu, struct hlist_node *node) +{ + struct nvdimm_pmu *nd_pmu; + int target; + + nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node); + + if (cpu != nd_pmu->cpu) + return 0; + + target = cpumask_last(cpu_active_mask); + if (target < 0 || target >= nr_cpu_ids) + return -1; + + nd_pmu->cpu = target; + return 0; +} + static ssize_t device_show_string(struct device *dev, struct device_attribute *attr, char *buf) { @@ -603,6 +643,7 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv *p, struct nvdimm_pmu /* Fill attribute groups for the nvdimm pmu device */ nd_pmu->attr_groups[NVDIMM_PMU_FORMAT_ATTR] = &nvdimm_pmu_format_group; nd_pmu->attr_groups[NVDIMM_PMU_EVENT_ATTR] = nvdimm_pmu_events_group; + nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR] = &nvdimm_pmu_cpumask_group; nd_pmu->attr_groups[NVDIMM_PMU_NULL_ATTR] = NULL; kfree(single_stats); @@ -652,6 +693,20 @@ static void papr_scm_pmu_register(struct papr_scm_priv *p) nd_pmu->read = papr_scm_pmu_read; nd_pmu->add = papr_scm_pmu_add; nd_pmu->del = papr_scm_pmu_del; + nd_pmu->cpu = raw_smp_processor_id(); + + rc = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, +"perf/nvdimm:online", + NULL, nvdimm_pmu_offline_cpu); + if (rc < 0) + goto pmu_cpuhp_setup_err; + + nd_pmu->cpuhp_state = rc; + + /* Register the pmu instance for cpu hotplug */ + rc = cpuhp_state_add_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node); + if (rc) + goto cpuhp_instance_err; rc = register_nvdimm_pmu(nd_pmu, p->pdev); if (rc) @@ -665,6 +720,10 @@ static void papr_scm_pmu_register(struct papr_scm_priv *p) return; pmu_register_err: + cpuhp_state_remove_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node); +cpuhp_instance_err: + cpuhp_remove_multi_state(nd_pmu->cpuhp_state); +pmu_cpuhp_setup_err: nvdimm_pmu_mem_free(nd_pmu); kfree(p->nvdimm_events_map); pmu_check_events_err: @@ -675,6 +734,8 @@ static void papr_scm_pmu_register(struct papr_scm_priv *p) static void nvdimm_pmu_uinit(struct nvdimm_pmu *nd_pmu) { + cpuhp_state_remove_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node);
[RFC v2 1/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats
Patch adds performance stats reporting support for nvdimm. Added interface includes support for pmu register/unregister functions. A structure is added called nvdimm_pmu to be used for adding arch/platform specific data such as supported events and pmu event functions like event_init/add/read/del. User could use the standard perf tool to access perf events exposed via pmu. Signed-off-by: Kajol Jain --- drivers/nvdimm/Makefile | 1 + drivers/nvdimm/nd_perf.c | 58 include/linux/nd.h | 35 3 files changed, 94 insertions(+) create mode 100644 drivers/nvdimm/nd_perf.c diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile index 29203f3d3069..25dba6095612 100644 --- a/drivers/nvdimm/Makefile +++ b/drivers/nvdimm/Makefile @@ -18,6 +18,7 @@ nd_e820-y := e820.o libnvdimm-y := core.o libnvdimm-y += bus.o libnvdimm-y += dimm_devs.o +libnvdimm-y += nd_perf.o libnvdimm-y += dimm.o libnvdimm-y += region_devs.o libnvdimm-y += region.o diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c new file mode 100644 index ..96280c1ff799 --- /dev/null +++ b/drivers/nvdimm/nd_perf.c @@ -0,0 +1,58 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * nd_perf.c: NVDIMM Device Performance Monitoring Unit support + * + * Perf interface to expose nvdimm performance stats. + * + * Copyright (C) 2021 IBM Corporation + */ + +#define pr_fmt(fmt) "nvdimm_pmu: " fmt + +#include + +int register_nvdimm_pmu(struct nvdimm_pmu *nd_pmu, struct platform_device *pdev) +{ + int rc; + + if (!nd_pmu || !pdev) + return -EINVAL; + + /* event functions like add/del/read/event_init should not be NULL */ + if (WARN_ON_ONCE(!(nd_pmu->event_init && nd_pmu->add && nd_pmu->del && nd_pmu->read))) + return -EINVAL; + + nd_pmu->pmu.task_ctx_nr = perf_invalid_context; + nd_pmu->pmu.name = nd_pmu->name; + nd_pmu->pmu.event_init = nd_pmu->event_init; + nd_pmu->pmu.add = nd_pmu->add; + nd_pmu->pmu.del = nd_pmu->del; + nd_pmu->pmu.read = nd_pmu->read; + + nd_pmu->pmu.attr_groups = nd_pmu->attr_groups; + nd_pmu->pmu.capabilities = PERF_PMU_CAP_NO_INTERRUPT | + PERF_PMU_CAP_NO_EXCLUDE; + + /* +* Add platform_device->dev pointer to nvdimm_pmu to access +* device data in events functions. +*/ + nd_pmu->dev = &pdev->dev; + + rc = perf_pmu_register(&nd_pmu->pmu, nd_pmu->name, -1); + if (rc) + return rc; + + pr_info("%s NVDIMM performance monitor support registered\n", + nd_pmu->name); + + return 0; +} +EXPORT_SYMBOL_GPL(register_nvdimm_pmu); + +void unregister_nvdimm_pmu(struct pmu *nd_pmu) +{ + /* handle freeing of memory in arch specific code */ + perf_pmu_unregister(nd_pmu); +} +EXPORT_SYMBOL_GPL(unregister_nvdimm_pmu); diff --git a/include/linux/nd.h b/include/linux/nd.h index ee9ad76afbba..a0e0619256be 100644 --- a/include/linux/nd.h +++ b/include/linux/nd.h @@ -8,6 +8,8 @@ #include #include #include +#include +#include enum nvdimm_event { NVDIMM_REVALIDATE_POISON, @@ -23,6 +25,39 @@ enum nvdimm_claim_class { NVDIMM_CCLASS_UNKNOWN, }; +/* Event attribute array index */ +#define NVDIMM_PMU_FORMAT_ATTR 0 +#define NVDIMM_PMU_EVENT_ATTR 1 +#define NVDIMM_PMU_NULL_ATTR 2 + +/** + * struct nvdimm_pmu - data structure for nvdimm perf driver + * + * @name: name of the nvdimm pmu device. + * @pmu: pmu data structure for nvdimm performance stats. + * @dev: nvdimm device pointer. + * @functions(event_init/add/del/read): platform specific pmu functions. + * @attr_groups: data structure for events and formats. + */ +struct nvdimm_pmu { + const char *name; + struct pmu pmu; + struct device *dev; + int (*event_init)(struct perf_event *event); + int (*add)(struct perf_event *event, int flags); + void (*del)(struct perf_event *event, int flags); + void (*read)(struct perf_event *event); + /* +* Attribute groups for the nvdimm pmu. Index 0 used for +* format attribute, index 1 used for event attribute and +* index 2 kept as NULL. +*/ + const struct attribute_group *attr_groups[3]; +}; + +int register_nvdimm_pmu(struct nvdimm_pmu *nvdimm, struct platform_device *pdev); +void unregister_nvdimm_pmu(struct pmu *pmu); + struct nd_device_driver { struct device_driver drv; unsigned long type; -- 2.27.0
[RFC v2 3/4] powerpc/papr_scm: Document papr_scm sysfs event format entries
This patch add event format and events details in ABI documentation Signed-off-by: Kajol Jain --- Documentation/ABI/testing/sysfs-bus-papr-pmem | 25 +++ 1 file changed, 25 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem index 92e2db0e2d3d..38c4daf65af2 100644 --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem @@ -59,3 +59,28 @@ Description: * "CchRHCnt" : Cache Read Hit Count * "CchWHCnt" : Cache Write Hit Count * "FastWCnt" : Fast Write Count + +What: /sys/devices/nmemX/format +Date: May 2021 +Contact: linuxppc-dev , linux-nvd...@lists.01.org, +Description: (RO) Attribute group to describe the magic bits +that go into perf_event_attr.config for a particular pmu. +(See ABI/testing/sysfs-bus-event_source-devices-format). + +Each attribute under this group defines a bit range of the +perf_event_attr.config. Supported attribute is listed +below:: + + event = "config:0-4" - event ID + + For example:: + noopstat = "event=0x1" + +What: /sys/devices/nmemX/events +Date: May 2021 +Contact: linuxppc-dev , linux-nvd...@lists.01.org, +Description:(RO) Attribute group to describe performance monitoring +events specific to papr-scm. Each attribute in this group describes +a single performance monitoring event supported by this nvdimm pmu. +The name of the file is the name of the event. +(See ABI/testing/sysfs-bus-event_source-devices-events). -- 2.27.0
[RFC v2 2/4] powerpc/papr_scm: Add perf interface support
Patch adds support for performance monitoring of papr-scm nvdimm devices via perf interface. It adds pmu functions like add/del/read/event_init for nvdimm_pmu struture. Patch adds a new parameter 'priv' in pdev_archdata structure to save nvdimm_pmu device pointer, to handle the unregistering of pmu device. papr_scm_pmu_register function populates the nvdimm_pmu structure with events, attribute groups along with event handling functions. Finally the populated nvdimm_pmu structure is passed to register the pmu device. Event handling functions internally uses hcall to get events and counter data. Result in power9 machine with 2 nvdimm device: Ex: List all event by perf list command:# perf list nmem nmem0/cchrhcnt/[Kernel PMU event] nmem0/cchwhcnt/[Kernel PMU event] nmem0/critrscu/[Kernel PMU event] nmem0/ctlresct/[Kernel PMU event] nmem0/ctlrestm/[Kernel PMU event] nmem0/fastwcnt/[Kernel PMU event] nmem0/hostlcnt/[Kernel PMU event] nmem0/hostldur/[Kernel PMU event] nmem0/hostscnt/[Kernel PMU event] nmem0/hostsdur/[Kernel PMU event] nmem0/medrcnt/ [Kernel PMU event] nmem0/medrdur/ [Kernel PMU event] nmem0/medwcnt/ [Kernel PMU event] nmem0/medwdur/ [Kernel PMU event] nmem0/memlife/ [Kernel PMU event] nmem0/noopstat/[Kernel PMU event] nmem0/ponsecs/ [Kernel PMU event] nmem1/cchrhcnt/[Kernel PMU event] nmem1/cchwhcnt/[Kernel PMU event] nmem1/critrscu/[Kernel PMU event] ... nmem1/noopstat/[Kernel PMU event] nmem1/ponsecs/ [Kernel PMU event] Signed-off-by: Kajol Jain --- arch/powerpc/include/asm/device.h | 5 + arch/powerpc/platforms/pseries/papr_scm.c | 358 ++ 2 files changed, 363 insertions(+) diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h index 219559d65864..47ed639f3b8f 100644 --- a/arch/powerpc/include/asm/device.h +++ b/arch/powerpc/include/asm/device.h @@ -48,6 +48,11 @@ struct dev_archdata { struct pdev_archdata { u64 dma_mask; + /* +* Pointer to nvdimm_pmu structure, to handle the unregistering +* of pmu device +*/ + void *priv; }; #endif /* _ASM_POWERPC_DEVICE_H */ diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index ef26fe40efb0..f2d57da98ff4 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -18,6 +18,8 @@ #include #include #include +#include +#include #define BIND_ANY_ADDR (~0ul) @@ -67,6 +69,8 @@ #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS) #define PAPR_SCM_PERF_STATS_VERSION 0x1 +#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu) + /* Struct holding a single performance metric */ struct papr_scm_perf_stat { u8 stat_id[8]; @@ -116,6 +120,12 @@ struct papr_scm_priv { /* length of the stat buffer as expected by phyp */ size_t stat_buffer_len; + +/* array to have event_code and stat_id mappings */ + char **nvdimm_events_map; + + /* count of supported events */ + u32 total_events; }; static int papr_scm_pmem_flush(struct nd_region *nd_region, @@ -329,6 +339,347 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, return 0; } +PMU_FORMAT_ATTR(event, "config:0-4"); + +static struct attribute *nvdimm_pmu_format_attr[] = { + &format_attr_event.attr, + NULL, +}; + +static struct attribute_group nvdimm_pmu_format_group = { + .name = "format", + .attrs = nvdimm_pmu_format_attr, +}; + +static int papr_scm_pmu_get_value(struct perf_event *event, struct device *dev, u64 *count) +{ + struct papr_scm_perf_stat *stat; + struct papr_scm_perf_stats *stats; + struct papr_scm_priv *p = (struct papr_scm_priv *)dev->driver_data; + int rc, size; + + /* Allocate request buffer enough to hold single performance stat */ + size = sizeof(struct papr_scm_perf_stats) + + sizeof(struct papr_scm_perf_stat); + + if (!p || !p->nvdimm_events_map) + return -EINVAL; + + stats = kzalloc(size, GFP_KERNEL); + if (!stats) + return -ENO
[RFC v2 0/4] Add perf interface to expose nvdimm
Patch adds performance stats reporting support for nvdimm. Added interface includes support for pmu register/unregister functions. A structure is added called nvdimm_pmu to be used for adding arch/platform specific data such as supported events and pmu event functions like event_init/add/read/del. User could use the standard perf tool to access perf events exposed via pmu. Patchset includes implementation to expose IBM pseries platform nmem* device performance stats using this interface. Result from power9 pseries lpar with 2 nvdimm device: command:# perf list nmem nmem0/cchrhcnt/[Kernel PMU event] nmem0/cchwhcnt/[Kernel PMU event] nmem0/critrscu/[Kernel PMU event] nmem0/ctlresct/[Kernel PMU event] nmem0/ctlrestm/[Kernel PMU event] nmem0/fastwcnt/[Kernel PMU event] nmem0/hostlcnt/[Kernel PMU event] nmem0/hostldur/[Kernel PMU event] nmem0/hostscnt/[Kernel PMU event] nmem0/hostsdur/[Kernel PMU event] nmem0/medrcnt/ [Kernel PMU event] nmem0/medrdur/ [Kernel PMU event] nmem0/medwcnt/ [Kernel PMU event] nmem0/medwdur/ [Kernel PMU event] nmem0/memlife/ [Kernel PMU event] nmem0/noopstat/[Kernel PMU event] nmem0/ponsecs/ [Kernel PMU event] nmem1/cchrhcnt/[Kernel PMU event] nmem1/cchwhcnt/[Kernel PMU event] nmem1/critrscu/[Kernel PMU event] ... nmem1/noopstat/[Kernel PMU event] nmem1/ponsecs/ [Kernel PMU event] Patch1: Introduces the nvdimm_pmu structure, common function for pmu register along with callback routine check. Pacth2 Add code in arch/powerpc/platform/pseries/papr_scm.c to expose nmem* pmu. It fills in the nvdimm_pmu structure with event attrs and event functions and then registers the pmu by adding callbacks to register_nvdimm_pmu. Patch3: Sysfs documentation patch Patch4: Adds cpuhotplug support. Changelog --- v1 -> v2 - Removed intermediate functions nvdimm_pmu_read/nvdimm_pmu_add/ nvdimm_pmu_del/nvdimm_pmu_event_init and directly assigned platfrom specific routines. Also add check for any NULL functions. Suggested by: Peter Zijlstra - Add macros for event attribute array index which can be used to assign dynamically allocated attr_groups. - New function 'nvdimm_pmu_mem_free' is added to free dynamic memory allocated for attr_groups in papr_scm.c - PMU register call moved from papr_scm_nvdimm_init() to papr_scm_probe() - Move addition of cpu/node/cpuhp_state attributes in struct nvdimm_pmu to patch 4 where cpu hotplug code added - Removed device attribute from the attribute list of add/del/read/event_init functions in nvdimm_pmu structure as we need to assign them directly to pmu structure. - Some optimizations/fixes from previous RFC code Kajol Jain (4): drivers/nvdimm: Add perf interface to expose nvdimm performance stats powerpc/papr_scm: Add perf interface support powerpc/papr_scm: Document papr_scm sysfs event format entries powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device Documentation/ABI/testing/sysfs-bus-papr-pmem | 31 ++ arch/powerpc/include/asm/device.h | 5 + arch/powerpc/platforms/pseries/papr_scm.c | 419 ++ drivers/nvdimm/Makefile | 1 + drivers/nvdimm/nd_perf.c | 58 +++ include/linux/nd.h| 42 ++ 6 files changed, 556 insertions(+) create mode 100644 drivers/nvdimm/nd_perf.c -- 2.27.0
Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
* Valentin Schneider [2021-05-25 11:21:02]: > On 24/05/21 21:48, Srikar Dronamraju wrote: > > * Valentin Schneider [2021-05-24 15:16:09]: > >> Ok so from your arch you can figure out the *size* of the set of unique > >> distances, but not the individual node_distance(a, b)... That's quite > >> unfortunate. > > > > Yes, thats true. > > > >> > >> I suppose one way to avoid the hook would be to write some "fake" distance > >> values into your distance_lookup_table[] for offline nodes using your > >> distance_ref_point_depth thing, i.e. ensure an iteration of > >> node_distance(a, b) covers all distance values [1]. You can then keep patch > >> 3 around, and that should roughly be it. > >> > > > > Yes, this would suffice but to me its not very clean. > > static int found[distance_ref_point_depth]; > > > > for_each_node(node){ > > int i, nd, distance = LOCAL_DISTANCE; > > goto out; > > > > nd = node_distance(node, first_online_node) > > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { > > if (node_online) { > > if (distance != nd) > > continue; > > found[i] ++; > > break; > > } > > if (found[i]) > > continue; > > distance_lookup_table[node][i] = > > distance_lookup_table[first_online_node][i]; > > found[i] ++; > > break; > > } > > } > > > > But do note: We are setting a precedent for node distance between two nodes > > to change. > > > > Indeed. AFAICT it's that or the unique-distance-values hook :/ Peter, Please let me know which approach would you prefer. I am open to try any other approach too. In my humble opinion, unique-distance-values hook is more cleaner. Do you still have any concerns with the unique-distance-values hook? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
On 24/05/21 21:48, Srikar Dronamraju wrote: > * Valentin Schneider [2021-05-24 15:16:09]: >> Ok so from your arch you can figure out the *size* of the set of unique >> distances, but not the individual node_distance(a, b)... That's quite >> unfortunate. > > Yes, thats true. > >> >> I suppose one way to avoid the hook would be to write some "fake" distance >> values into your distance_lookup_table[] for offline nodes using your >> distance_ref_point_depth thing, i.e. ensure an iteration of >> node_distance(a, b) covers all distance values [1]. You can then keep patch >> 3 around, and that should roughly be it. >> > > Yes, this would suffice but to me its not very clean. > static int found[distance_ref_point_depth]; > > for_each_node(node){ > int i, nd, distance = LOCAL_DISTANCE; > goto out; > > nd = node_distance(node, first_online_node) > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) { > if (node_online) { > if (distance != nd) > continue; > found[i] ++; > break; > } > if (found[i]) > continue; > distance_lookup_table[node][i] = > distance_lookup_table[first_online_node][i]; > found[i] ++; > break; > } > } > > But do note: We are setting a precedent for node distance between two nodes > to change. > Indeed. AFAICT it's that or the unique-distance-values hook :/
Re: [PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout
Linus Torvalds writes: > On Mon, May 24, 2021 at 3:38 AM Aneesh Kumar K.V > wrote: >> >> Avoid the above race with MOVE_PMD by holding pte ptl in mremap and waiting >> for >> parallel pagetable walk to finish operating on pte before updating new_pmd > > Ack on the concept. Should we worry about the below race. The window would be small CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pmd_ptl) pmd = *old_pmd pmd_clear(old_pmd) flush_tlb_range(old_addr) lock(pte_ptl) *new_pmd = pmd unlock(pte_ptl) unlock(pmd_ptl) lock(pte_ptl) *new_addr = 10; and fills TLB with new addr and old pfn ptep_clear_flush(old_addr) old pfn is free. Stale TLB entry > > However, not so much on the patch. > > Odd whitespace change: > >> @@ -254,6 +254,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, >> unsigned long old_addr, >> if (WARN_ON_ONCE(!pmd_none(*new_pmd))) >> return false; >> >> + >> /* >> * We don't have to worry about the ordering of src and dst >> * ptlocks because exclusive mmap_lock prevents deadlock. > > And new optimization for empty pmd, which seems unrelated to the > change and should presumably be separate: That was added that we can safely do pte_lockptr() below > >> @@ -263,6 +264,10 @@ static bool move_normal_pmd(struct vm_area_struct *vma, >> unsigned long old_addr, >> if (new_ptl != old_ptl) >> spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); >> >> + if (pmd_none(*old_pmd)) >> + goto unlock_out; >> + >> + pte_ptl = pte_lockptr(mm, old_pmd); >> /* Clear the pmd */ >> pmd = *old_pmd; >> pmd_clear(old_pmd); > > And also, why does the above assign 'pte_ptl' without using it, when > the actual use is ten lines further down? So that we fetch the pte_ptl before we clear old_pmd. -aneesh
[RFC PATCH 1/2] mm/mremap: Fix race between MOVE_PMD mremap and pageout
CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pte_ptl) lock(pmd_ptl) pmd = *old_pmd pmd_clear(old_pmd) flush_tlb_range(old_addr) *new_pmd = pmd *new_addr = 10; and fills TLB with new addr and old pfn unlock(pmd_ptl) ptep_clear_flush() old pfn is free. Stale TLB entry Fix this race by holding pmd lock in pageout. This still doesn't handle the race between MOVE_PUD and pageout. Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions") Link: https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com Signed-off-by: Aneesh Kumar K.V --- include/linux/rmap.h | 9 ++--- mm/page_vma_mapped.c | 36 ++-- 2 files changed, 24 insertions(+), 21 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index def5c62c93b3..272ab0c2b60b 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -207,7 +207,8 @@ struct page_vma_mapped_walk { unsigned long address; pmd_t *pmd; pte_t *pte; - spinlock_t *ptl; + spinlock_t *pte_ptl; + spinlock_t *pmd_ptl; unsigned int flags; }; @@ -216,8 +217,10 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ if (pvmw->pte && !PageHuge(pvmw->page)) pte_unmap(pvmw->pte); - if (pvmw->ptl) - spin_unlock(pvmw->ptl); + if (pvmw->pte_ptl) + spin_unlock(pvmw->pte_ptl); + if (pvmw->pmd_ptl) + spin_unlock(pvmw->pmd_ptl); } bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 2cf01d933f13..87a2c94c7e27 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -47,8 +47,10 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw) return false; } } - pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd); - spin_lock(pvmw->ptl); + if (USE_SPLIT_PTE_PTLOCKS) { + pvmw->pte_ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd); + spin_lock(pvmw->pte_ptl); + } return true; } @@ -162,8 +164,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (!pvmw->pte) return false; - pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte); - spin_lock(pvmw->ptl); + pvmw->pte_ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte); + spin_lock(pvmw->pte_ptl); if (!check_pte(pvmw)) return not_found(pvmw); return true; @@ -179,6 +181,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (!pud_present(*pud)) return false; pvmw->pmd = pmd_offset(pud, pvmw->address); + pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd); /* * Make sure the pmd value isn't cached in a register by the * compiler and used as a stale value after we've observed a @@ -186,7 +189,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) */ pmde = READ_ONCE(*pvmw->pmd); if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) { - pvmw->ptl = pmd_lock(mm, pvmw->pmd); if (likely(pmd_trans_huge(*pvmw->pmd))) { if (pvmw->flags & PVMW_MIGRATION) return not_found(pvmw); @@ -206,14 +208,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) } } return not_found(pvmw); - } else { - /* THP pmd was split under us: handle on pte level */ - spin_unlock(pvmw->ptl); - pvmw->ptl = NULL; } - } else if (!pmd_present(pmde)) { - return false; - } + } else if (!pmd_present(pmde)) + return not_found(pvmw); + if (!map_pte(pvmw)) goto next_pte; while (1) { @@ -233,19 +231,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) /* Did we cross page table boundary? */ if (pvmw->address % P
[RFC PATCH 2/2] mm/mremap: Fix race between MOVE_PUD mremap and pageout
CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pte_ptl) lock(pud_ptl) pud = *old_pud pud_clear(old_pud) flush_tlb_range(old_addr) *new_pud = pud *new_addr = 10; and fills TLB with new addr and old pfn unlock(pud_ptl) ptep_clear_flush() old pfn is free. Stale TLB entry Fix this race by holding pud lock in pageout. Signed-off-by: Aneesh Kumar K.V --- include/linux/rmap.h | 4 mm/page_vma_mapped.c | 13 ++--- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 272ab0c2b60b..491c65ce1d46 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -209,6 +209,7 @@ struct page_vma_mapped_walk { pte_t *pte; spinlock_t *pte_ptl; spinlock_t *pmd_ptl; + spinlock_t *pud_ptl; unsigned int flags; }; @@ -221,6 +222,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) spin_unlock(pvmw->pte_ptl); if (pvmw->pmd_ptl) spin_unlock(pvmw->pmd_ptl); + if (pvmw->pud_ptl) + spin_unlock(pvmw->pud_ptl); + } bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 87a2c94c7e27..c913bc34b1d3 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -180,8 +180,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pud = pud_offset(p4d, pvmw->address); if (!pud_present(*pud)) return false; + + pvmw->pud_ptl = pud_lock(mm, pud); pvmw->pmd = pmd_offset(pud, pvmw->address); - pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd); + if (USE_SPLIT_PMD_PTLOCKS) + pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd); /* * Make sure the pmd value isn't cached in a register by the * compiler and used as a stale value after we've observed a @@ -235,8 +238,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) spin_unlock(pvmw->pte_ptl); pvmw->pte_ptl = NULL; } - spin_unlock(pvmw->pmd_ptl); - pvmw->pmd_ptl = NULL; + if (pvmw->pmd_ptl) { + spin_unlock(pvmw->pmd_ptl); + pvmw->pmd_ptl = NULL; + } + spin_unlock(pvmw->pud_ptl); + pvmw->pud_ptl = NULL; goto restart; } else { pvmw->pte++; -- 2.31.1