Re: [PATCH] perf vendor events: Fix eventcode of power10 json events

2021-05-25 Thread Nageswara Sastry



> On 25-May-2021, at 8:57 PM, Paul A. Clarke  wrote:
>> 
> I lost the original message, but Nageswara Sastry said:
>> 1. Extracted all the 244 events from the patch.
>> 2. Check them in 'perf list' - all 244 events found
>> 3. Ran all the events with 'perf stat -e "event name" sleep 1', all ran fine.
>>No errors were seen in 'dmesg'
> 
> I count 255 events.
> 
> PC

Seems while extracting I filtered out newly added ones, so I got 244(255-11). 
Now checked with all 255 events. Thanks for pointing out.

Thanks!!
R.Nageswara Sastry

Re: simplify gendisk and request_queue allocation for bio based drivers

2021-05-25 Thread Christoph Hellwig
On Wed, May 26, 2021 at 12:41:37AM +0200, Ulf Hansson wrote:
> On Fri, 21 May 2021 at 07:51, Christoph Hellwig  wrote:
> >
> > Hi all,
> >
> > this series is the first part of cleaning up lifetimes and allocation of
> > the gendisk and request_queue structure.  It adds a new interface to
> > allocate the disk and queue together for bio based drivers, and a helper
> > for cleanup/free them when a driver is unloaded or a device is removed.
> 
> May I ask what else you have in the pipe for the next steps?
> 
> The reason why I ask is that I am looking into some issues related to
> lifecycle problems of gendisk/mmc, typically triggered at SD/MMC card
> removal.

In the short run not much more than superficial cleanups.  Eventually
I want bio based drivers to not require a separate request_queue, leaving
that purely as a data structure for blk-mq based drivers.  But it will
take a while until we get there, so it should not block any fixes.

For hot unplug handling it might be worth to take a look at nvme, as it
is tested a lot for that case.


[powerpc:next] BUILD SUCCESS b73c8cccd72ac28beaf262fd6ef4b91411fc8335

2021-05-25 Thread kernel test robot
   bmips_be_defconfig
powerpc mpc8272_ads_defconfig
powerpc sbc8548_defconfig
sh   sh7770_generic_defconfig
arm am200epdkit_defconfig
alpha   defconfig
riscvnommu_k210_defconfig
powerpc tqm8541_defconfig
powerpc   mpc834x_itxgp_defconfig
mips   capcella_defconfig
armmps2_defconfig
nds32   defconfig
powerpc  obs600_defconfig
powerpc  bamboo_defconfig
mipsbcm47xx_defconfig
arm axm55xx_defconfig
sh  landisk_defconfig
openriscor1ksim_defconfig
mips  maltaaprp_defconfig
mips  rm200_defconfig
armpleb_defconfig
arm   omap1_defconfig
mips mpc30x_defconfig
powerpc rainier_defconfig
powerpc  ppc6xx_defconfig
arm hackkit_defconfig
mips   mtx1_defconfig
powerpc  allyesconfig
x86_64  defconfig
mips tb0219_defconfig
mipsar7_defconfig
m68k   m5249evb_defconfig
arm   tegra_defconfig
riscv allnoconfig
arm davinci_all_defconfig
powerpc  mgcoge_defconfig
nds32 allnoconfig
powerpc  g5_defconfig
arm   u8500_defconfig
sh   se7750_defconfig
sh   se7206_defconfig
arm   cns3420vb_defconfig
powerpc sequoia_defconfig
sh   se7619_defconfig
s390defconfig
m68kq40_defconfig
microblaze  defconfig
sh   se7705_defconfig
m68kmvme16x_defconfig
arm rpc_defconfig
riscv nommu_k210_sdcard_defconfig
mipse55_defconfig
powerpcfsp2_defconfig
mips  malta_defconfig
powerpc wii_defconfig
sh   j2_defconfig
arm64alldefconfig
sh   secureedge5410_defconfig
shedosk7705_defconfig
powerpc  tqm8xx_defconfig
powerpc   currituck_defconfig
m68k apollo_defconfig
arm  pxa255-idp_defconfig
powerpc canyonlands_defconfig
powerpc  pasemi_defconfig
arcnsim_700_defconfig
armdove_defconfig
arc nsimosci_hs_defconfig
s390  debug_defconfig
arm   aspeed_g5_defconfig
mips  rb532_defconfig
armtrizeps4_defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
nios2   defconfig
arc  allyesconfig
nios2allyesconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a001-20210525
i386 randconfig-a002-20210525
i386 randconfig-a005-20210525
i386 randconfig-a006-20210525
i386 randconfig-a003-20210525
i386 randconfig-a004-20210525
i386 randconfig-a001-20210526
i386 randconfig-a002-20210526
i386 randconfig-a005-20210526
i386 randconfig-a004-20210526
i386 randconfig-a003-20210526
i386 randconfig-a006-20210526
x86_64

[powerpc:next-test] BUILD SUCCESS 027f55e87c3094270a3223f7331d033fe15a2b3f

2021-05-25 Thread kernel test robot
   mtx1_defconfig
mips tb0219_defconfig
mipsar7_defconfig
m68k   m5249evb_defconfig
arm   tegra_defconfig
riscv allnoconfig
csky alldefconfig
sparc   sparc32_defconfig
sh   se7343_defconfig
sh   se7619_defconfig
m68kq40_defconfig
powerpcklondike_defconfig
microblaze  defconfig
sh   se7705_defconfig
m68kmvme16x_defconfig
arm rpc_defconfig
riscv nommu_k210_sdcard_defconfig
mipse55_defconfig
powerpc mpc8272_ads_defconfig
powerpcfsp2_defconfig
mips  malta_defconfig
powerpc wii_defconfig
sh   j2_defconfig
arm64alldefconfig
sh   secureedge5410_defconfig
mips mpc30x_defconfig
shedosk7705_defconfig
m68k apollo_defconfig
arm  pxa255-idp_defconfig
powerpc canyonlands_defconfig
sh   sh7724_generic_defconfig
powerpc  pasemi_defconfig
arcnsim_700_defconfig
armdove_defconfig
arc nsimosci_hs_defconfig
s390  debug_defconfig
arm   aspeed_g5_defconfig
mips  rb532_defconfig
armtrizeps4_defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nios2allyesconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a001-20210525
i386 randconfig-a002-20210525
i386 randconfig-a005-20210525
i386 randconfig-a006-20210525
i386 randconfig-a003-20210525
i386 randconfig-a004-20210525
i386 randconfig-a001-20210526
i386 randconfig-a002-20210526
i386 randconfig-a005-20210526
i386 randconfig-a004-20210526
i386 randconfig-a003-20210526
i386 randconfig-a006-20210526
x86_64   randconfig-a013-20210525
x86_64   randconfig-a012-20210525
x86_64   randconfig-a014-20210525
x86_64   randconfig-a016-20210525
x86_64   randconfig-a015-20210525
x86_64   randconfig-a011-20210525
i386 randconfig-a011-20210525
i386 randconfig-a016-20210525
i386 randconfig-a015-20210525
i386 randconfig-a012-20210525
i386 randconfig-a014-20210525
i386 randconfig-a013-20210525
x86_64   randconfig-a005-20210526
x86_64   randconfig-a001-20210526
x86_64   randconfig-a006-20210526
x86_64   randconfig-a003-20210526
x86_64   randconfig-a004-20210526
x86_64   randconfig-a002-20210526
riscvallyesconfig
riscvnommu_virt_defconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
um   allmodconfig
umallnoconfig
um   allyesconfig
um  defconfig
x86_64   allyesconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64

[powerpc:merge] BUILD SUCCESS 112f47a1484ddca610b70cbe4a99f0d0f1701daf

2021-05-25 Thread kernel test robot
allyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a001-20210525
i386 randconfig-a002-20210525
i386 randconfig-a005-20210525
i386 randconfig-a006-20210525
i386 randconfig-a003-20210525
i386 randconfig-a004-20210525
i386 randconfig-a001-20210526
i386 randconfig-a002-20210526
i386 randconfig-a005-20210526
i386 randconfig-a004-20210526
i386 randconfig-a003-20210526
i386 randconfig-a006-20210526
x86_64   randconfig-a013-20210525
x86_64   randconfig-a012-20210525
x86_64   randconfig-a014-20210525
x86_64   randconfig-a016-20210525
x86_64   randconfig-a015-20210525
x86_64   randconfig-a011-20210525
i386 randconfig-a011-20210525
i386 randconfig-a016-20210525
i386 randconfig-a015-20210525
i386 randconfig-a012-20210525
i386 randconfig-a014-20210525
i386 randconfig-a013-20210525
x86_64   randconfig-a005-20210526
x86_64   randconfig-a001-20210526
x86_64   randconfig-a006-20210526
x86_64   randconfig-a003-20210526
x86_64   randconfig-a004-20210526
x86_64   randconfig-a002-20210526
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
um   allmodconfig
umallnoconfig
um   allyesconfig
um  defconfig
x86_64   allyesconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-b001-20210525
x86_64   randconfig-a005-20210525
x86_64   randconfig-a006-20210525
x86_64   randconfig-a001-20210525
x86_64   randconfig-a003-20210525
x86_64   randconfig-a004-20210525
x86_64   randconfig-a002-20210525

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [PATCH v3 0/4] P2040/P2041 i2c recovery erratum

2021-05-25 Thread Michael Ellerman
Wolfram Sang  writes:
> On Wed, May 12, 2021 at 09:20:48AM +1200, Chris Packham wrote:
>> The P2040/P2041 has an erratum where the i2c recovery scheme
>> documented in the reference manual (and currently implemented
>> in the i2c-mpc.c driver) does not work. The errata document
>> provides an alternative that does work. This series implements
>> that alternative and uses a property in the devicetree to
>> decide when the alternative mechanism is needed.
>
> The series looks good to me. Usually, I don't take DTS patches. This
> time I'd make an exception and apply all patches to for-current because
> this is clearly a bugfix. For that, I'd need an ack from PPC
> maintainers. Could I have those for patches 2+3?

Yep, done.

cheers


Re: [PATCH v3 3/4] powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers

2021-05-25 Thread Michael Ellerman
Chris Packham  writes:
> The i2c controllers on the P1010 have an erratum where the documented
> scheme for i2c bus recovery will not work (A-004447). A different
> mechanism is needed which is documented in the P1010 Chip Errata Rev L.
>
> Signed-off-by: Chris Packham 
> ---
>
> Notes:
> Changes in v3:
> - New
>
>  arch/powerpc/boot/dts/fsl/p1010si-post.dtsi | 8 
>  1 file changed, 8 insertions(+)

Acked-by: Michael Ellerman 

cheers

> diff --git a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi
> index c2717f31925a..ccda0a91abf0 100644
> --- a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi
> @@ -122,7 +122,15 @@ memory-controller@2000 {
>   };
>  
>  /include/ "pq3-i2c-0.dtsi"
> + i2c@3000 {
> + fsl,i2c-erratum-a004447;
> + };
> +
>  /include/ "pq3-i2c-1.dtsi"
> + i2c@3100 {
> + fsl,i2c-erratum-a004447;
> + };
> +
>  /include/ "pq3-duart-0.dtsi"
>  /include/ "pq3-espi-0.dtsi"
>   spi0: spi@7000 {
> -- 
> 2.31.1


Re: [PATCH v3 2/4] powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers

2021-05-25 Thread Michael Ellerman
Chris Packham  writes:
> The i2c controllers on the P2040/P2041 have an erratum where the
> documented scheme for i2c bus recovery will not work (A-004447). A
> different mechanism is needed which is documented in the P2040 Chip
> Errata Rev Q (latest available at the time of writing).
>
> Signed-off-by: Chris Packham 
> ---
>  arch/powerpc/boot/dts/fsl/p2041si-post.dtsi | 16 
>  1 file changed, 16 insertions(+)

Acked-by: Michael Ellerman 

cheers

> diff --git a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
> index 872e4485dc3f..ddc018d42252 100644
> --- a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
> @@ -371,7 +371,23 @@ sdhc@114000 {
>   };
>  
>  /include/ "qoriq-i2c-0.dtsi"
> + i2c@118000 {
> + fsl,i2c-erratum-a004447;
> + };
> +
> + i2c@118100 {
> + fsl,i2c-erratum-a004447;
> + };
> +
>  /include/ "qoriq-i2c-1.dtsi"
> + i2c@119000 {
> + fsl,i2c-erratum-a004447;
> + };
> +
> + i2c@119100 {
> + fsl,i2c-erratum-a004447;
> + };
> +
>  /include/ "qoriq-duart-0.dtsi"
>  /include/ "qoriq-duart-1.dtsi"
>  /include/ "qoriq-gpio-0.dtsi"
> -- 
> 2.31.1


Re: simplify gendisk and request_queue allocation for bio based drivers

2021-05-25 Thread Ulf Hansson
On Fri, 21 May 2021 at 07:51, Christoph Hellwig  wrote:
>
> Hi all,
>
> this series is the first part of cleaning up lifetimes and allocation of
> the gendisk and request_queue structure.  It adds a new interface to
> allocate the disk and queue together for bio based drivers, and a helper
> for cleanup/free them when a driver is unloaded or a device is removed.

May I ask what else you have in the pipe for the next steps?

The reason why I ask is that I am looking into some issues related to
lifecycle problems of gendisk/mmc, typically triggered at SD/MMC card
removal.

>
> Together this removes the need to treat the gendisk and request_queue
> as separate entities for bio based drivers.
>
> Diffstat:
>  arch/m68k/emu/nfblock.c |   20 +---
>  arch/xtensa/platforms/iss/simdisk.c |   29 +--
>  block/blk-core.c|1
>  block/blk.h |6 -
>  block/genhd.c   |  149 
> +++-
>  block/partitions/core.c |   19 ++--
>  drivers/block/brd.c |   94 +++---
>  drivers/block/drbd/drbd_main.c  |   23 +
>  drivers/block/n64cart.c |8 -
>  drivers/block/null_blk/main.c   |   38 -
>  drivers/block/pktcdvd.c |   11 --
>  drivers/block/ps3vram.c |   31 +--
>  drivers/block/rsxx/dev.c|   39 +++--
>  drivers/block/rsxx/rsxx_priv.h  |1
>  drivers/block/zram/zram_drv.c   |   19 
>  drivers/lightnvm/core.c |   24 +
>  drivers/md/bcache/super.c   |   15 ---
>  drivers/md/dm.c |   16 +--
>  drivers/md/md.c |   25 ++
>  drivers/memstick/core/ms_block.c|1
>  drivers/nvdimm/blk.c|   27 +-
>  drivers/nvdimm/btt.c|   25 +-
>  drivers/nvdimm/btt.h|2
>  drivers/nvdimm/pmem.c   |   17 +---
>  drivers/nvme/host/core.c|1
>  drivers/nvme/host/multipath.c   |   46 +++
>  drivers/s390/block/dcssblk.c|   26 +-
>  drivers/s390/block/xpram.c  |   26 ++
>  include/linux/blkdev.h  |1
>  include/linux/genhd.h   |   23 +
>  30 files changed, 297 insertions(+), 466 deletions(-)

This looks like a nice cleanup to me.  Feel free to add, for the series:

Reviewed-by: Ulf Hansson 

Kind regards
Uffe


Re: [PATCH] perf vendor events: Fix eventcode of power10 json events

2021-05-25 Thread Nageswara Sastry
Tested patch with the following steps:

1. Extracted all the 244 events from the patch.
2. Check them in 'perf list' - all 244 events found
3. Ran all the events with 'perf stat -e "event name" sleep 1', all ran fine.
No errors were seen in 'dmesg'

Tested-by: Nageswara R Sastry 



> On 25-May-2021, at 12:07 PM, Kajol Jain  wrote:
> 
> Fixed the eventcode values in the power10 json event files to append
> "0x" since these are hexadecimal values.
> Patch also changes event description of PM_EXEC_STALL_LOAD_FINISH and
> PM_EXEC_STALL_NTC_FLUSH event and move some events to correct files.
> 
> Fixes: 32daa5d7899e ("perf vendor events: Initial JSON/events list for 
> power10 platform")
> Signed-off-by: Kajol Jain 
> ---
> .../arch/powerpc/power10/cache.json   |  30 ++--
> .../arch/powerpc/power10/floating_point.json  |   2 +-
> .../arch/powerpc/power10/frontend.json| 124 ++--
> .../arch/powerpc/power10/locks.json   |   4 +-
> .../arch/powerpc/power10/marked.json  |  61 
> .../arch/powerpc/power10/memory.json  |  79 +-
> .../arch/powerpc/power10/others.json  | 133 +++--
> .../arch/powerpc/power10/pipeline.json| 135 +-
> .../pmu-events/arch/powerpc/power10/pmc.json  |   8 +-
> .../arch/powerpc/power10/translation.json |  22 +--
> 10 files changed, 299 insertions(+), 299 deletions(-)
> 
> diff --git a/tools/perf/pmu-events/arch/powerpc/power10/cache.json 
> b/tools/perf/pmu-events/arch/powerpc/power10/cache.json
> index 616f29098c71..605be14f441c 100644
> --- a/tools/perf/pmu-events/arch/powerpc/power10/cache.json
> +++ b/tools/perf/pmu-events/arch/powerpc/power10/cache.json
> @@ -1,46 +1,56 @@
> [
>   {
> -"EventCode": "1003C",
> +"EventCode": "0x1003C",
> "EventName": "PM_EXEC_STALL_DMISS_L2L3",
> "BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was waiting for a load miss to resolve from either the local L2 or 
> local L3."
>   },
>   {
> -"EventCode": "34056",
> +"EventCode": "0x1E054",
> +"EventName": "PM_EXEC_STALL_DMISS_L21_L31",
> +"BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was waiting for a load miss to resolve from another core's L2 or L3 
> on the same chip."
> +  },
> +  {
> +"EventCode": "0x34054",
> +"EventName": "PM_EXEC_STALL_DMISS_L2L3_NOCONFLICT",
> +"BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was waiting for a load miss to resolve from the local L2 or local 
> L3, without a dispatch conflict."
> +  },
> +  {
> +"EventCode": "0x34056",
> "EventName": "PM_EXEC_STALL_LOAD_FINISH",
> -"BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was finishing a load after its data was reloaded from a data source 
> beyond the local L1; cycles in which the LSU was processing an L1-hit; cycles 
> in which the NTF instruction merged with another load in the LMQ."
> +"BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was finishing a load after its data was reloaded from a data source 
> beyond the local L1; cycles in which the LSU was processing an L1-hit; cycles 
> in which the NTF instruction merged with another load in the LMQ; cycles in 
> which the NTF instruction is waiting for a data reload for a load miss, but 
> the data comes back with a non-NTF instruction."
>   },
>   {
> -"EventCode": "3006C",
> +"EventCode": "0x3006C",
> "EventName": "PM_RUN_CYC_SMT2_MODE",
> "BriefDescription": "Cycles when this thread's run latch is set and the 
> core is in SMT2 mode."
>   },
>   {
> -"EventCode": "300F4",
> +"EventCode": "0x300F4",
> "EventName": "PM_RUN_INST_CMPL_CONC",
> "BriefDescription": "PowerPC instructions completed by this thread when 
> all threads in the core had the run-latch set."
>   },
>   {
> -"EventCode": "4C016",
> +"EventCode": "0x4C016",
> "EventName": "PM_EXEC_STALL_DMISS_L2L3_CONFLICT",
> "BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was waiting for a load miss to resolve from the local L2 or local 
> L3, with a dispatch conflict."
>   },
>   {
> -"EventCode": "4D014",
> +"EventCode": "0x4D014",
> "EventName": "PM_EXEC_STALL_LOAD",
> "BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was a load instruction executing in the Load Store Unit."
>   },
>   {
> -"EventCode": "4D016",
> +"EventCode": "0x4D016",
> "EventName": "PM_EXEC_STALL_PTESYNC",
> "BriefDescription": "Cycles in which the oldest instruction in the 
> pipeline was a PTESYNC instruction executing in the Load Store Unit."
>   },
>   {
> -"EventCode": "401EA",
> +"EventCode": "0x401EA",
> "EventName": "PM_THRESH_EXC_128",
> "BriefDescription": "Threshold counter exceeded a value of 128."
>   },
>   {
> -"EventCode": "400F6",
> +"EventCod

Re: [PATCH v3 0/4] P2040/P2041 i2c recovery erratum

2021-05-25 Thread Wolfram Sang
On Wed, May 12, 2021 at 09:20:48AM +1200, Chris Packham wrote:
> The P2040/P2041 has an erratum where the i2c recovery scheme
> documented in the reference manual (and currently implemented
> in the i2c-mpc.c driver) does not work. The errata document
> provides an alternative that does work. This series implements
> that alternative and uses a property in the devicetree to
> decide when the alternative mechanism is needed.

The series looks good to me. Usually, I don't take DTS patches. This
time I'd make an exception and apply all patches to for-current because
this is clearly a bugfix. For that, I'd need an ack from PPC
maintainers. Could I have those for patches 2+3?



signature.asc
Description: PGP signature


Re: [PATCH v3 0/4] P2040/P2041 i2c recovery erratum

2021-05-25 Thread w...@kernel.org

> For those reading along the v2 mentioned in that thread was posted as 
> https://lore.kernel.org/linux-i2c/20170511122033.22471-1-joakim.tjernl...@infinera.com/
>  
> there was a bit of discussion but it seemed to die out without reaching 
> a conclusion.
> 
> The i2c-mpc driver is now using the generic recovery mechanism so that 
> addresses one bit of feedback from the original thread.

Yes, and the generic recovery has been improved since then. There is an
"incomplete_write_byte" fault injector now and the generic recovery
handles it correctly meanwhile. Before, it actually could cause a write
to happen but we are sending STOPs now.



signature.asc
Description: PGP signature


Re: [RFC PATCH 2/2] mm/mremap: Fix race between MOVE_PUD mremap and pageout

2021-05-25 Thread Linus Torvalds
On Mon, May 24, 2021 at 10:34 PM Aneesh Kumar K.V
 wrote:
>
> @@ -221,6 +222,9 @@ static inline void page_vma_mapped_walk_done(struct 
> page_vma_mapped_walk *pvmw)
> spin_unlock(pvmw->pte_ptl);
> if (pvmw->pmd_ptl)
> spin_unlock(pvmw->pmd_ptl);
> +   if (pvmw->pud_ptl)
> +   spin_unlock(pvmw->pud_ptl);
> +
>  }

You have this habit of adding odd whitespace..

But yes, this seems to be the right way to fix the races properly. The
pageout code is special, the pageout code is normally not critical, so
it's the pageout code that should go the extra mile to make up for the
fact that it doesn't hold the mmap_sem like good page table
modification codepaths do.

Linus


Re: [PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout

2021-05-25 Thread Linus Torvalds
On Mon, May 24, 2021 at 10:44 PM A lneesh Kumar K.V
 wrote:
>
> Should we worry about the below race. The window would be small
>
> CPU 1   CPU 2   CPU 3
>
> mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
>
> mmap_write_lock_killable()
>
> addr = old_addr
>
> lock(pmd_ptl)
> pmd = *old_pmd
> pmd_clear(old_pmd)
> flush_tlb_range(old_addr)
>
> lock(pte_ptl)
> *new_pmd = pmd
> unlock(pte_ptl)
>
> unlock(pmd_ptl)
> lock(pte_ptl)
> 
> *new_addr = 10; and fills
> TLB 
> with new addr
> and 
> old pfn
>
> ptep_clear_flush(old_addr)
> old pfn is free.
> Stale 
> TLB entry

Hmm. Do you need a third CPU there? What is done above on CPU3 looks
like it might just be CPU1 accessing the new range immediately.

Which doesn't actually sound at all unlikely - so maybe the window is
small, but it sounds like something that could happen.

This looks nasty. The page shrinker has always been problematic
because it basically avoids the normal full set of locks.

I wonder if we could just make the page shrinker try-lock the mmap_sem
and avoid all this that way. It _is_ allowed to fail, after all, and
the page shrinker is "not normal" and should be less of a performance
issue than all the actual normal VM paths.

Does anybody have any good ideas?

> > And new optimization for empty pmd, which seems unrelated to the
> > change and should presumably be separate:
>
> That was added that we can safely do pte_lockptr() below

Oh, because pte_lockptr() doesn't actually use the "old_pmd" pointer
value - it actually *dereferences* the pointer.

That looks like a mis-design. Why does it do that? Why don't we pass
it the pmd value, if that's what it wants?

   Linus


Re: [PATCH v6 07/11] mm/mremap: Use range flush that does TLB and page walk cache flush

2021-05-25 Thread Linus Torvalds
On Tue, May 25, 2021 at 3:28 AM Aneesh Kumar K.V
 wrote:
>
> How about flush_tlb_and_page_table_cache() ?

Honestly, I'd prefer it to be a separate function.

So keep the existing

 flush_tlb()

as-is, and add a

flush_tlb_walking_cache()

and document that any architecture that flushes the walker caches as
part of the regular tlb flush can just make that a no-op.

Would that work for powerpc?

But:

> >  (b) is this even worth it as a public interface?
>
> But such a large range invalidate doesn't imply we are freeing page
> tables.

No. But it's what everybody else (ie x86 and ARM) does, and if you're
flushing megabytes of TLB's, what's the downside of flushing a few TLB
walker cache entries?

You already do that for internal powerpc errata anyway (ie
"mm_needs_flush_escalation()"), so I'm saying that you might as well
treat the page walker cache as a powerpc-internal implementation
thing.

Put another way: can you even _measure_ the difference between "just
make powerpc look like everybody else" and "add a new explicit page
table walker cache flush function interface"?

Now, from a quick look at the powerpc code, it looks like powerpc is a
bit mis-architected, and when you flush the walker cache, you flush
everything for that ASID. x86 and arm only flush the parts affected by
the TLB flush range (now, admittedly, that's what they do
_architecturally_ - for all I know the actual hardware just always
flushes all walker caches when you flush any TLB and so maybe they act
exactly like the powrpc RIC_FLUSH_PWC in practice).

So maybe it's measurable. But I kind of doubt it, and I'd like to know
that you've actually done some testing to see that "yes, this matters,
I can't just say 'if flushing more than a pmd, just flush the walker
cache too'".

Hmm?

Linus


Re: [PATCH] perf vendor events: Fix eventcode of power10 json events

2021-05-25 Thread Paul A. Clarke
On Tue, May 25, 2021 at 09:42:15AM -0500, Paul A. Clarke wrote:
> On Tue, May 25, 2021 at 12:07:23PM +0530, Kajol Jain wrote:
> > Fixed the eventcode values in the power10 json event files to append
> > "0x" since these are hexadecimal values.
> > Patch also changes event description of PM_EXEC_STALL_LOAD_FINISH and
> > PM_EXEC_STALL_NTC_FLUSH event and move some events to correct files.
> > 
> > Fixes: 32daa5d7899e ("perf vendor events: Initial JSON/events list for 
> > power10 platform")
> > Signed-off-by: Kajol Jain 
> 
> I checked that everything that was "removed" was actually just moved
> somewhere else, that all the added EventCodes indeed had '0x', the
> number of EventCodes added matched the number removed, and that the
> additional text added seemed reasonable.  LGTM.
> 
> Reviewed-by: Paul A. Clarke 

I lost the original message, but Nageswara Sastry said:
> 1. Extracted all the 244 events from the patch.
> 2. Check them in 'perf list' - all 244 events found
> 3. Ran all the events with 'perf stat -e "event name" sleep 1', all ran fine.
> No errors were seen in 'dmesg'

I count 255 events.

PC


Re: [PATCH] perf vendor events: Fix eventcode of power10 json events

2021-05-25 Thread Paul A. Clarke
On Tue, May 25, 2021 at 12:07:23PM +0530, Kajol Jain wrote:
> Fixed the eventcode values in the power10 json event files to append
> "0x" since these are hexadecimal values.
> Patch also changes event description of PM_EXEC_STALL_LOAD_FINISH and
> PM_EXEC_STALL_NTC_FLUSH event and move some events to correct files.
> 
> Fixes: 32daa5d7899e ("perf vendor events: Initial JSON/events list for 
> power10 platform")
> Signed-off-by: Kajol Jain 

I checked that everything that was "removed" was actually just moved
somewhere else, that all the added EventCodes indeed had '0x', the
number of EventCodes added matched the number removed, and that the
additional text added seemed reasonable.  LGTM.

Reviewed-by: Paul A. Clarke 


Re: [RFC v2 4/4] powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device

2021-05-25 Thread Peter Zijlstra
On Tue, May 25, 2021 at 06:52:16PM +0530, Kajol Jain wrote:
> Patch here adds cpu hotplug functions to nvdimm pmu.

I'm thinking "Patch here" qualifies for "This patch", see
Documentation/process/submitting-patches.rst .

> It adds cpumask to designate a cpu to make HCALL to
> collect the counter data for the nvdimm device and
> update ABI documentation accordingly.
> 
> Result in power9 lpar system:
> command:# cat /sys/devices/nmem0/cpumask
> 0

Is this specific to the papr thing, or should this be in generic nvdimm
code?


[PATCH V3 2/2] selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA v3.1

2021-05-25 Thread Athira Rajeev
With the MMCR0 control bit (PMCCEXT) in ISA v3.1, read access to
group B registers is restricted when MMCR0 PMCC=0b00. In other
platforms (like power9), the older behaviour works where group B
PMU SPRs are readable.

Patch creates a selftest which verifies that the test takes a
SIGILL when attempting to read PMU registers via helper function
"dump_ebb_state" for ISA v3.1.

Signed-off-by: Athira Rajeev 
---
 tools/testing/selftests/powerpc/pmu/ebb/Makefile   |  2 +-
 .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++
 2 files changed, 64 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c

diff --git a/tools/testing/selftests/powerpc/pmu/ebb/Makefile 
b/tools/testing/selftests/powerpc/pmu/ebb/Makefile
index c5ecb46..0101606 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/Makefile
+++ b/tools/testing/selftests/powerpc/pmu/ebb/Makefile
@@ -24,7 +24,7 @@ TEST_GEN_PROGS := reg_access_test event_attributes_test 
cycles_test   \
 fork_cleanup_test ebb_on_child_test\
 ebb_on_willing_child_test back_to_back_ebbs_test   \
 lost_exception_test no_handler_test\
-cycles_with_mmcr2_test
+cycles_with_mmcr2_test regs_access_pmccext_test
 
 top_srcdir = ../../../../../..
 include ../../../lib.mk
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c
new file mode 100644
index 000..1eda8e9
--- /dev/null
+++ b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2021, Athira Rajeev, IBM Corp.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "ebb.h"
+
+
+/*
+ * Test that closing the EBB event clears MMCR0_PMCC and
+ * sets MMCR0_PMCCEXT preventing further read access to the
+ * group B PMU registers.
+ */
+
+static int regs_access_pmccext(void)
+{
+   struct event event;
+
+   SKIP_IF(!ebb_is_supported());
+
+   event_init_named(&event, 0x1001e, "cycles");
+   event_leader_ebb_init(&event);
+
+   FAIL_IF(event_open(&event));
+
+   ebb_enable_pmc_counting(1);
+   setup_ebb_handler(standard_ebb_callee);
+   ebb_global_enable();
+   FAIL_IF(ebb_event_enable(&event));
+
+   mtspr(SPRN_PMC1, pmc_sample_period(sample_period));
+
+   while (ebb_state.stats.ebb_count < 1)
+   FAIL_IF(core_busy_loop());
+
+   ebb_global_disable();
+   event_close(&event);
+
+   FAIL_IF(ebb_state.stats.ebb_count == 0);
+
+   /*
+* For ISA v3.1, verify the test takes a SIGILL when reading
+* PMU regs after the event is closed. With the control bit
+* in MMCR0 (PMCCEXT) restricting access to group B PMU regs,
+* sigill is expected.
+*/
+   if (have_hwcap2(PPC_FEATURE2_ARCH_3_1))
+   FAIL_IF(catch_sigill(dump_ebb_state));
+   else
+   dump_ebb_state();
+
+   return 0;
+}
+
+int main(void)
+{
+   return test_harness(regs_access_pmccext, "regs_access_pmccext");
+}
-- 
1.8.3.1



[PATCH V3 1/2] selftests/powerpc: Fix "no_handler" EBB selftest

2021-05-25 Thread Athira Rajeev
The "no_handler_test" in ebb selftests attempts to read the PMU
registers twice via helper function "dump_ebb_state". First dump is
just before closing of event and the second invocation is done after
closing of the event. The original intention of second
dump_ebb_state was to dump the state of registers at the end of
the test when the counters are frozen. But this will be achieved
with the first call itself since sample period is set to low value
and PMU will be frozen by then. Hence patch removes the
dump which was done before closing of the event.

Signed-off-by: Athira Rajeev 
Reported-by: Shirisha Ganta 
---
 tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c
index fc5bf48..01e827c 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c
@@ -50,8 +50,6 @@ static int no_handler_test(void)
 
event_close(&event);
 
-   dump_ebb_state();
-
/* The real test is that we never took an EBB at 0x0 */
 
return 0;
-- 
1.8.3.1



[PATCH V3 0/2] selftests/powerpc: Updates to EBB selftest for ISA v3.1

2021-05-25 Thread Athira Rajeev
The "no_handler_test" in ebb selftests attempts to read the PMU
registers after closing of the event via helper function
"dump_ebb_state". With the MMCR0 control bit (PMCCEXT) in ISA v3.1,
read access to group B registers is restricted when MMCR0 PMCC=0b00.
Hence the call to dump_ebb_state after closing of event will generate
a SIGILL, which is expected.

Test has below in logs:
<<>>
!! child died by signal 4
failure: no_handler_test
<<>>

In other platforms (like power9), the older behaviour works where
group B PMU SPRs are readable. The "dump_ebb_state" is called twice
in the test. The second call after closing of event was done inorder
to dump state of registers when the counters are frozen. But since
the counters should already be frozen by the time first dump is done,
patch1 drops the second call to "dump_ebb_state".

To address the new sigill behaviour in ISA v3.1, patch2 creates
a separate selftest.

Changelog:
v2 -> v3:
Fixed a space issue in patch2.

v1 -> v2:
Addressed review comments from Michael Ellerman.
First version attempted to address the SIGILL
behaviour in existing "no_handler_test" test itself.
As per mpe's suggestion, moved that to a separate test
and removed the second call to "dump_ebb_state" since
that is actually not needed.

Athira Rajeev (2):
  selftests/powerpc: Fix "no_handler" EBB selftest
  selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA
v3.1

 tools/testing/selftests/powerpc/pmu/ebb/Makefile   |  2 +-
 .../selftests/powerpc/pmu/ebb/no_handler_test.c|  2 -
 .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++
 3 files changed, 64 insertions(+), 3 deletions(-)
 create mode 100644 
tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c

-- 
1.8.3.1



Re: [PATCH v6 07/11] mm/mremap: Use range flush that does TLB and page walk cache flush

2021-05-25 Thread Aneesh Kumar K.V
Linus Torvalds  writes:

> On Sun, May 23, 2021 at 11:04 PM Aneesh Kumar K.V
>  wrote:
>>
>> Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
>> page walk cache where TLB entries are mapped with page size PAGE_SIZE.
>
> So I dislike this patch for two reasons:
>
>  (a) naming.
>
> If the ppc people want to use crazy TLA's that have no meaning outside
> of the powerpc community, that's fine. But only in powerpc code.
>
> "pwc" makes no sense to me, or to anybody else that isn't intimately
> involved in low-level powerpc stuff. I assume it's "page walk cache",
> but honestly, outside of this area, PWC is mostly used for a specific
> type of webcam.
>
> So there's no way I'd accept this as-is, simply because of that.
> flush_pte_tlb_pwc_range() is simply not an acceptable name. You would
> have to spell it out, not use an obscure TLA.
>
> But I think you don't even want to do that, because of

How about flush_tlb_and_page_table_cache() ?

>
>  (b) is this even worth it as a public interface?
>
> Why doesn't the powerpc radix TLB flushing code just always flush the
> page table walking cache when the range is larger than a PMD?
>
> Once you have big flush ranges like that, I don't believe it makes any
> sense not to flush the walking cache too.

But such a large range invalidate doesn't imply we are freeing page
tables. Hence forcing a page table cache flush for large range
invalidate can have performance impact. ppc64 don't do a range page
table cache invalidate. Hence we will have to flush the full page table
cache.

>
> NOTE! This is particularly true as "flush the walking cache" isn't a
> well-defined operation anyway. Which _levels_ of the walking cache?
> Again, the size (and alignment) of the flush would actually tell you.
> A new boolean "flush" parameter does *NOT* tell that at all.
>
> So I think this new interface is mis-named, but I also think it's
> pointless. Just DTRT automatically when somebody asks for a flush that
> covers a PMD range (or a PUD range).
>
>   Linus

-aneesh


[V2 2/2] selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA v3.1

2021-05-25 Thread Athira Rajeev
With the MMCR0 control bit (PMCCEXT) in ISA v3.1, read access to
group B registers is restricted when MMCR0 PMCC=0b00. In other
platforms (like power9), the older behaviour works where group B
PMU SPRs are readable.

Patch creates a selftest which verifies that the test takes a
SIGILL when attempting to read PMU registers via helper function
"dump_ebb_state" for ISA v3.1.

Signed-off-by: Athira Rajeev 
---
 tools/testing/selftests/powerpc/pmu/ebb/Makefile   |  2 +-
 .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++
 2 files changed, 64 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c

diff --git a/tools/testing/selftests/powerpc/pmu/ebb/Makefile 
b/tools/testing/selftests/powerpc/pmu/ebb/Makefile
index c5ecb46..0101606 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/Makefile
+++ b/tools/testing/selftests/powerpc/pmu/ebb/Makefile
@@ -24,7 +24,7 @@ TEST_GEN_PROGS := reg_access_test event_attributes_test 
cycles_test   \
 fork_cleanup_test ebb_on_child_test\
 ebb_on_willing_child_test back_to_back_ebbs_test   \
 lost_exception_test no_handler_test\
-cycles_with_mmcr2_test
+cycles_with_mmcr2_test regs_access_pmccext_test
 
 top_srcdir = ../../../../../..
 include ../../../lib.mk
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c
new file mode 100644
index 000..5f1a040
--- /dev/null
+++ b/tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2021, Athira Rajeev, IBM Corp.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "ebb.h"
+
+
+/*
+ * Test that closing the EBB event clears MMCR0_PMCC and
+ * sets MMCR0_PMCCEXT preventing further read access to the
+ * group B PMU registers.
+ */
+
+static int regs_access_pmccext(void)
+{
+   struct event event;
+
+   SKIP_IF(!ebb_is_supported());
+
+   event_init_named(&event, 0x1001e, "cycles");
+   event_leader_ebb_init(&event);
+
+   FAIL_IF(event_open(&event));
+
+   ebb_enable_pmc_counting(1);
+   setup_ebb_handler(standard_ebb_callee);
+   ebb_global_enable();
+   FAIL_IF(ebb_event_enable(&event));
+
+   mtspr(SPRN_PMC1, pmc_sample_period(sample_period));
+
+   while (ebb_state.stats.ebb_count < 1)
+   FAIL_IF(core_busy_loop());
+
+   ebb_global_disable();
+   event_close(&event);
+
+   FAIL_IF(ebb_state.stats.ebb_count == 0);
+
+   /*
+* For ISA v3.1, verify the test takes a SIGILL when reading
+* PMU regs after the event is closed. With the control bit
+* in MMCR0 (PMCCEXT) restricting access to group B PMU regs,
+* sigill is expected.
+*/
+   if (have_hwcap2(PPC_FEATURE2_ARCH_3_1))
+   FAIL_IF(catch_sigill(dump_ebb_state));
+   else
+   dump_ebb_state();
+
+   return 0;
+}
+
+int main(void)
+{
+   return test_harness(regs_access_pmccext,"regs_access_pmccext");
+}
-- 
1.8.3.1



[V2 1/2] selftests/powerpc: Fix "no_handler" EBB selftest

2021-05-25 Thread Athira Rajeev
The "no_handler_test" in ebb selftests attempts to read the PMU
registers twice via helper function "dump_ebb_state". First dump is
just before closing of event and the second invocation is done after
closing of the event. The original intention of second
dump_ebb_state was to dump the state of registers at the end of
the test when the counters are frozen. But this will be achieved
with the first call itself since sample period is set to low value
and PMU will be frozen by then. Hence patch removes the
dump which was done before closing of the event.

Signed-off-by: Athira Rajeev 
Reported-by: Shirisha Ganta 
---
 tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c
index fc5bf48..01e827c 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/no_handler_test.c
@@ -50,8 +50,6 @@ static int no_handler_test(void)
 
event_close(&event);
 
-   dump_ebb_state();
-
/* The real test is that we never took an EBB at 0x0 */
 
return 0;
-- 
1.8.3.1



[V2 0/2] selftests/powerpc: Updates to EBB selftest for ISA v3.1

2021-05-25 Thread Athira Rajeev
The "no_handler_test" in ebb selftests attempts to read the PMU
registers after closing of the event via helper function
"dump_ebb_state". With the MMCR0 control bit (PMCCEXT) in ISA v3.1,
read access to group B registers is restricted when MMCR0 PMCC=0b00.
Hence the call to dump_ebb_state after closing of event will generate
a SIGILL, which is expected.

Test has below in logs:
<<>>
!! child died by signal 4
failure: no_handler_test
<<>>

In other platforms (like power9), the older behaviour works where
group B PMU SPRs are readable. The "dump_ebb_state" is called twice
in the test. The second call after closing of event was done inorder
to dump state of registers when the counters are frozen. But since
the counters should already be frozen by the time first dump is done,
patch1 drops the second call to "dump_ebb_state".

To address the new sigill behaviour in ISA v3.1, patch2 creates
a separate selftest.

Changelog:
v1 -> v2:
Addressed review comments from Michael Ellerman.
First version attempted to address the SIGILL
behaviour in existing "no_handler_test" test itself.
As per mpe's suggestion, moved that to a separate test
and removed the second call to "dump_ebb_state" since
that is actually not needed.

Athira Rajeev (2):
  selftests/powerpc: Fix "no_handler" EBB selftest
  selftests/powerpc: EBB selftest for MMCR0 control for PMU SPRs in ISA
v3.1

 tools/testing/selftests/powerpc/pmu/ebb/Makefile   |  2 +-
 .../selftests/powerpc/pmu/ebb/no_handler_test.c|  2 -
 .../powerpc/pmu/ebb/regs_access_pmccext_test.c | 63 ++
 3 files changed, 64 insertions(+), 3 deletions(-)
 create mode 100644 
tools/testing/selftests/powerpc/pmu/ebb/regs_access_pmccext_test.c

-- 
1.8.3.1



[RFC v2 4/4] powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device

2021-05-25 Thread Kajol Jain
Patch here adds cpu hotplug functions to nvdimm pmu.
It adds cpumask to designate a cpu to make HCALL to
collect the counter data for the nvdimm device and
update ABI documentation accordingly.

Result in power9 lpar system:
command:# cat /sys/devices/nmem0/cpumask
0

Signed-off-by: Kajol Jain 
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  6 ++
 arch/powerpc/platforms/pseries/papr_scm.c | 61 +++
 include/linux/nd.h| 17 --
 3 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 38c4daf65af2..986df1691914 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -76,6 +76,12 @@ Description: (RO) Attribute group to describe the magic bits
For example::
noopstat = "event=0x1"
 
+What:  /sys/devices/nmemX/cpumask
+Date:  May 2021
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:   (RO) This sysfs file exposes the cpumask which is designated to 
make
+HCALLs to retrieve nvdimm pmu event counter data.
+
 What:  /sys/devices/nmemX/events
 Date:  May 2021
 Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f2d57da98ff4..76121d876b7f 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -339,6 +339,28 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
return 0;
 }
 
+static ssize_t cpumask_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+   struct nvdimm_pmu *nd_pmu;
+
+   nd_pmu = container_of(pmu, struct nvdimm_pmu, pmu);
+
+   return cpumap_print_to_pagebuf(true, buf, cpumask_of(nd_pmu->cpu));
+}
+
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *nvdimm_cpumask_attrs[] = {
+   &dev_attr_cpumask.attr,
+   NULL,
+};
+
+static const struct attribute_group nvdimm_pmu_cpumask_group = {
+   .attrs = nvdimm_cpumask_attrs,
+};
+
 PMU_FORMAT_ATTR(event, "config:0-4");
 
 static struct attribute *nvdimm_pmu_format_attr[] = {
@@ -459,6 +481,24 @@ static void papr_scm_pmu_del(struct perf_event *event, int 
flags)
papr_scm_pmu_read(event);
 }
 
+static int nvdimm_pmu_offline_cpu(unsigned int cpu, struct hlist_node *node)
+{
+   struct nvdimm_pmu *nd_pmu;
+   int target;
+
+   nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
+
+   if (cpu != nd_pmu->cpu)
+   return 0;
+
+   target = cpumask_last(cpu_active_mask);
+   if (target < 0 || target >= nr_cpu_ids)
+   return -1;
+
+   nd_pmu->cpu = target;
+   return 0;
+}
+
 static ssize_t device_show_string(struct device *dev, struct device_attribute 
*attr,
  char *buf)
 {
@@ -603,6 +643,7 @@ static int papr_scm_pmu_check_events(struct papr_scm_priv 
*p, struct nvdimm_pmu
/* Fill attribute groups for the nvdimm pmu device */
nd_pmu->attr_groups[NVDIMM_PMU_FORMAT_ATTR] = &nvdimm_pmu_format_group;
nd_pmu->attr_groups[NVDIMM_PMU_EVENT_ATTR] = nvdimm_pmu_events_group;
+   nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR] = 
&nvdimm_pmu_cpumask_group;
nd_pmu->attr_groups[NVDIMM_PMU_NULL_ATTR] = NULL;
 
kfree(single_stats);
@@ -652,6 +693,20 @@ static void papr_scm_pmu_register(struct papr_scm_priv *p)
nd_pmu->read = papr_scm_pmu_read;
nd_pmu->add = papr_scm_pmu_add;
nd_pmu->del = papr_scm_pmu_del;
+   nd_pmu->cpu = raw_smp_processor_id();
+
+   rc = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+"perf/nvdimm:online",
+ NULL, nvdimm_pmu_offline_cpu);
+   if (rc < 0)
+   goto pmu_cpuhp_setup_err;
+
+   nd_pmu->cpuhp_state = rc;
+
+   /* Register the pmu instance for cpu hotplug */
+   rc = cpuhp_state_add_instance_nocalls(nd_pmu->cpuhp_state, 
&nd_pmu->node);
+   if (rc)
+   goto cpuhp_instance_err;
 
rc = register_nvdimm_pmu(nd_pmu, p->pdev);
if (rc)
@@ -665,6 +720,10 @@ static void papr_scm_pmu_register(struct papr_scm_priv *p)
return;
 
 pmu_register_err:
+   cpuhp_state_remove_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node);
+cpuhp_instance_err:
+   cpuhp_remove_multi_state(nd_pmu->cpuhp_state);
+pmu_cpuhp_setup_err:
nvdimm_pmu_mem_free(nd_pmu);
kfree(p->nvdimm_events_map);
 pmu_check_events_err:
@@ -675,6 +734,8 @@ static void papr_scm_pmu_register(struct papr_scm_priv *p)
 
 static void nvdimm_pmu_uinit(struct nvdimm_pmu *nd_pmu)
 {
+   cpuhp_state_remove_instance_nocalls(nd_pmu->cpuhp_state, &nd_pmu->node);

[RFC v2 1/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats

2021-05-25 Thread Kajol Jain
Patch adds performance stats reporting support for nvdimm.
Added interface includes support for pmu register/unregister
functions. A structure is added called nvdimm_pmu to be used for
adding arch/platform specific data such as supported events and pmu
event functions like event_init/add/read/del.
User could use the standard perf tool to access perf
events exposed via pmu.

Signed-off-by: Kajol Jain 
---
 drivers/nvdimm/Makefile  |  1 +
 drivers/nvdimm/nd_perf.c | 58 
 include/linux/nd.h   | 35 
 3 files changed, 94 insertions(+)
 create mode 100644 drivers/nvdimm/nd_perf.c

diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 29203f3d3069..25dba6095612 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -18,6 +18,7 @@ nd_e820-y := e820.o
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
+libnvdimm-y += nd_perf.o
 libnvdimm-y += dimm.o
 libnvdimm-y += region_devs.o
 libnvdimm-y += region.o
diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c
new file mode 100644
index ..96280c1ff799
--- /dev/null
+++ b/drivers/nvdimm/nd_perf.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * nd_perf.c: NVDIMM Device Performance Monitoring Unit support
+ *
+ * Perf interface to expose nvdimm performance stats.
+ *
+ * Copyright (C) 2021 IBM Corporation
+ */
+
+#define pr_fmt(fmt) "nvdimm_pmu: " fmt
+
+#include 
+
+int register_nvdimm_pmu(struct nvdimm_pmu *nd_pmu, struct platform_device 
*pdev)
+{
+   int rc;
+
+   if (!nd_pmu || !pdev)
+   return -EINVAL;
+
+   /* event functions like add/del/read/event_init should not be NULL */
+   if (WARN_ON_ONCE(!(nd_pmu->event_init && nd_pmu->add && nd_pmu->del && 
nd_pmu->read)))
+   return -EINVAL;
+
+   nd_pmu->pmu.task_ctx_nr = perf_invalid_context;
+   nd_pmu->pmu.name = nd_pmu->name;
+   nd_pmu->pmu.event_init = nd_pmu->event_init;
+   nd_pmu->pmu.add = nd_pmu->add;
+   nd_pmu->pmu.del = nd_pmu->del;
+   nd_pmu->pmu.read = nd_pmu->read;
+
+   nd_pmu->pmu.attr_groups = nd_pmu->attr_groups;
+   nd_pmu->pmu.capabilities = PERF_PMU_CAP_NO_INTERRUPT |
+   PERF_PMU_CAP_NO_EXCLUDE;
+
+   /*
+* Add platform_device->dev pointer to nvdimm_pmu to access
+* device data in events functions.
+*/
+   nd_pmu->dev = &pdev->dev;
+
+   rc = perf_pmu_register(&nd_pmu->pmu, nd_pmu->name, -1);
+   if (rc)
+   return rc;
+
+   pr_info("%s NVDIMM performance monitor support registered\n",
+   nd_pmu->name);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(register_nvdimm_pmu);
+
+void unregister_nvdimm_pmu(struct pmu *nd_pmu)
+{
+   /* handle freeing of memory in arch specific code */
+   perf_pmu_unregister(nd_pmu);
+}
+EXPORT_SYMBOL_GPL(unregister_nvdimm_pmu);
diff --git a/include/linux/nd.h b/include/linux/nd.h
index ee9ad76afbba..a0e0619256be 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 enum nvdimm_event {
NVDIMM_REVALIDATE_POISON,
@@ -23,6 +25,39 @@ enum nvdimm_claim_class {
NVDIMM_CCLASS_UNKNOWN,
 };
 
+/* Event attribute array index */
+#define NVDIMM_PMU_FORMAT_ATTR 0
+#define NVDIMM_PMU_EVENT_ATTR  1
+#define NVDIMM_PMU_NULL_ATTR   2
+
+/**
+ * struct nvdimm_pmu - data structure for nvdimm perf driver
+ *
+ * @name: name of the nvdimm pmu device.
+ * @pmu: pmu data structure for nvdimm performance stats.
+ * @dev: nvdimm device pointer.
+ * @functions(event_init/add/del/read): platform specific pmu functions.
+ * @attr_groups: data structure for events and formats.
+ */
+struct nvdimm_pmu {
+   const char *name;
+   struct pmu pmu;
+   struct device *dev;
+   int (*event_init)(struct perf_event *event);
+   int  (*add)(struct perf_event *event, int flags);
+   void (*del)(struct perf_event *event, int flags);
+   void (*read)(struct perf_event *event);
+   /*
+* Attribute groups for the nvdimm pmu. Index 0 used for
+* format attribute, index 1 used for event attribute and
+* index 2 kept as NULL.
+*/
+   const struct attribute_group *attr_groups[3];
+};
+
+int register_nvdimm_pmu(struct nvdimm_pmu *nvdimm, struct platform_device 
*pdev);
+void unregister_nvdimm_pmu(struct pmu *pmu);
+
 struct nd_device_driver {
struct device_driver drv;
unsigned long type;
-- 
2.27.0



[RFC v2 3/4] powerpc/papr_scm: Document papr_scm sysfs event format entries

2021-05-25 Thread Kajol Jain
This patch add event format and events details in ABI
documentation

Signed-off-by: Kajol Jain 
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem | 25 +++
 1 file changed, 25 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 92e2db0e2d3d..38c4daf65af2 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -59,3 +59,28 @@ Description:
* "CchRHCnt" : Cache Read Hit Count
* "CchWHCnt" : Cache Write Hit Count
* "FastWCnt" : Fast Write Count
+
+What:  /sys/devices/nmemX/format
+Date:  May 2021
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:   (RO) Attribute group to describe the magic bits
+that go into perf_event_attr.config for a particular pmu.
+(See ABI/testing/sysfs-bus-event_source-devices-format).
+
+Each attribute under this group defines a bit range of the
+perf_event_attr.config. Supported attribute is listed
+below::
+
+   event  = "config:0-4"  - event ID
+
+   For example::
+   noopstat = "event=0x1"
+
+What:  /sys/devices/nmemX/events
+Date:  May 2021
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:(RO) Attribute group to describe performance monitoring
+events specific to papr-scm. Each attribute in this group 
describes
+a single performance monitoring event supported by this nvdimm 
pmu.
+The name of the file is the name of the event.
+(See ABI/testing/sysfs-bus-event_source-devices-events).
-- 
2.27.0



[RFC v2 2/4] powerpc/papr_scm: Add perf interface support

2021-05-25 Thread Kajol Jain
Patch adds support for performance monitoring of papr-scm
nvdimm devices via perf interface. It adds pmu functions
like add/del/read/event_init for nvdimm_pmu struture.

Patch adds a new parameter 'priv' in pdev_archdata structure to save
nvdimm_pmu device pointer, to handle the unregistering of pmu device.

papr_scm_pmu_register function populates the nvdimm_pmu structure
with events, attribute groups along with event handling functions.
Finally the populated nvdimm_pmu structure is passed to register
the pmu device.
Event handling functions internally uses hcall to get events and
counter data.

Result in power9 machine with 2 nvdimm device:

Ex: List all event by perf list

command:# perf list nmem

  nmem0/cchrhcnt/[Kernel PMU event]
  nmem0/cchwhcnt/[Kernel PMU event]
  nmem0/critrscu/[Kernel PMU event]
  nmem0/ctlresct/[Kernel PMU event]
  nmem0/ctlrestm/[Kernel PMU event]
  nmem0/fastwcnt/[Kernel PMU event]
  nmem0/hostlcnt/[Kernel PMU event]
  nmem0/hostldur/[Kernel PMU event]
  nmem0/hostscnt/[Kernel PMU event]
  nmem0/hostsdur/[Kernel PMU event]
  nmem0/medrcnt/ [Kernel PMU event]
  nmem0/medrdur/ [Kernel PMU event]
  nmem0/medwcnt/ [Kernel PMU event]
  nmem0/medwdur/ [Kernel PMU event]
  nmem0/memlife/ [Kernel PMU event]
  nmem0/noopstat/[Kernel PMU event]
  nmem0/ponsecs/ [Kernel PMU event]
  nmem1/cchrhcnt/[Kernel PMU event]
  nmem1/cchwhcnt/[Kernel PMU event]
  nmem1/critrscu/[Kernel PMU event]
  ...
  nmem1/noopstat/[Kernel PMU event]
  nmem1/ponsecs/ [Kernel PMU event]

Signed-off-by: Kajol Jain 
---
 arch/powerpc/include/asm/device.h |   5 +
 arch/powerpc/platforms/pseries/papr_scm.c | 358 ++
 2 files changed, 363 insertions(+)

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 219559d65864..47ed639f3b8f 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -48,6 +48,11 @@ struct dev_archdata {
 
 struct pdev_archdata {
u64 dma_mask;
+   /*
+* Pointer to nvdimm_pmu structure, to handle the unregistering
+* of pmu device
+*/
+   void *priv;
 };
 
 #endif /* _ASM_POWERPC_DEVICE_H */
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..f2d57da98ff4 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -18,6 +18,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -67,6 +69,8 @@
 #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
 #define PAPR_SCM_PERF_STATS_VERSION 0x1
 
+#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu)
+
 /* Struct holding a single performance metric */
 struct papr_scm_perf_stat {
u8 stat_id[8];
@@ -116,6 +120,12 @@ struct papr_scm_priv {
 
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
+
+/* array to have event_code and stat_id mappings */
+   char **nvdimm_events_map;
+
+   /* count of supported events */
+   u32 total_events;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,
@@ -329,6 +339,347 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
return 0;
 }
 
+PMU_FORMAT_ATTR(event, "config:0-4");
+
+static struct attribute *nvdimm_pmu_format_attr[] = {
+   &format_attr_event.attr,
+   NULL,
+};
+
+static struct attribute_group nvdimm_pmu_format_group = {
+   .name = "format",
+   .attrs = nvdimm_pmu_format_attr,
+};
+
+static int papr_scm_pmu_get_value(struct perf_event *event, struct device 
*dev, u64 *count)
+{
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+   struct papr_scm_priv *p = (struct papr_scm_priv *)dev->driver_data;
+   int rc, size;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   if (!p || !p->nvdimm_events_map)
+   return -EINVAL;
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENO

[RFC v2 0/4] Add perf interface to expose nvdimm

2021-05-25 Thread Kajol Jain
Patch adds performance stats reporting support for nvdimm.
Added interface includes support for pmu register/unregister
functions. A structure is added called nvdimm_pmu to be used for
adding arch/platform specific data such as supported events and pmu
event functions like event_init/add/read/del.
User could use the standard perf tool to access perf
events exposed via pmu.

Patchset includes implementation to expose IBM pseries platform nmem*
device performance stats using this interface.

Result from power9 pseries lpar with 2 nvdimm device:
command:# perf list nmem
  nmem0/cchrhcnt/[Kernel PMU event]
  nmem0/cchwhcnt/[Kernel PMU event]
  nmem0/critrscu/[Kernel PMU event]
  nmem0/ctlresct/[Kernel PMU event]
  nmem0/ctlrestm/[Kernel PMU event]
  nmem0/fastwcnt/[Kernel PMU event]
  nmem0/hostlcnt/[Kernel PMU event]
  nmem0/hostldur/[Kernel PMU event]
  nmem0/hostscnt/[Kernel PMU event]
  nmem0/hostsdur/[Kernel PMU event]
  nmem0/medrcnt/ [Kernel PMU event]
  nmem0/medrdur/ [Kernel PMU event]
  nmem0/medwcnt/ [Kernel PMU event]
  nmem0/medwdur/ [Kernel PMU event]
  nmem0/memlife/ [Kernel PMU event]
  nmem0/noopstat/[Kernel PMU event]
  nmem0/ponsecs/ [Kernel PMU event]
  nmem1/cchrhcnt/[Kernel PMU event]
  nmem1/cchwhcnt/[Kernel PMU event]
  nmem1/critrscu/[Kernel PMU event]
  ...
  nmem1/noopstat/[Kernel PMU event]
  nmem1/ponsecs/ [Kernel PMU event]

Patch1:
Introduces the nvdimm_pmu structure, common function for pmu
register along with callback routine check.
Pacth2
Add code in arch/powerpc/platform/pseries/papr_scm.c to expose
nmem* pmu. It fills in the nvdimm_pmu structure with event attrs
and event functions and then registers the pmu by adding
callbacks to register_nvdimm_pmu.
Patch3:
Sysfs documentation patch
Patch4:
Adds cpuhotplug support.

Changelog
---
v1 -> v2
- Removed intermediate functions nvdimm_pmu_read/nvdimm_pmu_add/
  nvdimm_pmu_del/nvdimm_pmu_event_init and directly assigned
  platfrom specific routines. Also add check for any NULL functions.
  Suggested by: Peter Zijlstra

- Add macros for event attribute array index which can be used to
  assign dynamically allocated attr_groups.

- New function 'nvdimm_pmu_mem_free' is added to free dynamic
  memory allocated for attr_groups in papr_scm.c

- PMU register call moved from papr_scm_nvdimm_init() to papr_scm_probe()

- Move addition of cpu/node/cpuhp_state attributes in struct nvdimm_pmu
  to patch 4 where cpu hotplug code added

- Removed device attribute from the attribute list of
  add/del/read/event_init functions in nvdimm_pmu structure
  as we need to assign them directly to pmu structure.

- Some optimizations/fixes from previous RFC code

Kajol Jain (4):
  drivers/nvdimm: Add perf interface to expose nvdimm performance stats
  powerpc/papr_scm: Add perf interface support
  powerpc/papr_scm: Document papr_scm sysfs event format entries
  powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  31 ++
 arch/powerpc/include/asm/device.h |   5 +
 arch/powerpc/platforms/pseries/papr_scm.c | 419 ++
 drivers/nvdimm/Makefile   |   1 +
 drivers/nvdimm/nd_perf.c  |  58 +++
 include/linux/nd.h|  42 ++
 6 files changed, 556 insertions(+)
 create mode 100644 drivers/nvdimm/nd_perf.c

-- 
2.27.0



Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-25 Thread Srikar Dronamraju
* Valentin Schneider  [2021-05-25 11:21:02]:

> On 24/05/21 21:48, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-05-24 15:16:09]:
> >> Ok so from your arch you can figure out the *size* of the set of unique
> >> distances, but not the individual node_distance(a, b)... That's quite
> >> unfortunate.
> >
> > Yes, thats true.
> >
> >>
> >> I suppose one way to avoid the hook would be to write some "fake" distance
> >> values into your distance_lookup_table[] for offline nodes using your
> >> distance_ref_point_depth thing, i.e. ensure an iteration of
> >> node_distance(a, b) covers all distance values [1]. You can then keep patch
> >> 3 around, and that should roughly be it.
> >>
> >
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> >
> > for_each_node(node){
> >   int i, nd, distance = LOCAL_DISTANCE;
> >   goto out;
> >
> >   nd = node_distance(node, first_online_node)
> >   for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> >   if (node_online) {
> >   if (distance != nd)
> >   continue;
> >   found[i] ++;
> >   break;
> >   }
> >   if (found[i])
> >   continue;
> >   distance_lookup_table[node][i] = 
> > distance_lookup_table[first_online_node][i];
> >   found[i] ++;
> >   break;
> >   }
> > }
> >
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> >
> 
> Indeed. AFAICT it's that or the unique-distance-values hook :/

Peter,

Please let me know which approach would you prefer.
I am open to try any other approach too.

In my humble opinion, unique-distance-values hook is more cleaner.
Do you still have any concerns with the unique-distance-values hook?

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-25 Thread Valentin Schneider
On 24/05/21 21:48, Srikar Dronamraju wrote:
> * Valentin Schneider  [2021-05-24 15:16:09]:
>> Ok so from your arch you can figure out the *size* of the set of unique
>> distances, but not the individual node_distance(a, b)... That's quite
>> unfortunate.
>
> Yes, thats true.
>
>>
>> I suppose one way to avoid the hook would be to write some "fake" distance
>> values into your distance_lookup_table[] for offline nodes using your
>> distance_ref_point_depth thing, i.e. ensure an iteration of
>> node_distance(a, b) covers all distance values [1]. You can then keep patch
>> 3 around, and that should roughly be it.
>>
>
> Yes, this would suffice but to me its not very clean.
> static int found[distance_ref_point_depth];
>
> for_each_node(node){
>   int i, nd, distance = LOCAL_DISTANCE;
>   goto out;
>
>   nd = node_distance(node, first_online_node)
>   for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
>   if (node_online) {
>   if (distance != nd)
>   continue;
>   found[i] ++;
>   break;
>   }
>   if (found[i])
>   continue;
>   distance_lookup_table[node][i] = 
> distance_lookup_table[first_online_node][i];
>   found[i] ++;
>   break;
>   }
> }
>
> But do note: We are setting a precedent for node distance between two nodes
> to change.
>

Indeed. AFAICT it's that or the unique-distance-values hook :/


Re: [PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout

2021-05-25 Thread A lneesh Kumar K.V
Linus Torvalds  writes:

> On Mon, May 24, 2021 at 3:38 AM Aneesh Kumar K.V
>  wrote:
>>
>> Avoid the above race with MOVE_PMD by holding pte ptl in mremap and waiting 
>> for
>> parallel pagetable walk to finish operating on pte before updating new_pmd
>
> Ack on the concept.


Should we worry about the below race. The window would be small

CPU 1   CPU 2   CPU 3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr

lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

lock(pte_ptl)
*new_pmd = pmd
unlock(pte_ptl)

unlock(pmd_ptl)
lock(pte_ptl)

*new_addr = 10; and fills
TLB 
with new addr
and old 
pfn

ptep_clear_flush(old_addr)
old pfn is free.
Stale 
TLB entry



>
> However, not so much on the patch.
>
> Odd whitespace change:
>
>> @@ -254,6 +254,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
>> unsigned long old_addr,
>> if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
>> return false;
>>
>> +
>> /*
>>  * We don't have to worry about the ordering of src and dst
>>  * ptlocks because exclusive mmap_lock prevents deadlock.
>
> And new optimization for empty pmd, which seems unrelated to the
> change and should presumably be separate:

That was added that we can safely do pte_lockptr() below

>
>> @@ -263,6 +264,10 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
>> unsigned long old_addr,
>> if (new_ptl != old_ptl)
>> spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>>
>> +   if (pmd_none(*old_pmd))
>> +   goto unlock_out;
>> +
>> +   pte_ptl = pte_lockptr(mm, old_pmd);
>> /* Clear the pmd */
>> pmd = *old_pmd;
>> pmd_clear(old_pmd);
>
> And also, why does the above assign 'pte_ptl' without using it, when
> the actual use is ten lines further down?

So that we fetch the pte_ptl before we clear old_pmd. 


-aneesh


[RFC PATCH 1/2] mm/mremap: Fix race between MOVE_PMD mremap and pageout

2021-05-25 Thread Aneesh Kumar K.V
CPU 1   CPU 2   CPU 3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd

*new_addr = 10; and fills
TLB 
with new addr
and old 
pfn

unlock(pmd_ptl)
ptep_clear_flush()
old pfn is free.
Stale 
TLB entry

Fix this race by holding pmd lock in pageout. This still doesn't handle the race
between MOVE_PUD and pageout.

Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/rmap.h |  9 ++---
 mm/page_vma_mapped.c | 36 ++--
 2 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..272ab0c2b60b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -207,7 +207,8 @@ struct page_vma_mapped_walk {
unsigned long address;
pmd_t *pmd;
pte_t *pte;
-   spinlock_t *ptl;
+   spinlock_t *pte_ptl;
+   spinlock_t *pmd_ptl;
unsigned int flags;
 };
 
@@ -216,8 +217,10 @@ static inline void page_vma_mapped_walk_done(struct 
page_vma_mapped_walk *pvmw)
/* HugeTLB pte is set to the relevant page table entry without 
pte_mapped. */
if (pvmw->pte && !PageHuge(pvmw->page))
pte_unmap(pvmw->pte);
-   if (pvmw->ptl)
-   spin_unlock(pvmw->ptl);
+   if (pvmw->pte_ptl)
+   spin_unlock(pvmw->pte_ptl);
+   if (pvmw->pmd_ptl)
+   spin_unlock(pvmw->pmd_ptl);
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 2cf01d933f13..87a2c94c7e27 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -47,8 +47,10 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
return false;
}
}
-   pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
-   spin_lock(pvmw->ptl);
+   if (USE_SPLIT_PTE_PTLOCKS) {
+   pvmw->pte_ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
+   spin_lock(pvmw->pte_ptl);
+   }
return true;
 }
 
@@ -162,8 +164,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pvmw->pte)
return false;
 
-   pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
-   spin_lock(pvmw->ptl);
+   pvmw->pte_ptl = huge_pte_lockptr(page_hstate(page), mm, 
pvmw->pte);
+   spin_lock(pvmw->pte_ptl);
if (!check_pte(pvmw))
return not_found(pvmw);
return true;
@@ -179,6 +181,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pud_present(*pud))
return false;
pvmw->pmd = pmd_offset(pud, pvmw->address);
+   pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd);
/*
 * Make sure the pmd value isn't cached in a register by the
 * compiler and used as a stale value after we've observed a
@@ -186,7 +189,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 */
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
-   pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
@@ -206,14 +208,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk 
*pvmw)
}
}
return not_found(pvmw);
-   } else {
-   /* THP pmd was split under us: handle on pte level */
-   spin_unlock(pvmw->ptl);
-   pvmw->ptl = NULL;
}
-   } else if (!pmd_present(pmde)) {
-   return false;
-   }
+   } else if (!pmd_present(pmde))
+   return not_found(pvmw);
+
if (!map_pte(pvmw))
goto next_pte;
while (1) {
@@ -233,19 +231,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk 
*pvmw)
/* Did we cross page table boundary? */
if (pvmw->address % P

[RFC PATCH 2/2] mm/mremap: Fix race between MOVE_PUD mremap and pageout

2021-05-25 Thread Aneesh Kumar K.V
CPU 1   CPU 2   CPU 3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr
lock(pte_ptl)
lock(pud_ptl)
pud = *old_pud
pud_clear(old_pud)
flush_tlb_range(old_addr)

*new_pud = pud

*new_addr = 10; and fills
TLB 
with new addr
and old 
pfn

unlock(pud_ptl)
ptep_clear_flush()
old pfn is free.
Stale 
TLB entry

Fix this race by holding pud lock in pageout.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/rmap.h |  4 
 mm/page_vma_mapped.c | 13 ++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 272ab0c2b60b..491c65ce1d46 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -209,6 +209,7 @@ struct page_vma_mapped_walk {
pte_t *pte;
spinlock_t *pte_ptl;
spinlock_t *pmd_ptl;
+   spinlock_t *pud_ptl;
unsigned int flags;
 };
 
@@ -221,6 +222,9 @@ static inline void page_vma_mapped_walk_done(struct 
page_vma_mapped_walk *pvmw)
spin_unlock(pvmw->pte_ptl);
if (pvmw->pmd_ptl)
spin_unlock(pvmw->pmd_ptl);
+   if (pvmw->pud_ptl)
+   spin_unlock(pvmw->pud_ptl);
+
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 87a2c94c7e27..c913bc34b1d3 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -180,8 +180,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk 
*pvmw)
pud = pud_offset(p4d, pvmw->address);
if (!pud_present(*pud))
return false;
+
+   pvmw->pud_ptl = pud_lock(mm, pud);
pvmw->pmd = pmd_offset(pud, pvmw->address);
-   pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd);
+   if (USE_SPLIT_PMD_PTLOCKS)
+   pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd);
/*
 * Make sure the pmd value isn't cached in a register by the
 * compiler and used as a stale value after we've observed a
@@ -235,8 +238,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk 
*pvmw)
spin_unlock(pvmw->pte_ptl);
pvmw->pte_ptl = NULL;
}
-   spin_unlock(pvmw->pmd_ptl);
-   pvmw->pmd_ptl = NULL;
+   if (pvmw->pmd_ptl) {
+   spin_unlock(pvmw->pmd_ptl);
+   pvmw->pmd_ptl = NULL;
+   }
+   spin_unlock(pvmw->pud_ptl);
+   pvmw->pud_ptl = NULL;
goto restart;
} else {
pvmw->pte++;
-- 
2.31.1