Re: [PATCH 09/25] powerpc: ability to create execute-disabled pkeys
Ram Paiwrites: > powerpc has hardware support to disable execute on a pkey. > This patch enables the ability to create execute-disabled > keys. Can you summarize here how this works? Access to IAMR is privileged so how will keys framework work with IAMR? -aneesh
Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
On 10/24/2017 12:52 AM, Ram Pai wrote: On Thu, Sep 14, 2017 at 06:13:57PM +1000, Benjamin Herrenschmidt wrote: On Fri, 2017-09-08 at 15:44 -0700, Ram Pai wrote: The second part of the PTE will hold (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. NOTE: None of the bits in the secondary PTE were not used by 64k-HPTE backed PTE. Have you measured the performance impact of this ? The second part of the PTE being in a different cache line there could be one... hmm..missed responding to this comment. I did a preliminay measurement running mmap bench in the selftest. Ran it multiple times. almost always the numbers were either equal-to or better-than without the patch-series. mmap bench doesn't do any fault. It is just mmap/munmap in loop. -aneesh
RE: [PATCH v10 0/4] this patchset is to remove PPCisms for QEIC
Hi all, Could anybody review this patchset and take action on them? Thank you! Best Regards Qiang Zhao > -Original Message- > From: Zhao Qiang [mailto:qiang.z...@nxp.com] > Sent: Monday, August 07, 2017 11:07 AM > To: t...@linutronix.de > Cc: o...@buserror.net; Xiaobo Xie; linux- > ker...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Qiang Zhao > > Subject: [PATCH v10 0/4] this patchset is to remove PPCisms for QEIC > > QEIC is supported more than just powerpc boards, so remove PPCisms. > > changelog: > Changes for v8: > - use IRQCHIP_DECLARE() instead of subsys_initcall in qeic driver > - remove include/soc/fsl/qe/qe_ic.h > Changes for v9: > - rebase > - fix the compile issue when apply the second patch, in fact, there was > no compile issue > when apply all the patches of this patchset > Changes for v10: > - simplify codes, remove duplicated codes > > Zhao Qiang (4): > irqchip/qeic: move qeic driver from drivers/soc/fsl/qe > Changes for v2: > - modify the subject and commit msg > Changes for v3: > - merge .h file to .c, rename it with irq-qeic.c > Changes for v4: > - modify comments > Changes for v5: > - disable rename detection > Changes for v6: > - rebase > Changes for v7: > - na > > irqchip/qeic: merge qeic init code from platforms to a common function > Changes for v2: > - modify subject and commit msg > - add check for qeic by type > Changes for v3: > - na > Changes for v4: > - na > Changes for v5: > - na > Changes for v6: > - rebase > Changes for v7: > - na > Changes for v8: > - use IRQCHIP_DECLARE() instead of subsys_initcall > > irqchip/qeic: merge qeic_of_init into qe_ic_init > Changes for v2: > - modify subject and commit msg > - return 0 and add put node when return in qe_ic_init > Changes for v3: > - na > Changes for v4: > - na > Changes for v5: > - na > Changes for v6: > - rebase > Changes for v7: > - na > > irqchip/qeic: remove PPCisms for QEIC > Changes for v6: > - new added > Changes for v7: > - fix warning > Changes for v8: > - remove include/soc/fsl/qe/qe_ic.h > > Zhao Qiang (4): > irqchip/qeic: move qeic driver from drivers/soc/fsl/qe > irqchip/qeic: merge qeic init code from platforms to a common function > irqchip/qeic: merge qeic_of_init into qe_ic_init > irqchip/qeic: remove PPCisms for QEIC > > MAINTAINERS| 6 + > arch/powerpc/platforms/83xx/km83xx.c | 1 - > arch/powerpc/platforms/83xx/misc.c | 16 - > arch/powerpc/platforms/83xx/mpc832x_mds.c | 1 - > arch/powerpc/platforms/83xx/mpc832x_rdb.c | 1 - > arch/powerpc/platforms/83xx/mpc836x_mds.c | 1 - > arch/powerpc/platforms/83xx/mpc836x_rdk.c | 1 - > arch/powerpc/platforms/85xx/corenet_generic.c | 10 - > arch/powerpc/platforms/85xx/mpc85xx_mds.c | 15 - > arch/powerpc/platforms/85xx/mpc85xx_rdb.c | 17 - > arch/powerpc/platforms/85xx/twr_p102x.c| 15 - > drivers/irqchip/Makefile | 1 + > drivers/{soc/fsl/qe/qe_ic.c => irqchip/irq-qeic.c} | 358 > - > drivers/soc/fsl/qe/Makefile| 2 +- > drivers/soc/fsl/qe/qe_ic.h | 103 -- > include/soc/fsl/qe/qe_ic.h | 139 > 16 files changed, 218 insertions(+), 469 deletions(-) rename > drivers/{soc/fsl/qe/qe_ic.c => irqchip/irq-qeic.c} (58%) delete mode 100644 > drivers/soc/fsl/qe/qe_ic.h delete mode 100644 include/soc/fsl/qe/qe_ic.h > > -- > 2.1.0.27.g96db324
Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
On Mon, Oct 23, 2017 at 02:22:44PM +0530, Aneesh Kumar K.V wrote: > Benjamin Herrenschmidtwrites: > > > On Fri, 2017-09-08 at 15:44 -0700, Ram Pai wrote: > >> The second part of the PTE will hold > >> (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. > >> NOTE: None of the bits in the secondary PTE were not used > >> by 64k-HPTE backed PTE. > > > > Have you measured the performance impact of this ? The second part of > > the PTE being in a different cache line there could be one... > > > > I am also looking at a patch series removing the slot tracking > completely. With randomize address turned off and no swap in guest/host > and making sure we touched most of guest ram, I don't find much impact > in performance when we don't track the slot at all. I will post the > patch series with numbers in a day or two. But my test was > > while (5000) { > mmap(128M) > touch every page of 2048 pages > munmap() > } > > I could also be the best case in my run because i might have always > found the hash pte slot in the primary. In one measurement with swap on > and address randmization enabled, i did find a 50% impact. But then i > was not able to recreate that again. So could be something i did wrong > in the test setup. > > Ram, > > Will you be able to get a test run with the above loop? Yes. results with patch look good; better than w/o patch. /---\ |Itteratn| secs w/ patch|secs w/o patch | - |1 | 45.572621| 49.046994 | |2 | 46.049545| 49.378756 | |3 | 46.103657| 49.223591 | |4 | 46.298903| 48.991245 | |5 | 46.353202| 48.988033 | |6 | 45.440878| 49.175846 | |7 | 46.860373| 49.008395 | |8 | 46.221390| 49.236964 | |9 | 45.794993| 49.171927 | |10 | 46.569491| 48.995628 | |---| |average | 46.1265053 | 49.1217379| \---/ The code is as follows: diff --git a/tools/testing/selftests/powerpc/benchmarks/mmap_bench.c b/tools/testing/selftests/powerpc/benchmarks/mmap_bench.c index 8d084a2..ef2ad87 100644 --- a/tools/testing/selftests/powerpc/benchmarks/mmap_bench.c +++ b/tools/testing/selftests/powerpc/benchmarks/mmap_bench.c @@ -10,14 +10,14 @@ #include "utils.h" -#define ITERATIONS 500 +#define ITERATIONS 5000 #define MEMSIZE (128 * 1024 * 1024) int test_mmap(void) { struct timespec ts_start, ts_end; - unsigned long i = ITERATIONS; + unsigned long i = ITERATIONS, j; clock_gettime(CLOCK_MONOTONIC, _start); @@ -25,6 +25,10 @@ int test_mmap(void) char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); FAIL_IF(c == MAP_FAILED); + + for (j=0; j < (MEMSIZE >> 16); j++) + c[j<<16] = 0xf; + munmap(c, MEMSIZE); }
Re: [PATCH v2 2/3] powerpc: use NMI IPI for smp_send_stop
Hi Nicholas, [auto build test ERROR on powerpc/next] [also build test ERROR on v4.14-rc6 next-20171018] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Nicholas-Piggin/avoid-secondary-hold-spinloop-when-possible/20171023-173012 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-iss476-smp_defconfig (attached as .config) compiler: powerpc-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All errors (new ones prefixed by >>): arch/powerpc/kernel/smp.o: In function `smp_send_stop': >> arch/powerpc/kernel/smp.c:563: undefined reference to `smp_send_nmi_ipi' vim +563 arch/powerpc/kernel/smp.c 560 561 void smp_send_stop(void) 562 { > 563 smp_send_nmi_ipi(NMI_IPI_ALL_OTHERS, stop_this_cpu, 100); 564 } 565 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.
On Thu, Sep 14, 2017 at 06:11:32PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2017-09-14 at 14:38 +1000, Balbir Singh wrote: > > On Fri, 8 Sep 2017 15:44:50 -0700 > > Ram Paiwrote: > > > > > powerpc needs an additional vma bit to support 32 keys. > > > Till the additional vma bit lands in include/linux/mm.h > > > we have to define it in powerpc specific header file. > > > This is needed to get pkeys working on power. > > > > > > Signed-off-by: Ram Pai > > > --- > > > > "This" being an arch specific hack for the additional bit? > > Arch VMA bits ? really ? I'd rather we limit ourselves to 16 keys first > then push for adding the extra bit to the generic code. (hmm.. this mail did not land in my mailbox :( Sorry for the delay. Just saw it in the mailing list.) I think this is good idea. I can code it such a way that we support 16 or 32 keys depending on the availability of the vma bit. No more hunks in the code. RP
Re: [PATCH v2 2/3] powerpc: use NMI IPI for smp_send_stop
Hi Nicholas, [auto build test ERROR on powerpc/next] [also build test ERROR on v4.14-rc6 next-20171018] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Nicholas-Piggin/avoid-secondary-hold-spinloop-when-possible/20171023-173012 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-ge_imp3a_defconfig (attached as .config) compiler: powerpc-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All errors (new ones prefixed by >>): arch/powerpc/kernel/smp.o: In function `smp_send_stop': >> smp.c:(.text+0x774): undefined reference to `smp_send_nmi_ipi' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2
On Sat, Oct 21, 2017 at 11:47:03AM -0400, Jerome Glisse wrote: > On Sat, Oct 21, 2017 at 04:54:40PM +1100, Balbir Singh wrote: > > On Thu, 2017-10-19 at 12:58 -0400, Jerome Glisse wrote: > > > On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote: > > > > On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse> > > > wrote: > > > > > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote: > > > > > > On Mon, 16 Oct 2017 23:10:02 -0400 > > > > > > jgli...@redhat.com wrote: > > > > > > > From: Jérôme Glisse [...] > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > > index c037d3d34950..ff5bc647b51d 100644 > > > > > > > --- a/mm/huge_memory.c > > > > > > > +++ b/mm/huge_memory.c > > > > > > > @@ -1186,8 +1186,15 @@ static int > > > > > > > do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > > > > > > > goto out_free_pages; > > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > > > > > > > > > + /* > > > > > > > +* Leave pmd empty until pte is filled note we must notify > > > > > > > here as > > > > > > > +* concurrent CPU thread might write to new page before the > > > > > > > call to > > > > > > > +* mmu_notifier_invalidate_range_end() happens which can lead > > > > > > > to a > > > > > > > +* device seeing memory write in different order than CPU. > > > > > > > +* > > > > > > > +* See Documentation/vm/mmu_notifier.txt > > > > > > > +*/ > > > > > > > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > > > > > > > > pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd); > > > > > > > pmd_populate(vma->vm_mm, &_pmd, pgtable); > > > > > > > @@ -2026,8 +2033,15 @@ static void > > > > > > > __split_huge_zero_page_pmd(struct vm_area_struct *vma, > > > > > > > pmd_t _pmd; > > > > > > > int i; > > > > > > > > > > > > > > - /* leave pmd empty until pte is filled */ > > > > > > > - pmdp_huge_clear_flush_notify(vma, haddr, pmd); > > > > > > > + /* > > > > > > > +* Leave pmd empty until pte is filled note that it is fine > > > > > > > to delay > > > > > > > +* notification until mmu_notifier_invalidate_range_end() as > > > > > > > we are > > > > > > > +* replacing a zero pmd write protected page with a zero pte > > > > > > > write > > > > > > > +* protected page. > > > > > > > +* > > > > > > > +* See Documentation/vm/mmu_notifier.txt > > > > > > > +*/ > > > > > > > + pmdp_huge_clear_flush(vma, haddr, pmd); > > > > > > > > > > > > Shouldn't the secondary TLB know if the page size changed? > > > > > > > > > > It should not matter, we are talking virtual to physical on behalf > > > > > of a device against a process address space. So the hardware should > > > > > not care about the page size. > > > > > > > > > > > > > Does that not indicate how much the device can access? Could it try > > > > to access more than what is mapped? > > > > > > Assuming device has huge TLB and 2MB huge page with 4K small page. > > > You are going from one 1 TLB covering a 2MB zero page to 512 TLB > > > each covering 4K. Both case is read only and both case are pointing > > > to same data (ie zero). > > > > > > It is fine to delay the TLB invalidate on the device to the call of > > > mmu_notifier_invalidate_range_end(). The device will keep using the > > > huge TLB for a little longer but both CPU and device are looking at > > > same data. > > > > > > Now if there is a racing thread that replace one of the 512 zeor page > > > after the split but before mmu_notifier_invalidate_range_end() that > > > code path would call mmu_notifier_invalidate_range() before changing > > > the pte to point to something else. Which should shoot down the device > > > TLB (it would be a serious device bug if this did not work). > > > > OK.. This seems reasonable, but I'd really like to see if it can be > > tested > > Well hard to test, many factors first each device might react differently. > Device that only store TLB at 4k granularity are fine. Clever device that > can store TLB for 4k, 2M, ... can ignore an invalidation that is smaller > than their TLB entry ie getting a 4K invalidation would not invalidate a > 2MB TLB entry in the device. I consider this as buggy. I will go look at > the PCIE ATS specification one more time and see if there is any wording > related that. I might bring up a question to the PCIE standard body if not. So inside PCIE ATS there is the definition of "minimum translation or invalidate size" which says 4096 bytes. So my understanding is that hardware must support 4K invalidation in all the case and thus we shoud be safe from possible hazard above. But none the less i will repost without the optimization for huge page to be more concervative as anyway we want to be correct before we care about last bit of optimization. Cheers, Jérôme
Re: lpar issue for ZONE_DEVICE p2pmem in 4.14-rc
> [3.537780] lpar: Attempting to resize HPT to shift 21 > [3.539251] Unable to resize hash page table to target order 21: -1 > [3.541079] Unable to create mapping for hot added memory > 0xc0002100..0xc00021000400: -2 > For #1 above please check if your qemu supports H_RESIZE_HPT_* hcalls? Balbir do you have any suggestions as to how to test for this support? Note I am running this on my x86_64 host so there is no virtualization hardware in my QEMU. My qemu is very recent (QEMU emulator version 2.10.50 (v2.10.0-1026-gd8f932c-dirty)). > For create mapping failures, the rc is -ENOENT. Can you help debug this > further? We could do hcall tracing or enable debugging. Sure I can help debug. My original email also had all you needed to recreate this issue so that’s an option too? Stephen
Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
On Thu, Sep 14, 2017 at 06:13:57PM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2017-09-08 at 15:44 -0700, Ram Pai wrote: > > The second part of the PTE will hold > > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. > > NOTE: None of the bits in the secondary PTE were not used > > by 64k-HPTE backed PTE. > > Have you measured the performance impact of this ? The second part of > the PTE being in a different cache line there could be one... hmm..missed responding to this comment. I did a preliminay measurement running mmap bench in the selftest. Ran it multiple times. almost always the numbers were either equal-to or better-than without the patch-series. RP
Re: [PATCH 06/25] powerpc: cleaup AMR,iAMR when a key is allocated or freed
On Mon, Oct 23, 2017 at 03:13:45PM +0530, Aneesh Kumar K.V wrote: > Balbir Singhwrites: > > > On Fri, 8 Sep 2017 15:44:54 -0700 > > Ram Pai wrote: > > > >> cleanup the bits corresponding to a key in the AMR, and IAMR > >> register, when the key is newly allocated/activated or is freed. > >> We dont want some residual bits cause the hardware enforce > >> unintended behavior when the key is activated or freed. > >> > >> Signed-off-by: Ram Pai > >> --- > >> arch/powerpc/include/asm/pkeys.h | 12 > >> 1 files changed, 12 insertions(+), 0 deletions(-) > >> > >> diff --git a/arch/powerpc/include/asm/pkeys.h > >> b/arch/powerpc/include/asm/pkeys.h > >> index 5a83ed7..53bf13b 100644 > >> --- a/arch/powerpc/include/asm/pkeys.h > >> +++ b/arch/powerpc/include/asm/pkeys.h > >> @@ -54,6 +54,8 @@ static inline bool mm_pkey_is_allocated(struct mm_struct > >> *mm, int pkey) > >>mm_set_pkey_is_allocated(mm, pkey)); > >> } > >> > >> +extern void __arch_activate_pkey(int pkey); > >> +extern void __arch_deactivate_pkey(int pkey); > >> /* > >> * Returns a positive, 5-bit key on success, or -1 on failure. > >> */ > >> @@ -80,6 +82,12 @@ static inline int mm_pkey_alloc(struct mm_struct *mm) > >> > >>ret = ffz((u32)mm_pkey_allocation_map(mm)); > >>mm_set_pkey_allocated(mm, ret); > >> + > >> + /* > >> + * enable the key in the hardware > >> + */ > >> + if (ret > 0) > >> + __arch_activate_pkey(ret); > >>return ret; > >> } > >> > >> @@ -91,6 +99,10 @@ static inline int mm_pkey_free(struct mm_struct *mm, > >> int pkey) > >>if (!mm_pkey_is_allocated(mm, pkey)) > >>return -EINVAL; > >> > >> + /* > >> + * Disable the key in the hardware > >> + */ > >> + __arch_deactivate_pkey(pkey); > >>mm_set_pkey_free(mm, pkey); > >> > >>return 0; > > > > I think some of these patches can be merged, too much fine granularity > > is hurting my ability to see the larger function/implementation. > > > Completely agree Me agree too :) Had to fine-grain it to satisfy comments received during the initial versions. I had about 10-12 patches in total. Thanks, will merge a couple of these patches. RP
Re: [PATCH 06/25] powerpc: cleaup AMR,iAMR when a key is allocated or freed
On Mon, Oct 23, 2017 at 03:13:33PM +0530, Aneesh Kumar K.V wrote: > Ram Paiwrites: > > > cleanup the bits corresponding to a key in the AMR, and IAMR > > register, when the key is newly allocated/activated or is freed. > > We dont want some residual bits cause the hardware enforce > > unintended behavior when the key is activated or freed. > > > > Signed-off-by: Ram Pai > > --- > > arch/powerpc/include/asm/pkeys.h | 12 > > 1 files changed, 12 insertions(+), 0 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/pkeys.h > > b/arch/powerpc/include/asm/pkeys.h > > index 5a83ed7..53bf13b 100644 > > --- a/arch/powerpc/include/asm/pkeys.h > > +++ b/arch/powerpc/include/asm/pkeys.h > > @@ -54,6 +54,8 @@ static inline bool mm_pkey_is_allocated(struct mm_struct > > *mm, int pkey) > > mm_set_pkey_is_allocated(mm, pkey)); > > } > > > > +extern void __arch_activate_pkey(int pkey); > > +extern void __arch_deactivate_pkey(int pkey); > > /* > > * Returns a positive, 5-bit key on success, or -1 on failure. > > */ > > @@ -80,6 +82,12 @@ static inline int mm_pkey_alloc(struct mm_struct *mm) > > > > ret = ffz((u32)mm_pkey_allocation_map(mm)); > > mm_set_pkey_allocated(mm, ret); > > + > > + /* > > +* enable the key in the hardware > > +*/ > > + if (ret > 0) > > + __arch_activate_pkey(ret); > > return ret; > > } > > We are already arch specific because we are defining them in > arch/powerpc/include/asm/, then why __arch_activate_pkey() ? almost all the memory-key internal functions are named with a two leading underbars. So just *loosely* following a convention within that file. The corresponding functions without the two underbars are the one exposed to the arch-neutral code. RP
Re: [PATCH 03/25] powerpc: track allocation status of all pkeys
On Mon, Oct 23, 2017 at 03:11:28PM +0530, Aneesh Kumar K.V wrote: > Ram Paiwrites: > > > Total 32 keys are available on power7 and above. However > > pkey 0,1 are reserved. So effectively we have 30 pkeys. > > When you say reserved, reserved by whom? Is that part of ISA or PAPR ? > Also do you expect that to change. If not why all these indirection? > Can we have the mask as a #define for 4K and 64K page size > config? The reserved keys cannot be determined statically. It depends on the platform and the key info exported to us by the device-tree. Hence it cannot be macro'd. One of the subsequent patch, reads the device tree and sets pkeys_total appropriately. FYI. BTW: key 0 is reserved by the ISA. key 1 is reserved, but its nebulous. There is a programming note in the ISA to avoid key 1. Also testing and experimentation with key 1 lead to unpredicatable behavior on powervm. Key 31 may or may not be used by the hypervisor. The device tree has to be referred to determine that. Bottomline, the mask can be determined only during runtime. RP
Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.
On Mon, Oct 23, 2017 at 02:58:55PM +0530, Aneesh Kumar K.V wrote: > "Aneesh Kumar K.V"writes: > > > Ram Pai writes: > > > >> powerpc needs an additional vma bit to support 32 keys. > >> Till the additional vma bit lands in include/linux/mm.h > >> we have to define it in powerpc specific header file. > >> This is needed to get pkeys working on power. > >> > >> Signed-off-by: Ram Pai > >> --- > >> arch/powerpc/include/asm/pkeys.h | 18 ++ > >> 1 files changed, 18 insertions(+), 0 deletions(-) > >> > >> diff --git a/arch/powerpc/include/asm/pkeys.h > >> b/arch/powerpc/include/asm/pkeys.h > >> index c02305a..44e01a2 100644 > >> --- a/arch/powerpc/include/asm/pkeys.h > >> +++ b/arch/powerpc/include/asm/pkeys.h > >> @@ -3,6 +3,24 @@ > >> > >> extern bool pkey_inited; > >> extern bool pkey_execute_disable_support; > >> + > >> +/* > >> + * powerpc needs an additional vma bit to support 32 keys. > >> + * Till the additional vma bit lands in include/linux/mm.h > >> + * we have to carry the hunk below. This is needed to get > >> + * pkeys working on power. -- Ram > >> + */ > >> +#ifndef VM_HIGH_ARCH_BIT_4 > >> +#define VM_HIGH_ARCH_BIT_436 > >> +#define VM_HIGH_ARCH_4BIT(VM_HIGH_ARCH_BIT_4) > >> +#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 > >> +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 > >> +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1 > >> +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2 > >> +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3 > >> +#define VM_PKEY_BIT4 VM_HIGH_ARCH_4 > >> +#endif > >> + > >> #define ARCH_VM_PKEY_FLAGS 0 > > > > Do we want them in pkeys.h ? Even if they are arch specific for the > > existing ones we have them in include/linux/mm.h. IIUC, vmflags details > > are always in mm.h? This will be the first exception to that? > > > Also can we move that > > #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) > # define VM_PKEY_SHIFTVM_HIGH_ARCH_BIT_0 > # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ > # define VM_PKEY_BIT1 VM_HIGH_ARCH_1 > # define VM_PKEY_BIT2 VM_HIGH_ARCH_2 > # define VM_PKEY_BIT3 VM_HIGH_ARCH_3 > #endif > > to > > #if defined (CONFIG_ARCH_HAS_PKEYS) > # define VM_PKEY_SHIFTVM_HIGH_ARCH_BIT_0 > # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ > # define VM_PKEY_BIT1 VM_HIGH_ARCH_1 > # define VM_PKEY_BIT2 VM_HIGH_ARCH_2 > # define VM_PKEY_BIT3 VM_HIGH_ARCH_3 > #endif > > > And then later update the generic to handle PKEY_BIT4? Yes. The above changes have been implemented in a patch sent to the mm mailing list as well as to lkml. https://lkml.org/lkml/2017/9/15/504 RP
Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.
On Mon, Oct 23, 2017 at 02:55:51PM +0530, Aneesh Kumar K.V wrote: > Ram Paiwrites: > > > powerpc needs an additional vma bit to support 32 keys. > > Till the additional vma bit lands in include/linux/mm.h > > we have to define it in powerpc specific header file. > > This is needed to get pkeys working on power. > > > > Signed-off-by: Ram Pai > > --- > > arch/powerpc/include/asm/pkeys.h | 18 ++ > > 1 files changed, 18 insertions(+), 0 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/pkeys.h > > b/arch/powerpc/include/asm/pkeys.h > > index c02305a..44e01a2 100644 > > --- a/arch/powerpc/include/asm/pkeys.h > > +++ b/arch/powerpc/include/asm/pkeys.h > > @@ -3,6 +3,24 @@ > > > > extern bool pkey_inited; > > extern bool pkey_execute_disable_support; > > + > > +/* > > + * powerpc needs an additional vma bit to support 32 keys. > > + * Till the additional vma bit lands in include/linux/mm.h > > + * we have to carry the hunk below. This is needed to get > > + * pkeys working on power. -- Ram > > + */ > > +#ifndef VM_HIGH_ARCH_BIT_4 > > +#define VM_HIGH_ARCH_BIT_4 36 > > +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) > > +#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 > > +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 > > +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1 > > +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2 > > +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3 > > +#define VM_PKEY_BIT4 VM_HIGH_ARCH_4 > > +#endif > > + > > #define ARCH_VM_PKEY_FLAGS 0 > > Do we want them in pkeys.h ? Even if they are arch specific for the > existing ones we have them in include/linux/mm.h. IIUC, vmflags details > are always in mm.h? This will be the first exception to that? I am trying to get the real fix in include/linux/mm.h If that happens sooner than this hunk is not needed. Till then this is an exception, but it is the **ONLY** exception. I think your point is to have this hunk in include/linux/mm.h ? If yes, than it would be easier to push the real fix in include/linux/mm.h instead of pushing this hunk in the there. RP
[PATCH 4/4] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations
Use safer string manipulation functions when dealing with a user-provided string in kprobe_lookup_name(). Reported-by: David LaightSigned-off-by: Naveen N. Rao --- arch/powerpc/kernel/kprobes.c | 47 ++- 1 file changed, 20 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index a14c61855705..bbb4795660f8 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -53,7 +53,7 @@ bool arch_within_kprobe_blacklist(unsigned long addr) kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset) { - kprobe_opcode_t *addr; + kprobe_opcode_t *addr = NULL; #ifdef PPC64_ELF_ABI_v2 /* PPC64 ABIv2 needs local entry point */ @@ -85,36 +85,29 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset) * Also handle format. */ char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN]; - const char *modsym; bool dot_appended = false; - if ((modsym = strchr(name, ':')) != NULL) { - modsym++; - if (*modsym != '\0' && *modsym != '.') { - /* Convert to */ - strncpy(dot_name, name, modsym - name); - dot_name[modsym - name] = '.'; - dot_name[modsym - name + 1] = '\0'; - strncat(dot_name, modsym, - sizeof(dot_name) - (modsym - name) - 2); - dot_appended = true; - } else { - dot_name[0] = '\0'; - strncat(dot_name, name, sizeof(dot_name) - 1); - } - } else if (name[0] != '.') { - dot_name[0] = '.'; - dot_name[1] = '\0'; - strncat(dot_name, name, KSYM_NAME_LEN - 2); + const char *c; + ssize_t ret = 0; + int len = 0; + + if ((c = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) { + c++; + len = c - name; + memcpy(dot_name, name, len); + } else + c = name; + + if (*c != '\0' && *c != '.') { + dot_name[len++] = '.'; dot_appended = true; - } else { - dot_name[0] = '\0'; - strncat(dot_name, name, KSYM_NAME_LEN - 1); } - addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name); - if (!addr && dot_appended) { - /* Let's try the original non-dot symbol lookup */ + ret = strscpy(dot_name + len, c, KSYM_NAME_LEN); + if (ret > 0) + addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name); + + /* Fallback to the original non-dot symbol lookup */ + if (!addr && dot_appended) addr = (kprobe_opcode_t *)kallsyms_lookup_name(name); - } #else addr = (kprobe_opcode_t *)kallsyms_lookup_name(name); #endif -- 2.14.2
[PATCH 3/4] powerpc/kprobes: Blacklist emulate_update_regs() from kprobes
Commit 3cdfcbfd32b9d ("powerpc: Change analyse_instr so it doesn't modify *regs") introduced emulate_update_regs() to perform part of what emulate_step() was doing earlier. However, this function was not added to the kprobes blacklist. Add it so as to prevent it from being probed. Signed-off-by: Naveen N. Rao--- arch/powerpc/lib/sstep.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 8c3955e183d4..70274b7b4773 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -2717,6 +2717,7 @@ void emulate_update_regs(struct pt_regs *regs, struct instruction_op *op) } regs->nip = next_pc; } +NOKPROBE_SYMBOL(emulate_update_regs); /* * Emulate a previously-analysed load or store instruction. -- 2.14.2
[PATCH 2/4] powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftrace
Per Documentation/kprobes.txt, we don't necessarily need to disable interrupts before invoking the kprobe handlers. Masami submitted similar changes for x86 via commit a19b2e3d783964 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized kprobes"). Do the same for powerpc. Signed-off-by: Naveen N. Rao--- arch/powerpc/kernel/kprobes-ftrace.c | 10 ++ arch/powerpc/kernel/optprobes.c | 10 -- 2 files changed, 2 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c index 4b1f34f685b1..7a1f99f1b47f 100644 --- a/arch/powerpc/kernel/kprobes-ftrace.c +++ b/arch/powerpc/kernel/kprobes-ftrace.c @@ -75,11 +75,7 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip, { struct kprobe *p; struct kprobe_ctlblk *kcb; - unsigned long flags; - /* Disable irq for emulating a breakpoint and avoiding preempt */ - local_irq_save(flags); - hard_irq_disable(); preempt_disable(); p = get_kprobe((kprobe_opcode_t *)nip); @@ -105,16 +101,14 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip, else { /* * If pre_handler returns !0, it sets regs->nip and -* resets current kprobe. In this case, we still need -* to restore irq, but not preemption. +* resets current kprobe. In this case, we should not +* re-enable preemption. */ - local_irq_restore(flags); return; } } end: preempt_enable_no_resched(); - local_irq_restore(flags); } NOKPROBE_SYMBOL(kprobe_ftrace_handler); diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c index 60ba7f1370a8..8237884ca389 100644 --- a/arch/powerpc/kernel/optprobes.c +++ b/arch/powerpc/kernel/optprobes.c @@ -115,14 +115,10 @@ static unsigned long can_optimize(struct kprobe *p) static void optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs) { - unsigned long flags; - /* This is possible if op is under delayed unoptimizing */ if (kprobe_disabled(>kp)) return; - local_irq_save(flags); - hard_irq_disable(); preempt_disable(); if (kprobe_running()) { @@ -135,13 +131,7 @@ static void optimized_callback(struct optimized_kprobe *op, __this_cpu_write(current_kprobe, NULL); } - /* -* No need for an explicit __hard_irq_enable() here. -* local_irq_restore() will re-enable interrupts, -* if they were hard disabled. -*/ preempt_enable_no_resched(); - local_irq_restore(flags); } NOKPROBE_SYMBOL(optimized_callback); -- 2.14.2
[PATCH 1/4] powerpc/kprobes: Disable preemption before invoking probe handler for optprobes
Per Documentation/kprobes.txt, probe handlers need to be invoked with preemption disabled. Update optimized_callback() to do so. Also move get_kprobe_ctlblk() invocation post preemption disable, since it accesses pre-cpu data. This was not an issue so far since optprobes wasn't selected if CONFIG_PREEMPT was enabled. Commit a30b85df7d599f ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y") changes this. Signed-off-by: Naveen N. Rao--- arch/powerpc/kernel/optprobes.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c index 91e037ab20a1..60ba7f1370a8 100644 --- a/arch/powerpc/kernel/optprobes.c +++ b/arch/powerpc/kernel/optprobes.c @@ -115,7 +115,6 @@ static unsigned long can_optimize(struct kprobe *p) static void optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs) { - struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); unsigned long flags; /* This is possible if op is under delayed unoptimizing */ @@ -124,13 +123,14 @@ static void optimized_callback(struct optimized_kprobe *op, local_irq_save(flags); hard_irq_disable(); + preempt_disable(); if (kprobe_running()) { kprobes_inc_nmissed_count(>kp); } else { __this_cpu_write(current_kprobe, >kp); regs->nip = (unsigned long)op->kp.addr; - kcb->kprobe_status = KPROBE_HIT_ACTIVE; + get_kprobe_ctlblk()->kprobe_status = KPROBE_HIT_ACTIVE; opt_pre_handler(>kp, regs); __this_cpu_write(current_kprobe, NULL); } @@ -140,6 +140,7 @@ static void optimized_callback(struct optimized_kprobe *op, * local_irq_restore() will re-enable interrupts, * if they were hard disabled. */ + preempt_enable_no_resched(); local_irq_restore(flags); } NOKPROBE_SYMBOL(optimized_callback); -- 2.14.2
Re: [PATCH 3/7] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
On Mon, Oct 23, 2017 at 02:17:39PM +0530, Aneesh Kumar K.V wrote: > Michael Ellermanwrites: > > > Ram Pai writes: > > > >> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c > >> index 1a68cb1..c6c5559 100644 > >> --- a/arch/powerpc/mm/hash64_64k.c > >> +++ b/arch/powerpc/mm/hash64_64k.c > >> @@ -126,18 +113,13 @@ int __hash_page_4K(unsigned long ea, unsigned long > >> access, unsigned long vsid, > >>if (__rpte_sub_valid(rpte, subpg_index)) { > >>int ret; > >> > >> - hash = hpt_hash(vpn, shift, ssize); > >> - hidx = __rpte_to_hidx(rpte, subpg_index); > >> - if (hidx & _PTEIDX_SECONDARY) > >> - hash = ~hash; > >> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP; > >> - slot += hidx & _PTEIDX_GROUP_IX; > >> + gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte, > >> + subpg_index); > >> + ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, > >> + MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags); > > > > This was formatted correctly before: > > > >> - ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, > >> - MMU_PAGE_4K, MMU_PAGE_4K, > >> - ssize, flags); > >>/* > >> - *if we failed because typically the HPTE wasn't really here > >> + * if we failed because typically the HPTE wasn't really here > > > > If you're fixing it up please make it "If ...". > > > >> * we try an insertion. > >> */ > >>if (ret == -1) > >> @@ -148,6 +130,15 @@ int __hash_page_4K(unsigned long ea, unsigned long > >> access, unsigned long vsid, > >>} > >> > >> htab_insert_hpte: > >> + > >> + /* > >> + * initialize all hidx entries to invalid value, > >> + * the first time the PTE is about to allocate > >> + * a 4K hpte > >> + */ > > > > Should be: > > /* > > * Initialize all hidx entries to invalid value, the first time > > * the PTE is about to allocate a 4K HPTE. > > */ > > > >> + if (!(old_pte & H_PAGE_COMBO)) > >> + rpte.hidx = ~0x0UL; > >> + > > > > Paul had the idea that if we biased the slot number by 1, we could make > > the "invalid" value be == 0. > > > > That would avoid needing to that above, and also mean the value is > > correctly invalid from the get-go, which would be good IMO. > > > > I think now that you've added the slot accessors it would be pretty easy > > to do. > > That would be imply, we loose one slot in primary group, which means we > will do extra work in some case because our primary now has only 7 > slots. And in case of pseries, the hypervisor will always return the > least available slot, which implie we will do extra hcalls in case of an > hpte insert to an empty group? No. that is not the idea. The idea is that slot 'F' in the seconday will continue to be a invalid slot, but will be represented as offset-by-one in the PTE. In other words, 0 will be repesented as 1, 1 as 2 and n as (n+1)%32 The idea seems feasible. It has the advantage -- where 0 in the PTE means invalid slot. But it can be confusing to the casual code- reader. Will need to put in a big-huge comment to explain that. RP
[PATCH V3 2/2] pseries/initnodes: Ensure nodes initialized for hotplug
To: linuxppc-dev@lists.ozlabs.org To: linux-ker...@vger.kernel.org Cc: Michael EllermanCc: Michael Bringmann Cc: John Allen Cc: Nathan Fontenot Subject: [PATCH V3 2/2] pseries/initnodes: Ensure nodes initialized for hotplug pseries/nodes: On pseries systems which allow 'hot-add' of CPU, it may occur that the new resources are to be inserted into nodes that were not used for memory resources at bootup. Many different configurations of PowerPC resources may need to be supported depending upon the environment. This patch fixes some problems encountered at runtime with configurations that support memory-less nodes, or that hot-add resources during system execution after boot. Signed-off-by: Michael Bringmann --- arch/powerpc/mm/numa.c | 35 +-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f885ab7..37f697dc1 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -551,7 +551,7 @@ static int numa_setup_cpu(unsigned long lcpu) nid = of_node_to_nid_single(cpu); out_present: - if (nid < 0 || !node_online(nid)) + if (nid < 0 || !node_possible(nid)) nid = first_online_node; map_cpu_to_node(lcpu, nid); @@ -1311,6 +1311,35 @@ static long vphn_get_associativity(unsigned long cpu, return rc; } +static int verify_node_preparation(int nid) +{ + /* +* Need to allocate/initialize NODE_DATA from a node with +* memory (see memblock_alloc_try_nid). Code executed after +* boot (like local_memory_node) often does not know enough +* to recover fully for memoryless nodes. +*/ + if (NODE_DATA(nid) == NULL) { + int ret; + printk(KERN_INFO "%s:%d node %d (%d)\n", __FUNCTION__, __LINE__, nid, first_online_node); + setup_node_data(nid, 0, 0); + printk(KERN_INFO "%s:%d node %d (%d)\n", __FUNCTION__, __LINE__, nid, first_online_node); + ret = try_online_node(nid); + printk(KERN_INFO "%s:%d node %d (%d) %d %p\n", __FUNCTION__, __LINE__, nid, first_online_node, ret, NODE_DATA(nid)); + + /* +* Node structures successfully initialized, but +* we still need the node to be offline. +*/ + node_set_offline(nid); + } + + if (NODE_DATA(nid)->node_spanned_pages == 0) + return first_online_node; + + return nid; +} + /* * Update the CPU maps and sysfs entries for a single CPU when its NUMA * characteristics change. This function doesn't perform any locking and is @@ -1419,9 +1448,11 @@ int numa_update_cpu_topology(bool cpus_locked) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); new_nid = associativity_to_nid(associativity); - if (new_nid < 0 || !node_online(new_nid)) + if (new_nid < 0 || !node_possible(new_nid)) new_nid = first_online_node; + new_nid = verify_node_preparation(new_nid); + if (new_nid == numa_cpu_lookup_table[cpu]) { cpumask_andnot(_associativity_changes_mask, _associativity_changes_mask,
[PATCH V3 1/2] pseries/nodes: Ensure enough nodes avail for operations
pseries/nodes: On pseries systems which allow 'hot-add' of CPU or memory resources, it may occur that the new resources are to be inserted into nodes that were not used for these resources at bootup. In the kernel, any node that is used must be defined and initialized. This patch ensures that sufficient nodes are defined to support configuration requirements after boot, as well as at boot. This patch extracts the value of the lowest domain level (number of allocable resources) from the device tree property "ibm,max-associativity-domains" to use as the maximum number of nodes to setup as possibly available in the system. This new setting will override the instruction, nodes_and(node_possible_map, node_possible_map, node_online_map); presently seen in the function arch/powerpc/mm/numa.c:initmem_init(). If the property is not present at boot, no operation will be performed to define or enable additional nodes. Signed-off-by: Michael Bringmann--- arch/powerpc/mm/numa.c | 39 --- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index ec098b3..f885ab7 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -892,6 +892,36 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) NODE_DATA(nid)->node_spanned_pages = spanned_pages; } +static void __init find_possible_nodes(void) +{ + struct device_node *rtas; + u32 numnodes, i; + + if (min_common_depth <= 0) + return; + + rtas = of_find_node_by_path("/rtas"); + if (!rtas) + return; + + if (of_property_read_u32_index(rtas, + "ibm,max-associativity-domains", + min_common_depth, )) + goto out; + + pr_info("numa: Nodes = %d (mcd = %d)\n", numnodes, + min_common_depth); + + for (i = 0; i < numnodes; i++) { + if (!node_possible(i)) + node_set(i, node_possible_map); + } + +out: + if (rtas) + of_node_put(rtas); +} + void __init initmem_init(void) { int nid, cpu; @@ -905,12 +935,15 @@ void __init initmem_init(void) memblock_dump_all(); /* -* Reduce the possible NUMA nodes to the online NUMA nodes, -* since we do not support node hotplug. This ensures that we -* lower the maximum NUMA node ID to what is actually present. +* Modify the set of possible NUMA nodes to reflect information +* available about the set of online nodes, and the set of nodes +* that we expect to make use of for this platform's affinity +* calculations. */ nodes_and(node_possible_map, node_possible_map, node_online_map); + find_possible_nodes(); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn;
[PATCH V3 0/2] pseries/nodes: Fix issues with memoryless nodes
pseries/nodes: Ensure enough nodes avail for operations pseries/initnodes: Ensure nodes initialized for hotplug Signed-off-by: Michael BringmannMichael Bringmann (2): pseries/nodes: Ensure enough nodes avail for operations pseries/initnodes: Ensure nodes initialized for hotplug
Re: [PATCH 3/4] char/tpm: Improve a size determination in nine functions
On Thu, Oct 19, 2017 at 04:58:23PM +, alexander.stef...@infineon.com wrote: > > On Tue, Oct 17, 2017 at 11:50:05AM +, alexander.stef...@infineon.com > > wrote: > > > > > Replace the specification of data structures by pointer dereferences > > > > > as the parameter for the operator "sizeof" to make the corresponding > > > > > size > > > > > determination a bit safer according to the Linux coding style > > > > > convention. > > > > > > > > > > > > This patch does one style in favor of the other. > > > > > > I actually prefer that style, so I'd welcome this change :) > > > > > > > At the end it's Jarkko's call, though I would NAK this as I think some > > > > one already told this to you for some other similar patch(es). > > > > > > > > > > > > I even would suggest to stop doing this noisy stuff, which keeps people > > > > busy for nothing. > > > > > > Cleaning up old code is also worth something, even if does not change > > > one bit in the assembly output in the end... > > > > > > Alexander > > > > Quite insignificant clean up it is that does more harm that gives any > > benefit as any new change adds debt to backporting. > > > > Anyway, this has been a useful patch set for me in the sense that I have > > clearer picture now on discarding/accepting commits. > > Indeed. I have now a better understanding for why some code looks as ugly as > it does. > These patches aren't a part of a sensible attempt to clean up the code. The first two patches are easy to review and have a clear benefit so that's fine any time. Patch 3 updates the code to a new style guideline but it doesn't really improve readability that much. Those sorts of patches are hard to review because you have to verify that the object code doesn't change. Plus it messes up git blame. It's good for new code and staging, but for old code, it's probably only worth it if there are other patches which make worth the price. You're basically applying it to make the patch sender happy. With patch 4, the tpm_ibmvtpm_probe() function was buggy and it's still buggy after the patch is applied. It's a waste of time re-arranging the code if you aren't going to fix the bugs. regards, dan carpenter
Re: [PATCH 1/4] powerpc/tm: Add commandline option to disable hardware transactional memory
On Sat, Oct 21, 2017 at 11:58:47AM +1100, Michael Neuling wrote: > On Fri, 2017-10-20 at 09:45 -0200, Breno Leitao wrote: > > Mikey, Cyril, > > > > On Thu, Oct 12, 2017 at 09:17:16PM +1100, Michael Ellerman wrote: > > > From: Cyril Bur> > > > > > Currently the kernel relies on firmware to inform it whether or not the > > > CPU supports HTM and as long as the kernel was built with > > > CONFIG_PPC_TRANSACTIONAL_MEM=y then it will allow userspace to make > > > use of the facility. > > > > > > There may be situations where it would be advantageous for the kernel > > > to not allow userspace to use HTM, currently the only way to achieve > > > this is to recompile the kernel with CONFIG_PPC_TRANSACTIONAL_MEM=n. > > > > > > This patch adds a simple commandline option so that HTM can be > > > disabled at boot time. > > > > > > Signed-off-by: Cyril Bur > > > [mpe: Simplify to a bool, move to prom.c, put doco in the right place. > > > Always disable, regardless of initial state, to avoid user confusion.] > > > Signed-off-by: Michael Ellerman > > > --- > > > Documentation/admin-guide/kernel-parameters.txt | 4 > > > arch/powerpc/kernel/prom.c | 31 > > > + > > > 2 files changed, 35 insertions(+) > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > > b/Documentation/admin-guide/kernel-parameters.txt > > > index 05496622b4ef..ef03e6e16bdb 100644 > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > @@ -3185,6 +3185,10 @@ > > > allowed (eg > > > kernel_enable_fpu()/kernel_disable_fpu()). > > > There is some performance impact when enabling > > > this. > > > > > > + ppc_tm= [PPC] > > > + Format: {"off"} > > > + Disable Hardware Transactional Memory > > > + > > > print-fatal-signals= > > > [KNL] debug: print fatal signals > > > > > > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > > > index f83056297441..d9bd6555f980 100644 > > > --- a/arch/powerpc/kernel/prom.c > > > +++ b/arch/powerpc/kernel/prom.c > > > @@ -658,6 +658,35 @@ static void __init early_reserve_mem(void) > > > #endif > > > } > > > > > > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM > > > +static bool tm_disabled __initdata; > > > > I think the name 'tm_disabled' might cause more confusion on the TM > > code. Mainly because we already have tm_enable() and tm_enabled() > > functions which are related to the MSR register and TM bit, and, with > > your new variable, tm_enabled() and tm_disabled are not going to be > > exclusionary. Neither tm_enable() with be able to toggle the tm_disabled > > value. > > Got a proposal for better names? That is the hardest part, but I thought about something as: * tm_disabled_on_boot * tm_off * tm_explicit_disabled * tm_feature_disabled * tm_enablement_disabled I think these names, although a bit longer, might avoid the confusion with tm_enable/tm_enabled nomenclature.
Re: [PATCH 03/25] powerpc: track allocation status of all pkeys
Ram Paiwrites: > Total 32 keys are available on power7 and above. However > pkey 0,1 are reserved. So effectively we have 30 pkeys. When you say reserved, reserved by whom? Is that part of ISA or PAPR ? Also do you expect that to change. If not why all these indirection? Can we have the mask as a #define for 4K and 64K page size config? > > On 4K kernels, we do not have 5 bits in the PTE to > represent all the keys; we only have 3bits.Two of those > keys are reserved; pkey 0 and pkey 1. So effectively we > have 6 pkeys. > > This patch keeps track of reserved keys, allocated keys > and keys that are currently free. > > Also it adds skeletal functions and macros, that the > architecture-independent code expects to be available. -aneesh
Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.
Ram Paiwrites: > powerpc needs an additional vma bit to support 32 keys. > Till the additional vma bit lands in include/linux/mm.h > we have to define it in powerpc specific header file. > This is needed to get pkeys working on power. > > Signed-off-by: Ram Pai > --- > arch/powerpc/include/asm/pkeys.h | 18 ++ > 1 files changed, 18 insertions(+), 0 deletions(-) > > diff --git a/arch/powerpc/include/asm/pkeys.h > b/arch/powerpc/include/asm/pkeys.h > index c02305a..44e01a2 100644 > --- a/arch/powerpc/include/asm/pkeys.h > +++ b/arch/powerpc/include/asm/pkeys.h > @@ -3,6 +3,24 @@ > > extern bool pkey_inited; > extern bool pkey_execute_disable_support; > + > +/* > + * powerpc needs an additional vma bit to support 32 keys. > + * Till the additional vma bit lands in include/linux/mm.h > + * we have to carry the hunk below. This is needed to get > + * pkeys working on power. -- Ram > + */ > +#ifndef VM_HIGH_ARCH_BIT_4 > +#define VM_HIGH_ARCH_BIT_4 36 > +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) > +#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 > +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 > +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1 > +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2 > +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3 > +#define VM_PKEY_BIT4 VM_HIGH_ARCH_4 > +#endif > + > #define ARCH_VM_PKEY_FLAGS 0 Do we want them in pkeys.h ? Even if they are arch specific for the existing ones we have them in include/linux/mm.h. IIUC, vmflags details are always in mm.h? This will be the first exception to that? > > static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey) > -- > 1.7.1
Re: [PATCH 3/7] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
Michael Ellermanwrites: > Ram Pai writes: > >> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c >> index 1a68cb1..c6c5559 100644 >> --- a/arch/powerpc/mm/hash64_64k.c >> +++ b/arch/powerpc/mm/hash64_64k.c >> @@ -126,18 +113,13 @@ int __hash_page_4K(unsigned long ea, unsigned long >> access, unsigned long vsid, >> if (__rpte_sub_valid(rpte, subpg_index)) { >> int ret; >> >> -hash = hpt_hash(vpn, shift, ssize); >> -hidx = __rpte_to_hidx(rpte, subpg_index); >> -if (hidx & _PTEIDX_SECONDARY) >> -hash = ~hash; >> -slot = (hash & htab_hash_mask) * HPTES_PER_GROUP; >> -slot += hidx & _PTEIDX_GROUP_IX; >> +gslot = pte_get_hash_gslot(vpn, shift, ssize, rpte, >> +subpg_index); >> +ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, >> +MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags); > > This was formatted correctly before: > >> -ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, >> - MMU_PAGE_4K, MMU_PAGE_4K, >> - ssize, flags); >> /* >> - *if we failed because typically the HPTE wasn't really here >> + * if we failed because typically the HPTE wasn't really here > > If you're fixing it up please make it "If ...". > >> * we try an insertion. >> */ >> if (ret == -1) >> @@ -148,6 +130,15 @@ int __hash_page_4K(unsigned long ea, unsigned long >> access, unsigned long vsid, >> } >> >> htab_insert_hpte: >> + >> +/* >> + * initialize all hidx entries to invalid value, >> + * the first time the PTE is about to allocate >> + * a 4K hpte >> + */ > > Should be: > /* >* Initialize all hidx entries to invalid value, the first time > * the PTE is about to allocate a 4K HPTE. >*/ > >> +if (!(old_pte & H_PAGE_COMBO)) >> +rpte.hidx = ~0x0UL; >> + > > Paul had the idea that if we biased the slot number by 1, we could make > the "invalid" value be == 0. > > That would avoid needing to that above, and also mean the value is > correctly invalid from the get-go, which would be good IMO. > > I think now that you've added the slot accessors it would be pretty easy > to do. That would be imply, we loose one slot in primary group, which means we will do extra work in some case because our primary now has only 7 slots. And in case of pseries, the hypervisor will always return the least available slot, which implie we will do extra hcalls in case of an hpte insert to an empty group? -aneesh
Re: [PATCH 06/25] powerpc: cleaup AMR, iAMR when a key is allocated or freed
Balbir Singhwrites: > On Fri, 8 Sep 2017 15:44:54 -0700 > Ram Pai wrote: > >> cleanup the bits corresponding to a key in the AMR, and IAMR >> register, when the key is newly allocated/activated or is freed. >> We dont want some residual bits cause the hardware enforce >> unintended behavior when the key is activated or freed. >> >> Signed-off-by: Ram Pai >> --- >> arch/powerpc/include/asm/pkeys.h | 12 >> 1 files changed, 12 insertions(+), 0 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/pkeys.h >> b/arch/powerpc/include/asm/pkeys.h >> index 5a83ed7..53bf13b 100644 >> --- a/arch/powerpc/include/asm/pkeys.h >> +++ b/arch/powerpc/include/asm/pkeys.h >> @@ -54,6 +54,8 @@ static inline bool mm_pkey_is_allocated(struct mm_struct >> *mm, int pkey) >> mm_set_pkey_is_allocated(mm, pkey)); >> } >> >> +extern void __arch_activate_pkey(int pkey); >> +extern void __arch_deactivate_pkey(int pkey); >> /* >> * Returns a positive, 5-bit key on success, or -1 on failure. >> */ >> @@ -80,6 +82,12 @@ static inline int mm_pkey_alloc(struct mm_struct *mm) >> >> ret = ffz((u32)mm_pkey_allocation_map(mm)); >> mm_set_pkey_allocated(mm, ret); >> + >> +/* >> + * enable the key in the hardware >> + */ >> +if (ret > 0) >> +__arch_activate_pkey(ret); >> return ret; >> } >> >> @@ -91,6 +99,10 @@ static inline int mm_pkey_free(struct mm_struct *mm, int >> pkey) >> if (!mm_pkey_is_allocated(mm, pkey)) >> return -EINVAL; >> >> +/* >> + * Disable the key in the hardware >> + */ >> +__arch_deactivate_pkey(pkey); >> mm_set_pkey_free(mm, pkey); >> >> return 0; > > I think some of these patches can be merged, too much fine granularity > is hurting my ability to see the larger function/implementation. Completely agree -aneesh
Re: [PATCH 06/25] powerpc: cleaup AMR, iAMR when a key is allocated or freed
Ram Paiwrites: > cleanup the bits corresponding to a key in the AMR, and IAMR > register, when the key is newly allocated/activated or is freed. > We dont want some residual bits cause the hardware enforce > unintended behavior when the key is activated or freed. > > Signed-off-by: Ram Pai > --- > arch/powerpc/include/asm/pkeys.h | 12 > 1 files changed, 12 insertions(+), 0 deletions(-) > > diff --git a/arch/powerpc/include/asm/pkeys.h > b/arch/powerpc/include/asm/pkeys.h > index 5a83ed7..53bf13b 100644 > --- a/arch/powerpc/include/asm/pkeys.h > +++ b/arch/powerpc/include/asm/pkeys.h > @@ -54,6 +54,8 @@ static inline bool mm_pkey_is_allocated(struct mm_struct > *mm, int pkey) > mm_set_pkey_is_allocated(mm, pkey)); > } > > +extern void __arch_activate_pkey(int pkey); > +extern void __arch_deactivate_pkey(int pkey); > /* > * Returns a positive, 5-bit key on success, or -1 on failure. > */ > @@ -80,6 +82,12 @@ static inline int mm_pkey_alloc(struct mm_struct *mm) > > ret = ffz((u32)mm_pkey_allocation_map(mm)); > mm_set_pkey_allocated(mm, ret); > + > + /* > + * enable the key in the hardware > + */ > + if (ret > 0) > + __arch_activate_pkey(ret); > return ret; > } We are already arch specific because we are defining them in arch/powerpc/include/asm/, then why __arch_activate_pkey() ? > > @@ -91,6 +99,10 @@ static inline int mm_pkey_free(struct mm_struct *mm, int > pkey) > if (!mm_pkey_is_allocated(mm, pkey)) > return -EINVAL; > > + /* > + * Disable the key in the hardware > + */ > + __arch_deactivate_pkey(pkey); > mm_set_pkey_free(mm, pkey); > > return 0; > -- > 1.7.1
Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.
"Aneesh Kumar K.V"writes: > Ram Pai writes: > >> powerpc needs an additional vma bit to support 32 keys. >> Till the additional vma bit lands in include/linux/mm.h >> we have to define it in powerpc specific header file. >> This is needed to get pkeys working on power. >> >> Signed-off-by: Ram Pai >> --- >> arch/powerpc/include/asm/pkeys.h | 18 ++ >> 1 files changed, 18 insertions(+), 0 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/pkeys.h >> b/arch/powerpc/include/asm/pkeys.h >> index c02305a..44e01a2 100644 >> --- a/arch/powerpc/include/asm/pkeys.h >> +++ b/arch/powerpc/include/asm/pkeys.h >> @@ -3,6 +3,24 @@ >> >> extern bool pkey_inited; >> extern bool pkey_execute_disable_support; >> + >> +/* >> + * powerpc needs an additional vma bit to support 32 keys. >> + * Till the additional vma bit lands in include/linux/mm.h >> + * we have to carry the hunk below. This is needed to get >> + * pkeys working on power. -- Ram >> + */ >> +#ifndef VM_HIGH_ARCH_BIT_4 >> +#define VM_HIGH_ARCH_BIT_4 36 >> +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) >> +#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 >> +#define VM_PKEY_BIT0VM_HIGH_ARCH_0 >> +#define VM_PKEY_BIT1VM_HIGH_ARCH_1 >> +#define VM_PKEY_BIT2VM_HIGH_ARCH_2 >> +#define VM_PKEY_BIT3VM_HIGH_ARCH_3 >> +#define VM_PKEY_BIT4VM_HIGH_ARCH_4 >> +#endif >> + >> #define ARCH_VM_PKEY_FLAGS 0 > > Do we want them in pkeys.h ? Even if they are arch specific for the > existing ones we have them in include/linux/mm.h. IIUC, vmflags details > are always in mm.h? This will be the first exception to that? Also can we move that #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ # define VM_PKEY_BIT1 VM_HIGH_ARCH_1 # define VM_PKEY_BIT2 VM_HIGH_ARCH_2 # define VM_PKEY_BIT3 VM_HIGH_ARCH_3 #endif to #if defined (CONFIG_ARCH_HAS_PKEYS) # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ # define VM_PKEY_BIT1 VM_HIGH_ARCH_1 # define VM_PKEY_BIT2 VM_HIGH_ARCH_2 # define VM_PKEY_BIT3 VM_HIGH_ARCH_3 #endif And then later update the generic to handle PKEY_BIT4? -aneesh
Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
Benjamin Herrenschmidtwrites: > On Fri, 2017-09-08 at 15:44 -0700, Ram Pai wrote: >> The second part of the PTE will hold >> (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. >> NOTE: None of the bits in the secondary PTE were not used >> by 64k-HPTE backed PTE. > > Have you measured the performance impact of this ? The second part of > the PTE being in a different cache line there could be one... > I am also looking at a patch series removing the slot tracking completely. With randomize address turned off and no swap in guest/host and making sure we touched most of guest ram, I don't find much impact in performance when we don't track the slot at all. I will post the patch series with numbers in a day or two. But my test was while (5000) { mmap(128M) touch every page of 2048 pages munmap() } I could also be the best case in my run because i might have always found the hash pte slot in the primary. In one measurement with swap on and address randmization enabled, i did find a 50% impact. But then i was not able to recreate that again. So could be something i did wrong in the test setup. Ram, Will you be able to get a test run with the above loop? -aneesh
Re: [PATCH 1/4] powerpc/tm: Add commandline option to disable hardware transactional memory
On Mon, 2017-10-23 at 09:01 +, David Laight wrote: > From: Michael Neuling > > Sent: 21 October 2017 02:00 > > To: David Laight; 'Breno Leitao'; Michael Ellerman > > Cc: stew...@linux.vnet.ibm.com; linuxppc-...@ozlabs.org; cyril...@gmail.com > > Subject: Re: [PATCH 1/4] powerpc/tm: Add commandline option to disable > > hardware transactional memory > > > > On Fri, 2017-10-20 at 12:58 +, David Laight wrote: > > > > > This patch adds a simple commandline option so that HTM can be > > > > > disabled at boot time. > > > > > > ISTM that being able to disable it after boot would be more useful. > > > (ie in a startup script) > > > > I agree bug unfortunately that's impossible. > > > > If a process is already running in tm suspend, there is no way to stop it > > other > > than killing the process. At that point you may as well kexec with a new > > cmdline option > > Isn't that unlikely when the first rc scripts are being run? > Setting an early rc script is generally easier than adding a command line > parameter. Unlikely or not, it's a case we'd have to handle that would add significant complexity compared to what's being proposed for (IMHO) little gain. Also, this doesn't exclude that option later if we decided we could do it. > I don't know about ppc, but grub on x86 makes it painfully almost > impossible to have two boot entries that have different command line > options. Ok. Mikey
RE: [PATCH 1/4] powerpc/tm: Add commandline option to disable hardware transactional memory
From: Michael Neuling > Sent: 21 October 2017 02:00 > To: David Laight; 'Breno Leitao'; Michael Ellerman > Cc: stew...@linux.vnet.ibm.com; linuxppc-...@ozlabs.org; cyril...@gmail.com > Subject: Re: [PATCH 1/4] powerpc/tm: Add commandline option to disable > hardware transactional memory > > On Fri, 2017-10-20 at 12:58 +, David Laight wrote: > > > > This patch adds a simple commandline option so that HTM can be > > > > disabled at boot time. > > > > ISTM that being able to disable it after boot would be more useful. > > (ie in a startup script) > > I agree bug unfortunately that's impossible. > > If a process is already running in tm suspend, there is no way to stop it > other > than killing the process. At that point you may as well kexec with a new > cmdline option Isn't that unlikely when the first rc scripts are being run? Setting an early rc script is generally easier than adding a command line parameter. I don't know about ppc, but grub on x86 makes it painfully almost impossible to have two boot entries that have different command line options. David
Re: [PATCHv3 6/7] symbol lookup: use new kernel and module dereference functions
On (10/20/17 15:08), Petr Mladek wrote: > On Thu 2017-10-19 15:42:35, Sergey Senozhatsky wrote: > > Sorry for the delay and thanks for taking a look. > > > > I'll try to re-spin the patch set by the end of this week/early next > > week. > > > > > > On (10/04/17 13:53), Petr Mladek wrote: > > [..] > > > Note that kallsyms_lookup() and module_address_lookup() is used > > > in many other situations. > > > > we dereference only things that can be dereferenced. > > so calling it on already dereferenced address, or address > > that does need to be dereferenced is OK. > > My concern is that it changes the behavior. It will suddenly return > another information for addresses that were not dereference before. OK. I'd be really-really surprised to find out that anyone did kallsyms_lookup()/module_address_lookup() on func descriptors, but I understand your concerns. I'll try to keep everything within vsprintf(). -ss
Re: [PATCH v2] kernel/module_64.c: Add REL24 relocation support of livepatch symbols
On Saturday 21 October 2017 06:29 AM, Balbir Singh wrote: On Fri, 2017-10-20 at 14:07 +0200, Torsten Duwe wrote: On Wed, Oct 18, 2017 at 11:47:35AM +0530, Kamalesh Babulal wrote: Consider a trivial patch, supplied to kpatch tool for generating a livepatch module: --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -132,7 +132,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) seq_printf(m, "VmallocTotal: %8lu kB\n", (unsigned long)VMALLOC_TOTAL >> 10); show_val_kb(m, "VmallocUsed:", 0ul); - show_val_kb(m, "VmallocChunk: ", 0ul); + show_val_kb(m, "VMALLOCCHUNK: ", 0ul); Am I assuming correctly that "kpatch tool" simply recompiles all code the way it would get compiled in a regular kernel build? kpatch is open source and is available on github. This patch is specific to the way kpatch works My understanding is that live patching modules need to be carefully prepared, which involves source code reorganisation and recompilation. In that process, you can easily declare show_val_kb() extern, and get the suitable instruction sequence for the call. Yes, we agree. For the current versions of kpatch, which involve a process of applying the patch and building the kernel without and with the patch and doing an elf diff (programatically), we cannot deviate from that process as it's architecture independent. This patch solves arch specific issues related to that process. Yes, that's the essence of the kpatch tool on building livepatchable kernel module, by doing an elf diff on the kernel with and without the patch applied. show_val_kb() is a simple example, consider more complex patch(s), if they need to be prepared manually as suggested. It beats the whole purpose of a kpatch tool, which programmatically prepares a livepatch module with close to no manual preparation required. It's the architecture limitation, which is addressed in this patch. This patch is outcome of long discussion at kpatch https://github.com/dynup/kpatch/pull/650 -- cheers, Kamalesh.
[PATCH kernel] powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
The pcidev value stored in pci_dn is only used for NPU/NPU2 initialization. We can easily drop the cached pointer and use an ancient helper - pci_get_domain_bus_and_slot() instead in order to reduce complexity. Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/include/asm/pci-bridge.h | 2 -- arch/powerpc/platforms/powernv/npu-dma.c | 5 - arch/powerpc/platforms/powernv/pci-ioda.c | 3 --- 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 0b8aa1fe2d5f..752718a2949d 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -197,8 +197,6 @@ struct pci_dn { struct iommu_table_group *table_group; /* for phb's or bridges */ int pci_ext_config_space; /* for pci devices */ - - struct pci_dev *pcidev;/* back-pointer to the pci device */ #ifdef CONFIG_EEH struct eeh_dev *edev; /* eeh device */ #endif diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 2cb6cbea4b3b..b5844a950852 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -39,7 +39,10 @@ */ static struct pci_dev *get_pci_dev(struct device_node *dn) { - return PCI_DN(dn)->pcidev; + struct pci_dn *pdn = PCI_DN(dn); + + return pci_get_domain_bus_and_slot(pci_domain_nr(pdn->phb->bus), + pdn->busno, pdn->devfn); } /* Given a NPU device get the associated PCI device. */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index bc60faa9f6c2..269f119e4b3c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1047,7 +1047,6 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev) * At some point we want to remove the PDN completely anyways */ pci_dev_get(dev); - pdn->pcidev = dev; pdn->pe_number = pe->pe_number; pe->flags = PNV_IODA_PE_DEV; pe->pdev = dev; @@ -1094,7 +1093,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe) continue; pe->device_count++; - pdn->pcidev = dev; pdn->pe_number = pe->pe_number; if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) pnv_ioda_setup_same_PE(dev->subordinate, pe); @@ -1209,7 +1207,6 @@ static struct pnv_ioda_pe *pnv_ioda_setup_npu_PE(struct pci_dev *npu_pdev) pci_dev_get(npu_pdev); npu_pdn = pci_get_pdn(npu_pdev); rid = npu_pdev->bus->number << 8 | npu_pdn->devfn; - npu_pdn->pcidev = npu_pdev; npu_pdn->pe_number = pe_num; phb->ioda.pe_rmap[rid] = pe->pe_number; -- 2.11.0
[PATCH v2 3/3] powerpc/powernv: Avoid waiting for secondary hold spinloop with OPAL
OPAL boot does not insert secondaries at 0x60 to wait at the secondary hold spinloop. Instead they are started later, and inserted at generic_secondary_smp_init(), which is after the secondary hold spinloop. Avoid waiting on this spinloop when booting with OPAL firmware. This wait always times out that case. This saves 100ms boot time on powernv, and 10s of seconds of real time when booting on the simulator in SMP. Signed-off-by: Nicholas Piggin--- arch/powerpc/kernel/head_64.S | 16 +++- arch/powerpc/kernel/setup_64.c | 10 +- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S index c9e760ec7530..0deef350004f 100644 --- a/arch/powerpc/kernel/head_64.S +++ b/arch/powerpc/kernel/head_64.S @@ -55,12 +55,18 @@ * * For pSeries or server processors: * 1. The MMU is off & open firmware is running in real mode. - * 2. The kernel is entered at __start + * 2. The primary CPU enters at __start. + * 3. If the RTAS supports "query-cpu-stopped-state", then secondary + * CPUs will enter as directed by "start-cpu" RTAS call, which is + * generic_secondary_smp_init, with PIR in r3. + * 4. Else the secondary CPUs will enter at secondary_hold (0x60) as + * directed by the "start-cpu" RTS call, with PIR in r3. * -or- For OPAL entry: - * 1. The MMU is off, processor in HV mode, primary CPU enters at 0 - * with device-tree in gpr3. We also get OPAL base in r8 and - * entry in r9 for debugging purposes - * 2. Secondary processors enter at 0x60 with PIR in gpr3 + * 1. The MMU is off, processor in HV mode. + * 2. The primary CPU enters at 0 with device-tree in r3, OPAL base + * in r8, and entry in r9 for debugging purposes. + * 3. Secondary CPUs enter as directed by OPAL_START_CPU call, which + * is at generic_secondary_smp_init, with PIR in r3. * * For Book3E processors: * 1. The MMU is on running in AS0 in a state defined in ePAPR diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index 3f2453858f60..ecaab70e4b78 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -363,8 +363,16 @@ void early_setup_secondary(void) #if defined(CONFIG_SMP) || defined(CONFIG_KEXEC_CORE) static bool use_spinloop(void) { - if (!IS_ENABLED(CONFIG_PPC_BOOK3E)) + if (IS_ENABLED(CONFIG_PPC_BOOK3S)) { + /* +* See comments in head_64.S -- not all platforms insert +* secondaries at __secondary_hold and wait at the spin +* loop. +*/ + if (firmware_has_feature(FW_FEATURE_OPAL)) + return false; return true; + } /* * When book3e boots from kexec, the ePAPR spin table does -- 2.13.3
[PATCH v2 2/3] powerpc: use NMI IPI for smp_send_stop
Use the NMI IPI rather than smp_call_function for smp_send_stop. Have stopped CPUs hard disable interrupts rather than just soft disable. This function is used in crash/panic/shutdown paths to bring other CPUs down as quickly and reliably as possible, and minimizing their potential to cause trouble. Avoiding the Linux smp_call_function infrastructure and (if supported) using true NMI IPIs makes this more robust. Also use spin loop primitives in the stop callback, mainly to help processing speed of the active thread speed in the simulator. Signed-off-by: Nicholas Piggin--- arch/powerpc/kernel/smp.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index e0a4c1f82e25..ce891030d925 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -547,19 +547,20 @@ void crash_send_ipi(void (*crash_ipi_callback)(struct pt_regs *)) } #endif -static void stop_this_cpu(void *dummy) +static void __noreturn stop_this_cpu(struct pt_regs *regs) { /* Remove this CPU */ set_cpu_online(smp_processor_id(), false); - local_irq_disable(); + hard_irq_disable(); + spin_begin(); while (1) - ; + spin_cpu_relax(); } void smp_send_stop(void) { - smp_call_function(stop_this_cpu, NULL, 0); + smp_send_nmi_ipi(NMI_IPI_ALL_OTHERS, stop_this_cpu, 100); } struct thread_info *current_set[NR_CPUS]; -- 2.13.3
[PATCH v2 1/3] powerpc/powernv: Always stop secondaries before reboot/shutdown
Currently powernv reboot and shutdown requests just leave secondaries to do their own things. This is undesirable because they can trigger any number of watchdogs while waiting for reboot, but also we don't know what else they might be doing, or they might be stuck somewhere causing trouble. The opal scheduled flash update code already ran into watchdog problems due to flashing taking a long time, but it's possible for regular reboots to trigger problems too (this is with watchdog_thresh set to 1, but I have seen it with watchdog_thresh at the default value once too): reboot: Restarting system [ 360.038896709,5] OPAL: Reboot request... Watchdog CPU:0 Hard LOCKUP Watchdog CPU:44 detected Hard LOCKUP other CPUS:16 Watchdog CPU:16 Hard LOCKUP watchdog: BUG: soft lockup - CPU#16 stuck for 3s! [swapper/16:0] So remove the special case for flash update, and unconditionally do smp_send_stop before rebooting. Return the CPUs to Linux stop loops rather than OPAL. The reason for this is that the path to firmware is longer, and the CPUs may have been interrupted from firmware, which may cause problems to re-enter it. It's better to put them into a simple spin loop to maximize the chance of a successful reboot. Signed-off-by: Nicholas Piggin--- arch/powerpc/include/asm/opal.h | 2 +- arch/powerpc/platforms/powernv/opal-flash.c | 28 +--- arch/powerpc/platforms/powernv/setup.c | 15 +-- 3 files changed, 7 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 04c32b08ffa1..ce58f4139ff5 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -317,7 +317,7 @@ struct rtc_time; extern unsigned long opal_get_boot_time(void); extern void opal_nvram_init(void); extern void opal_flash_update_init(void); -extern void opal_flash_term_callback(void); +extern void opal_flash_update_print_message(void); extern int opal_elog_init(void); extern void opal_platform_dump_init(void); extern void opal_sys_param_init(void); diff --git a/arch/powerpc/platforms/powernv/opal-flash.c b/arch/powerpc/platforms/powernv/opal-flash.c index 2fa3ac80cb4e..632871d78576 100644 --- a/arch/powerpc/platforms/powernv/opal-flash.c +++ b/arch/powerpc/platforms/powernv/opal-flash.c @@ -303,26 +303,9 @@ static int opal_flash_update(int op) return rc; } -/* Return CPUs to OPAL before starting FW update */ -static void flash_return_cpu(void *info) -{ - int cpu = smp_processor_id(); - - if (!cpu_online(cpu)) - return; - - /* Disable IRQ */ - hard_irq_disable(); - - /* Return the CPU to OPAL */ - opal_return_cpu(); -} - /* This gets called just before system reboots */ -void opal_flash_term_callback(void) +void opal_flash_update_print_message(void) { - struct cpumask mask; - if (update_flash_data.status != FLASH_IMG_READY) return; @@ -333,15 +316,6 @@ void opal_flash_term_callback(void) /* Small delay to help getting the above message out */ msleep(500); - - /* Return secondary CPUs to firmware */ - cpumask_copy(, cpu_online_mask); - cpumask_clear_cpu(smp_processor_id(), ); - if (!cpumask_empty()) - smp_call_function_many(, - flash_return_cpu, NULL, false); - /* Hard disable interrupts */ - hard_irq_disable(); } /* diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index b23d8beb0f1e..ff9422776a81 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -112,17 +112,12 @@ static void pnv_prepare_going_down(void) */ opal_event_shutdown(); - /* Soft disable interrupts */ - local_irq_disable(); + /* Print flash update message if one is scheduled. */ + opal_flash_update_print_message(); - /* -* Return secondary CPUs to firwmare if a flash update -* is pending otherwise we will get all sort of error -* messages about CPU being stuck etc.. This will also -* have the side effect of hard disabling interrupts so -* past this point, the kernel is effectively dead. -*/ - opal_flash_term_callback(); + smp_send_stop(); + + hard_irq_disable(); } static void __noreturn pnv_restart(char *cmd) -- 2.13.3
[PATCH v2 0/3] avoid secondary hold spinloop when possible
I've made this series only avoid the secondary spinloop on powernv. pSeries is a little bit more complicated, some cases in kexec the secondaries will be at 0x60. I haven't had time to get all that sorted out for this merge window, so OPAL-only this time. Nicholas Piggin (3): powerpc/powernv: Always stop secondaries before reboot/shutdown powerpc: use NMI IPI for smp_send_stop powerpc/powernv: Avoid waiting for secondary hold spinloop with OPAL arch/powerpc/include/asm/opal.h | 2 +- arch/powerpc/kernel/head_64.S | 16 +++- arch/powerpc/kernel/setup_64.c | 10 +- arch/powerpc/kernel/smp.c | 9 + arch/powerpc/platforms/powernv/opal-flash.c | 28 +--- arch/powerpc/platforms/powernv/setup.c | 15 +-- 6 files changed, 32 insertions(+), 48 deletions(-) -- 2.13.3
[PATCH 3/3] powerpc/powernv: Use FIXUP_ENDIAN_HV in OPAL return
Close the recoverability gap for OPAL calls by using FIXUP_ENDIAN_HV in the return path. Signed-off-by: Nicholas Piggin--- arch/powerpc/platforms/powernv/opal-wrappers.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index 37cd170201a2..6f4b00a2ac46 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -94,7 +94,7 @@ opal_return: * bytes (always BE) since MSR:LE will end up fixed up as a side * effect of the rfid. */ - FIXUP_ENDIAN + FIXUP_ENDIAN_HV ld r2,PACATOC(r13); lwz r4,8(r1); ld r5,PPC_LR_STKOFF(r1); @@ -120,7 +120,7 @@ opal_real_call: hrfid opal_return_realmode: - FIXUP_ENDIAN + FIXUP_ENDIAN_HV ld r2,PACATOC(r13); lwz r11,8(r1); ld r12,PPC_LR_STKOFF(r1) -- 2.13.3
[PATCH 2/3] powerpc/book3s: Add an HV variant of FIXUP_ENDIAN that is recoverable
Add an HV variant of FIXUP_ENDIAN which uses HSRR[01] and does not clear MSR[RI], which improves recoverability. Signed-off-by: Nicholas Piggin--- arch/powerpc/include/asm/ppc_asm.h | 22 ++ 1 file changed, 22 insertions(+) diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h index d6b56aebe602..ae94b3626b6c 100644 --- a/arch/powerpc/include/asm/ppc_asm.h +++ b/arch/powerpc/include/asm/ppc_asm.h @@ -774,6 +774,10 @@ END_FTR_SECTION_IFCLR(CPU_FTR_601) #ifdef CONFIG_PPC_BOOK3E #define FIXUP_ENDIAN #else +/* + * This version may be used in in HV or non-HV context. + * MSR[EE] must be disabled. + */ #define FIXUP_ENDIAN \ tdi 0,0,0x48; /* Reverse endian of b . + 8 */ \ b 191f; /* Skip trampoline if endian is good */ \ @@ -789,6 +793,24 @@ END_FTR_SECTION_IFCLR(CPU_FTR_601) .long 0x244c; /* rfid */ \ 191: +/* + * This version that may only be used with MSR[HV]=1 + * - Does not clear MSR[RI], so more robust. + * - Slightly smaller and faster. + */ +#define FIXUP_ENDIAN_HV \ + tdi 0,0,0x48; /* Reverse endian of b . + 8 */ \ + b 191f; /* Skip trampoline if endian is good */ \ + .long 0xa600607d; /* mfmsr r11 */ \ + .long 0x01006b69; /* xori r11,r11,1 */ \ + .long 0x05009f42; /* bcl 20,31,$+4 */ \ + .long 0xa602487d; /* mflr r10 */ \ + .long 0x14004a39; /* addi r10,r10,20*/ \ + .long 0xa64b5a7d; /* mthsrr0 r10*/ \ + .long 0xa64b7b7d; /* mthsrr1 r11*/ \ + .long 0x2402004c; /* hrfid */ \ +191: + #endif /* !CONFIG_PPC_BOOK3E */ #endif /* __ASSEMBLY__ */ -- 2.13.3
[PATCH 1/3] powerpc/book3s: use label for FIXUP_ENDIAN macro branch
Signed-off-by: Nicholas Piggin--- arch/powerpc/include/asm/ppc_asm.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h index 36f3e41c9fbe..d6b56aebe602 100644 --- a/arch/powerpc/include/asm/ppc_asm.h +++ b/arch/powerpc/include/asm/ppc_asm.h @@ -776,7 +776,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_601) #else #define FIXUP_ENDIAN \ tdi 0,0,0x48; /* Reverse endian of b . + 8 */ \ - b $+44; /* Skip trampoline if endian is good */ \ + b 191f; /* Skip trampoline if endian is good */ \ .long 0xa600607d; /* mfmsr r11 */ \ .long 0x01006b69; /* xori r11,r11,1 */ \ .long 0x4039; /* li r10,0 */ \ @@ -786,7 +786,8 @@ END_FTR_SECTION_IFCLR(CPU_FTR_601) .long 0x14004a39; /* addi r10,r10,20*/ \ .long 0xa6035a7d; /* mtsrr0 r10 */ \ .long 0xa6037b7d; /* mtsrr1 r11 */ \ - .long 0x244c /* rfid */ + .long 0x244c; /* rfid */ \ +191: #endif /* !CONFIG_PPC_BOOK3E */ -- 2.13.3
[PATCH 0/3] improve recoverability of OPAL calls
Here's a few patches to improve recoverability for FIXUP_ENDIAN on powernv. We should try to minimise SRR[01] (and MSR[RI]=0) usage as much as possible. Whether that's by using HSRR or mtmsrd, it usually results in faster and smaller code too. There's a few other places we can improve, but I've had these patches around for a while. Thanks, Nick Nicholas Piggin (3): powerpc/book3s: use label for FIXUP_ENDIAN macro branch powerpc/book3s: Add an HV variant of FIXUP_ENDIAN that is recoverable powerpc/powernv: Use FIXUP_ENDIAN_HV in OPAL return arch/powerpc/include/asm/ppc_asm.h | 27 -- arch/powerpc/platforms/powernv/opal-wrappers.S | 4 ++-- 2 files changed, 27 insertions(+), 4 deletions(-) -- 2.13.3