date:20210414

Re: [PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-14 Thread Anshuman Khandual

On 4/14/21 11:40 AM, Christophe Leroy wrote:
> 
> 
> Le 14/04/2021 à 07:59, Anshuman Khandual a écrit :
>>
>>
>> On 4/14/21 10:52 AM, Christophe Leroy wrote:
>>>
>>>
>>> Le 14/04/2021 à 04:54, Anshuman Khandual a écrit :
 Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
 same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
 for those platforms which would override generic default FIRST_USER_ADDRESS
 value 0UL. This makes it much cleaner with reduced code.

 Cc: linux-al...@vger.kernel.org
 Cc: linux-snps-...@lists.infradead.org
 Cc: linux-arm-ker...@lists.infradead.org
 Cc: linux-c...@vger.kernel.org
 Cc: linux-hexa...@vger.kernel.org
 Cc: linux-i...@vger.kernel.org
 Cc: linux-m...@lists.linux-m68k.org
 Cc: linux-m...@vger.kernel.org
 Cc: openr...@lists.librecores.org
 Cc: linux-par...@vger.kernel.org
 Cc: linuxppc-dev@lists.ozlabs.org
 Cc: linux-ri...@lists.infradead.org
 Cc: linux-s...@vger.kernel.org
 Cc: linux...@vger.kernel.org
 Cc: sparcli...@vger.kernel.org
 Cc: linux...@lists.infradead.org
 Cc: linux-xte...@linux-xtensa.org
 Cc: x...@kernel.org
 Cc: linux...@kvack.org
 Cc: linux-ker...@vger.kernel.org
 Signed-off-by: Anshuman Khandual 
 ---
    arch/alpha/include/asm/pgtable.h | 1 -
    arch/arc/include/asm/pgtable.h   | 6 --
    arch/arm/Kconfig | 1 +
    arch/arm64/include/asm/pgtable.h | 2 --
    arch/csky/include/asm/pgtable.h  | 1 -
    arch/hexagon/include/asm/pgtable.h   | 3 ---
    arch/ia64/include/asm/pgtable.h  | 1 -
    arch/m68k/include/asm/pgtable_mm.h   | 1 -
    arch/microblaze/include/asm/pgtable.h    | 2 --
    arch/mips/include/asm/pgtable-32.h   | 1 -
    arch/mips/include/asm/pgtable-64.h   | 1 -
    arch/nds32/Kconfig   | 1 +
    arch/nios2/include/asm/pgtable.h | 2 --
    arch/openrisc/include/asm/pgtable.h  | 1 -
    arch/parisc/include/asm/pgtable.h    | 2 --
    arch/powerpc/include/asm/book3s/pgtable.h    | 1 -
    arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
    arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
    arch/riscv/include/asm/pgtable.h | 2 --
    arch/s390/include/asm/pgtable.h  | 2 --
    arch/sh/include/asm/pgtable.h    | 2 --
    arch/sparc/include/asm/pgtable_32.h  | 1 -
    arch/sparc/include/asm/pgtable_64.h  | 3 ---
    arch/um/include/asm/pgtable-2level.h | 1 -
    arch/um/include/asm/pgtable-3level.h | 1 -
    arch/x86/include/asm/pgtable_types.h | 2 --
    arch/xtensa/include/asm/pgtable.h    | 1 -
    include/linux/mm.h   | 4 
    mm/Kconfig   | 4 
    29 files changed, 10 insertions(+), 43 deletions(-)

 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 8ba434287387..47098ccd715e 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -46,6 +46,10 @@ extern int sysctl_page_lock_unfairness;
      void init_mm_internals(void);
    +#ifndef ARCH_HAS_FIRST_USER_ADDRESS
>>>
>>> I guess you didn't test it . :)
>>
>> In fact I did :) Though just booted it on arm64 and cross compiled on
>> multiple others platforms.

I guess for all platforms, ARCH_HAS_FIRST_USER_ADDRESS would have just
evaluated to be false hence falling back on the generic definition. So
this never complained during build any where or during boot on arm64.

>>
>>>
>>> should be #ifndef CONFIG_ARCH_HAS_FIRST_USER_ADDRESS
>>
>> Right, meant that instead.
>>
>>>
 +#define FIRST_USER_ADDRESS    0UL
 +#endif
>>>
>>> But why do we need a config option at all for that ?
>>>
>>> Why not just:
>>>
>>> #ifndef FIRST_USER_ADDRESS
>>> #define FIRST_USER_ADDRESS    0UL
>>> #endif
>>
>> This sounds simpler. But just wondering, would not there be any possibility
>> of build problems due to compilation sequence between arch and generic code ?
>>
> 
> For sure it has to be addresses carefully, but there are already a lot of 
> stuff like that around pgtables.h
> 
> For instance, pte_offset_kernel() has a generic definition in 
> linux/pgtables.h based on whether it is already defined or not.
> 
> Taking into account that FIRST_USER_ADDRESS is today in the architectures's 
> asm/pgtables.h, I think putting the fallback definition in linux/pgtable.h 
> would do the trick.

Agreed,  includes  at the beginning and
if the arch defines FIRST_USER_ADDRESS, the generic one afterwards would
be skipped. The following change builds on multiple platforms.

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ad086e6d7155..5da96f5df48f 100644
---

[powerpc:next] BUILD SUCCESS 7098f8f0cf0387443fd8702f24a8a2521d5133f3

2021-04-14 Thread kernel test robot

allyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
s390 allyesconfig
s390defconfig
parisc  defconfig
s390 allmodconfig
parisc   allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a003-20210414
i386 randconfig-a006-20210414
i386 randconfig-a004-20210414
i386 randconfig-a001-20210414
i386 randconfig-a005-20210414
i386 randconfig-a002-20210414
x86_64   randconfig-a014-20210414
x86_64   randconfig-a015-20210414
x86_64   randconfig-a011-20210414
x86_64   randconfig-a013-20210414
x86_64   randconfig-a012-20210414
x86_64   randconfig-a016-20210414
i386 randconfig-a015-20210414
i386 randconfig-a014-20210414
i386 randconfig-a013-20210414
i386 randconfig-a012-20210414
i386 randconfig-a016-20210414
i386 randconfig-a011-20210414
riscvnommu_k210_defconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
um   allmodconfig
um   allyesconfig
um  defconfig
x86_64rhel-8.3-kselftests
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-a003-20210414
x86_64   randconfig-a002-20210414
x86_64   randconfig-a005-20210414
x86_64   randconfig-a001-20210414
x86_64   randconfig-a006-20210414
x86_64   randconfig-a004-20210414

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

[powerpc:merge] BUILD SUCCESS 0702e74703f57173e70cfab2a79a3e682e9e96ec

2021-04-14 Thread kernel test robot

 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
i386 randconfig-a003-20210414
i386 randconfig-a006-20210414
i386 randconfig-a001-20210414
i386 randconfig-a005-20210414
i386 randconfig-a004-20210414
i386 randconfig-a002-20210414
i386 randconfig-a003-20210412
i386 randconfig-a001-20210412
i386 randconfig-a006-20210412
i386 randconfig-a005-20210412
i386 randconfig-a004-20210412
i386 randconfig-a002-20210412
x86_64   randconfig-a014-20210414
x86_64   randconfig-a015-20210414
x86_64   randconfig-a011-20210414
x86_64   randconfig-a013-20210414
x86_64   randconfig-a012-20210414
x86_64   randconfig-a016-20210414
x86_64   randconfig-a014-20210412
x86_64   randconfig-a015-20210412
x86_64   randconfig-a011-20210412
x86_64   randconfig-a013-20210412
x86_64   randconfig-a012-20210412
x86_64   randconfig-a016-20210412
i386 randconfig-a015-20210414
i386 randconfig-a014-20210414
i386 randconfig-a013-20210414
i386 randconfig-a012-20210414
i386 randconfig-a016-20210414
i386 randconfig-a011-20210414
i386 randconfig-a015-20210412
i386 randconfig-a014-20210412
i386 randconfig-a013-20210412
i386 randconfig-a012-20210412
i386 randconfig-a016-20210412
i386 randconfig-a011-20210412
riscvnommu_k210_defconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
um   allmodconfig
um   allyesconfig
um  defconfig
x86_64rhel-8.3-kselftests
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-a003-20210414
x86_64   randconfig-a002-20210414
x86_64   randconfig-a005-20210414
x86_64   randconfig-a001-20210414
x86_64   randconfig-a006-20210414
x86_64   randconfig-a004-20210414
x86_64   randconfig-a003-20210412
x86_64   randconfig-a002-20210412
x86_64   randconfig-a001-20210412
x86_64   randconfig-a005-20210412
x86_64   randconfig-a006-20210412
x86_64   randconfig-a004-20210412

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH v2 0/5] powerpc/rtas: miscellaneous cleanups

2021-04-14 Thread Nathan Lynch

Christophe Leroy  writes:

> Le 08/04/2021 à 16:06, Nathan Lynch a écrit :
>> This is a reroll of the series posted here:
>> https://lore.kernel.org/linuxppc-dev/20210114220004.1138993-1-nath...@linux.ibm.com/
>> 
>> Originally this work was prompted by failures on radix MMU PowerVM
>> guests when passing buffers to RTAS that lay outside of its idea of
>> the RMA. In v1 I approached this as a problem to be solved in Linux,
>> but RTAS development has since decided to change their code so that
>> the RMA restriction does not apply with radix.
>> 
>> So in v2 I retain the cleanups and discard the more significant change
>> which accommodated the misbehaving RTAS versions.
>
> Is there a link with https://github.com/linuxppc/issues/issues/252 ?
>

No, not really.

linux-next: build warning after merge of the powerpc tree

2021-04-14 Thread Stephen Rothwell

Hi all,

After merging the powerpc tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:

drivers/macintosh/via-pmu.c:190:12: warning: '__fake_sleep' defined but not 
used [-Wunused-variable]
  190 | static int __fake_sleep;
  |^~~~

Introduced by commit

  95d143923379 ("macintosh/via-pmu: Make some symbols static")

-- 
Cheers,
Stephen Rothwell


pgpzbyZ3XZK9g.pgp
Description: OpenPGP digital signature

RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread David Laight

From: Matthew Wilcox
> Sent: 14 April 2021 22:36
> 
> On Wed, Apr 14, 2021 at 09:13:22PM +0200, Jesper Dangaard Brouer wrote:
> > (If others want to reproduce).  First I could not reproduce on ARM32.
> > Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
> > cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.
> 
> hmmm ... you should be able to provoke it by enabling ARM_LPAE,
> which selects PHYS_ADDR_T_64BIT, and
> 
> config ARCH_DMA_ADDR_T_64BIT
> def_bool 64BIT || PHYS_ADDR_T_64BIT
> 
> >  struct page {
> > long unsigned int  flags;/* 0 4 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > union {
> > struct {
> > struct list_head lru;/* 8 8 */
> > struct address_space * mapping;  /*16 4 */
> > long unsigned int index; /*20 4 */
> > long unsigned int private;   /*24 4 */
> > };   /* 820 */
> > struct {
> > dma_addr_t dma_addr

Adding __packed here will remove the 4 byte hole before the union
and the compiler seems clever enough to know that anything following
a 'long' must also be 'long' aligned.
So you don't get anything horrid like byte accesses.
On 64bit dma_addr will remain 64bit aligned.
On arm32 dma_addr will be 32bit aligned - but forcing two 32bit access
won't make any difference.

So definitely the only simple fix.

David

> >; /* 8 8 */
> > };   /* 8 8 */
> [...]
> > } __attribute__((__aligned__(8)));   /* 824 */
> > union {
> > atomic_t   _mapcount;/*32 4 */
> > unsigned int   page_type;/*32 4 */
> > unsigned int   active;   /*32 4 */
> > intunits;/*32 4 */
> > };   /*32 4 */
> > atomic_t   _refcount;/*36 4 */
> >
> > /* size: 40, cachelines: 1, members: 4 */
> > /* sum members: 36, holes: 1, sum holes: 4 */
> > /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
> > /* last cacheline: 40 bytes */
> > } __attribute__((__aligned__(8)));
> 
> If you also enable CONFIG_MEMCG or enough options to make
> LAST_CPUPID_NOT_IN_PAGE_FLAGS true, you'll end up with another 4-byte
> hole at the end.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 09:13:22PM +0200, Jesper Dangaard Brouer wrote:
> (If others want to reproduce).  First I could not reproduce on ARM32.
> Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
> cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.

hmmm ... you should be able to provoke it by enabling ARM_LPAE,
which selects PHYS_ADDR_T_64BIT, and

config ARCH_DMA_ADDR_T_64BIT
def_bool 64BIT || PHYS_ADDR_T_64BIT

>  struct page {
> long unsigned int  flags;/* 0 4 */
> 
> /* XXX 4 bytes hole, try to pack */
> 
> union {
> struct {
> struct list_head lru;/* 8 8 */
> struct address_space * mapping;  /*16 4 */
> long unsigned int index; /*20 4 */
> long unsigned int private;   /*24 4 */
> };   /* 820 */
> struct {
> dma_addr_t dma_addr; /* 8 8 */
> };   /* 8 8 */
[...]
> } __attribute__((__aligned__(8)));   /* 824 */
> union {
> atomic_t   _mapcount;/*32 4 */
> unsigned int   page_type;/*32 4 */
> unsigned int   active;   /*32 4 */
> intunits;/*32 4 */
> };   /*32 4 */
> atomic_t   _refcount;/*36 4 */
> 
> /* size: 40, cachelines: 1, members: 4 */
> /* sum members: 36, holes: 1, sum holes: 4 */
> /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
> /* last cacheline: 40 bytes */
> } __attribute__((__aligned__(8)));

If you also enable CONFIG_MEMCG or enough options to make
LAST_CPUPID_NOT_IN_PAGE_FLAGS true, you'll end up with another 4-byte
hole at the end.

Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Dan Williams

On Wed, Apr 14, 2021 at 5:40 AM Vaibhav Jain  wrote:
>
> Currently drc_pmem_qeury_stats() generates a dev_err in case
> "Enable Performance Information Collection" feature is disabled from
> HMC. The error is of the form below:
>
> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>  performance stats, Err:-10
>
> This error message confuses users as it implies a possible problem
> with the nvdimm even though its due to a disabled feature.
>
> So we fix this by explicitly handling the H_AUTHORITY error from the
> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> error, saying that "Performance stats in-accessible".
>
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..9216424f8be3 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
> *p,
> dev_err(>pdev->dev,
> "Unknown performance stats, Err:0x%016lX\n", ret[0]);
> return -ENOENT;
> +   } else if (rc == H_AUTHORITY) {
> +   dev_warn(>pdev->dev, "Performance stats in-accessible");
> +   return -EPERM;

So userspace can spam the kernel log? Why is kernel log message needed
at all? EPERM told the caller what happened.

Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Ira Weiny

On Wed, Apr 14, 2021 at 09:51:40PM +0530, Vaibhav Jain wrote:
> Thanks for looking into this patch Ira,
> 
> Ira Weiny  writes:
> 
> > On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote:
> >> Currently drc_pmem_qeury_stats() generates a dev_err in case
> >> "Enable Performance Information Collection" feature is disabled from
> >> HMC. The error is of the form below:
> >> 
> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
> >> performance stats, Err:-10
> >> 
> >> This error message confuses users as it implies a possible problem
> >> with the nvdimm even though its due to a disabled feature.
> >> 
> >> So we fix this by explicitly handling the H_AUTHORITY error from the
> >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> >> error, saying that "Performance stats in-accessible".
> >> 
> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> >> PHYP')
> >> Signed-off-by: Vaibhav Jain 
> >> ---
> >>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >> 
> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> >> b/arch/powerpc/platforms/pseries/papr_scm.c
> >> index 835163f54244..9216424f8be3 100644
> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct 
> >> papr_scm_priv *p,
> >>dev_err(>pdev->dev,
> >>"Unknown performance stats, Err:0x%016lX\n", ret[0]);
> >>return -ENOENT;
> >> +  } else if (rc == H_AUTHORITY) {
> >> +  dev_warn(>pdev->dev, "Performance stats in-accessible");
> >> +  return -EPERM;
> >
> > Is this because of a disabled feature or because of permissions?
> 
> Its because of a disabled feature that revokes permission for a guest to
> retrieve performance statistics.
> 
> The feature is called "Enable Performance Information Collection" and
> once disabled the hcall H_SCM_PERFORMANCE_STATS returns an error
> H_AUTHORITY indicating that the guest doesn't have permission to retrieve
> performance statistics.

In that case would it be appropriate to have the error message indicate a
permission issue?

Something like 'permission denied'?

Ira

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Jesper Dangaard Brouer

On Wed, 14 Apr 2021 12:50:52 +0100
Matthew Wilcox  wrote:

> > That said, I think we need to have a quicker fix for the immediate
> > issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> > it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> > happens on certain 32-bit archs?  
> 
> AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
> on x86-32 because dma_addr_t is 32-bit aligned.

(If others want to reproduce).  First I could not reproduce on ARM32.
Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.

Details below signature.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

From file: arch/arm/Kconfig

config XEN
bool "Xen guest support on ARM"
depends on ARM && AEABI && OF
depends on CPU_V7 && !CPU_V6
depends on !GENERIC_ATOMIC64
depends on MMU
select ARCH_DMA_ADDR_T_64BIT
select ARM_PSCI
select SWIOTLB
select SWIOTLB_XEN
select PARAVIRT
help
  Say Y if you want to run Linux in a Virtual Machine on Xen on ARM.

My make compile command:

 export VERSION=gcc-arm-10.2-2020.11-x86_64-arm-none-linux-gnueabihf/
 export 
CROSS_COMPILE="/home/${USER}/cross-compilers/${VERSION}/bin/arm-none-linux-gnueabihf-"
 make -j8 ARCH=arm CROSS_COMPILE=$CROSS_COMPILE

Pahole output:
 $ pahole -C page mm/page_alloc.o

 struct page {
long unsigned int  flags;/* 0 4 */

/* XXX 4 bytes hole, try to pack */

union {
struct {
struct list_head lru;/* 8 8 */
struct address_space * mapping;  /*16 4 */
long unsigned int index; /*20 4 */
long unsigned int private;   /*24 4 */
};   /* 820 */
struct {
dma_addr_t dma_addr; /* 8 8 */
};   /* 8 8 */
struct {
union {
struct list_head slab_list; /* 8 8 */
struct {
struct page * next; /* 8 4 */
short int pages; /*12 2 */
short int pobjects; /*14 2 */
};   /* 8 8 */
};   /* 8 8 */
struct kmem_cache * slab_cache;  /*16 4 */
void * freelist; /*20 4 */
union {
void * s_mem;/*24 4 */
long unsigned int counters; /*24 4 */
struct {
unsigned int inuse:16; /*24: 0  4 */
unsigned int objects:15; /*24:16  4 
*/
unsigned int frozen:1; /*24:31  4 */
};   /*24 4 */
};   /*24 4 */
};   /* 820 */
struct {
long unsigned int compound_head; /* 8 4 */
unsigned char compound_dtor; /*12 1 */
unsigned char compound_order;/*13 1 */

/* XXX 2 bytes hole, try to pack */

atomic_t   compound_mapcount;/*16 4 */
unsigned int compound_nr;/*20 4 */
};   /* 816 */
struct {
long unsigned int _compound_pad_1; /* 8 4 */
atomic_t   hpage_pinned_refcount; /*12 4 */
struct list_head deferred_list;  /*16 8 */
};   /* 816 */
struct {
long unsigned int _pt_pad_1; /* 8 4 */
pgtable_t  pmd_huge_pte; /*12 4 */
long unsigned int _pt_pad_2; /*16 4 */
union {
struct mm_struct * pt_mm; /*20 4 */
atomic_t pt_frag_refcount; /*20 4 */
};

Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread Segher Boessenkool

On Wed, Apr 14, 2021 at 03:32:04PM +, David Laight wrote:
> From: Segher Boessenkool
> > Sent: 14 April 2021 16:19
> ...
> > > Could the kernel use GCC builtin atomic functions instead ?
> > >
> > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > 
> > Certainly that should work fine for the simpler cases that the atomic
> > operations are meant to provide.  But esp. for not-so-simple cases the
> > kernel may require some behaviour provided by the existing assembler
> > implementation, and not by the atomic builtins.
> > 
> > I'm not saying this cannot work, just that some serious testing will be
> > needed.  If it works it should be the best of all worlds, so then it is
> > a really good idea yes :-)
> 
> I suspect they just add an extra layer of abstraction that makes it
> even more difficult to verify and could easily get broken by a compiler
> update (etc).

I would say it uses an existing facility, instead of creating a kernel-
specific one.

> The other issue is that the code needs to be correct with compiled
> with (for example) -O0.
> That could very easily break anything except the asm implementation
> if additional memory accesses and/or increased code size cause grief.

The compiler generates correct code.  New versions of the compiler or
old, -O0 or not, under any phase of the moon.

Of course sometimes the compiler is broken, but there are pre-existing
ways of dealing with that, and there is no reason at all to think this
would break more often than random other code.


Segher

Re: [PATCH v2 0/5] powerpc/rtas: miscellaneous cleanups

2021-04-14 Thread Christophe Leroy





Le 08/04/2021 à 16:06, Nathan Lynch a écrit :

This is a reroll of the series posted here:
https://lore.kernel.org/linuxppc-dev/20210114220004.1138993-1-nath...@linux.ibm.com/

Originally this work was prompted by failures on radix MMU PowerVM
guests when passing buffers to RTAS that lay outside of its idea of
the RMA. In v1 I approached this as a problem to be solved in Linux,
but RTAS development has since decided to change their code so that
the RMA restriction does not apply with radix.

So in v2 I retain the cleanups and discard the more significant change
which accommodated the misbehaving RTAS versions.


Is there a link with https://github.com/linuxppc/issues/issues/252 ?



Changes since v1:
- Correct missing conversion of RTAS_RMOBUF_MAX ->
   RTAS_USER_REGION_SIZE in in_rmo_buf().
- Remove unnecessary braces in rtas_syscall_filter_init().
- Leave expression of RTAS_WORK_AREA_SIZE as-is instead of changing
   the factors in a confusing way, per discussion with Alexey.
- Drop "powerpc/rtas: constrain user region allocation to RMA"

Nathan Lynch (5):
   powerpc/rtas: improve ppc_rtas_rmo_buf_show documentation
   powerpc/rtas-proc: remove unused RMO_READ_BUF_MAX
   powerpc/rtas: remove ibm_suspend_me_token
   powerpc/rtas: move syscall filter setup into separate function
   powerpc/rtas: rename RTAS_RMOBUF_MAX to RTAS_USER_REGION_SIZE

  arch/powerpc/include/asm/rtas.h |  6 +++---
  arch/powerpc/kernel/rtas-proc.c | 15 +++
  arch/powerpc/kernel/rtas.c  | 34 +
  3 files changed, 32 insertions(+), 23 deletions(-)

[PATCH v3 3/3] powerpc/atomics: Remove atomic_inc()/atomic_dec() and friends

2021-04-14 Thread Christophe Leroy

Now that atomic_add() and atomic_sub() handle immediate operands,
atomic_inc() and atomic_dec() have no added value compared to the
generic fallback which calls atomic_add(1) and atomic_sub(1).

Also remove atomic_inc_not_zero() which fallsback to
atomic_add_unless() which itself fallsback to
atomic_fetch_add_unless() which now handles immediate operands.

Signed-off-by: Christophe Leroy 
---
v2: New
---
 arch/powerpc/include/asm/atomic.h | 95 ---
 1 file changed, 95 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index eb1bdf14f67c..00ba5d9e837b 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -118,71 +118,6 @@ ATOMIC_OPS(xor, xor, "", K)
 #undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
-static __inline__ void atomic_inc(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc\n\
-   addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2 \n\
-   bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-}
-#define atomic_inc atomic_inc
-
-static __inline__ int atomic_inc_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc_return_relaxed\n"
-"  addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-static __inline__ void atomic_dec(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec\n\
-   addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n\
-   bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-}
-#define atomic_dec atomic_dec
-
-static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec_return_relaxed\n"
-"  addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-#define atomic_inc_return_relaxed atomic_inc_return_relaxed
-#define atomic_dec_return_relaxed atomic_dec_return_relaxed
-
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_cmpxchg_relaxed(v, o, n) \
cmpxchg_relaxed(&((v)->counter), (o), (n))
@@ -252,36 +187,6 @@ static __inline__ int atomic_fetch_add_unless(atomic_t *v, 
int a, int u)
 }
 #define atomic_fetch_add_unless atomic_fetch_add_unless
 
-/**
- * atomic_inc_not_zero - increment unless the number is zero
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1, so long as @v is non-zero.
- * Returns non-zero if @v was non-zero, and zero otherwise.
- */
-static __inline__ int atomic_inc_not_zero(atomic_t *v)
-{
-   int t1, t2;
-
-   __asm__ __volatile__ (
-   PPC_ATOMIC_ENTRY_BARRIER
-"1:lwarx   %0,0,%2 # atomic_inc_not_zero\n\
-   cmpwi   0,%0,0\n\
-   beq-2f\n\
-   addic   %1,%0,1\n"
-"  stwcx.  %1,0,%2\n\
-   bne-1b\n"
-   PPC_ATOMIC_EXIT_BARRIER
-   "\n\
-2:"
-   : "=" (t1), "=" (t2)
-   : "r" (>counter)
-   : "cc", "xer", "memory");
-
-   return t1;
-}
-#define atomic_inc_not_zero(v) atomic_inc_not_zero((v))
-
 /*
  * Atomically test *v and decrement if it is greater than 0.
  * The function returns the old value of *v minus 1, even if
-- 
2.25.0

[PATCH v3 2/3] powerpc/atomics: Use immediate operand when possible

2021-04-14 Thread Christophe Leroy

Today we get the following code generation for atomic operations:

c001bb2c:   39 20 00 01 li  r9,1
c001bb30:   7d 40 18 28 lwarx   r10,0,r3
c001bb34:   7d 09 50 50 subfr8,r9,r10
c001bb38:   7d 00 19 2d stwcx.  r8,0,r3

c001c7a8:   39 40 00 01 li  r10,1
c001c7ac:   7d 00 18 28 lwarx   r8,0,r3
c001c7b0:   7c ea 42 14 add r7,r10,r8
c001c7b4:   7c e0 19 2d stwcx.  r7,0,r3

By allowing GCC to choose between immediate or regular operation,
we get:

c001bb2c:   7d 20 18 28 lwarx   r9,0,r3
c001bb30:   39 49 ff ff addir10,r9,-1
c001bb34:   7d 40 19 2d stwcx.  r10,0,r3
--
c001c7a4:   7d 40 18 28 lwarx   r10,0,r3
c001c7a8:   39 0a 00 01 addir8,r10,1
c001c7ac:   7d 00 19 2d stwcx.  r8,0,r3

For "and", the dot form has to be used because "andi" doesn't exist.

For logical operations we use unsigned 16 bits immediate.
For arithmetic operations we use signed 16 bits immediate.

On pmac32_defconfig, it reduces the text by approx another 8 kbytes.

Signed-off-by: Christophe Leroy 
Acked-by: Segher Boessenkool 
---
v2: Use "addc/addic"
---
 arch/powerpc/include/asm/atomic.h | 56 +++
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 61c6e8b200e8..eb1bdf14f67c 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -37,62 +37,62 @@ static __inline__ void atomic_set(atomic_t *v, int i)
__asm__ __volatile__("stw%U0%X0 %1,%0" : "=m"UPD_CONSTR(v->counter) : 
"r"(i));
 }
 
-#define ATOMIC_OP(op, asm_op)  \
+#define ATOMIC_OP(op, asm_op, suffix, sign, ...)   \
 static __inline__ void atomic_##op(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "\n" \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3 \n" \
 "  bne-1b\n"   \
: "=" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (>counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (>counter)   \
+   : "cc", ##__VA_ARGS__); \
 }  \
 
-#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op, suffix, sign, ...)
\
 static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
: "=" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (>counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (>counter)   \
+   : "cc", ##__VA_ARGS__); \
\
return t;   \
 }
 
-#define ATOMIC_FETCH_OP_RELAXED(op, asm_op)\
+#define ATOMIC_FETCH_OP_RELAXED(op, asm_op, suffix, sign, ...) \
 static inline int atomic_fetch_##op##_relaxed(int a, atomic_t *v)  \
 {  \
int res, t; \
\
__asm__

[PATCH v3 1/3] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread Christophe Leroy

Today we get the following code generation for bitops like
set or clear bit:

c0009fe0:   39 40 08 00 li  r10,2048
c0009fe4:   7c e0 40 28 lwarx   r7,0,r8
c0009fe8:   7c e7 53 78 or  r7,r7,r10
c0009fec:   7c e0 41 2d stwcx.  r7,0,r8

c000d568:   39 00 18 00 li  r8,6144
c000d56c:   7c c0 38 28 lwarx   r6,0,r7
c000d570:   7c c6 40 78 andcr6,r6,r8
c000d574:   7c c0 39 2d stwcx.  r6,0,r7

Most set bits are constant on lower 16 bits, so it can easily
be replaced by the "immediate" version of the operation. Allow
GCC to choose between the normal or immediate form.

For clear bits, on 32 bits 'rlwinm' can be used instead of 'andc' for
when all bits to be cleared are consecutive.

On 64 bits we don't have any equivalent single operation for clearing,
single bits or a few bits, we'd need two 'rldicl' so it is not
worth it, the li/andc sequence is doing the same.

With this patch we get:

c0009fe0:   7d 00 50 28 lwarx   r8,0,r10
c0009fe4:   61 08 08 00 ori r8,r8,2048
c0009fe8:   7d 00 51 2d stwcx.  r8,0,r10

c000d558:   7c e0 40 28 lwarx   r7,0,r8
c000d55c:   54 e7 05 64 rlwinm  r7,r7,0,21,18
c000d560:   7c e0 41 2d stwcx.  r7,0,r8

On pmac32_defconfig, it reduces the text by approx 10 kbytes.

Signed-off-by: Christophe Leroy 
---
v3:
- Using the mask validation proposed by Segher

v2:
- Use "n" instead of "i" as constraint for the rlwinm mask
- Improve mask verification to handle more than single bit masks

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bitops.h | 89 ---
 1 file changed, 81 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h 
b/arch/powerpc/include/asm/bitops.h
index 299ab33505a6..09500c789972 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -71,19 +71,61 @@ static inline void fn(unsigned long mask,   \
__asm__ __volatile__ (  \
prefix  \
 "1:"   PPC_LLARX(%0,0,%3,0) "\n"   \
-   stringify_in_c(op) "%0,%0,%2\n" \
+   #op "%I2 %0,%0,%2\n"\
PPC_STLCX "%0,0,%3\n"   \
"bne- 1b\n" \
: "=" (old), "+m" (*p)\
-   : "r" (mask), "r" (p)   \
+   : "rK" (mask), "r" (p)  \
: "cc", "memory");  \
 }
 
 DEFINE_BITOP(set_bits, or, "")
-DEFINE_BITOP(clear_bits, andc, "")
-DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
 DEFINE_BITOP(change_bits, xor, "")
 
+static __always_inline bool is_rlwinm_mask_valid(unsigned long x)
+{
+   if (!x)
+   return false;
+   if (x & 1)
+   x = ~x; // make the mask non-wrapping
+   x += x & -x;// adding the low set bit results in at most one bit set
+
+   return !(x & (x - 1));
+}
+
+#define DEFINE_CLROP(fn, prefix)   \
+static inline void fn(unsigned long mask, volatile unsigned long *_p)  \
+{  \
+   unsigned long old;  \
+   unsigned long *p = (unsigned long *)_p; \
+   \
+   if (IS_ENABLED(CONFIG_PPC32) && \
+   __builtin_constant_p(mask) && is_rlwinm_mask_valid(~mask)) {\
+   asm volatile (  \
+   prefix  \
+   "1:""lwarx  %0,0,%3\n"  \
+   "rlwinm %0,%0,0,%2\n"   \
+   "stwcx. %0,0,%3\n"  \
+   "bne- 1b\n" \
+   : "=" (old), "+m" (*p)\
+   : "n" (~mask), "r" (p)  \
+   : "cc", "memory");  \
+   } else {\
+   asm volatile (  \
+   prefix  \
+   "1:"PPC_LLARX(%0,0,%3,0) "\n"   \
+   "andc %0,%0,%2\n"   \
+   PPC_STLCX "%0,0,%3\n"   \
+   "bne- 1b\n" \
+   : "=" (old), "+m" (*p)\
+   : "r"

Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Vaibhav Jain

Thanks for looking into this patch Ira,

Ira Weiny  writes:

> On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote:
>> Currently drc_pmem_qeury_stats() generates a dev_err in case
>> "Enable Performance Information Collection" feature is disabled from
>> HMC. The error is of the form below:
>> 
>> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>>   performance stats, Err:-10
>> 
>> This error message confuses users as it implies a possible problem
>> with the nvdimm even though its due to a disabled feature.
>> 
>> So we fix this by explicitly handling the H_AUTHORITY error from the
>> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
>> error, saying that "Performance stats in-accessible".
>> 
>> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
>> PHYP')
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 835163f54244..9216424f8be3 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
>> *p,
>>  dev_err(>pdev->dev,
>>  "Unknown performance stats, Err:0x%016lX\n", ret[0]);
>>  return -ENOENT;
>> +} else if (rc == H_AUTHORITY) {
>> +dev_warn(>pdev->dev, "Performance stats in-accessible");
>> +return -EPERM;
>
> Is this because of a disabled feature or because of permissions?

Its because of a disabled feature that revokes permission for a guest to
retrieve performance statistics.

The feature is called "Enable Performance Information Collection" and
once disabled the hcall H_SCM_PERFORMANCE_STATS returns an error
H_AUTHORITY indicating that the guest doesn't have permission to retrieve
performance statistics.

>
> Ira
>
>>  } else if (rc != H_SUCCESS) {
>>  dev_err(>pdev->dev,
>>  "Failed to query performance stats, Err:%lld\n", rc);
>> -- 
>> 2.30.2
>> 

-- 
Cheers
~ Vaibhav

RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread David Laight

> Doing this fixes it:
> 
> +++ b/include/linux/types.h
> @@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
>   * so they don't care about the size of the actual bus addresses.
>   */
>  #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> -typedef u64 dma_addr_t;
> +typedef u64 __attribute__((aligned(sizeof(void * dma_addr_t;
>  #else
>  typedef u32 dma_addr_t;
>  #endif

I hate __packed so much I've been checking what it does!

If you add __packed to the dma_addr_t field inside the union
then gcc (at least) removes the pad from before it, but also
'remembers' the alignment that is enforced by other members
of the structure.

So you don't need the extra aligned(sizeof (void *)) since
that is implicit.

So in this case __packed probably has no side effects.
(Unless a 32bit arch has instructions for a 64bit read
that must not be on an 8n+4 boundary and the address is taken).

It also doesn't affect 64bit - since the previous field
forces 64bit alignment.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Ira Weiny

On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote:
> Currently drc_pmem_qeury_stats() generates a dev_err in case
> "Enable Performance Information Collection" feature is disabled from
> HMC. The error is of the form below:
> 
> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>performance stats, Err:-10
> 
> This error message confuses users as it implies a possible problem
> with the nvdimm even though its due to a disabled feature.
> 
> So we fix this by explicitly handling the H_AUTHORITY error from the
> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> error, saying that "Performance stats in-accessible".
> 
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..9216424f8be3 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
> *p,
>   dev_err(>pdev->dev,
>   "Unknown performance stats, Err:0x%016lX\n", ret[0]);
>   return -ENOENT;
> + } else if (rc == H_AUTHORITY) {
> + dev_warn(>pdev->dev, "Performance stats in-accessible");
> + return -EPERM;

Is this because of a disabled feature or because of permissions?

Ira

>   } else if (rc != H_SUCCESS) {
>   dev_err(>pdev->dev,
>   "Failed to query performance stats, Err:%lld\n", rc);
> -- 
> 2.30.2
>

RE: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread David Laight

From: Segher Boessenkool
> Sent: 14 April 2021 16:19
...
> > Could the kernel use GCC builtin atomic functions instead ?
> >
> > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> 
> Certainly that should work fine for the simpler cases that the atomic
> operations are meant to provide.  But esp. for not-so-simple cases the
> kernel may require some behaviour provided by the existing assembler
> implementation, and not by the atomic builtins.
> 
> I'm not saying this cannot work, just that some serious testing will be
> needed.  If it works it should be the best of all worlds, so then it is
> a really good idea yes :-)

I suspect they just add an extra layer of abstraction that makes it
even more difficult to verify and could easily get broken by a compiler
update (etc).

The other issue is that the code needs to be correct with compiled
with (for example) -O0.
That could very easily break anything except the asm implementation
if additional memory accesses and/or increased code size cause grief.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread Segher Boessenkool

On Wed, Apr 14, 2021 at 02:42:51PM +0200, Christophe Leroy wrote:
> Le 14/04/2021 à 14:24, Segher Boessenkool a écrit :
> >On Wed, Apr 14, 2021 at 12:01:21PM +1000, Nicholas Piggin wrote:
> >>Would be nice if we could let the compiler deal with it all...
> >>
> >>static inline unsigned long lr(unsigned long *mem)
> >>{
> >> unsigned long val;
> >>
> >> /*
> >>  * This doesn't clobber memory but want to avoid memory 
> >>  operations
> >>  * moving ahead of it
> >>  */
> >> asm volatile("ldarx %0, %y1" : "=r"(val) : "Z"(*mem) : 
> >> "memory");
> >>
> >> return val;
> >>}
> >
> >(etc.)
> >
> >That can not work reliably: the compiler can put random instructions
> >between the larx and stcx. this way, and you then do not have guaranteed
> >forward progress anymore.  It can put the two in different routines
> >(after inlining and other interprocedural optimisations), duplicate
> >them, make a different number of copies of them, etc.
> >
> >Nothing of that is okay if you want to guarantee forward progress on all
> >implementations, and also not if you want to have good performance
> >everywhere (or anywhere even).  Unfortunately you have to write all
> >larx/stcx. loops as one block of assembler, so that you know exactly
> >what instructions will end up in your binary.
> >
> >If you don't, it will fail mysteriously after random recompilations, or
> >have performance degradations, etc.  You don't want to go there :-)
> >
> 
> Could the kernel use GCC builtin atomic functions instead ?
> 
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Certainly that should work fine for the simpler cases that the atomic
operations are meant to provide.  But esp. for not-so-simple cases the
kernel may require some behaviour provided by the existing assembler
implementation, and not by the atomic builtins.

I'm not saying this cannot work, just that some serious testing will be
needed.  If it works it should be the best of all worlds, so then it is
a really good idea yes :-)


Segher

[PATCH v3 4/4] powerpc: Move copy_from_kernel_nofault_inst()

2021-04-14 Thread Christophe Leroy

When probe_kernel_read_inst() was created, there was no good place to
put it, so a file called lib/inst.c was dedicated for it.

Since then, probe_kernel_read_inst() has been renamed
copy_from_kernel_nofault_inst(). And mm/maccess.h didn't exist at that
time. Today, mm/maccess.h is related to copy_from_kernel_nofault().

Move copy_from_kernel_nofault_inst() into mm/maccess.c

Signed-off-by: Christophe Leroy 
---
v2: Remove inst.o from Makefile
---
 arch/powerpc/lib/Makefile |  2 +-
 arch/powerpc/lib/inst.c   | 26 --
 arch/powerpc/mm/maccess.c | 21 +
 3 files changed, 22 insertions(+), 27 deletions(-)
 delete mode 100644 arch/powerpc/lib/inst.c

diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index d4efc182662a..f2c690ee75d1 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -16,7 +16,7 @@ CFLAGS_code-patching.o += -DDISABLE_BRANCH_PROFILING
 CFLAGS_feature-fixups.o += -DDISABLE_BRANCH_PROFILING
 endif
 
-obj-y += alloc.o code-patching.o feature-fixups.o pmem.o inst.o 
test_code-patching.o
+obj-y += alloc.o code-patching.o feature-fixups.o pmem.o test_code-patching.o
 
 ifndef CONFIG_KASAN
 obj-y  +=  string.o memcmp_$(BITS).o
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
deleted file mode 100644
index e554d1357f2f..
--- a/arch/powerpc/lib/inst.c
+++ /dev/null
@@ -1,26 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- *  Copyright 2020, IBM Corporation.
- */
-
-#include 
-#include 
-#include 
-#include 
-
-int copy_inst_from_kernel_nofault(struct ppc_inst *inst, struct ppc_inst *src)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (err)
-   return err;
-   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
-   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
diff --git a/arch/powerpc/mm/maccess.c b/arch/powerpc/mm/maccess.c
index fa9a7a718fc6..a3c30a884076 100644
--- a/arch/powerpc/mm/maccess.c
+++ b/arch/powerpc/mm/maccess.c
@@ -3,7 +3,28 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 {
return is_kernel_addr((unsigned long)unsafe_src);
 }
+
+int copy_inst_from_kernel_nofault(struct ppc_inst *inst, struct ppc_inst *src)
+{
+   unsigned int val, suffix;
+   int err;
+
+   err = copy_from_kernel_nofault(, src, sizeof(val));
+   if (err)
+   return err;
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
+   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
+   *inst = ppc_inst_prefix(val, suffix);
+   } else {
+   *inst = ppc_inst(val);
+   }
+   return err;
+}
-- 
2.25.0

[PATCH v3 1/4] powerpc: Remove probe_user_read_inst()

2021-04-14 Thread Christophe Leroy

Its name comes from former probe_user_read() function.
That function is now called copy_from_user_nofault().

probe_user_read_inst() uses copy_from_user_nofault() to read only
a few bytes. It is suboptimal.

It does the same as get_user_inst() but in addition disables
page faults.

But on the other hand, it is not used for the time being. So remove it
for now. If one day it is really needed, we can give it a new name
more in line with today's naming, and implement it using get_user_inst()

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  3 ---
 arch/powerpc/lib/inst.c | 31 ---
 2 files changed, 34 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 19e18af2fac9..2902d4e6a363 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -175,9 +175,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip);
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src);
 
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index 9cc17eb62462..c57b3548de37 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -9,24 +9,6 @@
 #include 
 
 #ifdef CONFIG_PPC64
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_user_nofault(, nip, sizeof(val));
-   if (err)
-   return err;
-   if (get_op(val) == OP_PREFIX) {
-   err = copy_from_user_nofault(, (void __user *)nip + 4, 
4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -45,19 +27,6 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
return err;
 }
 #else /* !CONFIG_PPC64 */
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_user_nofault(, nip, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
-- 
2.25.0

[PATCH v3 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-14 Thread Christophe Leroy

When probe_kernel_read_inst() was created, it was to mimic
probe_kernel_read() function.

Since then, probe_kernel_read() has been renamed
copy_from_kernel_nofault().

Rename probe_kernel_read_inst() into copy_inst_from_kernel_nofault().

Signed-off-by: Christophe Leroy 
---
v3: copy_inst_from_kernel_nofault() instead of copy_from_kernel_nofault_inst()
---
 arch/powerpc/include/asm/inst.h|  3 +--
 arch/powerpc/kernel/align.c|  2 +-
 arch/powerpc/kernel/trace/ftrace.c | 22 +++---
 arch/powerpc/lib/inst.c|  3 +--
 4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index a40c3913a4a3..eaf5a6299034 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -177,7 +177,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src);
+int copy_inst_from_kernel_nofault(struct ppc_inst *inst, struct ppc_inst *src);
 
 #endif /* _ASM_POWERPC_INST_H */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index a97d5f1a3905..8f350d0478e6 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -311,7 +311,7 @@ int fix_alignment(struct pt_regs *regs)
CHECK_FULL_REGS(regs);
 
if (is_kernel_addr(regs->nip))
-   r = probe_kernel_read_inst(, (void *)regs->nip);
+   r = copy_inst_from_kernel_nofault(, (void *)regs->nip);
else
r = __get_user_instr(instr, (void __user *)regs->nip);
 
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 42761ebec9f7..ffe9537195aa 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -68,7 +68,7 @@ ftrace_modify_code(unsigned long ip, struct ppc_inst old, 
struct ppc_inst new)
 */
 
/* read the text we want to modify */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_inst_from_kernel_nofault(, (void *)ip))
return -EFAULT;
 
/* Make sure it is what we expect it to be */
@@ -130,7 +130,7 @@ __ftrace_make_nop(struct module *mod,
struct ppc_inst op, pop;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_inst_from_kernel_nofault(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -164,7 +164,7 @@ __ftrace_make_nop(struct module *mod,
/* When using -mkernel_profile there is no load to jump over */
pop = ppc_inst(PPC_INST_NOP);
 
-   if (probe_kernel_read_inst(, (void *)(ip - 4))) {
+   if (copy_inst_from_kernel_nofault(, (void *)(ip - 4))) {
pr_err("Fetching instruction at %lx failed.\n", ip - 4);
return -EFAULT;
}
@@ -197,7 +197,7 @@ __ftrace_make_nop(struct module *mod,
 * Check what is in the next instruction. We can see ld r2,40(r1), but
 * on first pass after boot we will see mflr r0.
 */
-   if (probe_kernel_read_inst(, (void *)(ip + 4))) {
+   if (copy_inst_from_kernel_nofault(, (void *)(ip + 4))) {
pr_err("Fetching op failed.\n");
return -EFAULT;
}
@@ -349,7 +349,7 @@ static int setup_mcount_compiler_tramp(unsigned long tramp)
return -1;
 
/* New trampoline -- read where this goes */
-   if (probe_kernel_read_inst(, (void *)tramp)) {
+   if (copy_inst_from_kernel_nofault(, (void *)tramp)) {
pr_debug("Fetching opcode failed.\n");
return -1;
}
@@ -399,7 +399,7 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, 
unsigned long addr)
struct ppc_inst op;
 
/* Read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_inst_from_kernel_nofault(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -526,10 +526,10 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
struct module *mod = rec->arch.mod;
 
/* read where this goes */
-   if (probe_kernel_read_inst(op, ip))
+   if (copy_inst_from_kernel_nofault(op, ip))
return -EFAULT;
 
-   if (probe_kernel_read_inst(op + 1, ip + 4))
+   if (copy_inst_from_kernel_nofault(op + 1, ip + 4))
return -EFAULT;
 
if (!expected_nop_sequence(ip, op[0], op[1])) {
@@ -592,7 +592,7 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
unsigned long ip = rec->ip;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_inst_from_kernel_nofault(, (void *)ip))
return -EFAULT;
 
/* It should

[PATCH v3 2/4] powerpc: Make probe_kernel_read_inst() common to PPC32 and PPC64

2021-04-14 Thread Christophe Leroy

We have two independant versions of probe_kernel_read_inst(), one for
PPC32 and one for PPC64.

The PPC32 is identical to the first part of the PPC64 version.
The remaining part of PPC64 version is not relevant for PPC32, but
not contradictory, so we can easily have a common function with
the PPC64 part opted out via a IS_ENABLED(CONFIG_PPC64).

The only need is to add a version of ppc_inst_prefix() for PPC32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  2 ++
 arch/powerpc/lib/inst.c | 17 +
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 2902d4e6a363..a40c3913a4a3 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -102,6 +102,8 @@ static inline bool ppc_inst_equal(struct ppc_inst x, struct 
ppc_inst y)
 
 #define ppc_inst(x) ((struct ppc_inst){ .val = x })
 
+#define ppc_inst_prefix(x, y) ppc_inst(x)
+
 static inline bool ppc_inst_prefixed(struct ppc_inst x)
 {
return false;
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index c57b3548de37..0dff3ac2d45f 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC64
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -18,7 +17,7 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
err = copy_from_kernel_nofault(, src, sizeof(val));
if (err)
return err;
-   if (get_op(val) == OP_PREFIX) {
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
err = copy_from_kernel_nofault(, (void *)src + 4, 4);
*inst = ppc_inst_prefix(val, suffix);
} else {
@@ -26,17 +25,3 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
}
return err;
 }
-#else /* !CONFIG_PPC64 */
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-#endif /* CONFIG_PPC64 */
-- 
2.25.0

Re: [PATCH v2 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-14 Thread Christophe Leroy





Le 14/04/2021 à 07:23, Aneesh Kumar K.V a écrit :

Christophe Leroy  writes:


When probe_kernel_read_inst() was created, it was to mimic
probe_kernel_read() function.

Since then, probe_kernel_read() has been renamed
copy_from_kernel_nofault().

Rename probe_kernel_read_inst() into copy_from_kernel_nofault_inst().


At first glance I read it as copy from kernel nofault instruction.
How about copy_inst_from_kernel_nofault()?


Yes good idea.

Christophe

Re: [PATCH net-next v2 1/2] of: net: pass the dst buffer to of_get_mac_address()

2021-04-14 Thread Michael Walle


Hi Dan,

Am 2021-04-14 07:33, schrieb Dan Carpenter:

url:
https://github.com/0day-ci/linux/commits/Michael-Walle/of-net-support-non-platform-devices-in-of_get_mac_address/20210406-234030
base:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
cc0626c2aaed8e475efdd85fa374b497a7192e35
config: x86_64-randconfig-m001-20210406 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 
Reported-by: Dan Carpenter 

smatch warnings:
drivers/net/ethernet/xilinx/xilinx_axienet_main.c:2069 axienet_probe()
warn: passing a valid pointer to 'PTR_ERR'

vim +/PTR_ERR +2069 drivers/net/ethernet/xilinx/xilinx_axienet_main.c

522856cefaf09d Robert Hancock  2019-06-06  2060 /* Check for
Ethernet core IRQ (optional) */
522856cefaf09d Robert Hancock  2019-06-06  2061  	if (lp->eth_irq 
<= 0)

522856cefaf09d Robert Hancock  2019-06-06  2062
dev_info(>dev, "Ethernet core IRQ not defined\n");
522856cefaf09d Robert Hancock  2019-06-06  2063
8a3b7a252dca9f Daniel Borkmann 2012-01-19  2064 /* Retrieve the
MAC address */
411b125c6ace1f Michael Walle   2021-04-06  2065 ret =
of_get_mac_address(pdev->dev.of_node, mac_addr);
411b125c6ace1f Michael Walle   2021-04-06  2066 if (!ret) {
411b125c6ace1f Michael Walle   2021-04-06  2067
axienet_set_mac_address(ndev, mac_addr);
411b125c6ace1f Michael Walle   2021-04-06  2068 } else {
d05a9ed5c3a773 Robert Hancock  2019-06-06 @2069
dev_warn(>dev, "could not find MAC address property: 
%ld\n",
d05a9ed5c3a773 Robert Hancock  2019-06-06  2070  			 
PTR_ERR(mac_addr));


  ^
This should print "ret".


Thanks, this was fixed (in the now merged) v4. I forgot
to add you to that huge CC list. Sorry for that.

-michael

Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread Christophe Leroy





Le 14/04/2021 à 14:24, Segher Boessenkool a écrit :

On Wed, Apr 14, 2021 at 12:01:21PM +1000, Nicholas Piggin wrote:

Would be nice if we could let the compiler deal with it all...

static inline unsigned long lr(unsigned long *mem)
{
 unsigned long val;

 /*
  * This doesn't clobber memory but want to avoid memory operations
  * moving ahead of it
  */
 asm volatile("ldarx %0, %y1" : "=r"(val) : "Z"(*mem) : "memory");

 return val;
}


(etc.)

That can not work reliably: the compiler can put random instructions
between the larx and stcx. this way, and you then do not have guaranteed
forward progress anymore.  It can put the two in different routines
(after inlining and other interprocedural optimisations), duplicate
them, make a different number of copies of them, etc.

Nothing of that is okay if you want to guarantee forward progress on all
implementations, and also not if you want to have good performance
everywhere (or anywhere even).  Unfortunately you have to write all
larx/stcx. loops as one block of assembler, so that you know exactly
what instructions will end up in your binary.

If you don't, it will fail mysteriously after random recompilations, or
have performance degradations, etc.  You don't want to go there :-)



Could the kernel use GCC builtin atomic functions instead ?

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

[PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Vaibhav Jain

Currently drc_pmem_qeury_stats() generates a dev_err in case
"Enable Performance Information Collection" feature is disabled from
HMC. The error is of the form below:

papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
 performance stats, Err:-10

This error message confuses users as it implies a possible problem
with the nvdimm even though its due to a disabled feature.

So we fix this by explicitly handling the H_AUTHORITY error from the
H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
error, saying that "Performance stats in-accessible".

Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
PHYP')
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..9216424f8be3 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
dev_err(>pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+   } else if (rc == H_AUTHORITY) {
+   dev_warn(>pdev->dev, "Performance stats in-accessible");
+   return -EPERM;
} else if (rc != H_SUCCESS) {
dev_err(>pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
-- 
2.30.2

Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread Segher Boessenkool

On Wed, Apr 14, 2021 at 12:01:21PM +1000, Nicholas Piggin wrote:
> Would be nice if we could let the compiler deal with it all...
> 
> static inline unsigned long lr(unsigned long *mem)
> {
> unsigned long val;
> 
> /*
>  * This doesn't clobber memory but want to avoid memory operations
>  * moving ahead of it
>  */
> asm volatile("ldarx %0, %y1" : "=r"(val) : "Z"(*mem) : "memory");
> 
> return val;
> }

(etc.)

That can not work reliably: the compiler can put random instructions
between the larx and stcx. this way, and you then do not have guaranteed
forward progress anymore.  It can put the two in different routines
(after inlining and other interprocedural optimisations), duplicate
them, make a different number of copies of them, etc.

Nothing of that is okay if you want to guarantee forward progress on all
implementations, and also not if you want to have good performance
everywhere (or anywhere even).  Unfortunately you have to write all
larx/stcx. loops as one block of assembler, so that you know exactly
what instructions will end up in your binary.

If you don't, it will fail mysteriously after random recompilations, or
have performance degradations, etc.  You don't want to go there :-)

Segher

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Ilias Apalodimas

On Wed, Apr 14, 2021 at 12:50:52PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 14, 2021 at 10:10:44AM +0200, Jesper Dangaard Brouer wrote:
> > Yes, indeed! - And very frustrating.  It's keeping me up at night.
> > I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
> > me that I don't sleep well with these kind of dreams ;-)
> 
> Then you're going to love this ... even with the latest patch, there's
> still a problem.  Because dma_addr_t is still 64-bit aligned _as a type_,
> that forces the union to be 64-bit aligned (as we already knew and worked
> around), but what I'd forgotten is that forces the entirety of struct
> page to be 64-bit aligned.  Which means ...
> 
> /* size: 40, cachelines: 1, members: 4 */
> /* padding: 4 */
> /* forced alignments: 1 */
> /* last cacheline: 40 bytes */
> } __attribute__((__aligned__(8)));
> 
> .. that we still have a hole!  It's just moved from being at offset 4
> to being at offset 36.
> 
> > That said, I think we need to have a quicker fix for the immediate
> > issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> > it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> > happens on certain 32-bit archs?
> 
> AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
> on x86-32 because dma_addr_t is 32-bit aligned.
> 
> Doing this fixes it:
> 
> +++ b/include/linux/types.h
> @@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
>   * so they don't care about the size of the actual bus addresses.
>   */
>  #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> -typedef u64 dma_addr_t;
> +typedef u64 __attribute__((aligned(sizeof(void * dma_addr_t;
>  #else
>  typedef u32 dma_addr_t;
>  #endif
> 
> > I'm seriously considering removing page_pool's support for doing/keeping
> > DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.
> 
> ... if you're going to do that, then we don't need to do this.

FWIW I already proposed that to Matthew in private a few days ago...
II am not even sure the AM572x has that support.  I'd much prefer getting rid
of it as well, instead of overcomplicating the struct for a device noone is
going to need.

Cheers
/Ilias

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 10:10:44AM +0200, Jesper Dangaard Brouer wrote:
> Yes, indeed! - And very frustrating.  It's keeping me up at night.
> I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
> me that I don't sleep well with these kind of dreams ;-)

Then you're going to love this ... even with the latest patch, there's
still a problem.  Because dma_addr_t is still 64-bit aligned _as a type_,
that forces the union to be 64-bit aligned (as we already knew and worked
around), but what I'd forgotten is that forces the entirety of struct
page to be 64-bit aligned.  Which means ...

/* size: 40, cachelines: 1, members: 4 */
/* padding: 4 */
/* forced alignments: 1 */
/* last cacheline: 40 bytes */
} __attribute__((__aligned__(8)));

.. that we still have a hole!  It's just moved from being at offset 4
to being at offset 36.

> That said, I think we need to have a quicker fix for the immediate
> issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> happens on certain 32-bit archs?

AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
on x86-32 because dma_addr_t is 32-bit aligned.

Doing this fixes it:

+++ b/include/linux/types.h
@@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
  * so they don't care about the size of the actual bus addresses.
  */
 #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
-typedef u64 dma_addr_t;
+typedef u64 __attribute__((aligned(sizeof(void * dma_addr_t;
 #else
 typedef u32 dma_addr_t;
 #endif

> I'm seriously considering removing page_pool's support for doing/keeping
> DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.

... if you're going to do that, then we don't need to do this.

[PATCH v5] powerpc/traps: Enhance readability for trap types

2021-04-14 Thread Xiongwei Song

From: Xiongwei Song 

Define macros to list ppc interrupt types in interttupt.h, replace the
reference of the trap hex values with these macros.

Referred the hex numbers in arch/powerpc/kernel/exceptions-64e.S,
arch/powerpc/kernel/exceptions-64s.S, arch/powerpc/kernel/head_*.S,
arch/powerpc/kernel/head_booke.h and arch/powerpc/include/asm/kvm_asm.h.

v4-v5:
* Delete unnecessary #ifdef.
* Move INTERRUPT_* macros to interttupt.h, classify for different cpu
  types. Drop traps.h.
* Directly define INTERRUPT_MACHINE_CHECK with 0x200.

v3-v4:
Fix compile issue:
arch/powerpc/kernel/process.c:1473:14: error: 'INTERRUPT_MACHINE_CHECK' 
undeclared (first use in this function); did you mean 'TAINT_MACHINE_CHECK'?
Reported-by: kernel test robot 

v2-v3:
Correct the prefix of trap macros with INTERRUPT_, the previous prefix
is TRAP_, which is not precise. This is suggested by Segher Boessenkool
and Nicholas Piggin.

v1-v2:
Define more trap macros to replace more trap hexs in code, not just for
the __show_regs function. This is suggested by Christophe Leroy.

Signed-off-by: Xiongwei Song 
---
 arch/powerpc/include/asm/interrupt.h  | 48 +--
 arch/powerpc/kernel/fadump.c  |  2 +-
 arch/powerpc/kernel/interrupt.c   |  2 +-
 arch/powerpc/kernel/process.c |  4 ++-
 arch/powerpc/kernel/traps.c   |  6 ++--
 arch/powerpc/kexec/crash.c|  3 +-
 arch/powerpc/mm/book3s64/hash_utils.c |  4 +--
 arch/powerpc/mm/fault.c   | 16 -
 arch/powerpc/perf/core-book3s.c   |  5 +--
 arch/powerpc/xmon/xmon.c  | 20 +++
 10 files changed, 81 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 05e7fc4ffb50..86daf1242088 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -9,6 +9,46 @@
 #include 
 #include 
 
+/* BookE/4xx */
+#define INTERRUPT_CRITICAL_INPUT  0x100
+
+/* BookE */
+#define INTERRUPT_DEBUG   0xd00
+#ifdef CONFIG_BOOKE
+#define INTERRUPT_PERFMON 0x260
+#define INTERRUPT_DOORBELL0x280
+#endif
+
+/* BookS/4xx/8xx */
+#define INTERRUPT_MACHINE_CHECK   0x200
+
+/* BookS/8xx */
+#define INTERRUPT_SYSTEM_RESET0x100
+
+/* BookS */
+#define INTERRUPT_DATA_SEGMENT0x380
+#define INTERRUPT_INST_SEGMENT0x480
+#define INTERRUPT_TRACE   0xd00
+#define INTERRUPT_H_DATA_STORAGE  0xe00
+#define INTERRUPT_H_FAC_UNAVAIL   0xf80
+#ifdef CONFIG_PPC_BOOK3S
+#define INTERRUPT_DOORBELL0xa00
+#define INTERRUPT_PERFMON 0xf00
+#endif
+
+/* BookE/BookS/4xx/8xx */
+#define INTERRUPT_DATA_STORAGE0x300
+#define INTERRUPT_INST_STORAGE0x400
+#define INTERRUPT_ALIGNMENT   0x600
+#define INTERRUPT_PROGRAM 0x700
+#define INTERRUPT_SYSCALL 0xc00
+
+/* BookE/BookS/44x */
+#define INTERRUPT_FP_UNAVAIL  0x800
+
+/* BookE/BookS/44x/8xx */
+#define INTERRUPT_DECREMENTER 0x900
+
 static inline void nap_adjust_return(struct pt_regs *regs)
 {
 #ifdef CONFIG_PPC_970_NAP
@@ -70,7 +110,7 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs, struct interrup
 * CT_WARN_ON comes here via program_check_exception,
 * so avoid recursion.
 */
-   if (TRAP(regs) != 0x700)
+   if (TRAP(regs) != INTERRUPT_PROGRAM)
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
}
 #endif
@@ -175,7 +215,8 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
/* Don't do any per-CPU operations until interrupt state is fixed */
 #endif
/* Allow DEC and PMI to be traced when they are soft-NMI */
-   if (TRAP(regs) != 0x900 && TRAP(regs) != 0xf00 && TRAP(regs) != 0x260) {
+   if (TRAP(regs) != INTERRUPT_DECREMENTER &&
+   TRAP(regs) != INTERRUPT_PERFMON) {
state->ftrace_enabled = this_cpu_get_ftrace_enabled();
this_cpu_set_ftrace_enabled(0);
}
@@ -204,7 +245,8 @@ static inline void interrupt_nmi_exit_prepare(struct 
pt_regs *regs, struct inter
 */
 
 #ifdef CONFIG_PPC64
-   if (TRAP(regs) != 0x900 && TRAP(regs) != 0xf00 && TRAP(regs) != 0x260)
+   if (TRAP(regs) != INTERRUPT_DECREMENTER &&
+   TRAP(regs) != INTERRUPT_PERFMON)
this_cpu_set_ftrace_enabled(state->ftrace_enabled);
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index eddf362caedc..b55b4c23f3b6 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -728,7 +728,7 @@ void crash_fadump(struct pt_regs *regs, const char *str)
 * If we came in via system reset, wait a while for the secondary
 * CPUs to enter.
 */
-   if (TRAP(&(fdh->regs)) == 0x100) {
+   if (TRAP(&(fdh->regs)) == INTERRUPT_SYSTEM_RESET) {
msecs = CRASH_TIMEOUT;
while

[PATCH v2] init: consolidate trap_init()

2021-04-14 Thread Jisheng Zhang

Many architectures implement the trap_init() as NOP, since there is
no such default for trap_init(), this empty stub is duplicated among
these architectures. Provide a generic but weak NOP implementation
to drop the empty stubs of trap_init() in these architectures.

The alpha, microblaze and sparc32 have real trap_init() implementation
but the __init marker is missing, so add it to these three platforms.

Signed-off-by: Jisheng Zhang 
---
Since v1:
 - add __init marker to trap_init() for alpha, microblaze and sparc32
 - adjust the generic weak NOP trap_init() location to make it sits with
   other NOP implementations together

 arch/alpha/kernel/traps.c  |  2 +-
 arch/arc/kernel/traps.c|  5 -
 arch/arm/kernel/traps.c|  5 -
 arch/h8300/kernel/traps.c  | 13 -
 arch/hexagon/kernel/traps.c|  4 
 arch/microblaze/kernel/traps.c |  2 +-
 arch/nds32/kernel/traps.c  |  5 -
 arch/nios2/kernel/traps.c  |  5 -
 arch/openrisc/kernel/traps.c   |  5 -
 arch/parisc/kernel/traps.c |  4 
 arch/powerpc/kernel/traps.c|  5 -
 arch/riscv/kernel/traps.c  |  5 -
 arch/sparc/kernel/traps_32.c   |  2 +-
 arch/um/kernel/trap.c  |  4 
 init/main.c|  2 ++
 15 files changed, 5 insertions(+), 63 deletions(-)

diff --git a/arch/alpha/kernel/traps.c b/arch/alpha/kernel/traps.c
index 921d4b6e4d95..96b203199c6c 100644
--- a/arch/alpha/kernel/traps.c
+++ b/arch/alpha/kernel/traps.c
@@ -973,7 +973,7 @@ do_entUnaUser(void __user * va, unsigned long opcode,
return;
 }
 
-void
+void __init
 trap_init(void)
 {
/* Tell PAL-code what global pointer we want in the kernel.  */
diff --git a/arch/arc/kernel/traps.c b/arch/arc/kernel/traps.c
index 57235e5c0cea..6b83e3f2b41c 100644
--- a/arch/arc/kernel/traps.c
+++ b/arch/arc/kernel/traps.c
@@ -20,11 +20,6 @@
 #include 
 #include 
 
-void __init trap_init(void)
-{
-   return;
-}
-
 void die(const char *str, struct pt_regs *regs, unsigned long address)
 {
show_kernel_fault_diag(str, regs, address);
diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index 17d5a785df28..9baccef20392 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -780,11 +780,6 @@ void abort(void)
panic("Oops failed to kill thread");
 }
 
-void __init trap_init(void)
-{
-   return;
-}
-
 #ifdef CONFIG_KUSER_HELPERS
 static void __init kuser_init(void *vectors)
 {
diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c
index 5d8b969cd8f3..c3a3ebf77fbb 100644
--- a/arch/h8300/kernel/traps.c
+++ b/arch/h8300/kernel/traps.c
@@ -30,19 +30,6 @@
 
 static DEFINE_SPINLOCK(die_lock);
 
-/*
- * this must be called very early as the kernel might
- * use some instruction that are emulated on the 060
- */
-
-void __init base_trap_init(void)
-{
-}
-
-void __init trap_init(void)
-{
-}
-
 asmlinkage void set_esp0(unsigned long ssp)
 {
current->thread.esp0 = ssp;
diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index 904134b37232..edfc35dafeb1 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -28,10 +28,6 @@
 #define TRAP_SYSCALL   1
 #define TRAP_DEBUG 0xdb
 
-void __init trap_init(void)
-{
-}
-
 #ifdef CONFIG_GENERIC_BUG
 /* Maybe should resemble arch/sh/kernel/traps.c ?? */
 int is_valid_bugaddr(unsigned long addr)
diff --git a/arch/microblaze/kernel/traps.c b/arch/microblaze/kernel/traps.c
index 94b6fe93147d..7c15704fe56e 100644
--- a/arch/microblaze/kernel/traps.c
+++ b/arch/microblaze/kernel/traps.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 
-void trap_init(void)
+void __init trap_init(void)
 {
__enable_hw_exceptions();
 }
diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c
index ee0d9ae192a5..f06421c645af 100644
--- a/arch/nds32/kernel/traps.c
+++ b/arch/nds32/kernel/traps.c
@@ -183,11 +183,6 @@ void __pgd_error(const char *file, int line, unsigned long 
val)
 }
 
 extern char *exception_vector, *exception_vector_end;
-void __init trap_init(void)
-{
-   return;
-}
-
 void __init early_trap_init(void)
 {
unsigned long ivb = 0;
diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c
index b172da4eb1a9..596986a74a26 100644
--- a/arch/nios2/kernel/traps.c
+++ b/arch/nios2/kernel/traps.c
@@ -105,11 +105,6 @@ void show_stack(struct task_struct *task, unsigned long 
*stack,
printk("%s\n", loglvl);
 }
 
-void __init trap_init(void)
-{
-   /* Nothing to do here */
-}
-
 /* Breakpoint handler */
 asmlinkage void breakpoint_c(struct pt_regs *fp)
 {
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 4d61333c2623..aa1e709405ac 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -231,11 +231,6 @@ void unhandled_exception(struct pt_regs *regs, int ea, int 
vector)
die("Oops", regs, 9);
 }
 
-void __init trap_init(void)
-{
-   /* Nothing needs to be done */
-}
-

Re: [PATCH] init: consolidate trap_init()

2021-04-14 Thread Jisheng Zhang

On Wed, 14 Apr 2021 17:27:57 +0800
Jisheng Zhang  wrote:

> CAUTION: Email originated externally, do not click links or open attachments 
> unless you recognize the sender and know the content is safe.
> 
> 
> On Wed, 14 Apr 2021 11:10:42 +0200
> Christophe Leroy  wrote:
> 
> >
> > Le 14/04/2021 à 10:58, Jisheng Zhang a écrit :  
> > > Many architectures implement the trap_init() as NOP, since there is
> > > no such default for trap_init(), this empty stub is duplicated among
> > > these architectures. Provide a generic but weak NOP implementation
> > > to drop the empty stubs of trap_init() in these architectures.  
> >
> > You define the weak function in the __init section.
> >
> > Most but not all architectures had it in __init section.
> >
> > And the remaining ones may not be defined in __init section. For instance 
> > look at the one in alpha
> > architecture.
> >
> > Have you checked that it is not a problem ? It would be good to say 
> > something about it in the commit
> > description.  
> 
> For those non-nop platforms, I can only test x86/arm64/, but both has
> __init mark. I'm not sure whether this is a problem for alpha etc. Maybe
> I can check which section the trap_init() sits. Or to avoid any possible
> regression, I can add __init mark to those remaining ones without it in
> preparation patches.
> 

Hi,

I found only three platforms don't have the __init marker for trap_init(), I
will add the __init marker in three preparation patches in new version.

thanks

Re: [PATCH] init: consolidate trap_init()

2021-04-14 Thread Jisheng Zhang

On Wed, 14 Apr 2021 11:10:42 +0200
Christophe Leroy  wrote:

> 
> Le 14/04/2021 à 10:58, Jisheng Zhang a écrit :
> > Many architectures implement the trap_init() as NOP, since there is
> > no such default for trap_init(), this empty stub is duplicated among
> > these architectures. Provide a generic but weak NOP implementation
> > to drop the empty stubs of trap_init() in these architectures.  
> 
> You define the weak function in the __init section.
> 
> Most but not all architectures had it in __init section.
> 
> And the remaining ones may not be defined in __init section. For instance 
> look at the one in alpha
> architecture.
> 
> Have you checked that it is not a problem ? It would be good to say something 
> about it in the commit
> description.

For those non-nop platforms, I can only test x86/arm64/, but both has
__init mark. I'm not sure whether this is a problem for alpha etc. Maybe
I can check which section the trap_init() sits. Or to avoid any possible
regression, I can add __init mark to those remaining ones without it in
preparation patches.

> 
> 
> >
> > Signed-off-by: Jisheng Zhang 
> > ---
> >   arch/arc/kernel/traps.c  |  5 -
> >   arch/arm/kernel/traps.c  |  5 -
> >   arch/h8300/kernel/traps.c| 13 -
> >   arch/hexagon/kernel/traps.c  |  4 
> >   arch/nds32/kernel/traps.c|  5 -
> >   arch/nios2/kernel/traps.c|  5 -
> >   arch/openrisc/kernel/traps.c |  5 -
> >   arch/parisc/kernel/traps.c   |  4 
> >   arch/powerpc/kernel/traps.c  |  5 -
> >   arch/riscv/kernel/traps.c|  5 -
> >   arch/um/kernel/trap.c|  4 
> >   init/main.c  |  2 ++
> >   12 files changed, 2 insertions(+), 60 deletions(-)
> >
> > diff --git a/init/main.c b/init/main.c
> > index 53b278845b88..4bdbe2928530 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -790,6 +790,8 @@ static inline void initcall_debug_enable(void)
> >   }
> >   #endif
> >
> > +void __init __weak trap_init(void) { }
> > +  
> 
> I think in a C file we don't try to save space as much as in a header file.
> 

This is to follow most weak NOP implementations in init/main.c to make
the style unified in the same file. I'm not sure which is better.

> I would prefer something like:
> 
> 
> void __init __weak trap_init(void)
> {
> }
> 
> 
> >   /* Report memory auto-initialization states for this boot. */
> >   static void __init report_meminit(void)
> >   {
> >

Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-04-14 Thread Vaibhav Jain

"Aneesh Kumar K.V"  writes:

> Vaibhav Jain  writes:
>
>> Hi Shiva,
>>
>> Apologies for a late review but something just caught my eye while
>> working on a different patch.
>>
>> Shivaprasad G Bhat  writes:
>>
>>> Add support for ND_REGION_ASYNC capability if the device tree
>>> indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
>>> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
>>>
>>> If the flush request failed, the hypervisor is expected to
>>> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.
>>>
>>> This patch prevents mmap of namespaces with MAP_SYNC flag if the
>>> nvdimm requires an explicit flush[1].
>>>
>>> References:
>>> [1] 
>>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c
>>>
>>> Signed-off-by: Shivaprasad G Bhat 
>>> ---
>>> v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html
>>> Changes from v2:
>>>- Fixed the commit message.
>>>- Add dev_dbg before the H_SCM_FLUSH hcall
>>>
>>> v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
>>> Changes from v1:
>>>- Hcall semantics finalized, all changes are to accomodate them.
>>>
>>>  Documentation/powerpc/papr_hcalls.rst |   14 ++
>>>  arch/powerpc/include/asm/hvcall.h |3 +-
>>>  arch/powerpc/platforms/pseries/papr_scm.c |   40 
>>> +
>>>  3 files changed, 56 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/powerpc/papr_hcalls.rst 
>>> b/Documentation/powerpc/papr_hcalls.rst
>>> index 48fcf1255a33..648f278eea8f 100644
>>> --- a/Documentation/powerpc/papr_hcalls.rst
>>> +++ b/Documentation/powerpc/papr_hcalls.rst
>>> @@ -275,6 +275,20 @@ Health Bitmap Flags:
>>>  Given a DRC Index collect the performance statistics for NVDIMM and copy 
>>> them
>>>  to the resultBuffer.
>>>  
>>> +**H_SCM_FLUSH**
>>> +
>>> +| Input: *drcIndex, continue-token*
>>> +| Out: *continue-token*
>>> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
>>> +
>>> +Given a DRC Index Flush the data to backend NVDIMM device.
>>> +
>>> +The hcall returns H_BUSY when the flush takes longer time and the hcall 
>>> needs
>>> +to be issued multiple times in order to be completely serviced. The
>>> +*continue-token* from the output to be passed in the argument list of
>>> +subsequent hcalls to the hypervisor until the hcall is completely serviced
>>> +at which point H_SUCCESS or other error is returned by the hypervisor.
>>> +
>> Does the hcall semantic also include measures to trigger a barrier/fence
>> (like pm_wmb()) so that all the stores before the hcall are gauranteed
>> to have hit the pmem controller ?
>
> It is upto the hypervisor to implement the right sequence to ensure the
> guarantee. The hcall doesn't expect the store to reach the platform
> buffers.

Since the asyc_flush function is also called for performing
deep_flush from libnvdimm and as the hcall doesnt gaurantee stores to
reach the platform buffers, hence papr_scm_pmem_flush() should issue
pm_wmb() before the hcall.

This would ensure papr_scm_pmem_flush() semantics are consistent that to
generic_nvdimm_flush().

Also, adding the statement "The hcall doesn't expect the store to reach
the platform buffers" to the hcall documentation would be good to have.

>
>
>>
>> If not then the papr_scm_pmem_flush() introduced below should issue a
>> fence like pm_wmb() before issuing the flush hcall.
>>
>> If yes then this behaviour should also be documented with the hcall
>> semantics above.
>>
>
> IIUC fdatasync on the backend file is enough to ensure the
> guaraantee. Such a request will have REQ_PREFLUSH and REQ_FUA which will
> do the necessary barrier on the backing device holding the backend file.
>
Right, but thats assuming nvdimm is backed by a regular file in
hypervisor which may not always be the case.


> -aneesh

-- 
Cheers
~ Vaibhav

Re: [PATCH] init: consolidate trap_init()

2021-04-14 Thread Christophe Leroy





Le 14/04/2021 à 10:58, Jisheng Zhang a écrit :

Many architectures implement the trap_init() as NOP, since there is
no such default for trap_init(), this empty stub is duplicated among
these architectures. Provide a generic but weak NOP implementation
to drop the empty stubs of trap_init() in these architectures.


You define the weak function in the __init section.

Most but not all architectures had it in __init section.

And the remaining ones may not be defined in __init section. For instance look at the one in alpha 
architecture.


Have you checked that it is not a problem ? It would be good to say something about it in the commit 
description.





Signed-off-by: Jisheng Zhang 
---
  arch/arc/kernel/traps.c  |  5 -
  arch/arm/kernel/traps.c  |  5 -
  arch/h8300/kernel/traps.c| 13 -
  arch/hexagon/kernel/traps.c  |  4 
  arch/nds32/kernel/traps.c|  5 -
  arch/nios2/kernel/traps.c|  5 -
  arch/openrisc/kernel/traps.c |  5 -
  arch/parisc/kernel/traps.c   |  4 
  arch/powerpc/kernel/traps.c  |  5 -
  arch/riscv/kernel/traps.c|  5 -
  arch/um/kernel/trap.c|  4 
  init/main.c  |  2 ++
  12 files changed, 2 insertions(+), 60 deletions(-)

diff --git a/init/main.c b/init/main.c
index 53b278845b88..4bdbe2928530 100644
--- a/init/main.c
+++ b/init/main.c
@@ -790,6 +790,8 @@ static inline void initcall_debug_enable(void)
  }
  #endif
  
+void __init __weak trap_init(void) { }

+


I think in a C file we don't try to save space as much as in a header file.

I would prefer something like:


void __init __weak trap_init(void)
{
}



  /* Report memory auto-initialization states for this boot. */
  static void __init report_meminit(void)
  {

[PATCH v4 9/9] powerpc/mm: Enable move pmd/pud

2021-04-14 Thread Aneesh Kumar K.V

mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:
1GB mremap - Source PTE-aligned, Destination PTE-aligned
  mremap time:  1127034ns
1GB mremap - Source PMD-aligned, Destination PMD-aligned
  mremap time:   508817ns
1GB mremap - Source PUD-aligned, Destination PUD-aligned
  mremap time:23046ns

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/Kconfig.cputype | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 3ce907523b1e..2e666e569fdf 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -97,6 +97,8 @@ config PPC_BOOK3S_64
select PPC_HAVE_PMU_SUPPORT
select SYS_SUPPORTS_HUGETLBFS
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+   select HAVE_MOVE_PMD
+   select HAVE_MOVE_PUD
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
select ARCH_SUPPORTS_NUMA_BALANCING
select IRQ_WORK
-- 
2.30.2

[PATCH v4 8/9] mm/mremap: Allow arch runtime override

2021-04-14 Thread Aneesh Kumar K.V

Architectures like ppc64 support faster mremap only with radix
translation. Hence allow a runtime check w.r.t support for fast mremap.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/tlb.h |  6 ++
 mm/mremap.c| 15 ++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..058918a7cd3c 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -83,5 +83,11 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 }
 #endif
 
+#define arch_supports_page_tables_move arch_supports_page_tables_move
+static inline bool arch_supports_page_tables_move(void)
+{
+   return radix_enabled();
+}
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_TLB_H */
diff --git a/mm/mremap.c b/mm/mremap.c
index 7ac1df8e6d51..6922e161928c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -25,7 +25,7 @@
 #include 
 
 #include 
-#include 
+#include 
 #include 
 
 #include "internal.h"
@@ -221,6 +221,15 @@ static inline void flush_pte_tlb_pwc_range(struct 
vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_supports_page_tables_move
+#define arch_supports_page_tables_move arch_supports_page_tables_move
+static inline bool arch_supports_page_tables_move(void)
+{
+   return IS_ENABLED(CONFIG_HAVE_MOVE_PMD) ||
+   IS_ENABLED(CONFIG_HAVE_MOVE_PUD);
+}
+#endif
+
 #ifdef CONFIG_HAVE_MOVE_PMD
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
@@ -229,6 +238,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pmd_t pmd;
 
+   if (!arch_supports_page_tables_move())
+   return false;
/*
 * The destination pmd shouldn't be established, free_pgtables()
 * should have released it.
@@ -295,6 +306,8 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pud_t pud;
 
+   if (!arch_supports_page_tables_move())
+   return false;
/*
 * The destination pud shouldn't be established, free_pgtables()
 * should have released it.
-- 
2.30.2

[PATCH v4 7/9] mm/mremap: Move TLB flush outside page table lock

2021-04-14 Thread Aneesh Kumar K.V

Move TLB flush outside page table lock so that kernel does
less with page table lock held. Releasing the ptl with old
TLB contents still valid will behave such that such access
happened before the level3 or level2 entry update.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 0e7b11daafee..7ac1df8e6d51 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -259,7 +259,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 * We don't have to worry about the ordering of src and dst
 * ptlocks because exclusive mmap_lock prevents deadlock.
 */
-   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+   old_ptl = pmd_lock(mm, old_pmd);
new_ptl = pmd_lockptr(mm, new_pmd);
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -271,11 +271,11 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd));
 
-   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE, true);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
 
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE, true);
return true;
 }
 #else
@@ -306,7 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
 * We don't have to worry about the ordering of src and dst
 * ptlocks because exclusive mmap_lock prevents deadlock.
 */
-   old_ptl = pud_lock(vma->vm_mm, old_pud);
+   old_ptl = pud_lock(mm, old_pud);
new_ptl = pud_lockptr(mm, new_pud);
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -318,11 +318,11 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
VM_BUG_ON(!pud_none(*new_pud));
 
pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
-   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PUD_SIZE, true);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
 
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PUD_SIZE, true);
return true;
 }
 #else
-- 
2.30.2

[PATCH v4 6/9] mm/mremap: Use range flush that does TLB and page walk cache flush

2021-04-14 Thread Aneesh Kumar K.V

Some architectures do have the concept of page walk cache which need
to be flush when updating higher levels of page tables. A fast mremap
that involves moving page table pages instead of copying pte entries
should flush page walk cache since the old translation cache is no more
valid.

Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
page walk cache where TLB entries are mapped with page size PAGE_SIZE.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 11 +++
 mm/mremap.c   | 15 +--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index f9f8a3a264f7..c236b66f490b 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -80,6 +80,17 @@ static inline void flush_hugetlb_tlb_range(struct 
vm_area_struct *vma,
return flush_hugetlb_tlb_pwc_range(vma, start, end, false);
 }
 
+#define flush_pte_tlb_pwc_range flush_tlb_pwc_range
+static inline void flush_pte_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start, unsigned long 
end,
+  bool also_pwc)
+{
+   if (radix_enabled())
+   return radix__flush_tlb_pwc_range_psize(vma->vm_mm, start,
+   end, mmu_virtual_psize, 
also_pwc);
+   return hash__flush_tlb_range(vma, start, end);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
index 574287f9bb39..0e7b11daafee 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -210,6 +210,17 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
*old_pmd,
drop_rmap_locks(vma);
 }
 
+#ifndef flush_pte_tlb_pwc_range
+#define flush_pte_tlb_pwc_range flush_pte_tlb_pwc_range
+static inline void flush_pte_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start,
+  unsigned long end,
+  bool also_pwc)
+{
+   return flush_tlb_range(vma, start, end);
+}
+#endif
+
 #ifdef CONFIG_HAVE_MOVE_PMD
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
@@ -260,7 +271,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd));
 
-   flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE, true);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
@@ -307,7 +318,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
VM_BUG_ON(!pud_none(*new_pud));
 
pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
-   flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PUD_SIZE, true);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
-- 
2.30.2

[PATCH v4 4/9] powerpc/mm/book3s64: Fix possible build error

2021-04-14 Thread Aneesh Kumar K.V

Update _tlbiel_pid() such that we can avoid build errors like below when
using this function in other places.

arch/powerpc/mm/book3s64/radix_tlb.c: In function 
‘__radix__flush_tlb_range_psize’:
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: warning: ‘asm’ operand 3 probably 
does not match constraints
  114 |  asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
  |  ^~~
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: error: impossible constraint in 
‘asm’
make[4]: *** [scripts/Makefile.build:271: arch/powerpc/mm/book3s64/radix_tlb.o] 
Error 1
m

With this fix, we can also drop the __always_inline in 
__radix_flush_tlb_range_psize
which was added by commit e12d6d7d46a6 ("powerpc/mm/radix: mark 
__radix__flush_tlb_range_psize() as __always_inline")

Reviewed-by: Christophe Leroy 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..817a02ef6032 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -291,22 +291,30 @@ static inline void fixup_tlbie_lpid(unsigned long lpid)
 /*
  * We use 128 set in radix mode and 256 set in hpt mode.
  */
-static __always_inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
+static inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
 {
int set;
 
asm volatile("ptesync": : :"memory");
 
-   /*
-* Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL,
-* also flush the entire Page Walk Cache.
-*/
-   __tlbiel_pid(pid, 0, ric);
+   switch (ric) {
+   case RIC_FLUSH_PWC:
 
-   /* For PWC, only one flush is needed */
-   if (ric == RIC_FLUSH_PWC) {
+   /* For PWC, only one flush is needed */
+   __tlbiel_pid(pid, 0, RIC_FLUSH_PWC);
ppc_after_tlbiel_barrier();
return;
+   case RIC_FLUSH_TLB:
+   __tlbiel_pid(pid, 0, RIC_FLUSH_TLB);
+   break;
+   case RIC_FLUSH_ALL:
+   default:
+   /*
+* Flush the first set of the TLB, and if
+* we're doing a RIC_FLUSH_ALL, also flush
+* the entire Page Walk Cache.
+*/
+   __tlbiel_pid(pid, 0, RIC_FLUSH_ALL);
}
 
if (!cpu_has_feature(CPU_FTR_ARCH_31)) {
@@ -1176,7 +1184,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
}
 }
 
-static __always_inline void __radix__flush_tlb_range_psize(struct mm_struct 
*mm,
+static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
unsigned long start, unsigned long end,
int psize, bool also_pwc)
 {
-- 
2.30.2

[PATCH v4 5/9] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument

2021-04-14 Thread Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V 
---
 .../include/asm/book3s/64/tlbflush-radix.h| 19 +++-
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 23 ---
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |  4 +--
 arch/powerpc/mm/book3s64/radix_tlb.c  | 29 +++
 4 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 8b33601cdb9d..171441a43b35 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -56,15 +56,18 @@ static inline void radix__flush_all_lpid_guest(unsigned int 
lpid)
 }
 #endif
 
-extern void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma,
-  unsigned long start, unsigned long 
end);
-extern void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long 
start,
-unsigned long end, int psize);
-extern void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
-  unsigned long start, unsigned long end);
-extern void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long 
start,
+void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end,
+   bool flush_pwc);
+void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end,
+   bool flush_pwc);
+void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long 
start,
+ unsigned long end, int psize, bool 
flush_pwc);
+void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end);
-extern void radix__flush_tlb_kernel_range(unsigned long start, unsigned long 
end);
+void radix__flush_tlb_kernel_range(unsigned long start, unsigned long end);
+
 
 extern void radix__local_flush_tlb_mm(struct mm_struct *mm);
 extern void radix__local_flush_all_mm(struct mm_struct *mm);
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 215973b4cb26..f9f8a3a264f7 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -45,13 +45,30 @@ static inline void tlbiel_all_lpid(bool radix)
hash__tlbiel_all(TLB_INVAL_SCOPE_LPID);
 }
 
+static inline void flush_pmd_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start,
+  unsigned long end,
+  bool flush_pwc)
+{
+   if (radix_enabled())
+   return radix__flush_pmd_tlb_range(vma, start, end, flush_pwc);
+   return hash__flush_tlb_range(vma, start, end);
+}
 
 #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
 static inline void flush_pmd_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end)
+{
+   return flush_pmd_tlb_pwc_range(vma, start, end, false);
+}
+
+static inline void flush_hugetlb_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start,
+  unsigned long end,
+  bool flush_pwc)
 {
if (radix_enabled())
-   return radix__flush_pmd_tlb_range(vma, start, end);
+   return radix__flush_hugetlb_tlb_range(vma, start, end, 
flush_pwc);
return hash__flush_tlb_range(vma, start, end);
 }
 
@@ -60,9 +77,7 @@ static inline void flush_hugetlb_tlb_range(struct 
vm_area_struct *vma,
   unsigned long start,
   unsigned long end)
 {
-   if (radix_enabled())
-   return radix__flush_hugetlb_tlb_range(vma, start, end);
-   return hash__flush_tlb_range(vma, start, end);
+   return flush_hugetlb_tlb_pwc_range(vma, start, end, false);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c 
b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index cb91071eef52..e62f5679b119 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -26,13 +26,13 @@ void radix__local_flush_hugetlb_page(struct vm_area_struct 
*vma, unsigned long v
 }
 
 void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long 
start,
-  unsigned long end)
+   unsigned long end, bool flush_pwc)
 {
int psize;
struct hstate *hstate = hstate_file(vma->vm_file);
 
psize

[PATCH v4 3/9] mm/mremap: Use pmd/pud_poplulate to update page table entries

2021-04-14 Thread Aneesh Kumar K.V

pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at
can only be used to set a hugepage PTE. Since we are not setting up a hugepage
PTE here, use the pmd/pud_populate interface.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index ec8f840399ed..574287f9bb39 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -257,9 +258,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
pmd_clear(old_pmd);
 
VM_BUG_ON(!pmd_none(*new_pmd));
+   pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd));
 
-   /* Set the new pmd */
-   set_pmd_at(mm, new_addr, new_pmd, pmd);
flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
@@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
 
VM_BUG_ON(!pud_none(*new_pud));
 
-   /* Set the new pud */
-   set_pud_at(mm, new_addr, new_pud, pud);
+   pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
-- 
2.30.2

[PATCH v4 2/9] selftest/mremap_test: Avoid crash with static build

2021-04-14 Thread Aneesh Kumar K.V

With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.

Reviewed-by: Kalesh Singh 
Signed-off-by: Aneesh Kumar K.V 
---
 tools/testing/selftests/vm/mremap_test.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c 
b/tools/testing/selftests/vm/mremap_test.c
index c9a5461eb786..0624d1bd71b5 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -75,9 +75,10 @@ static void *get_source_mapping(struct config c)
 retry:
addr += c.src_alignment;
src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-   MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+   MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+   -1, 0);
if (src_addr == MAP_FAILED) {
-   if (errno == EPERM)
+   if (errno == EPERM || errno == EEXIST)
goto retry;
goto error;
}
-- 
2.30.2

[PATCH v4 1/9] selftest/mremap_test: Update the test to handle pagesize other than 4K

2021-04-14 Thread Aneesh Kumar K.V

Instead of hardcoding 4K page size fetch it using sysconf(). For the performance
measurements test still assume 2M and 1G are hugepage sizes.

Reviewed-by: Kalesh Singh 
Signed-off-by: Aneesh Kumar K.V 
---
 tools/testing/selftests/vm/mremap_test.c | 113 ---
 1 file changed, 61 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c 
b/tools/testing/selftests/vm/mremap_test.c
index 9c391d016922..c9a5461eb786 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -45,14 +45,15 @@ enum {
_4MB = 4ULL << 20,
_1GB = 1ULL << 30,
_2GB = 2ULL << 30,
-   PTE = _4KB,
PMD = _2MB,
PUD = _1GB,
 };
 
+#define PTE page_size
+
 #define MAKE_TEST(source_align, destination_align, size,   \
  overlaps, should_fail, test_name) \
-{  \
+(struct test){ \
.name = test_name,  \
.config = { \
.src_alignment = source_align,  \
@@ -252,12 +253,17 @@ static int parse_args(int argc, char **argv, unsigned int 
*threshold_mb,
return 0;
 }
 
+#define MAX_TEST 13
+#define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
int failures = 0;
int i, run_perf_tests;
unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
unsigned int pattern_seed;
+   struct test test_cases[MAX_TEST];
+   struct test perf_test_cases[MAX_PERF_TEST];
+   int page_size;
time_t t;
 
pattern_seed = (unsigned int) time();
@@ -268,56 +274,59 @@ int main(int argc, char **argv)
ksft_print_msg("Test 
configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n",
   threshold_mb, pattern_seed);
 
-   struct test test_cases[] = {
-   /* Expected mremap failures */
-   MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source and Destination Regions Overlapping"),
-   MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Destination Address Misaligned (1KB-aligned)"),
-   MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source Address Misaligned (1KB-aligned)"),
-
-   /* Src addr PTE aligned */
-   MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "8KB mremap - Source PTE-aligned, Destination PTE-aligned"),
-
-   /* Src addr 1MB aligned */
-   MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"),
-   MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"),
-
-   /* Src addr PMD aligned */
-   MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"),
-   MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PMD-aligned"),
-
-   /* Src addr PUD aligned */
-   MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"),
-   MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PMD-aligned"),
-   MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-   };
-
-   struct test perf_test_cases[] = {
-   /*
-* mremap 1GB region - Page table level aligned time
-* comparison.
-*/
-   MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PTE-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PMD-aligned, Destination PMD-aligned"),
-   MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-   };
+   page_size = sysconf(_SC_PAGESIZE);
+
+   /* Expected mremap failures */
+   test_cases[0] =

[PATCH v4 0/9] Speedup mremap on ppc64

2021-04-14 Thread Aneesh Kumar K.V

Hi,

This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.

Changes from v3:
* Fix build error reported by kernel test robot
* Address review feedback.

Changes from v2:
* switch from using mmu_gather to flush_pte_tlb_pwc_range() 

Changes from v1:
* Rebase to recent upstream
* Fix build issues with tlb_gather_mmu changes


Aneesh Kumar K.V (9):
  selftest/mremap_test: Update the test to handle pagesize other than 4K
  selftest/mremap_test: Avoid crash with static build
  mm/mremap: Use pmd/pud_poplulate to update page table entries
  powerpc/mm/book3s64: Fix possible build error
  powerpc/mm/book3s64: Update tlb flush routines to take a page walk
cache flush argument
  mm/mremap: Use range flush that does TLB and page walk cache flush
  mm/mremap: Move TLB flush outside page table lock
  mm/mremap: Allow arch runtime override
  powerpc/mm: Enable move pmd/pud

 .../include/asm/book3s/64/tlbflush-radix.h|  19 +--
 arch/powerpc/include/asm/book3s/64/tlbflush.h |  30 -
 arch/powerpc/include/asm/tlb.h|   6 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |   4 +-
 arch/powerpc/mm/book3s64/radix_tlb.c  |  55 
 arch/powerpc/platforms/Kconfig.cputype|   2 +
 mm/mremap.c   |  41 --
 tools/testing/selftests/vm/mremap_test.c  | 118 ++
 8 files changed, 172 insertions(+), 103 deletions(-)

-- 
2.30.2

[PATCH] init: consolidate trap_init()

2021-04-14 Thread Jisheng Zhang

Many architectures implement the trap_init() as NOP, since there is
no such default for trap_init(), this empty stub is duplicated among
these architectures. Provide a generic but weak NOP implementation
to drop the empty stubs of trap_init() in these architectures.

Signed-off-by: Jisheng Zhang 
---
 arch/arc/kernel/traps.c  |  5 -
 arch/arm/kernel/traps.c  |  5 -
 arch/h8300/kernel/traps.c| 13 -
 arch/hexagon/kernel/traps.c  |  4 
 arch/nds32/kernel/traps.c|  5 -
 arch/nios2/kernel/traps.c|  5 -
 arch/openrisc/kernel/traps.c |  5 -
 arch/parisc/kernel/traps.c   |  4 
 arch/powerpc/kernel/traps.c  |  5 -
 arch/riscv/kernel/traps.c|  5 -
 arch/um/kernel/trap.c|  4 
 init/main.c  |  2 ++
 12 files changed, 2 insertions(+), 60 deletions(-)

diff --git a/arch/arc/kernel/traps.c b/arch/arc/kernel/traps.c
index 57235e5c0cea..6b83e3f2b41c 100644
--- a/arch/arc/kernel/traps.c
+++ b/arch/arc/kernel/traps.c
@@ -20,11 +20,6 @@
 #include 
 #include 
 
-void __init trap_init(void)
-{
-   return;
-}
-
 void die(const char *str, struct pt_regs *regs, unsigned long address)
 {
show_kernel_fault_diag(str, regs, address);
diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index 17d5a785df28..9baccef20392 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -780,11 +780,6 @@ void abort(void)
panic("Oops failed to kill thread");
 }
 
-void __init trap_init(void)
-{
-   return;
-}
-
 #ifdef CONFIG_KUSER_HELPERS
 static void __init kuser_init(void *vectors)
 {
diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c
index 5d8b969cd8f3..c3a3ebf77fbb 100644
--- a/arch/h8300/kernel/traps.c
+++ b/arch/h8300/kernel/traps.c
@@ -30,19 +30,6 @@
 
 static DEFINE_SPINLOCK(die_lock);
 
-/*
- * this must be called very early as the kernel might
- * use some instruction that are emulated on the 060
- */
-
-void __init base_trap_init(void)
-{
-}
-
-void __init trap_init(void)
-{
-}
-
 asmlinkage void set_esp0(unsigned long ssp)
 {
current->thread.esp0 = ssp;
diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index 904134b37232..edfc35dafeb1 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -28,10 +28,6 @@
 #define TRAP_SYSCALL   1
 #define TRAP_DEBUG 0xdb
 
-void __init trap_init(void)
-{
-}
-
 #ifdef CONFIG_GENERIC_BUG
 /* Maybe should resemble arch/sh/kernel/traps.c ?? */
 int is_valid_bugaddr(unsigned long addr)
diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c
index ee0d9ae192a5..f06421c645af 100644
--- a/arch/nds32/kernel/traps.c
+++ b/arch/nds32/kernel/traps.c
@@ -183,11 +183,6 @@ void __pgd_error(const char *file, int line, unsigned long 
val)
 }
 
 extern char *exception_vector, *exception_vector_end;
-void __init trap_init(void)
-{
-   return;
-}
-
 void __init early_trap_init(void)
 {
unsigned long ivb = 0;
diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c
index b172da4eb1a9..596986a74a26 100644
--- a/arch/nios2/kernel/traps.c
+++ b/arch/nios2/kernel/traps.c
@@ -105,11 +105,6 @@ void show_stack(struct task_struct *task, unsigned long 
*stack,
printk("%s\n", loglvl);
 }
 
-void __init trap_init(void)
-{
-   /* Nothing to do here */
-}
-
 /* Breakpoint handler */
 asmlinkage void breakpoint_c(struct pt_regs *fp)
 {
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 4d61333c2623..aa1e709405ac 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -231,11 +231,6 @@ void unhandled_exception(struct pt_regs *regs, int ea, int 
vector)
die("Oops", regs, 9);
 }
 
-void __init trap_init(void)
-{
-   /* Nothing needs to be done */
-}
-
 asmlinkage void do_trap(struct pt_regs *regs, unsigned long address)
 {
force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index 8d8441d4562a..747c328fb886 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -859,7 +859,3 @@ void  __init early_trap_init(void)
 
initialize_ivt(_vector_20);
 }
-
-void __init trap_init(void)
-{
-}
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index a44a30b0688c..e952bee89684 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -2207,11 +2207,6 @@ DEFINE_INTERRUPT_HANDLER(kernel_bad_stack)
die("Bad kernel stack pointer", regs, SIGABRT);
 }
 
-void __init trap_init(void)
-{
-}
-
-
 #ifdef CONFIG_PPC_EMULATED_STATS
 
 #define WARN_EMULATED_SETUP(type)  .type = { .name = #type }
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 0879b5df11b9..b3f3d84de779 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -194,8 +194,3 @@ int is_valid_bugaddr(unsigned long pc)
return ((insn & __COMPRESSED_INSN_MASK) ==

Re: [PATCH net-next v2 1/2] of: net: pass the dst buffer to of_get_mac_address()

2021-04-14 Thread Dan Carpenter

Hi Michael,

url:
https://github.com/0day-ci/linux/commits/Michael-Walle/of-net-support-non-platform-devices-in-of_get_mac_address/20210406-234030
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
cc0626c2aaed8e475efdd85fa374b497a7192e35
config: x86_64-randconfig-m001-20210406 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 
Reported-by: Dan Carpenter 

smatch warnings:
drivers/net/ethernet/xilinx/xilinx_axienet_main.c:2069 axienet_probe() warn: 
passing a valid pointer to 'PTR_ERR'

vim +/PTR_ERR +2069 drivers/net/ethernet/xilinx/xilinx_axienet_main.c

522856cefaf09d Robert Hancock  2019-06-06  2060 /* Check for Ethernet 
core IRQ (optional) */
522856cefaf09d Robert Hancock  2019-06-06  2061 if (lp->eth_irq <= 0)
522856cefaf09d Robert Hancock  2019-06-06  2062 
dev_info(>dev, "Ethernet core IRQ not defined\n");
522856cefaf09d Robert Hancock  2019-06-06  2063  
8a3b7a252dca9f Daniel Borkmann 2012-01-19  2064 /* Retrieve the MAC 
address */
411b125c6ace1f Michael Walle   2021-04-06  2065 ret = 
of_get_mac_address(pdev->dev.of_node, mac_addr);
411b125c6ace1f Michael Walle   2021-04-06  2066 if (!ret) {
411b125c6ace1f Michael Walle   2021-04-06  2067 
axienet_set_mac_address(ndev, mac_addr);
411b125c6ace1f Michael Walle   2021-04-06  2068 } else {
d05a9ed5c3a773 Robert Hancock  2019-06-06 @2069 
dev_warn(>dev, "could not find MAC address property: %ld\n",
d05a9ed5c3a773 Robert Hancock  2019-06-06  2070  
PTR_ERR(mac_addr));
 
^
This should print "ret".

411b125c6ace1f Michael Walle   2021-04-06  2071 
axienet_set_mac_address(ndev, NULL);
8a3b7a252dca9f Daniel Borkmann 2012-01-19  2072 }
8a3b7a252dca9f Daniel Borkmann 2012-01-19  2073  
8a3b7a252dca9f Daniel Borkmann 2012-01-19  2074 lp->coalesce_count_rx = 
XAXIDMA_DFT_RX_THRESHOLD;
8a3b7a252dca9f Daniel Borkmann 2012-01-19  2075 lp->coalesce_count_tx = 
XAXIDMA_DFT_TX_THRESHOLD;

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH v2] powerpc/eeh: Fix EEH handling for hugepages in ioremap space.

2021-04-14 Thread Mahesh J Salgaonkar

On 2021-04-13 10:53:39 Tue, Oliver O'Halloran wrote:
> On Mon, Apr 12, 2021 at 5:52 PM Mahesh Salgaonkar  
> wrote:
> >
> > During the EEH MMIO error checking, the current implementation fails to map
> > the (virtual) MMIO address back to the pci device on radix with hugepage
> > mappings for I/O. This results into failure to dispatch EEH event with no
> > recovery even when EEH capability has been enabled on the device.
> >
> > eeh_check_failure(token)# token = virtual MMIO address
> >   addr = eeh_token_to_phys(token);
> >   edev = eeh_addr_cache_get_dev(addr);
> >   if (!edev)
> > return 0;
> >   eeh_dev_check_failure(edev);  <= Dispatch the EEH event
> >
> > In case of hugepage mappings, eeh_token_to_phys() has a bug in virt -> phys
> > translation that results in wrong physical address, which is then passed to
> > eeh_addr_cache_get_dev() to match it against cached pci I/O address ranges
> > to get to a PCI device. Hence, it fails to find a match and the EEH event
> > never gets dispatched leaving the device in failed state.
> >
> > The commit 33439620680be ("powerpc/eeh: Handle hugepages in ioremap space")
> > introduced following logic to translate virt to phys for hugepage mappings:
> >
> > eeh_token_to_phys():
> > +   pa = pte_pfn(*ptep);
> > +
> > +   /* On radix we can do hugepage mappings for io, so handle that */
> > +   if (hugepage_shift) {
> > +   pa <<= hugepage_shift;  <= This is wrong
> > +   pa |= token & ((1ul << hugepage_shift) - 1);
> > +   }
> 
> I think I vaguely remember thinking "is this right?" at the time.
> Apparently not!
> 
> Reviewed-by: Oliver O'Halloran 
> 

Thanks for the review.

> 
> It would probably be a good idea to add a debugfs interface to help
> with testing the address translation. Maybe something like:
> 
> echo  > /sys/kernel/debug/powerpc/eeh_addr_check
> 
> Then in the kernel:
> 
> struct resource *r = lookup_resource(mmio_addr);
> void *virt = ioremap_resource(r);
> ret = eeh_check_failure(virt);
> iounmap(virt)
> 
> return ret;
> 
> A little tedious, but then you can write a selftest :)

Sure, will give a try.

Thanks,
-Mahesh.

-- 
Mahesh J Salgaonkar

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Jesper Dangaard Brouer

On Mon, 12 Apr 2021 02:15:32 +0100
Matthew Wilcox  wrote:

> On Sun, Apr 11, 2021 at 11:33:18AM +0100, Matthew Wilcox wrote:
> > Basically, we have three aligned dwords here.  We can either alias with
> > @flags and the first word of @lru, or the second word of @lru and @mapping,
> > or @index and @private.  @flags is a non-starter.  If we use @mapping,
> > then you have to set it to NULL before you free it, and I'm not sure
> > how easy that will be for you.  If that's trivial, then we could use
> > the layout:
> > 
> > unsigned long _pp_flags;
> > unsigned long pp_magic;
> > union {
> > dma_addr_t dma_addr;/* might be one or two words */
> > unsigned long _pp_align[2];
> > };
> > unsigned long pp_pfmemalloc;
> > unsigned long xmi;  
> 
> I forgot about the munmap path.  That calls zap_page_range() which calls
> set_page_dirty() which calls page_mapping().  If we use page->mapping,
> that's going to get interpreted as an address_space pointer.
> 
> *sigh*.  Foiled at every turn.

Yes, indeed! - And very frustrating.  It's keeping me up at night.
I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
me that I don't sleep well with these kind of dreams ;-)

> I'm kind of inclined towards using two (or more) bits for PageSlab as
> we discussed here:
> 
> https://lore.kernel.org/linux-mm/01000163efe179fe-d6270c58-eaba-482f-a6bd-334667250ef7-000...@email.amazonses.com/
> 
> so we have PageKAlloc that's true for PageSlab, PagePool, PageDMAPool,
> PageVMalloc, PageFrag and maybe a few other kernel-internal allocations.

I actually like this idea a lot.  I also think it will solve or remove
Matteo/Ilias'es[2] need to introduce the pp_magic signature.  Ilias do
say[1] that page_pool pages could be used for TCP RX zerocopy, but I
don't think we should "allow" that (meaning page_pool should drop the
DMA-mapping and give up recycling).  I should argue why in that thread.

That said, I think we need to have a quicker fix for the immediate
issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
it leaves[3] in struct page.  In[3] you mention ppc32, does it only
happens on certain 32-bit archs?

I'm seriously considering removing page_pool's support for doing/keeping
DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.

> (see also here:)
> https://lore.kernel.org/linux-mm/20180518194519.3820-18-wi...@infradead.org/

[1] https://lore.kernel.org/netdev/YHHuE7g73mZNrMV4@enceladus/
[2] 
https://patchwork.kernel.org/project/netdevbpf/patch/20210409223801.104657-3-mcr...@linux.microsoft.com/
[3] 
https://lore.kernel.org/linux-mm/20210410024313.gx2531...@casper.infradead.org/
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-14 Thread Gautham R Shenoy

On Mon, Apr 12, 2021 at 06:33:55PM +0200, Michal Suchánek wrote:
> On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > > Peter, Valentin, Vincent, Mel, etal
> > > > >
> > > > > On architectures where we have multiple levels of cache access 
> > > > > latencies
> > > > > within a DIE, (For example: one within the current LLC or SMT core 
> > > > > and the
> > > > > other at MC or Hemisphere, and finally across hemispheres), do you 
> > > > > have any
> > > > > suggestions on how we could handle the same in the core scheduler?
> > >
> > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > > only rely on cache
> > >
> >
> > From topology.c
> >
> > SD_SHARE_PKG_RESOURCES - describes shared caches
> >
> > I'm guessing here because I am not familiar with power10 but the central
> > problem appears to be when to prefer selecting a CPU sharing L2 or L3
> > cache and the core assumes the last-level-cache is the only relevant one.
> 
> It does not seem to be the case according to original description:
> 
>  When the scheduler tries to wakeup a task, it chooses between the
>  waker-CPU and the wakee's previous-CPU. Suppose this choice is called
>  the "target", then in the target's LLC domain, the scheduler
>  
>  a) tries to find an idle core in the LLC. This helps exploit the
> This is the same as (b) Should this be SMT^^^ ?

On POWER10, without this patch, the LLC is at SMT sched-domain
domain. The difference between a) and b) is a) searches for an idle
core, while b) searches for an idle CPU. 


> SMT folding that the wakee task can benefit from. If an idle
> core is found, the wakee is woken up on it.
>  
>  b) Failing to find an idle core, the scheduler tries to find an idle
> CPU in the LLC. This helps minimise the wakeup latency for the
> wakee since it gets to run on the CPU immediately.
>  
>  c) Failing this, it will wake it up on target CPU.
>  
>  Thus, with P9-sched topology, since the CACHE domain comprises of two
>  SMT4 cores, there is a decent chance that we get an idle core, failing
>  which there is a relatively higher probability of finding an idle CPU
>  among the 8 threads in the domain.
>  
>  However, in P10-sched topology, since the SMT domain is the LLC and it
>  contains only a single SMT4 core, the probability that we find that
>  core to be idle is less. Furthermore, since there are only 4 CPUs to
>  search for an idle CPU, there is lower probability that we can get an
>  idle CPU to wake up the task on.
> 
> >
> > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> > unintended consequences for load balancing because load within a die may
> > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> > the MC level.
> 
> Not spreading load between SMT4 domains within MC is exactly what setting LLC
> at MC level would address, wouldn't it?
>
> As in on P10 we have two relevant levels but the topology as is describes only
> one, and moving the LLC level lower gives two levels the scheduler looks at
> again. Or am I missing something?

This is my current understanding as well, since with this patch we
would then be able to move tasks quickly between the SMT4 cores,
perhaps at the expense of losing out on cache-affinity. Which is why
it would be good to verify this using a test/benchmark.


> 
> Thanks
> 
> Michal
> 

--
Thanks and Regards
gautham.

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-14 Thread Gautham R Shenoy

Hello Mel,

On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> > 
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> > 
> 
> >From topology.c
> 
>   SD_SHARE_PKG_RESOURCES - describes shared caches
>

Yes, I was aware of this shared caches, but this current patch was the
simplest way to achieve the effect, though the cores in the MC domain
on POWER10 do not share a cache. However, it is relatively faster to
transfer data across the cores within the MC domain compared to the
cores outside the MC domain in the Die.

> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
>

On POWER we have traditionally preferred to keep the LLC at the
sched-domain comprising of groups of CPUs that share the L2 (since L3
is a victim cache on POWER).

On POWER9, the L2 was shared by the threads of a pair of SMT4 cores,
while on POWER10, L2 is shared by threads of a single SMT4 core.

Thus, the current task wake-up logic would have a lower probability of
finding an idle core inside an LLC since it has only one core to
search in the LLC. This is why moving the LLC to the parent domain
(MC) consisting of a group of SMT4 cores among which snooping the
cache-data is faster is helpful for workloads that require the single
threaded performance.

> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

Since we are adding the SD_SHARE_PKG_RESOURCES to the parent of the
the only sched-domain (which is a SMT4 domain) which currently has
this flag set, would it cause issues in spreading the load between the
SMT4 domains ?

Are there any tests/benchmarks that can help bring this out? It could
be good to understand this.

> 
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> > 
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> > 
> 
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.
> 
> But still, your comment reminded me that different architectures have
> different requirements
> 
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
>   spillover to L3 when selecting and idle CPU.
>

Indeed, so on POWER10, the preference would be
1) idle core in the L2 domain.
2) idle core in the MC domain.
3) idle CPU  in the L2 domain
4) idle CPU  in the MC domain.

This patch is able to achieve this *implicitly* because of the way the
select_idle_cpu() and the select_idle_core() is currently coded, where
in the presence of idle cores in the MC level, the select_idle_core()
searches for the idle core starting with the core of the target-CPU.

If I understand your proposal correctly it would be to make this
explicit into a two level search where we first search in the LLC
domain, failing which, we carry on the search in the rest of the die
(assuming that the LLC is not in the die).

> X86 varies, it might want the Power10 approach for some families and prefer
>   L3 spilling over to a CPU on the same node in others.
> 
> S390 cares about something called books and drawers although I've no
>   what it means as such and whether it has any preferences on
>   search order.
> 
> ARM has similar requirements again according to "scheduler: expose the
>   topology of clusters and add cluster scheduler" and that one *does*
>   add another domain.
> 
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
> 
> The only

Re: [PATCH 1/5] uapi: remove the unused HAVE_ARCH_STRUCT_FLOCK64 define

2021-04-14 Thread Stephen Rothwell

Hi Arnd,

On Mon, 12 Apr 2021 11:55:41 +0200 Arnd Bergmann  wrote:
>
> On Mon, Apr 12, 2021 at 10:55 AM Christoph Hellwig  wrote:
> >
> > Signed-off-by: Christoph Hellwig   
> 
> The patch looks good, but I'd like to see a description for each one.
> How about:
> 
> | The check was added when Stephen Rothwell created the file, but
> | no architecture ever defined it.

Actually it was used by the xtensa architecture until Dec, 2006.

-- 
Cheers,
Stephen Rothwell


pgpI0dRUn4tC3.pgp
Description: OpenPGP digital signature

Re: [PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-14 Thread Christophe Leroy





Le 14/04/2021 à 07:59, Anshuman Khandual a écrit :



On 4/14/21 10:52 AM, Christophe Leroy wrote:



Le 14/04/2021 à 04:54, Anshuman Khandual a écrit :

Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
for those platforms which would override generic default FIRST_USER_ADDRESS
value 0UL. This makes it much cleaner with reduced code.

Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Cc: x...@kernel.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
   arch/alpha/include/asm/pgtable.h | 1 -
   arch/arc/include/asm/pgtable.h   | 6 --
   arch/arm/Kconfig | 1 +
   arch/arm64/include/asm/pgtable.h | 2 --
   arch/csky/include/asm/pgtable.h  | 1 -
   arch/hexagon/include/asm/pgtable.h   | 3 ---
   arch/ia64/include/asm/pgtable.h  | 1 -
   arch/m68k/include/asm/pgtable_mm.h   | 1 -
   arch/microblaze/include/asm/pgtable.h    | 2 --
   arch/mips/include/asm/pgtable-32.h   | 1 -
   arch/mips/include/asm/pgtable-64.h   | 1 -
   arch/nds32/Kconfig   | 1 +
   arch/nios2/include/asm/pgtable.h | 2 --
   arch/openrisc/include/asm/pgtable.h  | 1 -
   arch/parisc/include/asm/pgtable.h    | 2 --
   arch/powerpc/include/asm/book3s/pgtable.h    | 1 -
   arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
   arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
   arch/riscv/include/asm/pgtable.h | 2 --
   arch/s390/include/asm/pgtable.h  | 2 --
   arch/sh/include/asm/pgtable.h    | 2 --
   arch/sparc/include/asm/pgtable_32.h  | 1 -
   arch/sparc/include/asm/pgtable_64.h  | 3 ---
   arch/um/include/asm/pgtable-2level.h | 1 -
   arch/um/include/asm/pgtable-3level.h | 1 -
   arch/x86/include/asm/pgtable_types.h | 2 --
   arch/xtensa/include/asm/pgtable.h    | 1 -
   include/linux/mm.h   | 4 
   mm/Kconfig   | 4 
   29 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..47098ccd715e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -46,6 +46,10 @@ extern int sysctl_page_lock_unfairness;
     void init_mm_internals(void);
   +#ifndef ARCH_HAS_FIRST_USER_ADDRESS


I guess you didn't test it . :)


In fact I did :) Though just booted it on arm64 and cross compiled on
multiple others platforms.



should be #ifndef CONFIG_ARCH_HAS_FIRST_USER_ADDRESS


Right, meant that instead.




+#define FIRST_USER_ADDRESS    0UL
+#endif


But why do we need a config option at all for that ?

Why not just:

#ifndef FIRST_USER_ADDRESS
#define FIRST_USER_ADDRESS    0UL
#endif


This sounds simpler. But just wondering, would not there be any possibility
of build problems due to compilation sequence between arch and generic code ?



For sure it has to be addresses carefully, but there are already a lot of stuff like that around 
pgtables.h


For instance, pte_offset_kernel() has a generic definition in linux/pgtables.h based on whether it 
is already defined or not.


Taking into account that FIRST_USER_ADDRESS is today in the architectures's asm/pgtables.h, I think 
putting the fallback definition in linux/pgtable.h would do the trick.

55 matches

Mail list logo