[PATCH] powerpc: Remove proc_trap()

2021-06-08 Thread Christophe Leroy
proc_trap() has never been used, remove it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/reg.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7c81d3e563b2..3bb01a8779c9 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1435,8 +1435,6 @@ static inline void mtsr(u32 val, u32 idx)
 }
 #endif
 
-#define proc_trap()asm volatile("trap")
-
 extern unsigned long current_stack_frame(void);
 
 register unsigned long current_stack_pointer asm("r1");
-- 
2.25.0



Re: [PATCH v1 04/12] mm/memory_hotplug: remove nid parameter from arch_remove_memory()

2021-06-08 Thread Heiko Carstens
On Mon, Jun 07, 2021 at 09:54:22PM +0200, David Hildenbrand wrote:
> The parameter is unused, let's remove it.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  arch/arm64/mm/mmu.c| 3 +--
>  arch/ia64/mm/init.c| 3 +--
>  arch/powerpc/mm/mem.c  | 3 +--
>  arch/s390/mm/init.c| 3 +--
>  arch/sh/mm/init.c  | 3 +--
>  arch/x86/mm/init_32.c  | 3 +--
>  arch/x86/mm/init_64.c  | 3 +--
>  include/linux/memory_hotplug.h | 3 +--
>  mm/memory_hotplug.c| 4 ++--
>  mm/memremap.c  | 5 +
>  10 files changed, 11 insertions(+), 22 deletions(-)

For s390:
Acked-by: Heiko Carstens 


Re: [PATCH v2 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug

2021-06-08 Thread Leonardo Brás
On Wed, 2021-06-09 at 14:40 +1000, David Gibson wrote:
> On Tue, Jun 08, 2021 at 09:52:10PM -0300, Leonardo Brás wrote:
> > On Mon, 2021-06-07 at 15:02 +1000, David Gibson wrote:
> > > On Fri, Apr 30, 2021 at 11:36:06AM -0300, Leonardo Bras wrote:
> > > > Because hypervisors may need to create HPTs without knowing the
> > > > guest
> > > > page size, the smallest used page-size (4k) may be chosen,
> > > > resulting in
> > > > a HPT that is possibly bigger than needed.
> > > > 
> > > > On a guest with bigger page-sizes, the amount of entries for
> > > > HTP
> > > > may be
> > > > too high, causing the guest to ask for a HPT resize-down on the
> > > > first
> > > > hotplug.
> > > > 
> > > > This becomes a problem when HPT resize-down fails, and causes
> > > > the
> > > > HPT resize to be performed on every LMB added, until HPT size
> > > > is
> > > > compatible to guest memory size, causing a major slowdown.
> > > > 
> > > > So, avoiding HPT resizing-down on hot-add significantly
> > > > improves
> > > > memory
> > > > hotplug times.
> > > > 
> > > > As an example, hotplugging 256GB on a 129GB guest took 710s
> > > > without
> > > > this
> > > > patch, and 21s after applied.
> > > > 
> > > > Signed-off-by: Leonardo Bras 
> > > 
> > > Sorry it's taken me so long to look at these
> > > 
> > > I don't love the extra statefulness that the 'shrinking'
> > > parameter
> > > adds, but I can't see an elegant way to avoid it, so:
> > > 
> > > Reviewed-by: David Gibson 
> > 
> > np, thanks for reviewing!
> 
> Actually... I take that back.  With the subsequent patches my
> discomfort with the complexity of implementing the batching grew.
> 
> I think I can see a simpler way - although it wasn't as clear as I
> thought it might be, without some deep history on this feature.
> 
> What's going on here is pretty hard to follow, because it starts in
> arch-specific code (arch/powerpc/platforms/pseries/hotplug-memory.c)
> where it processes the add/remove requests, then goes into generic
> code __add_memory() which eventually emerges back in arch specific
> code (hash__create_section_mapping()).
> 
> The HPT resizing calls are in the "inner" arch specific section,
> whereas it's only the outer arch section that has the information to
> batch properly.  The mutex and 'shrinking' parameter in Leonardo's
> code are all about conveying information from the outer to inner
> section.
> 
> Now, I think the reason I had the resize calls in the inner section
> was to accomodate the notion that a) pHyp might support resizing in
> future, and it could come in through a different path with its drmgr
> thingy and/or b) bare metal hash architectures might want to
> implement
> hash resizing, and this would make at least part of the path common.
> 
> Given the decreasing relevance of hash MMUs, I think we can now
> safely
> say neither of these is ever going to happen.
> 
> Therefore, we can simplify things by moving the HPT resize calls into
> the pseries LMB code, instead of create/remove_section_mapping.  Then
> to do batching without extra complications we just need this logic
> for
> all resizes (both add and remove):
> 
> let new_hpt_order = expected HPT size for new mem size;
> 
> if (new_hpt_order > current_hpt_order)
> resize to new_hpt_order
> 
> add/remove memory
> 
> if (new_hpt_order < current_hpt_order - 1)
> resize to new_hpt_order
> 
> 


Ok, that really does seem to simplify a lot the batching.

Question:
by LMB code, you mean dlpar_memory_{add,remove}_by_* ?
(dealing only with dlpar_{add,remove}_lmb() would not be enough to deal
with batching)

In my 3/3 repĺy I sent you some other examples of functions that
currently end up calling resize_hpt_for_hotplug() without comming from 
hotplug-memory.c. Is that ok that they do not call it anymore?


Best regards,
Leonardo Bras



Re: [PATCH v2 3/3] powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug

2021-06-08 Thread Leonardo Brás
On Mon, 2021-06-07 at 15:20 +1000, David Gibson wrote:
> On Fri, Apr 30, 2021 at 11:36:10AM -0300, Leonardo Bras wrote:
> > During memory hotunplug, after each LMB is removed, the HPT may be
> > resized-down if it would map a max of 4 times the current amount of
> > memory.
> > (2 shifts, due to introduced histeresis)
> > 
> > It usually is not an issue, but it can take a lot of time if HPT
> > resizing-down fails. This happens  because resize-down failures
> > usually repeat at each LMB removal, until there are no more bolted
> > entries
> > conflict, which can take a while to happen.
> > 
> > This can be solved by doing a single HPT resize at the end of
> > memory
> > hotunplug, after all requested entries are removed.
> > 
> > To make this happen, it's necessary to temporarily disable all HPT
> > resize-downs before hotunplug, re-enable them after hotunplug ends,
> > and then resize-down HPT to the current memory size.
> > 
> > As an example, hotunplugging 256GB from a 385GB guest took 621s
> > without
> > this patch, and 100s after applied.
> > 
> > Signed-off-by: Leonardo Bras 
> 
> Hrm.  This looks correct, but it seems overly complicated.
> 
> AFAICT, the resize calls that this adds should in practice be the
> *only* times we call resize, all the calls from the lower level code
> should be suppressed. 

That's correct.

>  In which case can't we just remove those calls
> entirely, and not deal with the clunky locking and exclusion here.
> That should also remove the need for the 'shrinking' parameter in
> 1/3.


If I get your suggestion correctly, you suggest something like:
1 - Never calling resize_hpt_for_hotplug() in
hash__remove_section_mapping(), thus not needing the srinking
parameter.
2 - Functions in hotplug-memory.c that call dlpar_remove_lmb() would in
fact call another function to do the batch resize_hpt_for_hotplug() for
them

If so, that assumes that no other function that currently calls
resize_hpt_for_hotplug() under another path, or if they do, it does not
need to actually resize the HPT.

Is the above correct?

There are some examples of functions that currently call
resize_hpt_for_hotplug() by another path:

add_memory_driver_managed
virtio_mem_add_memory
dev_dax_kmem_probe

reserve_additional_memory
balloon_process
add_ballooned_pages

__add_memory
probe_store

__remove_memory
pseries_remove_memblock

remove_memory
dev_dax_kmem_remove
virtio_mem_remove_memory

memunmap_pages
pci_p2pdma_add_resource
virtio_fs_setup_dax


Best regards,
Leonardo Bras



Re: [PATCH v2 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug

2021-06-08 Thread David Gibson
On Tue, Jun 08, 2021 at 09:52:10PM -0300, Leonardo Brás wrote:
> On Mon, 2021-06-07 at 15:02 +1000, David Gibson wrote:
> > On Fri, Apr 30, 2021 at 11:36:06AM -0300, Leonardo Bras wrote:
> > > Because hypervisors may need to create HPTs without knowing the
> > > guest
> > > page size, the smallest used page-size (4k) may be chosen,
> > > resulting in
> > > a HPT that is possibly bigger than needed.
> > > 
> > > On a guest with bigger page-sizes, the amount of entries for HTP
> > > may be
> > > too high, causing the guest to ask for a HPT resize-down on the
> > > first
> > > hotplug.
> > > 
> > > This becomes a problem when HPT resize-down fails, and causes the
> > > HPT resize to be performed on every LMB added, until HPT size is
> > > compatible to guest memory size, causing a major slowdown.
> > > 
> > > So, avoiding HPT resizing-down on hot-add significantly improves
> > > memory
> > > hotplug times.
> > > 
> > > As an example, hotplugging 256GB on a 129GB guest took 710s without
> > > this
> > > patch, and 21s after applied.
> > > 
> > > Signed-off-by: Leonardo Bras 
> > 
> > Sorry it's taken me so long to look at these
> > 
> > I don't love the extra statefulness that the 'shrinking' parameter
> > adds, but I can't see an elegant way to avoid it, so:
> > 
> > Reviewed-by: David Gibson 
> 
> np, thanks for reviewing!

Actually... I take that back.  With the subsequent patches my
discomfort with the complexity of implementing the batching grew.

I think I can see a simpler way - although it wasn't as clear as I
thought it might be, without some deep history on this feature.

What's going on here is pretty hard to follow, because it starts in
arch-specific code (arch/powerpc/platforms/pseries/hotplug-memory.c)
where it processes the add/remove requests, then goes into generic
code __add_memory() which eventually emerges back in arch specific
code (hash__create_section_mapping()).

The HPT resizing calls are in the "inner" arch specific section,
whereas it's only the outer arch section that has the information to
batch properly.  The mutex and 'shrinking' parameter in Leonardo's
code are all about conveying information from the outer to inner
section.

Now, I think the reason I had the resize calls in the inner section
was to accomodate the notion that a) pHyp might support resizing in
future, and it could come in through a different path with its drmgr
thingy and/or b) bare metal hash architectures might want to implement
hash resizing, and this would make at least part of the path common.

Given the decreasing relevance of hash MMUs, I think we can now safely
say neither of these is ever going to happen.

Therefore, we can simplify things by moving the HPT resize calls into
the pseries LMB code, instead of create/remove_section_mapping.  Then
to do batching without extra complications we just need this logic for
all resizes (both add and remove):

let new_hpt_order = expected HPT size for new mem size;

if (new_hpt_order > current_hpt_order)
resize to new_hpt_order

add/remove memory

if (new_hpt_order < current_hpt_order - 1)
resize to new_hpt_order


-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 03/11] Documentation: ocxl.rst: change FPGA indirect article to an

2021-06-08 Thread Andrew Donnellan

On 9/6/21 7:23 am, t...@redhat.com wrote:

From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 


Acked-by: Andrew Donnellan 


--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited


Re: [PATCH v2 2/3] powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug

2021-06-08 Thread Leonardo Brás
On Mon, 2021-06-07 at 15:10 +1000, David Gibson wrote:
> On Fri, Apr 30, 2021 at 11:36:08AM -0300, Leonardo Bras wrote:
> > Every time a memory hotplug happens, and the memory limit crosses a
> > 2^n
> > value, it may be necessary to perform HPT resizing-up, which can
> > take
> > some time (over 100ms in my tests).
> > 
> > It usually is not an issue, but it can take some time if a lot of
> > memory
> > is added to a guest with little starting memory:
> > Adding 256G to a 2GB guest, for example will require 8 HPT resizes.
> > 
> > Perform an HPT resize before memory hotplug, updating HPT to its
> > final size (considering a successful hotplug), taking the number of
> > HPT resizes to at most one per memory hotplug action.
> > 
> > Signed-off-by: Leonardo Bras 
> 
> Reviewed-by: David Gibson 

Thanks David!

> 
> > ---
> >  arch/powerpc/include/asm/book3s/64/hash.h |  2 ++
> >  arch/powerpc/mm/book3s64/hash_utils.c | 20
> > +++
> >  .../platforms/pseries/hotplug-memory.c    |  9 +
> >  3 files changed, 31 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h
> > b/arch/powerpc/include/asm/book3s/64/hash.h
> > index d959b0195ad9..fad4af8b8543 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> > @@ -255,6 +255,8 @@ int hash__create_section_mapping(unsigned long
> > start, unsigned long end,
> >  int nid, pgprot_t prot);
> >  int hash__remove_section_mapping(unsigned long start, unsigned
> > long end);
> >  
> > +void hash_batch_expand_prepare(unsigned long newsize);
> > +
> >  #endif /* !__ASSEMBLY__ */
> >  #endif /* __KERNEL__ */
> >  #endif /* _ASM_POWERPC_BOOK3S_64_HASH_H */
> > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c
> > b/arch/powerpc/mm/book3s64/hash_utils.c
> > index 608e4ed397a9..3fa395b3fe57 100644
> > --- a/arch/powerpc/mm/book3s64/hash_utils.c
> > +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> > @@ -859,6 +859,26 @@ int hash__remove_section_mapping(unsigned long
> > start, unsigned long end)
> >  
> > return rc;
> >  }
> > +
> > +void hash_batch_expand_prepare(unsigned long newsize)
> > +{
> > +   const u64 starting_size = ppc64_pft_size;
> > +
> > +   /*
> > +    * Resizing-up HPT should never fail, but there are some
> > cases system starts with higher
> > +    * SHIFT than required, and we go through the funny case of
> > resizing HPT down while
> > +    * adding memory
> > +    */
> > +
> > +   while (resize_hpt_for_hotplug(newsize, false) == -ENOSPC) {
> > +   newsize *= 2;
> > +   pr_warn("Hash collision while resizing HPT\n");
> > +
> > +   /* Do not try to resize to the starting size, or
> > bigger value */
> > +   if (htab_shift_for_mem_size(newsize) >=
> > starting_size)
> > +   break;
> > +   }
> > +}
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  
> >  static void __init hash_init_partition_table(phys_addr_t
> > hash_table,
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 8377f1f7c78e..48b2cfe4ce69 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -13,6 +13,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -671,6 +672,10 @@ static int dlpar_memory_add_by_count(u32
> > lmbs_to_add)
> > if (lmbs_available < lmbs_to_add)
> > return -EINVAL;
> >  
> > +   if (!radix_enabled())
> > +   hash_batch_expand_prepare(memblock_phys_mem_size()
> > +
> > +    lmbs_to_add *
> > drmem_lmb_size());
> > +
> > for_each_drmem_lmb(lmb) {
> > if (lmb->flags & DRCONF_MEM_ASSIGNED)
> > continue;
> > @@ -788,6 +793,10 @@ static int dlpar_memory_add_by_ic(u32
> > lmbs_to_add, u32 drc_index)
> > if (lmbs_available < lmbs_to_add)
> > return -EINVAL;
> >  
> > +   if (!radix_enabled())
> > +   hash_batch_expand_prepare(memblock_phys_mem_size()
> > +
> > + lmbs_to_add *
> > drmem_lmb_size());
> > +
> > for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
> > if (lmb->flags & DRCONF_MEM_ASSIGNED)
> > continue;
> 
> -- 
> David Gibson| I'll have my music baroque, and my
> code
> david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_
> _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson




Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Baoquan He
On 06/08/21 at 02:14pm, Andrew Morton wrote:
> On Tue, 8 Jun 2021 22:24:32 +0800 Baoquan He  wrote:
> 
> > On 06/08/21 at 06:33am, Pingfan Liu wrote:
> > > As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
> > > Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
> > > formula:
> > > #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> > > 
> > > Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
> > > PAGES_PER_SECTION in makedumpfile just like kernel.
> > > 
> > > Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
> > > recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
> > > SECTION_SIZE_BITS"). But user space wants a stable interface to get this
> > > info. Such info is impossible to be deduced from a crashdump vmcore.
> > > Hence append SECTION_SIZE_BITS to vmcoreinfo.
> > 
> > ...
> >
> > Add the discussion of the original thread in kexec ML for reference:
> > http://lists.infradead.org/pipermail/kexec/2021-June/022676.html
> 
> I added a Link: for this.

Thanks, Andrew.

>  
> > This looks good to me.
> > 
> > Acked-by: Baoquan He 
> 
> I'm thinking we should backport this at least to Fixes:f0b13ee23241. 
> But perhaps it's simpler to just backport it as far as possible, so I
> added a bare cc:stable with no Fixes:.  Thoughts?

Yeah, it should add cc:stable, thanks. Otherwise it will break
vmcore dumping on 5.12 stable kernel even though with the updated
makedumpfile utility. Fixes:f0b13ee23241 will help stable kernel
maintainer easier to identify which kernel this patch need be applied
on? If only having cc:stable with no Fixes is allowed, it's also OK. 



Re: [PATCH v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks

2021-06-08 Thread Paul Moore
On Tue, Jun 8, 2021 at 7:02 AM Ondrej Mosnacek  wrote:
> On Thu, Jun 3, 2021 at 7:46 PM Paul Moore  wrote:

...

> > It sounds an awful lot like the lockdown hook is in the wrong spot.
> > It sounds like it would be a lot better to relocate the hook than
> > remove it.
>
> I don't see how you would solve this by moving the hook. Where do you
> want to relocate it?

Wherever it makes sense.  Based on your comments it really sounded
like the hook was in a bad spot and since your approach in a lot of
this had been to remove or disable hooks I wanted to make sure that
relocating the hook was something you had considered.  Thankfully it
sounds like you have considered moving the hook - that's good.

> The main obstacle is that the message containing
> the SA dump is sent to consumers via a simple netlink broadcast, which
> doesn't provide a facility to redact the SA secret on a per-consumer
> basis. I can't see any way to make the checks meaningful for SELinux
> without a major overhaul of the broadcast logic.

Fair enough.

-- 
paul moore
www.paul-moore.com


Re: switch the block layer to use kmap_local_page

2021-06-08 Thread Ira Weiny
On Tue, Jun 08, 2021 at 06:05:47PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series switches the core block layer code and all users of the
> existing bvec kmap helpers to use kmap_local_page.  Drivers that
> currently use open coded kmap_atomic calls will converted in a follow
> on series.

Other than the missing flush_dcache's.

For the series.

Reviewed-by: Ira Weiny 

> 
> Diffstat:
>  arch/mips/include/asm/mach-rc32434/rb.h |2 -
>  block/bio-integrity.c   |   14 --
>  block/bio.c |   37 +++-
>  block/blk-map.c |2 -
>  block/bounce.c  |   35 ++
>  block/t10-pi.c  |   16 
>  drivers/block/ps3disk.c |   19 ++
>  drivers/block/rbd.c |   15 +--
>  drivers/md/dm-writecache.c  |5 +--
>  include/linux/bio.h |   42 
> 
>  include/linux/bvec.h|   27 ++--
>  include/linux/highmem.h |4 +--
>  12 files changed, 64 insertions(+), 154 deletions(-)


Re: [PATCH 14/16] block: use memcpy_from_bvec in __blk_queue_bounce

2021-06-08 Thread Ira Weiny
On Tue, Jun 08, 2021 at 06:06:01PM +0200, Christoph Hellwig wrote:
> Rewrite the actual bounce buffering loop in __blk_queue_bounce to that
> the memcpy_to_bvec helper can be used to perform the data copies.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bounce.c | 21 +++--
>  1 file changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/block/bounce.c b/block/bounce.c
> index a2fc6326b6c9..b5ad09e07bcf 100644
> --- a/block/bounce.c
> +++ b/block/bounce.c
> @@ -243,24 +243,17 @@ void __blk_queue_bounce(struct request_queue *q, struct 
> bio **bio_orig)
>* because the 'bio' is single-page bvec.
>*/
>   for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) {
> - struct page *page = to->bv_page;
> + struct page *bounce_page;
>  
> - if (!PageHighMem(page))
> + if (!PageHighMem(to->bv_page))
>   continue;
>  
> - to->bv_page = mempool_alloc(_pool, GFP_NOIO);
> - inc_zone_page_state(to->bv_page, NR_BOUNCE);
> + bounce_page = mempool_alloc(_pool, GFP_NOIO);
> + inc_zone_page_state(bounce_page, NR_BOUNCE);
>  
> - if (rw == WRITE) {
> - char *vto, *vfrom;
> -
> - flush_dcache_page(page);
> -
> - vto = page_address(to->bv_page) + to->bv_offset;
> - vfrom = kmap_atomic(page) + to->bv_offset;
> - memcpy(vto, vfrom, to->bv_len);
> - kunmap_atomic(vfrom);
> - }
> + if (rw == WRITE)
> + memcpy_from_bvec(page_address(bounce_page), to);

NIT: the fact that the copy is from 'to' makes my head hurt...  But I don't
see a good way to change that without declaring unnecessary variables...  :-(

The logic seems right.

Ira

> + to->bv_page = bounce_page;
>   }
>  
>   trace_block_bio_bounce(*bio_orig);
> -- 
> 2.30.2
> 


Re: [PATCH 09/16] ps3disk: use memcpy_{from,to}_bvec

2021-06-08 Thread Ira Weiny
On Tue, Jun 08, 2021 at 06:05:56PM +0200, Christoph Hellwig wrote:
>  
>   rq_for_each_segment(bvec, req, iter) {
> - unsigned long flags;
> - dev_dbg(>sbd.core, "%s:%u: bio %u: %u sectors from %llu\n",
> - __func__, __LINE__, i, bio_sectors(iter.bio),
> - iter.bio->bi_iter.bi_sector);
> -
> - size = bvec.bv_len;
> - buf = bvec_kmap_irq(, );
>   if (gather)
> - memcpy(dev->bounce_buf+offset, buf, size);
> + memcpy_from_bvec(dev->bounce_buf + offset, );
>   else
> - memcpy(buf, dev->bounce_buf+offset, size);
> - offset += size;
> - flush_kernel_dcache_page(bvec.bv_page);

I'm still not 100% sure that these flushes are needed but the are not no-ops on
every arch.  Would it be best to preserve them after the memcpy_to/from_bvec()?

Same thing in patch 11 and 14.

Ira


[PATCH v15 9/9] powerpc/32: use set_memory_attr()

2021-06-08 Thread Jordan Niethe
From: Christophe Leroy 

Use set_memory_attr() instead of the PPC32 specific change_page_attr()

change_page_attr() was checking that the address was not mapped by
blocks and was handling highmem, but that's unneeded because the
affected pages can't be in highmem and block mapping verification
is already done by the callers.

Signed-off-by: Christophe Leroy 
[ruscur: rebase on powerpc/merge with Christophe's new patches]
Signed-off-by: Russell Currey 
Signed-off-by: Jordan Niethe 
---
 arch/powerpc/mm/pgtable_32.c | 60 ++--
 1 file changed, 10 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index e0ec67a16887..dcf5ecca19d9 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -132,64 +133,20 @@ void __init mapin_ram(void)
}
 }
 
-static int __change_page_attr_noflush(struct page *page, pgprot_t prot)
-{
-   pte_t *kpte;
-   unsigned long address;
-
-   BUG_ON(PageHighMem(page));
-   address = (unsigned long)page_address(page);
-
-   if (v_block_mapped(address))
-   return 0;
-   kpte = virt_to_kpte(address);
-   if (!kpte)
-   return -EINVAL;
-   __set_pte_at(_mm, address, kpte, mk_pte(page, prot), 0);
-
-   return 0;
-}
-
-/*
- * Change the page attributes of an page in the linear mapping.
- *
- * THIS DOES NOTHING WITH BAT MAPPINGS, DEBUG USE ONLY
- */
-static int change_page_attr(struct page *page, int numpages, pgprot_t prot)
-{
-   int i, err = 0;
-   unsigned long flags;
-   struct page *start = page;
-
-   local_irq_save(flags);
-   for (i = 0; i < numpages; i++, page++) {
-   err = __change_page_attr_noflush(page, prot);
-   if (err)
-   break;
-   }
-   wmb();
-   local_irq_restore(flags);
-   flush_tlb_kernel_range((unsigned long)page_address(start),
-  (unsigned long)page_address(page));
-   return err;
-}
-
 void mark_initmem_nx(void)
 {
-   struct page *page = virt_to_page(_sinittext);
unsigned long numpages = PFN_UP((unsigned long)_einittext) -
 PFN_DOWN((unsigned long)_sinittext);
 
if (v_block_mapped((unsigned long)_sinittext))
mmu_mark_initmem_nx();
else
-   change_page_attr(page, numpages, PAGE_KERNEL);
+   set_memory_attr((unsigned long)_sinittext, numpages, 
PAGE_KERNEL);
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
 void mark_rodata_ro(void)
 {
-   struct page *page;
unsigned long numpages;
 
if (v_block_mapped((unsigned long)_stext + 1)) {
@@ -198,20 +155,18 @@ void mark_rodata_ro(void)
return;
}
 
-   page = virt_to_page(_stext);
numpages = PFN_UP((unsigned long)_etext) -
   PFN_DOWN((unsigned long)_stext);
 
-   change_page_attr(page, numpages, PAGE_KERNEL_ROX);
+   set_memory_attr((unsigned long)_stext, numpages, PAGE_KERNEL_ROX);
/*
 * mark .rodata as read only. Use __init_begin rather than __end_rodata
 * to cover NOTES and EXCEPTION_TABLE.
 */
-   page = virt_to_page(__start_rodata);
numpages = PFN_UP((unsigned long)__init_begin) -
   PFN_DOWN((unsigned long)__start_rodata);
 
-   change_page_attr(page, numpages, PAGE_KERNEL_RO);
+   set_memory_attr((unsigned long)__start_rodata, numpages, 
PAGE_KERNEL_RO);
 
// mark_initmem_nx() should have already run by now
ptdump_check_wx();
@@ -221,9 +176,14 @@ void mark_rodata_ro(void)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
+   unsigned long addr = (unsigned long)page_address(page);
+
if (PageHighMem(page))
return;
 
-   change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
+   if (enable)
+   set_memory_attr(addr, numpages, PAGE_KERNEL);
+   else
+   set_memory_attr(addr, numpages, __pgprot(0));
 }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
-- 
2.25.1



[PATCH v15 8/9] powerpc/mm: implement set_memory_attr()

2021-06-08 Thread Jordan Niethe
From: Christophe Leroy 

In addition to the set_memory_xx() functions which allows to change
the memory attributes of not (yet) used memory regions, implement a
set_memory_attr() function to:
- set the final memory protection after init on currently used
kernel regions.
- enable/disable kernel memory regions in the scope of DEBUG_PAGEALLOC.

Unlike the set_memory_xx() which can act in three step as the regions
are unused, this function must modify 'on the fly' as the kernel is
executing from them. At the moment only PPC32 will use it and changing
page attributes on the fly is not an issue.

Signed-off-by: Christophe Leroy 
Reported-by: kbuild test robot 
[ruscur: cast "data" to unsigned long instead of int]
Signed-off-by: Russell Currey 
Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/set_memory.h |  2 ++
 arch/powerpc/mm/pageattr.c| 33 +++
 2 files changed, 35 insertions(+)

diff --git a/arch/powerpc/include/asm/set_memory.h 
b/arch/powerpc/include/asm/set_memory.h
index 64011ea444b4..b040094f7920 100644
--- a/arch/powerpc/include/asm/set_memory.h
+++ b/arch/powerpc/include/asm/set_memory.h
@@ -29,4 +29,6 @@ static inline int set_memory_x(unsigned long addr, int 
numpages)
return change_memory_attr(addr, numpages, SET_MEMORY_X);
 }
 
+int set_memory_attr(unsigned long addr, int numpages, pgprot_t prot);
+
 #endif
diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index 5e5ae50a7f23..0876216ceee6 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -99,3 +99,36 @@ int change_memory_attr(unsigned long addr, int numpages, 
long action)
return apply_to_existing_page_range(_mm, start, size,
change_page_attr, (void *)action);
 }
+
+/*
+ * Set the attributes of a page:
+ *
+ * This function is used by PPC32 at the end of init to set final kernel memory
+ * protection. It includes changing the maping of the page it is executing from
+ * and data pages it is using.
+ */
+static int set_page_attr(pte_t *ptep, unsigned long addr, void *data)
+{
+   pgprot_t prot = __pgprot((unsigned long)data);
+
+   spin_lock(_mm.page_table_lock);
+
+   set_pte_at(_mm, addr, ptep, pte_modify(*ptep, prot));
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+   spin_unlock(_mm.page_table_lock);
+
+   return 0;
+}
+
+int set_memory_attr(unsigned long addr, int numpages, pgprot_t prot)
+{
+   unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+   unsigned long sz = numpages * PAGE_SIZE;
+
+   if (numpages <= 0)
+   return 0;
+
+   return apply_to_existing_page_range(_mm, start, sz, set_page_attr,
+   (void *)pgprot_val(prot));
+}
-- 
2.25.1



[PATCH v15 7/9] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX

2021-06-08 Thread Jordan Niethe
From: Russell Currey 

To enable strict module RWX on powerpc, set:

CONFIG_STRICT_MODULE_RWX=y

You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real
security benefit.

ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX.
This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that
makes STRICT_MODULE_RWX *on by default* in configurations where
STRICT_KERNEL_RWX is *unavailable*.

Since this doesn't make much sense, and module RWX without kernel RWX
doesn't make much sense, having the same dependencies as kernel RWX
works around this problem.

Book3s/32 603 and 604 core processors are not able to write protect
kernel pages so do not set ARCH_HAS_STRICT_MODULE_RWX for Book3s/32.

Reviewed-by: Christophe Leroy 
Signed-off-by: Russell Currey 
[jpn: - predicate on !PPC_BOOK3S_604
  - make module_alloc() use PAGE_KERNEL protection]
Signed-off-by: Jordan Niethe 
---
v10: - Predicate on !PPC_BOOK3S_604
 - Make module_alloc() use PAGE_KERNEL protection
v11: - Neaten up
v13: Use strict_kernel_rwx_enabled()
v14: Make changes to module_alloc() its own commit
v15: - Force STRICT_KERNEL_RWX if STRICT_MODULE_RWX is selected
 - Predicate on !PPC_BOOK3S_32 instead
---
 arch/powerpc/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index abfe2e9225fa..72f307f1796b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -142,6 +142,7 @@ config PPC
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!HIBERNATION)
+   select ARCH_HAS_STRICT_MODULE_RWX   if ARCH_HAS_STRICT_KERNEL_RWX 
&& !PPC_BOOK3S_32
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UBSAN_SANITIZE_ALL
@@ -267,6 +268,7 @@ config PPC
select PPC_DAWR if PPC64
select RTC_LIB
select SPARSE_IRQ
+   select STRICT_KERNEL_RWX if STRICT_MODULE_RWX
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select VIRT_TO_BUS  if !PPC64
-- 
2.25.1



[PATCH v15 6/9] powerpc/bpf: Write protect JIT code

2021-06-08 Thread Jordan Niethe
Add the necessary call to bpf_jit_binary_lock_ro() to remove write and
add exec permissions to the JIT image after it has finished being
written.

Without CONFIG_STRICT_MODULE_RWX the image will be writable and
executable until the call to bpf_jit_binary_lock_ro().

Reviewed-by: Christophe Leroy 
Signed-off-by: Jordan Niethe 
---
v10: New to series
v11: Remove CONFIG_STRICT_MODULE_RWX conditional
---
 arch/powerpc/net/bpf_jit_comp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 6c8c268e4fe8..53aefee3fe70 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -237,6 +237,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
fp->jited_len = alloclen;
 
bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
+   bpf_jit_binary_lock_ro(bpf_hdr);
if (!fp->is_func || extra_pass) {
bpf_prog_fill_jited_linfo(fp, addrs);
 out_addrs:
-- 
2.25.1



[PATCH v15 5/9] powerpc/bpf: Remove bpf_jit_free()

2021-06-08 Thread Jordan Niethe
Commit 74451e66d516 ("bpf: make jited programs visible in traces") added
a default bpf_jit_free() implementation. Powerpc did not use the default
bpf_jit_free() as powerpc did not set the images read-only. The default
bpf_jit_free() called bpf_jit_binary_unlock_ro() is why it could not be
used for powerpc.

Commit d53d2f78cead ("bpf: Use vmalloc special flag") moved keeping
track of read-only memory to vmalloc. This included removing
bpf_jit_binary_unlock_ro(). Therefore there is no reason powerpc needs
its own bpf_jit_free(). Remove it.

Reviewed-by: Christophe Leroy 
Signed-off-by: Jordan Niethe 
---
v11: New to series
---
 arch/powerpc/net/bpf_jit_comp.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 798ac4350a82..6c8c268e4fe8 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -257,15 +257,3 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 
return fp;
 }
-
-/* Overriding bpf_jit_free() as we don't set images read-only. */
-void bpf_jit_free(struct bpf_prog *fp)
-{
-   unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
-   struct bpf_binary_header *bpf_hdr = (void *)addr;
-
-   if (fp->jited)
-   bpf_jit_binary_free(bpf_hdr);
-
-   bpf_prog_unlock_free(fp);
-}
-- 
2.25.1



[PATCH v15 4/9] powerpc/kprobes: Mark newly allocated probes as ROX

2021-06-08 Thread Jordan Niethe
From: Russell Currey 

Add the arch specific insn page allocator for powerpc. This allocates
ROX pages if STRICT_KERNEL_RWX is enabled. These pages are only written
to with patch_instruction() which is able to write RO pages.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
Signed-off-by: Christophe Leroy 
[jpn: Reword commit message, switch to __vmalloc_node_range()]
Signed-off-by: Jordan Niethe 
---
v9: - vmalloc_exec() no longer exists
- Set the page to RW before freeing it
v10: - use __vmalloc_node_range()
v11: - Neaten up
v12: - Switch from __vmalloc_node_range() to module_alloc()
v13: Use strict_kernel_rwx_enabled()
v14: Use strict_module_rwx_enabled()
---
 arch/powerpc/kernel/kprobes.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 01ab2163659e..937e338053ff 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,11 +19,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
@@ -103,6 +105,21 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
return addr;
 }
 
+void *alloc_insn_page(void)
+{
+   void *page;
+
+   page = module_alloc(PAGE_SIZE);
+   if (!page)
+   return NULL;
+
+   if (strict_module_rwx_enabled()) {
+   set_memory_ro((unsigned long)page, 1);
+   set_memory_x((unsigned long)page, 1);
+   }
+   return page;
+}
+
 int arch_prepare_kprobe(struct kprobe *p)
 {
int ret = 0;
-- 
2.25.1



[PATCH v15 3/9] powerpc/modules: Make module_alloc() Strict Module RWX aware

2021-06-08 Thread Jordan Niethe
Make module_alloc() use PAGE_KERNEL protections instead of
PAGE_KERNEL_EXEX if Strict Module RWX is enabled.

Signed-off-by: Jordan Niethe 
---
v14: - Split out from powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
 - Add and use strict_module_rwx_enabled() helper
---
 arch/powerpc/include/asm/mmu.h | 5 +
 arch/powerpc/kernel/module.c   | 4 +++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index 998fe01dd1a8..27016b98ecb2 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -345,6 +345,11 @@ static inline bool strict_kernel_rwx_enabled(void)
return false;
 }
 #endif
+
+static inline bool strict_module_rwx_enabled(void)
+{
+   return IS_ENABLED(CONFIG_STRICT_MODULE_RWX) && 
strict_kernel_rwx_enabled();
+}
 #endif /* !__ASSEMBLY__ */
 
 /* The kernel use the constants below to index in the page sizes array.
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 3f35c8d20be7..ed04a3ba66fe 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -92,12 +92,14 @@ int module_finalize(const Elf_Ehdr *hdr,
 static __always_inline void *
 __module_alloc(unsigned long size, unsigned long start, unsigned long end)
 {
+   pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : 
PAGE_KERNEL_EXEC;
+
/*
 * Don't do huge page allocations for modules yet until more testing
 * is done. STRICT_MODULE_RWX may require extra work to support this
 * too.
 */
-   return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL, 
PAGE_KERNEL_EXEC,
+   return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL, prot,
VM_FLUSH_RESET_PERMS | VM_NO_HUGE_VMAP,
NUMA_NO_NODE, __builtin_return_address(0));
 }
-- 
2.25.1



[PATCH v15 2/9] powerpc/lib/code-patching: Set up Strict RWX patching earlier

2021-06-08 Thread Jordan Niethe
setup_text_poke_area() is a late init call so it runs before
mark_rodata_ro() and after the init calls. This lets all the init code
patching simply write to their locations. In the future, kprobes is
going to allocate its instruction pages RO which means they will need
setup_text__poke_area() to have been already called for their code
patching. However, init_kprobes() (which allocates and patches some
instruction pages) is an early init call so it happens before
setup_text__poke_area().

start_kernel() calls poking_init() before any of the init calls. On
powerpc, poking_init() is currently a nop. setup_text_poke_area() relies
on kernel virtual memory, cpu hotplug and per_cpu_areas being setup.
setup_per_cpu_areas(), boot_cpu_hotplug_init() and mm_init() are called
before poking_init().

Turn setup_text_poke_area() into poking_init().

Reviewed-by: Christophe Leroy 
Reviewed-by: Russell Currey 
Signed-off-by: Jordan Niethe 
---
v9: New to series
---
 arch/powerpc/lib/code-patching.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 870b30d9be2f..15296207e1ba 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -70,14 +70,11 @@ static int text_area_cpu_down(unsigned int cpu)
 }
 
 /*
- * Run as a late init call. This allows all the boot time patching to be done
- * simply by patching the code, and then we're called here prior to
- * mark_rodata_ro(), which happens after all init calls are run. Although
- * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we judge
- * it as being preferable to a kernel that will crash later when someone tries
- * to use patch_instruction().
+ * Although BUG_ON() is rude, in this case it should only happen if ENOMEM, and
+ * we judge it as being preferable to a kernel that will crash later when
+ * someone tries to use patch_instruction().
  */
-static int __init setup_text_poke_area(void)
+int __init poking_init(void)
 {
BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"powerpc/text_poke:online", text_area_cpu_up,
@@ -85,7 +82,6 @@ static int __init setup_text_poke_area(void)
 
return 0;
 }
-late_initcall(setup_text_poke_area);
 
 /*
  * This can be called for kernel text or a module.
-- 
2.25.1



[PATCH v15 1/9] powerpc/mm: Implement set_memory() routines

2021-06-08 Thread Jordan Niethe
From: Russell Currey 

The set_memory_{ro/rw/nx/x}() functions are required for
STRICT_MODULE_RWX, and are generally useful primitives to have.  This
implementation is designed to be generic across powerpc's many MMUs.
It's possible that this could be optimised to be faster for specific
MMUs.

This implementation does not handle cases where the caller is attempting
to change the mapping of the page it is executing from, or if another
CPU is concurrently using the page being altered.  These cases likely
shouldn't happen, but a more complex implementation with MMU-specific code
could safely handle them.

On hash, the linear mapping is not kept in the linux pagetable, so this
will not change the protection if used on that range. Currently these
functions are not used on the linear map so just WARN for now.

apply_to_existing_page_range() does not work on huge pages so for now
disallow changing the protection of huge pages.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
Signed-off-by: Christophe Leroy 
[jpn: - Allow set memory functions to be used without Strict RWX
  - Hash: Disallow certain regions
  - Have change_page_attr() take function pointers to manipulate ptes
  - Radix: Add ptesync after set_pte_at()]
Signed-off-by: Jordan Niethe 
---
v10: WARN if trying to change the hash linear map
v11: - Update copywrite dates
 - Allow set memory functions to be used without Strict RWX
 - Hash: Disallow certain regions and add comment explaining why
 - Have change_page_attr() take function pointers to manipulate ptes
 - Clarify change_page_attr()'s comment
 - Radix: Add ptesync after set_pte_at()
v12: - change_page_attr() back to taking an action value
 - disallow operating on huge pages
v14: - only check is_vm_area_hugepages() for virtual memory
---
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/set_memory.h |  32 
 arch/powerpc/mm/Makefile  |   2 +-
 arch/powerpc/mm/pageattr.c| 101 ++
 4 files changed, 135 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/set_memory.h
 create mode 100644 arch/powerpc/mm/pageattr.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 36ba413f49d3..abfe2e9225fa 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
+   select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
diff --git a/arch/powerpc/include/asm/set_memory.h 
b/arch/powerpc/include/asm/set_memory.h
new file mode 100644
index ..64011ea444b4
--- /dev/null
+++ b/arch/powerpc/include/asm/set_memory.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SET_MEMORY_H
+#define _ASM_POWERPC_SET_MEMORY_H
+
+#define SET_MEMORY_RO  0
+#define SET_MEMORY_RW  1
+#define SET_MEMORY_NX  2
+#define SET_MEMORY_X   3
+
+int change_memory_attr(unsigned long addr, int numpages, long action);
+
+static inline int set_memory_ro(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RO);
+}
+
+static inline int set_memory_rw(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RW);
+}
+
+static inline int set_memory_nx(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_NX);
+}
+
+static inline int set_memory_x(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_X);
+}
+
+#endif
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index c3df3a8501d4..9142cf1fb0d5 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -5,7 +5,7 @@
 
 ccflags-$(CONFIG_PPC64):= $(NO_MINIMAL_TOC)
 
-obj-y  := fault.o mem.o pgtable.o mmap.o maccess.o \
+obj-y  := fault.o mem.o pgtable.o mmap.o maccess.o 
pageattr.o \
   init_$(BITS).o pgtable_$(BITS).o \
   pgtable-frag.o ioremap.o ioremap_$(BITS).o \
   init-common.o mmu_context.o drmem.o \
diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
new file mode 100644
index ..5e5ae50a7f23
--- /dev/null
+++ b/arch/powerpc/mm/pageattr.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * MMU-generic set_memory implementation for powerpc
+ *
+ * Copyright 2019-2021, IBM Corporation.
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+
+/*
+ * Updates the attributes of a page 

[PATCH v15 0/9] powerpc: Further Strict RWX support

2021-06-08 Thread Jordan Niethe
Adding more Strict RWX support on powerpc, in particular Strict Module RWX.
It is now rebased on ppc next.

For reference the previous revision is available here: 
https://lore.kernel.org/linuxppc-dev/20210517032810.129949-1-jniet...@gmail.com/

Changes for v15:

Christophe Leroy (2):
  powerpc/mm: implement set_memory_attr()
  powerpc/32: use set_memory_attr()

Jordan Niethe (4):
  powerpc/lib/code-patching: Set up Strict RWX patching earlier
  powerpc/modules: Make module_alloc() Strict Module RWX aware
  powerpc/bpf: Remove bpf_jit_free()
  powerpc/bpf: Write protect JIT code

Russell Currey (3):
  powerpc/mm: Implement set_memory() routines
  powerpc/kprobes: Mark newly allocated probes as ROX
  powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
- Force STRICT_KERNEL_RWX if STRICT_MODULE_RWX is selected
- Predicate on !PPC_BOOK3S_32 instead


 arch/powerpc/Kconfig  |   3 +
 arch/powerpc/include/asm/mmu.h|   5 +
 arch/powerpc/include/asm/set_memory.h |  34 +++
 arch/powerpc/kernel/kprobes.c |  17 
 arch/powerpc/kernel/module.c  |   4 +-
 arch/powerpc/lib/code-patching.c  |  12 +--
 arch/powerpc/mm/Makefile  |   2 +-
 arch/powerpc/mm/pageattr.c| 134 ++
 arch/powerpc/mm/pgtable_32.c  |  60 ++--
 arch/powerpc/net/bpf_jit_comp.c   |  13 +--
 10 files changed, 212 insertions(+), 72 deletions(-)
 create mode 100644 arch/powerpc/include/asm/set_memory.h
 create mode 100644 arch/powerpc/mm/pageattr.c

-- 
2.25.1



Re: [PATCH v2 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug

2021-06-08 Thread Leonardo Brás
On Mon, 2021-06-07 at 15:02 +1000, David Gibson wrote:
> On Fri, Apr 30, 2021 at 11:36:06AM -0300, Leonardo Bras wrote:
> > Because hypervisors may need to create HPTs without knowing the
> > guest
> > page size, the smallest used page-size (4k) may be chosen,
> > resulting in
> > a HPT that is possibly bigger than needed.
> > 
> > On a guest with bigger page-sizes, the amount of entries for HTP
> > may be
> > too high, causing the guest to ask for a HPT resize-down on the
> > first
> > hotplug.
> > 
> > This becomes a problem when HPT resize-down fails, and causes the
> > HPT resize to be performed on every LMB added, until HPT size is
> > compatible to guest memory size, causing a major slowdown.
> > 
> > So, avoiding HPT resizing-down on hot-add significantly improves
> > memory
> > hotplug times.
> > 
> > As an example, hotplugging 256GB on a 129GB guest took 710s without
> > this
> > patch, and 21s after applied.
> > 
> > Signed-off-by: Leonardo Bras 
> 
> Sorry it's taken me so long to look at these
> 
> I don't love the extra statefulness that the 'shrinking' parameter
> adds, but I can't see an elegant way to avoid it, so:
> 
> Reviewed-by: David Gibson 

np, thanks for reviewing!

Best regards,
Leonardo Bras



Re: [PATCH v7 01/11] mm/mremap: Fix race between MOVE_PMD mremap and pageout

2021-06-08 Thread Hugh Dickins
On Tue, 8 Jun 2021, Aneesh Kumar K.V wrote:
> 
> mm/mremap: hold the rmap lock in write mode when moving page table 
> entries.
> 
> To avoid a race between rmap walk and mremap, mremap does 
> take_rmap_locks().
> The lock was taken to ensure that rmap walk don't miss a page table entry 
> due to
> PTE moves via move_pagetables(). The kernel does further optimization of
> this lock such that if we are going to find the newly added vma after the
> old vma, the rmap lock is not taken. This is because rmap walk would find 
> the
> vmas in the same order and if we don't find the page table attached to
> older vma we would find it with the new vma which we would iterate later.
> The actual lifetime of the page is still controlled by the PTE lock.
> 
> This patch updates the locking requirement to handle another race 
> condition
> explained below with optimized mremap::
> 
> Optmized PMD move
> 
> CPU 1   CPU 2 
>   CPU 3
> 
> mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> 
> mmap_write_lock_killable()
> 
> addr = old_addr
> lock(pte_ptl)
> lock(pmd_ptl)
> pmd = *old_pmd
> pmd_clear(old_pmd)
> flush_tlb_range(old_addr)
> 
> *new_pmd = pmd
>   
>   *new_addr = 10; and fills
>   
>   TLB with new addr
>   
>   and old pfn
> 
> unlock(pmd_ptl)
> ptep_clear_flush()
> old pfn is free.
>   
>   Stale TLB entry
> 

The PUD example below is mainly a waste a space and time:
"Optimized PUD move suffers from a similar race." would be better.

> Optmized PUD move:
> 
> CPU 1   CPU 2 
>   CPU 3
> 
> mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> 
> mmap_write_lock_killable()
> 
> addr = old_addr
> lock(pte_ptl)
> lock(pud_ptl)
> pud = *old_pud
> pud_clear(old_pud)
> flush_tlb_range(old_addr)
> 
> *new_pud = pud
>   
>   *new_addr = 10; and fills
>   
>   TLB with new addr
>   
>   and old pfn
> 
> unlock(pud_ptl)
> ptep_clear_flush()
> old pfn is free.
>   
>   Stale TLB entry
> 
> Both the above race condition can be fixed if we force mremap path to 
> take rmap lock.
> 

Don't forget the Fixes and Link you had in the previous version:
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com

> Signed-off-by: Aneesh Kumar K.V 

Thanks, this is orders of magnitude better!
Acked-by: Hugh Dickins 

> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 9cd352fb9cf8..f12df630fb37 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -517,7 +517,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>   } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == 
> PUD_SIZE) {
>  
>   if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
> -old_pud, new_pud, need_rmap_locks))
> +old_pud, new_pud, true))
>   continue;
>   }
>  
> @@ -544,7 +544,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>* moving at the PMD level if possible.
>*/
>   if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
> -old_pmd, new_pmd, need_rmap_locks))
> +old_pmd, new_pmd, true))
>   continue;
>   }
>  
> 


Re: [PATCH v3 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-08 Thread Andrew Morton
On Tue,  8 Jun 2021 12:13:15 +0300 Mike Rapoport  wrote:

> From: Mike Rapoport 
> 
> After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
> configuration options are equivalent.
> 
> Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
> 
> Done with
> 
>   $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
>   $(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
>   $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
>   $(git grep -wl NEED_MULTIPLE_NODES)
> 
> with manual tweaks afterwards.
> 
> ...
>
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -987,7 +987,7 @@ extern int movable_zone;
>  #ifdef CONFIG_HIGHMEM
>  static inline int zone_movable_is_highmem(void)
>  {
> -#ifdef CONFIG_NEED_MULTIPLE_NODES
> +#ifdef CONFIG_NUMA
>   return movable_zone == ZONE_HIGHMEM;
>  #else
>   return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM;

I dropped this hunk - your "mm/mmzone.h: simplify is_highmem_idx()"
removed zone_movable_is_highmem().  


Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Pingfan Liu
Correct mail address of Kazuhito

On Tue, Jun 8, 2021 at 6:34 PM Pingfan Liu  wrote:
>
> As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
> Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
> formula:
> #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
>
> Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
> PAGES_PER_SECTION in makedumpfile just like kernel.
>
> Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
> recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
> SECTION_SIZE_BITS"). But user space wants a stable interface to get this
> info. Such info is impossible to be deduced from a crashdump vmcore.
> Hence append SECTION_SIZE_BITS to vmcoreinfo.
>
> Signed-off-by: Pingfan Liu 
> Cc: Bhupesh Sharma 
> Cc: Kazuhito Hagio 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: Boris Petkov 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Cc: James Morse 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Dave Anderson 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org
> Cc: ke...@lists.infradead.org
> Cc: x...@kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> ---
>  kernel/crash_core.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 825284baaf46..684a6061a13a 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void)
> VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
> VMCOREINFO_STRUCT_SIZE(mem_section);
> VMCOREINFO_OFFSET(mem_section, section_mem_map);
> +   VMCOREINFO_NUMBER(SECTION_SIZE_BITS);
> VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>  #endif
> VMCOREINFO_STRUCT_SIZE(page);
> --
> 2.29.2
>


Re: [RFC] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-06-08 Thread Fabiano Rosas
"Pratik R. Sampat"  writes:

Hi, I have some general comments and questions, mostly trying to
understand design of the hcall and use cases of the sysfs data:

> Adds a generic interface to represent the energy and frequency related
> PAPR attributes on the system using the new H_CALL
> "H_GET_ENERGY_SCALE_INFO".
>
> H_GET_EM_PARMS H_CALL was previously responsible for exporting this
> information in the lparcfg, however the H_GET_EM_PARMS H_CALL
> will be deprecated P10 onwards.
>
> The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
> hcall(
>   uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
>   uint64 flags,   // Per the flag request
>   uint64 firstAttributeId,// The attribute id
>   uint64 bufferAddress,   // The logical address of the output buffer

Instead of logical address, guest address or guest physical address
would be more precise.

>   uint64 bufferSize   // The size in bytes of the output buffer
> );
>
> This H_CALL can query either all the attributes at once with
> firstAttributeId = 0, flags = 0 as well as query only one attribute
> at a time with firstAttributeId = id
>
> The output buffer consists of the following
> 1. number of attributes  - 8 bytes
> 2. array offset to the data location - 8 bytes

The offset is from the start of the buffer, isn't it? So not the array
offset.

> 3. version info  - 1 byte
> 4. A data array of size num attributes, which contains the following:
>   a. attribute ID  - 8 bytes
>   b. attribute value in number - 8 bytes
>   c. attribute name in string  - 64 bytes
>   d. attribute value in string - 64 bytes

Is this new hypercall already present in the spec? These seem a bit
underspecified to me.

>
> The new H_CALL exports information in direct string value format, hence
> a new interface has been introduced in /sys/firmware/papr to export

Hm.. Maybe this should be something less generic than "papr"?

> this information to userspace in an extensible pass-through format.
> The H_CALL returns the name, numeric value and string value. As string
> values are in human readable format, therefore if the string value
> exists then that is given precedence over the numeric value.

So the hypervisor could simply not send the string representation? How
will the userspace tell the difference since they are reading everything
from a file?

Overall I'd say we should give the data in a more structured way and let
the user-facing tool do the formatting and presentation.

>
> The format of exposing the sysfs information is as follows:
> /sys/firmware/papr/
>   |-- attr_0_name
>   |-- attr_0_val
>   |-- attr_1_name
>   |-- attr_1_val
> ...

How do we keep a stable interface with userspace? Say the hypervisor
decides to add or remove attributes, change their order, string
representation, etc? It will inform us via the version field, but that
is lost when we output this to sysfs.

I get that if the userspace just iterate over the contents of the
directory then nothing breaks, but there is not much else it could do it
seems.

>
> The energy information that is exported is useful for userspace tools
> such as powerpc-utils. Currently these tools infer the
> "power_mode_data" value in the lparcfg, which in turn is obtained from
> the to be deprecated H_GET_EM_PARMS H_CALL.
> On future platforms, such userspace utilities will have to look at the
> data returned from the new H_CALL being populated in this new sysfs
> interface and report this information directly without the need of
> interpretation.
>
> Signed-off-by: Pratik R. Sampat 


Re: [PATCH v2] libnvdimm/pmem: Fix blk_cleanup_disk() usage

2021-06-08 Thread Jens Axboe
On 6/7/21 5:52 PM, Dan Williams wrote:
> The queue_to_disk() helper can not be used after del_gendisk()
> communicate @disk via the pgmap->owner.
> 
> Otherwise, queue_to_disk() returns NULL resulting in the splat below.
> 
>  Kernel attempted to read user page (330) - exploit attempt? (uid: 0)
>  BUG: Kernel NULL pointer dereference on read at 0x0330
>  Faulting instruction address: 0xc0906344
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  [..]
>  NIP [c0906344] pmem_pagemap_cleanup+0x24/0x40
>  LR [c04701d4] memunmap_pages+0x1b4/0x4b0
>  Call Trace:
>  [c00022cbb9c0] [c09063c8] pmem_pagemap_kill+0x28/0x40 
> (unreliable)
>  [c00022cbb9e0] [c04701d4] memunmap_pages+0x1b4/0x4b0
>  [c00022cbba90] [c08b28a0] devm_action_release+0x30/0x50
>  [c00022cbbab0] [c08b39c8] release_nodes+0x2f8/0x3e0
>  [c00022cbbb60] [c08ac440] 
> device_release_driver_internal+0x190/0x2b0
>  [c00022cbbba0] [c08a8450] unbind_store+0x130/0x170

Applied, thanks.

-- 
Jens Axboe



[PATCH 11/11] staging: fpgaboot: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/staging/gs_fpgaboot/README | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gs_fpgaboot/README 
b/drivers/staging/gs_fpgaboot/README
index b85a76849fc4a..ec1235a21bcc1 100644
--- a/drivers/staging/gs_fpgaboot/README
+++ b/drivers/staging/gs_fpgaboot/README
@@ -39,7 +39,7 @@ TABLE OF CONTENTS.
 
 5. USE CASE (from a mailing list discussion with Greg)
 
-   a. As a FPGA development support tool,
+   a. As an FPGA development support tool,
During FPGA firmware development, you need to download a new FPGA
image frequently.
You would do that with a dedicated JTAG, which usually a limited
-- 
2.26.3



[PATCH 10/11] fpga: stratix10-soc: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/fpga/stratix10-soc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/fpga/stratix10-soc.c b/drivers/fpga/stratix10-soc.c
index 657a70c5fc996..2aeb53f8e9d0f 100644
--- a/drivers/fpga/stratix10-soc.c
+++ b/drivers/fpga/stratix10-soc.c
@@ -271,7 +271,7 @@ static int s10_send_buf(struct fpga_manager *mgr, const 
char *buf, size_t count)
 }
 
 /*
- * Send a FPGA image to privileged layers to write to the FPGA.  When done
+ * Send an FPGA image to privileged layers to write to the FPGA.  When done
  * sending, free all service layer buffers we allocated in write_init.
  */
 static int s10_ops_write(struct fpga_manager *mgr, const char *buf,
-- 
2.26.3



[PATCH 09/11] fpga: of-fpga-region: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/fpga/of-fpga-region.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/fpga/of-fpga-region.c b/drivers/fpga/of-fpga-region.c
index e405309baadc1..e3c25576b6b9d 100644
--- a/drivers/fpga/of-fpga-region.c
+++ b/drivers/fpga/of-fpga-region.c
@@ -181,7 +181,7 @@ static int child_regions_with_firmware(struct device_node 
*overlay)
  * @region: FPGA region
  * @overlay: overlay applied to the FPGA region
  *
- * Given an overlay applied to a FPGA region, parse the FPGA image specific
+ * Given an overlay applied to an FPGA region, parse the FPGA image specific
  * info in the overlay and do some checking.
  *
  * Returns:
@@ -273,7 +273,7 @@ static struct fpga_image_info *of_fpga_region_parse_ov(
  * @region: FPGA region that the overlay was applied to
  * @nd: overlay notification data
  *
- * Called when an overlay targeted to a FPGA Region is about to be applied.
+ * Called when an overlay targeted to an FPGA Region is about to be applied.
  * Parses the overlay for properties that influence how the FPGA will be
  * programmed and does some checking. If the checks pass, programs the FPGA.
  * If the checks fail, overlay is rejected and does not get added to the
@@ -336,8 +336,8 @@ static void of_fpga_region_notify_post_remove(struct 
fpga_region *region,
  * @action:notifier action
  * @arg:   reconfig data
  *
- * This notifier handles programming a FPGA when a "firmware-name" property is
- * added to a fpga-region.
+ * This notifier handles programming an FPGA when a "firmware-name" property is
+ * added to an fpga-region.
  *
  * Returns NOTIFY_OK or error if FPGA programming fails.
  */
-- 
2.26.3



[PATCH 08/11] fpga: region: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/fpga/fpga-region.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/fpga/fpga-region.c b/drivers/fpga/fpga-region.c
index c3134b89c3fe5..c5c55d2f20b92 100644
--- a/drivers/fpga/fpga-region.c
+++ b/drivers/fpga/fpga-region.c
@@ -33,14 +33,14 @@ struct fpga_region *fpga_region_class_find(
 EXPORT_SYMBOL_GPL(fpga_region_class_find);
 
 /**
- * fpga_region_get - get an exclusive reference to a fpga region
+ * fpga_region_get - get an exclusive reference to an fpga region
  * @region: FPGA Region struct
  *
  * Caller should call fpga_region_put() when done with region.
  *
  * Return fpga_region struct if successful.
  * Return -EBUSY if someone already has a reference to the region.
- * Return -ENODEV if @np is not a FPGA Region.
+ * Return -ENODEV if @np is not an FPGA Region.
  */
 static struct fpga_region *fpga_region_get(struct fpga_region *region)
 {
@@ -234,7 +234,7 @@ struct fpga_region
 EXPORT_SYMBOL_GPL(fpga_region_create);
 
 /**
- * fpga_region_free - free a FPGA region created by fpga_region_create()
+ * fpga_region_free - free an FPGA region created by fpga_region_create()
  * @region: FPGA region
  */
 void fpga_region_free(struct fpga_region *region)
@@ -257,7 +257,7 @@ static void devm_fpga_region_release(struct device *dev, 
void *res)
  * @mgr: manager that programs this region
  * @get_bridges: optional function to get bridges to a list
  *
- * This function is intended for use in a FPGA region driver's probe function.
+ * This function is intended for use in an FPGA region driver's probe function.
  * After the region driver creates the region struct with
  * devm_fpga_region_create(), it should register it with 
fpga_region_register().
  * The region driver's remove function should call fpga_region_unregister().
@@ -291,7 +291,7 @@ struct fpga_region
 EXPORT_SYMBOL_GPL(devm_fpga_region_create);
 
 /**
- * fpga_region_register - register a FPGA region
+ * fpga_region_register - register an FPGA region
  * @region: FPGA region
  *
  * Return: 0 or -errno
@@ -303,10 +303,10 @@ int fpga_region_register(struct fpga_region *region)
 EXPORT_SYMBOL_GPL(fpga_region_register);
 
 /**
- * fpga_region_unregister - unregister a FPGA region
+ * fpga_region_unregister - unregister an FPGA region
  * @region: FPGA region
  *
- * This function is intended for use in a FPGA region driver's remove function.
+ * This function is intended for use in an FPGA region driver's remove 
function.
  */
 void fpga_region_unregister(struct fpga_region *region)
 {
-- 
2.26.3



[PATCH 07/11] fpga-mgr: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/fpga/fpga-mgr.c   | 22 +++---
 include/linux/fpga/fpga-mgr.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/fpga/fpga-mgr.c b/drivers/fpga/fpga-mgr.c
index b85bc47c91a9a..ae21202105af7 100644
--- a/drivers/fpga/fpga-mgr.c
+++ b/drivers/fpga/fpga-mgr.c
@@ -26,7 +26,7 @@ struct fpga_mgr_devres {
 };
 
 /**
- * fpga_image_info_alloc - Allocate a FPGA image info struct
+ * fpga_image_info_alloc - Allocate an FPGA image info struct
  * @dev: owning device
  *
  * Return: struct fpga_image_info or NULL
@@ -50,7 +50,7 @@ struct fpga_image_info *fpga_image_info_alloc(struct device 
*dev)
 EXPORT_SYMBOL_GPL(fpga_image_info_alloc);
 
 /**
- * fpga_image_info_free - Free a FPGA image info struct
+ * fpga_image_info_free - Free an FPGA image info struct
  * @info: FPGA image info struct to free
  */
 void fpga_image_info_free(struct fpga_image_info *info)
@@ -470,7 +470,7 @@ static int fpga_mgr_dev_match(struct device *dev, const 
void *data)
 }
 
 /**
- * fpga_mgr_get - Given a device, get a reference to a fpga mgr.
+ * fpga_mgr_get - Given a device, get a reference to an fpga mgr.
  * @dev:   parent device that fpga mgr was registered with
  *
  * Return: fpga manager struct or IS_ERR() condition containing error code.
@@ -487,7 +487,7 @@ struct fpga_manager *fpga_mgr_get(struct device *dev)
 EXPORT_SYMBOL_GPL(fpga_mgr_get);
 
 /**
- * of_fpga_mgr_get - Given a device node, get a reference to a fpga mgr.
+ * of_fpga_mgr_get - Given a device node, get a reference to an fpga mgr.
  *
  * @node:  device node
  *
@@ -506,7 +506,7 @@ struct fpga_manager *of_fpga_mgr_get(struct device_node 
*node)
 EXPORT_SYMBOL_GPL(of_fpga_mgr_get);
 
 /**
- * fpga_mgr_put - release a reference to a fpga manager
+ * fpga_mgr_put - release a reference to an fpga manager
  * @mgr:   fpga manager structure
  */
 void fpga_mgr_put(struct fpga_manager *mgr)
@@ -550,7 +550,7 @@ void fpga_mgr_unlock(struct fpga_manager *mgr)
 EXPORT_SYMBOL_GPL(fpga_mgr_unlock);
 
 /**
- * fpga_mgr_create - create and initialize a FPGA manager struct
+ * fpga_mgr_create - create and initialize an FPGA manager struct
  * @dev:   fpga manager device from pdev
  * @name:  fpga manager name
  * @mops:  pointer to structure of fpga manager ops
@@ -617,7 +617,7 @@ struct fpga_manager *fpga_mgr_create(struct device *dev, 
const char *name,
 EXPORT_SYMBOL_GPL(fpga_mgr_create);
 
 /**
- * fpga_mgr_free - free a FPGA manager created with fpga_mgr_create()
+ * fpga_mgr_free - free an FPGA manager created with fpga_mgr_create()
  * @mgr:   fpga manager struct
  */
 void fpga_mgr_free(struct fpga_manager *mgr)
@@ -641,7 +641,7 @@ static void devm_fpga_mgr_release(struct device *dev, void 
*res)
  * @mops:  pointer to structure of fpga manager ops
  * @priv:  fpga manager private data
  *
- * This function is intended for use in a FPGA manager driver's probe function.
+ * This function is intended for use in an FPGA manager driver's probe 
function.
  * After the manager driver creates the manager struct with
  * devm_fpga_mgr_create(), it should register it with fpga_mgr_register().  The
  * manager driver's remove function should call fpga_mgr_unregister().  The
@@ -674,7 +674,7 @@ struct fpga_manager *devm_fpga_mgr_create(struct device 
*dev, const char *name,
 EXPORT_SYMBOL_GPL(devm_fpga_mgr_create);
 
 /**
- * fpga_mgr_register - register a FPGA manager
+ * fpga_mgr_register - register an FPGA manager
  * @mgr: fpga manager struct
  *
  * Return: 0 on success, negative error code otherwise.
@@ -706,10 +706,10 @@ int fpga_mgr_register(struct fpga_manager *mgr)
 EXPORT_SYMBOL_GPL(fpga_mgr_register);
 
 /**
- * fpga_mgr_unregister - unregister a FPGA manager
+ * fpga_mgr_unregister - unregister an FPGA manager
  * @mgr: fpga manager struct
  *
- * This function is intended for use in a FPGA manager driver's remove 
function.
+ * This function is intended for use in an FPGA manager driver's remove 
function.
  */
 void fpga_mgr_unregister(struct fpga_manager *mgr)
 {
diff --git a/include/linux/fpga/fpga-mgr.h b/include/linux/fpga/fpga-mgr.h
index 3a32b8e201857..474c1f5063070 100644
--- a/include/linux/fpga/fpga-mgr.h
+++ b/include/linux/fpga/fpga-mgr.h
@@ -75,7 +75,7 @@ enum fpga_mgr_states {
 #define FPGA_MGR_COMPRESSED_BITSTREAM  BIT(4)
 
 /**
- * struct fpga_image_info - information specific to a FPGA image
+ * struct fpga_image_info - information specific to an FPGA image
  * @flags: boolean flags as defined above
  * @enable_timeout_us: maximum time to enable traffic through bridge (uSec)
  * @disable_timeout_us: maximum time to disable traffic through bridge (uSec)
-- 
2.26.3



[PATCH 06/11] fpga: bridge: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/fpga/fpga-bridge.c   | 22 +++---
 include/linux/fpga/fpga-bridge.h |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/fpga/fpga-bridge.c b/drivers/fpga/fpga-bridge.c
index 05c6d4f2d043f..beef53b194b27 100644
--- a/drivers/fpga/fpga-bridge.c
+++ b/drivers/fpga/fpga-bridge.c
@@ -85,14 +85,14 @@ static struct fpga_bridge *__fpga_bridge_get(struct device 
*dev,
 }
 
 /**
- * of_fpga_bridge_get - get an exclusive reference to a fpga bridge
+ * of_fpga_bridge_get - get an exclusive reference to an fpga bridge
  *
- * @np: node pointer of a FPGA bridge
+ * @np: node pointer of an FPGA bridge
  * @info: fpga image specific information
  *
  * Return fpga_bridge struct if successful.
  * Return -EBUSY if someone already has a reference to the bridge.
- * Return -ENODEV if @np is not a FPGA Bridge.
+ * Return -ENODEV if @np is not an FPGA Bridge.
  */
 struct fpga_bridge *of_fpga_bridge_get(struct device_node *np,
   struct fpga_image_info *info)
@@ -113,11 +113,11 @@ static int fpga_bridge_dev_match(struct device *dev, 
const void *data)
 }
 
 /**
- * fpga_bridge_get - get an exclusive reference to a fpga bridge
+ * fpga_bridge_get - get an exclusive reference to an fpga bridge
  * @dev:   parent device that fpga bridge was registered with
  * @info:  fpga manager info
  *
- * Given a device, get an exclusive reference to a fpga bridge.
+ * Given a device, get an exclusive reference to an fpga bridge.
  *
  * Return: fpga bridge struct or IS_ERR() condition containing error code.
  */
@@ -224,7 +224,7 @@ EXPORT_SYMBOL_GPL(fpga_bridges_put);
 /**
  * of_fpga_bridge_get_to_list - get a bridge, add it to a list
  *
- * @np: node pointer of a FPGA bridge
+ * @np: node pointer of an FPGA bridge
  * @info: fpga image specific information
  * @bridge_list: list of FPGA bridges
  *
@@ -373,7 +373,7 @@ struct fpga_bridge *fpga_bridge_create(struct device *dev, 
const char *name,
 EXPORT_SYMBOL_GPL(fpga_bridge_create);
 
 /**
- * fpga_bridge_free - free a fpga bridge created by fpga_bridge_create()
+ * fpga_bridge_free - free an fpga bridge created by fpga_bridge_create()
  * @bridge:FPGA bridge struct
  */
 void fpga_bridge_free(struct fpga_bridge *bridge)
@@ -397,7 +397,7 @@ static void devm_fpga_bridge_release(struct device *dev, 
void *res)
  * @br_ops:pointer to structure of fpga bridge ops
  * @priv:  FPGA bridge private data
  *
- * This function is intended for use in a FPGA bridge driver's probe function.
+ * This function is intended for use in an FPGA bridge driver's probe function.
  * After the bridge driver creates the struct with devm_fpga_bridge_create(), 
it
  * should register the bridge with fpga_bridge_register().  The bridge driver's
  * remove function should call fpga_bridge_unregister().  The bridge struct
@@ -430,7 +430,7 @@ struct fpga_bridge
 EXPORT_SYMBOL_GPL(devm_fpga_bridge_create);
 
 /**
- * fpga_bridge_register - register a FPGA bridge
+ * fpga_bridge_register - register an FPGA bridge
  *
  * @bridge: FPGA bridge struct
  *
@@ -454,11 +454,11 @@ int fpga_bridge_register(struct fpga_bridge *bridge)
 EXPORT_SYMBOL_GPL(fpga_bridge_register);
 
 /**
- * fpga_bridge_unregister - unregister a FPGA bridge
+ * fpga_bridge_unregister - unregister an FPGA bridge
  *
  * @bridge: FPGA bridge struct
  *
- * This function is intended for use in a FPGA bridge driver's remove function.
+ * This function is intended for use in an FPGA bridge driver's remove 
function.
  */
 void fpga_bridge_unregister(struct fpga_bridge *bridge)
 {
diff --git a/include/linux/fpga/fpga-bridge.h b/include/linux/fpga/fpga-bridge.h
index 817600a32c935..6c3c28806ff13 100644
--- a/include/linux/fpga/fpga-bridge.h
+++ b/include/linux/fpga/fpga-bridge.h
@@ -11,7 +11,7 @@ struct fpga_bridge;
 /**
  * struct fpga_bridge_ops - ops for low level FPGA bridge drivers
  * @enable_show: returns the FPGA bridge's status
- * @enable_set: set a FPGA bridge as enabled or disabled
+ * @enable_set: set an FPGA bridge as enabled or disabled
  * @fpga_bridge_remove: set FPGA into a specific state during driver remove
  * @groups: optional attribute groups.
  */
-- 
2.26.3



[PATCH 05/11] fpga: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/fpga/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
index 33e15058d0dc7..8cd454ee20c0c 100644
--- a/drivers/fpga/Kconfig
+++ b/drivers/fpga/Kconfig
@@ -7,7 +7,7 @@ menuconfig FPGA
tristate "FPGA Configuration Framework"
help
  Say Y here if you want support for configuring FPGAs from the
- kernel.  The FPGA framework adds a FPGA manager class and FPGA
+ kernel.  The FPGA framework adds an FPGA manager class and FPGA
  manager drivers.
 
 if FPGA
@@ -134,7 +134,7 @@ config FPGA_REGION
tristate "FPGA Region"
depends on FPGA_BRIDGE
help
- FPGA Region common code.  A FPGA Region controls a FPGA Manager
+ FPGA Region common code.  An FPGA Region controls an FPGA Manager
  and the FPGA Bridges associated with either a reconfigurable
  region of an FPGA or a whole FPGA.
 
-- 
2.26.3



[PATCH 00/11] fpga: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

A treewide followup of
https://lore.kernel.org/linux-fpga/2faf6ccb-005b-063a-a2a3-e177082c4...@silicom.dk/

Change the use of 'a fpga' to 'an fpga'
Ref usage in wiki
https://en.wikipedia.org/wiki/Field-programmable_gate_array
and Intel's 'FPGAs For Dummies'
https://plan.seek.intel.com/PSG_WW_NC_LPCD_FR_2018_FPGAforDummiesbook

Change was mechanical
 !/bin/sh
 for f in `find . -type f`; do
   sed -i.bak 's/ a fpga/ an fpga/g' $f
   sed -i.bak 's/ A fpga/ An fpga/g' $f
   sed -i.bak 's/ a FPGA/ an FPGA/g' $f
   sed -i.bak 's/ A FPGA/ An FPGA/g' $f
 done


Tom Rix (11):
  dt-bindings: fpga: fpga-region: change FPGA indirect article to an
  Documentation: fpga: dfl: change FPGA indirect article to an
  Documentation: ocxl.rst: change FPGA indirect article to an
  crypto: marvell: cesa: change FPGA indirect article to an
  fpga: change FPGA indirect article to an
  fpga: bridge: change FPGA indirect article to an
  fpga-mgr: change FPGA indirect article to an
  fpga: region: change FPGA indirect article to an
  fpga: of-fpga-region: change FPGA indirect article to an
  fpga: stratix10-soc: change FPGA indirect article to an
  staging: fpgaboot: change FPGA indirect article to an

 .../devicetree/bindings/fpga/fpga-region.txt  | 22 +--
 Documentation/fpga/dfl.rst|  4 ++--
 .../userspace-api/accelerators/ocxl.rst   |  2 +-
 drivers/crypto/marvell/cesa/cesa.h|  2 +-
 drivers/fpga/Kconfig  |  4 ++--
 drivers/fpga/fpga-bridge.c| 22 +--
 drivers/fpga/fpga-mgr.c   | 22 +--
 drivers/fpga/fpga-region.c| 14 ++--
 drivers/fpga/of-fpga-region.c |  8 +++
 drivers/fpga/stratix10-soc.c  |  2 +-
 drivers/staging/gs_fpgaboot/README|  2 +-
 include/linux/fpga/fpga-bridge.h  |  2 +-
 include/linux/fpga/fpga-mgr.h |  2 +-
 13 files changed, 54 insertions(+), 54 deletions(-)

-- 
2.26.3



[PATCH 04/11] crypto: marvell: cesa: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 drivers/crypto/marvell/cesa/cesa.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/marvell/cesa/cesa.h 
b/drivers/crypto/marvell/cesa/cesa.h
index c1007f2ba79c8..d215a6bed6bc7 100644
--- a/drivers/crypto/marvell/cesa/cesa.h
+++ b/drivers/crypto/marvell/cesa/cesa.h
@@ -66,7 +66,7 @@
 #define CESA_SA_ST_ACT_1   BIT(1)
 
 /*
- * CESA_SA_FPGA_INT_STATUS looks like a FPGA leftover and is documented only
+ * CESA_SA_FPGA_INT_STATUS looks like an FPGA leftover and is documented only
  * in Errata 4.12. It looks like that it was part of an IRQ-controller in FPGA
  * and someone forgot to remove  it while switching to the core and moving to
  * CESA_SA_INT_STATUS.
-- 
2.26.3



[PATCH 02/11] Documentation: fpga: dfl: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 Documentation/fpga/dfl.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/fpga/dfl.rst b/Documentation/fpga/dfl.rst
index ccc33f199df2a..ef9eec71f6f3a 100644
--- a/Documentation/fpga/dfl.rst
+++ b/Documentation/fpga/dfl.rst
@@ -57,7 +57,7 @@ FPGA Interface Unit (FIU) represents a standalone functional 
unit for the
 interface to FPGA, e.g. the FPGA Management Engine (FME) and Port (more
 descriptions on FME and Port in later sections).
 
-Accelerated Function Unit (AFU) represents a FPGA programmable region and
+Accelerated Function Unit (AFU) represents an FPGA programmable region and
 always connects to a FIU (e.g. a Port) as its child as illustrated above.
 
 Private Features represent sub features of the FIU and AFU. They could be
@@ -311,7 +311,7 @@ The driver organization in virtualization case is 
illustrated below:
  | PCI PF Device ||  | PCI VF Device |
  +---+|  +---+
 
-FPGA PCIe device driver is always loaded first once a FPGA PCIe PF or VF device
+FPGA PCIe device driver is always loaded first once an FPGA PCIe PF or VF 
device
 is detected. It:
 
 * Finishes enumeration on both FPGA PCIe PF and VF device using common
-- 
2.26.3



[PATCH 03/11] Documentation: ocxl.rst: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 Documentation/userspace-api/accelerators/ocxl.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/userspace-api/accelerators/ocxl.rst 
b/Documentation/userspace-api/accelerators/ocxl.rst
index 14cefc020e2d5..db7570d5e50d1 100644
--- a/Documentation/userspace-api/accelerators/ocxl.rst
+++ b/Documentation/userspace-api/accelerators/ocxl.rst
@@ -6,7 +6,7 @@ OpenCAPI is an interface between processors and accelerators. 
It aims
 at being low-latency and high-bandwidth. The specification is
 developed by the `OpenCAPI Consortium `_.
 
-It allows an accelerator (which could be a FPGA, ASICs, ...) to access
+It allows an accelerator (which could be an FPGA, ASICs, ...) to access
 the host memory coherently, using virtual addresses. An OpenCAPI
 device can also host its own memory, that can be accessed from the
 host.
-- 
2.26.3



[PATCH 01/11] dt-bindings: fpga: fpga-region: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix 
---
 .../devicetree/bindings/fpga/fpga-region.txt  | 22 +--
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/fpga/fpga-region.txt 
b/Documentation/devicetree/bindings/fpga/fpga-region.txt
index d787d57491a1c..7d35152648387 100644
--- a/Documentation/devicetree/bindings/fpga/fpga-region.txt
+++ b/Documentation/devicetree/bindings/fpga/fpga-region.txt
@@ -38,7 +38,7 @@ Partial Reconfiguration (PR)
 
 Partial Reconfiguration Region (PRR)
  * Also called a "reconfigurable partition"
- * A PRR is a specific section of a FPGA reserved for reconfiguration.
+ * A PRR is a specific section of an FPGA reserved for reconfiguration.
  * A base (or static) FPGA image may create a set of PRR's that later may
be independently reprogrammed many times.
  * The size and specific location of each PRR is fixed.
@@ -105,7 +105,7 @@ reprogrammed independently while the rest of the system 
continues to function.
 Sequence
 
 
-When a DT overlay that targets a FPGA Region is applied, the FPGA Region will
+When a DT overlay that targets an FPGA Region is applied, the FPGA Region will
 do the following:
 
  1. Disable appropriate FPGA bridges.
@@ -134,8 +134,8 @@ The intended use is that a Device Tree overlay (DTO) can be 
used to reprogram an
 FPGA while an operating system is running.
 
 An FPGA Region that exists in the live Device Tree reflects the current state.
-If the live tree shows a "firmware-name" property or child nodes under a FPGA
-Region, the FPGA already has been programmed.  A DTO that targets a FPGA Region
+If the live tree shows a "firmware-name" property or child nodes under an FPGA
+Region, the FPGA already has been programmed.  A DTO that targets an FPGA 
Region
 and adds the "firmware-name" property is taken as a request to reprogram the
 FPGA.  After reprogramming is successful, the overlay is accepted into the live
 tree.
@@ -152,9 +152,9 @@ These FPGA regions are children of FPGA bridges which are 
then children of the
 base FPGA region.  The "Full Reconfiguration to add PRR's" example below shows
 this.
 
-If an FPGA Region does not specify a FPGA Manager, it will inherit the FPGA
+If an FPGA Region does not specify an FPGA Manager, it will inherit the FPGA
 Manager specified by its ancestor FPGA Region.  This supports both the case
-where the same FPGA Manager is used for all of a FPGA as well the case where
+where the same FPGA Manager is used for all of an FPGA as well the case where
 a different FPGA Manager is used for each region.
 
 FPGA Regions do not inherit their ancestor FPGA regions' bridges.  This 
prevents
@@ -166,7 +166,7 @@ within the static image of the FPGA.
 Required properties:
 - compatible : should contain "fpga-region"
 - fpga-mgr : should contain a phandle to an FPGA Manager.  Child FPGA Regions
-   inherit this property from their ancestor regions.  A fpga-mgr property
+   inherit this property from their ancestor regions.  An fpga-mgr property
in a region will override any inherited FPGA manager.
 - #address-cells, #size-cells, ranges : must be present to handle address space
mapping for child nodes.
@@ -175,12 +175,12 @@ Optional properties:
 - firmware-name : should contain the name of an FPGA image file located on the
firmware search path.  If this property shows up in a live device tree
it indicates that the FPGA has already been programmed with this image.
-   If this property is in an overlay targeting a FPGA region, it is a
+   If this property is in an overlay targeting an FPGA region, it is a
request to program the FPGA with that image.
 - fpga-bridges : should contain a list of phandles to FPGA Bridges that must be
controlled during FPGA programming along with the parent FPGA bridge.
This property is optional if the FPGA Manager handles the bridges.
-If the fpga-region is  the child of a fpga-bridge, the list should not
+If the fpga-region is  the child of an fpga-bridge, the list should not
 contain the parent bridge.
 - partial-fpga-config : boolean, set if partial reconfiguration is to be done,
otherwise full reconfiguration is done.
@@ -279,7 +279,7 @@ Supported Use Models
 
 In all cases the live DT must have the FPGA Manager, FPGA Bridges (if any), and
 a FPGA Region.  The target of the Device Tree Overlay is the FPGA Region.  Some
-uses are specific to a FPGA device.
+uses are specific to an FPGA device.
 
  * No FPGA Bridges
In this case, the FPGA Manager which programs the FPGA also handles the
@@ -300,7 +300,7 @@ uses are specific to a FPGA device.
bridges need to exist in the FPGA that can gate the buses going to each FPGA
region while the buses are enabled for other sections.  Before any partial
reconfiguration can be done, a base FPGA image must be loaded which includes
-   PRR's with 

[PATCH 00/11] fpga: change FPGA indirect article to an

2021-06-08 Thread trix
From: Tom Rix 

A treewide followup of
https://lore.kernel.org/linux-fpga/2faf6ccb-005b-063a-a2a3-e177082c4...@silicom.dk/

Change the use of 'a fpga' to 'an fpga'
Ref usage in wiki
https://en.wikipedia.org/wiki/Field-programmable_gate_array
and Intel's 'FPGAs For Dummies'
https://plan.seek.intel.com/PSG_WW_NC_LPCD_FR_2018_FPGAforDummiesbook

Change was mechanical
 !/bin/sh
 for f in `find . -type f`; do
   sed -i.bak 's/ a fpga/ an fpga/g' $f
   sed -i.bak 's/ A fpga/ An fpga/g' $f
   sed -i.bak 's/ a FPGA/ an FPGA/g' $f
   sed -i.bak 's/ A FPGA/ An FPGA/g' $f
 done


Tom Rix (11):
  dt-bindings: fpga: fpga-region: change FPGA indirect article to an
  Documentation: fpga: dfl: change FPGA indirect article to an
  Documentation: ocxl.rst: change FPGA indirect article to an
  crypto: marvell: cesa: change FPGA indirect article to an
  fpga: change FPGA indirect article to an
  fpga: bridge: change FPGA indirect article to an
  fpga-mgr: change FPGA indirect article to an
  fpga: region: change FPGA indirect article to an
  fpga: of-fpga-region: change FPGA indirect article to an
  fpga: stratix10-soc: change FPGA indirect article to an
  staging: fpgaboot: change FPGA indirect article to an

 .../devicetree/bindings/fpga/fpga-region.txt  | 22 +--
 Documentation/fpga/dfl.rst|  4 ++--
 .../userspace-api/accelerators/ocxl.rst   |  2 +-
 drivers/crypto/marvell/cesa/cesa.h|  2 +-
 drivers/fpga/Kconfig  |  4 ++--
 drivers/fpga/fpga-bridge.c| 22 +--
 drivers/fpga/fpga-mgr.c   | 22 +--
 drivers/fpga/fpga-region.c| 14 ++--
 drivers/fpga/of-fpga-region.c |  8 +++
 drivers/fpga/stratix10-soc.c  |  2 +-
 drivers/staging/gs_fpgaboot/README|  2 +-
 include/linux/fpga/fpga-bridge.h  |  2 +-
 include/linux/fpga/fpga-mgr.h |  2 +-
 13 files changed, 54 insertions(+), 54 deletions(-)

-- 
2.26.3



Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Andrew Morton
On Tue, 8 Jun 2021 22:24:32 +0800 Baoquan He  wrote:

> On 06/08/21 at 06:33am, Pingfan Liu wrote:
> > As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
> > Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
> > formula:
> > #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> > 
> > Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
> > PAGES_PER_SECTION in makedumpfile just like kernel.
> > 
> > Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
> > recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
> > SECTION_SIZE_BITS"). But user space wants a stable interface to get this
> > info. Such info is impossible to be deduced from a crashdump vmcore.
> > Hence append SECTION_SIZE_BITS to vmcoreinfo.
> 
> ...
>
> Add the discussion of the original thread in kexec ML for reference:
> http://lists.infradead.org/pipermail/kexec/2021-June/022676.html

I added a Link: for this.
 
> This looks good to me.
> 
> Acked-by: Baoquan He 

I'm thinking we should backport this at least to Fixes:f0b13ee23241. 
But perhaps it's simpler to just backport it as far as possible, so I
added a bare cc:stable with no Fixes:.  Thoughts?


Re: [PATCH] powerpc/32: Remove __main()

2021-06-08 Thread Segher Boessenkool
On Tue, Jun 08, 2021 at 05:22:51PM +, Christophe Leroy wrote:
> Comment says that __main() is there to make GCC happy.
> 
> It's been there since the implementation of ppc arch in Linux 1.3.45.
> 
> ppc32 is the only architecture having that. Even ppc64 doesn't have it.
> 
> Seems like GCC is still happy without it.
> 
> Drop it for good.

If you used G++ to build the kernel there could be a call to __main
inserted under some circumstances.   It is used in functions called
"main" if there is no other way to do initialisations (this should not
happen if you use -ffreestanding, and there should not be a function
called "main" anyway, but who knows).

Either way, yup, this is ancient history :-)


Segher


Re: [PATCH 13/16] block: use memcpy_from_bvec in bio_copy_kern_endio_read

2021-06-08 Thread Chaitanya Kulkarni
On 6/8/21 09:09, Christoph Hellwig wrote:
> Use memcpy_from_bvec instead of open coding the logic.
>
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Chaitanya Kulkarni 



Re: [PATCH 12/16] block: use memcpy_to_bvec in copy_to_high_bio_irq

2021-06-08 Thread Chaitanya Kulkarni
On 6/8/21 09:08, Christoph Hellwig wrote:
> Use memcpy_to_bvec instead of opencoding the logic.
>
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Chaitanya Kulkarni 



Re: [PATCH 05/16] bvec: add memcpy_{from,to}_bvec and memzero_bvec helper

2021-06-08 Thread Chaitanya Kulkarni
On 6/8/21 09:07, Christoph Hellwig wrote:
> Add helpers to perform common memory operation on a bvec.
>
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Chaitanya Kulkarni 




Re: [PATCH 06/16] block: use memzero_page in zero_fill_bio

2021-06-08 Thread Chaitanya Kulkarni
On 6/8/21 09:07, Christoph Hellwig wrote:
> Use memzero_bvec to zero each segment in the bio instead of manually
> mapping and zeroing the data.
>
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Chaitanya Kulkarni 




Re: [PATCH 04/16] bvec: add a bvec_kmap_local helper

2021-06-08 Thread Chaitanya Kulkarni
On 6/8/21 09:06, Christoph Hellwig wrote:
> Add a helper to call kmap_local_page on a bvec.  There is no need for
> an unmap helper given that kunmap_local accept any address in the mapped
> page.
>
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Chaitanya Kulkarni 




Re: [PATCH 01/16] mm: use kmap_local_page in memzero_page

2021-06-08 Thread Chaitanya Kulkarni
On 6/8/21 09:06, Christoph Hellwig wrote:
> No need for kmap_atomic here.
>
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Chaitanya Kulkarni 





Re: [PATCH 03/16] bvec: fix the include guards for bvec.h

2021-06-08 Thread Chaitanya Kulkarni
On 6/8/21 09:06, Christoph Hellwig wrote:
> Fix the include guards to match the file naming.
>
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Chaitanya Kulkarni 



Re: [PATCH 2/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats

2021-06-08 Thread Peter Zijlstra
On Tue, Jun 08, 2021 at 05:26:58PM +0530, Kajol Jain wrote:
> +static int nvdimm_pmu_cpu_offline(unsigned int cpu, struct hlist_node *node)
> +{
> + struct nvdimm_pmu *nd_pmu;
> + u32 target;
> + int nodeid;
> + const struct cpumask *cpumask;
> +
> + nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
> +
> + /* Clear it, incase given cpu is set in nd_pmu->arch_cpumask */
> + cpumask_test_and_clear_cpu(cpu, _pmu->arch_cpumask);
> +
> + /*
> +  * If given cpu is not same as current designated cpu for
> +  * counter access, just return.
> +  */
> + if (cpu != nd_pmu->cpu)
> + return 0;
> +
> + /* Check for any active cpu in nd_pmu->arch_cpumask */
> + target = cpumask_any(_pmu->arch_cpumask);
> + nd_pmu->cpu = target;
> +
> + /*
> +  * Incase we don't have any active cpu in nd_pmu->arch_cpumask,
> +  * check in given cpu's numa node list.
> +  */
> + if (target >= nr_cpu_ids) {
> + nodeid = cpu_to_node(cpu);
> + cpumask = cpumask_of_node(nodeid);
> + target = cpumask_any_but(cpumask, cpu);
> + nd_pmu->cpu = target;
> +
> + if (target >= nr_cpu_ids)
> + return -1;
> + }
> +
> + return 0;
> +}
> +
> +static int nvdimm_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
> +{
> + struct nvdimm_pmu *nd_pmu;
> +
> + nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
> +
> + if (nd_pmu->cpu >= nr_cpu_ids)
> + nd_pmu->cpu = cpu;
> +
> + return 0;
> +}

> +static int nvdimm_pmu_cpu_hotplug_init(struct nvdimm_pmu *nd_pmu)
> +{
> + int nodeid, rc;
> + const struct cpumask *cpumask;
> +
> + /*
> +  * Incase cpu hotplug is not handled by arch specific code
> +  * they can still provide required cpumask which can be used
> +  * to get designatd cpu for counter access.
> +  * Check for any active cpu in nd_pmu->arch_cpumask.
> +  */
> + if (!cpumask_empty(_pmu->arch_cpumask)) {
> + nd_pmu->cpu = cpumask_any(_pmu->arch_cpumask);
> + } else {
> + /* pick active cpu from the cpumask of device numa node. */
> + nodeid = dev_to_node(nd_pmu->dev);
> + cpumask = cpumask_of_node(nodeid);
> + nd_pmu->cpu = cpumask_any(cpumask);
> + }
> +
> + rc = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "perf/nvdimm:online",
> +  nvdimm_pmu_cpu_online, 
> nvdimm_pmu_cpu_offline);
> +

Did you actually test this hotplug stuff?

That is, create a counter, unplug the CPU the counter was on, and
continue counting? "perf stat -I" is a good option for this, concurrent
with a hotplug.

Because I don't think it's actually correct. The thing is perf core is
strictly per-cpu, and it will place the event on a specific CPU context.
If you then unplug that CPU, nothing will touch the events on that CPU
anymore.

What drivers that span CPUs need to do is call
perf_pmu_migrate_context() whenever the CPU they were assigned to goes
away. Please have a look at arch/x86/events/rapl.c or
arch/x86/events/amd/power.c for relatively simple drivers that have this
property.




[PATCH] powerpc/32: Remove __main()

2021-06-08 Thread Christophe Leroy
Comment says that __main() is there to make GCC happy.

It's been there since the implementation of ppc arch in Linux 1.3.45.

ppc32 is the only architecture having that. Even ppc64 doesn't have it.

Seems like GCC is still happy without it.

Drop it for good.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/misc_32.S | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 6a076bef2932..39ab15419592 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -388,9 +388,3 @@ _GLOBAL(start_secondary_resume)
bl  start_secondary
b   .
 #endif /* CONFIG_SMP */
-   
-/*
- * This routine is just here to keep GCC happy - sigh...
- */
-_GLOBAL(__main)
-   blr
-- 
2.25.0



Re: [PATCH v7 00/11] Speedup mremap on ppc64

2021-06-08 Thread Linus Torvalds
On Mon, Jun 7, 2021 at 3:10 AM Nick Piggin  wrote:
>
> I'd really rather not do this, I'm not sure if micro benchmark captures 
> everything.

I don't much care what powerpc code does _itnernally_ for this
architecture-specific mis-design issue, but I really don't want to see
more complex generic interfaces unless you have better hard numbers
for them.

So far the numbers are: "no observable difference".

It would have to be not just observable, but actually meaningful for
me to go "ok, we'll add this crazy flag that nobody else cares about".

And honestly, from everything I've seen on page table walker caches:
they are great, but once you start remapping big ranges and
invallidating megabytes of TLB's, the walker caches just aren't going
to be your issue.

But: numbers talk.  I'd take the sane generic interfaces as a first
cut. If somebody then has really compelling numbers, we can _then_
look at that "optimize for odd page table walker cache situation"
case.

And in the meantime, maybe you can talk to the hardware people and
tell them that you want the "flush range" capability to work right,
and that if the walker cache is so important they shouldn't
have made it a all-or-nothing flush.

Linus


Re: [RFC] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-06-08 Thread Pratik Sampat

I've implemented a POC using this interface for the powerpc-utils'
ppc64_cpu --frequency command-line tool to utilize this information
in userspace.

The POC has been hosted here:
https://github.com/pratiksampat/powerpc-utils/tree/H_GET_ENERGY_SCALE_INFO
and based on comments I suggestions I can further improve the
parsing logic from this initial implementation.

Sample output from the powerpc-utils tool is as follows:

# ppc64_cpu --frequency
Power and Performance Mode: 
Idle Power Saver Status   : 
Processor Folding Status  :  --> Printed if Idle power save status is 
supported

Platform reported frequencies --> Frequencies reported from the platform's 
H_CALL i.e PAPR interface
min        :     GHz
max        :     GHz
static     :     GHz

Tool Computed frequencies
min        :     GHz (cpu XX)
max        :     GHz (cpu XX)
avg        :     GHz


On 04/06/21 10:05 pm, Pratik R. Sampat wrote:

Adds a generic interface to represent the energy and frequency related
PAPR attributes on the system using the new H_CALL
"H_GET_ENERGY_SCALE_INFO".

H_GET_EM_PARMS H_CALL was previously responsible for exporting this
information in the lparcfg, however the H_GET_EM_PARMS H_CALL
will be deprecated P10 onwards.

The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
hcall(
   uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
   uint64 flags,   // Per the flag request
   uint64 firstAttributeId,// The attribute id
   uint64 bufferAddress,   // The logical address of the output buffer
   uint64 bufferSize   // The size in bytes of the output buffer
);

This H_CALL can query either all the attributes at once with
firstAttributeId = 0, flags = 0 as well as query only one attribute
at a time with firstAttributeId = id

The output buffer consists of the following
1. number of attributes  - 8 bytes
2. array offset to the data location - 8 bytes
3. version info  - 1 byte
4. A data array of size num attributes, which contains the following:
   a. attribute ID  - 8 bytes
   b. attribute value in number - 8 bytes
   c. attribute name in string  - 64 bytes
   d. attribute value in string - 64 bytes

The new H_CALL exports information in direct string value format, hence
a new interface has been introduced in /sys/firmware/papr to export
this information to userspace in an extensible pass-through format.
The H_CALL returns the name, numeric value and string value. As string
values are in human readable format, therefore if the string value
exists then that is given precedence over the numeric value.

The format of exposing the sysfs information is as follows:
/sys/firmware/papr/
   |-- attr_0_name
   |-- attr_0_val
   |-- attr_1_name
   |-- attr_1_val
...

The energy information that is exported is useful for userspace tools
such as powerpc-utils. Currently these tools infer the
"power_mode_data" value in the lparcfg, which in turn is obtained from
the to be deprecated H_GET_EM_PARMS H_CALL.
On future platforms, such userspace utilities will have to look at the
data returned from the new H_CALL being populated in this new sysfs
interface and report this information directly without the need of
interpretation.

Signed-off-by: Pratik R. Sampat 
---
  Documentation/ABI/testing/sysfs-firmware-papr |  24 +++
  arch/powerpc/include/asm/hvcall.h |  21 +-
  arch/powerpc/kvm/trace_hv.h   |   1 +
  arch/powerpc/platforms/pseries/Makefile   |   3 +-
  .../pseries/papr_platform_attributes.c| 203 ++
  5 files changed, 250 insertions(+), 2 deletions(-)
  create mode 100644 Documentation/ABI/testing/sysfs-firmware-papr
  create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-papr 
b/Documentation/ABI/testing/sysfs-firmware-papr
new file mode 100644
index ..1c040b44ac3b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-papr
@@ -0,0 +1,24 @@
+What:  /sys/firmware/papr
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description :  Director hosting a set of platform attributes on Linux
+   running as a PAPR guest.
+
+   Each file in a directory contains a platform
+   attribute pertaining to performance/energy-savings
+   mode and processor frequency.
+
+What:  /sys/firmware/papr/attr_X_name
+   /sys/firmware/papr/attr_X_val
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   PAPR attributes directory for POWERVM servers
+
+   This directory provides PAPR information. It
+   contains below sysfs attributes:
+
+   - attr_X_name: File contains the name of
+   attribute X
+
+   - attr_X_val: Numeric/string value of
+   attribute X
diff --git 

Re: [PATCH 08/16] dm-writecache: use bvec_kmap_local instead of bvec_kmap_irq

2021-06-08 Thread Christoph Hellwig
On Tue, Jun 08, 2021 at 09:30:56AM -0700, Bart Van Assche wrote:
> >From one of the functions called by kunmap_local():
> 
> unsigned long addr = (unsigned long) vaddr & PAGE_MASK;
> 
> This won't work well if bvec->bv_offset >= PAGE_SIZE I assume?

It won't indeed.  Both the existing and new helpers operate on single
page bvecs only, and all callers only use those.  I should have
probably mentioned that in the cover letter and documented the
assumptions in the code, though.


Re: [PATCH 08/16] dm-writecache: use bvec_kmap_local instead of bvec_kmap_irq

2021-06-08 Thread Bart Van Assche
On 6/8/21 9:05 AM, Christoph Hellwig wrote:
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index aecc246ade26..93ca454eaca9 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -1205,14 +1205,13 @@ static void memcpy_flushcache_optimized(void *dest, 
> void *source, size_t size)
>  static void bio_copy_block(struct dm_writecache *wc, struct bio *bio, void 
> *data)
>  {
>   void *buf;
> - unsigned long flags;
>   unsigned size;
>   int rw = bio_data_dir(bio);
>   unsigned remaining_size = wc->block_size;
>  
>   do {
>   struct bio_vec bv = bio_iter_iovec(bio, bio->bi_iter);
> - buf = bvec_kmap_irq(, );
> + buf = bvec_kmap_local();
>   size = bv.bv_len;
>   if (unlikely(size > remaining_size))
>   size = remaining_size;
> @@ -1230,7 +1229,7 @@ static void bio_copy_block(struct dm_writecache *wc, 
> struct bio *bio, void *data
>   memcpy_flushcache_optimized(data, buf, size);
>   }
>  
> - bvec_kunmap_irq(buf, );
> + kunmap_local(buf);
>  
>   data = (char *)data + size;
>   remaining_size -= size;

>From one of the functions called by kunmap_local():

unsigned long addr = (unsigned long) vaddr & PAGE_MASK;

This won't work well if bvec->bv_offset >= PAGE_SIZE I assume?

Thanks,

Bart.


Re: [PATCH 03/16] bvec: fix the include guards for bvec.h

2021-06-08 Thread Bart Van Assche
On 6/8/21 9:05 AM, Christoph Hellwig wrote:
> Fix the include guards to match the file naming.

Reviewed-by: Bart Van Assche 


Re: [PATCH 02/16] MIPS: don't include in

2021-06-08 Thread Bart Van Assche
On 6/8/21 9:05 AM, Christoph Hellwig wrote:
> There is no need to include genhd.h from a random arch header, and not
> doing so prevents the possibility for nasty include loops.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/mips/include/asm/mach-rc32434/rb.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/mips/include/asm/mach-rc32434/rb.h 
> b/arch/mips/include/asm/mach-rc32434/rb.h
> index d502673a4f6c..34d179ca020b 100644
> --- a/arch/mips/include/asm/mach-rc32434/rb.h
> +++ b/arch/mips/include/asm/mach-rc32434/rb.h
> @@ -7,8 +7,6 @@
>  #ifndef __ASM_RC32434_RB_H
>  #define __ASM_RC32434_RB_H
>  
> -#include 
> -
>  #define REGBASE  0x1800
>  #define IDT434_REG_BASE ((volatile void *) KSEG1ADDR(REGBASE))
>  #define UART0BASE0x58000

Reviewed-by: Bart Van Assche 


Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

2021-06-08 Thread Andy Lutomirski
On 6/4/21 6:42 PM, Nicholas Piggin wrote:
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched. This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the
> last refcount is dropped. Currently this is always enabled, which is
> what existing code does, so the patch is effectively a no-op.
> 
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

I am in favor of this approach, but I would be a lot more comfortable
with the resulting code if task->active_mm were at least better
documented and possibly even guarded by ifdefs.

x86 bare metal currently does not need the core lazy mm refcounting, and
x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
if lazy mm refcounting were configured out, ->active_mm could become a
dangling pointer, and this makes me extremely uncomfortable.

So I tend to think that, depending on config, the core code should
either keep ->active_mm [1] alive or get rid of it entirely.

[1] I don't really think it belongs in task_struct at all.  It's not a
property of the task.  It's the *per-cpu* mm that the core code is
keeping alive for lazy purposes.  How about consolidating it with the
copy in rq?

I guess the short summary of my opinion is that I like making this
configurable, but I do not like the state of the code.

--Andy


[PATCH 15/16] block: use bvec_kmap_local in t10_pi_type1_{prepare, complete}

2021-06-08 Thread Christoph Hellwig
Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig 
---
 block/t10-pi.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index d910534b3a41..00c203b2a921 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -147,11 +147,10 @@ static void t10_pi_type1_prepare(struct request *rq)
break;
 
bip_for_each_vec(iv, bip, iter) {
-   void *p, *pmap;
unsigned int j;
+   void *p;
 
-   pmap = kmap_atomic(iv.bv_page);
-   p = pmap + iv.bv_offset;
+   p = bvec_kmap_local();
for (j = 0; j < iv.bv_len; j += tuple_sz) {
struct t10_pi_tuple *pi = p;
 
@@ -161,8 +160,7 @@ static void t10_pi_type1_prepare(struct request *rq)
ref_tag++;
p += tuple_sz;
}
-
-   kunmap_atomic(pmap);
+   kunmap_local(p);
}
 
bip->bip_flags |= BIP_MAPPED_INTEGRITY;
@@ -195,11 +193,10 @@ static void t10_pi_type1_complete(struct request *rq, 
unsigned int nr_bytes)
struct bvec_iter iter;
 
bip_for_each_vec(iv, bip, iter) {
-   void *p, *pmap;
unsigned int j;
+   void *p;
 
-   pmap = kmap_atomic(iv.bv_page);
-   p = pmap + iv.bv_offset;
+   p = bvec_kmap_local();
for (j = 0; j < iv.bv_len && intervals; j += tuple_sz) {
struct t10_pi_tuple *pi = p;
 
@@ -210,8 +207,7 @@ static void t10_pi_type1_complete(struct request *rq, 
unsigned int nr_bytes)
intervals--;
p += tuple_sz;
}
-
-   kunmap_atomic(pmap);
+   kunmap_local(p);
}
}
 }
-- 
2.30.2



[PATCH 16/16] block: use bvec_kmap_local in bio_integrity_process

2021-06-08 Thread Christoph Hellwig
Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig 
---
 block/bio-integrity.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 4b4eb8964a6f..8f54d49dc500 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -172,18 +172,16 @@ static blk_status_t bio_integrity_process(struct bio *bio,
iter.prot_buf = prot_buf;
 
__bio_for_each_segment(bv, bio, bviter, *proc_iter) {
-   void *kaddr = kmap_atomic(bv.bv_page);
+   void *kaddr = bvec_kmap_local();
 
-   iter.data_buf = kaddr + bv.bv_offset;
+   iter.data_buf = kaddr;
iter.data_size = bv.bv_len;
-
ret = proc_fn();
-   if (ret) {
-   kunmap_atomic(kaddr);
-   return ret;
-   }
+   kunmap_local(kaddr);
+
+   if (ret)
+   break;
 
-   kunmap_atomic(kaddr);
}
return ret;
 }
-- 
2.30.2



[PATCH 14/16] block: use memcpy_from_bvec in __blk_queue_bounce

2021-06-08 Thread Christoph Hellwig
Rewrite the actual bounce buffering loop in __blk_queue_bounce to that
the memcpy_to_bvec helper can be used to perform the data copies.

Signed-off-by: Christoph Hellwig 
---
 block/bounce.c | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index a2fc6326b6c9..b5ad09e07bcf 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -243,24 +243,17 @@ void __blk_queue_bounce(struct request_queue *q, struct 
bio **bio_orig)
 * because the 'bio' is single-page bvec.
 */
for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) {
-   struct page *page = to->bv_page;
+   struct page *bounce_page;
 
-   if (!PageHighMem(page))
+   if (!PageHighMem(to->bv_page))
continue;
 
-   to->bv_page = mempool_alloc(_pool, GFP_NOIO);
-   inc_zone_page_state(to->bv_page, NR_BOUNCE);
+   bounce_page = mempool_alloc(_pool, GFP_NOIO);
+   inc_zone_page_state(bounce_page, NR_BOUNCE);
 
-   if (rw == WRITE) {
-   char *vto, *vfrom;
-
-   flush_dcache_page(page);
-
-   vto = page_address(to->bv_page) + to->bv_offset;
-   vfrom = kmap_atomic(page) + to->bv_offset;
-   memcpy(vto, vfrom, to->bv_len);
-   kunmap_atomic(vfrom);
-   }
+   if (rw == WRITE)
+   memcpy_from_bvec(page_address(bounce_page), to);
+   to->bv_page = bounce_page;
}
 
trace_block_bio_bounce(*bio_orig);
-- 
2.30.2



[PATCH 13/16] block: use memcpy_from_bvec in bio_copy_kern_endio_read

2021-06-08 Thread Christoph Hellwig
Use memcpy_from_bvec instead of open coding the logic.

Signed-off-by: Christoph Hellwig 
---
 block/blk-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 3743158ddaeb..d1448aaad980 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -400,7 +400,7 @@ static void bio_copy_kern_endio_read(struct bio *bio)
struct bvec_iter_all iter_all;
 
bio_for_each_segment_all(bvec, bio, iter_all) {
-   memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
+   memcpy_from_bvec(p, bvec);
p += bvec->bv_len;
}
 
-- 
2.30.2



[PATCH 12/16] block: use memcpy_to_bvec in copy_to_high_bio_irq

2021-06-08 Thread Christoph Hellwig
Use memcpy_to_bvec instead of opencoding the logic.

Signed-off-by: Christoph Hellwig 
---
 block/bounce.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index 94081e013c58..a2fc6326b6c9 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -67,18 +67,6 @@ static __init int init_emergency_pool(void)
 
 __initcall(init_emergency_pool);
 
-/*
- * highmem version, map in to vec
- */
-static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
-{
-   unsigned char *vto;
-
-   vto = kmap_atomic(to->bv_page);
-   memcpy(vto + to->bv_offset, vfrom, to->bv_len);
-   kunmap_atomic(vto);
-}
-
 /*
  * Simple bounce buffer support for highmem pages. Depending on the
  * queue gfp mask set, *to may or may not be a highmem page. kmap it
@@ -107,7 +95,7 @@ static void copy_to_high_bio_irq(struct bio *to, struct bio 
*from)
vfrom = page_address(fromvec.bv_page) +
tovec.bv_offset;
 
-   bounce_copy_vec(, vfrom);
+   memcpy_to_bvec(, vfrom);
flush_dcache_page(tovec.bv_page);
}
bio_advance_iter(from, _iter, tovec.bv_len);
-- 
2.30.2



[PATCH 11/16] block: rewrite bio_copy_data_iter to use bvec_kmap_local and memcpy_to_bvec

2021-06-08 Thread Christoph Hellwig
Use the proper helpers instead of open coding the copy.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c | 28 
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1d7abdb83a39..c14d2e66c084 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1186,27 +1186,15 @@ EXPORT_SYMBOL(bio_advance);
 void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
struct bio *src, struct bvec_iter *src_iter)
 {
-   struct bio_vec src_bv, dst_bv;
-   void *src_p, *dst_p;
-   unsigned bytes;
-
while (src_iter->bi_size && dst_iter->bi_size) {
-   src_bv = bio_iter_iovec(src, *src_iter);
-   dst_bv = bio_iter_iovec(dst, *dst_iter);
-
-   bytes = min(src_bv.bv_len, dst_bv.bv_len);
-
-   src_p = kmap_atomic(src_bv.bv_page);
-   dst_p = kmap_atomic(dst_bv.bv_page);
-
-   memcpy(dst_p + dst_bv.bv_offset,
-  src_p + src_bv.bv_offset,
-  bytes);
-
-   kunmap_atomic(dst_p);
-   kunmap_atomic(src_p);
-
-   flush_dcache_page(dst_bv.bv_page);
+   struct bio_vec src_bv = bio_iter_iovec(src, *src_iter);
+   struct bio_vec dst_bv = bio_iter_iovec(dst, *dst_iter);
+   unsigned int bytes = min(src_bv.bv_len, dst_bv.bv_len);
+   void *src_buf;
+
+   src_buf = bvec_kmap_local(_bv);
+   memcpy_to_bvec(_bv, src_buf);
+   kunmap_local(src_buf);
 
bio_advance_iter_single(src, src_iter, bytes);
bio_advance_iter_single(dst, dst_iter, bytes);
-- 
2.30.2



[PATCH 10/16] block: remove bvec_kmap_irq and bvec_kunmap_irq

2021-06-08 Thread Christoph Hellwig
These two helpers are entirely unused now.

Signed-off-by: Christoph Hellwig 
---
 include/linux/bio.h | 42 --
 1 file changed, 42 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index a0b4cfdf62a4..169b14b10c16 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -5,7 +5,6 @@
 #ifndef __LINUX_BIO_H
 #define __LINUX_BIO_H
 
-#include 
 #include 
 #include 
 /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */
@@ -523,47 +522,6 @@ static inline void bio_clone_blkg_association(struct bio 
*dst,
  struct bio *src) { }
 #endif /* CONFIG_BLK_CGROUP */
 
-#ifdef CONFIG_HIGHMEM
-/*
- * remember never ever reenable interrupts between a bvec_kmap_irq and
- * bvec_kunmap_irq!
- */
-static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
-{
-   unsigned long addr;
-
-   /*
-* might not be a highmem page, but the preempt/irq count
-* balancing is a lot nicer this way
-*/
-   local_irq_save(*flags);
-   addr = (unsigned long) kmap_atomic(bvec->bv_page);
-
-   BUG_ON(addr & ~PAGE_MASK);
-
-   return (char *) addr + bvec->bv_offset;
-}
-
-static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
-{
-   unsigned long ptr = (unsigned long) buffer & PAGE_MASK;
-
-   kunmap_atomic((void *) ptr);
-   local_irq_restore(*flags);
-}
-
-#else
-static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
-{
-   return page_address(bvec->bv_page) + bvec->bv_offset;
-}
-
-static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
-{
-   *flags = 0;
-}
-#endif
-
 /*
  * BIO list management for use by remapping drivers (e.g. DM or MD) and loop.
  *
-- 
2.30.2



[PATCH 09/16] ps3disk: use memcpy_{from,to}_bvec

2021-06-08 Thread Christoph Hellwig
Use the bvec helpers instead of open coding the copy.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/ps3disk.c | 19 +++
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
index ba3ece56cbb3..f2eb0225814f 100644
--- a/drivers/block/ps3disk.c
+++ b/drivers/block/ps3disk.c
@@ -84,26 +84,13 @@ static void ps3disk_scatter_gather(struct 
ps3_storage_device *dev,
unsigned int offset = 0;
struct req_iterator iter;
struct bio_vec bvec;
-   unsigned int i = 0;
-   size_t size;
-   void *buf;
 
rq_for_each_segment(bvec, req, iter) {
-   unsigned long flags;
-   dev_dbg(>sbd.core, "%s:%u: bio %u: %u sectors from %llu\n",
-   __func__, __LINE__, i, bio_sectors(iter.bio),
-   iter.bio->bi_iter.bi_sector);
-
-   size = bvec.bv_len;
-   buf = bvec_kmap_irq(, );
if (gather)
-   memcpy(dev->bounce_buf+offset, buf, size);
+   memcpy_from_bvec(dev->bounce_buf + offset, );
else
-   memcpy(buf, dev->bounce_buf+offset, size);
-   offset += size;
-   flush_kernel_dcache_page(bvec.bv_page);
-   bvec_kunmap_irq(buf, );
-   i++;
+   memcpy_to_bvec(, dev->bounce_buf + offset);
+   offset += bvec.bv_len;
}
 }
 
-- 
2.30.2



[PATCH 08/16] dm-writecache: use bvec_kmap_local instead of bvec_kmap_irq

2021-06-08 Thread Christoph Hellwig
There is no need to disable interrupts in bio_copy_block, and the local
only mappings helps to avoid any sort of problems with stray writes
into the bio data.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm-writecache.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index aecc246ade26..93ca454eaca9 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -1205,14 +1205,13 @@ static void memcpy_flushcache_optimized(void *dest, 
void *source, size_t size)
 static void bio_copy_block(struct dm_writecache *wc, struct bio *bio, void 
*data)
 {
void *buf;
-   unsigned long flags;
unsigned size;
int rw = bio_data_dir(bio);
unsigned remaining_size = wc->block_size;
 
do {
struct bio_vec bv = bio_iter_iovec(bio, bio->bi_iter);
-   buf = bvec_kmap_irq(, );
+   buf = bvec_kmap_local();
size = bv.bv_len;
if (unlikely(size > remaining_size))
size = remaining_size;
@@ -1230,7 +1229,7 @@ static void bio_copy_block(struct dm_writecache *wc, 
struct bio *bio, void *data
memcpy_flushcache_optimized(data, buf, size);
}
 
-   bvec_kunmap_irq(buf, );
+   kunmap_local(buf);
 
data = (char *)data + size;
remaining_size -= size;
-- 
2.30.2



[PATCH 07/16] rbd: use memzero_bvec

2021-06-08 Thread Christoph Hellwig
Use memzero_bvec instead of reimplementing it.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rbd.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bbb88eb009e0..eb243fc4d108 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1219,24 +1219,13 @@ static void rbd_dev_mapping_clear(struct rbd_device 
*rbd_dev)
rbd_dev->mapping.size = 0;
 }
 
-static void zero_bvec(struct bio_vec *bv)
-{
-   void *buf;
-   unsigned long flags;
-
-   buf = bvec_kmap_irq(bv, );
-   memset(buf, 0, bv->bv_len);
-   flush_dcache_page(bv->bv_page);
-   bvec_kunmap_irq(buf, );
-}
-
 static void zero_bios(struct ceph_bio_iter *bio_pos, u32 off, u32 bytes)
 {
struct ceph_bio_iter it = *bio_pos;
 
ceph_bio_iter_advance(, off);
ceph_bio_iter_advance_step(, bytes, ({
-   zero_bvec();
+   memzero_bvec();
}));
 }
 
@@ -1246,7 +1235,7 @@ static void zero_bvecs(struct ceph_bvec_iter *bvec_pos, 
u32 off, u32 bytes)
 
ceph_bvec_iter_advance(, off);
ceph_bvec_iter_advance_step(, bytes, ({
-   zero_bvec();
+   memzero_bvec();
}));
 }
 
-- 
2.30.2



[PATCH 06/16] block: use memzero_page in zero_fill_bio

2021-06-08 Thread Christoph Hellwig
Use memzero_bvec to zero each segment in the bio instead of manually
mapping and zeroing the data.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 44205dfb6b60..1d7abdb83a39 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -495,16 +495,11 @@ EXPORT_SYMBOL(bio_kmalloc);
 
 void zero_fill_bio(struct bio *bio)
 {
-   unsigned long flags;
struct bio_vec bv;
struct bvec_iter iter;
 
-   bio_for_each_segment(bv, bio, iter) {
-   char *data = bvec_kmap_irq(, );
-   memset(data, 0, bv.bv_len);
-   flush_dcache_page(bv.bv_page);
-   bvec_kunmap_irq(data, );
-   }
+   bio_for_each_segment(bv, bio, iter)
+   memzero_bvec();
 }
 EXPORT_SYMBOL(zero_fill_bio);
 
-- 
2.30.2



[PATCH 05/16] bvec: add memcpy_{from, to}_bvec and memzero_bvec helper

2021-06-08 Thread Christoph Hellwig
Add helpers to perform common memory operation on a bvec.

Signed-off-by: Christoph Hellwig 
---
 include/linux/bvec.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d64d6c0ceb77..ac835fa01ee3 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -189,4 +189,19 @@ static inline void *bvec_kmap_local(struct bio_vec *bvec)
return kmap_local_page(bvec->bv_page) + bvec->bv_offset;
 }
 
+static inline void memcpy_from_bvec(char *to, struct bio_vec *bvec)
+{
+   memcpy_from_page(to, bvec->bv_page, bvec->bv_offset, bvec->bv_len);
+}
+
+static inline void memcpy_to_bvec(struct bio_vec *bvec, const char *from)
+{
+   memcpy_to_page(bvec->bv_page, bvec->bv_offset, from, bvec->bv_len);
+}
+
+static inline void memzero_bvec(struct bio_vec *bvec)
+{
+   memzero_page(bvec->bv_page, bvec->bv_offset, bvec->bv_len);
+}
+
 #endif /* __LINUX_BVEC_H */
-- 
2.30.2



[PATCH 04/16] bvec: add a bvec_kmap_local helper

2021-06-08 Thread Christoph Hellwig
Add a helper to call kmap_local_page on a bvec.  There is no need for
an unmap helper given that kunmap_local accept any address in the mapped
page.

Signed-off-by: Christoph Hellwig 
---
 include/linux/bvec.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 883faf5f1523..d64d6c0ceb77 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -7,6 +7,7 @@
 #ifndef __LINUX_BVEC_H
 #define __LINUX_BVEC_H
 
+#include 
 #include 
 #include 
 #include 
@@ -183,4 +184,9 @@ static inline void bvec_advance(const struct bio_vec *bvec,
}
 }
 
+static inline void *bvec_kmap_local(struct bio_vec *bvec)
+{
+   return kmap_local_page(bvec->bv_page) + bvec->bv_offset;
+}
+
 #endif /* __LINUX_BVEC_H */
-- 
2.30.2



switch the block layer to use kmap_local_page

2021-06-08 Thread Christoph Hellwig
Hi all,

this series switches the core block layer code and all users of the
existing bvec kmap helpers to use kmap_local_page.  Drivers that
currently use open coded kmap_atomic calls will converted in a follow
on series.

Diffstat:
 arch/mips/include/asm/mach-rc32434/rb.h |2 -
 block/bio-integrity.c   |   14 --
 block/bio.c |   37 +++-
 block/blk-map.c |2 -
 block/bounce.c  |   35 ++
 block/t10-pi.c  |   16 
 drivers/block/ps3disk.c |   19 ++
 drivers/block/rbd.c |   15 +--
 drivers/md/dm-writecache.c  |5 +--
 include/linux/bio.h |   42 
 include/linux/bvec.h|   27 ++--
 include/linux/highmem.h |4 +--
 12 files changed, 64 insertions(+), 154 deletions(-)


[PATCH 01/16] mm: use kmap_local_page in memzero_page

2021-06-08 Thread Christoph Hellwig
No need for kmap_atomic here.

Signed-off-by: Christoph Hellwig 
---
 include/linux/highmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 832b49b50c7b..0dc0451cf1d1 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -334,9 +334,9 @@ static inline void memcpy_to_page(struct page *page, size_t 
offset,
 
 static inline void memzero_page(struct page *page, size_t offset, size_t len)
 {
-   char *addr = kmap_atomic(page);
+   char *addr = kmap_local_page(page);
memset(addr + offset, 0, len);
-   kunmap_atomic(addr);
+   kunmap_local(addr);
 }
 
 #endif /* _LINUX_HIGHMEM_H */
-- 
2.30.2



[PATCH 03/16] bvec: fix the include guards for bvec.h

2021-06-08 Thread Christoph Hellwig
Fix the include guards to match the file naming.

Signed-off-by: Christoph Hellwig 
---
 include/linux/bvec.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ff832e698efb..883faf5f1523 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -4,8 +4,8 @@
  *
  * Copyright (C) 2001 Ming Lei 
  */
-#ifndef __LINUX_BVEC_ITER_H
-#define __LINUX_BVEC_ITER_H
+#ifndef __LINUX_BVEC_H
+#define __LINUX_BVEC_H
 
 #include 
 #include 
@@ -183,4 +183,4 @@ static inline void bvec_advance(const struct bio_vec *bvec,
}
 }
 
-#endif /* __LINUX_BVEC_ITER_H */
+#endif /* __LINUX_BVEC_H */
-- 
2.30.2



[PATCH 02/16] MIPS: don't include in

2021-06-08 Thread Christoph Hellwig
There is no need to include genhd.h from a random arch header, and not
doing so prevents the possibility for nasty include loops.

Signed-off-by: Christoph Hellwig 
---
 arch/mips/include/asm/mach-rc32434/rb.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/mips/include/asm/mach-rc32434/rb.h 
b/arch/mips/include/asm/mach-rc32434/rb.h
index d502673a4f6c..34d179ca020b 100644
--- a/arch/mips/include/asm/mach-rc32434/rb.h
+++ b/arch/mips/include/asm/mach-rc32434/rb.h
@@ -7,8 +7,6 @@
 #ifndef __ASM_RC32434_RB_H
 #define __ASM_RC32434_RB_H
 
-#include 
-
 #define REGBASE0x1800
 #define IDT434_REG_BASE ((volatile void *) KSEG1ADDR(REGBASE))
 #define UART0BASE  0x58000
-- 
2.30.2



Re: [PATCH v3 resend 01/15] mm: add setup_initial_init_mm() helper

2021-06-08 Thread Souptick Joarder
On Tue, Jun 8, 2021 at 8:27 PM Christophe Leroy
 wrote:
>
>
>
> Le 08/06/2021 à 16:53, Souptick Joarder a écrit :
> > On Tue, Jun 8, 2021 at 1:56 PM Kefeng Wang  
> > wrote:
> >>
> >> Add setup_initial_init_mm() helper to setup kernel text,
> >> data and brk.
> >>
> >> Cc: linux-snps-...@lists.infradead.org
> >> Cc: linux-arm-ker...@lists.infradead.org
> >> Cc: linux-c...@vger.kernel.org
> >> Cc: uclinux-h8-de...@lists.sourceforge.jp
> >> Cc: linux-m...@lists.linux-m68k.org
> >> Cc: openr...@lists.librecores.org
> >> Cc: linuxppc-dev@lists.ozlabs.org
> >> Cc: linux-ri...@lists.infradead.org
> >> Cc: linux...@vger.kernel.org
> >> Cc: linux-s...@vger.kernel.org
> >> Cc: x...@kernel.org
> >> Signed-off-by: Kefeng Wang 
> >> ---
> >>   include/linux/mm.h | 3 +++
> >>   mm/init-mm.c   | 9 +
> >>   2 files changed, 12 insertions(+)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index c274f75efcf9..02aa057540b7 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -244,6 +244,9 @@ int __add_to_page_cache_locked(struct page *page, 
> >> struct address_space *mapping,
> >>
> >>   #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
> >>
> >> +void setup_initial_init_mm(void *start_code, void *end_code,
> >> +  void *end_data, void *brk);
> >> +
> >
> > Gentle query -> is there any limitation to add inline functions in
> > setup_arch() functions ?
>
> Kefeng just followed comment from Mike I guess, see
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210604070633.32363-2-wangkefeng.w...@huawei.com/#2696253

Ok.
>
> Christophe
>


Re: [PATCH v3 resend 11/15] powerpc: convert to setup_initial_init_mm()

2021-06-08 Thread Souptick Joarder
On Tue, Jun 8, 2021 at 8:24 PM Christophe Leroy
 wrote:
>
>
>
> Le 08/06/2021 à 16:36, Souptick Joarder a écrit :
> > On Tue, Jun 8, 2021 at 1:56 PM Kefeng Wang  
> > wrote:
> >>
> >> Use setup_initial_init_mm() helper to simplify code.
> >>
> >> Note klimit is (unsigned long) _end, with new helper,
> >> will use _end directly.
> >
> > With this change klimit left with no user in this file and can be
> > moved to some appropriate header.
> > But in a separate series.
>
> I have a patch to remove klimit, see
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/9fa9ba6807c17f93f35a582c199c646c4a8bfd9c.1622800638.git.christophe.le...@csgroup.eu/

Got it. Thanks :)

>
> Christophe
>
>
> >
> >>
> >> Cc: Michael Ellerman 
> >> Cc: Benjamin Herrenschmidt 
> >> Cc: linuxppc-dev@lists.ozlabs.org
> >> Signed-off-by: Kefeng Wang 
> >> ---
> >>   arch/powerpc/kernel/setup-common.c | 5 +
> >>   1 file changed, 1 insertion(+), 4 deletions(-)
> >>
> >> diff --git a/arch/powerpc/kernel/setup-common.c 
> >> b/arch/powerpc/kernel/setup-common.c
> >> index 74a98fff2c2f..96697c6e1e16 100644
> >> --- a/arch/powerpc/kernel/setup-common.c
> >> +++ b/arch/powerpc/kernel/setup-common.c
> >> @@ -927,10 +927,7 @@ void __init setup_arch(char **cmdline_p)
> >>
> >>  klp_init_thread_info(_task);
> >>
> >> -   init_mm.start_code = (unsigned long)_stext;
> >> -   init_mm.end_code = (unsigned long) _etext;
> >> -   init_mm.end_data = (unsigned long) _edata;
> >> -   init_mm.brk = klimit;
> >> +   setup_initial_init_mm(_stext, _etext, _edata, _end);
> >>
> >>  mm_iommu_init(_mm);
> >>  irqstack_early_init();
> >> --
> >> 2.26.2
> >>
> >>


Re: [PATCH v3 resend 01/15] mm: add setup_initial_init_mm() helper

2021-06-08 Thread Christophe Leroy




Le 08/06/2021 à 16:53, Souptick Joarder a écrit :

On Tue, Jun 8, 2021 at 1:56 PM Kefeng Wang  wrote:


Add setup_initial_init_mm() helper to setup kernel text,
data and brk.

Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: uclinux-h8-de...@lists.sourceforge.jp
Cc: linux-m...@lists.linux-m68k.org
Cc: openr...@lists.librecores.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: x...@kernel.org
Signed-off-by: Kefeng Wang 
---
  include/linux/mm.h | 3 +++
  mm/init-mm.c   | 9 +
  2 files changed, 12 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..02aa057540b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -244,6 +244,9 @@ int __add_to_page_cache_locked(struct page *page, struct 
address_space *mapping,

  #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))

+void setup_initial_init_mm(void *start_code, void *end_code,
+  void *end_data, void *brk);
+


Gentle query -> is there any limitation to add inline functions in
setup_arch() functions ?


Kefeng just followed comment from Mike I guess, see 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210604070633.32363-2-wangkefeng.w...@huawei.com/#2696253


Christophe



Re: [PATCH v3 resend 11/15] powerpc: convert to setup_initial_init_mm()

2021-06-08 Thread Christophe Leroy




Le 08/06/2021 à 16:36, Souptick Joarder a écrit :

On Tue, Jun 8, 2021 at 1:56 PM Kefeng Wang  wrote:


Use setup_initial_init_mm() helper to simplify code.

Note klimit is (unsigned long) _end, with new helper,
will use _end directly.


With this change klimit left with no user in this file and can be
moved to some appropriate header.
But in a separate series.


I have a patch to remove klimit, see 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/9fa9ba6807c17f93f35a582c199c646c4a8bfd9c.1622800638.git.christophe.le...@csgroup.eu/


Christophe






Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Kefeng Wang 
---
  arch/powerpc/kernel/setup-common.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 74a98fff2c2f..96697c6e1e16 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -927,10 +927,7 @@ void __init setup_arch(char **cmdline_p)

 klp_init_thread_info(_task);

-   init_mm.start_code = (unsigned long)_stext;
-   init_mm.end_code = (unsigned long) _etext;
-   init_mm.end_data = (unsigned long) _edata;
-   init_mm.brk = klimit;
+   setup_initial_init_mm(_stext, _etext, _edata, _end);

 mm_iommu_init(_mm);
 irqstack_early_init();
--
2.26.2




Re: [PATCH v3 resend 01/15] mm: add setup_initial_init_mm() helper

2021-06-08 Thread Souptick Joarder
On Tue, Jun 8, 2021 at 1:56 PM Kefeng Wang  wrote:
>
> Add setup_initial_init_mm() helper to setup kernel text,
> data and brk.
>
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c...@vger.kernel.org
> Cc: uclinux-h8-de...@lists.sourceforge.jp
> Cc: linux-m...@lists.linux-m68k.org
> Cc: openr...@lists.librecores.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux...@vger.kernel.org
> Cc: linux-s...@vger.kernel.org
> Cc: x...@kernel.org
> Signed-off-by: Kefeng Wang 
> ---
>  include/linux/mm.h | 3 +++
>  mm/init-mm.c   | 9 +
>  2 files changed, 12 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c274f75efcf9..02aa057540b7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -244,6 +244,9 @@ int __add_to_page_cache_locked(struct page *page, struct 
> address_space *mapping,
>
>  #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
>
> +void setup_initial_init_mm(void *start_code, void *end_code,
> +  void *end_data, void *brk);
> +

Gentle query -> is there any limitation to add inline functions in
setup_arch() functions ?

>  /*
>   * Linux kernel virtual memory manager primitives.
>   * The idea being to have a "virtual" mm in the same way
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 153162669f80..b4a6f38fb51d 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -40,3 +40,12 @@ struct mm_struct init_mm = {
> .cpu_bitmap = CPU_BITS_NONE,
> INIT_MM_CONTEXT(init_mm)
>  };
> +
> +void setup_initial_init_mm(void *start_code, void *end_code,
> +  void *end_data, void *brk)
> +{
> +   init_mm.start_code = (unsigned long)start_code;
> +   init_mm.end_code = (unsigned long)end_code;
> +   init_mm.end_data = (unsigned long)end_data;
> +   init_mm.brk = (unsigned long)brk;
> +}
> --
> 2.26.2
>
>


Re: [PATCH v3 resend 11/15] powerpc: convert to setup_initial_init_mm()

2021-06-08 Thread Souptick Joarder
On Tue, Jun 8, 2021 at 1:56 PM Kefeng Wang  wrote:
>
> Use setup_initial_init_mm() helper to simplify code.
>
> Note klimit is (unsigned long) _end, with new helper,
> will use _end directly.

With this change klimit left with no user in this file and can be
moved to some appropriate header.
But in a separate series.

>
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Kefeng Wang 
> ---
>  arch/powerpc/kernel/setup-common.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 74a98fff2c2f..96697c6e1e16 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -927,10 +927,7 @@ void __init setup_arch(char **cmdline_p)
>
> klp_init_thread_info(_task);
>
> -   init_mm.start_code = (unsigned long)_stext;
> -   init_mm.end_code = (unsigned long) _etext;
> -   init_mm.end_data = (unsigned long) _edata;
> -   init_mm.brk = klimit;
> +   setup_initial_init_mm(_stext, _etext, _edata, _end);
>
> mm_iommu_init(_mm);
> irqstack_early_init();
> --
> 2.26.2
>
>


Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Baoquan He
On 06/08/21 at 06:33am, Pingfan Liu wrote:
> As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
> Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
> formula:
> #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> 
> Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
> PAGES_PER_SECTION in makedumpfile just like kernel.
> 
> Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
> recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
> SECTION_SIZE_BITS"). But user space wants a stable interface to get this
> info. Such info is impossible to be deduced from a crashdump vmcore.
> Hence append SECTION_SIZE_BITS to vmcoreinfo.

> 
> Signed-off-by: Pingfan Liu 
> Cc: Bhupesh Sharma 
> Cc: Kazuhito Hagio 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: Boris Petkov 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Cc: James Morse 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Dave Anderson 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org
> Cc: ke...@lists.infradead.org
> Cc: x...@kernel.org
> Cc: linux-arm-ker...@lists.infradead.org


Add the discussion of the original thread in kexec ML for reference:
http://lists.infradead.org/pipermail/kexec/2021-June/022676.html

This looks good to me.

Acked-by: Baoquan He 

> ---
>  kernel/crash_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 825284baaf46..684a6061a13a 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
>   VMCOREINFO_STRUCT_SIZE(mem_section);
>   VMCOREINFO_OFFSET(mem_section, section_mem_map);
> + VMCOREINFO_NUMBER(SECTION_SIZE_BITS);
>   VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
>  #endif
>   VMCOREINFO_STRUCT_SIZE(page);
> -- 
> 2.29.2
> 



[PATCH] powerpc/signal64: Copy siginfo before changing regs->nip

2021-06-08 Thread Michael Ellerman
In commit 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64()
to minimise uaccess switches") the 64-bit signal code was rearranged to
use user_write_access_begin/end().

As part of that change the call to copy_siginfo_to_user() was moved
later in the function, so that it could be done after the
user_write_access_end().

In particular it was moved after we modify regs->nip to point to the
signal trampoline. That means if copy_siginfo_to_user() fails we exit
handle_rt_signal64() with an error but with regs->nip modified, whereas
previously we would not modify regs->nip until the copy succeeded.

Returning an error from signal delivery but with regs->nip updated
leaves the process in a sort of half-delivered state. We do immediately
force a SEGV in signal_setup_done(), called from do_signal(), so the
process should never run in the half-delivered state.

However that SEGV is not delivered until we've gone around to
do_notify_resume() again, so it's possible some tracing could observe
the half-delivered state.

There are other cases where we fail signal delivery with regs partly
updated, eg. the write to newsp and SA_SIGINFO, but the latter at least
is very unlikely to fail as it reads back from the frame we just wrote
to.

Looking at other arches they seem to be more careful about leaving regs
unchanged until the copy operations have succeeded, and in general that
seems like good hygenie.

So although the current behaviour is not cleary buggy, it's also not
clearly correct. So move the call to copy_siginfo_to_user() up prior to
the modification of regs->nip, which is closer to the old behaviour, and
easier to reason about.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/signal_64.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index dca66481d0c2..f9e1f5428b9e 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -902,6 +902,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
unsafe_copy_to_user(>uc.uc_sigmask, set, sizeof(*set), 
badframe_block);
user_write_access_end();
 
+   /* Save the siginfo outside of the unsafe block. */
+   if (copy_siginfo_to_user(>info, >info))
+   goto badframe;
+
/* Make sure signal handler doesn't get spurious FP exceptions */
tsk->thread.fp_state.fpscr = 0;
 
@@ -915,11 +919,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
regs->nip = (unsigned long) >tramp[0];
}
 
-
-   /* Save the siginfo outside of the unsafe block. */
-   if (copy_siginfo_to_user(>info, >info))
-   goto badframe;
-
/* Allocate a dummy caller frame for the signal handler. */
newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
-- 
2.25.1



Re: [PATCH v2 00/12] powerpc: Cleanup use of 'struct ppc_inst'

2021-06-08 Thread Christophe Leroy

Hi Michael,

Le 20/05/2021 à 15:50, Christophe Leroy a écrit :

This series is a cleanup of the use of 'struct ppc_inst'.

A confusion is made between internal representation of powerpc
instructions with 'struct ppc_inst' and in-memory code which is
and will always be an array of 'unsigned int'.

This series cleans it up.

First patch is fixing detection of missing '__user' flag by sparse
when using get_user_instr().

Last part of the series does some source code cleanup in
optprobes, it is put at the ends of this series because of
clashes with the 'struct ppc_inst' cleanups.


What are your plans about this series ? I fear that the more we wait the more we get additional bad 
uses of 'struct ppc_inst'.
There are several people working around places that play with instructions, so I think the sooner it 
gets cleaned the better it is. Do you agree ?


Thanks
Christophe


Re: [PATCH v7 01/11] mm/mremap: Fix race between MOVE_PMD mremap and pageout

2021-06-08 Thread Kirill A. Shutemov
On Tue, Jun 08, 2021 at 04:47:19PM +0530, Aneesh Kumar K.V wrote:
> On 6/8/21 3:12 PM, Kirill A. Shutemov wrote:
> > On Tue, Jun 08, 2021 at 01:22:23PM +0530, Aneesh Kumar K.V wrote:
> > > 
> > > Hi Hugh,
> > > 
> > > Hugh Dickins  writes:
> > > 
> > > > On Mon, 7 Jun 2021, Aneesh Kumar K.V wrote:
> > > > 
> > > > > CPU 1 CPU 2   
> > > > > CPU 3
> > > > > 
> > > > > mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> > > > > 
> > > > > mmap_write_lock_killable()
> > > > > 
> > > > >   addr = old_addr
> > > > >   lock(pte_ptl)
> > > > > lock(pmd_ptl)
> > > > > pmd = *old_pmd
> > > > > pmd_clear(old_pmd)
> > > > > flush_tlb_range(old_addr)
> > > > > 
> > > > > *new_pmd = pmd
> > > > >   
> > > > > *new_addr = 10; and fills
> > > > >   
> > > > > TLB with new addr
> > > > >   
> > > > > and old pfn
> > > > > 
> > > > > unlock(pmd_ptl)
> > > > >   ptep_clear_flush()
> > > > >   old pfn is free.
> > > > >   
> > > > > Stale TLB entry
> > > > > 
> > > > > Fix this race by holding pmd lock in pageout. This still doesn't 
> > > > > handle the race
> > > > > between MOVE_PUD and pageout.
> > > > > 
> > > > > Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
> > > > > Signed-off-by: Aneesh Kumar K.V 
> > > > 
> > > > This seems very wrong to me, to require another level of locking in the
> > > > rmap lookup, just to fix some new pagetable games in mremap.
> > > > 
> > > > But Linus asked "Am I missing something?": neither of you have mentioned
> > > > mremap's take_rmap_locks(), so I hope that already meets your need.  And
> > > > if it needs to be called more often than before (see "need_rmap_locks"),
> > > > that's probably okay.
> > > > 
> > > > Hugh
> > > > 
> > > 
> > > Thanks for reviewing the change. I missed the rmap lock in the code
> > > path. How about the below change?
> > > 
> > >  mm/mremap: hold the rmap lock in write mode when moving page table 
> > > entries.
> > >  To avoid a race between rmap walk and mremap, mremap does 
> > > take_rmap_locks().
> > >  The lock was taken to ensure that rmap walk don't miss a page table 
> > > entry due to
> > >  PTE moves via move_pagetables(). The kernel does further 
> > > optimization of
> > >  this lock such that if we are going to find the newly added vma 
> > > after the
> > >  old vma, the rmap lock is not taken. This is because rmap walk would 
> > > find the
> > >  vmas in the same order and if we don't find the page table attached 
> > > to
> > >  older vma we would find it with the new vma which we would iterate 
> > > later.
> > >  The actual lifetime of the page is still controlled by the PTE lock.
> > >  This patch updates the locking requirement to handle another race 
> > > condition
> > >  explained below with optimized mremap::
> > >  Optmized PMD move
> > >  CPU 1   CPU 2
> > >CPU 3
> > >  mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> > >  mmap_write_lock_killable()
> > >  addr = old_addr
> > >  lock(pte_ptl)
> > >  lock(pmd_ptl)
> > >  pmd = *old_pmd
> > >  pmd_clear(old_pmd)
> > >  flush_tlb_range(old_addr)
> > >  *new_pmd = pmd
> > >   
> > >*new_addr = 10; and fills
> > >   
> > >TLB with new addr
> > >   
> > >and old pfn
> > >  unlock(pmd_ptl)
> > >  ptep_clear_flush()
> > >  old pfn is free.
> > >   
> > >Stale TLB entry
> > >  Optmized PUD move:
> > >  CPU 1   CPU 2
> > >CPU 3
> > >  mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> > >  mmap_write_lock_killable()
> > >  addr = old_addr
> > >  lock(pte_ptl)
> > >  lock(pud_ptl)
> > >  pud = *old_pud
> > >  pud_clear(old_pud)
> > 

[PATCH 4/4] powerpc/papr_scm: Document papr_scm sysfs event format entries

2021-06-08 Thread Kajol Jain
Details is added for the event, cpumask and format attributes
in the ABI documentation.

Signed-off-by: Kajol Jain 
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 92e2db0e2d3d..be91de341454 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -59,3 +59,34 @@ Description:
* "CchRHCnt" : Cache Read Hit Count
* "CchWHCnt" : Cache Write Hit Count
* "FastWCnt" : Fast Write Count
+
+What:  /sys/devices/nmemX/format
+Date:  June 2021
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:   (RO) Attribute group to describe the magic bits
+that go into perf_event_attr.config for a particular pmu.
+(See ABI/testing/sysfs-bus-event_source-devices-format).
+
+Each attribute under this group defines a bit range of the
+perf_event_attr.config. Supported attribute is listed
+below::
+
+   event  = "config:0-4"  - event ID
+
+   For example::
+   noopstat = "event=0x1"
+
+What:  /sys/devices/nmemX/events
+Date:  June 2021
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:(RO) Attribute group to describe performance monitoring
+events specific to papr-scm. Each attribute in this group 
describes
+a single performance monitoring event supported by this nvdimm 
pmu.
+The name of the file is the name of the event.
+(See ABI/testing/sysfs-bus-event_source-devices-events).
+
+What:  /sys/devices/nmemX/cpumask
+Date:  June 2021
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:   (RO) This sysfs file exposes the cpumask which is designated to 
make
+HCALLs to retrieve nvdimm pmu event counter data.
-- 
2.27.0



[PATCH 3/4] powerpc/papr_scm: Add perf interface support

2021-06-08 Thread Kajol Jain
Performance monitoring support for papr-scm nvdimm devices
via perf interface is added which includes addition of pmu
functions like add/del/read/event_init for nvdimm_pmu struture.

A new parameter 'priv' in added to the pdev_archdata structure to save
nvdimm_pmu device pointer, to handle the unregistering of pmu device.

papr_scm_pmu_register function populates the nvdimm_pmu structure
with events, cpumask, attribute groups along with event handling
functions. Finally the populated nvdimm_pmu structure is passed to
register the pmu device.
Event handling functions internally uses hcall to get events and
counter data.

Result in power9 machine with 2 nvdimm device:

Ex: List all event by perf list

command:# perf list nmem

  nmem0/cchrhcnt/[Kernel PMU event]
  nmem0/cchwhcnt/[Kernel PMU event]
  nmem0/critrscu/[Kernel PMU event]
  nmem0/ctlresct/[Kernel PMU event]
  nmem0/ctlrestm/[Kernel PMU event]
  nmem0/fastwcnt/[Kernel PMU event]
  nmem0/hostlcnt/[Kernel PMU event]
  nmem0/hostldur/[Kernel PMU event]
  nmem0/hostscnt/[Kernel PMU event]
  nmem0/hostsdur/[Kernel PMU event]
  nmem0/medrcnt/ [Kernel PMU event]
  nmem0/medrdur/ [Kernel PMU event]
  nmem0/medwcnt/ [Kernel PMU event]
  nmem0/medwdur/ [Kernel PMU event]
  nmem0/memlife/ [Kernel PMU event]
  nmem0/noopstat/[Kernel PMU event]
  nmem0/ponsecs/ [Kernel PMU event]
  nmem1/cchrhcnt/[Kernel PMU event]
  nmem1/cchwhcnt/[Kernel PMU event]
  nmem1/critrscu/[Kernel PMU event]
  ...
  nmem1/noopstat/[Kernel PMU event]
  nmem1/ponsecs/ [Kernel PMU event]

Signed-off-by: Kajol Jain 
---
 arch/powerpc/include/asm/device.h |   5 +
 arch/powerpc/platforms/pseries/papr_scm.c | 365 ++
 2 files changed, 370 insertions(+)

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 219559d65864..47ed639f3b8f 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -48,6 +48,11 @@ struct dev_archdata {
 
 struct pdev_archdata {
u64 dma_mask;
+   /*
+* Pointer to nvdimm_pmu structure, to handle the unregistering
+* of pmu device
+*/
+   void *priv;
 };
 
 #endif /* _ASM_POWERPC_DEVICE_H */
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ef26fe40efb0..92632b4a4a60 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -18,6 +18,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -67,6 +69,8 @@
 #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
 #define PAPR_SCM_PERF_STATS_VERSION 0x1
 
+#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu)
+
 /* Struct holding a single performance metric */
 struct papr_scm_perf_stat {
u8 stat_id[8];
@@ -116,6 +120,12 @@ struct papr_scm_priv {
 
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
+
+/* array to have event_code and stat_id mappings */
+   char **nvdimm_events_map;
+
+   /* count of supported events */
+   u32 total_events;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,
@@ -329,6 +339,354 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
return 0;
 }
 
+PMU_FORMAT_ATTR(event, "config:0-4");
+
+static struct attribute *nvdimm_pmu_format_attr[] = {
+   _attr_event.attr,
+   NULL,
+};
+
+static struct attribute_group nvdimm_pmu_format_group = {
+   .name = "format",
+   .attrs = nvdimm_pmu_format_attr,
+};
+
+static int papr_scm_pmu_get_value(struct perf_event *event, struct device 
*dev, u64 *count)
+{
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+   struct papr_scm_priv *p = (struct papr_scm_priv *)dev->driver_data;
+   int rc, size;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   if (!p || !p->nvdimm_events_map)
+   return -EINVAL;
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+ 

[PATCH 2/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats

2021-06-08 Thread Kajol Jain
A common interface is added to get performance stats reporting
support for nvdimm devices. Added interface includes support for
pmu register/unregister functions, cpu hotplug and pmu event
functions like event_init/add/read/del.
User could use the standard perf tool to access perf
events exposed via pmu.

Signed-off-by: Kajol Jain 
---
 drivers/nvdimm/Makefile  |   1 +
 drivers/nvdimm/nd_perf.c | 230 +++
 include/linux/nd.h   |   3 +
 3 files changed, 234 insertions(+)
 create mode 100644 drivers/nvdimm/nd_perf.c

diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 29203f3d3069..25dba6095612 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -18,6 +18,7 @@ nd_e820-y := e820.o
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
+libnvdimm-y += nd_perf.o
 libnvdimm-y += dimm.o
 libnvdimm-y += region_devs.o
 libnvdimm-y += region.o
diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c
new file mode 100644
index ..5cc1f1c65972
--- /dev/null
+++ b/drivers/nvdimm/nd_perf.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * nd_perf.c: NVDIMM Device Performance Monitoring Unit support
+ *
+ * Perf interface to expose nvdimm performance stats.
+ *
+ * Copyright (C) 2021 IBM Corporation
+ */
+
+#define pr_fmt(fmt) "nvdimm_pmu: " fmt
+
+#include 
+
+static ssize_t nvdimm_pmu_cpumask_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+   struct nvdimm_pmu *nd_pmu;
+
+   nd_pmu = container_of(pmu, struct nvdimm_pmu, pmu);
+
+   return cpumap_print_to_pagebuf(true, buf, cpumask_of(nd_pmu->cpu));
+}
+
+static int nvdimm_pmu_cpu_offline(unsigned int cpu, struct hlist_node *node)
+{
+   struct nvdimm_pmu *nd_pmu;
+   u32 target;
+   int nodeid;
+   const struct cpumask *cpumask;
+
+   nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
+
+   /* Clear it, incase given cpu is set in nd_pmu->arch_cpumask */
+   cpumask_test_and_clear_cpu(cpu, _pmu->arch_cpumask);
+
+   /*
+* If given cpu is not same as current designated cpu for
+* counter access, just return.
+*/
+   if (cpu != nd_pmu->cpu)
+   return 0;
+
+   /* Check for any active cpu in nd_pmu->arch_cpumask */
+   target = cpumask_any(_pmu->arch_cpumask);
+   nd_pmu->cpu = target;
+
+   /*
+* Incase we don't have any active cpu in nd_pmu->arch_cpumask,
+* check in given cpu's numa node list.
+*/
+   if (target >= nr_cpu_ids) {
+   nodeid = cpu_to_node(cpu);
+   cpumask = cpumask_of_node(nodeid);
+   target = cpumask_any_but(cpumask, cpu);
+   nd_pmu->cpu = target;
+
+   if (target >= nr_cpu_ids)
+   return -1;
+   }
+
+   return 0;
+}
+
+static int nvdimm_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
+{
+   struct nvdimm_pmu *nd_pmu;
+
+   nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
+
+   if (nd_pmu->cpu >= nr_cpu_ids)
+   nd_pmu->cpu = cpu;
+
+   return 0;
+}
+
+static int create_cpumask_attr_group(struct nvdimm_pmu *nd_pmu)
+{
+   struct perf_pmu_events_attr *attr;
+   struct attribute **attrs;
+   struct attribute_group *nvdimm_pmu_cpumask_group;
+
+   attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+   if (!attr)
+   return -ENOMEM;
+
+   attrs = kzalloc(2 * sizeof(struct attribute *), GFP_KERNEL);
+   if (!attrs) {
+   kfree(attr);
+   return -ENOMEM;
+   }
+
+   /* Allocate memory for cpumask attribute group */
+   nvdimm_pmu_cpumask_group = kzalloc(sizeof(*nvdimm_pmu_cpumask_group), 
GFP_KERNEL);
+   if (!nvdimm_pmu_cpumask_group) {
+   kfree(attr);
+   kfree(attrs);
+   return -ENOMEM;
+   }
+
+   sysfs_attr_init(>attr.attr);
+   attr->attr.attr.name = "cpumask";
+   attr->attr.attr.mode = 0444;
+   attr->attr.show = nvdimm_pmu_cpumask_show;
+   attrs[0] = >attr.attr;
+   attrs[1] = NULL;
+
+   nvdimm_pmu_cpumask_group->attrs = attrs;
+   nd_pmu->attr_groups[NVDIMM_PMU_CPUMASK_ATTR] = nvdimm_pmu_cpumask_group;
+   return 0;
+}
+
+static int nvdimm_pmu_cpu_hotplug_init(struct nvdimm_pmu *nd_pmu)
+{
+   int nodeid, rc;
+   const struct cpumask *cpumask;
+
+   /*
+* Incase cpu hotplug is not handled by arch specific code
+* they can still provide required cpumask which can be used
+* to get designatd cpu for counter access.
+* Check for any active cpu in nd_pmu->arch_cpumask.
+*/
+   if (!cpumask_empty(_pmu->arch_cpumask)) {
+   nd_pmu->cpu = cpumask_any(_pmu->arch_cpumask);
+   } else {
+   /* pick active cpu from the 

[PATCH 1/4] drivers/nvdimm: Add nvdimm pmu structure

2021-06-08 Thread Kajol Jain
A structure is added, called nvdimm_pmu, for performance
stats reporting support of nvdimm devices. It can be used to add
nvdimm pmu data such as supported events and pmu event functions
like event_init/add/read/del with cpu hotplug support.

Signed-off-by: Kajol Jain 
---
 include/linux/nd.h | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/include/linux/nd.h b/include/linux/nd.h
index ee9ad76afbba..712499cf7335 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 enum nvdimm_event {
NVDIMM_REVALIDATE_POISON,
@@ -23,6 +25,47 @@ enum nvdimm_claim_class {
NVDIMM_CCLASS_UNKNOWN,
 };
 
+/* Event attribute array index */
+#define NVDIMM_PMU_FORMAT_ATTR 0
+#define NVDIMM_PMU_EVENT_ATTR  1
+#define NVDIMM_PMU_CPUMASK_ATTR2
+#define NVDIMM_PMU_NULL_ATTR   3
+
+/**
+ * struct nvdimm_pmu - data structure for nvdimm perf driver
+ *
+ * @name: name of the nvdimm pmu device.
+ * @pmu: pmu data structure for nvdimm performance stats.
+ * @dev: nvdimm device pointer.
+ * @functions(event_init/add/del/read): platform specific pmu functions.
+ * @attr_groups: data structure for events, formats and cpumask
+ * @cpu: designated cpu for counter access.
+ * @node: node for cpu hotplug notifier link.
+ * @cpuhp_state: state for cpu hotplug notification.
+ * @arch_cpumask: cpumask to get designated cpu for counter access.
+ */
+struct nvdimm_pmu {
+   const char *name;
+   struct pmu pmu;
+   struct device *dev;
+   int (*event_init)(struct perf_event *event);
+   int  (*add)(struct perf_event *event, int flags);
+   void (*del)(struct perf_event *event, int flags);
+   void (*read)(struct perf_event *event);
+   /*
+* Attribute groups for the nvdimm pmu. Index 0 used for
+* format attribute, index 1 used for event attribute,
+* index 2 used for cpusmask attribute and index 3 kept as NULL.
+*/
+   const struct attribute_group *attr_groups[4];
+   int cpu;
+   struct hlist_node node;
+   enum cpuhp_state cpuhp_state;
+
+   /* cpumask provided by arch/platform specific code */
+   struct cpumask arch_cpumask;
+};
+
 struct nd_device_driver {
struct device_driver drv;
unsigned long type;
-- 
2.27.0



[PATCH 0/4] Add perf interface to expose nvdimm

2021-06-08 Thread Kajol Jain
Patchset adds performance stats reporting support for nvdimm.
Added interface includes support for pmu register/unregister
functions. A structure is added called nvdimm_pmu to be used for
adding arch/platform specific data such as supported events, cpumask
pmu event functions like event_init/add/read/del.
User could use the standard perf tool to access perf
events exposed via pmu.

Added implementation to expose IBM pseries platform nmem*
device performance stats using this interface.

Result from power9 pseries lpar with 2 nvdimm device:
command:# perf list nmem
  nmem0/cchrhcnt/[Kernel PMU event]
  nmem0/cchwhcnt/[Kernel PMU event]
  nmem0/critrscu/[Kernel PMU event]
  nmem0/ctlresct/[Kernel PMU event]
  nmem0/ctlrestm/[Kernel PMU event]
  nmem0/fastwcnt/[Kernel PMU event]
  nmem0/hostlcnt/[Kernel PMU event]
  nmem0/hostldur/[Kernel PMU event]
  nmem0/hostscnt/[Kernel PMU event]
  nmem0/hostsdur/[Kernel PMU event]
  nmem0/medrcnt/ [Kernel PMU event]
  nmem0/medrdur/ [Kernel PMU event]
  nmem0/medwcnt/ [Kernel PMU event]
  nmem0/medwdur/ [Kernel PMU event]
  nmem0/memlife/ [Kernel PMU event]
  nmem0/noopstat/[Kernel PMU event]
  nmem0/ponsecs/ [Kernel PMU event]
  nmem1/cchrhcnt/[Kernel PMU event]
  nmem1/cchwhcnt/[Kernel PMU event]
  nmem1/critrscu/[Kernel PMU event]
  ...
  nmem1/noopstat/[Kernel PMU event]
  nmem1/ponsecs/ [Kernel PMU event]

Patch1:
Introduces the nvdimm_pmu structure
Patch2:
Adds common interface to add arch/platform specific data
includes supported events, pmu event functions. It also
adds code for cpu hotplug support.
Patch3:
Add code in arch/powerpc/platform/pseries/papr_scm.c to expose
nmem* pmu. It fills in the nvdimm_pmu structure with event attrs
cpumask andevent functions and then registers the pmu by adding
callbacks to register_nvdimm_pmu.
Patch4:
Sysfs documentation patch

Changelog
---
RFC v3 -> PATCH
- Link to the RFC v3 patchset : https://lkml.org/lkml/2021/5/29/28

- Remove RFC tag.

- Add nvdimm_pmu_cpu_online function.

- A new variable 'arch_cpumask' is added to the struct nvdimm_pmu
  which can be used to provide cpumask by the arch specific code.
  It will used incase cpu hotplug is not handled by arch code.
  Now common interface first check for any active cpu in arch_cpumask
  to designate cpu to collect counter data and incase we dont have any
  active cpu in that mask, it will look into cpumask of the device
  numa node.

-Add code in papr_scm to fill arch_cpumask variable with required
 cpumask.

- Some optimizations/fixes from previous RFC code

v2 -> v3
- Link to the RFC v2 patchset : https://lkml.org/lkml/2021/5/25/591

- Moved hotplug code changes from papr_scm code to generic interface
  with required functionality as suggested by Peter Zijlstra

- Changed function parameter of unregister_nvdimm_pmu function from
  struct pmu to struct nvdimm_pmu.

- Now cpumask will get updated based on numa node of corresponding nvdimm
  device as suggested by Peter Zijlstra.

v1 -> v2
- Link to the RFC v1 patchset : https://lkml.org/lkml/2021/5/12/2747

- Removed intermediate functions nvdimm_pmu_read/nvdimm_pmu_add/
  nvdimm_pmu_del/nvdimm_pmu_event_init and directly assigned
  platfrom specific routines. Also add check for any NULL functions.
  Suggested by: Peter Zijlstra

- Add macros for event attribute array index which can be used to
  assign dynamically allocated attr_groups.

- New function 'nvdimm_pmu_mem_free' is added to free dynamic
  memory allocated for attr_groups in papr_scm.c

- PMU register call moved from papr_scm_nvdimm_init() to papr_scm_probe()

- Move addition of cpu/node/cpuhp_state attributes in struct nvdimm_pmu
  to patch 4 where cpu hotplug code added

- Removed device attribute from the attribute list of
  add/del/read/event_init functions in nvdimm_pmu structure
  as we need to assign them directly to pmu structure.

---

Kajol Jain (4):
  drivers/nvdimm: Add nvdimm pmu structure
  drivers/nvdimm: Add perf interface to expose nvdimm performance stats
  powerpc/papr_scm: Add perf interface support
  powerpc/papr_scm: Document papr_scm sysfs event format entries

 

Re: [PATCH] powerpc/kprobes: Pass ppc_inst as a pointer to emulate_step() on ppc32

2021-06-08 Thread Naveen N. Rao

Christophe Leroy wrote:



Le 07/06/2021 à 19:36, Christophe Leroy a écrit :



Le 07/06/2021 à 16:31, Christophe Leroy a écrit :



Le 07/06/2021 à 13:34, Naveen N. Rao a écrit :

Naveen N. Rao wrote:

Trying to use a kprobe on ppc32 results in the below splat:
    BUG: Unable to handle kernel data access on read at 0x7c0802a6
    Faulting instruction address: 0xc002e9f0
    Oops: Kernel access of bad area, sig: 11 [#1]
    BE PAGE_SIZE=4K PowerPC 44x Platform
    Modules linked in:
    CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7
    NIP:  c002e9f0 LR: c0011858 CTR: 8a47
    REGS: c292fd50 TRAP: 0300   Not tainted  (5.13.0-rc1-01824-g3a81c0495fdb)
    MSR:  9000   CR: 24002002  XER: 2000
    DEAR: 7c0802a6 ESR: 
    
    NIP [c002e9f0] emulate_step+0x28/0x324
    LR [c0011858] optinsn_slot+0x128/0x1
    Call Trace:
 opt_pre_handler+0x7c/0xb4 (unreliable)
 optinsn_slot+0x128/0x1
 ret_from_syscall+0x0/0x28

The offending instruction is:
    81 24 00 00 lwz r9,0(r4)

Here, we are trying to load the second argument to emulate_step():
struct ppc_inst, which is the instruction to be emulated. On ppc64,
structures are passed in registers when passed by value. However, per
the ppc32 ABI, structures are always passed to functions as pointers.
This isn't being adhered to when setting up the call to emulate_step()
in the optprobe trampoline. Fix the same.

Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32")
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/optprobes.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)


Christophe,
Can you confirm if this patch works for you? It would be good if this can go in 
v5.13.



I'm trying to use kprobes, but I must be missing something. I have tried to follow the exemple in 
kernel's documentation:


  # echo 'p:myprobe do_sys_open dfd=%r3' > 
/sys/kernel/debug/tracing/kprobe_events

  # echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable

  # cat /sys/kernel/debug/kprobes/list

  c00122e4  k  kretprobe_trampoline+0x0    [OPTIMIZED]
  c018a1b4  k  do_sys_open+0x0    [OPTIMIZED]

  # cat /sys/kernel/debug/tracing/tracing_on

  1

  # cat /sys/kernel/debug/tracing/trace

# tracer: nop
#
# entries-in-buffer/entries-written: 0/0   #P:1
#
#    _-=> irqs-off
#   / _=> need-resched
#  | / _---=> hardirq/softirq
#  || / _--=> preempt-depth
#  ||| / delay
#   TASK-PID CPU#     TIMESTAMP  FUNCTION
#  | | |     | |



So it looks like I get no event. I can't believe that do_sys_open() is never 
hit.

This is without your patch, so it should Oops ?


Then it looks like something is locked up somewhere, because I can't do 
anything else:

  # echo 'p:myprobe2 do_sys_openat2 dfd=%r3' 
>/sys/kernel/debug/tracing/kprobe_events

  -sh: can't create /sys/kernel/debug/tracing/kprobe_events: Device or resource 
busy

  # echo '-:myprobe' > /sys/kernel/debug/tracing/kprobe_events

  -sh: can't create /sys/kernel/debug/tracing/kprobe_events: Device or resource 
busy

  # echo > /sys/kernel/debug/tracing/kprobe_events

  -sh: can't create /sys/kernel/debug/tracing/kprobe_events: Device or resource 
busy


These should work if you disable the event. See below...






Ok, did a new test. Seems like do_sys_open() is really never called.
I set the test at do_sys_openat2 , it was not optimised and was working.
I set the test at do_sys_openat2+0x10 , it was optimised and crashed.
Now I'm going to test the patch.

When I set an event, is that normal that it removes the previous one ? Then we can have only one 
event at a time ? And then when that event is enabled we get 'Device or resource busy' when trying 
to add a new one ?


You should append to kprobe_events (i.e., use '>>') when _adding_ an 
event, otherwise it is considered a write and it tries to remove the 
existing event, which can't be done if the event is enabled.


kprobe_events allows events to be removed only after they are disabled.





I confirm it doesn't crash anymore and it now works with optimised probes.

Tested-by: Christophe Leroy 


Thanks!
- Naveen



Re: [PATCH v1 05/12] mm/memory_hotplug: remove nid parameter from remove_memory() and friends

2021-06-08 Thread David Hildenbrand

On 08.06.21 13:11, Michael Ellerman wrote:

David Hildenbrand  writes:

There is only a single user remaining. We can simply try to offline all
online nodes - which is fast, because we usually span pages and can skip
such nodes right away.


That makes me slightly nervous, because our big powerpc boxes tend to
trip on these scaling issues before others.

But the spanned pages check is just:

void try_offline_node(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
 ...
if (pgdat->node_spanned_pages)
return;

So I guess that's pretty cheap, and it's only O(nodes), which should
never get that big.


Exactly. And if it does turn out to be a problem, we can walk all memory 
blocks before removing them, collecting the nid(s).


--
Thanks,

David / dhildenb



Re: [PATCH v7 01/11] mm/mremap: Fix race between MOVE_PMD mremap and pageout

2021-06-08 Thread Aneesh Kumar K.V

On 6/8/21 3:12 PM, Kirill A. Shutemov wrote:

On Tue, Jun 08, 2021 at 01:22:23PM +0530, Aneesh Kumar K.V wrote:


Hi Hugh,

Hugh Dickins  writes:


On Mon, 7 Jun 2021, Aneesh Kumar K.V wrote:


CPU 1   CPU 2   CPU 3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd

*new_addr = 10; and fills
TLB 
with new addr
and old 
pfn

unlock(pmd_ptl)
ptep_clear_flush()
old pfn is free.
Stale 
TLB entry

Fix this race by holding pmd lock in pageout. This still doesn't handle the race
between MOVE_PUD and pageout.

Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V 


This seems very wrong to me, to require another level of locking in the
rmap lookup, just to fix some new pagetable games in mremap.

But Linus asked "Am I missing something?": neither of you have mentioned
mremap's take_rmap_locks(), so I hope that already meets your need.  And
if it needs to be called more often than before (see "need_rmap_locks"),
that's probably okay.

Hugh



Thanks for reviewing the change. I missed the rmap lock in the code
path. How about the below change?

 mm/mremap: hold the rmap lock in write mode when moving page table entries.
 
 To avoid a race between rmap walk and mremap, mremap does take_rmap_locks().

 The lock was taken to ensure that rmap walk don't miss a page table entry 
due to
 PTE moves via move_pagetables(). The kernel does further optimization of
 this lock such that if we are going to find the newly added vma after the
 old vma, the rmap lock is not taken. This is because rmap walk would find 
the
 vmas in the same order and if we don't find the page table attached to
 older vma we would find it with the new vma which we would iterate later.
 The actual lifetime of the page is still controlled by the PTE lock.
 
 This patch updates the locking requirement to handle another race condition

 explained below with optimized mremap::
 
 Optmized PMD move
 
 CPU 1   CPU 2   CPU 3
 
 mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
 
 mmap_write_lock_killable()
 
 addr = old_addr

 lock(pte_ptl)
 lock(pmd_ptl)
 pmd = *old_pmd
 pmd_clear(old_pmd)
 flush_tlb_range(old_addr)
 
 *new_pmd = pmd


 *new_addr = 10; and fills

 TLB with new addr

 and old pfn
 
 unlock(pmd_ptl)

 ptep_clear_flush()
 old pfn is free.

 Stale TLB entry
 
 Optmized PUD move:
 
 CPU 1   CPU 2   CPU 3
 
 mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
 
 mmap_write_lock_killable()
 
 addr = old_addr

 lock(pte_ptl)
 lock(pud_ptl)
 pud = *old_pud
 pud_clear(old_pud)
 flush_tlb_range(old_addr)
 
 *new_pud = pud


 *new_addr = 10; and fills

 TLB with new addr

 and old pfn
 
 unlock(pud_ptl)

 ptep_clear_flush()
 old pfn is free.

 Stale TLB entry
 
 Both the above race condition can be fixed if we force mremap path to take rmap lock.
 
 Signed-off-by: Aneesh Kumar K.V 


Looks like it should be enough to address the 

Re: [PATCH v1 05/12] mm/memory_hotplug: remove nid parameter from remove_memory() and friends

2021-06-08 Thread Michael Ellerman
David Hildenbrand  writes:
> There is only a single user remaining. We can simply try to offline all
> online nodes - which is fast, because we usually span pages and can skip
> such nodes right away.

That makes me slightly nervous, because our big powerpc boxes tend to
trip on these scaling issues before others.

But the spanned pages check is just:

void try_offline_node(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
...
if (pgdat->node_spanned_pages)
return;

So I guess that's pretty cheap, and it's only O(nodes), which should
never get that big.

> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Andrew Morton 
> Cc: Nathan Lynch 
> Cc: Laurent Dufour 
> Cc: "Aneesh Kumar K.V" 
> Cc: Scott Cheloha 
> Cc: Anton Blanchard 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-a...@vger.kernel.org
> Cc: nvd...@lists.linux.dev
> Signed-off-by: David Hildenbrand 
> ---
>  .../platforms/pseries/hotplug-memory.c|  9 -
>  drivers/acpi/acpi_memhotplug.c|  7 +--
>  drivers/dax/kmem.c|  3 +--
>  drivers/virtio/virtio_mem.c   |  4 ++--
>  include/linux/memory_hotplug.h| 10 +-
>  mm/memory_hotplug.c   | 20 +--
>  6 files changed, 23 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 8377f1f7c78e..4a9232ddbefe 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -286,7 +286,7 @@ static int pseries_remove_memblock(unsigned long base, 
> unsigned long memblock_si
>  {
>   unsigned long block_sz, start_pfn;
>   int sections_per_block;
> - int i, nid;
> + int i;
>  
>   start_pfn = base >> PAGE_SHIFT;
>  
> @@ -297,10 +297,9 @@ static int pseries_remove_memblock(unsigned long base, 
> unsigned long memblock_si
>  
>   block_sz = pseries_memory_block_size();
>   sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> - nid = memory_add_physaddr_to_nid(base);
>  
>   for (i = 0; i < sections_per_block; i++) {
> - __remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
> + __remove_memory(base, MIN_MEMORY_BLOCK_SIZE);
>   base += MIN_MEMORY_BLOCK_SIZE;
>   }
>  
> @@ -386,7 +385,7 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
>  
>   block_sz = pseries_memory_block_size();
>  
> - __remove_memory(mem_block->nid, lmb->base_addr, block_sz);
> + __remove_memory(lmb->base_addr, block_sz);
>   put_device(_block->dev);
>  
>   /* Update memory regions for memory remove */
> @@ -638,7 +637,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  
>   rc = dlpar_online_lmb(lmb);
>   if (rc) {
> - __remove_memory(nid, lmb->base_addr, block_sz);
> + __remove_memory(lmb->base_addr, block_sz);
>   invalidate_lmb_associativity_index(lmb);
>   } else {
>   lmb->flags |= DRCONF_MEM_ASSIGNED;


Acked-by: Michael Ellerman  (powerpc)

cheers


Re: [PATCH] powerpc: Fix kernel-jump address for ppc64 wrapper boot

2021-06-08 Thread He Ying

Hello,


在 2021/6/8 12:55, Christophe Leroy 写道:



Le 04/06/2021 à 11:22, He Ying a écrit :

 From "64-bit PowerPC ELF Application Binary Interface Supplement 1.9",
we know that the value of a function pointer in a language like C is
the address of the function descriptor and the first doubleword
of the function descriptor contains the address of the entry point
of the function.

So, when we want to jump to an address (e.g. addr) to execute for
PPC-elf64abi, we should assign the address of addr *NOT* addr itself
to the function pointer or system will jump to the wrong address.

Link: 
https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html#FUNC-DES

Signed-off-by: He Ying 
---
  arch/powerpc/boot/main.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/boot/main.c b/arch/powerpc/boot/main.c
index cae31a6e8f02..50fd7f11b642 100644
--- a/arch/powerpc/boot/main.c
+++ b/arch/powerpc/boot/main.c
@@ -268,7 +268,16 @@ void start(void)
  if (console_ops.close)
  console_ops.close();
  +#ifdef CONFIG_PPC64_BOOT_WRAPPER


This kind of need doesn't desserve a #ifdef, see 
https://www.kernel.org/doc/html/latest/process/coding-style.html#conditional-compilation


You can do:


kentry = (kernel_entry_t)(IS_ENABLED(CONFIG_PPC64_BOOT_WRAPPER) ? 
 : vmlinux.addr);



Or, if you prefer something less compact:


if (IS_ENABLED(CONFIG_PPC64_BOOT_WRAPPER))
    kentry = (kernel_entry_t) 
else
    kentry = (kernel_entry_t) vmlinux.addr;


Thanks for reviewing. But from Oliver's reply, this patch should be dropped.

Because all ppc platforms will not build wrapper to ppc64be ELF currently.

And ppc64le uses LE ABI (ABIv2) which doesn't use function descriptors.

So this may not be a problem for now.


Thanks again.





+    /*
+ * For PPC-elf64abi, the value of a function pointer is the address
+ * of the function descriptor. And the first doubleword of a 
function
+ * descriptor contains the address of the entry point of the 
function.

+ */
+    kentry = (kernel_entry_t) 
+#else
  kentry = (kernel_entry_t) vmlinux.addr;
+#endif
  if (ft_addr) {
  if(platform_ops.kentry)
  platform_ops.kentry(ft_addr, vmlinux.addr);


.


Re: [PATCH v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks

2021-06-08 Thread Ondrej Mosnacek
On Thu, Jun 3, 2021 at 7:46 PM Paul Moore  wrote:
> On Wed, Jun 2, 2021 at 9:40 AM Ondrej Mosnacek  wrote:
> > On Fri, May 28, 2021 at 3:37 AM Paul Moore  wrote:
[...]
> > > I know you and Casey went back and forth on this in v1, but I agree
> > > with Casey that having two LSM hooks here is a mistake.  I know it
> > > makes backports hard, but spoiler alert: maintaining complex software
> > > over any non-trivial period of time is hard, rally hard sometimes
> > > ;)
> >
> > Do you mean having two slots in lsm_hook_defs.h or also having two
> > security_*() functions? (It's not clear to me if you're just
> > reiterating disagreement with v1 or if you dislike the simplified v2
> > as well.)
>
> To be clear I don't think there should be two functions for this, just
> make whatever changes are necessary to the existing
> security_locked_down() LSM hook.  Yes, the backport is hard.  Yes, it
> will touch a lot of code.  Yes, those are lame excuses to not do the
> right thing.

OK, I guess I'll just go with the bigger patch.

> > > > The callers migrated to the new hook, passing NULL as cred:
> > > > 1. arch/powerpc/xmon/xmon.c
> > > >  Here the hook seems to be called from non-task context and is only
> > > >  used for redacting some sensitive values from output sent to
> > > >  userspace.
> > >
> > > This definitely sounds like kernel_t based on the description above.
> >
> > Here I'm a little concerned that the hook might be called from some
> > unusual interrupt, which is not masked by spin_lock_irqsave()... We
> > ran into this with PMI (Platform Management Interrupt) before, see
> > commit 5ae5fbd21079 ("powerpc/perf: Fix handling of privilege level
> > checks in perf interrupt context"). While I can't see anything that
> > would suggest something like this happening here, the whole thing is
> > so foreign to me that I'm wary of making assumptions :)
> >
> > @Michael/PPC devs, can you confirm to us that xmon_is_locked_down() is
> > only called from normal syscall/interrupt context (as opposed to
> > something tricky like PMI)?
>
> You did submit the code change so I assumed you weren't concerned
> about it :)  If it is a bad hook placement that is something else
> entirely.

What I submitted effectively avoided the SELinux hook to be called, so
I didn't bother checking deeper in what context those hook calls would
occur.

> Hopefully we'll get some guidance from the PPC folks.
>
> > > > 4. net/xfrm/xfrm_user.c:copy_to_user_*()
> > > >  Here a cryptographic secret is redacted based on the value returned
> > > >  from the hook. There are two possible actions that may lead here:
> > > >  a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> > > > task context is relevant, since the dumped data is sent back to
> > > > the current task.
> > >
> > > If the task context is relevant we should use it.
> >
> > Yes, but as I said it would create an asymmetry with case b), which
> > I'll expand on below...
> >
> > > >  b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are
> > > > broadcasted to tasks subscribed to XFRM events - here the
> > > > SELinux check is not meningful as the current task's creds do
> > > > not represent the tasks that could potentially see the secret.
> > >
> > > This looks very similar to the BPF hook discussed above, I believe my
> > > comments above apply here as well.
> >
> > Using the current task is just logically wrong in this case. The
> > current task here is just simply deleting an SA that happens to have
> > some secret value in it. When deleting an SA, a notification is sent
> > to a group of subscribers (some group of other tasks), which includes
> > a dump of the secret value. The current task isn't doing any attempt
> > to breach lockdown, it's just deleting an SA.
> >
> > It also makes it really awkward to make policy decisions around this.
> > Suppose that domains A, B, and C need to be able to add/delete SAs and
> > domains D, E, and F need to receive notifications about changes in
> > SAs. Then if, say, domain E actually needs to see the secret values in
> > the notifications, you must grant the confidentiality permission to
> > all of A, B, C to keep things working. And now you have opened up the
> > door for A, B, C to do other lockdown-confidentiality stuff, even
> > though these domains themselves actually don't request/need any
> > confidential data from the kernel. That's just not logical and you may
> > actually end up (slightly) worse security-wise than if you just
> > skipped checking for XFRM secrets altogether, because you need to
> > allow confidentiality to domains for which it may be excessive.
>
> It sounds an awful lot like the lockdown hook is in the wrong spot.
> It sounds like it would be a lot better to relocate the hook than
> remove it.

I don't see how you would solve this by moving the hook. Where do you
want to relocate it? The main obstacle is that the 

Re: [PATCH v1 04/12] mm/memory_hotplug: remove nid parameter from arch_remove_memory()

2021-06-08 Thread Michael Ellerman
David Hildenbrand  writes:

> The parameter is unused, let's remove it.
>
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: "H. Peter Anvin" 
> Cc: Andrew Morton 
> Cc: Anshuman Khandual 
> Cc: Ard Biesheuvel 
> Cc: Mike Rapoport 
> Cc: Nicholas Piggin 
> Cc: Pavel Tatashin 
> Cc: Baoquan He 
> Cc: Laurent Dufour 
> Cc: Sergei Trofimovich 
> Cc: Kefeng Wang 
> Cc: Michel Lespinasse 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: Thiago Jung Bauermann 
> Cc: Joe Perches 
> Cc: Pierre Morel 
> Cc: Jia He 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-i...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Signed-off-by: David Hildenbrand 
> ---
>  arch/arm64/mm/mmu.c| 3 +--
>  arch/ia64/mm/init.c| 3 +--
>  arch/powerpc/mm/mem.c  | 3 +--
>  arch/s390/mm/init.c| 3 +--
>  arch/sh/mm/init.c  | 3 +--
>  arch/x86/mm/init_32.c  | 3 +--
>  arch/x86/mm/init_64.c  | 3 +--
>  include/linux/memory_hotplug.h | 3 +--
>  mm/memory_hotplug.c| 4 ++--
>  mm/memremap.c  | 5 +
>  10 files changed, 11 insertions(+), 22 deletions(-)
>
...
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 043bbeaf407c..fc5c36189c26 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -115,8 +115,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>   return rc;
>  }
>  
> -void __ref arch_remove_memory(int nid, u64 start, u64 size,
> -   struct vmem_altmap *altmap)
> +void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap 
> *altmap)
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;

Acked-by: Michael Ellerman  (powerpc)

cheers


[PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Pingfan Liu
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
formula:
#define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)

Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
PAGES_PER_SECTION in makedumpfile just like kernel.

Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
SECTION_SIZE_BITS"). But user space wants a stable interface to get this
info. Such info is impossible to be deduced from a crashdump vmcore.
Hence append SECTION_SIZE_BITS to vmcoreinfo.

Signed-off-by: Pingfan Liu 
Cc: Bhupesh Sharma 
Cc: Kazuhito Hagio 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Boris Petkov 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: James Morse 
Cc: Mark Rutland 
Cc: Will Deacon 
Cc: Catalin Marinas 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Dave Anderson 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Cc: ke...@lists.infradead.org
Cc: x...@kernel.org
Cc: linux-arm-ker...@lists.infradead.org
---
 kernel/crash_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 825284baaf46..684a6061a13a 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
VMCOREINFO_STRUCT_SIZE(mem_section);
VMCOREINFO_OFFSET(mem_section, section_mem_map);
+   VMCOREINFO_NUMBER(SECTION_SIZE_BITS);
VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
 #endif
VMCOREINFO_STRUCT_SIZE(page);
-- 
2.29.2



Re: [PATCH v7 01/11] mm/mremap: Fix race between MOVE_PMD mremap and pageout

2021-06-08 Thread Kirill A. Shutemov
On Tue, Jun 08, 2021 at 01:22:23PM +0530, Aneesh Kumar K.V wrote:
> 
> Hi Hugh,
> 
> Hugh Dickins  writes:
> 
> > On Mon, 7 Jun 2021, Aneesh Kumar K.V wrote:
> >
> >> CPU 1  CPU 2   
> >> CPU 3
> >> 
> >> mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> >> 
> >> mmap_write_lock_killable()
> >> 
> >>addr = old_addr
> >>lock(pte_ptl)
> >> lock(pmd_ptl)
> >> pmd = *old_pmd
> >> pmd_clear(old_pmd)
> >> flush_tlb_range(old_addr)
> >> 
> >> *new_pmd = pmd
> >>
> >> *new_addr = 10; and fills
> >>TLB 
> >> with new addr
> >>and old 
> >> pfn
> >> 
> >> unlock(pmd_ptl)
> >>ptep_clear_flush()
> >>old pfn is free.
> >>Stale 
> >> TLB entry
> >> 
> >> Fix this race by holding pmd lock in pageout. This still doesn't handle 
> >> the race
> >> between MOVE_PUD and pageout.
> >> 
> >> Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
> >> Link: 
> >> https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
> >> Signed-off-by: Aneesh Kumar K.V 
> >
> > This seems very wrong to me, to require another level of locking in the
> > rmap lookup, just to fix some new pagetable games in mremap.
> >
> > But Linus asked "Am I missing something?": neither of you have mentioned
> > mremap's take_rmap_locks(), so I hope that already meets your need.  And
> > if it needs to be called more often than before (see "need_rmap_locks"),
> > that's probably okay.
> >
> > Hugh
> >
> 
> Thanks for reviewing the change. I missed the rmap lock in the code
> path. How about the below change?
> 
> mm/mremap: hold the rmap lock in write mode when moving page table 
> entries.
> 
> To avoid a race between rmap walk and mremap, mremap does 
> take_rmap_locks().
> The lock was taken to ensure that rmap walk don't miss a page table entry 
> due to
> PTE moves via move_pagetables(). The kernel does further optimization of
> this lock such that if we are going to find the newly added vma after the
> old vma, the rmap lock is not taken. This is because rmap walk would find 
> the
> vmas in the same order and if we don't find the page table attached to
> older vma we would find it with the new vma which we would iterate later.
> The actual lifetime of the page is still controlled by the PTE lock.
> 
> This patch updates the locking requirement to handle another race 
> condition
> explained below with optimized mremap::
> 
> Optmized PMD move
> 
> CPU 1   CPU 2 
>   CPU 3
> 
> mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> 
> mmap_write_lock_killable()
> 
> addr = old_addr
> lock(pte_ptl)
> lock(pmd_ptl)
> pmd = *old_pmd
> pmd_clear(old_pmd)
> flush_tlb_range(old_addr)
> 
> *new_pmd = pmd
>   
>   *new_addr = 10; and fills
>   
>   TLB with new addr
>   
>   and old pfn
> 
> unlock(pmd_ptl)
> ptep_clear_flush()
> old pfn is free.
>   
>   Stale TLB entry
> 
> Optmized PUD move:
> 
> CPU 1   CPU 2 
>   CPU 3
> 
> mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one
> 
> mmap_write_lock_killable()
> 
> addr = old_addr
> lock(pte_ptl)
> lock(pud_ptl)
> pud = *old_pud
> pud_clear(old_pud)
> flush_tlb_range(old_addr)
> 
> *new_pud = pud
>   
>   *new_addr = 10; and fills
>   
>   TLB with new addr
>   
>   and old pfn
> 
> unlock(pud_ptl)
> ptep_clear_flush()
> old pfn is free.
> 

  1   2   >