Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-04-06 Thread Gerald Schaefer
On Thu, 1 Apr 2021 13:10:49 -0700
Yang Shi  wrote:

[...]
> > >
> > > Yes, it could be. The old behavior of migration was to return -ENOMEM
> > > if THP migration is not supported then split THP. That behavior was
> > > not very friendly to some usecases, for example, memory policy and
> > > migration lieu of reclaim (the upcoming). But I don't mean we restore
> > > the old behavior. We could split THP if it returns -ENOSYS and the
> > > page is THP.
> >
> > OK, as long as we don't get any broken PMD migration entries established
> > for s390, some extra THP splitting would be acceptable I guess.
> 
> There will be no migration PMD installed. The current behavior is a
> no-op if THP migration is not supported.

Ok, just for completeness, since Mel also replied that the split
was not done on other architectures "because the loss from splitting
exceeded the gain of improved locality":

I did not mean to request extra splitting functionality for s390,
simply skipping / ignoring large PMDs would also be fine for s390,
no need to add extra complexity.


Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-03-31 Thread Gerald Schaefer
On Tue, 30 Mar 2021 09:51:46 -0700
Yang Shi  wrote:

> On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
>  wrote:
> >
> > On Mon, 29 Mar 2021 11:33:06 -0700
> > Yang Shi  wrote:
> >  
> > >
> > > When the THP NUMA fault support was added THP migration was not supported 
> > > yet.
> > > So the ad hoc THP migration was implemented in NUMA fault handling.  
> > > Since v4.14
> > > THP migration has been supported so it doesn't make too much sense to 
> > > still keep
> > > another THP migration implementation rather than using the generic 
> > > migration
> > > code.  It is definitely a maintenance burden to keep two THP migration
> > > implementation for different code paths and it is more error prone.  
> > > Using the
> > > generic THP migration implementation allows us remove the duplicate code 
> > > and
> > > some hacks needed by the old ad hoc implementation.
> > >
> > > A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both 
> > > THP
> > > and NUMA balancing.  The most of them support THP migration except for 
> > > S390.
> > > Zi Yan tried to add THP migration support for S390 before but it was not
> > > accepted due to the design of S390 PMD.  For the discussion, please see:
> > > https://lkml.org/lkml/2018/4/27/953.
> > >
> > > I'm not expert on S390 so not sure if it is feasible to support THP 
> > > migration
> > > for S390 or not.  If it is not feasible then the patchset may make THP 
> > > NUMA
> > > balancing not be functional on S390.  Not sure if this is a show stopper 
> > > although
> > > the patchset does simplify the code a lot.  Anyway it seems worth posting 
> > > the
> > > series to the mailing list to get some feedback.  
> >
> > The reason why THP migration cannot work on s390 is because the migration
> > code will establish swap ptes in a pmd. The pmd layout is very different 
> > from
> > the pte layout on s390, so you cannot simply write a swap pte into a pmd.
> > There are no separate swp primitives for swap/migration pmds, IIRC. And even
> > if there were, we'd still need to find some space for a present bit in the
> > s390 pmd, and/or possibly move around some other bits.
> >
> > A lot of things can go wrong here, even if it could be possible in theory,
> > by introducing separate swp primitives in common code for pmd entries, along
> > with separate offset, type, shift, etc. I don't see that happening in the
> > near future.  
> 
> Thanks a lot for elaboration. IIUC, implementing migration PMD entry
> is *not* prevented from by hardware, it may be very tricky to
> implement it, right?

Well, it depends. The HW is preventing proper full-blown swap + migration
support for PMD, similar to what we have for PTE, because we simply don't
have enough OS-defined bits in the PMD. A 5-bit swap type for example,
similar to a PTE, plus the PFN would not be possible.

The HW would not prevent a similar mechanism in principle, i.e. we could
mark it as invalid to trigger a fault, and have some magic bits that tell
the fault handler or migration code what it is about.

For handling migration aspects only, w/o any swap device or other support, a
single type bit could already be enough, to indicate read/write migration,
plus a "present" bit similar to PTE. But even those 2 bits would be hard to
find, though I would not entirely rule that out. That would be the tricky
part.

Then of course, common code would need some changes, to reflect the
different swap/migration (type) capabilities of PTE and PMD entries.
Not sure if such an approach would be acceptable for common code.

But this is just some very abstract and optimistic view, I have not
really properly looked into the details. So it might be even more
tricky, or not possible at all.

> 
> >
> > Not sure if this is a show stopper, but I am not familiar enough with
> > NUMA and migration code to judge. E.g., I do not see any swp entry action
> > in your patches, but I assume this is implicitly triggered by the switch
> > to generic THP migration code.  
> 
> Yes, exactly. The migrate_pages() called by migrate_misplaced_page()
> takes care of everything.
> 
> >
> > Could there be a work-around by splitting THP pages instead of marking them
> > as migrate pmds (via pte swap entries), at least when THP migration is not
> > supported? I guess it could also be acceptable if THP pages were simply not
> > migrated for NUMA balancing on s390, but then we might need some extra 
> > config
> > option to make that behavior explicit.  
> 
> Yes, it could be. The old behavior of migration was to return -ENOMEM
> if THP migration is not supported then split THP. That behavior was
> not very friendly to some usecases, for example, memory policy and
> migration lieu of reclaim (the upcoming). But I don't mean we restore
> the old behavior. We could split THP if it returns -ENOSYS and the
> page is THP.

OK, as long as we don't get any broken PMD migration entries established
for s390, some extra THP splitting would be acceptable I guess.


Re: [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page

2021-03-30 Thread Gerald Schaefer
On Mon, 29 Mar 2021 11:33:11 -0700
Yang Shi  wrote:

> The old behavior didn't split THP if migration is failed due to lack of
> memory on the target node.  But the THP migration does split THP, so keep
> the old behavior for misplaced NUMA page migration.
> 
> Signed-off-by: Yang Shi 
> ---
>  mm/migrate.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 86325c750c14..1c0c873375ab 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1444,6 +1444,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>   int swapwrite = current->flags & PF_SWAPWRITE;
>   int rc, nr_subpages;
>   LIST_HEAD(ret_pages);
> + bool nosplit = (reason == MR_NUMA_MISPLACED);
>  
>   if (!swapwrite)
>   current->flags |= PF_SWAPWRITE;
> @@ -1495,7 +1496,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>*/
>   case -ENOSYS:
>   /* THP migration is unsupported */
> - if (is_thp) {
> + if (is_thp && !nosplit) {

This is the "THP migration is unsupported" case, but according to your
description you rather want to change the -ENOMEM case?

Could this be the correct place to trigger THP split for NUMA balancing,
for architectures not supporting THP migration, like s390?

Do I understand it correctly that this change (for -ENOSYS) would
result in always failed THP migrations during NUMA balancing, if THP
migration was not supported?


Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-03-30 Thread Gerald Schaefer
On Mon, 29 Mar 2021 11:33:06 -0700
Yang Shi  wrote:

> 
> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since 
> v4.14
> THP migration has been supported so it doesn't make too much sense to still 
> keep
> another THP migration implementation rather than using the generic migration
> code.  It is definitely a maintenance burden to keep two THP migration
> implementation for different code paths and it is more error prone.  Using the
> generic THP migration implementation allows us remove the duplicate code and
> some hacks needed by the old ad hoc implementation.
> 
> A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> and NUMA balancing.  The most of them support THP migration except for S390.
> Zi Yan tried to add THP migration support for S390 before but it was not
> accepted due to the design of S390 PMD.  For the discussion, please see:
> https://lkml.org/lkml/2018/4/27/953.
> 
> I'm not expert on S390 so not sure if it is feasible to support THP migration
> for S390 or not.  If it is not feasible then the patchset may make THP NUMA
> balancing not be functional on S390.  Not sure if this is a show stopper 
> although
> the patchset does simplify the code a lot.  Anyway it seems worth posting the
> series to the mailing list to get some feedback.

The reason why THP migration cannot work on s390 is because the migration
code will establish swap ptes in a pmd. The pmd layout is very different from
the pte layout on s390, so you cannot simply write a swap pte into a pmd.
There are no separate swp primitives for swap/migration pmds, IIRC. And even
if there were, we'd still need to find some space for a present bit in the
s390 pmd, and/or possibly move around some other bits.

A lot of things can go wrong here, even if it could be possible in theory,
by introducing separate swp primitives in common code for pmd entries, along
with separate offset, type, shift, etc. I don't see that happening in the
near future.

Not sure if this is a show stopper, but I am not familiar enough with
NUMA and migration code to judge. E.g., I do not see any swp entry action
in your patches, but I assume this is implicitly triggered by the switch
to generic THP migration code.

Could there be a work-around by splitting THP pages instead of marking them
as migrate pmds (via pte swap entries), at least when THP migration is not
supported? I guess it could also be acceptable if THP pages were simply not
migrated for NUMA balancing on s390, but then we might need some extra config
option to make that behavior explicit.

See also my comment on patch #5 of this series.

Regards,
Gerald


Re: Freeing page tables through RCU

2021-02-26 Thread Gerald Schaefer
On Thu, 25 Feb 2021 20:58:20 +
Matthew Wilcox  wrote:

> In order to walk the page tables without the mmap semaphore, it must
> be possible to prevent them from being freed and reused (eg if munmap()
> races with viewing /proc/$pid/smaps).
> 
> There is various commentary within the mm on how to prevent this.  One way
> is to disable interrupts, relying on that to block rcu_sched or IPIs.
> I don't think the RT people are terribly happy about reading a proc file
> disabling interrupts, and it doesn't work for architectures that free
> page tables directly instead of batching them into an rcu_sched (because
> the IPI may not be sent to this CPU if the task has never run on it).
> 
> See "Fast GUP" in mm/gup.c
> 
> Ideally, I'd like rcu_read_lock() to delay page table reuse.  This is
> close to trivial for architectures which use entire pages or multiple
> pages for levels of their page tables as we can use the rcu_head embedded
> in struct page to queue the page for RCU.
> 
> s390 and powerpc are the only two architectures I know of that have
> levels of their page table that are smaller than their PAGE_SIZE.
> I'd like to discuss options.  There may be a complicated scheme that
> allows partial pages to be freed via RCU, but I have something simpler
> in mind.  For powerpc in particular, it can have a PAGE_SIZE of 64kB
> and then the MMU wants to see 4kB entries in the PMD.  I suggest that
> instead of allocating each 4kB entry individually, we allocate a 64kB
> page and fill in 16 consecutive PMDs.  This could cost a bit more memory
> (although if you've asked for a CONFIG_PAGE_SIZE of 64kB, you presumably
> don't care too much about it), but it'll make future page faults cheaper
> (as the PMDs will already be present, assuming you have good locality
> of reference).
> 
> I'd like to hear better ideas than this.

Some background on the situation for s390: The architecture defines
an 8 bit pagetable index, so we have 256 entries in a 2 KB pagetable,
but PAGE_SIZE is 4 KB. pte_alloc(_one) will use alloc_page() to allocate
a full 4 KB page, and then do some housekeeping to maintain a per-mm list
of such 4 KB pages, which will contain either one or two 2 KB pagetable
fragments.

This is also the reason why pgtable_t on s390 is not pointing to the
struct page of the (4 KB) page containing a 2 KB pagetable fragment, but
rather to the 2 KB pagetable itself.

I see at least two issues here, with using rcu_head embedded in the
struct page (for a 4 KB page):

1) There might be two 2 KB pagetables present in that 4 KB page,
and the rcu_head would affect both. Not sure if this would really be
a problem, because we already have a similar situation with the split
ptlock embedded in struct page, which also might lock two 2 KB pagetables,
i.e. more than necessary. It still is far less "over-locking" than
using mm->page_table_lock, and the move_pte() code e.g. takes care
to avoid a deadlock if src and dst ptlocks happen to be on the
same page.

So, a similar "over-locking" might also be possible and acceptable
for the rcu_head approach, but I do not really understand if that could
have some deadlock or other unwanted side-effects.

2) The "housekeeping" of our 2 KB pagetable fragments uses page->lru
to maintain the per-mm list. It also (mis)uses page->_refcount to mark
which 2 KB half is used/free, but that should not be an issue I guess.
Using page->lru will be an issue though. IIUC, then page->rcu_head will
overlay with page->lru, so using page->rcu_head for pagetable pages
on s390 would conflict with our page->lru usage for such pagetable pages.

I do not really see how that could be fixed, maybe we could find and
re-use other struct page members for our 2 KB fragment list. Also, for
kvm, there seem to be even more users of page->lru for pagetable pages,
in arch/s390/mm/gmap.c. Not sure though if those would actually also
affect "regular" pagetable walks, or if they are somehow independent.
But if we'd find some new list home for the 2 KB fragments, then that
could probably also be used for the gmap stuff.


Re: [RFC PATCH 1/5] hugetlb: add hugetlb helpers for soft dirty support

2021-02-24 Thread Gerald Schaefer
On Wed, 24 Feb 2021 17:46:08 +0100
Gerald Schaefer  wrote:

[...]
> Then we fundamentally changed the way how we deal with that "hugetlb code
> is treating pmds as ptes" issue. Instead of caring about that in all
> huge_pte_xxx primitives, huge_ptep_get() will now return a nicely faked pte
> for s390, i.e. something that looks like a pte would look like, and not the
> real pmd/pud value. With that, hugetlb code can do all its pte handling on
> that fake pte, and the conversion back to a proper pmd/pud is done in
> set_huge_pte().

BTW, in case anybody is wondering, this conversion from and to pmd for s390
will also care about the soft dirty bit, even though it was not really used
before for hugetlb. So Mikes approach to add the default primitives for s390
should work fine.


Re: [RFC PATCH 1/5] hugetlb: add hugetlb helpers for soft dirty support

2021-02-24 Thread Gerald Schaefer
On Wed, 17 Feb 2021 11:24:15 -0500
Peter Xu  wrote:

> On Wed, Feb 10, 2021 at 04:03:18PM -0800, Mike Kravetz wrote:
> > Add interfaces to set and clear soft dirty in hugetlb ptes.  Make
> > hugetlb interfaces needed for /proc clear_refs available outside
> > hugetlb.c.
> > 
> > arch/s390 has it's own version of most routines in asm-generic/hugetlb.h,
> > so add new routines there as well.
> > 
> > Signed-off-by: Mike Kravetz 
> > ---
> >  arch/s390/include/asm/hugetlb.h | 30 ++
> >  include/asm-generic/hugetlb.h   | 30 ++
> >  include/linux/hugetlb.h |  1 +
> >  mm/hugetlb.c| 10 +-
> >  4 files changed, 62 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/s390/include/asm/hugetlb.h 
> > b/arch/s390/include/asm/hugetlb.h
> > index 60f9241e5e4a..b7d26248fb1c 100644
> > --- a/arch/s390/include/asm/hugetlb.h
> > +++ b/arch/s390/include/asm/hugetlb.h
> > @@ -105,6 +105,11 @@ static inline pte_t huge_pte_mkdirty(pte_t pte)
> > return pte_mkdirty(pte);
> >  }
> >  
> > +static inline pte_t huge_pte_mkyoung(pte_t pte)
> > +{
> > +   return pte_mkyoung(pte);
> > +}
> > +
> >  static inline pte_t huge_pte_wrprotect(pte_t pte)
> >  {
> > return pte_wrprotect(pte);
> > @@ -115,9 +120,34 @@ static inline pte_t huge_pte_modify(pte_t pte, 
> > pgprot_t newprot)
> > return pte_modify(pte, newprot);
> >  }
> >  
> > +static inline bool huge_pte_soft_dirty(pte_t pte)
> > +{
> > +   return pte_soft_dirty(pte);
> > +}
> > +
> > +static inline pte_t huge_pte_clear_soft_dirty(pte_t pte)
> > +{
> > +   return pte_clear_soft_dirty(pte);
> > +}
> > +
> > +static inline pte_t huge_pte_swp_clear_soft_dirty(pte_t pte)
> > +{
> > +   return pte_swp_clear_soft_dirty(pte);
> > +}
> > +
> 
> Indeed asm/hugetlb.h of s390 didn't include asm-generic/hugetlb.h as what was
> normally done by asm/hugetlb.h of other archs.  Do you know why it's special?
> E.g. huge_pte_wrprotect() of s390 version is actually the same of the default
> version.

That is for "historical reasons", and yes, it doesn't look like it makes a lot
of sense any more.

The history part:

When s390 hugetlb support was introduced in 2008, there was no
asm-generic/hugetlb.h, and also no huge_pte_xxx primitives at all. They were
actually introduced because of s390, since the hugetlb common code did not
make any difference between pte and pmd types, see commit 7f2e9525ba55
("hugetlbfs: common code update for s390").

Back then, only few architectures with hugetlb support existed, and instead
of creating an asm-generic/hugetlb.h, I just added the primitives to the
individual arch include files.

5 years later, more huge_pte_xxx primitives were needed, and it appeared
to make sense to introduce asm-generic/hugetlb.h, see commit 106c992a5ebe
("mm/hugetlb: add more arch-defined huge_pte functions"). However, for s390,
all those primitives still needed special care, so we were / are the only
architecture not including that.

Then we fundamentally changed the way how we deal with that "hugetlb code
is treating pmds as ptes" issue. Instead of caring about that in all
huge_pte_xxx primitives, huge_ptep_get() will now return a nicely faked pte
for s390, i.e. something that looks like a pte would look like, and not the
real pmd/pud value. With that, hugetlb code can do all its pte handling on
that fake pte, and the conversion back to a proper pmd/pud is done in
set_huge_pte().

This is also why it will go very wrong on s390 if you directly look at
or manipulate a huge pte (i.e. pmd or pud) via its pointer, and not use
huge_ptep_get() and set_huge_pte().

Since that change, most of the huge_pte_xxx primitives are now the default
pte_xxx primitives also for s390, but apparently nobody thought about moving
to asm-generic/hugetlb.h.

> When I looked at the huge_pte_wrprotect() I also see that there seems to have
> no real user of __HAVE_ARCH_HUGE_PTE_WRPROTECT.  Not sure whether it can be
> dropped.  My gut feeling is that s390 should also include 
> asm-generic/hugetlb.h
> but only redefine the helper only if necessary, since I see no point defining
> the same helper multiple times.

Your gut feeling seems right, I will look into cleaning that up. But don't
let that keep you from adding things there for now.

The __HAVE_ARCH_HUGE_PTE_WRPROTECT is not related to any s390-specifics,
I think it was already unused when it was introduced with commit c4916a008665a
("hugetlb: introduce generic version of huge_pte_wrprotect"). Maybe it was
just added for completeness or future support, because the corresponding
__HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT did / does have some users.


Re: [RFC] linux-next panic in hugepage_subpool_put_pages()

2021-02-23 Thread Gerald Schaefer
On Tue, 23 Feb 2021 15:57:40 +0100
Gerald Schaefer  wrote:

[...]
> What I do not understand is how __free_huge_page() would be called at all
> in the call trace below (set_max_huge_pages -> alloc_pool_huge_page ->
> __free_huge_page -> hugepage_subpool_put_pages). From the code it seems
> that this should not be possible, so I must be missing something.

Ok, looking more closely at the dump and code, I see that __free_huge_page()
was called via alloc_pool_huge_page -> put_page() -> destroy_compound_page()
-> compound_page_dtors[2].

It doesn't feel right that alloc_pool_huge_page() ends up freeing the
newly allocated page again. Maybe some refcounting race, so that put_page()
wrongly assumes it was the last reference?

Also from the dump, I could reconstruct the (head) struct page pointer
that __free_huge_page() was called with. Here is the content of the
head and the first tail page, maybe it can help. page->private in the tail
page was zeroed already in __free_huge_page(), but its original value was
the broken *spool pointer 004e, as seen in the register output
of the backtrace.

crash> struct -x page 037203dec000
struct page {
  flags = 0x3001, 
  {
{
  lru = {
next = 0x37203dec008, 
prev = 0x37203dec008
  }, 
  mapping = 0x0, 
  index = 0x0, 
  private = 0x0
}, 
{
  dma_addr = 0x37203dec008
}, 
{
  {
slab_list = {
  next = 0x37203dec008, 
  prev = 0x37203dec008
}, 
{
  next = 0x37203dec008, 
  pages = 0x372, 
  pobjects = 0x3dec008
}
  }, 
  slab_cache = 0x0, 
  freelist = 0x0, 
  {
s_mem = 0x0, 
counters = 0x0, 
{
  inuse = 0x0, 
  objects = 0x0, 
  frozen = 0x0
}
  }
}, 
{
  compound_head = 0x37203dec008, 
  compound_dtor = 0x0, 
  compound_order = 0x0, 
  compound_mapcount = {
counter = 0x3dec008
  }, 
  compound_nr = 0x0
}, 
{
  _compound_pad_1 = 0x37203dec008, 
  hpage_pinned_refcount = {
counter = 0x372
  }, 
  deferred_list = {
next = 0x0, 
prev = 0x0
  }
}, 
{
  _pt_pad_1 = 0x37203dec008, 
  pmd_huge_pte = 0x37203dec008, 
  _pt_pad_2 = 0x0, 
  {
pt_mm = 0x0, 
pt_frag_refcount = {
  counter = 0x0
}
  }, 
  ptl = {
{
  rlock = {
raw_lock = {
  lock = 0x0
}
  }
}
  }
}, 
{
  pgmap = 0x37203dec008, 
  zone_device_data = 0x37203dec008
}, 
callback_head = {
  next = 0x37203dec008, 
  func = 0x37203dec008
}
  }, 
  {
_mapcount = {
  counter = 0x
}, 
page_type = 0x, 
active = 0x, 
units = 0x
  }, 
  _refcount = {
counter = 0x0
  }, 
  memcg_data = 0x0
}


crash> struct -x page 037203dec040
struct page {
  flags = 0x3000, 
  {
{
  lru = {
next = 0x37203dec001, 
prev = 0x2080372
  }, 
  mapping = 0x1000400, 
  index = 0x2, 
  private = 0x0
}, 
{
  dma_addr = 0x37203dec001
}, 
{
  {
slab_list = {
  next = 0x37203dec001, 
  prev = 0x2080372
}, 
{
  next = 0x37203dec001, 
  pages = 0x2080372, 
  pobjects = 0x
}
  }, 
  slab_cache = 0x1000400, 
  freelist = 0x2, 
  {
s_mem = 0x0, 
counters = 0x0, 
{
  inuse = 0x0, 
  objects = 0x0, 
  frozen = 0x0
}
  }
}, 
{
  compound_head = 0x37203dec001, 
  compound_dtor = 0x2, 
  compound_order = 0x8, 
  compound_mapcount = {
counter = 0x
  }, 
  compound_nr = 0x100
}, 
{
  _compound_pad_1 = 0x37203dec001, 
  hpage_pinned_refcount = {
counter = 0x2080372
  }, 
  deferred_list = {
next = 0x1000400, 
prev = 0x2
  }
}, 
{
  _pt_pad_1 = 0x37203dec001, 
  pmd_huge_pte = 0x2080372, 
  _pt_pad_2 = 0x1000400, 
  {
pt_mm = 0x2, 
pt_frag_refcount = {
  counter = 0x0
}
  }, 
  ptl = {
{
  rlock = {
raw_lock = {
  lock = 0x0
}
  }
}
  }
}, 
{
  pgmap = 0x37203dec001, 
  zone_device_data = 0x2080372
}, 
callback_head = {
  next = 0x37203dec001, 
  func = 0x2080372
}
  }, 
  {
_mapcount = {
  counter = 0x
}, 
page_type = 0x, 
active = 0x, 
units = 0x
  }, 
  _refcount = {
counter = 0x0
  }, 
  memcg_data = 0x0
}


[RFC] linux-next panic in hugepage_subpool_put_pages()

2021-02-23 Thread Gerald Schaefer
Hi,

LTP triggered a panic on s390 in hugepage_subpool_put_pages() with
linux-next 5.12.0-20210222, see below.

It crashes on the spin_lock(>lock) at the beginning, because the
passed-in *spool points to 004e, which is not addressable
memory. It rather looks like some flags and not a proper address. I suspect
some relation to the recent rework in that area, e.g. commit f1280272ae4d
("hugetlb: use page.private for hugetlb specific page flags").

__free_huge_page() calls hugepage_subpool_put_pages() and takes *spool from
hugetlb_page_subpool(page), which was changed by that commit to use
page[1]->private now.

What I do not understand is how __free_huge_page() would be called at all
in the call trace below (set_max_huge_pages -> alloc_pool_huge_page ->
__free_huge_page -> hugepage_subpool_put_pages). From the code it seems
that this should not be possible, so I must be missing something.

BTW, the workload of the failing LTP test futex_wake04 is not really
related, the crash happens already during environment setup when it
writes to /proc/sys/vm/nr_hugepages, before starting its actual workload.
It is very hard to reproduce though, so I could not do a proper bisect
so far. Running futex_wake04 alone also never triggered it, I only hit
it very rarely when it was run with ./runltp -f syscalls. Looks like
some kind of race.

[ 4013.255629] LTP: starting futex_wake04
[ 4013.299623] futex_wake04 (865156): drop_caches: 3
[ 4013.300137] Unable to handle kernel pointer dereference in virtual kernel 
address space
[ 4013.300140] Failing address: 004e TEID: 004e0403
[ 4013.300143] Fault in home space mode while using kernel ASCE.
[ 4013.300148] AS:00014e264007 R3:0024 
[ 4013.300172] Oops: 003b ilc:2 [#1] SMP 
[ 4013.300177] Modules linked in: quota_v2 quota_tree tun overlay ntfs exfat 
vfat fat loop sctp udp_tunnel ip6_udp_tunnel vfio_pci irqbypass vfio_virqfd 
dm_service_time scsi_debug vfio_ap kvm vhost_vsock 
vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM 
xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp 
nft_objref nf_conntrack_tftp nft_counter bridge stp llc nft_fib_inet 
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc dm_multipath dm_mod 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua vfio_ccw vfio_mdev mdev vfio_iommu_type1 
vfio eadm_sch sch_fq_codel configfs ip_tables x_tables ghash_s390 prng aes_s390 
des_s390 libdes sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt 
rng_core autofs4 [last unloaded: init_module]
[ 4013.300273] CPU: 4 PID: 865156 Comm: futex_wake04 Tainted: G   OE
 5.12.0-20210222.rc0.git0.abaf6f60176f.300.fc33.s390x+next #1
[ 4013.300278] Hardware name: IBM 2827 H43 738 (LPAR)
[ 4013.300280] Krnl PSW : 0704c0018000 00014d00c36c 
(hugepage_subpool_put_pages.part.0+0x2c/0x138)
[ 4013.300302]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
[ 4013.300307] Krnl GPRS:   004e 
0005
[ 4013.300311]1000 037203deffc0 0380092a7b80 
0001
[ 4013.300314]3001 00014e2db208 0001 
004e
[ 4013.300317]81b81f00 00014d88fde0 0380092a7a30 
0380092a79e8
[ 4013.300327] Krnl Code: 00014d00c35e: e3e0f0980024stg 
%r14,152(%r15)
  00014d00c364: a718lhi %r1,0
 #00014d00c368: 583003acl   %r3,940
 >00014d00c36c: ba132000cs  
%r1,%r3,0(%r2)
  00014d00c370: a7740076brc 
7,00014d00c45c
  00014d00c374: e310b014lg  
%r1,16(%r11)
  00014d00c37a: a71fcghi%r1,-1
  00014d00c37e: a784000abrc 
8,00014d00c392
[ 4013.300351] Call Trace:
[ 4013.300353]  [<00014d00c36c>] 
hugepage_subpool_put_pages.part.0+0x2c/0x138 
[ 4013.300358]  [<00014d00c546>] __free_huge_page+0xce/0x310 
[ 4013.300361]  [<00014d00a45a>] alloc_pool_huge_page+0x102/0x120 
[ 4013.300365]  [<00014d00b2ae>] set_max_huge_pages+0x13e/0x350 
[ 4013.300368]  [<00014d00b728>] hugetlb_sysctl_handler_common+0xd8/0x110 
[ 4013.300372]  [<00014d00dc20>] hugetlb_sysctl_handler+0x48/0x58 
[ 4013.300376]  [<00014d12bd08>] proc_sys_call_handler+0x138/0x238 
[ 4013.300381]  [<00014d058ade>] new_sync_write+0x10e/0x198 
[ 4013.300386]  [<00014d059644>] vfs_write.part.0+0x12c/0x238 
[ 4013.300390]  [<00014d05b8b8>] ksys_write+0x68/0xf8 
[ 4013.300394]  [<00014cd71eda>] do_syscall+0x82/0xd0 
[ 4013.300399]  [<00014d8671bc>] 

Re: [PATCH V2 0/2] mm/debug_vm_pgtable: Some minor updates

2020-12-09 Thread Gerald Schaefer
On Wed, 9 Dec 2020 08:11:13 +0530
Anshuman Khandual  wrote:

> 
> 
> On 12/1/20 5:49 PM, Anshuman Khandual wrote:
> > This series contains some cleanups and new test suggestions from Catalin
> > from an earlier discussion.
> > 
> > https://lore.kernel.org/linux-mm/20201123142237.GF17833@gaia/
> > 
> > This series is based on v5.10-rc6 and has been tested on arm64 and x86 but
> > has only been build tested on riscv, s390, arc etc. It would be great if
> > folks could test this on these platforms as well. Thank you.
> > 
> > Cc: Christophe Leroy 
> > Cc: Gerald Schaefer 
> > Cc: Vineet Gupta 
> > Cc: Catalin Marinas 
> > Cc: Paul Walmsley 
> > Cc: Andrew Morton 
> 
> Hello Gerald/Christophe/Vineet/Paul,
> 
> Could you please give this series a quick test on s390, ppc, arc,
> and riscv platforms. Thank you.
> 

Hi Anshuman, works fine for s390.

Tested-by: Gerald Schaefer  [s390]


[RFC PATCH 1/1] mm/hugetlb: clear compound_nr before freeing gigantic pages

2020-12-08 Thread Gerald Schaefer
Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
added compound_nr counter to first tail struct page, overlaying with
page->mapping. The overlay itself is fine, but while freeing gigantic
hugepages via free_contig_range(), a "bad page" check will trigger for
non-NULL page->mapping on the first tail page:

[  276.681603] BUG: Bad page state in process bash  pfn:380001
[  276.681614] page:c35f0856 refcount:0 mapcount:0 
mapping:126b68aa index:0x0 pfn:0x380001
[  276.681620] aops:0x0
[  276.681622] flags: 0x3000()
[  276.681626] raw: 3000 0100 0122 
0001
[  276.681628] raw:    

[  276.681630] page dumped because: non-NULL mapping
[  276.681632] Modules linked in:
[  276.681637] CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 
#1
[  276.681639] Hardware name: IBM 3906 M03 703 (LPAR)
[  276.681641] Call Trace:
[  276.681648]  [<000458c252b6>] show_stack+0x6e/0xe8
[  276.681652]  [<00045971cf60>] dump_stack+0x90/0xc8
[  276.681656]  [<000458e8b186>] bad_page+0xd6/0x130
[  276.681658]  [<000458e8cdea>] free_pcppages_bulk+0x26a/0x800
[  276.681661]  [<000458e8e67e>] free_unref_page+0x6e/0x90
[  276.681663]  [<000458e8ea6c>] free_contig_range+0x94/0xe8
[  276.681666]  [<000458ea5e54>] update_and_free_page+0x1c4/0x2c8
[  276.681669]  [<000458ea784e>] free_pool_huge_page+0x11e/0x138
[  276.681671]  [<000458ea8530>] set_max_huge_pages+0x228/0x300
[  276.681673]  [<000458ea86c0>] nr_hugepages_store_common+0xb8/0x130
[  276.681678]  [<000458fd5b6a>] kernfs_fop_write+0xd2/0x218
[  276.681681]  [<000458ef9da0>] vfs_write+0xb0/0x2b8
[  276.681684]  [<000458efa15c>] ksys_write+0xac/0xe0
[  276.681687]  [<00045972c5ca>] system_call+0xe6/0x288
[  276.681730] Disabling lock debugging due to kernel taint

This is because only the compound_order is cleared in
destroy_compound_gigantic_page(), and compound_nr is set to 1U << order == 1
for order 0 in set_compound_order(page, 0).

Fix this by explicitly clearing compound_nr for first tail page after calling
set_compound_order(page, 0).

Cc:  # 5.9+
Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order")
Signed-off-by: Gerald Schaefer 
---
 mm/hugetlb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f3bf1710b66..3bcc0bc7e02a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1216,6 +1216,7 @@ static void destroy_compound_gigantic_page(struct page 
*page,
}
 
set_compound_order(page, 0);
+   page[1].compound_nr = 0;
__ClearPageHead(page);
 }
 
-- 
2.17.1



[RFC PATCH 0/1] "Bad page state" while freeing gigantic pages

2020-12-08 Thread Gerald Schaefer
The following "Bad page state" occurs on s390 when freeing gigantic pages:

[  276.681603] BUG: Bad page state in process bash  pfn:380001
[  276.681614] page:c35f0856 refcount:0 mapcount:0 
mapping:126b68aa index:0x0 pfn:0x380001
[  276.681620] aops:0x0
[  276.681622] flags: 0x3000()
[  276.681626] raw: 3000 0100 0122 
0001
[  276.681628] raw:    

[  276.681630] page dumped because: non-NULL mapping
[  276.681632] Modules linked in:
[  276.681637] CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 
#1
[  276.681639] Hardware name: IBM 3906 M03 703 (LPAR)
[  276.681641] Call Trace:
[  276.681648]  [<000458c252b6>] show_stack+0x6e/0xe8
[  276.681652]  [<00045971cf60>] dump_stack+0x90/0xc8
[  276.681656]  [<000458e8b186>] bad_page+0xd6/0x130
[  276.681658]  [<000458e8cdea>] free_pcppages_bulk+0x26a/0x800
[  276.681661]  [<000458e8e67e>] free_unref_page+0x6e/0x90
[  276.681663]  [<000458e8ea6c>] free_contig_range+0x94/0xe8
[  276.681666]  [<000458ea5e54>] update_and_free_page+0x1c4/0x2c8
[  276.681669]  [<000458ea784e>] free_pool_huge_page+0x11e/0x138
[  276.681671]  [<000458ea8530>] set_max_huge_pages+0x228/0x300
[  276.681673]  [<000458ea86c0>] nr_hugepages_store_common+0xb8/0x130
[  276.681678]  [<000458fd5b6a>] kernfs_fop_write+0xd2/0x218
[  276.681681]  [<000458ef9da0>] vfs_write+0xb0/0x2b8
[  276.681684]  [<000458efa15c>] ksys_write+0xac/0xe0
[  276.681687]  [<00045972c5ca>] system_call+0xe6/0x288
[  276.681730] Disabling lock debugging due to kernel taint

I bisected it to commit 1378a5ee451a ("mm: store compound_nr as well as
compound_order"), and it seems that the new compound_nr overlaying
page->mapping is not properly cleared, which then triggers the non-NULL
mapping warning.

This is because only the compound_order is cleared in
destroy_compound_gigantic_page(), and compound_nr is set to 1U << order == 1
for order 0 in set_compound_order(page, 0).

For some reason, I can not reproduce this on x86, but I do not see where
this could be an arch-sepcific issue. Still, I might be missing something,
and my proposed patch also looks a bit ugly (at least to me), hence this
RFC. Any comments?

BTW, for "normal sized" hugepages, this is not an issue, as page->mapping
seems to be cleared explicitly in this case, in free_tail_pages_check(),
but the freeing path for normal hugepages is quite different from that for
gigantic pages using free_contig_range(). So a "page[1].mapping = NULL"
might also be an option, instead of the "page[1].compound_nr = 0" in my
patch, but that looks even more ugly, since it would clear more than
needed.

Gerald Schaefer (1):
  mm/hugetlb: clear compound_nr before freeing gigantic pages

 mm/hugetlb.c | 1 +
 1 file changed, 1 insertion(+)

-- 
2.17.1



[PATCH] mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()

2020-11-10 Thread Gerald Schaefer
0 kernel/task_work.c:151
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
 system_call+0xe6/0x28c arch/s390/kernel/entry.S:416

The buggy address belongs to the object at 962d6948
 which belongs to the cache vm_area_struct of size 200
The buggy address is located 64 bytes inside of
 200-byte region [962d6948, 962d6a10)
The buggy address belongs to the page:
page:313a09fe refcount:1 mapcount:0 mapping: index:0x0 
pfn:0x962d6
flags: 0x3200(slab)
raw: 3200 4257e080 000c000c 8020ba00
raw:  000f001e 0001 96959501
page dumped because: kasan: bad access detected
page->mem_cgroup:96959501

Memory state around the buggy address:
 962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
>962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ^
 962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
 962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==

Fixes: 6b251fc96cf2c ("userfaultfd: call handle_userfault() for 
userfaultfd_missing() faults")
Cc:  # v4.3+
Reported-by: Alexander Egorenkov 
Signed-off-by: Gerald Schaefer 
---
 mm/huge_memory.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bdb070732352..b4630ccfd3c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -711,7 +711,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
transparent_hugepage_use_zero_page()) {
pgtable_t pgtable;
struct page *zero_page;
-   bool set;
vm_fault_t ret;
pgtable = pte_alloc_one(vma->vm_mm);
if (unlikely(!pgtable))
@@ -724,13 +723,14 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault 
*vmf)
}
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
ret = 0;
-   set = false;
if (pmd_none(*vmf->pmd)) {
ret = check_stable_address_space(vma->vm_mm);
if (ret) {
spin_unlock(vmf->ptl);
+   pte_free(vma->vm_mm, pgtable);
} else if (userfaultfd_missing(vma)) {
spin_unlock(vmf->ptl);
+   pte_free(vma->vm_mm, pgtable);
ret = handle_userfault(vmf, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
} else {
@@ -738,12 +738,11 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault 
*vmf)
   haddr, vmf->pmd, zero_page);
update_mmu_cache_pmd(vma, vmf->address, 
vmf->pmd);
spin_unlock(vmf->ptl);
-   set = true;
}
-   } else
+   } else {
spin_unlock(vmf->ptl);
-   if (!set)
pte_free(vma->vm_mm, pgtable);
+   }
return ret;
}
gfp = vma_thp_gfp_mask(vma);
-- 
2.17.1



Re: [PATCH 08/13] s390/pci: Remove races against pte updates

2020-10-08 Thread Gerald Schaefer
On Wed,  7 Oct 2020 18:44:21 +0200
Daniel Vetter  wrote:

> Way back it was a reasonable assumptions that iomem mappings never
> change the pfn range they point at. But this has changed:
> 
> - gpu drivers dynamically manage their memory nowadays, invalidating
> ptes with unmap_mapping_range when buffers get moved
> 
> - contiguous dma allocations have moved from dedicated carvetouts to
> cma regions. This means if we miss the unmap the pfn might contain
> pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
> 
> - even /dev/mem now invalidates mappings when the kernel requests that
> iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
> ("/dev/mem: Revoke mappings when a driver claims the region")
> 
> Accessing pfns obtained from ptes without holding all the locks is
> therefore no longer a good idea. Fix this.
> 
> Since zpci_memcpy_from|toio seems to not do anything nefarious with
> locks we just need to open code get_pfn and follow_pfn and make sure
> we drop the locks only after we've done. The write function also needs
> the copy_from_user move, since we can't take userspace faults while
> holding the mmap sem.
> 
> Signed-off-by: Daniel Vetter 
> Cc: Jason Gunthorpe 
> Cc: Dan Williams 
> Cc: Kees Cook 
> Cc: Andrew Morton 
> Cc: John Hubbard 
> Cc: Jérôme Glisse 
> Cc: Jan Kara 
> Cc: Dan Williams 
> Cc: linux...@kvack.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-samsung-...@vger.kernel.org
> Cc: linux-me...@vger.kernel.org
> Cc: Niklas Schnelle 
> Cc: Gerald Schaefer 
> Cc: linux-s...@vger.kernel.org
> ---
>  arch/s390/pci/pci_mmio.c | 98 +++-
>  1 file changed, 57 insertions(+), 41 deletions(-)

Looks good, thanks. Also survived some basic function test. Only some
minor nitpick, see below.

Reviewed-by: Gerald Schaefer 

> 
> diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
> index 401cf670a243..4d194cb09372 100644
> --- a/arch/s390/pci/pci_mmio.c
> +++ b/arch/s390/pci/pci_mmio.c
> @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem 
> *dst,
>   return rc;
>  }
>  
> -static long get_pfn(unsigned long user_addr, unsigned long access,
> - unsigned long *pfn)
> -{
> - struct vm_area_struct *vma;
> - long ret;
> -
> - mmap_read_lock(current->mm);
> - ret = -EINVAL;
> - vma = find_vma(current->mm, user_addr);
> - if (!vma)
> - goto out;
> - ret = -EACCES;
> - if (!(vma->vm_flags & access))
> - goto out;
> - ret = follow_pfn(vma, user_addr, pfn);
> -out:
> - mmap_read_unlock(current->mm);
> - return ret;
> -}
> -
>  SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
>   const void __user *, user_buffer, size_t, length)
>  {
>   u8 local_buf[64];
>   void __iomem *io_addr;
>   void *buf;
> - unsigned long pfn;
> + struct vm_area_struct *vma;
> + pte_t *ptep;
> + spinlock_t *ptl;
>   long ret;
>  
>   if (!zpci_is_enabled())
> @@ -158,7 +140,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, 
> mmio_addr,
>* We only support write access to MIO capable devices if we are on
>* a MIO enabled system. Otherwise we would have to check for every
>* address if it is a special ZPCI_ADDR and would have to do
> -  * a get_pfn() which we don't need for MIO capable devices.  Currently
> +  * a pfn lookup which we don't need for MIO capable devices.  Currently
>* ISM devices are the only devices without MIO support and there is no
>* known need for accessing these from userspace.
>*/
> @@ -176,21 +158,37 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, 
> mmio_addr,
>   } else
>   buf = local_buf;
>  
> - ret = get_pfn(mmio_addr, VM_WRITE, );
> + ret = -EFAULT;
> + if (copy_from_user(buf, user_buffer, length))
> + goto out_free;
> +
> + mmap_read_lock(current->mm);
> + ret = -EINVAL;
> + vma = find_vma(current->mm, mmio_addr);
> + if (!vma)
> + goto out_unlock_mmap;
> + ret = -EACCES;
> + if (!(vma->vm_flags & VM_WRITE))
> + goto out_unlock_mmap;
> + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> + goto out_unlock_mmap;

That check for VM_IO | VM_PFNMAP was previously hidden inside follow_pfn(),
and that would have returned -EINVAL in this case. With your change, we
now return -EACCES. Not sure how important that is, but it feels wrong.
Maybe move the VM_IO | VM_PFNMAP check up, before t

Re: BUG: Bad page state in process dirtyc0w_child

2020-09-24 Thread Gerald Schaefer
On Thu, 24 Sep 2020 00:02:26 +0200
Gerald Schaefer  wrote:

> On Wed, 23 Sep 2020 14:50:36 -0700
> Linus Torvalds  wrote:
> 
> > On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer
> >  wrote:
> > >
> > > Thanks, very nice walk-through, need some time to digest this. The TLB
> > > aspect is interesting, and we do have our own __tlb_remove_page_size(),
> > > which directly calls free_page_and_swap_cache() instead of the generic
> > > batched approach.
> > 
> > So I don't think it's the free_page_and_swap_cache() itself that is the 
> > problem.
> > 
> > As mentioned, the actual pages themselves should be handled by the
> > reference counting being atomic.
> > 
> > The interrupt disable is really about just the page *tables* being
> > free'd - not the final page level.
> > 
> > So the issue is that at least on x86-64, we have the serialization
> > that we will only free the page tables after a cross-CPU IPI has
> > flushed the TLB.
> > 
> > I think s390 just RCU-free's the page tables instead, which should fix it.
> > 
> > So I think this is special, and s390 is very different from x86, but I
> > don't think it's the problem.

Ah of course, I got confused by freeing pagetable pages vs. the pages
themselves. For the pagetable pages we actually use the generic
tlb_remove_table_(sync_)one, including the IPI-synchronizing
smp_call_function (CONFIG_MMU_GATHER_RCU_TABLE_FREE=y).

The "s390 magic" then only starts in our own __tlb_remove_table,
where we take care of the special 2K vs. 4K pagetable stuff.

Thanks a lot for this very valuable abstract of "who is who and why"
in pagetable memory management :-)

> > 
> > In fact, I think you pinpointed the real issue:
> > 
> > > Meanwhile, out of curiosity, while I still fail to comprehend commit
> > > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
> > > is one detail that I find most confusing: the unlock_page() has moved
> > > behind the wp_page_reuse(), while it was the other way round before.
> > 
> > You know what? That was just a mistake, and I think you may actually
> > have hit the real cause of the problem.
> > 
> > It means that we keep the page locked until after we do the
> > pte_unmap_unlock(), so now we have no guarantees that we hold the page
> > referecne.
> > 
> > And then we unlock it - while somebody else might be freeing it.
> > 
> > So somebody is freeing a locked page just as we're unlocking it, and
> > that matches the problem you see exactly: the debug thing will hit
> > because the last free happened while locked, and then by the time the
> > printout happens it has become unlocked so it doesn't show any more.
> > 
> > Duh.
> > 
> > Would you mind testing just moving the unlock_page() back to before
> > the wp_page_reuse()?
> 
> Sure, I'll give it a try running over the night again.

It's all good now, no more occurrences with unlock_page() before
wp_page_reuse().


Re: BUG: Bad page state in process dirtyc0w_child

2020-09-23 Thread Gerald Schaefer
On Wed, 23 Sep 2020 14:50:36 -0700
Linus Torvalds  wrote:

> On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer
>  wrote:
> >
> > Thanks, very nice walk-through, need some time to digest this. The TLB
> > aspect is interesting, and we do have our own __tlb_remove_page_size(),
> > which directly calls free_page_and_swap_cache() instead of the generic
> > batched approach.
> 
> So I don't think it's the free_page_and_swap_cache() itself that is the 
> problem.
> 
> As mentioned, the actual pages themselves should be handled by the
> reference counting being atomic.
> 
> The interrupt disable is really about just the page *tables* being
> free'd - not the final page level.
> 
> So the issue is that at least on x86-64, we have the serialization
> that we will only free the page tables after a cross-CPU IPI has
> flushed the TLB.
> 
> I think s390 just RCU-free's the page tables instead, which should fix it.
> 
> So I think this is special, and s390 is very different from x86, but I
> don't think it's the problem.
> 
> In fact, I think you pinpointed the real issue:
> 
> > Meanwhile, out of curiosity, while I still fail to comprehend commit
> > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
> > is one detail that I find most confusing: the unlock_page() has moved
> > behind the wp_page_reuse(), while it was the other way round before.
> 
> You know what? That was just a mistake, and I think you may actually
> have hit the real cause of the problem.
> 
> It means that we keep the page locked until after we do the
> pte_unmap_unlock(), so now we have no guarantees that we hold the page
> referecne.
> 
> And then we unlock it - while somebody else might be freeing it.
> 
> So somebody is freeing a locked page just as we're unlocking it, and
> that matches the problem you see exactly: the debug thing will hit
> because the last free happened while locked, and then by the time the
> printout happens it has become unlocked so it doesn't show any more.
> 
> Duh.
> 
> Would you mind testing just moving the unlock_page() back to before
> the wp_page_reuse()?

Sure, I'll give it a try running over the night again.


Re: BUG: Bad page state in process dirtyc0w_child

2020-09-23 Thread Gerald Schaefer
On Wed, 23 Sep 2020 13:00:45 -0700
Linus Torvalds  wrote:

[...]
> 
> Ooh. One thing that is *very* different about s390 is that it frees
> the page directly, and doesn't batch things up to happen after the TLB
> flush.
> 
> Maybe THAT is the difference? Not that I can tell why it should
> matter, for all the reasons outlines above. But on x86-64, the
> __tlb_remove_page() function just adds the page to the "free this
> later" TLB flush structure, and if it fills up it does the TLB flush
> and then does the actual batched page freeing outside the page table
> lock.
> 
> And that *has* been one of the things that the fast-gup code depended
> on. We even have a big comment about it:
> 
> /*
>  * Disable interrupts. The nested form is used, in order to allow
>  * full, general purpose use of this routine.
>  *
>  * With interrupts disabled, we block page table pages from being
>  * freed from under us. See struct mmu_table_batch comments in
>  * include/asm-generic/tlb.h for more details.
>  *
>  * We do not adopt an rcu_read_lock(.) here as we also want to
>  * block IPIs that come from THPs splitting.
>  */
> 
> and maybe that whole thing doesn't hold true for s390 at all.

Thanks, very nice walk-through, need some time to digest this. The TLB
aspect is interesting, and we do have our own __tlb_remove_page_size(),
which directly calls free_page_and_swap_cache() instead of the generic
batched approach.

I faintly remember that we also did have some batched and rcu_sched based
approach. It seems there was some rework with commit 9de7d833e370
("s390/tlb: Convert to generic mmu_gather") and discussion in
https://lore.kernel.org/linux-arch/20180918125151.31744-1-schwidef...@de.ibm.com/

Given the comment you mentioned and also this one from mm/gup.c:

/*
 * Fast GUP
 *
 * get_user_pages_fast attempts to pin user pages by walking the page
 * tables directly and avoids taking locks. Thus the walker needs to be
 * protected from page table pages being freed from under it, and should
 * block any THP splits.
 *
 * One way to achieve this is to have the walker disable interrupts, and
 * rely on IPIs from the TLB flushing code blocking before the page table
 * pages are freed. This is unsuitable for architectures that do not need
 * to broadcast an IPI when invalidating TLBs.
 *
 * Another way to achieve this is to batch up page table containing pages
 * belonging to more than one mm_user, then rcu_sched a callback to free those
 * pages. Disabling interrupts will allow the fast_gup walker to both block
 * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
 * (which is a relatively rare event). The code below adopts this strategy.

It really sounds like we might have yet another issue with (common code)
gup_fast, and non-batched freeing of pagetable pages, though I wonder how
that would have worked with the s390-specific version of gup_fast before.

Certainly worth investigating, thanks a lot for the hint!

Meanwhile, out of curiosity, while I still fail to comprehend commit
09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
is one detail that I find most confusing: the unlock_page() has moved
behind the wp_page_reuse(), while it was the other way round before.

Does it simply not really matter, or was it done on purpose, possibly
related to the "get rid of the nasty serialization on the page lock"
description?


Re: BUG: Bad page state in process dirtyc0w_child

2020-09-23 Thread Gerald Schaefer
On Tue, 22 Sep 2020 19:03:50 +0200
Gerald Schaefer  wrote:

> On Wed, 16 Sep 2020 16:28:06 +0200
> Heiko Carstens  wrote:
> 
> > On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote:
> > > Occasionally, running this LTP test will trigger an error below on
> > > s390:
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c
> > > 
> > > this .config:
> > > https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config
> > > 
> > > [ 6970.253173] LTP: starting dirtyc0w
> > > [ 6971.599102] BUG: Bad page state in process dirtyc0w_child  pfn:8865d
> > > [ 6971.599867] page:1a8328d7 refcount:0 mapcount:0 
> > > mapping: index:0x0 pfn:0x8865d
> > > [ 6971.599876] flags: 
> > > 0x4008000e(referenced|uptodate|dirty|swapbacked)
> > > [ 6971.599886] raw: 4008000e 0100 0122 
> > > 
> > > [ 6971.599893] raw:    
> > > 
> > > [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > > [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror 
> > > dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> > > [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G  
> > >  O  5.9.0-rc4-next-20200909 #1
> > > [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> > > [ 6971.599964] Call Trace:
> > > [ 6971.599979]  [<73aec038>] show_stack+0x158/0x1f0 
> > > [ 6971.599986]  [<73af724a>] dump_stack+0x1f2/0x238 
> > > [ 6971.54]  [<72ed086a>] bad_page+0x1ba/0x1c0 
> > > [ 6971.60]  [<72ed20c4>] free_pcp_prepare+0x4fc/0x658 
> > > [ 6971.66]  [<72ed96a6>] free_unref_page+0xae/0x158 
> > > [ 6971.600013]  [<72e8286a>] unmap_page_range+0xb62/0x1df8 
> > > [ 6971.600019]  [<72e83bbc>] unmap_single_vma+0xbc/0x1c8 
> > > [ 6971.600025]  [<72e8418e>] zap_page_range+0x176/0x230 
> > > [ 6971.600033]  [<72eece8e>] do_madvise+0xfde/0x1270 
> > > [ 6971.600039]  [<72eed50a>] __s390x_sys_madvise+0x72/0x98 
> > > [ 6971.600047]  [<73b1cce4>] system_call+0xdc/0x278 
> > > [ 6971.600053] 2 locks held by dirtyc0w_child/65238:
> > > [ 6971.600058]  #0: 00013442fa18 (>mmap_lock){}-{3:3}, at: 
> > > do_madvise+0x17a/0x1270
> > > [ 6971.600432]  #1: 0001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, 
> > > at: unmap_page_range+0x640/0x1df8
> > > [ 6971.600487] Disabling lock debugging due to kernel taint
> > > 
> > > Once it happens, running it again will trigger in on another PFN.
> > > 
> > > [39717.085115] BUG: Bad page state in process dirtyc0w_child  pfn:af065 
> > > 
> > > Any thoughts?  
> > 
> > Alexander, Gerald, could you take a look?
> 
> Thanks for reporting. From the header of dirtyc0w.c it seems that this
> is testing some gup behavior. Given that we have an issue with gup_fast
> on s390, this could be related. I'll try to reproduce and do more
> analysis.
> 
> A fix for our gup_fast issue is also in linux-next now, as of 2020-09-20,
> but it was not yet included in your kernel version 5.9.0-rc4-next-20200909.
> So if this is related to the gup_fast issue, it should not occur again
> with linux-next kernels after 2020-09-20.

OK, I can now reproduce this, and unfortunately also with the gup_fast
fix, so it is something different. Bisecting is a bit hard, as it will
not always show immediately, sometimes takes up to an hour.

Still, I think I found the culprit, merge commit b25d1dc9474e "Merge
branch 'simplify-do_wp_page'". Without those 4 patches, it works fine,
running over night.

Not sure why this only shows on s390, should not be architecture-specific,
but we do often see subtle races earlier than others due to hypervisor
impact.

The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already
introduces this error. The dirtyc0w_child test seems to play with cow
and racing madvise(MADV_DONTNEED), but I have not yet fully understood
it and also not the changes from commit 09854ba94c6a. As Linus already
mentioned in the merge commit message, this is some bad timing for such
a change, so I don't want to delay this further with trying to understand
it better before reporti

Re: BUG: Bad page state in process dirtyc0w_child

2020-09-22 Thread Gerald Schaefer
On Wed, 16 Sep 2020 16:28:06 +0200
Heiko Carstens  wrote:

> On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote:
> > Occasionally, running this LTP test will trigger an error below on
> > s390:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c
> > 
> > this .config:
> > https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config
> > 
> > [ 6970.253173] LTP: starting dirtyc0w
> > [ 6971.599102] BUG: Bad page state in process dirtyc0w_child  pfn:8865d
> > [ 6971.599867] page:1a8328d7 refcount:0 mapcount:0 
> > mapping: index:0x0 pfn:0x8865d
> > [ 6971.599876] flags: 
> > 0x4008000e(referenced|uptodate|dirty|swapbacked)
> > [ 6971.599886] raw: 4008000e 0100 0122 
> > 
> > [ 6971.599893] raw:    
> > 
> > [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror 
> > dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> > [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G   
> > O  5.9.0-rc4-next-20200909 #1
> > [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> > [ 6971.599964] Call Trace:
> > [ 6971.599979]  [<73aec038>] show_stack+0x158/0x1f0 
> > [ 6971.599986]  [<73af724a>] dump_stack+0x1f2/0x238 
> > [ 6971.54]  [<72ed086a>] bad_page+0x1ba/0x1c0 
> > [ 6971.60]  [<72ed20c4>] free_pcp_prepare+0x4fc/0x658 
> > [ 6971.66]  [<72ed96a6>] free_unref_page+0xae/0x158 
> > [ 6971.600013]  [<72e8286a>] unmap_page_range+0xb62/0x1df8 
> > [ 6971.600019]  [<72e83bbc>] unmap_single_vma+0xbc/0x1c8 
> > [ 6971.600025]  [<72e8418e>] zap_page_range+0x176/0x230 
> > [ 6971.600033]  [<72eece8e>] do_madvise+0xfde/0x1270 
> > [ 6971.600039]  [<72eed50a>] __s390x_sys_madvise+0x72/0x98 
> > [ 6971.600047]  [<73b1cce4>] system_call+0xdc/0x278 
> > [ 6971.600053] 2 locks held by dirtyc0w_child/65238:
> > [ 6971.600058]  #0: 00013442fa18 (>mmap_lock){}-{3:3}, at: 
> > do_madvise+0x17a/0x1270
> > [ 6971.600432]  #1: 0001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: 
> > unmap_page_range+0x640/0x1df8
> > [ 6971.600487] Disabling lock debugging due to kernel taint
> > 
> > Once it happens, running it again will trigger in on another PFN.
> > 
> > [39717.085115] BUG: Bad page state in process dirtyc0w_child  pfn:af065 
> > 
> > Any thoughts?  
> 
> Alexander, Gerald, could you take a look?

Thanks for reporting. From the header of dirtyc0w.c it seems that this
is testing some gup behavior. Given that we have an issue with gup_fast
on s390, this could be related. I'll try to reproduce and do more
analysis.

A fix for our gup_fast issue is also in linux-next now, as of 2020-09-20,
but it was not yet included in your kernel version 5.9.0-rc4-next-20200909.
So if this is related to the gup_fast issue, it should not occur again
with linux-next kernels after 2020-09-20.


Re: Ways to deprecate /sys/devices/system/memory/memoryX/phys_device ?

2020-09-22 Thread Gerald Schaefer
On Thu, 10 Sep 2020 12:20:34 +0200
David Hildenbrand  wrote:

> Hi everybody,
> 
> I was just exploring how /sys/devices/system/memory/memoryX/phys_device
> is/was used. It's one of these interfaces that most probably never
> should have been added but now we are stuck with it.
> 
> "phys_device" was used on s390x in older versions of lsmem[2]/chmem[3],
> back when they were still part of s390x-tools. They were later replaced
> [5] by the variants in linux-utils. For example, RHEL6 and RHEL7 contain
> lsmem/chmem from s390-utils. RHEL8 switched to versions from util-linux
> on s390x [4].
> 
> "phys_device" was added with sysfs support for memory hotplug in commit
> 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove functions")
> in 2005. It always returned 0.
> 
> s390x started returning something != 0 on some setups (if sclp.rzm is
> set by HW) in 2010 via commit 57b552ba0b2f("memory hotplug/s390: set
> phys_device").
> 
> For s390x, it allowed for identifying which memory block devices belong
> to the same memory increment (RZM). Only if all memory block devices
> comprising a single memory increment were offline, the memory could
> actually be removed in the hypervisor.
> 
> Since commit e5d709bb5fb7 ("s390/memory hotplug: provide
> memory_block_size_bytes() function") in 2013 a memory block devices
> spans at least one memory increment - which is why the interface isn't
> really helpful/used anymore (except by old lsmem/chmem tools).

Correct, so I do not see any problem for s390 with removing / changing
that for the upstream kernel. BTW, that commit also gave some relief
on the scaling issue, at least for s390. With increasing total memory
size, we also have increasing increment and thus memory block size.

Of course, that also has some limitations, IIRC max. 1 GB increment
size, but still better than the 256 MB default size.

> 
> There were once RFC patches to make use of it in ACPI, but it could be
> solved using different interfaces [1].
> 
> 
> While I'd love to rip it out completely, I think it would break old
> lsmem/chmem completely - and I assume that's not acceptable. I was
> wondering what would be considered safe to do now/in the future:
> 
> 1. Make it always return 0 (just as if "sclp.rzm" would be set to 0 on
> s390x). This will make old lsmem/chmem behave differently after
> switching to a new kernel, like if sclp.rzm would not be set by HW -
> AFAIU, it will assume all memory is in a single memory increment. Do we
> care?

No, at least not until that kernel change would be backported to some
old distribution level where we still use lsmem/chmem from s390-tools.
Given that this is just some clean-up w/o any functional benefit, and
hopefully w/o any negative impact, I think we can safely assume that no
distributor will do that "just for fun".

Even if there would be good reasons for backports, then I guess we also
have good reasons for backporting / switching to the util-linux version
of lsmem / chmem for such distribution levels. Alternatively, adjust the
s390-tools lsmem / chmem there.

But I would rather "rip it out completely" than just return 0. You'd
need some lsmem / chmem changes anyway, at least in case this would
ever be backported.

> 2. Restrict it to s390x only. It always returned 0 on other
> architectures, I was not able to find any user.
> 
> I think 2 should be safe to do (never used on other archs). I do wonder
> what the feelings are about 1.

Please don't add any s390-specific workarounds here, that does not
really sound like a clean-up, rather the opposite.

That being said, I do not really see the benefit of this change at
all. As Michal mentioned, there really should be some more fundamental
change. And from the rest of this thread, it also seems that phys_device
usage might not be the biggest issue here.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 11:33:17 -0700
Linus Torvalds  wrote:

> On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe  wrote:
> >
> > So.. To change away from the stack option I think we'd have to pass
> > the READ_ONCE value to pXX_offset() as an extra argument instead of it
> > derefing the pointer internally.
> 
> Yeah, but I think that would actually be the better model than passing
> an address to a random stack location.
> 
> It's also effectively what we do in some other places, eg the whole
> logic with "orig" in the regular pte fault handling is basically doing
> unlocked loads of the pte, various decisions on that, and then doing a
> final "is this still the same pte" after it has gotten the page table
> lock.

That sounds a lot like the pXd_offset_orig() from Martins first approach
in this thread:
https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

It is also the "Patch 1" option from the start of this thread:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/

I guess I chose wrongly there, should have had more trust in Martins
approach, and not try so hard to do it like others...

So, maybe we can start over again, from that patch option. It would of
course also initially introduce some gup-specific helpers, like with
the other approach. It seemed harder to generalize when I thought
about it back then, but I guess it should not be a lot harder than
the _addr_end stuff.

Or, maybe this time, just not to risk Christian getting a heart attack,
we could go for the gup-specific helper first, so that we would at
least have a fix for the possible s390 data corruption. Jason, would
you agree that we send a new RFC, this time with pXd_offset_orig()
approach, and have that accepted as short-term fix?

Or would you rather also wait for some proper generic change? Have
lost that option from my radar, so cannot really judge how much more
effort it would be. I'm on vacation next week anyway, but Alexander
or Vasily (who did the option 1 patch) could look into this further.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe  wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
> pud = pud_offset(p4d, addr);
> do {
>   WARN_ON(pud != pud_offset(p4d, addr);
> next = pud_addr_end(addr, end);
> } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe  wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > > pud = pud_offset(p4d, addr);
> > > do {
> > >   WARN_ON(pud != pud_offset(p4d, addr);
> > > next = pud_addr_end(addr, end);
> > > } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe  wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
> pud = pud_offset(p4d, addr);
> do {
>   WARN_ON(pud != pud_offset(p4d, addr);
> next = pud_addr_end(addr, end);
> } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe  wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
  iterations for range crossing pud boundary, doing
  that right here on a pudp which is actually the
  previously passed-through pgdp/p4dp (pointing to
  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
 should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
which will then use the same pointer for iteration as in the "Before"
case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also be that only we could do this because we

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Gerald Schaefer
On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen  wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(, addr)",
with  being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(, addr) will simply return
the passed-in  pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)


Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Gerald Schaefer
On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer  wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen  wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy  wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev 
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded 
> >>> pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough 
> >> I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in 
> >> the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you 
> >> should switch patch 1 and patch 2 to avoid the step through 
> >> pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go 
> > forward
> > with this patch ASAP and then handle the other patches on top with all the 
> > time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen  wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long 
> > addr, unsigned long end,
> > do {
> > pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -   next = pmd_addr_end(addr, end);
> > +   next = pmd_addr_end_folded(pmd, addr, end);
> > if (!pmd_present(pmd))
> > return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.


Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy  wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)


[RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-07 Thread Gerald Schaefer
From: Alexander Gordeev 

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function old new   delta
vmemmap_populate 587 592  +5
munlock_vma_pages_range  556 561  +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function old new   delta
.remove_pagetable   16481652  +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function old new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function old new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev 
Signed-off-by: Gerald Schaefer 
---
 arch/arm/include/asm/pgtable-2level.h|  2 +-
 arch/arm/mm/idmap.c  |  6 ++--
 arch/arm/mm/mmu.c|  8 ++---
 arch/arm64/kernel/hibernate.c| 16 ++
 arch/arm64/kvm/mmu.c | 16 +-
 arch/arm64/mm/kasan_init.c   |  8 ++---
 arch/arm64/mm/mmu.c  | 25 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c|  6 ++--
 arch/s390/include/asm/pgtable.h  |  8 ++---
 arch/s390/mm/page-states.c   |  8 ++---
 arch/s390/mm/pageattr.c  |  8 ++---
 arch/s390/mm/vmem.c  |  8 ++---
 arch/sparc/mm/hugetlbpage.c  |  6 ++--
 arch/um/kernel/tlb.c |  8 ++---
 arch/x86/mm/init_64.c| 15 -
 arch/x86/mm/kasan_init_64.c  | 16 +-
 include/asm-generic/pgtable-nop4d.h  |  2 +-
 include/asm-generic/pgtable-nopmd.h  |  2 +-
 include/asm-generic/pgtable-nopud.h  |  2 +-
 include/linux/pgtable.h  | 26 ---
 mm/gup.c |  8 ++---
 mm/ioremap.c |  8 ++---
 mm/kasan/init.c  | 17 +-
 mm/madvise.c |  4 +--
 mm/memory.c  | 40 
 mm/mlock.c   | 18 ---
 mm/mprotect.c|  8 ++---
 mm/pagewalk.c|  8 ++---
 mm/swapfile.c|  8 ++---
 mm/vmalloc.c | 16 +-
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h 
b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long 
addr)
} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, 
unsigned long end,
pmd = pmd_offset(pud, addr);
 
do {
-   next = pmd_addr_end(addr, end);
+   next = pmd_addr_end(*pmd, addr, end);
*pmd = __pmd((addr & PMD_MASK) | prot);
flush_pmd_entry(pmd);
} while (pmd++, addr = next, addr !=

[RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-07 Thread Gerald Schaefer
This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h|  2 +-
 arch/arm/mm/idmap.c  |  6 ++--
 arch/arm/mm/mmu.c|  8 ++---
 arch/arm64/kernel/hibernate.c| 16 +
 arch/arm64/kvm/mmu.c | 16 -
 arch/arm64/mm/kasan_init.c   |  8 ++---
 arch/arm64/mm/mmu.c  | 25 +++---
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c|  6 ++--
 arch/s390/include/asm/pgtable.h  | 42 
 arch/s390/mm/page-states.c   |  8 ++---
 arch/s390/mm/pageattr.c  |  8 ++---
 arch/s390/mm/vmem.c  |  8 ++---
 arch/sparc/mm/hugetlbpage.c  |  6 ++--
 arch/um/kernel/tlb.c |  8 ++---
 arch/x86/mm/init_64.c| 15 -
 arch/x86/mm/kasan_init_64.c  | 16 -
 include/asm-generic/pgtable-nop4d.h  |  2 +-
 include/asm-generic/pgtable-nopmd.h  |  2 +-
 include/asm-generic/pgtable-nopud.h  |  2 +-
 include/linux/pgtable.h  | 38 -
 mm/gup.c |  8 ++---
 mm/ioremap.c |  8 ++---
 mm/kasan/init.c  | 17 +-
 mm/madvise.c |  4 +--
 mm/memory.c  | 40 +++---
 mm/mlock.c   | 18 +++---
 mm/mprotect.c|  8 ++---
 mm/pagewalk.c|  8 ++---
 mm/swapfile.c|  8 ++---
 mm/vmalloc.c | 16 -
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1



[RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-07 Thread Gerald Schaefer
From: Alexander Gordeev 

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns  itself (a pointer to a value on stack)
pudp = pud_offset(, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x1008000, due to PUD_SIZE/MASK != 
PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

return 1;
}

pud_addr_end = 0x1008000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x1008000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc:  # 5.2+
Reviewed-by: Gerald Schaefer 
Signed-off-by: Alexander Gordeev 
Signed-off-by: Gerald Schaefer 
---
 arch/s390/include/asm/pgtable.h | 42 +
 include/linux/pgtable.h | 16 +
 mm/gup.c|  8 +++
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset(

[RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions

2020-09-07 Thread Gerald Schaefer
From: Alexander Gordeev 

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev 
Signed-off-by: Gerald Schaefer 
---
 include/linux/pgtable.h | 36 
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)   \
-({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)   \
-({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)   \
-({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)   \
-({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1



Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding

2020-09-02 Thread Gerald Schaefer
On Wed, 2 Sep 2020 14:24:37 +0200
Gerald Schaefer  wrote:

> On Tue, 1 Sep 2020 16:22:22 -0700
> John Hubbard  wrote:
> 
> > On 9/1/20 10:40 AM, Gerald Schaefer wrote:
> > > On Mon, 31 Aug 2020 12:15:53 -0700
> > > Andrew Morton  wrote:
> > ...
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index e8cbc2e795d5..43dacbce823f 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct 
> > > *mm,
> > >   })
> > >   #endif
> > >   
> > > +/*
> > > + * With dynamic page table levels on s390, the static pXd_addr_end() 
> > > functions
> > > + * will not return corresponding dynamic boundaries. This is no problem 
> > > as long
> > > + * as only pXd pointers are passed down during page table walk, because
> > > + * pXd_offset() will simply return the given pointer for folded levels, 
> > > and the
> > > + * pointer iteration over a range simply happens at the correct page 
> > > table
> > > + * level.
> > > + * It is however a problem with gup_fast, or other places walking the 
> > > page
> > > + * tables w/o locks using READ_ONCE(), and passing down the pXd values 
> > > instead
> > > + * of pointers. In this case, the pointer given to pXd_offset() is a 
> > > pointer to
> > > + * a stack variable, which cannot be used for pointer iteration at the 
> > > correct
> > > + * level. Instead, the iteration then has to happen by going up to pgd 
> > > level
> > > + * again. To allow this, provide pXd_addr_end_folded() functions with an
> > > + * additional pXd value parameter, which can be used on s390 to 
> > > determine the
> > > + * folding level and return the corresponding boundary.
> > 
> > Ah OK, I finally see what you have in mind. And as Jason noted, if we just
> > pass an additional parameter to pXd_addr_end() that's going to be
> > cleaner. And doing so puts this in line with other page table
> > abstractions that also carry more information than some architectures
> > need. For example, on x86, set_pte_at() ignores the first two
> > parameters:
> > 
> > #define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, 
> > ptep, pte)
> > 
> > static inline void native_set_pte_at(struct mm_struct *mm, unsigned long 
> > addr,
> >  pte_t *ptep , pte_t pte)
> > {
> > native_set_pte(ptep, pte);
> > }
> > 
> > This type of abstraction has worked out very well, IMHO.
> 
> Yes, it certainly feels like the right way to do it, and it would
> not affect other archs in a functional way. It would however introduce
> a subtle change for s390 behavior on _all_ page table walkers, not
> just the READ_ONCE gup_fast path, i.e. it changes the level at which
> the pointer iteration is done. Of course, that *should* not have any
> functional issues, or else it would also be broken in gup_fast, but
> in this area we often were wrong with should / could assumptions...

Hmm, not so sure about that "not affect other archs", that might also
be one of those *should*s. Consider this change to mm/mlock.c from
our current internal generalization work, for example:

diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..dbde97f317d4 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec 
*pvec,
struct vm_area_struct *vma, struct zone *zone,
unsigned long start, unsigned long end)
 {
-   pte_t *pte;
spinlock_t *ptl;
+   pte_t *pte;
+   pmd_t *pmd;
+   pud_t *pud;
+   p4d_t *p4d;
+   pgd_t *pgd;
 
/*
 * Initialize pte walk starting at the already pinned page where we
@@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct 
pagevec *pvec,
 */
pte = get_locked_pte(vma->vm_mm, start, );
/* Make sure we do not cross the page table boundary */
-   end = pgd_addr_end(start, end);
-   end = p4d_addr_end(start, end);
-   end = pud_addr_end(start, end);
-   end = pmd_addr_end(start, end);
+   pgd = pgd_offset(vma->vm_mm, start);
+   end = pgd_addr_end(*pgd, start, end);
+   p4d = p4d_offset(pgd, start);
+   end = p4d_addr_end(*p4d, start, end);
+   pud = pud_offset(p4d, start);
+   end = pud_addr_end(*pud, start, end);
+   pmd = pmd_offset(pud, start);
+   end = pmd_addr_end(*pmd, start, end);
 
/* The page next to the pinned page i

Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding

2020-09-02 Thread Gerald Schaefer
On Tue, 1 Sep 2020 16:22:22 -0700
John Hubbard  wrote:

> On 9/1/20 10:40 AM, Gerald Schaefer wrote:
> > On Mon, 31 Aug 2020 12:15:53 -0700
> > Andrew Morton  wrote:
> ...
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index e8cbc2e795d5..43dacbce823f 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   })
> >   #endif
> >   
> > +/*
> > + * With dynamic page table levels on s390, the static pXd_addr_end() 
> > functions
> > + * will not return corresponding dynamic boundaries. This is no problem as 
> > long
> > + * as only pXd pointers are passed down during page table walk, because
> > + * pXd_offset() will simply return the given pointer for folded levels, 
> > and the
> > + * pointer iteration over a range simply happens at the correct page table
> > + * level.
> > + * It is however a problem with gup_fast, or other places walking the page
> > + * tables w/o locks using READ_ONCE(), and passing down the pXd values 
> > instead
> > + * of pointers. In this case, the pointer given to pXd_offset() is a 
> > pointer to
> > + * a stack variable, which cannot be used for pointer iteration at the 
> > correct
> > + * level. Instead, the iteration then has to happen by going up to pgd 
> > level
> > + * again. To allow this, provide pXd_addr_end_folded() functions with an
> > + * additional pXd value parameter, which can be used on s390 to determine 
> > the
> > + * folding level and return the corresponding boundary.
> 
> Ah OK, I finally see what you have in mind. And as Jason noted, if we just
> pass an additional parameter to pXd_addr_end() that's going to be
> cleaner. And doing so puts this in line with other page table
> abstractions that also carry more information than some architectures
> need. For example, on x86, set_pte_at() ignores the first two
> parameters:
> 
> #define set_pte_at(mm, addr, ptep, pte)   native_set_pte_at(mm, addr, 
> ptep, pte)
> 
> static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
>pte_t *ptep , pte_t pte)
> {
>   native_set_pte(ptep, pte);
> }
> 
> This type of abstraction has worked out very well, IMHO.

Yes, it certainly feels like the right way to do it, and it would
not affect other archs in a functional way. It would however introduce
a subtle change for s390 behavior on _all_ page table walkers, not
just the READ_ONCE gup_fast path, i.e. it changes the level at which
the pointer iteration is done. Of course, that *should* not have any
functional issues, or else it would also be broken in gup_fast, but
in this area we often were wrong with should / could assumptions...

At least it could have some (minor) performance impact on s390,
due to going up to pgd level again instead of staying at the
folded level, but only for crossing pud/p4d boundaries, so that
*should* be neglectable.

So, while totally agreeing that adding the pXd parameter to
pXd_addr_end() in general looks like the cleanest way, we will at
least need to give this some proper testing on s390.

One other question that came up here while doing that generalized
approach was "how to submit it", i.e. should it be split up in
multiple patches which might be cleaner, or have it all-in-one,
which would simplify Fixes/stable tags. After all, we must not forget
that this fixes a severe data integrity issue on s390, so we don't
want to unnecessarily complicate or defer backports for that.

That being said, maybe a compromise could be to split it up
in a way that first introduces the pXd_addr_end_folded()
helpers for gup only, with proper Fixes/stable tags? Then
do the generalization with more thorough testing for s390
after that, possibly also with Fixes/stable tags so that we
do not have different code in stable and upstream?


Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding

2020-09-01 Thread Gerald Schaefer
On Mon, 31 Aug 2020 12:15:53 -0700
Andrew Morton  wrote:

> On Mon, 31 Aug 2020 13:53:36 +0200 Christian Borntraeger 
>  wrote:
> 
> > 
> > 
> > On 28.08.20 16:03, Gerald Schaefer wrote:
> > have some feedback soon if option 1 or option 2 would be acceptable 
> > from a common code perspective. Andrew, who of the mm people would 
> > be the right one to decide?
> 
> Jason and John Hubbard are doing most of the work in there at present,
> 
> Both patches look OK to me from a non-s390 perspective.  Unless we plan
> to implement Jason's more-general approach this time, I'd be inclined
> to defer to the s390 people as to the preferred implementation.

My gut feeling would say that the gup_pXd_addr_end approach makes
us behave more like other archs, at least for gup_fast, which cannot
be so bad.

Regarding generalization, both approaches would introduce some specific
helpers for this "page table walking w/o lock using READ_ONCE", currently
only used by gup_fast. What we initially had in mind wrt generalization
was replacing e.g. the pXd_addr_end() completely, with the new version that
takes the additional pXd value parameter. That would of course be quite
intrusive, also affecting other arch code, so it is probably not what we
want to do in the first step.

To make it a bit more generic and also usable for future paths in
mm/pagewalk.c for example, we could however at least name the new helpers
differently, e.g. pXd_addr_end_folded instead of gup_pXd_addr_end, and
also add some comment on why / where they should be used, like this
(will send again as proper patch with description if agreed):

---
 arch/s390/include/asm/pgtable.h | 49 +
 include/linux/pgtable.h | 32 +
 mm/gup.c|  8 +++---
 3 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..d74257f2cb5b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,55 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+static inline unsigned long rste_addr_end_folded(unsigned long rste,
+   unsigned long addr, unsigned 
long end)
+{
+   unsigned int type = rste & _REGION_ENTRY_TYPE_MASK;
+   unsigned long size, mask, boundary;
+
+   switch (type) {
+   case _REGION_ENTRY_TYPE_R1:
+   size = PGDIR_SIZE;
+   mask = PGDIR_MASK;
+   break;
+   case _REGION_ENTRY_TYPE_R2:
+   size = P4D_SIZE;
+   mask = P4D_MASK;
+   break;
+   case _REGION_ENTRY_TYPE_R3:
+   size = PUD_SIZE;
+   mask = PUD_MASK;
+   break;
+   default:
+   BUG();
+   };
+
+   boundary = (addr + size) & mask;
+
+   return (boundary - 1) < (end - 1) ? boundary : end;
+}
+
+#define pgd_addr_end_folded pgd_addr_end_folded
+static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr,
+   unsigned long end)
+{
+   return rste_addr_end_folded(pgd_val(pgd), addr, end);
+}
+
+#define p4d_addr_end_folded p4d_addr_end_folded
+static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr,
+   unsigned long end)
+{
+   return rste_addr_end_folded(p4d_val(p4d), addr, end);
+}
+
+#define pud_addr_end_folded pud_addr_end_folded
+static inline unsigned long pud_addr_end_folded(pud_t pud, unsigned long addr,
+   unsigned long end)
+{
+   return rste_addr_end_folded(pud_val(pud), addr, end);
+}
+
 static inline int mm_has_pgste(struct mm_struct *mm)
 {
 #ifdef CONFIG_PGSTE
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..43dacbce823f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 })
 #endif
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
+ * a stack variable, which cannot be used for pointer iteration at the correct
+ * level. Instead, the iteration then has to happen by goi

Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding

2020-08-28 Thread Gerald Schaefer
On Fri, 28 Aug 2020 11:21:37 -0300
Jason Gunthorpe  wrote:

> On Fri, Aug 28, 2020 at 04:03:12PM +0200, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> I think the page walk code in mm/pagewalk.c has similar issues to
> GUP. I've been noodling on some patches to add the missing stack
> copies to pagewalk.c as they are clearly missing..
> 
> It would be good if this could be less GUP specific?
> 
> Generically this is about walking the page table without holding the
> page table spinlocks using READ_ONCE.

Indeed, if there were other code paths doing that, they would most
likely also be broken (at least) for s390. Alexander was already
looking into generalizing the new gup-specific helpers, but so
far we assumed that would only be "nice to have" for the future,
and not fix any real issues at the moment. So we wanted to
focus on first fixing the very real gup_fast issue.

Both approaches here probably could be generalized, by either
changing pXd_address_end() or pXd_offset(), but I guess it makes
sense to already take into account that we might need such
generalization sooner than expected.

Just to make sure, you are referring to some future / planned
changes to mm/pagewalk.c, and not some currently existing
pagetable walkers already using the READ_ONCE logic w/o
spinlocks? If those would exist already, I guess we would
already have issues on s390, independent from our conversion
to common code gup_fast.

Regards,
Gerald


[RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding

2020-08-28 Thread Gerald Schaefer
Hi,

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
dynamically calculate top level page table offset, and do nothing in folded
pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus
pXd_addr_end do not reflect this dynamic behaviour, and still act like
static 5-level page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns  itself (a pointer to a value on stack)
pudp = pud_offset(, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x1008000
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

return 1;
}

pud_addr_end = 0x1008000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x1008000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

We came up with two possible fix-ups, both will introduce some gup-specific
helper functions, which will have no effect on other archs than s390.

Patch 1 is the solution that has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
It will additionally pass along pXd pointers in gup_pXd_range, and
also introduce pXd_offset_orig for s390, which takes an additional
pXd entry value parameter. This allows correctly returning / incrementing
pointers while still preseving the READ_ONCE logic for gup_fast.

Patch 2 would instead introduce new gup_pXd_addr_end helpers, which take
an additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range.

Comments / preferences are welcome. As a last resort, we might also
revert back to the s390-specific gup_fast code, if none of the options
are agreeable.

Regards,
Gerald


[RFC PATCH 1/2] mm/gup: fix gup_fast with dynamic page table folding

2020-08-28 Thread Gerald Schaefer
From: Vasily Gorbik 

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns  itself (a pointer to a value on stack)
pudp = pud_offset(, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x1008000, due to PUD_SIZE/MASK != 
PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

return 1;
}

pud_addr_end = 0x1008000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x1008000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this with an approach that has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
It will additionally pass along pXd pointers in gup_pXd_range, and
also introduce pXd_offset_orig for s390, which takes an additional
pXd entry value parameter. This allows returning correct pointers
while still preseving the READ_ONCE logic for gup_fast. No change
for other architectures introduced.

Cc:  # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer 
Reviewed-by: Alexander Gordeev 
Signed-off-by: Vasily Gorbik 
---
 arch/s390/include/asm/pgtable.h | 42 +++--
 include/linux/pgtable.h | 10 
 mm/gup.c| 18 +++---
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..69a92e39d7b8 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, 
unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_orig(pgd_t *pgdp, pgd_t pgd, unsigned long 
address)
 {
-   if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-   return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-   return (p4d_t *) pgd;
+   if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+   return (p4d_t *) pgd_deref(pgd) + p4d_index(address

[RFC PATCH 2/2] mm/gup: fix gup_fast with dynamic page table folding

2020-08-28 Thread Gerald Schaefer
From: Alexander Gordeev 

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns  itself (a pointer to a value on stack)
pudp = pud_offset(, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x1008000, due to PUD_SIZE/MASK != 
PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

return 1;
}

pud_addr_end = 0x1008000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x1008000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new gup_pXd_addr_end helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Cc:  # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/include/asm/pgtable.h | 49 +
 include/linux/pgtable.h | 16 +++
 mm/gup.c|  8 +++---
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..1b8f461f5783 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,55 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+static inline unsigned long gup_folded_addr_end(unsigned long rste,
+   unsigned long addr, unsigned 
long end)
+{
+   unsigned int type = rste & _REGION_ENTRY_TYPE_MASK;
+   unsigned long size, mask, boundary;
+
+   switch (type) {
+   case _REGION_ENTRY_TYPE_R1:
+   size = PGDIR_SIZE;
+   mask = PGDIR_MASK;
+   break;
+   case _REGION_ENTRY_TYPE_R2:
+   size = P4D_SIZE;
+   mask = P4D_MASK;
+   break;
+   case _REGION_ENTRY_TYPE_R3:
+   size = PUD_SIZE;
+   mask = PUD_M

Re: [PATCH v4 18/26] mm/s390: Use general page fault accounting

2020-07-01 Thread Gerald Schaefer
On Tue, 30 Jun 2020 16:45:48 -0400
Peter Xu  wrote:

> Use the general page fault accounting by passing regs into handle_mm_fault().
> It naturally solve the issue of multiple page fault accounting when page fault
> retry happened.
> 
> CC: Heiko Carstens 
> CC: Vasily Gorbik 
> CC: Christian Borntraeger 
> CC: linux-s...@vger.kernel.org
> Signed-off-by: Peter Xu 
> ---
>  arch/s390/mm/fault.c | 16 +---
>  1 file changed, 1 insertion(+), 15 deletions(-)

Reviewed-by: Gerald Schaefer 
Acked-by: Gerald Schaefer 


Re: [PATCH 01/26] mm: Do page fault accounting in handle_mm_fault

2020-06-26 Thread Gerald Schaefer
On Wed, 24 Jun 2020 16:34:12 -0400
Peter Xu  wrote:

> On Wed, Jun 24, 2020 at 08:49:03PM +0200, Gerald Schaefer wrote:
> > On Fri, 19 Jun 2020 12:05:13 -0400
> > Peter Xu  wrote:
> > 
> > [...]
> > 
> > > @@ -4393,6 +4425,38 @@ vm_fault_t handle_mm_fault(struct vm_area_struct 
> > > *vma, unsigned long address,
> > >   mem_cgroup_oom_synchronize(false);
> > >   }
> > > 
> > > + if (ret & VM_FAULT_RETRY)
> > > + return ret;
> > 
> > I'm wondering if this also needs a check and exit for VM_FAULT_ERROR.
> > In arch code (s390 and all others I briefly checked), the accounting
> > was skipped for VM_FAULT_ERROR case.
> 
> Yes. I didn't explicitly add the check because I thought it's still OK to 
> count
> the error cases, especially after we've discussed about
> PERF_COUNT_SW_PAGE_FAULTS in v1.  So far, the major reason (iiuc) to have
> PERF_COUNT_SW_PAGE_FAULTS still in per-arch handlers is to also cover these
> corner cases like VM_FAULT_ERROR.  So to me it makes sense too to also count
> them in here.  But I agree it changes the old counting on most archs.

Having PERF_COUNT_SW_PAGE_FAULTS count everything including VM_FAULT_ERROR
is OK. Just major/minor accounting should be only about successes, IIRC from
v1 discussion.

The "new rules" also say

+*  - faults that never even got here (because the address
+*wasn't valid). That includes arch_vma_access_permitted()
+*failing above.

VM_FAULT_ERROR, and also the arch-specific VM_FAULT_BADxxx, qualify
as "address wasn't valid" I think, so they should not be counted as
major/minor.

IIRC from v1, and we want to only count success as major/minor, maybe
the rule could also be made more clear about that, e.g. like

+*  - unsuccessful faults (because the address wasn't valid)
+*do not count. That includes arch_vma_access_permitted()
+*failing above.

> 
> Again, I don't have strong opinion either on this, just like the same to
> PERF_COUNT_SW_PAGE_FAULTS...  But if no one disagree, I will change this to:
> 
>   if (ret & (VM_FAULT_RETRY | VM_FAULT_ERROR))
>   return ret;
> 
> So we try our best to follow the past.

Sounds good to me, and VM_FAULT_BADxxx should never show up here.

> 
> Btw, note that there will still be some even more special corner cases. E.g.,
> for ARM64 it's also not accounted for some ARM64 specific fault errors
> (VM_FAULT_BADMAP, VM_FAULT_BADACCESS).  So even if we don't count
> VM_FAULT_ERROR, we might still count these for ARM64.  We can try to redefine
> VM_FAULT_ERROR in ARM64 to cover all the arch-specific errors, however that
> seems an overkill to me sololy for fault accountings, so hopefully I can 
> ignore
> that difference.

Hmm, arm64 already does not count the VM_FAULT_BADxxx, but also does not
call handle_mm_fault() for those, so no change with this patch. arm (and
also unicore32) do count those, but also not call handle_mm_fault(), so
there would be the change that they lose accounting, IIUC.

I agree that this probably can be ignored. The code in arm64 also looks
more recent, so it's probably just a left-over in arm/unicore32 code.

Regards,
Gerald


Re: [PATCH RFC] s390x/vmem: get rid of memory segment list

2020-06-26 Thread Gerald Schaefer
On Fri, 26 Jun 2020 19:22:53 +0200
Gerald Schaefer  wrote:

> On Thu, 25 Jun 2020 17:00:29 +0200
> David Hildenbrand  wrote:
> 
> > I can't come up with a satisfying reason why we still need the memory
> > segment list. We used to represent in the list:
> > - boot memory
> > - standby memory added via add_memory()
> > - loaded dcss segments
> > 
> > When loading/unloading dcss segments, we already track them in a
> > separate list and check for overlaps
> > (arch/s390/mm/extmem.c:segment_overlaps_others()) when loading segments.
> > 
> > The overlap check was introduced for some segments in
> > commit b2300b9efe1b ("[S390] dcssblk: add >2G DCSSs support and stacked
> > contiguous DCSSs support.")
> > and was extended to cover all dcss segments in
> > commit ca57114609d1 ("s390/extmem: remove code for 31 bit addressing
> > mode").
> > 
> > Although I doubt that overlaps with boot memory and standby memory
> > are relevant, let's reshuffle the checks in load_segment() to request
> > the resource first. This will bail out in case we have overlaps with
> > other resources (esp. boot memory and standby memory). The order
> > is now different compared to segment_unload() and segment_unload(), but
> > that should not matter.
> 
> You are right that this is ancient, but I think "overlaps with boot
> memory and standby memory" were very relevant, that might actually
> have been the initial reason for this in ancient times (but I do not
> really remember).
> 
> With DCSS you can define a fixed start address where the segment will
> be loaded into guest address space. The current code queries this address
> and directly gives it to vmem_add_mapping(), at least if there is no
> overlap with other DCSS segments. If there would be an overlap with
> boot memory, and not checked / rejected in vmem_add_mapping(), the
> following dcss_diag() would probably fail because AFAIR z/VM will
> not allow loading a DCSS with guest memory overlap. So far, so good,
> but the vmem_remove_mapping() on the exit path would then remove
> (part of) boot memory.
> 
> That being said, I think your patch prevents this by moving
> request_resource() up, so we should not call vmem_add_mapping()
> for such overlaps. I still want to give it some test, but need
> to find / setup systems with that ancient technology first :-)
> 

Verified with DCSS overlapping boot and standby memory, works fine.
As expected, the error message changes, but I don't think that is a
problem, as long as you also remove the old -ENOSPC case / comment
in arch/s390/mm/extmem.c. It is actually more correct now I guess,
-ENOSPC doesn't look like the correct return value anyway.

Thanks for cleaning up! Looks good to me, and removes > 100 LOC,
unless Heiko remembers some other issues from ancient times.

Reviewed-by: Gerald Schaefer 
Tested-by: Gerald Schaefer  [DCSS]


Re: [PATCH RFC] s390x/vmem: get rid of memory segment list

2020-06-26 Thread Gerald Schaefer
On Fri, 26 Jun 2020 19:22:53 +0200
Gerald Schaefer  wrote:

[...]
> 
> That should be OK, as it is also the same message that we already
> see for overlaps with other DCSSs. But you probably also should
> remove that case from the segment_warning() function for
> completeness.

... and also from the comment of segment_load()


Re: [PATCH RFC] s390x/vmem: get rid of memory segment list

2020-06-26 Thread Gerald Schaefer
On Thu, 25 Jun 2020 17:00:29 +0200
David Hildenbrand  wrote:

> I can't come up with a satisfying reason why we still need the memory
> segment list. We used to represent in the list:
> - boot memory
> - standby memory added via add_memory()
> - loaded dcss segments
> 
> When loading/unloading dcss segments, we already track them in a
> separate list and check for overlaps
> (arch/s390/mm/extmem.c:segment_overlaps_others()) when loading segments.
> 
> The overlap check was introduced for some segments in
> commit b2300b9efe1b ("[S390] dcssblk: add >2G DCSSs support and stacked
> contiguous DCSSs support.")
> and was extended to cover all dcss segments in
> commit ca57114609d1 ("s390/extmem: remove code for 31 bit addressing
> mode").
> 
> Although I doubt that overlaps with boot memory and standby memory
> are relevant, let's reshuffle the checks in load_segment() to request
> the resource first. This will bail out in case we have overlaps with
> other resources (esp. boot memory and standby memory). The order
> is now different compared to segment_unload() and segment_unload(), but
> that should not matter.

You are right that this is ancient, but I think "overlaps with boot
memory and standby memory" were very relevant, that might actually
have been the initial reason for this in ancient times (but I do not
really remember).

With DCSS you can define a fixed start address where the segment will
be loaded into guest address space. The current code queries this address
and directly gives it to vmem_add_mapping(), at least if there is no
overlap with other DCSS segments. If there would be an overlap with
boot memory, and not checked / rejected in vmem_add_mapping(), the
following dcss_diag() would probably fail because AFAIR z/VM will
not allow loading a DCSS with guest memory overlap. So far, so good,
but the vmem_remove_mapping() on the exit path would then remove
(part of) boot memory.

That being said, I think your patch prevents this by moving
request_resource() up, so we should not call vmem_add_mapping()
for such overlaps. I still want to give it some test, but need
to find / setup systems with that ancient technology first :-)

One change introduced by this patch is that we no longer
see -ENOSPC and the corresponding error message from extmem code:
"DCSS %s overlaps with used storage and cannot be loaded"

Instead, now we would see -EBUSY and this message:
"%s needs used memory resources and cannot be loaded or queried"

That should be OK, as it is also the same message that we already
see for overlaps with other DCSSs. But you probably also should
remove that case from the segment_warning() function for
completeness.

Regards,
Gerald


Re: [PATCH 18/26] mm/s390: Use general page fault accounting

2020-06-24 Thread Gerald Schaefer
On Fri, 19 Jun 2020 12:13:35 -0400
Peter Xu  wrote:

> Use the general page fault accounting by passing regs into handle_mm_fault().
> It naturally solve the issue of multiple page fault accounting when page fault
> retry happened.
> 
> CC: Heiko Carstens 
> CC: Vasily Gorbik 
> CC: Christian Borntraeger 
> CC: linux-s...@vger.kernel.org
> Signed-off-by: Peter Xu 
> ---
>  arch/s390/mm/fault.c | 16 +---
>  1 file changed, 1 insertion(+), 15 deletions(-)
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index ab6d7eedcfab..4d62ca7d3e09 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -479,7 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>* make sure we exit gracefully rather than endlessly redo
>* the fault.
>*/
> - fault = handle_mm_fault(vma, address, flags, NULL);
> + fault = handle_mm_fault(vma, address, flags, regs);
>   if (fault_signal_pending(fault, regs)) {
>   fault = VM_FAULT_SIGNAL;
>   if (flags & FAULT_FLAG_RETRY_NOWAIT)
> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   if (unlikely(fault & VM_FAULT_ERROR))
>   goto out_up;

There are two cases here where we skipped the accounting,
fault_signal_pending() and VM_FAULT_ERROR, similar to other archs.

fault_signal_pending() should be ok, because that only seems to be true
for fault & VM_FAULT_RETRY, in which case the new approach also skips
the accounting.

But for VM_FAULT_ERROR, the new approach would do accounting, IIUC. Is
that changed on purpose? See also my reply on [PATCH 01/26].

Regards,
Gerald


Re: [PATCH 01/26] mm: Do page fault accounting in handle_mm_fault

2020-06-24 Thread Gerald Schaefer
On Fri, 19 Jun 2020 12:05:13 -0400
Peter Xu  wrote:

[...]

> @@ -4393,6 +4425,38 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>   mem_cgroup_oom_synchronize(false);
>   }
> 
> + if (ret & VM_FAULT_RETRY)
> + return ret;

I'm wondering if this also needs a check and exit for VM_FAULT_ERROR.
In arch code (s390 and all others I briefly checked), the accounting
was skipped for VM_FAULT_ERROR case.

> +
> + /*
> +  * Do accounting in the common code, to avoid unnecessary
> +  * architecture differences or duplicated code.
> +  *
> +  * We arbitrarily make the rules be:
> +  *
> +  *  - faults that never even got here (because the address
> +  *wasn't valid). That includes arch_vma_access_permitted()

Missing "do not count" at the end of the first sentence?

> +  *failing above.
> +  *
> +  *So this is expressly not a "this many hardware page
> +  *faults" counter. Use the hw profiling for that.
> +  *
> +  *  - incomplete faults (ie RETRY) do not count (see above).
> +  *They will only count once completed.
> +  *
> +  *  - the fault counts as a "major" fault when the final
> +  *successful fault is VM_FAULT_MAJOR, or if it was a
> +  *retry (which implies that we couldn't handle it
> +  *immediately previously).
> +  *
> +  *  - if the fault is done for GUP, regs wil be NULL and

wil -> will

Regards,
Gerald


Re: [PATCH V3 0/4] mm/debug_vm_pgtable: Add some more tests

2020-06-24 Thread Gerald Schaefer
On Wed, 24 Jun 2020 13:05:39 +0200
Alexander Gordeev  wrote:

> On Wed, Jun 24, 2020 at 08:43:10AM +0530, Anshuman Khandual wrote:
> 
> [...]
> 
> > Hello Gerald/Christophe/Vineet,
> > 
> > It would be really great if you could give this series a quick test
> > on s390/ppc/arc platforms respectively. Thank you.
> 
> That worked for me with the default and debug s390 configurations.
> Would you like to try with some particular options or combinations
> of the options?

It will be enabled automatically on all archs that set
ARCH_HAS_DEBUG_VM_PGTABLE, which we do for s390 unconditionally.
Also, DEBUG_VM has to be set, which we have only in the debug config.
So only the s390 debug config will have it enabled, you can check
dmesg for "debug_vm_pgtable" to see when / where it was run, and if it
triggered any warnings.

I also checked with the v3 series, and it works fine for s390.


Possible duplicate page fault accounting on some archs after commit 4064b9827063

2020-06-10 Thread Gerald Schaefer
Hi,

Some architectures have their page fault accounting code inside the fault
retry loop, and rely on only going through that code once. Before commit
4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times"), that was
ensured by testing for and clearing FAULT_FLAG_ALLOW_RETRY.

That commit had to remove the clearing of FAULT_FLAG_ALLOW_RETRY for all
architectures, and introduced a subtle change to page fault accounting
logic in the affected archs. It is now possible to go through the retry
loop multiple times, and the affected archs would then account multiple
page faults instead of just one.

This was found by coincidence in s390 code, and a quick check showed that
there are quite a lot of other architectures that seem to be affected in a
similar way. I'm preparing a fix for s390, by moving the accounting behind
the retry loop, similar to x86. It is not completely straight-forward, so
I leave the fix for other archs to the respective maintainers.

Added the lists for possibly affected archs on cc, but no guarantee for
completeness.

Regards,
Gerald


Re: [PATCH v4 05/11] s390: Change s390_kernel_write() return type to match memcpy()

2020-05-07 Thread Gerald Schaefer
On Thu, 7 May 2020 15:36:48 +0200 (CEST)
Jiri Kosina  wrote:

> On Wed, 29 Apr 2020, Josh Poimboeuf wrote:
> 
> > s390_kernel_write()'s function type is almost identical to memcpy().
> > Change its return type to "void *" so they can be used interchangeably.
> > 
> > Cc: linux-s...@vger.kernel.org
> > Cc: heiko.carst...@de.ibm.com
> > Signed-off-by: Josh Poimboeuf 
> > Acked-by: Joe Lawrence 
> > Acked-by: Miroslav Benes 
> 
> Also for this one -- s390 folks, could you please provide your Ack for 
> taking things through livepatching.git as part of this series?
> 
> Thanks.
> 

Ah, forgot about that one, sorry. Also looks good.

Acked-by: Gerald Schaefer  # s390


Re: [PATCH v4 06/11] s390/module: Use s390_kernel_write() for late relocations

2020-05-07 Thread Gerald Schaefer
On Thu, 7 May 2020 12:00:13 +0200 (CEST)
Jiri Kosina  wrote:

> On Wed, 29 Apr 2020, Josh Poimboeuf wrote:
> 
> > From: Peter Zijlstra 
> > 
> > Because of late module patching, a livepatch module needs to be able to
> > apply some of its relocations well after it has been loaded.  Instead of
> > playing games with module_{dis,en}able_ro(), use existing text poking
> > mechanisms to apply relocations after module loading.
> > 
> > So far only x86, s390 and Power have HAVE_LIVEPATCH but only the first
> > two also have STRICT_MODULE_RWX.
> > 
> > This will allow removal of the last module_disable_ro() usage in
> > livepatch.  The ultimate goal is to completely disallow making
> > executable mappings writable.
> > 
> > [ jpoimboe: Split up patches.  Use mod state to determine whether
> > memcpy() can be used.  Test and add fixes. ]
> > 
> > Cc: linux-s...@vger.kernel.org
> > Cc: Heiko Carstens 
> > Cc: Gerald Schaefer 
> > Cc: Christian Borntraeger 
> > Suggested-by: Josh Poimboeuf 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Signed-off-by: Josh Poimboeuf 
> > Acked-by: Peter Zijlstra (Intel) 
> > Acked-by: Joe Lawrence 
> > Acked-by: Miroslav Benes 
> 
> Could we please get an Ack / Reviewed-by: for this patch from s390 folks?
> 
> Thanks,
> 

Looks pretty straightforward, and using s390_kernel_write() is OK, so

Acked-by: Gerald Schaefer  # s390


Re: [PATCH v2] s390: simplify memory notifier for protecting kdump crash kernel area

2020-04-30 Thread Gerald Schaefer
On Wed, 29 Apr 2020 16:55:38 +0200
David Hildenbrand  wrote:

> On 24.04.20 10:39, David Hildenbrand wrote:
> > Assume we have a crashkernel area of 256MB reserved:
> > 
> > root@vm0:~# cat /proc/iomem
> > -6fff : System RAM
> >   0f258000-0fcf : Kernel code
> >   0fd0-101d10e3 : Kernel data
> >   105b3000-1068dfff : Kernel bss
> > 7000-7fff : Crash kernel
> > 
> > This exactly corresponds to memory block 7 (memory block size is 256MB).
> > Trying to offline that memory block results in:
> > 
> > root@vm0:~# echo "offline" > /sys/devices/system/memory/memory7/state
> > -bash: echo: write error: Device or resource busy
> > 
> > [  128.458762] page:03d081c0 refcount:1 mapcount:0 
> > mapping:d01cecd4 index:0x0
> > [  128.458773] flags: 0x10001000(reserved)
> > [  128.458781] raw: 10001000 03d081c8 03d081c8 
> > 
> > [  128.458781] raw:   0001 
> > 
> > [  128.458783] page dumped because: unmovable page
> > 
> > The craskernel area is marked reserved in the bootmem allocator. This
> > results in the memmap getting initialized (refcount=1, PG_reserved), but
> > the pages are never freed to the page allocator.
> > 
> > So these pages look like allocated pages that are unmovable (esp.
> > PG_reserved), and therefore, memory offlining fails early, when trying to
> > isolate the page range.
> > 
> > We only have to care about the exchange area, make that clear.
> > 
> > Cc: Heiko Carstens 
> > Cc: Vasily Gorbik 
> > Cc: Christian Borntraeger 
> > Cc: Martin Schwidefsky 
> > Cc: Philipp Rudo 
> > Cc: Gerald Schaefer 
> > Cc: Eric W. Biederman 
> > Cc: Michal Hocko 
> > Signed-off-by: David Hildenbrand 
> > ---
> > 
> > Follow up of:
> > - "[PATCH v1] s390: drop memory notifier for protecting kdump crash kernel
> >area"
> > 
> > v1 -> v2:
> > - Keep the notifier, check for exchange area only
> > 
> > ---
> >  arch/s390/kernel/setup.c | 13 +
> >  1 file changed, 5 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> > index 0f0b140b5558..c0881f0a3175 100644
> > --- a/arch/s390/kernel/setup.c
> > +++ b/arch/s390/kernel/setup.c
> > @@ -594,9 +594,10 @@ static void __init setup_memory_end(void)
> >  #ifdef CONFIG_CRASH_DUMP
> >  
> >  /*
> > - * When kdump is enabled, we have to ensure that no memory from
> > - * the area [0 - crashkernel memory size] and
> > - * [crashk_res.start - crashk_res.end] is set offline.
> > + * When kdump is enabled, we have to ensure that no memory from the area
> > + * [0 - crashkernel memory size] is set offline - it will be exchanged with
> > + * the crashkernel memory region when kdump is triggered. The crashkernel
> > + * memory region can never get offlined (pages are unmovable).
> >   */
> >  static int kdump_mem_notifier(struct notifier_block *nb,
> >   unsigned long action, void *data)
> > @@ -607,11 +608,7 @@ static int kdump_mem_notifier(struct notifier_block 
> > *nb,
> >     return NOTIFY_OK;
> > if (arg->start_pfn < PFN_DOWN(resource_size(_res)))
> > return NOTIFY_BAD;
> > -   if (arg->start_pfn > PFN_DOWN(crashk_res.end))
> > -   return NOTIFY_OK;
> > -   if (arg->start_pfn + arg->nr_pages - 1 < PFN_DOWN(crashk_res.start))
> > -   return NOTIFY_OK;
> > -   return NOTIFY_BAD;
> > +   return NOTIFY_OK;
> >  }
> >  
> >  static struct notifier_block kdump_mem_nb = {
> > 
> 
> Ping.
> 

Looks good, thanks.

Reviewed-by: Gerald Schaefer 


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Gerald Schaefer
On Wed, 18 Sep 2019 18:26:03 +0200
Christophe Leroy  wrote:

[..] 
> My suggestion was not to completely drop the #ifdef but to do like you 
> did in pgd_clear_tests() for instance, ie to add the following test on 
> top of the function:
> 
>   if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
>   return;
> 

Ah, very nice, this would also fix the remaining issues for s390. Since
we have dynamic page table folding, neither __PAGETABLE_PXX_FOLDED nor
__ARCH_HAS_XLEVEL_HACK is defined, but mm_pxx_folded() will work.

mm_alloc() returns with a 3-level page table by default on s390, so we
will run into issues in p4d_clear/populate_tests(), and also at the end
with p4d/pud_free() (double free).

So, adding the mm_pud_folded() check to p4d_clear/populate_tests(),
and also adding mm_p4d/pud_folded() checks at the end before calling
p4d/pud_free(), would make it all work on s390.

BTW, regarding p4d/pud_free(), I'm not sure if we should rather check
the folding inside our s390 functions, similar to how we do it for
p4d/pud_free_tlb(), instead of relying on not being called for folded
p4d/pud. So far, I see no problem with this behavior, all callers of
p4d/pud_free() should be fine because of our folding check within
p4d/pud_present/none(). But that doesn't mean that it is correct not
to check for the folding inside p4d/pud_free(). At least, with this
test module we do now have a caller of p4d/pud_free() on potentially
folded entries, so instead of adding pxx_folded() checks to this
test module, we could add them to our p4d/pud_free() functions.
Any thoughts on this?

Regards,
Gerald



Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Gerald Schaefer
On Mon, 9 Sep 2019 11:56:50 +0530
Anshuman Khandual  wrote:

[..]
> > 
> > Hmm, I simply used this on my system to make pud_clear_tests() work, not
> > sure if it works on all archs:
> > 
> > pud_val(*pudp) |= RANDOM_NZVALUE;  
> 
> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
> has been defined there. on arm64 and s390 (with many others) pmd_val() is
> a macro which still got the variable that can be used as lvalue but that is
> not true for some other platforms like x86.
> 
> arch/arm64/include/asm/pgtable-types.h:   #define pmd_val(x)  
> ((x).pmd)
> arch/s390/include/asm/page.h: #define pmd_val(x)  ((x).pmd)
> arch/x86/include/asm/pgtable.h:   #define pmd_val(x)   
> native_pmd_val(x)
> 
> static inline pmdval_t native_pmd_val(pmd_t pmd)
> {
> return pmd.pmd;
> }
> 
> Unless I am mistaken, the return value from this function can not be used as
> lvalue for future assignments.
> 
> mm/arch_pgtable_test.c: In function ‘pud_clear_tests’:
> mm/arch_pgtable_test.c:156:17: error: lvalue required as left operand of 
> assignment
>   pud_val(*pudp) |= RANDOM_ORVALUE;
>  ^~
> AFAICS pxx_val() were never intended to be used as lvalue and using it that 
> way
> might just happen to work on all those platforms which define them as macros.
> They meant to just provide values for an entry as being determined by the 
> platform.
> 
> In principle pxx_val() on an entry was not supposed to be modified directly 
> from
> generic code without going through (again) platform helpers for any specific 
> state
> change (write, old, dirty, special, huge etc). The current use case is a 
> deviation
> for that.
> 
> I originally went with memset() just to load up the entries with non-zero 
> value so
> that we know pxx_clear() are really doing the clearing. The same is being 
> followed
> for all pxx_same() checks.
> 
> Another way for fixing the problem would be to mark them with known attributes
> like write/young/huge etc instead which for sure will create non-zero entries.
> We can do that for pxx_clear() and pxx_same() tests and drop RANDOM_NZVALUE
> completely. Does that sound good ?

Umm, not really. Those mkwrite/young/huge etc. helpers do only exist for
page table levels where we can also have large mappings, at least on s390.
Also, we do (on s390) again check for certain sanity before actually setting
the bits.
Good news is that at least for the pxx_same() checks the memset() is no
problem, because pxx_same() does not do any checks other than the same check.

For the pxx_clear_tests(), maybe it could be an option to put them behind the
pxx_populate_tests(), and rely on them having properly populated (non-clear)
values after that?

[...]
> > 
> > Actually, using get_unmapped_area() as suggested by Kirill could also
> > solve this issue. We do create a new mm with 3-level page tables on s390,
> > and the dynamic upgrade to 4 or 5 levels is then triggered exactly by
> > arch_get_unmapped_area(), depending on the addr. But I currently don't
> > see how / where arch_get_unmapped_area() is set up for such a dummy mm
> > created by mm_alloc().  
> 
> Normally they are set during program loading but we can set it up explicitly
> for the test mm_struct if we need to but there are some other challenges.
> 
> load_[aout|elf|flat|..]_binary()
>   setup_new_exec()
>   arch_pick_mmap_layout().
> 
> I did some initial experiments around get_unmapped_area(). Seems bit tricky
> to get it working on a pure 'test' mm_struct. It expects a real user context
> in the form of current->mm.

Yes, that's where I stopped because it looked rather complicated :-)
Not sure why Kirill suggested it initially, but if using get_unmapped_area()
would only be necessary to get properly initialized page table levels
on s390, you could also defer this to a later add-on patch.

Regards,
Gerald



Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-06 Thread Gerald Schaefer
On Fri, 6 Sep 2019 11:58:59 +0530
Anshuman Khandual  wrote:

> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
> > On Thu, 5 Sep 2019 14:48:14 +0530
> > Anshuman Khandual  wrote:
> >   
> >>> [...]
> >>>> +
> >>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> >>>> +static void pud_clear_tests(pud_t *pudp)
> >>>> +{
> >>>> +memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >>>> +pud_clear(pudp);
> >>>> +WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >>>> +}
> >>>
> >>> For pgd/p4d/pud_clear(), we only clear if the page table level is present
> >>> and not folded. The memset() here overwrites the table type bits, so
> >>> pud_clear() will not clear anything on s390 and the pud_none() check will
> >>> fail.
> >>> Would it be possible to OR a (larger) random value into the table, so that
> >>> the lower 12 bits would be preserved?
> >>
> >> So the suggestion is instead of doing memset() on entry with 
> >> RANDOM_NZVALUE,
> >> it should OR a large random value preserving lower 12 bits. Hmm, this 
> >> should
> >> still do the trick for other platforms, they just need non zero value. So 
> >> on
> >> s390, the lower 12 bits on the page table entry already has valid value 
> >> while
> >> entering this function which would make sure that pud_clear() really does
> >> clear the entry ?  
> > 
> > Yes, in theory the table entry on s390 would have the type set in the last
> > 4 bits, so preserving those would be enough. If it does not conflict with
> > others, I would still suggest preserving all 12 bits since those would 
> > contain
> > arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
> > would also work with the memset, but for consistency I think the same logic
> > should be used in all pxd_clear_tests.  
> 
> Makes sense but..
> 
> There is a small challenge with this. Modifying individual bits on a given
> page table entry from generic code like this test case is bit tricky. That
> is because there are not enough helpers to create entries with an absolute
> value. This would have been easier if all the platforms provided functions
> like __pxx() which is not the case now. Otherwise something like this should
> have worked.
> 
> 
> pud_t pud = READ_ONCE(*pudp);
> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
> WRITE_ONCE(*pudp, pud);
> 
> But __pud() will fail to build in many platforms.

Hmm, I simply used this on my system to make pud_clear_tests() work, not
sure if it works on all archs:

pud_val(*pudp) |= RANDOM_NZVALUE;

> 
> The other alternative will be to make sure memset() happens on all other
> bits except the lower 12 bits which will depend on endianness. If s390
> has a fixed endianness, we can still use either of them which will hold
> good for others as well.
> 
> memset(pudp, RANDOM_NZVALUE, sizeof(pud_t) - 3);
> 
> OR
> 
> memset(pudp + 3, RANDOM_NZVALUE, sizeof(pud_t) - 3);
> 
> > 
> > However, there is another issue on s390 which will make this only work
> > for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
> > mm_alloc() will only give you a 3-level page table initially on s390.
> > This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
> > both see the pud level (of course this also affects other tests).  
> 
> Got it.
> 
> > 
> > Not sure yet how to fix this, i.e. how to initialize/update the page table
> > to 5 levels. We can handle 5 level page tables, and it would be good if
> > all levels could be tested, but using mm_alloc() to establish the page
> > tables might not work on s390. One option could be to provide an arch-hook
> > or weak function to allocate/initialize the mm.  
> 
> Sure, got it. Though I plan to do add some arch specific tests or init 
> sequence
> like the above later on but for now the idea is to get the smallest possible 
> set
> of test cases which builds and runs on all platforms without requiring any 
> arch
> specific hooks or special casing (#ifdef) to be agreed upon broadly and 
> accepted.
> 
> Do you think this is absolutely necessary on s390 for the very first set of 
> test
> cases or we can add this later on as an improvement ?

It can be added later, no problem. I did not expect this to work flawlessly
on s390 right from the start anyway, with all our peculiarities, so don't
let this hinder you. I might come up with an add-on patch later.

Actually, using get_unmapped_area() as suggested by Kirill could also
solve this issue. We do create a new mm with 3-level page tables on s390,
and the dynamic upgrade to 4 or 5 levels is then triggered exactly by
arch_get_unmapped_area(), depending on the addr. But I currently don't
see how / where arch_get_unmapped_area() is set up for such a dummy mm
created by mm_alloc().

Regards,
Gerald



Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Gerald Schaefer
On Thu, 5 Sep 2019 14:48:14 +0530
Anshuman Khandual  wrote:

> > [...]  
> >> +
> >> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> >> +static void pud_clear_tests(pud_t *pudp)
> >> +{
> >> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >> +  pud_clear(pudp);
> >> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >> +}  
> > 
> > For pgd/p4d/pud_clear(), we only clear if the page table level is present
> > and not folded. The memset() here overwrites the table type bits, so
> > pud_clear() will not clear anything on s390 and the pud_none() check will
> > fail.
> > Would it be possible to OR a (larger) random value into the table, so that
> > the lower 12 bits would be preserved?  
> 
> So the suggestion is instead of doing memset() on entry with RANDOM_NZVALUE,
> it should OR a large random value preserving lower 12 bits. Hmm, this should
> still do the trick for other platforms, they just need non zero value. So on
> s390, the lower 12 bits on the page table entry already has valid value while
> entering this function which would make sure that pud_clear() really does
> clear the entry ?

Yes, in theory the table entry on s390 would have the type set in the last
4 bits, so preserving those would be enough. If it does not conflict with
others, I would still suggest preserving all 12 bits since those would contain
arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
would also work with the memset, but for consistency I think the same logic
should be used in all pxd_clear_tests.

However, there is another issue on s390 which will make this only work
for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
mm_alloc() will only give you a 3-level page table initially on s390.
This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
both see the pud level (of course this also affects other tests).

Not sure yet how to fix this, i.e. how to initialize/update the page table
to 5 levels. We can handle 5 level page tables, and it would be good if
all levels could be tested, but using mm_alloc() to establish the page
tables might not work on s390. One option could be to provide an arch-hook
or weak function to allocate/initialize the mm.

IIUC, the (dummy) mm is really only needed to provide an mm->pgd as starting
point, right?

Regards,
Gerald



Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Gerald Schaefer
On Tue,  3 Sep 2019 13:31:46 +0530
Anshuman Khandual  wrote:

> This adds a test module which will validate architecture page table helpers
> and accessors regarding compliance with generic MM semantics expectations.
> This will help various architectures in validating changes to the existing
> page table helpers or addition of new ones.
> 
> Test page table and memory pages creating it's entries at various level are
> all allocated from system memory with required alignments. If memory pages
> with required size and alignment could not be allocated, then all depending
> individual tests are skipped.

This looks very useful, thanks. Of course, s390 is quite special and does
not work nicely with this patch (yet), mostly because of our dynamic page
table levels/folding. Still need to figure out what can be fixed in the arch
code and what would need to be changed in the test module. See below for some
generic comments/questions.

At least one real bug in the s390 code was already revealed by this, which
is very nice. In pmd/pud_bad(), we also check large pmds/puds for sanity,
instead of reporting them as bad, which is apparently not how it is expected.

[...]
> +/*
> + * Basic operations
> + *
> + * mkold(entry)  = An old and not a young entry
> + * mkyoung(entry)= A young and not an old entry
> + * mkdirty(entry)= A dirty and not a clean entry
> + * mkclean(entry)= A clean and not a dirty entry
> + * mkwrite(entry)= A write and not a write protected entry
> + * wrprotect(entry)  = A write protected and not a write entry
> + * pxx_bad(entry)= A mapped and non-table entry
> + * pxx_same(entry1, entry2)  = Both entries hold the exact same value
> + */
> +#define VADDR_TEST   (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)

Why is P4D_SIZE missing in the VADDR_TEST calculation?

[...]
> +
> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> +static void pud_clear_tests(pud_t *pudp)
> +{
> + memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> + pud_clear(pudp);
> + WARN_ON(!pud_none(READ_ONCE(*pudp)));
> +}

For pgd/p4d/pud_clear(), we only clear if the page table level is present
and not folded. The memset() here overwrites the table type bits, so
pud_clear() will not clear anything on s390 and the pud_none() check will
fail.
Would it be possible to OR a (larger) random value into the table, so that
the lower 12 bits would be preserved?

> +
> +static void pud_populate_tests(struct mm_struct *mm, pud_t *pudp, pmd_t 
> *pmdp)
> +{
> + /*
> +  * This entry points to next level page table page.
> +  * Hence this must not qualify as pud_bad().
> +  */
> + pmd_clear(pmdp);
> + pud_clear(pudp);
> + pud_populate(mm, pudp, pmdp);
> + WARN_ON(pud_bad(READ_ONCE(*pudp)));
> +}

This will populate the pud with a pmd pointer that does not point to the
beginning of the pmd table, but to the second entry (because of how
VADDR_TEST is constructed). This will result in failing pud_bad() check
on s390. Not sure why/how it works on other archs, but would it be possible
to align pmdp down to the beginning of the pmd table (and similar for the
other pxd_populate_tests)?

[...]
> +
> + p4d_free(mm, saved_p4dp);
> + pud_free(mm, saved_pudp);
> + pmd_free(mm, saved_pmdp);
> + pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));

pgtable_t is arch-specific, and on s390 it is not a struct page pointer,
but a pte pointer. So this will go wrong, also on all other archs (if any)
where pgtable_t is not struct page.
Would it be possible to use pte_free_kernel() instead, and just pass
saved_ptep directly?

Regards,
Gerald



Re: [PATCH] mm: replace is_zero_pfn with is_huge_zero_pmd for thp

2019-08-26 Thread Gerald Schaefer
On Mon, 26 Aug 2019 06:18:58 -0700
Matthew Wilcox  wrote:

> Why did you not cc Gerald who wrote the patch?  You can't just
> run get_maintainers.pl and call it good.
> 
> On Sun, Aug 25, 2019 at 02:06:21PM -0600, Yu Zhao wrote:
> > For hugely mapped thp, we use is_huge_zero_pmd() to check if it's
> > zero page or not.
> > 
> > We do fill ptes with my_zero_pfn() when we split zero thp pmd, but
> >  this is not what we have in vm_normal_page_pmd().
> > pmd_trans_huge_lock() makes sure of it.
> > 
> > This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody
> > complains about it.
> > 
> > Signed-off-by: Yu Zhao 
> > ---
> >  mm/memory.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e2bb51b6242e..ea3c74855b23 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -654,7 +654,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct 
> > *vma, unsigned long addr,
> >  
> > if (pmd_devmap(pmd))
> > return NULL;
> > -   if (is_zero_pfn(pfn))
> > +   if (is_huge_zero_pmd(pmd))
> > return NULL;
> > if (unlikely(pfn > highest_memmap_pfn))
> > return NULL;
> > -- 
> > 2.23.0.187.g17f5b7556c-goog
> >   

Looks good to me. The "_pmd" versions for can_gather_numa_stats() and
vm_normal_page() were introduced to avoid using pte_present/dirty() on
pmds, which is not affected by this patch.

In fact, for vm_normal_page_pmd() I basically copied most of the code
from vm_normal_page(), including the is_zero_pfn(pfn) check, which does
look wrong to me now. Using is_huge_zero_pmd() should be correct.

Maybe the description could also mention the symptom of this bug?
I would assume that it affects anon/dirty accounting in gather_pte_stats(),
for huge mappings, if zero page mappings are not correctly recognized.

Regards,
Gerald



Re: [PATCH AUTOSEL 4.14 62/95] mm, memory_hotplug: initialize struct pages for the full memory section

2019-05-08 Thread Gerald Schaefer
On Tue, 7 May 2019 13:18:06 -0400
Sasha Levin  wrote:

> On Tue, May 07, 2019 at 10:15:19AM -0700, Linus Torvalds wrote:
> >On Tue, May 7, 2019 at 10:02 AM Sasha Levin  wrote:
> >>
> >> I got it wrong then. I'll fix it up and get efad4e475c31 in instead.
> >
> >Careful. That one had a bug too, and we have 891cb2a72d82 ("mm,
> >memory_hotplug: fix off-by-one in is_pageblock_removable").
> >
> >All of these were *horribly* and subtly buggy, and might be
> >intertwined with other issues. And only trigger on a few specific
> >machines where the memory map layout is just right to trigger some
> >special case or other, and you have just the right config.
> >
> >It might be best to verify with Michal Hocko. Michal?
> 
> Michal, is there a testcase I can plug into kselftests to make sure we
> got this right (and don't regress)? We care a lot about memory hotplug
> working right.

We hit the panics on s390 with special z/VM memory layout, but they both
can be triggered simply by using mem= kernel parameter (and
CONFIG_DEBUG_VM_PGFLAGS=y).

With "mem=3075M" (and w/o the commits efad4e475c31 + 24feb47c5fa5), it
can be triggered by reading from
/sys/devices/system/memory/memory/valid_zones, or from
/sys/devices/system/memory/memory/removable, with  being the last
memory block.

This is with 256MB section size and memory block size. On LPAR, with
256MB section size and 1GB memory block size, for some reason the
"removable" issue doesn't trigger, only the "valid_zones" issue.

Using lsmem will also trigger it, as it reads both the valid_zones and
the removable attribute for all memory blocks. So, a test with
not-section-aligned mem= parameter and using lsmem could be an option.

Regards,
Gerald



Re: [PATCH AUTOSEL 4.14 62/95] mm, memory_hotplug: initialize struct pages for the full memory section

2019-05-07 Thread Gerald Schaefer
On Tue, 7 May 2019 13:02:08 -0400
Sasha Levin  wrote:

> On Tue, May 07, 2019 at 09:50:50AM -0700, Linus Torvalds wrote:
> >On Tue, May 7, 2019 at 9:31 AM Alexander Duyck
> > wrote:
> >>
> >> Wasn't this patch reverted in Linus's tree for causing a regression on
> >> some platforms? If so I'm not sure we should pull this in as a
> >> candidate for stable should we, or am I missing something?
> >
> >Good catch. It was reverted in commit 4aa9fc2a435a ("Revert "mm,
> >memory_hotplug: initialize struct pages for the full memory
> >section"").
> >
> >We ended up with efad4e475c31 ("mm, memory_hotplug:
> >is_mem_section_removable do not pass the end of a zone") instead (and
> >possibly others - this was just from looking for commit messages that
> >mentioned that reverted commit).
> 
> I got it wrong then. I'll fix it up and get efad4e475c31 in instead.

There were two commits replacing the reverted commit, fixing
is_mem_section_removable() and test_pages_in_a_zone() respectively:

commit 24feb47c5fa5 ("mm, memory_hotplug: test_pages_in_a_zone do not
pass the end of zone")
commit efad4e475c31 ("mm, memory_hotplug: is_mem_section_removable do
not pass the end of a zone")



Re: [PATCH 0/2] mm, memory_hotplug: fix uninitialized pages fallouts.

2019-01-29 Thread Gerald Schaefer
On Tue, 29 Jan 2019 14:49:20 +0100
Michal Hocko  wrote:

> On Tue 29-01-19 14:14:47, Gerald Schaefer wrote:
> > On Mon, 28 Jan 2019 15:45:04 +0100
> > Michal Hocko  wrote:
> > 
> > > Hi,
> > > Mikhail has posted fixes for the two bugs quite some time ago [1]. I
> > > have pushed back on those fixes because I believed that it is much
> > > better to plug the problem at the initialization time rather than play
> > > whack-a-mole all over the hotplug code and find all the places which
> > > expect the full memory section to be initialized. We have ended up with
> > > 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full
> > > memory section") merged and cause a regression [2][3]. The reason is
> > > that there might be memory layouts when two NUMA nodes share the same
> > > memory section so the merged fix is simply incorrect.
> > > 
> > > In order to plug this hole we really have to be zone range aware in
> > > those handlers. I have split up the original patch into two. One is
> > > unchanged (patch 2) and I took a different approach for `removable'
> > > crash. It would be great if Mikhail could test it still works for his
> > > memory layout.
> > > 
> > > [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslo...@linux.ibm.com
> > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
> > > [3] http://lkml.kernel.org/r/20190125163938.ga20...@dhcp22.suse.cz
> > 
> > I verified that both patches fix the issues we had with valid_zones
> > (with mem=2050M) and removable (with mem=3075M).
> > 
> > However, the call trace in the description of your patch 1 is wrong.
> > You basically have the same call trace for test_pages_in_a_zone in
> > both patches. The "removable" patch should have the call trace for
> > is_mem_section_removable from Mikhails original patches:
> 
> Thanks for testing. Can I use you Tested-by?

Sure, forgot to add this:
Tested-by: Gerald Schaefer 

> 
> >  CONFIG_DEBUG_VM_PGFLAGS=y
> >  kernel parameter mem=3075M
> >  --
> >  page:03d08300c000 is uninitialized and poisoned
> >  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >  Call Trace:
> >  ([<0038596c>] is_mem_section_removable+0xb4/0x190)
> >   [<008f12fa>] show_mem_removable+0x9a/0xd8
> >   [<008cf9c4>] dev_attr_show+0x34/0x70
> >   [<00463ad0>] sysfs_kf_seq_show+0xc8/0x148
> >   [<003e4194>] seq_read+0x204/0x480
> >   [<003b53ea>] __vfs_read+0x32/0x178
> >   [<003b55b2>] vfs_read+0x82/0x138
> >   [<003b5be2>] ksys_read+0x5a/0xb0
> >   [<00b86ba0>] system_call+0xdc/0x2d8
> >  Last Breaking-Event-Address:
> >   [<0038596c>] is_mem_section_removable+0xb4/0x190
> >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> 
> Yeah, this is c mistake on my end. I will use this trace instead.
> Thanks for spotting.



Re: [PATCH 0/2] mm, memory_hotplug: fix uninitialized pages fallouts.

2019-01-29 Thread Gerald Schaefer
On Mon, 28 Jan 2019 15:45:04 +0100
Michal Hocko  wrote:

> Hi,
> Mikhail has posted fixes for the two bugs quite some time ago [1]. I
> have pushed back on those fixes because I believed that it is much
> better to plug the problem at the initialization time rather than play
> whack-a-mole all over the hotplug code and find all the places which
> expect the full memory section to be initialized. We have ended up with
> 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full
> memory section") merged and cause a regression [2][3]. The reason is
> that there might be memory layouts when two NUMA nodes share the same
> memory section so the merged fix is simply incorrect.
> 
> In order to plug this hole we really have to be zone range aware in
> those handlers. I have split up the original patch into two. One is
> unchanged (patch 2) and I took a different approach for `removable'
> crash. It would be great if Mikhail could test it still works for his
> memory layout.
> 
> [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslo...@linux.ibm.com
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
> [3] http://lkml.kernel.org/r/20190125163938.ga20...@dhcp22.suse.cz

I verified that both patches fix the issues we had with valid_zones
(with mem=2050M) and removable (with mem=3075M).

However, the call trace in the description of your patch 1 is wrong.
You basically have the same call trace for test_pages_in_a_zone in
both patches. The "removable" patch should have the call trace for
is_mem_section_removable from Mikhails original patches:

 CONFIG_DEBUG_VM_PGFLAGS=y
 kernel parameter mem=3075M
 --
 page:03d08300c000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
 ([<0038596c>] is_mem_section_removable+0xb4/0x190)
  [<008f12fa>] show_mem_removable+0x9a/0xd8
  [<008cf9c4>] dev_attr_show+0x34/0x70
  [<00463ad0>] sysfs_kf_seq_show+0xc8/0x148
  [<003e4194>] seq_read+0x204/0x480
  [<003b53ea>] __vfs_read+0x32/0x178
  [<003b55b2>] vfs_read+0x82/0x138
  [<003b5be2>] ksys_read+0x5a/0xb0
  [<00b86ba0>] system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
  [<0038596c>] is_mem_section_removable+0xb4/0x190
 Kernel panic - not syncing: Fatal exception: panic_on_oops



[PATCH] iommu/intel-iommu: fix memory leak in intel_iommu_put_resv_regions()

2019-01-16 Thread Gerald Schaefer
Commit 9d3a4de4cb8d ("iommu: Disambiguate MSI region types") changed
the reserved region type in intel_iommu_get_resv_regions() from
IOMMU_RESV_RESERVED to IOMMU_RESV_MSI, but it forgot to also change
the type in intel_iommu_put_resv_regions().

This leads to a memory leak, because now the check in
intel_iommu_put_resv_regions() for IOMMU_RESV_RESERVED will never
be true, and no allocated regions will be freed.

Fix this by changing the region type in intel_iommu_put_resv_regions()
to IOMMU_RESV_MSI, matching the type of the allocated regions.

Fixes: 9d3a4de4cb8d ("iommu: Disambiguate MSI region types")
Cc:  # v4.11+
Signed-off-by: Gerald Schaefer 
---
 drivers/iommu/intel-iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 048b5ab36a02..b83e0f2025bb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5295,7 +5295,7 @@ static void intel_iommu_put_resv_regions(struct device 
*dev,
struct iommu_resv_region *entry, *next;
 
list_for_each_entry_safe(entry, next, head, list) {
-   if (entry->type == IOMMU_RESV_RESERVED)
+   if (entry->type == IOMMU_RESV_MSI)
kfree(entry);
}
 }
-- 
2.16.4



Re: [PATCH v1] iommu/s390: Declare s390 iommu reserved regions

2019-01-15 Thread Gerald Schaefer
On Tue, 15 Jan 2019 18:37:30 +0100
Pierre Morel  wrote:

> The s390 iommu can only allow DMA transactions between the zPCI device
> entries start_dma and end_dma.
> 
> Let's declare the regions before start_dma and after end_dma as
> reserved regions using the appropriate callback in iommu_ops.
> 
> Signed-off-by: Pierre Morel 
> ---
>  drivers/iommu/s390-iommu.c | 29 +
>  1 file changed, 29 insertions(+)
> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 22d4db3..5ca91a1 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -363,6 +363,33 @@ void zpci_destroy_iommu(struct zpci_dev *zdev)
>   iommu_device_sysfs_remove(>iommu_dev);
>  }
> 
> +static void s390_get_resv_regions(struct device *dev, struct list_head *head)
> +{
> + struct iommu_resv_region *region;
> + struct zpci_dev *zdev = to_pci_dev(dev)->sysdata;
> +
> + region = iommu_alloc_resv_region(0, zdev->start_dma,
> +  0, IOMMU_RESV_RESERVED);
> + if (!region)
> + return;
> + list_add_tail(>list, head);
> +
> + region = iommu_alloc_resv_region(zdev->end_dma + 1,
> +  ~0UL - zdev->end_dma,
> +  0, IOMMU_RESV_RESERVED);

Can you guarantee that start_dma will never be 0 and end_dma never ~0UL,
even with future HW?

In any of these cases, your code would reserve strange ranges, and sysfs
would report broken reserved ranges.

Maybe add a check for start_dma > 0 and end_dma < ULONG_MAX?

> + if (!region)
> + return;
> + list_add_tail(>list, head);
> +}
> +
> +static void s390_put_resv_regions(struct device *dev, struct list_head *head)
> +{
> + struct iommu_resv_region *entry, *next;
> +
> + list_for_each_entry_safe(entry, next, head, list)
> + kfree(entry);
> +}

It looks very wrong that there is no matching list_del() for the previous
list_add_tail(). However, it seems to be done like this everywhere else,
and the calling functions (currently) only use temporary list_heads as
far as I can see, so I guess it should be OK (for now).

Still, a list_del() would be nice :-)

> +
>  static const struct iommu_ops s390_iommu_ops = {
>   .capable = s390_iommu_capable,
>   .domain_alloc = s390_domain_alloc,
> @@ -376,6 +403,8 @@ static const struct iommu_ops s390_iommu_ops = {
>   .remove_device = s390_iommu_remove_device,
>   .device_group = generic_device_group,
>   .pgsize_bitmap = S390_IOMMU_PGSIZES,
> + .get_resv_regions = s390_get_resv_regions,
> + .put_resv_regions = s390_put_resv_regions,
>  };
> 
>  static int __init s390_iommu_init(void)

With the start/end_dma issue addressed (if necessary):
Acked-by: Gerald Schaefer 



Re: [PATCH v2 1/1] mm, memory_hotplug: Initialize struct pages for the full memory section

2018-12-14 Thread Gerald Schaefer
On Fri, 14 Dec 2018 16:49:14 +0100
David Hildenbrand  wrote:

> On 14.12.18 16:22, David Hildenbrand wrote:
> > On 12.12.18 18:27, Mikhail Zaslonko wrote:  
> >> If memory end is not aligned with the sparse memory section boundary, the
> >> mapping of such a section is only partly initialized. This may lead to
> >> VM_BUG_ON due to uninitialized struct page access from
> >> is_mem_section_removable() or test_pages_in_a_zone() function triggered by
> >> memory_hotplug sysfs handlers:
> >>
> >> Here are the the panic examples:
> >>  CONFIG_DEBUG_VM=y
> >>  CONFIG_DEBUG_VM_PGFLAGS=y
> >>
> >>  kernel parameter mem=2050M
> >>  --
> >>  page:03d082008000 is uninitialized and poisoned
> >>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >>  Call Trace:
> >>  ([<00385b26>] test_pages_in_a_zone+0xde/0x160)
> >>   [<008f15c4>] show_valid_zones+0x5c/0x190
> >>   [<008cf9c4>] dev_attr_show+0x34/0x70
> >>   [<00463ad0>] sysfs_kf_seq_show+0xc8/0x148
> >>   [<003e4194>] seq_read+0x204/0x480
> >>   [<003b53ea>] __vfs_read+0x32/0x178
> >>   [<003b55b2>] vfs_read+0x82/0x138
> >>   [<003b5be2>] ksys_read+0x5a/0xb0
> >>   [<00b86ba0>] system_call+0xdc/0x2d8
> >>  Last Breaking-Event-Address:
> >>   [<00385b26>] test_pages_in_a_zone+0xde/0x160
> >>  Kernel panic - not syncing: Fatal exception: panic_on_oops
> >>
> >>  kernel parameter mem=3075M
> >>  --
> >>  page:03d08300c000 is uninitialized and poisoned
> >>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >>  Call Trace:
> >>  ([<0038596c>] is_mem_section_removable+0xb4/0x190)
> >>   [<008f12fa>] show_mem_removable+0x9a/0xd8
> >>   [<008cf9c4>] dev_attr_show+0x34/0x70
> >>   [<00463ad0>] sysfs_kf_seq_show+0xc8/0x148
> >>   [<003e4194>] seq_read+0x204/0x480
> >>   [<003b53ea>] __vfs_read+0x32/0x178
> >>   [<003b55b2>] vfs_read+0x82/0x138
> >>   [<003b5be2>] ksys_read+0x5a/0xb0
> >>   [<00b86ba0>] system_call+0xdc/0x2d8
> >>  Last Breaking-Event-Address:
> >>   [<0038596c>] is_mem_section_removable+0xb4/0x190
> >>  Kernel panic - not syncing: Fatal exception: panic_on_oops
> >>
> >> Fix the problem by initializing the last memory section of each zone
> >> in memmap_init_zone() till the very end, even if it goes beyond the zone
> >> end.
> >>
> >> Signed-off-by: Mikhail Zaslonko 
> >> Reviewed-by: Gerald Schaefer 
> >> Cc: 
> >> ---
> >>  mm/page_alloc.c | 12 
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 2ec9cc407216..e2afdb2dc2c5 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, 
> >> int nid, unsigned long zone,
> >>cond_resched();
> >>}
> >>}
> >> +#ifdef CONFIG_SPARSEMEM
> >> +  /*
> >> +   * If the zone does not span the rest of the section then
> >> +   * we should at least initialize those pages. Otherwise we
> >> +   * could blow up on a poisoned page in some paths which depend
> >> +   * on full sections being initialized (e.g. memory hotplug).
> >> +   */
> >> +  while (end_pfn % PAGES_PER_SECTION) {
> >> +  __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
> >> +  end_pfn++;  
> > 
> > This page will not be marked as PG_reserved - although it is a physical
> > memory gap. Do we care?
> >   
> 
> Hm, or do we even have any idea what this is (e.g. could it also be
> something not a gap)?

In the "mem=" restriction scenario it would be a gap, and probably fall
into the PG_reserved categorization from your recent patch:
 * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
 *   to read/write these pages might end badly. Don't touch!

Not sure if it could be something else. In theory, if it is possible to have
a scenario where memory zones are not section-aligned, then this
end_pfn % PAGES_PER_SECTION part could be part of another zone. But the

Re: [PATCH AUTOSEL 4.14 02/15] s390/hibernate: fix error handling when suspend cpu != resume cpu

2018-10-26 Thread Gerald Schaefer
On Fri, 26 Oct 2018 07:09:20 -0400
Sasha Levin  wrote:

> On Fri, Oct 26, 2018 at 12:15:49PM +0200, Pavel Machek wrote:
> >On Fri 2018-10-26 10:22:14, Greg KH wrote:  
> >> On Fri, Oct 26, 2018 at 11:05:43AM +0200, Pavel Machek wrote:  
> >> > On Mon 2018-10-22 06:20:13, Sasha Levin wrote:  
> >> > > From: Gerald Schaefer 
> >> > >
> >> > > [ Upstream commit 55a5542a546238354d1f209f794414168cf8c71d ]
> >> > >
> >> > > The resume code checks if the resume cpu is the same as the suspend 
> >> > > cpu.
> >> > > If not, and if it is also not possible to switch to the suspend cpu, an
> >> > > error message should be printed and the resume process should be 
> >> > > stopped
> >> > > by loading a disabled wait psw.
> >> > >
> >> > > The current logic is broken in multiple ways, the message is never 
> >> > > printed,
> >> > > and the disabled wait psw never loaded because the kernel panics 
> >> > > before that:
> >> > > - sam31 and SIGP_SET_ARCHITECTURE to ESA mode is wrong, this will break
> >> > >   on the first 64bit instruction in sclp_early_printk().
> >> > > - The init stack should be used, but the stack pointer is not set up 
> >> > > correctly
> >> > >   (missing aghi %r15,-STACK_FRAME_OVERHEAD).
> >> > > - __sclp_early_printk() checks the sclp_init_state. If it is not
> >> > >   sclp_init_state_uninitialized, it simply returns w/o printing 
> >> > > anything.
> >> > >   In the resumed kernel however, sclp_init_state will never be 
> >> > > uninitialized.  
> >> >
> >> > Stable patches should fix one bug, and one bug only.  
> >>
> >> So should upstream patches, but the rule of "stable patches match
> >> upstream identically" overrules this :)  
> >
> >a) There is no such rule for upstream.  
> 
> I invite you to read the docs in Documentation/process/ where you'll
> find gems like "Each logically independent change should be formatted as
> a separate patch".
> 
> >b) You should split the patch if it is important enough.
> >
> >c) The "stable patches match upstream identically" rule does not
> >exist. Check the documentation.  
> 
> The rule in question here is that every patch that goes into -stable
> must be upstream. If we were to split this patch, it would create 2+
> patches that do not exist upstream, thus breaking our own rules.
> 
> Anyway, given that this is an upstream issue, I'll also invite you to
> have this discussion with the maintainer of the subsystem this patch
> went into; you raise a valid concern which can hopefully be addressed by
> that maintainer.

It fixes the error message when suspend cpu != resume cpu, and splitting
it up would result in multiple patches where none of them would fix the
issue by itself, so they are logically connected.

Anyway, that patch did not have a cc: stable on purpose, it fixes an
esoteric special case, where the kernel would end up in disabled wait
anyway, with or without the patch. So it would be perfectly fine to not
include it in any stable update.



Re: [PATCH AUTOSEL 4.14 02/15] s390/hibernate: fix error handling when suspend cpu != resume cpu

2018-10-26 Thread Gerald Schaefer
On Fri, 26 Oct 2018 07:09:20 -0400
Sasha Levin  wrote:

> On Fri, Oct 26, 2018 at 12:15:49PM +0200, Pavel Machek wrote:
> >On Fri 2018-10-26 10:22:14, Greg KH wrote:  
> >> On Fri, Oct 26, 2018 at 11:05:43AM +0200, Pavel Machek wrote:  
> >> > On Mon 2018-10-22 06:20:13, Sasha Levin wrote:  
> >> > > From: Gerald Schaefer 
> >> > >
> >> > > [ Upstream commit 55a5542a546238354d1f209f794414168cf8c71d ]
> >> > >
> >> > > The resume code checks if the resume cpu is the same as the suspend 
> >> > > cpu.
> >> > > If not, and if it is also not possible to switch to the suspend cpu, an
> >> > > error message should be printed and the resume process should be 
> >> > > stopped
> >> > > by loading a disabled wait psw.
> >> > >
> >> > > The current logic is broken in multiple ways, the message is never 
> >> > > printed,
> >> > > and the disabled wait psw never loaded because the kernel panics 
> >> > > before that:
> >> > > - sam31 and SIGP_SET_ARCHITECTURE to ESA mode is wrong, this will break
> >> > >   on the first 64bit instruction in sclp_early_printk().
> >> > > - The init stack should be used, but the stack pointer is not set up 
> >> > > correctly
> >> > >   (missing aghi %r15,-STACK_FRAME_OVERHEAD).
> >> > > - __sclp_early_printk() checks the sclp_init_state. If it is not
> >> > >   sclp_init_state_uninitialized, it simply returns w/o printing 
> >> > > anything.
> >> > >   In the resumed kernel however, sclp_init_state will never be 
> >> > > uninitialized.  
> >> >
> >> > Stable patches should fix one bug, and one bug only.  
> >>
> >> So should upstream patches, but the rule of "stable patches match
> >> upstream identically" overrules this :)  
> >
> >a) There is no such rule for upstream.  
> 
> I invite you to read the docs in Documentation/process/ where you'll
> find gems like "Each logically independent change should be formatted as
> a separate patch".
> 
> >b) You should split the patch if it is important enough.
> >
> >c) The "stable patches match upstream identically" rule does not
> >exist. Check the documentation.  
> 
> The rule in question here is that every patch that goes into -stable
> must be upstream. If we were to split this patch, it would create 2+
> patches that do not exist upstream, thus breaking our own rules.
> 
> Anyway, given that this is an upstream issue, I'll also invite you to
> have this discussion with the maintainer of the subsystem this patch
> went into; you raise a valid concern which can hopefully be addressed by
> that maintainer.

It fixes the error message when suspend cpu != resume cpu, and splitting
it up would result in multiple patches where none of them would fix the
issue by itself, so they are logically connected.

Anyway, that patch did not have a cc: stable on purpose, it fixes an
esoteric special case, where the kernel would end up in disabled wait
anyway, with or without the patch. So it would be perfectly fine to not
include it in any stable update.



Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Wed, 12 Sep 2018 14:40:18 +
Pasha Tatashin  wrote:

> On 9/12/18 10:27 AM, Gerald Schaefer wrote:
> > On Wed, 12 Sep 2018 15:39:33 +0200
> > Michal Hocko  wrote:
> >   
> >> On Wed 12-09-18 15:03:56, Gerald Schaefer wrote:
> >> [...]  
> >>> BTW, those sysfs attributes are world-readable, so anyone can trigger
> >>> the panic by simply reading them, or just run lsmem (also available for
> >>> x86 since util-linux 2.32). OK, you need a special 
> >>> not-memory-block-aligned
> >>> mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
> >>> still access uninitialized struct pages. This sounds very wrong, and I
> >>> think it really should be fixed.
> >>
> >> Ohh, absolutely. Nobody is questioning that. The thing is that the
> >> code has been likely always broken. We just haven't noticed because
> >> those unitialized parts where zeroed previously. Now that the implicit
> >> zeroying is gone it is just visible.
> >>
> >> All that I am arguing is that there are many places which assume
> >> pageblocks to be fully initialized and plugging one place that blows up
> >> at the time is just whack a mole. We need to address this much earlier.
> >> E.g. by allowing only full pageblocks when adding a memory range.  
> > 
> > Just to make sure we are talking about the same thing: when you say
> > "pageblocks", do you mean the MAX_ORDER_NR_PAGES / pageblock_nr_pages
> > unit of pages, or do you mean the memory (hotplug) block unit?  
> 
> From early discussion, it was about pageblock_nr_pages not about
> memory_block_size_bytes
> 
> > 
> > I do not see any issue here with MAX_ORDER_NR_PAGES / pageblock_nr_pages
> > pageblocks, and if there was such an issue, of course you are right that
> > this would affect many places. If there was such an issue, I would also
> > assume that we would see the new page poison warning in many other places.
> > 
> > The bug that Mikhails patch would fix only affects code that operates
> > on / iterates through memory (hotplug) blocks, and that does not happen
> > in many places, only in the two functions that his patch fixes.  
> 
> Just to be clear, so memory is pageblock_nr_pages aligned, yet
> memory_block are larger and panic is still triggered?
> 
> I ask, because 3075M is not 128M aligned.

Correct, 3075M is pageblock aligned (at least on s390), but not memory
block aligned (256 MB on s390). And the "not memory block aligned" is the
reason for the panic, because at least the two memory hotplug functions
seem to rely on completely initialized struct pages for a memory block.
In this scenario we don't have any partly initialized pageblocks.

While thinking about this, with mem= it may actually be possible to also
create a not-pageblock-aligned scenario, e.g. with mem=2097148K. I didn't
try this and I thought that at least pageblock-alignment would always be
present, but from a quick glance at the mem= parsing it should actually
be possible to also create such a scenario. Then we really would have
partly initialized pageblocks, and maybe other problems would occur.

> 
> > 
> > When you say "address this much earlier", do you mean changing the way
> > that free_area_init_core()/memmap_init() initialize struct pages, i.e.
> > have them not use zone->spanned_pages as limit, but rather align that
> > up to the memory block (not pageblock) boundary?
> >   
> 
> This was my initial proposal, to fix memmap_init() and initialize struct
> pages beyond the "end", and before the "start" to cover the whole
> section. But, I think Michal suggested (and he might correct me) to
> simply ignore unaligned memory to section memory much earlier: so
> anything that does not align to sparse order is not added at all to the
> system.
> 
> I think Michal's proposal would simplify and strengthen memory mapping
> overall.

Of course it would be better to fix this in one place by providing
proper alignment, but to what unit, pageblock, section, memory block?
I was just confused by the pageblock discussion, because in the current
scenario we do not have any pageblock issues, and pageblock alignment
would also not help here. section alignment probably would, even though a
memory block can contain multiple sections, at least the memory hotplug
valid_zones/removable sysfs handlers seem to check for present sections
first, before accessing the struct pages.



Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Wed, 12 Sep 2018 14:40:18 +
Pasha Tatashin  wrote:

> On 9/12/18 10:27 AM, Gerald Schaefer wrote:
> > On Wed, 12 Sep 2018 15:39:33 +0200
> > Michal Hocko  wrote:
> >   
> >> On Wed 12-09-18 15:03:56, Gerald Schaefer wrote:
> >> [...]  
> >>> BTW, those sysfs attributes are world-readable, so anyone can trigger
> >>> the panic by simply reading them, or just run lsmem (also available for
> >>> x86 since util-linux 2.32). OK, you need a special 
> >>> not-memory-block-aligned
> >>> mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
> >>> still access uninitialized struct pages. This sounds very wrong, and I
> >>> think it really should be fixed.
> >>
> >> Ohh, absolutely. Nobody is questioning that. The thing is that the
> >> code has been likely always broken. We just haven't noticed because
> >> those unitialized parts where zeroed previously. Now that the implicit
> >> zeroying is gone it is just visible.
> >>
> >> All that I am arguing is that there are many places which assume
> >> pageblocks to be fully initialized and plugging one place that blows up
> >> at the time is just whack a mole. We need to address this much earlier.
> >> E.g. by allowing only full pageblocks when adding a memory range.  
> > 
> > Just to make sure we are talking about the same thing: when you say
> > "pageblocks", do you mean the MAX_ORDER_NR_PAGES / pageblock_nr_pages
> > unit of pages, or do you mean the memory (hotplug) block unit?  
> 
> From early discussion, it was about pageblock_nr_pages not about
> memory_block_size_bytes
> 
> > 
> > I do not see any issue here with MAX_ORDER_NR_PAGES / pageblock_nr_pages
> > pageblocks, and if there was such an issue, of course you are right that
> > this would affect many places. If there was such an issue, I would also
> > assume that we would see the new page poison warning in many other places.
> > 
> > The bug that Mikhails patch would fix only affects code that operates
> > on / iterates through memory (hotplug) blocks, and that does not happen
> > in many places, only in the two functions that his patch fixes.  
> 
> Just to be clear, so memory is pageblock_nr_pages aligned, yet
> memory_block are larger and panic is still triggered?
> 
> I ask, because 3075M is not 128M aligned.

Correct, 3075M is pageblock aligned (at least on s390), but not memory
block aligned (256 MB on s390). And the "not memory block aligned" is the
reason for the panic, because at least the two memory hotplug functions
seem to rely on completely initialized struct pages for a memory block.
In this scenario we don't have any partly initialized pageblocks.

While thinking about this, with mem= it may actually be possible to also
create a not-pageblock-aligned scenario, e.g. with mem=2097148K. I didn't
try this and I thought that at least pageblock-alignment would always be
present, but from a quick glance at the mem= parsing it should actually
be possible to also create such a scenario. Then we really would have
partly initialized pageblocks, and maybe other problems would occur.

> 
> > 
> > When you say "address this much earlier", do you mean changing the way
> > that free_area_init_core()/memmap_init() initialize struct pages, i.e.
> > have them not use zone->spanned_pages as limit, but rather align that
> > up to the memory block (not pageblock) boundary?
> >   
> 
> This was my initial proposal, to fix memmap_init() and initialize struct
> pages beyond the "end", and before the "start" to cover the whole
> section. But, I think Michal suggested (and he might correct me) to
> simply ignore unaligned memory to section memory much earlier: so
> anything that does not align to sparse order is not added at all to the
> system.
> 
> I think Michal's proposal would simplify and strengthen memory mapping
> overall.

Of course it would be better to fix this in one place by providing
proper alignment, but to what unit, pageblock, section, memory block?
I was just confused by the pageblock discussion, because in the current
scenario we do not have any pageblock issues, and pageblock alignment
would also not help here. section alignment probably would, even though a
memory block can contain multiple sections, at least the memory hotplug
valid_zones/removable sysfs handlers seem to check for present sections
first, before accessing the struct pages.



Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Mon, 10 Sep 2018 15:26:55 +
Pasha Tatashin  wrote:

> 
> I agree memoryblock is a hack, it fails to do both things it was
> designed to do:
> 
> 1. On bare metal you cannot free a physical dimm of memory using
> memoryblock granularity because memory devices do not equal to physical
> dimms. Thus, if for some reason a particular dimm must be
> remove/replaced, memoryblock does not help us.
> 
> 2. On machines with hypervisors it fails to provide an adequate
> granularity to add/remove memory.
> 
> We should define a new user interface where memory can be added/removed
> at a finer granularity: sparse section size, but without a memory
> devices for each section. We should also provide an optional access to
> legacy interface where memory devices are exported but each is of
> section size.
> 
> So, when legacy interface is enabled, current way would work:
> 
> echo offline > /sys/devices/system/memory/memoryXXX/state
> 
> And new interface would allow us to do something like this:
> 
> echo offline 256M > /sys/devices/system/node/nodeXXX/memory
> 
> With optional start address for offline memory.
> echo offline [start_pa] size > /sys/devices/system/node/nodeXXX/memory
> start_pa and size must be section size aligned (128M).
> 
> It would probably be a good discussion for the next MM Summit how to
> solve the current memory hotplug interface limitations.

Please keep lsmem/chmem from util-linux in mind, when changing the
memory hotplug user interface.



Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Mon, 10 Sep 2018 15:26:55 +
Pasha Tatashin  wrote:

> 
> I agree memoryblock is a hack, it fails to do both things it was
> designed to do:
> 
> 1. On bare metal you cannot free a physical dimm of memory using
> memoryblock granularity because memory devices do not equal to physical
> dimms. Thus, if for some reason a particular dimm must be
> remove/replaced, memoryblock does not help us.
> 
> 2. On machines with hypervisors it fails to provide an adequate
> granularity to add/remove memory.
> 
> We should define a new user interface where memory can be added/removed
> at a finer granularity: sparse section size, but without a memory
> devices for each section. We should also provide an optional access to
> legacy interface where memory devices are exported but each is of
> section size.
> 
> So, when legacy interface is enabled, current way would work:
> 
> echo offline > /sys/devices/system/memory/memoryXXX/state
> 
> And new interface would allow us to do something like this:
> 
> echo offline 256M > /sys/devices/system/node/nodeXXX/memory
> 
> With optional start address for offline memory.
> echo offline [start_pa] size > /sys/devices/system/node/nodeXXX/memory
> start_pa and size must be section size aligned (128M).
> 
> It would probably be a good discussion for the next MM Summit how to
> solve the current memory hotplug interface limitations.

Please keep lsmem/chmem from util-linux in mind, when changing the
memory hotplug user interface.



Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Wed, 12 Sep 2018 15:39:33 +0200
Michal Hocko  wrote:

> On Wed 12-09-18 15:03:56, Gerald Schaefer wrote:
> [...]
> > BTW, those sysfs attributes are world-readable, so anyone can trigger
> > the panic by simply reading them, or just run lsmem (also available for
> > x86 since util-linux 2.32). OK, you need a special not-memory-block-aligned
> > mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
> > still access uninitialized struct pages. This sounds very wrong, and I
> > think it really should be fixed.  
> 
> Ohh, absolutely. Nobody is questioning that. The thing is that the
> code has been likely always broken. We just haven't noticed because
> those unitialized parts where zeroed previously. Now that the implicit
> zeroying is gone it is just visible.
> 
> All that I am arguing is that there are many places which assume
> pageblocks to be fully initialized and plugging one place that blows up
> at the time is just whack a mole. We need to address this much earlier.
> E.g. by allowing only full pageblocks when adding a memory range.

Just to make sure we are talking about the same thing: when you say
"pageblocks", do you mean the MAX_ORDER_NR_PAGES / pageblock_nr_pages
unit of pages, or do you mean the memory (hotplug) block unit?

I do not see any issue here with MAX_ORDER_NR_PAGES / pageblock_nr_pages
pageblocks, and if there was such an issue, of course you are right that
this would affect many places. If there was such an issue, I would also
assume that we would see the new page poison warning in many other places.

The bug that Mikhails patch would fix only affects code that operates
on / iterates through memory (hotplug) blocks, and that does not happen
in many places, only in the two functions that his patch fixes.

When you say "address this much earlier", do you mean changing the way
that free_area_init_core()/memmap_init() initialize struct pages, i.e.
have them not use zone->spanned_pages as limit, but rather align that
up to the memory block (not pageblock) boundary?



Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Wed, 12 Sep 2018 15:39:33 +0200
Michal Hocko  wrote:

> On Wed 12-09-18 15:03:56, Gerald Schaefer wrote:
> [...]
> > BTW, those sysfs attributes are world-readable, so anyone can trigger
> > the panic by simply reading them, or just run lsmem (also available for
> > x86 since util-linux 2.32). OK, you need a special not-memory-block-aligned
> > mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
> > still access uninitialized struct pages. This sounds very wrong, and I
> > think it really should be fixed.  
> 
> Ohh, absolutely. Nobody is questioning that. The thing is that the
> code has been likely always broken. We just haven't noticed because
> those unitialized parts where zeroed previously. Now that the implicit
> zeroying is gone it is just visible.
> 
> All that I am arguing is that there are many places which assume
> pageblocks to be fully initialized and plugging one place that blows up
> at the time is just whack a mole. We need to address this much earlier.
> E.g. by allowing only full pageblocks when adding a memory range.

Just to make sure we are talking about the same thing: when you say
"pageblocks", do you mean the MAX_ORDER_NR_PAGES / pageblock_nr_pages
unit of pages, or do you mean the memory (hotplug) block unit?

I do not see any issue here with MAX_ORDER_NR_PAGES / pageblock_nr_pages
pageblocks, and if there was such an issue, of course you are right that
this would affect many places. If there was such an issue, I would also
assume that we would see the new page poison warning in many other places.

The bug that Mikhails patch would fix only affects code that operates
on / iterates through memory (hotplug) blocks, and that does not happen
in many places, only in the two functions that his patch fixes.

When you say "address this much earlier", do you mean changing the way
that free_area_init_core()/memmap_init() initialize struct pages, i.e.
have them not use zone->spanned_pages as limit, but rather align that
up to the memory block (not pageblock) boundary?



Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Mon, 10 Sep 2018 15:17:54 +0200
Michal Hocko  wrote:

> [Cc Pavel]
> 
> On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
> > If memory end is not aligned with the linux memory section boundary, such
> > a section is only partly initialized. This may lead to VM_BUG_ON due to
> > uninitialized struct pages access from is_mem_section_removable() or
> > test_pages_in_a_zone() function.
> > 
> > Here is one of the panic examples:
> >  CONFIG_DEBUG_VM_PGFLAGS=y
> >  kernel parameter mem=3075M  
> 
> OK, so the last memory section is not full and we have a partial memory
> block right?
> 
> >  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))  
> 
> OK, this means that the struct page is not fully initialized. Do you
> have a specific place which has triggered this assert?
> 
> >  [ cut here ]
> >  Call Trace:
> >  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
> >   [<009558ba>] show_mem_removable+0xda/0xe0
> >   [<009325fc>] dev_attr_show+0x3c/0x80
> >   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
> >   [<003fc4e0>] seq_read+0x208/0x4c8
> >   [<003cb80e>] __vfs_read+0x46/0x180
> >   [<003cb9ce>] vfs_read+0x86/0x148
> >   [<003cc06a>] ksys_read+0x62/0xc0
> >   [<00c001c0>] system_call+0xdc/0x2d8
> > 
> > This fix checks if the page lies within the zone boundaries before
> > accessing the struct page data. The check is added to both functions.
> > Actually similar check has already been present in
> > is_pageblock_removable_nolock() function but only after the struct page
> > is accessed.
> >   
> 
> Well, I am afraid this is not the proper solution. We are relying on the
> full pageblock worth of initialized struct pages at many other place. We
> used to do that in the past because we have initialized the full
> section but this has been changed recently. Pavel, do you have any ideas
> how to deal with this partial mem sections now?

This is not about partly initialized pageblocks, it is about partly
initialized struct pages for memory (hotplug) blocks. And this also
seems to "work as designed", i.e. memmap_init_zone() will stop at
zone->spanned_pages, and not initialize the full memory block if
you specify a not-memory-block-aligned mem= parameter.

"Normal" memory management code should never get in touch with the
uninitialized part, only the 2 memory hotplug sysfs handlers for
"valid_zones" and "removable" will iterate over a complete memory block.

So it does seem rather straight-forward to simply fix those 2 functions,
and not let them go beyond zone->spanned_pages, which is what Mikhails patch
would do. What exactly is wrong with that approach, and how would you rather
have it fixed? A patch that changes memmap_init_zone() to initialize all
struct pages of the last memory block, even beyond zone->spanned_pages?
Could this be done w/o side-effects?

If you look at test_pages_in_a_zone(), there is already some logic that
obviously assumes that at least the first page of the memory block is
initialized, and then while iterating over the rest, a check for
zone_spans_pfn() can easily be added. Mikhail applied the same logic
to is_mem_section_removable(), and his patch does fix the panic (or access
to uninitialized struct pages w/o DEBUG_VM poisoning).

BTW, those sysfs attributes are world-readable, so anyone can trigger
the panic by simply reading them, or just run lsmem (also available for
x86 since util-linux 2.32). OK, you need a special not-memory-block-aligned
mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
still access uninitialized struct pages. This sounds very wrong, and I
think it really should be fixed.

> 
> > Signed-off-by: Mikhail Zaslonko 
> > Reviewed-by: Gerald Schaefer 
> > Cc: 
> > ---
> >  mm/memory_hotplug.c | 20 +++-
> >  1 file changed, 11 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 9eea6e809a4e..8e20e8fcc3b0 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1229,9 +1229,8 @@ static struct page *next_active_pageblock(struct page 
> > *page)
> > return page + pageblock_nr_pages;
> >  }
> >  
> > -static bool is_pageblock_removable_nolock(struct page *page)
> > +static bool is_pageblock_removable_nolock(struct page *page, struct zone 
> > **zone)
> >  {
> > -   struct zone *zone;
> > unsigned long pfn;
> >  
> > /*
> > @@ -1241,15 +1240,14 @@ static bool is_pageblock_removable_nolock(str

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Gerald Schaefer
On Mon, 10 Sep 2018 15:17:54 +0200
Michal Hocko  wrote:

> [Cc Pavel]
> 
> On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
> > If memory end is not aligned with the linux memory section boundary, such
> > a section is only partly initialized. This may lead to VM_BUG_ON due to
> > uninitialized struct pages access from is_mem_section_removable() or
> > test_pages_in_a_zone() function.
> > 
> > Here is one of the panic examples:
> >  CONFIG_DEBUG_VM_PGFLAGS=y
> >  kernel parameter mem=3075M  
> 
> OK, so the last memory section is not full and we have a partial memory
> block right?
> 
> >  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))  
> 
> OK, this means that the struct page is not fully initialized. Do you
> have a specific place which has triggered this assert?
> 
> >  [ cut here ]
> >  Call Trace:
> >  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
> >   [<009558ba>] show_mem_removable+0xda/0xe0
> >   [<009325fc>] dev_attr_show+0x3c/0x80
> >   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
> >   [<003fc4e0>] seq_read+0x208/0x4c8
> >   [<003cb80e>] __vfs_read+0x46/0x180
> >   [<003cb9ce>] vfs_read+0x86/0x148
> >   [<003cc06a>] ksys_read+0x62/0xc0
> >   [<00c001c0>] system_call+0xdc/0x2d8
> > 
> > This fix checks if the page lies within the zone boundaries before
> > accessing the struct page data. The check is added to both functions.
> > Actually similar check has already been present in
> > is_pageblock_removable_nolock() function but only after the struct page
> > is accessed.
> >   
> 
> Well, I am afraid this is not the proper solution. We are relying on the
> full pageblock worth of initialized struct pages at many other place. We
> used to do that in the past because we have initialized the full
> section but this has been changed recently. Pavel, do you have any ideas
> how to deal with this partial mem sections now?

This is not about partly initialized pageblocks, it is about partly
initialized struct pages for memory (hotplug) blocks. And this also
seems to "work as designed", i.e. memmap_init_zone() will stop at
zone->spanned_pages, and not initialize the full memory block if
you specify a not-memory-block-aligned mem= parameter.

"Normal" memory management code should never get in touch with the
uninitialized part, only the 2 memory hotplug sysfs handlers for
"valid_zones" and "removable" will iterate over a complete memory block.

So it does seem rather straight-forward to simply fix those 2 functions,
and not let them go beyond zone->spanned_pages, which is what Mikhails patch
would do. What exactly is wrong with that approach, and how would you rather
have it fixed? A patch that changes memmap_init_zone() to initialize all
struct pages of the last memory block, even beyond zone->spanned_pages?
Could this be done w/o side-effects?

If you look at test_pages_in_a_zone(), there is already some logic that
obviously assumes that at least the first page of the memory block is
initialized, and then while iterating over the rest, a check for
zone_spans_pfn() can easily be added. Mikhail applied the same logic
to is_mem_section_removable(), and his patch does fix the panic (or access
to uninitialized struct pages w/o DEBUG_VM poisoning).

BTW, those sysfs attributes are world-readable, so anyone can trigger
the panic by simply reading them, or just run lsmem (also available for
x86 since util-linux 2.32). OK, you need a special not-memory-block-aligned
mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
still access uninitialized struct pages. This sounds very wrong, and I
think it really should be fixed.

> 
> > Signed-off-by: Mikhail Zaslonko 
> > Reviewed-by: Gerald Schaefer 
> > Cc: 
> > ---
> >  mm/memory_hotplug.c | 20 +++-
> >  1 file changed, 11 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 9eea6e809a4e..8e20e8fcc3b0 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1229,9 +1229,8 @@ static struct page *next_active_pageblock(struct page 
> > *page)
> > return page + pageblock_nr_pages;
> >  }
> >  
> > -static bool is_pageblock_removable_nolock(struct page *page)
> > +static bool is_pageblock_removable_nolock(struct page *page, struct zone 
> > **zone)
> >  {
> > -   struct zone *zone;
> > unsigned long pfn;
> >  
> > /*
> > @@ -1241,15 +1240,14 @@ static bool is_pageblock_removable_nolock(str

Re: [RFC PATCH 0/9] Enable THP migration for all possible architectures

2018-04-27 Thread Gerald Schaefer
On Thu, 26 Apr 2018 10:27:55 -0400
Zi Yan  wrote:

> From: Zi Yan 
> 
> Hi all,
> 
> THP migration is only enabled on x86_64 with a special
> ARCH_ENABLE_THP_MIGRATION macro. This patchset enables THP migration for
> all architectures that uses transparent hugepage, so that special macro can
> be dropped. Instead, THP migration is enabled/disabled via
> /sys/kernel/mm/transparent_hugepage/enable_thp_migration.
> 
> I grepped for TRANSPARENT_HUGEPAGE in arch folder and got 9 architectures that
> are supporting transparent hugepage. I mechanically add __pmd_to_swp_entry() 
> and
> __swp_entry_to_pmd() based on existing __pte_to_swp_entry() and
> __swp_entry_to_pte() for all these architectures, except tile which is going 
> to
> be dropped.

This will not work on s390, the pmd layout is very different from the pte
layout. Using __swp_entry/type/offset() on a pmd will go horribly wrong.
I currently don't see a chance to make this work for us, so please make/keep
this configurable, and do not configure it for s390.

Regards,
Gerald



Re: [RFC PATCH 0/9] Enable THP migration for all possible architectures

2018-04-27 Thread Gerald Schaefer
On Thu, 26 Apr 2018 10:27:55 -0400
Zi Yan  wrote:

> From: Zi Yan 
> 
> Hi all,
> 
> THP migration is only enabled on x86_64 with a special
> ARCH_ENABLE_THP_MIGRATION macro. This patchset enables THP migration for
> all architectures that uses transparent hugepage, so that special macro can
> be dropped. Instead, THP migration is enabled/disabled via
> /sys/kernel/mm/transparent_hugepage/enable_thp_migration.
> 
> I grepped for TRANSPARENT_HUGEPAGE in arch folder and got 9 architectures that
> are supporting transparent hugepage. I mechanically add __pmd_to_swp_entry() 
> and
> __swp_entry_to_pmd() based on existing __pte_to_swp_entry() and
> __swp_entry_to_pte() for all these architectures, except tile which is going 
> to
> be dropped.

This will not work on s390, the pmd layout is very different from the pte
layout. Using __swp_entry/type/offset() on a pmd will go horribly wrong.
I currently don't see a chance to make this work for us, so please make/keep
this configurable, and do not configure it for s390.

Regards,
Gerald



Re: [PATCH 3/3] iommu: s390: constify iommu_ops

2017-08-28 Thread Gerald Schaefer
On Mon, 28 Aug 2017 17:42:50 +0530
Arvind Yadav <arvind.yadav...@gmail.com> wrote:

> iommu_ops are not supposed to change at runtime.
> Functions 'bus_set_iommu' working with const iommu_ops provided
> by . So mark the non-const structs as const.
> 
> Signed-off-by: Arvind Yadav <arvind.yadav...@gmail.com>
> ---
>  drivers/iommu/s390-iommu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Gerald Schaefer <gerald.schae...@de.ibm.com>

> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 8788640..400d75f 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -327,7 +327,7 @@ static size_t s390_iommu_unmap(struct iommu_domain 
> *domain,
>   return size;
>  }
> 
> -static struct iommu_ops s390_iommu_ops = {
> +static const struct iommu_ops s390_iommu_ops = {
>   .capable = s390_iommu_capable,
>   .domain_alloc = s390_domain_alloc,
>   .domain_free = s390_domain_free,



Re: [PATCH 3/3] iommu: s390: constify iommu_ops

2017-08-28 Thread Gerald Schaefer
On Mon, 28 Aug 2017 17:42:50 +0530
Arvind Yadav  wrote:

> iommu_ops are not supposed to change at runtime.
> Functions 'bus_set_iommu' working with const iommu_ops provided
> by . So mark the non-const structs as const.
> 
> Signed-off-by: Arvind Yadav 
> ---
>  drivers/iommu/s390-iommu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Gerald Schaefer 

> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 8788640..400d75f 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -327,7 +327,7 @@ static size_t s390_iommu_unmap(struct iommu_domain 
> *domain,
>   return size;
>  }
> 
> -static struct iommu_ops s390_iommu_ops = {
> +static const struct iommu_ops s390_iommu_ops = {
>   .capable = s390_iommu_capable,
>   .domain_alloc = s390_domain_alloc,
>   .domain_free = s390_domain_free,



Re: [RFC PATCH 0/5] mm, memory_hotplug: allocate memmap from hotadded memory

2017-07-31 Thread Gerald Schaefer
On Mon, 31 Jul 2017 17:53:50 +0200
Michal Hocko <mho...@kernel.org> wrote:

> On Mon 31-07-17 17:04:59, Gerald Schaefer wrote:
> > On Mon, 31 Jul 2017 14:53:19 +0200
> > Michal Hocko <mho...@kernel.org> wrote:
> > 
> > > On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > > > On Fri, 28 Jul 2017 14:19:41 +0200
> > > > Michal Hocko <mho...@kernel.org> wrote:
> > > > 
> > > > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > > > [...]
> > > > > > > This does not seems to be an opt-in change ie if i am reading 
> > > > > > > patch 3
> > > > > > > correctly if an altmap is not provided to __add_pages() you 
> > > > > > > fallback
> > > > > > > to allocating from begining of zone. This will not work with HMM 
> > > > > > > ie
> > > > > > > device private memory. So at very least i would like to see some 
> > > > > > > way
> > > > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > > > > 
> > > > > > OK, I see! I will think about how to make a sane api for that.
> > > > > 
> > > > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > > > use the new range at this stage yet. This will need probably some 
> > > > > other
> > > > > changes but I guess we want an opt-in approach with an arch veto in 
> > > > > general.
> > > > > 
> > > > > So what do you think about the following? Only x86 is update now and I
> > > > > will split it into two parts but the idea should be clear at least.
> > > > 
> > > > This looks good, and the kernel will also boot again on s390 when 
> > > > applied
> > > > on top of the other 5 patches (plus adding the s390 part here).
> > > 
> > > Thanks for testing Gerald! I am still undecided whether the arch code
> > > should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> > > it when it is supported. My last post did the later but the first one
> > > sounds like a more clear API to me. I will keep thinking about it.
> > > 
> > > Anyway, did you have any chance to consider mapping the new physical
> > > memory range inside arch_add_memory rather than during online on s390?
> > 
> > Well, it still looks like we cannot do w/o splitting up add_memory():
> > 1) (only) set up section map during our initial memory detection, w/o
> > allocating struct pages, so that the sysfs entries get created also for
> > our offline memory (or else we have no way to online it later)
> > 2) set up vmemmap and allocate struct pages with your new altmap approach
> > during our MEM_GOING_ONLINE callback, because only now the memory is really
> > accessible
> 
> As I've tried to mentioned in my other response. This is not possible
> because there are memory hotplug usecases which never do an explicit
> online.

Of course the default behaviour should not change, we only need an option
to do the "2-stage-approach". E.g. we would call __add_pages() from our
MEM_GOING_ONLINE handler, and not from arch_add_memory() as before, but
then we would need some way to add memory sections (for generating sysfs
memory blocks) only, without allocating struct pages. See also below.

> 
> I am sorry to ask again. But why exactly cannot we make the range
> accessible from arch_add_memory on s390?

We have no acpi or other events to indicate new memory, both online and
offline memory needs to be (hypervisor) defined upfront, and then we want
to be able to use memory hotplug for ballooning during runtime.

Making the range accessible is equivalent to a hypervisor call that assigns
the memory to the guest. The problem with arch_add_memory() is now that
this gets called from add_memory(), which we call during initial memory
detection for the offline memory ranges. At that time, assigning all
offline memory to the guest, and thus making it accessible, would break
the ballooning usecase (even if it is still offline in Linux, the
hypervisor could not use it for other guests any more).

The main difference to other architectures is that we can not simply
call add_memory() (and thus arch_add_memory()) at the time when the
offline memory is actually supposed to get online (e.g. triggered by acpi).
We rather need to somehow make sure that the offline memory is detected
early, and sysfs entries are created for it, so that it can 

Re: [RFC PATCH 0/5] mm, memory_hotplug: allocate memmap from hotadded memory

2017-07-31 Thread Gerald Schaefer
On Mon, 31 Jul 2017 17:53:50 +0200
Michal Hocko  wrote:

> On Mon 31-07-17 17:04:59, Gerald Schaefer wrote:
> > On Mon, 31 Jul 2017 14:53:19 +0200
> > Michal Hocko  wrote:
> > 
> > > On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > > > On Fri, 28 Jul 2017 14:19:41 +0200
> > > > Michal Hocko  wrote:
> > > > 
> > > > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > > > [...]
> > > > > > > This does not seems to be an opt-in change ie if i am reading 
> > > > > > > patch 3
> > > > > > > correctly if an altmap is not provided to __add_pages() you 
> > > > > > > fallback
> > > > > > > to allocating from begining of zone. This will not work with HMM 
> > > > > > > ie
> > > > > > > device private memory. So at very least i would like to see some 
> > > > > > > way
> > > > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > > > > 
> > > > > > OK, I see! I will think about how to make a sane api for that.
> > > > > 
> > > > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > > > use the new range at this stage yet. This will need probably some 
> > > > > other
> > > > > changes but I guess we want an opt-in approach with an arch veto in 
> > > > > general.
> > > > > 
> > > > > So what do you think about the following? Only x86 is update now and I
> > > > > will split it into two parts but the idea should be clear at least.
> > > > 
> > > > This looks good, and the kernel will also boot again on s390 when 
> > > > applied
> > > > on top of the other 5 patches (plus adding the s390 part here).
> > > 
> > > Thanks for testing Gerald! I am still undecided whether the arch code
> > > should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> > > it when it is supported. My last post did the later but the first one
> > > sounds like a more clear API to me. I will keep thinking about it.
> > > 
> > > Anyway, did you have any chance to consider mapping the new physical
> > > memory range inside arch_add_memory rather than during online on s390?
> > 
> > Well, it still looks like we cannot do w/o splitting up add_memory():
> > 1) (only) set up section map during our initial memory detection, w/o
> > allocating struct pages, so that the sysfs entries get created also for
> > our offline memory (or else we have no way to online it later)
> > 2) set up vmemmap and allocate struct pages with your new altmap approach
> > during our MEM_GOING_ONLINE callback, because only now the memory is really
> > accessible
> 
> As I've tried to mentioned in my other response. This is not possible
> because there are memory hotplug usecases which never do an explicit
> online.

Of course the default behaviour should not change, we only need an option
to do the "2-stage-approach". E.g. we would call __add_pages() from our
MEM_GOING_ONLINE handler, and not from arch_add_memory() as before, but
then we would need some way to add memory sections (for generating sysfs
memory blocks) only, without allocating struct pages. See also below.

> 
> I am sorry to ask again. But why exactly cannot we make the range
> accessible from arch_add_memory on s390?

We have no acpi or other events to indicate new memory, both online and
offline memory needs to be (hypervisor) defined upfront, and then we want
to be able to use memory hotplug for ballooning during runtime.

Making the range accessible is equivalent to a hypervisor call that assigns
the memory to the guest. The problem with arch_add_memory() is now that
this gets called from add_memory(), which we call during initial memory
detection for the offline memory ranges. At that time, assigning all
offline memory to the guest, and thus making it accessible, would break
the ballooning usecase (even if it is still offline in Linux, the
hypervisor could not use it for other guests any more).

The main difference to other architectures is that we can not simply
call add_memory() (and thus arch_add_memory()) at the time when the
offline memory is actually supposed to get online (e.g. triggered by acpi).
We rather need to somehow make sure that the offline memory is detected
early, and sysfs entries are created for it, so that it can be set online
later on demand.

Maybe our design to use add_memory() for offl

Re: [RFC PATCH 0/5] mm, memory_hotplug: allocate memmap from hotadded memory

2017-07-31 Thread Gerald Schaefer
On Mon, 31 Jul 2017 14:53:19 +0200
Michal Hocko <mho...@kernel.org> wrote:

> On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > On Fri, 28 Jul 2017 14:19:41 +0200
> > Michal Hocko <mho...@kernel.org> wrote:
> > 
> > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > [...]
> > > > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > > > correctly if an altmap is not provided to __add_pages() you fallback
> > > > > to allocating from begining of zone. This will not work with HMM ie
> > > > > device private memory. So at very least i would like to see some way
> > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > > 
> > > > OK, I see! I will think about how to make a sane api for that.
> > > 
> > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > use the new range at this stage yet. This will need probably some other
> > > changes but I guess we want an opt-in approach with an arch veto in 
> > > general.
> > > 
> > > So what do you think about the following? Only x86 is update now and I
> > > will split it into two parts but the idea should be clear at least.
> > 
> > This looks good, and the kernel will also boot again on s390 when applied
> > on top of the other 5 patches (plus adding the s390 part here).
> 
> Thanks for testing Gerald! I am still undecided whether the arch code
> should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> it when it is supported. My last post did the later but the first one
> sounds like a more clear API to me. I will keep thinking about it.
> 
> Anyway, did you have any chance to consider mapping the new physical
> memory range inside arch_add_memory rather than during online on s390?

Well, it still looks like we cannot do w/o splitting up add_memory():
1) (only) set up section map during our initial memory detection, w/o
allocating struct pages, so that the sysfs entries get created also for
our offline memory (or else we have no way to online it later)
2) set up vmemmap and allocate struct pages with your new altmap approach
during our MEM_GOING_ONLINE callback, because only now the memory is really
accessible

Besides the obvious problem that this would need a new interface, there is
also the problem that (at least) show_valid_zones() in drivers/base/memory.c
operates on struct pages from _offline_ memory, for its page_zone() checks.
This will not work well if we have no struct pages for offline memory ...

BTW, the latter may also be a issue with your rework on any architecture.
Not sure if I understood it correctly, but the situation on s390 (i.e.
having offline memory blocks visible in sysfs) should be similar to
the scenario on x86, when you plug in memory, set it online in the acpi
handler, and then manually set it offline again via sysfs. Now the
memory is still visible in sysfs, and reading the valid_zones attribute
will trigger an access to struct pages for that memory. What if this
memory is now physically removed, in a race with such a struct page
access? Before your patch that would probably be OK, because the struct
pages were not part of the removed memory, but now you should get an
addressing exception, right?

Regards,
Gerald



Re: [RFC PATCH 0/5] mm, memory_hotplug: allocate memmap from hotadded memory

2017-07-31 Thread Gerald Schaefer
On Mon, 31 Jul 2017 14:53:19 +0200
Michal Hocko  wrote:

> On Mon 31-07-17 14:35:21, Gerald Schaefer wrote:
> > On Fri, 28 Jul 2017 14:19:41 +0200
> > Michal Hocko  wrote:
> > 
> > > On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > > > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > > > [...]
> > > > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > > > correctly if an altmap is not provided to __add_pages() you fallback
> > > > > to allocating from begining of zone. This will not work with HMM ie
> > > > > device private memory. So at very least i would like to see some way
> > > > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > > > 
> > > > OK, I see! I will think about how to make a sane api for that.
> > > 
> > > This is what I came up with. s390 guys mentioned that I cannot simply
> > > use the new range at this stage yet. This will need probably some other
> > > changes but I guess we want an opt-in approach with an arch veto in 
> > > general.
> > > 
> > > So what do you think about the following? Only x86 is update now and I
> > > will split it into two parts but the idea should be clear at least.
> > 
> > This looks good, and the kernel will also boot again on s390 when applied
> > on top of the other 5 patches (plus adding the s390 part here).
> 
> Thanks for testing Gerald! I am still undecided whether the arch code
> should veto MHP_RANGE_ACCESSIBLE if it cannot be supported or just set
> it when it is supported. My last post did the later but the first one
> sounds like a more clear API to me. I will keep thinking about it.
> 
> Anyway, did you have any chance to consider mapping the new physical
> memory range inside arch_add_memory rather than during online on s390?

Well, it still looks like we cannot do w/o splitting up add_memory():
1) (only) set up section map during our initial memory detection, w/o
allocating struct pages, so that the sysfs entries get created also for
our offline memory (or else we have no way to online it later)
2) set up vmemmap and allocate struct pages with your new altmap approach
during our MEM_GOING_ONLINE callback, because only now the memory is really
accessible

Besides the obvious problem that this would need a new interface, there is
also the problem that (at least) show_valid_zones() in drivers/base/memory.c
operates on struct pages from _offline_ memory, for its page_zone() checks.
This will not work well if we have no struct pages for offline memory ...

BTW, the latter may also be a issue with your rework on any architecture.
Not sure if I understood it correctly, but the situation on s390 (i.e.
having offline memory blocks visible in sysfs) should be similar to
the scenario on x86, when you plug in memory, set it online in the acpi
handler, and then manually set it offline again via sysfs. Now the
memory is still visible in sysfs, and reading the valid_zones attribute
will trigger an access to struct pages for that memory. What if this
memory is now physically removed, in a race with such a struct page
access? Before your patch that would probably be OK, because the struct
pages were not part of the removed memory, but now you should get an
addressing exception, right?

Regards,
Gerald



Re: [RFC PATCH 2/5] mm, arch: unify vmemmap_populate altmap handling

2017-07-31 Thread Gerald Schaefer
On Mon, 31 Jul 2017 14:55:56 +0200
Michal Hocko <mho...@kernel.org> wrote:

> On Mon 31-07-17 14:40:53, Gerald Schaefer wrote:
> [...]
> > > @@ -247,12 +248,12 @@ int __meminit vmemmap_populate(unsigned long start, 
> > > unsigned long end, int node)
> > >* use large frames even if they are only partially
> > >* used.
> > >* Otherwise we would have also page tables since
> > > -  * vmemmap_populate gets called for each section
> > > +  * __vmemmap_populate gets called for each section
> > >* separately. */
> > >   if (MACHINE_HAS_EDAT1) {
> > >   void *new_page;
> > > 
> > > - new_page = vmemmap_alloc_block(PMD_SIZE, node);
> > > + new_page = __vmemmap_alloc_block_buf(PMD_SIZE, 
> > > node, altmap);
> > >   if (!new_page)
> > >   goto out;
> > >   pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
> > 
> > There is another call to vmemmap_alloc_block() in this function, a couple
> > of lines below, this should also be replaced by __vmemmap_alloc_block_buf().
> 
> I've noticed that one but in general I have only transformed PMD
> mappings because we shouldn't even get to pte level if the forme works
> AFAICS. Memory sections should be always 2MB aligned unless I am missing
> something. Or is this not true?

vmemmap_populate() on s390 will only stop at pmd level if we have HW
support for large pages (MACHINE_HAS_EDAT1). In that case we will allocate
a PMD_SIZE block with vmemmap_alloc_block() and map it on pmd level as
a large page.

Without HW large page support, we will continue to allocate a pte page,
populate the pmd entry with that, and fall through to the pte_none()
check below, with its PAGE_SIZE vmemmap_alloc_block() allocation. In this
case we should use the __vmemmap_alloc_block_buf().

Regards,
Gerald



Re: [RFC PATCH 2/5] mm, arch: unify vmemmap_populate altmap handling

2017-07-31 Thread Gerald Schaefer
On Mon, 31 Jul 2017 14:55:56 +0200
Michal Hocko  wrote:

> On Mon 31-07-17 14:40:53, Gerald Schaefer wrote:
> [...]
> > > @@ -247,12 +248,12 @@ int __meminit vmemmap_populate(unsigned long start, 
> > > unsigned long end, int node)
> > >* use large frames even if they are only partially
> > >* used.
> > >* Otherwise we would have also page tables since
> > > -  * vmemmap_populate gets called for each section
> > > +  * __vmemmap_populate gets called for each section
> > >* separately. */
> > >   if (MACHINE_HAS_EDAT1) {
> > >   void *new_page;
> > > 
> > > - new_page = vmemmap_alloc_block(PMD_SIZE, node);
> > > + new_page = __vmemmap_alloc_block_buf(PMD_SIZE, 
> > > node, altmap);
> > >   if (!new_page)
> > >   goto out;
> > >   pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
> > 
> > There is another call to vmemmap_alloc_block() in this function, a couple
> > of lines below, this should also be replaced by __vmemmap_alloc_block_buf().
> 
> I've noticed that one but in general I have only transformed PMD
> mappings because we shouldn't even get to pte level if the forme works
> AFAICS. Memory sections should be always 2MB aligned unless I am missing
> something. Or is this not true?

vmemmap_populate() on s390 will only stop at pmd level if we have HW
support for large pages (MACHINE_HAS_EDAT1). In that case we will allocate
a PMD_SIZE block with vmemmap_alloc_block() and map it on pmd level as
a large page.

Without HW large page support, we will continue to allocate a pte page,
populate the pmd entry with that, and fall through to the pte_none()
check below, with its PAGE_SIZE vmemmap_alloc_block() allocation. In this
case we should use the __vmemmap_alloc_block_buf().

Regards,
Gerald



Re: [RFC PATCH 2/5] mm, arch: unify vmemmap_populate altmap handling

2017-07-31 Thread Gerald Schaefer
On Wed, 26 Jul 2017 10:33:30 +0200
Michal Hocko  wrote:

> From: Michal Hocko 
> 
> vmem_altmap allows vmemmap_populate to allocate memmap (struct page
> array) from an alternative allocator rather than bootmem resp.
> kmalloc. Only x86 currently supports altmap handling, most likely
> because only nvdim code uses this mechanism currently and the code
> depends on ZONE_DEVICE which is present only for x86_64. This will
> change in follow up changes so we would like other architectures
> to support it as well.
> 
> Provide vmemmap_populate generic implementation which simply resolves
> altmap and then call into arch specific __vmemmap_populate.
> Architectures then only need to use __vmemmap_alloc_block_buf to
> allocate the memmap. vmemmap_free then needs to call vmem_altmap_free
> if there is any altmap associated with the address.
> 
> This patch shouldn't introduce any functional changes because
> to_vmem_altmap always returns NULL on !x86_x64.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux-i...@vger.kernel.org
> Cc: x...@kernel.org
> Signed-off-by: Michal Hocko 
> ---
>  arch/arm64/mm/mmu.c   |  9 ++---
>  arch/ia64/mm/discontig.c  |  4 +++-
>  arch/powerpc/mm/init_64.c | 29 -
>  arch/s390/mm/vmem.c   |  7 ---
>  arch/sparc/mm/init_64.c   |  6 +++---
>  arch/x86/mm/init_64.c |  4 ++--
>  include/linux/memremap.h  | 13 ++---
>  include/linux/mm.h| 19 ++-
>  mm/sparse-vmemmap.c   |  2 +-
>  9 files changed, 59 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 0c429ec6fde8..5de1161e7a1b 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -649,12 +649,15 @@ int kern_addr_valid(unsigned long addr)
>  }
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  #if !ARM64_SWAPPER_USES_SECTION_MAPS
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
> + struct vmem_altmap *altmap)
>  {
> + WARN(altmap, "altmap unsupported\n");
>   return vmemmap_populate_basepages(start, end, node);
>  }
>  #else/* !ARM64_SWAPPER_USES_SECTION_MAPS */
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
> + struct vmem_altmap *altmap)
>  {
>   unsigned long addr = start;
>   unsigned long next;
> @@ -677,7 +680,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node)
>   if (pmd_none(*pmd)) {
>   void *p = NULL;
> 
> - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> + p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
>   if (!p)
>   return -ENOMEM;
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 878626805369..2a939e877ced 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -753,8 +753,10 @@ void arch_refresh_nodedata(int update_node, pg_data_t 
> *update_pgdat)
>  #endif
> 
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
> + struct vmem_altmap *altmap)
>  {
> + WARN(altmap, "altmap unsupported\n");
>   return vmemmap_populate_basepages(start, end, node);
>  }
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index ec84b31c6c86..5ea5e870a589 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -115,7 +116,8 @@ static struct vmemmap_backing *next;
>  static int num_left;
>  static int num_freed;
> 
> -static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
> +static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node,
> + struct vmem_altmap *altmap)
>  {
>   struct vmemmap_backing *vmem_back;
>   /* get from freed entries first */
> @@ -129,7 +131,7 @@ static __meminit struct vmemmap_backing * 
> vmemmap_list_alloc(int node)
> 
>   /* allocate a page when 

Re: [RFC PATCH 2/5] mm, arch: unify vmemmap_populate altmap handling

2017-07-31 Thread Gerald Schaefer
On Wed, 26 Jul 2017 10:33:30 +0200
Michal Hocko  wrote:

> From: Michal Hocko 
> 
> vmem_altmap allows vmemmap_populate to allocate memmap (struct page
> array) from an alternative allocator rather than bootmem resp.
> kmalloc. Only x86 currently supports altmap handling, most likely
> because only nvdim code uses this mechanism currently and the code
> depends on ZONE_DEVICE which is present only for x86_64. This will
> change in follow up changes so we would like other architectures
> to support it as well.
> 
> Provide vmemmap_populate generic implementation which simply resolves
> altmap and then call into arch specific __vmemmap_populate.
> Architectures then only need to use __vmemmap_alloc_block_buf to
> allocate the memmap. vmemmap_free then needs to call vmem_altmap_free
> if there is any altmap associated with the address.
> 
> This patch shouldn't introduce any functional changes because
> to_vmem_altmap always returns NULL on !x86_x64.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux-i...@vger.kernel.org
> Cc: x...@kernel.org
> Signed-off-by: Michal Hocko 
> ---
>  arch/arm64/mm/mmu.c   |  9 ++---
>  arch/ia64/mm/discontig.c  |  4 +++-
>  arch/powerpc/mm/init_64.c | 29 -
>  arch/s390/mm/vmem.c   |  7 ---
>  arch/sparc/mm/init_64.c   |  6 +++---
>  arch/x86/mm/init_64.c |  4 ++--
>  include/linux/memremap.h  | 13 ++---
>  include/linux/mm.h| 19 ++-
>  mm/sparse-vmemmap.c   |  2 +-
>  9 files changed, 59 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 0c429ec6fde8..5de1161e7a1b 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -649,12 +649,15 @@ int kern_addr_valid(unsigned long addr)
>  }
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  #if !ARM64_SWAPPER_USES_SECTION_MAPS
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
> + struct vmem_altmap *altmap)
>  {
> + WARN(altmap, "altmap unsupported\n");
>   return vmemmap_populate_basepages(start, end, node);
>  }
>  #else/* !ARM64_SWAPPER_USES_SECTION_MAPS */
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
> + struct vmem_altmap *altmap)
>  {
>   unsigned long addr = start;
>   unsigned long next;
> @@ -677,7 +680,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node)
>   if (pmd_none(*pmd)) {
>   void *p = NULL;
> 
> - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> + p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
>   if (!p)
>   return -ENOMEM;
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 878626805369..2a939e877ced 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -753,8 +753,10 @@ void arch_refresh_nodedata(int update_node, pg_data_t 
> *update_pgdat)
>  #endif
> 
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
> -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
> node)
> +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
> node,
> + struct vmem_altmap *altmap)
>  {
> + WARN(altmap, "altmap unsupported\n");
>   return vmemmap_populate_basepages(start, end, node);
>  }
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index ec84b31c6c86..5ea5e870a589 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -115,7 +116,8 @@ static struct vmemmap_backing *next;
>  static int num_left;
>  static int num_freed;
> 
> -static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node)
> +static __meminit struct vmemmap_backing * vmemmap_list_alloc(int node,
> + struct vmem_altmap *altmap)
>  {
>   struct vmemmap_backing *vmem_back;
>   /* get from freed entries first */
> @@ -129,7 +131,7 @@ static __meminit struct vmemmap_backing * 
> vmemmap_list_alloc(int node)
> 
>   /* allocate a page when required and hand out chunks */
>   if (!num_left) {
> - next = vmemmap_alloc_block(PAGE_SIZE, node);
> + next = __vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>   if (unlikely(!next)) {
>   WARN_ON(1);
>   return NULL;
> @@ 

Re: [RFC PATCH 0/5] mm, memory_hotplug: allocate memmap from hotadded memory

2017-07-31 Thread Gerald Schaefer
On Fri, 28 Jul 2017 14:19:41 +0200
Michal Hocko  wrote:

> On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > [...]
> > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > correctly if an altmap is not provided to __add_pages() you fallback
> > > to allocating from begining of zone. This will not work with HMM ie
> > > device private memory. So at very least i would like to see some way
> > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > 
> > OK, I see! I will think about how to make a sane api for that.
> 
> This is what I came up with. s390 guys mentioned that I cannot simply
> use the new range at this stage yet. This will need probably some other
> changes but I guess we want an opt-in approach with an arch veto in general.
> 
> So what do you think about the following? Only x86 is update now and I
> will split it into two parts but the idea should be clear at least.

This looks good, and the kernel will also boot again on s390 when applied
on top of the other 5 patches (plus adding the s390 part here).

Regards,
Gerald



Re: [RFC PATCH 0/5] mm, memory_hotplug: allocate memmap from hotadded memory

2017-07-31 Thread Gerald Schaefer
On Fri, 28 Jul 2017 14:19:41 +0200
Michal Hocko  wrote:

> On Thu 27-07-17 08:56:52, Michal Hocko wrote:
> > On Wed 26-07-17 17:06:59, Jerome Glisse wrote:
> > [...]
> > > This does not seems to be an opt-in change ie if i am reading patch 3
> > > correctly if an altmap is not provided to __add_pages() you fallback
> > > to allocating from begining of zone. This will not work with HMM ie
> > > device private memory. So at very least i would like to see some way
> > > to opt-out of this. Maybe a new argument like bool forbid_altmap ?
> > 
> > OK, I see! I will think about how to make a sane api for that.
> 
> This is what I came up with. s390 guys mentioned that I cannot simply
> use the new range at this stage yet. This will need probably some other
> changes but I guess we want an opt-in approach with an arch veto in general.
> 
> So what do you think about the following? Only x86 is update now and I
> will split it into two parts but the idea should be clear at least.

This looks good, and the kernel will also boot again on s390 when applied
on top of the other 5 patches (plus adding the s390 part here).

Regards,
Gerald



Re: [RFC PATCH 3/5] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2017-07-26 Thread Gerald Schaefer
On Wed, 26 Jul 2017 14:30:41 +0200
Michal Hocko  wrote:

> On Wed 26-07-17 13:45:39, Heiko Carstens wrote:
> [...]
> > In general I do like your idea, however if I understand your patches
> > correctly we might have an ordering problem on s390: it is not possible to
> > access hot-added memory on s390 before it is online (MEM_GOING_ONLINE
> > succeeded).
> 
> Could you point me to the code please? I cannot seem to find the
> notifier which implements that.

It is in drivers/s390/char/sclp_cmd.c: sclp_mem_notifier(). 

> 
> > On MEM_GOING_ONLINE we ask the hypervisor to back the potential available
> > hot-added memory region with physical pages. Accessing those ranges before
> > that will result in an exception.
> 
> Can we make the range which backs the memmap range available? E.g from
> s390 specific __vmemmap_populate path?

No, only the complete range of a storage increment can be made available.
The size of those increments may vary between z/VM and LPAR, but at least
with LPAR it will always be minimum 256 MB, IIRC.

> 
> > However with your approach the memory is still allocated when add_memory()
> > is being called, correct? That wouldn't be a change to the current
> > behaviour; except for the ordering problem outlined above.
> 
> Could you be more specific please? I do not change when the memmap is
> allocated.

I guess this is about the difference between s390 and others, wrt when
we call add_memory(). It is also in drivers/s390/char/sclp_cmd.c, early
during memory detection, as opposed to other archs, where I guess this
could be triggered by an ACPI event during runtime, at least for newly
added and to-be-onlined memory.

This probably means that any approach that tries to allocate memmap
memory during add_memory(), out of the "to-be-onlined but still offline"
memory, will be difficult for s390, because we call add_memory() only once
during memory detection for the complete range of (hypervisor) defined
online and offline memory. The offline parts are then made available in
the MEM_GOING_ONLINE notifier called from online_pages(). Only after
this point the memory would then be available to allocate a memmap in it.

Nevertheless, we have great interest in such a "allocate memmap from
the added memory range" solution. I guess we would need some way to
separate the memmap allocation from add_memory(), which sounds odd,
or provide some way to have add_memory() only allocate a memmap for
online memory, and a mechanism to add the memmaps for offline memory
blocks later when they are being set online.

Regards,
Gerald



Re: [RFC PATCH 3/5] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2017-07-26 Thread Gerald Schaefer
On Wed, 26 Jul 2017 14:30:41 +0200
Michal Hocko  wrote:

> On Wed 26-07-17 13:45:39, Heiko Carstens wrote:
> [...]
> > In general I do like your idea, however if I understand your patches
> > correctly we might have an ordering problem on s390: it is not possible to
> > access hot-added memory on s390 before it is online (MEM_GOING_ONLINE
> > succeeded).
> 
> Could you point me to the code please? I cannot seem to find the
> notifier which implements that.

It is in drivers/s390/char/sclp_cmd.c: sclp_mem_notifier(). 

> 
> > On MEM_GOING_ONLINE we ask the hypervisor to back the potential available
> > hot-added memory region with physical pages. Accessing those ranges before
> > that will result in an exception.
> 
> Can we make the range which backs the memmap range available? E.g from
> s390 specific __vmemmap_populate path?

No, only the complete range of a storage increment can be made available.
The size of those increments may vary between z/VM and LPAR, but at least
with LPAR it will always be minimum 256 MB, IIRC.

> 
> > However with your approach the memory is still allocated when add_memory()
> > is being called, correct? That wouldn't be a change to the current
> > behaviour; except for the ordering problem outlined above.
> 
> Could you be more specific please? I do not change when the memmap is
> allocated.

I guess this is about the difference between s390 and others, wrt when
we call add_memory(). It is also in drivers/s390/char/sclp_cmd.c, early
during memory detection, as opposed to other archs, where I guess this
could be triggered by an ACPI event during runtime, at least for newly
added and to-be-onlined memory.

This probably means that any approach that tries to allocate memmap
memory during add_memory(), out of the "to-be-onlined but still offline"
memory, will be difficult for s390, because we call add_memory() only once
during memory detection for the complete range of (hypervisor) defined
online and offline memory. The offline parts are then made available in
the MEM_GOING_ONLINE notifier called from online_pages(). Only after
this point the memory would then be available to allocate a memmap in it.

Nevertheless, we have great interest in such a "allocate memmap from
the added memory range" solution. I guess we would need some way to
separate the memmap allocation from add_memory(), which sounds odd,
or provide some way to have add_memory() only allocate a memmap for
online memory, and a mechanism to add the memmaps for offline memory
blocks later when they are being set online.

Regards,
Gerald



Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling

2017-06-29 Thread Gerald Schaefer
On Tue, 27 Jun 2017 17:28:06 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> On Mon, Jun 19, 2017 at 05:02:19PM +0200, Gerald Schaefer wrote:
> > On Thu, 15 Jun 2017 15:11:52 +0200
> > Joerg Roedel <j...@8bytes.org> wrote:
> > > + rc = zpci_init_iommu(zdev);
> > > + if (rc)
> > > + goto out_free;
> > > +
> > 
> > After this point, there are two options for failure (zpci_enable_device +
> > zpci_scan_bus), but missing error handling with an appropriate call to
> > zpci_destroy_iommu().
> 
> Right, I'll fix that in the next version.
> 
> > I must admit that I do not understand what these sysfs attributes are
> > used for, and by whom and when. Is it really necessary/correct to register
> > them (including the s390_iommu_ops) _before_ the zpci_dev is registered
> > to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)?
> > 
> > What is the expected life cycle for the attributes, and are they already
> > needed when a device is still under DMA API access, or even before it is
> > added to the PCI bus?
> 
> The sysfs attributes are used by user-space for topology detection. A
> user-space program using VFIO needs to know which PCI-devices it needs
> to assign to the VFIO driver in order to gain access.
> 
> On s390 this probably doesn't matter much, as the real PCI topology is
> not exposed, but it matters on other architectures. The purpose of this
> patch is not so much the sysfs attributes. Its a step to convert all
> IOMMU drivers to use the iommu_device_register() interface. Goal is that
> the iommu core code knows about individual hardware iommus and the
> devices behind it.
> 
> When this is implemented in all iommu drivers, we can start moving more
> code out of the drivers into the iommu core. This also probably doesn't
> matter much for s390, but all iommu drivers need to use this interface
> before we can move on. The sysfs-stuff is just a by-product of that.
> 
> So to move on with that, it would be good to get an Acked-by on this
> patch from you guys once you think I fixed everything :)

Ok, thanks for the explanation. So it should not be a problem if the
attributes are registered in zpci_create_device() before the device is
registered to the bus, especially since they do not provide any manipulation
function but only topology information.

BTW, I get the following new attributes with this patch:
/sys/devices/pci:00/:00:00.0/iommu
/sys/devices/virtual/iommu
/sys/devices/virtual/iommu/s390-iommu.0012
/sys/class/iommu/s390-iommu.0012

Regards,
Gerald



Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling

2017-06-29 Thread Gerald Schaefer
On Tue, 27 Jun 2017 17:28:06 +0200
Joerg Roedel  wrote:

> On Mon, Jun 19, 2017 at 05:02:19PM +0200, Gerald Schaefer wrote:
> > On Thu, 15 Jun 2017 15:11:52 +0200
> > Joerg Roedel  wrote:
> > > + rc = zpci_init_iommu(zdev);
> > > + if (rc)
> > > + goto out_free;
> > > +
> > 
> > After this point, there are two options for failure (zpci_enable_device +
> > zpci_scan_bus), but missing error handling with an appropriate call to
> > zpci_destroy_iommu().
> 
> Right, I'll fix that in the next version.
> 
> > I must admit that I do not understand what these sysfs attributes are
> > used for, and by whom and when. Is it really necessary/correct to register
> > them (including the s390_iommu_ops) _before_ the zpci_dev is registered
> > to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)?
> > 
> > What is the expected life cycle for the attributes, and are they already
> > needed when a device is still under DMA API access, or even before it is
> > added to the PCI bus?
> 
> The sysfs attributes are used by user-space for topology detection. A
> user-space program using VFIO needs to know which PCI-devices it needs
> to assign to the VFIO driver in order to gain access.
> 
> On s390 this probably doesn't matter much, as the real PCI topology is
> not exposed, but it matters on other architectures. The purpose of this
> patch is not so much the sysfs attributes. Its a step to convert all
> IOMMU drivers to use the iommu_device_register() interface. Goal is that
> the iommu core code knows about individual hardware iommus and the
> devices behind it.
> 
> When this is implemented in all iommu drivers, we can start moving more
> code out of the drivers into the iommu core. This also probably doesn't
> matter much for s390, but all iommu drivers need to use this interface
> before we can move on. The sysfs-stuff is just a by-product of that.
> 
> So to move on with that, it would be good to get an Acked-by on this
> patch from you guys once you think I fixed everything :)

Ok, thanks for the explanation. So it should not be a problem if the
attributes are registered in zpci_create_device() before the device is
registered to the bus, especially since they do not provide any manipulation
function but only topology information.

BTW, I get the following new attributes with this patch:
/sys/devices/pci:00/:00:00.0/iommu
/sys/devices/virtual/iommu
/sys/devices/virtual/iommu/s390-iommu.0012
/sys/class/iommu/s390-iommu.0012

Regards,
Gerald



Re: [PATCH 1/3] iommu: Return ERR_PTR() values from device_group call-backs

2017-06-28 Thread Gerald Schaefer
On Wed, 28 Jun 2017 14:00:56 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> From: Joerg Roedel <jroe...@suse.de>
> 
> The generic device_group call-backs in iommu.c return NULL
> in case of error. Since they are getting ERR_PTR values from
> iommu_group_alloc(), just pass them up instead.
> 
> Reported-by: Gerald Schaefer <gerald.schae...@de.ibm.com>
> Signed-off-by: Joerg Roedel <jroe...@suse.de>
> ---

Looks good,
Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com>


>  drivers/iommu/iommu.c | 14 ++
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index cf7ca7e..de09e1e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -915,13 +915,7 @@ static int get_pci_alias_or_group(struct pci_dev *pdev, 
> u16 alias, void *opaque)
>   */
>  struct iommu_group *generic_device_group(struct device *dev)
>  {
> - struct iommu_group *group;
> -
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return NULL;
> -
> - return group;
> + return iommu_group_alloc();
>  }
> 
>  /*
> @@ -988,11 +982,7 @@ struct iommu_group *pci_device_group(struct device *dev)
>   return group;
> 
>   /* No shared group found, allocate new */
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return NULL;
> -
> - return group;
> + return iommu_group_alloc();
>  }
> 
>  /**



Re: [PATCH 1/3] iommu: Return ERR_PTR() values from device_group call-backs

2017-06-28 Thread Gerald Schaefer
On Wed, 28 Jun 2017 14:00:56 +0200
Joerg Roedel  wrote:

> From: Joerg Roedel 
> 
> The generic device_group call-backs in iommu.c return NULL
> in case of error. Since they are getting ERR_PTR values from
> iommu_group_alloc(), just pass them up instead.
> 
> Reported-by: Gerald Schaefer 
> Signed-off-by: Joerg Roedel 
> ---

Looks good,
Reviewed-by: Gerald Schaefer 


>  drivers/iommu/iommu.c | 14 ++
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index cf7ca7e..de09e1e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -915,13 +915,7 @@ static int get_pci_alias_or_group(struct pci_dev *pdev, 
> u16 alias, void *opaque)
>   */
>  struct iommu_group *generic_device_group(struct device *dev)
>  {
> - struct iommu_group *group;
> -
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return NULL;
> -
> - return group;
> + return iommu_group_alloc();
>  }
> 
>  /*
> @@ -988,11 +982,7 @@ struct iommu_group *pci_device_group(struct device *dev)
>   return group;
> 
>   /* No shared group found, allocate new */
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return NULL;
> -
> - return group;
> + return iommu_group_alloc();
>  }
> 
>  /**



Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling

2017-06-19 Thread Gerald Schaefer
On Thu, 15 Jun 2017 15:11:52 +0200
Joerg Roedel  wrote:

> From: Joerg Roedel 
> 
> Add support for the iommu_device_register interface to make
> the s390 hardware iommus visible to the iommu core and in
> sysfs.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/s390/include/asm/pci.h |  7 +++
>  arch/s390/pci/pci.c |  5 +
>  drivers/iommu/s390-iommu.c  | 35 +++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 4e31866..a3f697e 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -8,6 +8,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -123,6 +124,8 @@ struct zpci_dev {
>   unsigned long   iommu_pages;
>   unsigned intnext_bit;
> 
> + struct iommu_device iommu_dev;  /* IOMMU core handle */
> +
>   char res_name[16];
>   struct zpci_bar_struct bars[PCI_BAR_COUNT];
> 
> @@ -173,6 +176,10 @@ int clp_add_pci_device(u32, u32, int);
>  int clp_enable_fh(struct zpci_dev *, u8);
>  int clp_disable_fh(struct zpci_dev *);
> 
> +/* IOMMU Interface */
> +int zpci_init_iommu(struct zpci_dev *zdev);
> +void zpci_destroy_iommu(struct zpci_dev *zdev);
> +
>  #ifdef CONFIG_PCI
>  /* Error handling and recovery */
>  void zpci_event_error(void *);
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 8051df1..4c13da6 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -749,6 +749,7 @@ void pcibios_remove_bus(struct pci_bus *bus)
> 
>   zpci_exit_slot(zdev);
>   zpci_cleanup_bus_resources(zdev);
> + zpci_destroy_iommu(zdev);
>   zpci_free_domain(zdev);
> 
>   spin_lock(_list_lock);
> @@ -820,6 +821,10 @@ int zpci_create_device(struct zpci_dev *zdev)
>   if (rc)
>   goto out;
> 
> + rc = zpci_init_iommu(zdev);
> + if (rc)
> + goto out_free;
> +

After this point, there are two options for failure (zpci_enable_device +
zpci_scan_bus), but missing error handling with an appropriate call to
zpci_destroy_iommu().

I must admit that I do not understand what these sysfs attributes are
used for, and by whom and when. Is it really necessary/correct to register
them (including the s390_iommu_ops) _before_ the zpci_dev is registered
to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)?

What is the expected life cycle for the attributes, and are they already
needed when a device is still under DMA API access, or even before it is
added to the PCI bus?

>   mutex_init(>lock);
>   if (zdev->state == ZPCI_FN_STATE_CONFIGURED) {
>   rc = zpci_enable_device(zdev);
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 8788640..85f3bc5 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -18,6 +18,8 @@
>   */
>  #define S390_IOMMU_PGSIZES   (~0xFFFUL)
> 
> +static struct iommu_ops s390_iommu_ops;
> +
>  struct s390_domain {
>   struct iommu_domain domain;
>   struct list_headdevices;
> @@ -166,11 +168,13 @@ static void s390_iommu_detach_device(struct 
> iommu_domain *domain,
>  static int s390_iommu_add_device(struct device *dev)
>  {
>   struct iommu_group *group = iommu_group_get_for_dev(dev);
> + struct zpci_dev *zdev = to_pci_dev(dev)->sysdata;
> 
>   if (IS_ERR(group))
>   return PTR_ERR(group);
> 
>   iommu_group_put(group);
> + iommu_device_link(>iommu_dev, dev);
> 
>   return 0;
>  }
> @@ -197,6 +201,7 @@ static void s390_iommu_remove_device(struct device *dev)
>   s390_iommu_detach_device(domain, dev);
>   }
> 
> + iommu_device_unlink(>iommu_dev, dev);
>   iommu_group_remove_device(dev);
>  }
> 
> @@ -327,6 +332,36 @@ static size_t s390_iommu_unmap(struct iommu_domain 
> *domain,
>   return size;
>  }
> 
> +int zpci_init_iommu(struct zpci_dev *zdev)
> +{
> + int rc = 0;
> +
> + rc = iommu_device_sysfs_add(>iommu_dev, NULL, NULL,
> + "s390-iommu.%08x", zdev->fid);
> + if (rc)
> + goto out_err;
> +
> + iommu_device_set_ops(>iommu_dev, _iommu_ops);
> +
> + rc = iommu_device_register(>iommu_dev);
> + if (rc)
> + goto out_sysfs;
> +
> + return 0;
> +
> +out_sysfs:
> + iommu_device_sysfs_remove(>iommu_dev);
> +
> +out_err:
> + return rc;
> +}
> +
> +void zpci_destroy_iommu(struct zpci_dev *zdev)
> +{
> + iommu_device_unregister(>iommu_dev);
> + iommu_device_sysfs_remove(>iommu_dev);
> +}
> +
>  static struct iommu_ops s390_iommu_ops = {
>   .capable = s390_iommu_capable,
>   .domain_alloc = s390_domain_alloc,



Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling

2017-06-19 Thread Gerald Schaefer
On Thu, 15 Jun 2017 15:11:52 +0200
Joerg Roedel  wrote:

> From: Joerg Roedel 
> 
> Add support for the iommu_device_register interface to make
> the s390 hardware iommus visible to the iommu core and in
> sysfs.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/s390/include/asm/pci.h |  7 +++
>  arch/s390/pci/pci.c |  5 +
>  drivers/iommu/s390-iommu.c  | 35 +++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 4e31866..a3f697e 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -8,6 +8,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -123,6 +124,8 @@ struct zpci_dev {
>   unsigned long   iommu_pages;
>   unsigned intnext_bit;
> 
> + struct iommu_device iommu_dev;  /* IOMMU core handle */
> +
>   char res_name[16];
>   struct zpci_bar_struct bars[PCI_BAR_COUNT];
> 
> @@ -173,6 +176,10 @@ int clp_add_pci_device(u32, u32, int);
>  int clp_enable_fh(struct zpci_dev *, u8);
>  int clp_disable_fh(struct zpci_dev *);
> 
> +/* IOMMU Interface */
> +int zpci_init_iommu(struct zpci_dev *zdev);
> +void zpci_destroy_iommu(struct zpci_dev *zdev);
> +
>  #ifdef CONFIG_PCI
>  /* Error handling and recovery */
>  void zpci_event_error(void *);
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 8051df1..4c13da6 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -749,6 +749,7 @@ void pcibios_remove_bus(struct pci_bus *bus)
> 
>   zpci_exit_slot(zdev);
>   zpci_cleanup_bus_resources(zdev);
> + zpci_destroy_iommu(zdev);
>   zpci_free_domain(zdev);
> 
>   spin_lock(_list_lock);
> @@ -820,6 +821,10 @@ int zpci_create_device(struct zpci_dev *zdev)
>   if (rc)
>   goto out;
> 
> + rc = zpci_init_iommu(zdev);
> + if (rc)
> + goto out_free;
> +

After this point, there are two options for failure (zpci_enable_device +
zpci_scan_bus), but missing error handling with an appropriate call to
zpci_destroy_iommu().

I must admit that I do not understand what these sysfs attributes are
used for, and by whom and when. Is it really necessary/correct to register
them (including the s390_iommu_ops) _before_ the zpci_dev is registered
to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)?

What is the expected life cycle for the attributes, and are they already
needed when a device is still under DMA API access, or even before it is
added to the PCI bus?

>   mutex_init(>lock);
>   if (zdev->state == ZPCI_FN_STATE_CONFIGURED) {
>   rc = zpci_enable_device(zdev);
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 8788640..85f3bc5 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -18,6 +18,8 @@
>   */
>  #define S390_IOMMU_PGSIZES   (~0xFFFUL)
> 
> +static struct iommu_ops s390_iommu_ops;
> +
>  struct s390_domain {
>   struct iommu_domain domain;
>   struct list_headdevices;
> @@ -166,11 +168,13 @@ static void s390_iommu_detach_device(struct 
> iommu_domain *domain,
>  static int s390_iommu_add_device(struct device *dev)
>  {
>   struct iommu_group *group = iommu_group_get_for_dev(dev);
> + struct zpci_dev *zdev = to_pci_dev(dev)->sysdata;
> 
>   if (IS_ERR(group))
>   return PTR_ERR(group);
> 
>   iommu_group_put(group);
> + iommu_device_link(>iommu_dev, dev);
> 
>   return 0;
>  }
> @@ -197,6 +201,7 @@ static void s390_iommu_remove_device(struct device *dev)
>   s390_iommu_detach_device(domain, dev);
>   }
> 
> + iommu_device_unlink(>iommu_dev, dev);
>   iommu_group_remove_device(dev);
>  }
> 
> @@ -327,6 +332,36 @@ static size_t s390_iommu_unmap(struct iommu_domain 
> *domain,
>   return size;
>  }
> 
> +int zpci_init_iommu(struct zpci_dev *zdev)
> +{
> + int rc = 0;
> +
> + rc = iommu_device_sysfs_add(>iommu_dev, NULL, NULL,
> + "s390-iommu.%08x", zdev->fid);
> + if (rc)
> + goto out_err;
> +
> + iommu_device_set_ops(>iommu_dev, _iommu_ops);
> +
> + rc = iommu_device_register(>iommu_dev);
> + if (rc)
> + goto out_sysfs;
> +
> + return 0;
> +
> +out_sysfs:
> + iommu_device_sysfs_remove(>iommu_dev);
> +
> +out_err:
> + return rc;
> +}
> +
> +void zpci_destroy_iommu(struct zpci_dev *zdev)
> +{
> + iommu_device_unregister(>iommu_dev);
> + iommu_device_sysfs_remove(>iommu_dev);
> +}
> +
>  static struct iommu_ops s390_iommu_ops = {
>   .capable = s390_iommu_capable,
>   .domain_alloc = s390_domain_alloc,



Re: [PATCH 1/2] iommu/s390: Use iommu_group_get_for_dev() in s390_iommu_add_device()

2017-06-16 Thread Gerald Schaefer
On Thu, 15 Jun 2017 15:11:51 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> From: Joerg Roedel <jroe...@suse.de>
> 
> The iommu_group_get_for_dev() function also attaches the
> device to its group, so this code doesn't need to be in the
> iommu driver.
> 
> Further by using this function the driver can make use of
> default domains in the future.
> 
> Signed-off-by: Joerg Roedel <jroe...@suse.de>

Seems pretty straightforward, so
Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com>

However, looking at iommu_group_get_for_dev(), I wonder if the
generic_device_group() always returns the right thing, but that
would be independent from this patch.

With generic_device_group() returning NULL in case the allocation failed,
this part of iommu_group_get_for_dev() would then happily dereference the
NULL pointer, because IS_ERR(group) would be false:

if (ops && ops->device_group)
group = ops->device_group(dev);

if (IS_ERR(group))
return group;

/*
 * Try to allocate a default domain - needs support from the
 * IOMMU driver.
 */
if (!group->default_domain) {

The same is true for pci_device_group(), which also returns NULL in case
of allocation failure. I guess both functions should just return the
group pointer from iommu_group_alloc() directly, which already would
contain an appropriate ERR_PTR, but never NULL.

What do you think?

> ---
>  drivers/iommu/s390-iommu.c | 15 +--
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 179e636..8788640 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -165,20 +165,14 @@ static void s390_iommu_detach_device(struct 
> iommu_domain *domain,
> 
>  static int s390_iommu_add_device(struct device *dev)
>  {
> - struct iommu_group *group;
> - int rc;
> + struct iommu_group *group = iommu_group_get_for_dev(dev);
> 
> - group = iommu_group_get(dev);
> - if (!group) {
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return PTR_ERR(group);
> - }
> + if (IS_ERR(group))
> + return PTR_ERR(group);
> 
> - rc = iommu_group_add_device(group, dev);
>   iommu_group_put(group);
> 
> - return rc;
> + return 0;
>  }
> 
>  static void s390_iommu_remove_device(struct device *dev)
> @@ -344,6 +338,7 @@ static struct iommu_ops s390_iommu_ops = {
>   .iova_to_phys = s390_iommu_iova_to_phys,
>   .add_device = s390_iommu_add_device,
>   .remove_device = s390_iommu_remove_device,
> + .device_group = generic_device_group,
>   .pgsize_bitmap = S390_IOMMU_PGSIZES,
>  };
> 



  1   2   3   4   >