Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-08 Thread Gerald Schaefer
On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
Hugh Dickins  wrote:

> On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> > On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> > Hugh Dickins  wrote:  
> > > On Thu, 1 Jun 2023 15:57:51 +0200
> > > Gerald Schaefer  wrote:  
> > > > 
> > > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > > rcu_head reuse. It might be possible to use the cleverness from our
> > > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > > the case where both 2K pagetable fragments become unused, similar to
> > > > how we decide when to actually call __free_page().
> > > > 
> > > > However, it might be much worse, and page->rcu_head from a pagetable
> > > > page cannot be used at all for s390, because we also use page->lru
> > > > to keep our list of free 2K pagetable fragments. I always get confused
> > > > by struct page unions, so not completely sure, but it seems to me that
> > > > page->rcu_head would overlay with page->lru, right?
> > > 
> > > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > > I'm wrong) I think that s390 could use exactly the same technique for
> > > its list of free 2K pagetable fragments as it uses for its list of THP
> > > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > > the first two longs of the page table itself for threading the list.  
> > 
> > Nice idea, I think that could actually work, since we only need the empty
> > 2K halves on the list. So it should be possible to store the list_head
> > inside those.  
> 
> Jason quickly pointed out the flaw in my thinking there.

Yes, while I had the right concerns about "the to-be-freed pagetables would
still be accessible, but not really valid, if we added them back to the list,
with list_heads inside them", when suggesting the approach w/o passing over
the mm, I missed that we would have the very same issue already with the
existing page_table_free_rcu().

Thankfully Jason was watching out!

> 
> >   
> > > 
> > > And while it could use third and fourth longs instead, I don't see any
> > > need for that: a deposited pagetable has been allocated, so would not
> > > be on the list of free fragments.  
> > 
> > Correct, that should not interfere.
> >   
> > > 
> > > Below is one of the grossest patches I've ever posted: gross because
> > > it's a rushed attempt to see whether that is viable, while it would take
> > > me longer to understand all the s390 cleverness there (even though the
> > > PP AA commentary above page_table_alloc() is excellent).  
> > 
> > Sounds fair, this is also one of the grossest code we have, which is also
> > why Alexander added the comment. I guess we could need even more comments
> > inside the code, as it still confuses me more than it should.
> > 
> > Considering that, you did remarkably well. Your patch seems to work fine,
> > at least it survived some LTP mm tests. I will also add it to our CI runs,
> > to give it some more testing. Will report tomorrow when it broke something.
> > See also below for some patch comments.  
> 
> Many thanks for your effort on this patch.  I don't expect the testing
> of it to catch Jason's point, that I'm corrupting the page table while
> it's on its way through RCU to being freed, but he's right nonetheless.

Right, tests ran fine, but we would have introduced subtle issues with
racing gup_fast, I guess.

> 
> I'll integrate your fixes below into what I have here, but probably
> just archive it as something to refer to later in case it might play
> a part; but probably it will not - sorry for wasting your time.

No worries, looking at that s390 code can never be amiss. It seems I need
regular refresh, at least I'm sure I already understood it better in the
past.

And who knows, with Jasons recent thoughts, that "list_head inside
pagetable" idea might not be dead yet.

> 
> >   
> > > 
> > > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.  
> > 
> > cmma_init_nodat() should be disjoint, not only because it is __init,
> > but also because it explicitly skips pagetable pages, so it should
> > never touch page->lru of those.
> > 
> > Not very familiar with the gmap code, it does look disjoint, and we should
> > also use complete 4K pages for pagetables instead of 2K fragments there,
> > but Christian or Claudio should also have a look.
> >   
> > > 
> > > Gerald, s390 folk: would it be possible for you to give this
> > > a try, suggest corrections and improvements, and then I can make it
> > > a separate patch of the series; and work on avoiding concurrent use
> > > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > > the scheme already there, maybe DD bits to go along with the PP AA).  
> > 
> > It feels like it could be possible to not only avoid the double
> > rcu_head, but also avoid passing over the mm via page->pt_mm.
> > I.e. have pte_free_defer(), 

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-08 Thread Jason Gunthorpe
On Wed, Jun 07, 2023 at 08:35:05PM -0700, Hugh Dickins wrote:

> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

I was having the same thought. It is pretty tricky, but if this was
made into some core helper then PPC and S390 could both use it and PPC
would get a nice upgrade to have the S390 frag re-use instead of
leaking frags.

Broadly we have three states:

 all frags free
 at least one frag free
 all frags used

'all frags free' should be returned to the allocator
'at least one frag free' should have the struct page on the mmu_struct's list
'all frags used' should be on no list.

So if we go from 
  all frags used -> at least one frag free
Then we put it on the RCU then the RCU puts it on the mmu_struct list

If we go from 
   at least one frag free -> all frags free
Then we take it off the mmu_struct list, put it on the RCU, and RCU
frees it.

Your trick to put the list_head for the mm_struct list into the frag
memory looks like the right direction. So 'at least one frag free' has
a single already RCU free'd frag hold the list head pointer. Thus we
never use the LRU and the rcu_head is always available.

The struct page itself can contain the actual free frag bitmask.

I think if we split up the memory used for pt_frag_refcount we can get
enough bits to keep track of everything. With only 2-4 frags we should
be OK.

So we track this data in the struct page:
  - Current RCU free TODO bitmask - if non-zero then a RCU is already
triggered
  - Next RCU TODO bitmaks - If an RCU is already triggrered then we
accumulate more free'd frags here
  - Current Free Bits - Only updated by the RCU callback

?

We'd also need to store the mmu_struct pointer in the struct page for
the RCU to be able to add/remove from the mm_struct list.

I'm not sure how much of the work can be done with atomics and how
much would need to rely on spinlock inside the mm_struct.

It feels feasible and not so bad. :)

Figure it out and test it on S390 then make power use the same common
code, and we get full RCU page table freeing using a reliable rcu_head
on both of these previously troublesome architectures :) Yay

Jason


Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-07 Thread Hugh Dickins
On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Thu, 1 Jun 2023 15:57:51 +0200
> > Gerald Schaefer  wrote:
> > > 
> > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > rcu_head reuse. It might be possible to use the cleverness from our
> > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > the case where both 2K pagetable fragments become unused, similar to
> > > how we decide when to actually call __free_page().
> > > 
> > > However, it might be much worse, and page->rcu_head from a pagetable
> > > page cannot be used at all for s390, because we also use page->lru
> > > to keep our list of free 2K pagetable fragments. I always get confused
> > > by struct page unions, so not completely sure, but it seems to me that
> > > page->rcu_head would overlay with page->lru, right?  
> > 
> > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > I'm wrong) I think that s390 could use exactly the same technique for
> > its list of free 2K pagetable fragments as it uses for its list of THP
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> Nice idea, I think that could actually work, since we only need the empty
> 2K halves on the list. So it should be possible to store the list_head
> inside those.

Jason quickly pointed out the flaw in my thinking there.

> 
> > 
> > And while it could use third and fourth longs instead, I don't see any
> > need for that: a deposited pagetable has been allocated, so would not
> > be on the list of free fragments.
> 
> Correct, that should not interfere.
> 
> > 
> > Below is one of the grossest patches I've ever posted: gross because
> > it's a rushed attempt to see whether that is viable, while it would take
> > me longer to understand all the s390 cleverness there (even though the
> > PP AA commentary above page_table_alloc() is excellent).
> 
> Sounds fair, this is also one of the grossest code we have, which is also
> why Alexander added the comment. I guess we could need even more comments
> inside the code, as it still confuses me more than it should.
> 
> Considering that, you did remarkably well. Your patch seems to work fine,
> at least it survived some LTP mm tests. I will also add it to our CI runs,
> to give it some more testing. Will report tomorrow when it broke something.
> See also below for some patch comments.

Many thanks for your effort on this patch.  I don't expect the testing
of it to catch Jason's point, that I'm corrupting the page table while
it's on its way through RCU to being freed, but he's right nonetheless.

I'll integrate your fixes below into what I have here, but probably
just archive it as something to refer to later in case it might play
a part; but probably it will not - sorry for wasting your time.

> 
> > 
> > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.
> 
> cmma_init_nodat() should be disjoint, not only because it is __init,
> but also because it explicitly skips pagetable pages, so it should
> never touch page->lru of those.
> 
> Not very familiar with the gmap code, it does look disjoint, and we should
> also use complete 4K pages for pagetables instead of 2K fragments there,
> but Christian or Claudio should also have a look.
> 
> > 
> > Gerald, s390 folk: would it be possible for you to give this
> > a try, suggest corrections and improvements, and then I can make it
> > a separate patch of the series; and work on avoiding concurrent use
> > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > the scheme already there, maybe DD bits to go along with the PP AA).
> 
> It feels like it could be possible to not only avoid the double
> rcu_head, but also avoid passing over the mm via page->pt_mm.
> I.e. have pte_free_defer(), which has the mm, do all the checks and
> list updates that page_table_free() does, for which we need the mm.
> Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> just do _dtor/__free_page similar to the generic version.

I'm not sure: I missed your suggestion there when I first skimmed
through, and today have spent more time getting deeper into how it's
done at present.  I am now feeling more confident of a way forward,
a nicely integrated way forward, than I was yesterday.
Though getting it right may not be so easy.

When Jason pointed out the existing RCU, I initially hoped that it might
already provide the necessary framework: but sadly not, because the
unbatched case (used when additional memory is not available) does not
use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
If I used that, it would cripple the s390 implementation unacceptably.

> 
> I must admit 

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-07 Thread Hugh Dickins
On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:
> 
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> It is not RCU anymore if it writes to the page table itself before the
> grace period, so this change seems to break the RCU behavior of
> page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
> called after the stores.

Yes indeed, thanks for pointing that out.

> 
> Maybe something like an xarray on the mm to hold the frags?

I think we can manage without that:
I'll say slightly more in reply to Gerald.

Hugh


Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-06 Thread Gerald Schaefer
On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
Hugh Dickins  wrote:

> On Sun, 28 May 2023, Hugh Dickins wrote:
> 
> > Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> > 
> > This version is more complicated than others: because page_table_free()
> > needs to know which fragment is being freed, and which mm to link it to.
> > 
> > page_table_free()'s fragment handling is clever, but I could too easily
> > break it: what's done here in pte_free_defer() and pte_free_now() might
> > be better integrated with page_table_free()'s cleverness, but not by me!
> > 
> > By the time that page_table_free() gets called via RCU, it's conceivable
> > that mm would already have been freed: so mmgrab() in pte_free_defer()
> > and mmdrop() in pte_free_now().  No, that is not a good context to call
> > mmdrop() from, so make mmdrop_async() public and use that.  
> 
> But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
> between multiple page tables is tricky: something I knew but had lost
> sight of.  So the powerpc and s390 patches were broken: powerpc fairly
> easily fixed, but s390 more painful.
> 
> In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
> On Thu, 1 Jun 2023 15:57:51 +0200
> Gerald Schaefer  wrote:
> > 
> > Yes, we have 2 pagetables in one 4K page, which could result in same
> > rcu_head reuse. It might be possible to use the cleverness from our
> > page_table_free() function, e.g. to only do the call_rcu() once, for
> > the case where both 2K pagetable fragments become unused, similar to
> > how we decide when to actually call __free_page().
> > 
> > However, it might be much worse, and page->rcu_head from a pagetable
> > page cannot be used at all for s390, because we also use page->lru
> > to keep our list of free 2K pagetable fragments. I always get confused
> > by struct page unions, so not completely sure, but it seems to me that
> > page->rcu_head would overlay with page->lru, right?  
> 
> Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> I'm wrong) I think that s390 could use exactly the same technique for
> its list of free 2K pagetable fragments as it uses for its list of THP
> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

Nice idea, I think that could actually work, since we only need the empty
2K halves on the list. So it should be possible to store the list_head
inside those.

> 
> And while it could use third and fourth longs instead, I don't see any
> need for that: a deposited pagetable has been allocated, so would not
> be on the list of free fragments.

Correct, that should not interfere.

> 
> Below is one of the grossest patches I've ever posted: gross because
> it's a rushed attempt to see whether that is viable, while it would take
> me longer to understand all the s390 cleverness there (even though the
> PP AA commentary above page_table_alloc() is excellent).

Sounds fair, this is also one of the grossest code we have, which is also
why Alexander added the comment. I guess we could need even more comments
inside the code, as it still confuses me more than it should.

Considering that, you did remarkably well. Your patch seems to work fine,
at least it survived some LTP mm tests. I will also add it to our CI runs,
to give it some more testing. Will report tomorrow when it broke something.
See also below for some patch comments.

> 
> I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

cmma_init_nodat() should be disjoint, not only because it is __init,
but also because it explicitly skips pagetable pages, so it should
never touch page->lru of those.

Not very familiar with the gmap code, it does look disjoint, and we should
also use complete 4K pages for pagetables instead of 2K fragments there,
but Christian or Claudio should also have a look.

> 
> Gerald, s390 folk: would it be possible for you to give this
> a try, suggest corrections and improvements, and then I can make it
> a separate patch of the series; and work on avoiding concurrent use
> of the rcu_head by pagetable fragment buddies (ideally fit in with
> the scheme already there, maybe DD bits to go along with the PP AA).

It feels like it could be possible to not only avoid the double
rcu_head, but also avoid passing over the mm via page->pt_mm.
I.e. have pte_free_defer(), which has the mm, do all the checks and
list updates that page_table_free() does, for which we need the mm.
Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
and do call_rcu(pte_free_now) instead. The pte_free_now() could then
just do 

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-06 Thread Jason Gunthorpe
On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:

> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

It is not RCU anymore if it writes to the page table itself before the
grace period, so this change seems to break the RCU behavior of
page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
called after the stores.

Maybe something like an xarray on the mm to hold the frags?

Jason


Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-05 Thread Hugh Dickins
On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer  wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+   struct list_head *listed;
unsigned long *table;
struct page *page;
unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
table = NULL;
spin_lock_bh(>context.lock);
if (!list_empty(>context.pgtable_list)) {
-   page = list_first_entry(>context.pgtable_list,
-   struct page, lru);
+   listed = mm->context.pgtable_list.next;
+   page = virt_to_page(listed);
mask = atomic_read(>_refcount) >> 24;
/*
 * The pending removal bits must also be 

[PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-05-29 Thread Hugh Dickins
Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because page_table_free()
needs to know which fragment is being freed, and which mm to link it to.

page_table_free()'s fragment handling is clever, but I could too easily
break it: what's done here in pte_free_defer() and pte_free_now() might
be better integrated with page_table_free()'s cleverness, but not by me!

By the time that page_table_free() gets called via RCU, it's conceivable
that mm would already have been freed: so mmgrab() in pte_free_defer()
and mmdrop() in pte_free_now().  No, that is not a good context to call
mmdrop() from, so make mmdrop_async() public and use that.

Signed-off-by: Hugh Dickins 
---
 arch/s390/include/asm/pgalloc.h |  4 
 arch/s390/mm/pgalloc.c  | 34 +
 include/linux/mm_types.h|  2 +-
 include/linux/sched/mm.h|  1 +
 kernel/fork.c   |  2 +-
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..0129de9addfd 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -346,6 +346,40 @@ void page_table_free(struct mm_struct *mm, unsigned long 
*table)
__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+   unsigned long mm_bit;
+   struct mm_struct *mm;
+   unsigned long *table;
+
+   page = container_of(head, struct page, rcu_head);
+   table = (unsigned long *)page_to_virt(page);
+   mm_bit = (unsigned long)page->pt_mm;
+   /* 4K page has only two 2K fragments, but alignment allows eight */
+   mm = (struct mm_struct *)(mm_bit & ~7);
+   table += PTRS_PER_PTE * (mm_bit & 7);
+   page_table_free(mm, table);
+   mmdrop_async(mm);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+   unsigned long mm_bit;
+
+   mmgrab(mm);
+   page = virt_to_page(pgtable);
+   /* Which 2K page table fragment of a 4K page? */
+   mm_bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+   (PTRS_PER_PTE * sizeof(pte_t));
+   mm_bit += (unsigned long)mm;
+   page->pt_mm = (struct mm_struct *)mm_bit;
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 unsigned long vmaddr)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2;/* mapping */
union {
-   struct mm_struct *pt_mm; /* x86 pgds only */
+   struct mm_struct *pt_mm; /* x86 pgd, s390 */
atomic_t pt_frag_refcount; /* powerpc */
};
 #if ALLOC_SPLIT_PTLOCKS
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..a9043d1a0d55 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -41,6 +41,7 @@ static inline void smp_mb__after_mmgrab(void)
smp_mb__after_atomic();
 }
 
+extern void mmdrop_async(struct mm_struct *mm);
 extern void __mmdrop(struct mm_struct *mm);
 
 static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fa4486b65c56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -942,7 +942,7 @@ static void mmdrop_async_fn(struct work_struct *work)
__mmdrop(mm);
 }
 
-static void mmdrop_async(struct mm_struct *mm)
+void mmdrop_async(struct mm_struct *mm)
 {
if (unlikely(atomic_dec_and_test(>mm_count))) {
INIT_WORK(>async_put_work, mmdrop_async_fn);
-- 
2.35.3