subject:"\[RFC v3 PATCH 4\/5\] mm\: mmap\: zap pages with read mmap

On 7/2/18 11:09 PM, Michal Hocko wrote:

On Mon 02-07-18 13:48:45, Andrew Morton wrote:

On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko wrote:

On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]

Would one of your earlier designs have addressed all usecases? I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work.

I said "one of". There were others.

Well, I was aware only about two potential solutions. Either do the
heavy lifting under the shared lock and do the rest with the exlusive
one and this, drop the lock per parts. Maybe I have missed others?

You simply
cannot drop the mmap_sem during unmap because another thread could
change the address space under your feet. So you need some form of
VM_DEAD and handle concurrent and conflicting address space operations.

Unclear that this is a problem. If a thread does an unmap of a range
of virtual address space, there's no guarantee that upon return some
other thread has not already mapped new stuff into that address range.
So what's changed?

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

They do not use any external synchronization and rely on the atomic
munmap. Thread B only munmaps range that it knows belongs to it (e.g.
called mmap in the past). It should be clear that ThreadA should not
get an address from the addr, sizeB range, right? In the most simple case
it will not happen. But let's say that the addr, sizeB range has
unmapped holes for what ever reasons. Now anytime munmap drops the
exclusive lock after handling one VMA, Thread A might find its sizeA
range and use it. ThreadB then might remove this new range as soon as it
gets its exclusive lock again.

I'm a little bit confused here. If ThreadB already has unmapped that
range, then ThreadA uses it. It sounds not like a problem since ThreadB
should just go ahead to handle the next range when it gets its exclusive
lock again, right? I don't think of why ThreadB would re-visit that
range to remove it.

But, if ThreadA uses MAP_FIXED, it might remap some ranges, then ThreadB
remove them. It might trigger SIGSEGV or SIGBUS, but this is not even
guaranteed on vanilla kernel too if the application doesn't do any
synchronization. It all depends on timing.

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

Yes, I agree, it could simply break MAP_FIXED_NOREPLACE.

Yang

I remember my attempt to make MAP_LOCKED consistent with mlock (if the
population fails then return -ENOMEM) and that required to drop the
shared mmap_sem and take it in exclusive mode (because we do not
have upgrade_read) and Linus was strongly against [1][2] for very
similar reasons. If you drop the lock you simply do not know what
happened under your feet.

[1]
http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0...@mail.gmail.com
[2]
http://lkml.kernel.org/r/ca+55afyajquhghw59qnwkgk4dbv0tpmdd7-1xqpo7dzwvo_...@mail.gmail.com

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On 7/2/18 11:09 PM, Michal Hocko wrote:

On Mon 02-07-18 13:48:45, Andrew Morton wrote:

On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko wrote:

On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]

Would one of your earlier designs have addressed all usecases? I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work.

I said "one of". There were others.

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

Yes, I agree, it could simply break MAP_FIXED_NOREPLACE.

Yang

[1]
http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0...@mail.gmail.com
[2]
http://lkml.kernel.org/r/ca+55afyajquhghw59qnwkgk4dbv0tpmdd7-1xqpo7dzwvo_...@mail.gmail.com

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On 7/2/18 11:09 PM, Michal Hocko wrote:

On Mon 02-07-18 13:48:45, Andrew Morton wrote:

On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko wrote:

On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]

Would one of your earlier designs have addressed all usecases? I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work.

I said "one of". There were others.

There is the other one which I presented on LSFMM summit. But, actually
it turns out that one looks very similar to the current under review one.

Yang

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

[1]
http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0...@mail.gmail.com
[2]
http://lkml.kernel.org/r/ca+55afyajquhghw59qnwkgk4dbv0tpmdd7-1xqpo7dzwvo_...@mail.gmail.com

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On 7/2/18 11:09 PM, Michal Hocko wrote:

On Mon 02-07-18 13:48:45, Andrew Morton wrote:

On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko wrote:

On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]

Would one of your earlier designs have addressed all usecases? I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work.

I said "one of". There were others.

There is the other one which I presented on LSFMM summit. But, actually
it turns out that one looks very similar to the current under review one.

Yang

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

[1]
http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0...@mail.gmail.com
[2]
http://lkml.kernel.org/r/ca+55afyajquhghw59qnwkgk4dbv0tpmdd7-1xqpo7dzwvo_...@mail.gmail.com

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Tue 03-07-18 11:12:05, Kirill A. Shutemov wrote:
> On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote:
> > On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
> > [...]
> > > I probably miss the explanation somewhere, but what's wrong with allowing
> > > other thread to re-populate the VMA?
> > 
> > We have discussed that earlier and it boils down to how is racy access
> > to munmap supposed to behave. Right now we have either the original
> > content or SEGV. If we allow to simply madvise_dontneed before real
> > unmap we could get a new page as well. There might be (quite broken I
> > would say) user space code that would simply corrupt data silently that
> > way.
> 
> Okay, so we add a lot of complexity to accommodate broken userspace that
> may or may not exist. Is it right? :)

I would really love to do the most simple and obious thing

diff --git a/mm/mmap.c b/mm/mmap.c
index 336bee8c4e25..86ffb179c3b5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap);
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
profile_munmap(addr);
+   if (len > LARGE_NUMBER)
+   do_madvise(addr, len, MADV_DONTNEED);
return vm_munmap(addr, len);
 }
 
but the argument that current semantic of good data or SEGV on
racing threads is no longer preserved sounds valid to me. Remember
optimizations shouldn't eat your data. How do we ensure that we won't
corrupt data silently?

Besides that if this was so simple then we do not even need any kernel
code. You could do that from glibc resp. any munmap wrapper. So maybe
the proper answer is, if you do care then just help the system and
DONTNEED your data before you munmap as an optimization for large
mappings.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Tue 03-07-18 11:12:05, Kirill A. Shutemov wrote:
> On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote:
> > On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
> > [...]
> > > I probably miss the explanation somewhere, but what's wrong with allowing
> > > other thread to re-populate the VMA?
> > 
> > We have discussed that earlier and it boils down to how is racy access
> > to munmap supposed to behave. Right now we have either the original
> > content or SEGV. If we allow to simply madvise_dontneed before real
> > unmap we could get a new page as well. There might be (quite broken I
> > would say) user space code that would simply corrupt data silently that
> > way.
> 
> Okay, so we add a lot of complexity to accommodate broken userspace that
> may or may not exist. Is it right? :)

I would really love to do the most simple and obious thing

diff --git a/mm/mmap.c b/mm/mmap.c
index 336bee8c4e25..86ffb179c3b5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap);
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
profile_munmap(addr);
+   if (len > LARGE_NUMBER)
+   do_madvise(addr, len, MADV_DONTNEED);
return vm_munmap(addr, len);
 }
 
but the argument that current semantic of good data or SEGV on
racing threads is no longer preserved sounds valid to me. Remember
optimizations shouldn't eat your data. How do we ensure that we won't
corrupt data silently?

Besides that if this was so simple then we do not even need any kernel
code. You could do that from glibc resp. any munmap wrapper. So maybe
the proper answer is, if you do care then just help the system and
DONTNEED your data before you munmap as an optimization for large
mappings.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote:
> On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
> [...]
> > I probably miss the explanation somewhere, but what's wrong with allowing
> > other thread to re-populate the VMA?
> 
> We have discussed that earlier and it boils down to how is racy access
> to munmap supposed to behave. Right now we have either the original
> content or SEGV. If we allow to simply madvise_dontneed before real
> unmap we could get a new page as well. There might be (quite broken I
> would say) user space code that would simply corrupt data silently that
> way.

Okay, so we add a lot of complexity to accommodate broken userspace that
may or may not exist. Is it right? :)

-- 
 Kirill A. Shutemov

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote:
> On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
> [...]
> > I probably miss the explanation somewhere, but what's wrong with allowing
> > other thread to re-populate the VMA?
> 
> We have discussed that earlier and it boils down to how is racy access
> to munmap supposed to behave. Right now we have either the original
> content or SEGV. If we allow to simply madvise_dontneed before real
> unmap we could get a new page as well. There might be (quite broken I
> would say) user space code that would simply corrupt data silently that
> way.

Okay, so we add a lot of complexity to accommodate broken userspace that
may or may not exist. Is it right? :)

-- 
 Kirill A. Shutemov

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon, Jul 02, 2018 at 10:19:32AM -0700, Yang Shi wrote:
> 
> 
> On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:
> > On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > 300GB), the below hung task issue may happen occasionally.
> > > INFO: task ps:14018 blocked for more than 120 seconds.
> > > Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
> > >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > message.
> > >   ps  D0 14018  1 0x0004
> > >885582f84000 885e8682f000 880972943000 885ebf499bc0
> > >8828ee12 c900349bfca8 817154d0 0040
> > >00ff812f872a 885ebf499bc0 024000d000948300 880972943000
> > >   Call Trace:
> > >[] ? __schedule+0x250/0x730
> > >[] schedule+0x36/0x80
> > >[] rwsem_down_read_failed+0xf0/0x150
> > >[] call_rwsem_down_read_failed+0x18/0x30
> > >[] down_read+0x20/0x40
> > >[] proc_pid_cmdline_read+0xd9/0x4e0
> > >[] ? do_filp_open+0xa5/0x100
> > >[] __vfs_read+0x37/0x150
> > >[] ? security_file_permission+0x9b/0xc0
> > >[] vfs_read+0x96/0x130
> > >[] SyS_read+0x55/0xc0
> > >[] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > Zapping pages is the most time consuming part, according to the
> > > suggestion from Michal Hock [1], zapping pages can be done with holding
> > > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> > > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> > > the page fault to VM_DEAD vma will trigger SIGSEGV.
> > > 
> > > Define large mapping size thresh as PUD size or 1GB, just zap pages with
> > > read mmap_sem for mappings which are >= thresh value.
> > > 
> > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> > > fallback to regular path since unmapping those mappings need acquire
> > > write mmap_sem.
> > > 
> > > For the time being, just do this in munmap syscall path. Other
> > > vm_munmap() or do_munmap() call sites remain intact for stability
> > > reason.
> > > 
> > > The below is some regression and performance data collected on a machine
> > > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> > > 
> > > With the patched kernel, write mmap_sem hold time is dropped to us level
> > > from second.
> > > 
> > > [1] https://lwn.net/Articles/753269/
> > > 
> > > Cc: Michal Hocko 
> > > Cc: Matthew Wilcox 
> > > Cc: Laurent Dufour 
> > > Cc: Andrew Morton 
> > > Signed-off-by: Yang Shi 
> > > ---
> > >   mm/mmap.c | 136 
> > > +-
> > >   1 file changed, 134 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 87dcf83..d61e08b 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct 
> > > *mm, struct vm_area_struct **vma,
> > >   return 1;
> > >   }
> > > +/* Consider PUD size or 1GB mapping as large mapping */
> > > +#ifdef HPAGE_PUD_SIZE
> > > +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> > > +#else
> > > +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> > > +#endif
> > PUD_SIZE is defined everywhere.
> 
> If THP is defined, otherwise it is:
> 
> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

I'm talking about PUD_SIZE, not HPAGE_PUD_SIZE.

-- 
 Kirill A. Shutemov

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon, Jul 02, 2018 at 10:19:32AM -0700, Yang Shi wrote:
> 
> 
> On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:
> > On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > 300GB), the below hung task issue may happen occasionally.
> > > INFO: task ps:14018 blocked for more than 120 seconds.
> > > Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
> > >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > message.
> > >   ps  D0 14018  1 0x0004
> > >885582f84000 885e8682f000 880972943000 885ebf499bc0
> > >8828ee12 c900349bfca8 817154d0 0040
> > >00ff812f872a 885ebf499bc0 024000d000948300 880972943000
> > >   Call Trace:
> > >[] ? __schedule+0x250/0x730
> > >[] schedule+0x36/0x80
> > >[] rwsem_down_read_failed+0xf0/0x150
> > >[] call_rwsem_down_read_failed+0x18/0x30
> > >[] down_read+0x20/0x40
> > >[] proc_pid_cmdline_read+0xd9/0x4e0
> > >[] ? do_filp_open+0xa5/0x100
> > >[] __vfs_read+0x37/0x150
> > >[] ? security_file_permission+0x9b/0xc0
> > >[] vfs_read+0x96/0x130
> > >[] SyS_read+0x55/0xc0
> > >[] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > It is because munmap holds mmap_sem from very beginning to all the way
> > > down to the end, and doesn't release it in the middle. When unmapping
> > > large mapping, it may take long time (take ~18 seconds to unmap 320GB
> > > mapping with every single page mapped on an idle machine).
> > > 
> > > Zapping pages is the most time consuming part, according to the
> > > suggestion from Michal Hock [1], zapping pages can be done with holding
> > > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> > > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> > > the page fault to VM_DEAD vma will trigger SIGSEGV.
> > > 
> > > Define large mapping size thresh as PUD size or 1GB, just zap pages with
> > > read mmap_sem for mappings which are >= thresh value.
> > > 
> > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> > > fallback to regular path since unmapping those mappings need acquire
> > > write mmap_sem.
> > > 
> > > For the time being, just do this in munmap syscall path. Other
> > > vm_munmap() or do_munmap() call sites remain intact for stability
> > > reason.
> > > 
> > > The below is some regression and performance data collected on a machine
> > > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> > > 
> > > With the patched kernel, write mmap_sem hold time is dropped to us level
> > > from second.
> > > 
> > > [1] https://lwn.net/Articles/753269/
> > > 
> > > Cc: Michal Hocko 
> > > Cc: Matthew Wilcox 
> > > Cc: Laurent Dufour 
> > > Cc: Andrew Morton 
> > > Signed-off-by: Yang Shi 
> > > ---
> > >   mm/mmap.c | 136 
> > > +-
> > >   1 file changed, 134 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 87dcf83..d61e08b 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct 
> > > *mm, struct vm_area_struct **vma,
> > >   return 1;
> > >   }
> > > +/* Consider PUD size or 1GB mapping as large mapping */
> > > +#ifdef HPAGE_PUD_SIZE
> > > +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> > > +#else
> > > +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> > > +#endif
> > PUD_SIZE is defined everywhere.
> 
> If THP is defined, otherwise it is:
> 
> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

I'm talking about PUD_SIZE, not HPAGE_PUD_SIZE.

-- 
 Kirill A. Shutemov

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon 02-07-18 13:48:45, Andrew Morton wrote:
> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko  wrote:
> 
> > On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> > [...]
> > > Would one of your earlier designs have addressed all usecases?  I
> > > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> > 
> > It has been already pointed out that this will not work.
> 
> I said "one of".  There were others.

Well, I was aware only about two potential solutions. Either do the
heavy lifting under the shared lock and do the rest with the exlusive
one and this, drop the lock per parts. Maybe I have missed others?

> > You simply
> > cannot drop the mmap_sem during unmap because another thread could
> > change the address space under your feet. So you need some form of
> > VM_DEAD and handle concurrent and conflicting address space operations.
> 
> Unclear that this is a problem.  If a thread does an unmap of a range
> of virtual address space, there's no guarantee that upon return some
> other thread has not already mapped new stuff into that address range. 
> So what's changed?

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

They do not use any external synchronization and rely on the atomic
munmap. Thread B only munmaps range that it knows belongs to it (e.g.
called mmap in the past). It should be clear that ThreadA should not
get an address from the addr, sizeB range, right? In the most simple case
it will not happen. But let's say that the addr, sizeB range has
unmapped holes for what ever reasons. Now anytime munmap drops the
exclusive lock after handling one VMA, Thread A might find its sizeA
range and use it. ThreadB then might remove this new range as soon as it
gets its exclusive lock again.

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

I remember my attempt to make MAP_LOCKED consistent with mlock (if the
population fails then return -ENOMEM) and that required to drop the
shared mmap_sem and take it in exclusive mode (because we do not
have upgrade_read) and Linus was strongly against [1][2] for very
similar reasons. If you drop the lock you simply do not know what
happened under your feet.

[1] 
http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0...@mail.gmail.com
[2] 
http://lkml.kernel.org/r/ca+55afyajquhghw59qnwkgk4dbv0tpmdd7-1xqpo7dzwvo_...@mail.gmail.com
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon 02-07-18 13:48:45, Andrew Morton wrote:
> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko  wrote:
> 
> > On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> > [...]
> > > Would one of your earlier designs have addressed all usecases?  I
> > > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> > 
> > It has been already pointed out that this will not work.
> 
> I said "one of".  There were others.

Well, I was aware only about two potential solutions. Either do the
heavy lifting under the shared lock and do the rest with the exlusive
one and this, drop the lock per parts. Maybe I have missed others?

> > You simply
> > cannot drop the mmap_sem during unmap because another thread could
> > change the address space under your feet. So you need some form of
> > VM_DEAD and handle concurrent and conflicting address space operations.
> 
> Unclear that this is a problem.  If a thread does an unmap of a range
> of virtual address space, there's no guarantee that upon return some
> other thread has not already mapped new stuff into that address range. 
> So what's changed?

Well, consider the following scenario:
Thread A = calling mmap(NULL, sizeA)
Thread B = calling munmap(addr, sizeB)

They do not use any external synchronization and rely on the atomic
munmap. Thread B only munmaps range that it knows belongs to it (e.g.
called mmap in the past). It should be clear that ThreadA should not
get an address from the addr, sizeB range, right? In the most simple case
it will not happen. But let's say that the addr, sizeB range has
unmapped holes for what ever reasons. Now anytime munmap drops the
exclusive lock after handling one VMA, Thread A might find its sizeA
range and use it. ThreadB then might remove this new range as soon as it
gets its exclusive lock again.

Is such a code safe? No it is not and I would call it fragile at best
but people tend to do weird things and atomic munmap behavior is
something they can easily depend on.

Another example would be an atomic address range probing by
MAP_FIXED_NOREPLACE. It would simply break for similar reasons.

I remember my attempt to make MAP_LOCKED consistent with mlock (if the
population fails then return -ENOMEM) and that required to drop the
shared mmap_sem and take it in exclusive mode (because we do not
have upgrade_read) and Linus was strongly against [1][2] for very
similar reasons. If you drop the lock you simply do not know what
happened under your feet.

[1] 
http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0...@mail.gmail.com
[2] 
http://lkml.kernel.org/r/ca+55afyajquhghw59qnwkgk4dbv0tpmdd7-1xqpo7dzwvo_...@mail.gmail.com
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 9:26 PM, Yang Shi wrote:



On 6/29/18 8:15 PM, Andrew Morton wrote:
On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi 
 wrote:





we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(
I agree it is not that perfect. But, it still could improve the most 
use

cases.

Well, those unaddressed usecases will need to be fixed at some point.


Yes, definitely.


What's our plan for that?


As I mentioned in the earlier email, locked and hugetlb cases might be 
able to be solved by separating vm_flags update and actual unmap. I 
will look into it further later.


By looking into this further， I think both mlocked and hugetlb vmas can 
be handled.


For mlocked vmas, it is easy since we acquires write mmap_sem before 
unmapping, so VM_LOCK flags can be cleared here then unmap, just like 
what the regular path does.


For hugetlb vmas, the VM_MAYSHARE flag is just checked by 
huge_pmd_share() in hugetlb_fault()->huge_pte_alloc(), another call site 
is dup_mm()->copy_page_range()->copy_hugetlb_page_range(), we don't care 
this call chain in this case.


So we may expand VM_DEAD to hugetlb_fault().  Michal suggested to check 
VM_DEAD in check_stable_address_space(), so it would be called in 
hugetlb_fault() too (not in current code), then the page fault handler 
would bail out before huge_pte_alloc() is called.


With this trick, we don't have to care about when the vm_flags is 
updated, we can unmap hugetlb vmas in read mmap_sem critical section, 
then update the vm_flags with write mmap_sem held or before the unmap.


Yang



From my point of view, uprobe mapping sounds not that vital.



Would one of your earlier designs have addressed all usecases? I
expect the dumb unmap-a-little-bit-at-a-time approach would have?


Yes. The v1 design does unmap with holding write map_sem. So, the 
vm_flags update is not a problem.


Thanks,
Yang




For the locked vmas and hugetlb vmas, unmapping operations need modify
vm_flags. But, I'm wondering we might be able to separate unmap and
vm_flags update. Because we know they will be unmapped right away, the
vm_flags might be able to be updated in write mmap_sem critical section
before the actual unmap is called or after it. This is just off the top
of my head.

For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 9:26 PM, Yang Shi wrote:



On 6/29/18 8:15 PM, Andrew Morton wrote:
On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi 
 wrote:





we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(
I agree it is not that perfect. But, it still could improve the most 
use

cases.

Well, those unaddressed usecases will need to be fixed at some point.


Yes, definitely.


What's our plan for that?


As I mentioned in the earlier email, locked and hugetlb cases might be 
able to be solved by separating vm_flags update and actual unmap. I 
will look into it further later.


By looking into this further， I think both mlocked and hugetlb vmas can 
be handled.


For mlocked vmas, it is easy since we acquires write mmap_sem before 
unmapping, so VM_LOCK flags can be cleared here then unmap, just like 
what the regular path does.


For hugetlb vmas, the VM_MAYSHARE flag is just checked by 
huge_pmd_share() in hugetlb_fault()->huge_pte_alloc(), another call site 
is dup_mm()->copy_page_range()->copy_hugetlb_page_range(), we don't care 
this call chain in this case.


So we may expand VM_DEAD to hugetlb_fault().  Michal suggested to check 
VM_DEAD in check_stable_address_space(), so it would be called in 
hugetlb_fault() too (not in current code), then the page fault handler 
would bail out before huge_pte_alloc() is called.


With this trick, we don't have to care about when the vm_flags is 
updated, we can unmap hugetlb vmas in read mmap_sem critical section, 
then update the vm_flags with write mmap_sem held or before the unmap.


Yang



From my point of view, uprobe mapping sounds not that vital.



Would one of your earlier designs have addressed all usecases? I
expect the dumb unmap-a-little-bit-at-a-time approach would have?


Yes. The v1 design does unmap with holding write map_sem. So, the 
vm_flags update is not a problem.


Thanks,
Yang




For the locked vmas and hugetlb vmas, unmapping operations need modify
vm_flags. But, I'm wondering we might be able to separate unmap and
vm_flags update. Because we know they will be unmapped right away, the
vm_flags might be able to be updated in write mmap_sem critical section
before the actual unmap is called or after it. This is just off the top
of my head.

For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

2018-07-02 Thread Andrew Morton

On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko  wrote:

> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> [...]
> > Would one of your earlier designs have addressed all usecases?  I
> > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> 
> It has been already pointed out that this will not work.

I said "one of".  There were others.

> You simply
> cannot drop the mmap_sem during unmap because another thread could
> change the address space under your feet. So you need some form of
> VM_DEAD and handle concurrent and conflicting address space operations.

Unclear that this is a problem.  If a thread does an unmap of a range
of virtual address space, there's no guarantee that upon return some
other thread has not already mapped new stuff into that address range. 
So what's changed?

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

2018-07-02 Thread Andrew Morton

On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko  wrote:

> On Fri 29-06-18 20:15:47, Andrew Morton wrote:
> [...]
> > Would one of your earlier designs have addressed all usecases?  I
> > expect the dumb unmap-a-little-bit-at-a-time approach would have?
> 
> It has been already pointed out that this will not work.

I said "one of".  There were others.

> You simply
> cannot drop the mmap_sem during unmap because another thread could
> change the address space under your feet. So you need some form of
> VM_DEAD and handle concurrent and conflicting address space operations.

Unclear that this is a problem.  If a thread does an unmap of a range
of virtual address space, there's no guarantee that upon return some
other thread has not already mapped new stuff into that address range. 
So what's changed?

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:

On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:

When running some mmap/munmap scalability tests with large memory (i.e.

300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

[1] https://lwn.net/Articles/753269/

Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Laurent Dufour 
Cc: Andrew Morton 
Signed-off-by: Yang Shi 
---
  mm/mmap.c | 136 +-
  1 file changed, 134 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 87dcf83..d61e08b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
struct vm_area_struct **vma,
return 1;
  }
  
+/* Consider PUD size or 1GB mapping as large mapping */

+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH   HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH   (1 * 1024 * 1024 * 1024)
+#endif

PUD_SIZE is defined everywhere.


If THP is defined, otherwise it is:

#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })




+
+/* Unmap large mapping early with acquiring read mmap_sem */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+  size_t len, struct list_head *uf)
+{
+   unsigned long end = 0;
+   struct vm_area_struct *vma = NULL, *prev, *tmp;
+   bool success = false;
+   int ret = 0;
+
+   if (!munmap_addr_sanity(start, len))
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
+
+   end = start + len;
+
+   /* Just deal with uf in regular path */
+   if (unlikely(uf))
+   goto regular_path;
+
+   if (len >= LARGE_MAP_THRESH) {
+   /*
+* need write mmap_sem to split vma and set VM_DEAD flag
+* splitting vma up-front to save PITA to clean if it is failed

What errors do you talk about? ENOMEM on VMA split? Anything else?


Yes, ENOMEM on vma split.




+*/
+   down_write(>mmap_sem);
+   ret = munmap_lookup_vma(mm, , , start, end);
+   if (ret != 1) {
+   up_write(>mmap_sem);
+   return ret;
+   }
+   /* This ret value might be returned, so reset it */
+   ret = 0;
+
+   /*
+* Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
+* flag set or has uprobes set, need acquire write map_sem,
+* so skip them in early zap. Just deal with such mapping in
+* regular path.
+

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 7/2/18 5:33 AM, Kirill A. Shutemov wrote:

On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:

When running some mmap/munmap scalability tests with large memory (i.e.

300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

[1] https://lwn.net/Articles/753269/

Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Laurent Dufour 
Cc: Andrew Morton 
Signed-off-by: Yang Shi 
---
  mm/mmap.c | 136 +-
  1 file changed, 134 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 87dcf83..d61e08b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
struct vm_area_struct **vma,
return 1;
  }
  
+/* Consider PUD size or 1GB mapping as large mapping */

+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH   HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH   (1 * 1024 * 1024 * 1024)
+#endif

PUD_SIZE is defined everywhere.


If THP is defined, otherwise it is:

#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })




+
+/* Unmap large mapping early with acquiring read mmap_sem */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+  size_t len, struct list_head *uf)
+{
+   unsigned long end = 0;
+   struct vm_area_struct *vma = NULL, *prev, *tmp;
+   bool success = false;
+   int ret = 0;
+
+   if (!munmap_addr_sanity(start, len))
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
+
+   end = start + len;
+
+   /* Just deal with uf in regular path */
+   if (unlikely(uf))
+   goto regular_path;
+
+   if (len >= LARGE_MAP_THRESH) {
+   /*
+* need write mmap_sem to split vma and set VM_DEAD flag
+* splitting vma up-front to save PITA to clean if it is failed

What errors do you talk about? ENOMEM on VMA split? Anything else?


Yes, ENOMEM on vma split.




+*/
+   down_write(>mmap_sem);
+   ret = munmap_lookup_vma(mm, , , start, end);
+   if (ret != 1) {
+   up_write(>mmap_sem);
+   return ret;
+   }
+   /* This ret value might be returned, so reset it */
+   ret = 0;
+
+   /*
+* Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
+* flag set or has uprobes set, need acquire write map_sem,
+* so skip them in early zap. Just deal with such mapping in
+* regular path.
+

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 7/2/18 6:53 AM, Michal Hocko wrote:

On Sat 30-06-18 06:39:44, Yang Shi wrote:

When running some mmap/munmap scalability tests with large memory (i.e.

300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding

s@Hock@Hocko@


Sorry for the wrong spelling.




read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

This really deserves an explanation why the all dance is really needed.

It would be also good to mention how do you achieve the overal
consistency. E.g. you are dropping mmap_sem and then re-taking it for
write. What if any pending write lock succeeds and modify the address
space? Does it matter, why if not?


Sure.




Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

What are those stability reasons?


mmap() and mremap() may call do_munmap() as well, so it may introduce 
more race condition if they use the zap early version of do_munmap too. 
They would have much more chances to take mmap_sem to change address 
space and cause conflict.


And, it looks they are not the vital source of long period of write 
mmap_sem hold. So, it sounds not worth making things more complicated 
for the time being.





The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

I haven't read through the implemenation carefuly TBH but the changelog
needs quite some work to explain the solution and resulting semantic of
munmap after the change.


Thanks for the suggestion. Will polish the changelog.

Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 7/2/18 6:53 AM, Michal Hocko wrote:

On Sat 30-06-18 06:39:44, Yang Shi wrote:

When running some mmap/munmap scalability tests with large memory (i.e.

300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding

s@Hock@Hocko@


Sorry for the wrong spelling.




read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

This really deserves an explanation why the all dance is really needed.

It would be also good to mention how do you achieve the overal
consistency. E.g. you are dropping mmap_sem and then re-taking it for
write. What if any pending write lock succeeds and modify the address
space? Does it matter, why if not?


Sure.




Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

What are those stability reasons?


mmap() and mremap() may call do_munmap() as well, so it may introduce 
more race condition if they use the zap early version of do_munmap too. 
They would have much more chances to take mmap_sem to change address 
space and cause conflict.


And, it looks they are not the vital source of long period of write 
mmap_sem hold. So, it sounds not worth making things more complicated 
for the time being.





The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

I haven't read through the implemenation carefuly TBH but the changelog
needs quite some work to explain the solution and resulting semantic of
munmap after the change.


Thanks for the suggestion. Will polish the changelog.

Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]
> Would one of your earlier designs have addressed all usecases?  I
> expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work. You simply
cannot drop the mmap_sem during unmap because another thread could
change the address space under your feet. So you need some form of
VM_DEAD and handle concurrent and conflicting address space operations.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Fri 29-06-18 20:15:47, Andrew Morton wrote:
[...]
> Would one of your earlier designs have addressed all usecases?  I
> expect the dumb unmap-a-little-bit-at-a-time approach would have?

It has been already pointed out that this will not work. You simply
cannot drop the mmap_sem during unmap because another thread could
change the address space under your feet. So you need some form of
VM_DEAD and handle concurrent and conflicting address space operations.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Sat 30-06-18 06:39:44, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding

s@Hock@Hocko@

> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.

This really deserves an explanation why the all dance is really needed.

It would be also good to mention how do you achieve the overal
consistency. E.g. you are dropping mmap_sem and then re-taking it for
write. What if any pending write lock succeeds and modify the address
space? Does it matter, why if not? 

> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.

What are those stability reasons?

> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.

I haven't read through the implemenation carefuly TBH but the changelog
needs quite some work to explain the solution and resulting semantic of
munmap after the change.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Sat 30-06-18 06:39:44, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding

s@Hock@Hocko@

> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.

This really deserves an explanation why the all dance is really needed.

It would be also good to mention how do you achieve the overal
consistency. E.g. you are dropping mmap_sem and then re-taking it for
write. What if any pending write lock succeeds and modify the address
space? Does it matter, why if not? 

> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.

What are those stability reasons?

> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.

I haven't read through the implemenation carefuly TBH but the changelog
needs quite some work to explain the solution and resulting semantic of
munmap after the change.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
[...]
> I probably miss the explanation somewhere, but what's wrong with allowing
> other thread to re-populate the VMA?

We have discussed that earlier and it boils down to how is racy access
to munmap supposed to behave. Right now we have either the original
content or SEGV. If we allow to simply madvise_dontneed before real
unmap we could get a new page as well. There might be (quite broken I
would say) user space code that would simply corrupt data silently that
way.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote:
[...]
> I probably miss the explanation somewhere, but what's wrong with allowing
> other thread to re-populate the VMA?

We have discussed that earlier and it boils down to how is racy access
to munmap supposed to behave. Right now we have either the original
content or SEGV. If we allow to simply madvise_dontneed before real
unmap we could get a new page as well. There might be (quite broken I
would say) user space code that would simply corrupt data silently that
way.
-- 
Michal Hocko
SUSE Labs

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

2018-07-02 Thread Kirill A. Shutemov

On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Laurent Dufour 
> Cc: Andrew Morton 
> Signed-off-by: Yang Shi 
> ---
>  mm/mmap.c | 136 
> +-
>  1 file changed, 134 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
> struct vm_area_struct **vma,
>   return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> +#endif

PUD_SIZE is defined everywhere.

> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +size_t len, struct list_head *uf)
> +{
> + unsigned long end = 0;
> + struct vm_area_struct *vma = NULL, *prev, *tmp;
> + bool success = false;
> + int ret = 0;
> +
> + if (!munmap_addr_sanity(start, len))
> + return -EINVAL;
> +
> + len = PAGE_ALIGN(len);
> +
> + end = start + len;
> +
> + /* Just deal with uf in regular path */
> + if (unlikely(uf))
> + goto regular_path;
> +
> + if (len >= LARGE_MAP_THRESH) {
> + /*
> +  * need write mmap_sem to split vma and set VM_DEAD flag
> +  * splitting vma up-front to save PITA to clean if it is failed

What errors do you talk about? ENOMEM on VMA split? Anything else?

> +  */
> + down_write(>mmap_sem);
> + ret = munmap_lookup_vma(mm, , , start, end);
> + if (ret != 1) {
> + up_write(>mmap_sem);
> + return ret;
> + }
> + /* This ret value might be returned, so reset it */
> + ret = 0;
> +
> + /*
> +  * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
> +  * flag set or has uprobes set, need acquire write map_sem,
> +  * so skip them in early zap. Just deal with such mapping in
> +  * regular path.
> +  *

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

2018-07-02 Thread Kirill A. Shutemov

On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.
> 
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.
> 
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.
> 
> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Laurent Dufour 
> Cc: Andrew Morton 
> Signed-off-by: Yang Shi 
> ---
>  mm/mmap.c | 136 
> +-
>  1 file changed, 134 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
> struct vm_area_struct **vma,
>   return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> +#endif

PUD_SIZE is defined everywhere.

> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +size_t len, struct list_head *uf)
> +{
> + unsigned long end = 0;
> + struct vm_area_struct *vma = NULL, *prev, *tmp;
> + bool success = false;
> + int ret = 0;
> +
> + if (!munmap_addr_sanity(start, len))
> + return -EINVAL;
> +
> + len = PAGE_ALIGN(len);
> +
> + end = start + len;
> +
> + /* Just deal with uf in regular path */
> + if (unlikely(uf))
> + goto regular_path;
> +
> + if (len >= LARGE_MAP_THRESH) {
> + /*
> +  * need write mmap_sem to split vma and set VM_DEAD flag
> +  * splitting vma up-front to save PITA to clean if it is failed

What errors do you talk about? ENOMEM on VMA split? Anything else?

> +  */
> + down_write(>mmap_sem);
> + ret = munmap_lookup_vma(mm, , , start, end);
> + if (ret != 1) {
> + up_write(>mmap_sem);
> + return ret;
> + }
> + /* This ret value might be returned, so reset it */
> + ret = 0;
> +
> + /*
> +  * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
> +  * flag set or has uprobes set, need acquire write map_sem,
> +  * so skip them in early zap. Just deal with such mapping in
> +  * regular path.
> +  *

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 8:15 PM, Andrew Morton wrote:

On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi  wrote:




we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(

I agree it is not that perfect. But, it still could improve the most use
cases.

Well, those unaddressed usecases will need to be fixed at some point.


Yes, definitely.


What's our plan for that?


As I mentioned in the earlier email, locked and hugetlb cases might be 
able to be solved by separating vm_flags update and actual unmap. I will 
look into it further later.


From my point of view, uprobe mapping sounds not that vital.



Would one of your earlier designs have addressed all usecases?  I
expect the dumb unmap-a-little-bit-at-a-time approach would have?


Yes. The v1 design does unmap with holding write map_sem. So, the 
vm_flags update is not a problem.


Thanks,
Yang




For the locked vmas and hugetlb vmas, unmapping operations need modify
vm_flags. But, I'm wondering we might be able to separate unmap and
vm_flags update. Because we know they will be unmapped right away, the
vm_flags might be able to be updated in write mmap_sem critical section
before the actual unmap is called or after it. This is just off the top
of my head.

For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 8:15 PM, Andrew Morton wrote:

On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi  wrote:




we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(

I agree it is not that perfect. But, it still could improve the most use
cases.

Well, those unaddressed usecases will need to be fixed at some point.


Yes, definitely.


What's our plan for that?


As I mentioned in the earlier email, locked and hugetlb cases might be 
able to be solved by separating vm_flags update and actual unmap. I will 
look into it further later.


From my point of view, uprobe mapping sounds not that vital.



Would one of your earlier designs have addressed all usecases?  I
expect the dumb unmap-a-little-bit-at-a-time approach would have?


Yes. The v1 design does unmap with holding write map_sem. So, the 
vm_flags update is not a problem.


Thanks,
Yang




For the locked vmas and hugetlb vmas, unmapping operations need modify
vm_flags. But, I'm wondering we might be able to separate unmap and
vm_flags update. Because we know they will be unmapped right away, the
vm_flags might be able to be updated in write mmap_sem critical section
before the actual unmap is called or after it. This is just off the top
of my head.

For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi  wrote:

> 
> 
> > we're adding a bunch of code to 32-bit kernels which will never be
> > executed.
> >
> > I'm thinking it would be better to be much more explicit with "#ifdef
> > CONFIG_64BIT" in this code, rather than relying upon the above magic.
> >
> > But I tend to think that the fact that we haven't solved anything on
> > locked vmas or on uprobed mappings is a shostopper for the whole
> > approach :(
> 
> I agree it is not that perfect. But, it still could improve the most use 
> cases.

Well, those unaddressed usecases will need to be fixed at some point. 
What's our plan for that?

Would one of your earlier designs have addressed all usecases?  I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

> For the locked vmas and hugetlb vmas, unmapping operations need modify 
> vm_flags. But, I'm wondering we might be able to separate unmap and 
> vm_flags update. Because we know they will be unmapped right away, the 
> vm_flags might be able to be updated in write mmap_sem critical section 
> before the actual unmap is called or after it. This is just off the top 
> of my head.
> 
> For uprobed mappings, I'm not sure how vital it is to this case.
> 
> Thanks,
> Yang
> 
> >

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi  wrote:

> 
> 
> > we're adding a bunch of code to 32-bit kernels which will never be
> > executed.
> >
> > I'm thinking it would be better to be much more explicit with "#ifdef
> > CONFIG_64BIT" in this code, rather than relying upon the above magic.
> >
> > But I tend to think that the fact that we haven't solved anything on
> > locked vmas or on uprobed mappings is a shostopper for the whole
> > approach :(
> 
> I agree it is not that perfect. But, it still could improve the most use 
> cases.

Well, those unaddressed usecases will need to be fixed at some point. 
What's our plan for that?

Would one of your earlier designs have addressed all usecases?  I
expect the dumb unmap-a-little-bit-at-a-time approach would have?

> For the locked vmas and hugetlb vmas, unmapping operations need modify 
> vm_flags. But, I'm wondering we might be able to separate unmap and 
> vm_flags update. Because we know they will be unmapped right away, the 
> vm_flags might be able to be updated in write mmap_sem critical section 
> before the actual unmap is called or after it. This is just off the top 
> of my head.
> 
> For uprobed mappings, I'm not sure how vital it is to this case.
> 
> Thanks,
> Yang
> 
> >

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 6:35 PM, Andrew Morton wrote:

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:


And...


diff --git a/mm/mmap.c b/mm/mmap.c
index 87dcf83..d61e08b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
struct vm_area_struct **vma,
return 1;
  }
  
+/* Consider PUD size or 1GB mapping as large mapping */

+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH   HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH   (1 * 1024 * 1024 * 1024)
+#endif

So this assumes that 32-bit machines cannot have 1GB mappings (fair
enough) and this is the sole means by which we avoid falling into the
"len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
least because for such machines, VM_DEAD is zero.

This is rather ugly and fragile.  And, I guess, explains why we can't
give all mappings this treatment: 32-bit machines can't do it.  And
we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(


I agree it is not that perfect. But, it still could improve the most use 
cases.


For the locked vmas and hugetlb vmas, unmapping operations need modify 
vm_flags. But, I'm wondering we might be able to separate unmap and 
vm_flags update. Because we know they will be unmapped right away, the 
vm_flags might be able to be updated in write mmap_sem critical section 
before the actual unmap is called or after it. This is just off the top 
of my head.


For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 6:35 PM, Andrew Morton wrote:

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:


And...


diff --git a/mm/mmap.c b/mm/mmap.c
index 87dcf83..d61e08b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
struct vm_area_struct **vma,
return 1;
  }
  
+/* Consider PUD size or 1GB mapping as large mapping */

+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH   HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH   (1 * 1024 * 1024 * 1024)
+#endif

So this assumes that 32-bit machines cannot have 1GB mappings (fair
enough) and this is the sole means by which we avoid falling into the
"len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
least because for such machines, VM_DEAD is zero.

This is rather ugly and fragile.  And, I guess, explains why we can't
give all mappings this treatment: 32-bit machines can't do it.  And
we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(


I agree it is not that perfect. But, it still could improve the most use 
cases.


For the locked vmas and hugetlb vmas, unmapping operations need modify 
vm_flags. But, I'm wondering we might be able to separate unmap and 
vm_flags update. Because we know they will be unmapped right away, the 
vm_flags might be able to be updated in write mmap_sem critical section 
before the actual unmap is called or after it. This is just off the top 
of my head.


For uprobed mappings, I'm not sure how vital it is to this case.

Thanks,
Yang

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 6:28 PM, Andrew Morton wrote:

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:


When running some mmap/munmap scalability tests with large memory (i.e.

300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

Perhaps it would be better to treat all mappings in the fashion,
regardless of size.  Simpler code, lesser mmap_sem hold times, much
better testing coverage.  Is there any particular downside to doing
this?


Actually, my testing uses 4K size to improve the coverage. The only 
downside AFAICS is the cost of multiple acquiring/releasing lock may 
outpace the benefits.





If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

So we'll still get huge latencies an softlockup warnings for some
usecases.  This is a problem!


Because unmapping such area needs modify vm_flags, which need write 
mmap_sem, in current code base.





For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Where is this "regression and performance data"?  Something mising from
the changelog?


oops, it might be removed inadvertently.


With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

[1] https://lwn.net/Articles/753269/

...

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
struct vm_area_struct **vma,
return 1;
  }
  
+/* Consider PUD size or 1GB mapping as large mapping */

+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH   HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH   (1 * 1024 * 1024 * 1024)
+#endif
+
+/* Unmap large mapping early with acquiring read mmap_sem */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+  size_t len, struct list_head *uf)

Can we have a comment describing what `uf' is and what it does? (at least)


Sure.




+{
+   unsigned long end = 0;
+   struct vm_area_struct *vma = NULL, *prev, *tmp;

`tmp' is a poor choice of identifier - it doesn't communicate either
the variable's type nor its purpose.

Perhaps rename vma to start_vma(?) and rename tmp to vma?

And declaring start_vma to be const would be a nice readability addition.


Sounds ok.




+   bool success = false;
+   int ret = 0;
+
+   if (!munmap_addr_sanity(start, len))
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
+
+   end = start + len;
+
+   /* Just deal with uf in regular path */
+   if (unlikely(uf))
+   goto regular_path;
+
+   if (len >= LARGE_MAP_THRESH) {
+   /*
+* need write mmap_sem to split vma and set VM_DEAD flag
+* splitting vma up-front to save PITA to clean if it is failed
+*/
+   down_write(>mmap_sem);
+   ret = munmap_lookup_vma(mm, , ,

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping





On 6/29/18 6:28 PM, Andrew Morton wrote:

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:


When running some mmap/munmap scalability tests with large memory (i.e.

300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

Perhaps it would be better to treat all mappings in the fashion,
regardless of size.  Simpler code, lesser mmap_sem hold times, much
better testing coverage.  Is there any particular downside to doing
this?


Actually, my testing uses 4K size to improve the coverage. The only 
downside AFAICS is the cost of multiple acquiring/releasing lock may 
outpace the benefits.





If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

So we'll still get huge latencies an softlockup warnings for some
usecases.  This is a problem!


Because unmapping such area needs modify vm_flags, which need write 
mmap_sem, in current code base.





For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Where is this "regression and performance data"?  Something mising from
the changelog?


oops, it might be removed inadvertently.


With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

[1] https://lwn.net/Articles/753269/

...

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
struct vm_area_struct **vma,
return 1;
  }
  
+/* Consider PUD size or 1GB mapping as large mapping */

+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH   HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH   (1 * 1024 * 1024 * 1024)
+#endif
+
+/* Unmap large mapping early with acquiring read mmap_sem */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+  size_t len, struct list_head *uf)

Can we have a comment describing what `uf' is and what it does? (at least)


Sure.




+{
+   unsigned long end = 0;
+   struct vm_area_struct *vma = NULL, *prev, *tmp;

`tmp' is a poor choice of identifier - it doesn't communicate either
the variable's type nor its purpose.

Perhaps rename vma to start_vma(?) and rename tmp to vma?

And declaring start_vma to be const would be a nice readability addition.


Sounds ok.




+   bool success = false;
+   int ret = 0;
+
+   if (!munmap_addr_sanity(start, len))
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
+
+   end = start + len;
+
+   /* Just deal with uf in regular path */
+   if (unlikely(uf))
+   goto regular_path;
+
+   if (len >= LARGE_MAP_THRESH) {
+   /*
+* need write mmap_sem to split vma and set VM_DEAD flag
+* splitting vma up-front to save PITA to clean if it is failed
+*/
+   down_write(>mmap_sem);
+   ret = munmap_lookup_vma(mm, , ,

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:

And...

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
> struct vm_area_struct **vma,
>   return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> +#endif

So this assumes that 32-bit machines cannot have 1GB mappings (fair
enough) and this is the sole means by which we avoid falling into the
"len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
least because for such machines, VM_DEAD is zero.

This is rather ugly and fragile.  And, I guess, explains why we can't
give all mappings this treatment: 32-bit machines can't do it.  And
we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:

And...

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 87dcf83..d61e08b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
> struct vm_area_struct **vma,
>   return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> +#endif

So this assumes that 32-bit machines cannot have 1GB mappings (fair
enough) and this is the sole means by which we avoid falling into the
"len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at
least because for such machines, VM_DEAD is zero.

This is rather ugly and fragile.  And, I guess, explains why we can't
give all mappings this treatment: 32-bit machines can't do it.  And
we're adding a bunch of code to 32-bit kernels which will never be
executed.

I'm thinking it would be better to be much more explicit with "#ifdef
CONFIG_64BIT" in this code, rather than relying upon the above magic.

But I tend to think that the fact that we haven't solved anything on
locked vmas or on uprobed mappings is a shostopper for the whole
approach :(

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:

> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.

Perhaps it would be better to treat all mappings in the fashion,
regardless of size.  Simpler code, lesser mmap_sem hold times, much
better testing coverage.  Is there any particular downside to doing
this?

> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.

So we'll still get huge latencies an softlockup warnings for some
usecases.  This is a problem!

> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Where is this "regression and performance data"?  Something mising from
the changelog?

> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
> struct vm_area_struct **vma,
>   return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> +#endif
> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +size_t len, struct list_head *uf)

Can we have a comment describing what `uf' is and what it does? (at least)

> +{
> + unsigned long end = 0;
> + struct vm_area_struct *vma = NULL, *prev, *tmp;

`tmp' is a poor choice of identifier - it doesn't communicate either
the variable's type nor its purpose.

Perhaps rename vma to start_vma(?) and rename tmp to vma?

And declaring start_vma to be const would be a nice readability addition.

> + bool success = false;
> + int ret = 0;
> +
> + if (!munmap_addr_sanity(start, len))
> + return -EINVAL;
> +
> + len = PAGE_ALIGN(len);
> +
> + end = start + len;
> +
> + /* Just deal with uf in regular path */
> + if (unlikely(uf))
> + goto regular_path;
> +
> + if (len >= LARGE_MAP_THRESH) {
> + /*
> +  * need write mmap_sem to split vma and set VM_DEAD flag
> +  * splitting vma up-front to save PITA to clean if it is failed
> +  */
> + down_write(>mmap_sem);
> + ret = munmap_lookup_vma(mm, , , start, end);
> + if (ret != 1) {
> + up_write(>mmap_sem);
> + return ret;

Can just use `goto out' here, and that would avoid the unpleasing use
of a deeply eembded `return'.

> + }
> +

Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi  wrote:

> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).
> 
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hock [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
> the page fault to VM_DEAD vma will trigger SIGSEGV.
> 
> Define large mapping size thresh as PUD size or 1GB, just zap pages with
> read mmap_sem for mappings which are >= thresh value.

Perhaps it would be better to treat all mappings in the fashion,
regardless of size.  Simpler code, lesser mmap_sem hold times, much
better testing coverage.  Is there any particular downside to doing
this?

> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
> fallback to regular path since unmapping those mappings need acquire
> write mmap_sem.

So we'll still get huge latencies an softlockup warnings for some
usecases.  This is a problem!

> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites remain intact for stability
> reason.
> 
> The below is some regression and performance data collected on a machine
> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Where is this "regression and performance data"?  Something mising from
the changelog?

> With the patched kernel, write mmap_sem hold time is dropped to us level
> from second.
> 
> [1] https://lwn.net/Articles/753269/
> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
> struct vm_area_struct **vma,
>   return 1;
>  }
>  
> +/* Consider PUD size or 1GB mapping as large mapping */
> +#ifdef HPAGE_PUD_SIZE
> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE
> +#else
> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024)
> +#endif
> +
> +/* Unmap large mapping early with acquiring read mmap_sem */
> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
> +size_t len, struct list_head *uf)

Can we have a comment describing what `uf' is and what it does? (at least)

> +{
> + unsigned long end = 0;
> + struct vm_area_struct *vma = NULL, *prev, *tmp;

`tmp' is a poor choice of identifier - it doesn't communicate either
the variable's type nor its purpose.

Perhaps rename vma to start_vma(?) and rename tmp to vma?

And declaring start_vma to be const would be a nice readability addition.

> + bool success = false;
> + int ret = 0;
> +
> + if (!munmap_addr_sanity(start, len))
> + return -EINVAL;
> +
> + len = PAGE_ALIGN(len);
> +
> + end = start + len;
> +
> + /* Just deal with uf in regular path */
> + if (unlikely(uf))
> + goto regular_path;
> +
> + if (len >= LARGE_MAP_THRESH) {
> + /*
> +  * need write mmap_sem to split vma and set VM_DEAD flag
> +  * splitting vma up-front to save PITA to clean if it is failed
> +  */
> + down_write(>mmap_sem);
> + ret = munmap_lookup_vma(mm, , , start, end);
> + if (ret != 1) {
> + up_write(>mmap_sem);
> + return ret;

Can just use `goto out' here, and that would avoid the unpleasing use
of a deeply eembded `return'.

> + }
> +

[RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping

When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
   Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps  D0 14018  1 0x0004
  885582f84000 885e8682f000 880972943000 885ebf499bc0
  8828ee12 c900349bfca8 817154d0 0040
  00ff812f872a 885ebf499bc0 024000d000948300 880972943000
 Call Trace:
  [] ? __schedule+0x250/0x730
  [] schedule+0x36/0x80
  [] rwsem_down_read_failed+0xf0/0x150
  [] call_rwsem_down_read_failed+0x18/0x30
  [] down_read+0x20/0x40
  [] proc_pid_cmdline_read+0xd9/0x4e0
  [] ? do_filp_open+0xa5/0x100
  [] __vfs_read+0x37/0x150
  [] ? security_file_permission+0x9b/0xc0
  [] vfs_read+0x96/0x130
  [] SyS_read+0x55/0xc0
  [] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the
suggestion from Michal Hock [1], zapping pages can be done with holding
read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set,
the page fault to VM_DEAD vma will trigger SIGSEGV.

Define large mapping size thresh as PUD size or 1GB, just zap pages with
read mmap_sem for mappings which are >= thresh value.

If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just
fallback to regular path since unmapping those mappings need acquire
write mmap_sem.

For the time being, just do this in munmap syscall path. Other
vm_munmap() or do_munmap() call sites remain intact for stability
reason.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

With the patched kernel, write mmap_sem hold time is dropped to us level
from second.

[1] https://lwn.net/Articles/753269/

Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Laurent Dufour 
Cc: Andrew Morton 
Signed-off-by: Yang Shi 
---
 mm/mmap.c | 136 +-
 1 file changed, 134 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 87dcf83..d61e08b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, 
struct vm_area_struct **vma,
return 1;
 }
 
+/* Consider PUD size or 1GB mapping as large mapping */
+#ifdef HPAGE_PUD_SIZE
+#define LARGE_MAP_THRESH   HPAGE_PUD_SIZE
+#else
+#define LARGE_MAP_THRESH   (1 * 1024 * 1024 * 1024)
+#endif
+
+/* Unmap large mapping early with acquiring read mmap_sem */
+static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start,
+  size_t len, struct list_head *uf)
+{
+   unsigned long end = 0;
+   struct vm_area_struct *vma = NULL, *prev, *tmp;
+   bool success = false;
+   int ret = 0;
+
+   if (!munmap_addr_sanity(start, len))
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
+
+   end = start + len;
+
+   /* Just deal with uf in regular path */
+   if (unlikely(uf))
+   goto regular_path;
+
+   if (len >= LARGE_MAP_THRESH) {
+   /*
+* need write mmap_sem to split vma and set VM_DEAD flag
+* splitting vma up-front to save PITA to clean if it is failed
+*/
+   down_write(>mmap_sem);
+   ret = munmap_lookup_vma(mm, , , start, end);
+   if (ret != 1) {
+   up_write(>mmap_sem);
+   return ret;
+   }
+   /* This ret value might be returned, so reset it */
+   ret = 0;
+
+   /*
+* Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP
+* flag set or has uprobes set, need acquire write map_sem,
+* so skip them in early zap. Just deal with such mapping in
+* regular path.
+* Borrow can_madv_dontneed_vma() to check the conditions.
+*/
+   tmp = vma;
+   while (tmp && tmp->vm_start < end) {
+   if (!can_madv_dontneed_vma(tmp) ||
+   vma_has_uprobes(tmp, start, end)) {
+   up_write(>mmap_sem);
+

[RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping