Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-08-06 Thread Minchan Kim
On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> > On Tue, Jul 30, 2019 at 02:57:51PM +0200, Michal Hocko wrote:
> > > [Cc Nick - the email thread starts 
> > > http://lkml.kernel.org/r/20190729071037.241581-1-minc...@kernel.org
> > >  A very brief summary is that mark_page_accessed seems to be quite
> > >  expensive and the question is whether we still need it and why
> > >  SetPageReferenced cannot be used instead. More below.]
> > > 
> > > On Tue 30-07-19 21:39:35, Minchan Kim wrote:
> [...]
> > > > commit bf3f3bc5e73
> > > > Author: Nick Piggin 
> > > > Date:   Tue Jan 6 14:38:55 2009 -0800
> > > > 
> > > > mm: don't mark_page_accessed in fault path
> > > > 
> > > > Doing a mark_page_accessed at fault-time, then doing 
> > > > SetPageReferenced at
> > > > unmap-time if the pte is young has a number of problems.
> > > > 
> > > > mark_page_accessed is supposed to be roughly the equivalent of a 
> > > > young pte
> > > > for unmapped references. Unfortunately it doesn't come with any 
> > > > context:
> > > > after being called, reclaim doesn't know who or why the page was 
> > > > touched.
> > > > 
> > > > So calling mark_page_accessed not only adds extra lru or 
> > > > PG_referenced
> > > > manipulations for pages that are already going to have pte_young 
> > > > ptes anyway,
> > > > but it also adds these references which are difficult to work with 
> > > > from the
> > > > context of vma specific references (eg. MADV_SEQUENTIAL pte_young 
> > > > may not
> > > > wish to contribute to the page being referenced).
> > > > 
> > > > Then, simply doing SetPageReferenced when zapping a pte and finding 
> > > > it is
> > > > young, is not a really good solution either. SetPageReferenced does 
> > > > not
> > > > correctly promote the page to the active list for example. So after 
> > > > removing
> > > > mark_page_accessed from the fault path, several 
> > > > mmap()+touch+munmap() would
> > > > have a very different result from several read(2) calls for 
> > > > example, which
> > > > is not really desirable.
> > > 
> > > Well, I have to say that this is rather vague to me. Nick, could you be
> > > more specific about which workloads do benefit from this change? Let's
> > > say that the zapped pte is the only referenced one and then reclaim
> > > finds the page on inactive list. We would go and reclaim it. But does
> > > that matter so much? Hot pages would be referenced from multiple ptes
> > > very likely, no?
> > 
> > As Nick mentioned in the description, without mark_page_accessed in
> > zapping part, repeated mmap + touch + munmap never acticated the page
> > while several read(2) calls easily promote it.
> 
> And is this really a problem? If we refault the same page then the
> refaults detection should catch it no? In other words is the above still
> a problem these days?

I admit we have been not fair for them because read(2) syscall pages are
easily promoted regardless of zap timing unlike mmap-based pages.

However, if we remove the mark_page_accessed in the zap_pte_range, it
would make them more unfair in that read(2)-accessed pages are easily
promoted while mmap-based page should go through refault to be promoted.

I also want to remove the costly overhead from the hot path but couldn't
come up with nice solution.


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-31 Thread Michal Hocko
On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> On Tue, Jul 30, 2019 at 02:57:51PM +0200, Michal Hocko wrote:
> > [Cc Nick - the email thread starts 
> > http://lkml.kernel.org/r/20190729071037.241581-1-minc...@kernel.org
> >  A very brief summary is that mark_page_accessed seems to be quite
> >  expensive and the question is whether we still need it and why
> >  SetPageReferenced cannot be used instead. More below.]
> > 
> > On Tue 30-07-19 21:39:35, Minchan Kim wrote:
[...]
> > > commit bf3f3bc5e73
> > > Author: Nick Piggin 
> > > Date:   Tue Jan 6 14:38:55 2009 -0800
> > > 
> > > mm: don't mark_page_accessed in fault path
> > > 
> > > Doing a mark_page_accessed at fault-time, then doing 
> > > SetPageReferenced at
> > > unmap-time if the pte is young has a number of problems.
> > > 
> > > mark_page_accessed is supposed to be roughly the equivalent of a 
> > > young pte
> > > for unmapped references. Unfortunately it doesn't come with any 
> > > context:
> > > after being called, reclaim doesn't know who or why the page was 
> > > touched.
> > > 
> > > So calling mark_page_accessed not only adds extra lru or PG_referenced
> > > manipulations for pages that are already going to have pte_young ptes 
> > > anyway,
> > > but it also adds these references which are difficult to work with 
> > > from the
> > > context of vma specific references (eg. MADV_SEQUENTIAL pte_young may 
> > > not
> > > wish to contribute to the page being referenced).
> > > 
> > > Then, simply doing SetPageReferenced when zapping a pte and finding 
> > > it is
> > > young, is not a really good solution either. SetPageReferenced does 
> > > not
> > > correctly promote the page to the active list for example. So after 
> > > removing
> > > mark_page_accessed from the fault path, several mmap()+touch+munmap() 
> > > would
> > > have a very different result from several read(2) calls for example, 
> > > which
> > > is not really desirable.
> > 
> > Well, I have to say that this is rather vague to me. Nick, could you be
> > more specific about which workloads do benefit from this change? Let's
> > say that the zapped pte is the only referenced one and then reclaim
> > finds the page on inactive list. We would go and reclaim it. But does
> > that matter so much? Hot pages would be referenced from multiple ptes
> > very likely, no?
> 
> As Nick mentioned in the description, without mark_page_accessed in
> zapping part, repeated mmap + touch + munmap never acticated the page
> while several read(2) calls easily promote it.

And is this really a problem? If we refault the same page then the
refaults detection should catch it no? In other words is the above still
a problem these days?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-30 Thread Minchan Kim
On Tue, Jul 30, 2019 at 12:42:07PM -0700, Andrew Morton wrote:
> On Mon, 29 Jul 2019 17:20:52 +0900 Minchan Kim  wrote:
> 
> > > > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct 
> > > > mmu_gather *tlb,
> > > > flush_tlb_batched_pending(mm);
> > > > arch_enter_lazy_mmu_mode();
> > > > do {
> > > > -   pte_t ptent = *pte;
> > > > +   pte_t ptent;
> > > > +
> > > > +   if (progress >= 32) {
> > > > +   progress = 0;
> > > > +   if (need_resched())
> > > > +   break;
> > > > +   }
> > > > +   progress += 8;
> > > 
> > > Why 8?
> > 
> > Just copied from copy_pte_range.
> 
> copy_pte_range() does
> 
>   if (pte_none(*src_pte)) {
>   progress++;
>   continue;
>   }
>   entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
>   vma, addr, rss);
>   if (entry.val)
>   break;
>   progress += 8;
> 
> which appears to be an attempt to balance the cost of copy_one_pte()
> against the cost of not calling copy_one_pte().
> 

Indeed.

> Your code doesn't do this balancing and hence can be simpler.

Based on the balancing code of copy_one_pte, it seems we should balance
it with cost of mark_page_accessed against the cost of not calling
mark_page_accessed. IOW, add up 8 only when mark_page_accessed is called.

However, every mark_page_accessed is not heavy since it uses pagevec
and caller couldn't know whether the target page will be activated or
just have PG_referenced which is cheap. Thus, I agree, do not make it
complicated.

> 
> It all seems a bit overdesigned.  need_resched() is cheap.  It's
> possibly a mistake to check need_resched() on *every* loop because some
> crazy scheduling load might livelock us.  But surely it would be enough
> to do something like
> 
>   if (progress++ && need_resched()) {
>   
>   progress = 0;
>   }
> 
> and leave it at that?

Seems like this?

>From bb1d7aaf520e98a6f9d988c25121602c28e12e67 Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Mon, 29 Jul 2019 15:28:48 +0900
Subject: [PATCH] mm: release the spinlock on zap_pte_range

In our testing(carmera recording), Miguel and Wei found unmap_page_range
takes above 6ms with preemption disabled easily. When I see that, the
reason is it holds page table spinlock during entire 512 page operation
in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
run in the time because it could make frame drop or glitch audio problem.

I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
device is 2018 premium mobile device.

I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention. It's already too heavy.

If I remove activate_page, 35-40% overhead of zap_pte_range is gone
so most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.

Thus, this patch adds preemption point for once every 32 times in the
loop.

Reported-by: Miguel de Dios 
Reported-by: Wei Wang 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Signed-off-by: Minchan Kim 
---
 mm/memory.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2e796372927fd..8bfcef09da674 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct zap_details *details)
 {
struct mm_struct *mm = tlb->mm;
+   int progress = 0;
int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
@@ -1022,7 +1023,15 @@ static unsigned long zap_pte_range(struct mmu_gather 
*tlb,
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
do {
-   pte_t ptent = *pte;
+   pte_t ptent;
+
+   if (progress++ >= 32) {
+   progress = 0;
+   if (need_resched())
+   break;
+   }
+
+   ptent = *pte;
if (pte_none(ptent))
continue;
 
@@ -1123,8 +1132,11 @@ static unsigned long zap_pte_range(struct mmu_gather 
*tlb,
if (force_flush) {
force_flush = 0;
tlb_flush_mmu(tlb);
-   if (addr != end)
-   goto again;
+   }
+
+   if (addr != end) {
+   progress = 0;
+   goto again;
}
 
return addr;
-- 
2.22.0.709.g102302147b-goog



Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-30 Thread Minchan Kim
On Tue, Jul 30, 2019 at 02:57:51PM +0200, Michal Hocko wrote:
> [Cc Nick - the email thread starts 
> http://lkml.kernel.org/r/20190729071037.241581-1-minc...@kernel.org
>  A very brief summary is that mark_page_accessed seems to be quite
>  expensive and the question is whether we still need it and why
>  SetPageReferenced cannot be used instead. More below.]
> 
> On Tue 30-07-19 21:39:35, Minchan Kim wrote:
> > On Tue, Jul 30, 2019 at 02:32:37PM +0200, Michal Hocko wrote:
> > > On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> > > > On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > > > > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > > > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > > > > In our testing(carmera recording), Miguel and Wei found 
> > > > > > > > unmap_page_range
> > > > > > > > takes above 6ms with preemption disabled easily. When I see 
> > > > > > > > that, the
> > > > > > > > reason is it holds page table spinlock during entire 512 page 
> > > > > > > > operation
> > > > > > > > in a PMD. 6.2ms is never trivial for user experince if RT task 
> > > > > > > > couldn't
> > > > > > > > run in the time because it could make frame drop or glitch 
> > > > > > > > audio problem.
> > > > > > > 
> > > > > > > Where is the time spent during the tear down? 512 pages doesn't 
> > > > > > > sound
> > > > > > > like a lot to tear down. Is it the TLB flushing?
> > > > > > 
> > > > > > Miguel confirmed there is no such big latency without 
> > > > > > mark_page_accessed
> > > > > > in zap_pte_range so I guess it's the contention of LRU lock as well 
> > > > > > as
> > > > > > heavy activate_page overhead which is not trivial, either.
> > > > > 
> > > > > Please give us more details ideally with some numbers.
> > > > 
> > > > I had a time to benchmark it via adding some trace_printk hooks between
> > > > pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > > > device is 2018 premium mobile device.
> > > > 
> > > > I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > > > task runs on little core even though it doesn't have any IPI and LRU
> > > > lock contention. It's already too heavy.
> > > > 
> > > > If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > > > so most of overhead(about 0.7ms) comes from activate_page via
> > > > mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > > > accumulate up to several ms.
> > > 
> > > Thanks for this information. This is something that should be a part of
> > > the changelog. I am sorry to still poke into this because I still do not
> > 
> > I will include it.
> > 
> > > have a full understanding of what is going on and while I do not object
> > > to drop the spinlock I still suspect this is papering over a deeper
> > > problem.
> > 
> > I couldn't come up with better solution. Feel free to suggest it.
> > 
> > > 
> > > If mark_page_accessed is really expensive then why do we even bother to
> > > do it in the tear down path in the first place? Why don't we simply set
> > > a referenced bit on the page to reflect the young pte bit? I might be
> > > missing something here of course.
> > 
> > commit bf3f3bc5e73
> > Author: Nick Piggin 
> > Date:   Tue Jan 6 14:38:55 2009 -0800
> > 
> > mm: don't mark_page_accessed in fault path
> > 
> > Doing a mark_page_accessed at fault-time, then doing SetPageReferenced 
> > at
> > unmap-time if the pte is young has a number of problems.
> > 
> > mark_page_accessed is supposed to be roughly the equivalent of a young 
> > pte
> > for unmapped references. Unfortunately it doesn't come with any context:
> > after being called, reclaim doesn't know who or why the page was 
> > touched.
> > 
> > So calling mark_page_accessed not only adds extra lru or PG_referenced
> > manipulations for pages that are already going to have pte_young ptes 
> > anyway,
> > but it also adds these references which are difficult to work with from 
> > the
> > context of vma specific references (eg. MADV_SEQUENTIAL pte_young may 
> > not
> > wish to contribute to the page being referenced).
> > 
> > Then, simply doing SetPageReferenced when zapping a pte and finding it 
> > is
> > young, is not a really good solution either. SetPageReferenced does not
> > correctly promote the page to the active list for example. So after 
> > removing
> > mark_page_accessed from the fault path, several mmap()+touch+munmap() 
> > would
> > have a very different result from several read(2) calls for example, 
> > which
> > is not really desirable.
> 
> Well, I have to say that this is rather vague to me. Nick, could you be
> more specific about which workloads do benefit from this change? Let's
> say that the zapped pte is the only referenced one and then reclaim
> finds the page on inactive list. We would go and reclaim i

Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-30 Thread Andrew Morton
On Mon, 29 Jul 2019 17:20:52 +0900 Minchan Kim  wrote:

> > > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct 
> > > mmu_gather *tlb,
> > >   flush_tlb_batched_pending(mm);
> > >   arch_enter_lazy_mmu_mode();
> > >   do {
> > > - pte_t ptent = *pte;
> > > + pte_t ptent;
> > > +
> > > + if (progress >= 32) {
> > > + progress = 0;
> > > + if (need_resched())
> > > + break;
> > > + }
> > > + progress += 8;
> > 
> > Why 8?
> 
> Just copied from copy_pte_range.

copy_pte_range() does

if (pte_none(*src_pte)) {
progress++;
continue;
}
entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
vma, addr, rss);
if (entry.val)
break;
progress += 8;

which appears to be an attempt to balance the cost of copy_one_pte()
against the cost of not calling copy_one_pte().

Your code doesn't do this balancing and hence can be simpler.

It all seems a bit overdesigned.  need_resched() is cheap.  It's
possibly a mistake to check need_resched() on *every* loop because some
crazy scheduling load might livelock us.  But surely it would be enough
to do something like

if (progress++ && need_resched()) {

progress = 0;
}

and leave it at that?


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-30 Thread Michal Hocko
[Cc Nick - the email thread starts 
http://lkml.kernel.org/r/20190729071037.241581-1-minc...@kernel.org
 A very brief summary is that mark_page_accessed seems to be quite
 expensive and the question is whether we still need it and why
 SetPageReferenced cannot be used instead. More below.]

On Tue 30-07-19 21:39:35, Minchan Kim wrote:
> On Tue, Jul 30, 2019 at 02:32:37PM +0200, Michal Hocko wrote:
> > On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> > > On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > > > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > > > In our testing(carmera recording), Miguel and Wei found 
> > > > > > > unmap_page_range
> > > > > > > takes above 6ms with preemption disabled easily. When I see that, 
> > > > > > > the
> > > > > > > reason is it holds page table spinlock during entire 512 page 
> > > > > > > operation
> > > > > > > in a PMD. 6.2ms is never trivial for user experince if RT task 
> > > > > > > couldn't
> > > > > > > run in the time because it could make frame drop or glitch audio 
> > > > > > > problem.
> > > > > > 
> > > > > > Where is the time spent during the tear down? 512 pages doesn't 
> > > > > > sound
> > > > > > like a lot to tear down. Is it the TLB flushing?
> > > > > 
> > > > > Miguel confirmed there is no such big latency without 
> > > > > mark_page_accessed
> > > > > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > > > > heavy activate_page overhead which is not trivial, either.
> > > > 
> > > > Please give us more details ideally with some numbers.
> > > 
> > > I had a time to benchmark it via adding some trace_printk hooks between
> > > pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > > device is 2018 premium mobile device.
> > > 
> > > I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > > task runs on little core even though it doesn't have any IPI and LRU
> > > lock contention. It's already too heavy.
> > > 
> > > If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > > so most of overhead(about 0.7ms) comes from activate_page via
> > > mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > > accumulate up to several ms.
> > 
> > Thanks for this information. This is something that should be a part of
> > the changelog. I am sorry to still poke into this because I still do not
> 
> I will include it.
> 
> > have a full understanding of what is going on and while I do not object
> > to drop the spinlock I still suspect this is papering over a deeper
> > problem.
> 
> I couldn't come up with better solution. Feel free to suggest it.
> 
> > 
> > If mark_page_accessed is really expensive then why do we even bother to
> > do it in the tear down path in the first place? Why don't we simply set
> > a referenced bit on the page to reflect the young pte bit? I might be
> > missing something here of course.
> 
> commit bf3f3bc5e73
> Author: Nick Piggin 
> Date:   Tue Jan 6 14:38:55 2009 -0800
> 
> mm: don't mark_page_accessed in fault path
> 
> Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
> unmap-time if the pte is young has a number of problems.
> 
> mark_page_accessed is supposed to be roughly the equivalent of a young pte
> for unmapped references. Unfortunately it doesn't come with any context:
> after being called, reclaim doesn't know who or why the page was touched.
> 
> So calling mark_page_accessed not only adds extra lru or PG_referenced
> manipulations for pages that are already going to have pte_young ptes 
> anyway,
> but it also adds these references which are difficult to work with from 
> the
> context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
> wish to contribute to the page being referenced).
> 
> Then, simply doing SetPageReferenced when zapping a pte and finding it is
> young, is not a really good solution either. SetPageReferenced does not
> correctly promote the page to the active list for example. So after 
> removing
> mark_page_accessed from the fault path, several mmap()+touch+munmap() 
> would
> have a very different result from several read(2) calls for example, which
> is not really desirable.

Well, I have to say that this is rather vague to me. Nick, could you be
more specific about which workloads do benefit from this change? Let's
say that the zapped pte is the only referenced one and then reclaim
finds the page on inactive list. We would go and reclaim it. But does
that matter so much? Hot pages would be referenced from multiple ptes
very likely, no?

That being said, cosindering that mark_page_accessed is not free, do we
have strong reasons to keep it?

Or do I miss something?

> Signed-off-by: Nick Piggin 
> Acked-by: Johannes Weiner 
> 

Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-30 Thread Minchan Kim
On Tue, Jul 30, 2019 at 02:32:37PM +0200, Michal Hocko wrote:
> On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> > On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > > In our testing(carmera recording), Miguel and Wei found 
> > > > > > unmap_page_range
> > > > > > takes above 6ms with preemption disabled easily. When I see that, 
> > > > > > the
> > > > > > reason is it holds page table spinlock during entire 512 page 
> > > > > > operation
> > > > > > in a PMD. 6.2ms is never trivial for user experince if RT task 
> > > > > > couldn't
> > > > > > run in the time because it could make frame drop or glitch audio 
> > > > > > problem.
> > > > > 
> > > > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > > > like a lot to tear down. Is it the TLB flushing?
> > > > 
> > > > Miguel confirmed there is no such big latency without mark_page_accessed
> > > > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > > > heavy activate_page overhead which is not trivial, either.
> > > 
> > > Please give us more details ideally with some numbers.
> > 
> > I had a time to benchmark it via adding some trace_printk hooks between
> > pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > device is 2018 premium mobile device.
> > 
> > I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > task runs on little core even though it doesn't have any IPI and LRU
> > lock contention. It's already too heavy.
> > 
> > If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > so most of overhead(about 0.7ms) comes from activate_page via
> > mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > accumulate up to several ms.
> 
> Thanks for this information. This is something that should be a part of
> the changelog. I am sorry to still poke into this because I still do not

I will include it.

> have a full understanding of what is going on and while I do not object
> to drop the spinlock I still suspect this is papering over a deeper
> problem.

I couldn't come up with better solution. Feel free to suggest it.

> 
> If mark_page_accessed is really expensive then why do we even bother to
> do it in the tear down path in the first place? Why don't we simply set
> a referenced bit on the page to reflect the young pte bit? I might be
> missing something here of course.

commit bf3f3bc5e73
Author: Nick Piggin 
Date:   Tue Jan 6 14:38:55 2009 -0800

mm: don't mark_page_accessed in fault path

Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
unmap-time if the pte is young has a number of problems.

mark_page_accessed is supposed to be roughly the equivalent of a young pte
for unmapped references. Unfortunately it doesn't come with any context:
after being called, reclaim doesn't know who or why the page was touched.

So calling mark_page_accessed not only adds extra lru or PG_referenced
manipulations for pages that are already going to have pte_young ptes 
anyway,
but it also adds these references which are difficult to work with from the
context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
wish to contribute to the page being referenced).

Then, simply doing SetPageReferenced when zapping a pte and finding it is
young, is not a really good solution either. SetPageReferenced does not
correctly promote the page to the active list for example. So after removing
mark_page_accessed from the fault path, several mmap()+touch+munmap() would
have a very different result from several read(2) calls for example, which
is not really desirable.

Signed-off-by: Nick Piggin 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-30 Thread Michal Hocko
On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > In our testing(carmera recording), Miguel and Wei found 
> > > > > unmap_page_range
> > > > > takes above 6ms with preemption disabled easily. When I see that, the
> > > > > reason is it holds page table spinlock during entire 512 page 
> > > > > operation
> > > > > in a PMD. 6.2ms is never trivial for user experince if RT task 
> > > > > couldn't
> > > > > run in the time because it could make frame drop or glitch audio 
> > > > > problem.
> > > > 
> > > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > > like a lot to tear down. Is it the TLB flushing?
> > > 
> > > Miguel confirmed there is no such big latency without mark_page_accessed
> > > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > > heavy activate_page overhead which is not trivial, either.
> > 
> > Please give us more details ideally with some numbers.
> 
> I had a time to benchmark it via adding some trace_printk hooks between
> pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> device is 2018 premium mobile device.
> 
> I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> task runs on little core even though it doesn't have any IPI and LRU
> lock contention. It's already too heavy.
> 
> If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> so most of overhead(about 0.7ms) comes from activate_page via
> mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> accumulate up to several ms.

Thanks for this information. This is something that should be a part of
the changelog. I am sorry to still poke into this because I still do not
have a full understanding of what is going on and while I do not object
to drop the spinlock I still suspect this is papering over a deeper
problem.

If mark_page_accessed is really expensive then why do we even bother to
do it in the tear down path in the first place? Why don't we simply set
a referenced bit on the page to reflect the young pte bit? I might be
missing something here of course.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-30 Thread Minchan Kim
On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > > takes above 6ms with preemption disabled easily. When I see that, the
> > > > reason is it holds page table spinlock during entire 512 page operation
> > > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > > run in the time because it could make frame drop or glitch audio 
> > > > problem.
> > > 
> > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > like a lot to tear down. Is it the TLB flushing?
> > 
> > Miguel confirmed there is no such big latency without mark_page_accessed
> > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > heavy activate_page overhead which is not trivial, either.
> 
> Please give us more details ideally with some numbers.

I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
device is 2018 premium mobile device.

I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention. It's already too heavy.

If I remove activate_page, 35-40% overhead of zap_pte_range is gone
so most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-29 Thread Michal Hocko
On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > takes above 6ms with preemption disabled easily. When I see that, the
> > > reason is it holds page table spinlock during entire 512 page operation
> > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > run in the time because it could make frame drop or glitch audio problem.
> > 
> > Where is the time spent during the tear down? 512 pages doesn't sound
> > like a lot to tear down. Is it the TLB flushing?
> 
> Miguel confirmed there is no such big latency without mark_page_accessed
> in zap_pte_range so I guess it's the contention of LRU lock as well as
> heavy activate_page overhead which is not trivial, either.

Please give us more details ideally with some numbers.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-29 Thread Minchan Kim
On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > takes above 6ms with preemption disabled easily. When I see that, the
> > reason is it holds page table spinlock during entire 512 page operation
> > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > run in the time because it could make frame drop or glitch audio problem.
> 
> Where is the time spent during the tear down? 512 pages doesn't sound
> like a lot to tear down. Is it the TLB flushing?

Miguel confirmed there is no such big latency without mark_page_accessed
in zap_pte_range so I guess it's the contention of LRU lock as well as
heavy activate_page overhead which is not trivial, either.

> 
> > This patch adds preemption point like coyp_pte_range.
> > 
> > Reported-by: Miguel de Dios 
> > Reported-by: Wei Wang 
> > Cc: Michal Hocko 
> > Cc: Johannes Weiner 
> > Cc: Mel Gorman 
> > Signed-off-by: Minchan Kim 
> > ---
> >  mm/memory.c | 19 ---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2e796372927fd..bc3e0c5e4f89b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather 
> > *tlb,
> > struct zap_details *details)
> >  {
> > struct mm_struct *mm = tlb->mm;
> > +   int progress = 0;
> > int force_flush = 0;
> > int rss[NR_MM_COUNTERS];
> > spinlock_t *ptl;
> > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather 
> > *tlb,
> > flush_tlb_batched_pending(mm);
> > arch_enter_lazy_mmu_mode();
> > do {
> > -   pte_t ptent = *pte;
> > +   pte_t ptent;
> > +
> > +   if (progress >= 32) {
> > +   progress = 0;
> > +   if (need_resched())
> > +   break;
> > +   }
> > +   progress += 8;
> 
> Why 8?

Just copied from copy_pte_range.


Re: [PATCH] mm: release the spinlock on zap_pte_range

2019-07-29 Thread Michal Hocko
On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> In our testing(carmera recording), Miguel and Wei found unmap_page_range
> takes above 6ms with preemption disabled easily. When I see that, the
> reason is it holds page table spinlock during entire 512 page operation
> in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> run in the time because it could make frame drop or glitch audio problem.

Where is the time spent during the tear down? 512 pages doesn't sound
like a lot to tear down. Is it the TLB flushing?

> This patch adds preemption point like coyp_pte_range.
> 
> Reported-by: Miguel de Dios 
> Reported-by: Wei Wang 
> Cc: Michal Hocko 
> Cc: Johannes Weiner 
> Cc: Mel Gorman 
> Signed-off-by: Minchan Kim 
> ---
>  mm/memory.c | 19 ---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2e796372927fd..bc3e0c5e4f89b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather 
> *tlb,
>   struct zap_details *details)
>  {
>   struct mm_struct *mm = tlb->mm;
> + int progress = 0;
>   int force_flush = 0;
>   int rss[NR_MM_COUNTERS];
>   spinlock_t *ptl;
> @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather 
> *tlb,
>   flush_tlb_batched_pending(mm);
>   arch_enter_lazy_mmu_mode();
>   do {
> - pte_t ptent = *pte;
> + pte_t ptent;
> +
> + if (progress >= 32) {
> + progress = 0;
> + if (need_resched())
> + break;
> + }
> + progress += 8;

Why 8?

> +
> + ptent = *pte;
>   if (pte_none(ptent))
>   continue;
>  
> @@ -1123,8 +1133,11 @@ static unsigned long zap_pte_range(struct mmu_gather 
> *tlb,
>   if (force_flush) {
>   force_flush = 0;
>   tlb_flush_mmu(tlb);
> - if (addr != end)
> - goto again;
> + }
> +
> + if (addr != end) {
> + progress = 0;
> + goto again;
>   }
>  
>   return addr;
> -- 
> 2.22.0.709.g102302147b-goog

-- 
Michal Hocko
SUSE Labs


[PATCH] mm: release the spinlock on zap_pte_range

2019-07-29 Thread Minchan Kim
In our testing(carmera recording), Miguel and Wei found unmap_page_range
takes above 6ms with preemption disabled easily. When I see that, the
reason is it holds page table spinlock during entire 512 page operation
in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
run in the time because it could make frame drop or glitch audio problem.

This patch adds preemption point like coyp_pte_range.

Reported-by: Miguel de Dios 
Reported-by: Wei Wang 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Signed-off-by: Minchan Kim 
---
 mm/memory.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2e796372927fd..bc3e0c5e4f89b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct zap_details *details)
 {
struct mm_struct *mm = tlb->mm;
+   int progress = 0;
int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
@@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather 
*tlb,
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
do {
-   pte_t ptent = *pte;
+   pte_t ptent;
+
+   if (progress >= 32) {
+   progress = 0;
+   if (need_resched())
+   break;
+   }
+   progress += 8;
+
+   ptent = *pte;
if (pte_none(ptent))
continue;
 
@@ -1123,8 +1133,11 @@ static unsigned long zap_pte_range(struct mmu_gather 
*tlb,
if (force_flush) {
force_flush = 0;
tlb_flush_mmu(tlb);
-   if (addr != end)
-   goto again;
+   }
+
+   if (addr != end) {
+   progress = 0;
+   goto again;
}
 
return addr;
-- 
2.22.0.709.g102302147b-goog