Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-17 Thread Michel Lespinasse
On Thu, Apr 15, 2021 at 01:13:13AM -0600, Yu Zhao wrote:
> Page table scanning doesn't replace the existing rmap walk. It is
> complementary and only happens when it is likely that most of the
> pages on a system under pressure have been referenced, i.e., out of
> *inactive* pages, by definition of the existing implementation. Under
> such a condition, scanning *active* pages one by one with the rmap is
> likely to cost more than scanning them all at once via page tables.
> When we evict *inactive* pages, we still use the rmap and share a
> common path with the existing code.
> 
> Page table scanning falls back to the rmap walk if the page tables of
> a process are apparently sparse, i.e., rss < size of the page tables.

Could you expand a bit more as to how page table scanning and rmap
scanning coexist ? Say, there is some memory pressure and you want to
identify good candidate pages to recaim. You could scan processes with
the page table scanning method, or you could scan the lru list through
the rmap method. How do you mix the two - when you use the lru/rmap
method, won't you encounter both pages that are mapped in "dense"
processes where scanning page tables would have been better, and pages
that are mapped in "sparse" processes where you are happy to be using
rmap, and even pges that are mapped into both types of processes at
once ?  Or, can you change the lru/rmap scan so that it will efficiently
skip over all dense processes when you use it ?

Thanks,

--
Michel "walken" Lespinasse


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-15 Thread Huang, Ying
Yu Zhao  writes:

> On Wed, Apr 14, 2021 at 9:00 PM Andi Kleen  wrote:
>>
>> > We fall back to the rmap when it's obviously not smart to do so. There
>> > is still a lot of room for improvement in this function though, i.e.,
>> > it should be per VMA and NUMA aware.
>>
>> Okay so it's more a question to tune the cross over heuristic. That
>> sounds much easier than replacing everything.
>>
>> Of course long term it might be a problem to maintain too many
>> different ways to do things, but I suppose short term it's a reasonable
>> strategy.
>
> Hi Rik, Ying,
>
> Sorry for being persistent. I want to make sure we are on the same page:
>
> Page table scanning doesn't replace the existing rmap walk. It is
> complementary and only happens when it is likely that most of the
> pages on a system under pressure have been referenced, i.e., out of
> *inactive* pages, by definition of the existing implementation. Under
> such a condition, scanning *active* pages one by one with the rmap is
> likely to cost more than scanning them all at once via page tables.
> When we evict *inactive* pages, we still use the rmap and share a
> common path with the existing code.
>
> Page table scanning falls back to the rmap walk if the page tables of
> a process are apparently sparse, i.e., rss < size of the page tables.
>
> I should have clarified this at the very beginning of the discussion.
> But it has become so natural to me and I assumed we'd all see it this
> way.
>
> Your concern regarding the NUMA optimization is still valid, and it's
> a high priority.

Hi, Yu,

In general, I think it's a good idea to combine the page table scanning
and rmap scanning in the page reclaiming.  For example, if the
working-set is transitioned, we can take advantage of the fast page
table scanning to identify the new working-set quickly.  While we can
fallback to the rmap scanning if the page table scanning doesn't help.

Best Regards,
Huang, Ying


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-15 Thread Yu Zhao
On Wed, Apr 14, 2021 at 9:00 PM Andi Kleen  wrote:
>
> > We fall back to the rmap when it's obviously not smart to do so. There
> > is still a lot of room for improvement in this function though, i.e.,
> > it should be per VMA and NUMA aware.
>
> Okay so it's more a question to tune the cross over heuristic. That
> sounds much easier than replacing everything.
>
> Of course long term it might be a problem to maintain too many
> different ways to do things, but I suppose short term it's a reasonable
> strategy.

Hi Rik, Ying,

Sorry for being persistent. I want to make sure we are on the same page:

Page table scanning doesn't replace the existing rmap walk. It is
complementary and only happens when it is likely that most of the
pages on a system under pressure have been referenced, i.e., out of
*inactive* pages, by definition of the existing implementation. Under
such a condition, scanning *active* pages one by one with the rmap is
likely to cost more than scanning them all at once via page tables.
When we evict *inactive* pages, we still use the rmap and share a
common path with the existing code.

Page table scanning falls back to the rmap walk if the page tables of
a process are apparently sparse, i.e., rss < size of the page tables.

I should have clarified this at the very beginning of the discussion.
But it has become so natural to me and I assumed we'd all see it this
way.

Your concern regarding the NUMA optimization is still valid, and it's
a high priority.

Thanks.


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Andi Kleen
> We fall back to the rmap when it's obviously not smart to do so. There
> is still a lot of room for improvement in this function though, i.e.,
> it should be per VMA and NUMA aware.

Okay so it's more a question to tune the cross over heuristic. That
sounds much easier than replacing everything.

Of course long term it might be a problem to maintain too many 
different ways to do things, but I suppose short term it's a reasonable
strategy.

-Andi



Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Dave Chinner
On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner  wrote:
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner  wrote:
> > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > batches) and so spending less time contending on the mapping tree
> > > > lock...
> > > >
> > > > IOWs, I suspect this result might actually be a result of less lock
> > > > contention due to a change in batch processing characteristics of
> > > > the new algorithm rather than it being a "better" algorithm...
> > >
> > > I appreciate the profile. But there is no batching in
> > > __remove_mapping() -- it locks the mapping for each page, and
> > > therefore the lock contention penalizes the mainline and this patchset
> > > equally. It looks worse on your system because the four kswapd threads
> > > from different nodes were working on the same file.
> >
> > I think you misunderstand exactly what I mean by "batching" here.
> > I'm not talking about doing multiple pieces of work under a single
> > lock. What I mean is that the overall amount of work done in a
> > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> >
> > We already batch up page reclaim via building a page list and then
> > passing it to shrink_page_list() to process the batch of pages in a
> > single pass. Each page in this page list batch then calls
> > remove_mapping() to pull the page form the LRU, we have a run of
> > contention between the foreground read() thread and the background
> > kswapd.
> >
> > If the size or nature of the pages in the batch passed to
> > shrink_page_list() changes, then the amount of time a reclaim batch
> > is going to put pressure on the mapping tree lock will also change.
> > That's the "change in batching behaviour" I'm referring to here. I
> > haven't read through the patchset to determine if you change the
> > shrink_page_list() algorithm, but it likely changes what is passed
> > to be reclaimed and that in turn changes the locking patterns that
> > fall out of shrink_page_list...
> 
> Ok, if we are talking about the size of the batch passed to
> shrink_page_list(), both the mainline and this patchset cap it at
> SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> running fio/io_uring, it's safe to say both use 32.

You're still looking at micro-scale behaviour, not the larger-scale
batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to
shrinker_page_list() at a different rate?

When I say "batch of work" when talking about the page cache cycling
*500 thousand pages a second* through the cache, I'm not talking
about batches of 32 pages. I'm talking about the entire batch of
work kswapd does in an invocation cycle.

Is it scanning 100k pages 10 times a second? or 10k pages a hundred
times a second? How long does a batch take to run? how long does is
sleep between processing batches? Is there any change in these
metrics as a result of the multi-gen LRU patches?

Basically, we're looking at how access to the mapping lock is
changing the contention profile, and whether that is signficant or
not. I suspect it is, because when you have highly contended locks
and you do something external that reduces unrelated lock
contention, it's because that external thing is taking more time to
do and so there's less time to spend hitting locks hard...

As such, I don't think this test is a good measure of the multi-gen
LRU patches at all - performance is dominated by the severity of
lock contention external to the LRU scanning algorithm, and it's
hard to infer anything through suck lock contention

> I don't want to paste everything here -- they'd clutter. Please see
> all the detailed profiles in the attachment. Let me know if their
> formats are no to your liking. I still have the raw perf.data.

Which makes the discussion thread just about impossible to follow or
comment on. Please just post the relevant excerpt of the stack
profile that you are commenting on.

> > > And I plan to reach out to other communities, e.g., PostgreSQL, to
> > > benchmark the patchset. I heard they have been complaining about the
> > > buffered io performance under memory pressure. Any other benchmarks
> > > you'd suggest?
> > >
> > > BTW, you might find another surprise in how less frequently slab
> > > shrinkers are called under memory pressure, because this patchset is a
> > > lot better at finding pages to reclaim and therefore doesn't overkill
> > > slabs.
> >
> > That's actually very likely to be a Bad Thing and cause unexpected
> > perofrmance and OOM based regressions. When the machine finally runs
> > out of page cache it can easily reclaim, it's going to get stuck
> > with long tail latencies reclaiming huge slab caches as they've had
> > no substantial ongoing pressure put on them to keep them in bal

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Dave Chinner
On Wed, Apr 14, 2021 at 08:43:36AM -0600, Jens Axboe wrote:
> On 4/13/21 5:14 PM, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> On 4/13/21 1:51 AM, SeongJae Park wrote:
> >>> From: SeongJae Park 
> >>>
> >>> Hello,
> >>>
> >>>
> >>> Very interesting work, thank you for sharing this :)
> >>>
> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> >>>
>  What's new in v2
>  
>  Special thanks to Jens Axboe for reporting a regression in buffered
>  I/O and helping test the fix.
> >>>
> >>> Is the discussion open?  If so, could you please give me a link?
> >>
> >> I wasn't on the initial post (or any of the lists it was posted to), but
> >> it's on the google page reclaim list. Not sure if that is public or not.
> >>
> >> tldr is that I was pretty excited about this work, as buffered IO tends
> >> to suck (a lot) for high throughput applications. My test case was
> >> pretty simple:
> >>
> >> Randomly read a fast device, using 4k buffered IO, and watch what
> >> happens when the page cache gets filled up. For this particular test,
> >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> >> with kswapd using a lot of CPU trying to keep up. That's mainline
> >> behavior.
> > 
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> > 
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> > 
> > -   20.06% 0.00%  [kernel]   [k] kswapd 
> > 
> >▒
> >- 20.06% kswapd  
> > 
> >▒
> >   - 20.05% balance_pgdat
> > 
> >▒
> >  - 20.03% shrink_node   
> > 
> >▒
> > - 19.92% shrink_lruvec  
> > 
> >▒
> >- 19.91% shrink_inactive_list
> > 
> >▒
> >   - 19.22% shrink_page_list 
> > 
> >▒
> >  - 17.51% __remove_mapping  
> > 
> >▒
> > - 14.16% _raw_spin_lock_irqsave 
> > 
> >▒
> >- 14.14% do_raw_spin_lock
> > 
> >▒
> > __pv_queued_spin_lock_slowpath  
> > 
> >▒
> > - 1.56% __delete_from_page_cache
> > 
> >▒
> >  0.63% xas_store
> > 
> >▒
> > - 0.78% _raw_spin_unlock_irqrestore 
> > 
> >▒
> >- 0.69% do_raw_spin_unlock   
> > 
> >▒
> > __raw_callee_save___pv_queued_spin_unlock   
> > 
> >▒
> >  - 0.82% free_unref_page_list   
> > 
> >▒
> > - 0.72% free_unref_page_commit  
> > 
> >▒
> >  0.57% 

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Yu Zhao
On Wed, Apr 14, 2021 at 1:42 PM Rik van Riel  wrote:
>
> On Wed, 2021-04-14 at 13:14 -0600, Yu Zhao wrote:
> > On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel 
> > wrote:
> > > On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> > > > >2) It will not scan PTE tables under non-leaf PMD entries
> > > > > that
> > > > > do not
> > > > >   have the accessed bit set, when
> > > > >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> > > >
> > > > This assumes  that workloads have reasonable locality. Could
> > > > there
> > > > be a worst case where only one or two pages in each PTE are used,
> > > > so this PTE skipping trick doesn't work?
> > >
> > > Databases with large shared memory segments shared between
> > > many processes come to mind as a real-world example of a
> > > worst case scenario.
> >
> > Well, I don't think you two are talking about the same thing. Andi
> > was
> > focusing on sparsity. Your example seems to be about sharing, i.e.,
> > ihgh mapcount. Of course both can happen at the same time, as I
> > tested
> > here:
> > https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t
> >
> > I'm skeptical that shared memory used by databases is that sparse,
> > i.e., one page per PTE table, because the extremely low locality
> > would
> > heavily penalize their performance. But my knowledge in databases is
> > close to zero. So feel free to enlighten me or just ignore what I
> > said.
>
> A database may have a 200GB shared memory segment,
> and a worker task that gets spun up to handle a
> query might access only 1MB of memory to answer
> that query.
>
> That memory could be from anywhere inside the
> shared memory segment. Maybe some of the accesses
> are more dense, and others more sparse, who knows?
>
> A lot of the locality
> will depend on how memory
> space inside the shared memory segment is reclaimed
> and recycled inside the database.

Thanks. Yeah, I guess we'll just need to see more benchmarks from the
database realm. Stay tuned :)


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Yu Zhao
On Wed, Apr 14, 2021 at 8:43 AM Jens Axboe  wrote:
>
> On 4/13/21 5:14 PM, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> On 4/13/21 1:51 AM, SeongJae Park wrote:
> >>> From: SeongJae Park 
> >>>
> >>> Hello,
> >>>
> >>>
> >>> Very interesting work, thank you for sharing this :)
> >>>
> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> >>>
>  What's new in v2
>  
>  Special thanks to Jens Axboe for reporting a regression in buffered
>  I/O and helping test the fix.
> >>>
> >>> Is the discussion open?  If so, could you please give me a link?
> >>
> >> I wasn't on the initial post (or any of the lists it was posted to), but
> >> it's on the google page reclaim list. Not sure if that is public or not.
> >>
> >> tldr is that I was pretty excited about this work, as buffered IO tends
> >> to suck (a lot) for high throughput applications. My test case was
> >> pretty simple:
> >>
> >> Randomly read a fast device, using 4k buffered IO, and watch what
> >> happens when the page cache gets filled up. For this particular test,
> >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> >> with kswapd using a lot of CPU trying to keep up. That's mainline
> >> behavior.
> >
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> >
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> >
> > -   20.06% 0.00%  [kernel]   [k] kswapd 
> > 
> >▒
> >- 20.06% kswapd  
> > 
> >▒
> >   - 20.05% balance_pgdat
> > 
> >▒
> >  - 20.03% shrink_node   
> > 
> >▒
> > - 19.92% shrink_lruvec  
> > 
> >▒
> >- 19.91% shrink_inactive_list
> > 
> >▒
> >   - 19.22% shrink_page_list 
> > 
> >▒
> >  - 17.51% __remove_mapping  
> > 
> >▒
> > - 14.16% _raw_spin_lock_irqsave 
> > 
> >▒
> >- 14.14% do_raw_spin_lock
> > 
> >▒
> > __pv_queued_spin_lock_slowpath  
> > 
> >▒
> > - 1.56% __delete_from_page_cache
> > 
> >▒
> >  0.63% xas_store
> > 
> >▒
> > - 0.78% _raw_spin_unlock_irqrestore 
> > 
> >▒
> >- 0.69% do_raw_spin_unlock   
> > 
> >▒
> > __raw_callee_save___pv_queued_spin_unlock   
> > 
> >▒
> >  - 0.82% free_unref_page_list   
> > 
> >▒
> > - 0.72% free_unref_page_commit  
> > 
> >▒
> >  0.57% free_pcppa

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Rik van Riel
On Wed, 2021-04-14 at 13:14 -0600, Yu Zhao wrote:
> On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel 
> wrote:
> > On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> > > >2) It will not scan PTE tables under non-leaf PMD entries
> > > > that
> > > > do not
> > > >   have the accessed bit set, when
> > > >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> > > 
> > > This assumes  that workloads have reasonable locality. Could
> > > there
> > > be a worst case where only one or two pages in each PTE are used,
> > > so this PTE skipping trick doesn't work?
> > 
> > Databases with large shared memory segments shared between
> > many processes come to mind as a real-world example of a
> > worst case scenario.
> 
> Well, I don't think you two are talking about the same thing. Andi
> was
> focusing on sparsity. Your example seems to be about sharing, i.e.,
> ihgh mapcount. Of course both can happen at the same time, as I
> tested
> here:
> https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t
> 
> I'm skeptical that shared memory used by databases is that sparse,
> i.e., one page per PTE table, because the extremely low locality
> would
> heavily penalize their performance. But my knowledge in databases is
> close to zero. So feel free to enlighten me or just ignore what I
> said.

A database may have a 200GB shared memory segment,
and a worker task that gets spun up to handle a
query might access only 1MB of memory to answer
that query.

That memory could be from anywhere inside the
shared memory segment. Maybe some of the accesses
are more dense, and others more sparse, who knows?

A lot of the locality
will depend on how memory
space inside the shared memory segment is reclaimed
and recycled inside the database.

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Yu Zhao
On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel  wrote:
>
> On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> > >2) It will not scan PTE tables under non-leaf PMD entries that
> > > do not
> > >   have the accessed bit set, when
> > >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> >
> > This assumes  that workloads have reasonable locality. Could there
> > be a worst case where only one or two pages in each PTE are used,
> > so this PTE skipping trick doesn't work?
>
> Databases with large shared memory segments shared between
> many processes come to mind as a real-world example of a
> worst case scenario.

Well, I don't think you two are talking about the same thing. Andi was
focusing on sparsity. Your example seems to be about sharing, i.e.,
ihgh mapcount. Of course both can happen at the same time, as I tested
here:
https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t

I'm skeptical that shared memory used by databases is that sparse,
i.e., one page per PTE table, because the extremely low locality would
heavily penalize their performance. But my knowledge in databases is
close to zero. So feel free to enlighten me or just ignore what I
said.


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Yu Zhao
On Wed, Apr 14, 2021 at 9:51 AM Andi Kleen  wrote:
>
> >2) It will not scan PTE tables under non-leaf PMD entries that do not
> >   have the accessed bit set, when
> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
>
> This assumes  that workloads have reasonable locality. Could there
> be a worst case where only one or two pages in each PTE are used,
> so this PTE skipping trick doesn't work?

Hi Andi,

Yes, it does make that assumption. And yes, there could. AFAIK, only
x86 supports this.

I wrote a crude test to verify this, and it maps exactly one page
within each PTE table. And I found page table scanning didn't
underperform the rmap:

https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t

The reason (sorry for repeating this) is page table scanning is conditional:

bool should_skip_mm()
{
...
/* leave the legwork to the rmap if mapped pages are too sparse */
if (RSS < mm_pgtables_bytes(mm) / PAGE_SIZE)
return true;

}

We fall back to the rmap when it's obviously not smart to do so. There
is still a lot of room for improvement in this function though, i.e.,
it should be per VMA and NUMA aware.

Note that page table scanning doesn't replace the existing rmap scan.
It's complementary, and it happens when there is a good chance that
most of the pages on a system under pressure have been referenced.
IOW, scanning them one by one with the rmap would cost more than
scanning them all at once via page tables.

Sounds reasonable?

Thanks.


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Yu Zhao
On Wed, Apr 14, 2021 at 7:52 AM Rik van Riel  wrote:
>
> On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote:
> > Yu Zhao  writes:
> >
> > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying 
> > > wrote:
> > > >
> > > NUMA Optimization
> > > -
> > > Support NUMA policies and per-node RSS counters.
> > >
> > > We only can move forward one step at a time. Fair?
> >
> > You don't need to implement that now definitely.  But we can discuss
> > the
> > possible solution now.
>
> That was my intention, too. I want to make sure we don't
> end up "painting ourselves into a corner" by moving in some
> direction we have no way to get out of.
>
> The patch set looks promising, but we need some plan to
> avoid the worst case behaviors that forced us into rmap
> based scanning initially.

Hi Rik,

By design, we voluntarily fall back to the rmap when page tables of a
process are too sparse. At the moment, we have

bool should_skip_mm()
{
...
/* leave the legwork to the rmap if mapped pages are too sparse */
if (RSS < mm_pgtables_bytes(mm) / PAGE_SIZE)
return true;

}

So yes, I agree we have more work to do in this direction, the
fallback should be per VMA and NUMA aware. Note that once the fallback
happens, it shares the same path with the existing implementation.

Probably I should have clarified that this patchset does not replace
the rmap with page table scanning. It conditionally uses page table
scanning when it thinks most of the pages on a system could have been
referenced, i.e., when it thinks walking the rmap would be less
efficient, based on generations.

It *unconditionally* walks the rmap to scan each of the pages it
eventually tries to evict, because scanning page tables for a small
batch of pages it wants to evict is too costly.

One of the simple ways to look at how the mixture of page table
scanning and the rmap works is:
  1) it scans page tables (but might fallback to the rmap) to
deactivate pages from the active list to the inactive list, when the
inactive list becomes empty
  2) it walks the rmap (not page table scanning) when it evicts
individual pages from the inactive list.
Does it make sense?

I fully agree "the mixture" is currently statistically decided, and it
must be made worst-case scenario proof.

> > Note that it's possible that only some processes are bound to some
> > NUMA
> > nodes, while other processes aren't bound.
>
> For workloads like PostgresQL or Oracle, it is common
> to have maybe 70% of memory in a large shared memory
> segment, spread between all the NUMA nodes, and mapped
> into hundreds, if not thousands, of processes in the
> system.

I do plan to reach out to the PostgreSQL community and ask for help to
benchmark this patchset. Will keep everybody posted.

> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.
>
> How will the current VM behave?

At the moment, we don't plan to make the DMA32 zone reclaim a
priority. Rather, I'd suggest
  1) stay with the existing implementation
  2) boost the watermark for DMA32

> What will the virtual scanning need to do?

The high priority items are:

To-do List
==
KVM Optimization

Support shadow page table scanning.

NUMA Optimization
-
Support NUMA policies and per-node RSS counters.

We are just trying to focus our resources on the trending use cases. Reasonable?

> If we can come up with a solution to make virtual
> scanning scale for that kind of workload, great.

It won't be easy, but IMO nothing worth doing is easy :)

> If not ... if it turns out most of the benefits of
> the multigeneratinal LRU framework come from sorting
> the pages into multiple LRUs, and from being able
> to easily reclaim unmapped pages before having to
> scan mapped ones, could it be an idea to implement
> that first, independently from virtual scanning?

This option is on the table considering the possibilities
  1) there are unforeseeable problems we couldn't solve
  2) sorting pages alone has demonstrated its standalone value

I guess 2) alone will help people heavily using page cache. Google
isn't one of them though. Personally I'm neutral (at least trying to
be), and my goal is to accommodate everybody as best as I can.

> I am all for improving
> our page reclaim system, I
> just want to make sure we don't revisit the old traps
> that forced us where we are today :)

Yeah, I do see your concerns and we need more data. Any suggestions on
benchmarks you'd be interested in?

Thanks.


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Johannes Weiner
Hello Yu,

On Tue, Apr 13, 2021 at 12:56:17AM -0600, Yu Zhao wrote:
> What's new in v2
> 
> Special thanks to Jens Axboe for reporting a regression in buffered
> I/O and helping test the fix.
> 
> This version includes the support of tiers, which represent levels of
> usage from file descriptors only. Pages accessed N times via file
> descriptors belong to tier order_base_2(N). Each generation contains
> at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> bits in page->flags. In contrast to moving across generations which
> requires the lru lock, moving across tiers only involves an atomic
> operation on page->flags and therefore has a negligible cost. A
> feedback loop modeled after the well-known PID controller monitors the
> refault rates across all tiers and decides when to activate pages from
> which tiers, on the reclaim path.

Could you elaborate a bit more on the difference between generations
and tiers?

A refault, a page table reference, or a buffered read through a file
descriptor ultimately all boil down to a memory access. The value of
having that memory resident and the cost of bringing it in from
backing storage should be the same regardless of how it's accessed by
userspace; and whether it's an in-memory reference or a non-resident
reference should have the same relative impact on the page's age.

With that context, I don't understand why file descriptor refs and
refaults get such special treatment. Could you shed some light here?

> This feedback model has a few advantages over the current feedforward
> model:
> 1) It has a negligible overhead in the buffered I/O access path
>because activations are done in the reclaim path.

This is useful if the workload isn't reclaim bound, but it can be
hazardous to defer work to reclaim, too.

If you go through the git history, there have been several patches to
soften access recognition inside reclaim because it can come with
large latencies when page reclaim kicks in after a longer period with
no memory pressure and doesn't have uptodate reference information -
to the point where eating a few extra IOs tend to add less latency to
the workload than waiting for reclaim to refresh its aging data.

Could you elaborate a bit more on the tradeoff here?

> Highlights from the discussions on v1
> =
> Thanks to Ying Huang and Dave Hansen for the comments and suggestions
> on page table scanning.
> 
> A simple worst-case scenario test did not find page table scanning
> underperforms the rmap because of the following optimizations:
> 1) It will not scan page tables from processes that have been sleeping
>since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>have the accessed bit set, when
>CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>table spanning multiple VMAs. In other words, it finishes all the
>VMAs with the range of the same PMD or PTE table before it returns
>to the PGD table. This optimizes workloads that have large numbers
>of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
> 
> TLDR
> 
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> an alternative framework that is performant, versatile and
> straightforward.
> 
> Repo
> 
> git fetch https://linux-mm.googlesource.com/page-reclaim 
> refs/changes/73/1173/1
> 
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173
> 
> Background
> ==
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment.

RAM cost on one hand.

On the other, paging backends have seen a revolutionary explosion in
iop/s capacity from solid state devices and CPUs that allow in-memory
compression at scale, so a higher rate of paging (semi-random IO) and
thus larger levels of overcommit are possible than ever before.

There is a lot of new opportunity here.

> Over the past decade of research and experimentation in memory
> overcommit, we observed a distinct trend across millions of servers
> and clients: the size of page cache has been decreasing because of
> the growing popularity of cloud storage. Nowadays anon pages account
> for more than 90% of our memory consumption and page cache contains
> mostly executable pages.

This gives the impression that because the number of setups heavily
using the page cache has reduced somewhat, its significance is waning
as well. I don't think that's true. I think we'll continue to have
mainstream workloads for which the page cache is significant.

Yes, the importance of paging anon memory more efficiently (or paging
it at all again, for that matter), has increased dramatically. But IMO
not because it's more prevalent, but rather because of the increase in
paging capacity from the hardware side. It's

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Rik van Riel
On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> >2) It will not scan PTE tables under non-leaf PMD entries that
> > do not
> >   have the accessed bit set, when
> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 
> This assumes  that workloads have reasonable locality. Could there
> be a worst case where only one or two pages in each PTE are used,
> so this PTE skipping trick doesn't work?

Databases with large shared memory segments shared between
many processes come to mind as a real-world example of a
worst case scenario.

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [page-reclaim] Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Shakeel Butt
On Wed, Apr 14, 2021 at 6:52 AM Rik van Riel  wrote:
>
> On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote:
> > Yu Zhao  writes:
> >
> > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying 
> > > wrote:
> > > >
> > > NUMA Optimization
> > > -
> > > Support NUMA policies and per-node RSS counters.
> > >
> > > We only can move forward one step at a time. Fair?
> >
> > You don't need to implement that now definitely.  But we can discuss
> > the
> > possible solution now.
>
> That was my intention, too. I want to make sure we don't
> end up "painting ourselves into a corner" by moving in some
> direction we have no way to get out of.
>
> The patch set looks promising, but we need some plan to
> avoid the worst case behaviors that forced us into rmap
> based scanning initially.
>
> > Note that it's possible that only some processes are bound to some
> > NUMA
> > nodes, while other processes aren't bound.
>
> For workloads like PostgresQL or Oracle, it is common
> to have maybe 70% of memory in a large shared memory
> segment, spread between all the NUMA nodes, and mapped
> into hundreds, if not thousands, of processes in the
> system.
>
> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.
>
> How will the current VM behave?
>
> Wha
> t will the virtual scanning need to do?
>
> If we can come up with a solution to make virtual
> scanning scale for that kind of workload, great.
>
> If not ... if it turns out most of the benefits of
> the multigeneratinal LRU framework come from sorting
> the pages into multiple LRUs, and from being able
> to easily reclaim unmapped pages before having to
> scan mapped ones, could it be an idea to implement
> that first, independently from virtual scanning?
>
> I am all for improving
> our page reclaim system, I
> just want to make sure we don't revisit the old traps
> that forced us where we are today :)
>

One potential idea is to take the hybrid 'of rmap and virtual
scanning' approach. If the number of pages that are targeted to be
scanned is below some threshold, do rmap otherwise virtual scanning. I
think we can experimentally find good value for that threshold.


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Andi Kleen
> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.

The question is how much do we still care about DMA32.
If there are problems they can probably just turn on the IOMMU for
these IO mappings.

-Andi


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Andi Kleen
>2) It will not scan PTE tables under non-leaf PMD entries that do not
>   have the accessed bit set, when
>   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.

This assumes  that workloads have reasonable locality. Could there
be a worst case where only one or two pages in each PTE are used,
so this PTE skipping trick doesn't work?

-Andi


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Jens Axboe
On 4/13/21 5:14 PM, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> On 4/13/21 1:51 AM, SeongJae Park wrote:
>>> From: SeongJae Park 
>>>
>>> Hello,
>>>
>>>
>>> Very interesting work, thank you for sharing this :)
>>>
>>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
>>>
 What's new in v2
 
 Special thanks to Jens Axboe for reporting a regression in buffered
 I/O and helping test the fix.
>>>
>>> Is the discussion open?  If so, could you please give me a link?
>>
>> I wasn't on the initial post (or any of the lists it was posted to), but
>> it's on the google page reclaim list. Not sure if that is public or not.
>>
>> tldr is that I was pretty excited about this work, as buffered IO tends
>> to suck (a lot) for high throughput applications. My test case was
>> pretty simple:
>>
>> Randomly read a fast device, using 4k buffered IO, and watch what
>> happens when the page cache gets filled up. For this particular test,
>> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
>> with kswapd using a lot of CPU trying to keep up. That's mainline
>> behavior.
> 
> I see this exact same behaviour here, too, but I RCA'd it to
> contention between the inode and memory reclaim for the mapping
> structure that indexes the page cache. Basically the mapping tree
> lock is the contention point here - you can either be adding pages
> to the mapping during IO, or memory reclaim can be removing pages
> from the mapping, but we can't do both at once.
> 
> So we end up with kswapd spinning on the mapping tree lock like so
> when doing 1.6GB/s in 4kB buffered IO:
> 
> -   20.06% 0.00%  [kernel]   [k] kswapd   
>   
>▒
>- 20.06% kswapd
>   
>▒
>   - 20.05% balance_pgdat  
>   
>▒
>  - 20.03% shrink_node 
>   
>▒
> - 19.92% shrink_lruvec
>   
>▒
>- 19.91% shrink_inactive_list  
>   
>▒
>   - 19.22% shrink_page_list   
>   
>▒
>  - 17.51% __remove_mapping
>   
>▒
> - 14.16% _raw_spin_lock_irqsave   
>   
>▒
>- 14.14% do_raw_spin_lock  
>   
>▒
> __pv_queued_spin_lock_slowpath
>   
>▒
> - 1.56% __delete_from_page_cache  
>   
>▒
>  0.63% xas_store  
>   
>▒
> - 0.78% _raw_spin_unlock_irqrestore   
>   
>▒
>- 0.69% do_raw_spin_unlock 
>   
>▒
> __raw_callee_save___pv_queued_spin_unlock 
>   
>▒
>  - 0.82% free_unref_page_list 
>   
>▒
> - 0.72% free_unref_page_commit
>   
>▒
>  0.57% free_pcppages_bulk 
>   
>▒
> 
> And these are the processes consuming CPU:
> 
>5171 root  20   0 1442496   5696   1284 R  99.7   0.0   1:07.7

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Rik van Riel
On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote:
> Yu Zhao  writes:
> 
> > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying 
> > wrote:
> > > 
> > NUMA Optimization
> > -
> > Support NUMA policies and per-node RSS counters.
> > 
> > We only can move forward one step at a time. Fair?
> 
> You don't need to implement that now definitely.  But we can discuss
> the
> possible solution now.

That was my intention, too. I want to make sure we don't
end up "painting ourselves into a corner" by moving in some
direction we have no way to get out of.

The patch set looks promising, but we need some plan to
avoid the worst case behaviors that forced us into rmap
based scanning initially.

> Note that it's possible that only some processes are bound to some
> NUMA
> nodes, while other processes aren't bound.

For workloads like PostgresQL or Oracle, it is common
to have maybe 70% of memory in a large shared memory
segment, spread between all the NUMA nodes, and mapped
into hundreds, if not thousands, of processes in the
system.

Now imagine we have an 8 node system, and memory
pressure in the DMA32 zone of node 0.

How will the current VM behave?

Wha
t will the virtual scanning need to do?

If we can come up with a solution to make virtual
scanning scale for that kind of workload, great.

If not ... if it turns out most of the benefits of
the multigeneratinal LRU framework come from sorting
the pages into multiple LRUs, and from being able
to easily reclaim unmapped pages before having to
scan mapped ones, could it be an idea to implement
that first, independently from virtual scanning?

I am all for improving
our page reclaim system, I
just want to make sure we don't revisit the old traps
that forced us where we are today :)

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Yu Zhao
On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner  wrote:
> >
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner  wrote:
> > > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > > > From: SeongJae Park 
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > >
> > > > > > Very interesting work, thank you for sharing this :)
> > > > > >
> > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  
> > > > > > wrote:
> > > > > >
> > > > > >> What's new in v2
> > > > > >> 
> > > > > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > > > > >> I/O and helping test the fix.
> > > > > >
> > > > > > Is the discussion open?  If so, could you please give me a link?
> > > > >
> > > > > I wasn't on the initial post (or any of the lists it was posted to), 
> > > > > but
> > > > > it's on the google page reclaim list. Not sure if that is public or 
> > > > > not.
> > > > >
> > > > > tldr is that I was pretty excited about this work, as buffered IO 
> > > > > tends
> > > > > to suck (a lot) for high throughput applications. My test case was
> > > > > pretty simple:
> > > > >
> > > > > Randomly read a fast device, using 4k buffered IO, and watch what
> > > > > happens when the page cache gets filled up. For this particular test,
> > > > > we'll initially be doing 2.1GB/sec of IO, and then drop to 
> > > > > 1.5-1.6GB/sec
> > > > > with kswapd using a lot of CPU trying to keep up. That's mainline
> > > > > behavior.
> > > >
> > > > I see this exact same behaviour here, too, but I RCA'd it to
> > > > contention between the inode and memory reclaim for the mapping
> > > > structure that indexes the page cache. Basically the mapping tree
> > > > lock is the contention point here - you can either be adding pages
> > > > to the mapping during IO, or memory reclaim can be removing pages
> > > > from the mapping, but we can't do both at once.
> > > >
> > > > So we end up with kswapd spinning on the mapping tree lock like so
> > > > when doing 1.6GB/s in 4kB buffered IO:
> > > >
> > > > -   20.06% 0.00%  [kernel]   [k] kswapd 
> > > > 
> > > >▒
> > > >- 20.06% kswapd  
> > > > 
> > > >▒
> > > >   - 20.05% balance_pgdat
> > > > 
> > > >▒
> > > >  - 20.03% shrink_node   
> > > > 
> > > >▒
> > > > - 19.92% shrink_lruvec  
> > > > 
> > > >▒
> > > >- 19.91% shrink_inactive_list
> > > > 
> > > >▒
> > > >   - 19.22% shrink_page_list 
> > > > 
> > > >▒
> > > >  - 17.51% __remove_mapping  
> > > > 
> > > >▒
> > > > - 14.16% _raw_spin_lock_irqsave 
> > > > 
> > > >▒
> > > >- 14.14% do_raw_spin_lock
> > > > 
> > > >▒
> > > > __pv_queued_spin_lock_slowpath  
> > > > 
> > > >▒
> > > > - 1.56% __delete_from_page_cache
> > > > 
> > > >▒
> > > >  0.63% xas_store
> > > > 
> > > >▒
> > > > - 0.78% _raw_spin_unlock_irqrestore 
> > > > 
> > > >▒
> > > >- 0.69% do_raw_spin_unlock   
> > > > 
> > > > 

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Huang, Ying
Yu Zhao  writes:

> On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying  wrote:
>>
>> Yu Zhao  writes:
>>
>> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel  wrote:
>> >>
>> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
>> >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> >> >
>> >> > > The initial posting of this patchset did no better, in fact it did
>> >> > > a bit
>> >> > > worse. Performance dropped to the same levels and kswapd was using
>> >> > > as
>> >> > > much CPU as before, but on top of that we also got excessive
>> >> > > swapping.
>> >> > > Not at a high rate, but 5-10MB/sec continually.
>> >> > >
>> >> > > I had some back and forths with Yu Zhao and tested a few new
>> >> > > revisions,
>> >> > > and the current series does much better in this regard. Performance
>> >> > > still dips a bit when page cache fills, but not nearly as much, and
>> >> > > kswapd is using less CPU than before.
>> >> >
>> >> > Profiles would be interesting, because it sounds to me like reclaim
>> >> > *might* be batching page cache removal better (e.g. fewer, larger
>> >> > batches) and so spending less time contending on the mapping tree
>> >> > lock...
>> >> >
>> >> > IOWs, I suspect this result might actually be a result of less lock
>> >> > contention due to a change in batch processing characteristics of
>> >> > the new algorithm rather than it being a "better" algorithm...
>> >>
>> >> That seems quite likely to me, given the issues we have
>> >> had with virtual scan reclaim algorithms in the past.
>> >
>> > Hi Rik,
>> >
>> > Let paste the code so we can move beyond the "batching" hypothesis:
>> >
>> > static int __remove_mapping(struct address_space *mapping, struct page
>> > *page,
>> > bool reclaimed, struct mem_cgroup 
>> > *target_memcg)
>> > {
>> > unsigned long flags;
>> > int refcount;
>> > void *shadow = NULL;
>> >
>> > BUG_ON(!PageLocked(page));
>> > BUG_ON(mapping != page_mapping(page));
>> >
>> > xa_lock_irqsave(&mapping->i_pages, flags);
>> >
>> >> SeongJae, what is this algorithm supposed to do when faced
>> >> with situations like this:
>> >
>> > I'll assume the questions were directed at me, not SeongJae.
>> >
>> >> 1) Running on a system with 8 NUMA nodes, and
>> >> memory
>> >>pressure in one of those nodes.
>> >> 2) Running PostgresQL or Oracle, with hundreds of
>> >>processes mapping the same (very large) shared
>> >>memory segment.
>> >>
>> >> How do you keep your algorithm from falling into the worst
>> >> case virtual scanning scenarios that were crippling the
>> >> 2.4 kernel 15+ years ago on systems with just a few GB of
>> >> memory?
>> >
>> > There is a fundamental shift: that time we were scanning for cold pages,
>> > and nowadays we are scanning for hot pages.
>> >
>> > I'd be surprised if scanning for cold pages didn't fall apart, because it'd
>> > find most of the entries accessed, if they are present at all.
>> >
>> > Scanning for hot pages, on the other hand, is way better. Let me just
>> > reiterate:
>> > 1) It will not scan page tables from processes that have been sleeping
>> >since the last scan.
>> > 2) It will not scan PTE tables under non-leaf PMD entries that do not
>> >have the accessed bit set, when
>> >CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
>> > 3) It will not zigzag between the PGD table and the same PMD or PTE
>> >table spanning multiple VMAs. In other words, it finishes all the
>> >VMAs with the range of the same PMD or PTE table before it returns
>> >to the PGD table. This optimizes workloads that have large numbers
>> >of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
>> >
>> > So the cost is roughly proportional to the number of referenced pages it
>> > discovers. If there is no memory pressure, no scanning at all. For a system
>> > under heavy memory pressure, most of the pages are referenced (otherwise
>> > why would it be under memory pressure?), and if we use the rmap, we need to
>> > scan a lot of pages anyway. Why not just scan them all?
>>
>> This may be not the case.  For rmap scanning, it's possible to scan only
>> a small portion of memory.  But with the page table scanning, you need
>> to scan almost all (I understand you have some optimization as above).
>
> Hi Ying,
>
> Let's take a step back.
>
> For the sake of discussion, when does the scanning have to happen? Can
> we agree that the simplest answer is when we have evicted all inactive
> pages?
>
> If so, my next question is who's filled in the memory space previously
> occupied by those inactive pages? Newly faulted in pages, right? They
> have the accessed bit set, and we can't evict them without scanning
> them first, would you agree?
>
> And there are also existing active pages, and they were protected from
> eviction. But now we need to deactivate some of them. Do you think
> whether they'd have been used or not since the last scan? (Remember
> they w

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Yu Zhao
On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying  wrote:
>
> Yu Zhao  writes:
>
> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel  wrote:
> >>
> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
> >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> >
> >> > > The initial posting of this patchset did no better, in fact it did
> >> > > a bit
> >> > > worse. Performance dropped to the same levels and kswapd was using
> >> > > as
> >> > > much CPU as before, but on top of that we also got excessive
> >> > > swapping.
> >> > > Not at a high rate, but 5-10MB/sec continually.
> >> > >
> >> > > I had some back and forths with Yu Zhao and tested a few new
> >> > > revisions,
> >> > > and the current series does much better in this regard. Performance
> >> > > still dips a bit when page cache fills, but not nearly as much, and
> >> > > kswapd is using less CPU than before.
> >> >
> >> > Profiles would be interesting, because it sounds to me like reclaim
> >> > *might* be batching page cache removal better (e.g. fewer, larger
> >> > batches) and so spending less time contending on the mapping tree
> >> > lock...
> >> >
> >> > IOWs, I suspect this result might actually be a result of less lock
> >> > contention due to a change in batch processing characteristics of
> >> > the new algorithm rather than it being a "better" algorithm...
> >>
> >> That seems quite likely to me, given the issues we have
> >> had with virtual scan reclaim algorithms in the past.
> >
> > Hi Rik,
> >
> > Let paste the code so we can move beyond the "batching" hypothesis:
> >
> > static int __remove_mapping(struct address_space *mapping, struct page
> > *page,
> > bool reclaimed, struct mem_cgroup *target_memcg)
> > {
> > unsigned long flags;
> > int refcount;
> > void *shadow = NULL;
> >
> > BUG_ON(!PageLocked(page));
> > BUG_ON(mapping != page_mapping(page));
> >
> > xa_lock_irqsave(&mapping->i_pages, flags);
> >
> >> SeongJae, what is this algorithm supposed to do when faced
> >> with situations like this:
> >
> > I'll assume the questions were directed at me, not SeongJae.
> >
> >> 1) Running on a system with 8 NUMA nodes, and
> >> memory
> >>pressure in one of those nodes.
> >> 2) Running PostgresQL or Oracle, with hundreds of
> >>processes mapping the same (very large) shared
> >>memory segment.
> >>
> >> How do you keep your algorithm from falling into the worst
> >> case virtual scanning scenarios that were crippling the
> >> 2.4 kernel 15+ years ago on systems with just a few GB of
> >> memory?
> >
> > There is a fundamental shift: that time we were scanning for cold pages,
> > and nowadays we are scanning for hot pages.
> >
> > I'd be surprised if scanning for cold pages didn't fall apart, because it'd
> > find most of the entries accessed, if they are present at all.
> >
> > Scanning for hot pages, on the other hand, is way better. Let me just
> > reiterate:
> > 1) It will not scan page tables from processes that have been sleeping
> >since the last scan.
> > 2) It will not scan PTE tables under non-leaf PMD entries that do not
> >have the accessed bit set, when
> >CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> > 3) It will not zigzag between the PGD table and the same PMD or PTE
> >table spanning multiple VMAs. In other words, it finishes all the
> >VMAs with the range of the same PMD or PTE table before it returns
> >to the PGD table. This optimizes workloads that have large numbers
> >of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
> >
> > So the cost is roughly proportional to the number of referenced pages it
> > discovers. If there is no memory pressure, no scanning at all. For a system
> > under heavy memory pressure, most of the pages are referenced (otherwise
> > why would it be under memory pressure?), and if we use the rmap, we need to
> > scan a lot of pages anyway. Why not just scan them all?
>
> This may be not the case.  For rmap scanning, it's possible to scan only
> a small portion of memory.  But with the page table scanning, you need
> to scan almost all (I understand you have some optimization as above).

Hi Ying,

Let's take a step back.

For the sake of discussion, when does the scanning have to happen? Can
we agree that the simplest answer is when we have evicted all inactive
pages?

If so, my next question is who's filled in the memory space previously
occupied by those inactive pages? Newly faulted in pages, right? They
have the accessed bit set, and we can't evict them without scanning
them first, would you agree?

And there are also existing active pages, and they were protected from
eviction. But now we need to deactivate some of them. Do you think
whether they'd have been used or not since the last scan? (Remember
they were active.)

You mentioned "a small portion" and "almost all". How do you interpret
them in terms of these steps?

Intuitively, "a small portion" and "a

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Huang, Ying
Yu Zhao  writes:

> On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel  wrote:
>>
>> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
>> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> >
>> > > The initial posting of this patchset did no better, in fact it did
>> > > a bit
>> > > worse. Performance dropped to the same levels and kswapd was using
>> > > as
>> > > much CPU as before, but on top of that we also got excessive
>> > > swapping.
>> > > Not at a high rate, but 5-10MB/sec continually.
>> > >
>> > > I had some back and forths with Yu Zhao and tested a few new
>> > > revisions,
>> > > and the current series does much better in this regard. Performance
>> > > still dips a bit when page cache fills, but not nearly as much, and
>> > > kswapd is using less CPU than before.
>> >
>> > Profiles would be interesting, because it sounds to me like reclaim
>> > *might* be batching page cache removal better (e.g. fewer, larger
>> > batches) and so spending less time contending on the mapping tree
>> > lock...
>> >
>> > IOWs, I suspect this result might actually be a result of less lock
>> > contention due to a change in batch processing characteristics of
>> > the new algorithm rather than it being a "better" algorithm...
>>
>> That seems quite likely to me, given the issues we have
>> had with virtual scan reclaim algorithms in the past.
>
> Hi Rik,
>
> Let paste the code so we can move beyond the "batching" hypothesis:
>
> static int __remove_mapping(struct address_space *mapping, struct page
> *page,
> bool reclaimed, struct mem_cgroup *target_memcg)
> {
> unsigned long flags;
> int refcount;
> void *shadow = NULL;
>
> BUG_ON(!PageLocked(page));
> BUG_ON(mapping != page_mapping(page));
>
> xa_lock_irqsave(&mapping->i_pages, flags);
>
>> SeongJae, what is this algorithm supposed to do when faced
>> with situations like this:
>
> I'll assume the questions were directed at me, not SeongJae.
>
>> 1) Running on a system with 8 NUMA nodes, and
>> memory
>>pressure in one of those nodes.
>> 2) Running PostgresQL or Oracle, with hundreds of
>>processes mapping the same (very large) shared
>>memory segment.
>>
>> How do you keep your algorithm from falling into the worst
>> case virtual scanning scenarios that were crippling the
>> 2.4 kernel 15+ years ago on systems with just a few GB of
>> memory?
>
> There is a fundamental shift: that time we were scanning for cold pages,
> and nowadays we are scanning for hot pages.
>
> I'd be surprised if scanning for cold pages didn't fall apart, because it'd
> find most of the entries accessed, if they are present at all.
>
> Scanning for hot pages, on the other hand, is way better. Let me just
> reiterate:
> 1) It will not scan page tables from processes that have been sleeping
>since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>have the accessed bit set, when
>CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>table spanning multiple VMAs. In other words, it finishes all the
>VMAs with the range of the same PMD or PTE table before it returns
>to the PGD table. This optimizes workloads that have large numbers
>of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
>
> So the cost is roughly proportional to the number of referenced pages it
> discovers. If there is no memory pressure, no scanning at all. For a system
> under heavy memory pressure, most of the pages are referenced (otherwise
> why would it be under memory pressure?), and if we use the rmap, we need to
> scan a lot of pages anyway. Why not just scan them all?

This may be not the case.  For rmap scanning, it's possible to scan only
a small portion of memory.  But with the page table scanning, you need
to scan almost all (I understand you have some optimization as above).
As Rik shown in the test case above, there may be memory pressure on
only one of 8 NUMA nodes (because of NUMA binding?).  Then ramp scanning
only needs to scan pages in this node, while the page table scanning may
need to scan pages in other nodes too.

Best Regards,
Huang, Ying

> This way you save a
> lot because of batching (now it's time to talk about batching). Besides,
> page tables have far better memory locality than the rmap. For the shared
> memory example you gave, the rmap needs to lock *each* page it scans. How
> many 4KB pages does your large file have? I'll leave the math to you.
>
> Here are some profiles:
>
> zram with the rmap (mainline)
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>4.63%  do_raw_spin_lock
>3.89%  vma_interval_tree_iter_next
>3.33%  vma_interval_tree_subtree_search
>
> zram with page table scanning (this patchset)
>   49.36%  lzo1x_1_do_compress
>4.54%  page_vma_mapped_walk
>4.45%  memset_erms
>3.47%  walk_pte_range
>2.88%  zram_bvec_r

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Dave Chinner
On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner  wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > From: SeongJae Park 
> > > >
> > > > Hello,
> > > >
> > > >
> > > > Very interesting work, thank you for sharing this :)
> > > >
> > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> > > >
> > > >> What's new in v2
> > > >> 
> > > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > > >> I/O and helping test the fix.
> > > >
> > > > Is the discussion open?  If so, could you please give me a link?
> > >
> > > I wasn't on the initial post (or any of the lists it was posted to), but
> > > it's on the google page reclaim list. Not sure if that is public or not.
> > >
> > > tldr is that I was pretty excited about this work, as buffered IO tends
> > > to suck (a lot) for high throughput applications. My test case was
> > > pretty simple:
> > >
> > > Randomly read a fast device, using 4k buffered IO, and watch what
> > > happens when the page cache gets filled up. For this particular test,
> > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > > with kswapd using a lot of CPU trying to keep up. That's mainline
> > > behavior.
> >
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> >
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> >
> > -   20.06% 0.00%  [kernel]   [k] kswapd 
> > 
> >▒
> >- 20.06% kswapd  
> > 
> >▒
> >   - 20.05% balance_pgdat
> > 
> >▒
> >  - 20.03% shrink_node   
> > 
> >▒
> > - 19.92% shrink_lruvec  
> > 
> >▒
> >- 19.91% shrink_inactive_list
> > 
> >▒
> >   - 19.22% shrink_page_list 
> > 
> >▒
> >  - 17.51% __remove_mapping  
> > 
> >▒
> > - 14.16% _raw_spin_lock_irqsave 
> > 
> >▒
> >- 14.14% do_raw_spin_lock
> > 
> >▒
> > __pv_queued_spin_lock_slowpath  
> > 
> >▒
> > - 1.56% __delete_from_page_cache
> > 
> >▒
> >  0.63% xas_store
> > 
> >▒
> > - 0.78% _raw_spin_unlock_irqrestore 
> > 
> >▒
> >- 0.69% do_raw_spin_unlock   
> > 
> >▒
> > __raw_callee_save___pv_queued_spin_unlock   
> > 
> >▒
> >  - 0.82% free_unref_page_list   
> > 
> >▒
> > - 0.72% free_unref_page_commit  
> > 
>

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Yu Zhao
On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner  wrote:
>
> On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > From: SeongJae Park 
> > >
> > > Hello,
> > >
> > >
> > > Very interesting work, thank you for sharing this :)
> > >
> > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> > >
> > >> What's new in v2
> > >> 
> > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > >> I/O and helping test the fix.
> > >
> > > Is the discussion open?  If so, could you please give me a link?
> >
> > I wasn't on the initial post (or any of the lists it was posted to), but
> > it's on the google page reclaim list. Not sure if that is public or not.
> >
> > tldr is that I was pretty excited about this work, as buffered IO tends
> > to suck (a lot) for high throughput applications. My test case was
> > pretty simple:
> >
> > Randomly read a fast device, using 4k buffered IO, and watch what
> > happens when the page cache gets filled up. For this particular test,
> > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > with kswapd using a lot of CPU trying to keep up. That's mainline
> > behavior.
>
> I see this exact same behaviour here, too, but I RCA'd it to
> contention between the inode and memory reclaim for the mapping
> structure that indexes the page cache. Basically the mapping tree
> lock is the contention point here - you can either be adding pages
> to the mapping during IO, or memory reclaim can be removing pages
> from the mapping, but we can't do both at once.
>
> So we end up with kswapd spinning on the mapping tree lock like so
> when doing 1.6GB/s in 4kB buffered IO:
>
> -   20.06% 0.00%  [kernel]   [k] kswapd   
>   
>▒
>- 20.06% kswapd
>   
>▒
>   - 20.05% balance_pgdat  
>   
>▒
>  - 20.03% shrink_node 
>   
>▒
> - 19.92% shrink_lruvec
>   
>▒
>- 19.91% shrink_inactive_list  
>   
>▒
>   - 19.22% shrink_page_list   
>   
>▒
>  - 17.51% __remove_mapping
>   
>▒
> - 14.16% _raw_spin_lock_irqsave   
>   
>▒
>- 14.14% do_raw_spin_lock  
>   
>▒
> __pv_queued_spin_lock_slowpath
>   
>▒
> - 1.56% __delete_from_page_cache  
>   
>▒
>  0.63% xas_store  
>   
>▒
> - 0.78% _raw_spin_unlock_irqrestore   
>   
>▒
>- 0.69% do_raw_spin_unlock 
>   
>▒
> __raw_callee_save___pv_queued_spin_unlock 
>   
>▒
>  - 0.82% free_unref_page_list 
>   
>▒
> - 0.72% free_unref_page_commit
>   
>▒
>  0.57% free_pcppages_bulk 
>   
>▒
>
> And these are the processes consuming CPU:
>
>5171 root   

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Rik van Riel
On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> 
> > The initial posting of this patchset did no better, in fact it did
> > a bit
> > worse. Performance dropped to the same levels and kswapd was using
> > as
> > much CPU as before, but on top of that we also got excessive
> > swapping.
> > Not at a high rate, but 5-10MB/sec continually.
> > 
> > I had some back and forths with Yu Zhao and tested a few new
> > revisions,
> > and the current series does much better in this regard. Performance
> > still dips a bit when page cache fills, but not nearly as much, and
> > kswapd is using less CPU than before.
> 
> Profiles would be interesting, because it sounds to me like reclaim
> *might* be batching page cache removal better (e.g. fewer, larger
> batches) and so spending less time contending on the mapping tree
> lock...
> 
> IOWs, I suspect this result might actually be a result of less lock
> contention due to a change in batch processing characteristics of
> the new algorithm rather than it being a "better" algorithm...

That seems quite likely to me, given the issues we have
had with virtual scan reclaim algorithms in the past.

SeongJae, what is this algorithm supposed to do when faced
with situations like this:
1) Running on a system with 8 NUMA nodes, and
memory
   pressure in one of those nodes.
2) Running PostgresQL or Oracle, with hundreds of
   processes mapping the same (very large) shared
   memory segment.

How do you keep your algorithm from falling into the worst
case virtual scanning scenarios that were crippling the
2.4 kernel 15+ years ago on systems with just a few GB of
memory?

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Dave Chinner
On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> On 4/13/21 1:51 AM, SeongJae Park wrote:
> > From: SeongJae Park 
> > 
> > Hello,
> > 
> > 
> > Very interesting work, thank you for sharing this :)
> > 
> > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> > 
> >> What's new in v2
> >> 
> >> Special thanks to Jens Axboe for reporting a regression in buffered
> >> I/O and helping test the fix.
> > 
> > Is the discussion open?  If so, could you please give me a link?
> 
> I wasn't on the initial post (or any of the lists it was posted to), but
> it's on the google page reclaim list. Not sure if that is public or not.
> 
> tldr is that I was pretty excited about this work, as buffered IO tends
> to suck (a lot) for high throughput applications. My test case was
> pretty simple:
> 
> Randomly read a fast device, using 4k buffered IO, and watch what
> happens when the page cache gets filled up. For this particular test,
> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> with kswapd using a lot of CPU trying to keep up. That's mainline
> behavior.

I see this exact same behaviour here, too, but I RCA'd it to
contention between the inode and memory reclaim for the mapping
structure that indexes the page cache. Basically the mapping tree
lock is the contention point here - you can either be adding pages
to the mapping during IO, or memory reclaim can be removing pages
from the mapping, but we can't do both at once.

So we end up with kswapd spinning on the mapping tree lock like so
when doing 1.6GB/s in 4kB buffered IO:

-   20.06% 0.00%  [kernel]   [k] kswapd 
   ▒
   - 20.06% kswapd  
   ▒
  - 20.05% balance_pgdat
   ▒
 - 20.03% shrink_node   
   ▒
- 19.92% shrink_lruvec  
   ▒
   - 19.91% shrink_inactive_list
   ▒
  - 19.22% shrink_page_list 
   ▒
 - 17.51% __remove_mapping  
   ▒
- 14.16% _raw_spin_lock_irqsave 
   ▒
   - 14.14% do_raw_spin_lock
   ▒
__pv_queued_spin_lock_slowpath  
   ▒
- 1.56% __delete_from_page_cache
   ▒
 0.63% xas_store
   ▒
- 0.78% _raw_spin_unlock_irqrestore 
   ▒
   - 0.69% do_raw_spin_unlock   
   ▒
__raw_callee_save___pv_queued_spin_unlock   
   ▒
 - 0.82% free_unref_page_list   
   ▒
- 0.72% free_unref_page_commit  
   ▒
 0.57% free_pcppages_bulk   
   ▒

And these are the processes consuming CPU:

   5171 root  20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
   1150 root  20   0   0  0  0 S  47.4   0.0   0:22.70 kswapd1
   1146 root  20   0   0  0  0 S  44.0   0.0   0:21.85 kswapd0
   1152 root  20   0   0  0  0

Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread SeongJae Park
From: SeongJae Park 

On Tue, 13 Apr 2021 10:13:24 -0600 Jens Axboe  wrote:

> On 4/13/21 1:51 AM, SeongJae Park wrote:
> > From: SeongJae Park 
> > 
> > Hello,
> > 
> > 
> > Very interesting work, thank you for sharing this :)
> > 
> > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> > 
> >> What's new in v2
> >> 
> >> Special thanks to Jens Axboe for reporting a regression in buffered
> >> I/O and helping test the fix.
> > 
> > Is the discussion open?  If so, could you please give me a link?
> 
> I wasn't on the initial post (or any of the lists it was posted to), but
> it's on the google page reclaim list. Not sure if that is public or not.
> 
> tldr is that I was pretty excited about this work, as buffered IO tends
> to suck (a lot) for high throughput applications. My test case was
> pretty simple:
> 
> Randomly read a fast device, using 4k buffered IO, and watch what
> happens when the page cache gets filled up. For this particular test,
> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> with kswapd using a lot of CPU trying to keep up. That's mainline
> behavior.
> 
> The initial posting of this patchset did no better, in fact it did a bit
> worse. Performance dropped to the same levels and kswapd was using as
> much CPU as before, but on top of that we also got excessive swapping.
> Not at a high rate, but 5-10MB/sec continually.
> 
> I had some back and forths with Yu Zhao and tested a few new revisions,
> and the current series does much better in this regard. Performance
> still dips a bit when page cache fills, but not nearly as much, and
> kswapd is using less CPU than before.
> 
> Hope that helps,

Appreciate this kind and detailed explanation, Jens!

So, my understanding is that v2 of this patchset improved the performance by
using frequency (tier) in addition to recency (generation number) for buffered
I/O pages.  That makes sense to me.  If I'm misunderstanding, please let me
know.


Thanks,
SeongJae Park

> -- 
> Jens Axboe
> 


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Jens Axboe
On 4/13/21 1:51 AM, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Hello,
> 
> 
> Very interesting work, thank you for sharing this :)
> 
> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:
> 
>> What's new in v2
>> 
>> Special thanks to Jens Axboe for reporting a regression in buffered
>> I/O and helping test the fix.
> 
> Is the discussion open?  If so, could you please give me a link?

I wasn't on the initial post (or any of the lists it was posted to), but
it's on the google page reclaim list. Not sure if that is public or not.

tldr is that I was pretty excited about this work, as buffered IO tends
to suck (a lot) for high throughput applications. My test case was
pretty simple:

Randomly read a fast device, using 4k buffered IO, and watch what
happens when the page cache gets filled up. For this particular test,
we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
with kswapd using a lot of CPU trying to keep up. That's mainline
behavior.

The initial posting of this patchset did no better, in fact it did a bit
worse. Performance dropped to the same levels and kswapd was using as
much CPU as before, but on top of that we also got excessive swapping.
Not at a high rate, but 5-10MB/sec continually.

I had some back and forths with Yu Zhao and tested a few new revisions,
and the current series does much better in this regard. Performance
still dips a bit when page cache fills, but not nearly as much, and
kswapd is using less CPU than before.

Hope that helps,
-- 
Jens Axboe



Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread SeongJae Park
From: SeongJae Park 

Hello,


Very interesting work, thank you for sharing this :)

On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao  wrote:

> What's new in v2
> 
> Special thanks to Jens Axboe for reporting a regression in buffered
> I/O and helping test the fix.

Is the discussion open?  If so, could you please give me a link?

> 
> This version includes the support of tiers, which represent levels of
> usage from file descriptors only. Pages accessed N times via file
> descriptors belong to tier order_base_2(N). Each generation contains
> at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> bits in page->flags. In contrast to moving across generations which
> requires the lru lock, moving across tiers only involves an atomic
> operation on page->flags and therefore has a negligible cost. A
> feedback loop modeled after the well-known PID controller monitors the
> refault rates across all tiers and decides when to activate pages from
> which tiers, on the reclaim path.
> 
> This feedback model has a few advantages over the current feedforward
> model:
> 1) It has a negligible overhead in the buffered I/O access path
>because activations are done in the reclaim path.
> 2) It takes mapped pages into account and avoids overprotecting pages
>accessed multiple times via file descriptors.
> 3) More tiers offer better protection to pages accessed more than
>twice when buffered-I/O-intensive workloads are under memory
>pressure.
> 
> The fio/io_uring benchmark shows 14% improvement in IOPS when randomly
> accessing Samsung PM981a in the buffered I/O mode.

Improvement under memory pressure, right?  How much pressure?

[...]
> 
> Differential scans via page tables
> --
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan.

Does this means it scans only virtual address spaces of processes and therefore
pages in the page cache that are not mmap()-ed will not be scanned?

> The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for workloads using a
> large amount of anon memory.

When and how frequently it scans?


Thanks,
SeongJae Park

[...]


[PATCH v2 00/16] Multigenerational LRU Framework

2021-04-12 Thread Yu Zhao
What's new in v2

Special thanks to Jens Axboe for reporting a regression in buffered
I/O and helping test the fix.

This version includes the support of tiers, which represent levels of
usage from file descriptors only. Pages accessed N times via file
descriptors belong to tier order_base_2(N). Each generation contains
at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
bits in page->flags. In contrast to moving across generations which
requires the lru lock, moving across tiers only involves an atomic
operation on page->flags and therefore has a negligible cost. A
feedback loop modeled after the well-known PID controller monitors the
refault rates across all tiers and decides when to activate pages from
which tiers, on the reclaim path.

This feedback model has a few advantages over the current feedforward
model:
1) It has a negligible overhead in the buffered I/O access path
   because activations are done in the reclaim path.
2) It takes mapped pages into account and avoids overprotecting pages
   accessed multiple times via file descriptors.
3) More tiers offer better protection to pages accessed more than
   twice when buffered-I/O-intensive workloads are under memory
   pressure.

The fio/io_uring benchmark shows 14% improvement in IOPS when randomly
accessing Samsung PM981a in the buffered I/O mode.

Highlights from the discussions on v1
=
Thanks to Ying Huang and Dave Hansen for the comments and suggestions
on page table scanning.

A simple worst-case scenario test did not find page table scanning
underperforms the rmap because of the following optimizations:
1) It will not scan page tables from processes that have been sleeping
   since the last scan.
2) It will not scan PTE tables under non-leaf PMD entries that do not
   have the accessed bit set, when
   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
3) It will not zigzag between the PGD table and the same PMD or PTE
   table spanning multiple VMAs. In other words, it finishes all the
   VMAs with the range of the same PMD or PTE table before it returns
   to the PGD table. This optimizes workloads that have large numbers
   of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.

TLDR

The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
an alternative framework that is performant, versatile and
straightforward.

Repo

git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1

Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173

Background
==
DRAM is a major factor in total cost of ownership, and improving
memory overcommit brings a high return on investment. Over the past
decade of research and experimentation in memory overcommit, we
observed a distinct trend across millions of servers and clients: the
size of page cache has been decreasing because of the growing
popularity of cloud storage. Nowadays anon pages account for more than
90% of our memory consumption and page cache contains mostly
executable pages.

Problems

Notion of active/inactive
-
For servers equipped with hundreds of gigabytes of memory, the
granularity of the active/inactive is too coarse to be useful for job
scheduling. False active/inactive rates are relatively high, and thus
the assumed savings may not materialize.

For phones and laptops, executable pages are frequently evicted
despite the fact that there are many less recently used anon pages.
Major faults on executable pages cause "janks" (slow UI renderings)
and negatively impact user experience.

For lruvecs from different memcgs or nodes, comparisons are impossible
due to the lack of a common frame of reference.

Incremental scans via rmap
--
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For
workloads using a large amount of anon memory, incremental scans lose
the advantage under sustained memory pressure due to high ratios of
the number of scanned pages to the number of reclaimed pages. In our
case, the average ratio of pgscan to pgsteal is above 7.

On top of that, the rmap has poor memory locality due to its complex
data structures. The combined effects typically result in a high
amount of CPU usage in the reclaim path. For example, with zram, a
typical kswapd profile on v5.11 looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

And with real swap, it looks like:
  45.16%  page_vma_mapped_walk
   7.61%  do_raw_spin_lock
   5.69%  vma_interval_tree_iter_next
   4.91%  vma_interval_tree_subtree_search
   3.71%  page_referenced_one

Solutions
=
Notion of generation numbers

The notion of generation numbers introduces a q