Re: [RFC PATCH v1 00/13] lru_lock scalability

2018-02-13 Thread Daniel Jordan

On 02/08/2018 06:36 PM, Andrew Morton wrote:

On Wed, 31 Jan 2018 18:04:00 -0500 daniel.m.jor...@oracle.com wrote:


lru_lock, a per-node* spinlock that protects an LRU list, is one of the
hottest locks in the kernel.  On some workloads on large machines, it
shows up at the top of lock_stat.


Do you have details on which callsites are causing the problem?  That
would permit us to consider other approaches, perhaps.


Sure, there are two paths where we're seeing contention.

In the first one, a pagevec's worth of anonymous pages are added to 
various LRUs when the per-cpu pagevec fills up:


  /* take an anonymous page fault, eventually end up at... */
  handle_pte_fault
do_anonymous_page
  lru_cache_add_active_or_unevictable
lru_cache_add
  __lru_cache_add
__pagevec_lru_add
  pagevec_lru_move_fn
/* contend on lru_lock */


In the second, one or more pages are removed from an LRU under one hold 
of lru_lock:


  // userland calls munmap or exit, eventually end up at...
  zap_pte_range
__tlb_remove_page // returns true because we eventually hit
  // MAX_GATHER_BATCH_COUNT in tlb_next_batch
tlb_flush_mmu_free
  free_pages_and_swap_cache
release_pages
  /* contend on lru_lock */


For a broader context, we've run decision support benchmarks where 
lru_lock (and zone->lock) show long wait times. But we're not the only 
ones according to certain kernel comments:


mm/vmscan.c:
 * zone_lru_lock is heavily contended.  Some of the functions that
 * shrink the lists perform better by taking out a batch of pages
 * and working on them outside the LRU lock.
 *
 * For pagecache intensive workloads, this function is the hottest
 * spot in the kernel (apart from copy_*_user functions).
...
static unsigned long isolate_lru_pages(unsigned long nr_to_scan,


include/linux/mmzone.h:
 * zone->lock and the [pgdat->lru_lock] are two of the hottest locks in 
the kernel.
 * So add a wild amount of padding here to ensure that they fall into 
separate

 * cachelines. ...


Anyway, if you're seeing this lock in your workloads, I'm interested in 
hearing what you're running so we can get more real world data on this.


Re: [RFC PATCH v1 00/13] lru_lock scalability

2018-02-08 Thread Andrew Morton
On Wed, 31 Jan 2018 18:04:00 -0500 daniel.m.jor...@oracle.com wrote:

> lru_lock, a per-node* spinlock that protects an LRU list, is one of the
> hottest locks in the kernel.  On some workloads on large machines, it
> shows up at the top of lock_stat.

Do you have details on which callsites are causing the problem?  That
would permit us to consider other approaches, perhaps.



Re: [RFC PATCH v1 00/13] lru_lock scalability

2018-02-02 Thread Steven Whitehouse

Hi,


On 02/02/18 04:18, Daniel Jordan wrote:



On 02/01/2018 10:54 AM, Steven Whitehouse wrote:

Hi,


On 31/01/18 23:04, daniel.m.jor...@oracle.com wrote:

lru_lock, a per-node* spinlock that protects an LRU list, is one of the
hottest locks in the kernel.  On some workloads on large machines, it
shows up at the top of lock_stat.

One way to improve lru_lock scalability is to introduce an array of 
locks,

with each lock protecting certain batches of LRU pages.

 *ooo**ooo**ooo** ...
 |   ||   ||   ||
  \ batch 1 /  \ batch 2 /  \ batch 3 /

In this ASCII depiction of an LRU, a page is represented with either 
'*'

or 'o'.  An asterisk indicates a sentinel page, which is a page at the
edge of a batch.  An 'o' indicates a non-sentinel page.

To remove a non-sentinel LRU page, only one lock from the array is
required.  This allows multiple threads to remove pages from different
batches simultaneously.  A sentinel page requires lru_lock in 
addition to

a lock from the array.

Full performance numbers appear in the last patch in this series, 
but this

prototype allows a microbenchmark to do up to 28% more page faults per
second with 16 or more concurrent processes.

This work was developed in collaboration with Steve Sistare.

Note: This is an early prototype.  I'm submitting it now to support my
request to attend LSF/MM, as well as get early feedback on the 
idea.  Any

comments appreciated.


* lru_lock is actually per-memcg, but without memcg's in the picture it
   becomes per-node.
GFS2 has an lru list for glocks, which can be contended under certain 
workloads. Work is still ongoing to figure out exactly why, but this 
looks like it might be a good approach to that issue too. The main 
purpose of GFS2's lru list is to allow shrinking of the glocks under 
memory pressure via the gfs2_scan_glock_lru() function, and it looks 
like this type of approach could be used there to improve the 
scalability,


Glad to hear that this could help in gfs2 as well.

Hopefully struct gfs2_glock is less space constrained than struct page 
for storing the few bits of metadata that this approach requires.


Daniel

We obviously want to keep gfs2_glock small, however within reason then 
yet we can add some additional fields as required. The use case is 
pretty much a standard LRU list, so items are added and removed, mostly 
at the active end of the list, and the inactive end of the list is 
scanned periodically by gfs2_scan_glock_lru()


Steve.



Re: [RFC PATCH v1 00/13] lru_lock scalability

2018-02-01 Thread Daniel Jordan



On 02/01/2018 10:54 AM, Steven Whitehouse wrote:

Hi,


On 31/01/18 23:04, daniel.m.jor...@oracle.com wrote:

lru_lock, a per-node* spinlock that protects an LRU list, is one of the
hottest locks in the kernel.  On some workloads on large machines, it
shows up at the top of lock_stat.

One way to improve lru_lock scalability is to introduce an array of locks,
with each lock protecting certain batches of LRU pages.

 *ooo**ooo**ooo** ...
 |   ||   ||   ||
  \ batch 1 /  \ batch 2 /  \ batch 3 /

In this ASCII depiction of an LRU, a page is represented with either '*'
or 'o'.  An asterisk indicates a sentinel page, which is a page at the
edge of a batch.  An 'o' indicates a non-sentinel page.

To remove a non-sentinel LRU page, only one lock from the array is
required.  This allows multiple threads to remove pages from different
batches simultaneously.  A sentinel page requires lru_lock in addition to
a lock from the array.

Full performance numbers appear in the last patch in this series, but this
prototype allows a microbenchmark to do up to 28% more page faults per
second with 16 or more concurrent processes.

This work was developed in collaboration with Steve Sistare.

Note: This is an early prototype.  I'm submitting it now to support my
request to attend LSF/MM, as well as get early feedback on the idea.  Any
comments appreciated.


* lru_lock is actually per-memcg, but without memcg's in the picture it
   becomes per-node.

GFS2 has an lru list for glocks, which can be contended under certain 
workloads. Work is still ongoing to figure out exactly why, but this looks like 
it might be a good approach to that issue too. The main purpose of GFS2's lru 
list is to allow shrinking of the glocks under memory pressure via the 
gfs2_scan_glock_lru() function, and it looks like this type of approach could 
be used there to improve the scalability,


Glad to hear that this could help in gfs2 as well.

Hopefully struct gfs2_glock is less space constrained than struct page for 
storing the few bits of metadata that this approach requires.

Daniel



Steve.



Aaron Lu (1):
   mm: add a percpu_pagelist_batch sysctl interface

Daniel Jordan (12):
   mm: allow compaction to be disabled
   mm: add lock array to pgdat and batch fields to struct page
   mm: introduce struct lru_list_head in lruvec to hold per-LRU batch
 info
   mm: add batching logic to add/delete/move API's
   mm: add lru_[un]lock_all APIs
   mm: convert to-be-refactored lru_lock callsites to lock-all API
   mm: temporarily convert lru_lock callsites to lock-all API
   mm: introduce add-only version of pagevec_lru_move_fn
   mm: add LRU batch lock API's
   mm: use lru_batch locking in release_pages
   mm: split up release_pages into non-sentinel and sentinel passes
   mm: splice local lists onto the front of the LRU

  include/linux/mm_inline.h | 209 +-
  include/linux/mm_types.h  |   5 ++
  include/linux/mmzone.h    |  25 +-
  kernel/sysctl.c   |   9 ++
  mm/Kconfig    |   1 -
  mm/huge_memory.c  |   6 +-
  mm/memcontrol.c   |   5 +-
  mm/mlock.c    |  11 +--
  mm/mmzone.c   |   7 +-
  mm/page_alloc.c   |  43 +-
  mm/page_idle.c    |   4 +-
  mm/swap.c | 208 -
  mm/vmscan.c   |  49 +--
  13 files changed, 500 insertions(+), 82 deletions(-)





Re: [RFC PATCH v1 00/13] lru_lock scalability

2018-02-01 Thread Steven Whitehouse

Hi,


On 31/01/18 23:04, daniel.m.jor...@oracle.com wrote:

lru_lock, a per-node* spinlock that protects an LRU list, is one of the
hottest locks in the kernel.  On some workloads on large machines, it
shows up at the top of lock_stat.

One way to improve lru_lock scalability is to introduce an array of locks,
with each lock protecting certain batches of LRU pages.

 *ooo**ooo**ooo** ...
 |   ||   ||   ||
  \ batch 1 /  \ batch 2 /  \ batch 3 /

In this ASCII depiction of an LRU, a page is represented with either '*'
or 'o'.  An asterisk indicates a sentinel page, which is a page at the
edge of a batch.  An 'o' indicates a non-sentinel page.

To remove a non-sentinel LRU page, only one lock from the array is
required.  This allows multiple threads to remove pages from different
batches simultaneously.  A sentinel page requires lru_lock in addition to
a lock from the array.

Full performance numbers appear in the last patch in this series, but this
prototype allows a microbenchmark to do up to 28% more page faults per
second with 16 or more concurrent processes.

This work was developed in collaboration with Steve Sistare.

Note: This is an early prototype.  I'm submitting it now to support my
request to attend LSF/MM, as well as get early feedback on the idea.  Any
comments appreciated.


* lru_lock is actually per-memcg, but without memcg's in the picture it
   becomes per-node.
GFS2 has an lru list for glocks, which can be contended under certain 
workloads. Work is still ongoing to figure out exactly why, but this 
looks like it might be a good approach to that issue too. The main 
purpose of GFS2's lru list is to allow shrinking of the glocks under 
memory pressure via the gfs2_scan_glock_lru() function, and it looks 
like this type of approach could be used there to improve the scalability,


Steve.



Aaron Lu (1):
   mm: add a percpu_pagelist_batch sysctl interface

Daniel Jordan (12):
   mm: allow compaction to be disabled
   mm: add lock array to pgdat and batch fields to struct page
   mm: introduce struct lru_list_head in lruvec to hold per-LRU batch
 info
   mm: add batching logic to add/delete/move API's
   mm: add lru_[un]lock_all APIs
   mm: convert to-be-refactored lru_lock callsites to lock-all API
   mm: temporarily convert lru_lock callsites to lock-all API
   mm: introduce add-only version of pagevec_lru_move_fn
   mm: add LRU batch lock API's
   mm: use lru_batch locking in release_pages
   mm: split up release_pages into non-sentinel and sentinel passes
   mm: splice local lists onto the front of the LRU

  include/linux/mm_inline.h | 209 +-
  include/linux/mm_types.h  |   5 ++
  include/linux/mmzone.h|  25 +-
  kernel/sysctl.c   |   9 ++
  mm/Kconfig|   1 -
  mm/huge_memory.c  |   6 +-
  mm/memcontrol.c   |   5 +-
  mm/mlock.c|  11 +--
  mm/mmzone.c   |   7 +-
  mm/page_alloc.c   |  43 +-
  mm/page_idle.c|   4 +-
  mm/swap.c | 208 -
  mm/vmscan.c   |  49 +--
  13 files changed, 500 insertions(+), 82 deletions(-)