Re: [RFC PATCH v1 00/13] lru_lock scalability
On 02/08/2018 06:36 PM, Andrew Morton wrote: On Wed, 31 Jan 2018 18:04:00 -0500 daniel.m.jor...@oracle.com wrote: lru_lock, a per-node* spinlock that protects an LRU list, is one of the hottest locks in the kernel. On some workloads on large machines, it shows up at the top of lock_stat. Do you have details on which callsites are causing the problem? That would permit us to consider other approaches, perhaps. Sure, there are two paths where we're seeing contention. In the first one, a pagevec's worth of anonymous pages are added to various LRUs when the per-cpu pagevec fills up: /* take an anonymous page fault, eventually end up at... */ handle_pte_fault do_anonymous_page lru_cache_add_active_or_unevictable lru_cache_add __lru_cache_add __pagevec_lru_add pagevec_lru_move_fn /* contend on lru_lock */ In the second, one or more pages are removed from an LRU under one hold of lru_lock: // userland calls munmap or exit, eventually end up at... zap_pte_range __tlb_remove_page // returns true because we eventually hit // MAX_GATHER_BATCH_COUNT in tlb_next_batch tlb_flush_mmu_free free_pages_and_swap_cache release_pages /* contend on lru_lock */ For a broader context, we've run decision support benchmarks where lru_lock (and zone->lock) show long wait times. But we're not the only ones according to certain kernel comments: mm/vmscan.c: * zone_lru_lock is heavily contended. Some of the functions that * shrink the lists perform better by taking out a batch of pages * and working on them outside the LRU lock. * * For pagecache intensive workloads, this function is the hottest * spot in the kernel (apart from copy_*_user functions). ... static unsigned long isolate_lru_pages(unsigned long nr_to_scan, include/linux/mmzone.h: * zone->lock and the [pgdat->lru_lock] are two of the hottest locks in the kernel. * So add a wild amount of padding here to ensure that they fall into separate * cachelines. ... Anyway, if you're seeing this lock in your workloads, I'm interested in hearing what you're running so we can get more real world data on this.
Re: [RFC PATCH v1 00/13] lru_lock scalability
On Wed, 31 Jan 2018 18:04:00 -0500 daniel.m.jor...@oracle.com wrote: > lru_lock, a per-node* spinlock that protects an LRU list, is one of the > hottest locks in the kernel. On some workloads on large machines, it > shows up at the top of lock_stat. Do you have details on which callsites are causing the problem? That would permit us to consider other approaches, perhaps.
Re: [RFC PATCH v1 00/13] lru_lock scalability
Hi, On 02/02/18 04:18, Daniel Jordan wrote: On 02/01/2018 10:54 AM, Steven Whitehouse wrote: Hi, On 31/01/18 23:04, daniel.m.jor...@oracle.com wrote: lru_lock, a per-node* spinlock that protects an LRU list, is one of the hottest locks in the kernel. On some workloads on large machines, it shows up at the top of lock_stat. One way to improve lru_lock scalability is to introduce an array of locks, with each lock protecting certain batches of LRU pages. *ooo**ooo**ooo** ... | || || || \ batch 1 / \ batch 2 / \ batch 3 / In this ASCII depiction of an LRU, a page is represented with either '*' or 'o'. An asterisk indicates a sentinel page, which is a page at the edge of a batch. An 'o' indicates a non-sentinel page. To remove a non-sentinel LRU page, only one lock from the array is required. This allows multiple threads to remove pages from different batches simultaneously. A sentinel page requires lru_lock in addition to a lock from the array. Full performance numbers appear in the last patch in this series, but this prototype allows a microbenchmark to do up to 28% more page faults per second with 16 or more concurrent processes. This work was developed in collaboration with Steve Sistare. Note: This is an early prototype. I'm submitting it now to support my request to attend LSF/MM, as well as get early feedback on the idea. Any comments appreciated. * lru_lock is actually per-memcg, but without memcg's in the picture it becomes per-node. GFS2 has an lru list for glocks, which can be contended under certain workloads. Work is still ongoing to figure out exactly why, but this looks like it might be a good approach to that issue too. The main purpose of GFS2's lru list is to allow shrinking of the glocks under memory pressure via the gfs2_scan_glock_lru() function, and it looks like this type of approach could be used there to improve the scalability, Glad to hear that this could help in gfs2 as well. Hopefully struct gfs2_glock is less space constrained than struct page for storing the few bits of metadata that this approach requires. Daniel We obviously want to keep gfs2_glock small, however within reason then yet we can add some additional fields as required. The use case is pretty much a standard LRU list, so items are added and removed, mostly at the active end of the list, and the inactive end of the list is scanned periodically by gfs2_scan_glock_lru() Steve.
Re: [RFC PATCH v1 00/13] lru_lock scalability
On 02/01/2018 10:54 AM, Steven Whitehouse wrote: Hi, On 31/01/18 23:04, daniel.m.jor...@oracle.com wrote: lru_lock, a per-node* spinlock that protects an LRU list, is one of the hottest locks in the kernel. On some workloads on large machines, it shows up at the top of lock_stat. One way to improve lru_lock scalability is to introduce an array of locks, with each lock protecting certain batches of LRU pages. *ooo**ooo**ooo** ... | || || || \ batch 1 / \ batch 2 / \ batch 3 / In this ASCII depiction of an LRU, a page is represented with either '*' or 'o'. An asterisk indicates a sentinel page, which is a page at the edge of a batch. An 'o' indicates a non-sentinel page. To remove a non-sentinel LRU page, only one lock from the array is required. This allows multiple threads to remove pages from different batches simultaneously. A sentinel page requires lru_lock in addition to a lock from the array. Full performance numbers appear in the last patch in this series, but this prototype allows a microbenchmark to do up to 28% more page faults per second with 16 or more concurrent processes. This work was developed in collaboration with Steve Sistare. Note: This is an early prototype. I'm submitting it now to support my request to attend LSF/MM, as well as get early feedback on the idea. Any comments appreciated. * lru_lock is actually per-memcg, but without memcg's in the picture it becomes per-node. GFS2 has an lru list for glocks, which can be contended under certain workloads. Work is still ongoing to figure out exactly why, but this looks like it might be a good approach to that issue too. The main purpose of GFS2's lru list is to allow shrinking of the glocks under memory pressure via the gfs2_scan_glock_lru() function, and it looks like this type of approach could be used there to improve the scalability, Glad to hear that this could help in gfs2 as well. Hopefully struct gfs2_glock is less space constrained than struct page for storing the few bits of metadata that this approach requires. Daniel Steve. Aaron Lu (1): mm: add a percpu_pagelist_batch sysctl interface Daniel Jordan (12): mm: allow compaction to be disabled mm: add lock array to pgdat and batch fields to struct page mm: introduce struct lru_list_head in lruvec to hold per-LRU batch info mm: add batching logic to add/delete/move API's mm: add lru_[un]lock_all APIs mm: convert to-be-refactored lru_lock callsites to lock-all API mm: temporarily convert lru_lock callsites to lock-all API mm: introduce add-only version of pagevec_lru_move_fn mm: add LRU batch lock API's mm: use lru_batch locking in release_pages mm: split up release_pages into non-sentinel and sentinel passes mm: splice local lists onto the front of the LRU include/linux/mm_inline.h | 209 +- include/linux/mm_types.h | 5 ++ include/linux/mmzone.h | 25 +- kernel/sysctl.c | 9 ++ mm/Kconfig | 1 - mm/huge_memory.c | 6 +- mm/memcontrol.c | 5 +- mm/mlock.c | 11 +-- mm/mmzone.c | 7 +- mm/page_alloc.c | 43 +- mm/page_idle.c | 4 +- mm/swap.c | 208 - mm/vmscan.c | 49 +-- 13 files changed, 500 insertions(+), 82 deletions(-)
Re: [RFC PATCH v1 00/13] lru_lock scalability
Hi, On 31/01/18 23:04, daniel.m.jor...@oracle.com wrote: lru_lock, a per-node* spinlock that protects an LRU list, is one of the hottest locks in the kernel. On some workloads on large machines, it shows up at the top of lock_stat. One way to improve lru_lock scalability is to introduce an array of locks, with each lock protecting certain batches of LRU pages. *ooo**ooo**ooo** ... | || || || \ batch 1 / \ batch 2 / \ batch 3 / In this ASCII depiction of an LRU, a page is represented with either '*' or 'o'. An asterisk indicates a sentinel page, which is a page at the edge of a batch. An 'o' indicates a non-sentinel page. To remove a non-sentinel LRU page, only one lock from the array is required. This allows multiple threads to remove pages from different batches simultaneously. A sentinel page requires lru_lock in addition to a lock from the array. Full performance numbers appear in the last patch in this series, but this prototype allows a microbenchmark to do up to 28% more page faults per second with 16 or more concurrent processes. This work was developed in collaboration with Steve Sistare. Note: This is an early prototype. I'm submitting it now to support my request to attend LSF/MM, as well as get early feedback on the idea. Any comments appreciated. * lru_lock is actually per-memcg, but without memcg's in the picture it becomes per-node. GFS2 has an lru list for glocks, which can be contended under certain workloads. Work is still ongoing to figure out exactly why, but this looks like it might be a good approach to that issue too. The main purpose of GFS2's lru list is to allow shrinking of the glocks under memory pressure via the gfs2_scan_glock_lru() function, and it looks like this type of approach could be used there to improve the scalability, Steve. Aaron Lu (1): mm: add a percpu_pagelist_batch sysctl interface Daniel Jordan (12): mm: allow compaction to be disabled mm: add lock array to pgdat and batch fields to struct page mm: introduce struct lru_list_head in lruvec to hold per-LRU batch info mm: add batching logic to add/delete/move API's mm: add lru_[un]lock_all APIs mm: convert to-be-refactored lru_lock callsites to lock-all API mm: temporarily convert lru_lock callsites to lock-all API mm: introduce add-only version of pagevec_lru_move_fn mm: add LRU batch lock API's mm: use lru_batch locking in release_pages mm: split up release_pages into non-sentinel and sentinel passes mm: splice local lists onto the front of the LRU include/linux/mm_inline.h | 209 +- include/linux/mm_types.h | 5 ++ include/linux/mmzone.h| 25 +- kernel/sysctl.c | 9 ++ mm/Kconfig| 1 - mm/huge_memory.c | 6 +- mm/memcontrol.c | 5 +- mm/mlock.c| 11 +-- mm/mmzone.c | 7 +- mm/page_alloc.c | 43 +- mm/page_idle.c| 4 +- mm/swap.c | 208 - mm/vmscan.c | 49 +-- 13 files changed, 500 insertions(+), 82 deletions(-)