Re: [patch 00/19] VM pageout scalability improvements
On Fri, 11 Jan 2008 16:11:15 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote: > I've just started the patch series, the compile fails for me on a > powerpc box. global_lru_pages() is defined under CONFIG_PM, but used > else where in mm/page-writeback.c. None of the global_lru_pages() > parameters depend on CONFIG_PM. Here's a simple patch to fix it. Thank you for the fix. I have applied it to my tree. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
* Rik van Riel <[EMAIL PROTECTED]> [2008-01-08 15:59:39]: > Changelog: > - merge memcontroller split LRU code into the main split LRU patch, > since it is not functionally different (it was split up only to help > people who had seen the last version of the patch series review it) Hi, Rik, I see a strange behaviour with this patchset. I have a program (pagetest from Vaidy), that does the following 1. Can allocate different kinds of memory, mapped, malloc'ed or shared 2. Allocates and touches all the memory in a loop (2 times) I mount the memory controller and limit it to 400M and run pagetest and ask it to touch 1000M. Without this patchset everything runs fine, but with this patchset installed, I immediately see pagetest invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Call Trace: [c000e5aef400] [c000eb24] .show_stack+0x70/0x1bc (unreliable) [c000e5aef4b0] [c00c] .oom_kill_process+0x80/0x260 [c000e5aef570] [c00bc498] .mem_cgroup_out_of_memory+0x6c/0x98 [c000e5aef610] [c00f2574] .mem_cgroup_charge_common+0x1e0/0x414 [c000e5aef6e0] [c00b852c] .add_to_page_cache+0x48/0x164 [c000e5aef780] [c00b8664] .add_to_page_cache_lru+0x1c/0x68 [c000e5aef810] [c012db50] .mpage_readpages+0xbc/0x15c [c000e5aef940] [c018bdac] .ext3_readpages+0x28/0x40 [c000e5aef9c0] [c00c3978] .__do_page_cache_readahead+0x158/0x260 [c000e5aefa90] [c00bac44] .filemap_fault+0x18c/0x3d4 [c000e5aefb70] [c00cd510] .__do_fault+0xb0/0x588 [c000e5aefc80] [c05653cc] .do_page_fault+0x440/0x620 [c000e5aefe30] [c0005408] handle_page_fault+0x20/0x58 Mem-info: Node 0 DMA per-cpu: CPU0: hi:6, btch: 1 usd: 4 CPU1: hi:6, btch: 1 usd: 0 CPU2: hi:6, btch: 1 usd: 3 CPU3: hi:6, btch: 1 usd: 4 Active_anon:9099 active_file:1523 inactive_anon0 inactive_file:2869 noreclaim:0 dirty:20 writeback :0 unstable:0 free:44210 slab:639 mapped:1724 pagetables:475 bo unce:0 Node 0 DMA free:2829440kB min:7808kB low:9728kB hi gh:11712kB active_anon:582336kB inactive_anon:0kB active_file:97472kB inactive_f ile:183616kB noreclaim:0kB present:3813760kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 Node 0 DMA: 3*64kB 5*128kB 5*256kB 4*512kB 2*1024k B 4*2048kB 3*4096kB 2*8192kB 170*16384kB = 2828352kB Swap cache: add 0, delete 0, find 0/0 Free swap = 3148608kB Total swap = 3148608kB Free swap: 3148608kB 59648 pages of RAM 677 reserved pages 28165 pages shared 0 pages swap cached Memory cgroup out of memory: kill process 6593 (pagetest) score 1003 or a child Killed process 6593 (pagetest) I am using a powerpc box with 64K size pages. I'll try and investigate further, just a heads up on the failure I am seeing. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
* Rik van Riel <[EMAIL PROTECTED]> [2008-01-08 15:59:39]: > On large memory systems, the VM can spend way too much time scanning > through pages that it cannot (or should not) evict from memory. Not > only does it use up CPU time, but it also provokes lock contention > and can leave large systems under memory presure in a catatonic state. > > Against 2.6.24-rc6-mm1 > > This patch series improves VM scalability by: > > 1) making the locking a little more scalable > > 2) putting filesystem backed, swap backed and non-reclaimable pages >onto their own LRUs, so the system only scans the pages that it >can/should evict from memory > > 3) switching to SEQ replacement for the anonymous LRUs, so the >number of pages that need to be scanned when the system >starts swapping is bound to a reasonable number > > More info on the overall design can be found at: > > http://linux-mm.org/PageReplacementDesign > > > Changelog: > - merge memcontroller split LRU code into the main split LRU patch, > since it is not functionally different (it was split up only to help > people who had seen the last version of the patch series review it) > - drop the page_file_cache debugging patch, since it never triggered > - reintroduce code to not scan anon list if swap is full > - add code to scan anon list if page cache is very small already > - use lumpy reclaim more aggressively for smaller order > 1 allocations > Hi, Rik, I've just started the patch series, the compile fails for me on a powerpc box. global_lru_pages() is defined under CONFIG_PM, but used else where in mm/page-writeback.c. None of the global_lru_pages() parameters depend on CONFIG_PM. Here's a simple patch to fix it. diff --git a/mm/page-writeback.c b/mm/page-writeback.c diff --git a/mm/vmscan.c b/mm/vmscan.c index b14e188..39e6aef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1920,6 +1920,14 @@ void wakeup_kswapd(struct zone *zone, int order) wake_up_interruptible(&pgdat->kswapd_wait); } +unsigned long global_lru_pages(void) +{ + return global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_ACTIVE_FILE) + + global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE); +} + #ifdef CONFIG_PM /* * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages @@ -1968,14 +1976,6 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int prio, return ret; } -unsigned long global_lru_pages(void) -{ - return global_page_state(NR_ACTIVE_ANON) - + global_page_state(NR_ACTIVE_FILE) - + global_page_state(NR_INACTIVE_ANON) - + global_page_state(NR_INACTIVE_FILE); -} - /* * Try to free `nr_pages' of memory, system-wide, and return the number of * freed pages. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Jan 10, 2008 10:41 AM, Rik van Riel <[EMAIL PROTECTED]> wrote: > > On Wed, 9 Jan 2008 23:39:02 -0500 > "Mike Snitzer" <[EMAIL PROTECTED]> wrote: > > > How much trouble am I asking for if I were to try to get your patchset > > to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)? If > > workable, is such an effort before it's time relative to your TODO? > > Quite a bit :) > > The -mm kernel has the memory controller code, which means the > mm/ directory is fairly different. My patch set sits on top > of that. > > Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1), > I can start building on top of that. > > OTOH, maybe I could get my patch series onto a recent 2.6.23.X with > minimal chainsaw effort. That would be great! I can't speak for others but -mm poses a problem for testing your patchset because it is so bleeding. Let me know if you take the plunge on a 2.6.23.x backport; I'd really appreciate it. Is anyone else interested in consuming a 2.6.23.x backport of Rik's patchset? If so please speak up. > > I see that you have an old port to a FC7-based 2.6.21 here: > > http://people.redhat.com/riel/vmsplit/ > > > > Also, do you have a public git repo that you regularly publish to for > > this patchset? If not a git repo do you put the raw patchset on some > > http/ftp server? > > Up to now I have only emailed out the patches. Since there is demand > for them to be downloadable from somewhere, I'll also start putting > them on http://people.redhat.com/riel/ Great, thanks. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Wed, 9 Jan 2008 23:39:02 -0500 "Mike Snitzer" <[EMAIL PROTECTED]> wrote: > How much trouble am I asking for if I were to try to get your patchset > to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)? If > workable, is such an effort before it's time relative to your TODO? Quite a bit :) The -mm kernel has the memory controller code, which means the mm/ directory is fairly different. My patch set sits on top of that. Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1), I can start building on top of that. OTOH, maybe I could get my patch series onto a recent 2.6.23.X with minimal chainsaw effort. > I see that you have an old port to a FC7-based 2.6.21 here: > http://people.redhat.com/riel/vmsplit/ > > Also, do you have a public git repo that you regularly publish to for > this patchset? If not a git repo do you put the raw patchset on some > http/ftp server? Up to now I have only emailed out the patches. Since there is demand for them to be downloadable from somewhere, I'll also start putting them on http://people.redhat.com/riel/ -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Jan 8, 2008 3:59 PM, Rik van Riel <[EMAIL PROTECTED]> wrote: > On large memory systems, the VM can spend way too much time scanning > through pages that it cannot (or should not) evict from memory. Not > only does it use up CPU time, but it also provokes lock contention > and can leave large systems under memory presure in a catatonic state. > > Against 2.6.24-rc6-mm1 Hi Rik, How much trouble am I asking for if I were to try to get your patchset to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)? If workable, is such an effort before it's time relative to your TODO? I see that you have an old port to a FC7-based 2.6.21 here: http://people.redhat.com/riel/vmsplit/ Also, do you have a public git repo that you regularly publish to for this patchset? If not a git repo do you put the raw patchset on some http/ftp server? thanks, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 00/19] VM pageout scalability improvements
On large memory systems, the VM can spend way too much time scanning through pages that it cannot (or should not) evict from memory. Not only does it use up CPU time, but it also provokes lock contention and can leave large systems under memory presure in a catatonic state. Against 2.6.24-rc6-mm1 This patch series improves VM scalability by: 1) making the locking a little more scalable 2) putting filesystem backed, swap backed and non-reclaimable pages onto their own LRUs, so the system only scans the pages that it can/should evict from memory 3) switching to SEQ replacement for the anonymous LRUs, so the number of pages that need to be scanned when the system starts swapping is bound to a reasonable number More info on the overall design can be found at: http://linux-mm.org/PageReplacementDesign Changelog: - merge memcontroller split LRU code into the main split LRU patch, since it is not functionally different (it was split up only to help people who had seen the last version of the patch series review it) - drop the page_file_cache debugging patch, since it never triggered - reintroduce code to not scan anon list if swap is full - add code to scan anon list if page cache is very small already - use lumpy reclaim more aggressively for smaller order > 1 allocations -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Mon, 7 Jan 2008 11:07:54 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Fri, 4 Jan 2008, Lee Schermerhorn wrote: > > > We see this on both NUMA and non-NUMA. x86_64 and ia64. The basic > > criteria to reproduce is to be able to run thousands [or low 10s of > > thousands] of tasks, continually increasing the number until the system > > just goes into reclaim. Instead of swapping, the system seems to > > hang--unresponsive from the console, but with "soft lockup" messages > > spitting out every few seconds... > > Ditto here. I have some suspicions on what could be causing this. The most obvious suspect is get_scan_ratio() continuing to return 100 file reclaim, 0 anon reclaim when the file LRUs have already been reduced to something very small, because reclaiming up to that point was easy. I plan to add some code to automatically set the anon reclaim to 100% if (free + file_active + file_inactive <= zone->pages_high), meaning that reclaiming just file pages will not be able to free enough pages. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Fri, 4 Jan 2008, Lee Schermerhorn wrote: > We see this on both NUMA and non-NUMA. x86_64 and ia64. The basic > criteria to reproduce is to be able to run thousands [or low 10s of > thousands] of tasks, continually increasing the number until the system > just goes into reclaim. Instead of swapping, the system seems to > hang--unresponsive from the console, but with "soft lockup" messages > spitting out every few seconds... Ditto here. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Mon, 7 Jan 2008 19:06:10 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote: > On Thu, 3 Jan 2008 12:00:00 -0500 > Rik van Riel <[EMAIL PROTECTED]> wrote: > > If there is no swap space, my VM code will not bother scanning > > any anon pages. This has the same effect as moving the pages > > to the no-reclaim list, with the extra benefit of being able to > > resume scanning the anon lists once swap space is freed. > > > Is this 'avoiding scanning anon if no swap' feature in this set ? I seem to have lost that code in a forward merge :( Dunno if I started the forward merge from an older series that Lee had or if I lost the code myself... I'll put it back in ASAP. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Thu, 3 Jan 2008 12:00:00 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > On Thu, 03 Jan 2008 11:52:08 -0500 > Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > > > Also, I should point out that the full noreclaim series includes a > > couple of other patches NOT posted here by Rik: > > > > 1) treat swap backed pages as nonreclaimable when no swap space is > > available. This addresses a problem we've seen in real life, with > > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/... > > pages only to find that there is no swap space--add_to_swap() fails. > > Maybe not a problem with Rik's new anon page handling. > > If there is no swap space, my VM code will not bother scanning > any anon pages. This has the same effect as moving the pages > to the no-reclaim list, with the extra benefit of being able to > resume scanning the anon lists once swap space is freed. > Is this 'avoiding scanning anon if no swap' feature in this set ? Thanks -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
Rik van Riel wrote: On Fri, 04 Jan 2008 17:34:00 +0100 Andi Kleen <[EMAIL PROTECTED]> wrote: Lee Schermerhorn <[EMAIL PROTECTED]> writes: We can easily [he says, glibly] reproduce the hang on the anon_vma lock Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks? I really think that the anon_vma and i_mmap_lock spinlock hangs are due to the lack of queued spinlocks. Not because I have seen your system hang, but because I've seen one of Larry's test systems here hang in scary/amusing ways :) Changing the anon_vma->lock into a rwlock_t helps because page_lock_anon_vma() can take it for read and thats where the contention is. However its the fact that under some tests, most of the pages are in vmas queued to one anon_vma that causes so much lock contention. With queued spinlocks the system should just slow down, not hang. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Fri, 2008-01-04 at 17:34 +0100, Andi Kleen wrote: > Lee Schermerhorn <[EMAIL PROTECTED]> writes: > > > We can easily [he says, glibly] reproduce the hang on the anon_vma lock > > Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks? We see this on both NUMA and non-NUMA. x86_64 and ia64. The basic criteria to reproduce is to be able to run thousands [or low 10s of thousands] of tasks, continually increasing the number until the system just goes into reclaim. Instead of swapping, the system seems to hang--unresponsive from the console, but with "soft lockup" messages spitting out every few seconds... Lee > > -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Fri, 04 Jan 2008 17:34:00 +0100 Andi Kleen <[EMAIL PROTECTED]> wrote: > Lee Schermerhorn <[EMAIL PROTECTED]> writes: > > > We can easily [he says, glibly] reproduce the hang on the anon_vma lock > > Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks? I really think that the anon_vma and i_mmap_lock spinlock hangs are due to the lack of queued spinlocks. Not because I have seen your system hang, but because I've seen one of Larry's test systems here hang in scary/amusing ways :) With queued spinlocks the system should just slow down, not hang. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
Lee Schermerhorn <[EMAIL PROTECTED]> writes: > We can easily [he says, glibly] reproduce the hang on the anon_vma lock Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Thu, 2008-01-03 at 17:00 -0500, Rik van Riel wrote: > On Thu, 03 Jan 2008 12:13:32 -0500 > Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > > > Yes, but the problem, when it occurs, is very awkward. The system just > > hangs for hours/days spinning on the reverse mapping locks--in both > > page_referenced() and try_to_unmap(). No pages get reclaimed and NO OOM > > kill occurs because we never get that far. So, I'm not sure I'd call > > any OOM kills resulting from this patch as "false". The memory is > > effectively nonreclaimable. Now, I think that your anon pages SEQ > > patch will eliminate the contention in page_referenced[_anon](), but we > > could still hang in try_to_unmap(). > > I am hoping that Nick's ticket spinlocks will fix this problem. > > Would you happen to have any test cases for the above problem that > I could use to reproduce the problem and look for an automatic fix? We can easily [he says, glibly] reproduce the hang on the anon_vma lock with AIM7 loads on our test platforms. Perhaps we can come up with an AIM workload to reproduce the phenomenon on one of your test platforms. I've seen the hang with 15K-20K tasks on a 4 socket x86_64 with 16-32G of memory and quite a bit of storage. I've also seen related hangs on both anon_vma and i_mmap_lock during a heavy usex stress load on the splitlru+noreclaim patches. [This, by the way, without and WITH my rw_lock patches for both anon_vma and i_mmap_lock.] I can try to package up the workload to run on your system. > > Any fix that requires the sysadmin to tune things _just_ right seems > too dangerous to me - especially if a change in the workload can > result in the system doing exactly the wrong thing... > > The idea is valid, but it just has to work automagically. > > Btw, if page_referenced() is called less, the locks that try_to_unmap() > also takes should get less contention. Makes sense. we'll have to see. Lee > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Thu, 03 Jan 2008 12:13:32 -0500 Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > Yes, but the problem, when it occurs, is very awkward. The system just > hangs for hours/days spinning on the reverse mapping locks--in both > page_referenced() and try_to_unmap(). No pages get reclaimed and NO OOM > kill occurs because we never get that far. So, I'm not sure I'd call > any OOM kills resulting from this patch as "false". The memory is > effectively nonreclaimable. Now, I think that your anon pages SEQ > patch will eliminate the contention in page_referenced[_anon](), but we > could still hang in try_to_unmap(). I am hoping that Nick's ticket spinlocks will fix this problem. Would you happen to have any test cases for the above problem that I could use to reproduce the problem and look for an automatic fix? Any fix that requires the sysadmin to tune things _just_ right seems too dangerous to me - especially if a change in the workload can result in the system doing exactly the wrong thing... The idea is valid, but it just has to work automagically. Btw, if page_referenced() is called less, the locks that try_to_unmap() also takes should get less contention. -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Thu, 2008-01-03 at 12:00 -0500, Rik van Riel wrote: > On Thu, 03 Jan 2008 11:52:08 -0500 > Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > > > Also, I should point out that the full noreclaim series includes a > > couple of other patches NOT posted here by Rik: > > > > 1) treat swap backed pages as nonreclaimable when no swap space is > > available. This addresses a problem we've seen in real life, with > > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/... > > pages only to find that there is no swap space--add_to_swap() fails. > > Maybe not a problem with Rik's new anon page handling. > > If there is no swap space, my VM code will not bother scanning > any anon pages. This has the same effect as moving the pages > to the no-reclaim list, with the extra benefit of being able to > resume scanning the anon lists once swap space is freed. > > > 2) treat anon pages with "excessively long" anon_vma lists as > > nonreclaimable. "excessively long" here is a sysctl tunable parameter. > > This also addresses problems we've seen with benchmarks and stress > > tests--all cpus spinning on some anon_vma lock. In "real life", we've > > seen this behavior with file backed pages--spinning on the > > i_mmap_lock--running Oracle workloads with user counts in the few > > thousands. Again, something we may not need with Rik's vmscan rework. > > If we did want to do this, we'd probably want to address file backed > > pages and add support to bring the pages back from the noreclaim list > > when the number of "mappers" drops below the threshold. My current > > patch leaves anon pages as non-reclaimable until they're freed, or > > manually scanned via the mechanism introduced by patch 12. > > I can see some issues with that patch. Specifically, if the threshold > is set too high no pages will be affected, and if the threshold is too > low all pages will become non-reclaimable, leading to a false OOM kill. > > Not only is it a very big hammer, it's also a rather awkward one... Yes, but the problem, when it occurs, is very awkward. The system just hangs for hours/days spinning on the reverse mapping locks--in both page_referenced() and try_to_unmap(). No pages get reclaimed and NO OOM kill occurs because we never get that far. So, I'm not sure I'd call any OOM kills resulting from this patch as "false". The memory is effectively nonreclaimable. Now, I think that your anon pages SEQ patch will eliminate the contention in page_referenced[_anon](), but we could still hang in try_to_unmap(). And we have the issue with file back pages and the i_mmap_lock. I'll see if this issue comes up in testings with the current series. If not, cool! If so, we just have more work to do. Later, Lee > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Thu, 03 Jan 2008 11:52:08 -0500 Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > Also, I should point out that the full noreclaim series includes a > couple of other patches NOT posted here by Rik: > > 1) treat swap backed pages as nonreclaimable when no swap space is > available. This addresses a problem we've seen in real life, with > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/... > pages only to find that there is no swap space--add_to_swap() fails. > Maybe not a problem with Rik's new anon page handling. If there is no swap space, my VM code will not bother scanning any anon pages. This has the same effect as moving the pages to the no-reclaim list, with the extra benefit of being able to resume scanning the anon lists once swap space is freed. > 2) treat anon pages with "excessively long" anon_vma lists as > nonreclaimable. "excessively long" here is a sysctl tunable parameter. > This also addresses problems we've seen with benchmarks and stress > tests--all cpus spinning on some anon_vma lock. In "real life", we've > seen this behavior with file backed pages--spinning on the > i_mmap_lock--running Oracle workloads with user counts in the few > thousands. Again, something we may not need with Rik's vmscan rework. > If we did want to do this, we'd probably want to address file backed > pages and add support to bring the pages back from the noreclaim list > when the number of "mappers" drops below the threshold. My current > patch leaves anon pages as non-reclaimable until they're freed, or > manually scanned via the mechanism introduced by patch 12. I can see some issues with that patch. Specifically, if the threshold is set too high no pages will be affected, and if the threshold is too low all pages will become non-reclaimable, leading to a false OOM kill. Not only is it a very big hammer, it's also a rather awkward one... -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
On Wed, 2008-01-02 at 17:41 -0500, [EMAIL PROTECTED] wrote: > On large memory systems, the VM can spend way too much time scanning > through pages that it cannot (or should not) evict from memory. Not > only does it use up CPU time, but it also provokes lock contention > and can leave large systems under memory presure in a catatonic state. > > Against 2.6.24-rc6-mm1 > > This patch series improves VM scalability by: > > 1) making the locking a little more scalable > > 2) putting filesystem backed, swap backed and non-reclaimable pages >onto their own LRUs, so the system only scans the pages that it >can/should evict from memory > > 3) switching to SEQ replacement for the anonymous LRUs, so the >number of pages that need to be scanned when the system >starts swapping is bound to a reasonable number > > The noreclaim patches come verbatim from Lee Schermerhorn and > Nick Piggin. I have made a few small fixes to them and left out > the bits that are no longer needed with split file/anon lists. > > The exception is "Scan noreclaim list for reclaimable pages", > which should not be needed but could be a useful debugging tool. Note that patch 14/19 [SHM_LOCK/UNLOCK handling] depends on the infrastructure introduced by the "Scan noreclaim list for reclaimable pages" patch. When SHM_UNLOCKing a shm segment, we call a new scan_mapping_noreclaim_page() function to check all of the pages in the segment for reclaimability. There might be other reasons for the pages to be non-reclaimable... So, we can't merge 14/19 as is w/o some of patch 12. We can probably eliminate the sysctl and per node sysfs attributes to force a scan. But, as Rik says, this has been useful for debugging--e.g., periodically forcing a full rescan while running a stress load. Also, I should point out that the full noreclaim series includes a couple of other patches NOT posted here by Rik: 1) treat swap backed pages as nonreclaimable when no swap space is available. This addresses a problem we've seen in real life, with vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/... pages only to find that there is no swap space--add_to_swap() fails. Maybe not a problem with Rik's new anon page handling. We'll see. If we did want to add this filter, we'll need a way to bring back pages from the noreclaim list that are there only for lack of swap space when space is added or becomes available. 2) treat anon pages with "excessively long" anon_vma lists as nonreclaimable. "excessively long" here is a sysctl tunable parameter. This also addresses problems we've seen with benchmarks and stress tests--all cpus spinning on some anon_vma lock. In "real life", we've seen this behavior with file backed pages--spinning on the i_mmap_lock--running Oracle workloads with user counts in the few thousands. Again, something we may not need with Rik's vmscan rework. If we did want to do this, we'd probably want to address file backed pages and add support to bring the pages back from the noreclaim list when the number of "mappers" drops below the threshold. My current patch leaves anon pages as non-reclaimable until they're freed, or manually scanned via the mechanism introduced by patch 12. Lee > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 00/19] VM pageout scalability improvements
On large memory systems, the VM can spend way too much time scanning through pages that it cannot (or should not) evict from memory. Not only does it use up CPU time, but it also provokes lock contention and can leave large systems under memory presure in a catatonic state. Against 2.6.24-rc6-mm1 This patch series improves VM scalability by: 1) making the locking a little more scalable 2) putting filesystem backed, swap backed and non-reclaimable pages onto their own LRUs, so the system only scans the pages that it can/should evict from memory 3) switching to SEQ replacement for the anonymous LRUs, so the number of pages that need to be scanned when the system starts swapping is bound to a reasonable number The noreclaim patches come verbatim from Lee Schermerhorn and Nick Piggin. I have made a few small fixes to them and left out the bits that are no longer needed with split file/anon lists. The exception is "Scan noreclaim list for reclaimable pages", which should not be needed but could be a useful debugging tool. -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/