Re: [patch 00/19] VM pageout scalability improvements

2008-01-11 Thread Rik van Riel
On Fri, 11 Jan 2008 16:11:15 +0530
Balbir Singh <[EMAIL PROTECTED]> wrote:

> I've just started the patch series, the compile fails for me on a
> powerpc box. global_lru_pages() is defined under CONFIG_PM, but used
> else where in mm/page-writeback.c. None of the global_lru_pages()
> parameters depend on CONFIG_PM. Here's a simple patch to fix it.

Thank you for the fix.  I have applied it to my tree.

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-11 Thread Balbir Singh
* Rik van Riel <[EMAIL PROTECTED]> [2008-01-08 15:59:39]:

> Changelog:
> - merge memcontroller split LRU code into the main split LRU patch,
>   since it is not functionally different (it was split up only to help
>   people who had seen the last version of the patch series review it)

Hi, Rik,

I see a strange behaviour with this patchset. I have a program
(pagetest from Vaidy), that does the following

1. Can allocate different kinds of memory, mapped, malloc'ed or shared
2. Allocates and touches all the memory in a loop (2 times)

I mount the memory controller and limit it to 400M and run pagetest
and ask it to touch 1000M. Without this patchset everything runs fine,
but with this patchset installed, I immediately see

 pagetest invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
 Call Trace:
 [c000e5aef400] [c000eb24] .show_stack+0x70/0x1bc (unreliable)
 [c000e5aef4b0] [c00c] .oom_kill_process+0x80/0x260
 [c000e5aef570] [c00bc498] .mem_cgroup_out_of_memory+0x6c/0x98
 [c000e5aef610] [c00f2574] .mem_cgroup_charge_common+0x1e0/0x414
 [c000e5aef6e0] [c00b852c] .add_to_page_cache+0x48/0x164
 [c000e5aef780] [c00b8664] .add_to_page_cache_lru+0x1c/0x68
 [c000e5aef810] [c012db50] .mpage_readpages+0xbc/0x15c
 [c000e5aef940] [c018bdac] .ext3_readpages+0x28/0x40
 [c000e5aef9c0] [c00c3978] .__do_page_cache_readahead+0x158/0x260
 [c000e5aefa90] [c00bac44] .filemap_fault+0x18c/0x3d4
 [c000e5aefb70] [c00cd510] .__do_fault+0xb0/0x588
 [c000e5aefc80] [c05653cc] .do_page_fault+0x440/0x620
 [c000e5aefe30] [c0005408] handle_page_fault+0x20/0x58
 Mem-info:
 Node 0 DMA per-cpu:
 CPU0: hi:6, btch:   1 usd:   4
 CPU1: hi:6, btch:   1 usd:   0
 CPU2: hi:6, btch:   1 usd:   3
 CPU3: hi:6, btch:   1 usd:   4
 Active_anon:9099 active_file:1523 inactive_anon0
  inactive_file:2869 noreclaim:0 dirty:20 writeback
:0 unstable:0
  free:44210 slab:639 mapped:1724 pagetables:475 bo
unce:0
 Node 0 DMA free:2829440kB min:7808kB low:9728kB hi
gh:11712kB active_anon:582336kB inactive_anon:0kB active_file:97472kB inactive_f
ile:183616kB noreclaim:0kB present:3813760kB pages_scanned:0 all_unreclaimable?
no
 lowmem_reserve[]: 0 0 0
 Node 0 DMA: 3*64kB 5*128kB 5*256kB 4*512kB 2*1024k
B 4*2048kB 3*4096kB 2*8192kB 170*16384kB = 2828352kB
 Swap cache: add 0, delete 0, find 0/0
 Free swap  = 3148608kB
 Total swap = 3148608kB
 Free swap:   3148608kB
 59648 pages of RAM
 677 reserved pages
 28165 pages shared
 0 pages swap cached
 Memory cgroup out of memory: kill process 6593 (pagetest) score 1003 or a child
 Killed process 6593 (pagetest)

I am using a powerpc box with 64K size pages. I'll try and investigate further,
just a heads up on the failure I am seeing.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-11 Thread Balbir Singh
* Rik van Riel <[EMAIL PROTECTED]> [2008-01-08 15:59:39]:

> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
> 
> Against 2.6.24-rc6-mm1
> 
> This patch series improves VM scalability by:
> 
> 1) making the locking a little more scalable
> 
> 2) putting filesystem backed, swap backed and non-reclaimable pages
>onto their own LRUs, so the system only scans the pages that it
>can/should evict from memory
> 
> 3) switching to SEQ replacement for the anonymous LRUs, so the
>number of pages that need to be scanned when the system
>starts swapping is bound to a reasonable number
> 
> More info on the overall design can be found at:
> 
>   http://linux-mm.org/PageReplacementDesign
> 
> 
> Changelog:
> - merge memcontroller split LRU code into the main split LRU patch,
>   since it is not functionally different (it was split up only to help
>   people who had seen the last version of the patch series review it)
> - drop the page_file_cache debugging patch, since it never triggered
> - reintroduce code to not scan anon list if swap is full
> - add code to scan anon list if page cache is very small already
> - use lumpy reclaim more aggressively for smaller order > 1 allocations
>

Hi, Rik,

I've just started the patch series, the compile fails for me on a
powerpc box. global_lru_pages() is defined under CONFIG_PM, but used
else where in mm/page-writeback.c. None of the global_lru_pages()
parameters depend on CONFIG_PM. Here's a simple patch to fix it.

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b14e188..39e6aef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1920,6 +1920,14 @@ void wakeup_kswapd(struct zone *zone, int order)
wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
+unsigned long global_lru_pages(void)
+{
+   return global_page_state(NR_ACTIVE_ANON)
+   + global_page_state(NR_ACTIVE_FILE)
+   + global_page_state(NR_INACTIVE_ANON)
+   + global_page_state(NR_INACTIVE_FILE);
+}
+
 #ifdef CONFIG_PM
 /*
  * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages' pages
@@ -1968,14 +1976,6 @@ static unsigned long shrink_all_zones(unsigned long 
nr_pages, int prio,
return ret;
 }
 
-unsigned long global_lru_pages(void)
-{
-   return global_page_state(NR_ACTIVE_ANON)
-   + global_page_state(NR_ACTIVE_FILE)
-   + global_page_state(NR_INACTIVE_ANON)
-   + global_page_state(NR_INACTIVE_FILE);
-}
-
 /*
  * Try to free `nr_pages' of memory, system-wide, and return the number of
  * freed pages.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-10 Thread Mike Snitzer
On Jan 10, 2008 10:41 AM, Rik van Riel <[EMAIL PROTECTED]> wrote:
>
> On Wed, 9 Jan 2008 23:39:02 -0500
> "Mike Snitzer" <[EMAIL PROTECTED]> wrote:
>
> > How much trouble am I asking for if I were to try to get your patchset
> > to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)?  If
> > workable, is such an effort before it's time relative to your TODO?
>
> Quite a bit :)
>
> The -mm kernel has the memory controller code, which means the
> mm/ directory is fairly different.  My patch set sits on top
> of that.
>
> Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1),
> I can start building on top of that.
>
> OTOH, maybe I could get my patch series onto a recent 2.6.23.X with
> minimal chainsaw effort.

That would be great!  I can't speak for others but -mm poses a problem
for testing your patchset because it is so bleeding.  Let me know if
you take the plunge on a 2.6.23.x backport; I'd really appreciate it.

Is anyone else interested in consuming a 2.6.23.x backport of Rik's
patchset?  If so please speak up.

> > I see that you have an old port to a FC7-based 2.6.21 here:
> > http://people.redhat.com/riel/vmsplit/
> >
> > Also, do you have a public git repo that you regularly publish to for
> > this patchset?  If not a git repo do you put the raw patchset on some
> > http/ftp server?
>
> Up to now I have only emailed out the patches. Since there is demand
> for them to be downloadable from somewhere, I'll also start putting
> them on http://people.redhat.com/riel/

Great, thanks.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-10 Thread Rik van Riel
On Wed, 9 Jan 2008 23:39:02 -0500
"Mike Snitzer" <[EMAIL PROTECTED]> wrote:

> How much trouble am I asking for if I were to try to get your patchset
> to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)?  If
> workable, is such an effort before it's time relative to your TODO?

Quite a bit :)

The -mm kernel has the memory controller code, which means the
mm/ directory is fairly different.  My patch set sits on top
of that.

Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1),
I can start building on top of that.

OTOH, maybe I could get my patch series onto a recent 2.6.23.X with
minimal chainsaw effort.

> I see that you have an old port to a FC7-based 2.6.21 here:
> http://people.redhat.com/riel/vmsplit/
> 
> Also, do you have a public git repo that you regularly publish to for
> this patchset?  If not a git repo do you put the raw patchset on some
> http/ftp server?

Up to now I have only emailed out the patches. Since there is demand
for them to be downloadable from somewhere, I'll also start putting
them on http://people.redhat.com/riel/

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-09 Thread Mike Snitzer
On Jan 8, 2008 3:59 PM, Rik van Riel <[EMAIL PROTECTED]> wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
>
> Against 2.6.24-rc6-mm1

Hi Rik,

How much trouble am I asking for if I were to try to get your patchset
to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)?  If
workable, is such an effort before it's time relative to your TODO?

I see that you have an old port to a FC7-based 2.6.21 here:
http://people.redhat.com/riel/vmsplit/

Also, do you have a public git repo that you regularly publish to for
this patchset?  If not a git repo do you put the raw patchset on some
http/ftp server?

thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 00/19] VM pageout scalability improvements

2008-01-08 Thread Rik van Riel
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.24-rc6-mm1

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

More info on the overall design can be found at:

http://linux-mm.org/PageReplacementDesign


Changelog:
- merge memcontroller split LRU code into the main split LRU patch,
  since it is not functionally different (it was split up only to help
  people who had seen the last version of the patch series review it)
- drop the page_file_cache debugging patch, since it never triggered
- reintroduce code to not scan anon list if swap is full
- add code to scan anon list if page cache is very small already
- use lumpy reclaim more aggressively for smaller order > 1 allocations

-- 
All Rights Reversed

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-07 Thread Rik van Riel
On Mon, 7 Jan 2008 11:07:54 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:
> On Fri, 4 Jan 2008, Lee Schermerhorn wrote:
> 
> > We see this on both NUMA and non-NUMA. x86_64 and ia64.  The basic
> > criteria to reproduce is to be able to run thousands [or low 10s of
> > thousands] of tasks, continually increasing the number until the system
> > just goes into reclaim.  Instead of swapping, the system seems to
> > hang--unresponsive from the console, but with "soft lockup" messages
> > spitting out every few seconds...
> 
> Ditto here.

I have some suspicions on what could be causing this.

The most obvious suspect is get_scan_ratio() continuing to return
100 file reclaim, 0 anon reclaim when the file LRUs have already
been reduced to something very small, because reclaiming up to that
point was easy.

I plan to add some code to automatically set the anon reclaim to
100% if (free + file_active + file_inactive <= zone->pages_high),
meaning that reclaiming just file pages will not be able to free
enough pages.

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-07 Thread Christoph Lameter
On Fri, 4 Jan 2008, Lee Schermerhorn wrote:

> We see this on both NUMA and non-NUMA. x86_64 and ia64.  The basic
> criteria to reproduce is to be able to run thousands [or low 10s of
> thousands] of tasks, continually increasing the number until the system
> just goes into reclaim.  Instead of swapping, the system seems to
> hang--unresponsive from the console, but with "soft lockup" messages
> spitting out every few seconds...

Ditto here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-07 Thread Rik van Riel
On Mon, 7 Jan 2008 19:06:10 +0900
KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:
> On Thu, 3 Jan 2008 12:00:00 -0500
> Rik van Riel <[EMAIL PROTECTED]> wrote:

> > If there is no swap space, my VM code will not bother scanning
> > any anon pages.  This has the same effect as moving the pages
> > to the no-reclaim list, with the extra benefit of being able to
> > resume scanning the anon lists once swap space is freed.
> > 
> Is this 'avoiding scanning anon if no swap' feature  in this set ?

I seem to have lost that code in a forward merge :(

Dunno if I started the forward merge from an older series that
Lee had or if I lost the code myself...

I'll put it back in ASAP.

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-07 Thread KAMEZAWA Hiroyuki
On Thu, 3 Jan 2008 12:00:00 -0500
Rik van Riel <[EMAIL PROTECTED]> wrote:

> On Thu, 03 Jan 2008 11:52:08 -0500
> Lee Schermerhorn <[EMAIL PROTECTED]> wrote:
> 
> > Also, I should point out that the full noreclaim series includes a
> > couple of other patches NOT posted here by Rik:
> > 
> > 1) treat swap backed pages as nonreclaimable when no swap space is
> > available.  This addresses a problem we've seen in real life, with
> > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> > pages only to find that there is no swap space--add_to_swap() fails.
> > Maybe not a problem with Rik's new anon page handling.
> 
> If there is no swap space, my VM code will not bother scanning
> any anon pages.  This has the same effect as moving the pages
> to the no-reclaim list, with the extra benefit of being able to
> resume scanning the anon lists once swap space is freed.
> 
Is this 'avoiding scanning anon if no swap' feature  in this set ?

Thanks
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-04 Thread Larry Woodman

Rik van Riel wrote:


On Fri, 04 Jan 2008 17:34:00 +0100
Andi Kleen <[EMAIL PROTECTED]> wrote:
 


Lee Schermerhorn <[EMAIL PROTECTED]> writes:

   


We can easily [he says, glibly] reproduce the hang on the anon_vma lock
 


Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?
   



I really think that the anon_vma and i_mmap_lock spinlock hangs are
due to the lack of queued spinlocks.  Not because I have seen your
system hang, but because I've seen one of Larry's test systems here
hang in scary/amusing ways :)

Changing the anon_vma->lock into a rwlock_t helps because 
page_lock_anon_vma()
can take it for read and thats where the contention is.  However its the 
fact that under
some tests, most of the pages are in vmas queued to one anon_vma that 
causes so much

lock contention.




With queued spinlocks the system should just slow down, not hang.

 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-04 Thread Lee Schermerhorn
On Fri, 2008-01-04 at 17:34 +0100, Andi Kleen wrote:
> Lee Schermerhorn <[EMAIL PROTECTED]> writes:
> 
> > We can easily [he says, glibly] reproduce the hang on the anon_vma lock
> 
> Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

We see this on both NUMA and non-NUMA. x86_64 and ia64.  The basic
criteria to reproduce is to be able to run thousands [or low 10s of
thousands] of tasks, continually increasing the number until the system
just goes into reclaim.  Instead of swapping, the system seems to
hang--unresponsive from the console, but with "soft lockup" messages
spitting out every few seconds...


Lee 


> 
> -Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-04 Thread Rik van Riel
On Fri, 04 Jan 2008 17:34:00 +0100
Andi Kleen <[EMAIL PROTECTED]> wrote:
> Lee Schermerhorn <[EMAIL PROTECTED]> writes:
> 
> > We can easily [he says, glibly] reproduce the hang on the anon_vma lock
> 
> Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

I really think that the anon_vma and i_mmap_lock spinlock hangs are
due to the lack of queued spinlocks.  Not because I have seen your
system hang, but because I've seen one of Larry's test systems here
hang in scary/amusing ways :)

With queued spinlocks the system should just slow down, not hang.

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-04 Thread Andi Kleen
Lee Schermerhorn <[EMAIL PROTECTED]> writes:

> We can easily [he says, glibly] reproduce the hang on the anon_vma lock

Is that a NUMA platform? On non x86? Perhaps you just need queued spinlocks?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-04 Thread Lee Schermerhorn
On Thu, 2008-01-03 at 17:00 -0500, Rik van Riel wrote:
> On Thu, 03 Jan 2008 12:13:32 -0500
> Lee Schermerhorn <[EMAIL PROTECTED]> wrote:
> 
> > Yes, but the problem, when it occurs, is very awkward.  The system just
> > hangs for hours/days spinning on the reverse mapping locks--in both
> > page_referenced() and try_to_unmap().  No pages get reclaimed and NO OOM
> > kill occurs because we never get that far.  So, I'm not sure I'd call
> > any OOM kills resulting from this patch as "false".  The memory is
> > effectively nonreclaimable.   Now, I think that your anon pages SEQ
> > patch will eliminate the contention in page_referenced[_anon](), but we
> > could still hang in try_to_unmap().
> 
> I am hoping that Nick's ticket spinlocks will fix this problem.
> 
> Would you happen to have any test cases for the above problem that
> I could use to reproduce the problem and look for an automatic fix?

We can easily [he says, glibly] reproduce the hang on the anon_vma lock
with AIM7 loads on our test platforms.  Perhaps we can come up with an
AIM workload to reproduce the phenomenon on one of your test platforms.
I've seen the hang with 15K-20K tasks on a 4 socket x86_64 with 16-32G
of memory and quite a bit of storage.

I've also seen related hangs on both anon_vma and i_mmap_lock during a
heavy usex stress load on the splitlru+noreclaim patches.  [This, by the
way, without and WITH my rw_lock patches for both anon_vma and
i_mmap_lock.]  I can try to package up the workload to run on your
system.

> 
> Any fix that requires the sysadmin to tune things _just_ right seems
> too dangerous to me - especially if a change in the workload can
> result in the system doing exactly the wrong thing...
> 
> The idea is valid, but it just has to work automagically.
> 
> Btw, if page_referenced() is called less, the locks that try_to_unmap()
> also takes should get less contention.

Makes sense.  we'll have to see.

Lee
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-03 Thread Rik van Riel
On Thu, 03 Jan 2008 12:13:32 -0500
Lee Schermerhorn <[EMAIL PROTECTED]> wrote:

> Yes, but the problem, when it occurs, is very awkward.  The system just
> hangs for hours/days spinning on the reverse mapping locks--in both
> page_referenced() and try_to_unmap().  No pages get reclaimed and NO OOM
> kill occurs because we never get that far.  So, I'm not sure I'd call
> any OOM kills resulting from this patch as "false".  The memory is
> effectively nonreclaimable.   Now, I think that your anon pages SEQ
> patch will eliminate the contention in page_referenced[_anon](), but we
> could still hang in try_to_unmap().

I am hoping that Nick's ticket spinlocks will fix this problem.

Would you happen to have any test cases for the above problem that
I could use to reproduce the problem and look for an automatic fix?

Any fix that requires the sysadmin to tune things _just_ right seems
too dangerous to me - especially if a change in the workload can
result in the system doing exactly the wrong thing...

The idea is valid, but it just has to work automagically.

Btw, if page_referenced() is called less, the locks that try_to_unmap()
also takes should get less contention.

-- 
All Rights Reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-03 Thread Lee Schermerhorn
On Thu, 2008-01-03 at 12:00 -0500, Rik van Riel wrote:
> On Thu, 03 Jan 2008 11:52:08 -0500
> Lee Schermerhorn <[EMAIL PROTECTED]> wrote:
> 
> > Also, I should point out that the full noreclaim series includes a
> > couple of other patches NOT posted here by Rik:
> > 
> > 1) treat swap backed pages as nonreclaimable when no swap space is
> > available.  This addresses a problem we've seen in real life, with
> > vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> > pages only to find that there is no swap space--add_to_swap() fails.
> > Maybe not a problem with Rik's new anon page handling.
> 
> If there is no swap space, my VM code will not bother scanning
> any anon pages.  This has the same effect as moving the pages
> to the no-reclaim list, with the extra benefit of being able to
> resume scanning the anon lists once swap space is freed.
> 
> > 2) treat anon pages with "excessively long" anon_vma lists as
> > nonreclaimable.   "excessively long" here is a sysctl tunable parameter.
> > This also addresses problems we've seen with benchmarks and stress
> > tests--all cpus spinning on some anon_vma lock.  In "real life", we've
> > seen this behavior with file backed pages--spinning on the
> > i_mmap_lock--running Oracle workloads with user counts in the few
> > thousands.  Again, something we may not need with Rik's vmscan rework.
> > If we did want to do this, we'd probably want to address file backed
> > pages and add support to bring the pages back from the noreclaim list
> > when the number of "mappers" drops below the threshold.  My current
> > patch leaves anon pages as non-reclaimable until they're freed, or
> > manually scanned via the mechanism introduced by patch 12.
> 
> I can see some issues with that patch.  Specifically, if the threshold
> is set too high no pages will be affected, and if the threshold is too
> low all pages will become non-reclaimable, leading to a false OOM kill.
> 
> Not only is it a very big hammer, it's also a rather awkward one...

Yes, but the problem, when it occurs, is very awkward.  The system just
hangs for hours/days spinning on the reverse mapping locks--in both
page_referenced() and try_to_unmap().  No pages get reclaimed and NO OOM
kill occurs because we never get that far.  So, I'm not sure I'd call
any OOM kills resulting from this patch as "false".  The memory is
effectively nonreclaimable.   Now, I think that your anon pages SEQ
patch will eliminate the contention in page_referenced[_anon](), but we
could still hang in try_to_unmap().  And we have the issue with file
back pages and the i_mmap_lock.  I'll see if this issue comes up in
testings with the current series.  If not, cool!  If so, we just have
more work to do.

Later,
Lee
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-03 Thread Rik van Riel
On Thu, 03 Jan 2008 11:52:08 -0500
Lee Schermerhorn <[EMAIL PROTECTED]> wrote:

> Also, I should point out that the full noreclaim series includes a
> couple of other patches NOT posted here by Rik:
> 
> 1) treat swap backed pages as nonreclaimable when no swap space is
> available.  This addresses a problem we've seen in real life, with
> vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
> pages only to find that there is no swap space--add_to_swap() fails.
> Maybe not a problem with Rik's new anon page handling.

If there is no swap space, my VM code will not bother scanning
any anon pages.  This has the same effect as moving the pages
to the no-reclaim list, with the extra benefit of being able to
resume scanning the anon lists once swap space is freed.

> 2) treat anon pages with "excessively long" anon_vma lists as
> nonreclaimable.   "excessively long" here is a sysctl tunable parameter.
> This also addresses problems we've seen with benchmarks and stress
> tests--all cpus spinning on some anon_vma lock.  In "real life", we've
> seen this behavior with file backed pages--spinning on the
> i_mmap_lock--running Oracle workloads with user counts in the few
> thousands.  Again, something we may not need with Rik's vmscan rework.
> If we did want to do this, we'd probably want to address file backed
> pages and add support to bring the pages back from the noreclaim list
> when the number of "mappers" drops below the threshold.  My current
> patch leaves anon pages as non-reclaimable until they're freed, or
> manually scanned via the mechanism introduced by patch 12.

I can see some issues with that patch.  Specifically, if the threshold
is set too high no pages will be affected, and if the threshold is too
low all pages will become non-reclaimable, leading to a false OOM kill.

Not only is it a very big hammer, it's also a rather awkward one...

-- 
All Rights Reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/19] VM pageout scalability improvements

2008-01-03 Thread Lee Schermerhorn
On Wed, 2008-01-02 at 17:41 -0500, [EMAIL PROTECTED] wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
> 
> Against 2.6.24-rc6-mm1
> 
> This patch series improves VM scalability by:
> 
> 1) making the locking a little more scalable
> 
> 2) putting filesystem backed, swap backed and non-reclaimable pages
>onto their own LRUs, so the system only scans the pages that it
>can/should evict from memory
> 
> 3) switching to SEQ replacement for the anonymous LRUs, so the
>number of pages that need to be scanned when the system
>starts swapping is bound to a reasonable number
> 
> The noreclaim patches come verbatim from Lee Schermerhorn and
> Nick Piggin.  I have made a few small fixes to them and left out
> the bits that are no longer needed with split file/anon lists.
> 
> The exception is "Scan noreclaim list for reclaimable pages",
> which should not be needed but could be a useful debugging tool.

Note that patch 14/19 [SHM_LOCK/UNLOCK handling] depends on the
infrastructure introduced by the "Scan noreclaim list for reclaimable
pages" patch.  When SHM_UNLOCKing a shm segment, we call a new
scan_mapping_noreclaim_page() function to check all of the pages in the
segment for reclaimability.  There might be other reasons for the pages
to be non-reclaimable...

So, we can't merge 14/19 as is w/o some of patch 12.  We can probably
eliminate the sysctl and per node sysfs attributes to force a scan.
But, as Rik says, this has been useful for debugging--e.g., periodically
forcing a full rescan while running a stress load.

Also, I should point out that the full noreclaim series includes a
couple of other patches NOT posted here by Rik:

1) treat swap backed pages as nonreclaimable when no swap space is
available.  This addresses a problem we've seen in real life, with
vmscan spending a lot of time trying to reclaim anon/shmem/tmpfs/...
pages only to find that there is no swap space--add_to_swap() fails.
Maybe not a problem with Rik's new anon page handling.  We'll see.  If
we did want to add this filter, we'll need a way to bring back pages
from the noreclaim list that are there only for lack of swap space when
space is added or becomes available.

2) treat anon pages with "excessively long" anon_vma lists as
nonreclaimable.   "excessively long" here is a sysctl tunable parameter.
This also addresses problems we've seen with benchmarks and stress
tests--all cpus spinning on some anon_vma lock.  In "real life", we've
seen this behavior with file backed pages--spinning on the
i_mmap_lock--running Oracle workloads with user counts in the few
thousands.  Again, something we may not need with Rik's vmscan rework.
If we did want to do this, we'd probably want to address file backed
pages and add support to bring the pages back from the noreclaim list
when the number of "mappers" drops below the threshold.  My current
patch leaves anon pages as non-reclaimable until they're freed, or
manually scanned via the mechanism introduced by patch 12.

Lee
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 00/19] VM pageout scalability improvements

2008-01-02 Thread linux-kernel
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.24-rc6-mm1

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

The noreclaim patches come verbatim from Lee Schermerhorn and
Nick Piggin.  I have made a few small fixes to them and left out
the bits that are no longer needed with split file/anon lists.

The exception is "Scan noreclaim list for reclaimable pages",
which should not be needed but could be a useful debugging tool.

-- 
All Rights Reversed

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/