Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: On Wed, 22 Aug 2007, Peter Zijlstra wrote: That is an extreme case that AFAIK we currently ignore and could be avoided with some effort. Its not extreme, not even rare, and its handled now. Its what PF_MEMALLOC is for. No its not. If you have all pages allocated as anonymous pages and your writeout requires more pages than available in the reserves then you are screwed either way regardless if you have PF_MEMALLOC set or not. Only if the _first_ writeout needs more pages. If the sum of all writeouts need more pages than you have available, that is fine. After all, buffer heads and some other metadata is freed on IO completion. Recursive reclaim will also be able to free the data pages after IO completion, and really fix the problem. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: On Wed, 22 Aug 2007, Peter Zijlstra wrote: That is an extreme case that AFAIK we currently ignore and could be avoided with some effort. Its not extreme, not even rare, and its handled now. Its what PF_MEMALLOC is for. No its not. If you have all pages allocated as anonymous pages and your writeout requires more pages than available in the reserves then you are screwed either way regardless if you have PF_MEMALLOC set or not. Only if the _first_ writeout needs more pages. If the sum of all writeouts need more pages than you have available, that is fine. After all, buffer heads and some other metadata is freed on IO completion. Recursive reclaim will also be able to free the data pages after IO completion, and really fix the problem. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Thu, 23 Aug 2007, Andrea Arcangeli wrote: > On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote: > > 1. Like in the earlier patchset allow reentry to reclaim under > >PF_MEMALLOC if we are out of all memory. > > Can you simply tweak on the may_writepage flag only to achieve the > second pass? We're talking here about a totally non-performance case, > almost impossible to hit in practice unless you do real weird things, > and certainly very unlikely to happen. So I'm unsure what's all that > complexity just to make a regular pass on the lru looking for clean > pages, something may_writepage=0 already does. > Yes that is what the PF_MEMALLOC patch that I posted before does. This discussion gets me more and more to thinking that the recursive reclaim on PF_MEMALLOC is all that is needed for emergency situations (to get out of the "tight spot"). See http://marc.info/?l=linux-kernel=118710219116624=2 > If the PF_MEMALLOC is found empty, I agree entering reclaim a second > time with may_writepage=0 sounds theoretically a good idea (in > practice it should never be necessary). printk must also be printed to > warn the user he was risking to deadlock for real and he has to > increase the min_free_kbytes. Ok. I can add a printk to that one. > That sounds a bit risky, there are latency considerations here to > make, GFP_ATOMIC will run with irq locally disabled and it may hang > for indefinite amount of time (O(N)). So irq latency may break and it > may be better to lose a packet once in a while than to hang > interrupts. If you want to do this you'd probably need to add a new > GFP_ATOMIC_RECLAIM or similar. Well we could do the same as for PF_MEMALLOC: print a warning and then reclaim nevertheless if we cannot fail (We already have a GFP_NOFAIL flag). It is better to generate a latency than the system failing altogether. However the GFP_ATOMIC reclaim patchset is a bit more invasive (http://marc.info/?l=linux-mm=118710584014150=2). Maybe this is too much churn for the rare need of such a reclaim. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Thu, 2007-08-23 at 14:08 +0200, Andrea Arcangeli wrote: > On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote: > > Strictly speaking: > > > > if: > > > > page = alloc_page(gfp); > > > > fails but: > > > > obj = kmem_cache_alloc(s, gfp); > > > > succeeds then its a bug. > > Why? this is like saying that if alloc_pages(order=1) fails but > alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a > bug. > > The only bug is if slab allocations <=4k fails despite > alloc_pages(order=0) would succeed. That would be currently true. However I need it to be stricter. I'm wanting to do networked swap. And in order to be able to receive writeout completions when in the PF_MEMALLOC region I need to introduce a new network state. This is because it needs to operate in a steady state with limited (bounded) memory use. Normal network either consumes memory, or fails to receive anything at all. So this new network state will allocate space for a packet, receive the packet from the NIC, inspect the packet, and toss the packet when its not found to be aimed at the VM (ie. does not contain a writeout completion). So the total memory consumption of this state is 0 - it always frees what it takes, but the memory use is non 0 but bounded - it does temporarily use memory, but will limit itself to never exceed a given maximum) Because the network stack runs on the slab allocator in generic (both kmem_cache and kmalloc) I need this extra guarantee so that a slab allocated from the reserves will not serve objects to some random non-critical application. If this is not restricted this network state can leak memory to outside of PF_MEMALLOC and will not be stable. So what I need is: kmem_cache_alloc(s, gfp) to fail when alloc_page(gfp) fails agreeing on the extra condition: when kmem_cache_size(s) <= PAGE_SIZE and the extra note that: I only really need it to fail for ALLOC_NO_WATERMARKS, the other levels like ALLOC_HIGH and ALLOC_HARDER are not critical. Which ends up with: if the current gfp-context does not allow ALLOC_NO_WATERMARKS allocations, and alloc_page() fails, so must kmem_cache_alloc(s,) if kmem_cache_size(s) <= PAGE_SIZE. (yes this leaves jumbo frames broken) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, Aug 22, 2007 at 10:03:45PM +0200, Peter Zijlstra wrote: > Its not extreme, not even rare, and its handled now. Its what > PF_MEMALLOC is for. Agreed. This is the whole point, either you limit the max amount of anon memory, slab, alloc_pages a driver can do or you reserve a pool. Guess what? In practice limiting the max ram a driver can eat in alloc_pages, at the same time while limting the max amount of pages that can be anon ram, etc..etc.. is called "reserving a pool of freepages for PF_MEMALLOC". Now in theory we could try a may_writepage=0 second reclaim pass before using the PF_MEMALLOC pool but would that make any difference other than being slower? We can argue what should be done first but the PF_MEMALLOC pool isn't likely to go away with this patch... only way to make it go away is to have every subsystem including tcp incoming to have mempools for everything which is too complicated to implement so we've to live the imperfect world that just works good enough. This logic of falling back in a may_writepage=0 pass will make things a bit more reliable but certainly not perfect and it doesn't obsolete the need of the current code IMHO. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote: > Strictly speaking: > > if: > > page = alloc_page(gfp); > > fails but: > > obj = kmem_cache_alloc(s, gfp); > > succeeds then its a bug. Why? this is like saying that if alloc_pages(order=1) fails but alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a bug. The only bug is if slab allocations <=4k fails despite alloc_pages(order=0) would succeed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote: > 1. Like in the earlier patchset allow reentry to reclaim under >PF_MEMALLOC if we are out of all memory. Can you simply tweak on the may_writepage flag only to achieve the second pass? We're talking here about a totally non-performance case, almost impossible to hit in practice unless you do real weird things, and certainly very unlikely to happen. So I'm unsure what's all that complexity just to make a regular pass on the lru looking for clean pages, something may_writepage=0 already does. Like Andi said at most one may_writepage=0 recursion should be allowed. If the PF_MEMALLOC is found empty, I agree entering reclaim a second time with may_writepage=0 sounds theoretically a good idea (in practice it should never be necessary). printk must also be printed to warn the user he was risking to deadlock for real and he has to increase the min_free_kbytes. > 2. Do the laundry as here but do not write out laundry directly. >Instead move laundry to a new lru style list in the zone structure. >This will allow the recursive reclaim to also trigger writeout >of pages (what this patchset was supposed to accomplish). A new lru for this sounds overkill to me, we're talking about deadlock avoidance, this has absolutely nothing to do with real life 99.% of runtime of all kernels out there. > 3. Perform writeback only from kswapd. Make other threads >wait on kswapd if memory is low, we can wait and writeback still >has to progress. What does buy you to think about other threads? The whole trouble is that PF_MEMALLOC is global, no matter which thread (pdflush like other email to Andi or kswapd here) still it'll deadlock the same way. If your intent is to limit the max number of in-flight writepage that could be achieved with a sempahore, not by context switching for no good reason. kswapd is needed for atomic allocations and to pipeline the VM so that the vm runs more likely asynchronous inside kswapd. > 4. Then allow reclaim of GFP_ATOMIC allocs (see >http://marc.info/?l=linux-kernel=118710595617696=2). Atomic >reclaim can then also put pages onto the zone laundry lists from where >it is going to be picked up and written out by kswapd ASAP. This one >may be tricky so maybe keep this separate. That sounds a bit risky, there are latency considerations here to make, GFP_ATOMIC will run with irq locally disabled and it may hang for indefinite amount of time (O(N)). So irq latency may break and it may be better to lose a packet once in a while than to hang interrupts. If you want to do this you'd probably need to add a new GFP_ATOMIC_RECLAIM or similar. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 2007-08-22 at 13:16 -0700, Christoph Lameter wrote: > On Wed, 22 Aug 2007, Peter Zijlstra wrote: > > > > As shown, there are cases where there just isn't any memory to reclaim. ^^^ > > > > Please accept this. > > > That is an extreme case that AFAIK we currently ignore and could be > > > avoided with some effort. > > > > Its not extreme, not even rare, and its handled now. Its what > > PF_MEMALLOC is for. > > No its not. If you have all pages allocated as anonymous pages and your > writeout requires more pages than available in the reserves then you are > screwed either way regardless if you have PF_MEMALLOC set or not. Christoph, we were talking about memory to reclaim, no about exhausting the reserves. > > > The initial PF_MEMALLOC patchset seems to be > > > still enough to deal with your issues. > > > > Take the anonyous workload, user-space will block once the page > > allocator hits ALLOC_MIN. Network will be able to receive until > > ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will > > start dropping all packets until there is memory again. But userspace is > > wedged and hence will not consume the network traffic, hence we > > deadlock. > > > > Even if there is something to reclaim initially, if the pressure > > persists that can eventually be exhausted. > > Sure ultimately you will end up with pages that are all unreclaimable if > you reclaim all reclaimable memory. > > > > multiple critical tasks on various devices that have various memory > > > needs. > > > So multiple critical spots can happen concurrently in multiple > > > application contexts. > > > > yes, reclaim can be unbounded concurrent, and that is one of the > > (theoretically) major problems we currently have. > > So your patchset is not fixing it? No, and I never said it would. I've been meaning to do one that does though. Just haven't come around to actually doing it :-/ > > > We have that with PF_MEMALLOC. > > > > Exactly. But if you recognise the need for PF_MEMALLOC then what is this > > argument about? > > The PF_MEMALLOC patchset f.e. is about avoiding to go out of > memory when there is still memory available even if we are doing a > PF_MEMALLOC allocation and would OOM otherwise. Right, but as long as there is a need for PF_MEMALLOC there is a need for the patches I proposed. > > Networking can currently be seen as having two states: > > > > 1 receive packets and consume memory > > 2 drop all packets (when out of memory) > > > > I need a 3rd state: > > > > 3 receiving packets but not consuming memory > > So far a good idea. If you are not consuming memory then why are the > allocators involved? Because I do need to receive some packets, its just that I'll free them again. So it won't keep consuming memory. This needs a little pool of memory in order to operate in a stable state. Its: alloc, receive, inspect, free total memory use: 0 memory delta: a little (its just that you need to be able to receive a significant number of packets, not 1, due to funny things like ip-defragmentation before you can be sure to actually receive 1 whole tcp packet - but the idea is the same) > > Now, I need this state when we're in PF_MEMALLOC territory, because I > > need to be able to process an unspecified amount of network traffic in > > order to receive the writeout completion. > > > > In order to operate this 3rd network state, some memory is needed in > > which packets can be received and when deemed not important freed and > > reused. > > > > It needs a bounded amount of memory in order to process an unbounded > > amount of network traffic. > > > > What exactly is not clear about this? If you accept the need for > > PF_MEMALLOC you surely must also agree that at the point you're using it > > running reclaim is useless. > > Yes looks like you would like to add something to the network layer to > filter important packets. As long as you stay within PF_MEMALLOC > boundaries you can allocate and throw packets away. If you want to have a > reserve that is secure and just for you then you need to take it away from > the reserves (which in turn will lead reclaim to restore them). Ah, but also note that _using_ PF_MEMALLOC is the trigger to enter that 3rd network state. These two are tightly coupled. You only need this 3rd state when under PF_MEMALLOC, otherwise we could just receive normally. So, my thinking was that, if the current reserves are good enough to keep the system 'deadlock' free, I can just enlarge the reserves by whatever it is I need for that network state and we're all good, no? Why separate these two? If the current reserve is large enough (and theoretically it is not - but I'm meaning to fix that) it will not consume the extra memory I added below. Note how: [PATCH 09/10] mm: emergency pool pushes up the current reserves in a fashion so as to maintain the relative
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 2007-08-22 at 13:16 -0700, Christoph Lameter wrote: On Wed, 22 Aug 2007, Peter Zijlstra wrote: As shown, there are cases where there just isn't any memory to reclaim. ^^^ Please accept this. That is an extreme case that AFAIK we currently ignore and could be avoided with some effort. Its not extreme, not even rare, and its handled now. Its what PF_MEMALLOC is for. No its not. If you have all pages allocated as anonymous pages and your writeout requires more pages than available in the reserves then you are screwed either way regardless if you have PF_MEMALLOC set or not. Christoph, we were talking about memory to reclaim, no about exhausting the reserves. The initial PF_MEMALLOC patchset seems to be still enough to deal with your issues. Take the anonyous workload, user-space will block once the page allocator hits ALLOC_MIN. Network will be able to receive until ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will start dropping all packets until there is memory again. But userspace is wedged and hence will not consume the network traffic, hence we deadlock. Even if there is something to reclaim initially, if the pressure persists that can eventually be exhausted. Sure ultimately you will end up with pages that are all unreclaimable if you reclaim all reclaimable memory. multiple critical tasks on various devices that have various memory needs. So multiple critical spots can happen concurrently in multiple application contexts. yes, reclaim can be unbounded concurrent, and that is one of the (theoretically) major problems we currently have. So your patchset is not fixing it? No, and I never said it would. I've been meaning to do one that does though. Just haven't come around to actually doing it :-/ We have that with PF_MEMALLOC. Exactly. But if you recognise the need for PF_MEMALLOC then what is this argument about? The PF_MEMALLOC patchset f.e. is about avoiding to go out of memory when there is still memory available even if we are doing a PF_MEMALLOC allocation and would OOM otherwise. Right, but as long as there is a need for PF_MEMALLOC there is a need for the patches I proposed. Networking can currently be seen as having two states: 1 receive packets and consume memory 2 drop all packets (when out of memory) I need a 3rd state: 3 receiving packets but not consuming memory So far a good idea. If you are not consuming memory then why are the allocators involved? Because I do need to receive some packets, its just that I'll free them again. So it won't keep consuming memory. This needs a little pool of memory in order to operate in a stable state. Its: alloc, receive, inspect, free total memory use: 0 memory delta: a little (its just that you need to be able to receive a significant number of packets, not 1, due to funny things like ip-defragmentation before you can be sure to actually receive 1 whole tcp packet - but the idea is the same) Now, I need this state when we're in PF_MEMALLOC territory, because I need to be able to process an unspecified amount of network traffic in order to receive the writeout completion. In order to operate this 3rd network state, some memory is needed in which packets can be received and when deemed not important freed and reused. It needs a bounded amount of memory in order to process an unbounded amount of network traffic. What exactly is not clear about this? If you accept the need for PF_MEMALLOC you surely must also agree that at the point you're using it running reclaim is useless. Yes looks like you would like to add something to the network layer to filter important packets. As long as you stay within PF_MEMALLOC boundaries you can allocate and throw packets away. If you want to have a reserve that is secure and just for you then you need to take it away from the reserves (which in turn will lead reclaim to restore them). Ah, but also note that _using_ PF_MEMALLOC is the trigger to enter that 3rd network state. These two are tightly coupled. You only need this 3rd state when under PF_MEMALLOC, otherwise we could just receive normally. So, my thinking was that, if the current reserves are good enough to keep the system 'deadlock' free, I can just enlarge the reserves by whatever it is I need for that network state and we're all good, no? Why separate these two? If the current reserve is large enough (and theoretically it is not - but I'm meaning to fix that) it will not consume the extra memory I added below. Note how: [PATCH 09/10] mm: emergency pool pushes up the current reserves in a fashion so as to maintain the relative operating range of the page allocator (distance between min,low,high and scaling of the wmarks under ALLOC_HIGH|ALLOC_HARDER). Also, failing
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote: 1. Like in the earlier patchset allow reentry to reclaim under PF_MEMALLOC if we are out of all memory. Can you simply tweak on the may_writepage flag only to achieve the second pass? We're talking here about a totally non-performance case, almost impossible to hit in practice unless you do real weird things, and certainly very unlikely to happen. So I'm unsure what's all that complexity just to make a regular pass on the lru looking for clean pages, something may_writepage=0 already does. Like Andi said at most one may_writepage=0 recursion should be allowed. If the PF_MEMALLOC is found empty, I agree entering reclaim a second time with may_writepage=0 sounds theoretically a good idea (in practice it should never be necessary). printk must also be printed to warn the user he was risking to deadlock for real and he has to increase the min_free_kbytes. 2. Do the laundry as here but do not write out laundry directly. Instead move laundry to a new lru style list in the zone structure. This will allow the recursive reclaim to also trigger writeout of pages (what this patchset was supposed to accomplish). A new lru for this sounds overkill to me, we're talking about deadlock avoidance, this has absolutely nothing to do with real life 99.% of runtime of all kernels out there. 3. Perform writeback only from kswapd. Make other threads wait on kswapd if memory is low, we can wait and writeback still has to progress. What does buy you to think about other threads? The whole trouble is that PF_MEMALLOC is global, no matter which thread (pdflush like other email to Andi or kswapd here) still it'll deadlock the same way. If your intent is to limit the max number of in-flight writepage that could be achieved with a sempahore, not by context switching for no good reason. kswapd is needed for atomic allocations and to pipeline the VM so that the vm runs more likely asynchronous inside kswapd. 4. Then allow reclaim of GFP_ATOMIC allocs (see http://marc.info/?l=linux-kernelm=118710595617696w=2). Atomic reclaim can then also put pages onto the zone laundry lists from where it is going to be picked up and written out by kswapd ASAP. This one may be tricky so maybe keep this separate. That sounds a bit risky, there are latency considerations here to make, GFP_ATOMIC will run with irq locally disabled and it may hang for indefinite amount of time (O(N)). So irq latency may break and it may be better to lose a packet once in a while than to hang interrupts. If you want to do this you'd probably need to add a new GFP_ATOMIC_RECLAIM or similar. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote: Strictly speaking: if: page = alloc_page(gfp); fails but: obj = kmem_cache_alloc(s, gfp); succeeds then its a bug. Why? this is like saying that if alloc_pages(order=1) fails but alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a bug. The only bug is if slab allocations =4k fails despite alloc_pages(order=0) would succeed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, Aug 22, 2007 at 10:03:45PM +0200, Peter Zijlstra wrote: Its not extreme, not even rare, and its handled now. Its what PF_MEMALLOC is for. Agreed. This is the whole point, either you limit the max amount of anon memory, slab, alloc_pages a driver can do or you reserve a pool. Guess what? In practice limiting the max ram a driver can eat in alloc_pages, at the same time while limting the max amount of pages that can be anon ram, etc..etc.. is called reserving a pool of freepages for PF_MEMALLOC. Now in theory we could try a may_writepage=0 second reclaim pass before using the PF_MEMALLOC pool but would that make any difference other than being slower? We can argue what should be done first but the PF_MEMALLOC pool isn't likely to go away with this patch... only way to make it go away is to have every subsystem including tcp incoming to have mempools for everything which is too complicated to implement so we've to live the imperfect world that just works good enough. This logic of falling back in a may_writepage=0 pass will make things a bit more reliable but certainly not perfect and it doesn't obsolete the need of the current code IMHO. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Thu, 2007-08-23 at 14:08 +0200, Andrea Arcangeli wrote: On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote: Strictly speaking: if: page = alloc_page(gfp); fails but: obj = kmem_cache_alloc(s, gfp); succeeds then its a bug. Why? this is like saying that if alloc_pages(order=1) fails but alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a bug. The only bug is if slab allocations =4k fails despite alloc_pages(order=0) would succeed. That would be currently true. However I need it to be stricter. I'm wanting to do networked swap. And in order to be able to receive writeout completions when in the PF_MEMALLOC region I need to introduce a new network state. This is because it needs to operate in a steady state with limited (bounded) memory use. Normal network either consumes memory, or fails to receive anything at all. So this new network state will allocate space for a packet, receive the packet from the NIC, inspect the packet, and toss the packet when its not found to be aimed at the VM (ie. does not contain a writeout completion). So the total memory consumption of this state is 0 - it always frees what it takes, but the memory use is non 0 but bounded - it does temporarily use memory, but will limit itself to never exceed a given maximum) Because the network stack runs on the slab allocator in generic (both kmem_cache and kmalloc) I need this extra guarantee so that a slab allocated from the reserves will not serve objects to some random non-critical application. If this is not restricted this network state can leak memory to outside of PF_MEMALLOC and will not be stable. So what I need is: kmem_cache_alloc(s, gfp) to fail when alloc_page(gfp) fails agreeing on the extra condition: when kmem_cache_size(s) = PAGE_SIZE and the extra note that: I only really need it to fail for ALLOC_NO_WATERMARKS, the other levels like ALLOC_HIGH and ALLOC_HARDER are not critical. Which ends up with: if the current gfp-context does not allow ALLOC_NO_WATERMARKS allocations, and alloc_page() fails, so must kmem_cache_alloc(s,) if kmem_cache_size(s) = PAGE_SIZE. (yes this leaves jumbo frames broken) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Thu, 23 Aug 2007, Andrea Arcangeli wrote: On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote: 1. Like in the earlier patchset allow reentry to reclaim under PF_MEMALLOC if we are out of all memory. Can you simply tweak on the may_writepage flag only to achieve the second pass? We're talking here about a totally non-performance case, almost impossible to hit in practice unless you do real weird things, and certainly very unlikely to happen. So I'm unsure what's all that complexity just to make a regular pass on the lru looking for clean pages, something may_writepage=0 already does. Yes that is what the PF_MEMALLOC patch that I posted before does. This discussion gets me more and more to thinking that the recursive reclaim on PF_MEMALLOC is all that is needed for emergency situations (to get out of the tight spot). See http://marc.info/?l=linux-kernelm=118710219116624w=2 If the PF_MEMALLOC is found empty, I agree entering reclaim a second time with may_writepage=0 sounds theoretically a good idea (in practice it should never be necessary). printk must also be printed to warn the user he was risking to deadlock for real and he has to increase the min_free_kbytes. Ok. I can add a printk to that one. That sounds a bit risky, there are latency considerations here to make, GFP_ATOMIC will run with irq locally disabled and it may hang for indefinite amount of time (O(N)). So irq latency may break and it may be better to lose a packet once in a while than to hang interrupts. If you want to do this you'd probably need to add a new GFP_ATOMIC_RECLAIM or similar. Well we could do the same as for PF_MEMALLOC: print a warning and then reclaim nevertheless if we cannot fail (We already have a GFP_NOFAIL flag). It is better to generate a latency than the system failing altogether. However the GFP_ATOMIC reclaim patchset is a bit more invasive (http://marc.info/?l=linux-mmm=118710584014150w=2). Maybe this is too much churn for the rare need of such a reclaim. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Peter Zijlstra wrote: > > That is an extreme case that AFAIK we currently ignore and could be > > avoided with some effort. > > Its not extreme, not even rare, and its handled now. Its what > PF_MEMALLOC is for. No its not. If you have all pages allocated as anonymous pages and your writeout requires more pages than available in the reserves then you are screwed either way regardless if you have PF_MEMALLOC set or not. > > The initial PF_MEMALLOC patchset seems to be > > still enough to deal with your issues. > > Take the anonyous workload, user-space will block once the page > allocator hits ALLOC_MIN. Network will be able to receive until > ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will > start dropping all packets until there is memory again. But userspace is > wedged and hence will not consume the network traffic, hence we > deadlock. > > Even if there is something to reclaim initially, if the pressure > persists that can eventually be exhausted. Sure ultimately you will end up with pages that are all unreclaimable if you reclaim all reclaimable memory. > > multiple critical tasks on various devices that have various memory needs. > > So multiple critical spots can happen concurrently in multiple > > application contexts. > > yes, reclaim can be unbounded concurrent, and that is one of the > (theoretically) major problems we currently have. So your patchset is not fixing it? > > We have that with PF_MEMALLOC. > > Exactly. But if you recognise the need for PF_MEMALLOC then what is this > argument about? The PF_MEMALLOC patchset f.e. is about avoiding to go out of memory when there is still memory available even if we are doing a PF_MEMALLOC allocation and would OOM otherwise. > Networking can currently be seen as having two states: > > 1 receive packets and consume memory > 2 drop all packets (when out of memory) > > I need a 3rd state: > > 3 receiving packets but not consuming memory So far a good idea. If you are not consuming memory then why are the allocators involved? > Now, I need this state when we're in PF_MEMALLOC territory, because I > need to be able to process an unspecified amount of network traffic in > order to receive the writeout completion. > > In order to operate this 3rd network state, some memory is needed in > which packets can be received and when deemed not important freed and > reused. > > It needs a bounded amount of memory in order to process an unbounded > amount of network traffic. > > What exactly is not clear about this? If you accept the need for > PF_MEMALLOC you surely must also agree that at the point you're using it > running reclaim is useless. Yes looks like you would like to add something to the network layer to filter important packets. As long as you stay within PF_MEMALLOC boundaries you can allocate and throw packets away. If you want to have a reserve that is secure and just for you then you need to take it away from the reserves (which in turn will lead reclaim to restore them). > > > Also, failing a memory allocation isn't bad, why are you so worried > > > about that? It happens all the time. > > > > Its a performance impact and plainly does not make sense if there is > > reclaimable memory availble. The common action of the vm is to reclaim if > > there is a demand for memory. Now we suddenly abandon that approach? > > I'm utterly confused by this, on one hand you recognise the need for > PF_MEMALLOC but on the other hand you're saying its not needed and > anybody needing memory (even reclaim itself) should use reclaim. The VM reclaims memory on demand but in exceptional limited cases where we cannot do so we use the reserves. I am sure you know this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 2007-08-22 at 12:04 -0700, Christoph Lameter wrote: > On Wed, 22 Aug 2007, Peter Zijlstra wrote: > > > Its unavoidable, at some point it just happens. Also using reclaim > > doesn't seem like the ideal way to get out of live-locks since reclaim > > itself can live-lock on these large boxen. > > If reclaim can live lock then it needs to be fixed. Riel is working on that. > > As shown, there are cases where there just isn't any memory to reclaim. > > Please accept this. > > That is an extreme case that AFAIK we currently ignore and could be > avoided with some effort. Its not extreme, not even rare, and its handled now. Its what PF_MEMALLOC is for. > The initial PF_MEMALLOC patchset seems to be > still enough to deal with your issues. No it isnt. Take the anonyous workload, user-space will block once the page allocator hits ALLOC_MIN. Network will be able to receive until ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will start dropping all packets until there is memory again. But userspace is wedged and hence will not consume the network traffic, hence we deadlock. Even if there is something to reclaim initially, if the pressure persists that can eventually be exhausted. > > Also, by reclaiming memory and getting out of the tight spot you give > > the rest of the system access to that memory, and it can be used for > > other things than getting out of the tight spot. > > The rest of the system may have their own tights spots. Language the "the > tight spot" sets up all sort of alarms over here since you seem to be > thinking about a system doing a single task. reclaim > The system may be handling > multiple critical tasks on various devices that have various memory needs. > So multiple critical spots can happen concurrently in multiple > application contexts. yes, reclaim can be unbounded concurrent, and that is one of the (theoretically) major problems we currently have. > > You really want a separate allocation state that allows only reclaim to > > access memory. > > We have that with PF_MEMALLOC. Exactly. But if you recognise the need for PF_MEMALLOC then what is this argument about? Networking can currently be seen as having two states: 1 receive packets and consume memory 2 drop all packets (when out of memory) I need a 3rd state: 3 receiving packets but not consuming memory Now, I need this state when we're in PF_MEMALLOC territory, because I need to be able to process an unspecified amount of network traffic in order to receive the writeout completion. In order to operate this 3rd network state, some memory is needed in which packets can be received and when deemed not important freed and reused. It needs a bounded amount of memory in order to process an unbounded amount of network traffic. What exactly is not clear about this? If you accept the need for PF_MEMALLOC you surely must also agree that at the point you're using it running reclaim is useless. > > Also, failing a memory allocation isn't bad, why are you so worried > > about that? It happens all the time. > > Its a performance impact and plainly does not make sense if there is > reclaimable memory availble. The common action of the vm is to reclaim if > there is a demand for memory. Now we suddenly abandon that approach? I'm utterly confused by this, on one hand you recognise the need for PF_MEMALLOC but on the other hand you're saying its not needed and anybody needing memory (even reclaim itself) should use reclaim. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Ingo Molnar wrote: > Could you outline the "big picture" as you see it? To me your argument > that reclaim can always be done instantly and that the cases where it > cannot be done are pathological and need to be avoided is fundamentally > dangerous and quite a bit short-sighted at first glance. That is a bit overdrawing my argument. The issues that Peter saw can be fixed by allowing recursive reclaim (see the earlier patchset). The rest is so far sugar on top or building extreme cases where we already have trouble. > The big picture way to think about this is the following: the free page > pool is the "cache" of the MM. It's what "greases" the mechanism and > bridges the inevitable reclaim latency and makes "atomic memory" > available to the reclaim mechanism itself. We _cannot_ remove that cache > without a conceptual replacement (or a _very_ robust argument and proof > that the free pages pool is not needed at all - this would be a major > design change (and a stupid mistake IMO)). Your patchset, in essence, > tries to claim that we dont really need this cache and that all that > matters is to keep enough clean pagecache pages around. That approach > misses the full picture and i dont think we can progress without > agreeing on the fundamentals first. The patchset attempts to deal with the reserves in a more intelligent way in order not to fail when this pool becomes exhausted because some device needs a lot of memory in the writeout path. > That "cache" cannot be handled in your scheme: a fully or mostly > anonymous workload (tons of apps are like that) instantly destroys the > "there is always a minimal amount of atomically reclaimable pages > around" property of freelists, and this cannot be talked or tweaked > around by twiddling any existing property of anonymous reclaim. A extreme anonymous workload like discussed here can even cause the current VM to fail. Realistically at least portions of the executable and varios slab caches will remain in memory in addition to the reserves. > Anonymous memory is dirty and takes ages to reclaim. The fact that your > patchset causes an easy anonymous OOM further underlines this flaw of > your thinking. Not making anonymous workloads OOM is the _hardest_ part > of the MM, by far. Pagecache reclaim is a breeze in comparison :-) The central flaw in my thinking was the switching of of PF_MEMALLOC on the writeout path instead of allowing recursive PF_MEMALLOC reclaim as in the first patch. But the first patchset did not have that flaw. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Peter Zijlstra wrote: > Its unavoidable, at some point it just happens. Also using reclaim > doesn't seem like the ideal way to get out of live-locks since reclaim > itself can live-lock on these large boxen. If reclaim can live lock then it needs to be fixed. > As shown, there are cases where there just isn't any memory to reclaim. > Please accept this. That is an extreme case that AFAIK we currently ignore and could be avoided with some effort. The initial PF_MEMALLOC patchset seems to be still enough to deal with your issues. > Also, by reclaiming memory and getting out of the tight spot you give > the rest of the system access to that memory, and it can be used for > other things than getting out of the tight spot. The rest of the system may have their own tights spots. Language the "the tight spot" sets up all sort of alarms over here since you seem to be thinking about a system doing a single task. The system may be handling multiple critical tasks on various devices that have various memory needs. So multiple critical spots can happen concurrently in multiple application contexts. > You really want a separate allocation state that allows only reclaim to > access memory. We have that with PF_MEMALLOC. > Also, failing a memory allocation isn't bad, why are you so worried > about that? It happens all the time. Its a performance impact and plainly does not make sense if there is reclaimable memory availble. The common action of the vm is to reclaim if there is a demand for memory. Now we suddenly abandon that approach? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
* Christoph Lameter <[EMAIL PROTECTED]> wrote: > > I want slab to fail when a similar page alloc would fail, no magic. > > Yes I know. I do not want allocations to fail but that reclaim occurs > in order to avoid failing any allocation. We need provisions that make > sure that we never get into such a bad memory situation that would > cause severe slowless and usually end up in a livelock anyways. Could you outline the "big picture" as you see it? To me your argument that reclaim can always be done instantly and that the cases where it cannot be done are pathological and need to be avoided is fundamentally dangerous and quite a bit short-sighted at first glance. The big picture way to think about this is the following: the free page pool is the "cache" of the MM. It's what "greases" the mechanism and bridges the inevitable reclaim latency and makes "atomic memory" available to the reclaim mechanism itself. We _cannot_ remove that cache without a conceptual replacement (or a _very_ robust argument and proof that the free pages pool is not needed at all - this would be a major design change (and a stupid mistake IMO)). Your patchset, in essence, tries to claim that we dont really need this cache and that all that matters is to keep enough clean pagecache pages around. That approach misses the full picture and i dont think we can progress without agreeing on the fundamentals first. That "cache" cannot be handled in your scheme: a fully or mostly anonymous workload (tons of apps are like that) instantly destroys the "there is always a minimal amount of atomically reclaimable pages around" property of freelists, and this cannot be talked or tweaked around by twiddling any existing property of anonymous reclaim. Anonymous memory is dirty and takes ages to reclaim. The fact that your patchset causes an easy anonymous OOM further underlines this flaw of your thinking. Not making anonymous workloads OOM is the _hardest_ part of the MM, by far. Pagecache reclaim is a breeze in comparison :-) So there is a large and fundamental rift between having pages on the freelist (instantly available to any context) and having them on the (current) LRU where they might or might not be clean, etc. The freelists are an implicit guarantee of buffering and atomicity and they can and do save the day if everything else fails to keep stuff insta-freeable. (And then we havent even considered the performance and scalability differences between picking from the pcp freelists versus picking pages from the LRU, havent considered the better higher-order page allocation property of the buddy pool and havent considered the atomicity of in-irq-handler allocations.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 2007-08-21 at 15:43 -0700, Christoph Lameter wrote: > On Wed, 22 Aug 2007, Peter Zijlstra wrote: > > > Also, all I want is for slab to honour gfp flags like page allocation > > does, nothing more, nothing less. > > > > (well, actually slightly less, since I'm only really interrested in the > > ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER -> ALLOC_NO_WATERMARKS transition and > > not all higher ones) > > I am still not sure what that brings you. There may be multiple > PF_MEMALLOC going on at the same time. On a large system with N cpus > there may be more than N of these that can steal objects from one another. Yes, quite aware of that, and have ideas on how to properly fix that. Once it is, the reserves can be shrunk too, perhaps you can work on this? > A NUMA system will be shot anyways if memory gets that problematic to > handle since the OS cannot effectively place memory if all zones are > overallocated so that only a few pages are left. Also not a new problem. > > I want slab to fail when a similar page alloc would fail, no magic. > > Yes I know. I do not want allocations to fail but that reclaim occurs in > order to avoid failing any allocation. We need provisions that > make sure that we never get into such a bad memory situation that would > cause severe slowless and usually end up in a livelock anyways. Its unavoidable, at some point it just happens. Also using reclaim doesn't seem like the ideal way to get out of live-locks since reclaim itself can live-lock on these large boxen. > > > > Anonymous pages are a there to stay, and we cannot tell people how to > > > > use them. So we need some free or freeable pages in order to avoid the > > > > vm deadlock that arises from all memory dirty. > > > > > > No one is trying to abolish Anonymous pages. Free memory is readily > > > available on demand if one calls reclaim. Your scheme introduces complex > > > negotiations over a few scraps of memory when large amounts of memory > > > would still be readily available if one would do the right thing and call > > > into reclaim. > > > > This is the thing I contend, there need not be large amounts of memory > > around. In my test prog the hot code path fits into a single page, the > > rest can be anonymous. > > Thats a bit extreme We need to make sure that there are larger amounts > of memory around. Pages are used for all shorts of short term uses (like > slab shrinking etc etc.). If memory is that low that a single page matters > then we are in very bad shape anyways. Yes we are, but its a legitimate situation. Denying it won't get us very far. Also placing a large bound on anonymous memory usage is not going to be appreciated by the userspace people. Slab cache will also be at a minimum is the pressure persists for a while. > > > Sounds like you would like to change the way we handle memory in general > > > in the VM? Reclaim (and thus finding freeable pages) is basic to Linux > > > memory management. > > > > Not quite, currently we have free pages in the reserves, if you want to > > replace some (or all) of that by freeable pages then that is a change. > > We have free pages primarily to optimize the allocation. Meaning we do not > have to run reclaim on every call. We want to use all of memory. The > reserves are there for the case that we cannot call into reclaim. > The easy > solution if that is problematic is to enhance the reclaim to work in the > critical situations that we care about. As shown, there are cases where there just isn't any memory to reclaim. Please accept this. Also, by reclaiming memory and getting out of the tight spot you give the rest of the system access to that memory, and it can be used for other things than getting out of the tight spot. You really want a separate allocation state that allows only reclaim to access memory. > > > Sorry I just got into this a short time ago and I may need a few cycles > > > to get this all straight. An approach that uses memory instead of > > > ignoring available memory is certainly better. > > > > Sure if and when possible. There will always be need to fall back to the > > reserves. > > Maybe. But we can certainly avoid that as much as possible which would > also increase our ability to use all available memory instead of leaving > some of it unused./ > > > A bit off-topic, re that reclaim from atomic context: > > Currently we try to hold spinlocks only for short periods of time so > > that reclaim can be preempted, if you run all of reclaim from a > > non-preemptible context you get very large preemption latencies and if > > done from int context it'd also generate large int latencies. > > If you call into the page allocator from an interrupt context then you are > already in bad shape since we may check pcps lists and then potentially > have to traverse the zonelists and check all sorts of things. Only an issue on these obscenely large NUMA boxen, normal machines don't have large zone
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 2007-08-21 at 15:43 -0700, Christoph Lameter wrote: On Wed, 22 Aug 2007, Peter Zijlstra wrote: Also, all I want is for slab to honour gfp flags like page allocation does, nothing more, nothing less. (well, actually slightly less, since I'm only really interrested in the ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER - ALLOC_NO_WATERMARKS transition and not all higher ones) I am still not sure what that brings you. There may be multiple PF_MEMALLOC going on at the same time. On a large system with N cpus there may be more than N of these that can steal objects from one another. Yes, quite aware of that, and have ideas on how to properly fix that. Once it is, the reserves can be shrunk too, perhaps you can work on this? A NUMA system will be shot anyways if memory gets that problematic to handle since the OS cannot effectively place memory if all zones are overallocated so that only a few pages are left. Also not a new problem. I want slab to fail when a similar page alloc would fail, no magic. Yes I know. I do not want allocations to fail but that reclaim occurs in order to avoid failing any allocation. We need provisions that make sure that we never get into such a bad memory situation that would cause severe slowless and usually end up in a livelock anyways. Its unavoidable, at some point it just happens. Also using reclaim doesn't seem like the ideal way to get out of live-locks since reclaim itself can live-lock on these large boxen. Anonymous pages are a there to stay, and we cannot tell people how to use them. So we need some free or freeable pages in order to avoid the vm deadlock that arises from all memory dirty. No one is trying to abolish Anonymous pages. Free memory is readily available on demand if one calls reclaim. Your scheme introduces complex negotiations over a few scraps of memory when large amounts of memory would still be readily available if one would do the right thing and call into reclaim. This is the thing I contend, there need not be large amounts of memory around. In my test prog the hot code path fits into a single page, the rest can be anonymous. Thats a bit extreme We need to make sure that there are larger amounts of memory around. Pages are used for all shorts of short term uses (like slab shrinking etc etc.). If memory is that low that a single page matters then we are in very bad shape anyways. Yes we are, but its a legitimate situation. Denying it won't get us very far. Also placing a large bound on anonymous memory usage is not going to be appreciated by the userspace people. Slab cache will also be at a minimum is the pressure persists for a while. Sounds like you would like to change the way we handle memory in general in the VM? Reclaim (and thus finding freeable pages) is basic to Linux memory management. Not quite, currently we have free pages in the reserves, if you want to replace some (or all) of that by freeable pages then that is a change. We have free pages primarily to optimize the allocation. Meaning we do not have to run reclaim on every call. We want to use all of memory. The reserves are there for the case that we cannot call into reclaim. The easy solution if that is problematic is to enhance the reclaim to work in the critical situations that we care about. As shown, there are cases where there just isn't any memory to reclaim. Please accept this. Also, by reclaiming memory and getting out of the tight spot you give the rest of the system access to that memory, and it can be used for other things than getting out of the tight spot. You really want a separate allocation state that allows only reclaim to access memory. Sorry I just got into this a short time ago and I may need a few cycles to get this all straight. An approach that uses memory instead of ignoring available memory is certainly better. Sure if and when possible. There will always be need to fall back to the reserves. Maybe. But we can certainly avoid that as much as possible which would also increase our ability to use all available memory instead of leaving some of it unused./ A bit off-topic, re that reclaim from atomic context: Currently we try to hold spinlocks only for short periods of time so that reclaim can be preempted, if you run all of reclaim from a non-preemptible context you get very large preemption latencies and if done from int context it'd also generate large int latencies. If you call into the page allocator from an interrupt context then you are already in bad shape since we may check pcps lists and then potentially have to traverse the zonelists and check all sorts of things. Only an issue on these obscenely large NUMA boxen, normal machines don't have large zone lists. No reason to hurt the small boxen in favour of the large boxen. If we would implement atomic reclaim then the reserves
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
* Christoph Lameter [EMAIL PROTECTED] wrote: I want slab to fail when a similar page alloc would fail, no magic. Yes I know. I do not want allocations to fail but that reclaim occurs in order to avoid failing any allocation. We need provisions that make sure that we never get into such a bad memory situation that would cause severe slowless and usually end up in a livelock anyways. Could you outline the big picture as you see it? To me your argument that reclaim can always be done instantly and that the cases where it cannot be done are pathological and need to be avoided is fundamentally dangerous and quite a bit short-sighted at first glance. The big picture way to think about this is the following: the free page pool is the cache of the MM. It's what greases the mechanism and bridges the inevitable reclaim latency and makes atomic memory available to the reclaim mechanism itself. We _cannot_ remove that cache without a conceptual replacement (or a _very_ robust argument and proof that the free pages pool is not needed at all - this would be a major design change (and a stupid mistake IMO)). Your patchset, in essence, tries to claim that we dont really need this cache and that all that matters is to keep enough clean pagecache pages around. That approach misses the full picture and i dont think we can progress without agreeing on the fundamentals first. That cache cannot be handled in your scheme: a fully or mostly anonymous workload (tons of apps are like that) instantly destroys the there is always a minimal amount of atomically reclaimable pages around property of freelists, and this cannot be talked or tweaked around by twiddling any existing property of anonymous reclaim. Anonymous memory is dirty and takes ages to reclaim. The fact that your patchset causes an easy anonymous OOM further underlines this flaw of your thinking. Not making anonymous workloads OOM is the _hardest_ part of the MM, by far. Pagecache reclaim is a breeze in comparison :-) So there is a large and fundamental rift between having pages on the freelist (instantly available to any context) and having them on the (current) LRU where they might or might not be clean, etc. The freelists are an implicit guarantee of buffering and atomicity and they can and do save the day if everything else fails to keep stuff insta-freeable. (And then we havent even considered the performance and scalability differences between picking from the pcp freelists versus picking pages from the LRU, havent considered the better higher-order page allocation property of the buddy pool and havent considered the atomicity of in-irq-handler allocations.) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Peter Zijlstra wrote: Its unavoidable, at some point it just happens. Also using reclaim doesn't seem like the ideal way to get out of live-locks since reclaim itself can live-lock on these large boxen. If reclaim can live lock then it needs to be fixed. As shown, there are cases where there just isn't any memory to reclaim. Please accept this. That is an extreme case that AFAIK we currently ignore and could be avoided with some effort. The initial PF_MEMALLOC patchset seems to be still enough to deal with your issues. Also, by reclaiming memory and getting out of the tight spot you give the rest of the system access to that memory, and it can be used for other things than getting out of the tight spot. The rest of the system may have their own tights spots. Language the the tight spot sets up all sort of alarms over here since you seem to be thinking about a system doing a single task. The system may be handling multiple critical tasks on various devices that have various memory needs. So multiple critical spots can happen concurrently in multiple application contexts. You really want a separate allocation state that allows only reclaim to access memory. We have that with PF_MEMALLOC. Also, failing a memory allocation isn't bad, why are you so worried about that? It happens all the time. Its a performance impact and plainly does not make sense if there is reclaimable memory availble. The common action of the vm is to reclaim if there is a demand for memory. Now we suddenly abandon that approach? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Ingo Molnar wrote: Could you outline the big picture as you see it? To me your argument that reclaim can always be done instantly and that the cases where it cannot be done are pathological and need to be avoided is fundamentally dangerous and quite a bit short-sighted at first glance. That is a bit overdrawing my argument. The issues that Peter saw can be fixed by allowing recursive reclaim (see the earlier patchset). The rest is so far sugar on top or building extreme cases where we already have trouble. The big picture way to think about this is the following: the free page pool is the cache of the MM. It's what greases the mechanism and bridges the inevitable reclaim latency and makes atomic memory available to the reclaim mechanism itself. We _cannot_ remove that cache without a conceptual replacement (or a _very_ robust argument and proof that the free pages pool is not needed at all - this would be a major design change (and a stupid mistake IMO)). Your patchset, in essence, tries to claim that we dont really need this cache and that all that matters is to keep enough clean pagecache pages around. That approach misses the full picture and i dont think we can progress without agreeing on the fundamentals first. The patchset attempts to deal with the reserves in a more intelligent way in order not to fail when this pool becomes exhausted because some device needs a lot of memory in the writeout path. That cache cannot be handled in your scheme: a fully or mostly anonymous workload (tons of apps are like that) instantly destroys the there is always a minimal amount of atomically reclaimable pages around property of freelists, and this cannot be talked or tweaked around by twiddling any existing property of anonymous reclaim. A extreme anonymous workload like discussed here can even cause the current VM to fail. Realistically at least portions of the executable and varios slab caches will remain in memory in addition to the reserves. Anonymous memory is dirty and takes ages to reclaim. The fact that your patchset causes an easy anonymous OOM further underlines this flaw of your thinking. Not making anonymous workloads OOM is the _hardest_ part of the MM, by far. Pagecache reclaim is a breeze in comparison :-) The central flaw in my thinking was the switching of of PF_MEMALLOC on the writeout path instead of allowing recursive PF_MEMALLOC reclaim as in the first patch. But the first patchset did not have that flaw. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 2007-08-22 at 12:04 -0700, Christoph Lameter wrote: On Wed, 22 Aug 2007, Peter Zijlstra wrote: Its unavoidable, at some point it just happens. Also using reclaim doesn't seem like the ideal way to get out of live-locks since reclaim itself can live-lock on these large boxen. If reclaim can live lock then it needs to be fixed. Riel is working on that. As shown, there are cases where there just isn't any memory to reclaim. Please accept this. That is an extreme case that AFAIK we currently ignore and could be avoided with some effort. Its not extreme, not even rare, and its handled now. Its what PF_MEMALLOC is for. The initial PF_MEMALLOC patchset seems to be still enough to deal with your issues. No it isnt. Take the anonyous workload, user-space will block once the page allocator hits ALLOC_MIN. Network will be able to receive until ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will start dropping all packets until there is memory again. But userspace is wedged and hence will not consume the network traffic, hence we deadlock. Even if there is something to reclaim initially, if the pressure persists that can eventually be exhausted. Also, by reclaiming memory and getting out of the tight spot you give the rest of the system access to that memory, and it can be used for other things than getting out of the tight spot. The rest of the system may have their own tights spots. Language the the tight spot sets up all sort of alarms over here since you seem to be thinking about a system doing a single task. reclaim The system may be handling multiple critical tasks on various devices that have various memory needs. So multiple critical spots can happen concurrently in multiple application contexts. yes, reclaim can be unbounded concurrent, and that is one of the (theoretically) major problems we currently have. You really want a separate allocation state that allows only reclaim to access memory. We have that with PF_MEMALLOC. Exactly. But if you recognise the need for PF_MEMALLOC then what is this argument about? Networking can currently be seen as having two states: 1 receive packets and consume memory 2 drop all packets (when out of memory) I need a 3rd state: 3 receiving packets but not consuming memory Now, I need this state when we're in PF_MEMALLOC territory, because I need to be able to process an unspecified amount of network traffic in order to receive the writeout completion. In order to operate this 3rd network state, some memory is needed in which packets can be received and when deemed not important freed and reused. It needs a bounded amount of memory in order to process an unbounded amount of network traffic. What exactly is not clear about this? If you accept the need for PF_MEMALLOC you surely must also agree that at the point you're using it running reclaim is useless. Also, failing a memory allocation isn't bad, why are you so worried about that? It happens all the time. Its a performance impact and plainly does not make sense if there is reclaimable memory availble. The common action of the vm is to reclaim if there is a demand for memory. Now we suddenly abandon that approach? I'm utterly confused by this, on one hand you recognise the need for PF_MEMALLOC but on the other hand you're saying its not needed and anybody needing memory (even reclaim itself) should use reclaim. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Peter Zijlstra wrote: That is an extreme case that AFAIK we currently ignore and could be avoided with some effort. Its not extreme, not even rare, and its handled now. Its what PF_MEMALLOC is for. No its not. If you have all pages allocated as anonymous pages and your writeout requires more pages than available in the reserves then you are screwed either way regardless if you have PF_MEMALLOC set or not. The initial PF_MEMALLOC patchset seems to be still enough to deal with your issues. Take the anonyous workload, user-space will block once the page allocator hits ALLOC_MIN. Network will be able to receive until ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will start dropping all packets until there is memory again. But userspace is wedged and hence will not consume the network traffic, hence we deadlock. Even if there is something to reclaim initially, if the pressure persists that can eventually be exhausted. Sure ultimately you will end up with pages that are all unreclaimable if you reclaim all reclaimable memory. multiple critical tasks on various devices that have various memory needs. So multiple critical spots can happen concurrently in multiple application contexts. yes, reclaim can be unbounded concurrent, and that is one of the (theoretically) major problems we currently have. So your patchset is not fixing it? We have that with PF_MEMALLOC. Exactly. But if you recognise the need for PF_MEMALLOC then what is this argument about? The PF_MEMALLOC patchset f.e. is about avoiding to go out of memory when there is still memory available even if we are doing a PF_MEMALLOC allocation and would OOM otherwise. Networking can currently be seen as having two states: 1 receive packets and consume memory 2 drop all packets (when out of memory) I need a 3rd state: 3 receiving packets but not consuming memory So far a good idea. If you are not consuming memory then why are the allocators involved? Now, I need this state when we're in PF_MEMALLOC territory, because I need to be able to process an unspecified amount of network traffic in order to receive the writeout completion. In order to operate this 3rd network state, some memory is needed in which packets can be received and when deemed not important freed and reused. It needs a bounded amount of memory in order to process an unbounded amount of network traffic. What exactly is not clear about this? If you accept the need for PF_MEMALLOC you surely must also agree that at the point you're using it running reclaim is useless. Yes looks like you would like to add something to the network layer to filter important packets. As long as you stay within PF_MEMALLOC boundaries you can allocate and throw packets away. If you want to have a reserve that is secure and just for you then you need to take it away from the reserves (which in turn will lead reclaim to restore them). Also, failing a memory allocation isn't bad, why are you so worried about that? It happens all the time. Its a performance impact and plainly does not make sense if there is reclaimable memory availble. The common action of the vm is to reclaim if there is a demand for memory. Now we suddenly abandon that approach? I'm utterly confused by this, on one hand you recognise the need for PF_MEMALLOC but on the other hand you're saying its not needed and anybody needing memory (even reclaim itself) should use reclaim. The VM reclaims memory on demand but in exceptional limited cases where we cannot do so we use the reserves. I am sure you know this. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Peter Zijlstra wrote: > Also, all I want is for slab to honour gfp flags like page allocation > does, nothing more, nothing less. > > (well, actually slightly less, since I'm only really interrested in the > ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER -> ALLOC_NO_WATERMARKS transition and > not all higher ones) I am still not sure what that brings you. There may be multiple PF_MEMALLOC going on at the same time. On a large system with N cpus there may be more than N of these that can steal objects from one another. A NUMA system will be shot anyways if memory gets that problematic to handle since the OS cannot effectively place memory if all zones are overallocated so that only a few pages are left. > I want slab to fail when a similar page alloc would fail, no magic. Yes I know. I do not want allocations to fail but that reclaim occurs in order to avoid failing any allocation. We need provisions that make sure that we never get into such a bad memory situation that would cause severe slowless and usually end up in a livelock anyways. > > > Anonymous pages are a there to stay, and we cannot tell people how to > > > use them. So we need some free or freeable pages in order to avoid the > > > vm deadlock that arises from all memory dirty. > > > > No one is trying to abolish Anonymous pages. Free memory is readily > > available on demand if one calls reclaim. Your scheme introduces complex > > negotiations over a few scraps of memory when large amounts of memory > > would still be readily available if one would do the right thing and call > > into reclaim. > > This is the thing I contend, there need not be large amounts of memory > around. In my test prog the hot code path fits into a single page, the > rest can be anonymous. Thats a bit extreme We need to make sure that there are larger amounts of memory around. Pages are used for all shorts of short term uses (like slab shrinking etc etc.). If memory is that low that a single page matters then we are in very bad shape anyways. > > Sounds like you would like to change the way we handle memory in general > > in the VM? Reclaim (and thus finding freeable pages) is basic to Linux > > memory management. > > Not quite, currently we have free pages in the reserves, if you want to > replace some (or all) of that by freeable pages then that is a change. We have free pages primarily to optimize the allocation. Meaning we do not have to run reclaim on every call. We want to use all of memory. The reserves are there for the case that we cannot call into reclaim. The easy solution if that is problematic is to enhance the reclaim to work in the critical situations that we care about. > > Sorry I just got into this a short time ago and I may need a few cycles > > to get this all straight. An approach that uses memory instead of > > ignoring available memory is certainly better. > > Sure if and when possible. There will always be need to fall back to the > reserves. Maybe. But we can certainly avoid that as much as possible which would also increase our ability to use all available memory instead of leaving some of it unused./ > A bit off-topic, re that reclaim from atomic context: > Currently we try to hold spinlocks only for short periods of time so > that reclaim can be preempted, if you run all of reclaim from a > non-preemptible context you get very large preemption latencies and if > done from int context it'd also generate large int latencies. If you call into the page allocator from an interrupt context then you are already in bad shape since we may check pcps lists and then potentially have to traverse the zonelists and check all sorts of things. If we would implement atomic reclaim then the reserves may become a latency optimizations. At least we will not fail anymore if the reserves are out. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Rik van Riel wrote: > Christoph Lameter wrote: > > > I want general improvements to reclaim to address the issues that you see > > and other issues related to reclaim instead of the strange code that makes > > PF_MEMALLOC allocs compete for allocations from a single slab and putting > > logic into the kernel to decide which allocs to fail. We can reclaim after > > all. Its just a matter of finding the right way to do this. > > The simplest way of achieving that would be to allow > recursion of the page reclaim code, under the condition > that the second level call can only reclaim clean pages, > while the "outer" call does what the VM does today. Yes that is what the precursor to this patchset does. See http://marc.info/?l=linux-mm=118710207203449=2 This one did not even come up to the level of the earlier one. Sigh. The way forward may be: 1. Like in the earlier patchset allow reentry to reclaim under PF_MEMALLOC if we are out of all memory. 2. Do the laundry as here but do not write out laundry directly. Instead move laundry to a new lru style list in the zone structure. This will allow the recursive reclaim to also trigger writeout of pages (what this patchset was supposed to accomplish). 3. Perform writeback only from kswapd. Make other threads wait on kswapd if memory is low, we can wait and writeback still has to progress. 4. Then allow reclaim of GFP_ATOMIC allocs (see http://marc.info/?l=linux-kernel=118710595617696=2). Atomic reclaim can then also put pages onto the zone laundry lists from where it is going to be picked up and written out by kswapd ASAP. This one may be tricky so maybe keep this separate. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 2007-08-21 at 14:29 -0700, Christoph Lameter wrote: > On Tue, 21 Aug 2007, Peter Zijlstra wrote: > > > It quickly ends up with all of memory in the laundry list and then > > recursing into __alloc_pages which will fail to make progress and OOMs. > > H... Okay that needs to be addressed. Reserves need to be used and we > only should enter reclaim if that runs out (like the first patch that I > did). > > > But aside from the numerous issues with the patch set as presented, I'm > > not seeing the seeing the big picture, why are you doing this. > > I want general improvements to reclaim to address the issues that you see > and other issues related to reclaim instead of the strange code that makes > PF_MEMALLOC allocs compete for allocations from a single slab and putting > logic into the kernel to decide which allocs to fail. We can reclaim after > all. Its just a matter of finding the right way to do this. The latest patch I posted got rid of that global slab. Also, all I want is for slab to honour gfp flags like page allocation does, nothing more, nothing less. (well, actually slightly less, since I'm only really interrested in the ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER -> ALLOC_NO_WATERMARKS transition and not all higher ones) I want slab to fail when a similar page alloc would fail, no magic. Strictly speaking: if: page = alloc_page(gfp); fails but: obj = kmem_cache_alloc(s, gfp); succeeds then its a bug. But I'm not actually needing it that strict, just the ALLOC_NO_WATERMARK part needs to be done, ALLOC_HARDER, ALLOC_HIGH those may fudge a bit. > > Anonymous pages are a there to stay, and we cannot tell people how to > > use them. So we need some free or freeable pages in order to avoid the > > vm deadlock that arises from all memory dirty. > > No one is trying to abolish Anonymous pages. Free memory is readily > available on demand if one calls reclaim. Your scheme introduces complex > negotiations over a few scraps of memory when large amounts of memory > would still be readily available if one would do the right thing and call > into reclaim. This is the thing I contend, there need not be large amounts of memory around. In my test prog the hot code path fits into a single page, the rest can be anonymous. > > 'Optimizing' this by switching to freeable pages has mainly > > disadvantages IMHO, finding them scrambles LRU order and complexifies > > relcaim and all that for a relatively small gain in space for clean > > pagecache pages. > > Sounds like you would like to change the way we handle memory in general > in the VM? Reclaim (and thus finding freeable pages) is basic to Linux > memory management. Not quite, currently we have free pages in the reserves, if you want to replace some (or all) of that by freeable pages then that is a change. I'm just using the reserves. > > Please, stop writing patches and write down a solid proposal of how you > > envision the VM working in the various scenarios and why its better than > > the current approach. > > Sorry I just got into this a short time ago and I may need a few cycles > to get this all straight. An approach that uses memory instead of > ignoring available memory is certainly better. Sure if and when possible. There will always be need to fall back to the reserves. A bit off-topic, re that reclaim from atomic context: Currently we try to hold spinlocks only for short periods of time so that reclaim can be preempted, if you run all of reclaim from a non-preemptible context you get very large preemption latencies and if done from int context it'd also generate large int latencies. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: I want general improvements to reclaim to address the issues that you see and other issues related to reclaim instead of the strange code that makes PF_MEMALLOC allocs compete for allocations from a single slab and putting logic into the kernel to decide which allocs to fail. We can reclaim after all. Its just a matter of finding the right way to do this. The simplest way of achieving that would be to allow recursion of the page reclaim code, under the condition that the second level call can only reclaim clean pages, while the "outer" call does what the VM does today. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Rik van Riel wrote: > > What is preventing that from occurring right now? If the dirty pags are > > aligned in the right way you can have the exact same situation. > > For one, dirty page writeout is done even when free memory > is low. The kernel will dig into the PF_MEMALLOC reserves, > instead of deciding not to do writeout unless there is lots > of free memory. Right that is a fundamental problem with this RFC. We need to be able to get into PF_MEMALLOC reserves for writeout. > Secondly, why would you want to recreate this worst case on > purpose every time the pageout code runs? I did not intend that to occur. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Peter Zijlstra wrote: > It quickly ends up with all of memory in the laundry list and then > recursing into __alloc_pages which will fail to make progress and OOMs. H... Okay that needs to be addressed. Reserves need to be used and we only should enter reclaim if that runs out (like the first patch that I did). > But aside from the numerous issues with the patch set as presented, I'm > not seeing the seeing the big picture, why are you doing this. I want general improvements to reclaim to address the issues that you see and other issues related to reclaim instead of the strange code that makes PF_MEMALLOC allocs compete for allocations from a single slab and putting logic into the kernel to decide which allocs to fail. We can reclaim after all. Its just a matter of finding the right way to do this. > Anonymous pages are a there to stay, and we cannot tell people how to > use them. So we need some free or freeable pages in order to avoid the > vm deadlock that arises from all memory dirty. No one is trying to abolish Anonymous pages. Free memory is readily available on demand if one calls reclaim. Your scheme introduces complex negotiations over a few scraps of memory when large amounts of memory would still be readily available if one would do the right thing and call into reclaim. > 'Optimizing' this by switching to freeable pages has mainly > disadvantages IMHO, finding them scrambles LRU order and complexifies > relcaim and all that for a relatively small gain in space for clean > pagecache pages. Sounds like you would like to change the way we handle memory in general in the VM? Reclaim (and thus finding freeable pages) is basic to Linux memory management. > Please, stop writing patches and write down a solid proposal of how you > envision the VM working in the various scenarios and why its better than > the current approach. Sorry I just got into this a short time ago and I may need a few cycles to get this all straight. An approach that uses memory instead of ignoring available memory is certainly better. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 2007-08-21 at 13:48 -0700, Christoph Lameter wrote: > On Tue, 21 Aug 2007, Peter Zijlstra wrote: > > > This almost insta-OOMs with anonymous workloads. > > What does the workload do? So writeout needs to begin earlier. There are > likely issues with throttling. The workload is a single program mapping 256M of anonymous memory and cycling through it with writes ran on a 128M setup. It quickly ends up with all of memory in the laundry list and then recursing into __alloc_pages which will fail to make progress and OOMs. But aside from the numerous issues with the patch set as presented, I'm not seeing the seeing the big picture, why are you doing this. Anonymous pages are a there to stay, and we cannot tell people how to use them. So we need some free or freeable pages in order to avoid the vm deadlock that arises from all memory dirty. Currently we keep them free, this has the advantage that the buddy allocator can at least try to coalese them. 'Optimizing' this by switching to freeable pages has mainly disadvantages IMHO, finding them scrambles LRU order and complexifies relcaim and all that for a relatively small gain in space for clean pagecache pages. Please, stop writing patches and write down a solid proposal of how you envision the VM working in the various scenarios and why its better than the current approach. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: On Tue, 21 Aug 2007, Rik van Riel wrote: Christoph Lameter wrote: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. That sounds like a horrendously bad idea. While one process is busy freeing all the non dirty pages, other processes can allocate those pages, leaving you with no memory to free up the dirty pages! What is preventing that from occurring right now? If the dirty pags are aligned in the right way you can have the exact same situation. For one, dirty page writeout is done even when free memory is low. The kernel will dig into the PF_MEMALLOC reserves, instead of deciding not to do writeout unless there is lots of free memory. Secondly, why would you want to recreate this worst case on purpose every time the pageout code runs? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Dave McCracken wrote: > On Monday 20 August 2007, Christoph Lameter wrote: > > 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim > > has reestablished the high marks. Then all the dirty pages (the laundry) > > is written out. > > I don't buy it. What happens when there aren't enough clean pages in the > system to achieve the high water mark? I'm guessing we'd get a quick OOM (as > observed by Peter). We reclaim the clean pages that there are (removing the executable pages from memory) and then we do writeback. The quick OOM is due to throttling not working right AFAIK. > > 2. Reclaim is essentially complete during the writeout phase. So we remove > > PF_MEMALLOC and allow recursive reclaim if we still run into trouble > > during writeout. > > You're assuming the system is static and won't allocate new pages behind your > back. We could be back to critically low memory before the write happens. Yes and that occurs now too. > More broadly, we need to be proactive about getting dirty pages cleaned > before > they consume the system. Deferring the write just makes it harder to keep > up. Cleaning dirty pages through writeout consumes memory. Writing dirty pages out early makes the memory situation even worse.
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Rik van Riel wrote: > Christoph Lameter wrote: > > > 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim > >has reestablished the high marks. Then all the dirty pages (the laundry) > >is written out. > > That sounds like a horrendously bad idea. While one process > is busy freeing all the non dirty pages, other processes can > allocate those pages, leaving you with no memory to free up > the dirty pages! What is preventing that from occurring right now? If the dirty pags are aligned in the right way you can have the exact same situation. > Also, writing out all the dirty pages at once seems like it > could hurt latency quite badly, especially on large systems. We only write back the dirty pages that we are about to reclaim not all of them. The bigger batching occurs if we go through multiple priorities. Plus writeback in the sync reclaim case is stopped if the device becomes contended anyways. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Peter Zijlstra wrote: > This almost insta-OOMs with anonymous workloads. What does the workload do? So writeout needs to begin earlier. There are likely issues with throttling. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Monday 20 August 2007, Christoph Lameter wrote: > 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim > has reestablished the high marks. Then all the dirty pages (the laundry) > is written out. I don't buy it. What happens when there aren't enough clean pages in the system to achieve the high water mark? I'm guessing we'd get a quick OOM (as observed by Peter). > 2. Reclaim is essentially complete during the writeout phase. So we remove > PF_MEMALLOC and allow recursive reclaim if we still run into trouble > during writeout. You're assuming the system is static and won't allocate new pages behind your back. We could be back to critically low memory before the write happens. More broadly, we need to be proactive about getting dirty pages cleaned before they consume the system. Deferring the write just makes it harder to keep up. Dave McCracken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. That sounds like a horrendously bad idea. While one process is busy freeing all the non dirty pages, other processes can allocate those pages, leaving you with no memory to free up the dirty pages! How exactly are you planning to prevent that problem? Also, writing out all the dirty pages at once seems like it could hurt latency quite badly, especially on large systems. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Mon, 2007-08-20 at 14:50 -0700, Christoph Lameter wrote: > One of the problems with reclaim writeout is that it occurs when memory in a > zone is low. A particular bad problem can occur if memory in a zone is > already low and now the first page that we encounter during reclaim is dirty. > So the writeout function is called without the filesystem or device having > much of a reserve that would allow further allocations. Triggering writeout > of dirty pages early does not improve the memory situation since the actual > writeout of the page is a relatively long process. The call to writepage > will therefore not improve the low memory situation but make it worse > because extra memory may be needed to get the device to write the page. > > This patchset fixes that issue by: > > 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim >has reestablished the high marks. Then all the dirty pages (the laundry) >is written out. > > 2. Reclaim is essentially complete during the writeout phase. So we remove >PF_MEMALLOC and allow recursive reclaim if we still run into trouble >during writeout. This almost insta-OOMs with anonymous workloads. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Mon, 2007-08-20 at 14:50 -0700, Christoph Lameter wrote: One of the problems with reclaim writeout is that it occurs when memory in a zone is low. A particular bad problem can occur if memory in a zone is already low and now the first page that we encounter during reclaim is dirty. So the writeout function is called without the filesystem or device having much of a reserve that would allow further allocations. Triggering writeout of dirty pages early does not improve the memory situation since the actual writeout of the page is a relatively long process. The call to writepage will therefore not improve the low memory situation but make it worse because extra memory may be needed to get the device to write the page. This patchset fixes that issue by: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. 2. Reclaim is essentially complete during the writeout phase. So we remove PF_MEMALLOC and allow recursive reclaim if we still run into trouble during writeout. This almost insta-OOMs with anonymous workloads. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. That sounds like a horrendously bad idea. While one process is busy freeing all the non dirty pages, other processes can allocate those pages, leaving you with no memory to free up the dirty pages! How exactly are you planning to prevent that problem? Also, writing out all the dirty pages at once seems like it could hurt latency quite badly, especially on large systems. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Monday 20 August 2007, Christoph Lameter wrote: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. I don't buy it. What happens when there aren't enough clean pages in the system to achieve the high water mark? I'm guessing we'd get a quick OOM (as observed by Peter). 2. Reclaim is essentially complete during the writeout phase. So we remove PF_MEMALLOC and allow recursive reclaim if we still run into trouble during writeout. You're assuming the system is static and won't allocate new pages behind your back. We could be back to critically low memory before the write happens. More broadly, we need to be proactive about getting dirty pages cleaned before they consume the system. Deferring the write just makes it harder to keep up. Dave McCracken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Peter Zijlstra wrote: This almost insta-OOMs with anonymous workloads. What does the workload do? So writeout needs to begin earlier. There are likely issues with throttling. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Rik van Riel wrote: Christoph Lameter wrote: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. That sounds like a horrendously bad idea. While one process is busy freeing all the non dirty pages, other processes can allocate those pages, leaving you with no memory to free up the dirty pages! What is preventing that from occurring right now? If the dirty pags are aligned in the right way you can have the exact same situation. Also, writing out all the dirty pages at once seems like it could hurt latency quite badly, especially on large systems. We only write back the dirty pages that we are about to reclaim not all of them. The bigger batching occurs if we go through multiple priorities. Plus writeback in the sync reclaim case is stopped if the device becomes contended anyways. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Dave McCracken wrote: On Monday 20 August 2007, Christoph Lameter wrote: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. I don't buy it. What happens when there aren't enough clean pages in the system to achieve the high water mark? I'm guessing we'd get a quick OOM (as observed by Peter). We reclaim the clean pages that there are (removing the executable pages from memory) and then we do writeback. The quick OOM is due to throttling not working right AFAIK. 2. Reclaim is essentially complete during the writeout phase. So we remove PF_MEMALLOC and allow recursive reclaim if we still run into trouble during writeout. You're assuming the system is static and won't allocate new pages behind your back. We could be back to critically low memory before the write happens. Yes and that occurs now too. More broadly, we need to be proactive about getting dirty pages cleaned before they consume the system. Deferring the write just makes it harder to keep up. Cleaning dirty pages through writeout consumes memory. Writing dirty pages out early makes the memory situation even worse.
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 2007-08-21 at 13:48 -0700, Christoph Lameter wrote: On Tue, 21 Aug 2007, Peter Zijlstra wrote: This almost insta-OOMs with anonymous workloads. What does the workload do? So writeout needs to begin earlier. There are likely issues with throttling. The workload is a single program mapping 256M of anonymous memory and cycling through it with writes ran on a 128M setup. It quickly ends up with all of memory in the laundry list and then recursing into __alloc_pages which will fail to make progress and OOMs. But aside from the numerous issues with the patch set as presented, I'm not seeing the seeing the big picture, why are you doing this. Anonymous pages are a there to stay, and we cannot tell people how to use them. So we need some free or freeable pages in order to avoid the vm deadlock that arises from all memory dirty. Currently we keep them free, this has the advantage that the buddy allocator can at least try to coalese them. 'Optimizing' this by switching to freeable pages has mainly disadvantages IMHO, finding them scrambles LRU order and complexifies relcaim and all that for a relatively small gain in space for clean pagecache pages. Please, stop writing patches and write down a solid proposal of how you envision the VM working in the various scenarios and why its better than the current approach. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: On Tue, 21 Aug 2007, Rik van Riel wrote: Christoph Lameter wrote: 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim has reestablished the high marks. Then all the dirty pages (the laundry) is written out. That sounds like a horrendously bad idea. While one process is busy freeing all the non dirty pages, other processes can allocate those pages, leaving you with no memory to free up the dirty pages! What is preventing that from occurring right now? If the dirty pags are aligned in the right way you can have the exact same situation. For one, dirty page writeout is done even when free memory is low. The kernel will dig into the PF_MEMALLOC reserves, instead of deciding not to do writeout unless there is lots of free memory. Secondly, why would you want to recreate this worst case on purpose every time the pageout code runs? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Peter Zijlstra wrote: It quickly ends up with all of memory in the laundry list and then recursing into __alloc_pages which will fail to make progress and OOMs. H... Okay that needs to be addressed. Reserves need to be used and we only should enter reclaim if that runs out (like the first patch that I did). But aside from the numerous issues with the patch set as presented, I'm not seeing the seeing the big picture, why are you doing this. I want general improvements to reclaim to address the issues that you see and other issues related to reclaim instead of the strange code that makes PF_MEMALLOC allocs compete for allocations from a single slab and putting logic into the kernel to decide which allocs to fail. We can reclaim after all. Its just a matter of finding the right way to do this. Anonymous pages are a there to stay, and we cannot tell people how to use them. So we need some free or freeable pages in order to avoid the vm deadlock that arises from all memory dirty. No one is trying to abolish Anonymous pages. Free memory is readily available on demand if one calls reclaim. Your scheme introduces complex negotiations over a few scraps of memory when large amounts of memory would still be readily available if one would do the right thing and call into reclaim. 'Optimizing' this by switching to freeable pages has mainly disadvantages IMHO, finding them scrambles LRU order and complexifies relcaim and all that for a relatively small gain in space for clean pagecache pages. Sounds like you would like to change the way we handle memory in general in the VM? Reclaim (and thus finding freeable pages) is basic to Linux memory management. Please, stop writing patches and write down a solid proposal of how you envision the VM working in the various scenarios and why its better than the current approach. Sorry I just got into this a short time ago and I may need a few cycles to get this all straight. An approach that uses memory instead of ignoring available memory is certainly better. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Rik van Riel wrote: What is preventing that from occurring right now? If the dirty pags are aligned in the right way you can have the exact same situation. For one, dirty page writeout is done even when free memory is low. The kernel will dig into the PF_MEMALLOC reserves, instead of deciding not to do writeout unless there is lots of free memory. Right that is a fundamental problem with this RFC. We need to be able to get into PF_MEMALLOC reserves for writeout. Secondly, why would you want to recreate this worst case on purpose every time the pageout code runs? I did not intend that to occur. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
Christoph Lameter wrote: I want general improvements to reclaim to address the issues that you see and other issues related to reclaim instead of the strange code that makes PF_MEMALLOC allocs compete for allocations from a single slab and putting logic into the kernel to decide which allocs to fail. We can reclaim after all. Its just a matter of finding the right way to do this. The simplest way of achieving that would be to allow recursion of the page reclaim code, under the condition that the second level call can only reclaim clean pages, while the outer call does what the VM does today. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 2007-08-21 at 14:29 -0700, Christoph Lameter wrote: On Tue, 21 Aug 2007, Peter Zijlstra wrote: It quickly ends up with all of memory in the laundry list and then recursing into __alloc_pages which will fail to make progress and OOMs. H... Okay that needs to be addressed. Reserves need to be used and we only should enter reclaim if that runs out (like the first patch that I did). But aside from the numerous issues with the patch set as presented, I'm not seeing the seeing the big picture, why are you doing this. I want general improvements to reclaim to address the issues that you see and other issues related to reclaim instead of the strange code that makes PF_MEMALLOC allocs compete for allocations from a single slab and putting logic into the kernel to decide which allocs to fail. We can reclaim after all. Its just a matter of finding the right way to do this. The latest patch I posted got rid of that global slab. Also, all I want is for slab to honour gfp flags like page allocation does, nothing more, nothing less. (well, actually slightly less, since I'm only really interrested in the ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER - ALLOC_NO_WATERMARKS transition and not all higher ones) I want slab to fail when a similar page alloc would fail, no magic. Strictly speaking: if: page = alloc_page(gfp); fails but: obj = kmem_cache_alloc(s, gfp); succeeds then its a bug. But I'm not actually needing it that strict, just the ALLOC_NO_WATERMARK part needs to be done, ALLOC_HARDER, ALLOC_HIGH those may fudge a bit. Anonymous pages are a there to stay, and we cannot tell people how to use them. So we need some free or freeable pages in order to avoid the vm deadlock that arises from all memory dirty. No one is trying to abolish Anonymous pages. Free memory is readily available on demand if one calls reclaim. Your scheme introduces complex negotiations over a few scraps of memory when large amounts of memory would still be readily available if one would do the right thing and call into reclaim. This is the thing I contend, there need not be large amounts of memory around. In my test prog the hot code path fits into a single page, the rest can be anonymous. 'Optimizing' this by switching to freeable pages has mainly disadvantages IMHO, finding them scrambles LRU order and complexifies relcaim and all that for a relatively small gain in space for clean pagecache pages. Sounds like you would like to change the way we handle memory in general in the VM? Reclaim (and thus finding freeable pages) is basic to Linux memory management. Not quite, currently we have free pages in the reserves, if you want to replace some (or all) of that by freeable pages then that is a change. I'm just using the reserves. Please, stop writing patches and write down a solid proposal of how you envision the VM working in the various scenarios and why its better than the current approach. Sorry I just got into this a short time ago and I may need a few cycles to get this all straight. An approach that uses memory instead of ignoring available memory is certainly better. Sure if and when possible. There will always be need to fall back to the reserves. A bit off-topic, re that reclaim from atomic context: Currently we try to hold spinlocks only for short periods of time so that reclaim can be preempted, if you run all of reclaim from a non-preemptible context you get very large preemption latencies and if done from int context it'd also generate large int latencies. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Tue, 21 Aug 2007, Rik van Riel wrote: Christoph Lameter wrote: I want general improvements to reclaim to address the issues that you see and other issues related to reclaim instead of the strange code that makes PF_MEMALLOC allocs compete for allocations from a single slab and putting logic into the kernel to decide which allocs to fail. We can reclaim after all. Its just a matter of finding the right way to do this. The simplest way of achieving that would be to allow recursion of the page reclaim code, under the condition that the second level call can only reclaim clean pages, while the outer call does what the VM does today. Yes that is what the precursor to this patchset does. See http://marc.info/?l=linux-mmm=118710207203449w=2 This one did not even come up to the level of the earlier one. Sigh. The way forward may be: 1. Like in the earlier patchset allow reentry to reclaim under PF_MEMALLOC if we are out of all memory. 2. Do the laundry as here but do not write out laundry directly. Instead move laundry to a new lru style list in the zone structure. This will allow the recursive reclaim to also trigger writeout of pages (what this patchset was supposed to accomplish). 3. Perform writeback only from kswapd. Make other threads wait on kswapd if memory is low, we can wait and writeback still has to progress. 4. Then allow reclaim of GFP_ATOMIC allocs (see http://marc.info/?l=linux-kernelm=118710595617696w=2). Atomic reclaim can then also put pages onto the zone laundry lists from where it is going to be picked up and written out by kswapd ASAP. This one may be tricky so maybe keep this separate. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks
On Wed, 22 Aug 2007, Peter Zijlstra wrote: Also, all I want is for slab to honour gfp flags like page allocation does, nothing more, nothing less. (well, actually slightly less, since I'm only really interrested in the ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER - ALLOC_NO_WATERMARKS transition and not all higher ones) I am still not sure what that brings you. There may be multiple PF_MEMALLOC going on at the same time. On a large system with N cpus there may be more than N of these that can steal objects from one another. A NUMA system will be shot anyways if memory gets that problematic to handle since the OS cannot effectively place memory if all zones are overallocated so that only a few pages are left. I want slab to fail when a similar page alloc would fail, no magic. Yes I know. I do not want allocations to fail but that reclaim occurs in order to avoid failing any allocation. We need provisions that make sure that we never get into such a bad memory situation that would cause severe slowless and usually end up in a livelock anyways. Anonymous pages are a there to stay, and we cannot tell people how to use them. So we need some free or freeable pages in order to avoid the vm deadlock that arises from all memory dirty. No one is trying to abolish Anonymous pages. Free memory is readily available on demand if one calls reclaim. Your scheme introduces complex negotiations over a few scraps of memory when large amounts of memory would still be readily available if one would do the right thing and call into reclaim. This is the thing I contend, there need not be large amounts of memory around. In my test prog the hot code path fits into a single page, the rest can be anonymous. Thats a bit extreme We need to make sure that there are larger amounts of memory around. Pages are used for all shorts of short term uses (like slab shrinking etc etc.). If memory is that low that a single page matters then we are in very bad shape anyways. Sounds like you would like to change the way we handle memory in general in the VM? Reclaim (and thus finding freeable pages) is basic to Linux memory management. Not quite, currently we have free pages in the reserves, if you want to replace some (or all) of that by freeable pages then that is a change. We have free pages primarily to optimize the allocation. Meaning we do not have to run reclaim on every call. We want to use all of memory. The reserves are there for the case that we cannot call into reclaim. The easy solution if that is problematic is to enhance the reclaim to work in the critical situations that we care about. Sorry I just got into this a short time ago and I may need a few cycles to get this all straight. An approach that uses memory instead of ignoring available memory is certainly better. Sure if and when possible. There will always be need to fall back to the reserves. Maybe. But we can certainly avoid that as much as possible which would also increase our ability to use all available memory instead of leaving some of it unused./ A bit off-topic, re that reclaim from atomic context: Currently we try to hold spinlocks only for short periods of time so that reclaim can be preempted, if you run all of reclaim from a non-preemptible context you get very large preemption latencies and if done from int context it'd also generate large int latencies. If you call into the page allocator from an interrupt context then you are already in bad shape since we may check pcps lists and then potentially have to traverse the zonelists and check all sorts of things. If we would implement atomic reclaim then the reserves may become a latency optimizations. At least we will not fail anymore if the reserves are out. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/