Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-25 Thread Rik van Riel

Christoph Lameter wrote:

On Wed, 22 Aug 2007, Peter Zijlstra wrote:

That is an extreme case that AFAIK we currently ignore and could be 
avoided with some effort.

Its not extreme, not even rare, and its handled now. Its what
PF_MEMALLOC is for.


No its not. If you have all pages allocated as anonymous pages and your 
writeout requires more pages than available in the reserves then you are 
screwed either way regardless if you have PF_MEMALLOC set or not.


Only if the _first_ writeout needs more pages.

If the sum of all writeouts need more pages than you have
available, that is fine.  After all, buffer heads and some
other metadata is freed on IO completion.

Recursive reclaim will also be able to free the data pages
after IO completion, and really fix the problem.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-25 Thread Rik van Riel

Christoph Lameter wrote:

On Wed, 22 Aug 2007, Peter Zijlstra wrote:

That is an extreme case that AFAIK we currently ignore and could be 
avoided with some effort.

Its not extreme, not even rare, and its handled now. Its what
PF_MEMALLOC is for.


No its not. If you have all pages allocated as anonymous pages and your 
writeout requires more pages than available in the reserves then you are 
screwed either way regardless if you have PF_MEMALLOC set or not.


Only if the _first_ writeout needs more pages.

If the sum of all writeouts need more pages than you have
available, that is fine.  After all, buffer heads and some
other metadata is freed on IO completion.

Recursive reclaim will also be able to free the data pages
after IO completion, and really fix the problem.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Christoph Lameter
On Thu, 23 Aug 2007, Andrea Arcangeli wrote:

> On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote:
> > 1. Like in the earlier patchset allow reentry to reclaim under 
> >PF_MEMALLOC if we are out of all memory.
> 
> Can you simply tweak on the may_writepage flag only to achieve the
> second pass? We're talking here about a totally non-performance case,
> almost impossible to hit in practice unless you do real weird things,
> and certainly very unlikely to happen. So I'm unsure what's all that
> complexity just to make a regular pass on the lru looking for clean
> pages, something may_writepage=0 already does.
> 

Yes that is what the PF_MEMALLOC patch that I posted before does. This 
discussion gets me more and more to thinking that the recursive reclaim on 
PF_MEMALLOC is all that is needed for emergency situations (to get out of 
the "tight spot").

See
http://marc.info/?l=linux-kernel=118710219116624=2

> If the PF_MEMALLOC is found empty, I agree entering reclaim a second
> time with may_writepage=0 sounds theoretically a good idea (in
> practice it should never be necessary). printk must also be printed to
> warn the user he was risking to deadlock for real and he has to
> increase the min_free_kbytes.

Ok. I can add a printk to that one.

> That sounds a bit risky, there are latency considerations here to
> make, GFP_ATOMIC will run with irq locally disabled and it may hang
> for indefinite amount of time (O(N)). So irq latency may break and it
> may be better to lose a packet once in a while than to hang
> interrupts. If you want to do this you'd probably need to add a new
> GFP_ATOMIC_RECLAIM or similar.

Well we could do the same as for PF_MEMALLOC: print a warning and then 
reclaim nevertheless if we cannot fail (We already have a GFP_NOFAIL 
flag). It is better to generate a latency than the system failing 
altogether. However the GFP_ATOMIC reclaim patchset is a 
bit more invasive (http://marc.info/?l=linux-mm=118710584014150=2). 
Maybe this is too much churn for the rare need of such a reclaim.
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Peter Zijlstra
On Thu, 2007-08-23 at 14:08 +0200, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote:
> > Strictly speaking:
> > 
> > if:
> > 
> >  page = alloc_page(gfp);
> > 
> > fails but:
> > 
> >  obj = kmem_cache_alloc(s, gfp);
> > 
> > succeeds then its a bug.
> 
> Why? this is like saying that if alloc_pages(order=1) fails but
> alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a
> bug.
> 
> The only bug is if slab allocations <=4k fails despite
> alloc_pages(order=0) would succeed.

That would be currently true. However I need it to be stricter.

I'm wanting to do networked swap. And in order to be able to receive
writeout completions when in the PF_MEMALLOC region I need to introduce
a new network state. This is because it needs to operate in a steady
state with limited (bounded) memory use.

Normal network either consumes memory, or fails to receive anything at
all.

So this new network state will allocate space for a packet, receive the
packet from the NIC, inspect the packet, and toss the packet when its
not found to be aimed at the VM (ie. does not contain a writeout
completion).

So the total memory consumption of this state is 0 - it always frees
what it takes, but the memory use is non 0 but bounded - it does
temporarily use memory, but will limit itself to never exceed a given
maximum)

Because the network stack runs on the slab allocator in generic (both
kmem_cache and kmalloc) I need this extra guarantee so that a slab
allocated from the reserves will not serve objects to some random
non-critical application.

If this is not restricted this network state can leak memory to outside
of PF_MEMALLOC and will not be stable.

So what I need is:

  kmem_cache_alloc(s, gfp) to fail when alloc_page(gfp) fails

agreeing on the extra condition:

  when kmem_cache_size(s) <= PAGE_SIZE

and the extra note that:

  I only really need it to fail for ALLOC_NO_WATERMARKS, the other
  levels like ALLOC_HIGH and ALLOC_HARDER are not critical.

Which ends up with:

  if the current gfp-context does not allow ALLOC_NO_WATERMARKS
allocations, and alloc_page() fails, so must kmem_cache_alloc(s,) if
kmem_cache_size(s) <= PAGE_SIZE.

(yes this leaves jumbo frames broken)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Andrea Arcangeli
On Wed, Aug 22, 2007 at 10:03:45PM +0200, Peter Zijlstra wrote:
> Its not extreme, not even rare, and its handled now. Its what
> PF_MEMALLOC is for.

Agreed. This is the whole point, either you limit the max amount of
anon memory, slab, alloc_pages a driver can do or you reserve a
pool. Guess what? In practice limiting the max ram a driver can eat in
alloc_pages, at the same time while limting the max amount of pages
that can be anon ram, etc..etc.. is called "reserving a pool of
freepages for PF_MEMALLOC".

Now in theory we could try a may_writepage=0 second reclaim pass
before using the PF_MEMALLOC pool but would that make any difference
other than being slower? We can argue what should be done first but
the PF_MEMALLOC pool isn't likely to go away with this patch... only
way to make it go away is to have every subsystem including tcp
incoming to have mempools for everything which is too complicated to
implement so we've to live the imperfect world that just works good
enough.

This logic of falling back in a may_writepage=0 pass will make things
a bit more reliable but certainly not perfect and it doesn't obsolete
the need of the current code IMHO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Andrea Arcangeli
On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote:
> Strictly speaking:
> 
> if:
> 
>  page = alloc_page(gfp);
> 
> fails but:
> 
>  obj = kmem_cache_alloc(s, gfp);
> 
> succeeds then its a bug.

Why? this is like saying that if alloc_pages(order=1) fails but
alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a
bug.

The only bug is if slab allocations <=4k fails despite
alloc_pages(order=0) would succeed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Andrea Arcangeli
On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote:
> 1. Like in the earlier patchset allow reentry to reclaim under 
>PF_MEMALLOC if we are out of all memory.

Can you simply tweak on the may_writepage flag only to achieve the
second pass? We're talking here about a totally non-performance case,
almost impossible to hit in practice unless you do real weird things,
and certainly very unlikely to happen. So I'm unsure what's all that
complexity just to make a regular pass on the lru looking for clean
pages, something may_writepage=0 already does.

Like Andi said at most one may_writepage=0 recursion should be
allowed.

If the PF_MEMALLOC is found empty, I agree entering reclaim a second
time with may_writepage=0 sounds theoretically a good idea (in
practice it should never be necessary). printk must also be printed to
warn the user he was risking to deadlock for real and he has to
increase the min_free_kbytes.

> 2. Do the laundry as here but do not write out laundry directly.
>Instead move laundry to a new lru style list in the zone structure.
>This will allow the recursive reclaim to also trigger writeout
>of pages (what this patchset was supposed to accomplish).

A new lru for this sounds overkill to me, we're talking about deadlock
avoidance, this has absolutely nothing to do with real life 99.%
of runtime of all kernels out there.

> 3. Perform writeback only from kswapd. Make other threads
>wait on kswapd if memory is low, we can wait and writeback still
>has to progress.

What does buy you to think about other threads? The whole trouble is
that PF_MEMALLOC is global, no matter which thread (pdflush like other
email to Andi or kswapd here) still it'll deadlock the same way. If
your intent is to limit the max number of in-flight writepage that
could be achieved with a sempahore, not by context switching for no
good reason. kswapd is needed for atomic allocations and to pipeline
the VM so that the vm runs more likely asynchronous inside kswapd.

> 4. Then allow reclaim of GFP_ATOMIC allocs (see
>http://marc.info/?l=linux-kernel=118710595617696=2). Atomic
>reclaim can then also put pages onto the zone laundry lists from where
>it is going to be picked up and written out by kswapd ASAP. This one
>may be tricky so maybe keep this separate.

That sounds a bit risky, there are latency considerations here to
make, GFP_ATOMIC will run with irq locally disabled and it may hang
for indefinite amount of time (O(N)). So irq latency may break and it
may be better to lose a packet once in a while than to hang
interrupts. If you want to do this you'd probably need to add a new
GFP_ATOMIC_RECLAIM or similar.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Peter Zijlstra
On Wed, 2007-08-22 at 13:16 -0700, Christoph Lameter wrote:
> On Wed, 22 Aug 2007, Peter Zijlstra wrote:


> > > > As shown, there are cases where there just isn't any memory to reclaim.
   ^^^
> > > > Please accept this.

> > > That is an extreme case that AFAIK we currently ignore and could be 
> > > avoided with some effort.
> > 
> > Its not extreme, not even rare, and its handled now. Its what
> > PF_MEMALLOC is for.
> 
> No its not. If you have all pages allocated as anonymous pages and your 
> writeout requires more pages than available in the reserves then you are 
> screwed either way regardless if you have PF_MEMALLOC set or not.

Christoph, we were talking about memory to reclaim, no about exhausting
the reserves.

> > > The initial PF_MEMALLOC patchset seems to be 
> > > still enough to deal with your issues.
> > 
> > Take the anonyous workload, user-space will block once the page
> > allocator hits ALLOC_MIN. Network will be able to receive until
> > ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will
> > start dropping all packets until there is memory again. But userspace is
> > wedged and hence will not consume the network traffic, hence we
> > deadlock.
> > 
> > Even if there is something to reclaim initially, if the pressure
> > persists that can eventually be exhausted.
> 
> Sure ultimately you will end up with pages that are all unreclaimable if 
> you reclaim all reclaimable memory.
> 
> > > multiple critical tasks on various devices that have various memory 
> > > needs. 
> > > So multiple critical spots can happen concurrently in multiple 
> > > application contexts.
> > 
> > yes, reclaim can be unbounded concurrent, and that is one of the
> > (theoretically) major problems we currently have.
> 
> So your patchset is not fixing it?

No, and I never said it would. I've been meaning to do one that does
though. Just haven't come around to actually doing it :-/

> > > We have that with PF_MEMALLOC.
> > 
> > Exactly. But if you recognise the need for PF_MEMALLOC then what is this
> > argument about?
> 
> The PF_MEMALLOC patchset f.e. is about avoiding to go out of 
> memory when there is still memory available even if we are doing a 
> PF_MEMALLOC allocation and would OOM otherwise.

Right, but as long as there is a need for PF_MEMALLOC there is a need
for the patches I proposed.

> > Networking can currently be seen as having two states:
> > 
> >  1 receive packets and consume memory
> >  2 drop all packets (when out of memory)
> > 
> > I need a 3rd state:
> > 
> >  3 receiving packets but not consuming memory
> 
> So far a good idea. If you are not consuming memory then why are the 
> allocators involved?

Because I do need to receive some packets, its just that I'll free them
again. So it won't keep consuming memory. This needs a little pool of
memory in order to operate in a stable state.

Its: alloc, receive, inspect, free
total memory use: 0
memory delta: a little
 
(its just that you need to be able to receive a significant number of
packets, not 1, due to funny things like ip-defragmentation before you
can be sure to actually receive 1 whole tcp packet - but the idea is the
same)

> > Now, I need this state when we're in PF_MEMALLOC territory, because I
> > need to be able to process an unspecified amount of network traffic in
> > order to receive the writeout completion.
> > 
> > In order to operate this 3rd network state, some memory is needed in
> > which packets can be received and when deemed not important freed and
> > reused.
> > 
> > It needs a bounded amount of memory in order to process an unbounded
> > amount of network traffic.
> > 
> > What exactly is not clear about this? If you accept the need for
> > PF_MEMALLOC you surely must also agree that at the point you're using it
> > running reclaim is useless.
> 
> Yes looks like you would like to add something to the network layer to 
> filter important packets. As long as you stay within PF_MEMALLOC 
> boundaries you can allocate and throw packets away. If you want to have a 
> reserve that is secure and just for you then you need to take it away from 
> the reserves (which in turn will lead reclaim to restore them).

Ah, but also note that _using_ PF_MEMALLOC is the trigger to enter that
3rd network state. These two are tightly coupled. You only need this 3rd
state when under PF_MEMALLOC, otherwise we could just receive normally.

So, my thinking was that, if the current reserves are good enough to
keep the system 'deadlock' free, I can just enlarge the reserves by
whatever it is I need for that network state and we're all good, no?

Why separate these two? If the current reserve is large enough (and
theoretically it is not - but I'm meaning to fix that) it will not
consume the extra memory I added below.

Note how:
  [PATCH 09/10] mm: emergency pool
pushes up the current reserves in a fashion so as to maintain the
relative 

Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Peter Zijlstra
On Wed, 2007-08-22 at 13:16 -0700, Christoph Lameter wrote:
 On Wed, 22 Aug 2007, Peter Zijlstra wrote:


As shown, there are cases where there just isn't any memory to reclaim.
   ^^^
Please accept this.

   That is an extreme case that AFAIK we currently ignore and could be 
   avoided with some effort.
  
  Its not extreme, not even rare, and its handled now. Its what
  PF_MEMALLOC is for.
 
 No its not. If you have all pages allocated as anonymous pages and your 
 writeout requires more pages than available in the reserves then you are 
 screwed either way regardless if you have PF_MEMALLOC set or not.

Christoph, we were talking about memory to reclaim, no about exhausting
the reserves.

   The initial PF_MEMALLOC patchset seems to be 
   still enough to deal with your issues.
  
  Take the anonyous workload, user-space will block once the page
  allocator hits ALLOC_MIN. Network will be able to receive until
  ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will
  start dropping all packets until there is memory again. But userspace is
  wedged and hence will not consume the network traffic, hence we
  deadlock.
  
  Even if there is something to reclaim initially, if the pressure
  persists that can eventually be exhausted.
 
 Sure ultimately you will end up with pages that are all unreclaimable if 
 you reclaim all reclaimable memory.
 
   multiple critical tasks on various devices that have various memory 
   needs. 
   So multiple critical spots can happen concurrently in multiple 
   application contexts.
  
  yes, reclaim can be unbounded concurrent, and that is one of the
  (theoretically) major problems we currently have.
 
 So your patchset is not fixing it?

No, and I never said it would. I've been meaning to do one that does
though. Just haven't come around to actually doing it :-/

   We have that with PF_MEMALLOC.
  
  Exactly. But if you recognise the need for PF_MEMALLOC then what is this
  argument about?
 
 The PF_MEMALLOC patchset f.e. is about avoiding to go out of 
 memory when there is still memory available even if we are doing a 
 PF_MEMALLOC allocation and would OOM otherwise.

Right, but as long as there is a need for PF_MEMALLOC there is a need
for the patches I proposed.

  Networking can currently be seen as having two states:
  
   1 receive packets and consume memory
   2 drop all packets (when out of memory)
  
  I need a 3rd state:
  
   3 receiving packets but not consuming memory
 
 So far a good idea. If you are not consuming memory then why are the 
 allocators involved?

Because I do need to receive some packets, its just that I'll free them
again. So it won't keep consuming memory. This needs a little pool of
memory in order to operate in a stable state.

Its: alloc, receive, inspect, free
total memory use: 0
memory delta: a little
 
(its just that you need to be able to receive a significant number of
packets, not 1, due to funny things like ip-defragmentation before you
can be sure to actually receive 1 whole tcp packet - but the idea is the
same)

  Now, I need this state when we're in PF_MEMALLOC territory, because I
  need to be able to process an unspecified amount of network traffic in
  order to receive the writeout completion.
  
  In order to operate this 3rd network state, some memory is needed in
  which packets can be received and when deemed not important freed and
  reused.
  
  It needs a bounded amount of memory in order to process an unbounded
  amount of network traffic.
  
  What exactly is not clear about this? If you accept the need for
  PF_MEMALLOC you surely must also agree that at the point you're using it
  running reclaim is useless.
 
 Yes looks like you would like to add something to the network layer to 
 filter important packets. As long as you stay within PF_MEMALLOC 
 boundaries you can allocate and throw packets away. If you want to have a 
 reserve that is secure and just for you then you need to take it away from 
 the reserves (which in turn will lead reclaim to restore them).

Ah, but also note that _using_ PF_MEMALLOC is the trigger to enter that
3rd network state. These two are tightly coupled. You only need this 3rd
state when under PF_MEMALLOC, otherwise we could just receive normally.

So, my thinking was that, if the current reserves are good enough to
keep the system 'deadlock' free, I can just enlarge the reserves by
whatever it is I need for that network state and we're all good, no?

Why separate these two? If the current reserve is large enough (and
theoretically it is not - but I'm meaning to fix that) it will not
consume the extra memory I added below.

Note how:
  [PATCH 09/10] mm: emergency pool
pushes up the current reserves in a fashion so as to maintain the
relative operating range of the page allocator (distance between
min,low,high and scaling of the wmarks under ALLOC_HIGH|ALLOC_HARDER).

Also, failing 

Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Andrea Arcangeli
On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote:
 1. Like in the earlier patchset allow reentry to reclaim under 
PF_MEMALLOC if we are out of all memory.

Can you simply tweak on the may_writepage flag only to achieve the
second pass? We're talking here about a totally non-performance case,
almost impossible to hit in practice unless you do real weird things,
and certainly very unlikely to happen. So I'm unsure what's all that
complexity just to make a regular pass on the lru looking for clean
pages, something may_writepage=0 already does.

Like Andi said at most one may_writepage=0 recursion should be
allowed.

If the PF_MEMALLOC is found empty, I agree entering reclaim a second
time with may_writepage=0 sounds theoretically a good idea (in
practice it should never be necessary). printk must also be printed to
warn the user he was risking to deadlock for real and he has to
increase the min_free_kbytes.

 2. Do the laundry as here but do not write out laundry directly.
Instead move laundry to a new lru style list in the zone structure.
This will allow the recursive reclaim to also trigger writeout
of pages (what this patchset was supposed to accomplish).

A new lru for this sounds overkill to me, we're talking about deadlock
avoidance, this has absolutely nothing to do with real life 99.%
of runtime of all kernels out there.

 3. Perform writeback only from kswapd. Make other threads
wait on kswapd if memory is low, we can wait and writeback still
has to progress.

What does buy you to think about other threads? The whole trouble is
that PF_MEMALLOC is global, no matter which thread (pdflush like other
email to Andi or kswapd here) still it'll deadlock the same way. If
your intent is to limit the max number of in-flight writepage that
could be achieved with a sempahore, not by context switching for no
good reason. kswapd is needed for atomic allocations and to pipeline
the VM so that the vm runs more likely asynchronous inside kswapd.

 4. Then allow reclaim of GFP_ATOMIC allocs (see
http://marc.info/?l=linux-kernelm=118710595617696w=2). Atomic
reclaim can then also put pages onto the zone laundry lists from where
it is going to be picked up and written out by kswapd ASAP. This one
may be tricky so maybe keep this separate.

That sounds a bit risky, there are latency considerations here to
make, GFP_ATOMIC will run with irq locally disabled and it may hang
for indefinite amount of time (O(N)). So irq latency may break and it
may be better to lose a packet once in a while than to hang
interrupts. If you want to do this you'd probably need to add a new
GFP_ATOMIC_RECLAIM or similar.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Andrea Arcangeli
On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote:
 Strictly speaking:
 
 if:
 
  page = alloc_page(gfp);
 
 fails but:
 
  obj = kmem_cache_alloc(s, gfp);
 
 succeeds then its a bug.

Why? this is like saying that if alloc_pages(order=1) fails but
alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a
bug.

The only bug is if slab allocations =4k fails despite
alloc_pages(order=0) would succeed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Andrea Arcangeli
On Wed, Aug 22, 2007 at 10:03:45PM +0200, Peter Zijlstra wrote:
 Its not extreme, not even rare, and its handled now. Its what
 PF_MEMALLOC is for.

Agreed. This is the whole point, either you limit the max amount of
anon memory, slab, alloc_pages a driver can do or you reserve a
pool. Guess what? In practice limiting the max ram a driver can eat in
alloc_pages, at the same time while limting the max amount of pages
that can be anon ram, etc..etc.. is called reserving a pool of
freepages for PF_MEMALLOC.

Now in theory we could try a may_writepage=0 second reclaim pass
before using the PF_MEMALLOC pool but would that make any difference
other than being slower? We can argue what should be done first but
the PF_MEMALLOC pool isn't likely to go away with this patch... only
way to make it go away is to have every subsystem including tcp
incoming to have mempools for everything which is too complicated to
implement so we've to live the imperfect world that just works good
enough.

This logic of falling back in a may_writepage=0 pass will make things
a bit more reliable but certainly not perfect and it doesn't obsolete
the need of the current code IMHO.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Peter Zijlstra
On Thu, 2007-08-23 at 14:08 +0200, Andrea Arcangeli wrote:
 On Wed, Aug 22, 2007 at 12:09:03AM +0200, Peter Zijlstra wrote:
  Strictly speaking:
  
  if:
  
   page = alloc_page(gfp);
  
  fails but:
  
   obj = kmem_cache_alloc(s, gfp);
  
  succeeds then its a bug.
 
 Why? this is like saying that if alloc_pages(order=1) fails but
 alloc_pages(order=0) succeeds then it's a bug. Obviously it's not a
 bug.
 
 The only bug is if slab allocations =4k fails despite
 alloc_pages(order=0) would succeed.

That would be currently true. However I need it to be stricter.

I'm wanting to do networked swap. And in order to be able to receive
writeout completions when in the PF_MEMALLOC region I need to introduce
a new network state. This is because it needs to operate in a steady
state with limited (bounded) memory use.

Normal network either consumes memory, or fails to receive anything at
all.

So this new network state will allocate space for a packet, receive the
packet from the NIC, inspect the packet, and toss the packet when its
not found to be aimed at the VM (ie. does not contain a writeout
completion).

So the total memory consumption of this state is 0 - it always frees
what it takes, but the memory use is non 0 but bounded - it does
temporarily use memory, but will limit itself to never exceed a given
maximum)

Because the network stack runs on the slab allocator in generic (both
kmem_cache and kmalloc) I need this extra guarantee so that a slab
allocated from the reserves will not serve objects to some random
non-critical application.

If this is not restricted this network state can leak memory to outside
of PF_MEMALLOC and will not be stable.

So what I need is:

  kmem_cache_alloc(s, gfp) to fail when alloc_page(gfp) fails

agreeing on the extra condition:

  when kmem_cache_size(s) = PAGE_SIZE

and the extra note that:

  I only really need it to fail for ALLOC_NO_WATERMARKS, the other
  levels like ALLOC_HIGH and ALLOC_HARDER are not critical.

Which ends up with:

  if the current gfp-context does not allow ALLOC_NO_WATERMARKS
allocations, and alloc_page() fails, so must kmem_cache_alloc(s,) if
kmem_cache_size(s) = PAGE_SIZE.

(yes this leaves jumbo frames broken)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-23 Thread Christoph Lameter
On Thu, 23 Aug 2007, Andrea Arcangeli wrote:

 On Tue, Aug 21, 2007 at 03:32:25PM -0700, Christoph Lameter wrote:
  1. Like in the earlier patchset allow reentry to reclaim under 
 PF_MEMALLOC if we are out of all memory.
 
 Can you simply tweak on the may_writepage flag only to achieve the
 second pass? We're talking here about a totally non-performance case,
 almost impossible to hit in practice unless you do real weird things,
 and certainly very unlikely to happen. So I'm unsure what's all that
 complexity just to make a regular pass on the lru looking for clean
 pages, something may_writepage=0 already does.
 

Yes that is what the PF_MEMALLOC patch that I posted before does. This 
discussion gets me more and more to thinking that the recursive reclaim on 
PF_MEMALLOC is all that is needed for emergency situations (to get out of 
the tight spot).

See
http://marc.info/?l=linux-kernelm=118710219116624w=2

 If the PF_MEMALLOC is found empty, I agree entering reclaim a second
 time with may_writepage=0 sounds theoretically a good idea (in
 practice it should never be necessary). printk must also be printed to
 warn the user he was risking to deadlock for real and he has to
 increase the min_free_kbytes.

Ok. I can add a printk to that one.

 That sounds a bit risky, there are latency considerations here to
 make, GFP_ATOMIC will run with irq locally disabled and it may hang
 for indefinite amount of time (O(N)). So irq latency may break and it
 may be better to lose a packet once in a while than to hang
 interrupts. If you want to do this you'd probably need to add a new
 GFP_ATOMIC_RECLAIM or similar.

Well we could do the same as for PF_MEMALLOC: print a warning and then 
reclaim nevertheless if we cannot fail (We already have a GFP_NOFAIL 
flag). It is better to generate a latency than the system failing 
altogether. However the GFP_ATOMIC reclaim patchset is a 
bit more invasive (http://marc.info/?l=linux-mmm=118710584014150w=2). 
Maybe this is too much churn for the rare need of such a reclaim.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Peter Zijlstra wrote:

> > That is an extreme case that AFAIK we currently ignore and could be 
> > avoided with some effort.
> 
> Its not extreme, not even rare, and its handled now. Its what
> PF_MEMALLOC is for.

No its not. If you have all pages allocated as anonymous pages and your 
writeout requires more pages than available in the reserves then you are 
screwed either way regardless if you have PF_MEMALLOC set or not.

> > The initial PF_MEMALLOC patchset seems to be 
> > still enough to deal with your issues.
> 
> Take the anonyous workload, user-space will block once the page
> allocator hits ALLOC_MIN. Network will be able to receive until
> ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will
> start dropping all packets until there is memory again. But userspace is
> wedged and hence will not consume the network traffic, hence we
> deadlock.
> 
> Even if there is something to reclaim initially, if the pressure
> persists that can eventually be exhausted.

Sure ultimately you will end up with pages that are all unreclaimable if 
you reclaim all reclaimable memory.

> > multiple critical tasks on various devices that have various memory needs. 
> > So multiple critical spots can happen concurrently in multiple 
> > application contexts.
> 
> yes, reclaim can be unbounded concurrent, and that is one of the
> (theoretically) major problems we currently have.

So your patchset is not fixing it?

> > We have that with PF_MEMALLOC.
> 
> Exactly. But if you recognise the need for PF_MEMALLOC then what is this
> argument about?

The PF_MEMALLOC patchset f.e. is about avoiding to go out of 
memory when there is still memory available even if we are doing a 
PF_MEMALLOC allocation and would OOM otherwise.

> Networking can currently be seen as having two states:
> 
>  1 receive packets and consume memory
>  2 drop all packets (when out of memory)
> 
> I need a 3rd state:
> 
>  3 receiving packets but not consuming memory

So far a good idea. If you are not consuming memory then why are the 
allocators involved?
 
> Now, I need this state when we're in PF_MEMALLOC territory, because I
> need to be able to process an unspecified amount of network traffic in
> order to receive the writeout completion.
> 
> In order to operate this 3rd network state, some memory is needed in
> which packets can be received and when deemed not important freed and
> reused.
> 
> It needs a bounded amount of memory in order to process an unbounded
> amount of network traffic.
> 
> What exactly is not clear about this? If you accept the need for
> PF_MEMALLOC you surely must also agree that at the point you're using it
> running reclaim is useless.

Yes looks like you would like to add something to the network layer to 
filter important packets. As long as you stay within PF_MEMALLOC 
boundaries you can allocate and throw packets away. If you want to have a 
reserve that is secure and just for you then you need to take it away from 
the reserves (which in turn will lead reclaim to restore them).

> > > Also, failing a memory allocation isn't bad, why are you so worried
> > > about that? It happens all the time.
> > 
> > Its a performance impact and plainly does not make sense if there is 
> > reclaimable memory availble. The common action of the vm is to reclaim if 
> > there is a demand for memory. Now we suddenly abandon that approach?
> 
> I'm utterly confused by this, on one hand you recognise the need for
> PF_MEMALLOC but on the other hand you're saying its not needed and
> anybody needing memory (even reclaim itself) should use reclaim.

The VM reclaims memory on demand but in exceptional limited cases where we 
cannot do so we use the reserves. I am sure you know this.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Peter Zijlstra
On Wed, 2007-08-22 at 12:04 -0700, Christoph Lameter wrote:
> On Wed, 22 Aug 2007, Peter Zijlstra wrote:
> 
> > Its unavoidable, at some point it just happens. Also using reclaim
> > doesn't seem like the ideal way to get out of live-locks since reclaim
> > itself can live-lock on these large boxen.
> 
> If reclaim can live lock then it needs to be fixed.

Riel is working on that.

> > As shown, there are cases where there just isn't any memory to reclaim.
> > Please accept this.
> 
> That is an extreme case that AFAIK we currently ignore and could be 
> avoided with some effort.

Its not extreme, not even rare, and its handled now. Its what
PF_MEMALLOC is for.

> The initial PF_MEMALLOC patchset seems to be 
> still enough to deal with your issues.

No it isnt. 

Take the anonyous workload, user-space will block once the page
allocator hits ALLOC_MIN. Network will be able to receive until
ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will
start dropping all packets until there is memory again. But userspace is
wedged and hence will not consume the network traffic, hence we
deadlock.

Even if there is something to reclaim initially, if the pressure
persists that can eventually be exhausted.

> > Also, by reclaiming memory and getting out of the tight spot you give
> > the rest of the system access to that memory, and it can be used for
> > other things than getting out of the tight spot.
> 
> The rest of the system may have their own tights spots. Language the "the 
> tight spot" sets up all sort of alarms over here since you seem to be 
> thinking about a system doing a single task.

reclaim

>  The system may be handling 
> multiple critical tasks on various devices that have various memory needs. 
> So multiple critical spots can happen concurrently in multiple 
> application contexts.

yes, reclaim can be unbounded concurrent, and that is one of the
(theoretically) major problems we currently have.

> > You really want a separate allocation state that allows only reclaim to
> > access memory.
> 
> We have that with PF_MEMALLOC.

Exactly. But if you recognise the need for PF_MEMALLOC then what is this
argument about?

Networking can currently be seen as having two states:

 1 receive packets and consume memory
 2 drop all packets (when out of memory)

I need a 3rd state:

 3 receiving packets but not consuming memory

Now, I need this state when we're in PF_MEMALLOC territory, because I
need to be able to process an unspecified amount of network traffic in
order to receive the writeout completion.

In order to operate this 3rd network state, some memory is needed in
which packets can be received and when deemed not important freed and
reused.

It needs a bounded amount of memory in order to process an unbounded
amount of network traffic.

What exactly is not clear about this? If you accept the need for
PF_MEMALLOC you surely must also agree that at the point you're using it
running reclaim is useless.

> > Also, failing a memory allocation isn't bad, why are you so worried
> > about that? It happens all the time.
> 
> Its a performance impact and plainly does not make sense if there is 
> reclaimable memory availble. The common action of the vm is to reclaim if 
> there is a demand for memory. Now we suddenly abandon that approach?

I'm utterly confused by this, on one hand you recognise the need for
PF_MEMALLOC but on the other hand you're saying its not needed and
anybody needing memory (even reclaim itself) should use reclaim.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Ingo Molnar wrote:

> Could you outline the "big picture" as you see it? To me your argument 
> that reclaim can always be done instantly and that the cases where it 
> cannot be done are pathological and need to be avoided is fundamentally 
> dangerous and quite a bit short-sighted at first glance.

That is a bit overdrawing my argument. The issues that Peter saw can be 
fixed by allowing recursive reclaim (see the earlier patchset). The rest 
is so far sugar on top or building extreme cases where we already have 
trouble.

> The big picture way to think about this is the following: the free page 
> pool is the "cache" of the MM. It's what "greases" the mechanism and 
> bridges the inevitable reclaim latency and makes "atomic memory" 
> available to the reclaim mechanism itself. We _cannot_ remove that cache 
> without a conceptual replacement (or a _very_ robust argument and proof 
> that the free pages pool is not needed at all - this would be a major 
> design change (and a stupid mistake IMO)). Your patchset, in essence, 
> tries to claim that we dont really need this cache and that all that 
> matters is to keep enough clean pagecache pages around. That approach 
> misses the full picture and i dont think we can progress without 
> agreeing on the fundamentals first.

The patchset attempts to deal with the reserves in a more intelligent 
way in order not to fail when this pool becomes exhausted because some
device needs a lot of memory in the writeout path.

> That "cache" cannot be handled in your scheme: a fully or mostly 
> anonymous workload (tons of apps are like that) instantly destroys the 
> "there is always a minimal amount of atomically reclaimable pages 
> around" property of freelists, and this cannot be talked or tweaked 
> around by twiddling any existing property of anonymous reclaim. 

A extreme anonymous workload like discussed here can even cause the 
current VM to fail. Realistically at least portions of the executable and 
varios slab caches will remain in memory in addition to the reserves.

> Anonymous memory is dirty and takes ages to reclaim. The fact that your 
> patchset causes an easy anonymous OOM further underlines this flaw of 
> your thinking. Not making anonymous workloads OOM is the _hardest_ part 
> of the MM, by far. Pagecache reclaim is a breeze in comparison :-)

The central flaw in my thinking was the switching of of PF_MEMALLOC on the 
writeout path instead of allowing recursive PF_MEMALLOC reclaim as in the 
first patch. But the first patchset did not have that flaw.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Peter Zijlstra wrote:

> Its unavoidable, at some point it just happens. Also using reclaim
> doesn't seem like the ideal way to get out of live-locks since reclaim
> itself can live-lock on these large boxen.

If reclaim can live lock then it needs to be fixed.

> As shown, there are cases where there just isn't any memory to reclaim.
> Please accept this.

That is an extreme case that AFAIK we currently ignore and could be 
avoided with some effort. The initial PF_MEMALLOC patchset seems to be 
still enough to deal with your issues.

> Also, by reclaiming memory and getting out of the tight spot you give
> the rest of the system access to that memory, and it can be used for
> other things than getting out of the tight spot.

The rest of the system may have their own tights spots. Language the "the 
tight spot" sets up all sort of alarms over here since you seem to be 
thinking about a system doing a single task. The system may be handling 
multiple critical tasks on various devices that have various memory needs. 
So multiple critical spots can happen concurrently in multiple 
application contexts.

> You really want a separate allocation state that allows only reclaim to
> access memory.

We have that with PF_MEMALLOC.

> Also, failing a memory allocation isn't bad, why are you so worried
> about that? It happens all the time.

Its a performance impact and plainly does not make sense if there is 
reclaimable memory availble. The common action of the vm is to reclaim if 
there is a demand for memory. Now we suddenly abandon that approach?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Ingo Molnar

* Christoph Lameter <[EMAIL PROTECTED]> wrote:

> > I want slab to fail when a similar page alloc would fail, no magic.
> 
> Yes I know. I do not want allocations to fail but that reclaim occurs 
> in order to avoid failing any allocation. We need provisions that make 
> sure that we never get into such a bad memory situation that would 
> cause severe slowless and usually end up in a livelock anyways.

Could you outline the "big picture" as you see it? To me your argument 
that reclaim can always be done instantly and that the cases where it 
cannot be done are pathological and need to be avoided is fundamentally 
dangerous and quite a bit short-sighted at first glance.

The big picture way to think about this is the following: the free page 
pool is the "cache" of the MM. It's what "greases" the mechanism and 
bridges the inevitable reclaim latency and makes "atomic memory" 
available to the reclaim mechanism itself. We _cannot_ remove that cache 
without a conceptual replacement (or a _very_ robust argument and proof 
that the free pages pool is not needed at all - this would be a major 
design change (and a stupid mistake IMO)). Your patchset, in essence, 
tries to claim that we dont really need this cache and that all that 
matters is to keep enough clean pagecache pages around. That approach 
misses the full picture and i dont think we can progress without 
agreeing on the fundamentals first.

That "cache" cannot be handled in your scheme: a fully or mostly 
anonymous workload (tons of apps are like that) instantly destroys the 
"there is always a minimal amount of atomically reclaimable pages 
around" property of freelists, and this cannot be talked or tweaked 
around by twiddling any existing property of anonymous reclaim. 
Anonymous memory is dirty and takes ages to reclaim. The fact that your 
patchset causes an easy anonymous OOM further underlines this flaw of 
your thinking. Not making anonymous workloads OOM is the _hardest_ part 
of the MM, by far. Pagecache reclaim is a breeze in comparison :-)

So there is a large and fundamental rift between having pages on the 
freelist (instantly available to any context) and having them on the 
(current) LRU where they might or might not be clean, etc. The freelists 
are an implicit guarantee of buffering and atomicity and they can and do 
save the day if everything else fails to keep stuff insta-freeable. (And 
then we havent even considered the performance and scalability 
differences between picking from the pcp freelists versus picking pages 
from the LRU, havent considered the better higher-order page allocation 
property of the buddy pool and havent considered the atomicity of 
in-irq-handler allocations.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Peter Zijlstra
On Tue, 2007-08-21 at 15:43 -0700, Christoph Lameter wrote:
> On Wed, 22 Aug 2007, Peter Zijlstra wrote:
> 
> > Also, all I want is for slab to honour gfp flags like page allocation
> > does, nothing more, nothing less.
> > 
> > (well, actually slightly less, since I'm only really interrested in the
> > ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER -> ALLOC_NO_WATERMARKS transition and
> > not all higher ones)
> 
> I am still not sure what that brings you. There may be multiple 
> PF_MEMALLOC going on at the same time. On a large system with N cpus
> there may be more than N of these that can steal objects from one another. 

Yes, quite aware of that, and have ideas on how to properly fix that.
Once it is, the reserves can be shrunk too, perhaps you can work on
this?

> A NUMA system will be shot anyways if memory gets that problematic to 
> handle since the OS cannot effectively place memory if all zones are 
> overallocated so that only a few pages are left.

Also not a new problem.

> > I want slab to fail when a similar page alloc would fail, no magic.
> 
> Yes I know. I do not want allocations to fail but that reclaim occurs in 
> order to avoid failing any allocation. We need provisions that 
> make sure that we never get into such a bad memory situation that would
> cause severe slowless and usually end up in a livelock anyways.

Its unavoidable, at some point it just happens. Also using reclaim
doesn't seem like the ideal way to get out of live-locks since reclaim
itself can live-lock on these large boxen.

> > > > Anonymous pages are a there to stay, and we cannot tell people how to
> > > > use them. So we need some free or freeable pages in order to avoid the
> > > > vm deadlock that arises from all memory dirty.
> > > 
> > > No one is trying to abolish Anonymous pages. Free memory is readily 
> > > available on demand if one calls reclaim. Your scheme introduces complex 
> > > negotiations over a few scraps of memory when large amounts of memory 
> > > would still be readily available if one would do the right thing and call 
> > > into reclaim.
> > 
> > This is the thing I contend, there need not be large amounts of memory
> > around. In my test prog the hot code path fits into a single page, the
> > rest can be anonymous.
> 
> Thats a bit extreme We need to make sure that there are larger amounts 
> of memory around. Pages are used for all shorts of short term uses (like 
> slab shrinking etc etc.). If memory is that low that a single page matters
> then we are in very bad shape anyways.

Yes we are, but its a legitimate situation. Denying it won't get us very
far. Also placing a large bound on anonymous memory usage is not going
to be appreciated by the userspace people.

Slab cache will also be at a minimum is the pressure persists for a
while.

> > > Sounds like you would like to change the way we handle memory in general 
> > > in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
> > > memory management.
> > 
> > Not quite, currently we have free pages in the reserves, if you want to
> > replace some (or all) of that by freeable pages then that is a change.
> 
> We have free pages primarily to optimize the allocation. Meaning we do not 
> have to run reclaim on every call. We want to use all of memory. The 
> reserves are there for the case that we cannot call into reclaim. 

> The easy 
> solution if that is problematic is to enhance the reclaim to work in the
> critical situations that we care about.

As shown, there are cases where there just isn't any memory to reclaim.
Please accept this.

Also, by reclaiming memory and getting out of the tight spot you give
the rest of the system access to that memory, and it can be used for
other things than getting out of the tight spot.

You really want a separate allocation state that allows only reclaim to
access memory.

> > > Sorry I just got into this a short time ago and I may need a few cycles 
> > > to get this all straight. An approach that uses memory instead of 
> > > ignoring available memory is certainly better.
> > 
> > Sure if and when possible. There will always be need to fall back to the
> > reserves.
> 
> Maybe. But we can certainly avoid that as much as possible which would 
> also increase our ability to use all available memory instead of leaving 
> some of it unused./
> 
> > A bit off-topic, re that reclaim from atomic context:
> > Currently we try to hold spinlocks only for short periods of time so
> > that reclaim can be preempted, if you run all of reclaim from a
> > non-preemptible context you get very large preemption latencies and if
> > done from int context it'd also generate large int latencies.
> 
> If you call into the page allocator from an interrupt context then you are 
> already in bad shape since we may check pcps lists and then potentially 
> have to traverse the zonelists and check all sorts of things. 

Only an issue on these obscenely large NUMA boxen, normal machines don't
have large zone 

Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Peter Zijlstra
On Tue, 2007-08-21 at 15:43 -0700, Christoph Lameter wrote:
 On Wed, 22 Aug 2007, Peter Zijlstra wrote:
 
  Also, all I want is for slab to honour gfp flags like page allocation
  does, nothing more, nothing less.
  
  (well, actually slightly less, since I'm only really interrested in the
  ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER - ALLOC_NO_WATERMARKS transition and
  not all higher ones)
 
 I am still not sure what that brings you. There may be multiple 
 PF_MEMALLOC going on at the same time. On a large system with N cpus
 there may be more than N of these that can steal objects from one another. 

Yes, quite aware of that, and have ideas on how to properly fix that.
Once it is, the reserves can be shrunk too, perhaps you can work on
this?

 A NUMA system will be shot anyways if memory gets that problematic to 
 handle since the OS cannot effectively place memory if all zones are 
 overallocated so that only a few pages are left.

Also not a new problem.

  I want slab to fail when a similar page alloc would fail, no magic.
 
 Yes I know. I do not want allocations to fail but that reclaim occurs in 
 order to avoid failing any allocation. We need provisions that 
 make sure that we never get into such a bad memory situation that would
 cause severe slowless and usually end up in a livelock anyways.

Its unavoidable, at some point it just happens. Also using reclaim
doesn't seem like the ideal way to get out of live-locks since reclaim
itself can live-lock on these large boxen.

Anonymous pages are a there to stay, and we cannot tell people how to
use them. So we need some free or freeable pages in order to avoid the
vm deadlock that arises from all memory dirty.
   
   No one is trying to abolish Anonymous pages. Free memory is readily 
   available on demand if one calls reclaim. Your scheme introduces complex 
   negotiations over a few scraps of memory when large amounts of memory 
   would still be readily available if one would do the right thing and call 
   into reclaim.
  
  This is the thing I contend, there need not be large amounts of memory
  around. In my test prog the hot code path fits into a single page, the
  rest can be anonymous.
 
 Thats a bit extreme We need to make sure that there are larger amounts 
 of memory around. Pages are used for all shorts of short term uses (like 
 slab shrinking etc etc.). If memory is that low that a single page matters
 then we are in very bad shape anyways.

Yes we are, but its a legitimate situation. Denying it won't get us very
far. Also placing a large bound on anonymous memory usage is not going
to be appreciated by the userspace people.

Slab cache will also be at a minimum is the pressure persists for a
while.

   Sounds like you would like to change the way we handle memory in general 
   in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
   memory management.
  
  Not quite, currently we have free pages in the reserves, if you want to
  replace some (or all) of that by freeable pages then that is a change.
 
 We have free pages primarily to optimize the allocation. Meaning we do not 
 have to run reclaim on every call. We want to use all of memory. The 
 reserves are there for the case that we cannot call into reclaim. 

 The easy 
 solution if that is problematic is to enhance the reclaim to work in the
 critical situations that we care about.

As shown, there are cases where there just isn't any memory to reclaim.
Please accept this.

Also, by reclaiming memory and getting out of the tight spot you give
the rest of the system access to that memory, and it can be used for
other things than getting out of the tight spot.

You really want a separate allocation state that allows only reclaim to
access memory.

   Sorry I just got into this a short time ago and I may need a few cycles 
   to get this all straight. An approach that uses memory instead of 
   ignoring available memory is certainly better.
  
  Sure if and when possible. There will always be need to fall back to the
  reserves.
 
 Maybe. But we can certainly avoid that as much as possible which would 
 also increase our ability to use all available memory instead of leaving 
 some of it unused./
 
  A bit off-topic, re that reclaim from atomic context:
  Currently we try to hold spinlocks only for short periods of time so
  that reclaim can be preempted, if you run all of reclaim from a
  non-preemptible context you get very large preemption latencies and if
  done from int context it'd also generate large int latencies.
 
 If you call into the page allocator from an interrupt context then you are 
 already in bad shape since we may check pcps lists and then potentially 
 have to traverse the zonelists and check all sorts of things. 

Only an issue on these obscenely large NUMA boxen, normal machines don't
have large zone lists. No reason to hurt the small boxen in favour of
the large boxen.

 If we 
 would implement atomic reclaim then the reserves 

Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Ingo Molnar

* Christoph Lameter [EMAIL PROTECTED] wrote:

  I want slab to fail when a similar page alloc would fail, no magic.
 
 Yes I know. I do not want allocations to fail but that reclaim occurs 
 in order to avoid failing any allocation. We need provisions that make 
 sure that we never get into such a bad memory situation that would 
 cause severe slowless and usually end up in a livelock anyways.

Could you outline the big picture as you see it? To me your argument 
that reclaim can always be done instantly and that the cases where it 
cannot be done are pathological and need to be avoided is fundamentally 
dangerous and quite a bit short-sighted at first glance.

The big picture way to think about this is the following: the free page 
pool is the cache of the MM. It's what greases the mechanism and 
bridges the inevitable reclaim latency and makes atomic memory 
available to the reclaim mechanism itself. We _cannot_ remove that cache 
without a conceptual replacement (or a _very_ robust argument and proof 
that the free pages pool is not needed at all - this would be a major 
design change (and a stupid mistake IMO)). Your patchset, in essence, 
tries to claim that we dont really need this cache and that all that 
matters is to keep enough clean pagecache pages around. That approach 
misses the full picture and i dont think we can progress without 
agreeing on the fundamentals first.

That cache cannot be handled in your scheme: a fully or mostly 
anonymous workload (tons of apps are like that) instantly destroys the 
there is always a minimal amount of atomically reclaimable pages 
around property of freelists, and this cannot be talked or tweaked 
around by twiddling any existing property of anonymous reclaim. 
Anonymous memory is dirty and takes ages to reclaim. The fact that your 
patchset causes an easy anonymous OOM further underlines this flaw of 
your thinking. Not making anonymous workloads OOM is the _hardest_ part 
of the MM, by far. Pagecache reclaim is a breeze in comparison :-)

So there is a large and fundamental rift between having pages on the 
freelist (instantly available to any context) and having them on the 
(current) LRU where they might or might not be clean, etc. The freelists 
are an implicit guarantee of buffering and atomicity and they can and do 
save the day if everything else fails to keep stuff insta-freeable. (And 
then we havent even considered the performance and scalability 
differences between picking from the pcp freelists versus picking pages 
from the LRU, havent considered the better higher-order page allocation 
property of the buddy pool and havent considered the atomicity of 
in-irq-handler allocations.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Peter Zijlstra wrote:

 Its unavoidable, at some point it just happens. Also using reclaim
 doesn't seem like the ideal way to get out of live-locks since reclaim
 itself can live-lock on these large boxen.

If reclaim can live lock then it needs to be fixed.

 As shown, there are cases where there just isn't any memory to reclaim.
 Please accept this.

That is an extreme case that AFAIK we currently ignore and could be 
avoided with some effort. The initial PF_MEMALLOC patchset seems to be 
still enough to deal with your issues.

 Also, by reclaiming memory and getting out of the tight spot you give
 the rest of the system access to that memory, and it can be used for
 other things than getting out of the tight spot.

The rest of the system may have their own tights spots. Language the the 
tight spot sets up all sort of alarms over here since you seem to be 
thinking about a system doing a single task. The system may be handling 
multiple critical tasks on various devices that have various memory needs. 
So multiple critical spots can happen concurrently in multiple 
application contexts.

 You really want a separate allocation state that allows only reclaim to
 access memory.

We have that with PF_MEMALLOC.

 Also, failing a memory allocation isn't bad, why are you so worried
 about that? It happens all the time.

Its a performance impact and plainly does not make sense if there is 
reclaimable memory availble. The common action of the vm is to reclaim if 
there is a demand for memory. Now we suddenly abandon that approach?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Ingo Molnar wrote:

 Could you outline the big picture as you see it? To me your argument 
 that reclaim can always be done instantly and that the cases where it 
 cannot be done are pathological and need to be avoided is fundamentally 
 dangerous and quite a bit short-sighted at first glance.

That is a bit overdrawing my argument. The issues that Peter saw can be 
fixed by allowing recursive reclaim (see the earlier patchset). The rest 
is so far sugar on top or building extreme cases where we already have 
trouble.

 The big picture way to think about this is the following: the free page 
 pool is the cache of the MM. It's what greases the mechanism and 
 bridges the inevitable reclaim latency and makes atomic memory 
 available to the reclaim mechanism itself. We _cannot_ remove that cache 
 without a conceptual replacement (or a _very_ robust argument and proof 
 that the free pages pool is not needed at all - this would be a major 
 design change (and a stupid mistake IMO)). Your patchset, in essence, 
 tries to claim that we dont really need this cache and that all that 
 matters is to keep enough clean pagecache pages around. That approach 
 misses the full picture and i dont think we can progress without 
 agreeing on the fundamentals first.

The patchset attempts to deal with the reserves in a more intelligent 
way in order not to fail when this pool becomes exhausted because some
device needs a lot of memory in the writeout path.

 That cache cannot be handled in your scheme: a fully or mostly 
 anonymous workload (tons of apps are like that) instantly destroys the 
 there is always a minimal amount of atomically reclaimable pages 
 around property of freelists, and this cannot be talked or tweaked 
 around by twiddling any existing property of anonymous reclaim. 

A extreme anonymous workload like discussed here can even cause the 
current VM to fail. Realistically at least portions of the executable and 
varios slab caches will remain in memory in addition to the reserves.

 Anonymous memory is dirty and takes ages to reclaim. The fact that your 
 patchset causes an easy anonymous OOM further underlines this flaw of 
 your thinking. Not making anonymous workloads OOM is the _hardest_ part 
 of the MM, by far. Pagecache reclaim is a breeze in comparison :-)

The central flaw in my thinking was the switching of of PF_MEMALLOC on the 
writeout path instead of allowing recursive PF_MEMALLOC reclaim as in the 
first patch. But the first patchset did not have that flaw.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Peter Zijlstra
On Wed, 2007-08-22 at 12:04 -0700, Christoph Lameter wrote:
 On Wed, 22 Aug 2007, Peter Zijlstra wrote:
 
  Its unavoidable, at some point it just happens. Also using reclaim
  doesn't seem like the ideal way to get out of live-locks since reclaim
  itself can live-lock on these large boxen.
 
 If reclaim can live lock then it needs to be fixed.

Riel is working on that.

  As shown, there are cases where there just isn't any memory to reclaim.
  Please accept this.
 
 That is an extreme case that AFAIK we currently ignore and could be 
 avoided with some effort.

Its not extreme, not even rare, and its handled now. Its what
PF_MEMALLOC is for.

 The initial PF_MEMALLOC patchset seems to be 
 still enough to deal with your issues.

No it isnt. 

Take the anonyous workload, user-space will block once the page
allocator hits ALLOC_MIN. Network will be able to receive until
ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will
start dropping all packets until there is memory again. But userspace is
wedged and hence will not consume the network traffic, hence we
deadlock.

Even if there is something to reclaim initially, if the pressure
persists that can eventually be exhausted.

  Also, by reclaiming memory and getting out of the tight spot you give
  the rest of the system access to that memory, and it can be used for
  other things than getting out of the tight spot.
 
 The rest of the system may have their own tights spots. Language the the 
 tight spot sets up all sort of alarms over here since you seem to be 
 thinking about a system doing a single task.

reclaim

  The system may be handling 
 multiple critical tasks on various devices that have various memory needs. 
 So multiple critical spots can happen concurrently in multiple 
 application contexts.

yes, reclaim can be unbounded concurrent, and that is one of the
(theoretically) major problems we currently have.

  You really want a separate allocation state that allows only reclaim to
  access memory.
 
 We have that with PF_MEMALLOC.

Exactly. But if you recognise the need for PF_MEMALLOC then what is this
argument about?

Networking can currently be seen as having two states:

 1 receive packets and consume memory
 2 drop all packets (when out of memory)

I need a 3rd state:

 3 receiving packets but not consuming memory

Now, I need this state when we're in PF_MEMALLOC territory, because I
need to be able to process an unspecified amount of network traffic in
order to receive the writeout completion.

In order to operate this 3rd network state, some memory is needed in
which packets can be received and when deemed not important freed and
reused.

It needs a bounded amount of memory in order to process an unbounded
amount of network traffic.

What exactly is not clear about this? If you accept the need for
PF_MEMALLOC you surely must also agree that at the point you're using it
running reclaim is useless.

  Also, failing a memory allocation isn't bad, why are you so worried
  about that? It happens all the time.
 
 Its a performance impact and plainly does not make sense if there is 
 reclaimable memory availble. The common action of the vm is to reclaim if 
 there is a demand for memory. Now we suddenly abandon that approach?

I'm utterly confused by this, on one hand you recognise the need for
PF_MEMALLOC but on the other hand you're saying its not needed and
anybody needing memory (even reclaim itself) should use reclaim.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Peter Zijlstra wrote:

  That is an extreme case that AFAIK we currently ignore and could be 
  avoided with some effort.
 
 Its not extreme, not even rare, and its handled now. Its what
 PF_MEMALLOC is for.

No its not. If you have all pages allocated as anonymous pages and your 
writeout requires more pages than available in the reserves then you are 
screwed either way regardless if you have PF_MEMALLOC set or not.

  The initial PF_MEMALLOC patchset seems to be 
  still enough to deal with your issues.
 
 Take the anonyous workload, user-space will block once the page
 allocator hits ALLOC_MIN. Network will be able to receive until
 ALLOC_MIN|ALLOC_HIGH - if the completion doesn't arrive by then it will
 start dropping all packets until there is memory again. But userspace is
 wedged and hence will not consume the network traffic, hence we
 deadlock.
 
 Even if there is something to reclaim initially, if the pressure
 persists that can eventually be exhausted.

Sure ultimately you will end up with pages that are all unreclaimable if 
you reclaim all reclaimable memory.

  multiple critical tasks on various devices that have various memory needs. 
  So multiple critical spots can happen concurrently in multiple 
  application contexts.
 
 yes, reclaim can be unbounded concurrent, and that is one of the
 (theoretically) major problems we currently have.

So your patchset is not fixing it?

  We have that with PF_MEMALLOC.
 
 Exactly. But if you recognise the need for PF_MEMALLOC then what is this
 argument about?

The PF_MEMALLOC patchset f.e. is about avoiding to go out of 
memory when there is still memory available even if we are doing a 
PF_MEMALLOC allocation and would OOM otherwise.

 Networking can currently be seen as having two states:
 
  1 receive packets and consume memory
  2 drop all packets (when out of memory)
 
 I need a 3rd state:
 
  3 receiving packets but not consuming memory

So far a good idea. If you are not consuming memory then why are the 
allocators involved?
 
 Now, I need this state when we're in PF_MEMALLOC territory, because I
 need to be able to process an unspecified amount of network traffic in
 order to receive the writeout completion.
 
 In order to operate this 3rd network state, some memory is needed in
 which packets can be received and when deemed not important freed and
 reused.
 
 It needs a bounded amount of memory in order to process an unbounded
 amount of network traffic.
 
 What exactly is not clear about this? If you accept the need for
 PF_MEMALLOC you surely must also agree that at the point you're using it
 running reclaim is useless.

Yes looks like you would like to add something to the network layer to 
filter important packets. As long as you stay within PF_MEMALLOC 
boundaries you can allocate and throw packets away. If you want to have a 
reserve that is secure and just for you then you need to take it away from 
the reserves (which in turn will lead reclaim to restore them).

   Also, failing a memory allocation isn't bad, why are you so worried
   about that? It happens all the time.
  
  Its a performance impact and plainly does not make sense if there is 
  reclaimable memory availble. The common action of the vm is to reclaim if 
  there is a demand for memory. Now we suddenly abandon that approach?
 
 I'm utterly confused by this, on one hand you recognise the need for
 PF_MEMALLOC but on the other hand you're saying its not needed and
 anybody needing memory (even reclaim itself) should use reclaim.

The VM reclaims memory on demand but in exceptional limited cases where we 
cannot do so we use the reserves. I am sure you know this.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Wed, 22 Aug 2007, Peter Zijlstra wrote:

> Also, all I want is for slab to honour gfp flags like page allocation
> does, nothing more, nothing less.
> 
> (well, actually slightly less, since I'm only really interrested in the
> ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER -> ALLOC_NO_WATERMARKS transition and
> not all higher ones)

I am still not sure what that brings you. There may be multiple 
PF_MEMALLOC going on at the same time. On a large system with N cpus
there may be more than N of these that can steal objects from one another. 

A NUMA system will be shot anyways if memory gets that problematic to 
handle since the OS cannot effectively place memory if all zones are 
overallocated so that only a few pages are left.


> I want slab to fail when a similar page alloc would fail, no magic.

Yes I know. I do not want allocations to fail but that reclaim occurs in 
order to avoid failing any allocation. We need provisions that 
make sure that we never get into such a bad memory situation that would
cause severe slowless and usually end up in a livelock anyways.

> > > Anonymous pages are a there to stay, and we cannot tell people how to
> > > use them. So we need some free or freeable pages in order to avoid the
> > > vm deadlock that arises from all memory dirty.
> > 
> > No one is trying to abolish Anonymous pages. Free memory is readily 
> > available on demand if one calls reclaim. Your scheme introduces complex 
> > negotiations over a few scraps of memory when large amounts of memory 
> > would still be readily available if one would do the right thing and call 
> > into reclaim.
> 
> This is the thing I contend, there need not be large amounts of memory
> around. In my test prog the hot code path fits into a single page, the
> rest can be anonymous.

Thats a bit extreme We need to make sure that there are larger amounts 
of memory around. Pages are used for all shorts of short term uses (like 
slab shrinking etc etc.). If memory is that low that a single page matters
then we are in very bad shape anyways.

> > Sounds like you would like to change the way we handle memory in general 
> > in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
> > memory management.
> 
> Not quite, currently we have free pages in the reserves, if you want to
> replace some (or all) of that by freeable pages then that is a change.

We have free pages primarily to optimize the allocation. Meaning we do not 
have to run reclaim on every call. We want to use all of memory. The 
reserves are there for the case that we cannot call into reclaim. The easy 
solution if that is problematic is to enhance the reclaim to work in the
critical situations that we care about.


> > Sorry I just got into this a short time ago and I may need a few cycles 
> > to get this all straight. An approach that uses memory instead of 
> > ignoring available memory is certainly better.
> 
> Sure if and when possible. There will always be need to fall back to the
> reserves.

Maybe. But we can certainly avoid that as much as possible which would 
also increase our ability to use all available memory instead of leaving 
some of it unused./

> A bit off-topic, re that reclaim from atomic context:
> Currently we try to hold spinlocks only for short periods of time so
> that reclaim can be preempted, if you run all of reclaim from a
> non-preemptible context you get very large preemption latencies and if
> done from int context it'd also generate large int latencies.

If you call into the page allocator from an interrupt context then you are 
already in bad shape since we may check pcps lists and then potentially 
have to traverse the zonelists and check all sorts of things. If we 
would implement atomic reclaim then the reserves may become a latency 
optimizations. At least we will not fail anymore if the reserves are out.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Rik van Riel wrote:

> Christoph Lameter wrote:
> 
> > I want general improvements to reclaim to address the issues that you see
> > and other issues related to reclaim instead of the strange code that makes
> > PF_MEMALLOC allocs compete for allocations from a single slab and putting
> > logic into the kernel to decide which allocs to fail. We can reclaim after
> > all. Its just a matter of finding the right way to do this. 
> 
> The simplest way of achieving that would be to allow
> recursion of the page reclaim code, under the condition
> that the second level call can only reclaim clean pages,
> while the "outer" call does what the VM does today.

Yes that is what the precursor to this patchset does.

See http://marc.info/?l=linux-mm=118710207203449=2

This one did not even come up to the level of the earlier one. Sigh.

The way forward may be:

1. Like in the earlier patchset allow reentry to reclaim under 
   PF_MEMALLOC if we are out of all memory.

2. Do the laundry as here but do not write out laundry directly.
   Instead move laundry to a new lru style list in the zone structure.
   This will allow the recursive reclaim to also trigger writeout
   of pages (what this patchset was supposed to accomplish).

3. Perform writeback only from kswapd. Make other threads
   wait on kswapd if memory is low, we can wait and writeback still
   has to progress.

4. Then allow reclaim of GFP_ATOMIC allocs (see
   http://marc.info/?l=linux-kernel=118710595617696=2). Atomic
   reclaim can then also put pages onto the zone laundry lists from where
   it is going to be picked up and written out by kswapd ASAP. This one
   may be tricky so maybe keep this separate.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Peter Zijlstra
On Tue, 2007-08-21 at 14:29 -0700, Christoph Lameter wrote:
> On Tue, 21 Aug 2007, Peter Zijlstra wrote:
> 
> > It quickly ends up with all of memory in the laundry list and then
> > recursing into __alloc_pages which will fail to make progress and OOMs.
> 
> H... Okay that needs to be addressed. Reserves need to be used and we 
> only should enter reclaim if that runs out (like the first patch that I 
> did).
> 
> > But aside from the numerous issues with the patch set as presented, I'm
> > not seeing the seeing the big picture, why are you doing this.
> 
> I want general improvements to reclaim to address the issues that you see 
> and other issues related to reclaim instead of the strange code that makes 
> PF_MEMALLOC allocs compete for allocations from a single slab and putting 
> logic into the kernel to decide which allocs to fail. We can reclaim after 
> all. Its just a matter of finding the right way to do this. 

The latest patch I posted got rid of that global slab.

Also, all I want is for slab to honour gfp flags like page allocation
does, nothing more, nothing less.

(well, actually slightly less, since I'm only really interrested in the
ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER -> ALLOC_NO_WATERMARKS transition and
not all higher ones)

I want slab to fail when a similar page alloc would fail, no magic.

Strictly speaking:

if:

 page = alloc_page(gfp);

fails but:

 obj = kmem_cache_alloc(s, gfp);

succeeds then its a bug.

But I'm not actually needing it that strict, just the ALLOC_NO_WATERMARK
part needs to be done, ALLOC_HARDER, ALLOC_HIGH those may fudge a bit.

> > Anonymous pages are a there to stay, and we cannot tell people how to
> > use them. So we need some free or freeable pages in order to avoid the
> > vm deadlock that arises from all memory dirty.
> 
> No one is trying to abolish Anonymous pages. Free memory is readily 
> available on demand if one calls reclaim. Your scheme introduces complex 
> negotiations over a few scraps of memory when large amounts of memory 
> would still be readily available if one would do the right thing and call 
> into reclaim.

This is the thing I contend, there need not be large amounts of memory
around. In my test prog the hot code path fits into a single page, the
rest can be anonymous.

> > 'Optimizing' this by switching to freeable pages has mainly
> > disadvantages IMHO, finding them scrambles LRU order and complexifies
> > relcaim and all that for a relatively small gain in space for clean
> > pagecache pages.
> 
> Sounds like you would like to change the way we handle memory in general 
> in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
> memory management.

Not quite, currently we have free pages in the reserves, if you want to
replace some (or all) of that by freeable pages then that is a change.

I'm just using the reserves.

> > Please, stop writing patches and write down a solid proposal of how you
> > envision the VM working in the various scenarios and why its better than
> > the current approach.
> 
> Sorry I just got into this a short time ago and I may need a few cycles 
> to get this all straight. An approach that uses memory instead of 
> ignoring available memory is certainly better.

Sure if and when possible. There will always be need to fall back to the
reserves.

A bit off-topic, re that reclaim from atomic context:
Currently we try to hold spinlocks only for short periods of time so
that reclaim can be preempted, if you run all of reclaim from a
non-preemptible context you get very large preemption latencies and if
done from int context it'd also generate large int latencies.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Rik van Riel

Christoph Lameter wrote:

I want general improvements to reclaim to address the issues that you see 
and other issues related to reclaim instead of the strange code that makes 
PF_MEMALLOC allocs compete for allocations from a single slab and putting 
logic into the kernel to decide which allocs to fail. We can reclaim after 
all. Its just a matter of finding the right way to do this. 


The simplest way of achieving that would be to allow
recursion of the page reclaim code, under the condition
that the second level call can only reclaim clean pages,
while the "outer" call does what the VM does today.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Rik van Riel wrote:

> > What is preventing that from occurring right now? If the dirty pags are
> > aligned in the right way you can have the exact same situation.
> 
> For one, dirty page writeout is done even when free memory
> is low.  The kernel will dig into the PF_MEMALLOC reserves,
> instead of deciding not to do writeout unless there is lots
> of free memory.

Right that is a fundamental problem with this RFC. We need to be able to 
get into PF_MEMALLOC reserves for writeout.
  
> Secondly, why would you want to recreate this worst case on
> purpose every time the pageout code runs?

I did not intend that to occur.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Peter Zijlstra wrote:

> It quickly ends up with all of memory in the laundry list and then
> recursing into __alloc_pages which will fail to make progress and OOMs.

H... Okay that needs to be addressed. Reserves need to be used and we 
only should enter reclaim if that runs out (like the first patch that I 
did).

> But aside from the numerous issues with the patch set as presented, I'm
> not seeing the seeing the big picture, why are you doing this.

I want general improvements to reclaim to address the issues that you see 
and other issues related to reclaim instead of the strange code that makes 
PF_MEMALLOC allocs compete for allocations from a single slab and putting 
logic into the kernel to decide which allocs to fail. We can reclaim after 
all. Its just a matter of finding the right way to do this. 

> Anonymous pages are a there to stay, and we cannot tell people how to
> use them. So we need some free or freeable pages in order to avoid the
> vm deadlock that arises from all memory dirty.

No one is trying to abolish Anonymous pages. Free memory is readily 
available on demand if one calls reclaim. Your scheme introduces complex 
negotiations over a few scraps of memory when large amounts of memory 
would still be readily available if one would do the right thing and call 
into reclaim.

> 'Optimizing' this by switching to freeable pages has mainly
> disadvantages IMHO, finding them scrambles LRU order and complexifies
> relcaim and all that for a relatively small gain in space for clean
> pagecache pages.

Sounds like you would like to change the way we handle memory in general 
in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
memory management.

> Please, stop writing patches and write down a solid proposal of how you
> envision the VM working in the various scenarios and why its better than
> the current approach.

Sorry I just got into this a short time ago and I may need a few cycles 
to get this all straight. An approach that uses memory instead of 
ignoring available memory is certainly better.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Peter Zijlstra
On Tue, 2007-08-21 at 13:48 -0700, Christoph Lameter wrote:
> On Tue, 21 Aug 2007, Peter Zijlstra wrote:
> 
> > This almost insta-OOMs with anonymous workloads.
> 
> What does the workload do? So writeout needs to begin earlier. There are 
> likely issues with throttling.

The workload is a single program mapping 256M of anonymous memory and
cycling through it with writes ran on a 128M setup.

It quickly ends up with all of memory in the laundry list and then
recursing into __alloc_pages which will fail to make progress and OOMs.

But aside from the numerous issues with the patch set as presented, I'm
not seeing the seeing the big picture, why are you doing this.

Anonymous pages are a there to stay, and we cannot tell people how to
use them. So we need some free or freeable pages in order to avoid the
vm deadlock that arises from all memory dirty.

Currently we keep them free, this has the advantage that the buddy
allocator can at least try to coalese them.

'Optimizing' this by switching to freeable pages has mainly
disadvantages IMHO, finding them scrambles LRU order and complexifies
relcaim and all that for a relatively small gain in space for clean
pagecache pages.

Please, stop writing patches and write down a solid proposal of how you
envision the VM working in the various scenarios and why its better than
the current approach.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Rik van Riel

Christoph Lameter wrote:

On Tue, 21 Aug 2007, Rik van Riel wrote:


Christoph Lameter wrote:


1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
   has reestablished the high marks. Then all the dirty pages (the laundry)
   is written out.

That sounds like a horrendously bad idea.  While one process
is busy freeing all the non dirty pages, other processes can
allocate those pages, leaving you with no memory to free up
the dirty pages!


What is preventing that from occurring right now? If the dirty pags are 
aligned in the right way you can have the exact same situation.


For one, dirty page writeout is done even when free memory
is low.  The kernel will dig into the PF_MEMALLOC reserves,
instead of deciding not to do writeout unless there is lots
of free memory.

Secondly, why would you want to recreate this worst case on
purpose every time the pageout code runs?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Dave McCracken wrote:

> On Monday 20 August 2007, Christoph Lameter wrote:
> > 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
> >    has reestablished the high marks. Then all the dirty pages (the laundry)
> >    is written out.
> 
> I don't buy it.  What happens when there aren't enough clean pages in the 
> system to achieve the high water mark?  I'm guessing we'd get a quick OOM (as 
> observed by Peter).

We reclaim the clean pages that there are (removing the executable 
pages from memory) and then we do writeback.

The quick OOM is due to throttling not working right AFAIK.

> > 2. Reclaim is essentially complete during the writeout phase. So we remove
> >    PF_MEMALLOC and allow recursive reclaim if we still run into trouble
> >    during writeout.
> 
> You're assuming the system is static and won't allocate new pages behind your 
> back.  We could be back to critically low memory before the write happens.

Yes and that occurs now too.

> More broadly, we need to be proactive about getting dirty pages cleaned 
> before 
> they consume the system.  Deferring the write just makes it harder to keep 
> up.

Cleaning dirty pages through writeout consumes memory. Writing dirty pages 
out early makes the memory situation even worse.


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Rik van Riel wrote:

> Christoph Lameter wrote:
> 
> > 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
> >has reestablished the high marks. Then all the dirty pages (the laundry)
> >is written out.
> 
> That sounds like a horrendously bad idea.  While one process
> is busy freeing all the non dirty pages, other processes can
> allocate those pages, leaving you with no memory to free up
> the dirty pages!

What is preventing that from occurring right now? If the dirty pags are 
aligned in the right way you can have the exact same situation.
 
> Also, writing out all the dirty pages at once seems like it
> could hurt latency quite badly, especially on large systems.

We only write back the dirty pages that we are about to reclaim not all of 
them. The bigger batching occurs if we go through multiple priorities. 
Plus writeback in the sync reclaim case is stopped if the device becomes 
contended anyways.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Peter Zijlstra wrote:

> This almost insta-OOMs with anonymous workloads.

What does the workload do? So writeout needs to begin earlier. There are 
likely issues with throttling.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Dave McCracken
On Monday 20 August 2007, Christoph Lameter wrote:
> 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
>    has reestablished the high marks. Then all the dirty pages (the laundry)
>    is written out.

I don't buy it.  What happens when there aren't enough clean pages in the 
system to achieve the high water mark?  I'm guessing we'd get a quick OOM (as 
observed by Peter).

> 2. Reclaim is essentially complete during the writeout phase. So we remove
>    PF_MEMALLOC and allow recursive reclaim if we still run into trouble
>    during writeout.

You're assuming the system is static and won't allocate new pages behind your 
back.  We could be back to critically low memory before the write happens.

More broadly, we need to be proactive about getting dirty pages cleaned before 
they consume the system.  Deferring the write just makes it harder to keep 
up.

Dave McCracken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Rik van Riel

Christoph Lameter wrote:


1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
   has reestablished the high marks. Then all the dirty pages (the laundry)
   is written out.


That sounds like a horrendously bad idea.  While one process
is busy freeing all the non dirty pages, other processes can
allocate those pages, leaving you with no memory to free up
the dirty pages!

How exactly are you planning to prevent that problem?

Also, writing out all the dirty pages at once seems like it
could hurt latency quite badly, especially on large systems.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Peter Zijlstra
On Mon, 2007-08-20 at 14:50 -0700, Christoph Lameter wrote:
> One of the problems with reclaim writeout is that it occurs when memory in a
> zone is low. A particular bad problem can occur if memory in a zone is
> already low and now the first page that we encounter during reclaim is dirty.
> So the writeout function is called without the filesystem or device having
> much of a reserve that would allow further allocations. Triggering writeout
> of dirty pages early does not improve the memory situation since the actual
> writeout of the page is a relatively long process. The call to writepage
> will therefore not improve the low memory situation but make it worse
> because extra memory may be needed to get the device to write the page.
> 
> This patchset fixes that issue by:
> 
> 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
>has reestablished the high marks. Then all the dirty pages (the laundry)
>is written out.
> 
> 2. Reclaim is essentially complete during the writeout phase. So we remove
>PF_MEMALLOC and allow recursive reclaim if we still run into trouble
>during writeout.

This almost insta-OOMs with anonymous workloads.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Peter Zijlstra
On Mon, 2007-08-20 at 14:50 -0700, Christoph Lameter wrote:
 One of the problems with reclaim writeout is that it occurs when memory in a
 zone is low. A particular bad problem can occur if memory in a zone is
 already low and now the first page that we encounter during reclaim is dirty.
 So the writeout function is called without the filesystem or device having
 much of a reserve that would allow further allocations. Triggering writeout
 of dirty pages early does not improve the memory situation since the actual
 writeout of the page is a relatively long process. The call to writepage
 will therefore not improve the low memory situation but make it worse
 because extra memory may be needed to get the device to write the page.
 
 This patchset fixes that issue by:
 
 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
has reestablished the high marks. Then all the dirty pages (the laundry)
is written out.
 
 2. Reclaim is essentially complete during the writeout phase. So we remove
PF_MEMALLOC and allow recursive reclaim if we still run into trouble
during writeout.

This almost insta-OOMs with anonymous workloads.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Rik van Riel

Christoph Lameter wrote:


1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
   has reestablished the high marks. Then all the dirty pages (the laundry)
   is written out.


That sounds like a horrendously bad idea.  While one process
is busy freeing all the non dirty pages, other processes can
allocate those pages, leaving you with no memory to free up
the dirty pages!

How exactly are you planning to prevent that problem?

Also, writing out all the dirty pages at once seems like it
could hurt latency quite badly, especially on large systems.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Dave McCracken
On Monday 20 August 2007, Christoph Lameter wrote:
 1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
    has reestablished the high marks. Then all the dirty pages (the laundry)
    is written out.

I don't buy it.  What happens when there aren't enough clean pages in the 
system to achieve the high water mark?  I'm guessing we'd get a quick OOM (as 
observed by Peter).

 2. Reclaim is essentially complete during the writeout phase. So we remove
    PF_MEMALLOC and allow recursive reclaim if we still run into trouble
    during writeout.

You're assuming the system is static and won't allocate new pages behind your 
back.  We could be back to critically low memory before the write happens.

More broadly, we need to be proactive about getting dirty pages cleaned before 
they consume the system.  Deferring the write just makes it harder to keep 
up.

Dave McCracken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Peter Zijlstra wrote:

 This almost insta-OOMs with anonymous workloads.

What does the workload do? So writeout needs to begin earlier. There are 
likely issues with throttling.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Rik van Riel wrote:

 Christoph Lameter wrote:
 
  1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
 has reestablished the high marks. Then all the dirty pages (the laundry)
 is written out.
 
 That sounds like a horrendously bad idea.  While one process
 is busy freeing all the non dirty pages, other processes can
 allocate those pages, leaving you with no memory to free up
 the dirty pages!

What is preventing that from occurring right now? If the dirty pags are 
aligned in the right way you can have the exact same situation.
 
 Also, writing out all the dirty pages at once seems like it
 could hurt latency quite badly, especially on large systems.

We only write back the dirty pages that we are about to reclaim not all of 
them. The bigger batching occurs if we go through multiple priorities. 
Plus writeback in the sync reclaim case is stopped if the device becomes 
contended anyways.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Dave McCracken wrote:

 On Monday 20 August 2007, Christoph Lameter wrote:
  1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
     has reestablished the high marks. Then all the dirty pages (the laundry)
     is written out.
 
 I don't buy it.  What happens when there aren't enough clean pages in the 
 system to achieve the high water mark?  I'm guessing we'd get a quick OOM (as 
 observed by Peter).

We reclaim the clean pages that there are (removing the executable 
pages from memory) and then we do writeback.

The quick OOM is due to throttling not working right AFAIK.

  2. Reclaim is essentially complete during the writeout phase. So we remove
     PF_MEMALLOC and allow recursive reclaim if we still run into trouble
     during writeout.
 
 You're assuming the system is static and won't allocate new pages behind your 
 back.  We could be back to critically low memory before the write happens.

Yes and that occurs now too.

 More broadly, we need to be proactive about getting dirty pages cleaned 
 before 
 they consume the system.  Deferring the write just makes it harder to keep 
 up.

Cleaning dirty pages through writeout consumes memory. Writing dirty pages 
out early makes the memory situation even worse.


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Peter Zijlstra
On Tue, 2007-08-21 at 13:48 -0700, Christoph Lameter wrote:
 On Tue, 21 Aug 2007, Peter Zijlstra wrote:
 
  This almost insta-OOMs with anonymous workloads.
 
 What does the workload do? So writeout needs to begin earlier. There are 
 likely issues with throttling.

The workload is a single program mapping 256M of anonymous memory and
cycling through it with writes ran on a 128M setup.

It quickly ends up with all of memory in the laundry list and then
recursing into __alloc_pages which will fail to make progress and OOMs.

But aside from the numerous issues with the patch set as presented, I'm
not seeing the seeing the big picture, why are you doing this.

Anonymous pages are a there to stay, and we cannot tell people how to
use them. So we need some free or freeable pages in order to avoid the
vm deadlock that arises from all memory dirty.

Currently we keep them free, this has the advantage that the buddy
allocator can at least try to coalese them.

'Optimizing' this by switching to freeable pages has mainly
disadvantages IMHO, finding them scrambles LRU order and complexifies
relcaim and all that for a relatively small gain in space for clean
pagecache pages.

Please, stop writing patches and write down a solid proposal of how you
envision the VM working in the various scenarios and why its better than
the current approach.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Rik van Riel

Christoph Lameter wrote:

On Tue, 21 Aug 2007, Rik van Riel wrote:


Christoph Lameter wrote:


1. First reclaiming non dirty pages. Dirty pages are deferred until reclaim
   has reestablished the high marks. Then all the dirty pages (the laundry)
   is written out.

That sounds like a horrendously bad idea.  While one process
is busy freeing all the non dirty pages, other processes can
allocate those pages, leaving you with no memory to free up
the dirty pages!


What is preventing that from occurring right now? If the dirty pags are 
aligned in the right way you can have the exact same situation.


For one, dirty page writeout is done even when free memory
is low.  The kernel will dig into the PF_MEMALLOC reserves,
instead of deciding not to do writeout unless there is lots
of free memory.

Secondly, why would you want to recreate this worst case on
purpose every time the pageout code runs?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Peter Zijlstra wrote:

 It quickly ends up with all of memory in the laundry list and then
 recursing into __alloc_pages which will fail to make progress and OOMs.

H... Okay that needs to be addressed. Reserves need to be used and we 
only should enter reclaim if that runs out (like the first patch that I 
did).

 But aside from the numerous issues with the patch set as presented, I'm
 not seeing the seeing the big picture, why are you doing this.

I want general improvements to reclaim to address the issues that you see 
and other issues related to reclaim instead of the strange code that makes 
PF_MEMALLOC allocs compete for allocations from a single slab and putting 
logic into the kernel to decide which allocs to fail. We can reclaim after 
all. Its just a matter of finding the right way to do this. 

 Anonymous pages are a there to stay, and we cannot tell people how to
 use them. So we need some free or freeable pages in order to avoid the
 vm deadlock that arises from all memory dirty.

No one is trying to abolish Anonymous pages. Free memory is readily 
available on demand if one calls reclaim. Your scheme introduces complex 
negotiations over a few scraps of memory when large amounts of memory 
would still be readily available if one would do the right thing and call 
into reclaim.

 'Optimizing' this by switching to freeable pages has mainly
 disadvantages IMHO, finding them scrambles LRU order and complexifies
 relcaim and all that for a relatively small gain in space for clean
 pagecache pages.

Sounds like you would like to change the way we handle memory in general 
in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
memory management.

 Please, stop writing patches and write down a solid proposal of how you
 envision the VM working in the various scenarios and why its better than
 the current approach.

Sorry I just got into this a short time ago and I may need a few cycles 
to get this all straight. An approach that uses memory instead of 
ignoring available memory is certainly better.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Rik van Riel wrote:

  What is preventing that from occurring right now? If the dirty pags are
  aligned in the right way you can have the exact same situation.
 
 For one, dirty page writeout is done even when free memory
 is low.  The kernel will dig into the PF_MEMALLOC reserves,
 instead of deciding not to do writeout unless there is lots
 of free memory.

Right that is a fundamental problem with this RFC. We need to be able to 
get into PF_MEMALLOC reserves for writeout.
  
 Secondly, why would you want to recreate this worst case on
 purpose every time the pageout code runs?

I did not intend that to occur.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Rik van Riel

Christoph Lameter wrote:

I want general improvements to reclaim to address the issues that you see 
and other issues related to reclaim instead of the strange code that makes 
PF_MEMALLOC allocs compete for allocations from a single slab and putting 
logic into the kernel to decide which allocs to fail. We can reclaim after 
all. Its just a matter of finding the right way to do this. 


The simplest way of achieving that would be to allow
recursion of the page reclaim code, under the condition
that the second level call can only reclaim clean pages,
while the outer call does what the VM does today.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Peter Zijlstra
On Tue, 2007-08-21 at 14:29 -0700, Christoph Lameter wrote:
 On Tue, 21 Aug 2007, Peter Zijlstra wrote:
 
  It quickly ends up with all of memory in the laundry list and then
  recursing into __alloc_pages which will fail to make progress and OOMs.
 
 H... Okay that needs to be addressed. Reserves need to be used and we 
 only should enter reclaim if that runs out (like the first patch that I 
 did).
 
  But aside from the numerous issues with the patch set as presented, I'm
  not seeing the seeing the big picture, why are you doing this.
 
 I want general improvements to reclaim to address the issues that you see 
 and other issues related to reclaim instead of the strange code that makes 
 PF_MEMALLOC allocs compete for allocations from a single slab and putting 
 logic into the kernel to decide which allocs to fail. We can reclaim after 
 all. Its just a matter of finding the right way to do this. 

The latest patch I posted got rid of that global slab.

Also, all I want is for slab to honour gfp flags like page allocation
does, nothing more, nothing less.

(well, actually slightly less, since I'm only really interrested in the
ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER - ALLOC_NO_WATERMARKS transition and
not all higher ones)

I want slab to fail when a similar page alloc would fail, no magic.

Strictly speaking:

if:

 page = alloc_page(gfp);

fails but:

 obj = kmem_cache_alloc(s, gfp);

succeeds then its a bug.

But I'm not actually needing it that strict, just the ALLOC_NO_WATERMARK
part needs to be done, ALLOC_HARDER, ALLOC_HIGH those may fudge a bit.

  Anonymous pages are a there to stay, and we cannot tell people how to
  use them. So we need some free or freeable pages in order to avoid the
  vm deadlock that arises from all memory dirty.
 
 No one is trying to abolish Anonymous pages. Free memory is readily 
 available on demand if one calls reclaim. Your scheme introduces complex 
 negotiations over a few scraps of memory when large amounts of memory 
 would still be readily available if one would do the right thing and call 
 into reclaim.

This is the thing I contend, there need not be large amounts of memory
around. In my test prog the hot code path fits into a single page, the
rest can be anonymous.

  'Optimizing' this by switching to freeable pages has mainly
  disadvantages IMHO, finding them scrambles LRU order and complexifies
  relcaim and all that for a relatively small gain in space for clean
  pagecache pages.
 
 Sounds like you would like to change the way we handle memory in general 
 in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
 memory management.

Not quite, currently we have free pages in the reserves, if you want to
replace some (or all) of that by freeable pages then that is a change.

I'm just using the reserves.

  Please, stop writing patches and write down a solid proposal of how you
  envision the VM working in the various scenarios and why its better than
  the current approach.
 
 Sorry I just got into this a short time ago and I may need a few cycles 
 to get this all straight. An approach that uses memory instead of 
 ignoring available memory is certainly better.

Sure if and when possible. There will always be need to fall back to the
reserves.

A bit off-topic, re that reclaim from atomic context:
Currently we try to hold spinlocks only for short periods of time so
that reclaim can be preempted, if you run all of reclaim from a
non-preemptible context you get very large preemption latencies and if
done from int context it'd also generate large int latencies.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Rik van Riel wrote:

 Christoph Lameter wrote:
 
  I want general improvements to reclaim to address the issues that you see
  and other issues related to reclaim instead of the strange code that makes
  PF_MEMALLOC allocs compete for allocations from a single slab and putting
  logic into the kernel to decide which allocs to fail. We can reclaim after
  all. Its just a matter of finding the right way to do this. 
 
 The simplest way of achieving that would be to allow
 recursion of the page reclaim code, under the condition
 that the second level call can only reclaim clean pages,
 while the outer call does what the VM does today.

Yes that is what the precursor to this patchset does.

See http://marc.info/?l=linux-mmm=118710207203449w=2

This one did not even come up to the level of the earlier one. Sigh.

The way forward may be:

1. Like in the earlier patchset allow reentry to reclaim under 
   PF_MEMALLOC if we are out of all memory.

2. Do the laundry as here but do not write out laundry directly.
   Instead move laundry to a new lru style list in the zone structure.
   This will allow the recursive reclaim to also trigger writeout
   of pages (what this patchset was supposed to accomplish).

3. Perform writeback only from kswapd. Make other threads
   wait on kswapd if memory is low, we can wait and writeback still
   has to progress.

4. Then allow reclaim of GFP_ATOMIC allocs (see
   http://marc.info/?l=linux-kernelm=118710595617696w=2). Atomic
   reclaim can then also put pages onto the zone laundry lists from where
   it is going to be picked up and written out by kswapd ASAP. This one
   may be tricky so maybe keep this separate.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/7] Postphone reclaim laundry to write at high water marks

2007-08-21 Thread Christoph Lameter
On Wed, 22 Aug 2007, Peter Zijlstra wrote:

 Also, all I want is for slab to honour gfp flags like page allocation
 does, nothing more, nothing less.
 
 (well, actually slightly less, since I'm only really interrested in the
 ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER - ALLOC_NO_WATERMARKS transition and
 not all higher ones)

I am still not sure what that brings you. There may be multiple 
PF_MEMALLOC going on at the same time. On a large system with N cpus
there may be more than N of these that can steal objects from one another. 

A NUMA system will be shot anyways if memory gets that problematic to 
handle since the OS cannot effectively place memory if all zones are 
overallocated so that only a few pages are left.


 I want slab to fail when a similar page alloc would fail, no magic.

Yes I know. I do not want allocations to fail but that reclaim occurs in 
order to avoid failing any allocation. We need provisions that 
make sure that we never get into such a bad memory situation that would
cause severe slowless and usually end up in a livelock anyways.

   Anonymous pages are a there to stay, and we cannot tell people how to
   use them. So we need some free or freeable pages in order to avoid the
   vm deadlock that arises from all memory dirty.
  
  No one is trying to abolish Anonymous pages. Free memory is readily 
  available on demand if one calls reclaim. Your scheme introduces complex 
  negotiations over a few scraps of memory when large amounts of memory 
  would still be readily available if one would do the right thing and call 
  into reclaim.
 
 This is the thing I contend, there need not be large amounts of memory
 around. In my test prog the hot code path fits into a single page, the
 rest can be anonymous.

Thats a bit extreme We need to make sure that there are larger amounts 
of memory around. Pages are used for all shorts of short term uses (like 
slab shrinking etc etc.). If memory is that low that a single page matters
then we are in very bad shape anyways.

  Sounds like you would like to change the way we handle memory in general 
  in the VM? Reclaim (and thus finding freeable pages) is basic to Linux 
  memory management.
 
 Not quite, currently we have free pages in the reserves, if you want to
 replace some (or all) of that by freeable pages then that is a change.

We have free pages primarily to optimize the allocation. Meaning we do not 
have to run reclaim on every call. We want to use all of memory. The 
reserves are there for the case that we cannot call into reclaim. The easy 
solution if that is problematic is to enhance the reclaim to work in the
critical situations that we care about.


  Sorry I just got into this a short time ago and I may need a few cycles 
  to get this all straight. An approach that uses memory instead of 
  ignoring available memory is certainly better.
 
 Sure if and when possible. There will always be need to fall back to the
 reserves.

Maybe. But we can certainly avoid that as much as possible which would 
also increase our ability to use all available memory instead of leaving 
some of it unused./

 A bit off-topic, re that reclaim from atomic context:
 Currently we try to hold spinlocks only for short periods of time so
 that reclaim can be preempted, if you run all of reclaim from a
 non-preemptible context you get very large preemption latencies and if
 done from int context it'd also generate large int latencies.

If you call into the page allocator from an interrupt context then you are 
already in bad shape since we may check pcps lists and then potentially 
have to traverse the zonelists and check all sorts of things. If we 
would implement atomic reclaim then the reserves may become a latency 
optimizations. At least we will not fail anymore if the reserves are out.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/