Re: [RFC 0/8] Cpuset aware writeback

2007-04-21 Thread Christoph Lameter
On Sat, 21 Apr 2007, Ethan Solomita wrote:

>Exactly -- your patch should be consistent and do it the same way as
> whatever your patch is built against. Your patch is built against a kernel
> that subtracts off highmem. "Do it..." are you handing off the patch and are
> done with it?

Yes as said before the patch is not finished. As I told you I have other 
things to do right now. It is not high on my agenda and some other 
developers have shown an interest. Feel free to take over the patch.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-21 Thread Ethan Solomita

Christoph Lameter wrote:

On Fri, 20 Apr 2007, Ethan Solomita wrote:

  

cpuset_write_dirty_map.htm

   In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes()
but in __set_page_dirty_buffers() you call it only if page->mapping is still
set after locking. Is there a reason for the difference? Also a question not
about your patch: why do those functions call __mark_inode_dirty() even if the
dirty page has been truncated and mapping == NULL?



If page->mapping has been cleared then the page was removed from the 
mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs 
then the inode was modified.
  


   You didn't address the first half. Why do the buffers() and 
nobuffers() act differently when calling cpuset_update_dirty_nodes()?



cpuset_write_throttle.htm

   I noticed that several lines have leading spaces. I didn't check if other
patches have the problem too.



Maybe download the patches? How did those strange .htm endings get 
appended to the patches?
  


   Something weird with Firefox, but instead of jumping on me did you 
consider double checking your patches? I just went back, found the text 
versions, and the spaces are still there.e.g.:


+   unsigned long dirtyable_memory;



   In get_dirty_limits(), when cpusets are configd you don't subtract highmen
the same way that is done without cpusets. Is this intentional?



That is something in flux upstream. Linus changed it recently. Do it one 
way or the other.
  


   Exactly -- your patch should be consistent and do it the same way as 
whatever your patch is built against. Your patch is built against a 
kernel that subtracts off highmem. "Do it..." are you handing off the 
patch and are done with it?



   It seems that dirty_exceeded is still a global punishment across cpusets.
Should it be addressed?



Sure. It would be best if you could place that somehow in a cpuset.
  


   Again it sounds like you're handing them off. I'm not objecting I 
just hadn't understood that.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-21 Thread Ethan Solomita

Christoph Lameter wrote:

On Fri, 20 Apr 2007, Ethan Solomita wrote:

  

cpuset_write_dirty_map.htm

   In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes()
but in __set_page_dirty_buffers() you call it only if page-mapping is still
set after locking. Is there a reason for the difference? Also a question not
about your patch: why do those functions call __mark_inode_dirty() even if the
dirty page has been truncated and mapping == NULL?



If page-mapping has been cleared then the page was removed from the 
mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs 
then the inode was modified.
  


   You didn't address the first half. Why do the buffers() and 
nobuffers() act differently when calling cpuset_update_dirty_nodes()?



cpuset_write_throttle.htm

   I noticed that several lines have leading spaces. I didn't check if other
patches have the problem too.



Maybe download the patches? How did those strange .htm endings get 
appended to the patches?
  


   Something weird with Firefox, but instead of jumping on me did you 
consider double checking your patches? I just went back, found the text 
versions, and the spaces are still there.e.g.:


+   unsigned long dirtyable_memory;



   In get_dirty_limits(), when cpusets are configd you don't subtract highmen
the same way that is done without cpusets. Is this intentional?



That is something in flux upstream. Linus changed it recently. Do it one 
way or the other.
  


   Exactly -- your patch should be consistent and do it the same way as 
whatever your patch is built against. Your patch is built against a 
kernel that subtracts off highmem. Do it... are you handing off the 
patch and are done with it?



   It seems that dirty_exceeded is still a global punishment across cpusets.
Should it be addressed?



Sure. It would be best if you could place that somehow in a cpuset.
  


   Again it sounds like you're handing them off. I'm not objecting I 
just hadn't understood that.

   -- Ethan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-21 Thread Christoph Lameter
On Sat, 21 Apr 2007, Ethan Solomita wrote:

Exactly -- your patch should be consistent and do it the same way as
 whatever your patch is built against. Your patch is built against a kernel
 that subtracts off highmem. Do it... are you handing off the patch and are
 done with it?

Yes as said before the patch is not finished. As I told you I have other 
things to do right now. It is not high on my agenda and some other 
developers have shown an interest. Feel free to take over the patch.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-20 Thread Christoph Lameter
On Fri, 20 Apr 2007, Ethan Solomita wrote:

> cpuset_write_dirty_map.htm
> 
>In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes()
> but in __set_page_dirty_buffers() you call it only if page->mapping is still
> set after locking. Is there a reason for the difference? Also a question not
> about your patch: why do those functions call __mark_inode_dirty() even if the
> dirty page has been truncated and mapping == NULL?

If page->mapping has been cleared then the page was removed from the 
mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs 
then the inode was modified.

> cpuset_write_throttle.htm
> 
>I noticed that several lines have leading spaces. I didn't check if other
> patches have the problem too.

Maybe download the patches? How did those strange .htm endings get 
appended to the patches?

>In get_dirty_limits(), when cpusets are configd you don't subtract highmen
> the same way that is done without cpusets. Is this intentional?

That is something in flux upstream. Linus changed it recently. Do it one 
way or the other.

>It seems that dirty_exceeded is still a global punishment across cpusets.
> Should it be addressed?

Sure. It would be best if you could place that somehow in a cpuset.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-20 Thread Ethan Solomita

Christoph Lameter wrote:
H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 


I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty
  


   Hi Christoph -- a few comments on the patches:

cpuset_write_dirty_map.htm

   In __set_page_dirty_nobuffers() you always call 
cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call 
it only if page->mapping is still set after locking. Is there a reason 
for the difference? Also a question not about your patch: why do those 
functions call __mark_inode_dirty() even if the dirty page has been 
truncated and mapping == NULL?


cpuset_write_throttle.htm

   I noticed that several lines have leading spaces. I didn't check if 
other patches have the problem too.


   In get_dirty_limits(), when cpusets are configd you don't subtract 
highmen the same way that is done without cpusets. Is this intentional?


   It seems that dirty_exceeded is still a global punishment across 
cpusets. Should it be addressed?



   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-20 Thread Ethan Solomita

Christoph Lameter wrote:
H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 


I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty
  


   Hi Christoph -- a few comments on the patches:

cpuset_write_dirty_map.htm

   In __set_page_dirty_nobuffers() you always call 
cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call 
it only if page-mapping is still set after locking. Is there a reason 
for the difference? Also a question not about your patch: why do those 
functions call __mark_inode_dirty() even if the dirty page has been 
truncated and mapping == NULL?


cpuset_write_throttle.htm

   I noticed that several lines have leading spaces. I didn't check if 
other patches have the problem too.


   In get_dirty_limits(), when cpusets are configd you don't subtract 
highmen the same way that is done without cpusets. Is this intentional?


   It seems that dirty_exceeded is still a global punishment across 
cpusets. Should it be addressed?



   -- Ethan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-20 Thread Christoph Lameter
On Fri, 20 Apr 2007, Ethan Solomita wrote:

 cpuset_write_dirty_map.htm
 
In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes()
 but in __set_page_dirty_buffers() you call it only if page-mapping is still
 set after locking. Is there a reason for the difference? Also a question not
 about your patch: why do those functions call __mark_inode_dirty() even if the
 dirty page has been truncated and mapping == NULL?

If page-mapping has been cleared then the page was removed from the 
mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs 
then the inode was modified.

 cpuset_write_throttle.htm
 
I noticed that several lines have leading spaces. I didn't check if other
 patches have the problem too.

Maybe download the patches? How did those strange .htm endings get 
appended to the patches?

In get_dirty_limits(), when cpusets are configd you don't subtract highmen
 the same way that is done without cpusets. Is this intentional?

That is something in flux upstream. Linus changed it recently. Do it one 
way or the other.

It seems that dirty_exceeded is still a global punishment across cpusets.
 Should it be addressed?

Sure. It would be best if you could place that somehow in a cpuset.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-19 Thread Christoph Lameter
On Thu, 19 Apr 2007, Ethan Solomita wrote:

> > H Sorry. I got distracted and I have sent them to Kame-san who was
> > interested in working on them. 
> > I have placed the most recent version at
> > http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty
> >   
> 
>Do you expect any conflicts with the per-bdi dirty throttling patches?

You would have to check that yourself. The need for cpuset aware writeback 
is less due to writeback fixes to NFS. The per bdi dirty throttling is 
further reducing the need. The role of the cpuset aware writeback is
simply to implement measures to deal with the worst case scenarios.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-19 Thread Ethan Solomita

Christoph Lameter wrote:

On Wed, 18 Apr 2007, Ethan Solomita wrote:

  

   Any new ETA? I'm trying to decide whether to go back to your original
patches or wait for the new set. Adding new knobs isn't as important to me as
having something that fixes the core problem, so hopefully this isn't waiting
on them. They could always be patches on top of your core patches.
   -- Ethan



H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 


I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty
  


   Do you expect any conflicts with the per-bdi dirty throttling patches?
   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-19 Thread Ethan Solomita

Christoph Lameter wrote:

On Wed, 18 Apr 2007, Ethan Solomita wrote:

  

   Any new ETA? I'm trying to decide whether to go back to your original
patches or wait for the new set. Adding new knobs isn't as important to me as
having something that fixes the core problem, so hopefully this isn't waiting
on them. They could always be patches on top of your core patches.
   -- Ethan



H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 


I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty
  


   Do you expect any conflicts with the per-bdi dirty throttling patches?
   -- Ethan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-19 Thread Christoph Lameter
On Thu, 19 Apr 2007, Ethan Solomita wrote:

  H Sorry. I got distracted and I have sent them to Kame-san who was
  interested in working on them. 
  I have placed the most recent version at
  http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty

 
Do you expect any conflicts with the per-bdi dirty throttling patches?

You would have to check that yourself. The need for cpuset aware writeback 
is less due to writeback fixes to NFS. The per bdi dirty throttling is 
further reducing the need. The role of the cpuset aware writeback is
simply to implement measures to deal with the worst case scenarios.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-18 Thread Christoph Lameter
On Wed, 18 Apr 2007, Ethan Solomita wrote:

>Any new ETA? I'm trying to decide whether to go back to your original
> patches or wait for the new set. Adding new knobs isn't as important to me as
> having something that fixes the core problem, so hopefully this isn't waiting
> on them. They could always be patches on top of your core patches.
>-- Ethan

H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 

I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-18 Thread Ethan Solomita

Christoph Lameter wrote:

On Wed, 21 Mar 2007, Ethan Solomita wrote:

  

Christoph Lameter wrote:


On Thu, 1 Feb 2007, Ethan Solomita wrote:

  

   Hi Christoph -- has anything come of resolving the NFS / OOM concerns
that
Andrew Morton expressed concerning the patch? I'd be happy to see some
progress on getting this patch (i.e. the one you posted on 1/23) through.


Peter Zilkstra addressed the NFS issue. I will submit the patch again as
soon as the writeback code stabilizes a bit.
  

I'm pinging to see if this has gotten anywhere. Are you ready to
resubmit? Do we have the evidence to convince Andrew that the NFS issues are
resolved and so this patch won't obscure anything?



The NFS patch went into Linus tree a couple of days ago and I have a new 
version ready with additional support to set per dirty ratios per cpu. 
There is some interest in adding more VM controls to this patch. I hope I 
can post the next rev by tomorrow.
  


   Any new ETA? I'm trying to decide whether to go back to your 
original patches or wait for the new set. Adding new knobs isn't as 
important to me as having something that fixes the core problem, so 
hopefully this isn't waiting on them. They could always be patches on 
top of your core patches.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-18 Thread Ethan Solomita

Christoph Lameter wrote:

On Wed, 21 Mar 2007, Ethan Solomita wrote:

  

Christoph Lameter wrote:


On Thu, 1 Feb 2007, Ethan Solomita wrote:

  

   Hi Christoph -- has anything come of resolving the NFS / OOM concerns
that
Andrew Morton expressed concerning the patch? I'd be happy to see some
progress on getting this patch (i.e. the one you posted on 1/23) through.


Peter Zilkstra addressed the NFS issue. I will submit the patch again as
soon as the writeback code stabilizes a bit.
  

I'm pinging to see if this has gotten anywhere. Are you ready to
resubmit? Do we have the evidence to convince Andrew that the NFS issues are
resolved and so this patch won't obscure anything?



The NFS patch went into Linus tree a couple of days ago and I have a new 
version ready with additional support to set per dirty ratios per cpu. 
There is some interest in adding more VM controls to this patch. I hope I 
can post the next rev by tomorrow.
  


   Any new ETA? I'm trying to decide whether to go back to your 
original patches or wait for the new set. Adding new knobs isn't as 
important to me as having something that fixes the core problem, so 
hopefully this isn't waiting on them. They could always be patches on 
top of your core patches.

   -- Ethan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-04-18 Thread Christoph Lameter
On Wed, 18 Apr 2007, Ethan Solomita wrote:

Any new ETA? I'm trying to decide whether to go back to your original
 patches or wait for the new set. Adding new knobs isn't as important to me as
 having something that fixes the core problem, so hopefully this isn't waiting
 on them. They could always be patches on top of your core patches.
-- Ethan

H Sorry. I got distracted and I have sent them to Kame-san who was 
interested in working on them. 

I have placed the most recent version at
http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Andrew Morton wrote:

> > The NFS patch went into Linus tree a couple of days ago
> 
> Did it fix the oom issues which you were observing?

Yes it reduced the dirty ratios to reasonable numbers in a simple copy 
operation that created large amounts of dirty pages before. The trouble is 
now to check if cpuset writeback patch still works correctly.

Probably have to turn off block device congestion checks somehow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Andrew Morton
On Wed, 21 Mar 2007 14:29:42 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 21 Mar 2007, Ethan Solomita wrote:
> 
> > Christoph Lameter wrote:
> > > On Thu, 1 Feb 2007, Ethan Solomita wrote:
> > > 
> > > >Hi Christoph -- has anything come of resolving the NFS / OOM concerns
> > > > that
> > > > Andrew Morton expressed concerning the patch? I'd be happy to see some
> > > > progress on getting this patch (i.e. the one you posted on 1/23) 
> > > > through.
> > > 
> > > Peter Zilkstra addressed the NFS issue. I will submit the patch again as
> > > soon as the writeback code stabilizes a bit.
> > 
> > I'm pinging to see if this has gotten anywhere. Are you ready to
> > resubmit? Do we have the evidence to convince Andrew that the NFS issues are
> > resolved and so this patch won't obscure anything?
> 
> The NFS patch went into Linus tree a couple of days ago

Did it fix the oom issues which you were observing?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Ethan Solomita wrote:

> Christoph Lameter wrote:
> > On Thu, 1 Feb 2007, Ethan Solomita wrote:
> > 
> > >Hi Christoph -- has anything come of resolving the NFS / OOM concerns
> > > that
> > > Andrew Morton expressed concerning the patch? I'd be happy to see some
> > > progress on getting this patch (i.e. the one you posted on 1/23) through.
> > 
> > Peter Zilkstra addressed the NFS issue. I will submit the patch again as
> > soon as the writeback code stabilizes a bit.
> 
>   I'm pinging to see if this has gotten anywhere. Are you ready to
> resubmit? Do we have the evidence to convince Andrew that the NFS issues are
> resolved and so this patch won't obscure anything?

The NFS patch went into Linus tree a couple of days ago and I have a new 
version ready with additional support to set per dirty ratios per cpu. 
There is some interest in adding more VM controls to this patch. I hope I 
can post the next rev by tomorrow.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Ethan Solomita

Christoph Lameter wrote:

On Thu, 1 Feb 2007, Ethan Solomita wrote:


   Hi Christoph -- has anything come of resolving the NFS / OOM concerns that
Andrew Morton expressed concerning the patch? I'd be happy to see some
progress on getting this patch (i.e. the one you posted on 1/23) through.


Peter Zilkstra addressed the NFS issue. I will submit the patch again as 
soon as the writeback code stabilizes a bit.


	I'm pinging to see if this has gotten anywhere. Are you ready to 
resubmit? Do we have the evidence to convince Andrew that the NFS issues 
are resolved and so this patch won't obscure anything?


Thanks,
-- Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Ethan Solomita wrote:

 Christoph Lameter wrote:
  On Thu, 1 Feb 2007, Ethan Solomita wrote:
  
  Hi Christoph -- has anything come of resolving the NFS / OOM concerns
   that
   Andrew Morton expressed concerning the patch? I'd be happy to see some
   progress on getting this patch (i.e. the one you posted on 1/23) through.
  
  Peter Zilkstra addressed the NFS issue. I will submit the patch again as
  soon as the writeback code stabilizes a bit.
 
   I'm pinging to see if this has gotten anywhere. Are you ready to
 resubmit? Do we have the evidence to convince Andrew that the NFS issues are
 resolved and so this patch won't obscure anything?

The NFS patch went into Linus tree a couple of days ago and I have a new 
version ready with additional support to set per dirty ratios per cpu. 
There is some interest in adding more VM controls to this patch. I hope I 
can post the next rev by tomorrow.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Andrew Morton
On Wed, 21 Mar 2007 14:29:42 -0700 (PDT)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Wed, 21 Mar 2007, Ethan Solomita wrote:
 
  Christoph Lameter wrote:
   On Thu, 1 Feb 2007, Ethan Solomita wrote:
   
   Hi Christoph -- has anything come of resolving the NFS / OOM concerns
that
Andrew Morton expressed concerning the patch? I'd be happy to see some
progress on getting this patch (i.e. the one you posted on 1/23) 
through.
   
   Peter Zilkstra addressed the NFS issue. I will submit the patch again as
   soon as the writeback code stabilizes a bit.
  
  I'm pinging to see if this has gotten anywhere. Are you ready to
  resubmit? Do we have the evidence to convince Andrew that the NFS issues are
  resolved and so this patch won't obscure anything?
 
 The NFS patch went into Linus tree a couple of days ago

Did it fix the oom issues which you were observing?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-03-21 Thread Christoph Lameter
On Wed, 21 Mar 2007, Andrew Morton wrote:

  The NFS patch went into Linus tree a couple of days ago
 
 Did it fix the oom issues which you were observing?

Yes it reduced the dirty ratios to reasonable numbers in a simple copy 
operation that created large amounts of dirty pages before. The trouble is 
now to check if cpuset writeback patch still works correctly.

Probably have to turn off block device congestion checks somehow.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Andrew Morton
On Thu, 1 Feb 2007 21:29:06 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Thu, 1 Feb 2007, Andrew Morton wrote:
> 
> > > Peter Zilkstra addressed the NFS issue.
> > 
> > Did he?  Are you yet in a position to confirm that?
> 
> He provided a solution to fix the congestion issue in NFS. I thought 
> that is what you were looking for? That should make NFS behave more
> like a block device right?

We hope so.

The cpuset-aware-writeback patches were explicitly written to hide the bug which
Peter's patches hopefully address.  They hence remove our best way of confirming
that Peter's patches fix the problem which you've observed in a proper fashion.

Until we've confirmed that the NFS problem is nailed, I wouldn't want to merge
cpuset-aware-writeback.  I'm hoping to be able to do that with fake-numa on 
x86-64
but haven't got onto it yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Neil Brown
On Thursday February 1, [EMAIL PROTECTED] wrote:
> 
> > The network stack is of course a different (much harder) problem.
> 
> An NFS solution is possible without solving the network stack issue?

NFS is currently able to make more than max_dirty_ratio of memory
Dirty/Writeback without being effectively throttled.  So it can use up
way more than it should and put pressure in the network stack.

If NFS were throttled like other block-based filesystems (which
Peter's patch should do), then there will normally be a lot more head
room and the network stack will normally be able to cope.  There might
still be situations were you can run out of memory to the extent that
NFS cannot make forward progress, but they will be substantially less
likely (I think you need lots of TCP streams with slow consumers and
fast producers so that TCP is forced to use up it reserves).

The block layer guarantees not to run out of memory.
The network layer makes a best effort as long as nothing goes crazy.
NFS (currently) doesn't do quite enough to stop things going crazy.

At least, that is my understanding.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Christoph Lameter
On Fri, 2 Feb 2007, Neil Brown wrote:

> md/raid doesn't cause any problems here.  It preallocates enough to be
> sure that it can always make forward progress.  In general the entire
> block layer from generic_make_request down can always successfully
> write a block out in a reasonable amount of time without requiring
> kmalloc to succeed (with obvious exceptions like loop and nbd which go
> back up to a higher layer).

Hmmm... I wonder if that could be generalized. A device driver could make 
a reservation by increasing min_free_kbytes? Additional drivers in a 
chain could make additional reservations in such a way that enough 
memory is set aside for the worst case?

> The network stack is of course a different (much harder) problem.

An NFS solution is possible without solving the network stack issue?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Neil Brown
On Thursday February 1, [EMAIL PROTECTED] wrote:
>The NFS problems also exist for non cpuset scenarios 
> and we have by and large been able to live with it so I think they are 
> lower priority. It seems that the basic problem is created by the dirty 
> ratios in a cpuset.

Some of our customers haven't been able to live with it.  I'm really
glad this will soon be fixed in mainline as it means our somewhat less
elegant fix in SLES can go away :-)

> 
> BTW the block layer also may be layered with raid and stuff and then we 
> have similar issues. There is no general way so far of handling these 
> situations except by twiddling around with min_free_kbytes praying 5 Hail 
> Mary's and trying again.

md/raid doesn't cause any problems here.  It preallocates enough to be
sure that it can always make forward progress.  In general the entire
block layer from generic_make_request down can always successfully
write a block out in a reasonable amount of time without requiring
kmalloc to succeed (with obvious exceptions like loop and nbd which go
back up to a higher layer).

The network stack is of course a different (much harder) problem.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Christoph Lameter
On Thu, 1 Feb 2007, Andrew Morton wrote:

> > Peter Zilkstra addressed the NFS issue.
> 
> Did he?  Are you yet in a position to confirm that?

He provided a solution to fix the congestion issue in NFS. I thought 
that is what you were looking for? That should make NFS behave more
like a block device right?

As I said before I think NFS is inherently unfixable given the layering of 
a block device on top of the network stack (which consists of an unknown 
number of additional intermediate layers). Cpuset writeback needs to work 
in the same way as in a machine without cpusets. If fails then at least 
let the cpuset behave as if we had a machine all on our own and fail in 
both cases in the same way. Right now we create dangerous low memory 
conditions due to high dirty ratios in a cpuset created by not having a 
throttling method. The NFS problems also exist for non cpuset scenarios 
and we have by and large been able to live with it so I think they are 
lower priority. It seems that the basic problem is created by the dirty 
ratios in a cpuset.

BTW the block layer also may be layered with raid and stuff and then we 
have similar issues. There is no general way so far of handling these 
situations except by twiddling around with min_free_kbytes praying 5 Hail 
Mary's and trying again. Maybe we are able allocate all needed memory from 
PF_MEMALLOC processes during reclaim and hopefully there is now enough 
memory for these allocations and those that happen to occur during an 
interrupt while we reclaim.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Andrew Morton
On Thu, 1 Feb 2007 18:16:05 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Thu, 1 Feb 2007, Ethan Solomita wrote:
> 
> >Hi Christoph -- has anything come of resolving the NFS / OOM concerns 
> > that
> > Andrew Morton expressed concerning the patch? I'd be happy to see some
> > progress on getting this patch (i.e. the one you posted on 1/23) through.
> 
> Peter Zilkstra addressed the NFS issue.

Did he?  Are you yet in a position to confirm that?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Christoph Lameter
On Thu, 1 Feb 2007, Ethan Solomita wrote:

>Hi Christoph -- has anything come of resolving the NFS / OOM concerns that
> Andrew Morton expressed concerning the patch? I'd be happy to see some
> progress on getting this patch (i.e. the one you posted on 1/23) through.

Peter Zilkstra addressed the NFS issue. I will submit the patch again as 
soon as the writeback code stabilizes a bit.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Ethan Solomita
   Hi Christoph -- has anything come of resolving the NFS / OOM 
concerns that Andrew Morton expressed concerning the patch? I'd be happy 
to see some progress on getting this patch (i.e. the one you posted on 
1/23) through.


   Thanks,
   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Ethan Solomita
   Hi Christoph -- has anything come of resolving the NFS / OOM 
concerns that Andrew Morton expressed concerning the patch? I'd be happy 
to see some progress on getting this patch (i.e. the one you posted on 
1/23) through.


   Thanks,
   -- Ethan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Christoph Lameter
On Thu, 1 Feb 2007, Ethan Solomita wrote:

Hi Christoph -- has anything come of resolving the NFS / OOM concerns that
 Andrew Morton expressed concerning the patch? I'd be happy to see some
 progress on getting this patch (i.e. the one you posted on 1/23) through.

Peter Zilkstra addressed the NFS issue. I will submit the patch again as 
soon as the writeback code stabilizes a bit.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Andrew Morton
On Thu, 1 Feb 2007 18:16:05 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] 
wrote:

 On Thu, 1 Feb 2007, Ethan Solomita wrote:
 
 Hi Christoph -- has anything come of resolving the NFS / OOM concerns 
  that
  Andrew Morton expressed concerning the patch? I'd be happy to see some
  progress on getting this patch (i.e. the one you posted on 1/23) through.
 
 Peter Zilkstra addressed the NFS issue.

Did he?  Are you yet in a position to confirm that?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Christoph Lameter
On Thu, 1 Feb 2007, Andrew Morton wrote:

  Peter Zilkstra addressed the NFS issue.
 
 Did he?  Are you yet in a position to confirm that?

He provided a solution to fix the congestion issue in NFS. I thought 
that is what you were looking for? That should make NFS behave more
like a block device right?

As I said before I think NFS is inherently unfixable given the layering of 
a block device on top of the network stack (which consists of an unknown 
number of additional intermediate layers). Cpuset writeback needs to work 
in the same way as in a machine without cpusets. If fails then at least 
let the cpuset behave as if we had a machine all on our own and fail in 
both cases in the same way. Right now we create dangerous low memory 
conditions due to high dirty ratios in a cpuset created by not having a 
throttling method. The NFS problems also exist for non cpuset scenarios 
and we have by and large been able to live with it so I think they are 
lower priority. It seems that the basic problem is created by the dirty 
ratios in a cpuset.

BTW the block layer also may be layered with raid and stuff and then we 
have similar issues. There is no general way so far of handling these 
situations except by twiddling around with min_free_kbytes praying 5 Hail 
Mary's and trying again. Maybe we are able allocate all needed memory from 
PF_MEMALLOC processes during reclaim and hopefully there is now enough 
memory for these allocations and those that happen to occur during an 
interrupt while we reclaim.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Neil Brown
On Thursday February 1, [EMAIL PROTECTED] wrote:
The NFS problems also exist for non cpuset scenarios 
 and we have by and large been able to live with it so I think they are 
 lower priority. It seems that the basic problem is created by the dirty 
 ratios in a cpuset.

Some of our customers haven't been able to live with it.  I'm really
glad this will soon be fixed in mainline as it means our somewhat less
elegant fix in SLES can go away :-)

 
 BTW the block layer also may be layered with raid and stuff and then we 
 have similar issues. There is no general way so far of handling these 
 situations except by twiddling around with min_free_kbytes praying 5 Hail 
 Mary's and trying again.

md/raid doesn't cause any problems here.  It preallocates enough to be
sure that it can always make forward progress.  In general the entire
block layer from generic_make_request down can always successfully
write a block out in a reasonable amount of time without requiring
kmalloc to succeed (with obvious exceptions like loop and nbd which go
back up to a higher layer).

The network stack is of course a different (much harder) problem.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Christoph Lameter
On Fri, 2 Feb 2007, Neil Brown wrote:

 md/raid doesn't cause any problems here.  It preallocates enough to be
 sure that it can always make forward progress.  In general the entire
 block layer from generic_make_request down can always successfully
 write a block out in a reasonable amount of time without requiring
 kmalloc to succeed (with obvious exceptions like loop and nbd which go
 back up to a higher layer).

Hmmm... I wonder if that could be generalized. A device driver could make 
a reservation by increasing min_free_kbytes? Additional drivers in a 
chain could make additional reservations in such a way that enough 
memory is set aside for the worst case?

 The network stack is of course a different (much harder) problem.

An NFS solution is possible without solving the network stack issue?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Neil Brown
On Thursday February 1, [EMAIL PROTECTED] wrote:
 
  The network stack is of course a different (much harder) problem.
 
 An NFS solution is possible without solving the network stack issue?

NFS is currently able to make more than max_dirty_ratio of memory
Dirty/Writeback without being effectively throttled.  So it can use up
way more than it should and put pressure in the network stack.

If NFS were throttled like other block-based filesystems (which
Peter's patch should do), then there will normally be a lot more head
room and the network stack will normally be able to cope.  There might
still be situations were you can run out of memory to the extent that
NFS cannot make forward progress, but they will be substantially less
likely (I think you need lots of TCP streams with slow consumers and
fast producers so that TCP is forced to use up it reserves).

The block layer guarantees not to run out of memory.
The network layer makes a best effort as long as nothing goes crazy.
NFS (currently) doesn't do quite enough to stop things going crazy.

At least, that is my understanding.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-02-01 Thread Andrew Morton
On Thu, 1 Feb 2007 21:29:06 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] 
wrote:

 On Thu, 1 Feb 2007, Andrew Morton wrote:
 
   Peter Zilkstra addressed the NFS issue.
  
  Did he?  Are you yet in a position to confirm that?
 
 He provided a solution to fix the congestion issue in NFS. I thought 
 that is what you were looking for? That should make NFS behave more
 like a block device right?

We hope so.

The cpuset-aware-writeback patches were explicitly written to hide the bug which
Peter's patches hopefully address.  They hence remove our best way of confirming
that Peter's patches fix the problem which you've observed in a proper fashion.

Until we've confirmed that the NFS problem is nailed, I wouldn't want to merge
cpuset-aware-writeback.  I'm hoping to be able to do that with fake-numa on 
x86-64
but haven't got onto it yet.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andrew Morton wrote:

> > The problem there is that we do a GFP_ATOMIC allocation (no allocation 
> > context) that may fail when the first page is dirtied. We must therefore 
> > be able to subsequently allocate the nodemask_t in set_page_dirty(). 
> > Otherwise the first failure will mean that there will never be a dirty 
> > map for the inode/mapping.
> 
> True.  But it's pretty simple to change __mark_inode_dirty() to fix this.

Ok I tried it but this wont work unless I also pass the page struct pointer to 
__mark_inode_dirty() since the dirty_node pointer could be freed 
when the inode_lock is droppped. So I cannot dereference the 
dirty_nodes pointer outside of __mark_inode_dirty. 

If I expand __mark_inode_dirty then all variations of mark_inode_dirty() 
need to be changed and we need to pass a page struct everywhere. This 
result in extensive changes.

I think I need to stick with the tree_lock. This also makes more sense 
since we modify dirty information in the address_space structure and the 
radix tree is already protected by that lock.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Andrew Morton
> On Wed, 17 Jan 2007 17:10:25 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Wed, 17 Jan 2007, Andrew Morton wrote:
> 
> > > The inode lock is not taken when the page is dirtied.
> > 
> > The inode_lock is taken when the address_space's first page is dirtied.  It 
> > is
> > also taken when the address_space's last dirty page is cleaned.  So the 
> > place
> > where the inode is added to and removed from sb->s_dirty is, I think, 
> > exactly
> > the place where we want to attach and detach 
> > address_space.dirty_page_nodemask.
> 
> The problem there is that we do a GFP_ATOMIC allocation (no allocation 
> context) that may fail when the first page is dirtied. We must therefore 
> be able to subsequently allocate the nodemask_t in set_page_dirty(). 
> Otherwise the first failure will mean that there will never be a dirty 
> map for the inode/mapping.

True.  But it's pretty simple to change __mark_inode_dirty() to fix this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andrew Morton wrote:

> > The inode lock is not taken when the page is dirtied.
> 
> The inode_lock is taken when the address_space's first page is dirtied.  It is
> also taken when the address_space's last dirty page is cleaned.  So the place
> where the inode is added to and removed from sb->s_dirty is, I think, exactly
> the place where we want to attach and detach 
> address_space.dirty_page_nodemask.

The problem there is that we do a GFP_ATOMIC allocation (no allocation 
context) that may fail when the first page is dirtied. We must therefore 
be able to subsequently allocate the nodemask_t in set_page_dirty(). 
Otherwise the first failure will mean that there will never be a dirty 
map for the inode/mapping.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Andrew Morton
> On Wed, 17 Jan 2007 11:43:42 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
> 
> > Do what blockdevs do: limit the number of in-flight requests (Peter's
> > recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
> > is in effect, to keep Trond happy) and implement a mempool for the NFS
> > request critical store.  Additionally:
> > 
> > - we might need to twiddle the NFS gfp_flags so it doesn't call the
> >   oom-killer on failure: just return NULL.
> > 
> > - consider going off-cpuset for critical allocations.  It's better than
> >   going oom.  A suitable implementation might be to ignore the caller's
> >   cpuset if PF_MEMALLOC.  Maybe put a WARN_ON_ONCE in there: we prefer that
> >   it not happen and we want to know when it does.
> 
> Given the intermediate  layers (network, additional gizmos (ip over xxx) 
> and the network cards) that will not be easy.

Paul has observed that it's already done.  But it seems to not be working.

> > btw, regarding the per-address_space node mask: I think we should free it
> > when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)).  Chances
> > are, the inode will be dirty for 30 seconds and in-core for hours.  We
> > might as well steal its nodemask storage and give it to the next file which
> > gets written to.  A suitable place to do all this is in
> > __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
> > address_space.dirty_page_nodemask.
> 
> The inode lock is not taken when the page is dirtied.

The inode_lock is taken when the address_space's first page is dirtied.  It is
also taken when the address_space's last dirty page is cleaned.  So the place
where the inode is added to and removed from sb->s_dirty is, I think, exactly
the place where we want to attach and detach address_space.dirty_page_nodemask.

> The tree_lock
> is already taken when the mapping is dirtied and so I used that to
> avoid races adding and removing pointers to nodemasks from the address 
> space.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> Do what blockdevs do: limit the number of in-flight requests (Peter's
> recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
> is in effect, to keep Trond happy) and implement a mempool for the NFS
> request critical store.  Additionally:
> 
> - we might need to twiddle the NFS gfp_flags so it doesn't call the
>   oom-killer on failure: just return NULL.
> 
> - consider going off-cpuset for critical allocations.  It's better than
>   going oom.  A suitable implementation might be to ignore the caller's
>   cpuset if PF_MEMALLOC.  Maybe put a WARN_ON_ONCE in there: we prefer that
>   it not happen and we want to know when it does.

Given the intermediate  layers (network, additional gizmos (ip over xxx) 
and the network cards) that will not be easy.

> btw, regarding the per-address_space node mask: I think we should free it
> when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)).  Chances
> are, the inode will be dirty for 30 seconds and in-core for hours.  We
> might as well steal its nodemask storage and give it to the next file which
> gets written to.  A suitable place to do all this is in
> __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
> address_space.dirty_page_nodemask.

The inode lock is not taken when the page is dirtied. The tree_lock
is already taken when the mapping is dirtied and so I used that to
avoid races adding and removing pointers to nodemasks from the address 
space.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Andrew Morton
> On Wed, 17 Jan 2007 00:01:58 -0800 Paul Jackson <[EMAIL PROTECTED]> wrote:
> Andrew wrote:
> > - consider going off-cpuset for critical allocations. 
> 
> We do ... in mm/page_alloc.c:
> 
>  * This is the last chance, in general, before the goto nopage.
>  * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
>  * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
>  */
> page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
> 
> We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest
> enclosing mem_exclusive cpuset, which is typically a big cpuset covering most
> of the system.

hrm.   So how come NFS is getting oom-killings?

The oom-killer normally spews lots of useful stuff, including backtrace.  For 
some
reason that's not coming out for Christoph.  Log facility level, perhaps?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Paul Jackson
Andrew wrote:
> - consider going off-cpuset for critical allocations. 

We do ... in mm/page_alloc.c:

 * This is the last chance, in general, before the goto nopage.
 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 */
page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);

We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest
enclosing mem_exclusive cpuset, which is typically a big cpuset covering most
of the system.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Paul Jackson
Andrew wrote:
 - consider going off-cpuset for critical allocations. 

We do ... in mm/page_alloc.c:

 * This is the last chance, in general, before the goto nopage.
 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 */
page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);

We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest
enclosing mem_exclusive cpuset, which is typically a big cpuset covering most
of the system.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Andrew Morton
 On Wed, 17 Jan 2007 00:01:58 -0800 Paul Jackson [EMAIL PROTECTED] wrote:
 Andrew wrote:
  - consider going off-cpuset for critical allocations. 
 
 We do ... in mm/page_alloc.c:
 
  * This is the last chance, in general, before the goto nopage.
  * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
  * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
  */
 page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
 
 We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest
 enclosing mem_exclusive cpuset, which is typically a big cpuset covering most
 of the system.

hrm.   So how come NFS is getting oom-killings?

The oom-killer normally spews lots of useful stuff, including backtrace.  For 
some
reason that's not coming out for Christoph.  Log facility level, perhaps?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

 Do what blockdevs do: limit the number of in-flight requests (Peter's
 recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
 is in effect, to keep Trond happy) and implement a mempool for the NFS
 request critical store.  Additionally:
 
 - we might need to twiddle the NFS gfp_flags so it doesn't call the
   oom-killer on failure: just return NULL.
 
 - consider going off-cpuset for critical allocations.  It's better than
   going oom.  A suitable implementation might be to ignore the caller's
   cpuset if PF_MEMALLOC.  Maybe put a WARN_ON_ONCE in there: we prefer that
   it not happen and we want to know when it does.

Given the intermediate  layers (network, additional gizmos (ip over xxx) 
and the network cards) that will not be easy.

 btw, regarding the per-address_space node mask: I think we should free it
 when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)).  Chances
 are, the inode will be dirty for 30 seconds and in-core for hours.  We
 might as well steal its nodemask storage and give it to the next file which
 gets written to.  A suitable place to do all this is in
 __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
 address_space.dirty_page_nodemask.

The inode lock is not taken when the page is dirtied. The tree_lock
is already taken when the mapping is dirtied and so I used that to
avoid races adding and removing pointers to nodemasks from the address 
space.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Andrew Morton
 On Wed, 17 Jan 2007 11:43:42 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 On Tue, 16 Jan 2007, Andrew Morton wrote:
 
  Do what blockdevs do: limit the number of in-flight requests (Peter's
  recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
  is in effect, to keep Trond happy) and implement a mempool for the NFS
  request critical store.  Additionally:
  
  - we might need to twiddle the NFS gfp_flags so it doesn't call the
oom-killer on failure: just return NULL.
  
  - consider going off-cpuset for critical allocations.  It's better than
going oom.  A suitable implementation might be to ignore the caller's
cpuset if PF_MEMALLOC.  Maybe put a WARN_ON_ONCE in there: we prefer that
it not happen and we want to know when it does.
 
 Given the intermediate  layers (network, additional gizmos (ip over xxx) 
 and the network cards) that will not be easy.

Paul has observed that it's already done.  But it seems to not be working.

  btw, regarding the per-address_space node mask: I think we should free it
  when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)).  Chances
  are, the inode will be dirty for 30 seconds and in-core for hours.  We
  might as well steal its nodemask storage and give it to the next file which
  gets written to.  A suitable place to do all this is in
  __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
  address_space.dirty_page_nodemask.
 
 The inode lock is not taken when the page is dirtied.

The inode_lock is taken when the address_space's first page is dirtied.  It is
also taken when the address_space's last dirty page is cleaned.  So the place
where the inode is added to and removed from sb-s_dirty is, I think, exactly
the place where we want to attach and detach address_space.dirty_page_nodemask.

 The tree_lock
 is already taken when the mapping is dirtied and so I used that to
 avoid races adding and removing pointers to nodemasks from the address 
 space.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andrew Morton wrote:

  The inode lock is not taken when the page is dirtied.
 
 The inode_lock is taken when the address_space's first page is dirtied.  It is
 also taken when the address_space's last dirty page is cleaned.  So the place
 where the inode is added to and removed from sb-s_dirty is, I think, exactly
 the place where we want to attach and detach 
 address_space.dirty_page_nodemask.

The problem there is that we do a GFP_ATOMIC allocation (no allocation 
context) that may fail when the first page is dirtied. We must therefore 
be able to subsequently allocate the nodemask_t in set_page_dirty(). 
Otherwise the first failure will mean that there will never be a dirty 
map for the inode/mapping.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Andrew Morton
 On Wed, 17 Jan 2007 17:10:25 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 On Wed, 17 Jan 2007, Andrew Morton wrote:
 
   The inode lock is not taken when the page is dirtied.
  
  The inode_lock is taken when the address_space's first page is dirtied.  It 
  is
  also taken when the address_space's last dirty page is cleaned.  So the 
  place
  where the inode is added to and removed from sb-s_dirty is, I think, 
  exactly
  the place where we want to attach and detach 
  address_space.dirty_page_nodemask.
 
 The problem there is that we do a GFP_ATOMIC allocation (no allocation 
 context) that may fail when the first page is dirtied. We must therefore 
 be able to subsequently allocate the nodemask_t in set_page_dirty(). 
 Otherwise the first failure will mean that there will never be a dirty 
 map for the inode/mapping.

True.  But it's pretty simple to change __mark_inode_dirty() to fix this.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-17 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andrew Morton wrote:

  The problem there is that we do a GFP_ATOMIC allocation (no allocation 
  context) that may fail when the first page is dirtied. We must therefore 
  be able to subsequently allocate the nodemask_t in set_page_dirty(). 
  Otherwise the first failure will mean that there will never be a dirty 
  map for the inode/mapping.
 
 True.  But it's pretty simple to change __mark_inode_dirty() to fix this.

Ok I tried it but this wont work unless I also pass the page struct pointer to 
__mark_inode_dirty() since the dirty_node pointer could be freed 
when the inode_lock is droppped. So I cannot dereference the 
dirty_nodes pointer outside of __mark_inode_dirty. 

If I expand __mark_inode_dirty then all variations of mark_inode_dirty() 
need to be changed and we need to pass a page struct everywhere. This 
result in extensive changes.

I think I need to stick with the tree_lock. This also makes more sense 
since we modify dirty information in the address_space structure and the 
radix tree is already protected by that lock.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 22:27:36 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
> 
> > > Yes this is the result of the hierachical nature of cpusets which already 
> > > causes issues with the scheduler. It is rather typical that cpusets are 
> > > used to partition the memory and cpus. Overlappig cpusets seem to have 
> > > mainly an administrative function. Paul?
> > 
> > The typical usage scenarios don't matter a lot: the examples I gave show
> > that the core problem remains unsolved.  People can still hit the bug.
> 
> I agree the overlap issue is a problem and I hope it can be addressed 
> somehow for the rare cases in which such nesting takes place.
> 
> One easy solution may be to check the dirty ratio before engaging in 
> reclaim. If the dirty ratio is sufficiently high then trigger writeout via 
> pdflush (we already wakeup pdflush while scanning and you already noted 
> that pdflush writeout is not occurring within the context of the current 
> cpuset) and pass over any dirty pages during LRU scans until some pages 
> have been cleaned up.
> 
> This means we allow allocation of additional kernel memory outside of the 
> cpuset while triggering writeout of inodes that have pages on the nodes 
> of the cpuset. The memory directly used by the application is still 
> limited. Just the temporary information needed for writeback is allocated 
> outside.

Gad.  None of that should be necessary.

> Well sounds somehow still like a hack. Any other ideas out there?

Do what blockdevs do: limit the number of in-flight requests (Peter's
recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
is in effect, to keep Trond happy) and implement a mempool for the NFS
request critical store.  Additionally:

- we might need to twiddle the NFS gfp_flags so it doesn't call the
  oom-killer on failure: just return NULL.

- consider going off-cpuset for critical allocations.  It's better than
  going oom.  A suitable implementation might be to ignore the caller's
  cpuset if PF_MEMALLOC.  Maybe put a WARN_ON_ONCE in there: we prefer that
  it not happen and we want to know when it does.



btw, regarding the per-address_space node mask: I think we should free it
when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)).  Chances
are, the inode will be dirty for 30 seconds and in-core for hours.  We
might as well steal its nodemask storage and give it to the next file which
gets written to.  A suitable place to do all this is in
__mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
address_space.dirty_page_nodemask.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> > Yes this is the result of the hierachical nature of cpusets which already 
> > causes issues with the scheduler. It is rather typical that cpusets are 
> > used to partition the memory and cpus. Overlappig cpusets seem to have 
> > mainly an administrative function. Paul?
> 
> The typical usage scenarios don't matter a lot: the examples I gave show
> that the core problem remains unsolved.  People can still hit the bug.

I agree the overlap issue is a problem and I hope it can be addressed 
somehow for the rare cases in which such nesting takes place.

One easy solution may be to check the dirty ratio before engaging in 
reclaim. If the dirty ratio is sufficiently high then trigger writeout via 
pdflush (we already wakeup pdflush while scanning and you already noted 
that pdflush writeout is not occurring within the context of the current 
cpuset) and pass over any dirty pages during LRU scans until some pages 
have been cleaned up.

This means we allow allocation of additional kernel memory outside of the 
cpuset while triggering writeout of inodes that have pages on the nodes 
of the cpuset. The memory directly used by the application is still 
limited. Just the temporary information needed for writeback is allocated 
outside.

Well sounds somehow still like a hack. Any other ideas out there?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 19:40:17 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
> 
> > Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
> > cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
> > all of cpuset B's memory.  A task running in cpuset B gets oomkilled.
> > 
> > Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
> > cpuset containing nodes 0-2 and start using it.  I get oomkilled.
> > 
> > There may be other scenarios.
> 
> Yes this is the result of the hierachical nature of cpusets which already 
> causes issues with the scheduler. It is rather typical that cpusets are 
> used to partition the memory and cpus. Overlappig cpusets seem to have 
> mainly an administrative function. Paul?

The typical usage scenarios don't matter a lot: the examples I gave show
that the core problem remains unsolved.  People can still hit the bug.

> > So what I suggest we do is to fix the NFS bug, then move on to considering
> > the performance problems.
> 
> The NFS "bug" has been there for ages and no one cares since write 
> throttling works effectively. Since NFS can go via any network technology 
> (f.e. infiniband) we have many potential issues at that point that depend 
> on the underlying network technology. As far as I can recall we decided 
> that these stacking issues are inherently problematic and basically 
> unsolvable.

The problem you refer to arises from the inability of the net driver to
allocate memory for an outbound ack.  Such allocations aren't constrained to
a cpuset.

I expect that we can solve the NFS oom problem along the same lines as
block devices.  Certainly it's dumb of us to oom-kill a process rather than
going off-cpuset for a small and short-lived allocation.  It's also dumb of
us to allocate a basically unbounded number of nfs requests rather than
waiting for some of the ones which we _have_ allocated to complete.


> > On reflection, I agree that your proposed changes are sensible-looking for
> > addressing the probable, not-yet-demonstrated-and-quantified performance
> > problem.  The per-inode (should be per-address_space, maybe it is?) node
> 
> The address space is part of the inode.

Physically, yes.  Logically, it is not.  The address_space controls the
data-plane part of a file and is the appropriate place in which to store
this nodemask.

> Some of my development versions at 
> the dirty_map in the address space. However, the end of the inode was a 
> convenient place for a runtime sizes nodemask.
> 
> > map is unfortunate.  Need to think about that a bit more.  For a start, it
> > should be dynamically allocated (from a new, purpose-created slab cache):
> > most in-core inodes don't have any dirty pages and don't need this
> > additional storage.
> 
> We also considered such an approach. However. it creates the problem 
> of performing a slab allocation while dirtying pages. At that point we do 
> not have an allocation context, nor can we block.

Yes, it must be an atomic allocation.  If it fails, we don't care.  Chances
are it'll succeed when the next page in this address_space gets dirtied.

Plus we don't waste piles of memory on read-only files.

> > But this is unrelated to the NFS bug ;)
> 
> Looks more like a design issue (given its layering on top of the 
> networking layer) and not a bug. The "bug" surfaces when writeback is not 
> done properly. I wonder what happens if other filesystems are pushed to 
> the border of the dirty abyss.   The mmap tracking 
> fixes that were done in 2.6.19 were done because of similar symptoms 
> because the systems dirty tracking was off. This is fundamentally the 
> same issue showing up in a cpuset. So we should be able to produce the
> hangs (looks ... yes another customer reported issue on this one is that 
> reclaim is continually running and we basically livelock the system) that 
> we saw for the mmap dirty tracking issues in addition to the NFS problems 
> seen so far.
> 
> Memory allocation is required in most filesystem flush paths. If we cannot 
> allocate memory then we cannot clean pages and thus we continue trying -> 
> Livelock. I still see this as a fundamental correctness issue in the 
> kernel.

I'll believe all that once someone has got down and tried to fix NFS, and
has failed ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Paul Jackson
> Yes this is the result of the hierachical nature of cpusets which already 
> causes issues with the scheduler. It is rather typical that cpusets are 
> used to partition the memory and cpus. Overlappig cpusets seem to have 
> mainly an administrative function. Paul?

The heavy weight tasks, which are expected to be applying serious memory
pressure (whether for data pages or dirty file pages), are usually in
non-overlapping cpusets, or sharing the same cpuset, but not partially
overlapping with, or a proper superset of, some other cpuset holding an
active job.

The higher level cpusets, such as the top cpuset, or the one deeded over
to the Batch Scheduler, are proper supersets of many other cpusets.  We
avoid putting anything heavy weight in those cpusets.

Sometimes of course a task turns out to be unexpectedly heavy weight.
But in that case, we're mostly interested in function (system keeps
running), not performance.

That is, if someone setup what Andrew described, with a job in a large
cpuset sucking up all available memory from one in a smaller, contained
cpuset, I don't think I'm tuning for optimum performance anymore.
Rather I'm just trying to keep the system running and keep unrelated
jobs unaffected while we dig our way out of the hole.  If the smaller
job OOM's, that's tough nuggies.  They asked for it.  Jobs in
-unrelated- (non-overlapping) cpusets should ride out the storm with
little or no impact on their performance.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
> cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
> all of cpuset B's memory.  A task running in cpuset B gets oomkilled.
> 
> Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
> cpuset containing nodes 0-2 and start using it.  I get oomkilled.
> 
> There may be other scenarios.

Yes this is the result of the hierachical nature of cpusets which already 
causes issues with the scheduler. It is rather typical that cpusets are 
used to partition the memory and cpus. Overlappig cpusets seem to have 
mainly an administrative function. Paul?

> So what I suggest we do is to fix the NFS bug, then move on to considering
> the performance problems.

The NFS "bug" has been there for ages and no one cares since write 
throttling works effectively. Since NFS can go via any network technology 
(f.e. infiniband) we have many potential issues at that point that depend 
on the underlying network technology. As far as I can recall we decided 
that these stacking issues are inherently problematic and basically 
unsolvable.

> On reflection, I agree that your proposed changes are sensible-looking for
> addressing the probable, not-yet-demonstrated-and-quantified performance
> problem.  The per-inode (should be per-address_space, maybe it is?) node

The address space is part of the inode. Some of my development versions at 
the dirty_map in the address space. However, the end of the inode was a 
convenient place for a runtime sizes nodemask.

> map is unfortunate.  Need to think about that a bit more.  For a start, it
> should be dynamically allocated (from a new, purpose-created slab cache):
> most in-core inodes don't have any dirty pages and don't need this
> additional storage.

We also considered such an approach. However. it creates the problem 
of performing a slab allocation while dirtying pages. At that point we do 
not have an allocation context, nor can we block.

> But this is unrelated to the NFS bug ;)

Looks more like a design issue (given its layering on top of the 
networking layer) and not a bug. The "bug" surfaces when writeback is not 
done properly. I wonder what happens if other filesystems are pushed to 
the border of the dirty abyss.   The mmap tracking 
fixes that were done in 2.6.19 were done because of similar symptoms 
because the systems dirty tracking was off. This is fundamentally the 
same issue showing up in a cpuset. So we should be able to produce the
hangs (looks ... yes another customer reported issue on this one is that 
reclaim is continually running and we basically livelock the system) that 
we saw for the mmap dirty tracking issues in addition to the NFS problems 
seen so far.

Memory allocation is required in most filesystem flush paths. If we cannot 
allocate memory then we cannot clean pages and thus we continue trying -> 
Livelock. I still see this as a fundamental correctness issue in the 
kernel.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 17:30:26 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> > Nope.  You've completely omitted the little fact that we'll do writeback in
> > the offending zone off the LRU.  Slower, maybe.  But it should work and the
> > system should recover.  If it's not doing that (it isn't) then we should
> > fix it rather than avoiding it (by punting writeback over to pdflush).
> 
> pdflush is not running *at* all nor is dirty throttling working. That is 
> correct behavior? We could do background writeback but we choose not to do 
> so? Instead we wait until we hit reclaim and then block (well it seems 
> that we do not block the blocking there also fails since we again check 
> global ratios)?

I agree that it is a worthy objective to be able to constrain a cpuset's
dirty memory levels.  But as a performance optimisation and NOT as a
correctness fix.

Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
all of cpuset B's memory.  A task running in cpuset B gets oomkilled.

Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
cpuset containing nodes 0-2 and start using it.  I get oomkilled.

There may be other scenarios.


IOW, we have a correctness problem, and we have a probable,
not-yet-demonstrated-and-quantified performance problem.  Fixing the latter
(in the proposed fashion) will *not* fix the former.

So what I suggest we do is to fix the NFS bug, then move on to considering
the performance problems.



On reflection, I agree that your proposed changes are sensible-looking for
addressing the probable, not-yet-demonstrated-and-quantified performance
problem.  The per-inode (should be per-address_space, maybe it is?) node
map is unfortunate.  Need to think about that a bit more.  For a start, it
should be dynamically allocated (from a new, purpose-created slab cache):
most in-core inodes don't have any dirty pages and don't need this
additional storage.

Also, I worry about the worst-case performance of that linear search across
the inodes.

But this is unrelated to the NFS bug ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> Nope.  You've completely omitted the little fact that we'll do writeback in
> the offending zone off the LRU.  Slower, maybe.  But it should work and the
> system should recover.  If it's not doing that (it isn't) then we should
> fix it rather than avoiding it (by punting writeback over to pdflush).

pdflush is not running *at* all nor is dirty throttling working. That is 
correct behavior? We could do background writeback but we choose not to do 
so? Instead we wait until we hit reclaim and then block (well it seems 
that we do not block the blocking there also fails since we again check 
global ratios)?

> > The patchset does not allow processes to allocate from other nodes than 
> > the current cpuset.
> 
> Yes it does.  It asks pdflush to perform writeback of the offending zone(s)
> rather than (or as well as) doing it directly.  The only reason pdflush can
> sucessfuly do that is because pdflush can allocate its requests from other
> zones.

Ok pdflush is able to do that. Still the application is not able to 
extend its memory beyond the cpuset. What about writeback throttling? 
There it all breaks down. The cpuset is effective and we are unable to 
allocate any more memory. 

The reason this works is because not all of memory is dirty. Thus reclaim 
will be able to free up memory or there is enough memory free.

> > AFAIK any filesyste/block device can go oom with the current broken 
> > writeback it just does a few allocations. Its a matter of hitting the 
> > sweet spots.
> 
> That shouldn't be possible, in theory.  Block IO is supposed to succeed if
> *all memory in the machine is dirty*: the old
> dirty-everything-with-MAP_SHARED-then-exit problem.  Lots of testing went
> into that and it works.  It also failed on NFS although I thought that got
> "fixed" a year or so ago.  Apparently not.

Humm... Really?

> > Nope. Why would a dirty zone pose a problem? The proble exist if you 
> > cannot allocate more memory.
> 
> Well one example would be a GFP_KERNEL allocation on a highmem machine in
> whcih all of ZONE_NORMAL is dirty.

That is a restricted allocation which will lead to reclaim.

> > If we have multiple zones then other zones may still provide memory to 
> > continue (same as in UP).
> 
> Not if all the eligible zones are all-dirty.

They are all dirty if we do not throttle the dirty pages.

> Right now, what we have is an NFS bug.  How about we fix it, then
> reevaluate the situation?

The "NFS bug" only exists when using a cpuset. If you run NFS without 
cpusets then the throttling will kick in and everything is fine.

> A good starting point would be to show us one of these oom-killer traces.

No traces. Since the process is killed within a cpuset we only get 
messages like:

Nov 28 16:19:52 ic4 kernel: Out of Memory: Kill process 679783 (ncks) score 0 
and children.
Nov 28 16:19:52 ic4 kernel: No available memory in cpuset: Killed process 
679783 (ncks).
Nov 28 16:27:58 ic4 kernel: oom-killer: gfp_mask=0x200d2, order=0

Probably need to rerun these with some patches.

> > Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
> > on the first node. Then we copy a large file to disk. Node local 
> > allocation means that we allocate from the first node. After we reach 40% 
> > of the node then we throttle? This is going to be a significant 
> > performance degradation since we can no longer use the memory of other 
> > nodes to buffer writeout.
> 
> That was what I was referring to.

Note that this was describing the behavior you wanted not the way things 
work. It is desired behavior not to use all the memory resources of the 
cpuset and slow down the system?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 16:16:30 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
> 
> > It's a workaround for a still-unfixed NFS problem.
> 
> No its doing proper throttling. Without this patchset there will *no* 
> writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each
> and a cpuset that only spans one node.
> 
> Then a process runniung in that cpuset can dirty all of memory and still 
> continue running without writeback continuing. background dirty ratio
> is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever
> be reached because the process will only ever be able to dirty memory on 
> one node which is 5%. There will be no throttling, no background 
> writeback, no blocking for dirty pages.
> 
> At some point we run into reclaim (possibly we have ~99% of of the cpuset 
> dirty) and then we trigger writeout. Okay so if the filesystem / block 
> device is robust enough and does not require memory allocations then we 
> likely will survive that and do slow writeback page by page from the LRU.
> 
> writback is completely hosed for that situation. This patch restores 
> expected behavior in a cpuset (which is a form of system partition that 
> should mirror the system as a whole). At 10% dirty we should start 
> background writeback and at 40% we should block. If that is done then even 
> fragile combinations of filesystem/block devices will work as they do 
> without cpusets.

Nope.  You've completely omitted the little fact that we'll do writeback in
the offending zone off the LRU.  Slower, maybe.  But it should work and the
system should recover.  If it's not doing that (it isn't) then we should
fix it rather than avoiding it (by punting writeback over to pdflush).

Once that's fixed, if we determine that there are remaining and significant
performance issues then we can take a look at that.

> 
> > > Yes we can fix these allocations by allowing processes to allocate from 
> > > other nodes. But then the container function of cpusets is no longer 
> > > there.
> > But that's what your patch already does!
> 
> The patchset does not allow processes to allocate from other nodes than 
> the current cpuset.

Yes it does.  It asks pdflush to perform writeback of the offending zone(s)
rather than (or as well as) doing it directly.  The only reason pdflush can
sucessfuly do that is because pdflush can allocate its requests from other
zones.

> 
> AFAIK any filesyste/block device can go oom with the current broken 
> writeback it just does a few allocations. Its a matter of hitting the 
> sweet spots.

That shouldn't be possible, in theory.  Block IO is supposed to succeed if
*all memory in the machine is dirty*: the old
dirty-everything-with-MAP_SHARED-then-exit problem.  Lots of testing went
into that and it works.  It also failed on NFS although I thought that got
"fixed" a year or so ago.  Apparently not.

> > But we also can get into trouble if a *zone* is all-dirty.  Any solution to
> > the cpuset problem should solve that problem too, no?
> 
> Nope. Why would a dirty zone pose a problem? The proble exist if you 
> cannot allocate more memory.

Well one example would be a GFP_KERNEL allocation on a highmem machine in
whcih all of ZONE_NORMAL is dirty.

> If a cpuset contains a single node which is a 
> single zone then this patchset will also address that issue.
> 
> If we have multiple zones then other zones may still provide memory to 
> continue (same as in UP).

Not if all the eligible zones are all-dirty.

> > > Yes, but when we enter reclaim most of the pages of a zone may already be 
> > > dirty/writeback so we fail.
> > 
> > No.  If the dirty limits become per-zone then no zone will ever have >40%
> > dirty.
> 
> I am still confused as to why you would want per zone dirty limits?

The need for that has yet to be demonstrated.  There _might_ be a problem,
but we need test cases and analyses to demonstrate that need.

Right now, what we have is an NFS bug.  How about we fix it, then
reevaluate the situation?

A good starting point would be to show us one of these oom-killer traces.

> Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
> on the first node. Then we copy a large file to disk. Node local 
> allocation means that we allocate from the first node. After we reach 40% 
> of the node then we throttle? This is going to be a significant 
> performance degradation since we can no longer use the memory of other 
> nodes to buffer writeout.

That was what I was referring to.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> It's a workaround for a still-unfixed NFS problem.

No its doing proper throttling. Without this patchset there will *no* 
writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each
and a cpuset that only spans one node.

Then a process runniung in that cpuset can dirty all of memory and still 
continue running without writeback continuing. background dirty ratio
is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever
be reached because the process will only ever be able to dirty memory on 
one node which is 5%. There will be no throttling, no background 
writeback, no blocking for dirty pages.

At some point we run into reclaim (possibly we have ~99% of of the cpuset 
dirty) and then we trigger writeout. Okay so if the filesystem / block 
device is robust enough and does not require memory allocations then we 
likely will survive that and do slow writeback page by page from the LRU.

writback is completely hosed for that situation. This patch restores 
expected behavior in a cpuset (which is a form of system partition that 
should mirror the system as a whole). At 10% dirty we should start 
background writeback and at 40% we should block. If that is done then even 
fragile combinations of filesystem/block devices will work as they do 
without cpusets.


> > Yes we can fix these allocations by allowing processes to allocate from 
> > other nodes. But then the container function of cpusets is no longer 
> > there.
> But that's what your patch already does!

The patchset does not allow processes to allocate from other nodes than 
the current cpuset. There is no change as to the source of memory 
allocations.
 
> > NFS is okay as far as I can tell. dirty throttling works fine in non 
> > cpuset environments because we throttle if 40% of memory becomes dirty or 
> > under writeback.
> 
> Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
> for existing IO to complete.  Back that up with a mempool for NFS requests
> and the problem is solved, I think?

AFAIK any filesyste/block device can go oom with the current broken 
writeback it just does a few allocations. Its a matter of hitting the 
sweet spots.

> But we also can get into trouble if a *zone* is all-dirty.  Any solution to
> the cpuset problem should solve that problem too, no?

Nope. Why would a dirty zone pose a problem? The proble exist if you 
cannot allocate more memory. If a cpuset contains a single node which is a 
single zone then this patchset will also address that issue.

If we have multiple zones then other zones may still provide memory to 
continue (same as in UP).

> > Yes, but when we enter reclaim most of the pages of a zone may already be 
> > dirty/writeback so we fail.
> 
> No.  If the dirty limits become per-zone then no zone will ever have >40%
> dirty.

I am still confused as to why you would want per zone dirty limits?

Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
on the first node. Then we copy a large file to disk. Node local 
allocation means that we allocate from the first node. After we reach 40% 
of the node then we throttle? This is going to be a significant 
performance degradation since we can no longer use the memory of other 
nodes to buffer writeout.

> The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
> reduction in that zone, throttling the dirtying process.  I suspect this
> would work very badly in common situations with, say, typical i386 boxes.

Absolute crap. You can prototype that broken behavior with zone reclaim by 
the way. Just switch on writeback during zone reclaim and watch how memory 
on a cpuset is unused and how the system becomes slow as molasses.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread David Chinner
On Tue, Jan 16, 2007 at 01:53:25PM -0800, Andrew Morton wrote:
> > On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter
> > <[EMAIL PROTECTED]> wrote:
> >
> > Currently cpusets are not able to do proper writeback since dirty ratio
> > calculations and writeback are all done for the system as a whole.
> 
> We _do_ do proper writeback.  But it's less efficient than it might be, and
> there's an NFS problem.
> 
> > This may result in a large percentage of a cpuset to become dirty without
> > writeout being triggered. Under NFS this can lead to OOM conditions.
> 
> OK, a big question: is this patchset a performance improvement or a
> correctness fix?  Given the above, and the lack of benchmark results I'm
> assuming it's for correctness.

Given that we've already got a 25-30% buffered write performance
degradation between 2.6.18 and 2.6.20-rc4 for simple sequential
write I/O to multiple filesystems concurrently, I'd really like
to see some serious I/O performance regression testing on this
change before it goes anywhere.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
>
> ...
>
> > > This may result in a large percentage of a cpuset
> > > to become dirty without writeout being triggered. Under NFS
> > > this can lead to OOM conditions.
> > 
> > OK, a big question: is this patchset a performance improvement or a
> > correctness fix?  Given the above, and the lack of benchmark results I'm
> > assuming it's for correctness.
> 
> It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

It's a workaround for a still-unfixed NFS problem.

> > - Why does NFS go oom?  Because it allocates potentially-unbounded
> >   numbers of requests in the writeback path?
> > 
> >   It was able to go oom on non-numa machines before dirty-page-tracking
> >   went in.  So a general problem has now become specific to some NUMA
> >   setups.
> 
> 
> Right. The issue is that large portions of memory become dirty / 
> writeback since no writeback occurs because dirty limits are not checked 
> for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
> unable to allocate memory.
>  
> >   So an obvious, equivalent and vastly simpler "fix" would be to teach
> >   the NFS client to go off-cpuset when trying to allocate these requests.
> 
> Yes we can fix these allocations by allowing processes to allocate from 
> other nodes. But then the container function of cpusets is no longer 
> there.

But that's what your patch already does!

It asks pdflush to write the pages instead of the direct-reclaim caller. 
The only reason pdflush doesn't go oom is that pdflush lives outside the
direct-reclaim caller's cpuset and is hence able to obtain those nfs
requests from off-cpuset zones.

> > (But is it really bad? What actual problems will it cause once NFS is 
> > fixed?)
> 
> NFS is okay as far as I can tell. dirty throttling works fine in non 
> cpuset environments because we throttle if 40% of memory becomes dirty or 
> under writeback.

Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
for existing IO to complete.  Back that up with a mempool for NFS requests
and the problem is solved, I think?

> > I don't understand why the proposed patches are cpuset-aware at all.  This
> > is a per-zone problem, and a per-zone fix would seem to be appropriate, and
> > more general.  For example, i386 machines can presumably get into trouble
> > if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
> > address that problem as well.  So I think it should all be per-zone?
> 
> No. A zone can be completely dirty as long as we are allowed to allocate 
> from other zones.

But we also can get into trouble if a *zone* is all-dirty.  Any solution to
the cpuset problem should solve that problem too, no?

> > Do we really need those per-inode cpumasks?  When page reclaim encounters a
> > dirty page on the zone LRU, we automatically know that page->mapping->host
> > has at least one dirty page in this zone, yes?  We could immediately ask
> 
> Yes, but when we enter reclaim most of the pages of a zone may already be 
> dirty/writeback so we fail.

No.  If the dirty limits become per-zone then no zone will ever have >40%
dirty.

The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
reduction in that zone, throttling the dirtying process.  I suspect this
would work very badly in common situations with, say, typical i386 boxes.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andi Kleen wrote:

> > Secondly we modify the dirty limit calculation to be based
> > on the acctive cpuset.
> 
> The global dirty limit definitely seems to be a problem
> in several cases, but my feeling is that the cpuset is the wrong unit
> to keep track of it. Most likely it should be more fine grained.

We already have zone reclaim that can take care of smaller units but why 
would we start writeback if only one zone is full of dirty pages and there
are lots of other zones (nodes) that are free?

> > If we are in a cpuset then we select only inodes for writeback
> > that have pages on the nodes of the cpuset.
> 
> Is there any indication this change helps on smaller systems
> or is it purely a large system optimization?

The bigger the system the larger the problem because the ratio of dirty
pages is calculated is currently based on the percentage of dirty pages
in the system as a whole. The less percentage of a system a cpuset 
contains the less effective the dirty_ratio and background_dirty_ratio 
become.

> > B. We add a new counter NR_UNRECLAIMABLE that is subtracted
> >from the available pages in a node. This allows us to
> >accurately calculate the dirty ratio even if large portions
> >of the node have been allocated for huge pages or for
> >slab pages.
> 
> That sounds like a useful change by itself.

I can separate that one out.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> > On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter <[EMAIL 
> > PROTECTED]> wrote:
> >
> > Currently cpusets are not able to do proper writeback since
> > dirty ratio calculations and writeback are all done for the system
> > as a whole.
> 
> We _do_ do proper writeback.  But it's less efficient than it might be, and
> there's an NFS problem.

Well yes we write back during LRU scans when a potentially high percentage 
of the memory in cpuset is dirty.

> > This may result in a large percentage of a cpuset
> > to become dirty without writeout being triggered. Under NFS
> > this can lead to OOM conditions.
> 
> OK, a big question: is this patchset a performance improvement or a
> correctness fix?  Given the above, and the lack of benchmark results I'm
> assuming it's for correctness.

It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

> - Why does NFS go oom?  Because it allocates potentially-unbounded
>   numbers of requests in the writeback path?
> 
>   It was able to go oom on non-numa machines before dirty-page-tracking
>   went in.  So a general problem has now become specific to some NUMA
>   setups.


Right. The issue is that large portions of memory become dirty / 
writeback since no writeback occurs because dirty limits are not checked 
for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
unable to allocate memory.
 
>   So an obvious, equivalent and vastly simpler "fix" would be to teach
>   the NFS client to go off-cpuset when trying to allocate these requests.

Yes we can fix these allocations by allowing processes to allocate from 
other nodes. But then the container function of cpusets is no longer 
there.

> (But is it really bad? What actual problems will it cause once NFS is fixed?)

NFS is okay as far as I can tell. dirty throttling works fine in non 
cpuset environments because we throttle if 40% of memory becomes dirty or 
under writeback.

> I don't understand why the proposed patches are cpuset-aware at all.  This
> is a per-zone problem, and a per-zone fix would seem to be appropriate, and
> more general.  For example, i386 machines can presumably get into trouble
> if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
> address that problem as well.  So I think it should all be per-zone?

No. A zone can be completely dirty as long as we are allowed to allocate 
from other zones.

> Do we really need those per-inode cpumasks?  When page reclaim encounters a
> dirty page on the zone LRU, we automatically know that page->mapping->host
> has at least one dirty page in this zone, yes?  We could immediately ask

Yes, but when we enter reclaim most of the pages of a zone may already be 
dirty/writeback so we fail. Also when we enter reclaim we may not have
the proper process / cpuset context. There is no use to throttle kswapd. 
We need to throttle the process that is dirtying memory.

> But all of this is, I think, unneeded if NFS is fixed.  It's hopefully a
> performance optimisation to permit writeout in a less seeky fashion. 
> Unless there's some other problem with excessively dirty zones.

The patchset improves performance because the filesystem can do sequential 
writeouts. So yes in some ways this is a performance improvement. But this 
is only because this patch makes dirty throttling for cpusets work in the 
same way as for non NUMA system.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andi Kleen

> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.

The global dirty limit definitely seems to be a problem
in several cases, but my feeling is that the cpuset is the wrong unit
to keep track of it. Most likely it should be more fine grained.

> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.

Is there any indication this change helps on smaller systems
or is it purely a large system optimization?

> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
>from the available pages in a node. This allows us to
>accurately calculate the dirty ratio even if large portions
>of the node have been allocated for huge pages or for
>slab pages.

That sounds like a useful change by itself.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
>
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole.

We _do_ do proper writeback.  But it's less efficient than it might be, and
there's an NFS problem.

> This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.

OK, a big question: is this patchset a performance improvement or a
correctness fix?  Given the above, and the lack of benchmark results I'm
assuming it's for correctness.

- Why does NFS go oom?  Because it allocates potentially-unbounded
  numbers of requests in the writeback path?

  It was able to go oom on non-numa machines before dirty-page-tracking
  went in.  So a general problem has now become specific to some NUMA
  setups.

  We have earlier discussed fixing NFS to not do that.  Make it allocate
  a fixed number of requests and to then block.  Just like
  get_request_wait().  This is one reason why block_congestion_wait() and
  friends got renamed to congestion_wait(): it's on the path to getting NFS
  better aligned with how block devices are handling this.

- There's no reason which I can see why NFS _has_ to go oom.  It could
  just fail the memory allocation for the request and then wait for the
  stuff which it _has_ submitted to complete.  We do that for block
  devices, backed by mempools.

- Why does NFS go oom if there's free memory in other nodes?  I assume
  that's what's happening, because things apparently work OK if you ask
  pdflush to do exactly the thing which the direct-reclaim process was
  attempting to do: allocate NFS requests and do writeback.

  So an obvious, equivalent and vastly simpler "fix" would be to teach
  the NFS client to go off-cpuset when trying to allocate these requests.

I suspect that if we do some or all of the above, NFS gets better and the
bug which motivated this patchset goes away.

But that being said, yes, allowing a zone to go 100% dirty like this is
bad, and it's be nice to be able to fix it.

(But is it really bad? What actual problems will it cause once NFS is fixed?)

Assuming that it is bad, yes, we'll obviously need the extra per-zone
dirty-memory accounting.




I don't understand why the proposed patches are cpuset-aware at all.  This
is a per-zone problem, and a per-zone fix would seem to be appropriate, and
more general.  For example, i386 machines can presumably get into trouble
if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
address that problem as well.  So I think it should all be per-zone?




Do we really need those per-inode cpumasks?  When page reclaim encounters a
dirty page on the zone LRU, we automatically know that page->mapping->host
has at least one dirty page in this zone, yes?  We could immediately ask
pdflush to write out some pages from that inode.  We would need to take a
ref on the inode (while the page is locked, to avoid racing with inode
reclaim) and pass that inode off to pdflush (actually pass a list of such
inodes off to pdflush, keep appending to it).

Extra refinements would include

- telling pdflush the file offset of the page so it can do writearound

- getting pdflush to deactivate any pages which it writes out, so that
  rotate_reclaimable_page() has a good chance of moving them to the tail of
  the inactive list for immediate reclaim.

But all of this is, I think, unneeded if NFS is fixed.  It's hopefully a
performance optimisation to permit writeout in a less seeky fashion. 
Unless there's some other problem with excessively dirty zones.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Peter Zijlstra wrote:

> > B. We add a new counter NR_UNRECLAIMABLE that is subtracted
> >from the available pages in a node. This allows us to
> >accurately calculate the dirty ratio even if large portions
> >of the node have been allocated for huge pages or for
> >slab pages.
> 
> What about mlock'ed pages?

mlocked pages can be dirty and written back right? So for the
dirty ratio calculation they do not play a role. We may need a
separate counter for mlocked pages if they are to be considered
for other decisions in the VM.

> Otherwise it all looks good.
> 
> Acked-by: Peter Zijlstra <[EMAIL PROTECTED]>

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Paul Jackson wrote:

> > 1. The nodemask expands the inode structure significantly if the
> > architecture allows a high number of nodes. This is only an issue
> > for IA64. 
> 
> Should that logic be disabled if HOTPLUG is configured on?  Or is
> nr_node_ids a valid upper limit on what could be plugged in, even on a
> mythical HOTPLUG system?

nr_node_ids is a valid upper limit on what could be plugged in. We could
modify nodemasks to only use nr_node_ids bits and the kernel would still
be functioning correctly.

> > 2. The calculation of the per cpuset limits can require looping
> > over a number of nodes which may bring the performance of get_dirty_limits
> > near pre 2.6.18 performance
> 
> Could we cache these limits?  Perhaps they only need to be recalculated if
> a tasks mems_allowed changes?

No they change dynamically. In particular writeout reduces the number of 
dirty / unstable pages.

> Separate question - what happens if a tasks mems_allowed changes while it is
> dirtying pages?  We could easily end up with dirty pages on nodes that are
> no longer allowed to the task.  Is there anyway that such a miscalculation
> could cause us to do harmful things?

The dirty_map on an inode is independent of a cpuset. The cpuset only 
comes into effect when we decide to do writeout and are scanning for files 
with pages on the nodes of interest.

> In patch 2/8:
> > The dirty map is cleared when the inode is cleared. There is no
> > synchronization (except for atomic nature of node_set) for the dirty_map. 
> > The
> > only problem that could be done is that we do not write out an inode if a
> > node bit is not set.
> 
> Does this mean that a dirty page could be left 'forever' in memory, unwritten,
> exposing us to risk of data corruption on disk, from some write done weeks 
> ago,
> but unwritten, in the event of say a power loss?

No it will age and be written out anyways. Note that there are usually 
multiple dirty pages which reduces the chance of the race. These are node
bits that help to decide when to start writeout of all dirty pages of an 
inode regardless of where the other pages are.

> Also in patch 2/8:
> > +static inline void cpuset_update_dirty_nodes(struct inode *i,
> > +   struct page *page) {}
> 
> Is an incomplete 'struct inode;' declaration needed here in cpuset.h,
> to avoid a warning if compiling with CPUSETS not configured?

Correct.

> 
> In patch 4/8:
> > We now add per node information which I think is equal or less effort
> > since there are less nodes than processors.
> 
> Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on
> a system with 2 or 4 CPUs and one real node.  But I guess that's ok ...

True but then its fake.

> In patch 4/8:
> > +#ifdef CONFIG_CPUSETS
> > +   /*
> > +* Calculate the limits relative to the current cpuset if necessary.
> > +*/
> > +   if (unlikely(nodes &&
> > +   !nodes_subset(node_online_map, *nodes))) {
> > +   int node;
> > +
> > +   is_subset = 1;
> > +   ...
> > +#ifdef CONFIG_HIGHMEM
> > +   high_memory += NODE_DATA(node)
> > +   ->node_zones[ZONE_HIGHMEM]->present_pages;
> > +#endif
> > +   nr_mapped += node_page_state(node, NR_FILE_MAPPED) +
> > +   node_page_state(node, NR_ANON_PAGES);
> > +   }
> > +   } else
> > +#endif
> > +   {
> 
> I'm wishing there was a clearer way to write the above code.  Nested
> ifdef's and an ifdef block ending in an open 'else' and perhaps the first
> #ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ...

I have tried to replicate the structure for global dirty_limits 
calculation which has the same ifdef.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Paul Jackson
Christoph wrote:
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole.

Thanks for tackling this - it is sorely needed.

I'm afraid my review will be mostly cosmetic; I'm not competent
to comment on the really interesting stuff.

> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.

Sorry - you tripped over a subtle distinction that happens to be on my
list of things to notice.

When cpusets are configured, -all- tasks are in a cpuset.  And
(correctly so, I trust) this patch doesn't look into the tasks cpuset
to see what nodes it allows.  Rather it looks to the mems_allowed field
in the task struct, which is equal to or (when set_mempolicy is used) a
subset of that tasks cpusets allowed nodes.

Perhaps the following phrasing would be more accurate:

  If CPUSETs are configured, then we select only the inodes for
  writeback that have dirty pages on that tasks mems_allowed nodes.

> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.

As above, perhaps the following would be more accurate:

  Secondly we modify the dirty limit calculation to be based
  on the current tasks mems_allowed nodes.

> 1. The nodemask expands the inode structure significantly if the
> architecture allows a high number of nodes. This is only an issue
> for IA64. 

Should that logic be disabled if HOTPLUG is configured on?  Or is
nr_node_ids a valid upper limit on what could be plugged in, even on a
mythical HOTPLUG system?

> 2. The calculation of the per cpuset limits can require looping
> over a number of nodes which may bring the performance of get_dirty_limits
> near pre 2.6.18 performance

Could we cache these limits?  Perhaps they only need to be recalculated if
a tasks mems_allowed changes?

Separate question - what happens if a tasks mems_allowed changes while it is
dirtying pages?  We could easily end up with dirty pages on nodes that are
no longer allowed to the task.  Is there anyway that such a miscalculation
could cause us to do harmful things?

In patch 2/8:
> The dirty map is cleared when the inode is cleared. There is no
> synchronization (except for atomic nature of node_set) for the dirty_map. The
> only problem that could be done is that we do not write out an inode if a
> node bit is not set.

Does this mean that a dirty page could be left 'forever' in memory, unwritten,
exposing us to risk of data corruption on disk, from some write done weeks ago,
but unwritten, in the event of say a power loss?

Also in patch 2/8:
> +static inline void cpuset_update_dirty_nodes(struct inode *i,
> + struct page *page) {}

Is an incomplete 'struct inode;' declaration needed here in cpuset.h,
to avoid a warning if compiling with CPUSETS not configured?

In patch 4/8:
> We now add per node information which I think is equal or less effort
> since there are less nodes than processors.

Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on
a system with 2 or 4 CPUs and one real node.  But I guess that's ok ...

In patch 4/8:
> +#ifdef CONFIG_CPUSETS
> + /*
> +  * Calculate the limits relative to the current cpuset if necessary.
> +  */
> + if (unlikely(nodes &&
> + !nodes_subset(node_online_map, *nodes))) {
> + int node;
> +
> + is_subset = 1;
> + ...
> +#ifdef CONFIG_HIGHMEM
> + high_memory += NODE_DATA(node)
> + ->node_zones[ZONE_HIGHMEM]->present_pages;
> +#endif
> + nr_mapped += node_page_state(node, NR_FILE_MAPPED) +
> + node_page_state(node, NR_ANON_PAGES);
> + }
> + } else
> +#endif
> + {

I'm wishing there was a clearer way to write the above code.  Nested
ifdef's and an ifdef block ending in an open 'else' and perhaps the first
#ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ...

However I have no clue if such a clearer way exists.  Sorry.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Paul Jackson
Christoph wrote:
 Currently cpusets are not able to do proper writeback since
 dirty ratio calculations and writeback are all done for the system
 as a whole.

Thanks for tackling this - it is sorely needed.

I'm afraid my review will be mostly cosmetic; I'm not competent
to comment on the really interesting stuff.

 If we are in a cpuset then we select only inodes for writeback
 that have pages on the nodes of the cpuset.

Sorry - you tripped over a subtle distinction that happens to be on my
list of things to notice.

When cpusets are configured, -all- tasks are in a cpuset.  And
(correctly so, I trust) this patch doesn't look into the tasks cpuset
to see what nodes it allows.  Rather it looks to the mems_allowed field
in the task struct, which is equal to or (when set_mempolicy is used) a
subset of that tasks cpusets allowed nodes.

Perhaps the following phrasing would be more accurate:

  If CPUSETs are configured, then we select only the inodes for
  writeback that have dirty pages on that tasks mems_allowed nodes.

 Secondly we modify the dirty limit calculation to be based
 on the acctive cpuset.

As above, perhaps the following would be more accurate:

  Secondly we modify the dirty limit calculation to be based
  on the current tasks mems_allowed nodes.

 1. The nodemask expands the inode structure significantly if the
 architecture allows a high number of nodes. This is only an issue
 for IA64. 

Should that logic be disabled if HOTPLUG is configured on?  Or is
nr_node_ids a valid upper limit on what could be plugged in, even on a
mythical HOTPLUG system?

 2. The calculation of the per cpuset limits can require looping
 over a number of nodes which may bring the performance of get_dirty_limits
 near pre 2.6.18 performance

Could we cache these limits?  Perhaps they only need to be recalculated if
a tasks mems_allowed changes?

Separate question - what happens if a tasks mems_allowed changes while it is
dirtying pages?  We could easily end up with dirty pages on nodes that are
no longer allowed to the task.  Is there anyway that such a miscalculation
could cause us to do harmful things?

In patch 2/8:
 The dirty map is cleared when the inode is cleared. There is no
 synchronization (except for atomic nature of node_set) for the dirty_map. The
 only problem that could be done is that we do not write out an inode if a
 node bit is not set.

Does this mean that a dirty page could be left 'forever' in memory, unwritten,
exposing us to risk of data corruption on disk, from some write done weeks ago,
but unwritten, in the event of say a power loss?

Also in patch 2/8:
 +static inline void cpuset_update_dirty_nodes(struct inode *i,
 + struct page *page) {}

Is an incomplete 'struct inode;' declaration needed here in cpuset.h,
to avoid a warning if compiling with CPUSETS not configured?

In patch 4/8:
 We now add per node information which I think is equal or less effort
 since there are less nodes than processors.

Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on
a system with 2 or 4 CPUs and one real node.  But I guess that's ok ...

In patch 4/8:
 +#ifdef CONFIG_CPUSETS
 + /*
 +  * Calculate the limits relative to the current cpuset if necessary.
 +  */
 + if (unlikely(nodes 
 + !nodes_subset(node_online_map, *nodes))) {
 + int node;
 +
 + is_subset = 1;
 + ...
 +#ifdef CONFIG_HIGHMEM
 + high_memory += NODE_DATA(node)
 + -node_zones[ZONE_HIGHMEM]-present_pages;
 +#endif
 + nr_mapped += node_page_state(node, NR_FILE_MAPPED) +
 + node_page_state(node, NR_ANON_PAGES);
 + }
 + } else
 +#endif
 + {

I'm wishing there was a clearer way to write the above code.  Nested
ifdef's and an ifdef block ending in an open 'else' and perhaps the first
#ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ...

However I have no clue if such a clearer way exists.  Sorry.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Paul Jackson wrote:

  1. The nodemask expands the inode structure significantly if the
  architecture allows a high number of nodes. This is only an issue
  for IA64. 
 
 Should that logic be disabled if HOTPLUG is configured on?  Or is
 nr_node_ids a valid upper limit on what could be plugged in, even on a
 mythical HOTPLUG system?

nr_node_ids is a valid upper limit on what could be plugged in. We could
modify nodemasks to only use nr_node_ids bits and the kernel would still
be functioning correctly.

  2. The calculation of the per cpuset limits can require looping
  over a number of nodes which may bring the performance of get_dirty_limits
  near pre 2.6.18 performance
 
 Could we cache these limits?  Perhaps they only need to be recalculated if
 a tasks mems_allowed changes?

No they change dynamically. In particular writeout reduces the number of 
dirty / unstable pages.

 Separate question - what happens if a tasks mems_allowed changes while it is
 dirtying pages?  We could easily end up with dirty pages on nodes that are
 no longer allowed to the task.  Is there anyway that such a miscalculation
 could cause us to do harmful things?

The dirty_map on an inode is independent of a cpuset. The cpuset only 
comes into effect when we decide to do writeout and are scanning for files 
with pages on the nodes of interest.

 In patch 2/8:
  The dirty map is cleared when the inode is cleared. There is no
  synchronization (except for atomic nature of node_set) for the dirty_map. 
  The
  only problem that could be done is that we do not write out an inode if a
  node bit is not set.
 
 Does this mean that a dirty page could be left 'forever' in memory, unwritten,
 exposing us to risk of data corruption on disk, from some write done weeks 
 ago,
 but unwritten, in the event of say a power loss?

No it will age and be written out anyways. Note that there are usually 
multiple dirty pages which reduces the chance of the race. These are node
bits that help to decide when to start writeout of all dirty pages of an 
inode regardless of where the other pages are.

 Also in patch 2/8:
  +static inline void cpuset_update_dirty_nodes(struct inode *i,
  +   struct page *page) {}
 
 Is an incomplete 'struct inode;' declaration needed here in cpuset.h,
 to avoid a warning if compiling with CPUSETS not configured?

Correct.

 
 In patch 4/8:
  We now add per node information which I think is equal or less effort
  since there are less nodes than processors.
 
 Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on
 a system with 2 or 4 CPUs and one real node.  But I guess that's ok ...

True but then its fake.

 In patch 4/8:
  +#ifdef CONFIG_CPUSETS
  +   /*
  +* Calculate the limits relative to the current cpuset if necessary.
  +*/
  +   if (unlikely(nodes 
  +   !nodes_subset(node_online_map, *nodes))) {
  +   int node;
  +
  +   is_subset = 1;
  +   ...
  +#ifdef CONFIG_HIGHMEM
  +   high_memory += NODE_DATA(node)
  +   -node_zones[ZONE_HIGHMEM]-present_pages;
  +#endif
  +   nr_mapped += node_page_state(node, NR_FILE_MAPPED) +
  +   node_page_state(node, NR_ANON_PAGES);
  +   }
  +   } else
  +#endif
  +   {
 
 I'm wishing there was a clearer way to write the above code.  Nested
 ifdef's and an ifdef block ending in an open 'else' and perhaps the first
 #ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ...

I have tried to replicate the structure for global dirty_limits 
calculation which has the same ifdef.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Peter Zijlstra wrote:

  B. We add a new counter NR_UNRECLAIMABLE that is subtracted
 from the available pages in a node. This allows us to
 accurately calculate the dirty ratio even if large portions
 of the node have been allocated for huge pages or for
 slab pages.
 
 What about mlock'ed pages?

mlocked pages can be dirty and written back right? So for the
dirty ratio calculation they do not play a role. We may need a
separate counter for mlocked pages if they are to be considered
for other decisions in the VM.

 Otherwise it all looks good.
 
 Acked-by: Peter Zijlstra [EMAIL PROTECTED]

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
 On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:

 Currently cpusets are not able to do proper writeback since
 dirty ratio calculations and writeback are all done for the system
 as a whole.

We _do_ do proper writeback.  But it's less efficient than it might be, and
there's an NFS problem.

 This may result in a large percentage of a cpuset
 to become dirty without writeout being triggered. Under NFS
 this can lead to OOM conditions.

OK, a big question: is this patchset a performance improvement or a
correctness fix?  Given the above, and the lack of benchmark results I'm
assuming it's for correctness.

- Why does NFS go oom?  Because it allocates potentially-unbounded
  numbers of requests in the writeback path?

  It was able to go oom on non-numa machines before dirty-page-tracking
  went in.  So a general problem has now become specific to some NUMA
  setups.

  We have earlier discussed fixing NFS to not do that.  Make it allocate
  a fixed number of requests and to then block.  Just like
  get_request_wait().  This is one reason why block_congestion_wait() and
  friends got renamed to congestion_wait(): it's on the path to getting NFS
  better aligned with how block devices are handling this.

- There's no reason which I can see why NFS _has_ to go oom.  It could
  just fail the memory allocation for the request and then wait for the
  stuff which it _has_ submitted to complete.  We do that for block
  devices, backed by mempools.

- Why does NFS go oom if there's free memory in other nodes?  I assume
  that's what's happening, because things apparently work OK if you ask
  pdflush to do exactly the thing which the direct-reclaim process was
  attempting to do: allocate NFS requests and do writeback.

  So an obvious, equivalent and vastly simpler fix would be to teach
  the NFS client to go off-cpuset when trying to allocate these requests.

I suspect that if we do some or all of the above, NFS gets better and the
bug which motivated this patchset goes away.

But that being said, yes, allowing a zone to go 100% dirty like this is
bad, and it's be nice to be able to fix it.

(But is it really bad? What actual problems will it cause once NFS is fixed?)

Assuming that it is bad, yes, we'll obviously need the extra per-zone
dirty-memory accounting.




I don't understand why the proposed patches are cpuset-aware at all.  This
is a per-zone problem, and a per-zone fix would seem to be appropriate, and
more general.  For example, i386 machines can presumably get into trouble
if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
address that problem as well.  So I think it should all be per-zone?




Do we really need those per-inode cpumasks?  When page reclaim encounters a
dirty page on the zone LRU, we automatically know that page-mapping-host
has at least one dirty page in this zone, yes?  We could immediately ask
pdflush to write out some pages from that inode.  We would need to take a
ref on the inode (while the page is locked, to avoid racing with inode
reclaim) and pass that inode off to pdflush (actually pass a list of such
inodes off to pdflush, keep appending to it).

Extra refinements would include

- telling pdflush the file offset of the page so it can do writearound

- getting pdflush to deactivate any pages which it writes out, so that
  rotate_reclaimable_page() has a good chance of moving them to the tail of
  the inactive list for immediate reclaim.

But all of this is, I think, unneeded if NFS is fixed.  It's hopefully a
performance optimisation to permit writeout in a less seeky fashion. 
Unless there's some other problem with excessively dirty zones.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andi Kleen

 Secondly we modify the dirty limit calculation to be based
 on the acctive cpuset.

The global dirty limit definitely seems to be a problem
in several cases, but my feeling is that the cpuset is the wrong unit
to keep track of it. Most likely it should be more fine grained.

 If we are in a cpuset then we select only inodes for writeback
 that have pages on the nodes of the cpuset.

Is there any indication this change helps on smaller systems
or is it purely a large system optimization?

 B. We add a new counter NR_UNRECLAIMABLE that is subtracted
from the available pages in a node. This allows us to
accurately calculate the dirty ratio even if large portions
of the node have been allocated for huge pages or for
slab pages.

That sounds like a useful change by itself.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

  On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter [EMAIL 
  PROTECTED] wrote:
 
  Currently cpusets are not able to do proper writeback since
  dirty ratio calculations and writeback are all done for the system
  as a whole.
 
 We _do_ do proper writeback.  But it's less efficient than it might be, and
 there's an NFS problem.

Well yes we write back during LRU scans when a potentially high percentage 
of the memory in cpuset is dirty.

  This may result in a large percentage of a cpuset
  to become dirty without writeout being triggered. Under NFS
  this can lead to OOM conditions.
 
 OK, a big question: is this patchset a performance improvement or a
 correctness fix?  Given the above, and the lack of benchmark results I'm
 assuming it's for correctness.

It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

 - Why does NFS go oom?  Because it allocates potentially-unbounded
   numbers of requests in the writeback path?
 
   It was able to go oom on non-numa machines before dirty-page-tracking
   went in.  So a general problem has now become specific to some NUMA
   setups.


Right. The issue is that large portions of memory become dirty / 
writeback since no writeback occurs because dirty limits are not checked 
for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
unable to allocate memory.
 
   So an obvious, equivalent and vastly simpler fix would be to teach
   the NFS client to go off-cpuset when trying to allocate these requests.

Yes we can fix these allocations by allowing processes to allocate from 
other nodes. But then the container function of cpusets is no longer 
there.

 (But is it really bad? What actual problems will it cause once NFS is fixed?)

NFS is okay as far as I can tell. dirty throttling works fine in non 
cpuset environments because we throttle if 40% of memory becomes dirty or 
under writeback.

 I don't understand why the proposed patches are cpuset-aware at all.  This
 is a per-zone problem, and a per-zone fix would seem to be appropriate, and
 more general.  For example, i386 machines can presumably get into trouble
 if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
 address that problem as well.  So I think it should all be per-zone?

No. A zone can be completely dirty as long as we are allowed to allocate 
from other zones.

 Do we really need those per-inode cpumasks?  When page reclaim encounters a
 dirty page on the zone LRU, we automatically know that page-mapping-host
 has at least one dirty page in this zone, yes?  We could immediately ask

Yes, but when we enter reclaim most of the pages of a zone may already be 
dirty/writeback so we fail. Also when we enter reclaim we may not have
the proper process / cpuset context. There is no use to throttle kswapd. 
We need to throttle the process that is dirtying memory.

 But all of this is, I think, unneeded if NFS is fixed.  It's hopefully a
 performance optimisation to permit writeout in a less seeky fashion. 
 Unless there's some other problem with excessively dirty zones.

The patchset improves performance because the filesystem can do sequential 
writeouts. So yes in some ways this is a performance improvement. But this 
is only because this patch makes dirty throttling for cpusets work in the 
same way as for non NUMA system.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andi Kleen wrote:

  Secondly we modify the dirty limit calculation to be based
  on the acctive cpuset.
 
 The global dirty limit definitely seems to be a problem
 in several cases, but my feeling is that the cpuset is the wrong unit
 to keep track of it. Most likely it should be more fine grained.

We already have zone reclaim that can take care of smaller units but why 
would we start writeback if only one zone is full of dirty pages and there
are lots of other zones (nodes) that are free?

  If we are in a cpuset then we select only inodes for writeback
  that have pages on the nodes of the cpuset.
 
 Is there any indication this change helps on smaller systems
 or is it purely a large system optimization?

The bigger the system the larger the problem because the ratio of dirty
pages is calculated is currently based on the percentage of dirty pages
in the system as a whole. The less percentage of a system a cpuset 
contains the less effective the dirty_ratio and background_dirty_ratio 
become.

  B. We add a new counter NR_UNRECLAIMABLE that is subtracted
 from the available pages in a node. This allows us to
 accurately calculate the dirty ratio even if large portions
 of the node have been allocated for huge pages or for
 slab pages.
 
 That sounds like a useful change by itself.

I can separate that one out.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
 On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:

 ...

   This may result in a large percentage of a cpuset
   to become dirty without writeout being triggered. Under NFS
   this can lead to OOM conditions.
  
  OK, a big question: is this patchset a performance improvement or a
  correctness fix?  Given the above, and the lack of benchmark results I'm
  assuming it's for correctness.
 
 It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

It's a workaround for a still-unfixed NFS problem.

  - Why does NFS go oom?  Because it allocates potentially-unbounded
numbers of requests in the writeback path?
  
It was able to go oom on non-numa machines before dirty-page-tracking
went in.  So a general problem has now become specific to some NUMA
setups.
 
 
 Right. The issue is that large portions of memory become dirty / 
 writeback since no writeback occurs because dirty limits are not checked 
 for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
 unable to allocate memory.
  
So an obvious, equivalent and vastly simpler fix would be to teach
the NFS client to go off-cpuset when trying to allocate these requests.
 
 Yes we can fix these allocations by allowing processes to allocate from 
 other nodes. But then the container function of cpusets is no longer 
 there.

But that's what your patch already does!

It asks pdflush to write the pages instead of the direct-reclaim caller. 
The only reason pdflush doesn't go oom is that pdflush lives outside the
direct-reclaim caller's cpuset and is hence able to obtain those nfs
requests from off-cpuset zones.

  (But is it really bad? What actual problems will it cause once NFS is 
  fixed?)
 
 NFS is okay as far as I can tell. dirty throttling works fine in non 
 cpuset environments because we throttle if 40% of memory becomes dirty or 
 under writeback.

Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
for existing IO to complete.  Back that up with a mempool for NFS requests
and the problem is solved, I think?

  I don't understand why the proposed patches are cpuset-aware at all.  This
  is a per-zone problem, and a per-zone fix would seem to be appropriate, and
  more general.  For example, i386 machines can presumably get into trouble
  if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
  address that problem as well.  So I think it should all be per-zone?
 
 No. A zone can be completely dirty as long as we are allowed to allocate 
 from other zones.

But we also can get into trouble if a *zone* is all-dirty.  Any solution to
the cpuset problem should solve that problem too, no?

  Do we really need those per-inode cpumasks?  When page reclaim encounters a
  dirty page on the zone LRU, we automatically know that page-mapping-host
  has at least one dirty page in this zone, yes?  We could immediately ask
 
 Yes, but when we enter reclaim most of the pages of a zone may already be 
 dirty/writeback so we fail.

No.  If the dirty limits become per-zone then no zone will ever have 40%
dirty.

The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
reduction in that zone, throttling the dirtying process.  I suspect this
would work very badly in common situations with, say, typical i386 boxes.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread David Chinner
On Tue, Jan 16, 2007 at 01:53:25PM -0800, Andrew Morton wrote:
  On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter
  [EMAIL PROTECTED] wrote:
 
  Currently cpusets are not able to do proper writeback since dirty ratio
  calculations and writeback are all done for the system as a whole.
 
 We _do_ do proper writeback.  But it's less efficient than it might be, and
 there's an NFS problem.
 
  This may result in a large percentage of a cpuset to become dirty without
  writeout being triggered. Under NFS this can lead to OOM conditions.
 
 OK, a big question: is this patchset a performance improvement or a
 correctness fix?  Given the above, and the lack of benchmark results I'm
 assuming it's for correctness.

Given that we've already got a 25-30% buffered write performance
degradation between 2.6.18 and 2.6.20-rc4 for simple sequential
write I/O to multiple filesystems concurrently, I'd really like
to see some serious I/O performance regression testing on this
change before it goes anywhere.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

 It's a workaround for a still-unfixed NFS problem.

No its doing proper throttling. Without this patchset there will *no* 
writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each
and a cpuset that only spans one node.

Then a process runniung in that cpuset can dirty all of memory and still 
continue running without writeback continuing. background dirty ratio
is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever
be reached because the process will only ever be able to dirty memory on 
one node which is 5%. There will be no throttling, no background 
writeback, no blocking for dirty pages.

At some point we run into reclaim (possibly we have ~99% of of the cpuset 
dirty) and then we trigger writeout. Okay so if the filesystem / block 
device is robust enough and does not require memory allocations then we 
likely will survive that and do slow writeback page by page from the LRU.

writback is completely hosed for that situation. This patch restores 
expected behavior in a cpuset (which is a form of system partition that 
should mirror the system as a whole). At 10% dirty we should start 
background writeback and at 40% we should block. If that is done then even 
fragile combinations of filesystem/block devices will work as they do 
without cpusets.


  Yes we can fix these allocations by allowing processes to allocate from 
  other nodes. But then the container function of cpusets is no longer 
  there.
 But that's what your patch already does!

The patchset does not allow processes to allocate from other nodes than 
the current cpuset. There is no change as to the source of memory 
allocations.
 
  NFS is okay as far as I can tell. dirty throttling works fine in non 
  cpuset environments because we throttle if 40% of memory becomes dirty or 
  under writeback.
 
 Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
 for existing IO to complete.  Back that up with a mempool for NFS requests
 and the problem is solved, I think?

AFAIK any filesyste/block device can go oom with the current broken 
writeback it just does a few allocations. Its a matter of hitting the 
sweet spots.

 But we also can get into trouble if a *zone* is all-dirty.  Any solution to
 the cpuset problem should solve that problem too, no?

Nope. Why would a dirty zone pose a problem? The proble exist if you 
cannot allocate more memory. If a cpuset contains a single node which is a 
single zone then this patchset will also address that issue.

If we have multiple zones then other zones may still provide memory to 
continue (same as in UP).

  Yes, but when we enter reclaim most of the pages of a zone may already be 
  dirty/writeback so we fail.
 
 No.  If the dirty limits become per-zone then no zone will ever have 40%
 dirty.

I am still confused as to why you would want per zone dirty limits?

Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
on the first node. Then we copy a large file to disk. Node local 
allocation means that we allocate from the first node. After we reach 40% 
of the node then we throttle? This is going to be a significant 
performance degradation since we can no longer use the memory of other 
nodes to buffer writeout.

 The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
 reduction in that zone, throttling the dirtying process.  I suspect this
 would work very badly in common situations with, say, typical i386 boxes.

Absolute crap. You can prototype that broken behavior with zone reclaim by 
the way. Just switch on writeback during zone reclaim and watch how memory 
on a cpuset is unused and how the system becomes slow as molasses.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
 On Tue, 16 Jan 2007 16:16:30 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 On Tue, 16 Jan 2007, Andrew Morton wrote:
 
  It's a workaround for a still-unfixed NFS problem.
 
 No its doing proper throttling. Without this patchset there will *no* 
 writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each
 and a cpuset that only spans one node.
 
 Then a process runniung in that cpuset can dirty all of memory and still 
 continue running without writeback continuing. background dirty ratio
 is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever
 be reached because the process will only ever be able to dirty memory on 
 one node which is 5%. There will be no throttling, no background 
 writeback, no blocking for dirty pages.
 
 At some point we run into reclaim (possibly we have ~99% of of the cpuset 
 dirty) and then we trigger writeout. Okay so if the filesystem / block 
 device is robust enough and does not require memory allocations then we 
 likely will survive that and do slow writeback page by page from the LRU.
 
 writback is completely hosed for that situation. This patch restores 
 expected behavior in a cpuset (which is a form of system partition that 
 should mirror the system as a whole). At 10% dirty we should start 
 background writeback and at 40% we should block. If that is done then even 
 fragile combinations of filesystem/block devices will work as they do 
 without cpusets.

Nope.  You've completely omitted the little fact that we'll do writeback in
the offending zone off the LRU.  Slower, maybe.  But it should work and the
system should recover.  If it's not doing that (it isn't) then we should
fix it rather than avoiding it (by punting writeback over to pdflush).

Once that's fixed, if we determine that there are remaining and significant
performance issues then we can take a look at that.

 
   Yes we can fix these allocations by allowing processes to allocate from 
   other nodes. But then the container function of cpusets is no longer 
   there.
  But that's what your patch already does!
 
 The patchset does not allow processes to allocate from other nodes than 
 the current cpuset.

Yes it does.  It asks pdflush to perform writeback of the offending zone(s)
rather than (or as well as) doing it directly.  The only reason pdflush can
sucessfuly do that is because pdflush can allocate its requests from other
zones.

 
 AFAIK any filesyste/block device can go oom with the current broken 
 writeback it just does a few allocations. Its a matter of hitting the 
 sweet spots.

That shouldn't be possible, in theory.  Block IO is supposed to succeed if
*all memory in the machine is dirty*: the old
dirty-everything-with-MAP_SHARED-then-exit problem.  Lots of testing went
into that and it works.  It also failed on NFS although I thought that got
fixed a year or so ago.  Apparently not.

  But we also can get into trouble if a *zone* is all-dirty.  Any solution to
  the cpuset problem should solve that problem too, no?
 
 Nope. Why would a dirty zone pose a problem? The proble exist if you 
 cannot allocate more memory.

Well one example would be a GFP_KERNEL allocation on a highmem machine in
whcih all of ZONE_NORMAL is dirty.

 If a cpuset contains a single node which is a 
 single zone then this patchset will also address that issue.
 
 If we have multiple zones then other zones may still provide memory to 
 continue (same as in UP).

Not if all the eligible zones are all-dirty.

   Yes, but when we enter reclaim most of the pages of a zone may already be 
   dirty/writeback so we fail.
  
  No.  If the dirty limits become per-zone then no zone will ever have 40%
  dirty.
 
 I am still confused as to why you would want per zone dirty limits?

The need for that has yet to be demonstrated.  There _might_ be a problem,
but we need test cases and analyses to demonstrate that need.

Right now, what we have is an NFS bug.  How about we fix it, then
reevaluate the situation?

A good starting point would be to show us one of these oom-killer traces.

 Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
 on the first node. Then we copy a large file to disk. Node local 
 allocation means that we allocate from the first node. After we reach 40% 
 of the node then we throttle? This is going to be a significant 
 performance degradation since we can no longer use the memory of other 
 nodes to buffer writeout.

That was what I was referring to.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

 Nope.  You've completely omitted the little fact that we'll do writeback in
 the offending zone off the LRU.  Slower, maybe.  But it should work and the
 system should recover.  If it's not doing that (it isn't) then we should
 fix it rather than avoiding it (by punting writeback over to pdflush).

pdflush is not running *at* all nor is dirty throttling working. That is 
correct behavior? We could do background writeback but we choose not to do 
so? Instead we wait until we hit reclaim and then block (well it seems 
that we do not block the blocking there also fails since we again check 
global ratios)?

  The patchset does not allow processes to allocate from other nodes than 
  the current cpuset.
 
 Yes it does.  It asks pdflush to perform writeback of the offending zone(s)
 rather than (or as well as) doing it directly.  The only reason pdflush can
 sucessfuly do that is because pdflush can allocate its requests from other
 zones.

Ok pdflush is able to do that. Still the application is not able to 
extend its memory beyond the cpuset. What about writeback throttling? 
There it all breaks down. The cpuset is effective and we are unable to 
allocate any more memory. 

The reason this works is because not all of memory is dirty. Thus reclaim 
will be able to free up memory or there is enough memory free.

  AFAIK any filesyste/block device can go oom with the current broken 
  writeback it just does a few allocations. Its a matter of hitting the 
  sweet spots.
 
 That shouldn't be possible, in theory.  Block IO is supposed to succeed if
 *all memory in the machine is dirty*: the old
 dirty-everything-with-MAP_SHARED-then-exit problem.  Lots of testing went
 into that and it works.  It also failed on NFS although I thought that got
 fixed a year or so ago.  Apparently not.

Humm... Really?

  Nope. Why would a dirty zone pose a problem? The proble exist if you 
  cannot allocate more memory.
 
 Well one example would be a GFP_KERNEL allocation on a highmem machine in
 whcih all of ZONE_NORMAL is dirty.

That is a restricted allocation which will lead to reclaim.

  If we have multiple zones then other zones may still provide memory to 
  continue (same as in UP).
 
 Not if all the eligible zones are all-dirty.

They are all dirty if we do not throttle the dirty pages.

 Right now, what we have is an NFS bug.  How about we fix it, then
 reevaluate the situation?

The NFS bug only exists when using a cpuset. If you run NFS without 
cpusets then the throttling will kick in and everything is fine.

 A good starting point would be to show us one of these oom-killer traces.

No traces. Since the process is killed within a cpuset we only get 
messages like:

Nov 28 16:19:52 ic4 kernel: Out of Memory: Kill process 679783 (ncks) score 0 
and children.
Nov 28 16:19:52 ic4 kernel: No available memory in cpuset: Killed process 
679783 (ncks).
Nov 28 16:27:58 ic4 kernel: oom-killer: gfp_mask=0x200d2, order=0

Probably need to rerun these with some patches.

  Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
  on the first node. Then we copy a large file to disk. Node local 
  allocation means that we allocate from the first node. After we reach 40% 
  of the node then we throttle? This is going to be a significant 
  performance degradation since we can no longer use the memory of other 
  nodes to buffer writeout.
 
 That was what I was referring to.

Note that this was describing the behavior you wanted not the way things 
work. It is desired behavior not to use all the memory resources of the 
cpuset and slow down the system?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
 On Tue, 16 Jan 2007 17:30:26 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
  Nope.  You've completely omitted the little fact that we'll do writeback in
  the offending zone off the LRU.  Slower, maybe.  But it should work and the
  system should recover.  If it's not doing that (it isn't) then we should
  fix it rather than avoiding it (by punting writeback over to pdflush).
 
 pdflush is not running *at* all nor is dirty throttling working. That is 
 correct behavior? We could do background writeback but we choose not to do 
 so? Instead we wait until we hit reclaim and then block (well it seems 
 that we do not block the blocking there also fails since we again check 
 global ratios)?

I agree that it is a worthy objective to be able to constrain a cpuset's
dirty memory levels.  But as a performance optimisation and NOT as a
correctness fix.

Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
all of cpuset B's memory.  A task running in cpuset B gets oomkilled.

Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
cpuset containing nodes 0-2 and start using it.  I get oomkilled.

There may be other scenarios.


IOW, we have a correctness problem, and we have a probable,
not-yet-demonstrated-and-quantified performance problem.  Fixing the latter
(in the proposed fashion) will *not* fix the former.

So what I suggest we do is to fix the NFS bug, then move on to considering
the performance problems.



On reflection, I agree that your proposed changes are sensible-looking for
addressing the probable, not-yet-demonstrated-and-quantified performance
problem.  The per-inode (should be per-address_space, maybe it is?) node
map is unfortunate.  Need to think about that a bit more.  For a start, it
should be dynamically allocated (from a new, purpose-created slab cache):
most in-core inodes don't have any dirty pages and don't need this
additional storage.

Also, I worry about the worst-case performance of that linear search across
the inodes.

But this is unrelated to the NFS bug ;)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

 Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
 cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
 all of cpuset B's memory.  A task running in cpuset B gets oomkilled.
 
 Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
 cpuset containing nodes 0-2 and start using it.  I get oomkilled.
 
 There may be other scenarios.

Yes this is the result of the hierachical nature of cpusets which already 
causes issues with the scheduler. It is rather typical that cpusets are 
used to partition the memory and cpus. Overlappig cpusets seem to have 
mainly an administrative function. Paul?

 So what I suggest we do is to fix the NFS bug, then move on to considering
 the performance problems.

The NFS bug has been there for ages and no one cares since write 
throttling works effectively. Since NFS can go via any network technology 
(f.e. infiniband) we have many potential issues at that point that depend 
on the underlying network technology. As far as I can recall we decided 
that these stacking issues are inherently problematic and basically 
unsolvable.

 On reflection, I agree that your proposed changes are sensible-looking for
 addressing the probable, not-yet-demonstrated-and-quantified performance
 problem.  The per-inode (should be per-address_space, maybe it is?) node

The address space is part of the inode. Some of my development versions at 
the dirty_map in the address space. However, the end of the inode was a 
convenient place for a runtime sizes nodemask.

 map is unfortunate.  Need to think about that a bit more.  For a start, it
 should be dynamically allocated (from a new, purpose-created slab cache):
 most in-core inodes don't have any dirty pages and don't need this
 additional storage.

We also considered such an approach. However. it creates the problem 
of performing a slab allocation while dirtying pages. At that point we do 
not have an allocation context, nor can we block.

 But this is unrelated to the NFS bug ;)

Looks more like a design issue (given its layering on top of the 
networking layer) and not a bug. The bug surfaces when writeback is not 
done properly. I wonder what happens if other filesystems are pushed to 
the border of the dirty abyss.   The mmap tracking 
fixes that were done in 2.6.19 were done because of similar symptoms 
because the systems dirty tracking was off. This is fundamentally the 
same issue showing up in a cpuset. So we should be able to produce the
hangs (looks ... yes another customer reported issue on this one is that 
reclaim is continually running and we basically livelock the system) that 
we saw for the mmap dirty tracking issues in addition to the NFS problems 
seen so far.

Memory allocation is required in most filesystem flush paths. If we cannot 
allocate memory then we cannot clean pages and thus we continue trying - 
Livelock. I still see this as a fundamental correctness issue in the 
kernel.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Paul Jackson
 Yes this is the result of the hierachical nature of cpusets which already 
 causes issues with the scheduler. It is rather typical that cpusets are 
 used to partition the memory and cpus. Overlappig cpusets seem to have 
 mainly an administrative function. Paul?

The heavy weight tasks, which are expected to be applying serious memory
pressure (whether for data pages or dirty file pages), are usually in
non-overlapping cpusets, or sharing the same cpuset, but not partially
overlapping with, or a proper superset of, some other cpuset holding an
active job.

The higher level cpusets, such as the top cpuset, or the one deeded over
to the Batch Scheduler, are proper supersets of many other cpusets.  We
avoid putting anything heavy weight in those cpusets.

Sometimes of course a task turns out to be unexpectedly heavy weight.
But in that case, we're mostly interested in function (system keeps
running), not performance.

That is, if someone setup what Andrew described, with a job in a large
cpuset sucking up all available memory from one in a smaller, contained
cpuset, I don't think I'm tuning for optimum performance anymore.
Rather I'm just trying to keep the system running and keep unrelated
jobs unaffected while we dig our way out of the hole.  If the smaller
job OOM's, that's tough nuggies.  They asked for it.  Jobs in
-unrelated- (non-overlapping) cpusets should ride out the storm with
little or no impact on their performance.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
 On Tue, 16 Jan 2007 19:40:17 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 On Tue, 16 Jan 2007, Andrew Morton wrote:
 
  Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
  cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
  all of cpuset B's memory.  A task running in cpuset B gets oomkilled.
  
  Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
  cpuset containing nodes 0-2 and start using it.  I get oomkilled.
  
  There may be other scenarios.
 
 Yes this is the result of the hierachical nature of cpusets which already 
 causes issues with the scheduler. It is rather typical that cpusets are 
 used to partition the memory and cpus. Overlappig cpusets seem to have 
 mainly an administrative function. Paul?

The typical usage scenarios don't matter a lot: the examples I gave show
that the core problem remains unsolved.  People can still hit the bug.

  So what I suggest we do is to fix the NFS bug, then move on to considering
  the performance problems.
 
 The NFS bug has been there for ages and no one cares since write 
 throttling works effectively. Since NFS can go via any network technology 
 (f.e. infiniband) we have many potential issues at that point that depend 
 on the underlying network technology. As far as I can recall we decided 
 that these stacking issues are inherently problematic and basically 
 unsolvable.

The problem you refer to arises from the inability of the net driver to
allocate memory for an outbound ack.  Such allocations aren't constrained to
a cpuset.

I expect that we can solve the NFS oom problem along the same lines as
block devices.  Certainly it's dumb of us to oom-kill a process rather than
going off-cpuset for a small and short-lived allocation.  It's also dumb of
us to allocate a basically unbounded number of nfs requests rather than
waiting for some of the ones which we _have_ allocated to complete.


  On reflection, I agree that your proposed changes are sensible-looking for
  addressing the probable, not-yet-demonstrated-and-quantified performance
  problem.  The per-inode (should be per-address_space, maybe it is?) node
 
 The address space is part of the inode.

Physically, yes.  Logically, it is not.  The address_space controls the
data-plane part of a file and is the appropriate place in which to store
this nodemask.

 Some of my development versions at 
 the dirty_map in the address space. However, the end of the inode was a 
 convenient place for a runtime sizes nodemask.
 
  map is unfortunate.  Need to think about that a bit more.  For a start, it
  should be dynamically allocated (from a new, purpose-created slab cache):
  most in-core inodes don't have any dirty pages and don't need this
  additional storage.
 
 We also considered such an approach. However. it creates the problem 
 of performing a slab allocation while dirtying pages. At that point we do 
 not have an allocation context, nor can we block.

Yes, it must be an atomic allocation.  If it fails, we don't care.  Chances
are it'll succeed when the next page in this address_space gets dirtied.

Plus we don't waste piles of memory on read-only files.

  But this is unrelated to the NFS bug ;)
 
 Looks more like a design issue (given its layering on top of the 
 networking layer) and not a bug. The bug surfaces when writeback is not 
 done properly. I wonder what happens if other filesystems are pushed to 
 the border of the dirty abyss.   The mmap tracking 
 fixes that were done in 2.6.19 were done because of similar symptoms 
 because the systems dirty tracking was off. This is fundamentally the 
 same issue showing up in a cpuset. So we should be able to produce the
 hangs (looks ... yes another customer reported issue on this one is that 
 reclaim is continually running and we basically livelock the system) that 
 we saw for the mmap dirty tracking issues in addition to the NFS problems 
 seen so far.
 
 Memory allocation is required in most filesystem flush paths. If we cannot 
 allocate memory then we cannot clean pages and thus we continue trying - 
 Livelock. I still see this as a fundamental correctness issue in the 
 kernel.

I'll believe all that once someone has got down and tried to fix NFS, and
has failed ;)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

  Yes this is the result of the hierachical nature of cpusets which already 
  causes issues with the scheduler. It is rather typical that cpusets are 
  used to partition the memory and cpus. Overlappig cpusets seem to have 
  mainly an administrative function. Paul?
 
 The typical usage scenarios don't matter a lot: the examples I gave show
 that the core problem remains unsolved.  People can still hit the bug.

I agree the overlap issue is a problem and I hope it can be addressed 
somehow for the rare cases in which such nesting takes place.

One easy solution may be to check the dirty ratio before engaging in 
reclaim. If the dirty ratio is sufficiently high then trigger writeout via 
pdflush (we already wakeup pdflush while scanning and you already noted 
that pdflush writeout is not occurring within the context of the current 
cpuset) and pass over any dirty pages during LRU scans until some pages 
have been cleaned up.

This means we allow allocation of additional kernel memory outside of the 
cpuset while triggering writeout of inodes that have pages on the nodes 
of the cpuset. The memory directly used by the application is still 
limited. Just the temporary information needed for writeback is allocated 
outside.

Well sounds somehow still like a hack. Any other ideas out there?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
 On Tue, 16 Jan 2007 22:27:36 -0800 (PST) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 On Tue, 16 Jan 2007, Andrew Morton wrote:
 
   Yes this is the result of the hierachical nature of cpusets which already 
   causes issues with the scheduler. It is rather typical that cpusets are 
   used to partition the memory and cpus. Overlappig cpusets seem to have 
   mainly an administrative function. Paul?
  
  The typical usage scenarios don't matter a lot: the examples I gave show
  that the core problem remains unsolved.  People can still hit the bug.
 
 I agree the overlap issue is a problem and I hope it can be addressed 
 somehow for the rare cases in which such nesting takes place.
 
 One easy solution may be to check the dirty ratio before engaging in 
 reclaim. If the dirty ratio is sufficiently high then trigger writeout via 
 pdflush (we already wakeup pdflush while scanning and you already noted 
 that pdflush writeout is not occurring within the context of the current 
 cpuset) and pass over any dirty pages during LRU scans until some pages 
 have been cleaned up.
 
 This means we allow allocation of additional kernel memory outside of the 
 cpuset while triggering writeout of inodes that have pages on the nodes 
 of the cpuset. The memory directly used by the application is still 
 limited. Just the temporary information needed for writeback is allocated 
 outside.

Gad.  None of that should be necessary.

 Well sounds somehow still like a hack. Any other ideas out there?

Do what blockdevs do: limit the number of in-flight requests (Peter's
recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
is in effect, to keep Trond happy) and implement a mempool for the NFS
request critical store.  Additionally:

- we might need to twiddle the NFS gfp_flags so it doesn't call the
  oom-killer on failure: just return NULL.

- consider going off-cpuset for critical allocations.  It's better than
  going oom.  A suitable implementation might be to ignore the caller's
  cpuset if PF_MEMALLOC.  Maybe put a WARN_ON_ONCE in there: we prefer that
  it not happen and we want to know when it does.



btw, regarding the per-address_space node mask: I think we should free it
when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)).  Chances
are, the inode will be dirty for 30 seconds and in-core for hours.  We
might as well steal its nodemask storage and give it to the next file which
gets written to.  A suitable place to do all this is in
__mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
address_space.dirty_page_nodemask.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-15 Thread Peter Zijlstra
On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote:
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole. This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.
> 
> Writeback will occur during the LRU scans. But such writeout
> is not effective since we write page by page and not in inode page
> order (regular writeback).
> 
> In order to fix the problem we first of all introduce a method to
> establish a map of nodes that contain dirty pages for each
> inode mapping.
> 
> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.
> 
> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.
> 
> After we have the cpuset throttling in place we can then make
> further fixups:
> 
> A. We can do inode based writeout from direct reclaim
>avoiding single page writes to the filesystem.
> 
> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
>from the available pages in a node. This allows us to
>accurately calculate the dirty ratio even if large portions
>of the node have been allocated for huge pages or for
>slab pages.

What about mlock'ed pages?

> There are a couple of points where some better ideas could be used:
> 
> 1. The nodemask expands the inode structure significantly if the
> architecture allows a high number of nodes. This is only an issue
> for IA64. For that platform we expand the inode structure by 128 byte
> (to support 1024 nodes). The last patch attempts to address the issue
> by using the knowledge about the maximum possible number of nodes
> determined on bootup to shrink the nodemask.

Not the prettiest indeed, no ideas though.

> 2. The calculation of the per cpuset limits can require looping
> over a number of nodes which may bring the performance of get_dirty_limits
> near pre 2.6.18 performance (before the introduction of the ZVC counters)
> (only for cpuset based limit calculation). There is no way of keeping these
> counters per cpuset since cpusets may overlap.

Well, you gain functionality, you loose some runtime, sad but probably
worth it.

Otherwise it all looks good.

Acked-by: Peter Zijlstra <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/8] Cpuset aware writeback

2007-01-15 Thread Christoph Lameter
Currently cpusets are not able to do proper writeback since
dirty ratio calculations and writeback are all done for the system
as a whole. This may result in a large percentage of a cpuset
to become dirty without writeout being triggered. Under NFS
this can lead to OOM conditions.

Writeback will occur during the LRU scans. But such writeout
is not effective since we write page by page and not in inode page
order (regular writeback).

In order to fix the problem we first of all introduce a method to
establish a map of nodes that contain dirty pages for each
inode mapping.

Secondly we modify the dirty limit calculation to be based
on the acctive cpuset.

If we are in a cpuset then we select only inodes for writeback
that have pages on the nodes of the cpuset.

After we have the cpuset throttling in place we can then make
further fixups:

A. We can do inode based writeout from direct reclaim
   avoiding single page writes to the filesystem.

B. We add a new counter NR_UNRECLAIMABLE that is subtracted
   from the available pages in a node. This allows us to
   accurately calculate the dirty ratio even if large portions
   of the node have been allocated for huge pages or for
   slab pages.

There are a couple of points where some better ideas could be used:

1. The nodemask expands the inode structure significantly if the
architecture allows a high number of nodes. This is only an issue
for IA64. For that platform we expand the inode structure by 128 byte
(to support 1024 nodes). The last patch attempts to address the issue
by using the knowledge about the maximum possible number of nodes
determined on bootup to shrink the nodemask.

2. The calculation of the per cpuset limits can require looping
over a number of nodes which may bring the performance of get_dirty_limits
near pre 2.6.18 performance (before the introduction of the ZVC counters)
(only for cpuset based limit calculation). There is no way of keeping these
counters per cpuset since cpusets may overlap.

Paul probably needs to go through this and may want additional fixes to
keep things in harmony with cpusets.

Tested on:
IA64 NUMA 128p, 12p

Compiles on:
i386 SMP
x86_64 UP


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/8] Cpuset aware writeback

2007-01-15 Thread Christoph Lameter
Currently cpusets are not able to do proper writeback since
dirty ratio calculations and writeback are all done for the system
as a whole. This may result in a large percentage of a cpuset
to become dirty without writeout being triggered. Under NFS
this can lead to OOM conditions.

Writeback will occur during the LRU scans. But such writeout
is not effective since we write page by page and not in inode page
order (regular writeback).

In order to fix the problem we first of all introduce a method to
establish a map of nodes that contain dirty pages for each
inode mapping.

Secondly we modify the dirty limit calculation to be based
on the acctive cpuset.

If we are in a cpuset then we select only inodes for writeback
that have pages on the nodes of the cpuset.

After we have the cpuset throttling in place we can then make
further fixups:

A. We can do inode based writeout from direct reclaim
   avoiding single page writes to the filesystem.

B. We add a new counter NR_UNRECLAIMABLE that is subtracted
   from the available pages in a node. This allows us to
   accurately calculate the dirty ratio even if large portions
   of the node have been allocated for huge pages or for
   slab pages.

There are a couple of points where some better ideas could be used:

1. The nodemask expands the inode structure significantly if the
architecture allows a high number of nodes. This is only an issue
for IA64. For that platform we expand the inode structure by 128 byte
(to support 1024 nodes). The last patch attempts to address the issue
by using the knowledge about the maximum possible number of nodes
determined on bootup to shrink the nodemask.

2. The calculation of the per cpuset limits can require looping
over a number of nodes which may bring the performance of get_dirty_limits
near pre 2.6.18 performance (before the introduction of the ZVC counters)
(only for cpuset based limit calculation). There is no way of keeping these
counters per cpuset since cpusets may overlap.

Paul probably needs to go through this and may want additional fixes to
keep things in harmony with cpusets.

Tested on:
IA64 NUMA 128p, 12p

Compiles on:
i386 SMP
x86_64 UP


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-15 Thread Peter Zijlstra
On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote:
 Currently cpusets are not able to do proper writeback since
 dirty ratio calculations and writeback are all done for the system
 as a whole. This may result in a large percentage of a cpuset
 to become dirty without writeout being triggered. Under NFS
 this can lead to OOM conditions.
 
 Writeback will occur during the LRU scans. But such writeout
 is not effective since we write page by page and not in inode page
 order (regular writeback).
 
 In order to fix the problem we first of all introduce a method to
 establish a map of nodes that contain dirty pages for each
 inode mapping.
 
 Secondly we modify the dirty limit calculation to be based
 on the acctive cpuset.
 
 If we are in a cpuset then we select only inodes for writeback
 that have pages on the nodes of the cpuset.
 
 After we have the cpuset throttling in place we can then make
 further fixups:
 
 A. We can do inode based writeout from direct reclaim
avoiding single page writes to the filesystem.
 
 B. We add a new counter NR_UNRECLAIMABLE that is subtracted
from the available pages in a node. This allows us to
accurately calculate the dirty ratio even if large portions
of the node have been allocated for huge pages or for
slab pages.

What about mlock'ed pages?

 There are a couple of points where some better ideas could be used:
 
 1. The nodemask expands the inode structure significantly if the
 architecture allows a high number of nodes. This is only an issue
 for IA64. For that platform we expand the inode structure by 128 byte
 (to support 1024 nodes). The last patch attempts to address the issue
 by using the knowledge about the maximum possible number of nodes
 determined on bootup to shrink the nodemask.

Not the prettiest indeed, no ideas though.

 2. The calculation of the per cpuset limits can require looping
 over a number of nodes which may bring the performance of get_dirty_limits
 near pre 2.6.18 performance (before the introduction of the ZVC counters)
 (only for cpuset based limit calculation). There is no way of keeping these
 counters per cpuset since cpusets may overlap.

Well, you gain functionality, you loose some runtime, sad but probably
worth it.

Otherwise it all looks good.

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/