Re: [RFC 0/8] Cpuset aware writeback
On Sat, 21 Apr 2007, Ethan Solomita wrote: >Exactly -- your patch should be consistent and do it the same way as > whatever your patch is built against. Your patch is built against a kernel > that subtracts off highmem. "Do it..." are you handing off the patch and are > done with it? Yes as said before the patch is not finished. As I told you I have other things to do right now. It is not high on my agenda and some other developers have shown an interest. Feel free to take over the patch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Fri, 20 Apr 2007, Ethan Solomita wrote: cpuset_write_dirty_map.htm In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call it only if page->mapping is still set after locking. Is there a reason for the difference? Also a question not about your patch: why do those functions call __mark_inode_dirty() even if the dirty page has been truncated and mapping == NULL? If page->mapping has been cleared then the page was removed from the mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs then the inode was modified. You didn't address the first half. Why do the buffers() and nobuffers() act differently when calling cpuset_update_dirty_nodes()? cpuset_write_throttle.htm I noticed that several lines have leading spaces. I didn't check if other patches have the problem too. Maybe download the patches? How did those strange .htm endings get appended to the patches? Something weird with Firefox, but instead of jumping on me did you consider double checking your patches? I just went back, found the text versions, and the spaces are still there.e.g.: + unsigned long dirtyable_memory; In get_dirty_limits(), when cpusets are configd you don't subtract highmen the same way that is done without cpusets. Is this intentional? That is something in flux upstream. Linus changed it recently. Do it one way or the other. Exactly -- your patch should be consistent and do it the same way as whatever your patch is built against. Your patch is built against a kernel that subtracts off highmem. "Do it..." are you handing off the patch and are done with it? It seems that dirty_exceeded is still a global punishment across cpusets. Should it be addressed? Sure. It would be best if you could place that somehow in a cpuset. Again it sounds like you're handing them off. I'm not objecting I just hadn't understood that. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Fri, 20 Apr 2007, Ethan Solomita wrote: cpuset_write_dirty_map.htm In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call it only if page-mapping is still set after locking. Is there a reason for the difference? Also a question not about your patch: why do those functions call __mark_inode_dirty() even if the dirty page has been truncated and mapping == NULL? If page-mapping has been cleared then the page was removed from the mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs then the inode was modified. You didn't address the first half. Why do the buffers() and nobuffers() act differently when calling cpuset_update_dirty_nodes()? cpuset_write_throttle.htm I noticed that several lines have leading spaces. I didn't check if other patches have the problem too. Maybe download the patches? How did those strange .htm endings get appended to the patches? Something weird with Firefox, but instead of jumping on me did you consider double checking your patches? I just went back, found the text versions, and the spaces are still there.e.g.: + unsigned long dirtyable_memory; In get_dirty_limits(), when cpusets are configd you don't subtract highmen the same way that is done without cpusets. Is this intentional? That is something in flux upstream. Linus changed it recently. Do it one way or the other. Exactly -- your patch should be consistent and do it the same way as whatever your patch is built against. Your patch is built against a kernel that subtracts off highmem. Do it... are you handing off the patch and are done with it? It seems that dirty_exceeded is still a global punishment across cpusets. Should it be addressed? Sure. It would be best if you could place that somehow in a cpuset. Again it sounds like you're handing them off. I'm not objecting I just hadn't understood that. -- Ethan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Sat, 21 Apr 2007, Ethan Solomita wrote: Exactly -- your patch should be consistent and do it the same way as whatever your patch is built against. Your patch is built against a kernel that subtracts off highmem. Do it... are you handing off the patch and are done with it? Yes as said before the patch is not finished. As I told you I have other things to do right now. It is not high on my agenda and some other developers have shown an interest. Feel free to take over the patch. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Fri, 20 Apr 2007, Ethan Solomita wrote: > cpuset_write_dirty_map.htm > >In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() > but in __set_page_dirty_buffers() you call it only if page->mapping is still > set after locking. Is there a reason for the difference? Also a question not > about your patch: why do those functions call __mark_inode_dirty() even if the > dirty page has been truncated and mapping == NULL? If page->mapping has been cleared then the page was removed from the mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs then the inode was modified. > cpuset_write_throttle.htm > >I noticed that several lines have leading spaces. I didn't check if other > patches have the problem too. Maybe download the patches? How did those strange .htm endings get appended to the patches? >In get_dirty_limits(), when cpusets are configd you don't subtract highmen > the same way that is done without cpusets. Is this intentional? That is something in flux upstream. Linus changed it recently. Do it one way or the other. >It seems that dirty_exceeded is still a global punishment across cpusets. > Should it be addressed? Sure. It would be best if you could place that somehow in a cpuset. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty Hi Christoph -- a few comments on the patches: cpuset_write_dirty_map.htm In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call it only if page->mapping is still set after locking. Is there a reason for the difference? Also a question not about your patch: why do those functions call __mark_inode_dirty() even if the dirty page has been truncated and mapping == NULL? cpuset_write_throttle.htm I noticed that several lines have leading spaces. I didn't check if other patches have the problem too. In get_dirty_limits(), when cpusets are configd you don't subtract highmen the same way that is done without cpusets. Is this intentional? It seems that dirty_exceeded is still a global punishment across cpusets. Should it be addressed? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty Hi Christoph -- a few comments on the patches: cpuset_write_dirty_map.htm In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call it only if page-mapping is still set after locking. Is there a reason for the difference? Also a question not about your patch: why do those functions call __mark_inode_dirty() even if the dirty page has been truncated and mapping == NULL? cpuset_write_throttle.htm I noticed that several lines have leading spaces. I didn't check if other patches have the problem too. In get_dirty_limits(), when cpusets are configd you don't subtract highmen the same way that is done without cpusets. Is this intentional? It seems that dirty_exceeded is still a global punishment across cpusets. Should it be addressed? -- Ethan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Fri, 20 Apr 2007, Ethan Solomita wrote: cpuset_write_dirty_map.htm In __set_page_dirty_nobuffers() you always call cpuset_update_dirty_nodes() but in __set_page_dirty_buffers() you call it only if page-mapping is still set after locking. Is there a reason for the difference? Also a question not about your patch: why do those functions call __mark_inode_dirty() even if the dirty page has been truncated and mapping == NULL? If page-mapping has been cleared then the page was removed from the mapping. __mark_inode_dirty just dirties the inode. If a truncation occurs then the inode was modified. cpuset_write_throttle.htm I noticed that several lines have leading spaces. I didn't check if other patches have the problem too. Maybe download the patches? How did those strange .htm endings get appended to the patches? In get_dirty_limits(), when cpusets are configd you don't subtract highmen the same way that is done without cpusets. Is this intentional? That is something in flux upstream. Linus changed it recently. Do it one way or the other. It seems that dirty_exceeded is still a global punishment across cpusets. Should it be addressed? Sure. It would be best if you could place that somehow in a cpuset. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 19 Apr 2007, Ethan Solomita wrote: > > H Sorry. I got distracted and I have sent them to Kame-san who was > > interested in working on them. > > I have placed the most recent version at > > http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty > > > >Do you expect any conflicts with the per-bdi dirty throttling patches? You would have to check that yourself. The need for cpuset aware writeback is less due to writeback fixes to NFS. The per bdi dirty throttling is further reducing the need. The role of the cpuset aware writeback is simply to implement measures to deal with the worst case scenarios. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Wed, 18 Apr 2007, Ethan Solomita wrote: Any new ETA? I'm trying to decide whether to go back to your original patches or wait for the new set. Adding new knobs isn't as important to me as having something that fixes the core problem, so hopefully this isn't waiting on them. They could always be patches on top of your core patches. -- Ethan H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty Do you expect any conflicts with the per-bdi dirty throttling patches? -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Wed, 18 Apr 2007, Ethan Solomita wrote: Any new ETA? I'm trying to decide whether to go back to your original patches or wait for the new set. Adding new knobs isn't as important to me as having something that fixes the core problem, so hopefully this isn't waiting on them. They could always be patches on top of your core patches. -- Ethan H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty Do you expect any conflicts with the per-bdi dirty throttling patches? -- Ethan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 19 Apr 2007, Ethan Solomita wrote: H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty Do you expect any conflicts with the per-bdi dirty throttling patches? You would have to check that yourself. The need for cpuset aware writeback is less due to writeback fixes to NFS. The per bdi dirty throttling is further reducing the need. The role of the cpuset aware writeback is simply to implement measures to deal with the worst case scenarios. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 18 Apr 2007, Ethan Solomita wrote: >Any new ETA? I'm trying to decide whether to go back to your original > patches or wait for the new set. Adding new knobs isn't as important to me as > having something that fixes the core problem, so hopefully this isn't waiting > on them. They could always be patches on top of your core patches. >-- Ethan H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Wed, 21 Mar 2007, Ethan Solomita wrote: Christoph Lameter wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. I'm pinging to see if this has gotten anywhere. Are you ready to resubmit? Do we have the evidence to convince Andrew that the NFS issues are resolved and so this patch won't obscure anything? The NFS patch went into Linus tree a couple of days ago and I have a new version ready with additional support to set per dirty ratios per cpu. There is some interest in adding more VM controls to this patch. I hope I can post the next rev by tomorrow. Any new ETA? I'm trying to decide whether to go back to your original patches or wait for the new set. Adding new knobs isn't as important to me as having something that fixes the core problem, so hopefully this isn't waiting on them. They could always be patches on top of your core patches. -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Wed, 21 Mar 2007, Ethan Solomita wrote: Christoph Lameter wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. I'm pinging to see if this has gotten anywhere. Are you ready to resubmit? Do we have the evidence to convince Andrew that the NFS issues are resolved and so this patch won't obscure anything? The NFS patch went into Linus tree a couple of days ago and I have a new version ready with additional support to set per dirty ratios per cpu. There is some interest in adding more VM controls to this patch. I hope I can post the next rev by tomorrow. Any new ETA? I'm trying to decide whether to go back to your original patches or wait for the new set. Adding new knobs isn't as important to me as having something that fixes the core problem, so hopefully this isn't waiting on them. They could always be patches on top of your core patches. -- Ethan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 18 Apr 2007, Ethan Solomita wrote: Any new ETA? I'm trying to decide whether to go back to your original patches or wait for the new set. Adding new knobs isn't as important to me as having something that fixes the core problem, so hopefully this isn't waiting on them. They could always be patches on top of your core patches. -- Ethan H Sorry. I got distracted and I have sent them to Kame-san who was interested in working on them. I have placed the most recent version at http://ftp.kernel.org/pub/linux/kernel/people/christoph/cpuset_dirty - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 21 Mar 2007, Andrew Morton wrote: > > The NFS patch went into Linus tree a couple of days ago > > Did it fix the oom issues which you were observing? Yes it reduced the dirty ratios to reasonable numbers in a simple copy operation that created large amounts of dirty pages before. The trouble is now to check if cpuset writeback patch still works correctly. Probably have to turn off block device congestion checks somehow. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 21 Mar 2007 14:29:42 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 21 Mar 2007, Ethan Solomita wrote: > > > Christoph Lameter wrote: > > > On Thu, 1 Feb 2007, Ethan Solomita wrote: > > > > > > >Hi Christoph -- has anything come of resolving the NFS / OOM concerns > > > > that > > > > Andrew Morton expressed concerning the patch? I'd be happy to see some > > > > progress on getting this patch (i.e. the one you posted on 1/23) > > > > through. > > > > > > Peter Zilkstra addressed the NFS issue. I will submit the patch again as > > > soon as the writeback code stabilizes a bit. > > > > I'm pinging to see if this has gotten anywhere. Are you ready to > > resubmit? Do we have the evidence to convince Andrew that the NFS issues are > > resolved and so this patch won't obscure anything? > > The NFS patch went into Linus tree a couple of days ago Did it fix the oom issues which you were observing? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 21 Mar 2007, Ethan Solomita wrote: > Christoph Lameter wrote: > > On Thu, 1 Feb 2007, Ethan Solomita wrote: > > > > >Hi Christoph -- has anything come of resolving the NFS / OOM concerns > > > that > > > Andrew Morton expressed concerning the patch? I'd be happy to see some > > > progress on getting this patch (i.e. the one you posted on 1/23) through. > > > > Peter Zilkstra addressed the NFS issue. I will submit the patch again as > > soon as the writeback code stabilizes a bit. > > I'm pinging to see if this has gotten anywhere. Are you ready to > resubmit? Do we have the evidence to convince Andrew that the NFS issues are > resolved and so this patch won't obscure anything? The NFS patch went into Linus tree a couple of days ago and I have a new version ready with additional support to set per dirty ratios per cpu. There is some interest in adding more VM controls to this patch. I hope I can post the next rev by tomorrow. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph Lameter wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. I'm pinging to see if this has gotten anywhere. Are you ready to resubmit? Do we have the evidence to convince Andrew that the NFS issues are resolved and so this patch won't obscure anything? Thanks, -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 21 Mar 2007, Ethan Solomita wrote: Christoph Lameter wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. I'm pinging to see if this has gotten anywhere. Are you ready to resubmit? Do we have the evidence to convince Andrew that the NFS issues are resolved and so this patch won't obscure anything? The NFS patch went into Linus tree a couple of days ago and I have a new version ready with additional support to set per dirty ratios per cpu. There is some interest in adding more VM controls to this patch. I hope I can post the next rev by tomorrow. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 21 Mar 2007 14:29:42 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: On Wed, 21 Mar 2007, Ethan Solomita wrote: Christoph Lameter wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. I'm pinging to see if this has gotten anywhere. Are you ready to resubmit? Do we have the evidence to convince Andrew that the NFS issues are resolved and so this patch won't obscure anything? The NFS patch went into Linus tree a couple of days ago Did it fix the oom issues which you were observing? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 21 Mar 2007, Andrew Morton wrote: The NFS patch went into Linus tree a couple of days ago Did it fix the oom issues which you were observing? Yes it reduced the dirty ratios to reasonable numbers in a simple copy operation that created large amounts of dirty pages before. The trouble is now to check if cpuset writeback patch still works correctly. Probably have to turn off block device congestion checks somehow. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007 21:29:06 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Thu, 1 Feb 2007, Andrew Morton wrote: > > > > Peter Zilkstra addressed the NFS issue. > > > > Did he? Are you yet in a position to confirm that? > > He provided a solution to fix the congestion issue in NFS. I thought > that is what you were looking for? That should make NFS behave more > like a block device right? We hope so. The cpuset-aware-writeback patches were explicitly written to hide the bug which Peter's patches hopefully address. They hence remove our best way of confirming that Peter's patches fix the problem which you've observed in a proper fashion. Until we've confirmed that the NFS problem is nailed, I wouldn't want to merge cpuset-aware-writeback. I'm hoping to be able to do that with fake-numa on x86-64 but haven't got onto it yet. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thursday February 1, [EMAIL PROTECTED] wrote: > > > The network stack is of course a different (much harder) problem. > > An NFS solution is possible without solving the network stack issue? NFS is currently able to make more than max_dirty_ratio of memory Dirty/Writeback without being effectively throttled. So it can use up way more than it should and put pressure in the network stack. If NFS were throttled like other block-based filesystems (which Peter's patch should do), then there will normally be a lot more head room and the network stack will normally be able to cope. There might still be situations were you can run out of memory to the extent that NFS cannot make forward progress, but they will be substantially less likely (I think you need lots of TCP streams with slow consumers and fast producers so that TCP is forced to use up it reserves). The block layer guarantees not to run out of memory. The network layer makes a best effort as long as nothing goes crazy. NFS (currently) doesn't do quite enough to stop things going crazy. At least, that is my understanding. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Fri, 2 Feb 2007, Neil Brown wrote: > md/raid doesn't cause any problems here. It preallocates enough to be > sure that it can always make forward progress. In general the entire > block layer from generic_make_request down can always successfully > write a block out in a reasonable amount of time without requiring > kmalloc to succeed (with obvious exceptions like loop and nbd which go > back up to a higher layer). Hmmm... I wonder if that could be generalized. A device driver could make a reservation by increasing min_free_kbytes? Additional drivers in a chain could make additional reservations in such a way that enough memory is set aside for the worst case? > The network stack is of course a different (much harder) problem. An NFS solution is possible without solving the network stack issue? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thursday February 1, [EMAIL PROTECTED] wrote: >The NFS problems also exist for non cpuset scenarios > and we have by and large been able to live with it so I think they are > lower priority. It seems that the basic problem is created by the dirty > ratios in a cpuset. Some of our customers haven't been able to live with it. I'm really glad this will soon be fixed in mainline as it means our somewhat less elegant fix in SLES can go away :-) > > BTW the block layer also may be layered with raid and stuff and then we > have similar issues. There is no general way so far of handling these > situations except by twiddling around with min_free_kbytes praying 5 Hail > Mary's and trying again. md/raid doesn't cause any problems here. It preallocates enough to be sure that it can always make forward progress. In general the entire block layer from generic_make_request down can always successfully write a block out in a reasonable amount of time without requiring kmalloc to succeed (with obvious exceptions like loop and nbd which go back up to a higher layer). The network stack is of course a different (much harder) problem. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007, Andrew Morton wrote: > > Peter Zilkstra addressed the NFS issue. > > Did he? Are you yet in a position to confirm that? He provided a solution to fix the congestion issue in NFS. I thought that is what you were looking for? That should make NFS behave more like a block device right? As I said before I think NFS is inherently unfixable given the layering of a block device on top of the network stack (which consists of an unknown number of additional intermediate layers). Cpuset writeback needs to work in the same way as in a machine without cpusets. If fails then at least let the cpuset behave as if we had a machine all on our own and fail in both cases in the same way. Right now we create dangerous low memory conditions due to high dirty ratios in a cpuset created by not having a throttling method. The NFS problems also exist for non cpuset scenarios and we have by and large been able to live with it so I think they are lower priority. It seems that the basic problem is created by the dirty ratios in a cpuset. BTW the block layer also may be layered with raid and stuff and then we have similar issues. There is no general way so far of handling these situations except by twiddling around with min_free_kbytes praying 5 Hail Mary's and trying again. Maybe we are able allocate all needed memory from PF_MEMALLOC processes during reclaim and hopefully there is now enough memory for these allocations and those that happen to occur during an interrupt while we reclaim. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007 18:16:05 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Thu, 1 Feb 2007, Ethan Solomita wrote: > > >Hi Christoph -- has anything come of resolving the NFS / OOM concerns > > that > > Andrew Morton expressed concerning the patch? I'd be happy to see some > > progress on getting this patch (i.e. the one you posted on 1/23) through. > > Peter Zilkstra addressed the NFS issue. Did he? Are you yet in a position to confirm that? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007, Ethan Solomita wrote: >Hi Christoph -- has anything come of resolving the NFS / OOM concerns that > Andrew Morton expressed concerning the patch? I'd be happy to see some > progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Thanks, -- Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Thanks, -- Ethan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. I will submit the patch again as soon as the writeback code stabilizes a bit. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007 18:16:05 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Thu, 1 Feb 2007, Ethan Solomita wrote: Hi Christoph -- has anything come of resolving the NFS / OOM concerns that Andrew Morton expressed concerning the patch? I'd be happy to see some progress on getting this patch (i.e. the one you posted on 1/23) through. Peter Zilkstra addressed the NFS issue. Did he? Are you yet in a position to confirm that? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007, Andrew Morton wrote: Peter Zilkstra addressed the NFS issue. Did he? Are you yet in a position to confirm that? He provided a solution to fix the congestion issue in NFS. I thought that is what you were looking for? That should make NFS behave more like a block device right? As I said before I think NFS is inherently unfixable given the layering of a block device on top of the network stack (which consists of an unknown number of additional intermediate layers). Cpuset writeback needs to work in the same way as in a machine without cpusets. If fails then at least let the cpuset behave as if we had a machine all on our own and fail in both cases in the same way. Right now we create dangerous low memory conditions due to high dirty ratios in a cpuset created by not having a throttling method. The NFS problems also exist for non cpuset scenarios and we have by and large been able to live with it so I think they are lower priority. It seems that the basic problem is created by the dirty ratios in a cpuset. BTW the block layer also may be layered with raid and stuff and then we have similar issues. There is no general way so far of handling these situations except by twiddling around with min_free_kbytes praying 5 Hail Mary's and trying again. Maybe we are able allocate all needed memory from PF_MEMALLOC processes during reclaim and hopefully there is now enough memory for these allocations and those that happen to occur during an interrupt while we reclaim. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thursday February 1, [EMAIL PROTECTED] wrote: The NFS problems also exist for non cpuset scenarios and we have by and large been able to live with it so I think they are lower priority. It seems that the basic problem is created by the dirty ratios in a cpuset. Some of our customers haven't been able to live with it. I'm really glad this will soon be fixed in mainline as it means our somewhat less elegant fix in SLES can go away :-) BTW the block layer also may be layered with raid and stuff and then we have similar issues. There is no general way so far of handling these situations except by twiddling around with min_free_kbytes praying 5 Hail Mary's and trying again. md/raid doesn't cause any problems here. It preallocates enough to be sure that it can always make forward progress. In general the entire block layer from generic_make_request down can always successfully write a block out in a reasonable amount of time without requiring kmalloc to succeed (with obvious exceptions like loop and nbd which go back up to a higher layer). The network stack is of course a different (much harder) problem. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Fri, 2 Feb 2007, Neil Brown wrote: md/raid doesn't cause any problems here. It preallocates enough to be sure that it can always make forward progress. In general the entire block layer from generic_make_request down can always successfully write a block out in a reasonable amount of time without requiring kmalloc to succeed (with obvious exceptions like loop and nbd which go back up to a higher layer). Hmmm... I wonder if that could be generalized. A device driver could make a reservation by increasing min_free_kbytes? Additional drivers in a chain could make additional reservations in such a way that enough memory is set aside for the worst case? The network stack is of course a different (much harder) problem. An NFS solution is possible without solving the network stack issue? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thursday February 1, [EMAIL PROTECTED] wrote: The network stack is of course a different (much harder) problem. An NFS solution is possible without solving the network stack issue? NFS is currently able to make more than max_dirty_ratio of memory Dirty/Writeback without being effectively throttled. So it can use up way more than it should and put pressure in the network stack. If NFS were throttled like other block-based filesystems (which Peter's patch should do), then there will normally be a lot more head room and the network stack will normally be able to cope. There might still be situations were you can run out of memory to the extent that NFS cannot make forward progress, but they will be substantially less likely (I think you need lots of TCP streams with slow consumers and fast producers so that TCP is forced to use up it reserves). The block layer guarantees not to run out of memory. The network layer makes a best effort as long as nothing goes crazy. NFS (currently) doesn't do quite enough to stop things going crazy. At least, that is my understanding. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Thu, 1 Feb 2007 21:29:06 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Thu, 1 Feb 2007, Andrew Morton wrote: Peter Zilkstra addressed the NFS issue. Did he? Are you yet in a position to confirm that? He provided a solution to fix the congestion issue in NFS. I thought that is what you were looking for? That should make NFS behave more like a block device right? We hope so. The cpuset-aware-writeback patches were explicitly written to hide the bug which Peter's patches hopefully address. They hence remove our best way of confirming that Peter's patches fix the problem which you've observed in a proper fashion. Until we've confirmed that the NFS problem is nailed, I wouldn't want to merge cpuset-aware-writeback. I'm hoping to be able to do that with fake-numa on x86-64 but haven't got onto it yet. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007, Andrew Morton wrote: > > The problem there is that we do a GFP_ATOMIC allocation (no allocation > > context) that may fail when the first page is dirtied. We must therefore > > be able to subsequently allocate the nodemask_t in set_page_dirty(). > > Otherwise the first failure will mean that there will never be a dirty > > map for the inode/mapping. > > True. But it's pretty simple to change __mark_inode_dirty() to fix this. Ok I tried it but this wont work unless I also pass the page struct pointer to __mark_inode_dirty() since the dirty_node pointer could be freed when the inode_lock is droppped. So I cannot dereference the dirty_nodes pointer outside of __mark_inode_dirty. If I expand __mark_inode_dirty then all variations of mark_inode_dirty() need to be changed and we need to pass a page struct everywhere. This result in extensive changes. I think I need to stick with the tree_lock. This also makes more sense since we modify dirty information in the address_space structure and the radix tree is already protected by that lock. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Wed, 17 Jan 2007 17:10:25 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > On Wed, 17 Jan 2007, Andrew Morton wrote: > > > > The inode lock is not taken when the page is dirtied. > > > > The inode_lock is taken when the address_space's first page is dirtied. It > > is > > also taken when the address_space's last dirty page is cleaned. So the > > place > > where the inode is added to and removed from sb->s_dirty is, I think, > > exactly > > the place where we want to attach and detach > > address_space.dirty_page_nodemask. > > The problem there is that we do a GFP_ATOMIC allocation (no allocation > context) that may fail when the first page is dirtied. We must therefore > be able to subsequently allocate the nodemask_t in set_page_dirty(). > Otherwise the first failure will mean that there will never be a dirty > map for the inode/mapping. True. But it's pretty simple to change __mark_inode_dirty() to fix this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007, Andrew Morton wrote: > > The inode lock is not taken when the page is dirtied. > > The inode_lock is taken when the address_space's first page is dirtied. It is > also taken when the address_space's last dirty page is cleaned. So the place > where the inode is added to and removed from sb->s_dirty is, I think, exactly > the place where we want to attach and detach > address_space.dirty_page_nodemask. The problem there is that we do a GFP_ATOMIC allocation (no allocation context) that may fail when the first page is dirtied. We must therefore be able to subsequently allocate the nodemask_t in set_page_dirty(). Otherwise the first failure will mean that there will never be a dirty map for the inode/mapping. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Wed, 17 Jan 2007 11:43:42 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > On Tue, 16 Jan 2007, Andrew Morton wrote: > > > Do what blockdevs do: limit the number of in-flight requests (Peter's > > recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC > > is in effect, to keep Trond happy) and implement a mempool for the NFS > > request critical store. Additionally: > > > > - we might need to twiddle the NFS gfp_flags so it doesn't call the > > oom-killer on failure: just return NULL. > > > > - consider going off-cpuset for critical allocations. It's better than > > going oom. A suitable implementation might be to ignore the caller's > > cpuset if PF_MEMALLOC. Maybe put a WARN_ON_ONCE in there: we prefer that > > it not happen and we want to know when it does. > > Given the intermediate layers (network, additional gizmos (ip over xxx) > and the network cards) that will not be easy. Paul has observed that it's already done. But it seems to not be working. > > btw, regarding the per-address_space node mask: I think we should free it > > when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)). Chances > > are, the inode will be dirty for 30 seconds and in-core for hours. We > > might as well steal its nodemask storage and give it to the next file which > > gets written to. A suitable place to do all this is in > > __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect > > address_space.dirty_page_nodemask. > > The inode lock is not taken when the page is dirtied. The inode_lock is taken when the address_space's first page is dirtied. It is also taken when the address_space's last dirty page is cleaned. So the place where the inode is added to and removed from sb->s_dirty is, I think, exactly the place where we want to attach and detach address_space.dirty_page_nodemask. > The tree_lock > is already taken when the mapping is dirtied and so I used that to > avoid races adding and removing pointers to nodemasks from the address > space. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: > Do what blockdevs do: limit the number of in-flight requests (Peter's > recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC > is in effect, to keep Trond happy) and implement a mempool for the NFS > request critical store. Additionally: > > - we might need to twiddle the NFS gfp_flags so it doesn't call the > oom-killer on failure: just return NULL. > > - consider going off-cpuset for critical allocations. It's better than > going oom. A suitable implementation might be to ignore the caller's > cpuset if PF_MEMALLOC. Maybe put a WARN_ON_ONCE in there: we prefer that > it not happen and we want to know when it does. Given the intermediate layers (network, additional gizmos (ip over xxx) and the network cards) that will not be easy. > btw, regarding the per-address_space node mask: I think we should free it > when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)). Chances > are, the inode will be dirty for 30 seconds and in-core for hours. We > might as well steal its nodemask storage and give it to the next file which > gets written to. A suitable place to do all this is in > __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect > address_space.dirty_page_nodemask. The inode lock is not taken when the page is dirtied. The tree_lock is already taken when the mapping is dirtied and so I used that to avoid races adding and removing pointers to nodemasks from the address space. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Wed, 17 Jan 2007 00:01:58 -0800 Paul Jackson <[EMAIL PROTECTED]> wrote: > Andrew wrote: > > - consider going off-cpuset for critical allocations. > > We do ... in mm/page_alloc.c: > > * This is the last chance, in general, before the goto nopage. > * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. > * See also cpuset_zone_allowed() comment in kernel/cpuset.c. > */ > page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags); > > We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest > enclosing mem_exclusive cpuset, which is typically a big cpuset covering most > of the system. hrm. So how come NFS is getting oom-killings? The oom-killer normally spews lots of useful stuff, including backtrace. For some reason that's not coming out for Christoph. Log facility level, perhaps? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Andrew wrote: > - consider going off-cpuset for critical allocations. We do ... in mm/page_alloc.c: * This is the last chance, in general, before the goto nopage. * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags); We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest enclosing mem_exclusive cpuset, which is typically a big cpuset covering most of the system. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Andrew wrote: - consider going off-cpuset for critical allocations. We do ... in mm/page_alloc.c: * This is the last chance, in general, before the goto nopage. * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags); We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest enclosing mem_exclusive cpuset, which is typically a big cpuset covering most of the system. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007 00:01:58 -0800 Paul Jackson [EMAIL PROTECTED] wrote: Andrew wrote: - consider going off-cpuset for critical allocations. We do ... in mm/page_alloc.c: * This is the last chance, in general, before the goto nopage. * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. * See also cpuset_zone_allowed() comment in kernel/cpuset.c. */ page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags); We also allow GFP_KERNEL requests to escape the current cpuset, to the nearest enclosing mem_exclusive cpuset, which is typically a big cpuset covering most of the system. hrm. So how come NFS is getting oom-killings? The oom-killer normally spews lots of useful stuff, including backtrace. For some reason that's not coming out for Christoph. Log facility level, perhaps? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: Do what blockdevs do: limit the number of in-flight requests (Peter's recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC is in effect, to keep Trond happy) and implement a mempool for the NFS request critical store. Additionally: - we might need to twiddle the NFS gfp_flags so it doesn't call the oom-killer on failure: just return NULL. - consider going off-cpuset for critical allocations. It's better than going oom. A suitable implementation might be to ignore the caller's cpuset if PF_MEMALLOC. Maybe put a WARN_ON_ONCE in there: we prefer that it not happen and we want to know when it does. Given the intermediate layers (network, additional gizmos (ip over xxx) and the network cards) that will not be easy. btw, regarding the per-address_space node mask: I think we should free it when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)). Chances are, the inode will be dirty for 30 seconds and in-core for hours. We might as well steal its nodemask storage and give it to the next file which gets written to. A suitable place to do all this is in __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect address_space.dirty_page_nodemask. The inode lock is not taken when the page is dirtied. The tree_lock is already taken when the mapping is dirtied and so I used that to avoid races adding and removing pointers to nodemasks from the address space. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007 11:43:42 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Tue, 16 Jan 2007, Andrew Morton wrote: Do what blockdevs do: limit the number of in-flight requests (Peter's recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC is in effect, to keep Trond happy) and implement a mempool for the NFS request critical store. Additionally: - we might need to twiddle the NFS gfp_flags so it doesn't call the oom-killer on failure: just return NULL. - consider going off-cpuset for critical allocations. It's better than going oom. A suitable implementation might be to ignore the caller's cpuset if PF_MEMALLOC. Maybe put a WARN_ON_ONCE in there: we prefer that it not happen and we want to know when it does. Given the intermediate layers (network, additional gizmos (ip over xxx) and the network cards) that will not be easy. Paul has observed that it's already done. But it seems to not be working. btw, regarding the per-address_space node mask: I think we should free it when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)). Chances are, the inode will be dirty for 30 seconds and in-core for hours. We might as well steal its nodemask storage and give it to the next file which gets written to. A suitable place to do all this is in __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect address_space.dirty_page_nodemask. The inode lock is not taken when the page is dirtied. The inode_lock is taken when the address_space's first page is dirtied. It is also taken when the address_space's last dirty page is cleaned. So the place where the inode is added to and removed from sb-s_dirty is, I think, exactly the place where we want to attach and detach address_space.dirty_page_nodemask. The tree_lock is already taken when the mapping is dirtied and so I used that to avoid races adding and removing pointers to nodemasks from the address space. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007, Andrew Morton wrote: The inode lock is not taken when the page is dirtied. The inode_lock is taken when the address_space's first page is dirtied. It is also taken when the address_space's last dirty page is cleaned. So the place where the inode is added to and removed from sb-s_dirty is, I think, exactly the place where we want to attach and detach address_space.dirty_page_nodemask. The problem there is that we do a GFP_ATOMIC allocation (no allocation context) that may fail when the first page is dirtied. We must therefore be able to subsequently allocate the nodemask_t in set_page_dirty(). Otherwise the first failure will mean that there will never be a dirty map for the inode/mapping. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007 17:10:25 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Wed, 17 Jan 2007, Andrew Morton wrote: The inode lock is not taken when the page is dirtied. The inode_lock is taken when the address_space's first page is dirtied. It is also taken when the address_space's last dirty page is cleaned. So the place where the inode is added to and removed from sb-s_dirty is, I think, exactly the place where we want to attach and detach address_space.dirty_page_nodemask. The problem there is that we do a GFP_ATOMIC allocation (no allocation context) that may fail when the first page is dirtied. We must therefore be able to subsequently allocate the nodemask_t in set_page_dirty(). Otherwise the first failure will mean that there will never be a dirty map for the inode/mapping. True. But it's pretty simple to change __mark_inode_dirty() to fix this. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007, Andrew Morton wrote: The problem there is that we do a GFP_ATOMIC allocation (no allocation context) that may fail when the first page is dirtied. We must therefore be able to subsequently allocate the nodemask_t in set_page_dirty(). Otherwise the first failure will mean that there will never be a dirty map for the inode/mapping. True. But it's pretty simple to change __mark_inode_dirty() to fix this. Ok I tried it but this wont work unless I also pass the page struct pointer to __mark_inode_dirty() since the dirty_node pointer could be freed when the inode_lock is droppped. So I cannot dereference the dirty_nodes pointer outside of __mark_inode_dirty. If I expand __mark_inode_dirty then all variations of mark_inode_dirty() need to be changed and we need to pass a page struct everywhere. This result in extensive changes. I think I need to stick with the tree_lock. This also makes more sense since we modify dirty information in the address_space structure and the radix tree is already protected by that lock. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Tue, 16 Jan 2007 22:27:36 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > On Tue, 16 Jan 2007, Andrew Morton wrote: > > > > Yes this is the result of the hierachical nature of cpusets which already > > > causes issues with the scheduler. It is rather typical that cpusets are > > > used to partition the memory and cpus. Overlappig cpusets seem to have > > > mainly an administrative function. Paul? > > > > The typical usage scenarios don't matter a lot: the examples I gave show > > that the core problem remains unsolved. People can still hit the bug. > > I agree the overlap issue is a problem and I hope it can be addressed > somehow for the rare cases in which such nesting takes place. > > One easy solution may be to check the dirty ratio before engaging in > reclaim. If the dirty ratio is sufficiently high then trigger writeout via > pdflush (we already wakeup pdflush while scanning and you already noted > that pdflush writeout is not occurring within the context of the current > cpuset) and pass over any dirty pages during LRU scans until some pages > have been cleaned up. > > This means we allow allocation of additional kernel memory outside of the > cpuset while triggering writeout of inodes that have pages on the nodes > of the cpuset. The memory directly used by the application is still > limited. Just the temporary information needed for writeback is allocated > outside. Gad. None of that should be necessary. > Well sounds somehow still like a hack. Any other ideas out there? Do what blockdevs do: limit the number of in-flight requests (Peter's recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC is in effect, to keep Trond happy) and implement a mempool for the NFS request critical store. Additionally: - we might need to twiddle the NFS gfp_flags so it doesn't call the oom-killer on failure: just return NULL. - consider going off-cpuset for critical allocations. It's better than going oom. A suitable implementation might be to ignore the caller's cpuset if PF_MEMALLOC. Maybe put a WARN_ON_ONCE in there: we prefer that it not happen and we want to know when it does. btw, regarding the per-address_space node mask: I think we should free it when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)). Chances are, the inode will be dirty for 30 seconds and in-core for hours. We might as well steal its nodemask storage and give it to the next file which gets written to. A suitable place to do all this is in __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect address_space.dirty_page_nodemask. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: > > Yes this is the result of the hierachical nature of cpusets which already > > causes issues with the scheduler. It is rather typical that cpusets are > > used to partition the memory and cpus. Overlappig cpusets seem to have > > mainly an administrative function. Paul? > > The typical usage scenarios don't matter a lot: the examples I gave show > that the core problem remains unsolved. People can still hit the bug. I agree the overlap issue is a problem and I hope it can be addressed somehow for the rare cases in which such nesting takes place. One easy solution may be to check the dirty ratio before engaging in reclaim. If the dirty ratio is sufficiently high then trigger writeout via pdflush (we already wakeup pdflush while scanning and you already noted that pdflush writeout is not occurring within the context of the current cpuset) and pass over any dirty pages during LRU scans until some pages have been cleaned up. This means we allow allocation of additional kernel memory outside of the cpuset while triggering writeout of inodes that have pages on the nodes of the cpuset. The memory directly used by the application is still limited. Just the temporary information needed for writeback is allocated outside. Well sounds somehow still like a hack. Any other ideas out there? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Tue, 16 Jan 2007 19:40:17 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > On Tue, 16 Jan 2007, Andrew Morton wrote: > > > Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive > > cpuset B consists of mems 0-3. A task running in cpuset A can freely dirty > > all of cpuset B's memory. A task running in cpuset B gets oomkilled. > > > > Consider: a 32-node machine has nodes 0-3 full of dirty memory. I create a > > cpuset containing nodes 0-2 and start using it. I get oomkilled. > > > > There may be other scenarios. > > Yes this is the result of the hierachical nature of cpusets which already > causes issues with the scheduler. It is rather typical that cpusets are > used to partition the memory and cpus. Overlappig cpusets seem to have > mainly an administrative function. Paul? The typical usage scenarios don't matter a lot: the examples I gave show that the core problem remains unsolved. People can still hit the bug. > > So what I suggest we do is to fix the NFS bug, then move on to considering > > the performance problems. > > The NFS "bug" has been there for ages and no one cares since write > throttling works effectively. Since NFS can go via any network technology > (f.e. infiniband) we have many potential issues at that point that depend > on the underlying network technology. As far as I can recall we decided > that these stacking issues are inherently problematic and basically > unsolvable. The problem you refer to arises from the inability of the net driver to allocate memory for an outbound ack. Such allocations aren't constrained to a cpuset. I expect that we can solve the NFS oom problem along the same lines as block devices. Certainly it's dumb of us to oom-kill a process rather than going off-cpuset for a small and short-lived allocation. It's also dumb of us to allocate a basically unbounded number of nfs requests rather than waiting for some of the ones which we _have_ allocated to complete. > > On reflection, I agree that your proposed changes are sensible-looking for > > addressing the probable, not-yet-demonstrated-and-quantified performance > > problem. The per-inode (should be per-address_space, maybe it is?) node > > The address space is part of the inode. Physically, yes. Logically, it is not. The address_space controls the data-plane part of a file and is the appropriate place in which to store this nodemask. > Some of my development versions at > the dirty_map in the address space. However, the end of the inode was a > convenient place for a runtime sizes nodemask. > > > map is unfortunate. Need to think about that a bit more. For a start, it > > should be dynamically allocated (from a new, purpose-created slab cache): > > most in-core inodes don't have any dirty pages and don't need this > > additional storage. > > We also considered such an approach. However. it creates the problem > of performing a slab allocation while dirtying pages. At that point we do > not have an allocation context, nor can we block. Yes, it must be an atomic allocation. If it fails, we don't care. Chances are it'll succeed when the next page in this address_space gets dirtied. Plus we don't waste piles of memory on read-only files. > > But this is unrelated to the NFS bug ;) > > Looks more like a design issue (given its layering on top of the > networking layer) and not a bug. The "bug" surfaces when writeback is not > done properly. I wonder what happens if other filesystems are pushed to > the border of the dirty abyss. The mmap tracking > fixes that were done in 2.6.19 were done because of similar symptoms > because the systems dirty tracking was off. This is fundamentally the > same issue showing up in a cpuset. So we should be able to produce the > hangs (looks ... yes another customer reported issue on this one is that > reclaim is continually running and we basically livelock the system) that > we saw for the mmap dirty tracking issues in addition to the NFS problems > seen so far. > > Memory allocation is required in most filesystem flush paths. If we cannot > allocate memory then we cannot clean pages and thus we continue trying -> > Livelock. I still see this as a fundamental correctness issue in the > kernel. I'll believe all that once someone has got down and tried to fix NFS, and has failed ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> Yes this is the result of the hierachical nature of cpusets which already > causes issues with the scheduler. It is rather typical that cpusets are > used to partition the memory and cpus. Overlappig cpusets seem to have > mainly an administrative function. Paul? The heavy weight tasks, which are expected to be applying serious memory pressure (whether for data pages or dirty file pages), are usually in non-overlapping cpusets, or sharing the same cpuset, but not partially overlapping with, or a proper superset of, some other cpuset holding an active job. The higher level cpusets, such as the top cpuset, or the one deeded over to the Batch Scheduler, are proper supersets of many other cpusets. We avoid putting anything heavy weight in those cpusets. Sometimes of course a task turns out to be unexpectedly heavy weight. But in that case, we're mostly interested in function (system keeps running), not performance. That is, if someone setup what Andrew described, with a job in a large cpuset sucking up all available memory from one in a smaller, contained cpuset, I don't think I'm tuning for optimum performance anymore. Rather I'm just trying to keep the system running and keep unrelated jobs unaffected while we dig our way out of the hole. If the smaller job OOM's, that's tough nuggies. They asked for it. Jobs in -unrelated- (non-overlapping) cpusets should ride out the storm with little or no impact on their performance. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: > Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive > cpuset B consists of mems 0-3. A task running in cpuset A can freely dirty > all of cpuset B's memory. A task running in cpuset B gets oomkilled. > > Consider: a 32-node machine has nodes 0-3 full of dirty memory. I create a > cpuset containing nodes 0-2 and start using it. I get oomkilled. > > There may be other scenarios. Yes this is the result of the hierachical nature of cpusets which already causes issues with the scheduler. It is rather typical that cpusets are used to partition the memory and cpus. Overlappig cpusets seem to have mainly an administrative function. Paul? > So what I suggest we do is to fix the NFS bug, then move on to considering > the performance problems. The NFS "bug" has been there for ages and no one cares since write throttling works effectively. Since NFS can go via any network technology (f.e. infiniband) we have many potential issues at that point that depend on the underlying network technology. As far as I can recall we decided that these stacking issues are inherently problematic and basically unsolvable. > On reflection, I agree that your proposed changes are sensible-looking for > addressing the probable, not-yet-demonstrated-and-quantified performance > problem. The per-inode (should be per-address_space, maybe it is?) node The address space is part of the inode. Some of my development versions at the dirty_map in the address space. However, the end of the inode was a convenient place for a runtime sizes nodemask. > map is unfortunate. Need to think about that a bit more. For a start, it > should be dynamically allocated (from a new, purpose-created slab cache): > most in-core inodes don't have any dirty pages and don't need this > additional storage. We also considered such an approach. However. it creates the problem of performing a slab allocation while dirtying pages. At that point we do not have an allocation context, nor can we block. > But this is unrelated to the NFS bug ;) Looks more like a design issue (given its layering on top of the networking layer) and not a bug. The "bug" surfaces when writeback is not done properly. I wonder what happens if other filesystems are pushed to the border of the dirty abyss. The mmap tracking fixes that were done in 2.6.19 were done because of similar symptoms because the systems dirty tracking was off. This is fundamentally the same issue showing up in a cpuset. So we should be able to produce the hangs (looks ... yes another customer reported issue on this one is that reclaim is continually running and we basically livelock the system) that we saw for the mmap dirty tracking issues in addition to the NFS problems seen so far. Memory allocation is required in most filesystem flush paths. If we cannot allocate memory then we cannot clean pages and thus we continue trying -> Livelock. I still see this as a fundamental correctness issue in the kernel. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Tue, 16 Jan 2007 17:30:26 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > > Nope. You've completely omitted the little fact that we'll do writeback in > > the offending zone off the LRU. Slower, maybe. But it should work and the > > system should recover. If it's not doing that (it isn't) then we should > > fix it rather than avoiding it (by punting writeback over to pdflush). > > pdflush is not running *at* all nor is dirty throttling working. That is > correct behavior? We could do background writeback but we choose not to do > so? Instead we wait until we hit reclaim and then block (well it seems > that we do not block the blocking there also fails since we again check > global ratios)? I agree that it is a worthy objective to be able to constrain a cpuset's dirty memory levels. But as a performance optimisation and NOT as a correctness fix. Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive cpuset B consists of mems 0-3. A task running in cpuset A can freely dirty all of cpuset B's memory. A task running in cpuset B gets oomkilled. Consider: a 32-node machine has nodes 0-3 full of dirty memory. I create a cpuset containing nodes 0-2 and start using it. I get oomkilled. There may be other scenarios. IOW, we have a correctness problem, and we have a probable, not-yet-demonstrated-and-quantified performance problem. Fixing the latter (in the proposed fashion) will *not* fix the former. So what I suggest we do is to fix the NFS bug, then move on to considering the performance problems. On reflection, I agree that your proposed changes are sensible-looking for addressing the probable, not-yet-demonstrated-and-quantified performance problem. The per-inode (should be per-address_space, maybe it is?) node map is unfortunate. Need to think about that a bit more. For a start, it should be dynamically allocated (from a new, purpose-created slab cache): most in-core inodes don't have any dirty pages and don't need this additional storage. Also, I worry about the worst-case performance of that linear search across the inodes. But this is unrelated to the NFS bug ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: > Nope. You've completely omitted the little fact that we'll do writeback in > the offending zone off the LRU. Slower, maybe. But it should work and the > system should recover. If it's not doing that (it isn't) then we should > fix it rather than avoiding it (by punting writeback over to pdflush). pdflush is not running *at* all nor is dirty throttling working. That is correct behavior? We could do background writeback but we choose not to do so? Instead we wait until we hit reclaim and then block (well it seems that we do not block the blocking there also fails since we again check global ratios)? > > The patchset does not allow processes to allocate from other nodes than > > the current cpuset. > > Yes it does. It asks pdflush to perform writeback of the offending zone(s) > rather than (or as well as) doing it directly. The only reason pdflush can > sucessfuly do that is because pdflush can allocate its requests from other > zones. Ok pdflush is able to do that. Still the application is not able to extend its memory beyond the cpuset. What about writeback throttling? There it all breaks down. The cpuset is effective and we are unable to allocate any more memory. The reason this works is because not all of memory is dirty. Thus reclaim will be able to free up memory or there is enough memory free. > > AFAIK any filesyste/block device can go oom with the current broken > > writeback it just does a few allocations. Its a matter of hitting the > > sweet spots. > > That shouldn't be possible, in theory. Block IO is supposed to succeed if > *all memory in the machine is dirty*: the old > dirty-everything-with-MAP_SHARED-then-exit problem. Lots of testing went > into that and it works. It also failed on NFS although I thought that got > "fixed" a year or so ago. Apparently not. Humm... Really? > > Nope. Why would a dirty zone pose a problem? The proble exist if you > > cannot allocate more memory. > > Well one example would be a GFP_KERNEL allocation on a highmem machine in > whcih all of ZONE_NORMAL is dirty. That is a restricted allocation which will lead to reclaim. > > If we have multiple zones then other zones may still provide memory to > > continue (same as in UP). > > Not if all the eligible zones are all-dirty. They are all dirty if we do not throttle the dirty pages. > Right now, what we have is an NFS bug. How about we fix it, then > reevaluate the situation? The "NFS bug" only exists when using a cpuset. If you run NFS without cpusets then the throttling will kick in and everything is fine. > A good starting point would be to show us one of these oom-killer traces. No traces. Since the process is killed within a cpuset we only get messages like: Nov 28 16:19:52 ic4 kernel: Out of Memory: Kill process 679783 (ncks) score 0 and children. Nov 28 16:19:52 ic4 kernel: No available memory in cpuset: Killed process 679783 (ncks). Nov 28 16:27:58 ic4 kernel: oom-killer: gfp_mask=0x200d2, order=0 Probably need to rerun these with some patches. > > Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running > > on the first node. Then we copy a large file to disk. Node local > > allocation means that we allocate from the first node. After we reach 40% > > of the node then we throttle? This is going to be a significant > > performance degradation since we can no longer use the memory of other > > nodes to buffer writeout. > > That was what I was referring to. Note that this was describing the behavior you wanted not the way things work. It is desired behavior not to use all the memory resources of the cpuset and slow down the system? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Tue, 16 Jan 2007 16:16:30 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > On Tue, 16 Jan 2007, Andrew Morton wrote: > > > It's a workaround for a still-unfixed NFS problem. > > No its doing proper throttling. Without this patchset there will *no* > writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each > and a cpuset that only spans one node. > > Then a process runniung in that cpuset can dirty all of memory and still > continue running without writeback continuing. background dirty ratio > is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever > be reached because the process will only ever be able to dirty memory on > one node which is 5%. There will be no throttling, no background > writeback, no blocking for dirty pages. > > At some point we run into reclaim (possibly we have ~99% of of the cpuset > dirty) and then we trigger writeout. Okay so if the filesystem / block > device is robust enough and does not require memory allocations then we > likely will survive that and do slow writeback page by page from the LRU. > > writback is completely hosed for that situation. This patch restores > expected behavior in a cpuset (which is a form of system partition that > should mirror the system as a whole). At 10% dirty we should start > background writeback and at 40% we should block. If that is done then even > fragile combinations of filesystem/block devices will work as they do > without cpusets. Nope. You've completely omitted the little fact that we'll do writeback in the offending zone off the LRU. Slower, maybe. But it should work and the system should recover. If it's not doing that (it isn't) then we should fix it rather than avoiding it (by punting writeback over to pdflush). Once that's fixed, if we determine that there are remaining and significant performance issues then we can take a look at that. > > > > Yes we can fix these allocations by allowing processes to allocate from > > > other nodes. But then the container function of cpusets is no longer > > > there. > > But that's what your patch already does! > > The patchset does not allow processes to allocate from other nodes than > the current cpuset. Yes it does. It asks pdflush to perform writeback of the offending zone(s) rather than (or as well as) doing it directly. The only reason pdflush can sucessfuly do that is because pdflush can allocate its requests from other zones. > > AFAIK any filesyste/block device can go oom with the current broken > writeback it just does a few allocations. Its a matter of hitting the > sweet spots. That shouldn't be possible, in theory. Block IO is supposed to succeed if *all memory in the machine is dirty*: the old dirty-everything-with-MAP_SHARED-then-exit problem. Lots of testing went into that and it works. It also failed on NFS although I thought that got "fixed" a year or so ago. Apparently not. > > But we also can get into trouble if a *zone* is all-dirty. Any solution to > > the cpuset problem should solve that problem too, no? > > Nope. Why would a dirty zone pose a problem? The proble exist if you > cannot allocate more memory. Well one example would be a GFP_KERNEL allocation on a highmem machine in whcih all of ZONE_NORMAL is dirty. > If a cpuset contains a single node which is a > single zone then this patchset will also address that issue. > > If we have multiple zones then other zones may still provide memory to > continue (same as in UP). Not if all the eligible zones are all-dirty. > > > Yes, but when we enter reclaim most of the pages of a zone may already be > > > dirty/writeback so we fail. > > > > No. If the dirty limits become per-zone then no zone will ever have >40% > > dirty. > > I am still confused as to why you would want per zone dirty limits? The need for that has yet to be demonstrated. There _might_ be a problem, but we need test cases and analyses to demonstrate that need. Right now, what we have is an NFS bug. How about we fix it, then reevaluate the situation? A good starting point would be to show us one of these oom-killer traces. > Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running > on the first node. Then we copy a large file to disk. Node local > allocation means that we allocate from the first node. After we reach 40% > of the node then we throttle? This is going to be a significant > performance degradation since we can no longer use the memory of other > nodes to buffer writeout. That was what I was referring to. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: > It's a workaround for a still-unfixed NFS problem. No its doing proper throttling. Without this patchset there will *no* writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each and a cpuset that only spans one node. Then a process runniung in that cpuset can dirty all of memory and still continue running without writeback continuing. background dirty ratio is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever be reached because the process will only ever be able to dirty memory on one node which is 5%. There will be no throttling, no background writeback, no blocking for dirty pages. At some point we run into reclaim (possibly we have ~99% of of the cpuset dirty) and then we trigger writeout. Okay so if the filesystem / block device is robust enough and does not require memory allocations then we likely will survive that and do slow writeback page by page from the LRU. writback is completely hosed for that situation. This patch restores expected behavior in a cpuset (which is a form of system partition that should mirror the system as a whole). At 10% dirty we should start background writeback and at 40% we should block. If that is done then even fragile combinations of filesystem/block devices will work as they do without cpusets. > > Yes we can fix these allocations by allowing processes to allocate from > > other nodes. But then the container function of cpusets is no longer > > there. > But that's what your patch already does! The patchset does not allow processes to allocate from other nodes than the current cpuset. There is no change as to the source of memory allocations. > > NFS is okay as far as I can tell. dirty throttling works fine in non > > cpuset environments because we throttle if 40% of memory becomes dirty or > > under writeback. > > Repeat: NFS shouldn't go oom. It should fail the allocation, recover, wait > for existing IO to complete. Back that up with a mempool for NFS requests > and the problem is solved, I think? AFAIK any filesyste/block device can go oom with the current broken writeback it just does a few allocations. Its a matter of hitting the sweet spots. > But we also can get into trouble if a *zone* is all-dirty. Any solution to > the cpuset problem should solve that problem too, no? Nope. Why would a dirty zone pose a problem? The proble exist if you cannot allocate more memory. If a cpuset contains a single node which is a single zone then this patchset will also address that issue. If we have multiple zones then other zones may still provide memory to continue (same as in UP). > > Yes, but when we enter reclaim most of the pages of a zone may already be > > dirty/writeback so we fail. > > No. If the dirty limits become per-zone then no zone will ever have >40% > dirty. I am still confused as to why you would want per zone dirty limits? Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running on the first node. Then we copy a large file to disk. Node local allocation means that we allocate from the first node. After we reach 40% of the node then we throttle? This is going to be a significant performance degradation since we can no longer use the memory of other nodes to buffer writeout. > The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory > reduction in that zone, throttling the dirtying process. I suspect this > would work very badly in common situations with, say, typical i386 boxes. Absolute crap. You can prototype that broken behavior with zone reclaim by the way. Just switch on writeback during zone reclaim and watch how memory on a cpuset is unused and how the system becomes slow as molasses. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, Jan 16, 2007 at 01:53:25PM -0800, Andrew Morton wrote: > > On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter > > <[EMAIL PROTECTED]> wrote: > > > > Currently cpusets are not able to do proper writeback since dirty ratio > > calculations and writeback are all done for the system as a whole. > > We _do_ do proper writeback. But it's less efficient than it might be, and > there's an NFS problem. > > > This may result in a large percentage of a cpuset to become dirty without > > writeout being triggered. Under NFS this can lead to OOM conditions. > > OK, a big question: is this patchset a performance improvement or a > correctness fix? Given the above, and the lack of benchmark results I'm > assuming it's for correctness. Given that we've already got a 25-30% buffered write performance degradation between 2.6.18 and 2.6.20-rc4 for simple sequential write I/O to multiple filesystems concurrently, I'd really like to see some serious I/O performance regression testing on this change before it goes anywhere. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > > ... > > > > This may result in a large percentage of a cpuset > > > to become dirty without writeout being triggered. Under NFS > > > this can lead to OOM conditions. > > > > OK, a big question: is this patchset a performance improvement or a > > correctness fix? Given the above, and the lack of benchmark results I'm > > assuming it's for correctness. > > It is a correctness fix both for NFS OOM and doing proper cpuset writeout. It's a workaround for a still-unfixed NFS problem. > > - Why does NFS go oom? Because it allocates potentially-unbounded > > numbers of requests in the writeback path? > > > > It was able to go oom on non-numa machines before dirty-page-tracking > > went in. So a general problem has now become specific to some NUMA > > setups. > > > Right. The issue is that large portions of memory become dirty / > writeback since no writeback occurs because dirty limits are not checked > for a cpuset. Then NFS attempt to writeout when doing LRU scans but is > unable to allocate memory. > > > So an obvious, equivalent and vastly simpler "fix" would be to teach > > the NFS client to go off-cpuset when trying to allocate these requests. > > Yes we can fix these allocations by allowing processes to allocate from > other nodes. But then the container function of cpusets is no longer > there. But that's what your patch already does! It asks pdflush to write the pages instead of the direct-reclaim caller. The only reason pdflush doesn't go oom is that pdflush lives outside the direct-reclaim caller's cpuset and is hence able to obtain those nfs requests from off-cpuset zones. > > (But is it really bad? What actual problems will it cause once NFS is > > fixed?) > > NFS is okay as far as I can tell. dirty throttling works fine in non > cpuset environments because we throttle if 40% of memory becomes dirty or > under writeback. Repeat: NFS shouldn't go oom. It should fail the allocation, recover, wait for existing IO to complete. Back that up with a mempool for NFS requests and the problem is solved, I think? > > I don't understand why the proposed patches are cpuset-aware at all. This > > is a per-zone problem, and a per-zone fix would seem to be appropriate, and > > more general. For example, i386 machines can presumably get into trouble > > if all of ZONE_DMA or ZONE_NORMAL get dirty. A good implementation would > > address that problem as well. So I think it should all be per-zone? > > No. A zone can be completely dirty as long as we are allowed to allocate > from other zones. But we also can get into trouble if a *zone* is all-dirty. Any solution to the cpuset problem should solve that problem too, no? > > Do we really need those per-inode cpumasks? When page reclaim encounters a > > dirty page on the zone LRU, we automatically know that page->mapping->host > > has at least one dirty page in this zone, yes? We could immediately ask > > Yes, but when we enter reclaim most of the pages of a zone may already be > dirty/writeback so we fail. No. If the dirty limits become per-zone then no zone will ever have >40% dirty. The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory reduction in that zone, throttling the dirtying process. I suspect this would work very badly in common situations with, say, typical i386 boxes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007, Andi Kleen wrote: > > Secondly we modify the dirty limit calculation to be based > > on the acctive cpuset. > > The global dirty limit definitely seems to be a problem > in several cases, but my feeling is that the cpuset is the wrong unit > to keep track of it. Most likely it should be more fine grained. We already have zone reclaim that can take care of smaller units but why would we start writeback if only one zone is full of dirty pages and there are lots of other zones (nodes) that are free? > > If we are in a cpuset then we select only inodes for writeback > > that have pages on the nodes of the cpuset. > > Is there any indication this change helps on smaller systems > or is it purely a large system optimization? The bigger the system the larger the problem because the ratio of dirty pages is calculated is currently based on the percentage of dirty pages in the system as a whole. The less percentage of a system a cpuset contains the less effective the dirty_ratio and background_dirty_ratio become. > > B. We add a new counter NR_UNRECLAIMABLE that is subtracted > >from the available pages in a node. This allows us to > >accurately calculate the dirty ratio even if large portions > >of the node have been allocated for huge pages or for > >slab pages. > > That sounds like a useful change by itself. I can separate that one out. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: > > On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter <[EMAIL > > PROTECTED]> wrote: > > > > Currently cpusets are not able to do proper writeback since > > dirty ratio calculations and writeback are all done for the system > > as a whole. > > We _do_ do proper writeback. But it's less efficient than it might be, and > there's an NFS problem. Well yes we write back during LRU scans when a potentially high percentage of the memory in cpuset is dirty. > > This may result in a large percentage of a cpuset > > to become dirty without writeout being triggered. Under NFS > > this can lead to OOM conditions. > > OK, a big question: is this patchset a performance improvement or a > correctness fix? Given the above, and the lack of benchmark results I'm > assuming it's for correctness. It is a correctness fix both for NFS OOM and doing proper cpuset writeout. > - Why does NFS go oom? Because it allocates potentially-unbounded > numbers of requests in the writeback path? > > It was able to go oom on non-numa machines before dirty-page-tracking > went in. So a general problem has now become specific to some NUMA > setups. Right. The issue is that large portions of memory become dirty / writeback since no writeback occurs because dirty limits are not checked for a cpuset. Then NFS attempt to writeout when doing LRU scans but is unable to allocate memory. > So an obvious, equivalent and vastly simpler "fix" would be to teach > the NFS client to go off-cpuset when trying to allocate these requests. Yes we can fix these allocations by allowing processes to allocate from other nodes. But then the container function of cpusets is no longer there. > (But is it really bad? What actual problems will it cause once NFS is fixed?) NFS is okay as far as I can tell. dirty throttling works fine in non cpuset environments because we throttle if 40% of memory becomes dirty or under writeback. > I don't understand why the proposed patches are cpuset-aware at all. This > is a per-zone problem, and a per-zone fix would seem to be appropriate, and > more general. For example, i386 machines can presumably get into trouble > if all of ZONE_DMA or ZONE_NORMAL get dirty. A good implementation would > address that problem as well. So I think it should all be per-zone? No. A zone can be completely dirty as long as we are allowed to allocate from other zones. > Do we really need those per-inode cpumasks? When page reclaim encounters a > dirty page on the zone LRU, we automatically know that page->mapping->host > has at least one dirty page in this zone, yes? We could immediately ask Yes, but when we enter reclaim most of the pages of a zone may already be dirty/writeback so we fail. Also when we enter reclaim we may not have the proper process / cpuset context. There is no use to throttle kswapd. We need to throttle the process that is dirtying memory. > But all of this is, I think, unneeded if NFS is fixed. It's hopefully a > performance optimisation to permit writeout in a less seeky fashion. > Unless there's some other problem with excessively dirty zones. The patchset improves performance because the filesystem can do sequential writeouts. So yes in some ways this is a performance improvement. But this is only because this patch makes dirty throttling for cpusets work in the same way as for non NUMA system. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> Secondly we modify the dirty limit calculation to be based > on the acctive cpuset. The global dirty limit definitely seems to be a problem in several cases, but my feeling is that the cpuset is the wrong unit to keep track of it. Most likely it should be more fine grained. > If we are in a cpuset then we select only inodes for writeback > that have pages on the nodes of the cpuset. Is there any indication this change helps on smaller systems or is it purely a large system optimization? > B. We add a new counter NR_UNRECLAIMABLE that is subtracted >from the available pages in a node. This allows us to >accurately calculate the dirty ratio even if large portions >of the node have been allocated for huge pages or for >slab pages. That sounds like a useful change by itself. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
> On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter <[EMAIL > PROTECTED]> wrote: > > Currently cpusets are not able to do proper writeback since > dirty ratio calculations and writeback are all done for the system > as a whole. We _do_ do proper writeback. But it's less efficient than it might be, and there's an NFS problem. > This may result in a large percentage of a cpuset > to become dirty without writeout being triggered. Under NFS > this can lead to OOM conditions. OK, a big question: is this patchset a performance improvement or a correctness fix? Given the above, and the lack of benchmark results I'm assuming it's for correctness. - Why does NFS go oom? Because it allocates potentially-unbounded numbers of requests in the writeback path? It was able to go oom on non-numa machines before dirty-page-tracking went in. So a general problem has now become specific to some NUMA setups. We have earlier discussed fixing NFS to not do that. Make it allocate a fixed number of requests and to then block. Just like get_request_wait(). This is one reason why block_congestion_wait() and friends got renamed to congestion_wait(): it's on the path to getting NFS better aligned with how block devices are handling this. - There's no reason which I can see why NFS _has_ to go oom. It could just fail the memory allocation for the request and then wait for the stuff which it _has_ submitted to complete. We do that for block devices, backed by mempools. - Why does NFS go oom if there's free memory in other nodes? I assume that's what's happening, because things apparently work OK if you ask pdflush to do exactly the thing which the direct-reclaim process was attempting to do: allocate NFS requests and do writeback. So an obvious, equivalent and vastly simpler "fix" would be to teach the NFS client to go off-cpuset when trying to allocate these requests. I suspect that if we do some or all of the above, NFS gets better and the bug which motivated this patchset goes away. But that being said, yes, allowing a zone to go 100% dirty like this is bad, and it's be nice to be able to fix it. (But is it really bad? What actual problems will it cause once NFS is fixed?) Assuming that it is bad, yes, we'll obviously need the extra per-zone dirty-memory accounting. I don't understand why the proposed patches are cpuset-aware at all. This is a per-zone problem, and a per-zone fix would seem to be appropriate, and more general. For example, i386 machines can presumably get into trouble if all of ZONE_DMA or ZONE_NORMAL get dirty. A good implementation would address that problem as well. So I think it should all be per-zone? Do we really need those per-inode cpumasks? When page reclaim encounters a dirty page on the zone LRU, we automatically know that page->mapping->host has at least one dirty page in this zone, yes? We could immediately ask pdflush to write out some pages from that inode. We would need to take a ref on the inode (while the page is locked, to avoid racing with inode reclaim) and pass that inode off to pdflush (actually pass a list of such inodes off to pdflush, keep appending to it). Extra refinements would include - telling pdflush the file offset of the page so it can do writearound - getting pdflush to deactivate any pages which it writes out, so that rotate_reclaimable_page() has a good chance of moving them to the tail of the inactive list for immediate reclaim. But all of this is, I think, unneeded if NFS is fixed. It's hopefully a performance optimisation to permit writeout in a less seeky fashion. Unless there's some other problem with excessively dirty zones. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Peter Zijlstra wrote: > > B. We add a new counter NR_UNRECLAIMABLE that is subtracted > >from the available pages in a node. This allows us to > >accurately calculate the dirty ratio even if large portions > >of the node have been allocated for huge pages or for > >slab pages. > > What about mlock'ed pages? mlocked pages can be dirty and written back right? So for the dirty ratio calculation they do not play a role. We may need a separate counter for mlocked pages if they are to be considered for other decisions in the VM. > Otherwise it all looks good. > > Acked-by: Peter Zijlstra <[EMAIL PROTECTED]> Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Paul Jackson wrote: > > 1. The nodemask expands the inode structure significantly if the > > architecture allows a high number of nodes. This is only an issue > > for IA64. > > Should that logic be disabled if HOTPLUG is configured on? Or is > nr_node_ids a valid upper limit on what could be plugged in, even on a > mythical HOTPLUG system? nr_node_ids is a valid upper limit on what could be plugged in. We could modify nodemasks to only use nr_node_ids bits and the kernel would still be functioning correctly. > > 2. The calculation of the per cpuset limits can require looping > > over a number of nodes which may bring the performance of get_dirty_limits > > near pre 2.6.18 performance > > Could we cache these limits? Perhaps they only need to be recalculated if > a tasks mems_allowed changes? No they change dynamically. In particular writeout reduces the number of dirty / unstable pages. > Separate question - what happens if a tasks mems_allowed changes while it is > dirtying pages? We could easily end up with dirty pages on nodes that are > no longer allowed to the task. Is there anyway that such a miscalculation > could cause us to do harmful things? The dirty_map on an inode is independent of a cpuset. The cpuset only comes into effect when we decide to do writeout and are scanning for files with pages on the nodes of interest. > In patch 2/8: > > The dirty map is cleared when the inode is cleared. There is no > > synchronization (except for atomic nature of node_set) for the dirty_map. > > The > > only problem that could be done is that we do not write out an inode if a > > node bit is not set. > > Does this mean that a dirty page could be left 'forever' in memory, unwritten, > exposing us to risk of data corruption on disk, from some write done weeks > ago, > but unwritten, in the event of say a power loss? No it will age and be written out anyways. Note that there are usually multiple dirty pages which reduces the chance of the race. These are node bits that help to decide when to start writeout of all dirty pages of an inode regardless of where the other pages are. > Also in patch 2/8: > > +static inline void cpuset_update_dirty_nodes(struct inode *i, > > + struct page *page) {} > > Is an incomplete 'struct inode;' declaration needed here in cpuset.h, > to avoid a warning if compiling with CPUSETS not configured? Correct. > > In patch 4/8: > > We now add per node information which I think is equal or less effort > > since there are less nodes than processors. > > Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on > a system with 2 or 4 CPUs and one real node. But I guess that's ok ... True but then its fake. > In patch 4/8: > > +#ifdef CONFIG_CPUSETS > > + /* > > +* Calculate the limits relative to the current cpuset if necessary. > > +*/ > > + if (unlikely(nodes && > > + !nodes_subset(node_online_map, *nodes))) { > > + int node; > > + > > + is_subset = 1; > > + ... > > +#ifdef CONFIG_HIGHMEM > > + high_memory += NODE_DATA(node) > > + ->node_zones[ZONE_HIGHMEM]->present_pages; > > +#endif > > + nr_mapped += node_page_state(node, NR_FILE_MAPPED) + > > + node_page_state(node, NR_ANON_PAGES); > > + } > > + } else > > +#endif > > + { > > I'm wishing there was a clearer way to write the above code. Nested > ifdef's and an ifdef block ending in an open 'else' and perhaps the first > #ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ... I have tried to replicate the structure for global dirty_limits calculation which has the same ifdef. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph wrote: > Currently cpusets are not able to do proper writeback since > dirty ratio calculations and writeback are all done for the system > as a whole. Thanks for tackling this - it is sorely needed. I'm afraid my review will be mostly cosmetic; I'm not competent to comment on the really interesting stuff. > If we are in a cpuset then we select only inodes for writeback > that have pages on the nodes of the cpuset. Sorry - you tripped over a subtle distinction that happens to be on my list of things to notice. When cpusets are configured, -all- tasks are in a cpuset. And (correctly so, I trust) this patch doesn't look into the tasks cpuset to see what nodes it allows. Rather it looks to the mems_allowed field in the task struct, which is equal to or (when set_mempolicy is used) a subset of that tasks cpusets allowed nodes. Perhaps the following phrasing would be more accurate: If CPUSETs are configured, then we select only the inodes for writeback that have dirty pages on that tasks mems_allowed nodes. > Secondly we modify the dirty limit calculation to be based > on the acctive cpuset. As above, perhaps the following would be more accurate: Secondly we modify the dirty limit calculation to be based on the current tasks mems_allowed nodes. > 1. The nodemask expands the inode structure significantly if the > architecture allows a high number of nodes. This is only an issue > for IA64. Should that logic be disabled if HOTPLUG is configured on? Or is nr_node_ids a valid upper limit on what could be plugged in, even on a mythical HOTPLUG system? > 2. The calculation of the per cpuset limits can require looping > over a number of nodes which may bring the performance of get_dirty_limits > near pre 2.6.18 performance Could we cache these limits? Perhaps they only need to be recalculated if a tasks mems_allowed changes? Separate question - what happens if a tasks mems_allowed changes while it is dirtying pages? We could easily end up with dirty pages on nodes that are no longer allowed to the task. Is there anyway that such a miscalculation could cause us to do harmful things? In patch 2/8: > The dirty map is cleared when the inode is cleared. There is no > synchronization (except for atomic nature of node_set) for the dirty_map. The > only problem that could be done is that we do not write out an inode if a > node bit is not set. Does this mean that a dirty page could be left 'forever' in memory, unwritten, exposing us to risk of data corruption on disk, from some write done weeks ago, but unwritten, in the event of say a power loss? Also in patch 2/8: > +static inline void cpuset_update_dirty_nodes(struct inode *i, > + struct page *page) {} Is an incomplete 'struct inode;' declaration needed here in cpuset.h, to avoid a warning if compiling with CPUSETS not configured? In patch 4/8: > We now add per node information which I think is equal or less effort > since there are less nodes than processors. Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on a system with 2 or 4 CPUs and one real node. But I guess that's ok ... In patch 4/8: > +#ifdef CONFIG_CPUSETS > + /* > + * Calculate the limits relative to the current cpuset if necessary. > + */ > + if (unlikely(nodes && > + !nodes_subset(node_online_map, *nodes))) { > + int node; > + > + is_subset = 1; > + ... > +#ifdef CONFIG_HIGHMEM > + high_memory += NODE_DATA(node) > + ->node_zones[ZONE_HIGHMEM]->present_pages; > +#endif > + nr_mapped += node_page_state(node, NR_FILE_MAPPED) + > + node_page_state(node, NR_ANON_PAGES); > + } > + } else > +#endif > + { I'm wishing there was a clearer way to write the above code. Nested ifdef's and an ifdef block ending in an open 'else' and perhaps the first #ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ... However I have no clue if such a clearer way exists. Sorry. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Christoph wrote: Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. Thanks for tackling this - it is sorely needed. I'm afraid my review will be mostly cosmetic; I'm not competent to comment on the really interesting stuff. If we are in a cpuset then we select only inodes for writeback that have pages on the nodes of the cpuset. Sorry - you tripped over a subtle distinction that happens to be on my list of things to notice. When cpusets are configured, -all- tasks are in a cpuset. And (correctly so, I trust) this patch doesn't look into the tasks cpuset to see what nodes it allows. Rather it looks to the mems_allowed field in the task struct, which is equal to or (when set_mempolicy is used) a subset of that tasks cpusets allowed nodes. Perhaps the following phrasing would be more accurate: If CPUSETs are configured, then we select only the inodes for writeback that have dirty pages on that tasks mems_allowed nodes. Secondly we modify the dirty limit calculation to be based on the acctive cpuset. As above, perhaps the following would be more accurate: Secondly we modify the dirty limit calculation to be based on the current tasks mems_allowed nodes. 1. The nodemask expands the inode structure significantly if the architecture allows a high number of nodes. This is only an issue for IA64. Should that logic be disabled if HOTPLUG is configured on? Or is nr_node_ids a valid upper limit on what could be plugged in, even on a mythical HOTPLUG system? 2. The calculation of the per cpuset limits can require looping over a number of nodes which may bring the performance of get_dirty_limits near pre 2.6.18 performance Could we cache these limits? Perhaps they only need to be recalculated if a tasks mems_allowed changes? Separate question - what happens if a tasks mems_allowed changes while it is dirtying pages? We could easily end up with dirty pages on nodes that are no longer allowed to the task. Is there anyway that such a miscalculation could cause us to do harmful things? In patch 2/8: The dirty map is cleared when the inode is cleared. There is no synchronization (except for atomic nature of node_set) for the dirty_map. The only problem that could be done is that we do not write out an inode if a node bit is not set. Does this mean that a dirty page could be left 'forever' in memory, unwritten, exposing us to risk of data corruption on disk, from some write done weeks ago, but unwritten, in the event of say a power loss? Also in patch 2/8: +static inline void cpuset_update_dirty_nodes(struct inode *i, + struct page *page) {} Is an incomplete 'struct inode;' declaration needed here in cpuset.h, to avoid a warning if compiling with CPUSETS not configured? In patch 4/8: We now add per node information which I think is equal or less effort since there are less nodes than processors. Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on a system with 2 or 4 CPUs and one real node. But I guess that's ok ... In patch 4/8: +#ifdef CONFIG_CPUSETS + /* + * Calculate the limits relative to the current cpuset if necessary. + */ + if (unlikely(nodes + !nodes_subset(node_online_map, *nodes))) { + int node; + + is_subset = 1; + ... +#ifdef CONFIG_HIGHMEM + high_memory += NODE_DATA(node) + -node_zones[ZONE_HIGHMEM]-present_pages; +#endif + nr_mapped += node_page_state(node, NR_FILE_MAPPED) + + node_page_state(node, NR_ANON_PAGES); + } + } else +#endif + { I'm wishing there was a clearer way to write the above code. Nested ifdef's and an ifdef block ending in an open 'else' and perhaps the first #ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ... However I have no clue if such a clearer way exists. Sorry. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Paul Jackson wrote: 1. The nodemask expands the inode structure significantly if the architecture allows a high number of nodes. This is only an issue for IA64. Should that logic be disabled if HOTPLUG is configured on? Or is nr_node_ids a valid upper limit on what could be plugged in, even on a mythical HOTPLUG system? nr_node_ids is a valid upper limit on what could be plugged in. We could modify nodemasks to only use nr_node_ids bits and the kernel would still be functioning correctly. 2. The calculation of the per cpuset limits can require looping over a number of nodes which may bring the performance of get_dirty_limits near pre 2.6.18 performance Could we cache these limits? Perhaps they only need to be recalculated if a tasks mems_allowed changes? No they change dynamically. In particular writeout reduces the number of dirty / unstable pages. Separate question - what happens if a tasks mems_allowed changes while it is dirtying pages? We could easily end up with dirty pages on nodes that are no longer allowed to the task. Is there anyway that such a miscalculation could cause us to do harmful things? The dirty_map on an inode is independent of a cpuset. The cpuset only comes into effect when we decide to do writeout and are scanning for files with pages on the nodes of interest. In patch 2/8: The dirty map is cleared when the inode is cleared. There is no synchronization (except for atomic nature of node_set) for the dirty_map. The only problem that could be done is that we do not write out an inode if a node bit is not set. Does this mean that a dirty page could be left 'forever' in memory, unwritten, exposing us to risk of data corruption on disk, from some write done weeks ago, but unwritten, in the event of say a power loss? No it will age and be written out anyways. Note that there are usually multiple dirty pages which reduces the chance of the race. These are node bits that help to decide when to start writeout of all dirty pages of an inode regardless of where the other pages are. Also in patch 2/8: +static inline void cpuset_update_dirty_nodes(struct inode *i, + struct page *page) {} Is an incomplete 'struct inode;' declaration needed here in cpuset.h, to avoid a warning if compiling with CPUSETS not configured? Correct. In patch 4/8: We now add per node information which I think is equal or less effort since there are less nodes than processors. Not so on Paul Menage's fake NUMA nodes - he can have say 64 fake nodes on a system with 2 or 4 CPUs and one real node. But I guess that's ok ... True but then its fake. In patch 4/8: +#ifdef CONFIG_CPUSETS + /* +* Calculate the limits relative to the current cpuset if necessary. +*/ + if (unlikely(nodes + !nodes_subset(node_online_map, *nodes))) { + int node; + + is_subset = 1; + ... +#ifdef CONFIG_HIGHMEM + high_memory += NODE_DATA(node) + -node_zones[ZONE_HIGHMEM]-present_pages; +#endif + nr_mapped += node_page_state(node, NR_FILE_MAPPED) + + node_page_state(node, NR_ANON_PAGES); + } + } else +#endif + { I'm wishing there was a clearer way to write the above code. Nested ifdef's and an ifdef block ending in an open 'else' and perhaps the first #ifdef CONFIG_CPUSETS ever, outside of fs/proc/base.c ... I have tried to replicate the structure for global dirty_limits calculation which has the same ifdef. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Peter Zijlstra wrote: B. We add a new counter NR_UNRECLAIMABLE that is subtracted from the available pages in a node. This allows us to accurately calculate the dirty ratio even if large portions of the node have been allocated for huge pages or for slab pages. What about mlock'ed pages? mlocked pages can be dirty and written back right? So for the dirty ratio calculation they do not play a role. We may need a separate counter for mlocked pages if they are to be considered for other decisions in the VM. Otherwise it all looks good. Acked-by: Peter Zijlstra [EMAIL PROTECTED] Thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. We _do_ do proper writeback. But it's less efficient than it might be, and there's an NFS problem. This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. OK, a big question: is this patchset a performance improvement or a correctness fix? Given the above, and the lack of benchmark results I'm assuming it's for correctness. - Why does NFS go oom? Because it allocates potentially-unbounded numbers of requests in the writeback path? It was able to go oom on non-numa machines before dirty-page-tracking went in. So a general problem has now become specific to some NUMA setups. We have earlier discussed fixing NFS to not do that. Make it allocate a fixed number of requests and to then block. Just like get_request_wait(). This is one reason why block_congestion_wait() and friends got renamed to congestion_wait(): it's on the path to getting NFS better aligned with how block devices are handling this. - There's no reason which I can see why NFS _has_ to go oom. It could just fail the memory allocation for the request and then wait for the stuff which it _has_ submitted to complete. We do that for block devices, backed by mempools. - Why does NFS go oom if there's free memory in other nodes? I assume that's what's happening, because things apparently work OK if you ask pdflush to do exactly the thing which the direct-reclaim process was attempting to do: allocate NFS requests and do writeback. So an obvious, equivalent and vastly simpler fix would be to teach the NFS client to go off-cpuset when trying to allocate these requests. I suspect that if we do some or all of the above, NFS gets better and the bug which motivated this patchset goes away. But that being said, yes, allowing a zone to go 100% dirty like this is bad, and it's be nice to be able to fix it. (But is it really bad? What actual problems will it cause once NFS is fixed?) Assuming that it is bad, yes, we'll obviously need the extra per-zone dirty-memory accounting. I don't understand why the proposed patches are cpuset-aware at all. This is a per-zone problem, and a per-zone fix would seem to be appropriate, and more general. For example, i386 machines can presumably get into trouble if all of ZONE_DMA or ZONE_NORMAL get dirty. A good implementation would address that problem as well. So I think it should all be per-zone? Do we really need those per-inode cpumasks? When page reclaim encounters a dirty page on the zone LRU, we automatically know that page-mapping-host has at least one dirty page in this zone, yes? We could immediately ask pdflush to write out some pages from that inode. We would need to take a ref on the inode (while the page is locked, to avoid racing with inode reclaim) and pass that inode off to pdflush (actually pass a list of such inodes off to pdflush, keep appending to it). Extra refinements would include - telling pdflush the file offset of the page so it can do writearound - getting pdflush to deactivate any pages which it writes out, so that rotate_reclaimable_page() has a good chance of moving them to the tail of the inactive list for immediate reclaim. But all of this is, I think, unneeded if NFS is fixed. It's hopefully a performance optimisation to permit writeout in a less seeky fashion. Unless there's some other problem with excessively dirty zones. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Secondly we modify the dirty limit calculation to be based on the acctive cpuset. The global dirty limit definitely seems to be a problem in several cases, but my feeling is that the cpuset is the wrong unit to keep track of it. Most likely it should be more fine grained. If we are in a cpuset then we select only inodes for writeback that have pages on the nodes of the cpuset. Is there any indication this change helps on smaller systems or is it purely a large system optimization? B. We add a new counter NR_UNRECLAIMABLE that is subtracted from the available pages in a node. This allows us to accurately calculate the dirty ratio even if large portions of the node have been allocated for huge pages or for slab pages. That sounds like a useful change by itself. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. We _do_ do proper writeback. But it's less efficient than it might be, and there's an NFS problem. Well yes we write back during LRU scans when a potentially high percentage of the memory in cpuset is dirty. This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. OK, a big question: is this patchset a performance improvement or a correctness fix? Given the above, and the lack of benchmark results I'm assuming it's for correctness. It is a correctness fix both for NFS OOM and doing proper cpuset writeout. - Why does NFS go oom? Because it allocates potentially-unbounded numbers of requests in the writeback path? It was able to go oom on non-numa machines before dirty-page-tracking went in. So a general problem has now become specific to some NUMA setups. Right. The issue is that large portions of memory become dirty / writeback since no writeback occurs because dirty limits are not checked for a cpuset. Then NFS attempt to writeout when doing LRU scans but is unable to allocate memory. So an obvious, equivalent and vastly simpler fix would be to teach the NFS client to go off-cpuset when trying to allocate these requests. Yes we can fix these allocations by allowing processes to allocate from other nodes. But then the container function of cpusets is no longer there. (But is it really bad? What actual problems will it cause once NFS is fixed?) NFS is okay as far as I can tell. dirty throttling works fine in non cpuset environments because we throttle if 40% of memory becomes dirty or under writeback. I don't understand why the proposed patches are cpuset-aware at all. This is a per-zone problem, and a per-zone fix would seem to be appropriate, and more general. For example, i386 machines can presumably get into trouble if all of ZONE_DMA or ZONE_NORMAL get dirty. A good implementation would address that problem as well. So I think it should all be per-zone? No. A zone can be completely dirty as long as we are allowed to allocate from other zones. Do we really need those per-inode cpumasks? When page reclaim encounters a dirty page on the zone LRU, we automatically know that page-mapping-host has at least one dirty page in this zone, yes? We could immediately ask Yes, but when we enter reclaim most of the pages of a zone may already be dirty/writeback so we fail. Also when we enter reclaim we may not have the proper process / cpuset context. There is no use to throttle kswapd. We need to throttle the process that is dirtying memory. But all of this is, I think, unneeded if NFS is fixed. It's hopefully a performance optimisation to permit writeout in a less seeky fashion. Unless there's some other problem with excessively dirty zones. The patchset improves performance because the filesystem can do sequential writeouts. So yes in some ways this is a performance improvement. But this is only because this patch makes dirty throttling for cpusets work in the same way as for non NUMA system. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Wed, 17 Jan 2007, Andi Kleen wrote: Secondly we modify the dirty limit calculation to be based on the acctive cpuset. The global dirty limit definitely seems to be a problem in several cases, but my feeling is that the cpuset is the wrong unit to keep track of it. Most likely it should be more fine grained. We already have zone reclaim that can take care of smaller units but why would we start writeback if only one zone is full of dirty pages and there are lots of other zones (nodes) that are free? If we are in a cpuset then we select only inodes for writeback that have pages on the nodes of the cpuset. Is there any indication this change helps on smaller systems or is it purely a large system optimization? The bigger the system the larger the problem because the ratio of dirty pages is calculated is currently based on the percentage of dirty pages in the system as a whole. The less percentage of a system a cpuset contains the less effective the dirty_ratio and background_dirty_ratio become. B. We add a new counter NR_UNRECLAIMABLE that is subtracted from the available pages in a node. This allows us to accurately calculate the dirty ratio even if large portions of the node have been allocated for huge pages or for slab pages. That sounds like a useful change by itself. I can separate that one out. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: ... This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. OK, a big question: is this patchset a performance improvement or a correctness fix? Given the above, and the lack of benchmark results I'm assuming it's for correctness. It is a correctness fix both for NFS OOM and doing proper cpuset writeout. It's a workaround for a still-unfixed NFS problem. - Why does NFS go oom? Because it allocates potentially-unbounded numbers of requests in the writeback path? It was able to go oom on non-numa machines before dirty-page-tracking went in. So a general problem has now become specific to some NUMA setups. Right. The issue is that large portions of memory become dirty / writeback since no writeback occurs because dirty limits are not checked for a cpuset. Then NFS attempt to writeout when doing LRU scans but is unable to allocate memory. So an obvious, equivalent and vastly simpler fix would be to teach the NFS client to go off-cpuset when trying to allocate these requests. Yes we can fix these allocations by allowing processes to allocate from other nodes. But then the container function of cpusets is no longer there. But that's what your patch already does! It asks pdflush to write the pages instead of the direct-reclaim caller. The only reason pdflush doesn't go oom is that pdflush lives outside the direct-reclaim caller's cpuset and is hence able to obtain those nfs requests from off-cpuset zones. (But is it really bad? What actual problems will it cause once NFS is fixed?) NFS is okay as far as I can tell. dirty throttling works fine in non cpuset environments because we throttle if 40% of memory becomes dirty or under writeback. Repeat: NFS shouldn't go oom. It should fail the allocation, recover, wait for existing IO to complete. Back that up with a mempool for NFS requests and the problem is solved, I think? I don't understand why the proposed patches are cpuset-aware at all. This is a per-zone problem, and a per-zone fix would seem to be appropriate, and more general. For example, i386 machines can presumably get into trouble if all of ZONE_DMA or ZONE_NORMAL get dirty. A good implementation would address that problem as well. So I think it should all be per-zone? No. A zone can be completely dirty as long as we are allowed to allocate from other zones. But we also can get into trouble if a *zone* is all-dirty. Any solution to the cpuset problem should solve that problem too, no? Do we really need those per-inode cpumasks? When page reclaim encounters a dirty page on the zone LRU, we automatically know that page-mapping-host has at least one dirty page in this zone, yes? We could immediately ask Yes, but when we enter reclaim most of the pages of a zone may already be dirty/writeback so we fail. No. If the dirty limits become per-zone then no zone will ever have 40% dirty. The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory reduction in that zone, throttling the dirtying process. I suspect this would work very badly in common situations with, say, typical i386 boxes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, Jan 16, 2007 at 01:53:25PM -0800, Andrew Morton wrote: On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. We _do_ do proper writeback. But it's less efficient than it might be, and there's an NFS problem. This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. OK, a big question: is this patchset a performance improvement or a correctness fix? Given the above, and the lack of benchmark results I'm assuming it's for correctness. Given that we've already got a 25-30% buffered write performance degradation between 2.6.18 and 2.6.20-rc4 for simple sequential write I/O to multiple filesystems concurrently, I'd really like to see some serious I/O performance regression testing on this change before it goes anywhere. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: It's a workaround for a still-unfixed NFS problem. No its doing proper throttling. Without this patchset there will *no* writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each and a cpuset that only spans one node. Then a process runniung in that cpuset can dirty all of memory and still continue running without writeback continuing. background dirty ratio is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever be reached because the process will only ever be able to dirty memory on one node which is 5%. There will be no throttling, no background writeback, no blocking for dirty pages. At some point we run into reclaim (possibly we have ~99% of of the cpuset dirty) and then we trigger writeout. Okay so if the filesystem / block device is robust enough and does not require memory allocations then we likely will survive that and do slow writeback page by page from the LRU. writback is completely hosed for that situation. This patch restores expected behavior in a cpuset (which is a form of system partition that should mirror the system as a whole). At 10% dirty we should start background writeback and at 40% we should block. If that is done then even fragile combinations of filesystem/block devices will work as they do without cpusets. Yes we can fix these allocations by allowing processes to allocate from other nodes. But then the container function of cpusets is no longer there. But that's what your patch already does! The patchset does not allow processes to allocate from other nodes than the current cpuset. There is no change as to the source of memory allocations. NFS is okay as far as I can tell. dirty throttling works fine in non cpuset environments because we throttle if 40% of memory becomes dirty or under writeback. Repeat: NFS shouldn't go oom. It should fail the allocation, recover, wait for existing IO to complete. Back that up with a mempool for NFS requests and the problem is solved, I think? AFAIK any filesyste/block device can go oom with the current broken writeback it just does a few allocations. Its a matter of hitting the sweet spots. But we also can get into trouble if a *zone* is all-dirty. Any solution to the cpuset problem should solve that problem too, no? Nope. Why would a dirty zone pose a problem? The proble exist if you cannot allocate more memory. If a cpuset contains a single node which is a single zone then this patchset will also address that issue. If we have multiple zones then other zones may still provide memory to continue (same as in UP). Yes, but when we enter reclaim most of the pages of a zone may already be dirty/writeback so we fail. No. If the dirty limits become per-zone then no zone will ever have 40% dirty. I am still confused as to why you would want per zone dirty limits? Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running on the first node. Then we copy a large file to disk. Node local allocation means that we allocate from the first node. After we reach 40% of the node then we throttle? This is going to be a significant performance degradation since we can no longer use the memory of other nodes to buffer writeout. The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory reduction in that zone, throttling the dirtying process. I suspect this would work very badly in common situations with, say, typical i386 boxes. Absolute crap. You can prototype that broken behavior with zone reclaim by the way. Just switch on writeback during zone reclaim and watch how memory on a cpuset is unused and how the system becomes slow as molasses. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007 16:16:30 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Tue, 16 Jan 2007, Andrew Morton wrote: It's a workaround for a still-unfixed NFS problem. No its doing proper throttling. Without this patchset there will *no* writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each and a cpuset that only spans one node. Then a process runniung in that cpuset can dirty all of memory and still continue running without writeback continuing. background dirty ratio is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever be reached because the process will only ever be able to dirty memory on one node which is 5%. There will be no throttling, no background writeback, no blocking for dirty pages. At some point we run into reclaim (possibly we have ~99% of of the cpuset dirty) and then we trigger writeout. Okay so if the filesystem / block device is robust enough and does not require memory allocations then we likely will survive that and do slow writeback page by page from the LRU. writback is completely hosed for that situation. This patch restores expected behavior in a cpuset (which is a form of system partition that should mirror the system as a whole). At 10% dirty we should start background writeback and at 40% we should block. If that is done then even fragile combinations of filesystem/block devices will work as they do without cpusets. Nope. You've completely omitted the little fact that we'll do writeback in the offending zone off the LRU. Slower, maybe. But it should work and the system should recover. If it's not doing that (it isn't) then we should fix it rather than avoiding it (by punting writeback over to pdflush). Once that's fixed, if we determine that there are remaining and significant performance issues then we can take a look at that. Yes we can fix these allocations by allowing processes to allocate from other nodes. But then the container function of cpusets is no longer there. But that's what your patch already does! The patchset does not allow processes to allocate from other nodes than the current cpuset. Yes it does. It asks pdflush to perform writeback of the offending zone(s) rather than (or as well as) doing it directly. The only reason pdflush can sucessfuly do that is because pdflush can allocate its requests from other zones. AFAIK any filesyste/block device can go oom with the current broken writeback it just does a few allocations. Its a matter of hitting the sweet spots. That shouldn't be possible, in theory. Block IO is supposed to succeed if *all memory in the machine is dirty*: the old dirty-everything-with-MAP_SHARED-then-exit problem. Lots of testing went into that and it works. It also failed on NFS although I thought that got fixed a year or so ago. Apparently not. But we also can get into trouble if a *zone* is all-dirty. Any solution to the cpuset problem should solve that problem too, no? Nope. Why would a dirty zone pose a problem? The proble exist if you cannot allocate more memory. Well one example would be a GFP_KERNEL allocation on a highmem machine in whcih all of ZONE_NORMAL is dirty. If a cpuset contains a single node which is a single zone then this patchset will also address that issue. If we have multiple zones then other zones may still provide memory to continue (same as in UP). Not if all the eligible zones are all-dirty. Yes, but when we enter reclaim most of the pages of a zone may already be dirty/writeback so we fail. No. If the dirty limits become per-zone then no zone will ever have 40% dirty. I am still confused as to why you would want per zone dirty limits? The need for that has yet to be demonstrated. There _might_ be a problem, but we need test cases and analyses to demonstrate that need. Right now, what we have is an NFS bug. How about we fix it, then reevaluate the situation? A good starting point would be to show us one of these oom-killer traces. Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running on the first node. Then we copy a large file to disk. Node local allocation means that we allocate from the first node. After we reach 40% of the node then we throttle? This is going to be a significant performance degradation since we can no longer use the memory of other nodes to buffer writeout. That was what I was referring to. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: Nope. You've completely omitted the little fact that we'll do writeback in the offending zone off the LRU. Slower, maybe. But it should work and the system should recover. If it's not doing that (it isn't) then we should fix it rather than avoiding it (by punting writeback over to pdflush). pdflush is not running *at* all nor is dirty throttling working. That is correct behavior? We could do background writeback but we choose not to do so? Instead we wait until we hit reclaim and then block (well it seems that we do not block the blocking there also fails since we again check global ratios)? The patchset does not allow processes to allocate from other nodes than the current cpuset. Yes it does. It asks pdflush to perform writeback of the offending zone(s) rather than (or as well as) doing it directly. The only reason pdflush can sucessfuly do that is because pdflush can allocate its requests from other zones. Ok pdflush is able to do that. Still the application is not able to extend its memory beyond the cpuset. What about writeback throttling? There it all breaks down. The cpuset is effective and we are unable to allocate any more memory. The reason this works is because not all of memory is dirty. Thus reclaim will be able to free up memory or there is enough memory free. AFAIK any filesyste/block device can go oom with the current broken writeback it just does a few allocations. Its a matter of hitting the sweet spots. That shouldn't be possible, in theory. Block IO is supposed to succeed if *all memory in the machine is dirty*: the old dirty-everything-with-MAP_SHARED-then-exit problem. Lots of testing went into that and it works. It also failed on NFS although I thought that got fixed a year or so ago. Apparently not. Humm... Really? Nope. Why would a dirty zone pose a problem? The proble exist if you cannot allocate more memory. Well one example would be a GFP_KERNEL allocation on a highmem machine in whcih all of ZONE_NORMAL is dirty. That is a restricted allocation which will lead to reclaim. If we have multiple zones then other zones may still provide memory to continue (same as in UP). Not if all the eligible zones are all-dirty. They are all dirty if we do not throttle the dirty pages. Right now, what we have is an NFS bug. How about we fix it, then reevaluate the situation? The NFS bug only exists when using a cpuset. If you run NFS without cpusets then the throttling will kick in and everything is fine. A good starting point would be to show us one of these oom-killer traces. No traces. Since the process is killed within a cpuset we only get messages like: Nov 28 16:19:52 ic4 kernel: Out of Memory: Kill process 679783 (ncks) score 0 and children. Nov 28 16:19:52 ic4 kernel: No available memory in cpuset: Killed process 679783 (ncks). Nov 28 16:27:58 ic4 kernel: oom-killer: gfp_mask=0x200d2, order=0 Probably need to rerun these with some patches. Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running on the first node. Then we copy a large file to disk. Node local allocation means that we allocate from the first node. After we reach 40% of the node then we throttle? This is going to be a significant performance degradation since we can no longer use the memory of other nodes to buffer writeout. That was what I was referring to. Note that this was describing the behavior you wanted not the way things work. It is desired behavior not to use all the memory resources of the cpuset and slow down the system? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007 17:30:26 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: Nope. You've completely omitted the little fact that we'll do writeback in the offending zone off the LRU. Slower, maybe. But it should work and the system should recover. If it's not doing that (it isn't) then we should fix it rather than avoiding it (by punting writeback over to pdflush). pdflush is not running *at* all nor is dirty throttling working. That is correct behavior? We could do background writeback but we choose not to do so? Instead we wait until we hit reclaim and then block (well it seems that we do not block the blocking there also fails since we again check global ratios)? I agree that it is a worthy objective to be able to constrain a cpuset's dirty memory levels. But as a performance optimisation and NOT as a correctness fix. Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive cpuset B consists of mems 0-3. A task running in cpuset A can freely dirty all of cpuset B's memory. A task running in cpuset B gets oomkilled. Consider: a 32-node machine has nodes 0-3 full of dirty memory. I create a cpuset containing nodes 0-2 and start using it. I get oomkilled. There may be other scenarios. IOW, we have a correctness problem, and we have a probable, not-yet-demonstrated-and-quantified performance problem. Fixing the latter (in the proposed fashion) will *not* fix the former. So what I suggest we do is to fix the NFS bug, then move on to considering the performance problems. On reflection, I agree that your proposed changes are sensible-looking for addressing the probable, not-yet-demonstrated-and-quantified performance problem. The per-inode (should be per-address_space, maybe it is?) node map is unfortunate. Need to think about that a bit more. For a start, it should be dynamically allocated (from a new, purpose-created slab cache): most in-core inodes don't have any dirty pages and don't need this additional storage. Also, I worry about the worst-case performance of that linear search across the inodes. But this is unrelated to the NFS bug ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive cpuset B consists of mems 0-3. A task running in cpuset A can freely dirty all of cpuset B's memory. A task running in cpuset B gets oomkilled. Consider: a 32-node machine has nodes 0-3 full of dirty memory. I create a cpuset containing nodes 0-2 and start using it. I get oomkilled. There may be other scenarios. Yes this is the result of the hierachical nature of cpusets which already causes issues with the scheduler. It is rather typical that cpusets are used to partition the memory and cpus. Overlappig cpusets seem to have mainly an administrative function. Paul? So what I suggest we do is to fix the NFS bug, then move on to considering the performance problems. The NFS bug has been there for ages and no one cares since write throttling works effectively. Since NFS can go via any network technology (f.e. infiniband) we have many potential issues at that point that depend on the underlying network technology. As far as I can recall we decided that these stacking issues are inherently problematic and basically unsolvable. On reflection, I agree that your proposed changes are sensible-looking for addressing the probable, not-yet-demonstrated-and-quantified performance problem. The per-inode (should be per-address_space, maybe it is?) node The address space is part of the inode. Some of my development versions at the dirty_map in the address space. However, the end of the inode was a convenient place for a runtime sizes nodemask. map is unfortunate. Need to think about that a bit more. For a start, it should be dynamically allocated (from a new, purpose-created slab cache): most in-core inodes don't have any dirty pages and don't need this additional storage. We also considered such an approach. However. it creates the problem of performing a slab allocation while dirtying pages. At that point we do not have an allocation context, nor can we block. But this is unrelated to the NFS bug ;) Looks more like a design issue (given its layering on top of the networking layer) and not a bug. The bug surfaces when writeback is not done properly. I wonder what happens if other filesystems are pushed to the border of the dirty abyss. The mmap tracking fixes that were done in 2.6.19 were done because of similar symptoms because the systems dirty tracking was off. This is fundamentally the same issue showing up in a cpuset. So we should be able to produce the hangs (looks ... yes another customer reported issue on this one is that reclaim is continually running and we basically livelock the system) that we saw for the mmap dirty tracking issues in addition to the NFS problems seen so far. Memory allocation is required in most filesystem flush paths. If we cannot allocate memory then we cannot clean pages and thus we continue trying - Livelock. I still see this as a fundamental correctness issue in the kernel. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
Yes this is the result of the hierachical nature of cpusets which already causes issues with the scheduler. It is rather typical that cpusets are used to partition the memory and cpus. Overlappig cpusets seem to have mainly an administrative function. Paul? The heavy weight tasks, which are expected to be applying serious memory pressure (whether for data pages or dirty file pages), are usually in non-overlapping cpusets, or sharing the same cpuset, but not partially overlapping with, or a proper superset of, some other cpuset holding an active job. The higher level cpusets, such as the top cpuset, or the one deeded over to the Batch Scheduler, are proper supersets of many other cpusets. We avoid putting anything heavy weight in those cpusets. Sometimes of course a task turns out to be unexpectedly heavy weight. But in that case, we're mostly interested in function (system keeps running), not performance. That is, if someone setup what Andrew described, with a job in a large cpuset sucking up all available memory from one in a smaller, contained cpuset, I don't think I'm tuning for optimum performance anymore. Rather I'm just trying to keep the system running and keep unrelated jobs unaffected while we dig our way out of the hole. If the smaller job OOM's, that's tough nuggies. They asked for it. Jobs in -unrelated- (non-overlapping) cpusets should ride out the storm with little or no impact on their performance. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007 19:40:17 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Tue, 16 Jan 2007, Andrew Morton wrote: Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive cpuset B consists of mems 0-3. A task running in cpuset A can freely dirty all of cpuset B's memory. A task running in cpuset B gets oomkilled. Consider: a 32-node machine has nodes 0-3 full of dirty memory. I create a cpuset containing nodes 0-2 and start using it. I get oomkilled. There may be other scenarios. Yes this is the result of the hierachical nature of cpusets which already causes issues with the scheduler. It is rather typical that cpusets are used to partition the memory and cpus. Overlappig cpusets seem to have mainly an administrative function. Paul? The typical usage scenarios don't matter a lot: the examples I gave show that the core problem remains unsolved. People can still hit the bug. So what I suggest we do is to fix the NFS bug, then move on to considering the performance problems. The NFS bug has been there for ages and no one cares since write throttling works effectively. Since NFS can go via any network technology (f.e. infiniband) we have many potential issues at that point that depend on the underlying network technology. As far as I can recall we decided that these stacking issues are inherently problematic and basically unsolvable. The problem you refer to arises from the inability of the net driver to allocate memory for an outbound ack. Such allocations aren't constrained to a cpuset. I expect that we can solve the NFS oom problem along the same lines as block devices. Certainly it's dumb of us to oom-kill a process rather than going off-cpuset for a small and short-lived allocation. It's also dumb of us to allocate a basically unbounded number of nfs requests rather than waiting for some of the ones which we _have_ allocated to complete. On reflection, I agree that your proposed changes are sensible-looking for addressing the probable, not-yet-demonstrated-and-quantified performance problem. The per-inode (should be per-address_space, maybe it is?) node The address space is part of the inode. Physically, yes. Logically, it is not. The address_space controls the data-plane part of a file and is the appropriate place in which to store this nodemask. Some of my development versions at the dirty_map in the address space. However, the end of the inode was a convenient place for a runtime sizes nodemask. map is unfortunate. Need to think about that a bit more. For a start, it should be dynamically allocated (from a new, purpose-created slab cache): most in-core inodes don't have any dirty pages and don't need this additional storage. We also considered such an approach. However. it creates the problem of performing a slab allocation while dirtying pages. At that point we do not have an allocation context, nor can we block. Yes, it must be an atomic allocation. If it fails, we don't care. Chances are it'll succeed when the next page in this address_space gets dirtied. Plus we don't waste piles of memory on read-only files. But this is unrelated to the NFS bug ;) Looks more like a design issue (given its layering on top of the networking layer) and not a bug. The bug surfaces when writeback is not done properly. I wonder what happens if other filesystems are pushed to the border of the dirty abyss. The mmap tracking fixes that were done in 2.6.19 were done because of similar symptoms because the systems dirty tracking was off. This is fundamentally the same issue showing up in a cpuset. So we should be able to produce the hangs (looks ... yes another customer reported issue on this one is that reclaim is continually running and we basically livelock the system) that we saw for the mmap dirty tracking issues in addition to the NFS problems seen so far. Memory allocation is required in most filesystem flush paths. If we cannot allocate memory then we cannot clean pages and thus we continue trying - Livelock. I still see this as a fundamental correctness issue in the kernel. I'll believe all that once someone has got down and tried to fix NFS, and has failed ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007, Andrew Morton wrote: Yes this is the result of the hierachical nature of cpusets which already causes issues with the scheduler. It is rather typical that cpusets are used to partition the memory and cpus. Overlappig cpusets seem to have mainly an administrative function. Paul? The typical usage scenarios don't matter a lot: the examples I gave show that the core problem remains unsolved. People can still hit the bug. I agree the overlap issue is a problem and I hope it can be addressed somehow for the rare cases in which such nesting takes place. One easy solution may be to check the dirty ratio before engaging in reclaim. If the dirty ratio is sufficiently high then trigger writeout via pdflush (we already wakeup pdflush while scanning and you already noted that pdflush writeout is not occurring within the context of the current cpuset) and pass over any dirty pages during LRU scans until some pages have been cleaned up. This means we allow allocation of additional kernel memory outside of the cpuset while triggering writeout of inodes that have pages on the nodes of the cpuset. The memory directly used by the application is still limited. Just the temporary information needed for writeback is allocated outside. Well sounds somehow still like a hack. Any other ideas out there? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Tue, 16 Jan 2007 22:27:36 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: On Tue, 16 Jan 2007, Andrew Morton wrote: Yes this is the result of the hierachical nature of cpusets which already causes issues with the scheduler. It is rather typical that cpusets are used to partition the memory and cpus. Overlappig cpusets seem to have mainly an administrative function. Paul? The typical usage scenarios don't matter a lot: the examples I gave show that the core problem remains unsolved. People can still hit the bug. I agree the overlap issue is a problem and I hope it can be addressed somehow for the rare cases in which such nesting takes place. One easy solution may be to check the dirty ratio before engaging in reclaim. If the dirty ratio is sufficiently high then trigger writeout via pdflush (we already wakeup pdflush while scanning and you already noted that pdflush writeout is not occurring within the context of the current cpuset) and pass over any dirty pages during LRU scans until some pages have been cleaned up. This means we allow allocation of additional kernel memory outside of the cpuset while triggering writeout of inodes that have pages on the nodes of the cpuset. The memory directly used by the application is still limited. Just the temporary information needed for writeback is allocated outside. Gad. None of that should be necessary. Well sounds somehow still like a hack. Any other ideas out there? Do what blockdevs do: limit the number of in-flight requests (Peter's recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC is in effect, to keep Trond happy) and implement a mempool for the NFS request critical store. Additionally: - we might need to twiddle the NFS gfp_flags so it doesn't call the oom-killer on failure: just return NULL. - consider going off-cpuset for critical allocations. It's better than going oom. A suitable implementation might be to ignore the caller's cpuset if PF_MEMALLOC. Maybe put a WARN_ON_ONCE in there: we prefer that it not happen and we want to know when it does. btw, regarding the per-address_space node mask: I think we should free it when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)). Chances are, the inode will be dirty for 30 seconds and in-core for hours. We might as well steal its nodemask storage and give it to the next file which gets written to. A suitable place to do all this is in __mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect address_space.dirty_page_nodemask. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote: > Currently cpusets are not able to do proper writeback since > dirty ratio calculations and writeback are all done for the system > as a whole. This may result in a large percentage of a cpuset > to become dirty without writeout being triggered. Under NFS > this can lead to OOM conditions. > > Writeback will occur during the LRU scans. But such writeout > is not effective since we write page by page and not in inode page > order (regular writeback). > > In order to fix the problem we first of all introduce a method to > establish a map of nodes that contain dirty pages for each > inode mapping. > > Secondly we modify the dirty limit calculation to be based > on the acctive cpuset. > > If we are in a cpuset then we select only inodes for writeback > that have pages on the nodes of the cpuset. > > After we have the cpuset throttling in place we can then make > further fixups: > > A. We can do inode based writeout from direct reclaim >avoiding single page writes to the filesystem. > > B. We add a new counter NR_UNRECLAIMABLE that is subtracted >from the available pages in a node. This allows us to >accurately calculate the dirty ratio even if large portions >of the node have been allocated for huge pages or for >slab pages. What about mlock'ed pages? > There are a couple of points where some better ideas could be used: > > 1. The nodemask expands the inode structure significantly if the > architecture allows a high number of nodes. This is only an issue > for IA64. For that platform we expand the inode structure by 128 byte > (to support 1024 nodes). The last patch attempts to address the issue > by using the knowledge about the maximum possible number of nodes > determined on bootup to shrink the nodemask. Not the prettiest indeed, no ideas though. > 2. The calculation of the per cpuset limits can require looping > over a number of nodes which may bring the performance of get_dirty_limits > near pre 2.6.18 performance (before the introduction of the ZVC counters) > (only for cpuset based limit calculation). There is no way of keeping these > counters per cpuset since cpusets may overlap. Well, you gain functionality, you loose some runtime, sad but probably worth it. Otherwise it all looks good. Acked-by: Peter Zijlstra <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/8] Cpuset aware writeback
Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. Writeback will occur during the LRU scans. But such writeout is not effective since we write page by page and not in inode page order (regular writeback). In order to fix the problem we first of all introduce a method to establish a map of nodes that contain dirty pages for each inode mapping. Secondly we modify the dirty limit calculation to be based on the acctive cpuset. If we are in a cpuset then we select only inodes for writeback that have pages on the nodes of the cpuset. After we have the cpuset throttling in place we can then make further fixups: A. We can do inode based writeout from direct reclaim avoiding single page writes to the filesystem. B. We add a new counter NR_UNRECLAIMABLE that is subtracted from the available pages in a node. This allows us to accurately calculate the dirty ratio even if large portions of the node have been allocated for huge pages or for slab pages. There are a couple of points where some better ideas could be used: 1. The nodemask expands the inode structure significantly if the architecture allows a high number of nodes. This is only an issue for IA64. For that platform we expand the inode structure by 128 byte (to support 1024 nodes). The last patch attempts to address the issue by using the knowledge about the maximum possible number of nodes determined on bootup to shrink the nodemask. 2. The calculation of the per cpuset limits can require looping over a number of nodes which may bring the performance of get_dirty_limits near pre 2.6.18 performance (before the introduction of the ZVC counters) (only for cpuset based limit calculation). There is no way of keeping these counters per cpuset since cpusets may overlap. Paul probably needs to go through this and may want additional fixes to keep things in harmony with cpusets. Tested on: IA64 NUMA 128p, 12p Compiles on: i386 SMP x86_64 UP - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/8] Cpuset aware writeback
Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. Writeback will occur during the LRU scans. But such writeout is not effective since we write page by page and not in inode page order (regular writeback). In order to fix the problem we first of all introduce a method to establish a map of nodes that contain dirty pages for each inode mapping. Secondly we modify the dirty limit calculation to be based on the acctive cpuset. If we are in a cpuset then we select only inodes for writeback that have pages on the nodes of the cpuset. After we have the cpuset throttling in place we can then make further fixups: A. We can do inode based writeout from direct reclaim avoiding single page writes to the filesystem. B. We add a new counter NR_UNRECLAIMABLE that is subtracted from the available pages in a node. This allows us to accurately calculate the dirty ratio even if large portions of the node have been allocated for huge pages or for slab pages. There are a couple of points where some better ideas could be used: 1. The nodemask expands the inode structure significantly if the architecture allows a high number of nodes. This is only an issue for IA64. For that platform we expand the inode structure by 128 byte (to support 1024 nodes). The last patch attempts to address the issue by using the knowledge about the maximum possible number of nodes determined on bootup to shrink the nodemask. 2. The calculation of the per cpuset limits can require looping over a number of nodes which may bring the performance of get_dirty_limits near pre 2.6.18 performance (before the introduction of the ZVC counters) (only for cpuset based limit calculation). There is no way of keeping these counters per cpuset since cpusets may overlap. Paul probably needs to go through this and may want additional fixes to keep things in harmony with cpusets. Tested on: IA64 NUMA 128p, 12p Compiles on: i386 SMP x86_64 UP - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/8] Cpuset aware writeback
On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote: Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. Writeback will occur during the LRU scans. But such writeout is not effective since we write page by page and not in inode page order (regular writeback). In order to fix the problem we first of all introduce a method to establish a map of nodes that contain dirty pages for each inode mapping. Secondly we modify the dirty limit calculation to be based on the acctive cpuset. If we are in a cpuset then we select only inodes for writeback that have pages on the nodes of the cpuset. After we have the cpuset throttling in place we can then make further fixups: A. We can do inode based writeout from direct reclaim avoiding single page writes to the filesystem. B. We add a new counter NR_UNRECLAIMABLE that is subtracted from the available pages in a node. This allows us to accurately calculate the dirty ratio even if large portions of the node have been allocated for huge pages or for slab pages. What about mlock'ed pages? There are a couple of points where some better ideas could be used: 1. The nodemask expands the inode structure significantly if the architecture allows a high number of nodes. This is only an issue for IA64. For that platform we expand the inode structure by 128 byte (to support 1024 nodes). The last patch attempts to address the issue by using the knowledge about the maximum possible number of nodes determined on bootup to shrink the nodemask. Not the prettiest indeed, no ideas though. 2. The calculation of the per cpuset limits can require looping over a number of nodes which may bring the performance of get_dirty_limits near pre 2.6.18 performance (before the introduction of the ZVC counters) (only for cpuset based limit calculation). There is no way of keeping these counters per cpuset since cpusets may overlap. Well, you gain functionality, you loose some runtime, sad but probably worth it. Otherwise it all looks good. Acked-by: Peter Zijlstra [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/