Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Fri, May 3, 2013 at 5:11 PM, Michal Hocko wrote: > On Wed 02-01-13 11:44:21, Michal Hocko wrote: >> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >> > From: Sha Zhengju >> > >> > This patch adds memcg routines to count dirty pages, which allows memory >> > controller >> > to maintain an accurate view of the amount of its dirty memory and can >> > provide some >> > info for users while cgroup's direct reclaim is working. >> >> I guess you meant targeted resp. (hard/soft) limit reclaim here, >> right? It is true that this is direct reclaim but it is not clear to me >> why the usefulnes should be limitted to the reclaim for users. I would >> understand this if the users was in fact in-kernel users. >> >> [...] >> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >> > ->private_lock --> mapping->tree_lock --> memcg->move_lock. >> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in >> > __set_page_dirty() >> > and __set_page_dirty_nobuffers(). But in order to avoiding useless >> > spinlock contention, >> > a prepare PageDirty() checking is added. >> >> But there is another AA deadlock here I believe. >> page_remove_rmap >> mem_cgroup_begin_update_page_stat <<< 1 >> set_page_dirty >> __set_page_dirty_buffers >> __set_page_dirty >> mem_cgroup_begin_update_page_stat <<< 2 >> move_lock_mem_cgroup >> spin_lock_irqsave(>move_lock, *flags); > > JFYI since abf09bed (s390/mm: implement software dirty bits) this is no > longer possible. I haven't checked wheter there are other cases like > this one and it should be better if mem_cgroup_begin_update_page_stat > was recursive safe if that can be done without too many hacks. > I will have a look at this (hopefully) sometimes next week. > Hi Michal, I'm sorry for not being able to return to this problem immediately after LSF/MM. That is good news. IIRC, it's the only place we have encountered recursive problem in accounting memcg dirty pages. But I'll try to revive my previous work of simplifying mem_cgroup_begin_update_page_stat() lock. I'll back to it in next few days. -- Thanks, Sha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed 02-01-13 11:44:21, Michal Hocko wrote: > On Wed 26-12-12 01:26:07, Sha Zhengju wrote: > > From: Sha Zhengju > > > > This patch adds memcg routines to count dirty pages, which allows memory > > controller > > to maintain an accurate view of the amount of its dirty memory and can > > provide some > > info for users while cgroup's direct reclaim is working. > > I guess you meant targeted resp. (hard/soft) limit reclaim here, > right? It is true that this is direct reclaim but it is not clear to me > why the usefulnes should be limitted to the reclaim for users. I would > understand this if the users was in fact in-kernel users. > > [...] > > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version > > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: > > ->private_lock --> mapping->tree_lock --> memcg->move_lock. > > So we need to make mapping->tree_lock ahead of TestSetPageDirty in > > __set_page_dirty() > > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock > > contention, > > a prepare PageDirty() checking is added. > > But there is another AA deadlock here I believe. > page_remove_rmap > mem_cgroup_begin_update_page_stat <<< 1 > set_page_dirty > __set_page_dirty_buffers > __set_page_dirty > mem_cgroup_begin_update_page_stat <<< 2 > move_lock_mem_cgroup > spin_lock_irqsave(>move_lock, *flags); JFYI since abf09bed (s390/mm: implement software dirty bits) this is no longer possible. I haven't checked wheter there are other cases like this one and it should be better if mem_cgroup_begin_update_page_stat was recursive safe if that can be done without too many hacks. I will have a look at this (hopefully) sometimes next week. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed 02-01-13 11:44:21, Michal Hocko wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat 2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); JFYI since abf09bed (s390/mm: implement software dirty bits) this is no longer possible. I haven't checked wheter there are other cases like this one and it should be better if mem_cgroup_begin_update_page_stat was recursive safe if that can be done without too many hacks. I will have a look at this (hopefully) sometimes next week. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Fri, May 3, 2013 at 5:11 PM, Michal Hocko mho...@suse.cz wrote: On Wed 02-01-13 11:44:21, Michal Hocko wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat 2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); JFYI since abf09bed (s390/mm: implement software dirty bits) this is no longer possible. I haven't checked wheter there are other cases like this one and it should be better if mem_cgroup_begin_update_page_stat was recursive safe if that can be done without too many hacks. I will have a look at this (hopefully) sometimes next week. Hi Michal, I'm sorry for not being able to return to this problem immediately after LSF/MM. That is good news. IIRC, it's the only place we have encountered recursive problem in accounting memcg dirty pages. But I'll try to revive my previous work of simplifying mem_cgroup_begin_update_page_stat() lock. I'll back to it in next few days. -- Thanks, Sha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki wrote: > (2013/01/10 13:26), Sha Zhengju wrote: > >> But this method also has its pros and cons(e.g. need lock nesting). So >> I doubt whether the following is able to deal with these issues all >> together: >> (CPU-A does "page stat accounting" and CPU-B does "move") >> >> CPU-ACPU-B >> >> move_lock_mem_cgroup() >> memcg = pc->mem_cgroup >> SetPageDirty(page) >> move_unlock_mem_cgroup() >>move_lock_mem_cgroup() >>if (PageDirty) { >> old_memcg->nr_dirty --; >> new_memcg->nr_dirty ++; >> } >> pc->mem_cgroup = new_memcg >> move_unlock_mem_cgroup() >> >> memcg->nr_dirty ++ >> >> >> For CPU-A, we save pc->mem_cgroup in a temporary variable just before >> SetPageDirty inside move_lock and then update stats if the page is set >> PG_dirty successfully. But CPU-B may do "moving" in advance that >> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but >> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the >> stats. >> However, there is a potential problem that old_memcg->nr_dirty may be >> minus in a very short period but not a big issue IMHO. >> > > IMHO, this will work. Please take care of that the recorded memcg will not > be invalid pointer when you update the nr_dirty later. > (Maybe RCU will protect it.) > Yes, there're 3 places to change pc->mem_cgroup: charge & uncharge & move_account. "charge" has no race with stat updater and "uncharge" doesn't reset pc->mem_cgroup directly, also "move_account" is just the one we are handling, so they may do no harm here. Meanwhile, invalid pointer made by cgroup deletion may also be avoided by RCU. Yet it's a rough conclusion by quick look... > _If_ this method can handle "nesting" problem clearer and make > implementation > simpler, please go ahead. To be honest, I'm not sure how the code will be > until Okay, later I'll try to propose the patch. > seeing the patch. Hmm, why you write SetPageDirty() here rather than > TestSetPageDirty() > No particular reason...TestSetPageDirty() may be more precise... : ) -- Thanks, Sha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: (2013/01/10 13:26), Sha Zhengju wrote: But this method also has its pros and cons(e.g. need lock nesting). So I doubt whether the following is able to deal with these issues all together: (CPU-A does page stat accounting and CPU-B does move) CPU-ACPU-B move_lock_mem_cgroup() memcg = pc-mem_cgroup SetPageDirty(page) move_unlock_mem_cgroup() move_lock_mem_cgroup() if (PageDirty) { old_memcg-nr_dirty --; new_memcg-nr_dirty ++; } pc-mem_cgroup = new_memcg move_unlock_mem_cgroup() memcg-nr_dirty ++ For CPU-A, we save pc-mem_cgroup in a temporary variable just before SetPageDirty inside move_lock and then update stats if the page is set PG_dirty successfully. But CPU-B may do moving in advance that old_memcg-nr_dirty -- will make old_memcg-nr_dirty incorrect but soon CPU-A will do memcg-nr_dirty ++ at the heels that amend the stats. However, there is a potential problem that old_memcg-nr_dirty may be minus in a very short period but not a big issue IMHO. IMHO, this will work. Please take care of that the recorded memcg will not be invalid pointer when you update the nr_dirty later. (Maybe RCU will protect it.) Yes, there're 3 places to change pc-mem_cgroup: charge uncharge move_account. charge has no race with stat updater and uncharge doesn't reset pc-mem_cgroup directly, also move_account is just the one we are handling, so they may do no harm here. Meanwhile, invalid pointer made by cgroup deletion may also be avoided by RCU. Yet it's a rough conclusion by quick look... _If_ this method can handle nesting problem clearer and make implementation simpler, please go ahead. To be honest, I'm not sure how the code will be until Okay, later I'll try to propose the patch. seeing the patch. Hmm, why you write SetPageDirty() here rather than TestSetPageDirty() No particular reason...TestSetPageDirty() may be more precise... : ) -- Thanks, Sha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/10 13:26), Sha Zhengju wrote: But this method also has its pros and cons(e.g. need lock nesting). So I doubt whether the following is able to deal with these issues all together: (CPU-A does "page stat accounting" and CPU-B does "move") CPU-ACPU-B move_lock_mem_cgroup() memcg = pc->mem_cgroup SetPageDirty(page) move_unlock_mem_cgroup() move_lock_mem_cgroup() if (PageDirty) { old_memcg->nr_dirty --; new_memcg->nr_dirty ++; } pc->mem_cgroup = new_memcg move_unlock_mem_cgroup() memcg->nr_dirty ++ For CPU-A, we save pc->mem_cgroup in a temporary variable just before SetPageDirty inside move_lock and then update stats if the page is set PG_dirty successfully. But CPU-B may do "moving" in advance that "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the stats. However, there is a potential problem that old_memcg->nr_dirty may be minus in a very short period but not a big issue IMHO. IMHO, this will work. Please take care of that the recorded memcg will not be invalid pointer when you update the nr_dirty later. (Maybe RCU will protect it.) _If_ this method can handle "nesting" problem clearer and make implementation simpler, please go ahead. To be honest, I'm not sure how the code will be until seeing the patch. Hmm, why you write SetPageDirty() here rather than TestSetPageDirty() Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki wrote: > (2013/01/10 0:02), Sha Zhengju wrote: >> >> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki >> wrote: >>> >>> (2013/01/05 13:48), Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: > > > On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >> >> >> From: Sha Zhengju >> >> This patch adds memcg routines to count dirty pages, which allows >> memory >> controller >> to maintain an accurate view of the amount of its dirty memory and can >> provide some >> info for users while cgroup's direct reclaim is working. > > > > I guess you meant targeted resp. (hard/soft) limit reclaim here, > right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). > why the usefulnes should be limitted to the reclaim for users. I would > understand this if the users was in fact in-kernel users. > One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P > [...] >> >> >> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >> So we need to make mapping->tree_lock ahead of TestSetPageDirty in >> __set_page_dirty() >> and __set_page_dirty_nobuffers(). But in order to avoiding useless >> spinlock contention, >> a prepare PageDirty() checking is added. > > > > But there is another AA deadlock here I believe. > page_remove_rmap > mem_cgroup_begin_update_page_stat <<< 1 > set_page_dirty > __set_page_dirty_buffers > __set_page_dirty > mem_cgroup_begin_update_page_stat <<< 2 > move_lock_mem_cgroup > spin_lock_irqsave(>move_lock, *flags); > > mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS > because we might race with the moving charges: > CPU0CPU1 > page_remove_rmap > mem_cgroup_can_attach > mem_cgroup_begin_update_page_stat (1) > rcu_read_lock > > mem_cgroup_start_move > > atomic_inc(_moving) > > atomic_inc(>moving_account) > synchronize_rcu > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< TRUE > move_lock_mem_cgroup > [...] > mem_cgroup_begin_update_page_stat (2) > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< still TRUE > move_lock_mem_cgroup <<< DEADLOCK > [...] > mem_cgroup_end_update_page_stat > rcu_unlock > # wake up from > synchronize_rcu > [...] > mem_cgroup_move_task > > mem_cgroup_move_charge > walk_page_range > > mem_cgroup_move_account > > move_lock_mem_cgroup > > > Maybe I have missed some other locking which would prevent this from > happening but the locking relations are really complicated in this area > so if mem_cgroup_{begin,end}_update_page_stat might be called > recursively then we need a fat comment which justifies that. > Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/10 0:02), Sha Zhengju wrote: On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki wrote: (2013/01/05 13:48), Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: ->private_lock --> mapping->tree_lock --> memcg->move_lock. So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat <<< 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat <<< 2 move_lock_mem_cgroup spin_lock_irqsave(>move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(_moving) atomic_inc(>moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< still TRUE move_lock_mem_cgroup <<< DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg->move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(_moving) atomic_inc(>moving_account)
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki wrote: > (2013/01/05 13:48), Sha Zhengju wrote: >> >> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: >>> >>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. >>> >>> >>> I guess you meant targeted resp. (hard/soft) limit reclaim here, >>> right? It is true that this is direct reclaim but it is not clear to me >> >> >> Yes, I meant memcg hard/soft reclaim here which is triggered directly >> by allocation and is distinct from background kswapd reclaim (global). >> >>> why the usefulnes should be limitted to the reclaim for users. I would >>> understand this if the users was in fact in-kernel users. >>> >> >> One of the reasons I'm trying to accounting the dirty pages is to get a >> more board overall view of memory usages because memcg hard/soft >> reclaim may have effect on response time of user application. >> Yeah, the beneficiary can be application administrator or kernel users. >> :P >> >>> [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: ->private_lock --> mapping->tree_lock --> memcg->move_lock. So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. >>> >>> >>> But there is another AA deadlock here I believe. >>> page_remove_rmap >>>mem_cgroup_begin_update_page_stat <<< 1 >>>set_page_dirty >>> __set_page_dirty_buffers >>>__set_page_dirty >>> mem_cgroup_begin_update_page_stat <<< 2 >>>move_lock_mem_cgroup >>> spin_lock_irqsave(>move_lock, *flags); >>> >>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS >>> because we might race with the moving charges: >>> CPU0CPU1 >>> page_remove_rmap >>> mem_cgroup_can_attach >>>mem_cgroup_begin_update_page_stat (1) >>> rcu_read_lock >>>mem_cgroup_start_move >>> >>> atomic_inc(_moving) >>> >>> atomic_inc(>moving_account) >>> synchronize_rcu >>> __mem_cgroup_begin_update_page_stat >>>mem_cgroup_stolen <<< TRUE >>>move_lock_mem_cgroup >>>[...] >>> mem_cgroup_begin_update_page_stat (2) >>>__mem_cgroup_begin_update_page_stat >>> mem_cgroup_stolen <<< still TRUE >>> move_lock_mem_cgroup <<< DEADLOCK >>>[...] >>>mem_cgroup_end_update_page_stat >>> rcu_unlock >>># wake up from >>> synchronize_rcu >>> [...] >>> mem_cgroup_move_task >>>mem_cgroup_move_charge >>> walk_page_range >>> >>> mem_cgroup_move_account >>> >>> move_lock_mem_cgroup >>> >>> >>> Maybe I have missed some other locking which would prevent this from >>> happening but the locking relations are really complicated in this area >>> so if mem_cgroup_{begin,end}_update_page_stat might be called >>> recursively then we need a fat comment which justifies that. >>> >> >> Ohhh...good catching! I didn't notice there is a recursive call of >> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). >> The mem_cgroup_{begin,end}_update_page_stat() design has depressed >> me a lot recently as the lock granularity is a little bigger than I >> thought. >> Not only the resource but also some code logic is in the range of locking >> which may be deadlock prone. The problem still exists if we are trying to >> add stat account of other memcg page later, may I make bold to suggest >> that we dig into the lock again... >> >> But with regard to the current lock implementation, I doubt if we can we >> can >> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just >> try to get move_lock once in the beginning. IMHO we can make >> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm >> thinking now is changing memcg->move_lock to rw-spinlock from the >> original spinlock: >> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which >> make it >> reenterable and memcg moving task side try to get the write spinlock. >> Then the race may be following: >> >> CPU0
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed 09-01-13 22:35:12, Sha Zhengju wrote: [...] > To my knowledge, each task is forked in root memcg, and there's a > moving while attaching it to a cgroup. So move_account is also a > frequent behavior to some extent. Not really. Every fork/exec is copies the current group (see cgroup_fork) so there is no moving on that path. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
Hi Hugh, On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins wrote: > On Sat, 5 Jan 2013, Sha Zhengju wrote: >> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: >> > >> > Maybe I have missed some other locking which would prevent this from >> > happening but the locking relations are really complicated in this area >> > so if mem_cgroup_{begin,end}_update_page_stat might be called >> > recursively then we need a fat comment which justifies that. >> > >> >> Ohhh...good catching! I didn't notice there is a recursive call of >> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). >> The mem_cgroup_{begin,end}_update_page_stat() design has depressed >> me a lot recently as the lock granularity is a little bigger than I thought. >> Not only the resource but also some code logic is in the range of locking >> which may be deadlock prone. The problem still exists if we are trying to >> add stat account of other memcg page later, may I make bold to suggest >> that we dig into the lock again... > > Forgive me, I must confess I'm no more than skimming this thread, > and don't like dumping unsigned-off patches on people; but thought > that on balance it might be more helpful than not if I offer you a > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). Thanks for your interest in this matter! I really appreciate your work! > I too was getting depressed by the constraints imposed by > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san > did to minimize them), and wanted to replace by something freer, more > RCU-like. In the end it seemed more effort than it was worth to go > as far as I wanted, but I do think that this is some improvement over > what we currently have, and should deal with your recursion issue. It takes me some time to understand the patch. yeah, it can solve my recursion issue and also reduce some locks(e.g. move_lock). But it did have some side effect on move end as it will become slower. To my knowledge, each task is forked in root memcg, and there's a moving while attaching it to a cgroup. So move_account is also a frequent behavior to some extent. Some comments are below. > But if this does appear useful to memcg people, then we really ought > to get it checked over by locking/barrier experts before going further. > I think myself that I've over-barriered it, and could use a little > lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come > to mind) will see more clearly, and may just hate the whole thing, > as yet another peculiar lockdep-avoiding hand-crafted locking scheme. > I've not wanted to waste their time on reviewing it, if it's not even > going to be useful to memcg people. > > It may be easier to understand if you just apply the patch and look > at the result in mm/memcontrol.c, where I tried to gather the pieces > together in one place and describe them ("These functions mediate..."). > > Hugh > > include/linux/memcontrol.h | 39 +-- > mm/memcontrol.c| 375 +-- > mm/rmap.c | 20 - > 3 files changed, 257 insertions(+), 177 deletions(-) > > --- 3.8-rc2/include/linux/memcontrol.h 2012-12-22 09:43:27.172015571 -0800 > +++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800 > @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v > return false; > } > > -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, > -unsigned long *flags); > - > +void __mem_cgroup_begin_update_page_stat(struct page *page); > +void __mem_cgroup_end_update_page_stat(void); > extern atomic_t memcg_moving; > > static inline void mem_cgroup_begin_update_page_stat(struct page *page, > - bool *locked, unsigned long *flags) > +bool *clamped) > { > - if (mem_cgroup_disabled()) > - return; > - rcu_read_lock(); > - *locked = false; > - if (atomic_read(_moving)) > - __mem_cgroup_begin_update_page_stat(page, locked, flags); > + preempt_disable(); Referring to synchronize_rcu in mem_cgroup_begin_move(), here rcu_read_lock() lost? > + *clamped = false; > + if (unlikely(atomic_read(_moving))) { > + __mem_cgroup_begin_update_page_stat(page); > + *clamped = true; > + } > } > > -void __mem_cgroup_end_update_page_stat(struct page *page, > - unsigned long *flags); > static inline void mem_cgroup_end_update_page_stat(struct page *page, > - bool *locked, unsigned long *flags) > + bool *clamped) > { > - if (mem_cgroup_disabled()) > - return; > - if (*locked) > - __mem_cgroup_end_update_page_stat(page, flags); > - rcu_read_unlock(); > + /* We
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen wrote: > On Tue, Dec 25 2012, Sha Zhengju wrote: > >> From: Sha Zhengju >> >> This patch adds memcg routines to count dirty pages, which allows memory >> controller >> to maintain an accurate view of the amount of its dirty memory and can >> provide some >> info for users while cgroup's direct reclaim is working. >> >> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), >> we can >> use 'struct page' flag to test page state instead of per page_cgroup flag. >> But memcg >> has a feature to move a page from a cgroup to another one and may have race >> between >> "move" and "page stat accounting". So in order to avoid the race we have >> designed a >> bigger lock: >> >> mem_cgroup_begin_update_page_stat() >> modify page information-->(a) >> mem_cgroup_update_page_stat() -->(b) >> mem_cgroup_end_update_page_stat() >> It requires (a) and (b)(dirty pages accounting) can stay close enough. >> In the previous two prepare patches, we have reworked the vfs set page dirty >> routines >> and now the interfaces are more explicit: >> incrementing (2): >> __set_page_dirty >> __set_page_dirty_nobuffers >> decrementing (2): >> clear_page_dirty_for_io >> cancel_dirty_page >> >> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >> So we need to make mapping->tree_lock ahead of TestSetPageDirty in >> __set_page_dirty() >> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock >> contention, >> a prepare PageDirty() checking is added. >> >> >> Signed-off-by: Sha Zhengju >> Acked-by: KAMEZAWA Hiroyuki >> Acked-by: Fengguang Wu >> --- >> fs/buffer.c| 14 +- >> include/linux/memcontrol.h |1 + >> mm/filemap.c | 10 ++ >> mm/memcontrol.c| 29 ++--- >> mm/page-writeback.c| 39 --- >> mm/truncate.c |6 ++ >> 6 files changed, 84 insertions(+), 15 deletions(-) > > __nilfs_clear_page_dirty() clears PageDirty, does it need modification > for this patch series? It doesn't need to do so. mem_cgroup_dec/inc_page_stat() is accompany with dec/inc_zone_page_state() to account memcg page stat. IMHO we only have to do some modification while SetPageDirty and dec/inc_zone_page_state() occur together. __nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page) later where the accounting is done. >> diff --git a/fs/buffer.c b/fs/buffer.c >> index 762168a..53402d2 100644 >> --- a/fs/buffer.c >> +++ b/fs/buffer.c >> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); >> int __set_page_dirty(struct page *page, >> struct address_space *mapping, int warn) >> { >> + bool locked; >> + unsigned long flags; >> + >> if (unlikely(!mapping)) >> return !TestSetPageDirty(page); >> >> - if (TestSetPageDirty(page)) >> + if (PageDirty(page)) >> return 0; >> >> spin_lock_irq(>tree_lock); >> + mem_cgroup_begin_update_page_stat(page, , ); >> + >> + if (TestSetPageDirty(page)) { >> + mem_cgroup_end_update_page_stat(page, , ); >> + spin_unlock_irq(>tree_lock); >> + return 0; >> + } >> + >> if (page->mapping) {/* Race with truncate? */ >> WARN_ON_ONCE(warn && !PageUptodate(page)); >> account_page_dirtied(page, mapping); >> radix_tree_tag_set(>page_tree, >> page_index(page), PAGECACHE_TAG_DIRTY); >> } >> + mem_cgroup_end_update_page_stat(page, , ); >> spin_unlock_irq(>tree_lock); >> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >> index 5421b8a..2685d8a 100644 >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { >> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ >> MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ >> MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ >> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ >> MEM_CGROUP_STAT_NSTATS, >> }; >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 83efee7..b589be5 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -62,6 +62,11 @@ >> * ->swap_lock (exclusive_swap_page, others) >> *->mapping->tree_lock >> * >> + *->private_lock (__set_page_dirty_buffers) >> + * ->mapping->tree_lock >> + *->memcg->move_lock (mem_cgroup_begin_update_page_stat-> >> + *
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen gthe...@google.com wrote: On Tue, Dec 25 2012, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between move and page stat accounting. So in order to avoid the race we have designed a bigger lock: mem_cgroup_begin_update_page_stat() modify page information--(a) mem_cgroup_update_page_stat() --(b) mem_cgroup_end_update_page_stat() It requires (a) and (b)(dirty pages accounting) can stay close enough. In the previous two prepare patches, we have reworked the vfs set page dirty routines and now the interfaces are more explicit: incrementing (2): __set_page_dirty __set_page_dirty_nobuffers decrementing (2): clear_page_dirty_for_io cancel_dirty_page To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. Signed-off-by: Sha Zhengju handai@taobao.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtisu.com Acked-by: Fengguang Wu fengguang...@intel.com --- fs/buffer.c| 14 +- include/linux/memcontrol.h |1 + mm/filemap.c | 10 ++ mm/memcontrol.c| 29 ++--- mm/page-writeback.c| 39 --- mm/truncate.c |6 ++ 6 files changed, 84 insertions(+), 15 deletions(-) __nilfs_clear_page_dirty() clears PageDirty, does it need modification for this patch series? It doesn't need to do so. mem_cgroup_dec/inc_page_stat() is accompany with dec/inc_zone_page_state() to account memcg page stat. IMHO we only have to do some modification while SetPageDirty and dec/inc_zone_page_state() occur together. __nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page) later where the accounting is done. diff --git a/fs/buffer.c b/fs/buffer.c index 762168a..53402d2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { + bool locked; + unsigned long flags; + if (unlikely(!mapping)) return !TestSetPageDirty(page); - if (TestSetPageDirty(page)) + if (PageDirty(page)) return 0; spin_lock_irq(mapping-tree_lock); + mem_cgroup_begin_update_page_stat(page, locked, flags); + + if (TestSetPageDirty(page)) { + mem_cgroup_end_update_page_stat(page, locked, flags); + spin_unlock_irq(mapping-tree_lock); + return 0; + } + if (page-mapping) {/* Race with truncate? */ WARN_ON_ONCE(warn !PageUptodate(page)); account_page_dirtied(page, mapping); radix_tree_tag_set(mapping-page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + mem_cgroup_end_update_page_stat(page, locked, flags); spin_unlock_irq(mapping-tree_lock); __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5421b8a..2685d8a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ MEM_CGROUP_STAT_NSTATS, }; diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..b589be5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -62,6 +62,11 @@ * -swap_lock (exclusive_swap_page, others) *-mapping-tree_lock * + *-private_lock (__set_page_dirty_buffers) + * -mapping-tree_lock + *-memcg-move_lock (mem_cgroup_begin_update_page_stat- + * move_lock_mem_cgroup) + * * -i_mutex
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
Hi Hugh, On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins hu...@google.com wrote: On Sat, 5 Jan 2013, Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). Thanks for your interest in this matter! I really appreciate your work! I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. It takes me some time to understand the patch. yeah, it can solve my recursion issue and also reduce some locks(e.g. move_lock). But it did have some side effect on move end as it will become slower. To my knowledge, each task is forked in root memcg, and there's a moving while attaching it to a cgroup. So move_account is also a frequent behavior to some extent. Some comments are below. But if this does appear useful to memcg people, then we really ought to get it checked over by locking/barrier experts before going further. I think myself that I've over-barriered it, and could use a little lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come to mind) will see more clearly, and may just hate the whole thing, as yet another peculiar lockdep-avoiding hand-crafted locking scheme. I've not wanted to waste their time on reviewing it, if it's not even going to be useful to memcg people. It may be easier to understand if you just apply the patch and look at the result in mm/memcontrol.c, where I tried to gather the pieces together in one place and describe them (These functions mediate...). Hugh include/linux/memcontrol.h | 39 +-- mm/memcontrol.c| 375 +-- mm/rmap.c | 20 - 3 files changed, 257 insertions(+), 177 deletions(-) --- 3.8-rc2/include/linux/memcontrol.h 2012-12-22 09:43:27.172015571 -0800 +++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800 @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v return false; } -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, -unsigned long *flags); - +void __mem_cgroup_begin_update_page_stat(struct page *page); +void __mem_cgroup_end_update_page_stat(void); extern atomic_t memcg_moving; static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +bool *clamped) { - if (mem_cgroup_disabled()) - return; - rcu_read_lock(); - *locked = false; - if (atomic_read(memcg_moving)) - __mem_cgroup_begin_update_page_stat(page, locked, flags); + preempt_disable(); Referring to synchronize_rcu in mem_cgroup_begin_move(), here rcu_read_lock() lost? + *clamped = false; + if (unlikely(atomic_read(memcg_moving))) { + __mem_cgroup_begin_update_page_stat(page); + *clamped = true; + } } -void __mem_cgroup_end_update_page_stat(struct page *page, - unsigned long *flags); static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) + bool *clamped) { - if (mem_cgroup_disabled()) - return; - if (*locked) - __mem_cgroup_end_update_page_stat(page, flags); - rcu_read_unlock(); + /* We don't currently use the page arg, but keep it for symmetry */ + if
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed 09-01-13 22:35:12, Sha Zhengju wrote: [...] To my knowledge, each task is forked in root memcg, and there's a moving while attaching it to a cgroup. So move_account is also a frequent behavior to some extent. Not really. Every fork/exec is copies the current group (see cgroup_fork) so there is no moving on that path. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: (2013/01/05 13:48), Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving) atomic_inc(memcg-moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen still TRUE move_lock_mem_cgroup DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg-move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving)
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/10 0:02), Sha Zhengju wrote: On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: (2013/01/05 13:48), Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving) atomic_inc(memcg-moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen still TRUE move_lock_mem_cgroup DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg-move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving)
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: (2013/01/10 0:02), Sha Zhengju wrote: On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: (2013/01/05 13:48), Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving) atomic_inc(memcg-moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen still TRUE move_lock_mem_cgroup DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg-move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/10 13:26), Sha Zhengju wrote: But this method also has its pros and cons(e.g. need lock nesting). So I doubt whether the following is able to deal with these issues all together: (CPU-A does page stat accounting and CPU-B does move) CPU-ACPU-B move_lock_mem_cgroup() memcg = pc-mem_cgroup SetPageDirty(page) move_unlock_mem_cgroup() move_lock_mem_cgroup() if (PageDirty) { old_memcg-nr_dirty --; new_memcg-nr_dirty ++; } pc-mem_cgroup = new_memcg move_unlock_mem_cgroup() memcg-nr_dirty ++ For CPU-A, we save pc-mem_cgroup in a temporary variable just before SetPageDirty inside move_lock and then update stats if the page is set PG_dirty successfully. But CPU-B may do moving in advance that old_memcg-nr_dirty -- will make old_memcg-nr_dirty incorrect but soon CPU-A will do memcg-nr_dirty ++ at the heels that amend the stats. However, there is a potential problem that old_memcg-nr_dirty may be minus in a very short period but not a big issue IMHO. IMHO, this will work. Please take care of that the recorded memcg will not be invalid pointer when you update the nr_dirty later. (Maybe RCU will protect it.) _If_ this method can handle nesting problem clearer and make implementation simpler, please go ahead. To be honest, I'm not sure how the code will be until seeing the patch. Hmm, why you write SetPageDirty() here rather than TestSetPageDirty() Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/09 14:15), Hugh Dickins wrote: On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote: (2013/01/07 5:02), Hugh Dickins wrote: Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. In what case does this improve performance ? Perhaps none. I was aiming to not degrade performance at the stats update end, and make it more flexible, so new stats can be updated which would be problematic today (for lock ordering and recursion reasons). I've not done any performance measurement on it, and don't have enough cpus for an interesting report; but if someone thinks it might solve a problem for them, and has plenty of cpus to test with, please go ahead, we'd be glad to hear the results. Hi, this patch seems interesting but...doesn't this make move_account() very slow if the number of cpus increases because of scanning all cpus per a page ? And this looks like reader-can-block-writer percpu rwlock..it's too heavy to writers if there are many readers. I was happy to make the relatively rare move_account end considerably heavier. I'll be disappointed if it turns out to be prohibitively heavy at that end - if we're going to make move_account impossible, there are much easier ways to achieve that! - but it is a possibility. move_account at task-move has been required feature for NEC and Nishimura-san did good job. I'd like to keep that available as much as possible. Something you might have missed when considering many readers (stats updaters): the move_account end does not wait for a moment when there are no readers, that would indeed be a losing strategy; it just waits for each cpu that's updating page stats to leave that section, so every cpu is sure to notice and hold off if it then tries to update the page which is to be moved. (I may not be explaining that very well!) Hmm, yeah, maybe I miss somehing. BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote: > (2013/01/07 5:02), Hugh Dickins wrote: > > > > Forgive me, I must confess I'm no more than skimming this thread, > > and don't like dumping unsigned-off patches on people; but thought > > that on balance it might be more helpful than not if I offer you a > > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). > > > > I too was getting depressed by the constraints imposed by > > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san > > did to minimize them), and wanted to replace by something freer, more > > RCU-like. In the end it seemed more effort than it was worth to go > > as far as I wanted, but I do think that this is some improvement over > > what we currently have, and should deal with your recursion issue. > > > In what case does this improve performance ? Perhaps none. I was aiming to not degrade performance at the stats update end, and make it more flexible, so new stats can be updated which would be problematic today (for lock ordering and recursion reasons). I've not done any performance measurement on it, and don't have enough cpus for an interesting report; but if someone thinks it might solve a problem for them, and has plenty of cpus to test with, please go ahead, we'd be glad to hear the results. > Hi, this patch seems interesting but...doesn't this make move_account() very > slow if the number of cpus increases because of scanning all cpus per a page > ? > And this looks like reader-can-block-writer percpu rwlock..it's too heavy to > writers if there are many readers. I was happy to make the relatively rare move_account end considerably heavier. I'll be disappointed if it turns out to be prohibitively heavy at that end - if we're going to make move_account impossible, there are much easier ways to achieve that! - but it is a possibility. Something you might have missed when considering many readers (stats updaters): the move_account end does not wait for a moment when there are no readers, that would indeed be a losing strategy; it just waits for each cpu that's updating page stats to leave that section, so every cpu is sure to notice and hold off if it then tries to update the page which is to be moved. (I may not be explaining that very well!) Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote: (2013/01/07 5:02), Hugh Dickins wrote: Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. In what case does this improve performance ? Perhaps none. I was aiming to not degrade performance at the stats update end, and make it more flexible, so new stats can be updated which would be problematic today (for lock ordering and recursion reasons). I've not done any performance measurement on it, and don't have enough cpus for an interesting report; but if someone thinks it might solve a problem for them, and has plenty of cpus to test with, please go ahead, we'd be glad to hear the results. Hi, this patch seems interesting but...doesn't this make move_account() very slow if the number of cpus increases because of scanning all cpus per a page ? And this looks like reader-can-block-writer percpu rwlock..it's too heavy to writers if there are many readers. I was happy to make the relatively rare move_account end considerably heavier. I'll be disappointed if it turns out to be prohibitively heavy at that end - if we're going to make move_account impossible, there are much easier ways to achieve that! - but it is a possibility. Something you might have missed when considering many readers (stats updaters): the move_account end does not wait for a moment when there are no readers, that would indeed be a losing strategy; it just waits for each cpu that's updating page stats to leave that section, so every cpu is sure to notice and hold off if it then tries to update the page which is to be moved. (I may not be explaining that very well!) Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/09 14:15), Hugh Dickins wrote: On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote: (2013/01/07 5:02), Hugh Dickins wrote: Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. In what case does this improve performance ? Perhaps none. I was aiming to not degrade performance at the stats update end, and make it more flexible, so new stats can be updated which would be problematic today (for lock ordering and recursion reasons). I've not done any performance measurement on it, and don't have enough cpus for an interesting report; but if someone thinks it might solve a problem for them, and has plenty of cpus to test with, please go ahead, we'd be glad to hear the results. Hi, this patch seems interesting but...doesn't this make move_account() very slow if the number of cpus increases because of scanning all cpus per a page ? And this looks like reader-can-block-writer percpu rwlock..it's too heavy to writers if there are many readers. I was happy to make the relatively rare move_account end considerably heavier. I'll be disappointed if it turns out to be prohibitively heavy at that end - if we're going to make move_account impossible, there are much easier ways to achieve that! - but it is a possibility. move_account at task-move has been required feature for NEC and Nishimura-san did good job. I'd like to keep that available as much as possible. Something you might have missed when considering many readers (stats updaters): the move_account end does not wait for a moment when there are no readers, that would indeed be a losing strategy; it just waits for each cpu that's updating page stats to leave that section, so every cpu is sure to notice and hold off if it then tries to update the page which is to be moved. (I may not be explaining that very well!) Hmm, yeah, maybe I miss somehing. BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/07 5:02), Hugh Dickins wrote: On Sat, 5 Jan 2013, Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. In what case does this improve performance ? But if this does appear useful to memcg people, then we really ought to get it checked over by locking/barrier experts before going further. I think myself that I've over-barriered it, and could use a little lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come to mind) will see more clearly, and may just hate the whole thing, as yet another peculiar lockdep-avoiding hand-crafted locking scheme. I've not wanted to waste their time on reviewing it, if it's not even going to be useful to memcg people. It may be easier to understand if you just apply the patch and look at the result in mm/memcontrol.c, where I tried to gather the pieces together in one place and describe them ("These functions mediate..."). Hugh Hi, this patch seems interesting but...doesn't this make move_account() very slow if the number of cpus increases because of scanning all cpus per a page ? And this looks like reader-can-block-writer percpu rwlock..it's too heavy to writers if there are many readers. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/05 13:48), Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: ->private_lock --> mapping->tree_lock --> memcg->move_lock. So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat <<< 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat <<< 2 move_lock_mem_cgroup spin_lock_irqsave(>move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(_moving) atomic_inc(>moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< still TRUE move_lock_mem_cgroup <<< DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg->move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Tue, Dec 25 2012, Sha Zhengju wrote: > From: Sha Zhengju > > This patch adds memcg routines to count dirty pages, which allows memory > controller > to maintain an accurate view of the amount of its dirty memory and can > provide some > info for users while cgroup's direct reclaim is working. > > After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), > we can > use 'struct page' flag to test page state instead of per page_cgroup flag. > But memcg > has a feature to move a page from a cgroup to another one and may have race > between > "move" and "page stat accounting". So in order to avoid the race we have > designed a > bigger lock: > > mem_cgroup_begin_update_page_stat() > modify page information-->(a) > mem_cgroup_update_page_stat() -->(b) > mem_cgroup_end_update_page_stat() > It requires (a) and (b)(dirty pages accounting) can stay close enough. > In the previous two prepare patches, we have reworked the vfs set page dirty > routines > and now the interfaces are more explicit: > incrementing (2): > __set_page_dirty > __set_page_dirty_nobuffers > decrementing (2): > clear_page_dirty_for_io > cancel_dirty_page > > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: > ->private_lock --> mapping->tree_lock --> memcg->move_lock. > So we need to make mapping->tree_lock ahead of TestSetPageDirty in > __set_page_dirty() > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock > contention, > a prepare PageDirty() checking is added. > > > Signed-off-by: Sha Zhengju > Acked-by: KAMEZAWA Hiroyuki > Acked-by: Fengguang Wu > --- > fs/buffer.c| 14 +- > include/linux/memcontrol.h |1 + > mm/filemap.c | 10 ++ > mm/memcontrol.c| 29 ++--- > mm/page-writeback.c| 39 --- > mm/truncate.c |6 ++ > 6 files changed, 84 insertions(+), 15 deletions(-) __nilfs_clear_page_dirty() clears PageDirty, does it need modification for this patch series? > diff --git a/fs/buffer.c b/fs/buffer.c > index 762168a..53402d2 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); > int __set_page_dirty(struct page *page, > struct address_space *mapping, int warn) > { > + bool locked; > + unsigned long flags; > + > if (unlikely(!mapping)) > return !TestSetPageDirty(page); > > - if (TestSetPageDirty(page)) > + if (PageDirty(page)) > return 0; > > spin_lock_irq(>tree_lock); > + mem_cgroup_begin_update_page_stat(page, , ); > + > + if (TestSetPageDirty(page)) { > + mem_cgroup_end_update_page_stat(page, , ); > + spin_unlock_irq(>tree_lock); > + return 0; > + } > + > if (page->mapping) {/* Race with truncate? */ > WARN_ON_ONCE(warn && !PageUptodate(page)); > account_page_dirtied(page, mapping); > radix_tree_tag_set(>page_tree, > page_index(page), PAGECACHE_TAG_DIRTY); > } > + mem_cgroup_end_update_page_stat(page, , ); > spin_unlock_irq(>tree_lock); > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5421b8a..2685d8a 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { > MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ > MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ > MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ > + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ > MEM_CGROUP_STAT_NSTATS, > }; > > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..b589be5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -62,6 +62,11 @@ > * ->swap_lock (exclusive_swap_page, others) > *->mapping->tree_lock > * > + *->private_lock (__set_page_dirty_buffers) > + * ->mapping->tree_lock > + *->memcg->move_lock (mem_cgroup_begin_update_page_stat-> > + * move_lock_mem_cgroup) > + * > * ->i_mutex > *->i_mmap_mutex (truncate->unmap_mapping_range) > * > @@ -112,6 +117,8 @@ > void __delete_from_page_cache(struct page *page) > { > struct address_space *mapping = page->mapping; > + bool locked; > + unsigned long flags; > > /* >* if we're uptodate, flush out into the cleancache, otherwise > @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page) >
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Sat, 5 Jan 2013, Sha Zhengju wrote: > On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: > > > > Maybe I have missed some other locking which would prevent this from > > happening but the locking relations are really complicated in this area > > so if mem_cgroup_{begin,end}_update_page_stat might be called > > recursively then we need a fat comment which justifies that. > > > > Ohhh...good catching! I didn't notice there is a recursive call of > mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). > The mem_cgroup_{begin,end}_update_page_stat() design has depressed > me a lot recently as the lock granularity is a little bigger than I thought. > Not only the resource but also some code logic is in the range of locking > which may be deadlock prone. The problem still exists if we are trying to > add stat account of other memcg page later, may I make bold to suggest > that we dig into the lock again... Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. But if this does appear useful to memcg people, then we really ought to get it checked over by locking/barrier experts before going further. I think myself that I've over-barriered it, and could use a little lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come to mind) will see more clearly, and may just hate the whole thing, as yet another peculiar lockdep-avoiding hand-crafted locking scheme. I've not wanted to waste their time on reviewing it, if it's not even going to be useful to memcg people. It may be easier to understand if you just apply the patch and look at the result in mm/memcontrol.c, where I tried to gather the pieces together in one place and describe them ("These functions mediate..."). Hugh include/linux/memcontrol.h | 39 +-- mm/memcontrol.c| 375 +-- mm/rmap.c | 20 - 3 files changed, 257 insertions(+), 177 deletions(-) --- 3.8-rc2/include/linux/memcontrol.h 2012-12-22 09:43:27.172015571 -0800 +++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800 @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v return false; } -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, -unsigned long *flags); - +void __mem_cgroup_begin_update_page_stat(struct page *page); +void __mem_cgroup_end_update_page_stat(void); extern atomic_t memcg_moving; static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +bool *clamped) { - if (mem_cgroup_disabled()) - return; - rcu_read_lock(); - *locked = false; - if (atomic_read(_moving)) - __mem_cgroup_begin_update_page_stat(page, locked, flags); + preempt_disable(); + *clamped = false; + if (unlikely(atomic_read(_moving))) { + __mem_cgroup_begin_update_page_stat(page); + *clamped = true; + } } -void __mem_cgroup_end_update_page_stat(struct page *page, - unsigned long *flags); static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) + bool *clamped) { - if (mem_cgroup_disabled()) - return; - if (*locked) - __mem_cgroup_end_update_page_stat(page, flags); - rcu_read_unlock(); + /* We don't currently use the page arg, but keep it for symmetry */ + if (unlikely(*clamped)) + __mem_cgroup_end_update_page_stat(); + preempt_enable(); } void mem_cgroup_update_page_stat(struct page *page, @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr } static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +bool *clamped) { + /* It may be helpful to our callers if the stub behaves the same way */ + preempt_disable(); } static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Sat, 5 Jan 2013, Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. But if this does appear useful to memcg people, then we really ought to get it checked over by locking/barrier experts before going further. I think myself that I've over-barriered it, and could use a little lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come to mind) will see more clearly, and may just hate the whole thing, as yet another peculiar lockdep-avoiding hand-crafted locking scheme. I've not wanted to waste their time on reviewing it, if it's not even going to be useful to memcg people. It may be easier to understand if you just apply the patch and look at the result in mm/memcontrol.c, where I tried to gather the pieces together in one place and describe them (These functions mediate...). Hugh include/linux/memcontrol.h | 39 +-- mm/memcontrol.c| 375 +-- mm/rmap.c | 20 - 3 files changed, 257 insertions(+), 177 deletions(-) --- 3.8-rc2/include/linux/memcontrol.h 2012-12-22 09:43:27.172015571 -0800 +++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800 @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v return false; } -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, -unsigned long *flags); - +void __mem_cgroup_begin_update_page_stat(struct page *page); +void __mem_cgroup_end_update_page_stat(void); extern atomic_t memcg_moving; static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +bool *clamped) { - if (mem_cgroup_disabled()) - return; - rcu_read_lock(); - *locked = false; - if (atomic_read(memcg_moving)) - __mem_cgroup_begin_update_page_stat(page, locked, flags); + preempt_disable(); + *clamped = false; + if (unlikely(atomic_read(memcg_moving))) { + __mem_cgroup_begin_update_page_stat(page); + *clamped = true; + } } -void __mem_cgroup_end_update_page_stat(struct page *page, - unsigned long *flags); static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) + bool *clamped) { - if (mem_cgroup_disabled()) - return; - if (*locked) - __mem_cgroup_end_update_page_stat(page, flags); - rcu_read_unlock(); + /* We don't currently use the page arg, but keep it for symmetry */ + if (unlikely(*clamped)) + __mem_cgroup_end_update_page_stat(); + preempt_enable(); } void mem_cgroup_update_page_stat(struct page *page, @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr } static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +bool *clamped) { + /* It may be helpful to our callers if the stub behaves the same way */ + preempt_disable(); } static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Tue, Dec 25 2012, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between move and page stat accounting. So in order to avoid the race we have designed a bigger lock: mem_cgroup_begin_update_page_stat() modify page information--(a) mem_cgroup_update_page_stat() --(b) mem_cgroup_end_update_page_stat() It requires (a) and (b)(dirty pages accounting) can stay close enough. In the previous two prepare patches, we have reworked the vfs set page dirty routines and now the interfaces are more explicit: incrementing (2): __set_page_dirty __set_page_dirty_nobuffers decrementing (2): clear_page_dirty_for_io cancel_dirty_page To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. Signed-off-by: Sha Zhengju handai@taobao.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtisu.com Acked-by: Fengguang Wu fengguang...@intel.com --- fs/buffer.c| 14 +- include/linux/memcontrol.h |1 + mm/filemap.c | 10 ++ mm/memcontrol.c| 29 ++--- mm/page-writeback.c| 39 --- mm/truncate.c |6 ++ 6 files changed, 84 insertions(+), 15 deletions(-) __nilfs_clear_page_dirty() clears PageDirty, does it need modification for this patch series? diff --git a/fs/buffer.c b/fs/buffer.c index 762168a..53402d2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { + bool locked; + unsigned long flags; + if (unlikely(!mapping)) return !TestSetPageDirty(page); - if (TestSetPageDirty(page)) + if (PageDirty(page)) return 0; spin_lock_irq(mapping-tree_lock); + mem_cgroup_begin_update_page_stat(page, locked, flags); + + if (TestSetPageDirty(page)) { + mem_cgroup_end_update_page_stat(page, locked, flags); + spin_unlock_irq(mapping-tree_lock); + return 0; + } + if (page-mapping) {/* Race with truncate? */ WARN_ON_ONCE(warn !PageUptodate(page)); account_page_dirtied(page, mapping); radix_tree_tag_set(mapping-page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + mem_cgroup_end_update_page_stat(page, locked, flags); spin_unlock_irq(mapping-tree_lock); __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5421b8a..2685d8a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ MEM_CGROUP_STAT_NSTATS, }; diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..b589be5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -62,6 +62,11 @@ * -swap_lock (exclusive_swap_page, others) *-mapping-tree_lock * + *-private_lock (__set_page_dirty_buffers) + * -mapping-tree_lock + *-memcg-move_lock (mem_cgroup_begin_update_page_stat- + * move_lock_mem_cgroup) + * * -i_mutex *-i_mmap_mutex (truncate-unmap_mapping_range) * @@ -112,6 +117,8 @@ void __delete_from_page_cache(struct page *page) { struct address_space *mapping = page-mapping; + bool locked; + unsigned long flags; /* * if we're uptodate, flush out into the cleancache, otherwise @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page) *
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/05 13:48), Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving) atomic_inc(memcg-moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen still TRUE move_lock_mem_cgroup DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg-move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
(2013/01/07 5:02), Hugh Dickins wrote: On Sat, 5 Jan 2013, Sha Zhengju wrote: On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. In what case does this improve performance ? But if this does appear useful to memcg people, then we really ought to get it checked over by locking/barrier experts before going further. I think myself that I've over-barriered it, and could use a little lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come to mind) will see more clearly, and may just hate the whole thing, as yet another peculiar lockdep-avoiding hand-crafted locking scheme. I've not wanted to waste their time on reviewing it, if it's not even going to be useful to memcg people. It may be easier to understand if you just apply the patch and look at the result in mm/memcontrol.c, where I tried to gather the pieces together in one place and describe them (These functions mediate...). Hugh Hi, this patch seems interesting but...doesn't this make move_account() very slow if the number of cpus increases because of scanning all cpus per a page ? And this looks like reader-can-block-writer percpu rwlock..it's too heavy to writers if there are many readers. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko wrote: > On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >> From: Sha Zhengju >> >> This patch adds memcg routines to count dirty pages, which allows memory >> controller >> to maintain an accurate view of the amount of its dirty memory and can >> provide some >> info for users while cgroup's direct reclaim is working. > > I guess you meant targeted resp. (hard/soft) limit reclaim here, > right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). > why the usefulnes should be limitted to the reclaim for users. I would > understand this if the users was in fact in-kernel users. > One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P > [...] >> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >> So we need to make mapping->tree_lock ahead of TestSetPageDirty in >> __set_page_dirty() >> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock >> contention, >> a prepare PageDirty() checking is added. > > But there is another AA deadlock here I believe. > page_remove_rmap > mem_cgroup_begin_update_page_stat <<< 1 > set_page_dirty > __set_page_dirty_buffers > __set_page_dirty > mem_cgroup_begin_update_page_stat <<< 2 > move_lock_mem_cgroup > spin_lock_irqsave(>move_lock, *flags); > > mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS > because we might race with the moving charges: > CPU0CPU1 > page_remove_rmap > mem_cgroup_can_attach > mem_cgroup_begin_update_page_stat (1) > rcu_read_lock > mem_cgroup_start_move > atomic_inc(_moving) > > atomic_inc(>moving_account) > synchronize_rcu > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< TRUE > move_lock_mem_cgroup > [...] > mem_cgroup_begin_update_page_stat (2) > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< still TRUE > move_lock_mem_cgroup <<< DEADLOCK > [...] > mem_cgroup_end_update_page_stat > rcu_unlock > # wake up from > synchronize_rcu > [...] > mem_cgroup_move_task > mem_cgroup_move_charge > walk_page_range > mem_cgroup_move_account > move_lock_mem_cgroup > > > Maybe I have missed some other locking which would prevent this from > happening but the locking relations are really complicated in this area > so if mem_cgroup_{begin,end}_update_page_stat might be called > recursively then we need a fat comment which justifies that. > Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg->move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote: On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving) atomic_inc(memcg-moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen still TRUE move_lock_mem_cgroup DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg-move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed 26-12-12 01:26:07, Sha Zhengju wrote: > From: Sha Zhengju > > This patch adds memcg routines to count dirty pages, which allows memory > controller > to maintain an accurate view of the amount of its dirty memory and can > provide some > info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. [...] > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: > ->private_lock --> mapping->tree_lock --> memcg->move_lock. > So we need to make mapping->tree_lock ahead of TestSetPageDirty in > __set_page_dirty() > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock > contention, > a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat <<< 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat <<< 2 move_lock_mem_cgroup spin_lock_irqsave(>move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(_moving) atomic_inc(>moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< still TRUE move_lock_mem_cgroup <<< DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
On Wed 26-12-12 01:26:07, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. [...] To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat2 move_lock_mem_cgroup spin_lock_irqsave(memcg-move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(memcg_moving) atomic_inc(memcg-moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen still TRUE move_lock_mem_cgroup DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
From: Sha Zhengju This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a bigger lock: mem_cgroup_begin_update_page_stat() modify page information-->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires (a) and (b)(dirty pages accounting) can stay close enough. In the previous two prepare patches, we have reworked the vfs set page dirty routines and now the interfaces are more explicit: incrementing (2): __set_page_dirty __set_page_dirty_nobuffers decrementing (2): clear_page_dirty_for_io cancel_dirty_page To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: ->private_lock --> mapping->tree_lock --> memcg->move_lock. So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. Signed-off-by: Sha Zhengju Acked-by: KAMEZAWA Hiroyuki Acked-by: Fengguang Wu --- fs/buffer.c| 14 +- include/linux/memcontrol.h |1 + mm/filemap.c | 10 ++ mm/memcontrol.c| 29 ++--- mm/page-writeback.c| 39 --- mm/truncate.c |6 ++ 6 files changed, 84 insertions(+), 15 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 762168a..53402d2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { + bool locked; + unsigned long flags; + if (unlikely(!mapping)) return !TestSetPageDirty(page); - if (TestSetPageDirty(page)) + if (PageDirty(page)) return 0; spin_lock_irq(>tree_lock); + mem_cgroup_begin_update_page_stat(page, , ); + + if (TestSetPageDirty(page)) { + mem_cgroup_end_update_page_stat(page, , ); + spin_unlock_irq(>tree_lock); + return 0; + } + if (page->mapping) {/* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); account_page_dirtied(page, mapping); radix_tree_tag_set(>page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + mem_cgroup_end_update_page_stat(page, , ); spin_unlock_irq(>tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5421b8a..2685d8a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ MEM_CGROUP_STAT_NSTATS, }; diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..b589be5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -62,6 +62,11 @@ * ->swap_lock(exclusive_swap_page, others) *->mapping->tree_lock * + *->private_lock (__set_page_dirty_buffers) + * ->mapping->tree_lock + *->memcg->move_lock (mem_cgroup_begin_update_page_stat-> + * move_lock_mem_cgroup) + * * ->i_mutex *->i_mmap_mutex (truncate->unmap_mapping_range) * @@ -112,6 +117,8 @@ void __delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + bool locked; + unsigned long flags; /* * if we're uptodate, flush out into the cleancache, otherwise @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page) * Fix it up by doing a final dirty accounting check after * having removed the page entirely. */ + mem_cgroup_begin_update_page_stat(page, , ); if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { + mem_cgroup_dec_page_stat(page,
[PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
From: Sha Zhengju handai@taobao.com This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between move and page stat accounting. So in order to avoid the race we have designed a bigger lock: mem_cgroup_begin_update_page_stat() modify page information--(a) mem_cgroup_update_page_stat() --(b) mem_cgroup_end_update_page_stat() It requires (a) and (b)(dirty pages accounting) can stay close enough. In the previous two prepare patches, we have reworked the vfs set page dirty routines and now the interfaces are more explicit: incrementing (2): __set_page_dirty __set_page_dirty_nobuffers decrementing (2): clear_page_dirty_for_io cancel_dirty_page To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: -private_lock -- mapping-tree_lock -- memcg-move_lock. So we need to make mapping-tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. Signed-off-by: Sha Zhengju handai@taobao.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtisu.com Acked-by: Fengguang Wu fengguang...@intel.com --- fs/buffer.c| 14 +- include/linux/memcontrol.h |1 + mm/filemap.c | 10 ++ mm/memcontrol.c| 29 ++--- mm/page-writeback.c| 39 --- mm/truncate.c |6 ++ 6 files changed, 84 insertions(+), 15 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 762168a..53402d2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { + bool locked; + unsigned long flags; + if (unlikely(!mapping)) return !TestSetPageDirty(page); - if (TestSetPageDirty(page)) + if (PageDirty(page)) return 0; spin_lock_irq(mapping-tree_lock); + mem_cgroup_begin_update_page_stat(page, locked, flags); + + if (TestSetPageDirty(page)) { + mem_cgroup_end_update_page_stat(page, locked, flags); + spin_unlock_irq(mapping-tree_lock); + return 0; + } + if (page-mapping) {/* Race with truncate? */ WARN_ON_ONCE(warn !PageUptodate(page)); account_page_dirtied(page, mapping); radix_tree_tag_set(mapping-page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + mem_cgroup_end_update_page_stat(page, locked, flags); spin_unlock_irq(mapping-tree_lock); __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5421b8a..2685d8a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ MEM_CGROUP_STAT_NSTATS, }; diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..b589be5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -62,6 +62,11 @@ * -swap_lock(exclusive_swap_page, others) *-mapping-tree_lock * + *-private_lock (__set_page_dirty_buffers) + * -mapping-tree_lock + *-memcg-move_lock (mem_cgroup_begin_update_page_stat- + * move_lock_mem_cgroup) + * * -i_mutex *-i_mmap_mutex (truncate-unmap_mapping_range) * @@ -112,6 +117,8 @@ void __delete_from_page_cache(struct page *page) { struct address_space *mapping = page-mapping; + bool locked; + unsigned long flags; /* * if we're uptodate, flush out into the cleancache, otherwise @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page) * Fix it up by doing a final dirty accounting check after * having removed the page entirely. */ + mem_cgroup_begin_update_page_stat(page, locked, flags);