Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-05-03 Thread Sha Zhengju
On Fri, May 3, 2013 at 5:11 PM, Michal Hocko  wrote:
> On Wed 02-01-13 11:44:21, Michal Hocko wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> > From: Sha Zhengju 
>> >
>> > This patch adds memcg routines to count dirty pages, which allows memory 
>> > controller
>> > to maintain an accurate view of the amount of its dirty memory and can 
>> > provide some
>> > info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>> [...]
>> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
>> > __set_page_dirty()
>> > and __set_page_dirty_nobuffers(). But in order to avoiding useless 
>> > spinlock contention,
>> > a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>   mem_cgroup_begin_update_page_stat   <<< 1
>>   set_page_dirty
>> __set_page_dirty_buffers
>>   __set_page_dirty
>> mem_cgroup_begin_update_page_stat <<< 2
>> move_lock_mem_cgroup
>>   spin_lock_irqsave(>move_lock, *flags);
>
> JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
> longer possible. I haven't checked wheter there are other cases like
> this one and it should be better if mem_cgroup_begin_update_page_stat
> was recursive safe if that can be done without too many hacks.
> I will have a look at this (hopefully) sometimes next week.
>

Hi Michal,


I'm sorry for not being able to return to this problem immediately after LSF/MM.
That is good news. IIRC, it's the only place we have encountered
recursive problem in accounting memcg dirty pages. But I'll try to
revive my previous work of simplifying
mem_cgroup_begin_update_page_stat() lock.
I'll back to it in next few days.


--
Thanks,
Sha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-05-03 Thread Michal Hocko
On Wed 02-01-13 11:44:21, Michal Hocko wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> > From: Sha Zhengju 
> > 
> > This patch adds memcg routines to count dirty pages, which allows memory 
> > controller
> > to maintain an accurate view of the amount of its dirty memory and can 
> > provide some
> > info for users while cgroup's direct reclaim is working.
> 
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me
> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
> 
> [...]
> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
> > __set_page_dirty()
> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
> > contention,
> > a prepare PageDirty() checking is added.
> 
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat   <<< 1
>   set_page_dirty
> __set_page_dirty_buffers
>   __set_page_dirty
> mem_cgroup_begin_update_page_stat <<< 2
> move_lock_mem_cgroup
>   spin_lock_irqsave(>move_lock, *flags);

JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
longer possible. I haven't checked wheter there are other cases like
this one and it should be better if mem_cgroup_begin_update_page_stat
was recursive safe if that can be done without too many hacks.
I will have a look at this (hopefully) sometimes next week.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-05-03 Thread Michal Hocko
On Wed 02-01-13 11:44:21, Michal Hocko wrote:
 On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
  From: Sha Zhengju handai@taobao.com
  
  This patch adds memcg routines to count dirty pages, which allows memory 
  controller
  to maintain an accurate view of the amount of its dirty memory and can 
  provide some
  info for users while cgroup's direct reclaim is working.
 
 I guess you meant targeted resp. (hard/soft) limit reclaim here,
 right? It is true that this is direct reclaim but it is not clear to me
 why the usefulnes should be limitted to the reclaim for users. I would
 understand this if the users was in fact in-kernel users.
 
 [...]
  To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
  (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
  -private_lock -- mapping-tree_lock -- memcg-move_lock.
  So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
  __set_page_dirty()
  and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
  contention,
  a prepare PageDirty() checking is added.
 
 But there is another AA deadlock here I believe.
 page_remove_rmap
   mem_cgroup_begin_update_page_stat1
   set_page_dirty
 __set_page_dirty_buffers
   __set_page_dirty
 mem_cgroup_begin_update_page_stat  2
 move_lock_mem_cgroup
   spin_lock_irqsave(memcg-move_lock, *flags);

JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
longer possible. I haven't checked wheter there are other cases like
this one and it should be better if mem_cgroup_begin_update_page_stat
was recursive safe if that can be done without too many hacks.
I will have a look at this (hopefully) sometimes next week.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-05-03 Thread Sha Zhengju
On Fri, May 3, 2013 at 5:11 PM, Michal Hocko mho...@suse.cz wrote:
 On Wed 02-01-13 11:44:21, Michal Hocko wrote:
 On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
  From: Sha Zhengju handai@taobao.com
 
  This patch adds memcg routines to count dirty pages, which allows memory 
  controller
  to maintain an accurate view of the amount of its dirty memory and can 
  provide some
  info for users while cgroup's direct reclaim is working.

 I guess you meant targeted resp. (hard/soft) limit reclaim here,
 right? It is true that this is direct reclaim but it is not clear to me
 why the usefulnes should be limitted to the reclaim for users. I would
 understand this if the users was in fact in-kernel users.

 [...]
  To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
  (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
  -private_lock -- mapping-tree_lock -- memcg-move_lock.
  So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
  __set_page_dirty()
  and __set_page_dirty_nobuffers(). But in order to avoiding useless 
  spinlock contention,
  a prepare PageDirty() checking is added.

 But there is another AA deadlock here I believe.
 page_remove_rmap
   mem_cgroup_begin_update_page_stat1
   set_page_dirty
 __set_page_dirty_buffers
   __set_page_dirty
 mem_cgroup_begin_update_page_stat  2
 move_lock_mem_cgroup
   spin_lock_irqsave(memcg-move_lock, *flags);

 JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
 longer possible. I haven't checked wheter there are other cases like
 this one and it should be better if mem_cgroup_begin_update_page_stat
 was recursive safe if that can be done without too many hacks.
 I will have a look at this (hopefully) sometimes next week.


Hi Michal,


I'm sorry for not being able to return to this problem immediately after LSF/MM.
That is good news. IIRC, it's the only place we have encountered
recursive problem in accounting memcg dirty pages. But I'll try to
revive my previous work of simplifying
mem_cgroup_begin_update_page_stat() lock.
I'll back to it in next few days.


--
Thanks,
Sha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-10 Thread Sha Zhengju
On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki
 wrote:
> (2013/01/10 13:26), Sha Zhengju wrote:
>
>> But this method also has its pros and cons(e.g. need lock nesting). So
>> I doubt whether the following is able to deal with these issues all
>> together:
>> (CPU-A does "page stat accounting" and CPU-B does "move")
>>
>>   CPU-ACPU-B
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> SetPageDirty(page)
>> move_unlock_mem_cgroup()
>>move_lock_mem_cgroup()
>>if (PageDirty) {
>> old_memcg->nr_dirty --;
>> new_memcg->nr_dirty ++;
>> }
>> pc->mem_cgroup = new_memcg
>> move_unlock_mem_cgroup()
>>
>> memcg->nr_dirty ++
>>
>>
>> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
>> SetPageDirty inside move_lock and then update stats if the page is set
>> PG_dirty successfully. But CPU-B may do "moving" in advance that
>> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
>> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
>> stats.
>> However, there is a potential problem that old_memcg->nr_dirty  may be
>> minus in a very short period but not a big issue IMHO.
>>
>
> IMHO, this will work. Please take care of that the recorded memcg will not
> be invalid pointer when you update the nr_dirty later.
> (Maybe RCU will protect it.)
>
Yes, there're 3 places to change pc->mem_cgroup: charge & uncharge &
move_account. "charge" has no race with stat updater and "uncharge"
doesn't reset pc->mem_cgroup directly, also "move_account" is just the
one we are handling, so they may do no harm here. Meanwhile, invalid
pointer made by cgroup deletion may also be avoided by RCU. Yet it's a
rough conclusion by quick look...

> _If_ this method can handle "nesting" problem clearer and make
> implementation
> simpler, please go ahead. To be honest, I'm not sure how the code will be
> until
Okay, later I'll try to propose the patch.

> seeing the patch. Hmm, why you write SetPageDirty() here rather than
> TestSetPageDirty()
>
No particular reason...TestSetPageDirty() may be more precise... : )


-- 
Thanks,
Sha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-10 Thread Sha Zhengju
On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:
 (2013/01/10 13:26), Sha Zhengju wrote:

 But this method also has its pros and cons(e.g. need lock nesting). So
 I doubt whether the following is able to deal with these issues all
 together:
 (CPU-A does page stat accounting and CPU-B does move)

   CPU-ACPU-B

 move_lock_mem_cgroup()
 memcg = pc-mem_cgroup
 SetPageDirty(page)
 move_unlock_mem_cgroup()
move_lock_mem_cgroup()
if (PageDirty) {
 old_memcg-nr_dirty --;
 new_memcg-nr_dirty ++;
 }
 pc-mem_cgroup = new_memcg
 move_unlock_mem_cgroup()

 memcg-nr_dirty ++


 For CPU-A, we save pc-mem_cgroup in a temporary variable just before
 SetPageDirty inside move_lock and then update stats if the page is set
 PG_dirty successfully. But CPU-B may do moving in advance that
 old_memcg-nr_dirty -- will make old_memcg-nr_dirty incorrect but
 soon CPU-A will do memcg-nr_dirty ++ at the heels that amend the
 stats.
 However, there is a potential problem that old_memcg-nr_dirty  may be
 minus in a very short period but not a big issue IMHO.


 IMHO, this will work. Please take care of that the recorded memcg will not
 be invalid pointer when you update the nr_dirty later.
 (Maybe RCU will protect it.)

Yes, there're 3 places to change pc-mem_cgroup: charge  uncharge 
move_account. charge has no race with stat updater and uncharge
doesn't reset pc-mem_cgroup directly, also move_account is just the
one we are handling, so they may do no harm here. Meanwhile, invalid
pointer made by cgroup deletion may also be avoided by RCU. Yet it's a
rough conclusion by quick look...

 _If_ this method can handle nesting problem clearer and make
 implementation
 simpler, please go ahead. To be honest, I'm not sure how the code will be
 until
Okay, later I'll try to propose the patch.

 seeing the patch. Hmm, why you write SetPageDirty() here rather than
 TestSetPageDirty()

No particular reason...TestSetPageDirty() may be more precise... : )


-- 
Thanks,
Sha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Kamezawa Hiroyuki

(2013/01/10 13:26), Sha Zhengju wrote:


But this method also has its pros and cons(e.g. need lock nesting). So
I doubt whether the following is able to deal with these issues all
together:
(CPU-A does "page stat accounting" and CPU-B does "move")

  CPU-ACPU-B

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
SetPageDirty(page)
move_unlock_mem_cgroup()
   move_lock_mem_cgroup()
   if (PageDirty) {
old_memcg->nr_dirty --;
new_memcg->nr_dirty ++;
}
pc->mem_cgroup = new_memcg
move_unlock_mem_cgroup()

memcg->nr_dirty ++


For CPU-A, we save pc->mem_cgroup in a temporary variable just before
SetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do "moving" in advance that
"old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
stats.
However, there is a potential problem that old_memcg->nr_dirty  may be
minus in a very short period but not a big issue IMHO.



IMHO, this will work. Please take care of that the recorded memcg will not
be invalid pointer when you update the nr_dirty later.
(Maybe RCU will protect it.)

_If_ this method can handle "nesting" problem clearer and make implementation
simpler, please go ahead. To be honest, I'm not sure how the code will be until
seeing the patch. Hmm, why you write SetPageDirty() here rather than
TestSetPageDirty()

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki
 wrote:
> (2013/01/10 0:02), Sha Zhengju wrote:
>>
>> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
>>  wrote:
>>>
>>> (2013/01/05 13:48), Sha Zhengju wrote:


 On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:
>
>
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>
>>
>> From: Sha Zhengju 
>>
>> This patch adds memcg routines to count dirty pages, which allows
>> memory
>> controller
>> to maintain an accurate view of the amount of its dirty memory and can
>> provide some
>> info for users while cgroup's direct reclaim is working.
>
>
>
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me



 Yes, I meant memcg hard/soft reclaim here which is triggered directly
 by allocation and is distinct from background kswapd reclaim (global).

> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
>

 One of the reasons I'm trying to accounting the dirty pages is to get a
 more board overall view of memory usages because memcg hard/soft
 reclaim may have effect on response time of user application.
 Yeah, the beneficiary can be application administrator or kernel users.
 :P

> [...]
>>
>>
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>> __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>> spinlock contention,
>> a prepare PageDirty() checking is added.
>
>
>
> But there is another AA deadlock here I believe.
> page_remove_rmap
> mem_cgroup_begin_update_page_stat <<< 1
> set_page_dirty
>   __set_page_dirty_buffers
> __set_page_dirty
>   mem_cgroup_begin_update_page_stat   <<< 2
> move_lock_mem_cgroup
>   spin_lock_irqsave(>move_lock, *flags);
>
> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
> because we might race with the moving charges:
>   CPU0CPU1
> page_remove_rmap
>   mem_cgroup_can_attach
> mem_cgroup_begin_update_page_stat (1)
>   rcu_read_lock
>
> mem_cgroup_start_move
>
> atomic_inc(_moving)
>
> atomic_inc(>moving_account)
>   synchronize_rcu
>   __mem_cgroup_begin_update_page_stat
> mem_cgroup_stolen <<< TRUE
> move_lock_mem_cgroup
> [...]
>   mem_cgroup_begin_update_page_stat (2)
> __mem_cgroup_begin_update_page_stat
>   mem_cgroup_stolen <<< still TRUE
>   move_lock_mem_cgroup  <<< DEADLOCK
> [...]
> mem_cgroup_end_update_page_stat
>   rcu_unlock
> # wake up from
> synchronize_rcu
>   [...]
>   mem_cgroup_move_task
>
> mem_cgroup_move_charge
>   walk_page_range
>
> mem_cgroup_move_account
>
> move_lock_mem_cgroup
>
>
> Maybe I have missed some other locking which would prevent this from
> happening but the locking relations are really complicated in this area
> so if mem_cgroup_{begin,end}_update_page_stat might be called
> recursively then we need a fat comment which justifies that.
>

 Ohhh...good catching!  I didn't notice there is a recursive call of
 mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
 The mem_cgroup_{begin,end}_update_page_stat() design has depressed
 me a lot recently as the lock granularity is a little bigger than I
 thought.
 Not only the resource but also some code logic is in the range of
 locking
 which may be deadlock prone. The problem still exists if we are trying
 to
 add stat account of other memcg page later, may I make bold to suggest
 that we dig into the lock again...

 But with regard to the current lock implementation, I doubt if we can we
 can
 account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
 try to get move_lock once in the beginning. IMHO we can make
 mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what
 I'm
 thinking 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Kamezawa Hiroyuki

(2013/01/10 0:02), Sha Zhengju wrote:

On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
 wrote:

(2013/01/05 13:48), Sha Zhengju wrote:


On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:


On Wed 26-12-12 01:26:07, Sha Zhengju wrote:


From: Sha Zhengju 

This patch adds memcg routines to count dirty pages, which allows memory
controller
to maintain an accurate view of the amount of its dirty memory and can
provide some
info for users while cgroup's direct reclaim is working.



I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me



Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).


why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.



One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.
:P


[...]


To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
->private_lock --> mapping->tree_lock --> memcg->move_lock.
So we need to make mapping->tree_lock ahead of TestSetPageDirty in
__set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless
spinlock contention,
a prepare PageDirty() checking is added.



But there is another AA deadlock here I believe.
page_remove_rmap
mem_cgroup_begin_update_page_stat <<< 1
set_page_dirty
  __set_page_dirty_buffers
__set_page_dirty
  mem_cgroup_begin_update_page_stat   <<< 2
move_lock_mem_cgroup
  spin_lock_irqsave(>move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
  CPU0CPU1
page_remove_rmap
  mem_cgroup_can_attach
mem_cgroup_begin_update_page_stat (1)
  rcu_read_lock
mem_cgroup_start_move

atomic_inc(_moving)

atomic_inc(>moving_account)
  synchronize_rcu
  __mem_cgroup_begin_update_page_stat
mem_cgroup_stolen <<< TRUE
move_lock_mem_cgroup
[...]
  mem_cgroup_begin_update_page_stat (2)
__mem_cgroup_begin_update_page_stat
  mem_cgroup_stolen <<< still TRUE
  move_lock_mem_cgroup  <<< DEADLOCK
[...]
mem_cgroup_end_update_page_stat
  rcu_unlock
# wake up from
synchronize_rcu
  [...]
  mem_cgroup_move_task
mem_cgroup_move_charge
  walk_page_range

mem_cgroup_move_account

move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.



Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I
thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we
can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg->move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

  CPU0CPU1
page_remove_rmap
  mem_cgroup_can_attach
mem_cgroup_begin_update_page_stat (1)
  rcu_read_lock
mem_cgroup_start_move

atomic_inc(_moving)

atomic_inc(>moving_account)
  

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
 wrote:
> (2013/01/05 13:48), Sha Zhengju wrote:
>>
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:
>>>
>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:

 From: Sha Zhengju 

 This patch adds memcg routines to count dirty pages, which allows memory
 controller
 to maintain an accurate view of the amount of its dirty memory and can
 provide some
 info for users while cgroup's direct reclaim is working.
>>>
>>>
>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>> right? It is true that this is direct reclaim but it is not clear to me
>>
>>
>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>> by allocation and is distinct from background kswapd reclaim (global).
>>
>>> why the usefulnes should be limitted to the reclaim for users. I would
>>> understand this if the users was in fact in-kernel users.
>>>
>>
>> One of the reasons I'm trying to accounting the dirty pages is to get a
>> more board overall view of memory usages because memcg hard/soft
>> reclaim may have effect on response time of user application.
>> Yeah, the beneficiary can be application administrator or kernel users.
>> :P
>>
>>> [...]

 To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
 (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
 ->private_lock --> mapping->tree_lock --> memcg->move_lock.
 So we need to make mapping->tree_lock ahead of TestSetPageDirty in
 __set_page_dirty()
 and __set_page_dirty_nobuffers(). But in order to avoiding useless
 spinlock contention,
 a prepare PageDirty() checking is added.
>>>
>>>
>>> But there is another AA deadlock here I believe.
>>> page_remove_rmap
>>>mem_cgroup_begin_update_page_stat <<< 1
>>>set_page_dirty
>>>  __set_page_dirty_buffers
>>>__set_page_dirty
>>>  mem_cgroup_begin_update_page_stat   <<< 2
>>>move_lock_mem_cgroup
>>>  spin_lock_irqsave(>move_lock, *flags);
>>>
>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>> because we might race with the moving charges:
>>>  CPU0CPU1
>>> page_remove_rmap
>>>  mem_cgroup_can_attach
>>>mem_cgroup_begin_update_page_stat (1)
>>>  rcu_read_lock
>>>mem_cgroup_start_move
>>>
>>> atomic_inc(_moving)
>>>
>>> atomic_inc(>moving_account)
>>>  synchronize_rcu
>>>  __mem_cgroup_begin_update_page_stat
>>>mem_cgroup_stolen <<< TRUE
>>>move_lock_mem_cgroup
>>>[...]
>>>  mem_cgroup_begin_update_page_stat (2)
>>>__mem_cgroup_begin_update_page_stat
>>>  mem_cgroup_stolen <<< still TRUE
>>>  move_lock_mem_cgroup  <<< DEADLOCK
>>>[...]
>>>mem_cgroup_end_update_page_stat
>>>  rcu_unlock
>>># wake up from
>>> synchronize_rcu
>>>  [...]
>>>  mem_cgroup_move_task
>>>mem_cgroup_move_charge
>>>  walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup
>>>
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I
>> thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>>
>> But with regard to the current lock implementation, I doubt if we can we
>> can
>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>> try to get move_lock once in the beginning. IMHO we can make
>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>> thinking now is changing memcg->move_lock to rw-spinlock from the
>> original spinlock:
>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>> make it
>> reenterable and memcg moving task side try to get the write spinlock.
>> Then the race may be following:
>>
>>  CPU0

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Michal Hocko
On Wed 09-01-13 22:35:12, Sha Zhengju wrote:
[...]
> To my knowledge, each task is forked in root memcg, and there's a
> moving while attaching it to a cgroup. So move_account is also a
> frequent behavior to some extent.

Not really. Every fork/exec is copies the current group (see
cgroup_fork) so there is no moving on that path.
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
Hi Hugh,

On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins  wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:
>> >
>> > Maybe I have missed some other locking which would prevent this from
>> > happening but the locking relations are really complicated in this area
>> > so if mem_cgroup_{begin,end}_update_page_stat might be called
>> > recursively then we need a fat comment which justifies that.
>> >
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

Thanks for your interest in this matter! I really appreciate your work!

> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
It takes me some time to understand the patch. yeah, it can solve my
recursion issue and also reduce some locks(e.g. move_lock). But it did
have some side effect on move end as it will become slower. To my
knowledge, each task is forked in root memcg, and there's a moving
while attaching it to a cgroup. So move_account is also a frequent
behavior to some extent.
Some comments are below.

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>
>  include/linux/memcontrol.h |   39 +--
>  mm/memcontrol.c|  375 +--
>  mm/rmap.c  |   20 -
>  3 files changed, 257 insertions(+), 177 deletions(-)
>
> --- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
> +++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800
> @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
> return false;
>  }
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
> -unsigned long *flags);
> -
> +void __mem_cgroup_begin_update_page_stat(struct page *page);
> +void __mem_cgroup_end_update_page_stat(void);
>  extern atomic_t memcg_moving;
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -   bool *locked, unsigned long *flags)
> +bool *clamped)
>  {
> -   if (mem_cgroup_disabled())
> -   return;
> -   rcu_read_lock();
> -   *locked = false;
> -   if (atomic_read(_moving))
> -   __mem_cgroup_begin_update_page_stat(page, locked, flags);
> +   preempt_disable();

Referring to synchronize_rcu in mem_cgroup_begin_move(), here
rcu_read_lock() lost?

> +   *clamped = false;
> +   if (unlikely(atomic_read(_moving))) {
> +   __mem_cgroup_begin_update_page_stat(page);
> +   *clamped = true;
> +   }
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page,
> -   unsigned long *flags);
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -   bool *locked, unsigned long *flags)
> +  bool *clamped)
>  {
> -   if (mem_cgroup_disabled())
> -   return;
> -   if (*locked)
> -   __mem_cgroup_end_update_page_stat(page, flags);
> -   rcu_read_unlock();
> +   /* We 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen  wrote:
> On Tue, Dec 25 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju 
>>
>> This patch adds memcg routines to count dirty pages, which allows memory 
>> controller
>> to maintain an accurate view of the amount of its dirty memory and can 
>> provide some
>> info for users while cgroup's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), 
>> we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. 
>> But memcg
>> has a feature to move a page from a cgroup to another one and may have race 
>> between
>> "move" and "page stat accounting". So in order to avoid the race we have 
>> designed a
>> bigger lock:
>>
>>  mem_cgroup_begin_update_page_stat()
>>  modify page information-->(a)
>>  mem_cgroup_update_page_stat()  -->(b)
>>  mem_cgroup_end_update_page_stat()
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>> In the previous two prepare patches, we have reworked the vfs set page dirty 
>> routines
>> and now the interfaces are more explicit:
>> incrementing (2):
>> __set_page_dirty
>> __set_page_dirty_nobuffers
>> decrementing (2):
>> clear_page_dirty_for_io
>> cancel_dirty_page
>>
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
>> __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
>> contention,
>> a prepare PageDirty() checking is added.
>>
>>
>> Signed-off-by: Sha Zhengju 
>> Acked-by: KAMEZAWA Hiroyuki 
>> Acked-by: Fengguang Wu 
>> ---
>>  fs/buffer.c|   14 +-
>>  include/linux/memcontrol.h |1 +
>>  mm/filemap.c   |   10 ++
>>  mm/memcontrol.c|   29 ++---
>>  mm/page-writeback.c|   39 ---
>>  mm/truncate.c  |6 ++
>>  6 files changed, 84 insertions(+), 15 deletions(-)
>
> __nilfs_clear_page_dirty() clears PageDirty, does it need modification
> for this patch series?

It doesn't need to do so.
mem_cgroup_dec/inc_page_stat() is accompany with
dec/inc_zone_page_state() to account memcg page stat.  IMHO we only
have to do some modification while SetPageDirty and
dec/inc_zone_page_state() occur together.
__nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page)
later where the accounting is done.

>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 762168a..53402d2 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>  int __set_page_dirty(struct page *page,
>>   struct address_space *mapping, int warn)
>>  {
>> + bool locked;
>> + unsigned long flags;
>> +
>>   if (unlikely(!mapping))
>>   return !TestSetPageDirty(page);
>>
>> - if (TestSetPageDirty(page))
>> + if (PageDirty(page))
>>   return 0;
>>
>>   spin_lock_irq(>tree_lock);
>> + mem_cgroup_begin_update_page_stat(page, , );
>> +
>> + if (TestSetPageDirty(page)) {
>> + mem_cgroup_end_update_page_stat(page, , );
>> + spin_unlock_irq(>tree_lock);
>> + return 0;
>> + }
>> +
>>   if (page->mapping) {/* Race with truncate? */
>>   WARN_ON_ONCE(warn && !PageUptodate(page));
>>   account_page_dirtied(page, mapping);
>>   radix_tree_tag_set(>page_tree,
>>   page_index(page), PAGECACHE_TAG_DIRTY);
>>   }
>> + mem_cgroup_end_update_page_stat(page, , );
>>   spin_unlock_irq(>tree_lock);
>>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 5421b8a..2685d8a 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>>   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
>>   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>   MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> + MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>   MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 83efee7..b589be5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -62,6 +62,11 @@
>>   *  ->swap_lock  (exclusive_swap_page, others)
>>   *->mapping->tree_lock
>>   *
>> + *->private_lock (__set_page_dirty_buffers)
>> + *  ->mapping->tree_lock
>> + *->memcg->move_lock (mem_cgroup_begin_update_page_stat->
>> + * 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen gthe...@google.com wrote:
 On Tue, Dec 25 2012, Sha Zhengju wrote:

 From: Sha Zhengju handai@taobao.com

 This patch adds memcg routines to count dirty pages, which allows memory 
 controller
 to maintain an accurate view of the amount of its dirty memory and can 
 provide some
 info for users while cgroup's direct reclaim is working.

 After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), 
 we can
 use 'struct page' flag to test page state instead of per page_cgroup flag. 
 But memcg
 has a feature to move a page from a cgroup to another one and may have race 
 between
 move and page stat accounting. So in order to avoid the race we have 
 designed a
 bigger lock:

  mem_cgroup_begin_update_page_stat()
  modify page information--(a)
  mem_cgroup_update_page_stat()  --(b)
  mem_cgroup_end_update_page_stat()
 It requires (a) and (b)(dirty pages accounting) can stay close enough.
 In the previous two prepare patches, we have reworked the vfs set page dirty 
 routines
 and now the interfaces are more explicit:
 incrementing (2):
 __set_page_dirty
 __set_page_dirty_nobuffers
 decrementing (2):
 clear_page_dirty_for_io
 cancel_dirty_page

 To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
 (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
 -private_lock -- mapping-tree_lock -- memcg-move_lock.
 So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
 __set_page_dirty()
 and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
 contention,
 a prepare PageDirty() checking is added.


 Signed-off-by: Sha Zhengju handai@taobao.com
 Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtisu.com
 Acked-by: Fengguang Wu fengguang...@intel.com
 ---
  fs/buffer.c|   14 +-
  include/linux/memcontrol.h |1 +
  mm/filemap.c   |   10 ++
  mm/memcontrol.c|   29 ++---
  mm/page-writeback.c|   39 ---
  mm/truncate.c  |6 ++
  6 files changed, 84 insertions(+), 15 deletions(-)

 __nilfs_clear_page_dirty() clears PageDirty, does it need modification
 for this patch series?

It doesn't need to do so.
mem_cgroup_dec/inc_page_stat() is accompany with
dec/inc_zone_page_state() to account memcg page stat.  IMHO we only
have to do some modification while SetPageDirty and
dec/inc_zone_page_state() occur together.
__nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page)
later where the accounting is done.

 diff --git a/fs/buffer.c b/fs/buffer.c
 index 762168a..53402d2 100644
 --- a/fs/buffer.c
 +++ b/fs/buffer.c
 @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  int __set_page_dirty(struct page *page,
   struct address_space *mapping, int warn)
  {
 + bool locked;
 + unsigned long flags;
 +
   if (unlikely(!mapping))
   return !TestSetPageDirty(page);

 - if (TestSetPageDirty(page))
 + if (PageDirty(page))
   return 0;

   spin_lock_irq(mapping-tree_lock);
 + mem_cgroup_begin_update_page_stat(page, locked, flags);
 +
 + if (TestSetPageDirty(page)) {
 + mem_cgroup_end_update_page_stat(page, locked, flags);
 + spin_unlock_irq(mapping-tree_lock);
 + return 0;
 + }
 +
   if (page-mapping) {/* Race with truncate? */
   WARN_ON_ONCE(warn  !PageUptodate(page));
   account_page_dirtied(page, mapping);
   radix_tree_tag_set(mapping-page_tree,
   page_index(page), PAGECACHE_TAG_DIRTY);
   }
 + mem_cgroup_end_update_page_stat(page, locked, flags);
   spin_unlock_irq(mapping-tree_lock);
   __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);

 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 5421b8a..2685d8a 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
   MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 + MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
   MEM_CGROUP_STAT_NSTATS,
  };

 diff --git a/mm/filemap.c b/mm/filemap.c
 index 83efee7..b589be5 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -62,6 +62,11 @@
   *  -swap_lock  (exclusive_swap_page, others)
   *-mapping-tree_lock
   *
 + *-private_lock (__set_page_dirty_buffers)
 + *  -mapping-tree_lock
 + *-memcg-move_lock (mem_cgroup_begin_update_page_stat-
 + *   move_lock_mem_cgroup)
 + *
   *  -i_mutex
   

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
Hi Hugh,

On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins hu...@google.com wrote:
 On Sat, 5 Jan 2013, Sha Zhengju wrote:
 On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:
 
  Maybe I have missed some other locking which would prevent this from
  happening but the locking relations are really complicated in this area
  so if mem_cgroup_{begin,end}_update_page_stat might be called
  recursively then we need a fat comment which justifies that.
 

 Ohhh...good catching!  I didn't notice there is a recursive call of
 mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
 The mem_cgroup_{begin,end}_update_page_stat() design has depressed
 me a lot recently as the lock granularity is a little bigger than I thought.
 Not only the resource but also some code logic is in the range of locking
 which may be deadlock prone. The problem still exists if we are trying to
 add stat account of other memcg page later, may I make bold to suggest
 that we dig into the lock again...

 Forgive me, I must confess I'm no more than skimming this thread,
 and don't like dumping unsigned-off patches on people; but thought
 that on balance it might be more helpful than not if I offer you a
 patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

Thanks for your interest in this matter! I really appreciate your work!

 I too was getting depressed by the constraints imposed by
 mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
 did to minimize them), and wanted to replace by something freer, more
 RCU-like.  In the end it seemed more effort than it was worth to go
 as far as I wanted, but I do think that this is some improvement over
 what we currently have, and should deal with your recursion issue.
It takes me some time to understand the patch. yeah, it can solve my
recursion issue and also reduce some locks(e.g. move_lock). But it did
have some side effect on move end as it will become slower. To my
knowledge, each task is forked in root memcg, and there's a moving
while attaching it to a cgroup. So move_account is also a frequent
behavior to some extent.
Some comments are below.

 But if this does appear useful to memcg people, then we really ought
 to get it checked over by locking/barrier experts before going further.
 I think myself that I've over-barriered it, and could use a little
 lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
 to mind) will see more clearly, and may just hate the whole thing,
 as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
 I've not wanted to waste their time on reviewing it, if it's not even
 going to be useful to memcg people.

 It may be easier to understand if you just apply the patch and look
 at the result in mm/memcontrol.c, where I tried to gather the pieces
 together in one place and describe them (These functions mediate...).

 Hugh

  include/linux/memcontrol.h |   39 +--
  mm/memcontrol.c|  375 +--
  mm/rmap.c  |   20 -
  3 files changed, 257 insertions(+), 177 deletions(-)

 --- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
 +++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800
 @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
 return false;
  }

 -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
 -unsigned long *flags);
 -
 +void __mem_cgroup_begin_update_page_stat(struct page *page);
 +void __mem_cgroup_end_update_page_stat(void);
  extern atomic_t memcg_moving;

  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 -   bool *locked, unsigned long *flags)
 +bool *clamped)
  {
 -   if (mem_cgroup_disabled())
 -   return;
 -   rcu_read_lock();
 -   *locked = false;
 -   if (atomic_read(memcg_moving))
 -   __mem_cgroup_begin_update_page_stat(page, locked, flags);
 +   preempt_disable();

Referring to synchronize_rcu in mem_cgroup_begin_move(), here
rcu_read_lock() lost?

 +   *clamped = false;
 +   if (unlikely(atomic_read(memcg_moving))) {
 +   __mem_cgroup_begin_update_page_stat(page);
 +   *clamped = true;
 +   }
  }

 -void __mem_cgroup_end_update_page_stat(struct page *page,
 -   unsigned long *flags);
  static inline void mem_cgroup_end_update_page_stat(struct page *page,
 -   bool *locked, unsigned long *flags)
 +  bool *clamped)
  {
 -   if (mem_cgroup_disabled())
 -   return;
 -   if (*locked)
 -   __mem_cgroup_end_update_page_stat(page, flags);
 -   rcu_read_unlock();
 +   /* We don't currently use the page arg, but keep it for symmetry */
 +   if 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Michal Hocko
On Wed 09-01-13 22:35:12, Sha Zhengju wrote:
[...]
 To my knowledge, each task is forked in root memcg, and there's a
 moving while attaching it to a cgroup. So move_account is also a
 frequent behavior to some extent.

Not really. Every fork/exec is copies the current group (see
cgroup_fork) so there is no moving on that path.
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:
 (2013/01/05 13:48), Sha Zhengju wrote:

 On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:

 On Wed 26-12-12 01:26:07, Sha Zhengju wrote:

 From: Sha Zhengju handai@taobao.com

 This patch adds memcg routines to count dirty pages, which allows memory
 controller
 to maintain an accurate view of the amount of its dirty memory and can
 provide some
 info for users while cgroup's direct reclaim is working.


 I guess you meant targeted resp. (hard/soft) limit reclaim here,
 right? It is true that this is direct reclaim but it is not clear to me


 Yes, I meant memcg hard/soft reclaim here which is triggered directly
 by allocation and is distinct from background kswapd reclaim (global).

 why the usefulnes should be limitted to the reclaim for users. I would
 understand this if the users was in fact in-kernel users.


 One of the reasons I'm trying to accounting the dirty pages is to get a
 more board overall view of memory usages because memcg hard/soft
 reclaim may have effect on response time of user application.
 Yeah, the beneficiary can be application administrator or kernel users.
 :P

 [...]

 To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
 (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
 -private_lock -- mapping-tree_lock -- memcg-move_lock.
 So we need to make mapping-tree_lock ahead of TestSetPageDirty in
 __set_page_dirty()
 and __set_page_dirty_nobuffers(). But in order to avoiding useless
 spinlock contention,
 a prepare PageDirty() checking is added.


 But there is another AA deadlock here I believe.
 page_remove_rmap
mem_cgroup_begin_update_page_stat  1
set_page_dirty
  __set_page_dirty_buffers
__set_page_dirty
  mem_cgroup_begin_update_page_stat2
move_lock_mem_cgroup
  spin_lock_irqsave(memcg-move_lock, *flags);

 mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
 because we might race with the moving charges:
  CPU0CPU1
 page_remove_rmap
  mem_cgroup_can_attach
mem_cgroup_begin_update_page_stat (1)
  rcu_read_lock
mem_cgroup_start_move

 atomic_inc(memcg_moving)

 atomic_inc(memcg-moving_account)
  synchronize_rcu
  __mem_cgroup_begin_update_page_stat
mem_cgroup_stolen  TRUE
move_lock_mem_cgroup
[...]
  mem_cgroup_begin_update_page_stat (2)
__mem_cgroup_begin_update_page_stat
  mem_cgroup_stolen  still TRUE
  move_lock_mem_cgroup   DEADLOCK
[...]
mem_cgroup_end_update_page_stat
  rcu_unlock
# wake up from
 synchronize_rcu
  [...]
  mem_cgroup_move_task
mem_cgroup_move_charge
  walk_page_range

 mem_cgroup_move_account

 move_lock_mem_cgroup


 Maybe I have missed some other locking which would prevent this from
 happening but the locking relations are really complicated in this area
 so if mem_cgroup_{begin,end}_update_page_stat might be called
 recursively then we need a fat comment which justifies that.


 Ohhh...good catching!  I didn't notice there is a recursive call of
 mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
 The mem_cgroup_{begin,end}_update_page_stat() design has depressed
 me a lot recently as the lock granularity is a little bigger than I
 thought.
 Not only the resource but also some code logic is in the range of locking
 which may be deadlock prone. The problem still exists if we are trying to
 add stat account of other memcg page later, may I make bold to suggest
 that we dig into the lock again...

 But with regard to the current lock implementation, I doubt if we can we
 can
 account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
 try to get move_lock once in the beginning. IMHO we can make
 mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
 thinking now is changing memcg-move_lock to rw-spinlock from the
 original spinlock:
 mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
 make it
 reenterable and memcg moving task side try to get the write spinlock.
 Then the race may be following:

  CPU0CPU1
 page_remove_rmap
  mem_cgroup_can_attach
mem_cgroup_begin_update_page_stat (1)
  rcu_read_lock
mem_cgroup_start_move

 atomic_inc(memcg_moving)

 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Kamezawa Hiroyuki

(2013/01/10 0:02), Sha Zhengju wrote:

On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:

(2013/01/05 13:48), Sha Zhengju wrote:


On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:


On Wed 26-12-12 01:26:07, Sha Zhengju wrote:


From: Sha Zhengju handai@taobao.com

This patch adds memcg routines to count dirty pages, which allows memory
controller
to maintain an accurate view of the amount of its dirty memory and can
provide some
info for users while cgroup's direct reclaim is working.



I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me



Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).


why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.



One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.
:P


[...]


To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
-private_lock -- mapping-tree_lock -- memcg-move_lock.
So we need to make mapping-tree_lock ahead of TestSetPageDirty in
__set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless
spinlock contention,
a prepare PageDirty() checking is added.



But there is another AA deadlock here I believe.
page_remove_rmap
mem_cgroup_begin_update_page_stat  1
set_page_dirty
  __set_page_dirty_buffers
__set_page_dirty
  mem_cgroup_begin_update_page_stat2
move_lock_mem_cgroup
  spin_lock_irqsave(memcg-move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
  CPU0CPU1
page_remove_rmap
  mem_cgroup_can_attach
mem_cgroup_begin_update_page_stat (1)
  rcu_read_lock
mem_cgroup_start_move

atomic_inc(memcg_moving)

atomic_inc(memcg-moving_account)
  synchronize_rcu
  __mem_cgroup_begin_update_page_stat
mem_cgroup_stolen  TRUE
move_lock_mem_cgroup
[...]
  mem_cgroup_begin_update_page_stat (2)
__mem_cgroup_begin_update_page_stat
  mem_cgroup_stolen  still TRUE
  move_lock_mem_cgroup   DEADLOCK
[...]
mem_cgroup_end_update_page_stat
  rcu_unlock
# wake up from
synchronize_rcu
  [...]
  mem_cgroup_move_task
mem_cgroup_move_charge
  walk_page_range

mem_cgroup_move_account

move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.



Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I
thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we
can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg-move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

  CPU0CPU1
page_remove_rmap
  mem_cgroup_can_attach
mem_cgroup_begin_update_page_stat (1)
  rcu_read_lock
mem_cgroup_start_move

atomic_inc(memcg_moving)


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Sha Zhengju
On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:
 (2013/01/10 0:02), Sha Zhengju wrote:

 On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
 kamezawa.hir...@jp.fujitsu.com wrote:

 (2013/01/05 13:48), Sha Zhengju wrote:


 On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:


 On Wed 26-12-12 01:26:07, Sha Zhengju wrote:


 From: Sha Zhengju handai@taobao.com

 This patch adds memcg routines to count dirty pages, which allows
 memory
 controller
 to maintain an accurate view of the amount of its dirty memory and can
 provide some
 info for users while cgroup's direct reclaim is working.



 I guess you meant targeted resp. (hard/soft) limit reclaim here,
 right? It is true that this is direct reclaim but it is not clear to me



 Yes, I meant memcg hard/soft reclaim here which is triggered directly
 by allocation and is distinct from background kswapd reclaim (global).

 why the usefulnes should be limitted to the reclaim for users. I would
 understand this if the users was in fact in-kernel users.


 One of the reasons I'm trying to accounting the dirty pages is to get a
 more board overall view of memory usages because memcg hard/soft
 reclaim may have effect on response time of user application.
 Yeah, the beneficiary can be application administrator or kernel users.
 :P

 [...]


 To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
 (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
 -private_lock -- mapping-tree_lock -- memcg-move_lock.
 So we need to make mapping-tree_lock ahead of TestSetPageDirty in
 __set_page_dirty()
 and __set_page_dirty_nobuffers(). But in order to avoiding useless
 spinlock contention,
 a prepare PageDirty() checking is added.



 But there is another AA deadlock here I believe.
 page_remove_rmap
 mem_cgroup_begin_update_page_stat  1
 set_page_dirty
   __set_page_dirty_buffers
 __set_page_dirty
   mem_cgroup_begin_update_page_stat2
 move_lock_mem_cgroup
   spin_lock_irqsave(memcg-move_lock, *flags);

 mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
 because we might race with the moving charges:
   CPU0CPU1
 page_remove_rmap
   mem_cgroup_can_attach
 mem_cgroup_begin_update_page_stat (1)
   rcu_read_lock

 mem_cgroup_start_move

 atomic_inc(memcg_moving)

 atomic_inc(memcg-moving_account)
   synchronize_rcu
   __mem_cgroup_begin_update_page_stat
 mem_cgroup_stolen  TRUE
 move_lock_mem_cgroup
 [...]
   mem_cgroup_begin_update_page_stat (2)
 __mem_cgroup_begin_update_page_stat
   mem_cgroup_stolen  still TRUE
   move_lock_mem_cgroup   DEADLOCK
 [...]
 mem_cgroup_end_update_page_stat
   rcu_unlock
 # wake up from
 synchronize_rcu
   [...]
   mem_cgroup_move_task

 mem_cgroup_move_charge
   walk_page_range

 mem_cgroup_move_account

 move_lock_mem_cgroup


 Maybe I have missed some other locking which would prevent this from
 happening but the locking relations are really complicated in this area
 so if mem_cgroup_{begin,end}_update_page_stat might be called
 recursively then we need a fat comment which justifies that.


 Ohhh...good catching!  I didn't notice there is a recursive call of
 mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
 The mem_cgroup_{begin,end}_update_page_stat() design has depressed
 me a lot recently as the lock granularity is a little bigger than I
 thought.
 Not only the resource but also some code logic is in the range of
 locking
 which may be deadlock prone. The problem still exists if we are trying
 to
 add stat account of other memcg page later, may I make bold to suggest
 that we dig into the lock again...

 But with regard to the current lock implementation, I doubt if we can we
 can
 account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
 try to get move_lock once in the beginning. IMHO we can make
 mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what
 I'm
 thinking now is changing memcg-move_lock to rw-spinlock from the
 original spinlock:
 mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
 make it
 reenterable and memcg moving task side try to get the write spinlock.
 Then the race may be following:

   CPU0CPU1
 page_remove_rmap
   mem_cgroup_can_attach
 mem_cgroup_begin_update_page_stat (1)
   rcu_read_lock

 mem_cgroup_start_move

 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-09 Thread Kamezawa Hiroyuki

(2013/01/10 13:26), Sha Zhengju wrote:


But this method also has its pros and cons(e.g. need lock nesting). So
I doubt whether the following is able to deal with these issues all
together:
(CPU-A does page stat accounting and CPU-B does move)

  CPU-ACPU-B

move_lock_mem_cgroup()
memcg = pc-mem_cgroup
SetPageDirty(page)
move_unlock_mem_cgroup()
   move_lock_mem_cgroup()
   if (PageDirty) {
old_memcg-nr_dirty --;
new_memcg-nr_dirty ++;
}
pc-mem_cgroup = new_memcg
move_unlock_mem_cgroup()

memcg-nr_dirty ++


For CPU-A, we save pc-mem_cgroup in a temporary variable just before
SetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do moving in advance that
old_memcg-nr_dirty -- will make old_memcg-nr_dirty incorrect but
soon CPU-A will do memcg-nr_dirty ++ at the heels that amend the
stats.
However, there is a potential problem that old_memcg-nr_dirty  may be
minus in a very short period but not a big issue IMHO.



IMHO, this will work. Please take care of that the recorded memcg will not
be invalid pointer when you update the nr_dirty later.
(Maybe RCU will protect it.)

_If_ this method can handle nesting problem clearer and make implementation
simpler, please go ahead. To be honest, I'm not sure how the code will be until
seeing the patch. Hmm, why you write SetPageDirty() here rather than
TestSetPageDirty()

Thanks,
-Kame




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-08 Thread Kamezawa Hiroyuki

(2013/01/09 14:15), Hugh Dickins wrote:

On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:

(2013/01/07 5:02), Hugh Dickins wrote:


Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.


In what case does this improve performance ?


Perhaps none.  I was aiming to not degrade performance at the stats
update end, and make it more flexible, so new stats can be updated which
would be problematic today (for lock ordering and recursion reasons).

I've not done any performance measurement on it, and don't have enough
cpus for an interesting report; but if someone thinks it might solve a
problem for them, and has plenty of cpus to test with, please go ahead,
we'd be glad to hear the results.


Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page
?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


I was happy to make the relatively rare move_account end considerably
heavier.  I'll be disappointed if it turns out to be prohibitively
heavy at that end - if we're going to make move_account impossible,
there are much easier ways to achieve that! - but it is a possibility.



move_account at task-move has been required feature for NEC and Nishimura-san
did good job. I'd like to keep that available as much as possible.


Something you might have missed when considering many readers (stats
updaters): the move_account end does not wait for a moment when there
are no readers, that would indeed be a losing strategy; it just waits
for each cpu that's updating page stats to leave that section, so every
cpu is sure to notice and hold off if it then tries to update the page
which is to be moved.  (I may not be explaining that very well!)



Hmm, yeah, maybe I miss somehing.

BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-08 Thread Hugh Dickins
On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
> (2013/01/07 5:02), Hugh Dickins wrote:
> > 
> > Forgive me, I must confess I'm no more than skimming this thread,
> > and don't like dumping unsigned-off patches on people; but thought
> > that on balance it might be more helpful than not if I offer you a
> > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
> > 
> > I too was getting depressed by the constraints imposed by
> > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> > did to minimize them), and wanted to replace by something freer, more
> > RCU-like.  In the end it seemed more effort than it was worth to go
> > as far as I wanted, but I do think that this is some improvement over
> > what we currently have, and should deal with your recursion issue.
> > 
> In what case does this improve performance ?

Perhaps none.  I was aiming to not degrade performance at the stats
update end, and make it more flexible, so new stats can be updated which
would be problematic today (for lock ordering and recursion reasons).

I've not done any performance measurement on it, and don't have enough
cpus for an interesting report; but if someone thinks it might solve a
problem for them, and has plenty of cpus to test with, please go ahead,
we'd be glad to hear the results.

> Hi, this patch seems interesting but...doesn't this make move_account() very
> slow if the number of cpus increases because of scanning all cpus per a page
> ?
> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
> writers if there are many readers.

I was happy to make the relatively rare move_account end considerably
heavier.  I'll be disappointed if it turns out to be prohibitively
heavy at that end - if we're going to make move_account impossible,
there are much easier ways to achieve that! - but it is a possibility.

Something you might have missed when considering many readers (stats
updaters): the move_account end does not wait for a moment when there
are no readers, that would indeed be a losing strategy; it just waits
for each cpu that's updating page stats to leave that section, so every
cpu is sure to notice and hold off if it then tries to update the page
which is to be moved.  (I may not be explaining that very well!)

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-08 Thread Hugh Dickins
On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
 (2013/01/07 5:02), Hugh Dickins wrote:
  
  Forgive me, I must confess I'm no more than skimming this thread,
  and don't like dumping unsigned-off patches on people; but thought
  that on balance it might be more helpful than not if I offer you a
  patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
  
  I too was getting depressed by the constraints imposed by
  mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
  did to minimize them), and wanted to replace by something freer, more
  RCU-like.  In the end it seemed more effort than it was worth to go
  as far as I wanted, but I do think that this is some improvement over
  what we currently have, and should deal with your recursion issue.
  
 In what case does this improve performance ?

Perhaps none.  I was aiming to not degrade performance at the stats
update end, and make it more flexible, so new stats can be updated which
would be problematic today (for lock ordering and recursion reasons).

I've not done any performance measurement on it, and don't have enough
cpus for an interesting report; but if someone thinks it might solve a
problem for them, and has plenty of cpus to test with, please go ahead,
we'd be glad to hear the results.

 Hi, this patch seems interesting but...doesn't this make move_account() very
 slow if the number of cpus increases because of scanning all cpus per a page
 ?
 And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
 writers if there are many readers.

I was happy to make the relatively rare move_account end considerably
heavier.  I'll be disappointed if it turns out to be prohibitively
heavy at that end - if we're going to make move_account impossible,
there are much easier ways to achieve that! - but it is a possibility.

Something you might have missed when considering many readers (stats
updaters): the move_account end does not wait for a moment when there
are no readers, that would indeed be a losing strategy; it just waits
for each cpu that's updating page stats to leave that section, so every
cpu is sure to notice and hold off if it then tries to update the page
which is to be moved.  (I may not be explaining that very well!)

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-08 Thread Kamezawa Hiroyuki

(2013/01/09 14:15), Hugh Dickins wrote:

On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:

(2013/01/07 5:02), Hugh Dickins wrote:


Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.


In what case does this improve performance ?


Perhaps none.  I was aiming to not degrade performance at the stats
update end, and make it more flexible, so new stats can be updated which
would be problematic today (for lock ordering and recursion reasons).

I've not done any performance measurement on it, and don't have enough
cpus for an interesting report; but if someone thinks it might solve a
problem for them, and has plenty of cpus to test with, please go ahead,
we'd be glad to hear the results.


Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page
?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


I was happy to make the relatively rare move_account end considerably
heavier.  I'll be disappointed if it turns out to be prohibitively
heavy at that end - if we're going to make move_account impossible,
there are much easier ways to achieve that! - but it is a possibility.



move_account at task-move has been required feature for NEC and Nishimura-san
did good job. I'd like to keep that available as much as possible.


Something you might have missed when considering many readers (stats
updaters): the move_account end does not wait for a moment when there
are no readers, that would indeed be a losing strategy; it just waits
for each cpu that's updating page stats to leave that section, so every
cpu is sure to notice and hold off if it then tries to update the page
which is to be moved.  (I may not be explaining that very well!)



Hmm, yeah, maybe I miss somehing.

BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus.

Thanks,
-Kame


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Kamezawa Hiroyuki

(2013/01/07 5:02), Hugh Dickins wrote:

On Sat, 5 Jan 2013, Sha Zhengju wrote:

On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.



Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...


Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.


In what case does this improve performance ?


But if this does appear useful to memcg people, then we really ought
to get it checked over by locking/barrier experts before going further.
I think myself that I've over-barriered it, and could use a little
lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
to mind) will see more clearly, and may just hate the whole thing,
as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
I've not wanted to waste their time on reviewing it, if it's not even
going to be useful to memcg people.

It may be easier to understand if you just apply the patch and look
at the result in mm/memcontrol.c, where I tried to gather the pieces
together in one place and describe them ("These functions mediate...").

Hugh



Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page ?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


Thanks,
-Kame


 








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Kamezawa Hiroyuki

(2013/01/05 13:48), Sha Zhengju wrote:

On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:

On Wed 26-12-12 01:26:07, Sha Zhengju wrote:

From: Sha Zhengju 

This patch adds memcg routines to count dirty pages, which allows memory 
controller
to maintain an accurate view of the amount of its dirty memory and can provide 
some
info for users while cgroup's direct reclaim is working.


I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me


Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).


why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.



One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.  :P


[...]

To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
->private_lock --> mapping->tree_lock --> memcg->move_lock.
So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
__set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
contention,
a prepare PageDirty() checking is added.


But there is another AA deadlock here I believe.
page_remove_rmap
   mem_cgroup_begin_update_page_stat <<< 1
   set_page_dirty
 __set_page_dirty_buffers
   __set_page_dirty
 mem_cgroup_begin_update_page_stat   <<< 2
   move_lock_mem_cgroup
 spin_lock_irqsave(>move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
 CPU0CPU1
page_remove_rmap
 mem_cgroup_can_attach
   mem_cgroup_begin_update_page_stat (1)
 rcu_read_lock
   mem_cgroup_start_move
 atomic_inc(_moving)
 
atomic_inc(>moving_account)
 synchronize_rcu
 __mem_cgroup_begin_update_page_stat
   mem_cgroup_stolen <<< TRUE
   move_lock_mem_cgroup
   [...]
 mem_cgroup_begin_update_page_stat (2)
   __mem_cgroup_begin_update_page_stat
 mem_cgroup_stolen <<< still TRUE
 move_lock_mem_cgroup  <<< DEADLOCK
   [...]
   mem_cgroup_end_update_page_stat
 rcu_unlock
   # wake up from 
synchronize_rcu
 [...]
 mem_cgroup_move_task
   mem_cgroup_move_charge
 walk_page_range
   mem_cgroup_move_account
 move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.



Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg->move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

 CPU0CPU1
page_remove_rmap
 mem_cgroup_can_attach
   mem_cgroup_begin_update_page_stat (1)
 rcu_read_lock
   mem_cgroup_start_move
   

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Greg Thelen
On Tue, Dec 25 2012, Sha Zhengju wrote:

> From: Sha Zhengju 
>
> This patch adds memcg routines to count dirty pages, which allows memory 
> controller
> to maintain an accurate view of the amount of its dirty memory and can 
> provide some
> info for users while cgroup's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), 
> we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. 
> But memcg
> has a feature to move a page from a cgroup to another one and may have race 
> between
> "move" and "page stat accounting". So in order to avoid the race we have 
> designed a
> bigger lock:
>
>  mem_cgroup_begin_update_page_stat()
>  modify page information-->(a)
>  mem_cgroup_update_page_stat()  -->(b)
>  mem_cgroup_end_update_page_stat()
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> In the previous two prepare patches, we have reworked the vfs set page dirty 
> routines
> and now the interfaces are more explicit:
> incrementing (2):
> __set_page_dirty
> __set_page_dirty_nobuffers
> decrementing (2):
> clear_page_dirty_for_io
> cancel_dirty_page
>
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
> __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
> contention,
> a prepare PageDirty() checking is added.
>
>
> Signed-off-by: Sha Zhengju 
> Acked-by: KAMEZAWA Hiroyuki 
> Acked-by: Fengguang Wu 
> ---
>  fs/buffer.c|   14 +-
>  include/linux/memcontrol.h |1 +
>  mm/filemap.c   |   10 ++
>  mm/memcontrol.c|   29 ++---
>  mm/page-writeback.c|   39 ---
>  mm/truncate.c  |6 ++
>  6 files changed, 84 insertions(+), 15 deletions(-)

__nilfs_clear_page_dirty() clears PageDirty, does it need modification
for this patch series?

> diff --git a/fs/buffer.c b/fs/buffer.c
> index 762168a..53402d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>   struct address_space *mapping, int warn)
>  {
> + bool locked;
> + unsigned long flags;
> +
>   if (unlikely(!mapping))
>   return !TestSetPageDirty(page);
>  
> - if (TestSetPageDirty(page))
> + if (PageDirty(page))
>   return 0;
>  
>   spin_lock_irq(>tree_lock);
> + mem_cgroup_begin_update_page_stat(page, , );
> +
> + if (TestSetPageDirty(page)) {
> + mem_cgroup_end_update_page_stat(page, , );
> + spin_unlock_irq(>tree_lock);
> + return 0;
> + }
> +
>   if (page->mapping) {/* Race with truncate? */
>   WARN_ON_ONCE(warn && !PageUptodate(page));
>   account_page_dirtied(page, mapping);
>   radix_tree_tag_set(>page_tree,
>   page_index(page), PAGECACHE_TAG_DIRTY);
>   }
> + mem_cgroup_end_update_page_stat(page, , );
>   spin_unlock_irq(>tree_lock);
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5421b8a..2685d8a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
>   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>   MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>   MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..b589be5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,11 @@
>   *  ->swap_lock  (exclusive_swap_page, others)
>   *->mapping->tree_lock
>   *
> + *->private_lock (__set_page_dirty_buffers)
> + *  ->mapping->tree_lock
> + *->memcg->move_lock (mem_cgroup_begin_update_page_stat->
> + *   move_lock_mem_cgroup)
> + *
>   *  ->i_mutex
>   *->i_mmap_mutex (truncate->unmap_mapping_range)
>   *
> @@ -112,6 +117,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>   struct address_space *mapping = page->mapping;
> + bool locked;
> + unsigned long flags;
>  
>   /*
>* if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>   

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Hugh Dickins
On Sat, 5 Jan 2013, Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:
> >
> > Maybe I have missed some other locking which would prevent this from
> > happening but the locking relations are really complicated in this area
> > so if mem_cgroup_{begin,end}_update_page_stat might be called
> > recursively then we need a fat comment which justifies that.
> >
> 
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...

Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.

But if this does appear useful to memcg people, then we really ought
to get it checked over by locking/barrier experts before going further.
I think myself that I've over-barriered it, and could use a little
lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
to mind) will see more clearly, and may just hate the whole thing,
as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
I've not wanted to waste their time on reviewing it, if it's not even
going to be useful to memcg people.

It may be easier to understand if you just apply the patch and look
at the result in mm/memcontrol.c, where I tried to gather the pieces
together in one place and describe them ("These functions mediate...").

Hugh

 include/linux/memcontrol.h |   39 +--
 mm/memcontrol.c|  375 +--
 mm/rmap.c  |   20 -
 3 files changed, 257 insertions(+), 177 deletions(-)

--- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
+++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800
@@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
return false;
 }
 
-void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
-unsigned long *flags);
-
+void __mem_cgroup_begin_update_page_stat(struct page *page);
+void __mem_cgroup_end_update_page_stat(void);
 extern atomic_t memcg_moving;
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-   bool *locked, unsigned long *flags)
+bool *clamped)
 {
-   if (mem_cgroup_disabled())
-   return;
-   rcu_read_lock();
-   *locked = false;
-   if (atomic_read(_moving))
-   __mem_cgroup_begin_update_page_stat(page, locked, flags);
+   preempt_disable();
+   *clamped = false;
+   if (unlikely(atomic_read(_moving))) {
+   __mem_cgroup_begin_update_page_stat(page);
+   *clamped = true;
+   }
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page,
-   unsigned long *flags);
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-   bool *locked, unsigned long *flags)
+  bool *clamped)
 {
-   if (mem_cgroup_disabled())
-   return;
-   if (*locked)
-   __mem_cgroup_end_update_page_stat(page, flags);
-   rcu_read_unlock();
+   /* We don't currently use the page arg, but keep it for symmetry */
+   if (unlikely(*clamped))
+   __mem_cgroup_end_update_page_stat();
+   preempt_enable();
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
@@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
 }
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-   bool *locked, unsigned long *flags)
+bool *clamped)
 {
+   /* It may be helpful to our callers if the stub behaves the same way */
+   preempt_disable();
 }
 
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-   bool *locked, unsigned long 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Hugh Dickins
On Sat, 5 Jan 2013, Sha Zhengju wrote:
 On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:
 
  Maybe I have missed some other locking which would prevent this from
  happening but the locking relations are really complicated in this area
  so if mem_cgroup_{begin,end}_update_page_stat might be called
  recursively then we need a fat comment which justifies that.
 
 
 Ohhh...good catching!  I didn't notice there is a recursive call of
 mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
 The mem_cgroup_{begin,end}_update_page_stat() design has depressed
 me a lot recently as the lock granularity is a little bigger than I thought.
 Not only the resource but also some code logic is in the range of locking
 which may be deadlock prone. The problem still exists if we are trying to
 add stat account of other memcg page later, may I make bold to suggest
 that we dig into the lock again...

Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.

But if this does appear useful to memcg people, then we really ought
to get it checked over by locking/barrier experts before going further.
I think myself that I've over-barriered it, and could use a little
lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
to mind) will see more clearly, and may just hate the whole thing,
as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
I've not wanted to waste their time on reviewing it, if it's not even
going to be useful to memcg people.

It may be easier to understand if you just apply the patch and look
at the result in mm/memcontrol.c, where I tried to gather the pieces
together in one place and describe them (These functions mediate...).

Hugh

 include/linux/memcontrol.h |   39 +--
 mm/memcontrol.c|  375 +--
 mm/rmap.c  |   20 -
 3 files changed, 257 insertions(+), 177 deletions(-)

--- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
+++ linux/include/linux/memcontrol.h2013-01-02 14:47:47.960394878 -0800
@@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
return false;
 }
 
-void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
-unsigned long *flags);
-
+void __mem_cgroup_begin_update_page_stat(struct page *page);
+void __mem_cgroup_end_update_page_stat(void);
 extern atomic_t memcg_moving;
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-   bool *locked, unsigned long *flags)
+bool *clamped)
 {
-   if (mem_cgroup_disabled())
-   return;
-   rcu_read_lock();
-   *locked = false;
-   if (atomic_read(memcg_moving))
-   __mem_cgroup_begin_update_page_stat(page, locked, flags);
+   preempt_disable();
+   *clamped = false;
+   if (unlikely(atomic_read(memcg_moving))) {
+   __mem_cgroup_begin_update_page_stat(page);
+   *clamped = true;
+   }
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page,
-   unsigned long *flags);
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-   bool *locked, unsigned long *flags)
+  bool *clamped)
 {
-   if (mem_cgroup_disabled())
-   return;
-   if (*locked)
-   __mem_cgroup_end_update_page_stat(page, flags);
-   rcu_read_unlock();
+   /* We don't currently use the page arg, but keep it for symmetry */
+   if (unlikely(*clamped))
+   __mem_cgroup_end_update_page_stat();
+   preempt_enable();
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
@@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
 }
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-   bool *locked, unsigned long *flags)
+bool *clamped)
 {
+   /* It may be helpful to our callers if the stub behaves the same way */
+   preempt_disable();
 }
 
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-   bool *locked, unsigned long 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Greg Thelen
On Tue, Dec 25 2012, Sha Zhengju wrote:

 From: Sha Zhengju handai@taobao.com

 This patch adds memcg routines to count dirty pages, which allows memory 
 controller
 to maintain an accurate view of the amount of its dirty memory and can 
 provide some
 info for users while cgroup's direct reclaim is working.

 After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), 
 we can
 use 'struct page' flag to test page state instead of per page_cgroup flag. 
 But memcg
 has a feature to move a page from a cgroup to another one and may have race 
 between
 move and page stat accounting. So in order to avoid the race we have 
 designed a
 bigger lock:

  mem_cgroup_begin_update_page_stat()
  modify page information--(a)
  mem_cgroup_update_page_stat()  --(b)
  mem_cgroup_end_update_page_stat()
 It requires (a) and (b)(dirty pages accounting) can stay close enough.
 In the previous two prepare patches, we have reworked the vfs set page dirty 
 routines
 and now the interfaces are more explicit:
 incrementing (2):
 __set_page_dirty
 __set_page_dirty_nobuffers
 decrementing (2):
 clear_page_dirty_for_io
 cancel_dirty_page

 To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
 (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
 -private_lock -- mapping-tree_lock -- memcg-move_lock.
 So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
 __set_page_dirty()
 and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
 contention,
 a prepare PageDirty() checking is added.


 Signed-off-by: Sha Zhengju handai@taobao.com
 Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtisu.com
 Acked-by: Fengguang Wu fengguang...@intel.com
 ---
  fs/buffer.c|   14 +-
  include/linux/memcontrol.h |1 +
  mm/filemap.c   |   10 ++
  mm/memcontrol.c|   29 ++---
  mm/page-writeback.c|   39 ---
  mm/truncate.c  |6 ++
  6 files changed, 84 insertions(+), 15 deletions(-)

__nilfs_clear_page_dirty() clears PageDirty, does it need modification
for this patch series?

 diff --git a/fs/buffer.c b/fs/buffer.c
 index 762168a..53402d2 100644
 --- a/fs/buffer.c
 +++ b/fs/buffer.c
 @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  int __set_page_dirty(struct page *page,
   struct address_space *mapping, int warn)
  {
 + bool locked;
 + unsigned long flags;
 +
   if (unlikely(!mapping))
   return !TestSetPageDirty(page);
  
 - if (TestSetPageDirty(page))
 + if (PageDirty(page))
   return 0;
  
   spin_lock_irq(mapping-tree_lock);
 + mem_cgroup_begin_update_page_stat(page, locked, flags);
 +
 + if (TestSetPageDirty(page)) {
 + mem_cgroup_end_update_page_stat(page, locked, flags);
 + spin_unlock_irq(mapping-tree_lock);
 + return 0;
 + }
 +
   if (page-mapping) {/* Race with truncate? */
   WARN_ON_ONCE(warn  !PageUptodate(page));
   account_page_dirtied(page, mapping);
   radix_tree_tag_set(mapping-page_tree,
   page_index(page), PAGECACHE_TAG_DIRTY);
   }
 + mem_cgroup_end_update_page_stat(page, locked, flags);
   spin_unlock_irq(mapping-tree_lock);
   __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
  
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 5421b8a..2685d8a 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
   MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 + MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
   MEM_CGROUP_STAT_NSTATS,
  };
  
 diff --git a/mm/filemap.c b/mm/filemap.c
 index 83efee7..b589be5 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -62,6 +62,11 @@
   *  -swap_lock  (exclusive_swap_page, others)
   *-mapping-tree_lock
   *
 + *-private_lock (__set_page_dirty_buffers)
 + *  -mapping-tree_lock
 + *-memcg-move_lock (mem_cgroup_begin_update_page_stat-
 + *   move_lock_mem_cgroup)
 + *
   *  -i_mutex
   *-i_mmap_mutex (truncate-unmap_mapping_range)
   *
 @@ -112,6 +117,8 @@
  void __delete_from_page_cache(struct page *page)
  {
   struct address_space *mapping = page-mapping;
 + bool locked;
 + unsigned long flags;
  
   /*
* if we're uptodate, flush out into the cleancache, otherwise
 @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
* 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Kamezawa Hiroyuki

(2013/01/05 13:48), Sha Zhengju wrote:

On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:

On Wed 26-12-12 01:26:07, Sha Zhengju wrote:

From: Sha Zhengju handai@taobao.com

This patch adds memcg routines to count dirty pages, which allows memory 
controller
to maintain an accurate view of the amount of its dirty memory and can provide 
some
info for users while cgroup's direct reclaim is working.


I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me


Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).


why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.



One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.  :P


[...]

To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
-private_lock -- mapping-tree_lock -- memcg-move_lock.
So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
__set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
contention,
a prepare PageDirty() checking is added.


But there is another AA deadlock here I believe.
page_remove_rmap
   mem_cgroup_begin_update_page_stat  1
   set_page_dirty
 __set_page_dirty_buffers
   __set_page_dirty
 mem_cgroup_begin_update_page_stat2
   move_lock_mem_cgroup
 spin_lock_irqsave(memcg-move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
 CPU0CPU1
page_remove_rmap
 mem_cgroup_can_attach
   mem_cgroup_begin_update_page_stat (1)
 rcu_read_lock
   mem_cgroup_start_move
 atomic_inc(memcg_moving)
 
atomic_inc(memcg-moving_account)
 synchronize_rcu
 __mem_cgroup_begin_update_page_stat
   mem_cgroup_stolen  TRUE
   move_lock_mem_cgroup
   [...]
 mem_cgroup_begin_update_page_stat (2)
   __mem_cgroup_begin_update_page_stat
 mem_cgroup_stolen  still TRUE
 move_lock_mem_cgroup   DEADLOCK
   [...]
   mem_cgroup_end_update_page_stat
 rcu_unlock
   # wake up from 
synchronize_rcu
 [...]
 mem_cgroup_move_task
   mem_cgroup_move_charge
 walk_page_range
   mem_cgroup_move_account
 move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.



Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg-move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

 CPU0CPU1
page_remove_rmap
 mem_cgroup_can_attach
   mem_cgroup_begin_update_page_stat (1)
 rcu_read_lock
   mem_cgroup_start_move
   

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-06 Thread Kamezawa Hiroyuki

(2013/01/07 5:02), Hugh Dickins wrote:

On Sat, 5 Jan 2013, Sha Zhengju wrote:

On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.



Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...


Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.


In what case does this improve performance ?


But if this does appear useful to memcg people, then we really ought
to get it checked over by locking/barrier experts before going further.
I think myself that I've over-barriered it, and could use a little
lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
to mind) will see more clearly, and may just hate the whole thing,
as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
I've not wanted to waste their time on reviewing it, if it's not even
going to be useful to memcg people.

It may be easier to understand if you just apply the patch and look
at the result in mm/memcontrol.c, where I tried to gather the pieces
together in one place and describe them (These functions mediate...).

Hugh



Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page ?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


Thanks,
-Kame


 








--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-04 Thread Sha Zhengju
On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko  wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> From: Sha Zhengju 
>>
>> This patch adds memcg routines to count dirty pages, which allows memory 
>> controller
>> to maintain an accurate view of the amount of its dirty memory and can 
>> provide some
>> info for users while cgroup's direct reclaim is working.
>
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me

Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).

> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
>

One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.  :P

> [...]
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
>> __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
>> contention,
>> a prepare PageDirty() checking is added.
>
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat <<< 1
>   set_page_dirty
> __set_page_dirty_buffers
>   __set_page_dirty
> mem_cgroup_begin_update_page_stat   <<< 2
>   move_lock_mem_cgroup
> spin_lock_irqsave(>move_lock, *flags);
>
> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
> because we might race with the moving charges:
> CPU0CPU1
> page_remove_rmap
> mem_cgroup_can_attach
>   mem_cgroup_begin_update_page_stat (1)
> rcu_read_lock
>   mem_cgroup_start_move
> atomic_inc(_moving)
> 
> atomic_inc(>moving_account)
> synchronize_rcu
> __mem_cgroup_begin_update_page_stat
>   mem_cgroup_stolen <<< TRUE
>   move_lock_mem_cgroup
>   [...]
> mem_cgroup_begin_update_page_stat (2)
>   __mem_cgroup_begin_update_page_stat
> mem_cgroup_stolen <<< still TRUE
> move_lock_mem_cgroup  <<< DEADLOCK
>   [...]
>   mem_cgroup_end_update_page_stat
> rcu_unlock
>   # wake up from 
> synchronize_rcu
> [...]
> mem_cgroup_move_task
>   mem_cgroup_move_charge
> walk_page_range
>   mem_cgroup_move_account
> move_lock_mem_cgroup
>
>
> Maybe I have missed some other locking which would prevent this from
> happening but the locking relations are really complicated in this area
> so if mem_cgroup_{begin,end}_update_page_stat might be called
> recursively then we need a fat comment which justifies that.
>

Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg->move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

CPU0CPU1
page_remove_rmap
mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
rcu_read_lock
 

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-04 Thread Sha Zhengju
On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko mho...@suse.cz wrote:
 On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
 From: Sha Zhengju handai@taobao.com

 This patch adds memcg routines to count dirty pages, which allows memory 
 controller
 to maintain an accurate view of the amount of its dirty memory and can 
 provide some
 info for users while cgroup's direct reclaim is working.

 I guess you meant targeted resp. (hard/soft) limit reclaim here,
 right? It is true that this is direct reclaim but it is not clear to me

Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).

 why the usefulnes should be limitted to the reclaim for users. I would
 understand this if the users was in fact in-kernel users.


One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.  :P

 [...]
 To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
 (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
 -private_lock -- mapping-tree_lock -- memcg-move_lock.
 So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
 __set_page_dirty()
 and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
 contention,
 a prepare PageDirty() checking is added.

 But there is another AA deadlock here I believe.
 page_remove_rmap
   mem_cgroup_begin_update_page_stat  1
   set_page_dirty
 __set_page_dirty_buffers
   __set_page_dirty
 mem_cgroup_begin_update_page_stat2
   move_lock_mem_cgroup
 spin_lock_irqsave(memcg-move_lock, *flags);

 mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
 because we might race with the moving charges:
 CPU0CPU1
 page_remove_rmap
 mem_cgroup_can_attach
   mem_cgroup_begin_update_page_stat (1)
 rcu_read_lock
   mem_cgroup_start_move
 atomic_inc(memcg_moving)
 
 atomic_inc(memcg-moving_account)
 synchronize_rcu
 __mem_cgroup_begin_update_page_stat
   mem_cgroup_stolen  TRUE
   move_lock_mem_cgroup
   [...]
 mem_cgroup_begin_update_page_stat (2)
   __mem_cgroup_begin_update_page_stat
 mem_cgroup_stolen  still TRUE
 move_lock_mem_cgroup   DEADLOCK
   [...]
   mem_cgroup_end_update_page_stat
 rcu_unlock
   # wake up from 
 synchronize_rcu
 [...]
 mem_cgroup_move_task
   mem_cgroup_move_charge
 walk_page_range
   mem_cgroup_move_account
 move_lock_mem_cgroup


 Maybe I have missed some other locking which would prevent this from
 happening but the locking relations are really complicated in this area
 so if mem_cgroup_{begin,end}_update_page_stat might be called
 recursively then we need a fat comment which justifies that.


Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg-move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

CPU0CPU1
page_remove_rmap
mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
rcu_read_lock
  mem_cgroup_start_move

Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-02 Thread Michal Hocko
On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> From: Sha Zhengju 
> 
> This patch adds memcg routines to count dirty pages, which allows memory 
> controller
> to maintain an accurate view of the amount of its dirty memory and can 
> provide some
> info for users while cgroup's direct reclaim is working.

I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me
why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.

[...]
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
> __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
> contention,
> a prepare PageDirty() checking is added.

But there is another AA deadlock here I believe.
page_remove_rmap
  mem_cgroup_begin_update_page_stat <<< 1
  set_page_dirty
__set_page_dirty_buffers
  __set_page_dirty
mem_cgroup_begin_update_page_stat   <<< 2
  move_lock_mem_cgroup
spin_lock_irqsave(>move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
CPU0CPU1

page_remove_rmap
mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
rcu_read_lock
  mem_cgroup_start_move
atomic_inc(_moving)

atomic_inc(>moving_account)
synchronize_rcu
__mem_cgroup_begin_update_page_stat
  mem_cgroup_stolen <<< TRUE
  move_lock_mem_cgroup
  [...]
mem_cgroup_begin_update_page_stat (2)
  __mem_cgroup_begin_update_page_stat
mem_cgroup_stolen <<< still TRUE
move_lock_mem_cgroup  <<< DEADLOCK
  [...]
  mem_cgroup_end_update_page_stat
rcu_unlock
  # wake up from synchronize_rcu
[...]
mem_cgroup_move_task
  mem_cgroup_move_charge
walk_page_range
  mem_cgroup_move_account
move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2013-01-02 Thread Michal Hocko
On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
 From: Sha Zhengju handai@taobao.com
 
 This patch adds memcg routines to count dirty pages, which allows memory 
 controller
 to maintain an accurate view of the amount of its dirty memory and can 
 provide some
 info for users while cgroup's direct reclaim is working.

I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me
why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.

[...]
 To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
 (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
 -private_lock -- mapping-tree_lock -- memcg-move_lock.
 So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
 __set_page_dirty()
 and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
 contention,
 a prepare PageDirty() checking is added.

But there is another AA deadlock here I believe.
page_remove_rmap
  mem_cgroup_begin_update_page_stat  1
  set_page_dirty
__set_page_dirty_buffers
  __set_page_dirty
mem_cgroup_begin_update_page_stat2
  move_lock_mem_cgroup
spin_lock_irqsave(memcg-move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
CPU0CPU1

page_remove_rmap
mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
rcu_read_lock
  mem_cgroup_start_move
atomic_inc(memcg_moving)

atomic_inc(memcg-moving_account)
synchronize_rcu
__mem_cgroup_begin_update_page_stat
  mem_cgroup_stolen  TRUE
  move_lock_mem_cgroup
  [...]
mem_cgroup_begin_update_page_stat (2)
  __mem_cgroup_begin_update_page_stat
mem_cgroup_stolen  still TRUE
move_lock_mem_cgroup   DEADLOCK
  [...]
  mem_cgroup_end_update_page_stat
rcu_unlock
  # wake up from synchronize_rcu
[...]
mem_cgroup_move_task
  mem_cgroup_move_charge
walk_page_range
  mem_cgroup_move_account
move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2012-12-25 Thread Sha Zhengju
From: Sha Zhengju 

This patch adds memcg routines to count dirty pages, which allows memory 
controller
to maintain an accurate view of the amount of its dirty memory and can provide 
some
info for users while cgroup's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we 
can
use 'struct page' flag to test page state instead of per page_cgroup flag. But 
memcg
has a feature to move a page from a cgroup to another one and may have race 
between
"move" and "page stat accounting". So in order to avoid the race we have 
designed a
bigger lock:

 mem_cgroup_begin_update_page_stat()
 modify page information-->(a)
 mem_cgroup_update_page_stat()  -->(b)
 mem_cgroup_end_update_page_stat()
It requires (a) and (b)(dirty pages accounting) can stay close enough.
In the previous two prepare patches, we have reworked the vfs set page dirty 
routines
and now the interfaces are more explicit:
incrementing (2):
__set_page_dirty
__set_page_dirty_nobuffers
decrementing (2):
clear_page_dirty_for_io
cancel_dirty_page

To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
->private_lock --> mapping->tree_lock --> memcg->move_lock.
So we need to make mapping->tree_lock ahead of TestSetPageDirty in 
__set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
contention,
a prepare PageDirty() checking is added.


Signed-off-by: Sha Zhengju 
Acked-by: KAMEZAWA Hiroyuki 
Acked-by: Fengguang Wu 
---
 fs/buffer.c|   14 +-
 include/linux/memcontrol.h |1 +
 mm/filemap.c   |   10 ++
 mm/memcontrol.c|   29 ++---
 mm/page-writeback.c|   39 ---
 mm/truncate.c  |6 ++
 6 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 762168a..53402d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
 {
+   bool locked;
+   unsigned long flags;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
-   if (TestSetPageDirty(page))
+   if (PageDirty(page))
return 0;
 
spin_lock_irq(>tree_lock);
+   mem_cgroup_begin_update_page_stat(page, , );
+
+   if (TestSetPageDirty(page)) {
+   mem_cgroup_end_update_page_stat(page, , );
+   spin_unlock_irq(>tree_lock);
+   return 0;
+   }
+
if (page->mapping) {/* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(>page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+   mem_cgroup_end_update_page_stat(page, , );
spin_unlock_irq(>tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5421b8a..2685d8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+   MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..b589be5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,6 +62,11 @@
  *  ->swap_lock(exclusive_swap_page, others)
  *->mapping->tree_lock
  *
+ *->private_lock   (__set_page_dirty_buffers)
+ *  ->mapping->tree_lock
+ *->memcg->move_lock   (mem_cgroup_begin_update_page_stat->
+ * move_lock_mem_cgroup)
+ *
  *  ->i_mutex
  *->i_mmap_mutex   (truncate->unmap_mapping_range)
  *
@@ -112,6 +117,8 @@
 void __delete_from_page_cache(struct page *page)
 {
struct address_space *mapping = page->mapping;
+   bool locked;
+   unsigned long flags;
 
/*
 * if we're uptodate, flush out into the cleancache, otherwise
@@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
 * Fix it up by doing a final dirty accounting check after
 * having removed the page entirely.
 */
+   mem_cgroup_begin_update_page_stat(page, , );
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+   mem_cgroup_dec_page_stat(page, 

[PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

2012-12-25 Thread Sha Zhengju
From: Sha Zhengju handai@taobao.com

This patch adds memcg routines to count dirty pages, which allows memory 
controller
to maintain an accurate view of the amount of its dirty memory and can provide 
some
info for users while cgroup's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we 
can
use 'struct page' flag to test page state instead of per page_cgroup flag. But 
memcg
has a feature to move a page from a cgroup to another one and may have race 
between
move and page stat accounting. So in order to avoid the race we have 
designed a
bigger lock:

 mem_cgroup_begin_update_page_stat()
 modify page information--(a)
 mem_cgroup_update_page_stat()  --(b)
 mem_cgroup_end_update_page_stat()
It requires (a) and (b)(dirty pages accounting) can stay close enough.
In the previous two prepare patches, we have reworked the vfs set page dirty 
routines
and now the interfaces are more explicit:
incrementing (2):
__set_page_dirty
__set_page_dirty_nobuffers
decrementing (2):
clear_page_dirty_for_io
cancel_dirty_page

To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
-private_lock -- mapping-tree_lock -- memcg-move_lock.
So we need to make mapping-tree_lock ahead of TestSetPageDirty in 
__set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock 
contention,
a prepare PageDirty() checking is added.


Signed-off-by: Sha Zhengju handai@taobao.com
Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtisu.com
Acked-by: Fengguang Wu fengguang...@intel.com
---
 fs/buffer.c|   14 +-
 include/linux/memcontrol.h |1 +
 mm/filemap.c   |   10 ++
 mm/memcontrol.c|   29 ++---
 mm/page-writeback.c|   39 ---
 mm/truncate.c  |6 ++
 6 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 762168a..53402d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
 {
+   bool locked;
+   unsigned long flags;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
-   if (TestSetPageDirty(page))
+   if (PageDirty(page))
return 0;
 
spin_lock_irq(mapping-tree_lock);
+   mem_cgroup_begin_update_page_stat(page, locked, flags);
+
+   if (TestSetPageDirty(page)) {
+   mem_cgroup_end_update_page_stat(page, locked, flags);
+   spin_unlock_irq(mapping-tree_lock);
+   return 0;
+   }
+
if (page-mapping) {/* Race with truncate? */
WARN_ON_ONCE(warn  !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(mapping-page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+   mem_cgroup_end_update_page_stat(page, locked, flags);
spin_unlock_irq(mapping-tree_lock);
__mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5421b8a..2685d8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+   MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..b589be5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,6 +62,11 @@
  *  -swap_lock(exclusive_swap_page, others)
  *-mapping-tree_lock
  *
+ *-private_lock   (__set_page_dirty_buffers)
+ *  -mapping-tree_lock
+ *-memcg-move_lock   (mem_cgroup_begin_update_page_stat-
+ * move_lock_mem_cgroup)
+ *
  *  -i_mutex
  *-i_mmap_mutex   (truncate-unmap_mapping_range)
  *
@@ -112,6 +117,8 @@
 void __delete_from_page_cache(struct page *page)
 {
struct address_space *mapping = page-mapping;
+   bool locked;
+   unsigned long flags;
 
/*
 * if we're uptodate, flush out into the cleancache, otherwise
@@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
 * Fix it up by doing a final dirty accounting check after
 * having removed the page entirely.
 */
+   mem_cgroup_begin_update_page_stat(page, locked, flags);