Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-14 Thread Glauber Costa
On 11/09/2012 07:37 AM, Sasha Levin wrote:
> On 11/08/2012 01:51 AM, Glauber Costa wrote:
>> On 11/07/2012 04:53 PM, Sasha Levin wrote:
>>> On 11/01/2012 08:07 AM, Glauber Costa wrote:
 SLUB allows us to tune a particular cache behavior with sysfs-based
 tunables.  When creating a new memcg cache copy, we'd like to preserve
 any tunables the parent cache already had.

 This can be done by tapping into the store attribute function provided
 by the allocator. We of course don't need to mess with read-only
 fields. Since the attributes can have multiple types and are stored
 internally by sysfs, the best strategy is to issue a ->show() in the
 root cache, and then ->store() in the memcg cache.

 The drawback of that, is that sysfs can allocate up to a page in
 buffering for show(), that we are likely not to need, but also can't
 guarantee. To avoid always allocating a page for that, we can update the
 caches at store time with the maximum attribute size ever stored to the
 root cache. We will then get a buffer big enough to hold it. The
 corolary to this, is that if no stores happened, nothing will be
 propagated.

 It can also happen that a root cache has its tunables updated during
 normal system operation. In this case, we will propagate the change to
 all caches that are already active.

 Signed-off-by: Glauber Costa 
 CC: Christoph Lameter 
 CC: Pekka Enberg 
 CC: Michal Hocko 
 CC: Kamezawa Hiroyuki 
 CC: Johannes Weiner 
 CC: Suleiman Souhlal 
 CC: Tejun Heo 
 ---
>>>
>>> Hi guys,
>>>
>>> This patch is making lockdep angry! *bark bark*
>>>
>>> [  351.935003] ==
>>> [  351.937693] [ INFO: possible circular locking dependency detected ]
>>> [  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G 
>>>W
>>> [  351.942444] ---
>>> [  351.943528] trinity-child13/6961 is trying to acquire lock:
>>> [  351.943528]  (s_active#43){.+}, at: [] 
>>> sysfs_addrm_finish+0x31/0x60
>>> [  351.943528]
>>> [  351.943528] but task is already holding lock:
>>> [  351.943528]  (slab_mutex){+.+.+.}, at: [] 
>>> kmem_cache_destroy+0x22/0xe0
>>> [  351.943528]
>>> [  351.943528] which lock already depends on the new lock.
>>> [  351.943528]
>>> [  351.943528]
>>> [  351.943528] the existing dependency chain (in reverse order) is:
>>> [  351.943528]
>>> -> #1 (slab_mutex){+.+.+.}:
>>> [  351.960334][] lock_acquire+0x1aa/0x240
>>> [  351.960334][] __mutex_lock_common+0x59/0x5a0
>>> [  351.960334][] mutex_lock_nested+0x3f/0x50
>>> [  351.960334][] slab_attr_store+0xde/0x110
>>> [  351.960334][] sysfs_write_file+0xfa/0x150
>>> [  351.960334][] vfs_write+0xb0/0x180
>>> [  351.960334][] sys_pwrite64+0x60/0xb0
>>> [  351.960334][] tracesys+0xe1/0xe6
>>> [  351.960334]
>>> -> #0 (s_active#43){.+}:
>>> [  351.960334][] __lock_acquire+0x14df/0x1ca0
>>> [  351.960334][] lock_acquire+0x1aa/0x240
>>> [  351.960334][] sysfs_deactivate+0x122/0x1a0
>>> [  351.960334][] sysfs_addrm_finish+0x31/0x60
>>> [  351.960334][] sysfs_remove_dir+0x89/0xd0
>>> [  351.960334][] kobject_del+0x16/0x40
>>> [  351.960334][] __kmem_cache_shutdown+0x40/0x60
>>> [  351.960334][] kmem_cache_destroy+0x40/0xe0
>>> [  351.960334][] mon_text_release+0x78/0xe0
>>> [  351.960334][] __fput+0x122/0x2d0
>>> [  351.960334][] fput+0x9/0x10
>>> [  351.960334][] task_work_run+0xbe/0x100
>>> [  351.960334][] do_exit+0x432/0xbd0
>>> [  351.960334][] do_group_exit+0x84/0xd0
>>> [  351.960334][] get_signal_to_deliver+0x81d/0x930
>>> [  351.960334][] do_signal+0x3a/0x950
>>> [  351.960334][] do_notify_resume+0x3e/0x90
>>> [  351.960334][] int_signal+0x12/0x17
>>> [  351.960334]

First: Sorry I took so long, I had some problems in my way back from
Spain...

I just managed to reproduce it, by following the callchain. In summary:

1) when we store an attribute, we will call sysfs_get_active(), that
will hold the sd->dep_map lock, where 'sd' is the specific dirent.

2) ->store() is called with that held.

3) ->store() will hold the slab_mutex

4) While destroying the cache, with the slab_mutex held, we will
eventually get to kobject_put(), that deep down in the callchain will
resort to sysfs_addrm_finish, that can hold that lock again.

In summary, creating a kmem limited memcg, storing an argument in the
global cache, and then deleting the memcg should trigger this. The funny
thing is that I had a test exactly like this in which it didn't trigger,
and now I know why: I was storing attributes for "dentry", which can
stay around for longer until it completely runs out of objects, which
will depend on the vmscan 

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-14 Thread Glauber Costa
On 11/09/2012 07:37 AM, Sasha Levin wrote:
 On 11/08/2012 01:51 AM, Glauber Costa wrote:
 On 11/07/2012 04:53 PM, Sasha Levin wrote:
 On 11/01/2012 08:07 AM, Glauber Costa wrote:
 SLUB allows us to tune a particular cache behavior with sysfs-based
 tunables.  When creating a new memcg cache copy, we'd like to preserve
 any tunables the parent cache already had.

 This can be done by tapping into the store attribute function provided
 by the allocator. We of course don't need to mess with read-only
 fields. Since the attributes can have multiple types and are stored
 internally by sysfs, the best strategy is to issue a -show() in the
 root cache, and then -store() in the memcg cache.

 The drawback of that, is that sysfs can allocate up to a page in
 buffering for show(), that we are likely not to need, but also can't
 guarantee. To avoid always allocating a page for that, we can update the
 caches at store time with the maximum attribute size ever stored to the
 root cache. We will then get a buffer big enough to hold it. The
 corolary to this, is that if no stores happened, nothing will be
 propagated.

 It can also happen that a root cache has its tunables updated during
 normal system operation. In this case, we will propagate the change to
 all caches that are already active.

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 CC: Tejun Heo t...@kernel.org
 ---

 Hi guys,

 This patch is making lockdep angry! *bark bark*

 [  351.935003] ==
 [  351.937693] [ INFO: possible circular locking dependency detected ]
 [  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G 
W
 [  351.942444] ---
 [  351.943528] trinity-child13/6961 is trying to acquire lock:
 [  351.943528]  (s_active#43){.+}, at: [812f9e11] 
 sysfs_addrm_finish+0x31/0x60
 [  351.943528]
 [  351.943528] but task is already holding lock:
 [  351.943528]  (slab_mutex){+.+.+.}, at: [81228a42] 
 kmem_cache_destroy+0x22/0xe0
 [  351.943528]
 [  351.943528] which lock already depends on the new lock.
 [  351.943528]
 [  351.943528]
 [  351.943528] the existing dependency chain (in reverse order) is:
 [  351.943528]
 - #1 (slab_mutex){+.+.+.}:
 [  351.960334][8118536a] lock_acquire+0x1aa/0x240
 [  351.960334][83a944d9] __mutex_lock_common+0x59/0x5a0
 [  351.960334][83a94a5f] mutex_lock_nested+0x3f/0x50
 [  351.960334][81256a6e] slab_attr_store+0xde/0x110
 [  351.960334][812f820a] sysfs_write_file+0xfa/0x150
 [  351.960334][8127a220] vfs_write+0xb0/0x180
 [  351.960334][8127a540] sys_pwrite64+0x60/0xb0
 [  351.960334][83a99298] tracesys+0xe1/0xe6
 [  351.960334]
 - #0 (s_active#43){.+}:
 [  351.960334][811825af] __lock_acquire+0x14df/0x1ca0
 [  351.960334][8118536a] lock_acquire+0x1aa/0x240
 [  351.960334][812f9272] sysfs_deactivate+0x122/0x1a0
 [  351.960334][812f9e11] sysfs_addrm_finish+0x31/0x60
 [  351.960334][812fa369] sysfs_remove_dir+0x89/0xd0
 [  351.960334][819e1d96] kobject_del+0x16/0x40
 [  351.960334][8125ed40] __kmem_cache_shutdown+0x40/0x60
 [  351.960334][81228a60] kmem_cache_destroy+0x40/0xe0
 [  351.960334][82b21058] mon_text_release+0x78/0xe0
 [  351.960334][8127b3b2] __fput+0x122/0x2d0
 [  351.960334][8127b569] fput+0x9/0x10
 [  351.960334][81131b4e] task_work_run+0xbe/0x100
 [  351.960334][81110742] do_exit+0x432/0xbd0
 [  351.960334][81110fa4] do_group_exit+0x84/0xd0
 [  351.960334][8112431d] get_signal_to_deliver+0x81d/0x930
 [  351.960334][8106d5aa] do_signal+0x3a/0x950
 [  351.960334][8106df1e] do_notify_resume+0x3e/0x90
 [  351.960334][83a993aa] int_signal+0x12/0x17
 [  351.960334]

First: Sorry I took so long, I had some problems in my way back from
Spain...

I just managed to reproduce it, by following the callchain. In summary:

1) when we store an attribute, we will call sysfs_get_active(), that
will hold the sd-dep_map lock, where 'sd' is the specific dirent.

2) -store() is called with that held.

3) -store() will hold the slab_mutex

4) While destroying the cache, with the slab_mutex held, we will
eventually get to kobject_put(), that deep down in the callchain will
resort to sysfs_addrm_finish, that can hold that lock again.

In summary, creating a kmem limited memcg, storing an argument in the
global 

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-08 Thread Sasha Levin
On 11/08/2012 01:51 AM, Glauber Costa wrote:
> On 11/07/2012 04:53 PM, Sasha Levin wrote:
>> On 11/01/2012 08:07 AM, Glauber Costa wrote:
>>> SLUB allows us to tune a particular cache behavior with sysfs-based
>>> tunables.  When creating a new memcg cache copy, we'd like to preserve
>>> any tunables the parent cache already had.
>>>
>>> This can be done by tapping into the store attribute function provided
>>> by the allocator. We of course don't need to mess with read-only
>>> fields. Since the attributes can have multiple types and are stored
>>> internally by sysfs, the best strategy is to issue a ->show() in the
>>> root cache, and then ->store() in the memcg cache.
>>>
>>> The drawback of that, is that sysfs can allocate up to a page in
>>> buffering for show(), that we are likely not to need, but also can't
>>> guarantee. To avoid always allocating a page for that, we can update the
>>> caches at store time with the maximum attribute size ever stored to the
>>> root cache. We will then get a buffer big enough to hold it. The
>>> corolary to this, is that if no stores happened, nothing will be
>>> propagated.
>>>
>>> It can also happen that a root cache has its tunables updated during
>>> normal system operation. In this case, we will propagate the change to
>>> all caches that are already active.
>>>
>>> Signed-off-by: Glauber Costa 
>>> CC: Christoph Lameter 
>>> CC: Pekka Enberg 
>>> CC: Michal Hocko 
>>> CC: Kamezawa Hiroyuki 
>>> CC: Johannes Weiner 
>>> CC: Suleiman Souhlal 
>>> CC: Tejun Heo 
>>> ---
>>
>> Hi guys,
>>
>> This patch is making lockdep angry! *bark bark*
>>
>> [  351.935003] ==
>> [  351.937693] [ INFO: possible circular locking dependency detected ]
>> [  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G  
>>   W
>> [  351.942444] ---
>> [  351.943528] trinity-child13/6961 is trying to acquire lock:
>> [  351.943528]  (s_active#43){.+}, at: [] 
>> sysfs_addrm_finish+0x31/0x60
>> [  351.943528]
>> [  351.943528] but task is already holding lock:
>> [  351.943528]  (slab_mutex){+.+.+.}, at: [] 
>> kmem_cache_destroy+0x22/0xe0
>> [  351.943528]
>> [  351.943528] which lock already depends on the new lock.
>> [  351.943528]
>> [  351.943528]
>> [  351.943528] the existing dependency chain (in reverse order) is:
>> [  351.943528]
>> -> #1 (slab_mutex){+.+.+.}:
>> [  351.960334][] lock_acquire+0x1aa/0x240
>> [  351.960334][] __mutex_lock_common+0x59/0x5a0
>> [  351.960334][] mutex_lock_nested+0x3f/0x50
>> [  351.960334][] slab_attr_store+0xde/0x110
>> [  351.960334][] sysfs_write_file+0xfa/0x150
>> [  351.960334][] vfs_write+0xb0/0x180
>> [  351.960334][] sys_pwrite64+0x60/0xb0
>> [  351.960334][] tracesys+0xe1/0xe6
>> [  351.960334]
>> -> #0 (s_active#43){.+}:
>> [  351.960334][] __lock_acquire+0x14df/0x1ca0
>> [  351.960334][] lock_acquire+0x1aa/0x240
>> [  351.960334][] sysfs_deactivate+0x122/0x1a0
>> [  351.960334][] sysfs_addrm_finish+0x31/0x60
>> [  351.960334][] sysfs_remove_dir+0x89/0xd0
>> [  351.960334][] kobject_del+0x16/0x40
>> [  351.960334][] __kmem_cache_shutdown+0x40/0x60
>> [  351.960334][] kmem_cache_destroy+0x40/0xe0
>> [  351.960334][] mon_text_release+0x78/0xe0
>> [  351.960334][] __fput+0x122/0x2d0
>> [  351.960334][] fput+0x9/0x10
>> [  351.960334][] task_work_run+0xbe/0x100
>> [  351.960334][] do_exit+0x432/0xbd0
>> [  351.960334][] do_group_exit+0x84/0xd0
>> [  351.960334][] get_signal_to_deliver+0x81d/0x930
>> [  351.960334][] do_signal+0x3a/0x950
>> [  351.960334][] do_notify_resume+0x3e/0x90
>> [  351.960334][] int_signal+0x12/0x17
>> [  351.960334]
>> [  351.960334] other info that might help us debug this:
>> [  351.960334]
>> [  351.960334]  Possible unsafe locking scenario:
>> [  351.960334]
>> [  351.960334]CPU0CPU1
>> [  351.960334]
>> [  351.960334]   lock(slab_mutex);
>> [  351.960334]lock(s_active#43);
>> [  351.960334]lock(slab_mutex);
>> [  351.960334]   lock(s_active#43);
>> [  351.960334]
>> [  351.960334]  *** DEADLOCK ***
>> [  351.960334]
>> [  351.960334] 2 locks held by trinity-child13/6961:
>> [  351.960334]  #0:  (mon_lock){+.+.+.}, at: [] 
>> mon_text_release+0x25/0xe0
>> [  351.960334]  #1:  (slab_mutex){+.+.+.}, at: [] 
>> kmem_cache_destroy+0x22/0xe0
>> [  351.960334]
>> [  351.960334] stack backtrace:
>> [  351.960334] Pid: 6961, comm: trinity-child13 Tainted: GW
>> 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117
>> [  351.960334] Call Trace:
>> [  351.960334]  [] print_circular_bug+0x1fb/0x20c
>> [  351.960334]  [] __lock_acquire+0x14df/0x1ca0
>> 

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-08 Thread Sasha Levin
On 11/08/2012 01:51 AM, Glauber Costa wrote:
 On 11/07/2012 04:53 PM, Sasha Levin wrote:
 On 11/01/2012 08:07 AM, Glauber Costa wrote:
 SLUB allows us to tune a particular cache behavior with sysfs-based
 tunables.  When creating a new memcg cache copy, we'd like to preserve
 any tunables the parent cache already had.

 This can be done by tapping into the store attribute function provided
 by the allocator. We of course don't need to mess with read-only
 fields. Since the attributes can have multiple types and are stored
 internally by sysfs, the best strategy is to issue a -show() in the
 root cache, and then -store() in the memcg cache.

 The drawback of that, is that sysfs can allocate up to a page in
 buffering for show(), that we are likely not to need, but also can't
 guarantee. To avoid always allocating a page for that, we can update the
 caches at store time with the maximum attribute size ever stored to the
 root cache. We will then get a buffer big enough to hold it. The
 corolary to this, is that if no stores happened, nothing will be
 propagated.

 It can also happen that a root cache has its tunables updated during
 normal system operation. In this case, we will propagate the change to
 all caches that are already active.

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 CC: Tejun Heo t...@kernel.org
 ---

 Hi guys,

 This patch is making lockdep angry! *bark bark*

 [  351.935003] ==
 [  351.937693] [ INFO: possible circular locking dependency detected ]
 [  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G  
   W
 [  351.942444] ---
 [  351.943528] trinity-child13/6961 is trying to acquire lock:
 [  351.943528]  (s_active#43){.+}, at: [812f9e11] 
 sysfs_addrm_finish+0x31/0x60
 [  351.943528]
 [  351.943528] but task is already holding lock:
 [  351.943528]  (slab_mutex){+.+.+.}, at: [81228a42] 
 kmem_cache_destroy+0x22/0xe0
 [  351.943528]
 [  351.943528] which lock already depends on the new lock.
 [  351.943528]
 [  351.943528]
 [  351.943528] the existing dependency chain (in reverse order) is:
 [  351.943528]
 - #1 (slab_mutex){+.+.+.}:
 [  351.960334][8118536a] lock_acquire+0x1aa/0x240
 [  351.960334][83a944d9] __mutex_lock_common+0x59/0x5a0
 [  351.960334][83a94a5f] mutex_lock_nested+0x3f/0x50
 [  351.960334][81256a6e] slab_attr_store+0xde/0x110
 [  351.960334][812f820a] sysfs_write_file+0xfa/0x150
 [  351.960334][8127a220] vfs_write+0xb0/0x180
 [  351.960334][8127a540] sys_pwrite64+0x60/0xb0
 [  351.960334][83a99298] tracesys+0xe1/0xe6
 [  351.960334]
 - #0 (s_active#43){.+}:
 [  351.960334][811825af] __lock_acquire+0x14df/0x1ca0
 [  351.960334][8118536a] lock_acquire+0x1aa/0x240
 [  351.960334][812f9272] sysfs_deactivate+0x122/0x1a0
 [  351.960334][812f9e11] sysfs_addrm_finish+0x31/0x60
 [  351.960334][812fa369] sysfs_remove_dir+0x89/0xd0
 [  351.960334][819e1d96] kobject_del+0x16/0x40
 [  351.960334][8125ed40] __kmem_cache_shutdown+0x40/0x60
 [  351.960334][81228a60] kmem_cache_destroy+0x40/0xe0
 [  351.960334][82b21058] mon_text_release+0x78/0xe0
 [  351.960334][8127b3b2] __fput+0x122/0x2d0
 [  351.960334][8127b569] fput+0x9/0x10
 [  351.960334][81131b4e] task_work_run+0xbe/0x100
 [  351.960334][81110742] do_exit+0x432/0xbd0
 [  351.960334][81110fa4] do_group_exit+0x84/0xd0
 [  351.960334][8112431d] get_signal_to_deliver+0x81d/0x930
 [  351.960334][8106d5aa] do_signal+0x3a/0x950
 [  351.960334][8106df1e] do_notify_resume+0x3e/0x90
 [  351.960334][83a993aa] int_signal+0x12/0x17
 [  351.960334]
 [  351.960334] other info that might help us debug this:
 [  351.960334]
 [  351.960334]  Possible unsafe locking scenario:
 [  351.960334]
 [  351.960334]CPU0CPU1
 [  351.960334]
 [  351.960334]   lock(slab_mutex);
 [  351.960334]lock(s_active#43);
 [  351.960334]lock(slab_mutex);
 [  351.960334]   lock(s_active#43);
 [  351.960334]
 [  351.960334]  *** DEADLOCK ***
 [  351.960334]
 [  351.960334] 2 locks held by trinity-child13/6961:
 [  351.960334]  #0:  (mon_lock){+.+.+.}, at: [82b21005] 
 mon_text_release+0x25/0xe0
 [  351.960334]  #1:  

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-07 Thread Glauber Costa
On 11/07/2012 04:53 PM, Sasha Levin wrote:
> On 11/01/2012 08:07 AM, Glauber Costa wrote:
>> SLUB allows us to tune a particular cache behavior with sysfs-based
>> tunables.  When creating a new memcg cache copy, we'd like to preserve
>> any tunables the parent cache already had.
>>
>> This can be done by tapping into the store attribute function provided
>> by the allocator. We of course don't need to mess with read-only
>> fields. Since the attributes can have multiple types and are stored
>> internally by sysfs, the best strategy is to issue a ->show() in the
>> root cache, and then ->store() in the memcg cache.
>>
>> The drawback of that, is that sysfs can allocate up to a page in
>> buffering for show(), that we are likely not to need, but also can't
>> guarantee. To avoid always allocating a page for that, we can update the
>> caches at store time with the maximum attribute size ever stored to the
>> root cache. We will then get a buffer big enough to hold it. The
>> corolary to this, is that if no stores happened, nothing will be
>> propagated.
>>
>> It can also happen that a root cache has its tunables updated during
>> normal system operation. In this case, we will propagate the change to
>> all caches that are already active.
>>
>> Signed-off-by: Glauber Costa 
>> CC: Christoph Lameter 
>> CC: Pekka Enberg 
>> CC: Michal Hocko 
>> CC: Kamezawa Hiroyuki 
>> CC: Johannes Weiner 
>> CC: Suleiman Souhlal 
>> CC: Tejun Heo 
>> ---
> 
> Hi guys,
> 
> This patch is making lockdep angry! *bark bark*
> 
> [  351.935003] ==
> [  351.937693] [ INFO: possible circular locking dependency detected ]
> [  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G   
>  W
> [  351.942444] ---
> [  351.943528] trinity-child13/6961 is trying to acquire lock:
> [  351.943528]  (s_active#43){.+}, at: [] 
> sysfs_addrm_finish+0x31/0x60
> [  351.943528]
> [  351.943528] but task is already holding lock:
> [  351.943528]  (slab_mutex){+.+.+.}, at: [] 
> kmem_cache_destroy+0x22/0xe0
> [  351.943528]
> [  351.943528] which lock already depends on the new lock.
> [  351.943528]
> [  351.943528]
> [  351.943528] the existing dependency chain (in reverse order) is:
> [  351.943528]
> -> #1 (slab_mutex){+.+.+.}:
> [  351.960334][] lock_acquire+0x1aa/0x240
> [  351.960334][] __mutex_lock_common+0x59/0x5a0
> [  351.960334][] mutex_lock_nested+0x3f/0x50
> [  351.960334][] slab_attr_store+0xde/0x110
> [  351.960334][] sysfs_write_file+0xfa/0x150
> [  351.960334][] vfs_write+0xb0/0x180
> [  351.960334][] sys_pwrite64+0x60/0xb0
> [  351.960334][] tracesys+0xe1/0xe6
> [  351.960334]
> -> #0 (s_active#43){.+}:
> [  351.960334][] __lock_acquire+0x14df/0x1ca0
> [  351.960334][] lock_acquire+0x1aa/0x240
> [  351.960334][] sysfs_deactivate+0x122/0x1a0
> [  351.960334][] sysfs_addrm_finish+0x31/0x60
> [  351.960334][] sysfs_remove_dir+0x89/0xd0
> [  351.960334][] kobject_del+0x16/0x40
> [  351.960334][] __kmem_cache_shutdown+0x40/0x60
> [  351.960334][] kmem_cache_destroy+0x40/0xe0
> [  351.960334][] mon_text_release+0x78/0xe0
> [  351.960334][] __fput+0x122/0x2d0
> [  351.960334][] fput+0x9/0x10
> [  351.960334][] task_work_run+0xbe/0x100
> [  351.960334][] do_exit+0x432/0xbd0
> [  351.960334][] do_group_exit+0x84/0xd0
> [  351.960334][] get_signal_to_deliver+0x81d/0x930
> [  351.960334][] do_signal+0x3a/0x950
> [  351.960334][] do_notify_resume+0x3e/0x90
> [  351.960334][] int_signal+0x12/0x17
> [  351.960334]
> [  351.960334] other info that might help us debug this:
> [  351.960334]
> [  351.960334]  Possible unsafe locking scenario:
> [  351.960334]
> [  351.960334]CPU0CPU1
> [  351.960334]
> [  351.960334]   lock(slab_mutex);
> [  351.960334]lock(s_active#43);
> [  351.960334]lock(slab_mutex);
> [  351.960334]   lock(s_active#43);
> [  351.960334]
> [  351.960334]  *** DEADLOCK ***
> [  351.960334]
> [  351.960334] 2 locks held by trinity-child13/6961:
> [  351.960334]  #0:  (mon_lock){+.+.+.}, at: [] 
> mon_text_release+0x25/0xe0
> [  351.960334]  #1:  (slab_mutex){+.+.+.}, at: [] 
> kmem_cache_destroy+0x22/0xe0
> [  351.960334]
> [  351.960334] stack backtrace:
> [  351.960334] Pid: 6961, comm: trinity-child13 Tainted: GW
> 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117
> [  351.960334] Call Trace:
> [  351.960334]  [] print_circular_bug+0x1fb/0x20c
> [  351.960334]  [] __lock_acquire+0x14df/0x1ca0
> [  351.960334]  [] ? debug_check_no_locks_freed+0x185/0x1e0
> [  351.960334]  [] lock_acquire+0x1aa/0x240
> [  351.960334]  [] ? 

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-07 Thread Sasha Levin
On 11/01/2012 08:07 AM, Glauber Costa wrote:
> SLUB allows us to tune a particular cache behavior with sysfs-based
> tunables.  When creating a new memcg cache copy, we'd like to preserve
> any tunables the parent cache already had.
> 
> This can be done by tapping into the store attribute function provided
> by the allocator. We of course don't need to mess with read-only
> fields. Since the attributes can have multiple types and are stored
> internally by sysfs, the best strategy is to issue a ->show() in the
> root cache, and then ->store() in the memcg cache.
> 
> The drawback of that, is that sysfs can allocate up to a page in
> buffering for show(), that we are likely not to need, but also can't
> guarantee. To avoid always allocating a page for that, we can update the
> caches at store time with the maximum attribute size ever stored to the
> root cache. We will then get a buffer big enough to hold it. The
> corolary to this, is that if no stores happened, nothing will be
> propagated.
> 
> It can also happen that a root cache has its tunables updated during
> normal system operation. In this case, we will propagate the change to
> all caches that are already active.
> 
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> CC: Suleiman Souhlal 
> CC: Tejun Heo 
> ---

Hi guys,

This patch is making lockdep angry! *bark bark*

[  351.935003] ==
[  351.937693] [ INFO: possible circular locking dependency detected ]
[  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G 
   W
[  351.942444] ---
[  351.943528] trinity-child13/6961 is trying to acquire lock:
[  351.943528]  (s_active#43){.+}, at: [] 
sysfs_addrm_finish+0x31/0x60
[  351.943528]
[  351.943528] but task is already holding lock:
[  351.943528]  (slab_mutex){+.+.+.}, at: [] 
kmem_cache_destroy+0x22/0xe0
[  351.943528]
[  351.943528] which lock already depends on the new lock.
[  351.943528]
[  351.943528]
[  351.943528] the existing dependency chain (in reverse order) is:
[  351.943528]
-> #1 (slab_mutex){+.+.+.}:
[  351.960334][] lock_acquire+0x1aa/0x240
[  351.960334][] __mutex_lock_common+0x59/0x5a0
[  351.960334][] mutex_lock_nested+0x3f/0x50
[  351.960334][] slab_attr_store+0xde/0x110
[  351.960334][] sysfs_write_file+0xfa/0x150
[  351.960334][] vfs_write+0xb0/0x180
[  351.960334][] sys_pwrite64+0x60/0xb0
[  351.960334][] tracesys+0xe1/0xe6
[  351.960334]
-> #0 (s_active#43){.+}:
[  351.960334][] __lock_acquire+0x14df/0x1ca0
[  351.960334][] lock_acquire+0x1aa/0x240
[  351.960334][] sysfs_deactivate+0x122/0x1a0
[  351.960334][] sysfs_addrm_finish+0x31/0x60
[  351.960334][] sysfs_remove_dir+0x89/0xd0
[  351.960334][] kobject_del+0x16/0x40
[  351.960334][] __kmem_cache_shutdown+0x40/0x60
[  351.960334][] kmem_cache_destroy+0x40/0xe0
[  351.960334][] mon_text_release+0x78/0xe0
[  351.960334][] __fput+0x122/0x2d0
[  351.960334][] fput+0x9/0x10
[  351.960334][] task_work_run+0xbe/0x100
[  351.960334][] do_exit+0x432/0xbd0
[  351.960334][] do_group_exit+0x84/0xd0
[  351.960334][] get_signal_to_deliver+0x81d/0x930
[  351.960334][] do_signal+0x3a/0x950
[  351.960334][] do_notify_resume+0x3e/0x90
[  351.960334][] int_signal+0x12/0x17
[  351.960334]
[  351.960334] other info that might help us debug this:
[  351.960334]
[  351.960334]  Possible unsafe locking scenario:
[  351.960334]
[  351.960334]CPU0CPU1
[  351.960334]
[  351.960334]   lock(slab_mutex);
[  351.960334]lock(s_active#43);
[  351.960334]lock(slab_mutex);
[  351.960334]   lock(s_active#43);
[  351.960334]
[  351.960334]  *** DEADLOCK ***
[  351.960334]
[  351.960334] 2 locks held by trinity-child13/6961:
[  351.960334]  #0:  (mon_lock){+.+.+.}, at: [] 
mon_text_release+0x25/0xe0
[  351.960334]  #1:  (slab_mutex){+.+.+.}, at: [] 
kmem_cache_destroy+0x22/0xe0
[  351.960334]
[  351.960334] stack backtrace:
[  351.960334] Pid: 6961, comm: trinity-child13 Tainted: GW
3.7.0-rc4-next-20121106-sasha-8-g353b62f #117
[  351.960334] Call Trace:
[  351.960334]  [] print_circular_bug+0x1fb/0x20c
[  351.960334]  [] __lock_acquire+0x14df/0x1ca0
[  351.960334]  [] ? debug_check_no_locks_freed+0x185/0x1e0
[  351.960334]  [] lock_acquire+0x1aa/0x240
[  351.960334]  [] ? sysfs_addrm_finish+0x31/0x60
[  351.960334]  [] sysfs_deactivate+0x122/0x1a0
[  351.960334]  [] ? sysfs_addrm_finish+0x31/0x60
[  351.960334]  [] sysfs_addrm_finish+0x31/0x60
[  351.960334]  [] sysfs_remove_dir+0x89/0xd0
[  351.960334]  [] 

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-07 Thread Sasha Levin
On 11/01/2012 08:07 AM, Glauber Costa wrote:
 SLUB allows us to tune a particular cache behavior with sysfs-based
 tunables.  When creating a new memcg cache copy, we'd like to preserve
 any tunables the parent cache already had.
 
 This can be done by tapping into the store attribute function provided
 by the allocator. We of course don't need to mess with read-only
 fields. Since the attributes can have multiple types and are stored
 internally by sysfs, the best strategy is to issue a -show() in the
 root cache, and then -store() in the memcg cache.
 
 The drawback of that, is that sysfs can allocate up to a page in
 buffering for show(), that we are likely not to need, but also can't
 guarantee. To avoid always allocating a page for that, we can update the
 caches at store time with the maximum attribute size ever stored to the
 root cache. We will then get a buffer big enough to hold it. The
 corolary to this, is that if no stores happened, nothing will be
 propagated.
 
 It can also happen that a root cache has its tunables updated during
 normal system operation. In this case, we will propagate the change to
 all caches that are already active.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 CC: Tejun Heo t...@kernel.org
 ---

Hi guys,

This patch is making lockdep angry! *bark bark*

[  351.935003] ==
[  351.937693] [ INFO: possible circular locking dependency detected ]
[  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G 
   W
[  351.942444] ---
[  351.943528] trinity-child13/6961 is trying to acquire lock:
[  351.943528]  (s_active#43){.+}, at: [812f9e11] 
sysfs_addrm_finish+0x31/0x60
[  351.943528]
[  351.943528] but task is already holding lock:
[  351.943528]  (slab_mutex){+.+.+.}, at: [81228a42] 
kmem_cache_destroy+0x22/0xe0
[  351.943528]
[  351.943528] which lock already depends on the new lock.
[  351.943528]
[  351.943528]
[  351.943528] the existing dependency chain (in reverse order) is:
[  351.943528]
- #1 (slab_mutex){+.+.+.}:
[  351.960334][8118536a] lock_acquire+0x1aa/0x240
[  351.960334][83a944d9] __mutex_lock_common+0x59/0x5a0
[  351.960334][83a94a5f] mutex_lock_nested+0x3f/0x50
[  351.960334][81256a6e] slab_attr_store+0xde/0x110
[  351.960334][812f820a] sysfs_write_file+0xfa/0x150
[  351.960334][8127a220] vfs_write+0xb0/0x180
[  351.960334][8127a540] sys_pwrite64+0x60/0xb0
[  351.960334][83a99298] tracesys+0xe1/0xe6
[  351.960334]
- #0 (s_active#43){.+}:
[  351.960334][811825af] __lock_acquire+0x14df/0x1ca0
[  351.960334][8118536a] lock_acquire+0x1aa/0x240
[  351.960334][812f9272] sysfs_deactivate+0x122/0x1a0
[  351.960334][812f9e11] sysfs_addrm_finish+0x31/0x60
[  351.960334][812fa369] sysfs_remove_dir+0x89/0xd0
[  351.960334][819e1d96] kobject_del+0x16/0x40
[  351.960334][8125ed40] __kmem_cache_shutdown+0x40/0x60
[  351.960334][81228a60] kmem_cache_destroy+0x40/0xe0
[  351.960334][82b21058] mon_text_release+0x78/0xe0
[  351.960334][8127b3b2] __fput+0x122/0x2d0
[  351.960334][8127b569] fput+0x9/0x10
[  351.960334][81131b4e] task_work_run+0xbe/0x100
[  351.960334][81110742] do_exit+0x432/0xbd0
[  351.960334][81110fa4] do_group_exit+0x84/0xd0
[  351.960334][8112431d] get_signal_to_deliver+0x81d/0x930
[  351.960334][8106d5aa] do_signal+0x3a/0x950
[  351.960334][8106df1e] do_notify_resume+0x3e/0x90
[  351.960334][83a993aa] int_signal+0x12/0x17
[  351.960334]
[  351.960334] other info that might help us debug this:
[  351.960334]
[  351.960334]  Possible unsafe locking scenario:
[  351.960334]
[  351.960334]CPU0CPU1
[  351.960334]
[  351.960334]   lock(slab_mutex);
[  351.960334]lock(s_active#43);
[  351.960334]lock(slab_mutex);
[  351.960334]   lock(s_active#43);
[  351.960334]
[  351.960334]  *** DEADLOCK ***
[  351.960334]
[  351.960334] 2 locks held by trinity-child13/6961:
[  351.960334]  #0:  (mon_lock){+.+.+.}, at: [82b21005] 
mon_text_release+0x25/0xe0
[  351.960334]  #1:  (slab_mutex){+.+.+.}, at: [81228a42] 
kmem_cache_destroy+0x22/0xe0
[  351.960334]
[  351.960334] stack backtrace:
[  351.960334] Pid: 6961, comm: 

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-07 Thread Glauber Costa
On 11/07/2012 04:53 PM, Sasha Levin wrote:
 On 11/01/2012 08:07 AM, Glauber Costa wrote:
 SLUB allows us to tune a particular cache behavior with sysfs-based
 tunables.  When creating a new memcg cache copy, we'd like to preserve
 any tunables the parent cache already had.

 This can be done by tapping into the store attribute function provided
 by the allocator. We of course don't need to mess with read-only
 fields. Since the attributes can have multiple types and are stored
 internally by sysfs, the best strategy is to issue a -show() in the
 root cache, and then -store() in the memcg cache.

 The drawback of that, is that sysfs can allocate up to a page in
 buffering for show(), that we are likely not to need, but also can't
 guarantee. To avoid always allocating a page for that, we can update the
 caches at store time with the maximum attribute size ever stored to the
 root cache. We will then get a buffer big enough to hold it. The
 corolary to this, is that if no stores happened, nothing will be
 propagated.

 It can also happen that a root cache has its tunables updated during
 normal system operation. In this case, we will propagate the change to
 all caches that are already active.

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 CC: Tejun Heo t...@kernel.org
 ---
 
 Hi guys,
 
 This patch is making lockdep angry! *bark bark*
 
 [  351.935003] ==
 [  351.937693] [ INFO: possible circular locking dependency detected ]
 [  351.939720] 3.7.0-rc4-next-20121106-sasha-8-g353b62f #117 Tainted: G   
  W
 [  351.942444] ---
 [  351.943528] trinity-child13/6961 is trying to acquire lock:
 [  351.943528]  (s_active#43){.+}, at: [812f9e11] 
 sysfs_addrm_finish+0x31/0x60
 [  351.943528]
 [  351.943528] but task is already holding lock:
 [  351.943528]  (slab_mutex){+.+.+.}, at: [81228a42] 
 kmem_cache_destroy+0x22/0xe0
 [  351.943528]
 [  351.943528] which lock already depends on the new lock.
 [  351.943528]
 [  351.943528]
 [  351.943528] the existing dependency chain (in reverse order) is:
 [  351.943528]
 - #1 (slab_mutex){+.+.+.}:
 [  351.960334][8118536a] lock_acquire+0x1aa/0x240
 [  351.960334][83a944d9] __mutex_lock_common+0x59/0x5a0
 [  351.960334][83a94a5f] mutex_lock_nested+0x3f/0x50
 [  351.960334][81256a6e] slab_attr_store+0xde/0x110
 [  351.960334][812f820a] sysfs_write_file+0xfa/0x150
 [  351.960334][8127a220] vfs_write+0xb0/0x180
 [  351.960334][8127a540] sys_pwrite64+0x60/0xb0
 [  351.960334][83a99298] tracesys+0xe1/0xe6
 [  351.960334]
 - #0 (s_active#43){.+}:
 [  351.960334][811825af] __lock_acquire+0x14df/0x1ca0
 [  351.960334][8118536a] lock_acquire+0x1aa/0x240
 [  351.960334][812f9272] sysfs_deactivate+0x122/0x1a0
 [  351.960334][812f9e11] sysfs_addrm_finish+0x31/0x60
 [  351.960334][812fa369] sysfs_remove_dir+0x89/0xd0
 [  351.960334][819e1d96] kobject_del+0x16/0x40
 [  351.960334][8125ed40] __kmem_cache_shutdown+0x40/0x60
 [  351.960334][81228a60] kmem_cache_destroy+0x40/0xe0
 [  351.960334][82b21058] mon_text_release+0x78/0xe0
 [  351.960334][8127b3b2] __fput+0x122/0x2d0
 [  351.960334][8127b569] fput+0x9/0x10
 [  351.960334][81131b4e] task_work_run+0xbe/0x100
 [  351.960334][81110742] do_exit+0x432/0xbd0
 [  351.960334][81110fa4] do_group_exit+0x84/0xd0
 [  351.960334][8112431d] get_signal_to_deliver+0x81d/0x930
 [  351.960334][8106d5aa] do_signal+0x3a/0x950
 [  351.960334][8106df1e] do_notify_resume+0x3e/0x90
 [  351.960334][83a993aa] int_signal+0x12/0x17
 [  351.960334]
 [  351.960334] other info that might help us debug this:
 [  351.960334]
 [  351.960334]  Possible unsafe locking scenario:
 [  351.960334]
 [  351.960334]CPU0CPU1
 [  351.960334]
 [  351.960334]   lock(slab_mutex);
 [  351.960334]lock(s_active#43);
 [  351.960334]lock(slab_mutex);
 [  351.960334]   lock(s_active#43);
 [  351.960334]
 [  351.960334]  *** DEADLOCK ***
 [  351.960334]
 [  351.960334] 2 locks held by trinity-child13/6961:
 [  351.960334]  #0:  (mon_lock){+.+.+.}, at: [82b21005] 
 mon_text_release+0x25/0xe0
 [  351.960334]  #1:  (slab_mutex){+.+.+.}, at: [81228a42] 
 

Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-06 Thread Andrew Morton
On Thu,  1 Nov 2012 16:07:44 +0400
Glauber Costa  wrote:

> SLUB allows us to tune a particular cache behavior with sysfs-based
> tunables.  When creating a new memcg cache copy, we'd like to preserve
> any tunables the parent cache already had.
> 
> This can be done by tapping into the store attribute function provided
> by the allocator. We of course don't need to mess with read-only
> fields. Since the attributes can have multiple types and are stored
> internally by sysfs, the best strategy is to issue a ->show() in the
> root cache, and then ->store() in the memcg cache.
> 
> The drawback of that, is that sysfs can allocate up to a page in
> buffering for show(), that we are likely not to need, but also can't
> guarantee. To avoid always allocating a page for that, we can update the
> caches at store time with the maximum attribute size ever stored to the
> root cache. We will then get a buffer big enough to hold it. The
> corolary to this, is that if no stores happened, nothing will be
> propagated.
> 
> It can also happen that a root cache has its tunables updated during
> normal system operation. In this case, we will propagate the change to
> all caches that are already active.
> 
> ...
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3955,6 +3956,7 @@ int __kmem_cache_create(struct kmem_cache *s, unsigned 
> long flags)
>   if (err)
>   return err;
>  
> + memcg_propagate_slab_attrs(s);
>   mutex_unlock(_mutex);
>   err = sysfs_slab_add(s);
>   mutex_lock(_mutex);
> @@ -5180,6 +5182,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>   struct slab_attribute *attribute;
>   struct kmem_cache *s;
>   int err;
> + int i __maybe_unused;
>  
>   attribute = to_slab_attr(attr);
>   s = to_slab(kobj);
> @@ -5188,10 +5191,81 @@ static ssize_t slab_attr_store(struct kobject *kobj,
>   return -EIO;
>  
>   err = attribute->store(s, buf, len);
> +#ifdef CONFIG_MEMCG_KMEM
> + if (slab_state < FULL)
> + return err;
>  
> + if ((err < 0) || !is_root_cache(s))
> + return err;
> +
> + mutex_lock(_mutex);
> + if (s->max_attr_size < len)
> + s->max_attr_size = len;
> +
> + for_each_memcg_cache_index(i) {
> + struct kmem_cache *c = cache_from_memcg(s, i);
> + if (c)
> + /* return value determined by the parent cache only */
> + attribute->store(c, buf, len);
> + }
> + mutex_unlock(_mutex);
> +#endif
>   return err;
>  }

hm, __maybe_unused is an ugly thing.  We can avoid it by tweaking the
code a bit:

diff -puN mm/slub.c~slub-slub-specific-propagation-changes-fix mm/slub.c
--- a/mm/slub.c~slub-slub-specific-propagation-changes-fix
+++ a/mm/slub.c
@@ -5175,7 +5175,6 @@ static ssize_t slab_attr_store(struct ko
struct slab_attribute *attribute;
struct kmem_cache *s;
int err;
-   int i __maybe_unused;
 
attribute = to_slab_attr(attr);
s = to_slab(kobj);
@@ -5185,23 +5184,24 @@ static ssize_t slab_attr_store(struct ko
 
err = attribute->store(s, buf, len);
 #ifdef CONFIG_MEMCG_KMEM
-   if (slab_state < FULL)
-   return err;
+   if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
+   int i;
 
-   if ((err < 0) || !is_root_cache(s))
-   return err;
-
-   mutex_lock(_mutex);
-   if (s->max_attr_size < len)
-   s->max_attr_size = len;
-
-   for_each_memcg_cache_index(i) {
-   struct kmem_cache *c = cache_from_memcg(s, i);
-   if (c)
-   /* return value determined by the parent cache only */
-   attribute->store(c, buf, len);
+   mutex_lock(_mutex);
+   if (s->max_attr_size < len)
+   s->max_attr_size = len;
+
+   for_each_memcg_cache_index(i) {
+   struct kmem_cache *c = cache_from_memcg(s, i);
+   /*
+* This function's return value is determined by the
+* parent cache only
+*/
+   if (c)
+   attribute->store(c, buf, len);
+   }
+   mutex_unlock(_mutex);
}
-   mutex_unlock(_mutex);
 #endif
return err;
 }

Also, the comment in there tells the reader *what the code does*, not
*why it does it*.  Why do we ignore the ->store return value for child
caches?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-06 Thread Andrew Morton
On Thu,  1 Nov 2012 16:07:44 +0400
Glauber Costa glom...@parallels.com wrote:

 SLUB allows us to tune a particular cache behavior with sysfs-based
 tunables.  When creating a new memcg cache copy, we'd like to preserve
 any tunables the parent cache already had.
 
 This can be done by tapping into the store attribute function provided
 by the allocator. We of course don't need to mess with read-only
 fields. Since the attributes can have multiple types and are stored
 internally by sysfs, the best strategy is to issue a -show() in the
 root cache, and then -store() in the memcg cache.
 
 The drawback of that, is that sysfs can allocate up to a page in
 buffering for show(), that we are likely not to need, but also can't
 guarantee. To avoid always allocating a page for that, we can update the
 caches at store time with the maximum attribute size ever stored to the
 root cache. We will then get a buffer big enough to hold it. The
 corolary to this, is that if no stores happened, nothing will be
 propagated.
 
 It can also happen that a root cache has its tunables updated during
 normal system operation. In this case, we will propagate the change to
 all caches that are already active.
 
 ...

 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -3955,6 +3956,7 @@ int __kmem_cache_create(struct kmem_cache *s, unsigned 
 long flags)
   if (err)
   return err;
  
 + memcg_propagate_slab_attrs(s);
   mutex_unlock(slab_mutex);
   err = sysfs_slab_add(s);
   mutex_lock(slab_mutex);
 @@ -5180,6 +5182,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
   struct slab_attribute *attribute;
   struct kmem_cache *s;
   int err;
 + int i __maybe_unused;
  
   attribute = to_slab_attr(attr);
   s = to_slab(kobj);
 @@ -5188,10 +5191,81 @@ static ssize_t slab_attr_store(struct kobject *kobj,
   return -EIO;
  
   err = attribute-store(s, buf, len);
 +#ifdef CONFIG_MEMCG_KMEM
 + if (slab_state  FULL)
 + return err;
  
 + if ((err  0) || !is_root_cache(s))
 + return err;
 +
 + mutex_lock(slab_mutex);
 + if (s-max_attr_size  len)
 + s-max_attr_size = len;
 +
 + for_each_memcg_cache_index(i) {
 + struct kmem_cache *c = cache_from_memcg(s, i);
 + if (c)
 + /* return value determined by the parent cache only */
 + attribute-store(c, buf, len);
 + }
 + mutex_unlock(slab_mutex);
 +#endif
   return err;
  }

hm, __maybe_unused is an ugly thing.  We can avoid it by tweaking the
code a bit:

diff -puN mm/slub.c~slub-slub-specific-propagation-changes-fix mm/slub.c
--- a/mm/slub.c~slub-slub-specific-propagation-changes-fix
+++ a/mm/slub.c
@@ -5175,7 +5175,6 @@ static ssize_t slab_attr_store(struct ko
struct slab_attribute *attribute;
struct kmem_cache *s;
int err;
-   int i __maybe_unused;
 
attribute = to_slab_attr(attr);
s = to_slab(kobj);
@@ -5185,23 +5184,24 @@ static ssize_t slab_attr_store(struct ko
 
err = attribute-store(s, buf, len);
 #ifdef CONFIG_MEMCG_KMEM
-   if (slab_state  FULL)
-   return err;
+   if (slab_state = FULL  err = 0  is_root_cache(s)) {
+   int i;
 
-   if ((err  0) || !is_root_cache(s))
-   return err;
-
-   mutex_lock(slab_mutex);
-   if (s-max_attr_size  len)
-   s-max_attr_size = len;
-
-   for_each_memcg_cache_index(i) {
-   struct kmem_cache *c = cache_from_memcg(s, i);
-   if (c)
-   /* return value determined by the parent cache only */
-   attribute-store(c, buf, len);
+   mutex_lock(slab_mutex);
+   if (s-max_attr_size  len)
+   s-max_attr_size = len;
+
+   for_each_memcg_cache_index(i) {
+   struct kmem_cache *c = cache_from_memcg(s, i);
+   /*
+* This function's return value is determined by the
+* parent cache only
+*/
+   if (c)
+   attribute-store(c, buf, len);
+   }
+   mutex_unlock(slab_mutex);
}
-   mutex_unlock(slab_mutex);
 #endif
return err;
 }

Also, the comment in there tells the reader *what the code does*, not
*why it does it*.  Why do we ignore the -store return value for child
caches?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-01 Thread Glauber Costa
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables.  When creating a new memcg cache copy, we'd like to preserve
any tunables the parent cache already had.

This can be done by tapping into the store attribute function provided
by the allocator. We of course don't need to mess with read-only
fields. Since the attributes can have multiple types and are stored
internally by sysfs, the best strategy is to issue a ->show() in the
root cache, and then ->store() in the memcg cache.

The drawback of that, is that sysfs can allocate up to a page in
buffering for show(), that we are likely not to need, but also can't
guarantee. To avoid always allocating a page for that, we can update the
caches at store time with the maximum attribute size ever stored to the
root cache. We will then get a buffer big enough to hold it. The
corolary to this, is that if no stores happened, nothing will be
propagated.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

Signed-off-by: Glauber Costa 
CC: Christoph Lameter 
CC: Pekka Enberg 
CC: Michal Hocko 
CC: Kamezawa Hiroyuki 
CC: Johannes Weiner 
CC: Suleiman Souhlal 
CC: Tejun Heo 
---
 include/linux/slub_def.h |  1 +
 mm/slub.c| 76 +++-
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 364ba6c..9db4825 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -103,6 +103,7 @@ struct kmem_cache {
 #endif
 #ifdef CONFIG_MEMCG_KMEM
struct memcg_cache_params *memcg_params;
+   int max_attr_size; /* for propagation, maximum size of a stored attr */
 #endif
 
 #ifdef CONFIG_NUMA
diff --git a/mm/slub.c b/mm/slub.c
index ffc8ede..6c39af8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -203,13 +203,14 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
 static int sysfs_slab_add(struct kmem_cache *);
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
 static void sysfs_slab_remove(struct kmem_cache *);
-
+static void memcg_propagate_slab_attrs(struct kmem_cache *s);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
{ return 0; }
 static inline void sysfs_slab_remove(struct kmem_cache *s) { }
 
+static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
 #endif
 
 static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -3955,6 +3956,7 @@ int __kmem_cache_create(struct kmem_cache *s, unsigned 
long flags)
if (err)
return err;
 
+   memcg_propagate_slab_attrs(s);
mutex_unlock(_mutex);
err = sysfs_slab_add(s);
mutex_lock(_mutex);
@@ -5180,6 +5182,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
struct slab_attribute *attribute;
struct kmem_cache *s;
int err;
+   int i __maybe_unused;
 
attribute = to_slab_attr(attr);
s = to_slab(kobj);
@@ -5188,10 +5191,81 @@ static ssize_t slab_attr_store(struct kobject *kobj,
return -EIO;
 
err = attribute->store(s, buf, len);
+#ifdef CONFIG_MEMCG_KMEM
+   if (slab_state < FULL)
+   return err;
 
+   if ((err < 0) || !is_root_cache(s))
+   return err;
+
+   mutex_lock(_mutex);
+   if (s->max_attr_size < len)
+   s->max_attr_size = len;
+
+   for_each_memcg_cache_index(i) {
+   struct kmem_cache *c = cache_from_memcg(s, i);
+   if (c)
+   /* return value determined by the parent cache only */
+   attribute->store(c, buf, len);
+   }
+   mutex_unlock(_mutex);
+#endif
return err;
 }
 
+static void memcg_propagate_slab_attrs(struct kmem_cache *s)
+{
+#ifdef CONFIG_MEMCG_KMEM
+   int i;
+   char *buffer = NULL;
+
+   if (!is_root_cache(s))
+   return;
+
+   /*
+* This mean this cache had no attribute written. Therefore, no point
+* in copying default values around
+*/
+   if (!s->max_attr_size)
+   return;
+
+   for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) {
+   char mbuf[64];
+   char *buf;
+   struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
+
+   if (!attr || !attr->store || !attr->show)
+   continue;
+
+   /*
+* It is really bad that we have to allocate here, so we will
+* do it only as a fallback. If we actually allocate, though,
+* we can just use the allocated buffer until the end.
+*
+* Most of the slub attributes will tend to be very small in
+* 

[PATCH v6 28/29] slub: slub-specific propagation changes.

2012-11-01 Thread Glauber Costa
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables.  When creating a new memcg cache copy, we'd like to preserve
any tunables the parent cache already had.

This can be done by tapping into the store attribute function provided
by the allocator. We of course don't need to mess with read-only
fields. Since the attributes can have multiple types and are stored
internally by sysfs, the best strategy is to issue a -show() in the
root cache, and then -store() in the memcg cache.

The drawback of that, is that sysfs can allocate up to a page in
buffering for show(), that we are likely not to need, but also can't
guarantee. To avoid always allocating a page for that, we can update the
caches at store time with the maximum attribute size ever stored to the
root cache. We will then get a buffer big enough to hold it. The
corolary to this, is that if no stores happened, nothing will be
propagated.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Christoph Lameter c...@linux.com
CC: Pekka Enberg penb...@cs.helsinki.fi
CC: Michal Hocko mho...@suse.cz
CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
CC: Johannes Weiner han...@cmpxchg.org
CC: Suleiman Souhlal sulei...@google.com
CC: Tejun Heo t...@kernel.org
---
 include/linux/slub_def.h |  1 +
 mm/slub.c| 76 +++-
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 364ba6c..9db4825 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -103,6 +103,7 @@ struct kmem_cache {
 #endif
 #ifdef CONFIG_MEMCG_KMEM
struct memcg_cache_params *memcg_params;
+   int max_attr_size; /* for propagation, maximum size of a stored attr */
 #endif
 
 #ifdef CONFIG_NUMA
diff --git a/mm/slub.c b/mm/slub.c
index ffc8ede..6c39af8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -203,13 +203,14 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
 static int sysfs_slab_add(struct kmem_cache *);
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
 static void sysfs_slab_remove(struct kmem_cache *);
-
+static void memcg_propagate_slab_attrs(struct kmem_cache *s);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
{ return 0; }
 static inline void sysfs_slab_remove(struct kmem_cache *s) { }
 
+static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
 #endif
 
 static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -3955,6 +3956,7 @@ int __kmem_cache_create(struct kmem_cache *s, unsigned 
long flags)
if (err)
return err;
 
+   memcg_propagate_slab_attrs(s);
mutex_unlock(slab_mutex);
err = sysfs_slab_add(s);
mutex_lock(slab_mutex);
@@ -5180,6 +5182,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
struct slab_attribute *attribute;
struct kmem_cache *s;
int err;
+   int i __maybe_unused;
 
attribute = to_slab_attr(attr);
s = to_slab(kobj);
@@ -5188,10 +5191,81 @@ static ssize_t slab_attr_store(struct kobject *kobj,
return -EIO;
 
err = attribute-store(s, buf, len);
+#ifdef CONFIG_MEMCG_KMEM
+   if (slab_state  FULL)
+   return err;
 
+   if ((err  0) || !is_root_cache(s))
+   return err;
+
+   mutex_lock(slab_mutex);
+   if (s-max_attr_size  len)
+   s-max_attr_size = len;
+
+   for_each_memcg_cache_index(i) {
+   struct kmem_cache *c = cache_from_memcg(s, i);
+   if (c)
+   /* return value determined by the parent cache only */
+   attribute-store(c, buf, len);
+   }
+   mutex_unlock(slab_mutex);
+#endif
return err;
 }
 
+static void memcg_propagate_slab_attrs(struct kmem_cache *s)
+{
+#ifdef CONFIG_MEMCG_KMEM
+   int i;
+   char *buffer = NULL;
+
+   if (!is_root_cache(s))
+   return;
+
+   /*
+* This mean this cache had no attribute written. Therefore, no point
+* in copying default values around
+*/
+   if (!s-max_attr_size)
+   return;
+
+   for (i = 0; i  ARRAY_SIZE(slab_attrs); i++) {
+   char mbuf[64];
+   char *buf;
+   struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
+
+   if (!attr || !attr-store || !attr-show)
+   continue;
+
+   /*
+* It is really bad that we have to allocate here, so we will
+* do it only as a fallback. If we actually allocate, though,
+* we can just