from:"Li Zefan"

Re: [PATCH] sysfs: fix namespace refcnt leak

2014-02-18 Thread Li Zefan

On 2014/2/19 7:12, Tejun Heo wrote:
> Hey,
> 
> On Mon, Feb 17, 2014 at 11:54:59AM +0800, Li Zefan wrote:
>> I think we have to fix kernfs in order to fix refcnt leak in sysfs
>> and cgroupfs. This fix is for 3.14, but it creates conflicts for
>> cgroup-next.
>>
>> 
>>
>> As mount() and kill_sb() is not a one-to-one match, we shoudn't get
>> ns refcnt unconditionally in sysfs_mount(), and instead we should
>> get the refcnt only when kernfs_mount() allocated a new superblock.
> 
> Ugh... nasty :(
> 
>> @@ -132,6 +132,7 @@ const void *kernfs_super_ns(struct super_block *sb)
>>   * @flags: mount flags specified for the mount
>>   * @root: kernfs_root of the hierarchy being mounted
>>   * @ns: optional namespace tag of the mount
>> + * @new: tell the caller if we allocated a new superblock
> 
> Maybe something like @new_sb_created is better?
> 
>>  struct super_block *sb;
>>  struct kernfs_super_info *info;
>>  int error;
>>  
>> +*new_sb = false;
> 
> Can we make it optional so that users who don't care about it can
> ignore it?
> 

cgroupfs also needs this to fix refcnt leak.

Because success in finding an existing cgroup_root doesn't mean no new
superblock is needed. For example:

  # mount -t cgroup -o cpuacct xxx /cgroup
  # mkdir /cgroup/tmp
  # umount /cgroup<--- sb will be freed but cgroup_root won't

  // this will allocate new sb, but we find the cgroup_root is there.
  # mount -t cgroup -o cpuacct xxx /cgroup

But debugfs won't need this if it's converted to kernfs.

How about I keep kernfs_mount() API intact, and when this fix gets
merged into mainline, you merge the fix into cgroup-next, and then
I make a fix for cgroup by changing kernfs_mount()?

>> @@ -430,9 +431,9 @@ static inline int kernfs_rename(struct kernfs_node *kn,
>>  
>>  static inline struct dentry *
>>  kernfs_mount(struct file_system_type *fs_type, int flags,
>> - struct kernfs_root *root)
>> + struct kernfs_root *root, bool *new_sb)
>>  {
>> -return kernfs_mount_ns(fs_type, flags, root, NULL);
>> +return kernfs_mount_ns(fs_type, flags, root, NULL, new_sb);
> 
> And let kernfs_mount() just use NULL for the parameter?
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [cgroup/task_lock] INFO: suspicious RCU usage.

2014-02-25 Thread Li Zefan

On 2014/2/21 23:40, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 21, 2014 at 10:16:22AM +0800, Fengguang Wu wrote:
>> commit fb47fea7a59cf3d6387c566084a6684b5005af83
>> Author: Tejun Heo 
>> AuthorDate: Thu Feb 13 15:16:35 2014 -0500
>> Commit: Tejun Heo 
>> CommitDate: Thu Feb 13 15:16:35 2014 -0500
>>
>> cgroup: drop task_lock() protection around task->cgroups
>> 
>> For optimization, task_lock() is additionally used to protect
>> task->cgroups.  The optimization is pretty dubious as either
>> css_set_rwsem is grabbed anyway or PF_EXITING already protects
>> task->cgroups.  It adds only overhead and confusion at this point.
>> Let's drop task_[un]lock() and update comments accordingly.
>> 
>> Signed-off-by: Tejun Heo 
>>
>> [main] Setsockopt(1 2b 80d1000 4) on fd 223 [17:2:768]
>> [   27.030764] 
>> [   27.031119] ===
>> [   27.031833] [ INFO: suspicious RCU usage. ]
>> [   27.032536] 3.14.0-rc3-02458-g837caba #2 Not tainted
>> [   27.033378] ---
>> [   27.044237] include/linux/cgroup.h:697 suspicious rcu_dereference_check() 
>> usage!
>> [   27.045795] 
>> [   27.045795] other info that might help us debug this:
>> [   27.045795] 
>> [   27.047114] 
>> [   27.047114] rcu_scheduler_active = 1, debug_locks = 0
>> [main] Setsockopt(1 c 80d1000 4) on fd 225 [39:5:0]
>> [   27.048751] 2 locks held by trinity-c0/4479:
>> [   27.049478]  #0:  (callback_mutex){+.+...}, at: [<81118395>] 
>> cpuset_cpus_allowed+0x1e/0x123
>> [   27.051132]  #1:  (&(>alloc_lock)->rlock){+.+...}, at: [<8111839c>] 
>> cpuset_cpus_allowed+0x25/0x123
>> [   27.052788] 
>> [   27.052788] stack backtrace:
>> [   27.053528] CPU: 0 PID: 4479 Comm: trinity-c0 Not tainted 
>> 3.14.0-rc3-02458-g837caba #2
>> [   27.064971]   
>>  919eff28 81877cc3[main] Setsockopt(1 7 80d1000 4) on fd 226 [1:5:1]
> 
> So, this is from removing task_lock from task_css_set_check() and
> adding rcu_read_lock() in cpuset_cpus_allowed() should fix it.

Yeah, rcu_read_lock() should be sufficient.

> I'm
> not sure how much of task_lock() locking we currently have in cpuset
> is actually necessary tho.  Shouldn't we be able to do most with just
> callback_mutex, if not cpuset_mutex?  Li, any ideas?
> 

task_lock() is also used to protect task->mems_allowed. I'll see if we
can get rid of most (if not all) task_lock() lockings in cpuset.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET v2 cgroup/for-3.15] cgroup: update task migration path

2014-02-25 Thread Li Zefan

On 2014/2/14 4:28, Tejun Heo wrote:
> Hello,
> 
> This is v2 of update-task-migration-path patchset.  Changes from v1[L]
> are
> 
> * Rebased on top of "[PATCH cgroup/for-3.14-fixes] cgroup: update
>   cgroup_enable_task_cg_lists() to grab siglock"
> 
> * 0005-cgroup-update-how-a-newly-forked-task-gets-associate.patch and
>   0006-cgroup-drop-task_lock-protection-around-task-cgroups.patch
>   added to address the race between migration and fork paths.
> 
> Currently, when migrating a task or process from one cgroup to
> another, a flex_array is used to keep track of the target tasks and
> associated css_sets.  The current implementation has several issues.
> 
> * flex_array size is limited.  Given the current data structure, the
>   limit is ~87k on 64bit, which is pretty high but not impossible to
>   hit.
> 
> * If multiple targets are being migrated, as migrating each target
>   involves memory allocation, it can fail at any point.  cgroup core
>   doesn't keep track of enough state to roll back partial migration
>   either, so it ends up aborting with some targets migrated with no
>   way of finding out which.  While this isn't a big issue now, we're
>   gonna be making more use of multi-target migration.
> 
> * Fork path could race against migration path and it was impossible to
>   implement a mechanism to migrate all tasks of a cgroup to another
>   because migration path can't tell whether there are just forked
>   tasks pointing to the css_set but not linked yet.
> 
> This patchset updates task migration path such that
> 
> * task->cg_list and css_sets are also used to keep track of targets
>   during migration so that no extra memory allocation is necessary to
>   keep track of migration targets.
> 
> * Migration is split into several stages so that all preparations
>   which may fail can be performed for all targets before actually
>   starting migrating tasks.  Ignoring ->can_attach() failure, this can
>   guarantee all-or-nothing semantics of multi-target migration.
> 
> * Newly forked tasks are now atomically associated with and linked to
>   the parent's css_set in cgroup_post_fork().  This guarantees that
>   the new task either is visible in the source cgroup once the
>   parent's migration is complete or ends up in the target cgroup in
>   the first place.  This means that just keeping moving tasks out of a
>   cgroup until it's empty is guaranteed to migrate all tasks.
> 
> This patchset contains the following seven patches.
> 
>  0001-cgroup-add-css_set-mg_tasks.patch
>  0002-cgroup-use-css_set-mg_tasks-to-track-target-tasks-du.patch
>  0003-cgroup-separate-out-cset_group_from_root-from-task_c.patch
>  0004-cgroup-split-process-task-migration-into-four-steps.patch
>  0005-cgroup-update-how-a-newly-forked-task-gets-associate.patch
>  0006-cgroup-drop-task_lock-protection-around-task-cgroups.patch
>  0007-cgroup-update-cgroup_transfer_tasks-to-either-succee.patch
> 

Acked-by: Li Zefan 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] sysfs: fix namespace refcnt leak

2014-02-25 Thread Li Zefan

As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.

Reviewed-by: Tejun Heo 
Signed-off-by: Li Zefan 
---
 fs/kernfs/mount.c  | 8 +++-
 fs/sysfs/mount.c   | 5 +++--
 include/linux/kernfs.h | 9 +
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 405279b..6a5f04a 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -131,6 +131,7 @@ const void *kernfs_super_ns(struct super_block *sb)
  * @fs_type: file_system_type of the fs being mounted
  * @flags: mount flags specified for the mount
  * @root: kernfs_root of the hierarchy being mounted
+ * @new_sb_created: tell the caller if we allocated a new superblock
  * @ns: optional namespace tag of the mount
  *
  * This is to be called from each kernfs user's file_system_type->mount()
@@ -141,7 +142,8 @@ const void *kernfs_super_ns(struct super_block *sb)
  * The return value can be passed to the vfs layer verbatim.
  */
 struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-  struct kernfs_root *root, const void *ns)
+  struct kernfs_root *root, bool *new_sb_created,
+  const void *ns)
 {
struct super_block *sb;
struct kernfs_super_info *info;
@@ -159,6 +161,10 @@ struct dentry *kernfs_mount_ns(struct file_system_type 
*fs_type, int flags,
kfree(info);
if (IS_ERR(sb))
return ERR_CAST(sb);
+
+   if (new_sb_created)
+   *new_sb_created = !sb->s_root;
+
if (!sb->s_root) {
error = kernfs_fill_super(sb);
if (error) {
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 5c7fdd9..a66ad61 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -27,6 +27,7 @@ static struct dentry *sysfs_mount(struct file_system_type 
*fs_type,
 {
struct dentry *root;
void *ns;
+   bool new_sb;
 
if (!(flags & MS_KERNMOUNT)) {
if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
@@ -37,8 +38,8 @@ static struct dentry *sysfs_mount(struct file_system_type 
*fs_type,
}
 
ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
-   root = kernfs_mount_ns(fs_type, flags, sysfs_root, ns);
-   if (IS_ERR(root))
+   root = kernfs_mount_ns(fs_type, flags, sysfs_root, _sb, ns);
+   if (IS_ERR(root) || !new_sb)
kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
return root;
 }
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 649497a..09669d0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -279,7 +279,8 @@ void kernfs_notify(struct kernfs_node *kn);
 
 const void *kernfs_super_ns(struct super_block *sb);
 struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-  struct kernfs_root *root, const void *ns);
+  struct kernfs_root *root, bool *new_sb_created,
+  const void *ns);
 void kernfs_kill_sb(struct super_block *sb);
 
 void kernfs_init(void);
@@ -372,7 +373,7 @@ static inline const void *kernfs_super_ns(struct 
super_block *sb)
 
 static inline struct dentry *
 kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-   struct kernfs_root *root, const void *ns)
+   struct kernfs_root *root, bool *new_sb_created, const void *ns)
 { return ERR_PTR(-ENOSYS); }
 
 static inline void kernfs_kill_sb(struct super_block *sb) { }
@@ -430,9 +431,9 @@ static inline int kernfs_rename(struct kernfs_node *kn,
 
 static inline struct dentry *
 kernfs_mount(struct file_system_type *fs_type, int flags,
-struct kernfs_root *root)
+struct kernfs_root *root, bool *new_sb_created)
 {
-   return kernfs_mount_ns(fs_type, flags, root, NULL);
+   return kernfs_mount_ns(fs_type, flags, root, new_sb_created, NULL);
 }
 
 #endif /* __LINUX_KERNFS_H */
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [cgroup/task_lock] INFO: suspicious RCU usage.

2014-02-25 Thread Li Zefan

>>> [main] Setsockopt(1 2b 80d1000 4) on fd 223 [17:2:768]
>>> [   27.030764] 
>>> [   27.031119] ===
>>> [   27.031833] [ INFO: suspicious RCU usage. ]
>>> [   27.032536] 3.14.0-rc3-02458-g837caba #2 Not tainted
>>> [   27.033378] ---
>>> [   27.044237] include/linux/cgroup.h:697 suspicious 
>>> rcu_dereference_check() usage!
>>> [   27.045795] 
>>> [   27.045795] other info that might help us debug this:
>>> [   27.045795] 
>>> [   27.047114] 
>>> [   27.047114] rcu_scheduler_active = 1, debug_locks = 0
>>> [main] Setsockopt(1 c 80d1000 4) on fd 225 [39:5:0]
>>> [   27.048751] 2 locks held by trinity-c0/4479:
>>> [   27.049478]  #0:  (callback_mutex){+.+...}, at: [<81118395>] 
>>> cpuset_cpus_allowed+0x1e/0x123
>>> [   27.051132]  #1:  (&(>alloc_lock)->rlock){+.+...}, at: [<8111839c>] 
>>> cpuset_cpus_allowed+0x25/0x123
>>> [   27.052788] 
>>> [   27.052788] stack backtrace:
>>> [   27.053528] CPU: 0 PID: 4479 Comm: trinity-c0 Not tainted 
>>> 3.14.0-rc3-02458-g837caba #2
>>> [   27.064971]   
>>>  919eff28 81877cc3[main] Setsockopt(1 7 80d1000 4) on fd 226 [1:5:1]
>>
>> So, this is from removing task_lock from task_css_set_check() and
>> adding rcu_read_lock() in cpuset_cpus_allowed() should fix it.
> 
> Yeah, rcu_read_lock() should be sufficient.
> 
>> I'm
>> not sure how much of task_lock() locking we currently have in cpuset
>> is actually necessary tho.  Shouldn't we be able to do most with just
>> callback_mutex, if not cpuset_mutex?  Li, any ideas?
>>
> 
> task_lock() is also used to protect task->mems_allowed. I'll see if we
> can get rid of most (if not all) task_lock() lockings in cpuset.
> 

After a quick lock, I think except the one in cpuset_change_task_nodemask(), all
other task_lock() lockings can be replace by rcu_readlock().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] sysfs: fix namespace refcnt leak

2014-02-25 Thread Li Zefan

On 2014/2/25 23:17, Tejun Heo wrote:
> On Tue, Feb 25, 2014 at 07:19:55AM -0800, Greg Kroah-Hartman wrote:
>> On Tue, Feb 25, 2014 at 09:42:56AM -0500, Tejun Heo wrote:
>>> On Tue, Feb 25, 2014 at 07:28:44PM +0800, Li Zefan wrote:
>>>> As mount() and kill_sb() is not a one-to-one match, we shoudn't get
>>>> ns refcnt unconditionally in sysfs_mount(), and instead we should
>>>> get the refcnt only when kernfs_mount() allocated a new superblock.
>>>>
>>>> v2:
>>>> - Changed the name of the new argument, suggested by Tejun.
>>>> - Made the argument optional, suggested by Tejun.
>>>>
>>>> v3:
>>>> - Make the new argument as second-to-last arg, suggested by Tejun.
>>>>
>>>> Reviewed-by: Tejun Heo 
>>>> Signed-off-by: Li Zefan 
>>>
>>>  Acked-by: Tejun Heo 
>>
>> Is this needed for 3.14-final or 3.15?
> 
> It also fixes sysfs refcnting, so should also be applied to 3.14, I
> think.
> 

Actually it fixes sysfs refcnting only, but the change to kernfs is
necessary for fixing cgroupfs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] cpuset: fix a race condition in __cpuset_node_allowed_softwall()

2014-02-27 Thread Li Zefan

It's not safe to access task's cpuset after releasing task_lock().
Holding callback_mutex won't help.

Cc: 
Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5f50ec6..c63a0d9 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2412,9 +2412,9 @@ int __cpuset_node_allowed_softwall(int node, gfp_t 
gfp_mask)
 
task_lock(current);
cs = nearest_hardwall_ancestor(task_cs(current));
+   allowed = node_isset(node, cs->mems_allowed);
task_unlock(current);
 
-   allowed = node_isset(node, cs->mems_allowed);
mutex_unlock(_mutex);
return allowed;
 }
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] cpuset: fix a locking issue in cpuset_migrate_mm()

2014-02-27 Thread Li Zefan

I can trigger a lockdep warning:

  # mount -t cgroup -o cpuset xxx /cgroup
  # mkdir /cgroup/cpuset
  # mkdir /cgroup/tmp
  # echo 0 > /cgroup/tmp/cpuset.cpus
  # echo 0 > /cgroup/tmp/cpuset.mems
  # echo 1 > /cgroup/tmp/cpuset.memory_migrate
  # echo $$ > /cgroup/tmp/tasks
  # echo 1 > /cgruop/tmp/cpuset.mems

  ===
  [ INFO: suspicious RCU usage. ]
  3.14.0-rc1-0.1-default+ #32 Not tainted
  ---
  include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage!
  ...
[] dump_stack+0x72/0x86
[] lockdep_rcu_suspicious+0x101/0x140
[] cpuset_migrate_mm+0xb1/0xe0
  ...

We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now
we hold cpuset_mutex, which causes task_css() to complain.

This is not a false-positive but a real issue.

Holding cpuset_mutex won't prevent a task's cpuset from changing, and
it won't prevent the original task->cgroup from destroying during this
change.

Fixes: 5d21cc2db040 (cpuset: replace cgroup_mutex locking with cpuset internal 
locking)
Cc:  # 3.9+
Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d8bec21..5f50ec6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -948,12 +948,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset 
*trialcs,
  *Temporarilly set tasks mems_allowed to target nodes of migration,
  *so that the migration code can allocate pages on these nodes.
  *
- *Call holding cpuset_mutex, so current's cpuset won't change
- *during this call, as manage_mutex holds off any cpuset_attach()
- *calls.  Therefore we don't need to take task_lock around the
- *call to guarantee_online_mems(), as we know no one is changing
- *our task's cpuset.
- *
  *While the mm_struct we are migrating is typically from some
  *other task, the task_struct mems_allowed that we are hacking
  *is for our current task, which must allocate new pages for that
@@ -970,8 +964,10 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const 
nodemask_t *from,
 
do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
 
+   rcu_read_lock();
mems_cs = effective_nodemask_cpuset(task_cs(tsk));
guarantee_online_mems(mems_cs, >mems_allowed);
+   rcu_read_unlock();
 }
 
 /*
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] cpuset: fix a locking issue in cpuset_migrate_mm()

2014-02-27 Thread Li Zefan

On 2014/2/27 18:19, Li Zefan wrote:
> I can trigger a lockdep warning:
> 
>   # mount -t cgroup -o cpuset xxx /cgroup
>   # mkdir /cgroup/cpuset
>   # mkdir /cgroup/tmp
>   # echo 0 > /cgroup/tmp/cpuset.cpus
>   # echo 0 > /cgroup/tmp/cpuset.mems
>   # echo 1 > /cgroup/tmp/cpuset.memory_migrate
>   # echo $$ > /cgroup/tmp/tasks
>   # echo 1 > /cgruop/tmp/cpuset.mems
> 
>   ===
>   [ INFO: suspicious RCU usage. ]
>   3.14.0-rc1-0.1-default+ #32 Not tainted
>   ---
>   include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage!
>   ...
> [] dump_stack+0x72/0x86
> [] lockdep_rcu_suspicious+0x101/0x140
> [] cpuset_migrate_mm+0xb1/0xe0
>   ...
> 
> We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now
> we hold cpuset_mutex, which causes task_css() to complain.
> 
> This is not a false-positive but a real issue.
> 
> Holding cpuset_mutex won't prevent a task's cpuset from changing, and

I meant it won't prevent a task from migrating to another cpuset.

> it won't prevent the original task->cgroup from destroying during this
> change.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuset: Add missing rcu read locks

2014-03-03 Thread Li Zefan

On 2014/2/27 22:20, Sasha Levin wrote:
> We call task_cs() which ends up going into cgroup code without having rcu
> read lock. Fix it by grabbing the lock before retrival of task_cs() and usage
> of the result.
> 

Thanks for the patch.

Fengguang already reported it, and I've made a patch for it, but I was fixing
two bugs for 3.14 and then waiting for them to be applied.

There're other places that rcu_read_lock() should be used, and we need to
remove task_lock().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuset: use rcu_read_lock() to protect task_cs()

2014-03-03 Thread Li Zefan

We no longer use task_lock() to protect tsk->cgroups.

Reported-by: Fengguang Wu 
Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c63a0d9..4b3560d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2235,10 +2235,10 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
struct cpumask *pmask)
struct cpuset *cpus_cs;
 
mutex_lock(_mutex);
-   task_lock(tsk);
+   rcu_read_lock();
cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
guarantee_online_cpus(cpus_cs, pmask);
-   task_unlock(tsk);
+   rcu_read_unlock();
mutex_unlock(_mutex);
 }
 
@@ -2291,10 +2291,10 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
nodemask_t mask;
 
mutex_lock(_mutex);
-   task_lock(tsk);
+   rcu_read_lock();
mems_cs = effective_nodemask_cpuset(task_cs(tsk));
guarantee_online_mems(mems_cs, );
-   task_unlock(tsk);
+   rcu_read_unlock();
mutex_unlock(_mutex);
 
return mask;
@@ -2410,10 +2410,10 @@ int __cpuset_node_allowed_softwall(int node, gfp_t 
gfp_mask)
/* Not hardwall and node outside mems_allowed: scan up cpusets */
mutex_lock(_mutex);
 
-   task_lock(current);
+   rcu_read_lock();
cs = nearest_hardwall_ancestor(task_cs(current));
allowed = node_isset(node, cs->mems_allowed);
-   task_unlock(current);
+   rcu_read_unlock();
 
mutex_unlock(_mutex);
return allowed;
@@ -2539,24 +2539,26 @@ int cpuset_mems_allowed_intersects(const struct 
task_struct *tsk1,
  * @task: pointer to task_struct of some task.
  *
  * Description: Prints @task's name, cpuset name, and cached copy of its
- * mems_allowed to the kernel log.  Must hold task_lock(task) to allow
- * dereferencing task_cs(task).
+ * mems_allowed to the kernel log.
  */
 void cpuset_print_task_mems_allowed(struct task_struct *tsk)
 {
 /* Statically allocated to prevent using excess stack. */
static char cpuset_nodelist[CPUSET_NODELIST_LEN];
static DEFINE_SPINLOCK(cpuset_buffer_lock);
-   struct cgroup *cgrp = task_cs(tsk)->css.cgroup;
+   struct cgroup *cgrp;
 
spin_lock(_buffer_lock);
+   rcu_read_lock();
 
+   cgrp = task_cs(tsk)->css.cgroup;
nodelist_scnprintf(cpuset_nodelist, CPUSET_NODELIST_LEN,
   tsk->mems_allowed);
printk(KERN_INFO "%s cpuset=", tsk->comm);
pr_cont_cgroup_name(cgrp);
pr_cont(" mems_allowed=%s\n", cpuset_nodelist);
 
+   rcu_read_unlock();
spin_unlock(_buffer_lock);
 }
 
@@ -2588,9 +2590,9 @@ int cpuset_memory_pressure_enabled __read_mostly;
 
 void __cpuset_memory_pressure_bump(void)
 {
-   task_lock(current);
+   rcu_read_lock();
fmeter_markevent(_cs(current)->fmeter);
-   task_unlock(current);
+   rcu_read_unlock();
 }
 
 #ifdef CONFIG_PROC_PID_CPUSET
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuset: use rcu_read_lock() to protect task_cs()

2014-03-03 Thread Li Zefan

We no longer use task_lock() to protect tsk->cgroups.

Reported-by: Fengguang Wu 
Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c63a0d9..4b3560d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2235,10 +2235,10 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
struct cpumask *pmask)
struct cpuset *cpus_cs;
 
mutex_lock(_mutex);
-   task_lock(tsk);
+   rcu_read_lock();
cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
guarantee_online_cpus(cpus_cs, pmask);
-   task_unlock(tsk);
+   rcu_read_unlock();
mutex_unlock(_mutex);
 }
 
@@ -2291,10 +2291,10 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
nodemask_t mask;
 
mutex_lock(_mutex);
-   task_lock(tsk);
+   rcu_read_lock();
mems_cs = effective_nodemask_cpuset(task_cs(tsk));
guarantee_online_mems(mems_cs, );
-   task_unlock(tsk);
+   rcu_read_unlock();
mutex_unlock(_mutex);
 
return mask;
@@ -2410,10 +2410,10 @@ int __cpuset_node_allowed_softwall(int node, gfp_t 
gfp_mask)
/* Not hardwall and node outside mems_allowed: scan up cpusets */
mutex_lock(_mutex);
 
-   task_lock(current);
+   rcu_read_lock();
cs = nearest_hardwall_ancestor(task_cs(current));
allowed = node_isset(node, cs->mems_allowed);
-   task_unlock(current);
+   rcu_read_unlock();
 
mutex_unlock(_mutex);
return allowed;
@@ -2539,24 +2539,26 @@ int cpuset_mems_allowed_intersects(const struct 
task_struct *tsk1,
  * @task: pointer to task_struct of some task.
  *
  * Description: Prints @task's name, cpuset name, and cached copy of its
- * mems_allowed to the kernel log.  Must hold task_lock(task) to allow
- * dereferencing task_cs(task).
+ * mems_allowed to the kernel log.
  */
 void cpuset_print_task_mems_allowed(struct task_struct *tsk)
 {
 /* Statically allocated to prevent using excess stack. */
static char cpuset_nodelist[CPUSET_NODELIST_LEN];
static DEFINE_SPINLOCK(cpuset_buffer_lock);
-   struct cgroup *cgrp = task_cs(tsk)->css.cgroup;
+   struct cgroup *cgrp;
 
spin_lock(_buffer_lock);
+   rcu_read_lock();
 
+   cgrp = task_cs(tsk)->css.cgroup;
nodelist_scnprintf(cpuset_nodelist, CPUSET_NODELIST_LEN,
   tsk->mems_allowed);
printk(KERN_INFO "%s cpuset=", tsk->comm);
pr_cont_cgroup_name(cgrp);
pr_cont(" mems_allowed=%s\n", cpuset_nodelist);
 
+   rcu_read_unlock();
spin_unlock(_buffer_lock);
 }
 
@@ -2588,9 +2590,9 @@ int cpuset_memory_pressure_enabled __read_mostly;
 
 void __cpuset_memory_pressure_bump(void)
 {
-   task_lock(current);
+   rcu_read_lock();
fmeter_markevent(_cs(current)->fmeter);
-   task_unlock(current);
+   rcu_read_unlock();
 }
 
 #ifdef CONFIG_PROC_PID_CPUSET
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cgroup: missing rcu read lock around task_css_set

2014-03-04 Thread Li Zefan

On 2014/3/5 3:47, Tejun Heo wrote:
> On Tue, Mar 04, 2014 at 12:20:45PM -0500, Sasha Levin wrote:
>>> Hrm... there is a PF_EXITING check there already:
>>>
>>> #define task_css_set_check(task, __c)\
>>> rcu_dereference_check((task)->cgroups,\
>>> lockdep_is_held(_mutex) ||\
>>> lockdep_is_held(_set_rwsem) ||\
>>> ((task)->flags & PF_EXITING) || (__c))
>>>
>>> I see it's not happening on Linus's master so I'll run a bisection to 
>>> figure out what broke it.
>>
>> Hi Tejun,
>>
>> It bisects down to your patch: "cgroup: drop task_lock() protection
>> around task->cgroups". I'll look into it later unless it's obvious
>> to you.
> 
> Hmmm... maybe I'm confused and PF_EXITING is not set there and
> task_lock was what held off the lockdep warning.  Confused
> 

Because this cgroup_exit() is called in a failure path in copy_process().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] kmemleak: allow freeing internal objects after disabling kmemleak

2014-03-16 Thread Li Zefan

On 2014/3/13 20:14, Catalin Marinas wrote:
> On Thu, Mar 13, 2014 at 06:47:46AM +0000, Li Zefan wrote:
>> +Freeing kmemleak internal objects
>> +-
>> +
>> +To allow access to previosuly found memory leaks even when an error fatal
>> +to kmemleak happens, internal kmemleak objects won't be freed when kmemleak
>> +is disabled, and those objects may occupy a large part of physical
>> +memory.
>> +
>> +If you want to make sure they're freed before disabling kmemleak:
>> +
>> +  # echo scan=off > /sys/kernel/debug/kmemleak
>> +  # echo off > /sys/kernel/debug/kmemleak
> 
> I would actually change the code to do a stop_scan_thread() as part of
> the "off" handling so that scan=off is not required (we can't put it as
> part of the kmemleak_disable because we need scan_mutex held).
> 

Sounds reasonable.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/3] kmemleak: remove redundant code

2014-03-16 Thread Li Zefan

- remove kmemleak_padding().
- remove kmemleak_release().

Signed-off-by: Li Zefan 
---
 include/linux/kmemleak.h | 2 --
 mm/kmemleak.c| 7 +--
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/linux/kmemleak.h b/include/linux/kmemleak.h
index 2a5e554..5bb4246 100644
--- a/include/linux/kmemleak.h
+++ b/include/linux/kmemleak.h
@@ -30,8 +30,6 @@ extern void kmemleak_alloc_percpu(const void __percpu *ptr, 
size_t size) __ref;
 extern void kmemleak_free(const void *ptr) __ref;
 extern void kmemleak_free_part(const void *ptr, size_t size) __ref;
 extern void kmemleak_free_percpu(const void __percpu *ptr) __ref;
-extern void kmemleak_padding(const void *ptr, unsigned long offset,
-size_t size) __ref;
 extern void kmemleak_not_leak(const void *ptr) __ref;
 extern void kmemleak_ignore(const void *ptr) __ref;
 extern void kmemleak_scan_area(const void *ptr, size_t size, gfp_t gfp) __ref;
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 7fc030e..54270f2 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1545,11 +1545,6 @@ static int kmemleak_open(struct inode *inode, struct 
file *file)
return seq_open(file, _seq_ops);
 }
 
-static int kmemleak_release(struct inode *inode, struct file *file)
-{
-   return seq_release(inode, file);
-}
-
 static int dump_str_object_info(const char *str)
 {
unsigned long flags;
@@ -1680,7 +1675,7 @@ static const struct file_operations kmemleak_fops = {
.read   = seq_read,
.write  = kmemleak_write,
.llseek = seq_lseek,
-   .release= kmemleak_release,
+   .release= seq_release,
 };
 
 /*
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/3] kmemleak: allow freeing internal objects after kmemleak was disabled

2014-03-16 Thread Li Zefan

Currently if kmemleak is disabled, the kmemleak objects can never be freed,
no matter if it's disabled by a user or due to fatal errors.

Those objects can be a big waste of memory.

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1200264 1197433  99%0.30K  46164   26369312K kmemleak_object

With this patch, internal objects will be freed immediately if kmemleak is
disabled explicitly by a user. If it's disabled due to a kmemleak error,
The user will be informed, and then he/she can reclaim memory with:

# echo off > /sys/kernel/debug/kmemleak

v2: use "off" handler instead of "clear" handler to do this, suggested
by Catalin.

Signed-off-by: Li Zefan 
---
 Documentation/kmemleak.txt | 14 +-
 mm/kmemleak.c  | 21 -
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/Documentation/kmemleak.txt b/Documentation/kmemleak.txt
index 6dc8013..00aa013 100644
--- a/Documentation/kmemleak.txt
+++ b/Documentation/kmemleak.txt
@@ -42,7 +42,8 @@ objects to be reported as orphan.
 Memory scanning parameters can be modified at run-time by writing to the
 /sys/kernel/debug/kmemleak file. The following parameters are supported:
 
-  off  - disable kmemleak (irreversible)
+  off  - disable kmemleak, or free all kmemleak objects if kmemleak
+ has been disabled due to fatal errors. (irreversible).
   stack=on - enable the task stacks scanning (default)
   stack=off- disable the tasks stacks scanning
   scan=on  - start the automatic memory scanning thread (default)
@@ -118,6 +119,17 @@ Then as usual to get your report with:
 
   # cat /sys/kernel/debug/kmemleak
 
+Freeing kmemleak internal objects
+-
+
+To allow access to previously found memory leaks even when an error fatal
+to kmemleak happens, internal kmemleak objects won't be freed in this case.
+Those objects may occupy a large part of physical memory.
+
+You can reclaim memory from those objects with:
+
+  # echo off > /sys/kernel/debug/kmemleak
+
 Kmemleak API
 
 
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 31f01c5..7fc030e 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1616,9 +1616,6 @@ static ssize_t kmemleak_write(struct file *file, const 
char __user *user_buf,
int buf_size;
int ret;
 
-   if (!atomic_read(_enabled))
-   return -EBUSY;
-
buf_size = min(size, (sizeof(buf) - 1));
if (strncpy_from_user(buf, user_buf, buf_size) < 0)
return -EFAULT;
@@ -1628,9 +1625,18 @@ static ssize_t kmemleak_write(struct file *file, const 
char __user *user_buf,
if (ret < 0)
return ret;
 
-   if (strncmp(buf, "off", 3) == 0)
+   if (strncmp(buf, "off", 3) == 0) {
+   stop_scan_thread();
kmemleak_disable();
-   else if (strncmp(buf, "stack=on", 8) == 0)
+   goto out;
+   }
+
+   if (!atomic_read(_enabled)) {
+   ret = -EBUSY;
+   goto out;
+   }
+
+   if (strncmp(buf, "stack=on", 8) == 0)
kmemleak_stack_scan = 1;
else if (strncmp(buf, "stack=off", 9) == 0)
kmemleak_stack_scan = 0;
@@ -1695,6 +1701,11 @@ static void kmemleak_do_cleanup(struct work_struct *work)
list_for_each_entry_rcu(object, _list, object_list)
delete_object_full(object->pointer);
rcu_read_unlock();
+   } else {
+   pr_info("Disable kmemleak without freeing internal objects, "
+   "so you may still check information on memory leak. "
+   "You may reclaim memory by writing \"off\" to "
+   "/sys/kernel/debug/kmemleak\n");
}
mutex_unlock(_mutex);
 }
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/3] kmemleak: change some global variables to int

2014-03-16 Thread Li Zefan

They don't have to be atomic_t, because they are simple boolean
toggles.

Signed-off-by: Li Zefan 
---
 mm/kmemleak.c | 78 +--
 1 file changed, 39 insertions(+), 39 deletions(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 54270f2..c352c63 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -192,15 +192,15 @@ static struct kmem_cache *object_cache;
 static struct kmem_cache *scan_area_cache;
 
 /* set if tracing memory operations is enabled */
-static atomic_t kmemleak_enabled = ATOMIC_INIT(0);
+static int kmemleak_enabled;
 /* set in the late_initcall if there were no errors */
-static atomic_t kmemleak_initialized = ATOMIC_INIT(0);
+static int kmemleak_initialized;
 /* enables or disables early logging of the memory operations */
-static atomic_t kmemleak_early_log = ATOMIC_INIT(1);
+static int kmemleak_early_log = 1;
 /* set if a kmemleak warning was issued */
-static atomic_t kmemleak_warning = ATOMIC_INIT(0);
+static int kmemleak_warning;
 /* set if a fatal kmemleak error has occurred */
-static atomic_t kmemleak_error = ATOMIC_INIT(0);
+static int kmemleak_error;
 
 /* minimum and maximum address that may be valid pointers */
 static unsigned long min_addr = ULONG_MAX;
@@ -267,7 +267,7 @@ static void kmemleak_disable(void);
 #define kmemleak_warn(x...)do {\
pr_warning(x);  \
dump_stack();   \
-   atomic_set(_warning, 1);   \
+   kmemleak_warning = 1;   \
 } while (0)
 
 /*
@@ -805,7 +805,7 @@ static void __init log_early(int op_type, const void *ptr, 
size_t size,
unsigned long flags;
struct early_log *log;
 
-   if (atomic_read(_error)) {
+   if (kmemleak_error) {
/* kmemleak stopped recording, just count the requests */
crt_early_log++;
return;
@@ -840,7 +840,7 @@ static void early_alloc(struct early_log *log)
unsigned long flags;
int i;
 
-   if (!atomic_read(_enabled) || !log->ptr || IS_ERR(log->ptr))
+   if (!kmemleak_enabled || !log->ptr || IS_ERR(log->ptr))
return;
 
/*
@@ -893,9 +893,9 @@ void __ref kmemleak_alloc(const void *ptr, size_t size, int 
min_count,
 {
pr_debug("%s(0x%p, %zu, %d)\n", __func__, ptr, size, min_count);
 
-   if (atomic_read(_enabled) && ptr && !IS_ERR(ptr))
+   if (kmemleak_enabled && ptr && !IS_ERR(ptr))
create_object((unsigned long)ptr, size, min_count, gfp);
-   else if (atomic_read(_early_log))
+   else if (kmemleak_early_log)
log_early(KMEMLEAK_ALLOC, ptr, size, min_count);
 }
 EXPORT_SYMBOL_GPL(kmemleak_alloc);
@@ -919,11 +919,11 @@ void __ref kmemleak_alloc_percpu(const void __percpu 
*ptr, size_t size)
 * Percpu allocations are only scanned and not reported as leaks
 * (min_count is set to 0).
 */
-   if (atomic_read(_enabled) && ptr && !IS_ERR(ptr))
+   if (kmemleak_enabled && ptr && !IS_ERR(ptr))
for_each_possible_cpu(cpu)
create_object((unsigned long)per_cpu_ptr(ptr, cpu),
  size, 0, GFP_KERNEL);
-   else if (atomic_read(_early_log))
+   else if (kmemleak_early_log)
log_early(KMEMLEAK_ALLOC_PERCPU, ptr, size, 0);
 }
 EXPORT_SYMBOL_GPL(kmemleak_alloc_percpu);
@@ -939,9 +939,9 @@ void __ref kmemleak_free(const void *ptr)
 {
pr_debug("%s(0x%p)\n", __func__, ptr);
 
-   if (atomic_read(_enabled) && ptr && !IS_ERR(ptr))
+   if (kmemleak_enabled && ptr && !IS_ERR(ptr))
delete_object_full((unsigned long)ptr);
-   else if (atomic_read(_early_log))
+   else if (kmemleak_early_log)
log_early(KMEMLEAK_FREE, ptr, 0, 0);
 }
 EXPORT_SYMBOL_GPL(kmemleak_free);
@@ -959,9 +959,9 @@ void __ref kmemleak_free_part(const void *ptr, size_t size)
 {
pr_debug("%s(0x%p)\n", __func__, ptr);
 
-   if (atomic_read(_enabled) && ptr && !IS_ERR(ptr))
+   if (kmemleak_enabled && ptr && !IS_ERR(ptr))
delete_object_part((unsigned long)ptr, size);
-   else if (atomic_read(_early_log))
+   else if (kmemleak_early_log)
log_early(KMEMLEAK_FREE_PART, ptr, size, 0);
 }
 EXPORT_SYMBOL_GPL(kmemleak_free_part);
@@ -979,11 +979,11 @@ void __ref kmemleak_free_percpu(const void __percpu *ptr)
 
pr_debug("%s(0x%p)\n", __func__, ptr);
 
-   if (atomic_read(_enabled) && ptr && !IS_ERR(ptr))
+   if (kmemleak_enabled && ptr && !IS_ERR(ptr))
for_each_possible_cpu(cpu)
delete_object_full((unsigned long)per_cpu_ptr(ptr,

Re: [PATCHSET cgroup/for-3.15] cgroup: prepare for the default unified hierarchy

2014-03-17 Thread Li Zefan

On 2014/3/14 3:50, Tejun Heo wrote:
> Hello,
> 
> The planned unified hierarchy is just around the corner.  We're gonna
> repurpose the dummy hierarchy to serve as the default unified
> hierarchy.  Controllers which aren't attached to other hierarchies are
> already automatically bound to the dummy hierarchy; however, the dummy
> hierarchy is minimally initialized during boot and can't currently act
> as a full hierarchy.
> 
> This patchset gradually turns the dummy hierarchy into fully
> functional default one and makes it mountable via selecting
> sane_behavior without any subsystems specified.  While this patchset
> does make the default hierarchy mountable, the unified hierarchy
> functionality isn't implemented yet and cgroup creation in the default
> hierarchy is rejected for now.
> 
> This lays down all the boilerplate work for the unified hierarchy.
> The next patchset will implement the actual unified functionality.
> 
> This patchset contains the following eleven patches.
> 
>  0001-cgroup-cgroup-subsys-should-be-cleared-after-the-css.patch
>  0002-cgroup-relocate-setting-of-CGRP_DEAD.patch
>  0003-cgroup-reorganize-cgroup-bootstrapping.patch
>  0004-cgroup-use-cgroup_setup_root-to-initialize-cgroup_du.patch
>  0005-cgroup-remove-NULL-checks-from-pr_cont_-cgroup_-name.patch
>  0006-cgroup-treat-cgroup_dummy_root-as-an-equivalent-hier.patch
>  0007-cgroup-move-subsys_mask-from-cgroupfs_root-to-cgroup.patch
>  0008-cgroup-rename-cgroup_dummy_root-and-related-names.patch
>  0009-cgroup-drop-const-from-buffer-of-cftype-write_string.patch
>  0010-cgroup-make-cgrp_dfl_root-mountable.patch
>  0011-cgroup-implement-CFTYPE_ONLY_ON_DFL.patch
> 
> 0001-0002 are misc fix / prep.
> 
> 0003-0009 (mostly) turn the dummy hierarchy into a full hierarchy
> which takes on controllers which aren't attached to other hierarchies.
> 
> 0010-0011 make the default hierarchy mountable and implement
> CFTYPE_ONLY_DFL.
> 
> This patchset is on top of cgroup/for-3.15 b8dadcb58d54 ("cpuset: use
> rcu_read_lock() to protect task_cs()").  diffstat follows.
> 
>  block/blk-throttle.c  |4
>  block/cfq-iosched.c   |4
>  include/linux/cgroup.h|   57 ++---
>  kernel/cgroup.c   |  464 
> +-
>  kernel/cgroup_freezer.c   |2
>  kernel/cpuset.c   |2
>  mm/hugetlb_cgroup.c   |2
>  mm/memcontrol.c   |4
>  net/core/netprio_cgroup.c |    2
>  net/ipv4/tcp_memcontrol.c |2
>  security/device_cgroup.c  |4
>  11 files changed, 295 insertions(+), 252 deletions(-)
> 

Looks good in general!

Acked-by: Li Zefan 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cgroup: fix a failure path in create_css()

2014-03-18 Thread Li Zefan

If online_css() fails, we should remove cgroup files belonging
to css->ss.

Signed-off-by: Li Zefan 
---
 kernel/cgroup.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3790617..e673407 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3408,17 +3408,17 @@ static int create_css(struct cgroup *cgrp, struct 
cgroup_subsys *ss)
 
err = percpu_ref_init(>refcnt, css_release);
if (err)
-   goto err_free;
+   goto err_free_css;
 
init_css(css, ss, cgrp);
 
err = cgroup_populate_dir(cgrp, 1 << ss->id);
if (err)
-   goto err_free;
+   goto err_free_percpu_ref;
 
err = online_css(css);
if (err)
-   goto err_free;
+   goto err_clear_dir;
 
cgroup_get(cgrp);
css_get(css->parent);
@@ -3434,8 +3434,11 @@ static int create_css(struct cgroup *cgrp, struct 
cgroup_subsys *ss)
 
return 0;
 
-err_free:
+err_clear_dir:
+   cgroup_clear_dir(css->cgroup, 1 << css->ss->id);
+err_free_percpu_ref:
percpu_ref_cancel_init(>refcnt);
+err_free_css:
ss->css_free(css);
return err;
 }
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [cgroup] BUG: unable to handle kernel NULL pointer dereference at 0000001c

2014-02-19 Thread Li Zefan

On 2014/2/19 20:25, Fengguang Wu wrote:
> Greetings,
> 
> I got the below dmesg and the first bad commit is
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
> 
> commit ace2bee8135a3dc725958b8d08c55ee9df813d39
> Author: Tejun Heo 
> AuthorDate: Tue Feb 11 11:52:47 2014 -0500
> Commit: Tejun Heo 
> CommitDate: Tue Feb 11 11:52:47 2014 -0500
> 
> cgroup: introduce cgroup_tree_mutex
> 
>     Signed-off-by: Tejun Heo 
> Acked-by: Li Zefan 
> 

Thanks for the report. It was introduced by

e61734c55c24cdf11b07e52a74aec4dc4a7f4bd0 cgroup: remove cgroup->name

And It has been fixed by

cc045e3952175e84c38dad22dea14465b9fc8fb5 cgroup: deal with dummp_top in 
cgroup_name() and cgroup_path()

The fix has been in for-next fox about a week, so I'm curious why
you still hit this bug.

> [   97.464167] BUG: unable to handle kernel NULL pointer dereference at 
> 001c
> [   97.464897] IP: [<8130f33d>] kernfs_path_locked+0x7d/0xc0
> [   97.465519] *pde = 
> [   97.465960] Oops:  [#1] SMP
> [   97.466936] CPU: 0 PID: 24 Comm: kworker/0:1 Tainted: G S  W
> 3.14.0-rc2-01592-g97162c1 #6
> [   97.467818] Workqueue: events_freezable_power_ disk_events_workfn
> [   97.468342] task: 801c7560 ti: 801d6000 task.ti: 801d6000
> [   97.468836] EIP: 0060:[<8130f33d>] EFLAGS: 0046 CPU: 0
> [   97.469291] EIP is at kernfs_path_locked+0x7d/0xc0
> [   97.469636] EAX:  EBX:  ECX: 801d7c34 EDX: 801d7bb4
> [   97.470024] ESI: 0046 EDI: 801d7c33 EBP: 801d7b68 ESP: 801d7b58
> [   97.470505]  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
> [   97.470929] CR0: 8005003b CR2: 001c CR3: 03b92000 CR4: 00300690
> [   97.471550] DR0:  DR1:  DR2:  DR3: 
> [   97.472155] DR6:  DR7: 
> [   97.472397] Stack:
> [   97.472397]  801d7bb4  0046 801d7bb4 801d7b80 8130fe59 
> 0080 929c0ce0
> [   97.472397]  929ae498 0018 801d7c40 8199b88a 0002 0001 
>  8199b650
> [   97.472397]  929ae498  0001 801c7560 929ae5e4 929b46f8 
> 929c0ca0 819978d0
> [   97.472397] Call Trace:
> [   97.472397]  [<8130fe59>] kernfs_path+0x49/0x80
> [   97.472397]  [<8199b88a>] cfq_find_alloc_queue+0x3fa/0x620
> [   97.472397]  [<8199b650>] ? cfq_find_alloc_queue+0x1c0/0x620
> [   97.472397]  [<819978d0>] ? cfq_put_request+0x220/0x220
> [   97.472397]  [<8199bca5>] ? cfq_set_request+0x85/0x740
> [   97.472397]  [<8111e5ce>] ? do_raw_spin_lock+0xe/0x2e0
> [   97.472397]  [<8199bb35>] cfq_get_queue+0x85/0x170
> [   97.472397]  [<8199bff7>] cfq_set_request+0x3d7/0x740
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] sysfs: fix namespace refcnt leak

2014-02-20 Thread Li Zefan

As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

Signed-off-by: Li Zefan 
---

This fix is for 3.14.

---
 fs/kernfs/mount.c  | 8 +++-
 fs/sysfs/mount.c   | 5 +++--
 include/linux/kernfs.h | 9 +
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 405279b..47f4efd 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -132,6 +132,7 @@ const void *kernfs_super_ns(struct super_block *sb)
  * @flags: mount flags specified for the mount
  * @root: kernfs_root of the hierarchy being mounted
  * @ns: optional namespace tag of the mount
+ * @new_sb_created: tell the caller if we allocated a new superblock
  *
  * This is to be called from each kernfs user's file_system_type->mount()
  * implementation, which should pass through the specified @fs_type and
@@ -141,7 +142,8 @@ const void *kernfs_super_ns(struct super_block *sb)
  * The return value can be passed to the vfs layer verbatim.
  */
 struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-  struct kernfs_root *root, const void *ns)
+  struct kernfs_root *root, const void *ns,
+  bool *new_sb_created)
 {
struct super_block *sb;
struct kernfs_super_info *info;
@@ -159,6 +161,10 @@ struct dentry *kernfs_mount_ns(struct file_system_type 
*fs_type, int flags,
kfree(info);
if (IS_ERR(sb))
return ERR_CAST(sb);
+
+   if (new_sb_created)
+   *new_sb_created = !sb->s_root;
+
if (!sb->s_root) {
error = kernfs_fill_super(sb);
if (error) {
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 5c7fdd9..f5bea79 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -27,6 +27,7 @@ static struct dentry *sysfs_mount(struct file_system_type 
*fs_type,
 {
struct dentry *root;
void *ns;
+   bool new_sb;
 
if (!(flags & MS_KERNMOUNT)) {
if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
@@ -37,8 +38,8 @@ static struct dentry *sysfs_mount(struct file_system_type 
*fs_type,
}
 
ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
-   root = kernfs_mount_ns(fs_type, flags, sysfs_root, ns);
-   if (IS_ERR(root))
+   root = kernfs_mount_ns(fs_type, flags, sysfs_root, ns, _sb);
+   if (IS_ERR(root) || !new_sb)
kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
return root;
 }
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 649497a..ea3f5d2 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -279,7 +279,8 @@ void kernfs_notify(struct kernfs_node *kn);
 
 const void *kernfs_super_ns(struct super_block *sb);
 struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-  struct kernfs_root *root, const void *ns);
+  struct kernfs_root *root, const void *ns,
+  bool *new_sb_created);
 void kernfs_kill_sb(struct super_block *sb);
 
 void kernfs_init(void);
@@ -372,7 +373,7 @@ static inline const void *kernfs_super_ns(struct 
super_block *sb)
 
 static inline struct dentry *
 kernfs_mount_ns(struct file_system_type *fs_type, int flags,
-   struct kernfs_root *root, const void *ns)
+   struct kernfs_root *root, const void *ns, bool *new_sb_created)
 { return ERR_PTR(-ENOSYS); }
 
 static inline void kernfs_kill_sb(struct super_block *sb) { }
@@ -430,9 +431,9 @@ static inline int kernfs_rename(struct kernfs_node *kn,
 
 static inline struct dentry *
 kernfs_mount(struct file_system_type *fs_type, int flags,
-struct kernfs_root *root)
+struct kernfs_root *root, bool *new_sb_created)
 {
-   return kernfs_mount_ns(fs_type, flags, root, NULL);
+   return kernfs_mount_ns(fs_type, flags, root, NULL, new_sb_created);
 }
 
 #endif /* __LINUX_KERNFS_H */
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] jffs2: avoid soft-lockup in jffs2_reserve_space_gc()

2014-02-11 Thread Li Zefan

>> --- a/fs/jffs2/nodemgmt.c
>> +++ b/fs/jffs2/nodemgmt.c
>> @@ -216,15 +216,20 @@ int jffs2_reserve_space_gc(struct jffs2_sb_info *c, 
>> uint32_t minsize,
>>  
>>  jffs2_dbg(1, "%s(): Requested 0x%x bytes\n", __func__, minsize);
>>  
>> -spin_lock(>erase_completion_lock);
>> -while(ret == -EAGAIN) {
>> +while (true) {
>> +spin_lock(>erase_completion_lock);
>>  ret = jffs2_do_reserve_space(c, minsize, len, sumsize);
>>  if (ret) {
>>  jffs2_dbg(1, "%s(): looping, ret is %d\n",
>>__func__, ret);
>>  }
>> +spin_unlock(>erase_completion_lock);
>> +
>> +if (ret == -EAGAIN)
>> +cond_resched();
>> +else
>> +break;
>>  }
>> -spin_unlock(>erase_completion_lock);
>>  if (!ret)
>>  ret = jffs2_prealloc_raw_node_refs(c, c->nextblock, 1);
> 
> Looks OK.  We can do this:
> 

Yeah, thanks for the cleanup.

> --- 
> a/fs/jffs2/nodemgmt.c~jffs2-avoid-soft-lockup-in-jffs2_reserve_space_gc-fix
> +++ a/fs/jffs2/nodemgmt.c
> @@ -211,7 +211,7 @@ out:
>  int jffs2_reserve_space_gc(struct jffs2_sb_info *c, uint32_t minsize,
>  uint32_t *len, uint32_t sumsize)
>  {
> - int ret = -EAGAIN;
> + int ret;
>   minsize = PAD(minsize);
>  
>   jffs2_dbg(1, "%s(): Requested 0x%x bytes\n", __func__, minsize);
> _

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cgroup: protect modifications to cgroup_idr with cgroup_mutex

2014-02-11 Thread Li Zefan

On 2014/2/11 18:20, Michal Hocko wrote:
> Hi Li,
> good work in reproducing the issue so quickly!
> I have tried to backport this patch to 3.12 kernel but the code has
> changed since then.
> The only two instances of idr_remove which are called outside of
> cgroup_mutex seem to be:
>  - cgroup_create calling it from err_free_id: path
>  - css_free_work_fn
>  mem_cgroup_css_free
>__mem_cgroup_free
>  free_css_id
> 
> The second one takes ss->id_lock spinlock which should be sufficient
> to exclude get_new_cssid but cgroup_mount and cgroup_create don't use
> id_lock. They do hold cgroup_mutex though. So I guess I need something
> like the following (I will have it tested):

I don't think you need to do anything with ss->idr.

cgroup_create() calls alloc_css_id() -> get_new_cssid(), and get_new_cssid()
uses id_lock.

cgroup_mount() won't touch ss->idr, because the css_id for root cgroup is
always 0.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cgroup/for-3.14-fixes] cgroup: protect modifications to cgroup_idr with cgroup_mutex

2014-02-11 Thread Li Zefan

On 2014/2/12 0:26, Michal Hocko wrote:
> On Tue 11-02-14 10:41:05, Tejun Heo wrote:
> [...]
>> @@ -4254,12 +4256,12 @@ static long cgroup_create(struct cgroup *parent, 
>> struct dentry *dentry,
>>  
>>  return 0;
>>  
>> -err_unlock:
>> -mutex_unlock(_mutex);
>> -/* Release the reference count that we took on the superblock */
>> -deactivate_super(sb);
>>  err_free_id:
>>  idr_remove(>cgroup_idr, cgrp->id);
>> +/* Release the reference count that we took on the superblock */
>> +deactivate_super(sb);
>> +err_unlock:
>> +mutex_unlock(_mutex);
>>  err_free_name:
>>  kfree(rcu_dereference_raw(cgrp->name));
>>  err_free_cgrp:
> 
> Do I have to change deactivate_super vs. mutex_unlock ordering in my
> backport for 3.12 as well?
> 

Your change is wrong that you shouldn't drop sb refcnt in err_unlock path.

But you made me think if it's OK to hold cgroup_mutex while calling 
deactivate_super(),
and the answer is NO! deactive_super() may call cgroup_kill_sb() which will
acquire cgroup_mutex.

I'll update the patch.

Thank Tejun we won't be entangled with vfs internal anymore after coverting
to kernfs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cgroup/for-3.14-fixes] cgroup: protect modifications to cgroup_idr with cgroup_mutex

2014-02-11 Thread Li Zefan

On 2014/2/12 10:15, Li Zefan wrote:
> On 2014/2/12 0:26, Michal Hocko wrote:
>> On Tue 11-02-14 10:41:05, Tejun Heo wrote:
>> [...]
>>> @@ -4254,12 +4256,12 @@ static long cgroup_create(struct cgroup *parent, 
>>> struct dentry *dentry,
>>>  
>>> return 0;
>>>  
>>> -err_unlock:
>>> -   mutex_unlock(_mutex);
>>> -   /* Release the reference count that we took on the superblock */
>>> -   deactivate_super(sb);
>>>  err_free_id:
>>> idr_remove(>cgroup_idr, cgrp->id);
>>> +   /* Release the reference count that we took on the superblock */
>>> +   deactivate_super(sb);
>>> +err_unlock:
>>> +   mutex_unlock(_mutex);
>>>  err_free_name:
>>> kfree(rcu_dereference_raw(cgrp->name));
>>>  err_free_cgrp:
>>
>> Do I have to change deactivate_super vs. mutex_unlock ordering in my
>> backport for 3.12 as well?
>>
> 
> Your change is wrong that you shouldn't drop sb refcnt in err_unlock path.
> 
> But you made me think if it's OK to hold cgroup_mutex while calling 
> deactivate_super(),
> and the answer is NO! deactive_super() may call cgroup_kill_sb() which will
> acquire cgroup_mutex.
> 
> I'll update the patch.
> 
> Thank Tejun we won't be entangled with vfs internal anymore after coverting
> to kernfs.
> 

On second thought, it should be safe to call deactivate_super() before
releasing cgroup_mutex, as cgroup_create() is called through vfs, so vfs
should guanrantee the superblock won't disapear, so this deactivate_super()
won't drop sb refcnt to 0.

Still this is just my guess without diving into vfs code, and we'd better
not depend on it even my guess is correct.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] cgroup: protect modifications to cgroup->idr with cgroup_mutex

2014-02-11 Thread Li Zefan

Setup cgroupfs like this:
  # mount -t cgroup -o cpuacct xxx /cgroup
  # mkdir /cgroup/sub1
  # mkdir /cgroup/sub2

Then run these two commands:
  # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /cgroup/sub1/tmp; } &
  # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /cgroup/sub2/tmp; } &

After seconds you may see this warning:

[ cut here ]
WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
idr_remove called for id=6 which is not allocated.
...
Call Trace:
 [] dump_stack+0x7a/0x96
 [] warn_slowpath_common+0x8c/0xc0
 [] warn_slowpath_fmt+0x46/0x50
 [] sub_remove+0x87/0x1b0
 [] ? css_killed_work_fn+0x32/0x1b0
 [] idr_remove+0x25/0xd0
 [] cgroup_destroy_css_killed+0x5b/0xc0
 [] css_killed_work_fn+0x130/0x1b0
 [] process_one_work+0x26c/0x550
 [] worker_thread+0x12e/0x3b0
 [] kthread+0xe6/0xf0
 [] ret_from_fork+0x7c/0xb0
---[ end trace 2d1577ec10cf80d0 ]---

It's because allocating/removing cgroup ID is not properly synchronized.

The bug was introduced when we converted cgroup_ida to cgroup_idr.
While synchronization is already done inside ida_simple_{get,remove}(),
users are responsible for concurrent calls to idr_{alloc,remove}().

v2:
- Don't call deactivate_super() inside cgroup_mutex, as cgroup_kill_sb()
will be called if sb refcnt reaches 0. I don't think this can happen,
as cgroup_create() is called through vfs, so vfs should guarantee the
superblock won't disappear. Still better not depend on it even my guess
is probably correct.

Fixes: 4e96ee8e981b ("cgroup: convert cgroup_ida to cgroup_idr")
Reported-by: Michal Hocko 
Signed-off-by: Li Zefan 
---

Also rebased on "cgroup: fix error return from cgroup_create()".

---
 include/linux/cgroup.h |  2 ++
 kernel/cgroup.c| 47 ++-
 2 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5c09759..9450f02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -166,6 +166,8 @@ struct cgroup {
 *
 * The ID of the root cgroup is always 0, and a new cgroup
 * will be assigned with a smallest available ID.
+*
+* Allocating/Removing ID must be protected by cgroup_mutex.
 */
int id;
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7be8fe4..37f217c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -886,7 +886,9 @@ static void cgroup_diput(struct dentry *dentry, struct 
inode *inode)
 * per-subsystem and moved to css->id so that lookups are
 * successful until the target css is released.
 */
+   mutex_lock(_mutex);
idr_remove(>root->cgroup_idr, cgrp->id);
+   mutex_unlock(_mutex);
cgrp->id = -1;
 
call_rcu(>rcu_head, cgroup_free_rcu);
@@ -4174,15 +4176,12 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
}
rcu_assign_pointer(cgrp->name, name);
 
-   /*
-* Temporarily set the pointer to NULL, so idr_find() won't return
-* a half-baked cgroup.
-*/
-   cgrp->id = idr_alloc(>cgroup_idr, NULL, 1, 0, GFP_KERNEL);
-   if (cgrp->id < 0) {
-   err = -ENOMEM;
-   goto err_free_name;
-   }
+   /* Grab a reference on the superblock so the hierarchy doesn't
+* get deleted on unmount if there are child cgroups.  This
+* can be done outside cgroup_mutex, since the sb can't
+* disappear while someone has an open control file on the
+* fs */
+   atomic_inc(>s_active);
 
/*
 * Only live parents can have children.  Note that the liveliness
@@ -4193,15 +4192,18 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
 */
if (!cgroup_lock_live_group(parent)) {
err = -ENODEV;
-   goto err_free_id;
+   goto err_free_name;
}
 
-   /* Grab a reference on the superblock so the hierarchy doesn't
-* get deleted on unmount if there are child cgroups.  This
-* can be done outside cgroup_mutex, since the sb can't
-* disappear while someone has an open control file on the
-* fs */
-   atomic_inc(>s_active);
+   /*
+* Temporarily set the pointer to NULL, so idr_find() won't return
+* a half-baked cgroup.
+*/
+   cgrp->id = idr_alloc(>cgroup_idr, NULL, 1, 0, GFP_KERNEL);
+   if (cgrp->id < 0) {
+   err = -ENOMEM;
+   goto err_unlock;
+   }
 
init_cgroup_housekeeping(cgrp);
 
@@ -4225,7 +4227,7 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
 */
err = cgroup_create_file(dentry, S_IFDIR | mode, sb);
if (err < 0)
-   goto err_unlock;
+

Re: [PATCH v2] cgroup: protect modifications to cgroup->idr with cgroup_mutex

2014-02-11 Thread Li Zefan

On 2014/2/12 14:37, Tejun Heo wrote:
> Hello, Li.
> 
> On Wed, Feb 12, 2014 at 02:28:53PM +0800, Li Zefan wrote:
>> v2:
>> - Don't call deactivate_super() inside cgroup_mutex, as cgroup_kill_sb()
>> will be called if sb refcnt reaches 0. I don't think this can happen,
>> as cgroup_create() is called through vfs, so vfs should guarantee the
>> superblock won't disappear. Still better not depend on it even my guess
>> is probably correct.
> 
> If the deadlock can't actually happen, I don't really care either way
> as the code goes away after kernfs conversion anyway.  I've already
> applied v1, so if you think this change is important, can you send an
> incremental patch?
> 

I'm fine to stick with V1.

I'm pretty confident it's safe, as we can increment sb refcnt without
any checking or locking (even cgroup_mutex as the comment says) in
cgroup_create().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/8] cgroup: remove cgroup->name

2014-02-11 Thread Li Zefan

> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index dd52e7f..30eee3b 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -111,8 +111,7 @@ static char *task_group_path(struct task_group *tg)
>   if (autogroup_path(tg, group_path, PATH_MAX))
>   return group_path;
>  
> - cgroup_path(tg->css.cgroup, group_path, PATH_MAX);
> - return group_path;
> + return cgroup_path(tg->css.cgroup, group_path, PATH_MAX);

The caller won't check the return value but pass it to printk/seq_printf,
but now cgroup_path() might return -ENAMETOOLONG..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/8] cgroup: remove cgroup->name

2014-02-12 Thread Li Zefan

On 2014/2/12 16:27, Tejun Heo wrote:
> On Wed, Feb 12, 2014 at 03:52:20PM +0800, Li Zefan wrote:
>> The caller won't check the return value but pass it to printk/seq_printf,
>> but now cgroup_path() might return -ENAMETOOLONG..
> 
> cgroup_path() returns NULL on overflow which printk handles fine, no?
> 

Ah, right. My mistake. I didn't take an afternoon nap, and was a bit sleeply.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET v2 cgroup/for-3.15] cgroup: cleanups after kernfs conversion

2014-02-12 Thread Li Zefan

On 2014/2/9 0:38, Tejun Heo wrote:
> Hello,
> 
> This is v2 of cleanups-after-kernfs-conversion patchset.  Nothing
> really changed since the last take[L]. It just got rebased on top of
> the updated patches.
> 
> This patchset does a number of cleanups which are possible now that
> cgroup is converted to kernfs.  This patchset contains the following
> eight patches.
> 
>  0001-cgroup-warn-if-xattr-is-specified-with-sane_behavior.patch
>  0002-cgroup-relocate-cgroup_rm_cftypes.patch
>  0003-cgroup-remove-cftype_set.patch
>  0004-cgroup-simplify-dynamic-cftype-addition-and-removal.patch
>  0005-cgroup-make-cgroup-hold-onto-its-kernfs_node.patch
>  0006-cgroup-remove-cgroup-name.patch
>  0007-cgroup-rename-cgroupfs_root-number_of_cgroups-to-nr_.patch
>  0008-cgroup-remove-cgroupfs_root-refcnt.patch
> 

Ack-by: Li Zefan 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cgroup: protect modifications to cgroup_idr with cgroup_mutex

2014-02-12 Thread Li Zefan

On 2014/2/12 17:12, Michal Hocko wrote:
> Li has pointed out that my previous backport was not correct because
> err_unlock label releases a reference to supperblock which was not taken
> before idr_alloc. I've also removed cgroup_mutex from free_css_id as per
> Li.
> Fixed in this version.
> 

Looks good to me!

Are you going to send it to stable mailing list after the fix hits mainline
or use it internally?

It has small conflicts with this patch that will also be backported to 3.12:

https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/kernel/cgroup.c?h=for-3.14-fixes=b58c89986a77a23658682a100eb15d8edb571ebb

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET cgroup/for-3.15] cgroup: more cleanups

2014-02-12 Thread Li Zefan

On 2014/2/9 21:52, Tejun Heo wrote:
> Hello,
> 
> This series contains assorted cleanups which also prepare for the
> planned migration taskset handling update.
> 
> This patchset contains the following sixteen patches.
> 
>  0001-cgroup-disallow-xattr-release_agent-and-name-if-sane.patch
>  0002-cgroup-drop-CGRP_ROOT_SUBSYS_BOUND.patch
>  0003-cgroup-enable-task_cg_lists-on-the-first-cgroup-moun.patch
>  0004-cgroup-relocate-cgroup_enable_task_cg_lists.patch
>  0005-cgroup-implement-cgroup_has_tasks-and-unexport-cgrou.patch
>  0006-cgroup-reimplement-cgroup_transfer_tasks-without-usi.patch
>  0007-cgroup-make-css_set_lock-a-rwsem-and-rename-it-to-cs.patch
>  0008-cpuset-use-css_task_iter_start-next-end-instead-of-c.patch
>  0009-cgroup-remove-css_scan_tasks.patch
>  0010-cgroup-separate-out-put_css_set_locked-and-remove-pu.patch
>  0011-cgroup-move-css_set_rwsem-locking-outside-of-cgroup_.patch
>  0012-cgroup-drop-skip_css-from-cgroup_taskset_for_each.patch
>  0013-cpuset-don-t-use-cgroup_taskset_cur_css.patch
>  0014-cgroup-remove-cgroup_taskset_cur_css-and-cgroup_task.patch
>  0015-cgroup-cosmetic-updates-to-cgroup_attach_task.patch
>  0016-cgroup-unexport-functions.patch
> 
> The notables ones are
> 
>  0003-0004 move task_cg_list enabling to the first mount instead of
>  the first css task iteration.
> 
>  0005-0009 make css_set_lock a rwsem so that css_task_iter allows
>  blocking during iteration and removes css_scan_tasks().
> 
>  0010-0015 clean up migration path to prepare for the planned
>  migration taskset handling update.
> 

Acked-by: Li Zefan 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/16] cgroup: remove css_scan_tasks()

2014-02-12 Thread Li Zefan

On 2014/2/9 21:52, Tejun Heo wrote:
> css_scan_tasks() doesn't have any user left.  Remove it.
> 

I always dislike css_scan_tasks().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cgroup/for-3.14-fixes] cgroup: update cgroup_enable_task_cg_lists() to grab siglock

2014-02-13 Thread Li Zefan

On 2014/2/14 2:29, Tejun Heo wrote:
> Currently, there's nothing preventing cgroup_enable_task_cg_lists()
> from missing set PF_EXITING and race against cgroup_exit().  Depending
> on the timing, cgroup_exit() may finish with the task still linked on
> css_set leading to list corruption.  Fix it by grabbing siglock in
> cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
> visible.
> 
> This whole on-demand cg_list optimization is extremely fragile and has
> ample possibility to lead to bugs which can cause things like
> once-a-year oops during boot.

I added the PF_EXITING check years ago:

commit 0e04388f0189fa1f6812a8e1cb6172136eada87e
Author: Li Zefan 
Date:   Thu Apr 17 11:37:15 2008 +0800

cgroup: fix a race condition in manipulating tsk->cg_list

Now the only race I see is caused by checking tsk->cg_list without locking
in cgroup_exit():

cgroup_enable_task_cg_lists()
  ...
  if (!p->flags & PF_EXITING) && list_empty(p->cg_list))

exit_signal(tsk)  <-- set PF_EXTING;
...
if 
(!list_empty(>cg_list)) {

down_write(_set_rwsem);
if 
(!list_empty(>cg_list))

list_del_init(>cg_list);

up_write(_set_rwsem);
}
list_add(p->cg_list, ...);

Your patch can fix this race, but after diving into the code I don't think
the race exists, because exit_mm() locks task_lock, and exit_mm()
is called after exit_signal() and before cgroup_exit(), and task_lock is
also taken by cgroup_enable_task_cg_lists().

I totally agree the code is fragile and we should take your patch. I just
want to make it clear if the bug exists in real life or not, and then we
can write better changelog and decide to queue the patch for 3.14 or 3.15
and decide to mark it for stable or not.

> I'm wondering whether the better
> approach would be just adding "cgroup_disable=all" handling which
> disables the whole cgroup rather than tempting fate with this
> on-demand craziness.
> 

:)

> Signed-off-by: Tejun Heo 
> Cc: sta...@vger.kernel.org
> ---
>  kernel/cgroup.c |5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 68d8710..105f273 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2905,9 +2905,14 @@ static void cgroup_enable_task_cg_lists(void)
>* We should check if the process is exiting, otherwise
>* it will race with cgroup_exit() in that the list
>* entry won't be deleted though the process has exited.
> +  * Do it while holding siglock so that we don't end up
> +  * racing against cgroup_exit().
>*/
> + spin_lock_irq(>sighand->siglock);
>   if (!(p->flags & PF_EXITING) && list_empty(>cg_list))
>   list_add(>cg_list, _css_set(p)->tasks);
> + spin_unlock_irq(>sighand->siglock);
> +
>   task_unlock(p);
>   } while_each_thread(g, p);
>   read_unlock(_lock);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] cgroup: fix memory leak in cgroup_mount()

2014-02-14 Thread Li Zefan

We should free the memory allocated in parse_cgroupfs_options() before
calling this function again.

Signed-off-by: Li Zefan 
---
 kernel/cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0e794ca..37d94a2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1540,6 +1540,8 @@ retry:
if (!atomic_inc_not_zero(>top_cgroup.refcnt)) {
mutex_unlock(_mutex);
mutex_unlock(_tree_mutex);
+   kfree(opts.release_agent);
+   kfree(opts.name);
msleep(10);
goto retry;
}
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] cgroup: fix locking in cgroupstats_build()

2014-02-14 Thread Li Zefan

css_set_lock has been converted to css_set_rwsem, and rwsem can't
nested inside rcu_read_lock.

Signed-off-by: Li Zefan 
---
 kernel/cgroup.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index af4ecc3..0e794ca 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2997,6 +2997,8 @@ int cgroupstats_build(struct cgroupstats *stats, struct 
dentry *dentry)
kernfs_type(kn) != KERNFS_DIR)
return -EINVAL;
 
+   mutex_lock(_mutex);
+
/*
 * We aren't being called from kernfs and there's no guarantee on
 * @kn->priv's validity.  For this and css_tryget_from_dir(),
@@ -3004,10 +3006,12 @@ int cgroupstats_build(struct cgroupstats *stats, struct 
dentry *dentry)
 */
rcu_read_lock();
cgrp = rcu_dereference(kn->priv);
-   if (!cgrp) {
+   if (!cgrp || cgroup_is_dead(cgrp)) {
rcu_read_unlock();
+   mutex_unlock(_mutex);
return -ENOENT;
}
+   rcu_read_unlock();
 
css_task_iter_start(>dummy_css, );
while ((tsk = css_task_iter_next())) {
@@ -3032,7 +3036,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct 
dentry *dentry)
}
css_task_iter_end();
 
-   rcu_read_unlock();
+   mutex_unlock(_mutex);
return 0;
 }
 
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] cgroup: deal with dummp_top in cgroup_name() and cgroup_path()

2014-02-14 Thread Li Zefan

My kernel fails to boot, because blkcg calls cgroup_path() while
cgroupfs is not mounted.

Fix both cgroup_name() and cgroup_path().

Signed-off-by: Li Zefan 
---
 include/linux/cgroup.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ef0b3af..8c283a9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -487,13 +487,21 @@ struct cgroup_subsys_state *seq_css(struct seq_file *seq);
 
 static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 {
-   return kernfs_name(cgrp->kn, buf, buflen);
+   /* dummy_top doesn't have a kn associated */
+   if (cgrp->kn)
+   return kernfs_name(cgrp->kn, buf, buflen);
+   else
+   return strlcpy(buf, "/", buflen);
 }
 
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
  size_t buflen)
 {
-   return kernfs_path(cgrp->kn, buf, buflen);
+   /* dummy_top doesn't have a kn associated */
+   if (cgrp->kn)
+   return kernfs_path(cgrp->kn, buf, buflen);
+   strlcpy(buf, "/", buflen);
+   return (buflen <= 2) ? NULL : buf;
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] kernfs: fix kernfs_node_from_dentry()

2014-02-14 Thread Li Zefan

Currently kernfs_node_from_dentry() returns NULL for root dentry,
because root_dentry->d_op == NULL.

Due to this bug cgroupstats_build() returns -EINVAL for root cgroup.

  # mount -t cgroup -o cpuacct /cgroup
  # Documentation/accounting/getdelays -C /cgroup
  fatal reply error,  errno -22

With this fix:

  # Documentation/accounting/getdelays -C /cgroup
  sleeping 305, blocked 0, running 1, stopped 0, uninterruptible 1

Signed-off-by: Li Zefan 
---
 fs/kernfs/dir.c | 2 +-
 fs/kernfs/kernfs-internal.h | 1 +
 fs/kernfs/mount.c   | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 939684e..300f57b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -501,7 +501,7 @@ const struct dentry_operations kernfs_dops = {
  */
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
 {
-   if (dentry->d_op == _dops)
+   if (dentry->d_sb->s_op == _sops)
return dentry->d_fsdata;
return NULL;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index a91d7a1..8be13b2 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -65,6 +65,7 @@ struct kernfs_super_info {
 };
 #define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
 
+extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache;
 
 /*
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index e5b28b0..405279b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -39,7 +39,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, 
struct dentry *dentry)
return 0;
 }
 
-static const struct super_operations kernfs_sops = {
+const struct super_operations kernfs_sops = {
.statfs = simple_statfs,
.drop_inode = generic_delete_inode,
.evict_inode= kernfs_evict_inode,
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cgroup: fix top cgroup refcnt leak

2014-02-14 Thread Li Zefan

If we mount the same cgroupfs in serveral mount points, and then
umount all of them, kill_sb() will be called only once.

Therefore it's wrong to increment top_cgroup's refcnt when we find
an existing cgroup_root.

Try:
# mount -t cgroup -o cpuacct xxx /cgroup
# mount -t cgroup -o cpuacct xxx /cgroup2
# cat /proc/cgroups | grep cpuacct
cpuacct 2   1   1
# umount /cgroup
# umount /cgroup2
# cat /proc/cgroups | grep cpuacct
cpuacct 2   1   1

You'll see cgroupfs will never be freed.

Also move this chunk of code upwards.

Signed-off-by: Li Zefan 
---
 kernel/cgroup.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 37d94a2..5bfe738 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1498,6 +1498,22 @@ retry:
bool name_match = false;
 
/*
+* A root's lifetime is governed by its top cgroup.  Zero
+* ref indicate that the root is being destroyed.  Wait for
+* destruction to complete so that the subsystems are free.
+* We can use wait_queue for the wait but this path is
+* super cold.  Let's just sleep for a bit and retry.
+*/
+   if (!atomic_read(>top_cgroup.refcnt)) {
+   mutex_unlock(_mutex);
+   mutex_unlock(_tree_mutex);
+   kfree(opts.release_agent);
+   kfree(opts.name);
+   msleep(10);
+   goto retry;
+   }
+
+   /*
 * If we asked for a name then it must match.  Also, if
 * name matches but sybsys_mask doesn't, we should fail.
 * Remember whether name matched.
@@ -1530,22 +1546,6 @@ retry:
}
}
 
-   /*
-* A root's lifetime is governed by its top cgroup.  Zero
-* ref indicate that the root is being destroyed.  Wait for
-* destruction to complete so that the subsystems are free.
-* We can use wait_queue for the wait but this path is
-* super cold.  Let's just sleep for a bit and retry.
-*/
-   if (!atomic_inc_not_zero(>top_cgroup.refcnt)) {
-   mutex_unlock(_mutex);
-   mutex_unlock(_tree_mutex);
-   kfree(opts.release_agent);
-   kfree(opts.name);
-   msleep(10);
-   goto retry;
-   }
-
ret = 0;
goto out_unlock;
}
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cgroup: fix top cgroup refcnt leak

2014-02-14 Thread Li Zefan

于 2014年02月14日 17:36, Li Zefan 写道:
> If we mount the same cgroupfs in serveral mount points, and then
> umount all of them, kill_sb() will be called only once.
> 
> Therefore it's wrong to increment top_cgroup's refcnt when we find
> an existing cgroup_root.
> 
> Try:
>   # mount -t cgroup -o cpuacct xxx /cgroup
>   # mount -t cgroup -o cpuacct xxx /cgroup2
>   # cat /proc/cgroups | grep cpuacct
>   cpuacct 2   1   1
>   # umount /cgroup
>   # umount /cgroup2
>   # cat /proc/cgroups | grep cpuacct
>   cpuacct 2   1   1
> 
> You'll see cgroupfs will never be freed.
> 
> Also move this chunk of code upwards.
> 
> Signed-off-by: Li Zefan 
> ---
>  kernel/cgroup.c | 32 
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 37d94a2..5bfe738 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1498,6 +1498,22 @@ retry:
>   bool name_match = false;
>  
>   /*
> +  * A root's lifetime is governed by its top cgroup.  Zero
> +  * ref indicate that the root is being destroyed.  Wait for
> +  * destruction to complete so that the subsystems are free.
> +  * We can use wait_queue for the wait but this path is
> +  * super cold.  Let's just sleep for a bit and retry.
> +  */
> + if (!atomic_read(>top_cgroup.refcnt)) {

oops, this fix is wrong. We call kernfs_mount() without cgroup locks and it
drops cgroup refcnt if failed.

I guess we need to bump the refcnt and then drop it after kernfs_mount().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] jffs2: fix unbalanced locking

2014-02-07 Thread Li Zefan

This was found by our internal debugging feature on runtime, but this
bug won't lead to deadlock, as the structure that this lock is embedded
in is freed on error.

Cc: 
Signed-off-by: Li Zefan 
---
 fs/jffs2/readinode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c
index 386303d..8261021 100644
--- a/fs/jffs2/readinode.c
+++ b/fs/jffs2/readinode.c
@@ -1143,6 +1143,7 @@ static int jffs2_do_read_inode_internal(struct 
jffs2_sb_info *c,
JFFS2_ERROR("cannot read nodes for ino %u, returned error is 
%d\n", f->inocache->ino, ret);
if (f->inocache->state == INO_STATE_READING)
jffs2_set_inocache_state(c, f->inocache, 
INO_STATE_CHECKEDABSENT);
+   mutex_unlock(>sem);
return ret;
}
 
@@ -1159,6 +1160,7 @@ static int jffs2_do_read_inode_internal(struct 
jffs2_sb_info *c,
jffs2_free_tmp_dnode_info(rii.mdata_tn);
rii.mdata_tn = NULL;
}
+   mutex_unlock(>sem);
return ret;
}
 
@@ -1183,6 +1185,7 @@ static int jffs2_do_read_inode_internal(struct 
jffs2_sb_info *c,
if (!rii.fds) {
if (f->inocache->state == INO_STATE_READING)
jffs2_set_inocache_state(c, 
f->inocache, INO_STATE_CHECKEDABSENT);
+   mutex_unlock(>sem);
return -EIO;
}
JFFS2_NOTICE("but it has children so we fake some modes 
for it\n");
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] jffs2: avoid soft-lockup in jffs2_reserve_space_gc()

2014-02-07 Thread Li Zefan

We triggered soft-lockup under stress test on 2.6.34 kernel.

BUG: soft lockup - CPU#1 stuck for 60009ms! [lockf2.test:14488]
...
[] (jffs2_do_reserve_space+0x420/0x440 [jffs2])
[] (jffs2_reserve_space_gc+0x34/0x78 [jffs2])
[] (jffs2_garbage_collect_dnode.isra.3+0x264/0x478 [jffs2])
[] (jffs2_garbage_collect_pass+0x9c0/0xe4c [jffs2])
[] (jffs2_reserve_space+0x104/0x2a8 [jffs2])
[] (jffs2_write_inode_range+0x5c/0x4d4 [jffs2])
[] (jffs2_write_end+0x198/0x2c0 [jffs2])
[] (generic_file_buffered_write+0x158/0x200)
[] (__generic_file_aio_write+0x3a4/0x414)
[] (generic_file_aio_write+0x5c/0xbc)
[] (do_sync_write+0x98/0xd4)
[] (vfs_write+0xa8/0x150)
[] (sys_write+0x3c/0xc0)]

Fix this by adding a cond_resched() in the while loop.

Cc: 
Signed-off-by: Li Zefan 
---
 fs/jffs2/nodemgmt.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/jffs2/nodemgmt.c b/fs/jffs2/nodemgmt.c
index 0331072..fb30161 100644
--- a/fs/jffs2/nodemgmt.c
+++ b/fs/jffs2/nodemgmt.c
@@ -216,15 +216,20 @@ int jffs2_reserve_space_gc(struct jffs2_sb_info *c, 
uint32_t minsize,
 
jffs2_dbg(1, "%s(): Requested 0x%x bytes\n", __func__, minsize);
 
-   spin_lock(>erase_completion_lock);
-   while(ret == -EAGAIN) {
+   while (true) {
+   spin_lock(>erase_completion_lock);
ret = jffs2_do_reserve_space(c, minsize, len, sumsize);
if (ret) {
jffs2_dbg(1, "%s(): looping, ret is %d\n",
  __func__, ret);
}
+   spin_unlock(>erase_completion_lock);
+
+   if (ret == -EAGAIN)
+   cond_resched();
+   else
+   break;
}
-   spin_unlock(>erase_completion_lock);
if (!ret)
ret = jffs2_prealloc_raw_node_refs(c, c->nextblock, 1);
 
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] jffs2: remove wait queue after schedule()

2014-02-07 Thread Li Zefan

@wait is a local variable, so if we don't remove it from the wait queue
list, later wake_up() may end up accessing invalid memory.

This was spotted by eyes.

Cc: 
Signed-off-by: Li Zefan 
---
 fs/jffs2/nodemgmt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/jffs2/nodemgmt.c b/fs/jffs2/nodemgmt.c
index fb30161..610a22c 100644
--- a/fs/jffs2/nodemgmt.c
+++ b/fs/jffs2/nodemgmt.c
@@ -179,6 +179,7 @@ int jffs2_reserve_space(struct jffs2_sb_info *c, uint32_t 
minsize,
spin_unlock(>erase_completion_lock);
 
schedule();
+   remove_wait_queue(>erase_wait, 
);
} else
spin_unlock(>erase_completion_lock);
} else if (ret)
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Attaching a cgroup subsystem to multiple hierarchies

2014-02-07 Thread Li Zefan

(Add Michal back to the Cc list, and Cc cgroup mailing list)

On 2014/2/7 17:21, Glyn Normington wrote:
> Hi Michal
> 
> On 6 Feb 2014, at 18:59, Michal Hocko  wrote:
> 
>> On Wed 05-02-14 14:39:52, Glyn Normington wrote:
>>> Reading cgroups.txt and casting around the net leads me to believe
>>> that it is possible to attach a cgroup subsystem (e.g. memory) to
>>> multiple hierarchies, but this seems to result in “mirrored”
>>> hierarchies which are automatically kept in step with each other -
>>> essentially it looks like the same hierarchy at multiple file system
>>> paths.
>>>
>>> Take the following interaction for example:
>>>
>>> \begin{verbatim}
>>> $ pwd   
>>> /home/vagrant
>>> $ mkdir mem1
>>> $ mkdir mem2
>>> $ sudo su
>>> # mount -t cgroup -o memory none /home/vagrant/mem1
>>> # mount -t cgroup -o memory none /home/vagrant/mem2
>>> # cd mem1
>>> # mkdir inst1  
>>> # ls inst1 
>>> cgroup.clone_children  memory.failcnt ...
>>> # ls ../mem2
>>> cgroup.clone_children  inst1 memory.limit_in_bytes ...
>>> # cd inst1
>>> # echo 100 > memory.limit_in_bytes 
>>> # cat memory.limit_in_bytes 
>>> 1003520
>>> # cat ../../mem2/inst1/memory.limit_in_bytes 
>>> 1003520
>>> # echo $$ > tasks
>>> # cat tasks
>>> 1365
>>> 1409
>>> # cat ../../mem2/inst1/tasks
>>> 1365
>>> 1411
>>>
>>> Is this working as intended?
>>
>> Yes, it doesn't make any sense to have two different views on the same
>> controllers.
> 
> Then wouldn’t it be better for the second mount to fail?
> 

We don't disallow mounting procfs/sysfs to more than one mount point.
Why we want to do this to cgroupfs?

>>
>>> Is there some other way to attach a subsystem to *distinct*
>>> hierarchies?
>>
>> No.
>>
>>> Distinct hierarchies would allow distinct cgroups, distinct settings
>>> (e.g. memory.limit_in_bytes) and distinct partitions of the tasks in
>>> the system.
>>
>> Which one should be applied then?
> 
> Good question. All of them, I would say: the constraints due to distinct 
> settings would be ANDed together.
> 
> The implementation would be more complex and less efficient as a subsystem's 
> resources consumed by a process would need charging against each hierarchy to 
> which the subsystem was attached.
> 
> I very much doubt this would be worth implementing and I’m not at all 
> suggesting it.
> 

Don't even think about it. :)

>>
>>>
>>> Note: I don’t have a good use for this function - I’m simply
>>> trying to reverse engineer the semantics of cgroups to get a precise
>>> understanding.
>>
>> I think there is no need to reverse engineer ;)
>> Documentation/cgroups/cgroups.txt in the kernel tree does give a decent
>> description IMO.
> 
> I disagree. For example, cgroups.txt does not clearly state whether or not a 
> single subsystem may be attached to distinct hierarchies.
> 
> This seems to have caused confusion elsewhere. For example, Red Hat write “… 
> a single subsystem can be attached to two hierarchies if both of those 
> hierarchies have only that subsystem attached.” ([1]).
> 

No documentation is perfect, but you can make it better by sending us
a patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] sched/rt: Remove leaf_rt_rq_list from struct rq

2014-02-07 Thread Li Zefan

This is a leftover from commit e23ee74777f389369431d77390c4b09332ce026a
("sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list").

Signed-off-by: Li Zefan 
---
 kernel/sched/core.c  | 1 -
 kernel/sched/sched.h | 4 
 2 files changed, 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a88f4a4..1f7d88a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6277,7 +6277,6 @@ void __init sched_init(void)
 
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
-   INIT_LIST_HEAD(>leaf_rt_rq_list);
init_tg_rt_entry(_task_group, >rt, NULL, i, NULL);
 #endif
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88c85b2..1c83665 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -438,10 +438,6 @@ struct rq {
struct list_head leaf_cfs_rq_list;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-#ifdef CONFIG_RT_GROUP_SCHED
-   struct list_head leaf_rt_rq_list;
-#endif
-
/*
 * This is part of a global counter where only the total sum
 * over all CPUs matters. A task can increase this counter on
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] sched/rt: Make init_sched_rt_calss() __init

2014-02-07 Thread Li Zefan

It's a bootstrap function.

Signed-off-by: Li Zefan 
---
 kernel/sched/rt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 1c40655..e45ec9a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1825,7 +1825,7 @@ static void switched_from_rt(struct rq *rq, struct 
task_struct *p)
resched_task(rq->curr);
 }
 
-void init_sched_rt_class(void)
+void __init init_sched_rt_class(void)
 {
unsigned int i;
 
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cgroup: protect modifications to cgroup_idr with cgroup_mutex

2014-02-11 Thread Li Zefan

Setup cgroupfs like this:
  # mount -t cgroup -o cpuacct xxx /cgroup
  # mkdir /cgroup/sub1
  # mkdir /cgroup/sub2

Then run these two commands:
  # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
  # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &

After seconds you may see this warning:

[ cut here ]
WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
idr_remove called for id=6 which is not allocated.
...
Call Trace:
 [] dump_stack+0x7a/0x96
 [] warn_slowpath_common+0x8c/0xc0
 [] warn_slowpath_fmt+0x46/0x50
 [] sub_remove+0x87/0x1b0
 [] ? css_killed_work_fn+0x32/0x1b0
 [] idr_remove+0x25/0xd0
 [] cgroup_destroy_css_killed+0x5b/0xc0
 [] css_killed_work_fn+0x130/0x1b0
 [] process_one_work+0x26c/0x550
 [] worker_thread+0x12e/0x3b0
 [] kthread+0xe6/0xf0
 [] ret_from_fork+0x7c/0xb0
---[ end trace 2d1577ec10cf80d0 ]---

It's because allocating/removing cgroup ID is not properly synchronized.

The bug was introduced when we converted cgroup_ida to cgroup_idr.
While synchronization is already done inside ida_simple_{get,remove}(),
users are responsible for concurrent calls to idr_{alloc,remove}().

Fixes: 4e96ee8e981b ("cgroup: convert cgroup_ida to cgroup_idr")
Cc:  #3.12+
Reported-by: Michal Hocko 
Signed-off-by: Li Zefan 
---
 include/linux/cgroup.h |  2 ++
 kernel/cgroup.c| 30 --
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5c09759..9450f02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -166,6 +166,8 @@ struct cgroup {
 *
 * The ID of the root cgroup is always 0, and a new cgroup
 * will be assigned with a smallest available ID.
+*
+* Allocating/Removing ID must be protected by cgroup_mutex.
 */
int id;
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e2f46ba..4bbaadd 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -886,7 +886,9 @@ static void cgroup_diput(struct dentry *dentry, struct 
inode *inode)
 * per-subsystem and moved to css->id so that lookups are
 * successful until the target css is released.
 */
+   mutex_lock(_mutex);
idr_remove(>root->cgroup_idr, cgrp->id);
+   mutex_unlock(_mutex);
cgrp->id = -1;
 
call_rcu(>rcu_head, cgroup_free_rcu);
@@ -4173,14 +4175,6 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
rcu_assign_pointer(cgrp->name, name);
 
/*
-* Temporarily set the pointer to NULL, so idr_find() won't return
-* a half-baked cgroup.
-*/
-   cgrp->id = idr_alloc(>cgroup_idr, NULL, 1, 0, GFP_KERNEL);
-   if (cgrp->id < 0)
-   goto err_free_name;
-
-   /*
 * Only live parents can have children.  Note that the liveliness
 * check isn't strictly necessary because cgroup_mkdir() and
 * cgroup_rmdir() are fully synchronized by i_mutex; however, do it
@@ -4189,9 +4183,17 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
 */
if (!cgroup_lock_live_group(parent)) {
err = -ENODEV;
-   goto err_free_id;
+   goto err_free_name;
}
 
+   /*
+* Temporarily set the pointer to NULL, so idr_find() won't return
+* a half-baked cgroup.
+*/
+   cgrp->id = idr_alloc(>cgroup_idr, NULL, 1, 0, GFP_KERNEL);
+   if (cgrp->id < 0)
+   goto err_unlock;
+
/* Grab a reference on the superblock so the hierarchy doesn't
 * get deleted on unmount if there are child cgroups.  This
 * can be done outside cgroup_mutex, since the sb can't
@@ -4221,7 +4223,7 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
 */
err = cgroup_create_file(dentry, S_IFDIR | mode, sb);
if (err < 0)
-   goto err_unlock;
+   goto err_free_id;
lockdep_assert_held(>d_inode->i_mutex);
 
cgrp->serial_nr = cgroup_serial_nr_next++;
@@ -4257,12 +4259,12 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
 
return 0;
 
-err_unlock:
-   mutex_unlock(_mutex);
-   /* Release the reference count that we took on the superblock */
-   deactivate_super(sb);
 err_free_id:
idr_remove(>cgroup_idr, cgrp->id);
+   /* Release the reference count that we took on the superblock */
+   deactivate_super(sb);
+err_unlock:
+   mutex_unlock(_mutex);
 err_free_name:
kfree(rcu_dereference_raw(cgrp->name));
 err_free_cgrp:
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@v

Re: [PATCHSET cgroup/for-3.15] cgroup: convert to kernfs

2014-02-11 Thread Li Zefan

On 2014/1/29 7:54, Tejun Heo wrote:
> Hello,
> 
> cgroup filesystem code was derived from the original sysfs
> implementation which was heavily intertwined with vfs objects and
> locking with the goal of re-using the existing vfs infrastructure.
> That experiment turned out rather disastrous and sysfs switched, a
> long time ago, to distributed filesystem model where a separate
> representation is maintained which is queried by vfs.  Unfortunately,
> cgroup stuck with the failed experiment all these years and
> accumulated even more problems over time.
> 
> Locking and object lifetime management being entangled with vfs is
> probably the most egregious.  vfs was never designed to be misused
> like this and cgroup ends up jumping through various convoluted
> dancing to make things work.  Even then, operations across multiple
> cgroups can't be done safely as it'll deadlock with rename locking.
> 
> Recently, kernfs is separated out from sysfs so that it can be used by
> users other than sysfs.  This patchset converts cgroup to use kernfs.
> This simplifies code base quite a bit and allows further cleanups and
> implementation of features which depend on operating across multiple
> nodes.
> 
> This patchset contains the following 13 patches.
> 
>  0001-cgroup-improve-css_from_dir-into-css_tryget_from_dir.patch
>  0002-cgroup-introduce-cgroup_tree_mutex.patch
>  0003-cgroup-release-cgroup_mutex-over-file-removals.patch
>  0004-cgroup-restructure-locking-and-error-handling-in-cgr.patch
>  0005-cgroup-factor-out-cgroup_setup_root-from-cgroup_moun.patch
>  0006-cgroup-update-cgroup-name-handling.patch
>  0007-cgroup-make-cgroup_subsys-base_cftypes-use-cgroup_ad.patch
>  0008-cgroup-update-the-meaning-of-cftype-max_write_len.patch
>  0009-cgroup-introduce-cgroup_init-exit_cftypes.patch
>  0010-cgroup-introduce-cgroup_ino.patch
>  0011-cgroup-misc-preps-for-kernfs-conversion.patch
>  0012-cgroup-relocate-functions-in-preparation-of-kernfs-c.patch
>  0013-cgroup-convert-to-kernfs.patch
> 
> 0001 updates css_from_dir() into css_tryget_from_dir() so that the
> caller doesn't have to worry about RCU locking.
> 
> 0002-0004 introduce cgroup_tree_mutex and prepare locking.
> 
> 0005-0006 are misc preps.
> 
> 0007-0008 update cftype handling in preparation.
> 
> 0009-0012 are misc preps.
> 
> 0013 converts cgroup to kernfs.
> 
> This patchset is on top of
> 
>   linus#master 54c0a4b46150 ("Merge branch 'akpm' (incoming from Andrew)")
> + [1] [PATCHSET cgroup/for-3.14-fixes] cgroup: four misc fixes
> + [2] [PATCH] arm, pm, vmpressure: add missing slab.h includes
> + [3] [PATCHSET v2 cgroup/for-3.15] cgroup: drop module support and 
> cgroup_root_mutex
> + git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git 
> review-kernfs-cgroup-prep
> 
> and also available in the following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
> review-kernfs-conversion
> 
> diffstat follows.
> 
>  block/blk-throttle.c   |4
>  block/cfq-iosched.c|3
>  include/linux/cgroup.h |   71 --
>  kernel/cgroup.c| 1712 
> ++++++++-
>  kernel/events/core.c   |   17
>  mm/memcontrol.c|   16
>  mm/memory-failure.c|8
>  7 files changed, 754 insertions(+), 1077 deletions(-)
> 

Looks good to me!

Acked-by: Li Zefan 

Will look into remaining patchsets tomorrow.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] hung_task: change sysctl_hung_task_check_count to int

2013-09-13 Thread Li Zefan

As sysctl_hung_task_check_count is unsigned long, when this value is
assigned to max_count in check_hung_uninterruptible_tasks(), it's
truncated to int type.

Therefore if we write 2^32 to sysctl.hung_task_check_count, hung task
detection will be effectively disabled.

Not a big deal, but still it's better to fix this inconsistency.

Signed-off-by: Li Zefan 
---
 include/linux/sched/sysctl.h | 2 +-
 kernel/hung_task.c   | 2 +-
 kernel/sysctl.c  | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..9552afa 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -2,8 +2,8 @@
 #define _SCHED_SYSCTL_H
 
 #ifdef CONFIG_DETECT_HUNG_TASK
+extern int  sysctl_hung_task_check_count;
 extern unsigned int  sysctl_hung_task_panic;
-extern unsigned long sysctl_hung_task_check_count;
 extern unsigned long sysctl_hung_task_timeout_secs;
 extern unsigned long sysctl_hung_task_warnings;
 extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 6df6149..b5a5d42 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -19,7 +19,7 @@
 /*
  * The number of tasks checked:
  */
-unsigned long __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
+int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
 
 /*
  * Limit number of tasks checked in a batch.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 07f6fc4..5e5cc21 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -962,9 +962,9 @@ static struct ctl_table kern_table[] = {
{
.procname   = "hung_task_check_count",
.data   = _hung_task_check_count,
-   .maxlen = sizeof(unsigned long),
+   .maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_doulongvec_minmax,
+   .proc_handler   = proc_dointvec_minmax,
},
{
.procname   = "hung_task_timeout_secs",
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hung_task: change sysctl_hung_task_check_count to int

2013-09-13 Thread Li Zefan

Cc: Ingo, as this touched include/linux/sched/.

On 2013/9/14 10:47, Li Zefan wrote:
> As sysctl_hung_task_check_count is unsigned long, when this value is
> assigned to max_count in check_hung_uninterruptible_tasks(), it's
> truncated to int type.
> 
> Therefore if we write 2^32 to sysctl.hung_task_check_count, hung task
> detection will be effectively disabled.
> 
> Not a big deal, but still it's better to fix this inconsistency.
> 
> Signed-off-by: Li Zefan 
> ---
>  include/linux/sched/sysctl.h | 2 +-
>  kernel/hung_task.c   | 2 +-
>  kernel/sysctl.c  | 4 ++--
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index bf8086b..9552afa 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -2,8 +2,8 @@
>  #define _SCHED_SYSCTL_H
>  
>  #ifdef CONFIG_DETECT_HUNG_TASK
> +extern intsysctl_hung_task_check_count;
>  extern unsigned int  sysctl_hung_task_panic;
> -extern unsigned long sysctl_hung_task_check_count;
>  extern unsigned long sysctl_hung_task_timeout_secs;
>  extern unsigned long sysctl_hung_task_warnings;
>  extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 6df6149..b5a5d42 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -19,7 +19,7 @@
>  /*
>   * The number of tasks checked:
>   */
> -unsigned long __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
> +int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
>  
>  /*
>   * Limit number of tasks checked in a batch.
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 07f6fc4..5e5cc21 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -962,9 +962,9 @@ static struct ctl_table kern_table[] = {
>   {
>   .procname   = "hung_task_check_count",
>   .data   = _hung_task_check_count,
> - .maxlen = sizeof(unsigned long),
> + .maxlen = sizeof(int),
>   .mode   = 0644,
> - .proc_handler   = proc_doulongvec_minmax,
> + .proc_handler   = proc_dointvec_minmax,
>   },
>   {
>   .procname   = "hung_task_timeout_secs",
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] hung_task: change sysctl_hung_task_check_count to int

2013-09-14 Thread Li Zefan

>> On 2013/9/14 10:47, Li Zefan wrote:
>>> As sysctl_hung_task_check_count is unsigned long, when this value is
>>> assigned to max_count in check_hung_uninterruptible_tasks(), it's
>>> truncated to int type.
>>>
>>> Therefore if we write 2^32 to sysctl.hung_task_check_count, hung task
>>> detection will be effectively disabled.
>>>
>>> Not a big deal, but still it's better to fix this inconsistency.
>>>
>>> Signed-off-by: Li Zefan 
>>> ---
>>>  include/linux/sched/sysctl.h | 2 +-
>>>  kernel/hung_task.c   | 2 +-
>>>  kernel/sysctl.c  | 4 ++--
>>>  3 files changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>>> index bf8086b..9552afa 100644
>>> --- a/include/linux/sched/sysctl.h
>>> +++ b/include/linux/sched/sysctl.h
>>> @@ -2,8 +2,8 @@
>>>  #define _SCHED_SYSCTL_H
>>>  
>>>  #ifdef CONFIG_DETECT_HUNG_TASK
>>> +extern int  sysctl_hung_task_check_count;
>>>  extern unsigned int  sysctl_hung_task_panic;
>>> -extern unsigned long sysctl_hung_task_check_count;
>>>  extern unsigned long sysctl_hung_task_timeout_secs; extern unsigned 
>>>  long sysctl_hung_task_warnings; extern int 
>>>  proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
> 
> Looks good, but I suppose most of the unsigned longs above ought to be 
> downgraded to int as well?
> 

They can be, but they don't have to be.

> Acked-by: Ingo Molnar 

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] [RFC v2] seqcount: Add lockdep functionality to seqcount/seqlock structures

2013-09-14 Thread Li Zefan

On 2013/9/14 8:19, John Stultz wrote:
> Currently seqlocks and seqcounts don't support lockdep.
> 
> After running across a seqcount related deadlock in the timekeeping
> code, I used a less-refined and more focused varient of this patch
> to narrow down the cause of the issue.
> 
> This is a first-pass attempt to properly enable lockdep functionality
> on seqlocks and seqcounts.
> 
> Since seqcounts are used in the vdso gettimeofday code, I've provided
> lockdep accessors.
> 
> I've also handled one cases where there were nested seqlock writers
> and there may be more edge cases.
> 
> Comments and feedback would be appreciated!
> 

Could you describe how seqlocks/seqcounts can lead to deadlock in the
changelog?

> Cc: Mathieu Desnoyers 
> Cc: Li Zefan 
> Cc: Steven Rostedt 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Signed-off-by: John Stultz 
> ---
> v2:
>  * Update to new simplified lockdep.h
>  * vdso accessor simplifications
>  * removed needless preempt_disable
>  * removed unneeded ifdefs

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] [RFC] cpuset: Fix potential deadlock w/ set_mems_allowed

2013-09-14 Thread Li Zefan

Cc Mel, who added seqcount to cpuset.

On 2013/9/14 8:19, John Stultz wrote:
> After adding lockdep support to seqlock/seqcount structures,
> I started seeing the following warning:
> 
> [1.070907] ==
> [1.072015] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
> [1.073181] 3.11.0+ #67 Not tainted
> [1.073801] --
> [1.074882] kworker/u4:2/708 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [1.076088]  (>mems_allowed_seq){+.+...}, at: [] 
> new_slab+0x5f/0x280
> [1.077572]
> [1.077572] and this task is already holding:
> [1.078593]  (&(>__queue_lock)->rlock){..-...}, at: 
> [] blk_execute_rq_nowait+0x53/0xf0
> [1.080042] which would create a new lock dependency:
> [1.080042]  (&(>__queue_lock)->rlock){..-...} -> 
> (>mems_allowed_seq){+.+...}
> [1.080042]
> [1.080042] but this new dependency connects a SOFTIRQ-irq-safe lock:
> [1.080042]  (&(>__queue_lock)->rlock){..-...}
> [1.080042] ... which became SOFTIRQ-irq-safe at:
> [1.080042]   [] __lock_acquire+0x5b9/0x1db0
> [1.080042]   [] lock_acquire+0x95/0x130
> [1.080042]   [] _raw_spin_lock+0x41/0x80
> [1.080042]   [] scsi_device_unbusy+0x7e/0xd0
> [1.080042]   [] scsi_finish_command+0x32/0xf0
> [1.080042]   [] scsi_softirq_done+0xa1/0x130
> [1.080042]   [] blk_done_softirq+0x73/0x90
> [1.080042]   [] __do_softirq+0x110/0x2f0
> [1.080042]   [] run_ksoftirqd+0x2d/0x60
> [1.080042]   [] smpboot_thread_fn+0x156/0x1e0
> [1.080042]   [] kthread+0xd6/0xe0
> [1.080042]   [] ret_from_fork+0x7c/0xb0
> [1.080042]
> [1.080042] to a SOFTIRQ-irq-unsafe lock:
> [1.080042]  (>mems_allowed_seq){+.+...}
> [1.080042] ... which became SOFTIRQ-irq-unsafe at:
> [1.080042] ...  [] __lock_acquire+0x613/0x1db0
> [1.080042]   [] lock_acquire+0x95/0x130
> [1.080042]   [] kthreadd+0x82/0x180
> [1.080042]   [] ret_from_fork+0x7c/0xb0
> [1.080042]
> [1.080042] other info that might help us debug this:
> [1.080042]
> [1.080042]  Possible interrupt unsafe locking scenario:
> [1.080042]
> [1.080042]CPU0CPU1
> [1.080042]
> [1.080042]   lock(>mems_allowed_seq);
> [1.080042]local_irq_disable();
> [1.080042]
> lock(&(>__queue_lock)->rlock);
> [1.080042]lock(>mems_allowed_seq);
> [1.080042]   
> [1.080042] lock(&(>__queue_lock)->rlock);
> [1.080042]
> [1.080042]  *** DEADLOCK ***
> 
> The issue stems from the kthreadd() function calling set_mems_allowed
> with irqs enabled. While its possibly unlikely for the actual deadlock
> to trigger, a fix is fairly simple: disable irqs before taking the
> mems_allowed_seq lock.
> 

Now I get it. I'm fine with this change.

Acked-by: Li Zefan 

> Let me know if you have any other suggestions or alternative fixes you'd
> prefer.
> 
> Cc: Li Zefan 
> Cc: Mathieu Desnoyers 
> Cc: Steven Rostedt 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Signed-off-by: John Stultz 
> ---
>  include/linux/cpuset.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index cc1b01c..3fe661f 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -110,10 +110,14 @@ static inline bool put_mems_allowed(unsigned int seq)
>  
>  static inline void set_mems_allowed(nodemask_t nodemask)
>  {
> + unsigned long flags;
> +
>   task_lock(current);
> + local_irq_save(flags);
>   write_seqcount_begin(>mems_allowed_seq);
>   current->mems_allowed = nodemask;
>   write_seqcount_end(>mems_allowed_seq);
> + local_irq_restore(flags);
>   task_unlock(current);
>  }
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 01/12] cpuset: add cs->effective_cpus and cs->effective_mems

2013-10-11 Thread Li Zefan

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
  - top cpuset's effective_mask == online_mask, otherwise
  - cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for sane_behavior only. For !sane_behavior
effective_mask and configured_mask are the same, so we won't break old
interfaces.

This patch adds and initializes the effective masks to struct cpuset.
The effective masks of the top cpuset is the same with configured masks,
and a child cpuset inherits its parent's effective masks.

This won't introduce behavior change.

v2:
- s/real_{mems,cpus}_allowed/effective_{mems,cpus}, suggested by Tejun.
- don't init effective masks in cpuset_css_online() if !sane_behavior

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 59 ++---
 1 file changed, 48 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 6bf981e..e13fc2a 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -81,8 +81,14 @@ struct cpuset {
struct cgroup_subsys_state css;
 
unsigned long flags;/* "unsigned long" so bitops work */
-   cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
-   nodemask_t mems_allowed;/* Memory Nodes allowed to tasks */
+
+   /* user-configured CPUs and Memory Nodes allow to tasks */
+   cpumask_var_t cpus_allowed;
+   nodemask_t mems_allowed;
+
+   /* effective CPUs and Memory Nodes allow to tasks */
+   cpumask_var_t effective_cpus;
+   nodemask_t effective_mems;
 
/*
 * This is old Memory Nodes tasks took on.
@@ -381,13 +387,20 @@ static struct cpuset *alloc_trial_cpuset(struct cpuset 
*cs)
if (!trial)
return NULL;
 
-   if (!alloc_cpumask_var(>cpus_allowed, GFP_KERNEL)) {
-   kfree(trial);
-   return NULL;
-   }
-   cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
+   if (!alloc_cpumask_var(>cpus_allowed, GFP_KERNEL))
+   goto free_cs;
+   if (!alloc_cpumask_var(>effective_cpus, GFP_KERNEL))
+   goto free_cpus;
 
+   cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
+   cpumask_copy(trial->effective_cpus, cs->effective_cpus);
return trial;
+
+free_cpus:
+   free_cpumask_var(trial->cpus_allowed);
+free_cs:
+   kfree(trial);
+   return NULL;
 }
 
 /**
@@ -396,6 +409,7 @@ static struct cpuset *alloc_trial_cpuset(struct cpuset *cs)
  */
 static void free_trial_cpuset(struct cpuset *trial)
 {
+   free_cpumask_var(trial->effective_cpus);
free_cpumask_var(trial->cpus_allowed);
kfree(trial);
 }
@@ -1948,18 +1962,26 @@ cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);
-   if (!alloc_cpumask_var(>cpus_allowed, GFP_KERNEL)) {
-   kfree(cs);
-   return ERR_PTR(-ENOMEM);
-   }
+   if (!alloc_cpumask_var(>cpus_allowed, GFP_KERNEL))
+   goto free_cs;
+   if (!alloc_cpumask_var(>effective_cpus, GFP_KERNEL))
+   goto free_cpus;
 
set_bit(CS_SCHED_LOAD_BALANCE, >flags);
cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
+   cpumask_clear(cs->effective_cpus);
+   nodes_clear(cs->effective_mems);
fmeter_init(>fmeter);
cs->relax_domain_level = -1;
 
return >css;
+
+free_cpus:
+   free_cpumask_var(cs->cpus_allowed);
+free_cs:
+   kfree(cs);
+   return ERR_PTR(-ENOMEM);
 }
 
 static int cpuset_css_online(struct cgroup_subsys_state *css)
@@ -1982,6 +2004,13 @@ static int cpuset_css_online(struct cgroup_subsys_state 
*css)
 
number_of_cpusets++;
 
+   mutex_lock(_mutex);
+   if (cgroup_sane_behavior(cs->css.cgroup)) {
+   cpumask_copy(cs->effective_cpus, parent->effective_cpus);
+   cs->effective_mems = parent->effective_mems;
+   }
+   mutex_unlock(_mutex);
+
if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, >cgroup->flags))
goto out_unlock;
 
@@ -2041,6 +2070,7 @@ static void cpuset_css_free(struct cgroup_subsys_state 
*css)
 {
struct cpuset *cs = css_cs(css);
 
+   free_cpumask_var(cs->effective_cpus);
free_cpumask_var(cs->cpus_allowed);
kfree(cs);
 }
@@ -2071,9 +2101,13 @@ int __init cpuset_init(void)

[PATCH v2 00/12] cpuset: separate configured masks and effective masks

2013-10-11 Thread Li Zefan

This patcheset introduces behavior changes, but only if you mount cgroupfs
with sane_behavior option:

- We introduce new interfaces cpuset.effective_cpus and cpuset.effective_mems,
  while cpuset.cpus and cpuset.mems will be configured masks.

- The configured masks can be changed by writing cpuset.cpus/mems only. They
  won't be changed when hotplug happens.

- Users can config cpus and mems without restrictions from the parent cpuset.
  effective masks will enforce the hierarchical behavior.

- Users can also config cpus and mems to have already offlined CPU/nodes.

- When a CPU/node is onlined, it will be brought back to the effective masks
  if it's in the configured masks.

- We build sched domains based on effective cpumask but not configured cpumask.


v2:
- fixed two bugs
- made changelogs more verbose
- added more comments
- changed cs->real_{mems,cpus}_allowed to cs->effective_{mems, cpus}
- splitted "cpuset: enable onlined cpu/node in effective masks" into 2 patches
- exported cpuset.effective_{cpus,mems} unconditionally


Li Zefan (12):
  cpuset: add cs->effective_cpus and cs->effective_mems
  cpuset: update cpuset->effective_{cpus,mems} at hotplug
  cpuset: update cs->effective_{cpus,mems} when config changes
  cpuset: inherit ancestor's masks if effective_{cpus,mems} becomes empty
  cpuset: use effective cpumask to build sched domains
  cpuset: initialize top_cpuset's configured masks at mount
  cpuset: apply cs->effective_{cpus,mems}
  cpuset: make cs->{cpus,mems}_allowed as user-configured masks
  cpuset: refactor cpuset_hotplug_update_tasks()
  cpuset: enable onlined cpu/node in effective masks
  cpuset: allow writing offlined masks to cpuset.cpus/mems
  cpuset: export effective masks to userspace

 kernel/cpuset.c | 513 ++--
 1 file changed, 316 insertions(+), 197 deletions(-)

-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 02/12] cpuset: update cpuset->effective_{cpus,mems} at hotplug

2013-10-11 Thread Li Zefan

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
  - top cpuset's effective_mask == online_mask, otherwise
  - cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for sane_behavior only. For !sane_behavior
effective_mask and configured_mask are the same, so we won't break old
interfaces.

To make cs->effective_{cpus,mems} to be effective masks, we need to
  - change the effective masks at hotplug
  - change the effective masks at config change
  - take on ancestor's mask when the effective mask is empty

The first item is done here.

This won't introduce behavior change.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index e13fc2a..d0ccde2 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2186,6 +2186,7 @@ retry:
 
mutex_lock(_mutex);
cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, _cpus);
+   cpumask_andnot(cs->effective_cpus, cs->effective_cpus, _cpus);
mutex_unlock(_mutex);
 
/*
@@ -2200,6 +2201,7 @@ retry:
 
mutex_lock(_mutex);
nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
+   nodes_andnot(cs->effective_mems, cs->effective_mems, off_mems);
mutex_unlock(_mutex);
 
/*
@@ -2263,6 +2265,7 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
if (cpus_updated) {
mutex_lock(_mutex);
cpumask_copy(top_cpuset.cpus_allowed, _cpus);
+   cpumask_copy(top_cpuset.effective_cpus, _cpus);
mutex_unlock(_mutex);
/* we don't mess with cpumasks of tasks in top_cpuset */
}
@@ -2271,6 +2274,7 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
if (mems_updated) {
mutex_lock(_mutex);
top_cpuset.mems_allowed = new_mems;
+   top_cpuset.effective_mems = new_mems;
mutex_unlock(_mutex);
update_tasks_nodemask(_cpuset, NULL);
}
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 03/12] cpuset: update cs->effective_{cpus,mems} when config changes

2013-10-11 Thread Li Zefan

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
  - top cpuset's effective_mask == online_mask, otherwise
  - cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for sane_behavior only. For !sane_behavior
effective_mask and configured_mask are the same, so we won't break old
interfaces.

To make cs->effective_{cpus,mems} to be effective masks, we need to
  - change the effective masks at hotplug
  - change the effective masks at config change
  - take on ancestor's mask when the effective mask is empty

The second item is done here. We don't need to treat root_cs specially
in update_cpumasks_hier(). While at it, remove the redundant variable
is_load_balanced.

This won't introduce behavior change.

v2:
- revise the comment in update_{cpu,node}masks_hier(), suggested by Tejun.
- fix to use @cp instead of @cs in these two functions.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 115 
 1 file changed, 66 insertions(+), 49 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d0ccde2..bdc6047 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -879,39 +879,49 @@ static void update_tasks_cpumask(struct cpuset *cs, 
struct ptr_heap *heap)
css_scan_tasks(>css, NULL, cpuset_change_cpumask, cs, heap);
 }
 
-/*
- * update_tasks_cpumask_hier - Update the cpumasks of tasks in the hierarchy.
- * @root_cs: the root cpuset of the hierarchy
- * @update_root: update root cpuset or not?
+/**
+ * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
+ * @cs: the cpuset to consider
+ * @trialcs: the trial cpuset
  * @heap: the heap used by css_scan_tasks()
  *
- * This will update cpumasks of tasks in @root_cs and all other empty cpusets
- * which take on cpumask of @root_cs.
- *
- * Called with cpuset_mutex held
+ * When configured cpumask is changed, the effective cpumasks of this cpuset
+ * and all its descendants need to be updated.
  */
-static void update_tasks_cpumask_hier(struct cpuset *root_cs,
- bool update_root, struct ptr_heap *heap)
+static void update_cpumasks_hier(struct cpuset *cs, struct cpuset *trialcs,
+struct ptr_heap *heap)
 {
-   struct cpuset *cp;
struct cgroup_subsys_state *pos_css;
+   struct cpuset *cp;
 
rcu_read_lock();
-   cpuset_for_each_descendant_pre(cp, pos_css, root_cs) {
-   if (cp == root_cs) {
-   if (!update_root)
-   continue;
-   } else {
-   /* skip the whole subtree if @cp have some CPU */
-   if (!cpumask_empty(cp->cpus_allowed)) {
-   pos_css = css_rightmost_descendant(pos_css);
-   continue;
-   }
+   cpuset_for_each_descendant_pre(cp, pos_css, cs) {
+   struct cpuset *parent = parent_cs(cp);
+   struct cpumask *new_cpus = trialcs->effective_cpus;
+
+   cpumask_and(new_cpus, cp->cpus_allowed,
+   parent->effective_cpus);
+
+   /*
+* Skip the whole subtree if the cpumask remains the same
+* and isn't empty. If it's empty, we need to update tasks
+* to take on an ancestor's cpumask.
+*/
+   if (cpumask_equal(new_cpus, cp->effective_cpus) &&
+   ((cp == cs) || !cpumask_empty(new_cpus))) {
+   pos_css = css_rightmost_descendant(pos_css);
+   continue;
}
+
if (!css_tryget(>css))
continue;
+
rcu_read_unlock();
 
+   mutex_lock(_mutex);
+   cpumask_copy(cp->effective_cpus, new_cpus);
+   mutex_unlock(_mutex);
+
update_tasks_cpumask(cp, heap);
 
rcu_read_lock();
@@ -930,7 +940,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset 
*trialcs,
 {
struct ptr_heap heap;
int retval;
-   int is_load_balanced;
 
/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
if (cs == _cpuset)
@@ -965,17 +974,15 @@ static int update_cpumask(struct cpuset *cs, struct 
cpuset *trialcs,
if (retval)
return retval;
 
-   is_load_balanced = is_sched_load_balance(trialcs);
-
mutex_lock(_mutex);
cpumask_copy(

[PATCH v2 05/12] cpuset: use effective cpumask to build sched domains

2013-10-11 Thread Li Zefan

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
  - top cpuset's effective_mask == online_mask, otherwise
  - cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for sane_behavior only. For !sane_behavior
effective_mask and configured_mask are the same, so we won't break old
interfaces.

This patch updatse cpuset to use effective masks to build sched domains.

This won't introduce behavior change.

v2:
- Add a comment for the call of rebuild_sched_domains(), suggested
by Tejun.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 6723b88..360e547 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -499,11 +499,11 @@ out:
 #ifdef CONFIG_SMP
 /*
  * Helper routine for generate_sched_domains().
- * Do cpusets a, b have overlapping cpus_allowed masks?
+ * Do cpusets a, b have overlapping effective cpus_allowed masks?
  */
 static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
 {
-   return cpumask_intersects(a->cpus_allowed, b->cpus_allowed);
+   return cpumask_intersects(a->effective_cpus, b->effective_cpus);
 }
 
 static void
@@ -620,7 +620,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, _cpuset);
}
-   cpumask_copy(doms[0], top_cpuset.cpus_allowed);
+   cpumask_copy(doms[0], top_cpuset.effective_cpus);
 
goto done;
}
@@ -727,7 +727,7 @@ restart:
struct cpuset *b = csa[j];
 
if (apn == b->pn) {
-   cpumask_or(dp, dp, b->cpus_allowed);
+   cpumask_or(dp, dp, b->effective_cpus);
if (dattr)
update_domain_attr_tree(dattr + nslot, 
b);
 
@@ -893,6 +893,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct 
cpuset *trialcs,
 {
struct cgroup_subsys_state *pos_css;
struct cpuset *cp;
+   bool need_rebuild_sched_domains = false;
 
rcu_read_lock();
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
@@ -930,10 +931,21 @@ static void update_cpumasks_hier(struct cpuset *cs, 
struct cpuset *trialcs,
 
update_tasks_cpumask(cp, heap);
 
+   /*
+* If the effective cpumask of any non-empty cpuset is
+* changed, we need to rebuild sched domains.
+*/
+   if (!cpumask_empty(cp->cpus_allowed) &&
+   is_sched_load_balance(cp))
+   need_rebuild_sched_domains = true;
+
rcu_read_lock();
css_put(>css);
}
rcu_read_unlock();
+
+   if (need_rebuild_sched_domains)
+   rebuild_sched_domains_locked();
 }
 
 /**
@@ -987,9 +999,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset 
*trialcs,
update_cpumasks_hier(cs, trialcs, );
 
heap_free();
-
-   if (is_sched_load_balance(cs))
-   rebuild_sched_domains_locked();
return 0;
 }
 
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 04/12] cpuset: inherit ancestor's masks if effective_{cpus,mems} becomes empty

2013-10-11 Thread Li Zefan

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
  - top cpuset's effective_mask == online_mask, otherwise
  - cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for sane_behavior only. For !sane_behavior
effective_mask and configured_mask are the same, so we won't break old
interfaces.

To make cs->effective_{cpus,mems} to be effective masks, we need to
  - change the effective masks at hotplug
  - change the effective masks at config change
  - take on ancestor's mask when the effective mask is empty

This won't introduce behavior change.

v2:
- Add comments to explain effective masks are the same with configured
masks for !sane_behavior.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 48 ++--
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index bdc6047..6723b88 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -899,16 +899,22 @@ static void update_cpumasks_hier(struct cpuset *cs, 
struct cpuset *trialcs,
struct cpuset *parent = parent_cs(cp);
struct cpumask *new_cpus = trialcs->effective_cpus;
 
-   cpumask_and(new_cpus, cp->cpus_allowed,
-   parent->effective_cpus);
+   /*
+* If !sane_behavior, new_cpus will equals cpus_allowed,
+* which is not empty, so it's guaranteed the effective mask
+* is the same with the configured mask.
+*/
+   cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
 
/*
-* Skip the whole subtree if the cpumask remains the same
-* and isn't empty. If it's empty, we need to update tasks
-* to take on an ancestor's cpumask.
+* If it becomes empty, inherit the effective mask of the
+* parent, which is guaranteed to have some CPUs.
 */
-   if (cpumask_equal(new_cpus, cp->effective_cpus) &&
-   ((cp == cs) || !cpumask_empty(new_cpus))) {
+   if (cpumask_empty(new_cpus))
+   cpumask_copy(new_cpus, parent->effective_cpus);
+
+   /* Skip the whole subtree if the cpumask remains the same. */
+   if (cpumask_equal(new_cpus, cp->effective_cpus)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
@@ -1163,16 +1169,22 @@ static void update_nodemasks_hier(struct cpuset *cs, 
struct cpuset *trialcs,
struct cpuset *parent = parent_cs(cp);
nodemask_t *new_mems = >effective_mems;
 
-   nodes_and(*new_mems, cp->mems_allowed,
- parent->effective_mems);
+   /*
+* If !sane_behavior, new_mems will equal mems_allowed,
+* which is not empty, so it's guaranteed the effective mask
+* is the same with the configured mask.
+*/
+   nodes_and(*new_mems, cp->mems_allowed, parent->effective_mems);
 
/*
-* Skip the whole subtree if the nodemask remains the same
-* and isn't empty. If it's empty, we need to update tasks
-* to take on an ancestor's nodemask.
+* If it becomes empty, inherit the effective mask of the
+* parent, which is guaranteed to have some MEMs.
 */
-   if (nodes_equal(*new_mems, cp->effective_mems) &&
-   ((cp == cs) || !nodes_empty(*new_mems))) {
+   if (nodes_empty(*new_mems))
+   *new_mems = parent->effective_mems;
+
+   /* Skip the whole subtree if the nodemask is not changed. */
+   if (nodes_equal(*new_mems, cp->effective_mems)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
@@ -2203,7 +2215,11 @@ retry:
 
mutex_lock(_mutex);
cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, _cpus);
+
+   /* Inherit the effective mask of the parent, if it becomes empty. */
cpumask_andnot(cs->effective_cpus, cs->effective_cpus, _cpus);
+   if (sane && cpumask_empty(cs->effective_cpus))
+   cpumask_copy(cs->effective_cpus, parent_cs(cs)->effective_cpus);

[PATCH v2 06/12] cpuset: initialize top_cpuset's configured masks at mount

2013-10-11 Thread Li Zefan

As we now have to support both sane_behavior and !sane_behavior,
top_cpuset's configured masks need to be initialized accordingly.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 360e547..5c53ba5 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2113,8 +2113,27 @@ static void cpuset_css_free(struct cgroup_subsys_state 
*css)
kfree(cs);
 }
 
+void cpuset_bind(struct cgroup_subsys_state *root_css)
+{
+   mutex_lock(_mutex);
+   mutex_lock(_mutex);
+
+   if (cgroup_sane_behavior(root_css->cgroup)) {
+   cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
+   top_cpuset.mems_allowed = node_possible_map;
+   } else {
+   cpumask_copy(top_cpuset.cpus_allowed,
+top_cpuset.effective_cpus);
+   top_cpuset.mems_allowed = top_cpuset.effective_mems;
+   }
+
+   mutex_unlock(_mutex);
+   mutex_unlock(_mutex);
+}
+
 struct cgroup_subsys cpuset_subsys = {
.name = "cpuset",
+   .bind = cpuset_bind,
.css_alloc = cpuset_css_alloc,
.css_online = cpuset_css_online,
.css_offline = cpuset_css_offline,
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 08/12] cpuset: make cs->{cpus,mems}_allowed as user-configured masks

2013-10-11 Thread Li Zefan

Now we've used effective cpumasks to enforce hierarchical manner,
we can use cs->{cpus,mems}_allowed as configured masks.

Configured masks can be changed by writing cpuset.cpus and cpuset.mems
only. The new behaviors are:

- They won't be changed by hotplug anymore.
- They won't be limited by its parent's masks.

This ia a behavior change, but won't take effect unless mount with
sane_behavior.

v2:
- Add comments to explain the differences between configured masks and
effective masks.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 51 ---
 1 file changed, 40 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 040ec59..e47115e 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -82,6 +82,26 @@ struct cpuset {
 
unsigned long flags;/* "unsigned long" so bitops work */
 
+   /*
+* If sane_behavior is set:
+*
+* The user-configured masks can only be changed by writing to
+* cpuset.cpus and cpuset.mems, and won't be limited by the
+* parent masks.
+*
+* The effective masks is the real masks that apply to the tasks
+* in the cpuset. They may be changed if the configured masks are
+* changed or hotplug happens.
+*
+* effective_mask == configured_mask & parent's effective_mask,
+* and if it ends up empty, it will inherit the parent's mask.
+*
+*
+* If sane_behavior is not set:
+*
+* The user-configured masks are always the same with effective masks.
+*/
+
/* user-configured CPUs and Memory Nodes allow to tasks */
cpumask_var_t cpus_allowed;
nodemask_t mems_allowed;
@@ -455,9 +475,13 @@ static int validate_change(struct cpuset *cur, struct 
cpuset *trial)
 
par = parent_cs(cur);
 
-   /* We must be a subset of our parent cpuset */
+   /*
+* We must be a subset of our parent cpuset, unless sane_behavior
+* flag is set.
+*/
ret = -EACCES;
-   if (!is_cpuset_subset(trial, par))
+   if (!cgroup_sane_behavior(cur->css.cgroup) &&
+   !is_cpuset_subset(trial, par))
goto out;
 
/*
@@ -779,7 +803,7 @@ static void rebuild_sched_domains_locked(void)
 * passing doms with offlined cpu to partition_sched_domains().
 * Anyways, hotplug work item will rebuild sched domains.
 */
-   if (!cpumask_equal(top_cpuset.cpus_allowed, cpu_active_mask))
+   if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
 
/* Generate domain masks and attrs */
@@ -2191,11 +2215,12 @@ retry:
goto retry;
}
 
-   cpumask_andnot(_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
-   nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
+   cpumask_andnot(_cpus, cs->effective_cpus, 
top_cpuset.effective_cpus);
+   nodes_andnot(off_mems, cs->effective_mems, top_cpuset.effective_mems);
 
mutex_lock(_mutex);
-   cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, _cpus);
+   if (!sane)
+   cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, _cpus);
 
/* Inherit the effective mask of the parent, if it becomes empty. */
cpumask_andnot(cs->effective_cpus, cs->effective_cpus, _cpus);
@@ -2214,7 +2239,8 @@ retry:
update_tasks_cpumask(cs, NULL);
 
mutex_lock(_mutex);
-   nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
+   if (!sane)
+   nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
 
/* Inherit the effective mask of the parent, if it becomes empty */
nodes_andnot(cs->effective_mems, cs->effective_mems, off_mems);
@@ -2269,6 +2295,7 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
static cpumask_t new_cpus;
static nodemask_t new_mems;
bool cpus_updated, mems_updated;
+   bool sane = cgroup_sane_behavior(top_cpuset.css.cgroup);
 
mutex_lock(_mutex);
 
@@ -2276,13 +2303,14 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
cpumask_copy(_cpus, cpu_active_mask);
new_mems = node_states[N_MEMORY];
 
-   cpus_updated = !cpumask_equal(top_cpuset.cpus_allowed, _cpus);
-   mems_updated = !nodes_equal(top_cpuset.mems_allowed, new_mems);
+   cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, _cpus);
+   mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
/* synchronize cpus_allowed to cpu_active_mask */
if (cpus_updated) {
mutex_lock(_mutex);
-   cpumask_copy(top_cpuset.cpus_allowed, _cpus);
+   if (!sane)
+   cpumask_copy(top_cpuset.cpus_allowed, _cpus);
cpumask_copy(top_cpuset.effecti

[PATCH v2 07/12] cpuset: apply cs->effective_{cpus,mems}

2013-10-11 Thread Li Zefan

Now we can use cs->effective_{cpus,mems} as effective masks. It's
used whenever:

- we update tasks' cpus_allowed/mems_allowed,
- we want to retrieve tasks_cs(tsk)'s cpus_allowed/mems_allowed.

They actually replace effective_{cpu,node}mask_cpuset().

effective_mask == configured_mask & parent effective_mask except when
the reault is empty, in which case it inherits parent effective_mask.
The result equals the mask computed from effective_{cpu,node}mask_cpuset().

This won't affect the original !sane_bevior, because in this case we
make sure the effective masks are always the same with user-configured
masks.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 83 ++---
 1 file changed, 14 insertions(+), 69 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5c53ba5..040ec59 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -318,9 +318,9 @@ static struct file_system_type cpuset_fs_type = {
  */
 static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
 {
-   while (!cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
+   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask))
cs = parent_cs(cs);
-   cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
+   cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
 }
 
 /*
@@ -336,9 +336,9 @@ static void guarantee_online_cpus(struct cpuset *cs, struct 
cpumask *pmask)
  */
 static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 {
-   while (!nodes_intersects(cs->mems_allowed, node_states[N_MEMORY]))
+   while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY]))
cs = parent_cs(cs);
-   nodes_and(*pmask, cs->mems_allowed, node_states[N_MEMORY]);
+   nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]);
 }
 
 /*
@@ -803,45 +803,6 @@ void rebuild_sched_domains(void)
mutex_unlock(_mutex);
 }
 
-/*
- * effective_cpumask_cpuset - return nearest ancestor with non-empty cpus
- * @cs: the cpuset in interest
- *
- * A cpuset's effective cpumask is the cpumask of the nearest ancestor
- * with non-empty cpus. We use effective cpumask whenever:
- * - we update tasks' cpus_allowed. (they take on the ancestor's cpumask
- *   if the cpuset they reside in has no cpus)
- * - we want to retrieve task_cs(tsk)'s cpus_allowed.
- *
- * Called with cpuset_mutex held. cpuset_cpus_allowed_fallback() is an
- * exception. See comments there.
- */
-static struct cpuset *effective_cpumask_cpuset(struct cpuset *cs)
-{
-   while (cpumask_empty(cs->cpus_allowed))
-   cs = parent_cs(cs);
-   return cs;
-}
-
-/*
- * effective_nodemask_cpuset - return nearest ancestor with non-empty mems
- * @cs: the cpuset in interest
- *
- * A cpuset's effective nodemask is the nodemask of the nearest ancestor
- * with non-empty memss. We use effective nodemask whenever:
- * - we update tasks' mems_allowed. (they take on the ancestor's nodemask
- *   if the cpuset they reside in has no mems)
- * - we want to retrieve task_cs(tsk)'s mems_allowed.
- *
- * Called with cpuset_mutex held.
- */
-static struct cpuset *effective_nodemask_cpuset(struct cpuset *cs)
-{
-   while (nodes_empty(cs->mems_allowed))
-   cs = parent_cs(cs);
-   return cs;
-}
-
 /**
  * cpuset_change_cpumask - make a task's cpus_allowed the same as its cpuset's
  * @tsk: task to test
@@ -856,9 +817,8 @@ static struct cpuset *effective_nodemask_cpuset(struct 
cpuset *cs)
 static void cpuset_change_cpumask(struct task_struct *tsk, void *data)
 {
struct cpuset *cs = data;
-   struct cpuset *cpus_cs = effective_cpumask_cpuset(cs);
 
-   set_cpus_allowed_ptr(tsk, cpus_cs->cpus_allowed);
+   set_cpus_allowed_ptr(tsk, cs->effective_cpus);
 }
 
 /**
@@ -1026,14 +986,12 @@ static void cpuset_migrate_mm(struct mm_struct *mm, 
const nodemask_t *from,
const nodemask_t *to)
 {
struct task_struct *tsk = current;
-   struct cpuset *mems_cs;
 
tsk->mems_allowed = *to;
 
do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
 
-   mems_cs = effective_nodemask_cpuset(task_cs(tsk));
-   guarantee_online_mems(mems_cs, >mems_allowed);
+   guarantee_online_mems(task_cs(tsk), >mems_allowed);
 }
 
 /*
@@ -1128,13 +1086,12 @@ static void *cpuset_being_rebound;
 static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
 {
static nodemask_t newmems;  /* protected by cpuset_mutex */
-   struct cpuset *mems_cs = effective_nodemask_cpuset(cs);
struct cpuset_change_nodemask_arg arg = { .cs = cs,
  .newmems =  };
 
cpuset_being_rebound = cs;  /* causes mpol_dup() rebind */
 
-   guarantee_online_mems(mems_cs, );
+   guarantee_online_mems(cs, );
 
/*

[PATCH v2 09/12] cpuset: refactor cpuset_hotplug_update_tasks()

2013-10-11 Thread Li Zefan

We mix the handling for both sane_behavior and !sane_behavior in the
same function, and it's quite messy, so split into two functions.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 118 ++--
 1 file changed, 63 insertions(+), 55 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index e47115e..cefc8f4 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2186,6 +2186,65 @@ static void remove_tasks_in_empty_cpuset(struct cpuset 
*cs)
}
 }
 
+static void hotplug_update_tasks_insane(struct cpuset *cs,
+   struct cpumask *off_cpus,
+   nodemask_t *off_mems)
+{
+   bool is_empty;
+
+   mutex_lock(_mutex);
+   cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, off_cpus);
+   cpumask_andnot(cs->effective_cpus, cs->effective_cpus, off_cpus);
+   nodes_andnot(cs->mems_allowed, cs->mems_allowed, *off_mems);
+   nodes_andnot(cs->effective_mems, cs->effective_mems, *off_mems);
+   mutex_unlock(_mutex);
+
+   /*
+* Don't call update_tasks_cpumask() if the cpuset becomes empty,
+* as the tasks will be migratecd to an ancestor.
+*/
+   if (!cpumask_empty(off_cpus) && !cpumask_empty(cs->cpus_allowed))
+   update_tasks_cpumask(cs, NULL);
+   if (!nodes_empty(*off_mems) && !nodes_empty(cs->mems_allowed))
+   update_tasks_nodemask(cs, NULL);
+
+   is_empty = cpumask_empty(cs->cpus_allowed) ||
+  nodes_empty(cs->mems_allowed);
+
+   mutex_unlock(_mutex);
+
+   /*
+* Move tasks to the nearest ancestor with execution resources,
+* This is full cgroup operation which will also call back into
+* cpuset. Should be done outside any lock.
+*/
+   if (is_empty)
+   remove_tasks_in_empty_cpuset(cs);
+
+   mutex_lock(_mutex);
+}
+
+static void hotplug_update_tasks_sane(struct cpuset *cs,
+ struct cpumask *off_cpus,
+ nodemask_t *off_mems)
+{
+   mutex_lock(_mutex);
+   cpumask_andnot(cs->effective_cpus, cs->effective_cpus, off_cpus);
+   if (cpumask_empty(cs->effective_cpus))
+   cpumask_copy(cs->effective_cpus,
+parent_cs(cs)->effective_cpus);
+
+   nodes_andnot(cs->effective_mems, cs->effective_mems, *off_mems);
+   if (nodes_empty(cs->effective_mems))
+   cs->effective_mems = parent_cs(cs)->effective_mems;
+   mutex_unlock(_mutex);
+
+   if (!cpumask_empty(off_cpus))
+   update_tasks_cpumask(cs, NULL);
+   if (!nodes_empty(*off_mems))
+   update_tasks_nodemask(cs, NULL);
+}
+
 /**
  * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
  * @cs: cpuset in interest
@@ -2198,9 +2257,6 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 {
static cpumask_t off_cpus;
static nodemask_t off_mems;
-   bool is_empty;
-   bool sane = cgroup_sane_behavior(cs->css.cgroup);
-
 retry:
wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
@@ -2218,60 +2274,12 @@ retry:
cpumask_andnot(_cpus, cs->effective_cpus, 
top_cpuset.effective_cpus);
nodes_andnot(off_mems, cs->effective_mems, top_cpuset.effective_mems);
 
-   mutex_lock(_mutex);
-   if (!sane)
-   cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, _cpus);
-
-   /* Inherit the effective mask of the parent, if it becomes empty. */
-   cpumask_andnot(cs->effective_cpus, cs->effective_cpus, _cpus);
-   if (sane && cpumask_empty(cs->effective_cpus))
-   cpumask_copy(cs->effective_cpus, parent_cs(cs)->effective_cpus);
-   mutex_unlock(_mutex);
-
-   /*
-* If sane_behavior flag is set, we need to update tasks' cpumask
-* for empty cpuset to take on ancestor's cpumask. Otherwise, don't
-* call update_tasks_cpumask() if the cpuset becomes empty, as
-* the tasks in it will be migrated to an ancestor.
-*/
-   if ((sane && cpumask_empty(cs->cpus_allowed)) ||
-   (!cpumask_empty(_cpus) && !cpumask_empty(cs->cpus_allowed)))
-   update_tasks_cpumask(cs, NULL);
-
-   mutex_lock(_mutex);
-   if (!sane)
-   nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
-
-   /* Inherit the effective mask of the parent, if it becomes empty */
-   nodes_andnot(cs->effective_mems, cs->effective_mems, off_mems);
-   if (sane && nodes_empty(cs->effective_mems))
-   cs->effective_mems = parent_cs(cs)->effective_mems;
-   mutex_unlock(_mutex);
-
-   /*
-* If sane_behavior flag is set, we n

[PATCH v2 10/12] cpuset: enable onlined cpu/node in effective masks

2013-10-11 Thread Li Zefan

Firstly offline cpu1:

  # echo 0-1 > cpuset.cpus
  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # cat cpuset.cpus
  0-1
  # cat cpuset.effective_cpus
  0

Then online it:

  # echo 1 > /sys/devices/system/cpu/cpu1/online
  # cat cpuset.cpus
  0-1
  # cat cpuset.effective_cpus
  0-1

And cpuset will bring it back to the effective mask.

The implementation is quite straightforward. Instead of calculating the
offlined cpus/mems and do updates, we just set the new effective_mask
to online_mask & congifured_mask.

This is a behavior change for sane_behavior, so !sane_behavior won't
be affected.

v2:
- make refactoring of cpuset_hotplug_update_tasks() as seperate patch,
suggested by Tejun.
- make hotplug_update_tasks_insane() use @new_cpus and @new_mems as
hotplug_update_tasks_sane() does.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 64 -
 1 file changed, 36 insertions(+), 28 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index cefc8f4..e71c04f 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2186,26 +2186,27 @@ static void remove_tasks_in_empty_cpuset(struct cpuset 
*cs)
}
 }
 
-static void hotplug_update_tasks_insane(struct cpuset *cs,
-   struct cpumask *off_cpus,
-   nodemask_t *off_mems)
+static void
+hotplug_update_tasks_insane(struct cpuset *cs,
+   struct cpumask *new_cpus, nodemask_t *new_mems,
+   bool cpus_updated, bool mems_updated)
 {
bool is_empty;
 
mutex_lock(_mutex);
-   cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, off_cpus);
-   cpumask_andnot(cs->effective_cpus, cs->effective_cpus, off_cpus);
-   nodes_andnot(cs->mems_allowed, cs->mems_allowed, *off_mems);
-   nodes_andnot(cs->effective_mems, cs->effective_mems, *off_mems);
+   cpumask_copy(cs->cpus_allowed, new_cpus);
+   cpumask_copy(cs->effective_cpus, new_cpus);
+   cs->mems_allowed = *new_mems;
+   cs->effective_mems = *new_mems;
mutex_unlock(_mutex);
 
/*
 * Don't call update_tasks_cpumask() if the cpuset becomes empty,
 * as the tasks will be migratecd to an ancestor.
 */
-   if (!cpumask_empty(off_cpus) && !cpumask_empty(cs->cpus_allowed))
+   if (cpus_updated && !cpumask_empty(cs->cpus_allowed))
update_tasks_cpumask(cs, NULL);
-   if (!nodes_empty(*off_mems) && !nodes_empty(cs->mems_allowed))
+   if (mems_updated && !nodes_empty(cs->mems_allowed))
update_tasks_nodemask(cs, NULL);
 
is_empty = cpumask_empty(cs->cpus_allowed) ||
@@ -2224,24 +2225,24 @@ static void hotplug_update_tasks_insane(struct cpuset 
*cs,
mutex_lock(_mutex);
 }
 
-static void hotplug_update_tasks_sane(struct cpuset *cs,
- struct cpumask *off_cpus,
- nodemask_t *off_mems)
+static void
+hotplug_update_tasks_sane(struct cpuset *cs,
+ struct cpumask *new_cpus, nodemask_t *new_mems,
+ bool cpus_updated, bool mems_updated)
 {
+   if (cpumask_empty(new_cpus))
+   cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
+   if (nodes_empty(*new_mems))
+   *new_mems = parent_cs(cs)->effective_mems;
+
mutex_lock(_mutex);
-   cpumask_andnot(cs->effective_cpus, cs->effective_cpus, off_cpus);
-   if (cpumask_empty(cs->effective_cpus))
-   cpumask_copy(cs->effective_cpus,
-parent_cs(cs)->effective_cpus);
-
-   nodes_andnot(cs->effective_mems, cs->effective_mems, *off_mems);
-   if (nodes_empty(cs->effective_mems))
-   cs->effective_mems = parent_cs(cs)->effective_mems;
+   cpumask_copy(cs->effective_cpus, new_cpus);
+   cs->effective_mems = *new_mems;
mutex_unlock(_mutex);
 
-   if (!cpumask_empty(off_cpus))
+   if (cpus_updated)
update_tasks_cpumask(cs, NULL);
-   if (!nodes_empty(*off_mems))
+   if (mems_updated)
update_tasks_nodemask(cs, NULL);
 }
 
@@ -2255,8 +2256,10 @@ static void hotplug_update_tasks_sane(struct cpuset *cs,
  */
 static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 {
-   static cpumask_t off_cpus;
-   static nodemask_t off_mems;
+   static cpumask_t new_cpus;
+   static nodemask_t new_mems;
+   bool cpus_updated;
+   bool mems_updated;
 retry:
wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
@@ -2271,13 +2274,18 @@ retry:
goto retry;
}
 
-   cpumask_andnot(_cpus, cs->effective_cpus, 
top_cpuset.effective_cpus);
-   nodes_andnot(off_mems, cs->effective_mems, top_cpuset.effec

[PATCH v2 11/12] cpuset: allow writing offlined masks to cpuset.cpus/mems

2013-10-11 Thread Li Zefan

As the configured masks won't be limited by its parent, and the top
cpuset's masks won't change when hotplug happens, it's natural to
allow writing offlined masks to the configured masks.

Signed-off-by; Li Zefan 
---
 kernel/cpuset.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index e71c04f..a98723d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -960,7 +960,8 @@ static int update_cpumask(struct cpuset *cs, struct cpuset 
*trialcs,
if (retval < 0)
return retval;
 
-   if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
+   if (!cpumask_subset(trialcs->cpus_allowed,
+   top_cpuset.cpus_allowed))
return -EINVAL;
}
 
@@ -1238,8 +1239,8 @@ static int update_nodemask(struct cpuset *cs, struct 
cpuset *trialcs,
goto done;
 
if (!nodes_subset(trialcs->mems_allowed,
-   node_states[N_MEMORY])) {
-   retval =  -EINVAL;
+ top_cpuset.mems_allowed)) {
+   retval = -EINVAL;
goto done;
}
}
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 12/12] cpuset: export effective masks to userspace

2013-10-11 Thread Li Zefan

cpuset.cpus and cpuset.mems are the configured masks, and we need
to export effective masks to userspace, so users know the real
cpus_allowed and mems_allowed that apply to the tasks in a cpuset.

cpuset.effective_cpus and cpuset.effective_mems will be created for
sane_behavior only.

v2:
- export those masks unconditionally, suggested by Tejun.

Signed-off-by: Li Zefan 
---
 kernel/cpuset.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a98723d..c8ba514 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1614,6 +1614,8 @@ typedef enum {
FILE_MEMORY_MIGRATE,
FILE_CPULIST,
FILE_MEMLIST,
+   FILE_EFFECTIVE_CPULIST,
+   FILE_EFFECTIVE_MEMLIST,
FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL,
@@ -1762,23 +1764,23 @@ out_unlock:
  * across a page fault.
  */
 
-static size_t cpuset_sprintf_cpulist(char *page, struct cpuset *cs)
+static size_t cpuset_sprintf_cpulist(char *page, struct cpumask *pmask)
 {
size_t count;
 
mutex_lock(_mutex);
-   count = cpulist_scnprintf(page, PAGE_SIZE, cs->cpus_allowed);
+   count = cpulist_scnprintf(page, PAGE_SIZE, pmask);
mutex_unlock(_mutex);
 
return count;
 }
 
-static size_t cpuset_sprintf_memlist(char *page, struct cpuset *cs)
+static size_t cpuset_sprintf_memlist(char *page, nodemask_t mask)
 {
size_t count;
 
mutex_lock(_mutex);
-   count = nodelist_scnprintf(page, PAGE_SIZE, cs->mems_allowed);
+   count = nodelist_scnprintf(page, PAGE_SIZE, mask);
mutex_unlock(_mutex);
 
return count;
@@ -1802,10 +1804,16 @@ static ssize_t cpuset_common_file_read(struct 
cgroup_subsys_state *css,
 
switch (type) {
case FILE_CPULIST:
-   s += cpuset_sprintf_cpulist(s, cs);
+   s += cpuset_sprintf_cpulist(s, cs->cpus_allowed);
break;
case FILE_MEMLIST:
-   s += cpuset_sprintf_memlist(s, cs);
+   s += cpuset_sprintf_memlist(s, cs->mems_allowed);
+   break;
+   case FILE_EFFECTIVE_CPULIST:
+   s += cpuset_sprintf_cpulist(s, cs->effective_cpus);
+   break;
+   case FILE_EFFECTIVE_MEMLIST:
+   s += cpuset_sprintf_memlist(s, cs->effective_mems);
break;
default:
retval = -EINVAL;
@@ -1880,6 +1888,13 @@ static struct cftype files[] = {
},
 
{
+   .name = "effective_cpus",
+   .read = cpuset_common_file_read,
+   .max_write_len = (100U + 6 * NR_CPUS),
+   .private = FILE_EFFECTIVE_CPULIST,
+   },
+
+   {
.name = "mems",
.read = cpuset_common_file_read,
.write_string = cpuset_write_resmask,
@@ -1888,6 +1903,13 @@ static struct cftype files[] = {
},
 
{
+   .name = "effective_mems",
+   .read = cpuset_common_file_read,
+   .max_write_len = (100U + 6 * MAX_NUMNODES),
+   .private = FILE_EFFECTIVE_MEMLIST,
+   },
+
+   {
.name = "cpu_exclusive",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup_attach_task && while_each_thread (Was: cgroup attach task - slogging cpu)

2013-10-11 Thread Li Zefan

On 2013/10/10 0:54, Oleg Nesterov wrote:
> And I am starting to think that this change should also fix the
> while_each_thread() problems in this particular case.
> 
> In generak the code like
> 
>   rcu_read_lock();
>   task = find_get_task(...);
>   rcu_read_unlock();
> 
>   rcu_read_lock();
>   t = task;
>   do {
>   ...
>   } while_each_thread (task, t);
>   rcu_read_unlock();
> 
> is wrong even if while_each_thread() was correct (and we have a lot
> of examples of this pattern). A GP can pass before the 2nd rcu-lock,
> and we simply can't trust ->thread_group.next.
> 
> But I didn't notice that cgroup_attach_task(tsk, threadgroup) can only
> be called with threadgroup == T when a) tsk is ->group_leader and b)
> we hold threadgroup_lock() which blocks de_thread(). IOW, in this case
> "tsk" can't be removed from ->thread_group list before other threads.
> 
> If next_thread() sees thread_group.next != leader, we know that the
> that .next thread didn't do __unhash_process() yet, and since we
> know that in this case "leader" didn't do this too we are safe.
> 
> In short: __unhash_process(leader) (in this) case can never change
> ->thread_group.next of another thread, because leader->thread_group
> should be already list_empty().
> 

If threadgroup == false, and if the tsk is existing or is already in
the targeted cgroup, we won't break the loop due to the bug but do
this:

  while_each_thread(task, t)

If @task isn't the leader, we might got stuck in the loop?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cgroup: fix to break the while loop in cgroup_attach_task() correctly

2013-10-11 Thread Li Zefan

From: Anjana V Kumar 

Both Anjana and Eunki reported a stall in the while_each_thread loop
in cgroup_attach_task().

It's because, when we attach a single thread to a cgroup, if the cgroup
is exiting or is already in that cgroup, we won't break the loop.

If the task is already in the cgroup, the bug can lead to another thread
being attached to the cgroup unexpectedly:

  # echo 5207 > tasks
  # cat tasks
  5207
  # echo 5207 > tasks
  # cat tasks
  5207
  5215

What's worse, if the task to be attached isn't the leader of the thread
group, we might never exit the loop, hence cpu stall. Thanks for Oleg's
analysis.

This bug was introduced by commit 081aa458c38ba576bdd4265fc807fa95b48b9e79
("cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()")

Cc:  # 3.9+
Reported-by: Eunki Kim 
Reported-by: Anjana V Kumar 
Signed-off-by: Anjana V Kumar 
[ lizf: - fixed the first continue, pointed out by Oleg,
- rewrote changelog. ]
Signed-off-by: Li Zefan 
---
 kernel/cgroup.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a5629f1..3db1d2e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2002,7 +2002,7 @@ static int cgroup_attach_task(struct cgroup *cgrp, struct 
task_struct *tsk,
 
/* @tsk either already exited or can't exit until the end */
if (tsk->flags & PF_EXITING)
-   continue;
+   goto next;
 
/* as per above, nr_threads may decrease, but not increase. */
BUG_ON(i >= group_size);
@@ -2010,7 +2010,7 @@ static int cgroup_attach_task(struct cgroup *cgrp, struct 
task_struct *tsk,
ent.cgrp = task_cgroup_from_root(tsk, root);
/* nothing to do if this task is already in the cgroup */
if (ent.cgrp == cgrp)
-   continue;
+   goto next;
/*
 * saying GFP_ATOMIC has no effect here because we did prealloc
 * earlier, but it's good form to communicate our expectations.
@@ -2018,7 +2018,7 @@ static int cgroup_attach_task(struct cgroup *cgrp, struct 
task_struct *tsk,
retval = flex_array_put(group, i, , GFP_ATOMIC);
BUG_ON(retval != 0);
i++;
-
+next:
if (!threadgroup)
break;
} while_each_thread(leader, tsk);
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: Fix calculation of cpu slabs

2013-10-14 Thread Li Zefan

Hi Pekka,

could you pick up this patch?

On 2013/9/11 5:06, Christoph Lameter wrote:
> On Tue, 10 Sep 2013, Li Zefan wrote:
> 
>> We should use page->pages instead of page->pobjects when calculating
>> the number of cpu partial slabs. This also fixes the mapping of slabs
>> and nodes.
> 
> True.
> 
>> As there's no variable storing the number of total/active objects in
>> cpu partial slabs, and we don't have user interfaces requiring those
>> statistics, I just add WARN_ON for those cases.
> 
> 
> Well that is not strictly required but it does not hurt either.
> 
> Acked-by: Christoph Lameter 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] hung_task: change sysctl_hung_task_check_count to int

2013-09-23 Thread Li Zefan

As sysctl_hung_task_check_count is unsigned long, when this value is
assigned to max_count in check_hung_uninterruptible_tasks(), it's
truncated to int type.

Therefore if we write 2^32 to sysctl.hung_task_check_count, hung task
detection will be effectively disabled.

With this fix, it will still truncate the user input, but reading
sysctl.hung_task_check_count reflects the truncated value.

v2: limit lowest value to 0.

Signed-off-by: Li Zefan 
Acked-by: Ingo Molnar 
---
 include/linux/sched/sysctl.h | 2 +-
 kernel/hung_task.c   | 2 +-
 kernel/sysctl.c  | 5 +++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..9552afa 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -2,8 +2,8 @@
 #define _SCHED_SYSCTL_H
 
 #ifdef CONFIG_DETECT_HUNG_TASK
+extern int  sysctl_hung_task_check_count;
 extern unsigned int  sysctl_hung_task_panic;
-extern unsigned long sysctl_hung_task_check_count;
 extern unsigned long sysctl_hung_task_timeout_secs;
 extern unsigned long sysctl_hung_task_warnings;
 extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 3e97fb1..0422523 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -20,7 +20,7 @@
 /*
  * The number of tasks checked:
  */
-unsigned long __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
+int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
 
 /*
  * Limit number of tasks checked in a batch.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7453418..bcb35ab 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -962,9 +962,10 @@ static struct ctl_table kern_table[] = {
{
.procname   = "hung_task_check_count",
.data   = _hung_task_check_count,
-   .maxlen = sizeof(unsigned long),
+   .maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_doulongvec_minmax,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
},
{
.procname   = "hung_task_timeout_secs",
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] jffs2: fix unbalanced locking

2013-09-23 Thread Li Zefan

In the failure path in jffs2_do_crccheck_inode() the lock isn't released
before returning.

This probably won't cause real bug, because the structure that contains
the lock is freed in this case.

Signed-off-by: Li Zefan 
---
 fs/jffs2/readinode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/jffs2/readinode.c b/fs/jffs2/readinode.c
index ae81b01..55cf63d 100644
--- a/fs/jffs2/readinode.c
+++ b/fs/jffs2/readinode.c
@@ -1425,7 +1425,8 @@ int jffs2_do_crccheck_inode(struct jffs2_sb_info *c, 
struct jffs2_inode_cache *i
jffs2_do_clear_inode(c, f);
}
jffs2_xattr_do_crccheck_inode(c, ic);
-   kfree (f);
+   mutex_unlock(>sem);
+   kfree(f);
return ret;
 }
 
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 0/5] memcg, cgroup: kill css id

2013-09-23 Thread Li Zefan

Hi Andrew,

The whole patchset has been acked and reviewed by Michal and Tejun.
Could you merge it into mm tree?

===

This patchset converts memcg to use cgroup->id, and then we remove cgroup
css_id.

As we've removed memcg's own refcnt, converting memcg to use cgroup->id
is very straight-forward.

v6:
- rebased against mmotm 2013-09-20-15-59
- moved cgroup id check from mem_cgroup_css_alloc() to mem_cgroup_css_online()

v5:
- rebased against mmotm 2013-08-07-16-55

v4:
- make cgroup_from_id() inline and check if cgroup_mutex is held.
- add a comment for idr_remove() in cgroup_offline)fn().

v2->v3:
- some minor cleanups suggested by Michal.
- fixed the call to idr_alloc() in cgroup_init() in the first patch.

Li Zefan (5):
  memcg: convert to use cgroup_is_descendant()
  memcg: convert to use cgroup id
  memcg: fail to create cgroup if the cgroup id is too big
  memcg: stop using css id
  cgroup: kill css_id

 include/linux/cgroup.h |  37 
 kernel/cgroup.c| 248 +
 mm/memcontrol.c|  66 +++--
 3 files changed, 41 insertions(+), 310 deletions(-)

-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 1/5] memcg: convert to use cgroup_is_descendant()

2013-09-23 Thread Li Zefan

This is a preparation to kill css_id.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7dda769..9117249 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1405,7 +1405,7 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup 
*root_memcg,
return true;
if (!root_memcg->use_hierarchy || !memcg)
return false;
-   return css_is_ancestor(>css, _memcg->css);
+   return cgroup_is_descendant(memcg->css.cgroup, root_memcg->css.cgroup);
 }
 
 static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 4/5] memcg: stop using css id

2013-09-23 Thread Li Zefan

Now memcg uses cgroup id instead of css id. Update some comments and
set mem_cgroup_subsys->use_id to 0.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e40ebe..32b2d33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -592,16 +592,11 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * This will be the memcg's index in each cache's ->memcg_params->memcg_caches.
- * There are two main reasons for not using the css_id for this:
- *  1) this works better in sparse environments, where we have a lot of memcgs,
- * but only a few kmem-limited. Or also, if we have, for instance, 200
- * memcgs, and none but the 200th is kmem-limited, we'd have to have a
- * 200 entry array for that.
- *
- *  2) In order not to violate the cgroup API, we would like to do all memory
- * allocation in ->create(). At that point, we haven't yet allocated the
- * css_id. Having a separate index prevents us from messing with the cgroup
- * core for this
+ * The main reason for not using cgroup id for this:
+ *  this works better in sparse environments, where we have a lot of memcgs,
+ *  but only a few kmem-limited. Or also, if we have, for instance, 200
+ *  memcgs, and none but the 200th is kmem-limited, we'd have to have a
+ *  200 entry array for that.
  *
  * The current size of the caches array is stored in
  * memcg_limited_groups_array_size.  It will double each time we have to
@@ -616,14 +611,14 @@ int memcg_limited_groups_array_size;
  * cgroups is a reasonable guess. In the future, it could be a parameter or
  * tunable, but that is strictly not necessary.
  *
- * MAX_SIZE should be as large as the number of css_ids. Ideally, we could get
+ * MAX_SIZE should be as large as the number of cgrp_ids. Ideally, we could get
  * this constant directly from cgroup, but it is understandable that this is
  * better kept as an internal representation in cgroup.c. In any case, the
- * css_id space is not getting any smaller, and we don't have to necessarily
+ * cgrp_id space is not getting any smaller, and we don't have to necessarily
  * increase ours as well if it increases.
  */
 #define MEMCG_CACHES_MIN_SIZE 4
-#define MEMCG_CACHES_MAX_SIZE 65535
+#define MEMCG_CACHES_MAX_SIZE MEM_CGROUP_ID_MAX
 
 /*
  * A lot of the calls to the cache allocation functions are expected to be
@@ -6215,7 +6210,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
size_t size = memcg_size();
 
mem_cgroup_remove_from_trees(memcg);
-   free_css_id(_cgroup_subsys, >css);
 
for_each_node(node)
free_mem_cgroup_per_zone_info(memcg, node);
@@ -7012,7 +7006,6 @@ struct cgroup_subsys mem_cgroup_subsys = {
.bind = mem_cgroup_bind,
.base_cftypes = mem_cgroup_files,
.early_init = 0,
-   .use_id = 1,
 };
 
 #ifdef CONFIG_MEMCG_SWAP
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 2/5] memcg: convert to use cgroup id

2013-09-23 Thread Li Zefan

Use cgroup id instead of css id. This is a preparation to kill css id.

Note, as memcg treat 0 as an invalid id, while cgroup id starts with 0,
we define memcg_id == cgroup_id + 1.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9117249..6719e2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -498,6 +498,23 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup 
*memcg)
return (memcg == root_mem_cgroup);
 }
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+   /*
+* The ID of the root cgroup is 0, but memcg treat 0 as an
+* invalid ID, so we return (cgroup_id + 1).
+*/
+   return memcg->css.cgroup->id + 1;
+}
+
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+   struct cgroup_subsys_state *css;
+
+   css = css_from_id(id - 1, _cgroup_subsys);
+   return mem_cgroup_from_css(css);
+}
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
@@ -2850,15 +2867,10 @@ static void __mem_cgroup_cancel_local_charge(struct 
mem_cgroup *memcg,
  */
 static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 {
-   struct cgroup_subsys_state *css;
-
/* ID 0 is unused ID */
if (!id)
return NULL;
-   css = css_lookup(_cgroup_subsys, id);
-   if (!css)
-   return NULL;
-   return mem_cgroup_from_css(css);
+   return mem_cgroup_from_id(id);
 }
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
@@ -4368,7 +4380,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, 
swp_entry_t ent, bool swapout)
 * css_get() was called in uncharge().
 */
if (do_swap_account && swapout && memcg)
-   swap_cgroup_record(ent, css_id(>css));
+   swap_cgroup_record(ent, mem_cgroup_id(memcg));
 }
 #endif
 
@@ -4420,8 +4432,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 {
unsigned short old_id, new_id;
 
-   old_id = css_id(>css);
-   new_id = css_id(>css);
+   old_id = mem_cgroup_id(from);
+   new_id = mem_cgroup_id(to);
 
if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
mem_cgroup_swap_statistics(from, false);
@@ -6571,7 +6583,7 @@ static enum mc_target_type get_mctgt_type(struct 
vm_area_struct *vma,
}
/* There is a swap entry and a page doesn't exist or isn't charged */
if (ent.val && !ret &&
-   css_id(>css) == lookup_swap_cgroup_id(ent)) {
+   mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
ret = MC_TARGET_SWAP;
if (target)
target->ent = ent;
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 3/5] memcg: fail to create cgroup if the cgroup id is too big

2013-09-23 Thread Li Zefan

memcg requires the cgroup id to be smaller than 65536.

This is a preparation to kill css id.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6719e2c..4e40ebe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -498,6 +498,12 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup 
*memcg)
return (memcg == root_mem_cgroup);
 }
 
+/*
+ * We restrict the id in the range of [1, 65535], so it can fit into
+ * an unsigned short.
+ */
+#define MEM_CGROUP_ID_MAX  USHRT_MAX
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
/*
@@ -6312,6 +6318,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(css));
int error = 0;
 
+   if (css->cgroup->id > MEM_CGROUP_ID_MAX)
+   return -ENOSPC;
+
if (!parent)
return 0;
 
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 5/5] cgroup: kill css_id

2013-09-23 Thread Li Zefan

The only user of css_id was memcg, and it has been convered to use
cgroup->id, so kill css_id.

Signed-off-by: Li Zefan 
Reviewed-by: Michal Hocko 
Acked-by: Tejun Heo 
---
 include/linux/cgroup.h |  37 
 kernel/cgroup.c| 248 +
 2 files changed, 1 insertion(+), 284 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3561d30..39c1d94 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -612,11 +612,6 @@ struct cgroup_subsys {
int subsys_id;
int disabled;
int early_init;
-   /*
-* True if this subsys uses ID. ID is not available before cgroup_init()
-* (not available in early_init time.)
-*/
-   bool use_id;
 
/*
 * If %false, this subsystem is properly hierarchical -
@@ -642,9 +637,6 @@ struct cgroup_subsys {
 */
struct cgroupfs_root *root;
struct list_head sibling;
-   /* used when use_id == true */
-   struct idr idr;
-   spinlock_t id_lock;
 
/* list of cftype_sets */
struct list_head cftsets;
@@ -875,35 +867,6 @@ int css_scan_tasks(struct cgroup_subsys_state *css,
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
 
-/*
- * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
- * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
- * CSS ID is assigned at cgroup allocation (create) automatically
- * and removed when subsys calls free_css_id() function. This is because
- * the lifetime of cgroup_subsys_state is subsys's matter.
- *
- * Looking up and scanning function should be called under rcu_read_lock().
- * Taking cgroup_mutex is not necessary for following calls.
- * But the css returned by this routine can be "not populated yet" or "being
- * destroyed". The caller should check css and cgroup's status.
- */
-
-/*
- * Typically Called at ->destroy(), or somewhere the subsys frees
- * cgroup_subsys_state.
- */
-void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
-
-/* Find a cgroup_subsys_state which has given ID */
-
-struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
-
-/* Returns true if root is ancestor of cg */
-bool css_is_ancestor(struct cgroup_subsys_state *cg,
-const struct cgroup_subsys_state *root);
-
-/* Get id and depth of css */
-unsigned short css_id(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *css_from_dir(struct dentry *dentry,
 struct cgroup_subsys *ss);
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b3ffe76..41f71c3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -125,38 +125,6 @@ struct cfent {
 };
 
 /*
- * CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
- * cgroup_subsys->use_id != 0.
- */
-#define CSS_ID_MAX (65535)
-struct css_id {
-   /*
-* The css to which this ID points. This pointer is set to valid value
-* after cgroup is populated. If cgroup is removed, this will be NULL.
-* This pointer is expected to be RCU-safe because destroy()
-* is called after synchronize_rcu(). But for safe use, css_tryget()
-* should be used for avoiding race.
-*/
-   struct cgroup_subsys_state __rcu *css;
-   /*
-* ID of this css.
-*/
-   unsigned short id;
-   /*
-* Depth in hierarchy which this ID belongs to.
-*/
-   unsigned short depth;
-   /*
-* ID is freed by RCU. (and lookup routine is RCU safe.)
-*/
-   struct rcu_head rcu_head;
-   /*
-* Hierarchy of CSS ID belongs to.
-*/
-   unsigned short stack[0]; /* Array of Length (depth+1) */
-};
-
-/*
  * cgroup_event represents events which userspace want to receive.
  */
 struct cgroup_event {
@@ -387,9 +355,6 @@ struct cgrp_cset_link {
 static struct css_set init_css_set;
 static struct cgrp_cset_link init_cgrp_cset_link;
 
-static int cgroup_init_idr(struct cgroup_subsys *ss,
-  struct cgroup_subsys_state *css);
-
 /*
  * css_set_lock protects the list of css_set objects, and the chain of
  * tasks off each css_set.  Nests outside task->alloc_lock due to
@@ -841,8 +806,6 @@ static struct backing_dev_info cgroup_backing_dev_info = {
.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
 
-static int alloc_css_id(struct cgroup_subsys_state *child_css);
-
 static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb)
 {
struct inode *inode = new_inode(sb);
@@ -4240,21 +4203,6 @@ static int cgroup_populate_dir(struct cgroup *cgrp, 
unsigned long subsys_mask)
goto err;
}
}
-
-   /* This cgroup is ready now */
-   for

Re: [PATCH v6 0/5] memcg, cgroup: kill css id

2013-09-23 Thread Li Zefan

On 2013/9/23 21:12, Tejun Heo wrote:
> On Mon, Sep 23, 2013 at 09:08:16AM -0400, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Sep 23, 2013 at 04:55:20PM +0800, Li Zefan wrote:
>>> The whole patchset has been acked and reviewed by Michal and Tejun.
>>> Could you merge it into mm tree?
>>
>> Ah... I really hoped that this had been merged during -rc1 window.
>> Andrew, would it be okay to carry this series through cgroup tree?  It
>> doesn't really have much to do with mm proper and it's a PITA to have
>> to keep updating css_id code from cgroup side when it's scheduled to
>> go away.  If carried in -mm, it's likely to cause conflicts with
>> ongoing cgroup changes too.

I would love to see this patchset go through cgroup tree. The changes to
memcg is quite small, and as -mm tree is based on -next it won't cause
future conflicts.

> 
> Also, wasn't this already in -mm during the last devel cycle?  ISTR
> conflicts with it in -mm with other cgroup core changes.  Is there any
> specific reason why this wasn't merged during the merge windw?
> 

No, it never went into -mm tree... I guess it's because Andrew was too
busy and overlooked this patchset?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] slub: Fix calculation of cpu slabs

2013-09-09 Thread Li Zefan

  /sys/kernel/slab/:t-048 # cat cpu_slabs
  231 N0=16 N1=215
  /sys/kernel/slab/:t-048 # cat slabs
  145 N0=36 N1=109

See, the number of slabs is smaller than that of cpu slabs.

The bug was introduced by commit 49e2258586b423684f03c278149ab46d8f8b6700
("slub: per cpu cache for partial pages").

We should use page->pages instead of page->pobjects when calculating
the number of cpu partial slabs. This also fixes the mapping of slabs
and nodes.

As there's no variable storing the number of total/active objects in
cpu partial slabs, and we don't have user interfaces requiring those
statistics, I just add WARN_ON for those cases.

Cc:  # 3.2+
Signed-off-by: Li Zefan 
---
 mm/slub.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index e3ba1f2..6ea461d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4300,7 +4300,13 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 
page = ACCESS_ONCE(c->partial);
if (page) {
-   x = page->pobjects;
+   node = page_to_nid(page);
+   if (flags & SO_TOTAL)
+   WARN_ON_ONCE(1);
+   else if (flags & SO_OBJECTS)
+   WARN_ON_ONCE(1);
+   else
+   x = page->pages;
total += x;
nodes[node] += x;
}
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH trivial] block/ioctl.c: let code match 'kernel code style'

2013-09-03 Thread Li Zefan

Please don't. Pure colding style cleanup is discouraged.

You're not going to run checkpatch.pl on the whole kernel tree and fix
all the complaints, are you?

On 2013/9/3 16:29, Chen Gang wrote:
> For 'switch case', remove redundancy '\t' (also can let related lines
> within 80 columns) and remove redundancy empty lines, just like other
> 'switch case' which match 'kernel code style' within the file.
> 
> Let blkpg_ioctl() within 80 columns. Let 2nd line of blkdev_ioctl() and
> __blkdev_driver_ioctl() align 1st line parameter's start position, just
> like blk_ioctl_discard() and blk_ioctl_zeroout() within the file.
> 
> For is_unrecognized_ioctl(), can shrink the 'return' statement into one
> line (so can save 2 lines), it still matches 'kernel code style' and it
> is no conflict with others within the file.
> 
> 
> Signed-off-by: Chen Gang 
> ---
>  block/ioctl.c |  216 
> -
>  1 files changed, 105 insertions(+), 111 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET cgroup/for-3.12] cgroup: factor out css creation into create_css()

2013-09-04 Thread Li Zefan

On 2013/8/29 5:03, Tejun Heo wrote:
> Hello,
> 
> For unified hierarchy, a css's (cgroup_subsys_state) lifetime will be
> different from that of the associated cgroup.  css's may be created
> and destroyed dynamically over the lifetime of a single cgroup.  The
> previous changes decoupled css destruction from cgroup's.  This
> patchset decouples css creation from cgroup's.
> 
> This patchset contains the following nine patches.
> 
>  0001-cgroup-fix-css-leaks-on-online_css-failure.patch
>  0002-cgroup-css-iterations-and-css_from_dir-are-safe-unde.patch
>  0003-cgroup-make-for_each_subsys-useable-under-cgroup_roo.patch
>  0004-cgroup-move-css_id-commit-from-cgroup_populate_dir-t.patch
>  0005-cgroup-reorder-operations-in-cgroup_create.patch
>  0006-cgroup-combine-css-handling-loops-in-cgroup_create.patch
>  0007-cgroup-factor-out-cgroup_subsys_state-creation-into-.patch
>  0008-cgroup-implement-for_each_css.patch
>  0009-cgroup-remove-for_each_root_subsys.patch
> 

Acked-by: Li Zefan 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mistakes in code vs. maintainer flow mistakes

2013-07-22 Thread Li Zefan

> Perhaps what might help here is a kernel organizational chart.  A graph
> of who sends pull requests to Linus, and their subsystem maintainers.
> For example, in the USB "branch" there would be:
> 
>   Linus Torvalds
>   (Linux kernel release engineer)
>   |
>   |
>   Greg Kroah-Hartman
> (USB)
>   |
>   |
>   -
>   |   |   |   |
>   |   |   |   |
>   Sarah Sharp |   |  Oliver Neukum
>   (USB3 and USB core) |   |   (USB NCM and 
> auto-suspend)
>   |   |
>   Alan Stern  |
>   (EHCI/UHCI/OHCI and USB core)   |
>   |
>   |
>   Felipe Balbi
>   (USB3 plat and USB gadget)
> 

Nice chart, exccept that the complete chart will in no doubt break 80
characters limit. Actually as the hierarchy is quite flat, I can't
image how long the longest line will be.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-22 Thread Li Zefan

On 2013/7/21 21:22, Ric Wheeler wrote:
> On 07/20/2013 01:04 PM, Ben Hutchings wrote:
>> n Fri, 2013-07-19 at 13:42 -0500, Felipe Contreras wrote:
>>> >On Fri, Jul 19, 2013 at 7:08 AM, Ingo Molnar  wrote:
 > >
 > >* Felipe Contreras  wrote:
>>> >
> > >>As Linus already pointed out, not everybody has to work with 
> > >>everybody.
 > >
 > >That's not the point though, the point is to potentially roughly double
 > >the creative brain capacity of the Linux kernel project.
>>> >
>>> >Unfortunately that's impossible; we all know there aren't as many
>>> >women programmers as there are men.
>> In some countries, though not all.
>>
>> But we also know (or should realise) that the gender ratio among
>> programmers in general is much less unbalanced than in some free
>> software communities including the Linux kernel developers.
>>
> 
> Just a couple of data points to add.
> 
> When I was in graduate school in Israel, we had more women doing their phd 
> then men. Not a huge sample, but it was interesting.
> 
> The counter sample is the number of coding women we have at Red Hat in the 
> kernel team. We are around zero per cent. Certainly a sign that we need to do 
> better, regardless of the broader community challenges...
> 

IT companies in China, they try to make sure there's at least one (most the
time the result is just one) female developer/tester in a team, and a team
is ~10 people. Even if it's a kernel team, but it's harder to meet.

Don't know if the same strategy is applied in other countries.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-22 Thread Li Zefan

On 2013/7/23 9:39, Steven Rostedt wrote:
> On Tue, 2013-07-23 at 09:26 +0800, Li Zefan wrote:
> 
>> IT companies in China, they try to make sure there's at least one (most the
>> time the result is just one) female developer/tester in a team, and a team
>> is ~10 people. Even if it's a kernel team, but it's harder to meet.
>>
>> Don't know if the same strategy is applied in other countries.
> 
> Just my observation, but it seems that I see more women in tech from the
> Asian countries than from the US.
> 
> Watching my two teenage daughters grow up here as well as their friends,
> the focus of our schools still seem more bent on being good in sports
> than in academics, and even worse for science. Sports for girls happen
> to be much more serious than when I was in school. Being a "nerd" for a
> boy is starting to get a bit more acceptance (see Big Bang Theory), but
> for girls they seem a bit more harsh. At least from what I can tell by
> watching how things are with my kids and their friends. One of the
> friends of my daughter, who does very well in school, hides her grades
> and "pretends" to be stupid. This is really a sad state of affairs if
> you ask me :-(
> 

In china we are in the opposite. In college girls like to stay in school
library to study, and in general they get better scores than boys, and
they don't like sports. But being good in study is not the same as being
good at programming, and in fact they are not keen in coding!

And I think IT companies in China tend to lower their requirements when
the job interviewee is a female.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ksummit-2013-discuss] KS Topic request: Handling the Stable kernel, let's dump the cc: stable tag

2013-07-22 Thread Li Zefan

On 2013/7/23 5:24, KOSAKI Motohiro wrote:
>> I review what I can, but recently have often missed the 2 day review
>> period.
>>
>> Review from the authors and maintainers is probably more valuable than
>> that from generalists on the stable list.
> 
>>From point of subsystem developers view, the problem is there are too many
> stable branches. I can't memorize what is different each other.
> Currently www.kernel.org
> lists two stable and five longterm branches. Who know all of them?
> Eventually I only
> checked them when I have enough much spare time.

I only check those maintained by Greg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] cgroup: avoid accessing modular cgroup subsys structure without locking

2013-03-04 Thread Li Zefan

subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload,
and that's protected by cgroup_mutex, and then the memory *subsys[i]
resides will be freed.

So this is unsafe without any locking:

  if (!ss || ss->module)
  ...

v2:
- add a comment for enum cgroup_subsys_id
- simplify the comment in cgroup_exit()

Signed-off-by: Li Zefan 
---
 include/linux/cgroup.h | 17 ++---
 kernel/cgroup.c| 28 ++--
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c6ec1..5f76829 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -44,14 +44,25 @@ extern void cgroup_unload_subsys(struct cgroup_subsys *ss);
 
 extern const struct file_operations proc_cgroup_operations;
 
-/* Define the enumeration of all builtin cgroup subsystems */
+/*
+ * Define the enumeration of all cgroup subsystems.
+ *
+ * We define ids for builtin subsystems and then modular ones.
+ */
 #define SUBSYS(_x) _x ## _subsys_id,
-#define IS_SUBSYS_ENABLED(option) IS_ENABLED(option)
 enum cgroup_subsys_id {
+#define IS_SUBSYS_ENABLED(option) IS_BUILTIN(option)
+#include 
+#undef IS_SUBSYS_ENABLED
+   CGROUP_BUILTIN_SUBSYS_COUNT,
+
+   __CGROUP_SUBSYS_TEMP_PLACEHOLDER = CGROUP_BUILTIN_SUBSYS_COUNT - 1,
+
+#define IS_SUBSYS_ENABLED(option) IS_MODULE(option)
 #include 
+#undef IS_SUBSYS_ENABLED
CGROUP_SUBSYS_COUNT,
 };
-#undef IS_SUBSYS_ENABLED
 #undef SUBSYS
 
 /* Per-subsystem/per-cgroup state maintained by the system. */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9df799d..7a6c4c7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4940,17 +4940,17 @@ void cgroup_post_fork(struct task_struct *child)
 * and addition to css_set.
 */
if (need_forkexit_callback) {
-   for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+   /*
+* fork/exit callbacks are supported only for builtin
+* subsystems, and the builtin section of the subsys
+* array is immutable, so we don't need to lock the
+* subsys array here. On the other hand, modular section
+* of the array can be freed at module unload, so we
+* can't touch that.
+*/
+   for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
 
-   /*
-* fork/exit callbacks are supported only for
-* builtin subsystems and we don't need further
-* synchronization as they never go away.
-*/
-   if (!ss || ss->module)
-   continue;
-
if (ss->fork)
ss->fork(child);
}
@@ -5015,13 +5015,13 @@ void cgroup_exit(struct task_struct *tsk, int 
run_callbacks)
tsk->cgroups = _css_set;
 
if (run_callbacks && need_forkexit_callback) {
-   for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+   /*
+* fork/exit callbacks are supported only for builtin
+* subsystems, see cgroup_post_fork() for details.
+*/
+   for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
 
-   /* modular subsystems can't use callbacks */
-   if (!ss || ss->module)
-   continue;
-
if (ss->exit) {
struct cgroup *old_cgrp =

rcu_dereference_raw(cg->subsys[i])->cgroup;
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuset: remove include of cgroup.h from cpuset.h

2013-03-04 Thread Li Zefan

We don't need to include cgroup.h in cpuset.h.

Signed-off-by: Li Zefan 
---
 include/linux/cpuset.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 8c8a60d..ccd1de8 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #ifdef CONFIG_CPUSETS
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] perf: remove include of cgroup.h from perf_event.h

2013-03-04 Thread Li Zefan

Move struct perf_cgroup_info and perf_cgroup to kernel/perf/core.c,
and then we can remove include of cgroup.h.

Signed-off-by: Li Zefan 
---
 include/linux/perf_event.h | 18 +-
 kernel/events/core.c   | 15 +++
 2 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e47ee46..8737e1c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -21,7 +21,6 @@
  */
 
 #ifdef CONFIG_PERF_EVENTS
-# include 
 # include 
 # include 
 #endif
@@ -299,22 +298,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_GROUP  0x02
 #define PERF_ATTACH_TASK   0x04
 
-#ifdef CONFIG_CGROUP_PERF
-/*
- * perf_cgroup_info keeps track of time_enabled for a cgroup.
- * This is a per-cpu dynamically allocated data structure.
- */
-struct perf_cgroup_info {
-   u64 time;
-   u64 timestamp;
-};
-
-struct perf_cgroup {
-   struct  cgroup_subsys_state css;
-   struct  perf_cgroup_info *info; /* timing info, 
one per cpu */
-};
-#endif
-
+struct perf_cgroup;
 struct ring_buffer;
 
 /**
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd865..5976a2a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -234,6 +235,20 @@ static void perf_ctx_unlock(struct perf_cpu_context 
*cpuctx,
 #ifdef CONFIG_CGROUP_PERF
 
 /*
+ * perf_cgroup_info keeps track of time_enabled for a cgroup.
+ * This is a per-cpu dynamically allocated data structure.
+ */
+struct perf_cgroup_info {
+   u64 time;
+   u64 timestamp;
+};
+
+struct perf_cgroup {
+   struct cgroup_subsys_state  css;
+   struct perf_cgroup_info *info;
+};
+
+/*
  * Must ensure cgroup is pinned (css_get) before calling
  * this function. In other words, we cannot call this function
  * if there is no cgroup event for the current CPU context.
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] res_counter: remove include of cgroup.h from res_counter.h

2013-03-04 Thread Li Zefan

It's not needed at all.

Signed-off-by: Li Zefan 
---
 include/linux/res_counter.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 5ae8456..a83a849 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -13,7 +13,7 @@
  * info about what this counter is.
  */
 
-#include 
+#include 
 
 /*
  * The core object. the cgroup that wishes to account for some
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/9] sched: Shrink include/linux/sched.h

2013-03-05 Thread Li Zefan

While working of a cgroup patch which also touched include/linux/sched.h,
I found some function/macro/structure declarations can be moved to
kernel/sched/sched.h, and some can even be total removed, so here's
the patchset.

The result is a reduction of ~200 LOC from include/linux/sched.h.

0001-sched-Remove-some-dummpy-functions.patch
0002-sched-Remove-test_sd_parent.patch
0003-sched-Move-SCHED_LOAD_SHIFT-macros-to-kernel-sched-s.patch
0004-sched-Move-struct-sched_group-to-kernel-sched-sched..patch
0005-sched-Move-wake-flags-to-kernel-sched-sched.h.patch
0006-sched-Move-struct-sched_class-to-kernel-sched-sched..patch
0007-sched-Make-default_scale_freq_power-static.patch
0008-sched-Move-group-scheduling-functions-out-of-include.patch
0009-sched-Remove-double-declaration-of-root_task_group.patch

--
 include/linux/sched.h | 194 +-
 kernel/sched/core.c   |  14 ++--
 kernel/sched/fair.c   |   6 +-
 kernel/sched/sched.h  | 159 +++--
 4 files changed, 168 insertions(+), 205 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/9] sched: Remove some dummpy functions

2013-03-05 Thread Li Zefan

No one will call those functions if CONFIG_SCHED_DEBUG=n.

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 12 
 1 file changed, 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..2715fbb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -127,18 +127,6 @@ extern void proc_sched_show_task(struct task_struct *p, 
struct seq_file *m);
 extern void proc_sched_set_task(struct task_struct *p);
 extern void
 print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
-#else
-static inline void
-proc_sched_show_task(struct task_struct *p, struct seq_file *m)
-{
-}
-static inline void proc_sched_set_task(struct task_struct *p)
-{
-}
-static inline void
-print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
-{
-}
 #endif
 
 /*
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/9] sched: Remove test_sd_parent()

2013-03-05 Thread Li Zefan

It's unused.

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 9 -
 1 file changed, 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2715fbb..e880d7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -959,15 +959,6 @@ extern void partition_sched_domains(int ndoms_new, 
cpumask_var_t doms_new[],
 cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
-/* Test a flag in parent sched domain */
-static inline int test_sd_parent(struct sched_domain *sd, int flag)
-{
-   if (sd->parent && (sd->parent->flags & flag))
-   return 1;
-
-   return 0;
-}
-
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu);
 unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
 
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/9] sched: Move struct sched_group to kernel/sched/sched.h

2013-03-05 Thread Li Zefan

Move struct sched_group_power and sched_group and related inline
functions to kernel/sched/sched.h, as they are used internally only.

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 58 ++-
 kernel/sched/sched.h  | 56 +
 2 files changed, 58 insertions(+), 56 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8826d0..0d64130 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -780,62 +780,6 @@ enum cpu_idle_type {
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
-struct sched_group_power {
-   atomic_t ref;
-   /*
-* CPU power of this group, SCHED_LOAD_SCALE being max power for a
-* single CPU.
-*/
-   unsigned int power, power_orig;
-   unsigned long next_update;
-   /*
-* Number of busy cpus in this group.
-*/
-   atomic_t nr_busy_cpus;
-
-   unsigned long cpumask[0]; /* iteration mask */
-};
-
-struct sched_group {
-   struct sched_group *next;   /* Must be a circular list */
-   atomic_t ref;
-
-   unsigned int group_weight;
-   struct sched_group_power *sgp;
-
-   /*
-* The CPUs this group covers.
-*
-* NOTE: this field is variable length. (Allocated dynamically
-* by attaching extra space to the end of the structure,
-* depending on how many CPUs the kernel has booted up with)
-*/
-   unsigned long cpumask[0];
-};
-
-static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
-{
-   return to_cpumask(sg->cpumask);
-}
-
-/*
- * cpumask masking which cpus in the group are allowed to iterate up the domain
- * tree.
- */
-static inline struct cpumask *sched_group_mask(struct sched_group *sg)
-{
-   return to_cpumask(sg->sgp->cpumask);
-}
-
-/**
- * group_first_cpu - Returns the first cpu in the cpumask of a sched_group.
- * @group: The group whose first cpu is to be returned.
- */
-static inline unsigned int group_first_cpu(struct sched_group *group)
-{
-   return cpumask_first(sched_group_cpus(group));
-}
-
 struct sched_domain_attr {
int relax_domain_level;
 };
@@ -846,6 +790,8 @@ struct sched_domain_attr {
 
 extern int sched_domain_level_max;
 
+struct sched_group;
+
 struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent;/* top domain must be null terminated */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 709a30c..1a4a2b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -572,6 +572,62 @@ static inline struct sched_domain *highest_flag_domain(int 
cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
 
+struct sched_group_power {
+   atomic_t ref;
+   /*
+* CPU power of this group, SCHED_LOAD_SCALE being max power for a
+* single CPU.
+*/
+   unsigned int power, power_orig;
+   unsigned long next_update;
+   /*
+* Number of busy cpus in this group.
+*/
+   atomic_t nr_busy_cpus;
+
+   unsigned long cpumask[0]; /* iteration mask */
+};
+
+struct sched_group {
+   struct sched_group *next;   /* Must be a circular list */
+   atomic_t ref;
+
+   unsigned int group_weight;
+   struct sched_group_power *sgp;
+
+   /*
+* The CPUs this group covers.
+*
+* NOTE: this field is variable length. (Allocated dynamically
+* by attaching extra space to the end of the structure,
+* depending on how many CPUs the kernel has booted up with)
+*/
+   unsigned long cpumask[0];
+};
+
+static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
+{
+   return to_cpumask(sg->cpumask);
+}
+
+/*
+ * cpumask masking which cpus in the group are allowed to iterate up the domain
+ * tree.
+ */
+static inline struct cpumask *sched_group_mask(struct sched_group *sg)
+{
+   return to_cpumask(sg->sgp->cpumask);
+}
+
+/**
+ * group_first_cpu - Returns the first cpu in the cpumask of a sched_group.
+ * @group: The group whose first cpu is to be returned.
+ */
+static inline unsigned int group_first_cpu(struct sched_group *group)
+{
+   return cpumask_first(sched_group_cpus(group));
+}
+
 extern int group_balance_cpu(struct sched_group *sg);
 
 #endif /* CONFIG_SMP */
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/9] sched: Move wake flags to kernel/sched/sched.h

2013-03-05 Thread Li Zefan

They are used internally only.

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 7 ---
 kernel/sched/sched.h  | 7 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0d64130..863b505 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -920,13 +920,6 @@ struct uts_namespace;
 struct rq;
 struct sched_domain;
 
-/*
- * wake flags
- */
-#define WF_SYNC0x01/* waker goes to sleep after 
wakup */
-#define WF_FORK0x02/* child wakeup after fork */
-#define WF_MIGRATED0x04/* internal use, task got migrated */
-
 #define ENQUEUE_WAKEUP 1
 #define ENQUEUE_HEAD   2
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a4a2b1..4e5c2af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -865,6 +865,13 @@ static inline void finish_lock_switch(struct rq *rq, 
struct task_struct *prev)
 }
 #endif /* __ARCH_WANT_UNLOCKED_CTXSW */
 
+/*
+ * wake flags
+ */
+#define WF_SYNC0x01/* waker goes to sleep after 
wakeup */
+#define WF_FORK0x02/* child wakeup after fork */
+#define WF_MIGRATED0x4 /* internal use, task got migrated */
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
lw->weight += inc;
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/9] sched: Move struct sched_class to kernel/sched/sched.h

2013-03-05 Thread Li Zefan

It's used internally only.

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 59 ---
 kernel/sched/sched.h  | 55 +++
 2 files changed, 55 insertions(+), 59 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 863b505..04b834f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -917,65 +917,6 @@ struct mempolicy;
 struct pipe_inode_info;
 struct uts_namespace;
 
-struct rq;
-struct sched_domain;
-
-#define ENQUEUE_WAKEUP 1
-#define ENQUEUE_HEAD   2
-#ifdef CONFIG_SMP
-#define ENQUEUE_WAKING 4   /* sched_class::task_waking was called 
*/
-#else
-#define ENQUEUE_WAKING 0
-#endif
-
-#define DEQUEUE_SLEEP  1
-
-struct sched_class {
-   const struct sched_class *next;
-
-   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
-   void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
-   void (*yield_task) (struct rq *rq);
-   bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool 
preempt);
-
-   void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
-
-   struct task_struct * (*pick_next_task) (struct rq *rq);
-   void (*put_prev_task) (struct rq *rq, struct task_struct *p);
-
-#ifdef CONFIG_SMP
-   int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
-   void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
-
-   void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
-   void (*post_schedule) (struct rq *this_rq);
-   void (*task_waking) (struct task_struct *task);
-   void (*task_woken) (struct rq *this_rq, struct task_struct *task);
-
-   void (*set_cpus_allowed)(struct task_struct *p,
-const struct cpumask *newmask);
-
-   void (*rq_online)(struct rq *rq);
-   void (*rq_offline)(struct rq *rq);
-#endif
-
-   void (*set_curr_task) (struct rq *rq);
-   void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
-   void (*task_fork) (struct task_struct *p);
-
-   void (*switched_from) (struct rq *this_rq, struct task_struct *task);
-   void (*switched_to) (struct rq *this_rq, struct task_struct *task);
-   void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
-int oldprio);
-
-   unsigned int (*get_rr_interval) (struct rq *rq,
-struct task_struct *task);
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-   void (*task_move_group) (struct task_struct *p, int on_rq);
-#endif
-};
-
 struct load_weight {
unsigned long weight, inv_weight;
 };
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4e5c2af..eca526d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -951,6 +951,61 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
 };
 
+#define ENQUEUE_WAKEUP 1
+#define ENQUEUE_HEAD   2
+#ifdef CONFIG_SMP
+#define ENQUEUE_WAKING 4   /* sched_class::task_waking was called 
*/
+#else
+#define ENQUEUE_WAKING 0
+#endif
+
+#define DEQUEUE_SLEEP  1
+
+struct sched_class {
+   const struct sched_class *next;
+
+   void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
+   void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
+   void (*yield_task) (struct rq *rq);
+   bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool 
preempt);
+
+   void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
+
+   struct task_struct * (*pick_next_task) (struct rq *rq);
+   void (*put_prev_task) (struct rq *rq, struct task_struct *p);
+
+#ifdef CONFIG_SMP
+   int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+   void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
+
+   void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
+   void (*post_schedule) (struct rq *this_rq);
+   void (*task_waking) (struct task_struct *task);
+   void (*task_woken) (struct rq *this_rq, struct task_struct *task);
+
+   void (*set_cpus_allowed)(struct task_struct *p,
+const struct cpumask *newmask);
+
+   void (*rq_online)(struct rq *rq);
+   void (*rq_offline)(struct rq *rq);
+#endif
+
+   void (*set_curr_task) (struct rq *rq);
+   void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
+   void (*task_fork) (struct task_struct *p);
+
+   void (*switched_from) (struct rq *this_rq, struct task_struct *task);
+   void (*switched_to) (struct rq *this_rq, struct task_struct *task);
+   void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
+int oldprio);
+
+   unsigned int (*get_rr_interval) (struct rq

[PATCH 3/9] sched: Move SCHED_LOAD_SHIFT macros to kernel/sched/sched.h

2013-03-05 Thread Li Zefan

They are used internally only.

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 25 -
 kernel/sched/sched.h  | 26 +-
 2 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e880d7d..f8826d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -756,31 +756,6 @@ enum cpu_idle_type {
 };
 
 /*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
- *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
-# define SCHED_LOAD_RESOLUTION 10
-# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)
-#else
-# define SCHED_LOAD_RESOLUTION 0
-# define scale_load(w) (w)
-# define scale_load_down(w)(w)
-#endif
-
-#define SCHED_LOAD_SHIFT   (10 + SCHED_LOAD_RESOLUTION)
-#define SCHED_LOAD_SCALE   (1L << SCHED_LOAD_SHIFT)
-
-/*
  * Increase resolution of cpu_power calculations
  */
 #define SCHED_POWER_SHIFT  10
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..709a30c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -33,6 +33,31 @@ extern __read_mostly int scheduler_running;
  */
 #define NS_TO_JIFFIES(TIME)((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
+/*
+ * Increase resolution of nice-level calculations for 64-bit architectures.
+ * The extra resolution improves shares distribution and load balancing of
+ * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
+ * hierarchies, especially on larger systems. This is not a user-visible change
+ * and does not change the user-interface for setting shares/weights.
+ *
+ * We increase resolution only if we have enough bits to allow this increased
+ * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
+ * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
+ * increased costs.
+ */
+#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
+# define SCHED_LOAD_RESOLUTION 10
+# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
+# define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)
+#else
+# define SCHED_LOAD_RESOLUTION 0
+# define scale_load(w) (w)
+# define scale_load_down(w)(w)
+#endif
+
+#define SCHED_LOAD_SHIFT   (10 + SCHED_LOAD_RESOLUTION)
+#define SCHED_LOAD_SCALE   (1L << SCHED_LOAD_SHIFT)
+
 #define NICE_0_LOADSCHED_LOAD_SCALE
 #define NICE_0_SHIFT   SCHED_LOAD_SHIFT
 
@@ -784,7 +809,6 @@ static inline void finish_lock_switch(struct rq *rq, struct 
task_struct *prev)
 }
 #endif /* __ARCH_WANT_UNLOCKED_CTXSW */
 
-
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
lw->weight += inc;
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/9] sched: Make default_scale_freq_power() static

2013-03-05 Thread Li Zefan

As default_scale_{freq,smt}_power() and update_rt_power() are used
in kernel/sched/fair.c only, annotate them as static functions.

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 3 ---
 kernel/sched/fair.c   | 6 +++---
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b834f..eadd113 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -880,9 +880,6 @@ extern void partition_sched_domains(int ndoms_new, 
cpumask_var_t doms_new[],
 cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
-unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu);
-unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
-
 bool cpus_share_cache(int this_cpu, int that_cpu);
 
 #else /* CONFIG_SMP */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..9f23112 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4245,7 +4245,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
 }
 
-unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
+static unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
return SCHED_POWER_SCALE;
 }
@@ -4255,7 +4255,7 @@ unsigned long __weak arch_scale_freq_power(struct 
sched_domain *sd, int cpu)
return default_scale_freq_power(sd, cpu);
 }
 
-unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu)
+static unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu)
 {
unsigned long weight = sd->span_weight;
unsigned long smt_gain = sd->smt_gain;
@@ -4270,7 +4270,7 @@ unsigned long __weak arch_scale_smt_power(struct 
sched_domain *sd, int cpu)
return default_scale_smt_power(sd, cpu);
 }
 
-unsigned long scale_rt_power(int cpu)
+static unsigned long scale_rt_power(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
u64 total, available, age_stamp, avg;
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/9] sched: Move group scheduling functions out of include/linux/sched.h

2013-03-05 Thread Li Zefan

- Make sched_group_{set_,}runtime(), sched_group_{set_,}period() and
sched_rt_can_attach() static.

- Move sched_{create,destroy,online,offline}_group() to kernel/sched/sched.h.

- Remove declaration of sched_group_shares().

Signed-off-by: Li Zefan 
---
 include/linux/sched.h | 21 -
 kernel/sched/core.c   | 10 +-
 kernel/sched/sched.h  | 12 
 3 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eadd113..fc039ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2512,28 +2512,7 @@ extern long sched_setaffinity(pid_t pid, const struct 
cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
 #ifdef CONFIG_CGROUP_SCHED
-
 extern struct task_group root_task_group;
-
-extern struct task_group *sched_create_group(struct task_group *parent);
-extern void sched_online_group(struct task_group *tg,
-  struct task_group *parent);
-extern void sched_destroy_group(struct task_group *tg);
-extern void sched_offline_group(struct task_group *tg);
-extern void sched_move_task(struct task_struct *tsk);
-#ifdef CONFIG_FAIR_GROUP_SCHED
-extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
-extern unsigned long sched_group_shares(struct task_group *tg);
-#endif
-#ifdef CONFIG_RT_GROUP_SCHED
-extern int sched_group_set_rt_runtime(struct task_group *tg,
- long rt_runtime_us);
-extern long sched_group_rt_runtime(struct task_group *tg);
-extern int sched_group_set_rt_period(struct task_group *tg,
- long rt_period_us);
-extern long sched_group_rt_period(struct task_group *tg);
-extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
-#endif
 #endif /* CONFIG_CGROUP_SCHED */
 
 extern int task_can_switch_user(struct user_struct *up,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..9ad26c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7455,7 +7455,7 @@ unlock:
return err;
 }
 
-int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
+static int sched_group_set_rt_runtime(struct task_group *tg, long 
rt_runtime_us)
 {
u64 rt_runtime, rt_period;
 
@@ -7467,7 +7467,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, 
long rt_runtime_us)
return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
-long sched_group_rt_runtime(struct task_group *tg)
+static long sched_group_rt_runtime(struct task_group *tg)
 {
u64 rt_runtime_us;
 
@@ -7479,7 +7479,7 @@ long sched_group_rt_runtime(struct task_group *tg)
return rt_runtime_us;
 }
 
-int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
+static int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
 {
u64 rt_runtime, rt_period;
 
@@ -7492,7 +7492,7 @@ int sched_group_set_rt_period(struct task_group *tg, long 
rt_period_us)
return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
-long sched_group_rt_period(struct task_group *tg)
+static long sched_group_rt_period(struct task_group *tg)
 {
u64 rt_period_us;
 
@@ -7527,7 +7527,7 @@ static int sched_rt_global_constraints(void)
return ret;
 }
 
-int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
+static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 {
/* Don't accept realtime tasks when there is no way for them to run */
if (rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eca526d..304fc1c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -221,6 +221,18 @@ extern void init_tg_rt_entry(struct task_group *tg, struct 
rt_rq *rt_rq,
struct sched_rt_entity *rt_se, int cpu,
struct sched_rt_entity *parent);
 
+extern struct task_group *sched_create_group(struct task_group *parent);
+extern void sched_online_group(struct task_group *tg,
+  struct task_group *parent);
+extern void sched_destroy_group(struct task_group *tg);
+extern void sched_offline_group(struct task_group *tg);
+
+extern void sched_move_task(struct task_struct *tsk);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
+#endif
+
 #else /* CONFIG_CGROUP_SCHED */
 
 struct cfs_bandwidth { };
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 9/9] sched: Remove double declaration of root_task_group

2013-03-05 Thread Li Zefan

It's already declared in include/linux/sched.h

Signed-off-by: Li Zefan 
---
 kernel/sched/core.c  | 4 
 kernel/sched/sched.h | 5 -
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9ad26c9..42ecbcb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6861,6 +6861,10 @@ int in_sched_functions(unsigned long addr)
 }
 
 #ifdef CONFIG_CGROUP_SCHED
+/*
+ * Default task group.
+ * Every task in system belongs to this group at bootup.
+ */
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 304fc1c..30bebb9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -179,11 +179,6 @@ struct task_group {
 #define MAX_SHARES (1UL << 18)
 #endif
 
-/* Default task group.
- * Every task in system belong to this group at bootup.
- */
-extern struct task_group root_task_group;
-
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 extern int walk_tg_tree_from(struct task_group *from,
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf: remove include of cgroup.h from perf_event.h

2013-03-05 Thread Li Zefan

On 2013/3/5 16:33, Stephane Eranian wrote:
> On Tue, Mar 5, 2013 at 4:38 AM, Li Zefan  wrote:
>> Move struct perf_cgroup_info and perf_cgroup to kernel/perf/core.c,
>> and then we can remove include of cgroup.h.
>>
>> Signed-off-by: Li Zefan 
>> ---
>>  include/linux/perf_event.h | 18 +-
>>  kernel/events/core.c   | 15 +++
>>  2 files changed, 16 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index e47ee46..8737e1c 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -21,7 +21,6 @@
>>   */
>>
>>  #ifdef CONFIG_PERF_EVENTS
>> -# include 
>>  # include 
>>  # include 
>>  #endif
>> @@ -299,22 +298,7 @@ struct swevent_hlist {
>>  #define PERF_ATTACH_GROUP  0x02
>>  #define PERF_ATTACH_TASK   0x04
>>
>> -#ifdef CONFIG_CGROUP_PERF
>> -/*
>> - * perf_cgroup_info keeps track of time_enabled for a cgroup.
>> - * This is a per-cpu dynamically allocated data structure.
>> - */
>> -struct perf_cgroup_info {
>> -   u64 time;
>> -   u64 timestamp;
>> -};
>> -
>> -struct perf_cgroup {
>> -   struct  cgroup_subsys_state css;
>> -   struct  perf_cgroup_info *info; /* timing 
>> info, one per cpu */
>> -};
>> -#endif
>> -
>> +struct perf_cgroup;
> 
> The problem is that you have struct perf_cgroup in the struct perf_event
> structure. Today, this field is not referenced outside of kernel/events/core.c
> But it is available outside this file. If someday the field is reference, your
> changes will have to do reverted. So I am wondering what is the point
> of the change right now?
> 

I touch cgroup.h quite frequently, so I'd like to reduce the compile time caused
by cgroup.h changes, so I've made a few patches to remove cgroup.h from some
header files.

It's better to expose as less interfaces as possible, and if someone wants to
export someting he should think about if it's the right thing to do.

Like we often make static functions usable to external users, and also often do 
the
reverse thing, this patch is nothing special.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 8 9 10 11 12 13 14 15 16 17 >

1201 - 1300 of 1870 matches

Mail list logo