Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Kirill Tkhai
On 18.04.2018 21:48, Yang Shi wrote:
> 
> 
> On 4/18/18 3:11 AM, Kirill Tkhai wrote:
>> On 14.04.2018 21:24, Yang Shi wrote:
>>> mmap_sem is on the hot path of kernel, and it very contended, but it is
>>> abused too. It is used to protect arg_start|end and evn_start|end when
>>> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
>>> sense since those proc files just expect to read 4 values atomically and
>>> not related to VM, they could be set to arbitrary values by C/R.
>>>
>>> And, the mmap_sem contention may cause unexpected issue like below:
>>>
>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>     Tainted: G    E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> message.
>>>   ps  D    0 14018  1 0x0004
>>>    885582f84000 885e8682f000 880972943000 885ebf499bc0
>>>    8828ee12 c900349bfca8 817154d0 0040
>>>    00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>>>   Call Trace:
>>>    [] ? __schedule+0x250/0x730
>>>    [] schedule+0x36/0x80
>>>    [] rwsem_down_read_failed+0xf0/0x150
>>>    [] call_rwsem_down_read_failed+0x18/0x30
>>>    [] down_read+0x20/0x40
>>>    [] proc_pid_cmdline_read+0xd9/0x4e0
>>>    [] ? do_filp_open+0xa5/0x100
>>>    [] __vfs_read+0x37/0x150
>>>    [] ? security_file_permission+0x9b/0xc0
>>>    [] vfs_read+0x96/0x130
>>>    [] SyS_read+0x55/0xc0
>>>    [] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>>
>>> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
>>> for them to mitigate the abuse of mmap_sem.
>>>
>>> So, introduce a new spinlock in mm_struct to protect the concurrent
>>> access to arg_start|end, env_start|end and others, as well as replace
>>> write map_sem to read to protect the race condition between prctl and
>>> sys_brk which might break check_data_rlimit(), and makes prctl more
>>> friendly to other VM operations.
>>>
>>> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
>>> above hung task warning completely since the later access_remote_vm() call
>>> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
>>> future.
>>>
>>> Signed-off-by: Yang Shi 
>>> Cc: Alexey Dobriyan 
>>> Cc: Michal Hocko 
>>> Cc: Matthew Wilcox 
>>> Cc: Mateusz Guzik 
>>> Cc: Cyrill Gorcunov 
>>> ---
>>> v3 --> v4:
>>> * Protected values update with down_read + spin_lock to prevent from race
>>>    condition between prctl and sys_brk and made prctl more friendly to VM
>>>    operations per Michal's suggestion
>>>
>>> v2 --> v3:
>>> * Restored down_write in prctl syscall
>>> * Elaborate the limitation of this patch suggested by Michal
>>> * Protect those fields by the new lock except brk and start_brk per Michal's
>>>    suggestion
>>> * Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
>>>    (https://lkml.org/lkml/2018/4/5/541)
>>>
>>> v1 --> v2:
>>> * Use spinlock instead of rwlock per Mattew's suggestion
>>> * Replace down_write to down_read in prctl_set_mm (see commit log for 
>>> details)
>>>   fs/proc/base.c   | 8 
>>>   include/linux/mm_types.h | 2 ++
>>>   kernel/fork.c    | 1 +
>>>   kernel/sys.c | 6 --
>>>   mm/init-mm.c | 1 +
>>>   5 files changed, 12 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>>> index eafa39a..3551757 100644
>>> --- a/fs/proc/base.c
>>> +++ b/fs/proc/base.c
>>> @@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file 
>>> *file, char __user *buf,
>>>   goto out_mmput;
>>>   }
>>>   -    down_read(>mmap_sem);
>>> +    spin_lock(>arg_lock);
>>>   arg_start = mm->arg_start;
>>>   arg_end = mm->arg_end;
>>>   env_start = mm->env_start;
>>>   env_end = mm->env_end;
>>> -    up_read(>mmap_sem);
>>> +    spin_unlock(>arg_lock);
>>>     BUG_ON(arg_start > arg_end);
>>>   BUG_ON(env_start > env_end);
>>> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
>>> __user *buf,
>>>   if (!mmget_not_zero(mm))
>>>   goto free;
>>>   -    down_read(>mmap_sem);
>>> +    spin_lock(>arg_lock);
>>>   env_start = mm->env_start;
>>>   env_end = mm->env_end;
>>> -    up_read(>mmap_sem);
>>> +    spin_unlock(>arg_lock);
>>>     while (count > 0) {
>>>   size_t this_len, max_len;
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 2161234..49dd59e 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -413,6 +413,8 @@ struct mm_struct {
>>>   unsigned long exec_vm;    /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
>>>   unsigned long stack_vm;    /* VM_STACK */
>>>   unsigned long def_flags;
>>> +
>>> +    spinlock_t arg_lock; /* 

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Kirill Tkhai
On 18.04.2018 21:48, Yang Shi wrote:
> 
> 
> On 4/18/18 3:11 AM, Kirill Tkhai wrote:
>> On 14.04.2018 21:24, Yang Shi wrote:
>>> mmap_sem is on the hot path of kernel, and it very contended, but it is
>>> abused too. It is used to protect arg_start|end and evn_start|end when
>>> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
>>> sense since those proc files just expect to read 4 values atomically and
>>> not related to VM, they could be set to arbitrary values by C/R.
>>>
>>> And, the mmap_sem contention may cause unexpected issue like below:
>>>
>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>     Tainted: G    E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> message.
>>>   ps  D    0 14018  1 0x0004
>>>    885582f84000 885e8682f000 880972943000 885ebf499bc0
>>>    8828ee12 c900349bfca8 817154d0 0040
>>>    00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>>>   Call Trace:
>>>    [] ? __schedule+0x250/0x730
>>>    [] schedule+0x36/0x80
>>>    [] rwsem_down_read_failed+0xf0/0x150
>>>    [] call_rwsem_down_read_failed+0x18/0x30
>>>    [] down_read+0x20/0x40
>>>    [] proc_pid_cmdline_read+0xd9/0x4e0
>>>    [] ? do_filp_open+0xa5/0x100
>>>    [] __vfs_read+0x37/0x150
>>>    [] ? security_file_permission+0x9b/0xc0
>>>    [] vfs_read+0x96/0x130
>>>    [] SyS_read+0x55/0xc0
>>>    [] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>>
>>> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
>>> for them to mitigate the abuse of mmap_sem.
>>>
>>> So, introduce a new spinlock in mm_struct to protect the concurrent
>>> access to arg_start|end, env_start|end and others, as well as replace
>>> write map_sem to read to protect the race condition between prctl and
>>> sys_brk which might break check_data_rlimit(), and makes prctl more
>>> friendly to other VM operations.
>>>
>>> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
>>> above hung task warning completely since the later access_remote_vm() call
>>> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
>>> future.
>>>
>>> Signed-off-by: Yang Shi 
>>> Cc: Alexey Dobriyan 
>>> Cc: Michal Hocko 
>>> Cc: Matthew Wilcox 
>>> Cc: Mateusz Guzik 
>>> Cc: Cyrill Gorcunov 
>>> ---
>>> v3 --> v4:
>>> * Protected values update with down_read + spin_lock to prevent from race
>>>    condition between prctl and sys_brk and made prctl more friendly to VM
>>>    operations per Michal's suggestion
>>>
>>> v2 --> v3:
>>> * Restored down_write in prctl syscall
>>> * Elaborate the limitation of this patch suggested by Michal
>>> * Protect those fields by the new lock except brk and start_brk per Michal's
>>>    suggestion
>>> * Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
>>>    (https://lkml.org/lkml/2018/4/5/541)
>>>
>>> v1 --> v2:
>>> * Use spinlock instead of rwlock per Mattew's suggestion
>>> * Replace down_write to down_read in prctl_set_mm (see commit log for 
>>> details)
>>>   fs/proc/base.c   | 8 
>>>   include/linux/mm_types.h | 2 ++
>>>   kernel/fork.c    | 1 +
>>>   kernel/sys.c | 6 --
>>>   mm/init-mm.c | 1 +
>>>   5 files changed, 12 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>>> index eafa39a..3551757 100644
>>> --- a/fs/proc/base.c
>>> +++ b/fs/proc/base.c
>>> @@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file 
>>> *file, char __user *buf,
>>>   goto out_mmput;
>>>   }
>>>   -    down_read(>mmap_sem);
>>> +    spin_lock(>arg_lock);
>>>   arg_start = mm->arg_start;
>>>   arg_end = mm->arg_end;
>>>   env_start = mm->env_start;
>>>   env_end = mm->env_end;
>>> -    up_read(>mmap_sem);
>>> +    spin_unlock(>arg_lock);
>>>     BUG_ON(arg_start > arg_end);
>>>   BUG_ON(env_start > env_end);
>>> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
>>> __user *buf,
>>>   if (!mmget_not_zero(mm))
>>>   goto free;
>>>   -    down_read(>mmap_sem);
>>> +    spin_lock(>arg_lock);
>>>   env_start = mm->env_start;
>>>   env_end = mm->env_end;
>>> -    up_read(>mmap_sem);
>>> +    spin_unlock(>arg_lock);
>>>     while (count > 0) {
>>>   size_t this_len, max_len;
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 2161234..49dd59e 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -413,6 +413,8 @@ struct mm_struct {
>>>   unsigned long exec_vm;    /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
>>>   unsigned long stack_vm;    /* VM_STACK */
>>>   unsigned long def_flags;
>>> +
>>> +    spinlock_t arg_lock; /* protect the below fields */
>> What the reason is spinlock is used to protect this fields?
>> There may be several readers, say, 

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Yang Shi



On 4/18/18 3:11 AM, Kirill Tkhai wrote:

On 14.04.2018 21:24, Yang Shi wrote:

mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.

This patch just eliminates the abuse of mmap_sem, but it can't resolve the
above hung task warning completely since the later access_remote_vm() call
needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
future.

Signed-off-by: Yang Shi 
Cc: Alexey Dobriyan 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Mateusz Guzik 
Cc: Cyrill Gorcunov 
---
v3 --> v4:
* Protected values update with down_read + spin_lock to prevent from race
   condition between prctl and sys_brk and made prctl more friendly to VM
   operations per Michal's suggestion

v2 --> v3:
* Restored down_write in prctl syscall
* Elaborate the limitation of this patch suggested by Michal
* Protect those fields by the new lock except brk and start_brk per Michal's
   suggestion
* Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
   (https://lkml.org/lkml/2018/4/5/541)

v1 --> v2:
* Use spinlock instead of rwlock per Mattew's suggestion
* Replace down_write to down_read in prctl_set_mm (see commit log for details)
  fs/proc/base.c   | 8 
  include/linux/mm_types.h | 2 ++
  kernel/fork.c| 1 +
  kernel/sys.c | 6 --
  mm/init-mm.c | 1 +
  5 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index eafa39a..3551757 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
goto out_mmput;
}
  
-	down_read(>mmap_sem);

+   spin_lock(>arg_lock);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(>mmap_sem);
+   spin_unlock(>arg_lock);
  
  	BUG_ON(arg_start > arg_end);

BUG_ON(env_start > env_end);
@@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
__user *buf,
if (!mmget_not_zero(mm))
goto free;
  
-	down_read(>mmap_sem);

+   spin_lock(>arg_lock);
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(>mmap_sem);
+   spin_unlock(>arg_lock);
  
  	while (count > 0) {

size_t this_len, max_len;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2161234..49dd59e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -413,6 +413,8 @@ struct mm_struct {
unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;
+
+   spinlock_t arg_lock; /* protect the below fields */

What the reason is spinlock is used to protect this fields?
There may be several readers, say, doing "ps axf" and it's
OK for them to access these fields in parallel. Why should
we delay them by each other?

rw_lock seems be more suitable for here.


Thanks a lot for the suggestion. We did think about using rwlock, but it 
sounds the benefit might be not worth the overhead. If it turns out 
rwlock is 

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Yang Shi



On 4/18/18 3:11 AM, Kirill Tkhai wrote:

On 14.04.2018 21:24, Yang Shi wrote:

mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.

This patch just eliminates the abuse of mmap_sem, but it can't resolve the
above hung task warning completely since the later access_remote_vm() call
needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
future.

Signed-off-by: Yang Shi 
Cc: Alexey Dobriyan 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Mateusz Guzik 
Cc: Cyrill Gorcunov 
---
v3 --> v4:
* Protected values update with down_read + spin_lock to prevent from race
   condition between prctl and sys_brk and made prctl more friendly to VM
   operations per Michal's suggestion

v2 --> v3:
* Restored down_write in prctl syscall
* Elaborate the limitation of this patch suggested by Michal
* Protect those fields by the new lock except brk and start_brk per Michal's
   suggestion
* Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
   (https://lkml.org/lkml/2018/4/5/541)

v1 --> v2:
* Use spinlock instead of rwlock per Mattew's suggestion
* Replace down_write to down_read in prctl_set_mm (see commit log for details)
  fs/proc/base.c   | 8 
  include/linux/mm_types.h | 2 ++
  kernel/fork.c| 1 +
  kernel/sys.c | 6 --
  mm/init-mm.c | 1 +
  5 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index eafa39a..3551757 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
goto out_mmput;
}
  
-	down_read(>mmap_sem);

+   spin_lock(>arg_lock);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(>mmap_sem);
+   spin_unlock(>arg_lock);
  
  	BUG_ON(arg_start > arg_end);

BUG_ON(env_start > env_end);
@@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
__user *buf,
if (!mmget_not_zero(mm))
goto free;
  
-	down_read(>mmap_sem);

+   spin_lock(>arg_lock);
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(>mmap_sem);
+   spin_unlock(>arg_lock);
  
  	while (count > 0) {

size_t this_len, max_len;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2161234..49dd59e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -413,6 +413,8 @@ struct mm_struct {
unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;
+
+   spinlock_t arg_lock; /* protect the below fields */

What the reason is spinlock is used to protect this fields?
There may be several readers, say, doing "ps axf" and it's
OK for them to access these fields in parallel. Why should
we delay them by each other?

rw_lock seems be more suitable for here.


Thanks a lot for the suggestion. We did think about using rwlock, but it 
sounds the benefit might be not worth the overhead. If it turns out 
rwlock is worth, we definitely can change it to rwlock later.





unsigned long start_code, end_code, start_data, end_data;

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Yang Shi



On 4/18/18 2:40 AM, Cyrill Gorcunov wrote:

On Wed, Apr 18, 2018 at 11:03:14AM +0200, Michal Hocko wrote:

What about something like the following?
"
arg_lock protects concurent updates but we still need mmap_sem for read
to exclude races with do_brk.
"
Acked-by: Michal Hocko 

Yes, thanks! Andrew, could you slightly update the changelog please?

No, I meant it to be a comment in the _code_.

Ah, I see. Then small patch on top should do the trick.


Will send out an incremental patch soon.




Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Yang Shi



On 4/18/18 2:40 AM, Cyrill Gorcunov wrote:

On Wed, Apr 18, 2018 at 11:03:14AM +0200, Michal Hocko wrote:

What about something like the following?
"
arg_lock protects concurent updates but we still need mmap_sem for read
to exclude races with do_brk.
"
Acked-by: Michal Hocko 

Yes, thanks! Andrew, could you slightly update the changelog please?

No, I meant it to be a comment in the _code_.

Ah, I see. Then small patch on top should do the trick.


Will send out an incremental patch soon.




Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Kirill Tkhai
On 14.04.2018 21:24, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.
> 
> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
> above hung task warning completely since the later access_remote_vm() call
> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
> future.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 
> ---
> v3 --> v4:
> * Protected values update with down_read + spin_lock to prevent from race
>   condition between prctl and sys_brk and made prctl more friendly to VM
>   operations per Michal's suggestion
> 
> v2 --> v3:
> * Restored down_write in prctl syscall
> * Elaborate the limitation of this patch suggested by Michal
> * Protect those fields by the new lock except brk and start_brk per Michal's
>   suggestion
> * Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
>   (https://lkml.org/lkml/2018/4/5/541)
> 
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
>  fs/proc/base.c   | 8 
>  include/linux/mm_types.h | 2 ++
>  kernel/fork.c| 1 +
>  kernel/sys.c | 6 --
>  mm/init-mm.c | 1 +
>  5 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index eafa39a..3551757 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
> char __user *buf,
>   goto out_mmput;
>   }
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   BUG_ON(arg_start > arg_end);
>   BUG_ON(env_start > env_end);
> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
> __user *buf,
>   if (!mmget_not_zero(mm))
>   goto free;
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   while (count > 0) {
>   size_t this_len, max_len;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2161234..49dd59e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -413,6 +413,8 @@ struct mm_struct {
>   unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
>   unsigned long stack_vm; /* VM_STACK */
>   unsigned long def_flags;
> +
> + spinlock_t arg_lock; /* protect the below fields */

What the reason is spinlock is used to protect this fields?
There may be several readers, say, doing "ps axf" and it's
OK for them to access these fields in parallel. Why should
we delay them by each other?

rw_lock seems be more suitable for here.

>   unsigned long start_code, 

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Kirill Tkhai
On 14.04.2018 21:24, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.
> 
> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
> above hung task warning completely since the later access_remote_vm() call
> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
> future.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 
> ---
> v3 --> v4:
> * Protected values update with down_read + spin_lock to prevent from race
>   condition between prctl and sys_brk and made prctl more friendly to VM
>   operations per Michal's suggestion
> 
> v2 --> v3:
> * Restored down_write in prctl syscall
> * Elaborate the limitation of this patch suggested by Michal
> * Protect those fields by the new lock except brk and start_brk per Michal's
>   suggestion
> * Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
>   (https://lkml.org/lkml/2018/4/5/541)
> 
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
>  fs/proc/base.c   | 8 
>  include/linux/mm_types.h | 2 ++
>  kernel/fork.c| 1 +
>  kernel/sys.c | 6 --
>  mm/init-mm.c | 1 +
>  5 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index eafa39a..3551757 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
> char __user *buf,
>   goto out_mmput;
>   }
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   BUG_ON(arg_start > arg_end);
>   BUG_ON(env_start > env_end);
> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
> __user *buf,
>   if (!mmget_not_zero(mm))
>   goto free;
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   while (count > 0) {
>   size_t this_len, max_len;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2161234..49dd59e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -413,6 +413,8 @@ struct mm_struct {
>   unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
>   unsigned long stack_vm; /* VM_STACK */
>   unsigned long def_flags;
> +
> + spinlock_t arg_lock; /* protect the below fields */

What the reason is spinlock is used to protect this fields?
There may be several readers, say, doing "ps axf" and it's
OK for them to access these fields in parallel. Why should
we delay them by each other?

rw_lock seems be more suitable for here.

>   unsigned long start_code, end_code, start_data, end_data;
>   unsigned long start_brk, brk, start_stack;
>   unsigned long arg_start, arg_end, 

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Cyrill Gorcunov
On Wed, Apr 18, 2018 at 11:03:14AM +0200, Michal Hocko wrote:
> > > 
> > > What about something like the following?
> > > "
> > > arg_lock protects concurent updates but we still need mmap_sem for read
> > > to exclude races with do_brk.
> > > "
> > > Acked-by: Michal Hocko 
> > 
> > Yes, thanks! Andrew, could you slightly update the changelog please?
> 
> No, I meant it to be a comment in the _code_.

Ah, I see. Then small patch on top should do the trick.


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Cyrill Gorcunov
On Wed, Apr 18, 2018 at 11:03:14AM +0200, Michal Hocko wrote:
> > > 
> > > What about something like the following?
> > > "
> > > arg_lock protects concurent updates but we still need mmap_sem for read
> > > to exclude races with do_brk.
> > > "
> > > Acked-by: Michal Hocko 
> > 
> > Yes, thanks! Andrew, could you slightly update the changelog please?
> 
> No, I meant it to be a comment in the _code_.

Ah, I see. Then small patch on top should do the trick.


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Michal Hocko
On Wed 18-04-18 12:02:17, Cyrill Gorcunov wrote:
> On Wed, Apr 18, 2018 at 10:05:55AM +0200, Michal Hocko wrote:
> > 
> > Yes, looks good to me. As mentioned in other emails prctl_set_mm_map
> > really deserves a comment explaining why we are doing the down_read
> > 
> > What about something like the following?
> > "
> > arg_lock protects concurent updates but we still need mmap_sem for read
> > to exclude races with do_brk.
> > "
> > Acked-by: Michal Hocko 
> 
> Yes, thanks! Andrew, could you slightly update the changelog please?

No, I meant it to be a comment in the _code_.

-- 
Michal Hocko
SUSE Labs


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Michal Hocko
On Wed 18-04-18 12:02:17, Cyrill Gorcunov wrote:
> On Wed, Apr 18, 2018 at 10:05:55AM +0200, Michal Hocko wrote:
> > 
> > Yes, looks good to me. As mentioned in other emails prctl_set_mm_map
> > really deserves a comment explaining why we are doing the down_read
> > 
> > What about something like the following?
> > "
> > arg_lock protects concurent updates but we still need mmap_sem for read
> > to exclude races with do_brk.
> > "
> > Acked-by: Michal Hocko 
> 
> Yes, thanks! Andrew, could you slightly update the changelog please?

No, I meant it to be a comment in the _code_.

-- 
Michal Hocko
SUSE Labs


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Cyrill Gorcunov
On Wed, Apr 18, 2018 at 10:05:55AM +0200, Michal Hocko wrote:
> 
> Yes, looks good to me. As mentioned in other emails prctl_set_mm_map
> really deserves a comment explaining why we are doing the down_read
> 
> What about something like the following?
> "
> arg_lock protects concurent updates but we still need mmap_sem for read
> to exclude races with do_brk.
> "
> Acked-by: Michal Hocko 

Yes, thanks! Andrew, could you slightly update the changelog please?


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Cyrill Gorcunov
On Wed, Apr 18, 2018 at 10:05:55AM +0200, Michal Hocko wrote:
> 
> Yes, looks good to me. As mentioned in other emails prctl_set_mm_map
> really deserves a comment explaining why we are doing the down_read
> 
> What about something like the following?
> "
> arg_lock protects concurent updates but we still need mmap_sem for read
> to exclude races with do_brk.
> "
> Acked-by: Michal Hocko 

Yes, thanks! Andrew, could you slightly update the changelog please?


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Michal Hocko
On Sun 15-04-18 02:24:51, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.
> 
> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
> above hung task warning completely since the later access_remote_vm() call
> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
> future.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 

Yes, looks good to me. As mentioned in other emails prctl_set_mm_map
really deserves a comment explaining why we are doing the down_read

What about something like the following?
"
arg_lock protects concurent updates but we still need mmap_sem for read
to exclude races with do_brk.
"
Acked-by: Michal Hocko 

> ---
> v3 --> v4:
> * Protected values update with down_read + spin_lock to prevent from race
>   condition between prctl and sys_brk and made prctl more friendly to VM
>   operations per Michal's suggestion
> 
> v2 --> v3:
> * Restored down_write in prctl syscall
> * Elaborate the limitation of this patch suggested by Michal
> * Protect those fields by the new lock except brk and start_brk per Michal's
>   suggestion
> * Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
>   (https://lkml.org/lkml/2018/4/5/541)
> 
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
>  fs/proc/base.c   | 8 
>  include/linux/mm_types.h | 2 ++
>  kernel/fork.c| 1 +
>  kernel/sys.c | 6 --
>  mm/init-mm.c | 1 +
>  5 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index eafa39a..3551757 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
> char __user *buf,
>   goto out_mmput;
>   }
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   BUG_ON(arg_start > arg_end);
>   BUG_ON(env_start > env_end);
> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
> __user *buf,
>   if (!mmget_not_zero(mm))
>   goto free;
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   while (count > 0) {
>   size_t this_len, max_len;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2161234..49dd59e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -413,6 +413,8 @@ struct mm_struct {
>   unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
>   unsigned long stack_vm; /* VM_STACK */
>   unsigned long def_flags;
> +
> + 

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Michal Hocko
On Sun 15-04-18 02:24:51, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.
> 
> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
> above hung task warning completely since the later access_remote_vm() call
> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
> future.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 

Yes, looks good to me. As mentioned in other emails prctl_set_mm_map
really deserves a comment explaining why we are doing the down_read

What about something like the following?
"
arg_lock protects concurent updates but we still need mmap_sem for read
to exclude races with do_brk.
"
Acked-by: Michal Hocko 

> ---
> v3 --> v4:
> * Protected values update with down_read + spin_lock to prevent from race
>   condition between prctl and sys_brk and made prctl more friendly to VM
>   operations per Michal's suggestion
> 
> v2 --> v3:
> * Restored down_write in prctl syscall
> * Elaborate the limitation of this patch suggested by Michal
> * Protect those fields by the new lock except brk and start_brk per Michal's
>   suggestion
> * Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
>   (https://lkml.org/lkml/2018/4/5/541)
> 
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
>  fs/proc/base.c   | 8 
>  include/linux/mm_types.h | 2 ++
>  kernel/fork.c| 1 +
>  kernel/sys.c | 6 --
>  mm/init-mm.c | 1 +
>  5 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index eafa39a..3551757 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
> char __user *buf,
>   goto out_mmput;
>   }
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   BUG_ON(arg_start > arg_end);
>   BUG_ON(env_start > env_end);
> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
> __user *buf,
>   if (!mmget_not_zero(mm))
>   goto free;
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   while (count > 0) {
>   size_t this_len, max_len;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2161234..49dd59e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -413,6 +413,8 @@ struct mm_struct {
>   unsigned long exec_vm;  /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
>   unsigned long stack_vm; /* VM_STACK */
>   unsigned long def_flags;
> +
> + spinlock_t arg_lock; /* protect the below fields */
>   unsigned long start_code, end_code, start_data, end_data;
>   unsigned long start_brk, 

Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Michal Hocko
On Tue 17-04-18 23:39:19, Cyrill Gorcunov wrote:
> On Tue, Apr 17, 2018 at 11:29:57AM -0700, Andrew Morton wrote:
> > On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  
> > wrote:
> > 
> > > mmap_sem is on the hot path of kernel, and it very contended, but it is
> > > abused too. It is used to protect arg_start|end and evn_start|end when
> > > reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> > > sense since those proc files just expect to read 4 values atomically and
> > > not related to VM, they could be set to arbitrary values by C/R.
> > > 
> > > And, the mmap_sem contention may cause unexpected issue like below:
> > > 
> > > INFO: task ps:14018 blocked for more than 120 seconds.
> > >Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > message.
> > >  ps  D0 14018  1 0x0004
> > >   885582f84000 885e8682f000 880972943000 885ebf499bc0
> > >   8828ee12 c900349bfca8 817154d0 0040
> > >   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
> > >  Call Trace:
> > >   [] ? __schedule+0x250/0x730
> > >   [] schedule+0x36/0x80
> > >   [] rwsem_down_read_failed+0xf0/0x150
> > >   [] call_rwsem_down_read_failed+0x18/0x30
> > >   [] down_read+0x20/0x40
> > >   [] proc_pid_cmdline_read+0xd9/0x4e0
> > >   [] ? do_filp_open+0xa5/0x100
> > >   [] __vfs_read+0x37/0x150
> > >   [] ? security_file_permission+0x9b/0xc0
> > >   [] vfs_read+0x96/0x130
> > >   [] SyS_read+0x55/0xc0
> > >   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > > 
> > > Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> > > for them to mitigate the abuse of mmap_sem.
> > > 
> > > So, introduce a new spinlock in mm_struct to protect the concurrent
> > > access to arg_start|end, env_start|end and others, as well as replace
> > > write map_sem to read to protect the race condition between prctl and
> > > sys_brk which might break check_data_rlimit(), and makes prctl more
> > > friendly to other VM operations.
> > 
> > (We should move check_data_rlimit() out of the .h file)
> > 
> > It seems inconsistent to be using mmap_sem to protect ->start_brk and
> > friends in sys_brk().  We've already declared that these are protected
> > by arg_lock so that's what we should be using?  And getting this
> > consistent should permit us to stop using mmap_sem in prctl()
> > altogether?
> 
> Nope, we still can't. Look, the down_read part order the call with
> sys_brk. while arg_lock orders prctl call itself. That said if
> someone is calling sys_brk while we're in a middle of prctl it
> should wait until prctl finished. But two simultaneous prcl
> may proceed without taking a write lock using arg_lock as
> a barrier.

A small comment would be due. The changelog mentions this but it would
be nicer to comment why we care about mmap_sem for read in prctl.

-- 
Michal Hocko
SUSE Labs


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-18 Thread Michal Hocko
On Tue 17-04-18 23:39:19, Cyrill Gorcunov wrote:
> On Tue, Apr 17, 2018 at 11:29:57AM -0700, Andrew Morton wrote:
> > On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  
> > wrote:
> > 
> > > mmap_sem is on the hot path of kernel, and it very contended, but it is
> > > abused too. It is used to protect arg_start|end and evn_start|end when
> > > reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> > > sense since those proc files just expect to read 4 values atomically and
> > > not related to VM, they could be set to arbitrary values by C/R.
> > > 
> > > And, the mmap_sem contention may cause unexpected issue like below:
> > > 
> > > INFO: task ps:14018 blocked for more than 120 seconds.
> > >Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > message.
> > >  ps  D0 14018  1 0x0004
> > >   885582f84000 885e8682f000 880972943000 885ebf499bc0
> > >   8828ee12 c900349bfca8 817154d0 0040
> > >   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
> > >  Call Trace:
> > >   [] ? __schedule+0x250/0x730
> > >   [] schedule+0x36/0x80
> > >   [] rwsem_down_read_failed+0xf0/0x150
> > >   [] call_rwsem_down_read_failed+0x18/0x30
> > >   [] down_read+0x20/0x40
> > >   [] proc_pid_cmdline_read+0xd9/0x4e0
> > >   [] ? do_filp_open+0xa5/0x100
> > >   [] __vfs_read+0x37/0x150
> > >   [] ? security_file_permission+0x9b/0xc0
> > >   [] vfs_read+0x96/0x130
> > >   [] SyS_read+0x55/0xc0
> > >   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > > 
> > > Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> > > for them to mitigate the abuse of mmap_sem.
> > > 
> > > So, introduce a new spinlock in mm_struct to protect the concurrent
> > > access to arg_start|end, env_start|end and others, as well as replace
> > > write map_sem to read to protect the race condition between prctl and
> > > sys_brk which might break check_data_rlimit(), and makes prctl more
> > > friendly to other VM operations.
> > 
> > (We should move check_data_rlimit() out of the .h file)
> > 
> > It seems inconsistent to be using mmap_sem to protect ->start_brk and
> > friends in sys_brk().  We've already declared that these are protected
> > by arg_lock so that's what we should be using?  And getting this
> > consistent should permit us to stop using mmap_sem in prctl()
> > altogether?
> 
> Nope, we still can't. Look, the down_read part order the call with
> sys_brk. while arg_lock orders prctl call itself. That said if
> someone is calling sys_brk while we're in a middle of prctl it
> should wait until prctl finished. But two simultaneous prcl
> may proceed without taking a write lock using arg_lock as
> a barrier.

A small comment would be due. The changelog mentions this but it would
be nicer to comment why we care about mmap_sem for read in prctl.

-- 
Michal Hocko
SUSE Labs


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-17 Thread Yang Shi



On 4/17/18 11:29 AM, Andrew Morton wrote:

On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  wrote:


mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.

(We should move check_data_rlimit() out of the .h file)


I don't get the point, check_data_rlimit() is still used by both prctl 
and sys_brk.




It seems inconsistent to be using mmap_sem to protect ->start_brk and
friends in sys_brk().  We've already declared that these are protected
by arg_lock so that's what we should be using?  And getting this
consistent should permit us to stop using mmap_sem in prctl()
altogether?


Cyrill already helped to elaborate the reason. arg_lock just can protect 
the concurrent access of brk between prctl calls, but not sys_brk.


Thanks,
Yang




Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-17 Thread Yang Shi



On 4/17/18 11:29 AM, Andrew Morton wrote:

On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  wrote:


mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.

(We should move check_data_rlimit() out of the .h file)


I don't get the point, check_data_rlimit() is still used by both prctl 
and sys_brk.




It seems inconsistent to be using mmap_sem to protect ->start_brk and
friends in sys_brk().  We've already declared that these are protected
by arg_lock so that's what we should be using?  And getting this
consistent should permit us to stop using mmap_sem in prctl()
altogether?


Cyrill already helped to elaborate the reason. arg_lock just can protect 
the concurrent access of brk between prctl calls, but not sys_brk.


Thanks,
Yang




Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-17 Thread Cyrill Gorcunov
On Tue, Apr 17, 2018 at 11:29:57AM -0700, Andrew Morton wrote:
> On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  
> wrote:
> 
> > mmap_sem is on the hot path of kernel, and it very contended, but it is
> > abused too. It is used to protect arg_start|end and evn_start|end when
> > reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> > sense since those proc files just expect to read 4 values atomically and
> > not related to VM, they could be set to arbitrary values by C/R.
> > 
> > And, the mmap_sem contention may cause unexpected issue like below:
> > 
> > INFO: task ps:14018 blocked for more than 120 seconds.
> >Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message.
> >  ps  D0 14018  1 0x0004
> >   885582f84000 885e8682f000 880972943000 885ebf499bc0
> >   8828ee12 c900349bfca8 817154d0 0040
> >   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
> >  Call Trace:
> >   [] ? __schedule+0x250/0x730
> >   [] schedule+0x36/0x80
> >   [] rwsem_down_read_failed+0xf0/0x150
> >   [] call_rwsem_down_read_failed+0x18/0x30
> >   [] down_read+0x20/0x40
> >   [] proc_pid_cmdline_read+0xd9/0x4e0
> >   [] ? do_filp_open+0xa5/0x100
> >   [] __vfs_read+0x37/0x150
> >   [] ? security_file_permission+0x9b/0xc0
> >   [] vfs_read+0x96/0x130
> >   [] SyS_read+0x55/0xc0
> >   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > 
> > Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> > for them to mitigate the abuse of mmap_sem.
> > 
> > So, introduce a new spinlock in mm_struct to protect the concurrent
> > access to arg_start|end, env_start|end and others, as well as replace
> > write map_sem to read to protect the race condition between prctl and
> > sys_brk which might break check_data_rlimit(), and makes prctl more
> > friendly to other VM operations.
> 
> (We should move check_data_rlimit() out of the .h file)
> 
> It seems inconsistent to be using mmap_sem to protect ->start_brk and
> friends in sys_brk().  We've already declared that these are protected
> by arg_lock so that's what we should be using?  And getting this
> consistent should permit us to stop using mmap_sem in prctl()
> altogether?

Nope, we still can't. Look, the down_read part order the call with
sys_brk. while arg_lock orders prctl call itself. That said if
someone is calling sys_brk while we're in a middle of prctl it
should wait until prctl finished. But two simultaneous prcl
may proceed without taking a write lock using arg_lock as
a barrier.


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-17 Thread Cyrill Gorcunov
On Tue, Apr 17, 2018 at 11:29:57AM -0700, Andrew Morton wrote:
> On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  
> wrote:
> 
> > mmap_sem is on the hot path of kernel, and it very contended, but it is
> > abused too. It is used to protect arg_start|end and evn_start|end when
> > reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> > sense since those proc files just expect to read 4 values atomically and
> > not related to VM, they could be set to arbitrary values by C/R.
> > 
> > And, the mmap_sem contention may cause unexpected issue like below:
> > 
> > INFO: task ps:14018 blocked for more than 120 seconds.
> >Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message.
> >  ps  D0 14018  1 0x0004
> >   885582f84000 885e8682f000 880972943000 885ebf499bc0
> >   8828ee12 c900349bfca8 817154d0 0040
> >   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
> >  Call Trace:
> >   [] ? __schedule+0x250/0x730
> >   [] schedule+0x36/0x80
> >   [] rwsem_down_read_failed+0xf0/0x150
> >   [] call_rwsem_down_read_failed+0x18/0x30
> >   [] down_read+0x20/0x40
> >   [] proc_pid_cmdline_read+0xd9/0x4e0
> >   [] ? do_filp_open+0xa5/0x100
> >   [] __vfs_read+0x37/0x150
> >   [] ? security_file_permission+0x9b/0xc0
> >   [] vfs_read+0x96/0x130
> >   [] SyS_read+0x55/0xc0
> >   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> > 
> > Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> > for them to mitigate the abuse of mmap_sem.
> > 
> > So, introduce a new spinlock in mm_struct to protect the concurrent
> > access to arg_start|end, env_start|end and others, as well as replace
> > write map_sem to read to protect the race condition between prctl and
> > sys_brk which might break check_data_rlimit(), and makes prctl more
> > friendly to other VM operations.
> 
> (We should move check_data_rlimit() out of the .h file)
> 
> It seems inconsistent to be using mmap_sem to protect ->start_brk and
> friends in sys_brk().  We've already declared that these are protected
> by arg_lock so that's what we should be using?  And getting this
> consistent should permit us to stop using mmap_sem in prctl()
> altogether?

Nope, we still can't. Look, the down_read part order the call with
sys_brk. while arg_lock orders prctl call itself. That said if
someone is calling sys_brk while we're in a middle of prctl it
should wait until prctl finished. But two simultaneous prcl
may proceed without taking a write lock using arg_lock as
a barrier.


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-17 Thread Andrew Morton
On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  wrote:

> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.

(We should move check_data_rlimit() out of the .h file)

It seems inconsistent to be using mmap_sem to protect ->start_brk and
friends in sys_brk().  We've already declared that these are protected
by arg_lock so that's what we should be using?  And getting this
consistent should permit us to stop using mmap_sem in prctl()
altogether?



Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-17 Thread Andrew Morton
On Sun, 15 Apr 2018 02:24:51 +0800 Yang Shi  wrote:

> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.

(We should move check_data_rlimit() out of the .h file)

It seems inconsistent to be using mmap_sem to protect ->start_brk and
friends in sys_brk().  We've already declared that these are protected
by arg_lock so that's what we should be using?  And getting this
consistent should permit us to stop using mmap_sem in prctl()
altogether?



Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-15 Thread Cyrill Gorcunov
On Sun, Apr 15, 2018 at 02:24:51AM +0800, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.
> 
> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
> above hung task warning completely since the later access_remote_vm() call
> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
> future.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 
> ---

Looks ok to me, thanks!
Reviewed-by: Cyrill Gorcunov 


Re: [v4 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-15 Thread Cyrill Gorcunov
On Sun, Apr 15, 2018 at 02:24:51AM +0800, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others, as well as replace
> write map_sem to read to protect the race condition between prctl and
> sys_brk which might break check_data_rlimit(), and makes prctl more
> friendly to other VM operations.
> 
> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
> above hung task warning completely since the later access_remote_vm() call
> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
> future.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 
> ---

Looks ok to me, thanks!
Reviewed-by: Cyrill Gorcunov