Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-04-01 Thread Yang Shi



On 3/26/18 11:29 PM, Michal Hocko wrote:

On Tue 27-03-18 02:20:39, Yang Shi wrote:
[...]
The patch looks reasonable to me. Maybe it would be better to be more
explicit about the purpose of the patch. As others noticed, this alone
wouldn't solve the mmap_sem contention issues. I _think_ that if you
were more explicit about the mmap_sem abuse it would trigger less
questions.

I have just one more question. Now that you are touching this area,
would you be willing to remove the following ugliness?


diff --git a/kernel/sys.c b/kernel/sys.c
index f2289de..17bddd2 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
return error;
}
  
-	down_write(&mm->mmap_sem);

+   down_read(&mm->mmap_sem);

Why do we need to hold mmap_sem here and call find_vma, when only
PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
new lock and take the mmap_sem only for PR_SET_MM_ENV_END.


Sorry for taking a little bit longer to get back since I was traveling. 
I think all the stuff can be protected by the new arg_lock except 
mm->brk since arg_lock can't prevent from concurrent writing from sys_brk().


We may use arg_lock to protect everything else other than mm->brk. The 
complexity sounds acceptable.


Of course, as Cyrill mentioned, he prefers to deprecating prctl_set_mm 
since C/R is the only user of it. We may wait until he is done?


Thanks,
Yang



Thanks!




Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-28 Thread Michal Hocko
On Tue 27-03-18 21:52:17, Cyrill Gorcunov wrote:
> On Tue, Mar 27, 2018 at 02:38:11PM -0400, Yang Shi wrote:
> > > Why do we need to hold mmap_sem here and call find_vma, when only
> > > PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
> > > new lock and take the mmap_sem only for PR_SET_MM_ENV_END.
> > 
> > Actually, I didn't think of why. It looks prctl_set_mm() checks if vma does
> > exist when it tries to set stack_start, argv_* and env_*, btw not only
> > env_end.
> > 
> > Cyrill may be able to give us some hint since C/R is the main user of this
> > API.
> 
> First and most important it makes code smaller. This prctl call is really
> rarely used. Of course we can optimize it, but as I said I would prefer
> to simply deprecate this old interface (and I gonne to do so once time
> permit).

Ohh, it would be really great if we can remove this thingy altogether. I
cannot say it has a wee bit of my sympathy.

-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Cyrill Gorcunov
On Tue, Mar 27, 2018 at 02:38:11PM -0400, Yang Shi wrote:
> > Why do we need to hold mmap_sem here and call find_vma, when only
> > PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
> > new lock and take the mmap_sem only for PR_SET_MM_ENV_END.
> 
> Actually, I didn't think of why. It looks prctl_set_mm() checks if vma does
> exist when it tries to set stack_start, argv_* and env_*, btw not only
> env_end.
> 
> Cyrill may be able to give us some hint since C/R is the main user of this
> API.

First and most important it makes code smaller. This prctl call is really
rarely used. Of course we can optimize it, but as I said I would prefer
to simply deprecate this old interface (and I gonne to do so once time
permit).


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Yang Shi



On 3/27/18 3:32 AM, Cyrill Gorcunov wrote:

On Mon, Mar 26, 2018 at 05:59:49PM -0400, Yang Shi wrote:

Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
and @brk = 20. Since now the call is guarded by _read_ the both calls
unlocked and due to OO engine it may happen then when both finish
we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
has been take to have consistent data on exit, either you have [20;10]
or [30;20] assigned not something mixed.

That said I think using read-lock here would be a bug.

Yes it sounds so. However, it was down_read before
ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing
to protect against others"). And, that commit is for fixing the concurrent
writing to arg_* and env_*. I just checked that commit, but omitted the brk
part. The potential issue mentioned by you should exist before that commit,
but might be just not discovered or very rare to hit.

I will change it back to down_write.

down_read before was a bug ;) And it was not discovered earlier simply
because not that many users of this interface exist, namely only criu
as far as I know by now.


Thanks for confirming this. I assumed so :-)




Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Yang Shi



On 3/27/18 2:29 AM, Michal Hocko wrote:

On Tue 27-03-18 02:20:39, Yang Shi wrote:
[...]
The patch looks reasonable to me. Maybe it would be better to be more
explicit about the purpose of the patch. As others noticed, this alone
wouldn't solve the mmap_sem contention issues. I _think_ that if you
were more explicit about the mmap_sem abuse it would trigger less
questions.


Yes, sure.



I have just one more question. Now that you are touching this area,
would you be willing to remove the following ugliness?


diff --git a/kernel/sys.c b/kernel/sys.c
index f2289de..17bddd2 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
return error;
}
  
-	down_write(&mm->mmap_sem);

+   down_read(&mm->mmap_sem);

Why do we need to hold mmap_sem here and call find_vma, when only
PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
new lock and take the mmap_sem only for PR_SET_MM_ENV_END.


Actually, I didn't think of why. It looks prctl_set_mm() checks if vma 
does exist when it tries to set stack_start, argv_* and env_*, btw not 
only env_end.


Cyrill may be able to give us some hint since C/R is the main user of 
this API.


Yang



Thanks!




Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Michal Hocko
On Tue 27-03-18 16:31:23, Mateusz Guzik wrote:
> On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote:
> > On Tue 27-03-18 02:20:39, Yang Shi wrote:
> > [...]
> > The patch looks reasonable to me. Maybe it would be better to be more
> > explicit about the purpose of the patch. As others noticed, this alone
> > wouldn't solve the mmap_sem contention issues. I _think_ that if you
> > were more explicit about the mmap_sem abuse it would trigger less
> > questions.
> > 
> 
> >From what I gather even with other fixes the kernel will still end up
> grabbing the semaphore. In this case I don't see what's the upside of
> adding the spinlock for args. The downside is growth of mm_struct.

Because accessing the specific address in the address space can be later
changed to use a more fine-grained locking. There are people
experimenting with range locking. These mmap_sem abusers, on the other
hand, will require the full range lock without a good reason. So it is
really worth it to remove them and replace by a more fine grained
locking.

If the mm_struct grow is a real concern (I haven't checked that) then we
can use a set of hashed locks or something else.
-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Mateusz Guzik
On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote:
> On Tue 27-03-18 02:20:39, Yang Shi wrote:
> [...]
> The patch looks reasonable to me. Maybe it would be better to be more
> explicit about the purpose of the patch. As others noticed, this alone
> wouldn't solve the mmap_sem contention issues. I _think_ that if you
> were more explicit about the mmap_sem abuse it would trigger less
> questions.
> 

>From what I gather even with other fixes the kernel will still end up
grabbing the semaphore. In this case I don't see what's the upside of
adding the spinlock for args. The downside is growth of mm_struct.

i.e. the code can be refactored to just hold the lock and relock only if
necessary (unable to copy to user without faulting)

-- 
Mateusz Guzik


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Cyrill Gorcunov
On Mon, Mar 26, 2018 at 06:12:55PM -0400, Yang Shi wrote:
> > +   if (unlikely(arg_start > arg_end || env_start > env_end)) {
> > +   cond_resched();
> > +   goto retry;
> 
> Can't it trap into dead loop if the condition is always false?

Yes, unfortunately it can.

> > +   }
> > 
> > for reading these fields.
> > 
> > By the way, /proc/pid/ readers are serving as a canary who tells something
> > mm_mmap related problem is happening. On the other hand, it is sad that
> > such canary cannot be terminated by signal due to use of unkillable waits.
> > I wish we can use killable waits.
> 
> I already proposed patches (https://lkml.org/lkml/2018/2/26/1197) to do this
> a few weeks ago. In the review, akpm suggested mitigate the mmap_sem
> contention instead of using killable version workaround. Then the
> preliminary unmaping by section patches
> (https://lkml.org/lkml/2018/3/20/786) were proposed. In the discussion, we
> decided to eliminate the mmap_sem abuse, this is where the patch came from.


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Cyrill Gorcunov
On Tue, Mar 27, 2018 at 07:00:56AM +0900, Tetsuo Handa wrote:
> 
> > To be fair I would prefer to drop this old per-field
> > interface completely. This per-field interface was rather an ugly
> > solution from my side.
> 
> But this is userspace visible API and thus we cannot change.

Hi! We could deplrecate this API call for a couple of releases
and then if nobody complain we could rip it off completely.
There should not be many users I think, didn't heard that
someone except criu used it ever.

> > > Then, I wonder whether reading arg_start|end and env_start|end atomically 
> > > makes
> > > sense. Just retry reading if arg_start > env_end or env_start > env_end 
> > > is fine?
> > 
> > Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.
> > 
> 
> You are not missing my point. What I thought is
> 
> +retry:
> - down_read(&mm->mmap_sem);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(&mm->mmap_sem);
>  
> - BUG_ON(arg_start > arg_end);
> - BUG_ON(env_start > env_end);
> + if (unlikely(arg_start > arg_end || env_start > env_end)) {
> + cond_resched();
> + goto retry;
> + }
> 
> for reading these fields.

I fear such contentional cycles are acceptable if only they
are guaranteed to finish eventually. Which doesn't look so
in the code above.

Cyrill


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Cyrill Gorcunov
On Mon, Mar 26, 2018 at 05:59:49PM -0400, Yang Shi wrote:
> > Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
> > one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
> > and @brk = 20. Since now the call is guarded by _read_ the both calls
> > unlocked and due to OO engine it may happen then when both finish
> > we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
> > has been take to have consistent data on exit, either you have [20;10]
> > or [30;20] assigned not something mixed.
> > 
> > That said I think using read-lock here would be a bug.
> 
> Yes it sounds so. However, it was down_read before
> ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing
> to protect against others"). And, that commit is for fixing the concurrent
> writing to arg_* and env_*. I just checked that commit, but omitted the brk
> part. The potential issue mentioned by you should exist before that commit,
> but might be just not discovered or very rare to hit.
> 
> I will change it back to down_write.

down_read before was a bug ;) And it was not discovered earlier simply
because not that many users of this interface exist, namely only criu
as far as I know by now.


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Michal Hocko
On Tue 27-03-18 02:20:39, Yang Shi wrote:
[...]
The patch looks reasonable to me. Maybe it would be better to be more
explicit about the purpose of the patch. As others noticed, this alone
wouldn't solve the mmap_sem contention issues. I _think_ that if you
were more explicit about the mmap_sem abuse it would trigger less
questions.

I have just one more question. Now that you are touching this area,
would you be willing to remove the following ugliness?

> diff --git a/kernel/sys.c b/kernel/sys.c
> index f2289de..17bddd2 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user 
> *addr, unsigned long data
>   return error;
>   }
>  
> - down_write(&mm->mmap_sem);
> + down_read(&mm->mmap_sem);

Why do we need to hold mmap_sem here and call find_vma, when only
PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
new lock and take the mmap_sem only for PR_SET_MM_ENV_END.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Yang Shi



On 3/26/18 6:00 PM, Tetsuo Handa wrote:

Cyrill Gorcunov wrote:

On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote:

On 2018/03/27 4:21, Cyrill Gorcunov wrote:

That said I think using read-lock here would be a bug.

If I understand correctly, the caller can't set both fields atomically, for
prctl() does not receive both fields at one call.

   prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor 
PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);


True, but the key moment is that two/three/four system calls can
run simultaneously. And while previously they are ordered by "write",
with read lock they are completely unordered and this is really
worries me.

Yes, we need exclusive lock when updating these fields.


 To be fair I would prefer to drop this old per-field
interface completely. This per-field interface was rather an ugly
solution from my side.

But this is userspace visible API and thus we cannot change.


Then, I wonder whether reading arg_start|end and env_start|end atomically makes
sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?

Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.


You are not missing my point. What I thought is

+retry:
-   down_read(&mm->mmap_sem);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(&mm->mmap_sem);
  
-	BUG_ON(arg_start > arg_end);

-   BUG_ON(env_start > env_end);
+   if (unlikely(arg_start > arg_end || env_start > env_end)) {
+   cond_resched();
+   goto retry;


Can't it trap into dead loop if the condition is always false?


+   }

for reading these fields.

By the way, /proc/pid/ readers are serving as a canary who tells something
mm_mmap related problem is happening. On the other hand, it is sad that
such canary cannot be terminated by signal due to use of unkillable waits.
I wish we can use killable waits.


I already proposed patches (https://lkml.org/lkml/2018/2/26/1197) to do 
this a few weeks ago. In the review, akpm suggested mitigate the 
mmap_sem contention instead of using killable version workaround. Then 
the preliminary unmaping by section patches 
(https://lkml.org/lkml/2018/3/20/786) were proposed. In the discussion, 
we decided to eliminate the mmap_sem abuse, this is where the patch came 
from.


Yang




Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Tetsuo Handa
Cyrill Gorcunov wrote:
> On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote:
> > On 2018/03/27 4:21, Cyrill Gorcunov wrote:
> > > That said I think using read-lock here would be a bug.
> > 
> > If I understand correctly, the caller can't set both fields atomically, for
> > prctl() does not receive both fields at one call.
> > 
> >   prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor 
> > PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);
> > 
> 
> True, but the key moment is that two/three/four system calls can
> run simultaneously. And while previously they are ordered by "write",
> with read lock they are completely unordered and this is really
> worries me.

Yes, we need exclusive lock when updating these fields.

> To be fair I would prefer to drop this old per-field
> interface completely. This per-field interface was rather an ugly
> solution from my side.

But this is userspace visible API and thus we cannot change.

> 
> > Then, I wonder whether reading arg_start|end and env_start|end atomically 
> > makes
> > sense. Just retry reading if arg_start > env_end or env_start > env_end is 
> > fine?
> 
> Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.
> 

You are not missing my point. What I thought is

+retry:
-   down_read(&mm->mmap_sem);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(&mm->mmap_sem);
 
-   BUG_ON(arg_start > arg_end);
-   BUG_ON(env_start > env_end);
+   if (unlikely(arg_start > arg_end || env_start > env_end)) {
+   cond_resched();
+   goto retry;
+   }

for reading these fields.

By the way, /proc/pid/ readers are serving as a canary who tells something
mm_mmap related problem is happening. On the other hand, it is sad that
such canary cannot be terminated by signal due to use of unkillable waits.
I wish we can use killable waits.


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Yang Shi



On 3/26/18 3:21 PM, Cyrill Gorcunov wrote:

On Mon, Mar 26, 2018 at 11:37:25AM -0700, Matthew Wilcox wrote:

On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:

+++ b/kernel/sys.c
@@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
return error;
}
  
-	down_write(&mm->mmap_sem);

+   down_read(&mm->mmap_sem);
  
  	/*

 * We don't validate if these members are pointing to
@@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
mm->start_brk= prctl_map.start_brk;
mm->brk  = prctl_map.brk;
mm->start_stack  = prctl_map.start_stack;
+
+   spin_lock(&mm->arg_lock);
mm->arg_start= prctl_map.arg_start;
mm->arg_end  = prctl_map.arg_end;
mm->env_start= prctl_map.env_start;
mm->env_end  = prctl_map.env_end;
+   spin_unlock(&mm->arg_lock);
  
  	/*

 * Note this update of @saved_auxv is lockless thus

I see the argument for the change to a write lock was because of a BUG
validating arg_start and arg_end, but more generally, we are updating these
values, so a write-lock is probably a good idea, and this is a very rare
operation to do, so we don't care about making this more parallel.  I would
not make this change (but if other more knowledgable people in this area
disagree with me, I will withdraw my objection to this part).

Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
and @brk = 20. Since now the call is guarded by _read_ the both calls
unlocked and due to OO engine it may happen then when both finish
we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
has been take to have consistent data on exit, either you have [20;10]
or [30;20] assigned not something mixed.

That said I think using read-lock here would be a bug.


Yes it sounds so. However, it was down_read before 
ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for 
writing to protect against others"). And, that commit is for fixing the 
concurrent writing to arg_* and env_*. I just checked that commit, but 
omitted the brk part. The potential issue mentioned by you should exist 
before that commit, but might be just not discovered or very rare to hit.


I will change it back to down_write.

Thanks,
Yang



Cyrill




Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Cyrill Gorcunov
On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote:
> On 2018/03/27 4:21, Cyrill Gorcunov wrote:
> > That said I think using read-lock here would be a bug.
> 
> If I understand correctly, the caller can't set both fields atomically, for
> prctl() does not receive both fields at one call.
> 
>   prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor 
> PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);
> 

True, but the key moment is that two/three/four system calls can
run simultaneously. And while previously they are ordered by "write",
with read lock they are completely unordered and this is really
worries me. To be fair I would prefer to drop this old per-field
interface completely. This per-field interface was rather an ugly
solution from my side.

> Then, I wonder whether reading arg_start|end and env_start|end atomically 
> makes
> sense. Just retry reading if arg_start > env_end or env_start > env_end is 
> fine?

Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Yang Shi



On 3/26/18 5:10 PM, Tetsuo Handa wrote:

On 2018/03/27 4:21, Cyrill Gorcunov wrote:

That said I think using read-lock here would be a bug.

If I understand correctly, the caller can't set both fields atomically, for
prctl() does not receive both fields at one call.

   prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor 
PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);

Then, I wonder whether reading arg_start|end and env_start|end atomically makes
sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?


It might trap into dead loop if those are set to wrong values, right?




Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Yang Shi



On 3/26/18 3:42 PM, Mateusz Guzik wrote:

On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:

mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.


They are not arbitrary - there is basic validation performed when
setting them.


And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ps  D0 14018  1 0x0004
   885582f84000 885e8682f000 880972943000 885ebf499bc0
   8828ee12 c900349bfca8 817154d0 0040
   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
  Call Trace:
   [] ? __schedule+0x250/0x730
   [] schedule+0x36/0x80
   [] rwsem_down_read_failed+0xf0/0x150
   [] call_rwsem_down_read_failed+0x18/0x30
   [] down_read+0x20/0x40
   [] proc_pid_cmdline_read+0xd9/0x4e0
   [] ? do_filp_open+0xa5/0x100
   [] __vfs_read+0x37/0x150
   [] ? security_file_permission+0x9b/0xc0
   [] vfs_read+0x96/0x130
   [] SyS_read+0x55/0xc0
   [] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.


While switching to arg spinlock here will relieve mmap_sem to an extent,
it wont help with the problem you are seeing here.

proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you
got yet another down_read(&mm->mmap_sem);.

i.e. the issue you ran into is well known and predates my change.

The problem does not stem from contention either, but blocking for a
long time while holding the lock - the most common example is dealing
with dead nfs mount vs mmaped areas.

I don't have good ideas how to fix the problem. The least bad I came up
with was to trylock with a timeout - after a failure either return an
error or resort to returning p_comm. ps/top could be modified to
fallback to snatching the name from /status.

Since the lock owner is now being stored in the semaphore, perhaps the
above routine can happily spin until it grabs the lock or the owner is
detected to have gone into uninterruptible sleep and react accordingly.

I don't know whether it is feasible to somehow avoid the mmap lock
altogether.

If it has to be there no matter what the code can be refactored to grab
it once and relock only if copyout would fault. This would in particular
reduce the number of times it is taken to begin with and still provide
the current synchronisation against prctl. But the fundamental problem
will remain.

That said, refactoring above will have the same effect as your patch and
will avoid growing mm_struct.

That's my $0,03. MM overlords have to comment on what to do with this.


Thanks for the comment. Yes, the spin lock absolutely can't solve all 
the mmap_sem scalability issue. Actually, I already proposed a 
preliminary RFC to try to mitigate the mmap_sem issue. It is still under 
review.


Other than that, we also found mmap_sem is abused somewhere, so this 
patch is proposed to reduce the abuse to mmap_sem.


Yang




So, introduce a new spinlock in mm_struct to protect the concurrent access
to arg_start|end and env_start|end.

And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
sem for writing to protect against others") changed down_read to
down_write to avoid write race condition in prctl_set_mm(). Since we
already have dedicated lock to protect them, it is safe to change back
to down_read.

Signed-off-by: Yang Shi 
Cc: Alexey Dobriyan 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Mateusz Guzik 
Cc: Cyrill Gorcunov 
---
v1 --> v2:
* Use spinlock instead of rwlock per Mattew's suggestion
* Replace down_write to down_read in prctl_set_mm (see commit log for details)

  fs/proc/base.c   |  8 
  include/linux/mm_types.h |  2 ++
  kernel/fork.c|  1 +
  kernel/sys.c | 14 ++
  mm/init-mm.c |  1 +
  5 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324..e0282b6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
goto out_mmput;
}
  
-	down_read(&mm->mmap_sem);

+   spin_lock(&mm->arg_lock);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(&mm->mmap_sem);
+   spin_unlock(&mm->arg_lock);
  
  	BUG_ON(arg_start > arg_end);

BUG_ON(env_start > env_end);
@@ -929,10 +929

Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Tetsuo Handa
On 2018/03/27 4:21, Cyrill Gorcunov wrote:
> That said I think using read-lock here would be a bug.

If I understand correctly, the caller can't set both fields atomically, for
prctl() does not receive both fields at one call.

  prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor 
PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);

Then, I wonder whether reading arg_start|end and env_start|end atomically makes
sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Mateusz Guzik
On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 

They are not arbitrary - there is basic validation performed when
setting them.

> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 

While switching to arg spinlock here will relieve mmap_sem to an extent,
it wont help with the problem you are seeing here.

proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you
got yet another down_read(&mm->mmap_sem);.

i.e. the issue you ran into is well known and predates my change.

The problem does not stem from contention either, but blocking for a
long time while holding the lock - the most common example is dealing
with dead nfs mount vs mmaped areas.

I don't have good ideas how to fix the problem. The least bad I came up
with was to trylock with a timeout - after a failure either return an
error or resort to returning p_comm. ps/top could be modified to
fallback to snatching the name from /status.

Since the lock owner is now being stored in the semaphore, perhaps the
above routine can happily spin until it grabs the lock or the owner is
detected to have gone into uninterruptible sleep and react accordingly. 

I don't know whether it is feasible to somehow avoid the mmap lock
altogether.

If it has to be there no matter what the code can be refactored to grab
it once and relock only if copyout would fault. This would in particular
reduce the number of times it is taken to begin with and still provide
the current synchronisation against prctl. But the fundamental problem
will remain.

That said, refactoring above will have the same effect as your patch and
will avoid growing mm_struct.

That's my $0,03. MM overlords have to comment on what to do with this.

> So, introduce a new spinlock in mm_struct to protect the concurrent access
> to arg_start|end and env_start|end.
> 
> And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
> sem for writing to protect against others") changed down_read to
> down_write to avoid write race condition in prctl_set_mm(). Since we
> already have dedicated lock to protect them, it is safe to change back
> to down_read.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 
> ---
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
> 
>  fs/proc/base.c   |  8 
>  include/linux/mm_types.h |  2 ++
>  kernel/fork.c|  1 +
>  kernel/sys.c | 14 ++
>  mm/init-mm.c |  1 +
>  5 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9298324..e0282b6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
> char __user *buf,
>   goto out_mmput;
>   }
>  
> - down_read(&mm->mmap_sem);
> + spin_lock(&mm->arg_lock);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(&mm->mmap_sem);
> + spin_unlock(&mm->arg_lock);
>  
>   BUG_ON(arg_start > arg_end);
>   BUG_ON(env_start > env_end);
> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
> __user *buf,
>   if (!mmget_not_zero(mm))
>   goto free;
>  
> - down_read(&mm->mmap_sem);
> + spin_lock(&mm->arg_lock);
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> -  

Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Cyrill Gorcunov
On Mon, Mar 26, 2018 at 11:37:25AM -0700, Matthew Wilcox wrote:
> On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> > +++ b/kernel/sys.c
> > @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void 
> > __user *addr, unsigned long data
> > return error;
> > }
> >  
> > -   down_write(&mm->mmap_sem);
> > +   down_read(&mm->mmap_sem);
> >  
> > /*
> >  * We don't validate if these members are pointing to
> > @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void 
> > __user *addr, unsigned long data
> > mm->start_brk   = prctl_map.start_brk;
> > mm->brk = prctl_map.brk;
> > mm->start_stack = prctl_map.start_stack;
> > +
> > +   spin_lock(&mm->arg_lock);
> > mm->arg_start   = prctl_map.arg_start;
> > mm->arg_end = prctl_map.arg_end;
> > mm->env_start   = prctl_map.env_start;
> > mm->env_end = prctl_map.env_end;
> > +   spin_unlock(&mm->arg_lock);
> >  
> > /*
> >  * Note this update of @saved_auxv is lockless thus
> 
> I see the argument for the change to a write lock was because of a BUG
> validating arg_start and arg_end, but more generally, we are updating these
> values, so a write-lock is probably a good idea, and this is a very rare
> operation to do, so we don't care about making this more parallel.  I would
> not make this change (but if other more knowledgable people in this area
> disagree with me, I will withdraw my objection to this part).

Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
and @brk = 20. Since now the call is guarded by _read_ the both calls
unlocked and due to OO engine it may happen then when both finish
we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
has been take to have consistent data on exit, either you have [20;10]
or [30;20] assigned not something mixed.

That said I think using read-lock here would be a bug.

Cyrill


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Matthew Wilcox
On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> +++ b/kernel/sys.c
> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user 
> *addr, unsigned long data
>   return error;
>   }
>  
> - down_write(&mm->mmap_sem);
> + down_read(&mm->mmap_sem);
>  
>   /*
>* We don't validate if these members are pointing to
> @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void 
> __user *addr, unsigned long data
>   mm->start_brk   = prctl_map.start_brk;
>   mm->brk = prctl_map.brk;
>   mm->start_stack = prctl_map.start_stack;
> +
> + spin_lock(&mm->arg_lock);
>   mm->arg_start   = prctl_map.arg_start;
>   mm->arg_end = prctl_map.arg_end;
>   mm->env_start   = prctl_map.env_start;
>   mm->env_end = prctl_map.env_end;
> + spin_unlock(&mm->arg_lock);
>  
>   /*
>* Note this update of @saved_auxv is lockless thus

I see the argument for the change to a write lock was because of a BUG
validating arg_start and arg_end, but more generally, we are updating these
values, so a write-lock is probably a good idea, and this is a very rare
operation to do, so we don't care about making this more parallel.  I would
not make this change (but if other more knowledgable people in this area
disagree with me, I will withdraw my objection to this part).



[v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Yang Shi
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
   Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps  D0 14018  1 0x0004
  885582f84000 885e8682f000 880972943000 885ebf499bc0
  8828ee12 c900349bfca8 817154d0 0040
  00ff812f872a 885ebf499bc0 024000d000948300 880972943000
 Call Trace:
  [] ? __schedule+0x250/0x730
  [] schedule+0x36/0x80
  [] rwsem_down_read_failed+0xf0/0x150
  [] call_rwsem_down_read_failed+0x18/0x30
  [] down_read+0x20/0x40
  [] proc_pid_cmdline_read+0xd9/0x4e0
  [] ? do_filp_open+0xa5/0x100
  [] __vfs_read+0x37/0x150
  [] ? security_file_permission+0x9b/0xc0
  [] vfs_read+0x96/0x130
  [] SyS_read+0x55/0xc0
  [] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent access
to arg_start|end and env_start|end.

And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
sem for writing to protect against others") changed down_read to
down_write to avoid write race condition in prctl_set_mm(). Since we
already have dedicated lock to protect them, it is safe to change back
to down_read.

Signed-off-by: Yang Shi 
Cc: Alexey Dobriyan 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Mateusz Guzik 
Cc: Cyrill Gorcunov 
---
v1 --> v2:
* Use spinlock instead of rwlock per Mattew's suggestion
* Replace down_write to down_read in prctl_set_mm (see commit log for details)

 fs/proc/base.c   |  8 
 include/linux/mm_types.h |  2 ++
 kernel/fork.c|  1 +
 kernel/sys.c | 14 ++
 mm/init-mm.c |  1 +
 5 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324..e0282b6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
goto out_mmput;
}
 
-   down_read(&mm->mmap_sem);
+   spin_lock(&mm->arg_lock);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(&mm->mmap_sem);
+   spin_unlock(&mm->arg_lock);
 
BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);
@@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char 
__user *buf,
if (!mmget_not_zero(mm))
goto free;
 
-   down_read(&mm->mmap_sem);
+   spin_lock(&mm->arg_lock);
env_start = mm->env_start;
env_end = mm->env_end;
-   up_read(&mm->mmap_sem);
+   spin_unlock(&mm->arg_lock);
 
while (count > 0) {
size_t this_len, max_len;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b..3be4588 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -413,6 +413,8 @@ struct mm_struct {
unsigned long def_flags;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
+
+   spinlock_t arg_lock; /* protect concurrent access to arg_* and env_* */
unsigned long arg_start, arg_end, env_start, env_end;
 
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
diff --git a/kernel/fork.c b/kernel/fork.c
index e5d9d40..6540ae7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -898,6 +898,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->pinned_vm = 0;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
+   spin_lock_init(&mm->arg_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
mm_init_owner(mm, p);
diff --git a/kernel/sys.c b/kernel/sys.c
index f2289de..17bddd2 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
return error;
}
 
-   down_write(&mm->mmap_sem);
+   down_read(&mm->mmap_sem);
 
/*
 * We don't validate if these members are pointing to
@@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
mm->start_brk   = prctl_map.start_brk;
mm->brk = prctl_map.brk;
mm->start_stack = pr