Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 3/26/18 11:29 PM, Michal Hocko wrote: On Tue 27-03-18 02:20:39, Yang Shi wrote: [...] The patch looks reasonable to me. Maybe it would be better to be more explicit about the purpose of the patch. As others noticed, this alone wouldn't solve the mmap_sem contention issues. I _think_ that if you were more explicit about the mmap_sem abuse it would trigger less questions. I have just one more question. Now that you are touching this area, would you be willing to remove the following ugliness? diff --git a/kernel/sys.c b/kernel/sys.c index f2289de..17bddd2 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data return error; } - down_write(&mm->mmap_sem); + down_read(&mm->mmap_sem); Why do we need to hold mmap_sem here and call find_vma, when only PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the new lock and take the mmap_sem only for PR_SET_MM_ENV_END. Sorry for taking a little bit longer to get back since I was traveling. I think all the stuff can be protected by the new arg_lock except mm->brk since arg_lock can't prevent from concurrent writing from sys_brk(). We may use arg_lock to protect everything else other than mm->brk. The complexity sounds acceptable. Of course, as Cyrill mentioned, he prefers to deprecating prctl_set_mm since C/R is the only user of it. We may wait until he is done? Thanks, Yang Thanks!
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue 27-03-18 21:52:17, Cyrill Gorcunov wrote: > On Tue, Mar 27, 2018 at 02:38:11PM -0400, Yang Shi wrote: > > > Why do we need to hold mmap_sem here and call find_vma, when only > > > PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the > > > new lock and take the mmap_sem only for PR_SET_MM_ENV_END. > > > > Actually, I didn't think of why. It looks prctl_set_mm() checks if vma does > > exist when it tries to set stack_start, argv_* and env_*, btw not only > > env_end. > > > > Cyrill may be able to give us some hint since C/R is the main user of this > > API. > > First and most important it makes code smaller. This prctl call is really > rarely used. Of course we can optimize it, but as I said I would prefer > to simply deprecate this old interface (and I gonne to do so once time > permit). Ohh, it would be really great if we can remove this thingy altogether. I cannot say it has a wee bit of my sympathy. -- Michal Hocko SUSE Labs
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue, Mar 27, 2018 at 02:38:11PM -0400, Yang Shi wrote: > > Why do we need to hold mmap_sem here and call find_vma, when only > > PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the > > new lock and take the mmap_sem only for PR_SET_MM_ENV_END. > > Actually, I didn't think of why. It looks prctl_set_mm() checks if vma does > exist when it tries to set stack_start, argv_* and env_*, btw not only > env_end. > > Cyrill may be able to give us some hint since C/R is the main user of this > API. First and most important it makes code smaller. This prctl call is really rarely used. Of course we can optimize it, but as I said I would prefer to simply deprecate this old interface (and I gonne to do so once time permit).
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 3/27/18 3:32 AM, Cyrill Gorcunov wrote: On Mon, Mar 26, 2018 at 05:59:49PM -0400, Yang Shi wrote: Say we've two syscalls running prctl_set_mm_map in parallel, and imagine one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30 and @brk = 20. Since now the call is guarded by _read_ the both calls unlocked and due to OO engine it may happen then when both finish we have @start_brk = 30 and @brk = 10. In turn "write" semaphore has been take to have consistent data on exit, either you have [20;10] or [30;20] assigned not something mixed. That said I think using read-lock here would be a bug. Yes it sounds so. However, it was down_read before ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing to protect against others"). And, that commit is for fixing the concurrent writing to arg_* and env_*. I just checked that commit, but omitted the brk part. The potential issue mentioned by you should exist before that commit, but might be just not discovered or very rare to hit. I will change it back to down_write. down_read before was a bug ;) And it was not discovered earlier simply because not that many users of this interface exist, namely only criu as far as I know by now. Thanks for confirming this. I assumed so :-)
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 3/27/18 2:29 AM, Michal Hocko wrote: On Tue 27-03-18 02:20:39, Yang Shi wrote: [...] The patch looks reasonable to me. Maybe it would be better to be more explicit about the purpose of the patch. As others noticed, this alone wouldn't solve the mmap_sem contention issues. I _think_ that if you were more explicit about the mmap_sem abuse it would trigger less questions. Yes, sure. I have just one more question. Now that you are touching this area, would you be willing to remove the following ugliness? diff --git a/kernel/sys.c b/kernel/sys.c index f2289de..17bddd2 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data return error; } - down_write(&mm->mmap_sem); + down_read(&mm->mmap_sem); Why do we need to hold mmap_sem here and call find_vma, when only PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the new lock and take the mmap_sem only for PR_SET_MM_ENV_END. Actually, I didn't think of why. It looks prctl_set_mm() checks if vma does exist when it tries to set stack_start, argv_* and env_*, btw not only env_end. Cyrill may be able to give us some hint since C/R is the main user of this API. Yang Thanks!
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue 27-03-18 16:31:23, Mateusz Guzik wrote: > On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote: > > On Tue 27-03-18 02:20:39, Yang Shi wrote: > > [...] > > The patch looks reasonable to me. Maybe it would be better to be more > > explicit about the purpose of the patch. As others noticed, this alone > > wouldn't solve the mmap_sem contention issues. I _think_ that if you > > were more explicit about the mmap_sem abuse it would trigger less > > questions. > > > > >From what I gather even with other fixes the kernel will still end up > grabbing the semaphore. In this case I don't see what's the upside of > adding the spinlock for args. The downside is growth of mm_struct. Because accessing the specific address in the address space can be later changed to use a more fine-grained locking. There are people experimenting with range locking. These mmap_sem abusers, on the other hand, will require the full range lock without a good reason. So it is really worth it to remove them and replace by a more fine grained locking. If the mm_struct grow is a real concern (I haven't checked that) then we can use a set of hashed locks or something else. -- Michal Hocko SUSE Labs
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote: > On Tue 27-03-18 02:20:39, Yang Shi wrote: > [...] > The patch looks reasonable to me. Maybe it would be better to be more > explicit about the purpose of the patch. As others noticed, this alone > wouldn't solve the mmap_sem contention issues. I _think_ that if you > were more explicit about the mmap_sem abuse it would trigger less > questions. > >From what I gather even with other fixes the kernel will still end up grabbing the semaphore. In this case I don't see what's the upside of adding the spinlock for args. The downside is growth of mm_struct. i.e. the code can be refactored to just hold the lock and relock only if necessary (unable to copy to user without faulting) -- Mateusz Guzik
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Mon, Mar 26, 2018 at 06:12:55PM -0400, Yang Shi wrote: > > + if (unlikely(arg_start > arg_end || env_start > env_end)) { > > + cond_resched(); > > + goto retry; > > Can't it trap into dead loop if the condition is always false? Yes, unfortunately it can. > > + } > > > > for reading these fields. > > > > By the way, /proc/pid/ readers are serving as a canary who tells something > > mm_mmap related problem is happening. On the other hand, it is sad that > > such canary cannot be terminated by signal due to use of unkillable waits. > > I wish we can use killable waits. > > I already proposed patches (https://lkml.org/lkml/2018/2/26/1197) to do this > a few weeks ago. In the review, akpm suggested mitigate the mmap_sem > contention instead of using killable version workaround. Then the > preliminary unmaping by section patches > (https://lkml.org/lkml/2018/3/20/786) were proposed. In the discussion, we > decided to eliminate the mmap_sem abuse, this is where the patch came from.
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue, Mar 27, 2018 at 07:00:56AM +0900, Tetsuo Handa wrote: > > > To be fair I would prefer to drop this old per-field > > interface completely. This per-field interface was rather an ugly > > solution from my side. > > But this is userspace visible API and thus we cannot change. Hi! We could deplrecate this API call for a couple of releases and then if nobody complain we could rip it off completely. There should not be many users I think, didn't heard that someone except criu used it ever. > > > Then, I wonder whether reading arg_start|end and env_start|end atomically > > > makes > > > sense. Just retry reading if arg_start > env_end or env_start > env_end > > > is fine? > > > > Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious. > > > > You are not missing my point. What I thought is > > +retry: > - down_read(&mm->mmap_sem); > arg_start = mm->arg_start; > arg_end = mm->arg_end; > env_start = mm->env_start; > env_end = mm->env_end; > - up_read(&mm->mmap_sem); > > - BUG_ON(arg_start > arg_end); > - BUG_ON(env_start > env_end); > + if (unlikely(arg_start > arg_end || env_start > env_end)) { > + cond_resched(); > + goto retry; > + } > > for reading these fields. I fear such contentional cycles are acceptable if only they are guaranteed to finish eventually. Which doesn't look so in the code above. Cyrill
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Mon, Mar 26, 2018 at 05:59:49PM -0400, Yang Shi wrote: > > Say we've two syscalls running prctl_set_mm_map in parallel, and imagine > > one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30 > > and @brk = 20. Since now the call is guarded by _read_ the both calls > > unlocked and due to OO engine it may happen then when both finish > > we have @start_brk = 30 and @brk = 10. In turn "write" semaphore > > has been take to have consistent data on exit, either you have [20;10] > > or [30;20] assigned not something mixed. > > > > That said I think using read-lock here would be a bug. > > Yes it sounds so. However, it was down_read before > ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing > to protect against others"). And, that commit is for fixing the concurrent > writing to arg_* and env_*. I just checked that commit, but omitted the brk > part. The potential issue mentioned by you should exist before that commit, > but might be just not discovered or very rare to hit. > > I will change it back to down_write. down_read before was a bug ;) And it was not discovered earlier simply because not that many users of this interface exist, namely only criu as far as I know by now.
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue 27-03-18 02:20:39, Yang Shi wrote: [...] The patch looks reasonable to me. Maybe it would be better to be more explicit about the purpose of the patch. As others noticed, this alone wouldn't solve the mmap_sem contention issues. I _think_ that if you were more explicit about the mmap_sem abuse it would trigger less questions. I have just one more question. Now that you are touching this area, would you be willing to remove the following ugliness? > diff --git a/kernel/sys.c b/kernel/sys.c > index f2289de..17bddd2 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user > *addr, unsigned long data > return error; > } > > - down_write(&mm->mmap_sem); > + down_read(&mm->mmap_sem); Why do we need to hold mmap_sem here and call find_vma, when only PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the new lock and take the mmap_sem only for PR_SET_MM_ENV_END. Thanks! -- Michal Hocko SUSE Labs
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 3/26/18 6:00 PM, Tetsuo Handa wrote: Cyrill Gorcunov wrote: On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote: On 2018/03/27 4:21, Cyrill Gorcunov wrote: That said I think using read-lock here would be a bug. If I understand correctly, the caller can't set both fields atomically, for prctl() does not receive both fields at one call. prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0); True, but the key moment is that two/three/four system calls can run simultaneously. And while previously they are ordered by "write", with read lock they are completely unordered and this is really worries me. Yes, we need exclusive lock when updating these fields. To be fair I would prefer to drop this old per-field interface completely. This per-field interface was rather an ugly solution from my side. But this is userspace visible API and thus we cannot change. Then, I wonder whether reading arg_start|end and env_start|end atomically makes sense. Just retry reading if arg_start > env_end or env_start > env_end is fine? Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious. You are not missing my point. What I thought is +retry: - down_read(&mm->mmap_sem); arg_start = mm->arg_start; arg_end = mm->arg_end; env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); - BUG_ON(arg_start > arg_end); - BUG_ON(env_start > env_end); + if (unlikely(arg_start > arg_end || env_start > env_end)) { + cond_resched(); + goto retry; Can't it trap into dead loop if the condition is always false? + } for reading these fields. By the way, /proc/pid/ readers are serving as a canary who tells something mm_mmap related problem is happening. On the other hand, it is sad that such canary cannot be terminated by signal due to use of unkillable waits. I wish we can use killable waits. I already proposed patches (https://lkml.org/lkml/2018/2/26/1197) to do this a few weeks ago. In the review, akpm suggested mitigate the mmap_sem contention instead of using killable version workaround. Then the preliminary unmaping by section patches (https://lkml.org/lkml/2018/3/20/786) were proposed. In the discussion, we decided to eliminate the mmap_sem abuse, this is where the patch came from. Yang
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
Cyrill Gorcunov wrote: > On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote: > > On 2018/03/27 4:21, Cyrill Gorcunov wrote: > > > That said I think using read-lock here would be a bug. > > > > If I understand correctly, the caller can't set both fields atomically, for > > prctl() does not receive both fields at one call. > > > > prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor > > PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0); > > > > True, but the key moment is that two/three/four system calls can > run simultaneously. And while previously they are ordered by "write", > with read lock they are completely unordered and this is really > worries me. Yes, we need exclusive lock when updating these fields. > To be fair I would prefer to drop this old per-field > interface completely. This per-field interface was rather an ugly > solution from my side. But this is userspace visible API and thus we cannot change. > > > Then, I wonder whether reading arg_start|end and env_start|end atomically > > makes > > sense. Just retry reading if arg_start > env_end or env_start > env_end is > > fine? > > Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious. > You are not missing my point. What I thought is +retry: - down_read(&mm->mmap_sem); arg_start = mm->arg_start; arg_end = mm->arg_end; env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); - BUG_ON(arg_start > arg_end); - BUG_ON(env_start > env_end); + if (unlikely(arg_start > arg_end || env_start > env_end)) { + cond_resched(); + goto retry; + } for reading these fields. By the way, /proc/pid/ readers are serving as a canary who tells something mm_mmap related problem is happening. On the other hand, it is sad that such canary cannot be terminated by signal due to use of unkillable waits. I wish we can use killable waits.
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 3/26/18 3:21 PM, Cyrill Gorcunov wrote: On Mon, Mar 26, 2018 at 11:37:25AM -0700, Matthew Wilcox wrote: On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote: +++ b/kernel/sys.c @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data return error; } - down_write(&mm->mmap_sem); + down_read(&mm->mmap_sem); /* * We don't validate if these members are pointing to @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data mm->start_brk= prctl_map.start_brk; mm->brk = prctl_map.brk; mm->start_stack = prctl_map.start_stack; + + spin_lock(&mm->arg_lock); mm->arg_start= prctl_map.arg_start; mm->arg_end = prctl_map.arg_end; mm->env_start= prctl_map.env_start; mm->env_end = prctl_map.env_end; + spin_unlock(&mm->arg_lock); /* * Note this update of @saved_auxv is lockless thus I see the argument for the change to a write lock was because of a BUG validating arg_start and arg_end, but more generally, we are updating these values, so a write-lock is probably a good idea, and this is a very rare operation to do, so we don't care about making this more parallel. I would not make this change (but if other more knowledgable people in this area disagree with me, I will withdraw my objection to this part). Say we've two syscalls running prctl_set_mm_map in parallel, and imagine one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30 and @brk = 20. Since now the call is guarded by _read_ the both calls unlocked and due to OO engine it may happen then when both finish we have @start_brk = 30 and @brk = 10. In turn "write" semaphore has been take to have consistent data on exit, either you have [20;10] or [30;20] assigned not something mixed. That said I think using read-lock here would be a bug. Yes it sounds so. However, it was down_read before ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing to protect against others"). And, that commit is for fixing the concurrent writing to arg_* and env_*. I just checked that commit, but omitted the brk part. The potential issue mentioned by you should exist before that commit, but might be just not discovered or very rare to hit. I will change it back to down_write. Thanks, Yang Cyrill
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote: > On 2018/03/27 4:21, Cyrill Gorcunov wrote: > > That said I think using read-lock here would be a bug. > > If I understand correctly, the caller can't set both fields atomically, for > prctl() does not receive both fields at one call. > > prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor > PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0); > True, but the key moment is that two/three/four system calls can run simultaneously. And while previously they are ordered by "write", with read lock they are completely unordered and this is really worries me. To be fair I would prefer to drop this old per-field interface completely. This per-field interface was rather an ugly solution from my side. > Then, I wonder whether reading arg_start|end and env_start|end atomically > makes > sense. Just retry reading if arg_start > env_end or env_start > env_end is > fine? Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 3/26/18 5:10 PM, Tetsuo Handa wrote: On 2018/03/27 4:21, Cyrill Gorcunov wrote: That said I think using read-lock here would be a bug. If I understand correctly, the caller can't set both fields atomically, for prctl() does not receive both fields at one call. prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0); Then, I wonder whether reading arg_start|end and env_start|end atomically makes sense. Just retry reading if arg_start > env_end or env_start > env_end is fine? It might trap into dead loop if those are set to wrong values, right?
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 3/26/18 3:42 PM, Mateusz Guzik wrote: On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote: mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. They are not arbitrary - there is basic validation performed when setting them. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D0 14018 1 0x0004 885582f84000 885e8682f000 880972943000 885ebf499bc0 8828ee12 c900349bfca8 817154d0 0040 00ff812f872a 885ebf499bc0 024000d000948300 880972943000 Call Trace: [] ? __schedule+0x250/0x730 [] schedule+0x36/0x80 [] rwsem_down_read_failed+0xf0/0x150 [] call_rwsem_down_read_failed+0x18/0x30 [] down_read+0x20/0x40 [] proc_pid_cmdline_read+0xd9/0x4e0 [] ? do_filp_open+0xa5/0x100 [] __vfs_read+0x37/0x150 [] ? security_file_permission+0x9b/0xc0 [] vfs_read+0x96/0x130 [] SyS_read+0x55/0xc0 [] entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. While switching to arg spinlock here will relieve mmap_sem to an extent, it wont help with the problem you are seeing here. proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you got yet another down_read(&mm->mmap_sem);. i.e. the issue you ran into is well known and predates my change. The problem does not stem from contention either, but blocking for a long time while holding the lock - the most common example is dealing with dead nfs mount vs mmaped areas. I don't have good ideas how to fix the problem. The least bad I came up with was to trylock with a timeout - after a failure either return an error or resort to returning p_comm. ps/top could be modified to fallback to snatching the name from /status. Since the lock owner is now being stored in the semaphore, perhaps the above routine can happily spin until it grabs the lock or the owner is detected to have gone into uninterruptible sleep and react accordingly. I don't know whether it is feasible to somehow avoid the mmap lock altogether. If it has to be there no matter what the code can be refactored to grab it once and relock only if copyout would fault. This would in particular reduce the number of times it is taken to begin with and still provide the current synchronisation against prctl. But the fundamental problem will remain. That said, refactoring above will have the same effect as your patch and will avoid growing mm_struct. That's my $0,03. MM overlords have to comment on what to do with this. Thanks for the comment. Yes, the spin lock absolutely can't solve all the mmap_sem scalability issue. Actually, I already proposed a preliminary RFC to try to mitigate the mmap_sem issue. It is still under review. Other than that, we also found mmap_sem is abused somewhere, so this patch is proposed to reduce the abuse to mmap_sem. Yang So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end and env_start|end. And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing to protect against others") changed down_read to down_write to avoid write race condition in prctl_set_mm(). Since we already have dedicated lock to protect them, it is safe to change back to down_read. Signed-off-by: Yang Shi Cc: Alexey Dobriyan Cc: Michal Hocko Cc: Matthew Wilcox Cc: Mateusz Guzik Cc: Cyrill Gorcunov --- v1 --> v2: * Use spinlock instead of rwlock per Mattew's suggestion * Replace down_write to down_read in prctl_set_mm (see commit log for details) fs/proc/base.c | 8 include/linux/mm_types.h | 2 ++ kernel/fork.c| 1 + kernel/sys.c | 14 ++ mm/init-mm.c | 1 + 5 files changed, 18 insertions(+), 8 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 9298324..e0282b6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf, goto out_mmput; } - down_read(&mm->mmap_sem); + spin_lock(&mm->arg_lock); arg_start = mm->arg_start; arg_end = mm->arg_end; env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); BUG_ON(arg_start > arg_end); BUG_ON(env_start > env_end); @@ -929,10 +929
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On 2018/03/27 4:21, Cyrill Gorcunov wrote: > That said I think using read-lock here would be a bug. If I understand correctly, the caller can't set both fields atomically, for prctl() does not receive both fields at one call. prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0); Then, I wonder whether reading arg_start|end and env_start|end atomically makes sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote: > mmap_sem is on the hot path of kernel, and it very contended, but it is > abused too. It is used to protect arg_start|end and evn_start|end when > reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make > sense since those proc files just expect to read 4 values atomically and > not related to VM, they could be set to arbitrary values by C/R. > They are not arbitrary - there is basic validation performed when setting them. > And, the mmap_sem contention may cause unexpected issue like below: > > INFO: task ps:14018 blocked for more than 120 seconds. >Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D0 14018 1 0x0004 > 885582f84000 885e8682f000 880972943000 885ebf499bc0 > 8828ee12 c900349bfca8 817154d0 0040 > 00ff812f872a 885ebf499bc0 024000d000948300 880972943000 > Call Trace: > [] ? __schedule+0x250/0x730 > [] schedule+0x36/0x80 > [] rwsem_down_read_failed+0xf0/0x150 > [] call_rwsem_down_read_failed+0x18/0x30 > [] down_read+0x20/0x40 > [] proc_pid_cmdline_read+0xd9/0x4e0 > [] ? do_filp_open+0xa5/0x100 > [] __vfs_read+0x37/0x150 > [] ? security_file_permission+0x9b/0xc0 > [] vfs_read+0x96/0x130 > [] SyS_read+0x55/0xc0 > [] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock > for them to mitigate the abuse of mmap_sem. > While switching to arg spinlock here will relieve mmap_sem to an extent, it wont help with the problem you are seeing here. proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you got yet another down_read(&mm->mmap_sem);. i.e. the issue you ran into is well known and predates my change. The problem does not stem from contention either, but blocking for a long time while holding the lock - the most common example is dealing with dead nfs mount vs mmaped areas. I don't have good ideas how to fix the problem. The least bad I came up with was to trylock with a timeout - after a failure either return an error or resort to returning p_comm. ps/top could be modified to fallback to snatching the name from /status. Since the lock owner is now being stored in the semaphore, perhaps the above routine can happily spin until it grabs the lock or the owner is detected to have gone into uninterruptible sleep and react accordingly. I don't know whether it is feasible to somehow avoid the mmap lock altogether. If it has to be there no matter what the code can be refactored to grab it once and relock only if copyout would fault. This would in particular reduce the number of times it is taken to begin with and still provide the current synchronisation against prctl. But the fundamental problem will remain. That said, refactoring above will have the same effect as your patch and will avoid growing mm_struct. That's my $0,03. MM overlords have to comment on what to do with this. > So, introduce a new spinlock in mm_struct to protect the concurrent access > to arg_start|end and env_start|end. > > And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap > sem for writing to protect against others") changed down_read to > down_write to avoid write race condition in prctl_set_mm(). Since we > already have dedicated lock to protect them, it is safe to change back > to down_read. > > Signed-off-by: Yang Shi > Cc: Alexey Dobriyan > Cc: Michal Hocko > Cc: Matthew Wilcox > Cc: Mateusz Guzik > Cc: Cyrill Gorcunov > --- > v1 --> v2: > * Use spinlock instead of rwlock per Mattew's suggestion > * Replace down_write to down_read in prctl_set_mm (see commit log for details) > > fs/proc/base.c | 8 > include/linux/mm_types.h | 2 ++ > kernel/fork.c| 1 + > kernel/sys.c | 14 ++ > mm/init-mm.c | 1 + > 5 files changed, 18 insertions(+), 8 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index 9298324..e0282b6 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, > char __user *buf, > goto out_mmput; > } > > - down_read(&mm->mmap_sem); > + spin_lock(&mm->arg_lock); > arg_start = mm->arg_start; > arg_end = mm->arg_end; > env_start = mm->env_start; > env_end = mm->env_end; > - up_read(&mm->mmap_sem); > + spin_unlock(&mm->arg_lock); > > BUG_ON(arg_start > arg_end); > BUG_ON(env_start > env_end); > @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char > __user *buf, > if (!mmget_not_zero(mm)) > goto free; > > - down_read(&mm->mmap_sem); > + spin_lock(&mm->arg_lock); > env_start = mm->env_start; > env_end = mm->env_end; > -
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Mon, Mar 26, 2018 at 11:37:25AM -0700, Matthew Wilcox wrote: > On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote: > > +++ b/kernel/sys.c > > @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void > > __user *addr, unsigned long data > > return error; > > } > > > > - down_write(&mm->mmap_sem); > > + down_read(&mm->mmap_sem); > > > > /* > > * We don't validate if these members are pointing to > > @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void > > __user *addr, unsigned long data > > mm->start_brk = prctl_map.start_brk; > > mm->brk = prctl_map.brk; > > mm->start_stack = prctl_map.start_stack; > > + > > + spin_lock(&mm->arg_lock); > > mm->arg_start = prctl_map.arg_start; > > mm->arg_end = prctl_map.arg_end; > > mm->env_start = prctl_map.env_start; > > mm->env_end = prctl_map.env_end; > > + spin_unlock(&mm->arg_lock); > > > > /* > > * Note this update of @saved_auxv is lockless thus > > I see the argument for the change to a write lock was because of a BUG > validating arg_start and arg_end, but more generally, we are updating these > values, so a write-lock is probably a good idea, and this is a very rare > operation to do, so we don't care about making this more parallel. I would > not make this change (but if other more knowledgable people in this area > disagree with me, I will withdraw my objection to this part). Say we've two syscalls running prctl_set_mm_map in parallel, and imagine one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30 and @brk = 20. Since now the call is guarded by _read_ the both calls unlocked and due to OO engine it may happen then when both finish we have @start_brk = 30 and @brk = 10. In turn "write" semaphore has been take to have consistent data on exit, either you have [20;10] or [30;20] assigned not something mixed. That said I think using read-lock here would be a bug. Cyrill
Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote: > +++ b/kernel/sys.c > @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user > *addr, unsigned long data > return error; > } > > - down_write(&mm->mmap_sem); > + down_read(&mm->mmap_sem); > > /* >* We don't validate if these members are pointing to > @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void > __user *addr, unsigned long data > mm->start_brk = prctl_map.start_brk; > mm->brk = prctl_map.brk; > mm->start_stack = prctl_map.start_stack; > + > + spin_lock(&mm->arg_lock); > mm->arg_start = prctl_map.arg_start; > mm->arg_end = prctl_map.arg_end; > mm->env_start = prctl_map.env_start; > mm->env_end = prctl_map.env_end; > + spin_unlock(&mm->arg_lock); > > /* >* Note this update of @saved_auxv is lockless thus I see the argument for the change to a write lock was because of a BUG validating arg_start and arg_end, but more generally, we are updating these values, so a write-lock is probably a good idea, and this is a very rare operation to do, so we don't care about making this more parallel. I would not make this change (but if other more knowledgable people in this area disagree with me, I will withdraw my objection to this part).
[v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D0 14018 1 0x0004 885582f84000 885e8682f000 880972943000 885ebf499bc0 8828ee12 c900349bfca8 817154d0 0040 00ff812f872a 885ebf499bc0 024000d000948300 880972943000 Call Trace: [] ? __schedule+0x250/0x730 [] schedule+0x36/0x80 [] rwsem_down_read_failed+0xf0/0x150 [] call_rwsem_down_read_failed+0x18/0x30 [] down_read+0x20/0x40 [] proc_pid_cmdline_read+0xd9/0x4e0 [] ? do_filp_open+0xa5/0x100 [] __vfs_read+0x37/0x150 [] ? security_file_permission+0x9b/0xc0 [] vfs_read+0x96/0x130 [] SyS_read+0x55/0xc0 [] entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end and env_start|end. And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing to protect against others") changed down_read to down_write to avoid write race condition in prctl_set_mm(). Since we already have dedicated lock to protect them, it is safe to change back to down_read. Signed-off-by: Yang Shi Cc: Alexey Dobriyan Cc: Michal Hocko Cc: Matthew Wilcox Cc: Mateusz Guzik Cc: Cyrill Gorcunov --- v1 --> v2: * Use spinlock instead of rwlock per Mattew's suggestion * Replace down_write to down_read in prctl_set_mm (see commit log for details) fs/proc/base.c | 8 include/linux/mm_types.h | 2 ++ kernel/fork.c| 1 + kernel/sys.c | 14 ++ mm/init-mm.c | 1 + 5 files changed, 18 insertions(+), 8 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 9298324..e0282b6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf, goto out_mmput; } - down_read(&mm->mmap_sem); + spin_lock(&mm->arg_lock); arg_start = mm->arg_start; arg_end = mm->arg_end; env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); BUG_ON(arg_start > arg_end); BUG_ON(env_start > env_end); @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char __user *buf, if (!mmget_not_zero(mm)) goto free; - down_read(&mm->mmap_sem); + spin_lock(&mm->arg_lock); env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); while (count > 0) { size_t this_len, max_len; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fd1af6b..3be4588 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -413,6 +413,8 @@ struct mm_struct { unsigned long def_flags; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; + + spinlock_t arg_lock; /* protect concurrent access to arg_* and env_* */ unsigned long arg_start, arg_end, env_start, env_end; unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ diff --git a/kernel/fork.c b/kernel/fork.c index e5d9d40..6540ae7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -898,6 +898,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->pinned_vm = 0; memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); spin_lock_init(&mm->page_table_lock); + spin_lock_init(&mm->arg_lock); mm_init_cpumask(mm); mm_init_aio(mm); mm_init_owner(mm, p); diff --git a/kernel/sys.c b/kernel/sys.c index f2289de..17bddd2 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data return error; } - down_write(&mm->mmap_sem); + down_read(&mm->mmap_sem); /* * We don't validate if these members are pointing to @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data mm->start_brk = prctl_map.start_brk; mm->brk = prctl_map.brk; mm->start_stack = pr