Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-13 Thread Michal Hocko
[Sorry about the slow response but I was offline for almost two weeks
and catching up with a tsunami in my inbox now]

On Fri 09-03-18 19:48:46, Tetsuo Handa wrote:
> Kohli, Gaurav wrote:
> > > t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> > > that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> > > exit_mm() from do_exit() and calling exit_creds() from 
> > > __put_task_struct(t)
> > > after task_unlock(t) is called. Seems difficult to trigger race window. 
> > > Maybe
> > > something has preempted because oom_badness() becomes outside of RCU grace
> > > period upon leaving find_lock_task_mm() when called from proc_oom_score().
> > 
> > Hi Tetsuo,
> > 
> > Yes it is not easy to reproduce seen twice till now and i agree with
> > your analysis. But David has already fixing this in different way,
> > So that also looks better to me:
> > 
> > https://patchwork.kernel.org/patch/10265641/
> > 
> 
> Yes, I'm aware of that patch.
> 
> > But if need to keep that code, So we have to bump up the task
> > reference that's only i can think of now.
> 
> I don't think so, for I think it is safe to call
> has_capability_noaudit(p) with p->alloc_lock held.

This however adds a subtle assumption on locking here and we should
rather not do so. The scope of alloc_lock is quite messy already and
adding on top is definitely not an improvement.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f2e7dfb..4efcfb8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>*/
>   points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
>   mm_pgtables_bytes(p->mm) / PAGE_SIZE;
> - task_unlock(p);
>  
>   /*
>* Root processes get 3% bonus, just like the __vm_enough_memory()
> @@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>*/
>   if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>   points -= (points * 3) / 100;
> + task_unlock(p);
>  
>   /* Normalize to oom_score_adj units */
>   adj *= totalpages / 1000;

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-13 Thread Michal Hocko
[Sorry about the slow response but I was offline for almost two weeks
and catching up with a tsunami in my inbox now]

On Fri 09-03-18 19:48:46, Tetsuo Handa wrote:
> Kohli, Gaurav wrote:
> > > t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> > > that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> > > exit_mm() from do_exit() and calling exit_creds() from 
> > > __put_task_struct(t)
> > > after task_unlock(t) is called. Seems difficult to trigger race window. 
> > > Maybe
> > > something has preempted because oom_badness() becomes outside of RCU grace
> > > period upon leaving find_lock_task_mm() when called from proc_oom_score().
> > 
> > Hi Tetsuo,
> > 
> > Yes it is not easy to reproduce seen twice till now and i agree with
> > your analysis. But David has already fixing this in different way,
> > So that also looks better to me:
> > 
> > https://patchwork.kernel.org/patch/10265641/
> > 
> 
> Yes, I'm aware of that patch.
> 
> > But if need to keep that code, So we have to bump up the task
> > reference that's only i can think of now.
> 
> I don't think so, for I think it is safe to call
> has_capability_noaudit(p) with p->alloc_lock held.

This however adds a subtle assumption on locking here and we should
rather not do so. The scope of alloc_lock is quite messy already and
adding on top is definitely not an improvement.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f2e7dfb..4efcfb8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>*/
>   points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
>   mm_pgtables_bytes(p->mm) / PAGE_SIZE;
> - task_unlock(p);
>  
>   /*
>* Root processes get 3% bonus, just like the __vm_enough_memory()
> @@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>*/
>   if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>   points -= (points * 3) / 100;
> + task_unlock(p);
>  
>   /* Normalize to oom_score_adj units */
>   adj *= totalpages / 1000;

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-09 Thread Tetsuo Handa
Kohli, Gaurav wrote:
> On 3/9/2018 4:18 PM, Tetsuo Handa wrote:
> 
> > Kohli, Gaurav wrote:
> >>> t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> >>> that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> >>> exit_mm() from do_exit() and calling exit_creds() from 
> >>> __put_task_struct(t)
> >>> after task_unlock(t) is called. Seems difficult to trigger race window. 
> >>> Maybe
> >>> something has preempted because oom_badness() becomes outside of RCU grace
> >>> period upon leaving find_lock_task_mm() when called from proc_oom_score().
> >> Hi Tetsuo,
> >>
> >> Yes it is not easy to reproduce seen twice till now and i agree with
> >> your analysis. But David has already fixing this in different way,
> >> So that also looks better to me:
> >>
> >> https://patchwork.kernel.org/patch/10265641/
> >>
> > Yes, I'm aware of that patch.
> >
> >> But if need to keep that code, So we have to bump up the task
> >> reference that's only i can think of now.
> > I don't think so, for I think it is safe to call
> > has_capability_noaudit(p) with p->alloc_lock held.
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index f2e7dfb..4efcfb8 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
> > mem_cgroup *memcg,
> >  */
> > points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
> > mm_pgtables_bytes(p->mm) / PAGE_SIZE;
> > -   task_unlock(p);
> >   
> > /*
> >  * Root processes get 3% bonus, just like the __vm_enough_memory()
> > @@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> > mem_cgroup *memcg,
> >  */
> > if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> > points -= (points * 3) / 100;
> > +   task_unlock(p);
> 
> Earlier i have thought the same to post this, but this may create 
> problem if there are sleeping calls in
> 
> has_capability_noaudit ?

has_capability_noaudit() does not sleep. See what has_ns_capability_noaudit() 
is doing.

> 
> >   
> > /* Normalize to oom_score_adj units */
> > adj *= totalpages / 1000;
> >
> -- 
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
> is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project.
> 
> 


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-09 Thread Tetsuo Handa
Kohli, Gaurav wrote:
> On 3/9/2018 4:18 PM, Tetsuo Handa wrote:
> 
> > Kohli, Gaurav wrote:
> >>> t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> >>> that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> >>> exit_mm() from do_exit() and calling exit_creds() from 
> >>> __put_task_struct(t)
> >>> after task_unlock(t) is called. Seems difficult to trigger race window. 
> >>> Maybe
> >>> something has preempted because oom_badness() becomes outside of RCU grace
> >>> period upon leaving find_lock_task_mm() when called from proc_oom_score().
> >> Hi Tetsuo,
> >>
> >> Yes it is not easy to reproduce seen twice till now and i agree with
> >> your analysis. But David has already fixing this in different way,
> >> So that also looks better to me:
> >>
> >> https://patchwork.kernel.org/patch/10265641/
> >>
> > Yes, I'm aware of that patch.
> >
> >> But if need to keep that code, So we have to bump up the task
> >> reference that's only i can think of now.
> > I don't think so, for I think it is safe to call
> > has_capability_noaudit(p) with p->alloc_lock held.
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index f2e7dfb..4efcfb8 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
> > mem_cgroup *memcg,
> >  */
> > points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
> > mm_pgtables_bytes(p->mm) / PAGE_SIZE;
> > -   task_unlock(p);
> >   
> > /*
> >  * Root processes get 3% bonus, just like the __vm_enough_memory()
> > @@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> > mem_cgroup *memcg,
> >  */
> > if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> > points -= (points * 3) / 100;
> > +   task_unlock(p);
> 
> Earlier i have thought the same to post this, but this may create 
> problem if there are sleeping calls in
> 
> has_capability_noaudit ?

has_capability_noaudit() does not sleep. See what has_ns_capability_noaudit() 
is doing.

> 
> >   
> > /* Normalize to oom_score_adj units */
> > adj *= totalpages / 1000;
> >
> -- 
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
> is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project.
> 
> 


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-09 Thread Kohli, Gaurav

On 3/9/2018 4:18 PM, Tetsuo Handa wrote:


Kohli, Gaurav wrote:

t->alloc_lock is still held when leaving find_lock_task_mm(), which means
that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
something has preempted because oom_badness() becomes outside of RCU grace
period upon leaving find_lock_task_mm() when called from proc_oom_score().

Hi Tetsuo,

Yes it is not easy to reproduce seen twice till now and i agree with
your analysis. But David has already fixing this in different way,
So that also looks better to me:

https://patchwork.kernel.org/patch/10265641/


Yes, I'm aware of that patch.


But if need to keep that code, So we have to bump up the task
reference that's only i can think of now.

I don't think so, for I think it is safe to call
has_capability_noaudit(p) with p->alloc_lock held.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f2e7dfb..4efcfb8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
-   task_unlock(p);
  
  	/*

 * Root processes get 3% bonus, just like the __vm_enough_memory()
@@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
+   task_unlock(p);


Earlier i have thought the same to post this, but this may create 
problem if there are sleeping calls in


has_capability_noaudit ?

  
  	/* Normalize to oom_score_adj units */

adj *= totalpages / 1000;


--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.



Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-09 Thread Kohli, Gaurav

On 3/9/2018 4:18 PM, Tetsuo Handa wrote:


Kohli, Gaurav wrote:

t->alloc_lock is still held when leaving find_lock_task_mm(), which means
that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
something has preempted because oom_badness() becomes outside of RCU grace
period upon leaving find_lock_task_mm() when called from proc_oom_score().

Hi Tetsuo,

Yes it is not easy to reproduce seen twice till now and i agree with
your analysis. But David has already fixing this in different way,
So that also looks better to me:

https://patchwork.kernel.org/patch/10265641/


Yes, I'm aware of that patch.


But if need to keep that code, So we have to bump up the task
reference that's only i can think of now.

I don't think so, for I think it is safe to call
has_capability_noaudit(p) with p->alloc_lock held.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f2e7dfb..4efcfb8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
-   task_unlock(p);
  
  	/*

 * Root processes get 3% bonus, just like the __vm_enough_memory()
@@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
+   task_unlock(p);


Earlier i have thought the same to post this, but this may create 
problem if there are sleeping calls in


has_capability_noaudit ?

  
  	/* Normalize to oom_score_adj units */

adj *= totalpages / 1000;


--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.



Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-09 Thread Tetsuo Handa
Kohli, Gaurav wrote:
> > t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> > that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> > exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
> > after task_unlock(t) is called. Seems difficult to trigger race window. 
> > Maybe
> > something has preempted because oom_badness() becomes outside of RCU grace
> > period upon leaving find_lock_task_mm() when called from proc_oom_score().
> 
> Hi Tetsuo,
> 
> Yes it is not easy to reproduce seen twice till now and i agree with
> your analysis. But David has already fixing this in different way,
> So that also looks better to me:
> 
> https://patchwork.kernel.org/patch/10265641/
> 

Yes, I'm aware of that patch.

> But if need to keep that code, So we have to bump up the task
> reference that's only i can think of now.

I don't think so, for I think it is safe to call
has_capability_noaudit(p) with p->alloc_lock held.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f2e7dfb..4efcfb8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
-   task_unlock(p);
 
/*
 * Root processes get 3% bonus, just like the __vm_enough_memory()
@@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
+   task_unlock(p);
 
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-09 Thread Tetsuo Handa
Kohli, Gaurav wrote:
> > t->alloc_lock is still held when leaving find_lock_task_mm(), which means
> > that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
> > exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
> > after task_unlock(t) is called. Seems difficult to trigger race window. 
> > Maybe
> > something has preempted because oom_badness() becomes outside of RCU grace
> > period upon leaving find_lock_task_mm() when called from proc_oom_score().
> 
> Hi Tetsuo,
> 
> Yes it is not easy to reproduce seen twice till now and i agree with
> your analysis. But David has already fixing this in different way,
> So that also looks better to me:
> 
> https://patchwork.kernel.org/patch/10265641/
> 

Yes, I'm aware of that patch.

> But if need to keep that code, So we have to bump up the task
> reference that's only i can think of now.

I don't think so, for I think it is safe to call
has_capability_noaudit(p) with p->alloc_lock held.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f2e7dfb..4efcfb8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -222,7 +222,6 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
-   task_unlock(p);
 
/*
 * Root processes get 3% bonus, just like the __vm_enough_memory()
@@ -230,6 +229,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 */
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
+   task_unlock(p);
 
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-08 Thread Tetsuo Handa
On 2018/03/08 13:51, Kohli, Gaurav wrote:
> On 3/8/2018 2:26 AM, David Rientjes wrote:
> 
>> On Wed, 7 Mar 2018, Gaurav Kohli wrote:
>>
>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>> index 6fd9773..5f4cc4b 100644
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct 
>>> task_struct *p)
>>>     for_each_thread(p, t) {
>>>   task_lock(t);
>>> +    get_task_struct(t);
>>>   if (likely(t->mm))
>>>   goto found;
>>>   task_unlock(t);
>>> +    put_task_struct(t);
>>>   }
>>>   t = NULL;
>>>   found:
>> We hold rcu_read_lock() here, so perhaps only do get_task_struct() before
>> doing rcu_read_unlock() and we have a non-NULL t?
> 
> Here rcu_read_lock will not help, as our task may change due to below algo:
> 
> for_each_thread(p, t) {
>  task_lock(t);
> +    get_task_struct(t);
>  if (likely(t->mm))
>  goto found;
>  task_unlock(t);
> +    put_task_struct(t)
> 
> 
> So only we can increase usage counter here only at the current task.

static int proc_single_show(struct seq_file *m, void *v)
{
struct inode *inode = m->private;
struct pid_namespace *ns;
struct pid *pid;
struct task_struct *task;
int ret;

ns = inode->i_sb->s_fs_info;
pid = proc_pid(inode);
task = get_pid_task(pid, PIDTYPE_PID); /* get_task_struct() is called 
upon success. */
if (!task)
return -ESRCH;

ret = PROC_I(inode)->op.proc_show(m, ns, pid, task);

put_task_struct(task);
return ret;
}

static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
  struct pid *pid, struct task_struct *task)
{
unsigned long totalpages = totalram_pages + total_swap_pages;
unsigned long points = 0;

points = oom_badness(task, NULL, NULL, totalpages) *
 1000 / totalpages; /* task->usage > 0 due to 
proc_single_show() */
seq_printf(m, "%lu\n", points);

return 0;
}

struct task_struct *find_lock_task_mm(struct task_struct *p) /* p->usage > 0 */
{
struct task_struct *t;

rcu_read_lock();

for_each_thread(p, t) {
task_lock(t);
if (likely(t->mm))
goto found;
task_unlock(t);
}
t = NULL;
found:
rcu_read_unlock();

return t; /* t->usage > 0 even if t != p because t->mm != NULL */
}

t->alloc_lock is still held when leaving find_lock_task_mm(), which means
that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
something has preempted because oom_badness() becomes outside of RCU grace
period upon leaving find_lock_task_mm() when called from proc_oom_score().


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-08 Thread Tetsuo Handa
On 2018/03/08 13:51, Kohli, Gaurav wrote:
> On 3/8/2018 2:26 AM, David Rientjes wrote:
> 
>> On Wed, 7 Mar 2018, Gaurav Kohli wrote:
>>
>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>> index 6fd9773..5f4cc4b 100644
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct 
>>> task_struct *p)
>>>     for_each_thread(p, t) {
>>>   task_lock(t);
>>> +    get_task_struct(t);
>>>   if (likely(t->mm))
>>>   goto found;
>>>   task_unlock(t);
>>> +    put_task_struct(t);
>>>   }
>>>   t = NULL;
>>>   found:
>> We hold rcu_read_lock() here, so perhaps only do get_task_struct() before
>> doing rcu_read_unlock() and we have a non-NULL t?
> 
> Here rcu_read_lock will not help, as our task may change due to below algo:
> 
> for_each_thread(p, t) {
>  task_lock(t);
> +    get_task_struct(t);
>  if (likely(t->mm))
>  goto found;
>  task_unlock(t);
> +    put_task_struct(t)
> 
> 
> So only we can increase usage counter here only at the current task.

static int proc_single_show(struct seq_file *m, void *v)
{
struct inode *inode = m->private;
struct pid_namespace *ns;
struct pid *pid;
struct task_struct *task;
int ret;

ns = inode->i_sb->s_fs_info;
pid = proc_pid(inode);
task = get_pid_task(pid, PIDTYPE_PID); /* get_task_struct() is called 
upon success. */
if (!task)
return -ESRCH;

ret = PROC_I(inode)->op.proc_show(m, ns, pid, task);

put_task_struct(task);
return ret;
}

static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
  struct pid *pid, struct task_struct *task)
{
unsigned long totalpages = totalram_pages + total_swap_pages;
unsigned long points = 0;

points = oom_badness(task, NULL, NULL, totalpages) *
 1000 / totalpages; /* task->usage > 0 due to 
proc_single_show() */
seq_printf(m, "%lu\n", points);

return 0;
}

struct task_struct *find_lock_task_mm(struct task_struct *p) /* p->usage > 0 */
{
struct task_struct *t;

rcu_read_lock();

for_each_thread(p, t) {
task_lock(t);
if (likely(t->mm))
goto found;
task_unlock(t);
}
t = NULL;
found:
rcu_read_unlock();

return t; /* t->usage > 0 even if t != p because t->mm != NULL */
}

t->alloc_lock is still held when leaving find_lock_task_mm(), which means
that t->mm != NULL. But nothing prevents t from setting t->mm = NULL at
exit_mm() from do_exit() and calling exit_creds() from __put_task_struct(t)
after task_unlock(t) is called. Seems difficult to trigger race window. Maybe
something has preempted because oom_badness() becomes outside of RCU grace
period upon leaving find_lock_task_mm() when called from proc_oom_score().


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-07 Thread Kohli, Gaurav

On 3/8/2018 2:26 AM, David Rientjes wrote:


On Wed, 7 Mar 2018, Gaurav Kohli wrote:


diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6fd9773..5f4cc4b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct 
*p)
  
  	for_each_thread(p, t) {

task_lock(t);
+   get_task_struct(t);
if (likely(t->mm))
goto found;
task_unlock(t);
+   put_task_struct(t);
}
t = NULL;
  found:

We hold rcu_read_lock() here, so perhaps only do get_task_struct() before
doing rcu_read_unlock() and we have a non-NULL t?


Here rcu_read_lock will not help, as our task may change due to below algo:

for_each_thread(p, t) {
task_lock(t);
+   get_task_struct(t);
if (likely(t->mm))
goto found;
task_unlock(t);
+   put_task_struct(t)


So only we can increase usage counter here only at the current task.

I have seen you new patch, that seems valid to me and it will resolve our issue.
Thanks for support.

Regards

Gaurav




--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.



Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-07 Thread Kohli, Gaurav

On 3/8/2018 2:26 AM, David Rientjes wrote:


On Wed, 7 Mar 2018, Gaurav Kohli wrote:


diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6fd9773..5f4cc4b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct 
*p)
  
  	for_each_thread(p, t) {

task_lock(t);
+   get_task_struct(t);
if (likely(t->mm))
goto found;
task_unlock(t);
+   put_task_struct(t);
}
t = NULL;
  found:

We hold rcu_read_lock() here, so perhaps only do get_task_struct() before
doing rcu_read_unlock() and we have a non-NULL t?


Here rcu_read_lock will not help, as our task may change due to below algo:

for_each_thread(p, t) {
task_lock(t);
+   get_task_struct(t);
if (likely(t->mm))
goto found;
task_unlock(t);
+   put_task_struct(t)


So only we can increase usage counter here only at the current task.

I have seen you new patch, that seems valid to me and it will resolve our issue.
Thanks for support.

Regards

Gaurav




--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.



Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-07 Thread David Rientjes
On Wed, 7 Mar 2018, Gaurav Kohli wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6fd9773..5f4cc4b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct 
> *p)
>  
>   for_each_thread(p, t) {
>   task_lock(t);
> + get_task_struct(t);
>   if (likely(t->mm))
>   goto found;
>   task_unlock(t);
> + put_task_struct(t);
>   }
>   t = NULL;
>  found:

We hold rcu_read_lock() here, so perhaps only do get_task_struct() before 
doing rcu_read_unlock() and we have a non-NULL t?

> @@ -191,6 +193,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>   test_bit(MMF_OOM_SKIP, >mm->flags) ||
>   in_vfork(p)) {
>   task_unlock(p);
> + put_task_struct(p);
>   return 0;
>   }
>  
> @@ -208,7 +211,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>*/
>   if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>   points -= (points * 3) / 100;
> -
> + put_task_struct(p);
>   /* Normalize to oom_score_adj units */
>   adj *= totalpages / 1000;
>   points += adj;

This fixes up oom_badness(), but there are other users of 
find_lock_task_mm() in the oom killer as well as other subsystems.


Re: [PATCH] mm: oom: Fix race condition between oom_badness and do_exit of task

2018-03-07 Thread David Rientjes
On Wed, 7 Mar 2018, Gaurav Kohli wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6fd9773..5f4cc4b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -114,9 +114,11 @@ struct task_struct *find_lock_task_mm(struct task_struct 
> *p)
>  
>   for_each_thread(p, t) {
>   task_lock(t);
> + get_task_struct(t);
>   if (likely(t->mm))
>   goto found;
>   task_unlock(t);
> + put_task_struct(t);
>   }
>   t = NULL;
>  found:

We hold rcu_read_lock() here, so perhaps only do get_task_struct() before 
doing rcu_read_unlock() and we have a non-NULL t?

> @@ -191,6 +193,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>   test_bit(MMF_OOM_SKIP, >mm->flags) ||
>   in_vfork(p)) {
>   task_unlock(p);
> + put_task_struct(p);
>   return 0;
>   }
>  
> @@ -208,7 +211,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>*/
>   if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>   points -= (points * 3) / 100;
> -
> + put_task_struct(p);
>   /* Normalize to oom_score_adj units */
>   adj *= totalpages / 1000;
>   points += adj;

This fixes up oom_badness(), but there are other users of 
find_lock_task_mm() in the oom killer as well as other subsystems.