Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-11-14 Thread Johannes Weiner
On Wed, Nov 14, 2012 at 02:59:30PM +0100, Michal Hocko wrote:
> On Tue 13-11-12 16:10:41, Johannes Weiner wrote:
> > Would it make sense to stick a wait_on_page_locked() in there just so
> > that we don't busy spin on a page under migration/reclaim?
> 
> Hmm, this would also mean that get_page_unless_zero would fail as well
> and so we would schedule in mem_cgroup_force_empty_list. It is true that
> there might be no other runnable task so we can busy loop so yes this
> would help. Care to cook the patch?

Eventually get_page_unless_zero() would fail but we could still spin
on a page while it's off the LRU and migration performs writeback on
it e.g.  cond_resched() does not necessarily schedule just because
there is another runnable task, I think, it's voluntary preemption
when the task needs rescheduling anyway, not yield.

Maybe not worth bothering...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-11-14 Thread Michal Hocko
On Tue 13-11-12 16:10:41, Johannes Weiner wrote:
> On Tue, Oct 30, 2012 at 11:35:59AM +0100, Michal Hocko wrote:
> > On Mon 29-10-12 15:00:22, Andrew Morton wrote:
> > > On Mon, 29 Oct 2012 17:58:45 +0400
> > > Glauber Costa  wrote:
> > > 
> > > > > + * move charges to its parent or the root cgroup if the group has no
> > > > > + * parent (aka use_hierarchy==0).
> > > > > + * Although this might fail (get_page_unless_zero, isolate_lru_page 
> > > > > or
> > > > > + * mem_cgroup_move_account fails) the failure is always temporary and
> > > > > + * it signals a race with a page removal/uncharge or migration. In 
> > > > > the
> > > > > + * first case the page is on the way out and it will vanish from the 
> > > > > LRU
> > > > > + * on the next attempt and the call should be retried later.
> > > > > + * Isolation from the LRU fails only if page has been isolated from
> > > > > + * the LRU since we looked at it and that usually means either global
> > > > > + * reclaim or migration going on. The page will either get back to 
> > > > > the
> > > > > + * LRU or vanish.
> > > > 
> > > > I just wonder for how long can it go in the worst case?
> > > 
> > > If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
> > 
> > You are right, if the rmdir (resp. echo > force_empty) at SCHED_FIFO
> > races with put_page (on a shared page) which gets preempted after
> > put_page_testzero and before __page_cache_release then we are screwed:
> > 
> > put_page(page)
> >   put_page_testzero
> >> LRU>
> > mem_cgroup_force_empty_list
> >   page = list_entry(list->prev, struct page, lru);
> >   mem_cgroup_move_parent(page)
> > get_page_unless_zero 
> >   cond_resched() 
> > 
> > The race window is really small but it is definitely possible. I am not
> > happy about this state and it should be probably mentioned in the
> > patch description but I do not see any way around (except for hacks like
> > sched_setscheduler for the current which is, ehm...) and still keep
> > do_not_fail contract here.
> > 
> > Can we consider this as a corner case (it is much easier to kill a
> > machine with SCHED_FIFO than this anyway) or the concern is really
> > strong and we should come with a solution before this can get merged?
> 
> Wouldn't the much bigger race window be reclaim having the page
> isolated and SCHED_FIFO preventing it from putback?

We wouldn't see the page on the LRU then, right?

> I also don't think this is a new class of problem, though.
> 
> Would it make sense to stick a wait_on_page_locked() in there just so
> that we don't busy spin on a page under migration/reclaim?

Hmm, this would also mean that get_page_unless_zero would fail as well
and so we would schedule in mem_cgroup_force_empty_list. It is true that
there might be no other runnable task so we can busy loop so yes this
would help. Care to cook the patch?

Thanks
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-11-14 Thread Michal Hocko
On Tue 13-11-12 16:10:41, Johannes Weiner wrote:
 On Tue, Oct 30, 2012 at 11:35:59AM +0100, Michal Hocko wrote:
  On Mon 29-10-12 15:00:22, Andrew Morton wrote:
   On Mon, 29 Oct 2012 17:58:45 +0400
   Glauber Costa glom...@parallels.com wrote:
   
 + * move charges to its parent or the root cgroup if the group has no
 + * parent (aka use_hierarchy==0).
 + * Although this might fail (get_page_unless_zero, isolate_lru_page 
 or
 + * mem_cgroup_move_account fails) the failure is always temporary and
 + * it signals a race with a page removal/uncharge or migration. In 
 the
 + * first case the page is on the way out and it will vanish from the 
 LRU
 + * on the next attempt and the call should be retried later.
 + * Isolation from the LRU fails only if page has been isolated from
 + * the LRU since we looked at it and that usually means either global
 + * reclaim or migration going on. The page will either get back to 
 the
 + * LRU or vanish.

I just wonder for how long can it go in the worst case?
   
   If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
  
  You are right, if the rmdir (resp. echo  force_empty) at SCHED_FIFO
  races with put_page (on a shared page) which gets preempted after
  put_page_testzero and before __page_cache_release then we are screwed:
  
  put_page(page)
put_page_testzero
preempted and page still on 
  LRU
  mem_cgroup_force_empty_list
page = list_entry(list-prev, struct page, lru);
mem_cgroup_move_parent(page)
  get_page_unless_zero fails
cond_resched() scheduled again
  
  The race window is really small but it is definitely possible. I am not
  happy about this state and it should be probably mentioned in the
  patch description but I do not see any way around (except for hacks like
  sched_setscheduler for the current which is, ehm...) and still keep
  do_not_fail contract here.
  
  Can we consider this as a corner case (it is much easier to kill a
  machine with SCHED_FIFO than this anyway) or the concern is really
  strong and we should come with a solution before this can get merged?
 
 Wouldn't the much bigger race window be reclaim having the page
 isolated and SCHED_FIFO preventing it from putback?

We wouldn't see the page on the LRU then, right?

 I also don't think this is a new class of problem, though.
 
 Would it make sense to stick a wait_on_page_locked() in there just so
 that we don't busy spin on a page under migration/reclaim?

Hmm, this would also mean that get_page_unless_zero would fail as well
and so we would schedule in mem_cgroup_force_empty_list. It is true that
there might be no other runnable task so we can busy loop so yes this
would help. Care to cook the patch?

Thanks
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-11-14 Thread Johannes Weiner
On Wed, Nov 14, 2012 at 02:59:30PM +0100, Michal Hocko wrote:
 On Tue 13-11-12 16:10:41, Johannes Weiner wrote:
  Would it make sense to stick a wait_on_page_locked() in there just so
  that we don't busy spin on a page under migration/reclaim?
 
 Hmm, this would also mean that get_page_unless_zero would fail as well
 and so we would schedule in mem_cgroup_force_empty_list. It is true that
 there might be no other runnable task so we can busy loop so yes this
 would help. Care to cook the patch?

Eventually get_page_unless_zero() would fail but we could still spin
on a page while it's off the LRU and migration performs writeback on
it e.g.  cond_resched() does not necessarily schedule just because
there is another runnable task, I think, it's voluntary preemption
when the task needs rescheduling anyway, not yield.

Maybe not worth bothering...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-11-13 Thread Johannes Weiner
On Tue, Oct 30, 2012 at 11:35:59AM +0100, Michal Hocko wrote:
> On Mon 29-10-12 15:00:22, Andrew Morton wrote:
> > On Mon, 29 Oct 2012 17:58:45 +0400
> > Glauber Costa  wrote:
> > 
> > > > + * move charges to its parent or the root cgroup if the group has no
> > > > + * parent (aka use_hierarchy==0).
> > > > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > > > + * mem_cgroup_move_account fails) the failure is always temporary and
> > > > + * it signals a race with a page removal/uncharge or migration. In the
> > > > + * first case the page is on the way out and it will vanish from the 
> > > > LRU
> > > > + * on the next attempt and the call should be retried later.
> > > > + * Isolation from the LRU fails only if page has been isolated from
> > > > + * the LRU since we looked at it and that usually means either global
> > > > + * reclaim or migration going on. The page will either get back to the
> > > > + * LRU or vanish.
> > > 
> > > I just wonder for how long can it go in the worst case?
> > 
> > If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
> 
> You are right, if the rmdir (resp. echo > force_empty) at SCHED_FIFO
> races with put_page (on a shared page) which gets preempted after
> put_page_testzero and before __page_cache_release then we are screwed:
> 
>   put_page(page)
> put_page_testzero
>  LRU>
> mem_cgroup_force_empty_list
>   page = list_entry(list->prev, struct page, lru);
>   mem_cgroup_move_parent(page)
> get_page_unless_zero 
>   cond_resched() 
> 
> The race window is really small but it is definitely possible. I am not
> happy about this state and it should be probably mentioned in the
> patch description but I do not see any way around (except for hacks like
> sched_setscheduler for the current which is, ehm...) and still keep
> do_not_fail contract here.
> 
> Can we consider this as a corner case (it is much easier to kill a
> machine with SCHED_FIFO than this anyway) or the concern is really
> strong and we should come with a solution before this can get merged?

Wouldn't the much bigger race window be reclaim having the page
isolated and SCHED_FIFO preventing it from putback?

I also don't think this is a new class of problem, though.

Would it make sense to stick a wait_on_page_locked() in there just so
that we don't busy spin on a page under migration/reclaim?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-11-13 Thread Johannes Weiner
On Tue, Oct 30, 2012 at 11:35:59AM +0100, Michal Hocko wrote:
 On Mon 29-10-12 15:00:22, Andrew Morton wrote:
  On Mon, 29 Oct 2012 17:58:45 +0400
  Glauber Costa glom...@parallels.com wrote:
  
+ * move charges to its parent or the root cgroup if the group has no
+ * parent (aka use_hierarchy==0).
+ * Although this might fail (get_page_unless_zero, isolate_lru_page or
+ * mem_cgroup_move_account fails) the failure is always temporary and
+ * it signals a race with a page removal/uncharge or migration. In the
+ * first case the page is on the way out and it will vanish from the 
LRU
+ * on the next attempt and the call should be retried later.
+ * Isolation from the LRU fails only if page has been isolated from
+ * the LRU since we looked at it and that usually means either global
+ * reclaim or migration going on. The page will either get back to the
+ * LRU or vanish.
   
   I just wonder for how long can it go in the worst case?
  
  If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
 
 You are right, if the rmdir (resp. echo  force_empty) at SCHED_FIFO
 races with put_page (on a shared page) which gets preempted after
 put_page_testzero and before __page_cache_release then we are screwed:
 
   put_page(page)
 put_page_testzero
 preempted and page still on 
 LRU
 mem_cgroup_force_empty_list
   page = list_entry(list-prev, struct page, lru);
   mem_cgroup_move_parent(page)
 get_page_unless_zero fails
   cond_resched() scheduled again
 
 The race window is really small but it is definitely possible. I am not
 happy about this state and it should be probably mentioned in the
 patch description but I do not see any way around (except for hacks like
 sched_setscheduler for the current which is, ehm...) and still keep
 do_not_fail contract here.
 
 Can we consider this as a corner case (it is much easier to kill a
 machine with SCHED_FIFO than this anyway) or the concern is really
 strong and we should come with a solution before this can get merged?

Wouldn't the much bigger race window be reclaim having the page
isolated and SCHED_FIFO preventing it from putback?

I also don't think this is a new class of problem, though.

Would it make sense to stick a wait_on_page_locked() in there just so
that we don't busy spin on a page under migration/reclaim?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-31 Thread Andrew Morton
On Tue, 30 Oct 2012 11:35:59 +0100
Michal Hocko  wrote:

> > If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
>
> ...
>
> Can we consider this as a corner case (it is much easier to kill a
> machine with SCHED_FIFO than this anyway) or the concern is really
> strong and we should come with a solution before this can get merged?

Sure.  SCHED_FIFO can be used to permanently block all kernel threads
which pretty quickly results in a very sick kernel.  My observation was
just a general moan about the SCHED_FIFO wontfix problem :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-31 Thread Andrew Morton
On Tue, 30 Oct 2012 11:35:59 +0100
Michal Hocko mho...@suse.cz wrote:

  If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!

 ...

 Can we consider this as a corner case (it is much easier to kill a
 machine with SCHED_FIFO than this anyway) or the concern is really
 strong and we should come with a solution before this can get merged?

Sure.  SCHED_FIFO can be used to permanently block all kernel threads
which pretty quickly results in a very sick kernel.  My observation was
just a general moan about the SCHED_FIFO wontfix problem :)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-30 Thread Michal Hocko
On Mon 29-10-12 15:00:22, Andrew Morton wrote:
> On Mon, 29 Oct 2012 17:58:45 +0400
> Glauber Costa  wrote:
> 
> > > + * move charges to its parent or the root cgroup if the group has no
> > > + * parent (aka use_hierarchy==0).
> > > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > > + * mem_cgroup_move_account fails) the failure is always temporary and
> > > + * it signals a race with a page removal/uncharge or migration. In the
> > > + * first case the page is on the way out and it will vanish from the LRU
> > > + * on the next attempt and the call should be retried later.
> > > + * Isolation from the LRU fails only if page has been isolated from
> > > + * the LRU since we looked at it and that usually means either global
> > > + * reclaim or migration going on. The page will either get back to the
> > > + * LRU or vanish.
> > 
> > I just wonder for how long can it go in the worst case?
> 
> If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!

You are right, if the rmdir (resp. echo > force_empty) at SCHED_FIFO
races with put_page (on a shared page) which gets preempted after
put_page_testzero and before __page_cache_release then we are screwed:

put_page(page)
  put_page_testzero
  
mem_cgroup_force_empty_list
  page = list_entry(list->prev, struct page, lru);
  mem_cgroup_move_parent(page)
get_page_unless_zero 
  cond_resched() 

The race window is really small but it is definitely possible. I am not
happy about this state and it should be probably mentioned in the
patch description but I do not see any way around (except for hacks like
sched_setscheduler for the current which is, ehm...) and still keep
do_not_fail contract here.

Can we consider this as a corner case (it is much easier to kill a
machine with SCHED_FIFO than this anyway) or the concern is really
strong and we should come with a solution before this can get merged?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-30 Thread Michal Hocko
On Mon 29-10-12 15:00:22, Andrew Morton wrote:
 On Mon, 29 Oct 2012 17:58:45 +0400
 Glauber Costa glom...@parallels.com wrote:
 
   + * move charges to its parent or the root cgroup if the group has no
   + * parent (aka use_hierarchy==0).
   + * Although this might fail (get_page_unless_zero, isolate_lru_page or
   + * mem_cgroup_move_account fails) the failure is always temporary and
   + * it signals a race with a page removal/uncharge or migration. In the
   + * first case the page is on the way out and it will vanish from the LRU
   + * on the next attempt and the call should be retried later.
   + * Isolation from the LRU fails only if page has been isolated from
   + * the LRU since we looked at it and that usually means either global
   + * reclaim or migration going on. The page will either get back to the
   + * LRU or vanish.
  
  I just wonder for how long can it go in the worst case?
 
 If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!

You are right, if the rmdir (resp. echo  force_empty) at SCHED_FIFO
races with put_page (on a shared page) which gets preempted after
put_page_testzero and before __page_cache_release then we are screwed:

put_page(page)
  put_page_testzero
  preempted and page still on 
LRU
mem_cgroup_force_empty_list
  page = list_entry(list-prev, struct page, lru);
  mem_cgroup_move_parent(page)
get_page_unless_zero fails
  cond_resched() scheduled again

The race window is really small but it is definitely possible. I am not
happy about this state and it should be probably mentioned in the
patch description but I do not see any way around (except for hacks like
sched_setscheduler for the current which is, ehm...) and still keep
do_not_fail contract here.

Can we consider this as a corner case (it is much easier to kill a
machine with SCHED_FIFO than this anyway) or the concern is really
strong and we should come with a solution before this can get merged?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Andrew Morton
On Mon, 29 Oct 2012 17:58:45 +0400
Glauber Costa  wrote:

> > + * move charges to its parent or the root cgroup if the group has no
> > + * parent (aka use_hierarchy==0).
> > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > + * mem_cgroup_move_account fails) the failure is always temporary and
> > + * it signals a race with a page removal/uncharge or migration. In the
> > + * first case the page is on the way out and it will vanish from the LRU
> > + * on the next attempt and the call should be retried later.
> > + * Isolation from the LRU fails only if page has been isolated from
> > + * the LRU since we looked at it and that usually means either global
> > + * reclaim or migration going on. The page will either get back to the
> > + * LRU or vanish.
> 
> I just wonder for how long can it go in the worst case?

If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Glauber Costa
>>> + * move charges to its parent or the root cgroup if the group has no
>>> + * parent (aka use_hierarchy==0).
>>> + * Although this might fail (get_page_unless_zero, isolate_lru_page or
>>> + * mem_cgroup_move_account fails) the failure is always temporary and
>>> + * it signals a race with a page removal/uncharge or migration. In the
>>> + * first case the page is on the way out and it will vanish from the LRU
>>> + * on the next attempt and the call should be retried later.
>>> + * Isolation from the LRU fails only if page has been isolated from
>>> + * the LRU since we looked at it and that usually means either global
>>> + * reclaim or migration going on. The page will either get back to the
>>> + * LRU or vanish.
>>
>> I just wonder for how long can it go in the worst case?
>  
> That's a good question and to be honest I have no idea. The point is
> that it will terminate eventually and that the group is on the way out
> so the time to complete the removal is not a big deal IMHO. We had
> basically similar situation previously when we would need to repeat
> rmdir loop on EBUSY. The only change is that we do not have to retry
> anymore.
> 
> So the key point is to check whether my assumption about temporarily is
> correct and that we cannot block the rest of the kernel/userspace to
> proceed even though we are waiting for finalization. I believe this is
> true but... (last famous words?)
> 
At least for me, it seems that this will hold.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Michal Hocko
On Mon 29-10-12 17:58:45, Glauber Costa wrote:
> 
> > 
> > Changes since v1
> > - use kerndoc
> > - be more specific about mem_cgroup_move_parent possible failures
> > 
> > Signed-off-by: Michal Hocko 
> > Reviewed-by: Tejun Heo 
> Reviewed-by: Glauber Costa 

Thanks!

> > + * move charges to its parent or the root cgroup if the group has no
> > + * parent (aka use_hierarchy==0).
> > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > + * mem_cgroup_move_account fails) the failure is always temporary and
> > + * it signals a race with a page removal/uncharge or migration. In the
> > + * first case the page is on the way out and it will vanish from the LRU
> > + * on the next attempt and the call should be retried later.
> > + * Isolation from the LRU fails only if page has been isolated from
> > + * the LRU since we looked at it and that usually means either global
> > + * reclaim or migration going on. The page will either get back to the
> > + * LRU or vanish.
> 
> I just wonder for how long can it go in the worst case?
 
That's a good question and to be honest I have no idea. The point is
that it will terminate eventually and that the group is on the way out
so the time to complete the removal is not a big deal IMHO. We had
basically similar situation previously when we would need to repeat
rmdir loop on EBUSY. The only change is that we do not have to retry
anymore.

So the key point is to check whether my assumption about temporarily is
correct and that we cannot block the rest of the kernel/userspace to
proceed even though we are waiting for finalization. I believe this is
true but... (last famous words?)

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Glauber Costa

> 
> Changes since v1
> - use kerndoc
> - be more specific about mem_cgroup_move_parent possible failures
> 
> Signed-off-by: Michal Hocko 
> Reviewed-by: Tejun Heo 
Reviewed-by: Glauber Costa 

> + * move charges to its parent or the root cgroup if the group has no
> + * parent (aka use_hierarchy==0).
> + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> + * mem_cgroup_move_account fails) the failure is always temporary and
> + * it signals a race with a page removal/uncharge or migration. In the
> + * first case the page is on the way out and it will vanish from the LRU
> + * on the next attempt and the call should be retried later.
> + * Isolation from the LRU fails only if page has been isolated from
> + * the LRU since we looked at it and that usually means either global
> + * reclaim or migration going on. The page will either get back to the
> + * LRU or vanish.

I just wonder for how long can it go in the worst case?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Andrew Morton
On Mon, 29 Oct 2012 17:58:45 +0400
Glauber Costa glom...@parallels.com wrote:

  + * move charges to its parent or the root cgroup if the group has no
  + * parent (aka use_hierarchy==0).
  + * Although this might fail (get_page_unless_zero, isolate_lru_page or
  + * mem_cgroup_move_account fails) the failure is always temporary and
  + * it signals a race with a page removal/uncharge or migration. In the
  + * first case the page is on the way out and it will vanish from the LRU
  + * on the next attempt and the call should be retried later.
  + * Isolation from the LRU fails only if page has been isolated from
  + * the LRU since we looked at it and that usually means either global
  + * reclaim or migration going on. The page will either get back to the
  + * LRU or vanish.
 
 I just wonder for how long can it go in the worst case?

If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Glauber Costa

 
 Changes since v1
 - use kerndoc
 - be more specific about mem_cgroup_move_parent possible failures
 
 Signed-off-by: Michal Hocko mho...@suse.cz
 Reviewed-by: Tejun Heo t...@kernel.org
Reviewed-by: Glauber Costa glom...@parallels.com

 + * move charges to its parent or the root cgroup if the group has no
 + * parent (aka use_hierarchy==0).
 + * Although this might fail (get_page_unless_zero, isolate_lru_page or
 + * mem_cgroup_move_account fails) the failure is always temporary and
 + * it signals a race with a page removal/uncharge or migration. In the
 + * first case the page is on the way out and it will vanish from the LRU
 + * on the next attempt and the call should be retried later.
 + * Isolation from the LRU fails only if page has been isolated from
 + * the LRU since we looked at it and that usually means either global
 + * reclaim or migration going on. The page will either get back to the
 + * LRU or vanish.

I just wonder for how long can it go in the worst case?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Michal Hocko
On Mon 29-10-12 17:58:45, Glauber Costa wrote:
 
  
  Changes since v1
  - use kerndoc
  - be more specific about mem_cgroup_move_parent possible failures
  
  Signed-off-by: Michal Hocko mho...@suse.cz
  Reviewed-by: Tejun Heo t...@kernel.org
 Reviewed-by: Glauber Costa glom...@parallels.com

Thanks!

  + * move charges to its parent or the root cgroup if the group has no
  + * parent (aka use_hierarchy==0).
  + * Although this might fail (get_page_unless_zero, isolate_lru_page or
  + * mem_cgroup_move_account fails) the failure is always temporary and
  + * it signals a race with a page removal/uncharge or migration. In the
  + * first case the page is on the way out and it will vanish from the LRU
  + * on the next attempt and the call should be retried later.
  + * Isolation from the LRU fails only if page has been isolated from
  + * the LRU since we looked at it and that usually means either global
  + * reclaim or migration going on. The page will either get back to the
  + * LRU or vanish.
 
 I just wonder for how long can it go in the worst case?
 
That's a good question and to be honest I have no idea. The point is
that it will terminate eventually and that the group is on the way out
so the time to complete the removal is not a big deal IMHO. We had
basically similar situation previously when we would need to repeat
rmdir loop on EBUSY. The only change is that we do not have to retry
anymore.

So the key point is to check whether my assumption about temporarily is
correct and that we cannot block the rest of the kernel/userspace to
proceed even though we are waiting for finalization. I believe this is
true but... (last famous words?)

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-29 Thread Glauber Costa
 + * move charges to its parent or the root cgroup if the group has no
 + * parent (aka use_hierarchy==0).
 + * Although this might fail (get_page_unless_zero, isolate_lru_page or
 + * mem_cgroup_move_account fails) the failure is always temporary and
 + * it signals a race with a page removal/uncharge or migration. In the
 + * first case the page is on the way out and it will vanish from the LRU
 + * on the next attempt and the call should be retried later.
 + * Isolation from the LRU fails only if page has been isolated from
 + * the LRU since we looked at it and that usually means either global
 + * reclaim or migration going on. The page will either get back to the
 + * LRU or vanish.

 I just wonder for how long can it go in the worst case?
  
 That's a good question and to be honest I have no idea. The point is
 that it will terminate eventually and that the group is on the way out
 so the time to complete the removal is not a big deal IMHO. We had
 basically similar situation previously when we would need to repeat
 rmdir loop on EBUSY. The only change is that we do not have to retry
 anymore.
 
 So the key point is to check whether my assumption about temporarily is
 correct and that we cannot block the rest of the kernel/userspace to
 proceed even though we are waiting for finalization. I believe this is
 true but... (last famous words?)
 
At least for me, it seems that this will hold.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-26 Thread Michal Hocko
mem_cgroup_force_empty_list currently tries to remove all pages from
the given LRU. To prevent from temoporary failures (EBUSY returned by
mem_cgroup_move_parent) it uses a margin to the current LRU pages and
returns the true if there are still some pages left on the list.

If we consider that mem_cgroup_move_parent fails only when it is racing
with somebody else removing (uncharging) the page or when the page is
migrated then it is obvious that all those failures are only temporal
and so we can safely retry later.
Let's get rid of the safety margin and make the loop really wait for
the empty LRU. The caller should still make sure that all charges have
been removed from the res_counter because mem_cgroup_replace_page_cache
might add a page to the LRU after the list_empty check (it doesn't touch
res_counter though).
This catches most of the cases except for shmem which might call
mem_cgroup_replace_page_cache with a page which is not charged and on
the LRU yet but this was the case also without this patch. In order to
fix this we need a guarantee that try_get_mem_cgroup_from_page falls
back to the current mm's cgroup so it needs css_tryget to fail. This
will be fixed up in a later patch because it needs a help from cgroup
core (pre_destroy has to be called after css is cleared).

Although mem_cgroup_pre_destroy can still fail (if a new task or a new
sub-group appears) there is no reason to retry pre_destroy callback from
the cgroup core. This means that __DEPRECATED_clear_css_refs has lost
its meaning and it can be removed.

Changes since v2
- remove __DEPRECATED_clear_css_refs

Changes since v1
- use kerndoc
- be more specific about mem_cgroup_move_parent possible failures

Signed-off-by: Michal Hocko 
Reviewed-by: Tejun Heo 
---
 mm/memcontrol.c |   76 +++
 1 file changed, 48 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 916132a..5a1d584 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2702,10 +2702,27 @@ out:
return ret;
 }
 
-/*
- * move charges to its parent.
+/**
+ * mem_cgroup_move_parent - moves page to the parent group
+ * @page: the page to move
+ * @pc: page_cgroup of the page
+ * @child: page's cgroup
+ *
+ * move charges to its parent or the root cgroup if the group has no
+ * parent (aka use_hierarchy==0).
+ * Although this might fail (get_page_unless_zero, isolate_lru_page or
+ * mem_cgroup_move_account fails) the failure is always temporary and
+ * it signals a race with a page removal/uncharge or migration. In the
+ * first case the page is on the way out and it will vanish from the LRU
+ * on the next attempt and the call should be retried later.
+ * Isolation from the LRU fails only if page has been isolated from
+ * the LRU since we looked at it and that usually means either global
+ * reclaim or migration going on. The page will either get back to the
+ * LRU or vanish.
+ * Finaly mem_cgroup_move_account fails only if the page got uncharged
+ * (!PageCgroupUsed) or moved to a different group. The page will
+ * disappear in the next attempt.
  */
-
 static int mem_cgroup_move_parent(struct page *page,
  struct page_cgroup *pc,
  struct mem_cgroup *child)
@@ -2732,8 +2749,10 @@ static int mem_cgroup_move_parent(struct page *page,
if (!parent)
parent = root_mem_cgroup;
 
-   if (nr_pages > 1)
+   if (nr_pages > 1) {
+   VM_BUG_ON(!PageTransHuge(page));
flags = compound_lock_irqsave(page);
+   }
 
ret = mem_cgroup_move_account(page, nr_pages,
pc, child, parent);
@@ -3683,17 +3702,22 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
return nr_reclaimed;
 }
 
-/*
+/**
+ * mem_cgroup_force_empty_list - clears LRU of a group
+ * @memcg: group to clear
+ * @node: NUMA node
+ * @zid: zone id
+ * @lru: lru to to clear
+ *
  * Traverse a specified page_cgroup list and try to drop them all.  This 
doesn't
- * reclaim the pages page themselves - it just removes the page_cgroups.
- * Returns true if some page_cgroups were not freed, indicating that the caller
- * must retry this operation.
+ * reclaim the pages page themselves - pages are moved to the parent (or root)
+ * group.
  */
-static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
+static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
int node, int zid, enum lru_list lru)
 {
struct mem_cgroup_per_zone *mz;
-   unsigned long flags, loop;
+   unsigned long flags;
struct list_head *list;
struct page *busy;
struct zone *zone;
@@ -3702,11 +3726,8 @@ static bool mem_cgroup_force_empty_list(struct 
mem_cgroup *memcg,
mz = mem_cgroup_zoneinfo(memcg, node, zid);
list = >lruvec.lists[lru];
 
-   loop = mz->lru_size[lru];
-   /* give 

[PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

2012-10-26 Thread Michal Hocko
mem_cgroup_force_empty_list currently tries to remove all pages from
the given LRU. To prevent from temoporary failures (EBUSY returned by
mem_cgroup_move_parent) it uses a margin to the current LRU pages and
returns the true if there are still some pages left on the list.

If we consider that mem_cgroup_move_parent fails only when it is racing
with somebody else removing (uncharging) the page or when the page is
migrated then it is obvious that all those failures are only temporal
and so we can safely retry later.
Let's get rid of the safety margin and make the loop really wait for
the empty LRU. The caller should still make sure that all charges have
been removed from the res_counter because mem_cgroup_replace_page_cache
might add a page to the LRU after the list_empty check (it doesn't touch
res_counter though).
This catches most of the cases except for shmem which might call
mem_cgroup_replace_page_cache with a page which is not charged and on
the LRU yet but this was the case also without this patch. In order to
fix this we need a guarantee that try_get_mem_cgroup_from_page falls
back to the current mm's cgroup so it needs css_tryget to fail. This
will be fixed up in a later patch because it needs a help from cgroup
core (pre_destroy has to be called after css is cleared).

Although mem_cgroup_pre_destroy can still fail (if a new task or a new
sub-group appears) there is no reason to retry pre_destroy callback from
the cgroup core. This means that __DEPRECATED_clear_css_refs has lost
its meaning and it can be removed.

Changes since v2
- remove __DEPRECATED_clear_css_refs

Changes since v1
- use kerndoc
- be more specific about mem_cgroup_move_parent possible failures

Signed-off-by: Michal Hocko mho...@suse.cz
Reviewed-by: Tejun Heo t...@kernel.org
---
 mm/memcontrol.c |   76 +++
 1 file changed, 48 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 916132a..5a1d584 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2702,10 +2702,27 @@ out:
return ret;
 }
 
-/*
- * move charges to its parent.
+/**
+ * mem_cgroup_move_parent - moves page to the parent group
+ * @page: the page to move
+ * @pc: page_cgroup of the page
+ * @child: page's cgroup
+ *
+ * move charges to its parent or the root cgroup if the group has no
+ * parent (aka use_hierarchy==0).
+ * Although this might fail (get_page_unless_zero, isolate_lru_page or
+ * mem_cgroup_move_account fails) the failure is always temporary and
+ * it signals a race with a page removal/uncharge or migration. In the
+ * first case the page is on the way out and it will vanish from the LRU
+ * on the next attempt and the call should be retried later.
+ * Isolation from the LRU fails only if page has been isolated from
+ * the LRU since we looked at it and that usually means either global
+ * reclaim or migration going on. The page will either get back to the
+ * LRU or vanish.
+ * Finaly mem_cgroup_move_account fails only if the page got uncharged
+ * (!PageCgroupUsed) or moved to a different group. The page will
+ * disappear in the next attempt.
  */
-
 static int mem_cgroup_move_parent(struct page *page,
  struct page_cgroup *pc,
  struct mem_cgroup *child)
@@ -2732,8 +2749,10 @@ static int mem_cgroup_move_parent(struct page *page,
if (!parent)
parent = root_mem_cgroup;
 
-   if (nr_pages  1)
+   if (nr_pages  1) {
+   VM_BUG_ON(!PageTransHuge(page));
flags = compound_lock_irqsave(page);
+   }
 
ret = mem_cgroup_move_account(page, nr_pages,
pc, child, parent);
@@ -3683,17 +3702,22 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
return nr_reclaimed;
 }
 
-/*
+/**
+ * mem_cgroup_force_empty_list - clears LRU of a group
+ * @memcg: group to clear
+ * @node: NUMA node
+ * @zid: zone id
+ * @lru: lru to to clear
+ *
  * Traverse a specified page_cgroup list and try to drop them all.  This 
doesn't
- * reclaim the pages page themselves - it just removes the page_cgroups.
- * Returns true if some page_cgroups were not freed, indicating that the caller
- * must retry this operation.
+ * reclaim the pages page themselves - pages are moved to the parent (or root)
+ * group.
  */
-static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
+static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
int node, int zid, enum lru_list lru)
 {
struct mem_cgroup_per_zone *mz;
-   unsigned long flags, loop;
+   unsigned long flags;
struct list_head *list;
struct page *busy;
struct zone *zone;
@@ -3702,11 +3726,8 @@ static bool mem_cgroup_force_empty_list(struct 
mem_cgroup *memcg,
mz = mem_cgroup_zoneinfo(memcg, node, zid);
list = mz-lruvec.lists[lru];
 
-   loop =