Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-02-14 Thread Roman Gushchin
On Thu, Feb 14, 2019 at 05:26:13PM +0100, Oleg Nesterov wrote:
> On 02/11, Roman Gushchin wrote:
> >
> > On Wed, Jan 30, 2019 at 05:52:01PM +0100, Oleg Nesterov wrote:
> > >
> > > Worse, this looks just wrong. In the latter case, cgroup becomes 
> > > CGRP_FROZEN
> > > right after a 2nd task migrates to this cgroup, before this new task calls
> > > do_freezer_trap() or cgroup_enter_stopped().
> >
> > You're right. So, it looks like the problem is in the equation
> > nr_tasks_frozen + nr_tasks_stopped == nr_tasks_to_freeze ,
> > because a task can be frozen and stopped simultaneously.
> >
> > So, basically it has to be
> > nr_tasks_frozen + nr_tasks_stopped >= nr_tasks_to_freeze instead.
> 
> It seems you didn't read the paragraph above, or I missed something...
> 
> How can "frozen + stopped >= to_freeze" fix the problem with the falsely true
> CGRP_FROZEN ?

It helps with the problem when cgroup is mistakenly reported as non-frozen,
and adding a task makes it frozen due to increased nr_tasks_to_freeze.

But you're right, it's not correct. I have to think more how to make
the accounting work correctly with stopped tasks.

Roman


Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-02-14 Thread Oleg Nesterov
On 02/11, Roman Gushchin wrote:
>
> On Wed, Jan 30, 2019 at 05:52:01PM +0100, Oleg Nesterov wrote:
> >
> > Worse, this looks just wrong. In the latter case, cgroup becomes CGRP_FROZEN
> > right after a 2nd task migrates to this cgroup, before this new task calls
> > do_freezer_trap() or cgroup_enter_stopped().
>
> You're right. So, it looks like the problem is in the equation
> nr_tasks_frozen + nr_tasks_stopped == nr_tasks_to_freeze ,
> because a task can be frozen and stopped simultaneously.
>
> So, basically it has to be
> nr_tasks_frozen + nr_tasks_stopped >= nr_tasks_to_freeze instead.

It seems you didn't read the paragraph above, or I missed something...

How can "frozen + stopped >= to_freeze" fix the problem with the falsely true
CGRP_FROZEN ?

Oleg.



Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-02-11 Thread Roman Gushchin
On Wed, Jan 30, 2019 at 05:52:01PM +0100, Oleg Nesterov wrote:
> Hi Roman,
> 
> On 01/28, Roman Gushchin wrote:
> >
> > Yes, I think you're right: cgroup_exit() should check CGRP_FREEZE bit,
> > not CGRP_FROZEN. Like cgroup_post_fork() does (a one-liner change below).

Hi Oleg!

Sorry for the late reply, I was out of work for some time...
Now I'm fully back.

> 
> but this won't fix all problems? it seems that you missed my other concerns.
> 
> Firstly, this doesn't look consistent. Suppose a cgroup contains a single
> process sleeping in ptrace_stop(). Then it becomes CGRP_FROZEN right after
> "echo 1 > cgroup.freeze".
> 
> OTOH. if this single task sleeps in do_freezer_trap() and gets 
> PTRACE_INTERRUPT,
> it will equally sleep ptrace_stop() but cgroup won't be CGRP_FROZEN. Never.
> 
> Worse, this looks just wrong. In the latter case, cgroup becomes CGRP_FROZEN
> right after a 2nd task migrates to this cgroup, before this new task calls
> do_freezer_trap() or cgroup_enter_stopped().

You're right. So, it looks like the problem is in the equation
nr_tasks_frozen + nr_tasks_stopped == nr_tasks_to_freeze ,
because a task can be frozen and stopped simultaneously.

So, basically it has to be
nr_tasks_frozen + nr_tasks_stopped >= nr_tasks_to_freeze instead.

I'll cover it with an unit test, and will post v7 soon.

> 
> 
> 
> > About spurious transitions (like frozen->non frozen->frozen on a task
> > being SIGKILLed):
> > in early versions of the patchset I've tried to avoid them, but then
> > following the Tejun's advice
> > switched over to expose them to a user. The logic behind is simple: if
> > the state of the cgroup has been changed (a task is gone, for
> > example), let's notify a user.
> 
> OK, I won't argue...
> 
> actually I can't argue because I do not really understand why do we want
> a "killable" freezer, let alone ptraceable ;)

So the problem with the frozen state as in cgroup v1 that it's a very
special and "non-natural" task state, which requires special handling
in many places.

Just for an example, we're using oomd (userspace OOM handling daemon),
which selects and kills one of the workloads in case of too high memory
pressure. It should be able to kill all tasks in the selected cgroup,
but we definitely don't want to let it to fiddle with cgroup controls
to unfreeze the cgroup. Etc.

In other words, the cgroup freezer has a simple task of preventing
processes in the given cgroup to consume CPU resources. Unlike the system-wide
freezer, it shouldn't try to make a "snapshot", it just a non-goal.

Thanks!


Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-01-30 Thread Oleg Nesterov
Hi Roman,

On 01/28, Roman Gushchin wrote:
>
> Yes, I think you're right: cgroup_exit() should check CGRP_FREEZE bit,
> not CGRP_FROZEN. Like cgroup_post_fork() does (a one-liner change below).

but this won't fix all problems? it seems that you missed my other concerns.

Firstly, this doesn't look consistent. Suppose a cgroup contains a single
process sleeping in ptrace_stop(). Then it becomes CGRP_FROZEN right after
"echo 1 > cgroup.freeze".

OTOH. if this single task sleeps in do_freezer_trap() and gets PTRACE_INTERRUPT,
it will equally sleep ptrace_stop() but cgroup won't be CGRP_FROZEN. Never.

Worse, this looks just wrong. In the latter case, cgroup becomes CGRP_FROZEN
right after a 2nd task migrates to this cgroup, before this new task calls
do_freezer_trap() or cgroup_enter_stopped().



> About spurious transitions (like frozen->non frozen->frozen on a task
> being SIGKILLed):
> in early versions of the patchset I've tried to avoid them, but then
> following the Tejun's advice
> switched over to expose them to a user. The logic behind is simple: if
> the state of the cgroup has been changed (a task is gone, for
> example), let's notify a user.

OK, I won't argue...

actually I can't argue because I do not really understand why do we want
a "killable" freezer, let alone ptraceable ;)

Oleg.



Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-01-28 Thread Roman Gushchin
Hi Oleg!

Thank you for looking into the patchset!

Yes, I think you're right: cgroup_exit() should check CGRP_FREEZE bit,
not CGRP_FROZEN. Like cgroup_post_fork() does (a one-liner change below).

About spurious transitions (like frozen->non frozen->frozen on a task
being SIGKILLed):
in early versions of the patchset I've tried to avoid them, but then
following the Tejun's advice
switched over to expose them to a user. The logic behind is simple: if
the state of the cgroup has been changed (a task is gone, for
example), let's notify a user.

nr_tasks_to_freeze is simple a number of non-kthead tasks in the
cgroup, which we do
keep actual only for freezing cgroups (to avoid an unnecessary overhead).

Thanks!

--

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3e6a7c19bfaf..08c84d2145eb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5832,7 +5832,7 @@ void cgroup_exit(struct task_struct *tsk)
css_set_move_task(tsk, cset, NULL, false);
cset->nr_tasks--;

-   if (unlikely(test_bit(CGRP_FROZEN, >dfl_cgrp->flags))) {
+   if (unlikely(cgroup_task_freeze(tsk))) {
/*
 * Task frozen bit should be cleared at this moment,
 * and nr_frozen_task should be decreased.

пт, 25 янв. 2019 г. в 04:27, Oleg Nesterov :
>
> Sorry, this version raced with my vacation, I missed it.
>
> I'll try to read this code carefully but after a quick glance I have some
> concerns,
>
> On 12/21, Roman Gushchin wrote:
> >
> > +static void cgroup_update_frozen(struct cgroup *cgrp)
> > +{
> > + bool frozen;
> > +
> > + lockdep_assert_held(_set_lock);
> > +
> > + /*
> > +  * If the cgroup has to be frozen (CGRP_FREEZE bit set),
> > +  * and all tasks are frozen or stopped, let's consider
> > +  * the cgroup frozen. Otherwise it's not frozen.
> > +  */
> > + frozen = test_bit(CGRP_FREEZE, >flags) &&
> > + cgrp->freezer.nr_frozen_tasks +
> > + cgrp->freezer.nr_stopped_tasks ==
> > + cgrp->freezer.nr_tasks_to_freeze;
>
> OK. Suppose that cgroup is frozen, CGRP_FROZEN is set, stopped == 0,
> to_freeze = frozen.
>
> One of the task is killed, it calls leave_frozen(). If I read this code path
> correctly, only ->nr_frozen_tasks will be decremented, so "frozen" will be
> "false" when cgroup_update_frozen() is called.
>
> Doesn't this mean that this cgroup will no longer be CGRP_FROZEN even after
> the killed task goes away completely?
>
>
> Or. Suppose that another process picks a task from the CGRP_FROZEN cgroup and
> does PTRACE_ATTACH + PTRACE_INTERRUPT. IIUC, the tracee will only increment
> ->nr_stopped_tasks, it won't touch other counters. Again, cgroup won't be 
> FROZEN
> until PTRACE_CONT'ed tracee does cgroup_leave_stopped() ? This looks strange 
> at
> least.
>
>
>
> SIGSTOP. IIUC, a frozen task sleeping in do_freezer_trap() won't stop. 
> However if
> another thread has already called do_signal_stop(), the woken frozen task will
> react to JOBCTL_STOP_PENDING and stop. And 
> do_signal_stop()->cgroup_enter_stopped()
> will "destroy" CGRP_FROZEN too, or I am totally confused.
>
> OTOH, if you freeze a TASK_STOPPED task's cgroup, this task can react to 
> SIGCONT,
> notify its parent, then freeze again. This is fine, but iiuc this cgroup 
> won't be
> FROZEN in between, cgroup_file_notify() will be called twice...
>
> Oleg.
>


Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-01-25 Thread Oleg Nesterov
On 01/25, Oleg Nesterov wrote:
>
> One of the task is killed, it calls leave_frozen(). If I read this code path
> correctly, only ->nr_frozen_tasks will be decremented, so "frozen" will be
> "false" when cgroup_update_frozen() is called.

forgot to mention...

and I do not think cgroup_exit() can "fix" this, see another email I sent a 
minute
ago...

Oleg.



Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-01-25 Thread Oleg Nesterov
On 12/21, Roman Gushchin wrote:
>
> +static void cgroup_do_freeze(struct cgroup *cgrp, bool freeze)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> +
> + lockdep_assert_held(_mutex);
> +
> + spin_lock_irq(_set_lock);
> + if (freeze) {
> + cgrp->freezer.nr_tasks_to_freeze = __cgroup_task_count(cgrp);
> + set_bit(CGRP_FREEZE, >flags);
> + } else {
> + clear_bit(CGRP_FREEZE, >flags);
> + }
> + spin_unlock_irq(_set_lock);
> +
> + css_task_iter_start(>self, 0, );
> + while ((task = css_task_iter_next())) {
> + /*
> +  * Ignore kernel threads here. Freezing cgroups containing
> +  * kthreads isn't supported.
> +  */
> + if (task->flags & PF_KTHREAD)
> + continue;
> + cgroup_freeze_task(task, freeze);
> + }
> + css_task_iter_end();

I don't understand why this can race with exiting task. Or with SIGKILL which
kills a task before it sets current->frozen.

How can we trust nr_tasks_to_freeze at all?

Yes you added cgroup_dec_tasks_to_freeze() into cgroup_exit(). But it won't be
called if CGRP_FROZEN was not set yet, or because of "spurious" transitions
caused by cgroup_inc/dec_frozen_cnt() called by this or other tasks.

it seems that cgroup_exit() should check CGRP_FREEZE instead...

Oleg.



Re: [PATCH v6 4/7] cgroup: cgroup v2 freezer

2019-01-25 Thread Oleg Nesterov
Sorry, this version raced with my vacation, I missed it.

I'll try to read this code carefully but after a quick glance I have some
concerns,

On 12/21, Roman Gushchin wrote:
>
> +static void cgroup_update_frozen(struct cgroup *cgrp)
> +{
> + bool frozen;
> +
> + lockdep_assert_held(_set_lock);
> +
> + /*
> +  * If the cgroup has to be frozen (CGRP_FREEZE bit set),
> +  * and all tasks are frozen or stopped, let's consider
> +  * the cgroup frozen. Otherwise it's not frozen.
> +  */
> + frozen = test_bit(CGRP_FREEZE, >flags) &&
> + cgrp->freezer.nr_frozen_tasks +
> + cgrp->freezer.nr_stopped_tasks ==
> + cgrp->freezer.nr_tasks_to_freeze;

OK. Suppose that cgroup is frozen, CGRP_FROZEN is set, stopped == 0,
to_freeze = frozen.

One of the task is killed, it calls leave_frozen(). If I read this code path
correctly, only ->nr_frozen_tasks will be decremented, so "frozen" will be
"false" when cgroup_update_frozen() is called.

Doesn't this mean that this cgroup will no longer be CGRP_FROZEN even after
the killed task goes away completely?


Or. Suppose that another process picks a task from the CGRP_FROZEN cgroup and
does PTRACE_ATTACH + PTRACE_INTERRUPT. IIUC, the tracee will only increment
->nr_stopped_tasks, it won't touch other counters. Again, cgroup won't be FROZEN
until PTRACE_CONT'ed tracee does cgroup_leave_stopped() ? This looks strange at
least.



SIGSTOP. IIUC, a frozen task sleeping in do_freezer_trap() won't stop. However 
if
another thread has already called do_signal_stop(), the woken frozen task will
react to JOBCTL_STOP_PENDING and stop. And 
do_signal_stop()->cgroup_enter_stopped()
will "destroy" CGRP_FROZEN too, or I am totally confused.

OTOH, if you freeze a TASK_STOPPED task's cgroup, this task can react to 
SIGCONT,
notify its parent, then freeze again. This is fine, but iiuc this cgroup won't 
be
FROZEN in between, cgroup_file_notify() will be called twice...

Oleg.



[PATCH v6 4/7] cgroup: cgroup v2 freezer

2018-12-21 Thread Roman Gushchin
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: Oleg Nesterov 
Cc: kernel-t...@fb.com
---
 include/linux/cgroup-defs.h  |  34 
 include/linux/cgroup.h   |  46 +
 include/linux/sched.h|   4 +
 include/linux/sched/jobctl.h |   2 +
 kernel/cgroup/Makefile   |   2 +-
 kernel/cgroup/cgroup.c   | 106 ++-
 kernel/cgroup/freezer.c  | 329 +++
 kernel/signal.c  |  93 +-
 8 files changed, 607 insertions(+), 9 deletions(-)
 create mode 100644 kernel/cgroup/freezer.c

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 03355d7008ff..859813ff1ee2 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -64,6 +64,12 @@ enum {
 * specified at mount time and thus is implemented here.
 */
CGRP_CPUSET_CLONE_CHILDREN,
+
+   /* Control group has to be frozen. */
+   CGRP_FREEZE,
+
+   /* Cgroup is frozen. */
+   CGRP_FROZEN,
 };
 
 /* cgroup_root->flags */
@@ -316,6 +322,31 @@ struct cgroup_rstat_cpu {
struct cgroup *updated_next;/* NULL iff not on the list */
 };
 
+struct cgroup_freezer_state {
+   /* Should the cgroup and its descendants be frozen. */
+   bool freeze;
+
+   /* Should the cgroup actually be frozen? */
+   int e_freeze;
+
+   /* Fields below are protected by css_set_lock */
+
+   /* Number of frozen descendant cgroups */
+   int nr_frozen_descendants;
+
+   /* Number of tasks to freeze */
+   int nr_tasks_to_freeze;
+
+   /* Number of frozen tasks */
+   int nr_frozen_tasks;
+
+   /*
+* Number of tasks, which are counted as frozen:
+* SIGSTOPped, and PTRACEd.
+*/
+   int nr_stopped_tasks;
+};
+
 struct cgroup {
/* self css with NULL ->ss, points back to this cgroup */
struct cgroup_subsys_state self;
@@ -452,6 +483,9 @@ struct cgroup {
/* If there is block congestion on this cgroup. */
atomic_t congestion_count;
 
+   /* Used to store internal freezer state */
+   struct cgroup_freezer_state freezer;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d12757a65b0..df51b37928ba 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -877,4 +877,50 @@ static inline void put_cgroup_ns(struct cgroup_namespace 
*ns)
free_cgroup_ns(ns);
 }
 
+#ifdef CONFIG_CGROUPS
+
+void cgroup_enter_frozen(void);
+void cgroup_leave_frozen(void);
+void cgroup_enter_stopped(void);
+void cgroup_leave_stopped(void);
+void cgroup_dec_tasks_to_freeze(struct cgroup *cgrp);
+void cgroup_freeze(struct cgroup *cgrp, bool freeze);
+void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src,
+struct cgroup *dst);
+static inline bool cgroup_task_freeze(struct task_struct