Re: [slurm-dev] cgroups and task epilog

Mark A. Grondona Thu, 06 Oct 2011 14:25:58 -0700

On Thu, 6 Oct 2011 14:04:01 -0700, Matthieu Hautreux 
<[email protected]> wrote:
Non-text part: multipart/alternative
>
> In my mind 3 modifications should be applied to the main branch :
> 
> - Sam's first patch to be sure that task epilog is added to the
> container before really executing it
> - ESRCH patch to avoid taking as an error an already terminated task
> addition to a cgroup directory (in case such an error occurs)
> - my second patch (based on your proposition) that prevents the cgroup
> container to be removed when user's tasks exit as the container can be
> used for further usages (epilogs)


> 
> Moe, Danny, do you agree on the interest of that modifications ?
> 
> 
> > act more like other proctrack plugins.
> >
> > (In fact, I will have to implement at least #2 at our site, because
> >  we are not willing to have tasks from the last job running when
> >  the next job starts, i.e. we don't want resource released until they
> >  are really free.)
> >
> > I would be interested to get the results of your tests with current
> > implementation + the patch because I do not think you need to modify
> > it to reach your goal.
> 
> Ok, I will actually test before I make claims in the future ;-)
> 
> I hope that your tests will confirm my expectations ;)

I just finished testing a contrived but possible scenario, wherein
the slurmstepd exits and the job step is released, but there are
still user tasks running (or stuck in an unkillable state) on the
system.

This can happen because the slurmstepd does not wait for the step
freezer cgroup to be fully removed during slurm_container_plugin_wait(),
looking at the code, it seems like this was meant to be handled,
but slurm_container_plugin_destroy() in proctrack/cgroup always
returns success, so the waiting code is never exercised.

I would also argue that once proctrack/cgroup waits for the jobstep
cgroup to be successfully removed, there is no need to reestablish
the notify_on_release flag in _slurm_cgroup_destroy().

Does that make sense?

BTW, my reproducer is to run something like the following:

 srun sh -c 'exec >/dev/null 2>&1 (sleep 100000&) sleep 200'

this puts one sleep in the background, one sleep process in the
foreground, and closes stdout and stderr for the job step.

I then make the background sleep unkillable with a systemtap script
such as:

 probe signal.send
 {
  if (task_pid(task) == target()) {
    printf ("resetting sig %d to SIGSTOP for target\n", sig)
    if (@defined($q->info->si_signo)) 
      $q->info->si_signo = 19
    else
      $sig = 19

  }

which turns all SIGKILL SIGTERM etc. into SIGSTOP, thus mimicking a
process that can't be killed by SLURM.

If I then terminate the job step, it appears to successfully exit, 
but the unkillable sleep is still running on the node

 # hype28 /cgroup/freezer/uid_6885/job_1004385 > squeue -u grondo
  JOBID PARTITION     NAME     USER  ST       TIME  NODES
  NODELIST(REASON)
 # hype28 /cgroup/freezer/uid_6885/job_1004385 > grep . step_15/*
 step_15/cgroup.procs:6388
 step_15/freezer.state:THAWED
 step_15/notify_on_release:1
 step_15/tasks:6388

If the unkillable sleep was instead a large memory job stuck writing
to a hung filesystem, then the next job SLURM runs on this node would
get a big surprise in that there would be a lack of free RAM.

mark


> Regards,
> Matthieu
> 
> 
> 
> mark
> 
> 
> >
> > Matthieu
> >
> > mark
> >
> >
> > > Mark, as you guess, putting slurmstepd in the step freezer cgroup is
> > > not an option as it would result in a chicken and egg problem to
> > > resume the jobs after a suspend.
> > >
> > > Sam, I think that with the patch for the problem 1), you should no
> > > longer experience the initial problem you mentioned (I did not succeed
> > > in reproducing the problem, probably because of the timing that must
> > > be right).
> > >
> > > Let me know if that helps you
> > > Matthieu
> > >
> > > 2011/10/6 Mark A. Grondona 
> > > <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>
> > > On Wed, 5 Oct 2011 15:40:16 -0700, Sam Lang 
> > > <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>
> > >  wrote:
> > > Non-text part: multipart/mixed
> > > >
> > > > Hello,
> > > >
> > > > I've run into minor problem with cgroups when a task epilog script is
> > > > specified in the config file.  The issue appears to be that with the
> > > > ProctrackType=proctrack/cgroup, and with the appropriate release files
> > > > in place so that cgroups are cleaned up properly
> > > > (CgroupReleaseAgentDir=/srv/slurm/sbin), the task epilog script
> > > > occasionally gives an error in a job's output file:
> > > > slurm_container_add:  No such process.
> > > >
> > > > It looks like the _run_script_as_user function forks and then execs the
> > > > epilog script, while the parent tries to add the child process to the
> > > > container, but sometimes (the timing has to be right), the script
> > > > finishes (and the process exits), before the slurm_container_add
> > > > function gets to writing the pid to the cgroup tasks file.  This results
> > > > in the ESRCH error.
> > > >
> > > > I've attached a patch that I believe fixes this, by making the child
> > > > process wait (just like in _fork_all_tasks) on a pipe for the parent to
> > > > write a character.  This allows the parent to call slurm_container_add
> > > > before the child process exits, ensuring that the pid will exist.
> > > >
> > > > This doesn't quite fix my problems though.  In
> > > > cgroups-fixes-with-sleep.patch, I've added a sleep(10); just after the
> > > > fork in the parent process.  With my patch and this sleep in place, the
> > > > slurm_container_add now fails with ENOENT, because the entire cgroup
> > > > freezer directory is missing.  I think this is because the release files
> > > > are removing the cgroup when the task completes.  In the patch, I remove
> > > > the check to create the container only if the cont_id == 0, but that
> > > > still doesn't seem to resolve the issue, as the directory might exist
> > > > between when slurm_container_create is called, and when the release
> > > > script is called to remove the cgroup.
> > > >
> > > > I'm not sure what the right fix is at this point, and any help is
> > > > greatly appreciated.  In summary, the three patches are basically the
> > > > same, one without the sleep and debugging messages, one with the sleep,
> > > > and one with the sleep and debugging messages.
> > > >
> > > > Thanks,
> > > > -sam
> > > >
> > >
> > > Nice debug work. Clearly there is a race condition when using
> > > proctrack/cgroup and the task expilog. When the last task exits,
> > > there is a race between slurmstepd invoking task epilog, and
> > > the release_agent removing the cgroup. I'm kind of unclear on
> > > the concept of task epilogs, but assuming they are necessary,
> > > this hole needs to be plugged somehow.
> > >
> > > There are a couple solutions I can think of off-hand.
> > >
> > > 1. Put the slurmstepd into the freezer cgroup along with the job
> > >   tasks. This would "hold" the cgroup until slurmstepd exits, thus
> > >   the job step cgroup would not be released before the final task
> > >   exits.
> > >
> > >   This isn't a good solution, though, because slurmstepd would then
> > >   be frozen along with all other tasks when the job step gets
> > >   a SIGSTOP. I believe this is why the slurmstepd was intentionally
> > >   left out of the freezer cgroup in the first place.
> > >
> > > 2. Don't delete freezer cgroups for proctrack/cgroup with a
> > >   release_agent. I think that perhaps the automatic release agent
> > >   for proctrack/cgroups is not strictly necessary. The cgroup
> > >   can be removed explicitly when slurm calls container_destroy().
> > >   This is far simpler than the first solution. In fact, the code
> > >   already tries to rmdir() the cgroup when slurm destroys the
> > >   "container".
> > >
> > > Unfortunately a quick patch implementing #2 is a bit tricky, because
> > > you would still need a release_agent for the user and job
> > > freezer cgroups (since container_destroy() is only applicable to
> > > the job step cgroup), and the code currently only lets you set
> > > a release agent on the namespace as a whole or not at all.
> > >
> > > You could try something like the following (completely untested)
> > >
> > > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > @@ -240,6 +240,8 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t
> > > id, uid
> > >                return SLURM_ERROR;
> > >        }
> > >
> > > +       xcgroup_set_param (&step_freezer_cg, "release_agent", "");
> > > +
> > >        return SLURM_SUCCESS;
> > >  }
> > >
> > > and manually clear the release_agent after the step cgroup has been
> > > created. The other thing that would need to be done is to have the
> > > "destroy" code wait indefinitely for the cgroup to be removed,
> > > to ensure all tasks have really exited (this is one of the main
> > > design goals of the proctrack plugin framework, to ensure the
> > > job/job step isn't considered "complete" until all user tasks
> > > have really exited)
> > >
> > >
> > > I'm actually kind of wondering if the user/job/step heirarchy
> > > is necessary for proctrack/cgroup. Since it is only used for
> > > process tracking of job steps, creating those extra nested
> > > cgroups might not be buying us much. That is a seperate question
> > > I guess.
> > >
> > > mark
> > >
> > Non-text part: text/html
> > > From 6272876a28bf9dd2a377eb2273dc152e8e05fcfb Mon Sep 17 00:00:00 2001
> > > From: Matthieu Hautreux 
> > > <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
> > > Date: Thu, 6 Oct 2011 11:28:14 +0200
> > > Subject: [PATCH 1/2] xcgroup: no longer treat ESRCH as an error when 
> > > adding a
> > >  pid to cgroup
> > >
> > > A delay occurs between a task creation and its addition to a different
> > > cgroup than the inherited one. In the meantime, the process can disapear
> > > resulting in a ESRCH during the addition in the second cgroup. Now react
> > > to that event as a warning instead of an error.
> > > ---
> > >  src/common/xcgroup.c |    3 ++-
> > >  1 files changed, 2 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/src/common/xcgroup.c b/src/common/xcgroup.c
> > > index d6caecc..eda75f0 100644
> > > --- a/src/common/xcgroup.c
> > > +++ b/src/common/xcgroup.c
> > > @@ -827,7 +827,8 @@ int _file_write_uint64s(char* file_path, uint64_t* 
> > > values, int nb)
> > >               if (rc < 1) {
> > >                       debug2("unable to add value '%s' to file '%s' : %m",
> > >                              tstr, file_path);
> > > -                     fstatus = XCGROUP_ERROR;
> > > +                     if ( errno != ESRCH )
> > > +                             fstatus = XCGROUP_ERROR;
> > >               }
> > >
> > >       }
> > > --
> > > 1.7.6.2
> > >
> > > From 1888616114f2a5c136e0c4313ec03867db67bc2f Mon Sep 17 00:00:00 2001
> > > From: Matthieu Hautreux 
> > > <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
> > > Date: Thu, 6 Oct 2011 11:40:49 +0200
> > > Subject: [PATCH 2/2] proctrack/cgroup:  inhibate release agent for the 
> > > step
> > >  cgroup
> > >
> > > With release_agent notified at the step cgroup level, the step cgroup
> > > can be removed while slurmstepd as not yet finished its internals
> > > epilog mechanisms. Inhibating release agent at the step level and
> > > manually removing the step cgroup helps to ensure that epilog or
> > > cleanup tasks can be added to the proctrack container.
> > > ---
> > >  src/plugins/proctrack/cgroup/proctrack_cgroup.c |   10 ++++++++++
> > >  1 files changed, 10 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/src/plugins/proctrack/cgroup/proctrack_cgroup.c 
> > > b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > index 1265326..9278bb2 100644
> > > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > @@ -239,6 +239,10 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t 
> > > id, uid_t uid, gid_t gid)
> > >               xcgroup_destroy(&step_freezer_cg);
> > >               return SLURM_ERROR;
> > >       }
> > > +     /* inhibate release agent for the step cgroup thus letting
> > > +      * slurmstepd being able to add new pids to the container
> > > +      * when the job ends (TaskEpilog,...) */
> > > +     xcgroup_set_param(&step_freezer_cg,"notify_on_release","0");
> > >
> > >       return SLURM_SUCCESS;
> > >  }
> > > @@ -246,6 +250,12 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t 
> > > id, uid_t uid, gid_t gid)
> > >  int _slurm_cgroup_destroy(void)
> > >  {
> > >       if (jobstep_cgroup_path[0] != '\0') {
> > > +             /* enable notification on release again
> > > +              * in case the delete would not work properly
> > > +              * because of a stuck process, this would let the
> > > +              * system manage the removal when possible */
> > > +             xcgroup_set_param(&step_freezer_cg,
> > > +                               "notify_on_release","1");
> > >               xcgroup_delete(&step_freezer_cg);
> > >               xcgroup_destroy(&step_freezer_cg);
> > >       }
> > > --
> > > 1.7.6.2
> > >
> >
> Non-text part: text/html
> 
Non-text part: text/html

Re: [slurm-dev] cgroups and task epilog

Reply via email to