Re: [slurm-dev] cgroups and task epilog

Mark A. Grondona Thu, 06 Oct 2011 09:03:41 -0700

On Thu, 6 Oct 2011 08:52:46 -0700, Matthieu Hautreux 
<matthieu.hautr...@gmail.com> wrote:
Non-text part: multipart/alternative
> 2011/10/6 Mark A. Grondona <mgrond...@llnl.gov<mailto:mgrond...@llnl.gov>>
> On Thu, 6 Oct 2011 04:47:59 -0700, Matthieu Hautreux 
> <matthieu.hautr...@gmail.com<mailto:matthieu.hautr...@gmail.com>> wrote:
> Non-text part: multipart/mixed
> Non-text part: multipart/alternative
> > Hi,
> >
> > IMHO, there is two distinct problems :
> >
> > 1) a task can be finished before being added to a cgroup, resulting in
> > a ESRCH error. That is the initial problem of Sam
> 
> Yes, you are right, I kind of glossed over that one.
> IMO, I think Sam's approach to use the same fork/exec mechanism
> used in fork_all_tasks was the right approach. Note that
> ignoring ESRCH error instead actually opens up a hole in
> the proctrack/cgroup mechanism wherein a user could immediately
> fork again and SLURM will lose track of processes.
> 
> I do not think that not treating ESRCH as an error is a problem and
> opens a hole. I think that because when you get ESRCH, the task is no
> longer present on the system, so you have nothing to do. Treating that
> as an error is not necessary because it will not change the future of
> what you are doing. However, I agree that using the same fork/exec
> mechanism than in fork_all_tasks is better and will prevent this
> scenario. Talking of that, this mechanism will also prevent the hole
> your are talking about by ensuring that each task pid is adding to the
> container before providing it the possibility to run.


Yeah, sorry, I misspoke. I didn't mean that silently ignoring ESRCH
opens a hole, but rather without Sam's fix the hole isn't closed.

> act more like other proctrack plugins.
> 
> (In fact, I will have to implement at least #2 at our site, because
>  we are not willing to have tasks from the last job running when
>  the next job starts, i.e. we don't want resource released until they
>  are really free.)
> 
> I would be interested to get the results of your tests with current
> implementation + the patch because I do not think you need to modify
> it to reach your goal.

Ok, I will actually test before I make claims in the future ;-)

mark


> 
> Matthieu
> 
> mark
> 
> 
> > Mark, as you guess, putting slurmstepd in the step freezer cgroup is
> > not an option as it would result in a chicken and egg problem to
> > resume the jobs after a suspend.
> >
> > Sam, I think that with the patch for the problem 1), you should no
> > longer experience the initial problem you mentioned (I did not succeed
> > in reproducing the problem, probably because of the timing that must
> > be right).
> >
> > Let me know if that helps you
> > Matthieu
> >
> > 2011/10/6 Mark A. Grondona 
> > <mgrond...@llnl.gov<mailto:mgrond...@llnl.gov><mailto:mgrond...@llnl.gov<mailto:mgrond...@llnl.gov>>>
> > On Wed, 5 Oct 2011 15:40:16 -0700, Sam Lang 
> > <saml...@gmail.com<mailto:saml...@gmail.com><mailto:saml...@gmail.com<mailto:saml...@gmail.com>>>
> >  wrote:
> > Non-text part: multipart/mixed
> > >
> > > Hello,
> > >
> > > I've run into minor problem with cgroups when a task epilog script is
> > > specified in the config file.  The issue appears to be that with the
> > > ProctrackType=proctrack/cgroup, and with the appropriate release files
> > > in place so that cgroups are cleaned up properly
> > > (CgroupReleaseAgentDir=/srv/slurm/sbin), the task epilog script
> > > occasionally gives an error in a job's output file:
> > > slurm_container_add:  No such process.
> > >
> > > It looks like the _run_script_as_user function forks and then execs the
> > > epilog script, while the parent tries to add the child process to the
> > > container, but sometimes (the timing has to be right), the script
> > > finishes (and the process exits), before the slurm_container_add
> > > function gets to writing the pid to the cgroup tasks file.  This results
> > > in the ESRCH error.
> > >
> > > I've attached a patch that I believe fixes this, by making the child
> > > process wait (just like in _fork_all_tasks) on a pipe for the parent to
> > > write a character.  This allows the parent to call slurm_container_add
> > > before the child process exits, ensuring that the pid will exist.
> > >
> > > This doesn't quite fix my problems though.  In
> > > cgroups-fixes-with-sleep.patch, I've added a sleep(10); just after the
> > > fork in the parent process.  With my patch and this sleep in place, the
> > > slurm_container_add now fails with ENOENT, because the entire cgroup
> > > freezer directory is missing.  I think this is because the release files
> > > are removing the cgroup when the task completes.  In the patch, I remove
> > > the check to create the container only if the cont_id == 0, but that
> > > still doesn't seem to resolve the issue, as the directory might exist
> > > between when slurm_container_create is called, and when the release
> > > script is called to remove the cgroup.
> > >
> > > I'm not sure what the right fix is at this point, and any help is
> > > greatly appreciated.  In summary, the three patches are basically the
> > > same, one without the sleep and debugging messages, one with the sleep,
> > > and one with the sleep and debugging messages.
> > >
> > > Thanks,
> > > -sam
> > >
> >
> > Nice debug work. Clearly there is a race condition when using
> > proctrack/cgroup and the task expilog. When the last task exits,
> > there is a race between slurmstepd invoking task epilog, and
> > the release_agent removing the cgroup. I'm kind of unclear on
> > the concept of task epilogs, but assuming they are necessary,
> > this hole needs to be plugged somehow.
> >
> > There are a couple solutions I can think of off-hand.
> >
> > 1. Put the slurmstepd into the freezer cgroup along with the job
> >   tasks. This would "hold" the cgroup until slurmstepd exits, thus
> >   the job step cgroup would not be released before the final task
> >   exits.
> >
> >   This isn't a good solution, though, because slurmstepd would then
> >   be frozen along with all other tasks when the job step gets
> >   a SIGSTOP. I believe this is why the slurmstepd was intentionally
> >   left out of the freezer cgroup in the first place.
> >
> > 2. Don't delete freezer cgroups for proctrack/cgroup with a
> >   release_agent. I think that perhaps the automatic release agent
> >   for proctrack/cgroups is not strictly necessary. The cgroup
> >   can be removed explicitly when slurm calls container_destroy().
> >   This is far simpler than the first solution. In fact, the code
> >   already tries to rmdir() the cgroup when slurm destroys the
> >   "container".
> >
> > Unfortunately a quick patch implementing #2 is a bit tricky, because
> > you would still need a release_agent for the user and job
> > freezer cgroups (since container_destroy() is only applicable to
> > the job step cgroup), and the code currently only lets you set
> > a release agent on the namespace as a whole or not at all.
> >
> > You could try something like the following (completely untested)
> >
> > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > @@ -240,6 +240,8 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t
> > id, uid
> >                return SLURM_ERROR;
> >        }
> >
> > +       xcgroup_set_param (&step_freezer_cg, "release_agent", "");
> > +
> >        return SLURM_SUCCESS;
> >  }
> >
> > and manually clear the release_agent after the step cgroup has been
> > created. The other thing that would need to be done is to have the
> > "destroy" code wait indefinitely for the cgroup to be removed,
> > to ensure all tasks have really exited (this is one of the main
> > design goals of the proctrack plugin framework, to ensure the
> > job/job step isn't considered "complete" until all user tasks
> > have really exited)
> >
> >
> > I'm actually kind of wondering if the user/job/step heirarchy
> > is necessary for proctrack/cgroup. Since it is only used for
> > process tracking of job steps, creating those extra nested
> > cgroups might not be buying us much. That is a seperate question
> > I guess.
> >
> > mark
> >
> Non-text part: text/html
> > From 6272876a28bf9dd2a377eb2273dc152e8e05fcfb Mon Sep 17 00:00:00 2001
> > From: Matthieu Hautreux 
> > <matthieu.hautr...@cea.fr<mailto:matthieu.hautr...@cea.fr>>
> > Date: Thu, 6 Oct 2011 11:28:14 +0200
> > Subject: [PATCH 1/2] xcgroup: no longer treat ESRCH as an error when adding 
> > a
> >  pid to cgroup
> >
> > A delay occurs between a task creation and its addition to a different
> > cgroup than the inherited one. In the meantime, the process can disapear
> > resulting in a ESRCH during the addition in the second cgroup. Now react
> > to that event as a warning instead of an error.
> > ---
> >  src/common/xcgroup.c |    3 ++-
> >  1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/src/common/xcgroup.c b/src/common/xcgroup.c
> > index d6caecc..eda75f0 100644
> > --- a/src/common/xcgroup.c
> > +++ b/src/common/xcgroup.c
> > @@ -827,7 +827,8 @@ int _file_write_uint64s(char* file_path, uint64_t* 
> > values, int nb)
> >               if (rc < 1) {
> >                       debug2("unable to add value '%s' to file '%s' : %m",
> >                              tstr, file_path);
> > -                     fstatus = XCGROUP_ERROR;
> > +                     if ( errno != ESRCH )
> > +                             fstatus = XCGROUP_ERROR;
> >               }
> >
> >       }
> > --
> > 1.7.6.2
> >
> > From 1888616114f2a5c136e0c4313ec03867db67bc2f Mon Sep 17 00:00:00 2001
> > From: Matthieu Hautreux 
> > <matthieu.hautr...@cea.fr<mailto:matthieu.hautr...@cea.fr>>
> > Date: Thu, 6 Oct 2011 11:40:49 +0200
> > Subject: [PATCH 2/2] proctrack/cgroup:  inhibate release agent for the step
> >  cgroup
> >
> > With release_agent notified at the step cgroup level, the step cgroup
> > can be removed while slurmstepd as not yet finished its internals
> > epilog mechanisms. Inhibating release agent at the step level and
> > manually removing the step cgroup helps to ensure that epilog or
> > cleanup tasks can be added to the proctrack container.
> > ---
> >  src/plugins/proctrack/cgroup/proctrack_cgroup.c |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> >
> > diff --git a/src/plugins/proctrack/cgroup/proctrack_cgroup.c 
> > b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > index 1265326..9278bb2 100644
> > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > @@ -239,6 +239,10 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t 
> > id, uid_t uid, gid_t gid)
> >               xcgroup_destroy(&step_freezer_cg);
> >               return SLURM_ERROR;
> >       }
> > +     /* inhibate release agent for the step cgroup thus letting
> > +      * slurmstepd being able to add new pids to the container
> > +      * when the job ends (TaskEpilog,...) */
> > +     xcgroup_set_param(&step_freezer_cg,"notify_on_release","0");
> >
> >       return SLURM_SUCCESS;
> >  }
> > @@ -246,6 +250,12 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t 
> > id, uid_t uid, gid_t gid)
> >  int _slurm_cgroup_destroy(void)
> >  {
> >       if (jobstep_cgroup_path[0] != '\0') {
> > +             /* enable notification on release again
> > +              * in case the delete would not work properly
> > +              * because of a stuck process, this would let the
> > +              * system manage the removal when possible */
> > +             xcgroup_set_param(&step_freezer_cg,
> > +                               "notify_on_release","1");
> >               xcgroup_delete(&step_freezer_cg);
> >               xcgroup_destroy(&step_freezer_cg);
> >       }
> > --
> > 1.7.6.2
> >
> 
Non-text part: text/html

Re: [slurm-dev] cgroups and task epilog

Reply via email to