Re: [slurm-dev] cgroups and task epilog

Matthieu Hautreux Thu, 06 Oct 2011 14:04:56 -0700

2011/10/6 Mark A. Grondona <mgrond...@llnl.gov>

> On Thu, 6 Oct 2011 08:52:46 -0700, Matthieu Hautreux <
> matthieu.hautr...@gmail.com> wrote:
> Non-text part: multipart/alternative
> > 2011/10/6 Mark A. Grondona <mgrond...@llnl.gov<mailto:mgrond...@llnl.gov
> >>
> > On Thu, 6 Oct 2011 04:47:59 -0700, Matthieu Hautreux <
> matthieu.hautr...@gmail.com<mailto:matthieu.hautr...@gmail.com>> wrote:
> > Non-text part: multipart/mixed
> > Non-text part: multipart/alternative
> > > Hi,
> > >
> > > IMHO, there is two distinct problems :
> > >
> > > 1) a task can be finished before being added to a cgroup, resulting in
> > > a ESRCH error. That is the initial problem of Sam
> >
> > Yes, you are right, I kind of glossed over that one.
> > IMO, I think Sam's approach to use the same fork/exec mechanism
> > used in fork_all_tasks was the right approach. Note that
> > ignoring ESRCH error instead actually opens up a hole in
> > the proctrack/cgroup mechanism wherein a user could immediately
> > fork again and SLURM will lose track of processes.
> >
> > I do not think that not treating ESRCH as an error is a problem and
> > opens a hole. I think that because when you get ESRCH, the task is no
> > longer present on the system, so you have nothing to do. Treating that
> > as an error is not necessary because it will not change the future of
> > what you are doing. However, I agree that using the same fork/exec
> > mechanism than in fork_all_tasks is better and will prevent this
> > scenario. Talking of that, this mechanism will also prevent the hole
> > your are talking about by ensuring that each task pid is adding to the
> > container before providing it the possibility to run.
>
> Yeah, sorry, I misspoke. I didn't mean that silently ignoring ESRCH
> opens a hole, but rather without Sam's fix the hole isn't closed.
>


Mark, I fully agree and that was great that you reacted on my email because
I did not clearly express the fact that Sam's first patch was necessary to
have a good management of the mentioned issue.

In my mind 3 modifications should be applied to the main branch :

- Sam's first patch to be sure that task epilog is added to the container
before really executing it
- ESRCH patch to avoid taking as an error an already terminated task
addition to a cgroup directory (in case such an error occurs)
- my second patch (based on your proposition) that prevents the cgroup
container to be removed when user's tasks exit as the container can be used
for further usages (epilogs)

Moe, Danny, do you agree on the interest of that modifications ?


> > act more like other proctrack plugins.
> >
> > (In fact, I will have to implement at least #2 at our site, because
> >  we are not willing to have tasks from the last job running when
> >  the next job starts, i.e. we don't want resource released until they
> >  are really free.)
> >
> > I would be interested to get the results of your tests with current
> > implementation + the patch because I do not think you need to modify
> > it to reach your goal.
>
> Ok, I will actually test before I make claims in the future ;-)
>

I hope that your tests will confirm my expectations ;)

Regards,
Matthieu



>
> mark
>
>
> >
> > Matthieu
> >
> > mark
> >
> >
> > > Mark, as you guess, putting slurmstepd in the step freezer cgroup is
> > > not an option as it would result in a chicken and egg problem to
> > > resume the jobs after a suspend.
> > >
> > > Sam, I think that with the patch for the problem 1), you should no
> > > longer experience the initial problem you mentioned (I did not succeed
> > > in reproducing the problem, probably because of the timing that must
> > > be right).
> > >
> > > Let me know if that helps you
> > > Matthieu
> > >
> > > 2011/10/6 Mark A. Grondona <mgrond...@llnl.gov<mailto:
> mgrond...@llnl.gov><mailto:mgrond...@llnl.gov<mailto:mgrond...@llnl.gov>>>
> > > On Wed, 5 Oct 2011 15:40:16 -0700, Sam Lang <saml...@gmail.com<mailto:
> saml...@gmail.com><mailto:saml...@gmail.com<mailto:saml...@gmail.com>>>
> wrote:
> > > Non-text part: multipart/mixed
> > > >
> > > > Hello,
> > > >
> > > > I've run into minor problem with cgroups when a task epilog script is
> > > > specified in the config file.  The issue appears to be that with the
> > > > ProctrackType=proctrack/cgroup, and with the appropriate release
> files
> > > > in place so that cgroups are cleaned up properly
> > > > (CgroupReleaseAgentDir=/srv/slurm/sbin), the task epilog script
> > > > occasionally gives an error in a job's output file:
> > > > slurm_container_add:  No such process.
> > > >
> > > > It looks like the _run_script_as_user function forks and then execs
> the
> > > > epilog script, while the parent tries to add the child process to the
> > > > container, but sometimes (the timing has to be right), the script
> > > > finishes (and the process exits), before the slurm_container_add
> > > > function gets to writing the pid to the cgroup tasks file.  This
> results
> > > > in the ESRCH error.
> > > >
> > > > I've attached a patch that I believe fixes this, by making the child
> > > > process wait (just like in _fork_all_tasks) on a pipe for the parent
> to
> > > > write a character.  This allows the parent to call
> slurm_container_add
> > > > before the child process exits, ensuring that the pid will exist.
> > > >
> > > > This doesn't quite fix my problems though.  In
> > > > cgroups-fixes-with-sleep.patch, I've added a sleep(10); just after
> the
> > > > fork in the parent process.  With my patch and this sleep in place,
> the
> > > > slurm_container_add now fails with ENOENT, because the entire cgroup
> > > > freezer directory is missing.  I think this is because the release
> files
> > > > are removing the cgroup when the task completes.  In the patch, I
> remove
> > > > the check to create the container only if the cont_id == 0, but that
> > > > still doesn't seem to resolve the issue, as the directory might exist
> > > > between when slurm_container_create is called, and when the release
> > > > script is called to remove the cgroup.
> > > >
> > > > I'm not sure what the right fix is at this point, and any help is
> > > > greatly appreciated.  In summary, the three patches are basically the
> > > > same, one without the sleep and debugging messages, one with the
> sleep,
> > > > and one with the sleep and debugging messages.
> > > >
> > > > Thanks,
> > > > -sam
> > > >
> > >
> > > Nice debug work. Clearly there is a race condition when using
> > > proctrack/cgroup and the task expilog. When the last task exits,
> > > there is a race between slurmstepd invoking task epilog, and
> > > the release_agent removing the cgroup. I'm kind of unclear on
> > > the concept of task epilogs, but assuming they are necessary,
> > > this hole needs to be plugged somehow.
> > >
> > > There are a couple solutions I can think of off-hand.
> > >
> > > 1. Put the slurmstepd into the freezer cgroup along with the job
> > >   tasks. This would "hold" the cgroup until slurmstepd exits, thus
> > >   the job step cgroup would not be released before the final task
> > >   exits.
> > >
> > >   This isn't a good solution, though, because slurmstepd would then
> > >   be frozen along with all other tasks when the job step gets
> > >   a SIGSTOP. I believe this is why the slurmstepd was intentionally
> > >   left out of the freezer cgroup in the first place.
> > >
> > > 2. Don't delete freezer cgroups for proctrack/cgroup with a
> > >   release_agent. I think that perhaps the automatic release agent
> > >   for proctrack/cgroups is not strictly necessary. The cgroup
> > >   can be removed explicitly when slurm calls container_destroy().
> > >   This is far simpler than the first solution. In fact, the code
> > >   already tries to rmdir() the cgroup when slurm destroys the
> > >   "container".
> > >
> > > Unfortunately a quick patch implementing #2 is a bit tricky, because
> > > you would still need a release_agent for the user and job
> > > freezer cgroups (since container_destroy() is only applicable to
> > > the job step cgroup), and the code currently only lets you set
> > > a release agent on the namespace as a whole or not at all.
> > >
> > > You could try something like the following (completely untested)
> > >
> > > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > @@ -240,6 +240,8 @@ int _slurm_cgroup_create(slurmd_job_t *job,
> uint64_t
> > > id, uid
> > >                return SLURM_ERROR;
> > >        }
> > >
> > > +       xcgroup_set_param (&step_freezer_cg, "release_agent", "");
> > > +
> > >        return SLURM_SUCCESS;
> > >  }
> > >
> > > and manually clear the release_agent after the step cgroup has been
> > > created. The other thing that would need to be done is to have the
> > > "destroy" code wait indefinitely for the cgroup to be removed,
> > > to ensure all tasks have really exited (this is one of the main
> > > design goals of the proctrack plugin framework, to ensure the
> > > job/job step isn't considered "complete" until all user tasks
> > > have really exited)
> > >
> > >
> > > I'm actually kind of wondering if the user/job/step heirarchy
> > > is necessary for proctrack/cgroup. Since it is only used for
> > > process tracking of job steps, creating those extra nested
> > > cgroups might not be buying us much. That is a seperate question
> > > I guess.
> > >
> > > mark
> > >
> > Non-text part: text/html
> > > From 6272876a28bf9dd2a377eb2273dc152e8e05fcfb Mon Sep 17 00:00:00 2001
> > > From: Matthieu Hautreux <matthieu.hautr...@cea.fr<mailto:
> matthieu.hautr...@cea.fr>>
> > > Date: Thu, 6 Oct 2011 11:28:14 +0200
> > > Subject: [PATCH 1/2] xcgroup: no longer treat ESRCH as an error when
> adding a
> > >  pid to cgroup
> > >
> > > A delay occurs between a task creation and its addition to a different
> > > cgroup than the inherited one. In the meantime, the process can
> disapear
> > > resulting in a ESRCH during the addition in the second cgroup. Now
> react
> > > to that event as a warning instead of an error.
> > > ---
> > >  src/common/xcgroup.c |    3 ++-
> > >  1 files changed, 2 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/src/common/xcgroup.c b/src/common/xcgroup.c
> > > index d6caecc..eda75f0 100644
> > > --- a/src/common/xcgroup.c
> > > +++ b/src/common/xcgroup.c
> > > @@ -827,7 +827,8 @@ int _file_write_uint64s(char* file_path, uint64_t*
> values, int nb)
> > >               if (rc < 1) {
> > >                       debug2("unable to add value '%s' to file '%s' :
> %m",
> > >                              tstr, file_path);
> > > -                     fstatus = XCGROUP_ERROR;
> > > +                     if ( errno != ESRCH )
> > > +                             fstatus = XCGROUP_ERROR;
> > >               }
> > >
> > >       }
> > > --
> > > 1.7.6.2
> > >
> > > From 1888616114f2a5c136e0c4313ec03867db67bc2f Mon Sep 17 00:00:00 2001
> > > From: Matthieu Hautreux <matthieu.hautr...@cea.fr<mailto:
> matthieu.hautr...@cea.fr>>
> > > Date: Thu, 6 Oct 2011 11:40:49 +0200
> > > Subject: [PATCH 2/2] proctrack/cgroup:  inhibate release agent for the
> step
> > >  cgroup
> > >
> > > With release_agent notified at the step cgroup level, the step cgroup
> > > can be removed while slurmstepd as not yet finished its internals
> > > epilog mechanisms. Inhibating release agent at the step level and
> > > manually removing the step cgroup helps to ensure that epilog or
> > > cleanup tasks can be added to the proctrack container.
> > > ---
> > >  src/plugins/proctrack/cgroup/proctrack_cgroup.c |   10 ++++++++++
> > >  1 files changed, 10 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > index 1265326..9278bb2 100644
> > > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> > > @@ -239,6 +239,10 @@ int _slurm_cgroup_create(slurmd_job_t *job,
> uint64_t id, uid_t uid, gid_t gid)
> > >               xcgroup_destroy(&step_freezer_cg);
> > >               return SLURM_ERROR;
> > >       }
> > > +     /* inhibate release agent for the step cgroup thus letting
> > > +      * slurmstepd being able to add new pids to the container
> > > +      * when the job ends (TaskEpilog,...) */
> > > +     xcgroup_set_param(&step_freezer_cg,"notify_on_release","0");
> > >
> > >       return SLURM_SUCCESS;
> > >  }
> > > @@ -246,6 +250,12 @@ int _slurm_cgroup_create(slurmd_job_t *job,
> uint64_t id, uid_t uid, gid_t gid)
> > >  int _slurm_cgroup_destroy(void)
> > >  {
> > >       if (jobstep_cgroup_path[0] != '\0') {
> > > +             /* enable notification on release again
> > > +              * in case the delete would not work properly
> > > +              * because of a stuck process, this would let the
> > > +              * system manage the removal when possible */
> > > +             xcgroup_set_param(&step_freezer_cg,
> > > +                               "notify_on_release","1");
> > >               xcgroup_delete(&step_freezer_cg);
> > >               xcgroup_destroy(&step_freezer_cg);
> > >       }
> > > --
> > > 1.7.6.2
> > >
> >
> Non-text part: text/html
>

Re: [slurm-dev] cgroups and task epilog

Reply via email to