On Thu, 6 Oct 2011 08:52:46 -0700, Matthieu Hautreux <matthieu.hautr...@gmail.com> wrote: Non-text part: multipart/alternative > 2011/10/6 Mark A. Grondona <mgrond...@llnl.gov<mailto:mgrond...@llnl.gov>> > On Thu, 6 Oct 2011 04:47:59 -0700, Matthieu Hautreux > <matthieu.hautr...@gmail.com<mailto:matthieu.hautr...@gmail.com>> wrote: > Non-text part: multipart/mixed > Non-text part: multipart/alternative > > Hi, > > > > IMHO, there is two distinct problems : > > > > 1) a task can be finished before being added to a cgroup, resulting in > > a ESRCH error. That is the initial problem of Sam > > Yes, you are right, I kind of glossed over that one. > IMO, I think Sam's approach to use the same fork/exec mechanism > used in fork_all_tasks was the right approach. Note that > ignoring ESRCH error instead actually opens up a hole in > the proctrack/cgroup mechanism wherein a user could immediately > fork again and SLURM will lose track of processes. > > I do not think that not treating ESRCH as an error is a problem and > opens a hole. I think that because when you get ESRCH, the task is no > longer present on the system, so you have nothing to do. Treating that > as an error is not necessary because it will not change the future of > what you are doing. However, I agree that using the same fork/exec > mechanism than in fork_all_tasks is better and will prevent this > scenario. Talking of that, this mechanism will also prevent the hole > your are talking about by ensuring that each task pid is adding to the > container before providing it the possibility to run.
Yeah, sorry, I misspoke. I didn't mean that silently ignoring ESRCH opens a hole, but rather without Sam's fix the hole isn't closed. > act more like other proctrack plugins. > > (In fact, I will have to implement at least #2 at our site, because > we are not willing to have tasks from the last job running when > the next job starts, i.e. we don't want resource released until they > are really free.) > > I would be interested to get the results of your tests with current > implementation + the patch because I do not think you need to modify > it to reach your goal. Ok, I will actually test before I make claims in the future ;-) mark > > Matthieu > > mark > > > > Mark, as you guess, putting slurmstepd in the step freezer cgroup is > > not an option as it would result in a chicken and egg problem to > > resume the jobs after a suspend. > > > > Sam, I think that with the patch for the problem 1), you should no > > longer experience the initial problem you mentioned (I did not succeed > > in reproducing the problem, probably because of the timing that must > > be right). > > > > Let me know if that helps you > > Matthieu > > > > 2011/10/6 Mark A. Grondona > > <mgrond...@llnl.gov<mailto:mgrond...@llnl.gov><mailto:mgrond...@llnl.gov<mailto:mgrond...@llnl.gov>>> > > On Wed, 5 Oct 2011 15:40:16 -0700, Sam Lang > > <saml...@gmail.com<mailto:saml...@gmail.com><mailto:saml...@gmail.com<mailto:saml...@gmail.com>>> > > wrote: > > Non-text part: multipart/mixed > > > > > > Hello, > > > > > > I've run into minor problem with cgroups when a task epilog script is > > > specified in the config file. The issue appears to be that with the > > > ProctrackType=proctrack/cgroup, and with the appropriate release files > > > in place so that cgroups are cleaned up properly > > > (CgroupReleaseAgentDir=/srv/slurm/sbin), the task epilog script > > > occasionally gives an error in a job's output file: > > > slurm_container_add: No such process. > > > > > > It looks like the _run_script_as_user function forks and then execs the > > > epilog script, while the parent tries to add the child process to the > > > container, but sometimes (the timing has to be right), the script > > > finishes (and the process exits), before the slurm_container_add > > > function gets to writing the pid to the cgroup tasks file. This results > > > in the ESRCH error. > > > > > > I've attached a patch that I believe fixes this, by making the child > > > process wait (just like in _fork_all_tasks) on a pipe for the parent to > > > write a character. This allows the parent to call slurm_container_add > > > before the child process exits, ensuring that the pid will exist. > > > > > > This doesn't quite fix my problems though. In > > > cgroups-fixes-with-sleep.patch, I've added a sleep(10); just after the > > > fork in the parent process. With my patch and this sleep in place, the > > > slurm_container_add now fails with ENOENT, because the entire cgroup > > > freezer directory is missing. I think this is because the release files > > > are removing the cgroup when the task completes. In the patch, I remove > > > the check to create the container only if the cont_id == 0, but that > > > still doesn't seem to resolve the issue, as the directory might exist > > > between when slurm_container_create is called, and when the release > > > script is called to remove the cgroup. > > > > > > I'm not sure what the right fix is at this point, and any help is > > > greatly appreciated. In summary, the three patches are basically the > > > same, one without the sleep and debugging messages, one with the sleep, > > > and one with the sleep and debugging messages. > > > > > > Thanks, > > > -sam > > > > > > > Nice debug work. Clearly there is a race condition when using > > proctrack/cgroup and the task expilog. When the last task exits, > > there is a race between slurmstepd invoking task epilog, and > > the release_agent removing the cgroup. I'm kind of unclear on > > the concept of task epilogs, but assuming they are necessary, > > this hole needs to be plugged somehow. > > > > There are a couple solutions I can think of off-hand. > > > > 1. Put the slurmstepd into the freezer cgroup along with the job > > tasks. This would "hold" the cgroup until slurmstepd exits, thus > > the job step cgroup would not be released before the final task > > exits. > > > > This isn't a good solution, though, because slurmstepd would then > > be frozen along with all other tasks when the job step gets > > a SIGSTOP. I believe this is why the slurmstepd was intentionally > > left out of the freezer cgroup in the first place. > > > > 2. Don't delete freezer cgroups for proctrack/cgroup with a > > release_agent. I think that perhaps the automatic release agent > > for proctrack/cgroups is not strictly necessary. The cgroup > > can be removed explicitly when slurm calls container_destroy(). > > This is far simpler than the first solution. In fact, the code > > already tries to rmdir() the cgroup when slurm destroys the > > "container". > > > > Unfortunately a quick patch implementing #2 is a bit tricky, because > > you would still need a release_agent for the user and job > > freezer cgroups (since container_destroy() is only applicable to > > the job step cgroup), and the code currently only lets you set > > a release agent on the namespace as a whole or not at all. > > > > You could try something like the following (completely untested) > > > > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c > > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c > > @@ -240,6 +240,8 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t > > id, uid > > return SLURM_ERROR; > > } > > > > + xcgroup_set_param (&step_freezer_cg, "release_agent", ""); > > + > > return SLURM_SUCCESS; > > } > > > > and manually clear the release_agent after the step cgroup has been > > created. The other thing that would need to be done is to have the > > "destroy" code wait indefinitely for the cgroup to be removed, > > to ensure all tasks have really exited (this is one of the main > > design goals of the proctrack plugin framework, to ensure the > > job/job step isn't considered "complete" until all user tasks > > have really exited) > > > > > > I'm actually kind of wondering if the user/job/step heirarchy > > is necessary for proctrack/cgroup. Since it is only used for > > process tracking of job steps, creating those extra nested > > cgroups might not be buying us much. That is a seperate question > > I guess. > > > > mark > > > Non-text part: text/html > > From 6272876a28bf9dd2a377eb2273dc152e8e05fcfb Mon Sep 17 00:00:00 2001 > > From: Matthieu Hautreux > > <matthieu.hautr...@cea.fr<mailto:matthieu.hautr...@cea.fr>> > > Date: Thu, 6 Oct 2011 11:28:14 +0200 > > Subject: [PATCH 1/2] xcgroup: no longer treat ESRCH as an error when adding > > a > > pid to cgroup > > > > A delay occurs between a task creation and its addition to a different > > cgroup than the inherited one. In the meantime, the process can disapear > > resulting in a ESRCH during the addition in the second cgroup. Now react > > to that event as a warning instead of an error. > > --- > > src/common/xcgroup.c | 3 ++- > > 1 files changed, 2 insertions(+), 1 deletions(-) > > > > diff --git a/src/common/xcgroup.c b/src/common/xcgroup.c > > index d6caecc..eda75f0 100644 > > --- a/src/common/xcgroup.c > > +++ b/src/common/xcgroup.c > > @@ -827,7 +827,8 @@ int _file_write_uint64s(char* file_path, uint64_t* > > values, int nb) > > if (rc < 1) { > > debug2("unable to add value '%s' to file '%s' : %m", > > tstr, file_path); > > - fstatus = XCGROUP_ERROR; > > + if ( errno != ESRCH ) > > + fstatus = XCGROUP_ERROR; > > } > > > > } > > -- > > 1.7.6.2 > > > > From 1888616114f2a5c136e0c4313ec03867db67bc2f Mon Sep 17 00:00:00 2001 > > From: Matthieu Hautreux > > <matthieu.hautr...@cea.fr<mailto:matthieu.hautr...@cea.fr>> > > Date: Thu, 6 Oct 2011 11:40:49 +0200 > > Subject: [PATCH 2/2] proctrack/cgroup: inhibate release agent for the step > > cgroup > > > > With release_agent notified at the step cgroup level, the step cgroup > > can be removed while slurmstepd as not yet finished its internals > > epilog mechanisms. Inhibating release agent at the step level and > > manually removing the step cgroup helps to ensure that epilog or > > cleanup tasks can be added to the proctrack container. > > --- > > src/plugins/proctrack/cgroup/proctrack_cgroup.c | 10 ++++++++++ > > 1 files changed, 10 insertions(+), 0 deletions(-) > > > > diff --git a/src/plugins/proctrack/cgroup/proctrack_cgroup.c > > b/src/plugins/proctrack/cgroup/proctrack_cgroup.c > > index 1265326..9278bb2 100644 > > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c > > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c > > @@ -239,6 +239,10 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t > > id, uid_t uid, gid_t gid) > > xcgroup_destroy(&step_freezer_cg); > > return SLURM_ERROR; > > } > > + /* inhibate release agent for the step cgroup thus letting > > + * slurmstepd being able to add new pids to the container > > + * when the job ends (TaskEpilog,...) */ > > + xcgroup_set_param(&step_freezer_cg,"notify_on_release","0"); > > > > return SLURM_SUCCESS; > > } > > @@ -246,6 +250,12 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t > > id, uid_t uid, gid_t gid) > > int _slurm_cgroup_destroy(void) > > { > > if (jobstep_cgroup_path[0] != '\0') { > > + /* enable notification on release again > > + * in case the delete would not work properly > > + * because of a stuck process, this would let the > > + * system manage the removal when possible */ > > + xcgroup_set_param(&step_freezer_cg, > > + "notify_on_release","1"); > > xcgroup_delete(&step_freezer_cg); > > xcgroup_destroy(&step_freezer_cg); > > } > > -- > > 1.7.6.2 > > > Non-text part: text/html