On Thu, 6 Oct 2011 04:47:59 -0700, Matthieu Hautreux <matthieu.hautr...@gmail.com> wrote: Non-text part: multipart/mixed Non-text part: multipart/alternative > Hi, > > IMHO, there is two distinct problems : > > 1) a task can be finished before being added to a cgroup, resulting in > a ESRCH error. That is the initial problem of Sam
Yes, you are right, I kind of glossed over that one. IMO, I think Sam's approach to use the same fork/exec mechanism used in fork_all_tasks was the right approach. Note that ignoring ESRCH error instead actually opens up a hole in the proctrack/cgroup mechanism wherein a user could immediately fork again and SLURM will lose track of processes. So I would argue that for #1 we use Sam's patch. Or better yet abstract the fork/wait-for-read/exec logic into a new function and use that whenever slurmstepd is executing a process on behalf of the user. > 2) a task can no longer being added to a step cgroup because this one > no longer exists. That is a collateral problem that appears with the > sleep(10) patch and prove that there is a potential race condition. > > Concerning the 1) I think that a ESRCH error should be treated as a > warning and no longer an error in xgroup_add_pids > > Concerning the 2) there is 2 options : > - consider that step cgroup is mostly dedicated to host user's > tasks while the job is running. As long as the user's tasks have > finished, the step cgroup is no longer required and can be > removed. Thus the new internal tasks could be let in the > slurmstepd cg > > - consider that step cg must be manually managed, and should not > use the release agent as proposed by Mark > > I have just implemented 2 patches, not pretty much tested, that > implement the proposal for 1) and the second proposal of 2). Yes, looking at your patches I realize my one-liner was not correct. However, I still think slurmstepd should wait around for all tasks to exit the step cgroup and not release the job/job step until all user processes are gone from the node. Otherwise, the node may be set back to idle by SLURM and the next job scheduled while the other user's code is still running, perhaps using all the RAM or stuck in the kernel spinning on one or more CPUs. In fact, I believe prevention of the above was the main reason we initially developed the proctrack plugin infrastructure. Also, I would interested in what you think of not using the uid/jobid/stepid heirarchy for proctrack/cgroup. It seems like unecessary complexity for the freezer cgroup. I would argue that 1. all freezer cgroups go under a slurm/ heirarchy as slurm/jobid.stepid 2. slurmstepd always manually manages cgroups in proctrack/cgroup keeps a job in completing until all tasks have exited the cgroup. After a small timeout, slurmstepd repeatedly tries to kill the tasks in the cgroup as is done in other cgroups. I think this will simplify proctrack/cgroup, and also make it act more like other proctrack plugins. (In fact, I will have to implement at least #2 at our site, because we are not willing to have tasks from the last job running when the next job starts, i.e. we don't want resource released until they are really free.) mark > Mark, as you guess, putting slurmstepd in the step freezer cgroup is > not an option as it would result in a chicken and egg problem to > resume the jobs after a suspend. > > Sam, I think that with the patch for the problem 1), you should no > longer experience the initial problem you mentioned (I did not succeed > in reproducing the problem, probably because of the timing that must > be right). > > Let me know if that helps you > Matthieu > > 2011/10/6 Mark A. Grondona <mgrond...@llnl.gov<mailto:mgrond...@llnl.gov>> > On Wed, 5 Oct 2011 15:40:16 -0700, Sam Lang > <saml...@gmail.com<mailto:saml...@gmail.com>> wrote: > Non-text part: multipart/mixed > > > > Hello, > > > > I've run into minor problem with cgroups when a task epilog script is > > specified in the config file. The issue appears to be that with the > > ProctrackType=proctrack/cgroup, and with the appropriate release files > > in place so that cgroups are cleaned up properly > > (CgroupReleaseAgentDir=/srv/slurm/sbin), the task epilog script > > occasionally gives an error in a job's output file: > > slurm_container_add: No such process. > > > > It looks like the _run_script_as_user function forks and then execs the > > epilog script, while the parent tries to add the child process to the > > container, but sometimes (the timing has to be right), the script > > finishes (and the process exits), before the slurm_container_add > > function gets to writing the pid to the cgroup tasks file. This results > > in the ESRCH error. > > > > I've attached a patch that I believe fixes this, by making the child > > process wait (just like in _fork_all_tasks) on a pipe for the parent to > > write a character. This allows the parent to call slurm_container_add > > before the child process exits, ensuring that the pid will exist. > > > > This doesn't quite fix my problems though. In > > cgroups-fixes-with-sleep.patch, I've added a sleep(10); just after the > > fork in the parent process. With my patch and this sleep in place, the > > slurm_container_add now fails with ENOENT, because the entire cgroup > > freezer directory is missing. I think this is because the release files > > are removing the cgroup when the task completes. In the patch, I remove > > the check to create the container only if the cont_id == 0, but that > > still doesn't seem to resolve the issue, as the directory might exist > > between when slurm_container_create is called, and when the release > > script is called to remove the cgroup. > > > > I'm not sure what the right fix is at this point, and any help is > > greatly appreciated. In summary, the three patches are basically the > > same, one without the sleep and debugging messages, one with the sleep, > > and one with the sleep and debugging messages. > > > > Thanks, > > -sam > > > > Nice debug work. Clearly there is a race condition when using > proctrack/cgroup and the task expilog. When the last task exits, > there is a race between slurmstepd invoking task epilog, and > the release_agent removing the cgroup. I'm kind of unclear on > the concept of task epilogs, but assuming they are necessary, > this hole needs to be plugged somehow. > > There are a couple solutions I can think of off-hand. > > 1. Put the slurmstepd into the freezer cgroup along with the job > tasks. This would "hold" the cgroup until slurmstepd exits, thus > the job step cgroup would not be released before the final task > exits. > > This isn't a good solution, though, because slurmstepd would then > be frozen along with all other tasks when the job step gets > a SIGSTOP. I believe this is why the slurmstepd was intentionally > left out of the freezer cgroup in the first place. > > 2. Don't delete freezer cgroups for proctrack/cgroup with a > release_agent. I think that perhaps the automatic release agent > for proctrack/cgroups is not strictly necessary. The cgroup > can be removed explicitly when slurm calls container_destroy(). > This is far simpler than the first solution. In fact, the code > already tries to rmdir() the cgroup when slurm destroys the > "container". > > Unfortunately a quick patch implementing #2 is a bit tricky, because > you would still need a release_agent for the user and job > freezer cgroups (since container_destroy() is only applicable to > the job step cgroup), and the code currently only lets you set > a release agent on the namespace as a whole or not at all. > > You could try something like the following (completely untested) > > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c > @@ -240,6 +240,8 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t > id, uid > return SLURM_ERROR; > } > > + xcgroup_set_param (&step_freezer_cg, "release_agent", ""); > + > return SLURM_SUCCESS; > } > > and manually clear the release_agent after the step cgroup has been > created. The other thing that would need to be done is to have the > "destroy" code wait indefinitely for the cgroup to be removed, > to ensure all tasks have really exited (this is one of the main > design goals of the proctrack plugin framework, to ensure the > job/job step isn't considered "complete" until all user tasks > have really exited) > > > I'm actually kind of wondering if the user/job/step heirarchy > is necessary for proctrack/cgroup. Since it is only used for > process tracking of job steps, creating those extra nested > cgroups might not be buying us much. That is a seperate question > I guess. > > mark > Non-text part: text/html > From 6272876a28bf9dd2a377eb2273dc152e8e05fcfb Mon Sep 17 00:00:00 2001 > From: Matthieu Hautreux <matthieu.hautr...@cea.fr> > Date: Thu, 6 Oct 2011 11:28:14 +0200 > Subject: [PATCH 1/2] xcgroup: no longer treat ESRCH as an error when adding a > pid to cgroup > > A delay occurs between a task creation and its addition to a different > cgroup than the inherited one. In the meantime, the process can disapear > resulting in a ESRCH during the addition in the second cgroup. Now react > to that event as a warning instead of an error. > --- > src/common/xcgroup.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/src/common/xcgroup.c b/src/common/xcgroup.c > index d6caecc..eda75f0 100644 > --- a/src/common/xcgroup.c > +++ b/src/common/xcgroup.c > @@ -827,7 +827,8 @@ int _file_write_uint64s(char* file_path, uint64_t* > values, int nb) > if (rc < 1) { > debug2("unable to add value '%s' to file '%s' : %m", > tstr, file_path); > - fstatus = XCGROUP_ERROR; > + if ( errno != ESRCH ) > + fstatus = XCGROUP_ERROR; > } > > } > -- > 1.7.6.2 > > From 1888616114f2a5c136e0c4313ec03867db67bc2f Mon Sep 17 00:00:00 2001 > From: Matthieu Hautreux <matthieu.hautr...@cea.fr> > Date: Thu, 6 Oct 2011 11:40:49 +0200 > Subject: [PATCH 2/2] proctrack/cgroup: inhibate release agent for the step > cgroup > > With release_agent notified at the step cgroup level, the step cgroup > can be removed while slurmstepd as not yet finished its internals > epilog mechanisms. Inhibating release agent at the step level and > manually removing the step cgroup helps to ensure that epilog or > cleanup tasks can be added to the proctrack container. > --- > src/plugins/proctrack/cgroup/proctrack_cgroup.c | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/src/plugins/proctrack/cgroup/proctrack_cgroup.c > b/src/plugins/proctrack/cgroup/proctrack_cgroup.c > index 1265326..9278bb2 100644 > --- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c > +++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c > @@ -239,6 +239,10 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t id, > uid_t uid, gid_t gid) > xcgroup_destroy(&step_freezer_cg); > return SLURM_ERROR; > } > + /* inhibate release agent for the step cgroup thus letting > + * slurmstepd being able to add new pids to the container > + * when the job ends (TaskEpilog,...) */ > + xcgroup_set_param(&step_freezer_cg,"notify_on_release","0"); > > return SLURM_SUCCESS; > } > @@ -246,6 +250,12 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t id, > uid_t uid, gid_t gid) > int _slurm_cgroup_destroy(void) > { > if (jobstep_cgroup_path[0] != '\0') { > + /* enable notification on release again > + * in case the delete would not work properly > + * because of a stuck process, this would let the > + * system manage the removal when possible */ > + xcgroup_set_param(&step_freezer_cg, > + "notify_on_release","1"); > xcgroup_delete(&step_freezer_cg); > xcgroup_destroy(&step_freezer_cg); > } > -- > 1.7.6.2 >