On Wed, 5 Oct 2011 15:40:16 -0700, Sam Lang <[email protected]> wrote:
Non-text part: multipart/mixed
>
> Hello,
>
> I've run into minor problem with cgroups when a task epilog script is
> specified in the config file. The issue appears to be that with the
> ProctrackType=proctrack/cgroup, and with the appropriate release files
> in place so that cgroups are cleaned up properly
> (CgroupReleaseAgentDir=/srv/slurm/sbin), the task epilog script
> occasionally gives an error in a job's output file:
> slurm_container_add: No such process.
>
> It looks like the _run_script_as_user function forks and then execs the
> epilog script, while the parent tries to add the child process to the
> container, but sometimes (the timing has to be right), the script
> finishes (and the process exits), before the slurm_container_add
> function gets to writing the pid to the cgroup tasks file. This results
> in the ESRCH error.
>
> I've attached a patch that I believe fixes this, by making the child
> process wait (just like in _fork_all_tasks) on a pipe for the parent to
> write a character. This allows the parent to call slurm_container_add
> before the child process exits, ensuring that the pid will exist.
>
> This doesn't quite fix my problems though. In
> cgroups-fixes-with-sleep.patch, I've added a sleep(10); just after the
> fork in the parent process. With my patch and this sleep in place, the
> slurm_container_add now fails with ENOENT, because the entire cgroup
> freezer directory is missing. I think this is because the release files
> are removing the cgroup when the task completes. In the patch, I remove
> the check to create the container only if the cont_id == 0, but that
> still doesn't seem to resolve the issue, as the directory might exist
> between when slurm_container_create is called, and when the release
> script is called to remove the cgroup.
>
> I'm not sure what the right fix is at this point, and any help is
> greatly appreciated. In summary, the three patches are basically the
> same, one without the sleep and debugging messages, one with the sleep,
> and one with the sleep and debugging messages.
>
> Thanks,
> -sam
>
Nice debug work. Clearly there is a race condition when using
proctrack/cgroup and the task expilog. When the last task exits,
there is a race between slurmstepd invoking task epilog, and
the release_agent removing the cgroup. I'm kind of unclear on
the concept of task epilogs, but assuming they are necessary,
this hole needs to be plugged somehow.
There are a couple solutions I can think of off-hand.
1. Put the slurmstepd into the freezer cgroup along with the job
tasks. This would "hold" the cgroup until slurmstepd exits, thus
the job step cgroup would not be released before the final task
exits.
This isn't a good solution, though, because slurmstepd would then
be frozen along with all other tasks when the job step gets
a SIGSTOP. I believe this is why the slurmstepd was intentionally
left out of the freezer cgroup in the first place.
2. Don't delete freezer cgroups for proctrack/cgroup with a
release_agent. I think that perhaps the automatic release agent
for proctrack/cgroups is not strictly necessary. The cgroup
can be removed explicitly when slurm calls container_destroy().
This is far simpler than the first solution. In fact, the code
already tries to rmdir() the cgroup when slurm destroys the
"container".
Unfortunately a quick patch implementing #2 is a bit tricky, because
you would still need a release_agent for the user and job
freezer cgroups (since container_destroy() is only applicable to
the job step cgroup), and the code currently only lets you set
a release agent on the namespace as a whole or not at all.
You could try something like the following (completely untested)
--- a/src/plugins/proctrack/cgroup/proctrack_cgroup.c
+++ b/src/plugins/proctrack/cgroup/proctrack_cgroup.c
@@ -240,6 +240,8 @@ int _slurm_cgroup_create(slurmd_job_t *job, uint64_t
id, uid
return SLURM_ERROR;
}
+ xcgroup_set_param (&step_freezer_cg, "release_agent", "");
+
return SLURM_SUCCESS;
}
and manually clear the release_agent after the step cgroup has been
created. The other thing that would need to be done is to have the
"destroy" code wait indefinitely for the cgroup to be removed,
to ensure all tasks have really exited (this is one of the main
design goals of the proctrack plugin framework, to ensure the
job/job step isn't considered "complete" until all user tasks
have really exited)
I'm actually kind of wondering if the user/job/step heirarchy
is necessary for proctrack/cgroup. Since it is only used for
process tracking of job steps, creating those extra nested
cgroups might not be buying us much. That is a seperate question
I guess.
mark