Hi, we had a nice debugging session here with batch jobs writing data to the wrong temporary directories which then also were not cleaned up. It turns out that this is due to the feature of Slurm creating a TMPDIR if the environment variable is set to a non-existing directory (and setting TMPDIR to something in any case).
We got users being unable to run jobs because the temporary directories
created by Slurm filled up their disk quota. The scenario before I
fixed things:
1. We have a mechanism to create a per-session TMPDIR in a script
in /etc/profile.d, with cleanup of the same using a trap on shell exit.
This is meant for normal user logins.
2. We have prolog and epilog of Slurm jobs managing global/local
TMPDIRs for jobs. Epilog cleans that up just fine.
3. We have an environment module that reconstructs temporary directory
paths to match what the prolog created or uses what the profile script
created (separate environment variable), sets TMPDIR to one of these.
4. Slurm sources profile (directly via /etc/profile or even
via /etc/bashrc since it is set up like that on CentOS) in Job
preparation without having exported SLURM_JOBID yet. Modules get loaded
… are confused … a TMPDIR gets set, but not actually created. Yet.
5. Slurm decides to create the false TMPDIR.
6. Job runs, uses the created TMPDIR.
7. Job ends. TMPDIR stays. Fills up quota.
8. Users have a problem.
9. I have a problem.
I have adapted our scripts now to detect the job startup phase (no
SLURM_JOBID set) and avoid emitting a false TMPDIR that prompts Slurm
to create it.
I do have to wonder why Slurmd should try to do that at all. I think it
is overstepping its boundaries here. Generally, I am not a fan of the
default shell environment exporting; we tell users to use --export=NONE
for all jobs, also unset SLURM_EXPORT_ENV to get a consistent state for
job scripts. Batch scripts need to be reasonably complete and
repeatable. Having random environment variables set for them just
because they were present in the shell that invoked sbatch is really
dangerous IMHO. Creating a TMPDIR for the job goes into the same
direction.
We had a misconfiguration where a shell-session TMPDIR name was
exported for the job. Instead of jobs failing to write to the
non-existent directory and thus alerting us of our mistake, jobs
happily started filling up user's disk quota with directories that were
not managed (deleted after use).
Mind that I do not think that Slurm deleting its TMPDIR after the job
ends would be the true fix (while it might still be an idea to
consider). The decision where to put TMPDIR and how to handle its
lifetime is an important site-specific one. When trowing many TiB of
data around, you just should not try to guess what might be the correct
place for that. Too much magic. Keep it simple (unlike this too long
post).
Would the Slurm community mind a switch to slurm.conf to disable TMPDIR
handling altogether? I see
if (i == 0)
_make_tmpdir(job);
in exec_task() of src/slurmd/slurmstepd/task.c (version 14.11.8, what
we are using in production). That should get another condition on a
configuration setting. While at that, I might like to deactivate other
automatisms, but let's keep on topic with the TMPDIR stuff. Any reason
why _make_tmpdir() has to be called, other than being a nanny for users
that might have some shell startup script that sets a non-existing
TMPDIR? Even the idea of setting it to /tmp irritates me:
if (!(tmpdir = getenvp(job->env, "TMPDIR")))
setenvf(&job->env, "TMPDIR", "/tmp"); /* task may want it set */
else if (mkdir(tmpdir, 0700) < 0) {
The comment says "task may want it set". May. Maybe not. I would
appreciate the batch system not to guess, only adding environment
variables related to batch job setup (SLURM_* variables) and leave the
environment pretty please alone apart from that. Am I alone with that?
In the end, you do not spare the users any work, at least when they
rely as heavily on environment modules as we do here. You have to start your
job script with something along
. /modules/init.sh
to get the module function definition imported into the current shell
anyway (unless relying on `declare -f module` and forcing use
of /bin/bash as shell, using the feature that brought us
https://en.wikipedia.org/wiki/Shellshock_%28software_bug%29
). Such a sourced script can set up the whole environment including TMPDIR,
especially deriving things from the SLURM_* variables, too, without any
additional burden on the end user.
Any code in Slurm that messes with the environment is no help and
possibly actively works against our efforts to produce reliable job
setups. Since that functionality is present in released versions of
Slurm and folks might depend on it, I at least hope that configration
options to deactivate the intermediate to higher magic will be welcomed
to be included in the codebase. Not that I have patches ready, but I
might find time to prepare them.
Alrighty then,
Thomas
--
Dr. Thomas Orgis
Universität Hamburg
RRZ / Zentrale Dienste / HPC
Schlüterstr. 70
20146 Hamburg
Tel.: 040/42838 8826
Fax: 040/428 38 6270
smime.p7s
Description: S/MIME cryptographic signature
