Hello
I defined "TmpDisk=930000" for some machines in slurm 20.02.3 (and TmpFS is
set to a local volume slightly bigger than that) and when I run...
sbatch --tmp=100000 -w node01 -array=1-100 -wrap="sleep 300"
I ended up with 36 jobs on the machine at a time, 1 per CPU core. I expect the
--tmp option to limit it to 9 jobs at a time since the node was defined as
having 930000MB of TmpDisk.
If I up the option to "--tmp=1000000" sbatch rejects the job because "Temporary
disk specification cannot be satisfied" so this should not be a typo in the
config or unit conversion issue.
I would expect this to be treated as a managed resource that could help limit
how many jobs land on a machine.
Am I misunderstanding how the "--tmp" option is supposed to work?
And a general question regarding TMPDIR and TmpFS...
I understand that TMPDIR is set to /tmp by slurm regardless of what TmpFS is
set to and it is expected that local sites will define TMPDIR in a prolog or
plugin if they feel it is necessary. I would expect TmpFS to affect the value
of TMPDIR by default.
What is the reasoning behind the decision not to set TMPDIR to something like
${TmpFS}/${SLURM_JOB_ID}?
Is there any documented discussion on slurm's expected use of TmpFS/TMPDIR or
the philosophy behind it that I can read?
Thanks