Hello
   I defined "TmpDisk=930000" for some machines in slurm 20.02.3 (and TmpFS is 
set to a local volume slightly bigger than that) and when I run...

  sbatch  --tmp=100000 -w node01 -array=1-100 -wrap="sleep 300"

I ended up with 36 jobs on the machine at a time, 1 per CPU core. I expect the 
--tmp option to limit it to 9 jobs at a time since the node was defined as 
having 930000MB of TmpDisk.
If I up the option to "--tmp=1000000" sbatch rejects the job because "Temporary 
disk specification cannot be satisfied" so this should not be a typo in the 
config or unit conversion issue.

I would expect this to be treated as a managed resource that could help limit 
how many jobs land on a machine.

Am I misunderstanding how the "--tmp" option is supposed to work?


And a general question regarding TMPDIR and TmpFS...

I understand that TMPDIR is set to /tmp by slurm regardless of what TmpFS is 
set to and it is expected that local sites will define TMPDIR in a prolog or 
plugin if they feel it is necessary. I would expect TmpFS to affect the value 
of TMPDIR by default.

What is the reasoning behind the decision not to set TMPDIR to something like 
${TmpFS}/${SLURM_JOB_ID}?

Is there any documented discussion on slurm's expected use of TmpFS/TMPDIR or 
the philosophy behind it that I can read?

Thanks


Reply via email to