On 2014-04-23T17:26:36 EEST, Alfonso Pardo wrote:
Hi,
I had some errors from a premature terminate jobs with this message:
slurmd[bd-p14-01]: error: unlink(/tmp/slurmd/job60560/slurm_script):
No such file or directory
slurmd[bd-p14-01]: error: rmdir(/tmp/slurmd/job60560): No such file or
directory
Should “TmpFS” location be a shared file system?
No. Or maybe it's possible, but why? Typically /tmp is considered a
machine-local directory.
That being said, the error messages you quote have nothing to do with
the slurm.conf TmpFS setting but rather tell that your SlurmdSpoolDir
is set to "/tmp/slurmd". That is likely a bad idea, as there might be
various /tmp cleaner scripts such as tmpwatch emptying /tmp regularly,
leading to errors like you see (been there, done that). Just leave it
at the default value unless you have good reasons to do otherwise. Note
that it requires some trickery to move the contents of the
SlurmdSpoolDir if you want to do it on the fly without losing track of
running jobs.
We don’t have TmpDisk parameter established (default value). How many
space is reasonable for this parameter?
Depends on how large disks you have on your nodes, no? However, the
trend seems to be that /tmp is a relatively small space, frequently on
a ram disk (tmpfs) rather than backed by a real disk [1]. So you might
not want to encourage your users to write code assuming a large /tmp is
available. A large machine-local space is probably better to place at
/var/tmp or something site-specific such as /local.
[1] http://0pointer.de/blog/projects/tmp.html
--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || [email protected]