Hi,
I have seen this in several different clusters running different apps -
users can start a task directly with openmpi, it will perform resonably,
but when started through slurm it either runs very slowly, or complain
and die.
Most recently jobs died due to number of open files limit, however both
hard and soft limit are high enough on the nodes (the openmpi job works)
and slurmd is set to that same limit (with ulimit -n in
/etc/sysconfig/slurm). The propagate parameter was tried but failed to
make a difference either from command line or in slurm.conf (by default
all limits should already be propagated).
The slurm version itself is somewhat old (I think 2.4.5) but can't
simply be upgraded (any changes to the cluster require a review
process), so answers in the form of "upgrade to latest and see if it
still exists" might not be very helpful.
I'll have more data (including access to logs) during next week, but for
now - can anyone make a guess as to what might be going on?
- [slurm-dev] openmpi misbehaves when started under slurm Sten Wolf
-