I will have this info next week, including exact version of slurm, but I
think is is a 1.6.x
On 28/02//2014 01:31, Ralph Castain wrote:
What version of OMPI are you using?
On Feb 27, 2014, at 2:54 PM, Sten Wolf <[email protected]> wrote:
Hi,
I have seen this in several different clusters running different apps - users
can start a task directly with openmpi, it will perform resonably, but when
started through slurm it either runs very slowly, or complain and die.
Most recently jobs died due to number of open files limit, however both hard
and soft limit are high enough on the nodes (the openmpi job works) and slurmd
is set to that same limit (with ulimit -n in /etc/sysconfig/slurm). The
propagate parameter was tried but failed to make a difference either from
command line or in slurm.conf (by default all limits should already be
propagated).
The slurm version itself is somewhat old (I think 2.4.5) but can't simply be upgraded
(any changes to the cluster require a review process), so answers in the form of
"upgrade to latest and see if it still exists" might not be very helpful.
I'll have more data (including access to logs) during next week, but for now -
can anyone make a guess as to what might be going on?