Hi,
First of all - thanks for all the responses.
The slurm version is 2.4.4, on centos 5.8 nodes (no updates, the cluster is not connected to internet). It turns out the issue has nothing to do with openmpi - upon further testing, when running serially as a single app - we see the same difference: running the app directly on a single node works well, but when started from slurm, it complains about number of open files.
Is there any other info I can provide to help figure out what's going on?
If an update is guaranteed to work, we can start the process, but would rather not spend the time to get approval and testing of all current apps if in the end nothing will change. On that note - is it possible to setup several versions of slurm on same cluster, and make the slurmctld aware of each other, so no node gets oversubscribed? or would I have to drain/down nodes in one version in order to use them in another?

On 28/02//2014 00:54, Sten Wolf wrote:

Hi,
I have seen this in several different clusters running different apps -
users can start a task directly with openmpi, it will perform resonably,
but when started through slurm it either runs very slowly, or complain
and die.
Most recently jobs died due to number of open files limit, however both
hard and soft limit are high enough on the nodes (the openmpi job works)
and slurmd is set to that same limit (with ulimit -n  in
/etc/sysconfig/slurm). The propagate parameter was tried but failed to
make a difference either from command line or in slurm.conf (by default
all limits should already be propagated).
The slurm version itself is somewhat old (I think 2.4.5) but can't
simply be upgraded (any changes to the cluster require a review
process), so answers in the form of "upgrade to latest and see if it
still exists" might not be very helpful.
I'll have more data (including access to logs) during next week, but for
now - can anyone make a guess as to what might be going on?

Reply via email to