Hi,
I have seen this in several different clusters running different apps - users can start a task directly with openmpi, it will perform resonably, but when started through slurm it either runs very slowly, or complain and die. Most recently jobs died due to number of open files limit, however both hard and soft limit are high enough on the nodes (the openmpi job works) and slurmd is set to that same limit (with ulimit -n in /etc/sysconfig/slurm). The propagate parameter was tried but failed to make a difference either from command line or in slurm.conf (by default all limits should already be propagated). The slurm version itself is somewhat old (I think 2.4.5) but can't simply be upgraded (any changes to the cluster require a review process), so answers in the form of "upgrade to latest and see if it still exists" might not be very helpful. I'll have more data (including access to logs) during next week, but for now - can anyone make a guess as to what might be going on?

Reply via email to