may be you should check you limits (ulimit) slurm takes them at submit time so you should check on nodes and on master node
I had problem with memory lock that looks like yours On Mon, Mar 3, 2014 at 3:46 PM, Sten Wolf <[email protected]> wrote: > > Hi, > First of all - thanks for all the responses. > The slurm version is 2.4.4, on centos 5.8 nodes (no updates, the cluster > is not connected to internet). > It turns out the issue has nothing to do with openmpi - upon further > testing, when running serially as a single app - we see the same difference: > running the app directly on a single node works well, but when started > from slurm, it complains about number of open files. > Is there any other info I can provide to help figure out what's going on? > If an update is guaranteed to work, we can start the process, but would > rather not spend the time to get approval and testing of all current apps > if in the end nothing will change. > On that note - is it possible to setup several versions of slurm on same > cluster, and make the slurmctld aware of each other, so no node gets > oversubscribed? or would I have to drain/down nodes in one version in order > to use them in another? > > On 28/02//2014 00:54, Sten Wolf wrote: > >> >> Hi, >> I have seen this in several different clusters running different apps - >> users can start a task directly with openmpi, it will perform resonably, >> but when started through slurm it either runs very slowly, or complain >> and die. >> Most recently jobs died due to number of open files limit, however both >> hard and soft limit are high enough on the nodes (the openmpi job works) >> and slurmd is set to that same limit (with ulimit -n in >> /etc/sysconfig/slurm). The propagate parameter was tried but failed to >> make a difference either from command line or in slurm.conf (by default >> all limits should already be propagated). >> The slurm version itself is somewhat old (I think 2.4.5) but can't >> simply be upgraded (any changes to the cluster require a review >> process), so answers in the form of "upgrade to latest and see if it >> still exists" might not be very helpful. >> I'll have more data (including access to logs) during next week, but for >> now - can anyone make a guess as to what might be going on? >> >
