may be you should check you limits (ulimit)
slurm takes them at submit time so you should check on nodes and on master
node

I had problem with memory lock that looks like yours


On Mon, Mar 3, 2014 at 3:46 PM, Sten Wolf <[email protected]> wrote:

>
> Hi,
> First of all - thanks for all the responses.
> The slurm version is 2.4.4, on centos 5.8 nodes (no updates, the cluster
> is not connected to internet).
> It turns out the issue has nothing to do with openmpi - upon further
> testing, when running serially as a single app - we see the same difference:
> running the app directly on a single node works well, but when started
> from slurm, it complains about number of open files.
> Is there any other info I can provide to help figure out what's going on?
> If an update is guaranteed to work, we can start the process, but would
> rather not spend the time to get approval and testing of all current apps
> if in the end nothing will change.
> On that note - is it possible to setup several versions of slurm on same
> cluster, and make the slurmctld aware of each other, so no node gets
> oversubscribed? or would I have to drain/down nodes in one version in order
> to use them in another?
>
> On 28/02//2014 00:54, Sten Wolf wrote:
>
>>
>> Hi,
>> I have seen this in several different clusters running different apps -
>> users can start a task directly with openmpi, it will perform resonably,
>> but when started through slurm it either runs very slowly, or complain
>> and die.
>> Most recently jobs died due to number of open files limit, however both
>> hard and soft limit are high enough on the nodes (the openmpi job works)
>> and slurmd is set to that same limit (with ulimit -n  in
>> /etc/sysconfig/slurm). The propagate parameter was tried but failed to
>> make a difference either from command line or in slurm.conf (by default
>> all limits should already be propagated).
>> The slurm version itself is somewhat old (I think 2.4.5) but can't
>> simply be upgraded (any changes to the cluster require a review
>> process), so answers in the form of "upgrade to latest and see if it
>> still exists" might not be very helpful.
>> I'll have more data (including access to logs) during next week, but for
>> now - can anyone make a guess as to what might be going on?
>>
>

Reply via email to