Assuming this is a Gnu/Linux system, try running

$ srun bash -c "ulimit -a"

to see if your compute nodes have unexpected limits. You may need to add something to /etc/sysconfig/slurm to allow them to match the user environment on your login node.  (If Slurm is started with sysvinit or systemd scripts at startup, it doesn't get any of the limit settings from /etc/security/limits.conf since no PAM modules were invoked.)

Andy

On 04/09/2015 02:09 PM, Michael Colonno wrote:

            Nope – my slurm.conf is very basic (been using it for several versions).

 

# COMPUTE NODES

NodeName=node[1-8]       Sockets=2         CoresPerSocket=6  ThreadsPerCore=1        State=IDLE

PartitionName=all        Nodes=node[1-8]   Default=YES       MaxTime=INFINITE State=UP

 

            Perhaps a system-level limit or something not set in the slurm init.d script? This all looks pretty normal:

 

# ulimit -a

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 256422

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 10240

cpu time               (seconds, -t) unlimited

max user processes              (-u) 256422

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

 

            Thanks,

            ~Mike C.

 

From: Morris Jette [mailto:[email protected]]
Sent: Thursday, April 09, 2015 10:59 AM
To: slurm-dev
Subject: [slurm-dev] Re: default memory limit (14.11.5)?

 

Do you have a DefMemPerCPU or DefMemPerNode configured in slurm.conf?

On April 9, 2015 10:52:37 AM PDT, Michael Colonno <[email protected]> wrote:

            Hi ~

 

            I just upgraded my cluster to SLURM 14.11.5. Everything went smoothly but when I run a test case it seems there is now a (very small) memory limit on jobs:

 

$ srun -n4 date

slurmstepd: Step 19293.0 exceeded memory limit (3324 > 1024), being killed

srun: Exceeded job memory limit

slurmstepd: *** STEP 19293.0 CANCELLED AT 2015-04-09T10:46:17 *** on node6

srun: Job step aborted: Waiting up to 2 seconds for job step to finish.

srun: error: node6: tasks 0-3: Killed

 

            How can I disable / fix this?

 

            Thanks,

            ~Mike C.


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.Image
                removed by sender.


Reply via email to