OK, should be easy but I am stumped

Built Slurm 16.05.0 on Ubuntu-14.04 LTS. Worked fine until I decided I wanted 
to get email notifications going using smail and seff, then realized that would 
need slurmdbd which we weren't using because we have no need for accounting.


Fast forward to getting slurmdbd built and job accounting going. Now for some 
reason when I submit any job I get:


slurmstepd: error: Job 3049 exceeded memory limit (1336 > 1024), being killed


And of course I see lots of stuff on slurm-dev about setting /etc/default/slurm 
to have 'ulimit -m unlimited' etc. I also see suggestions to put it in 
/etc/security/limits.conf. I also see suggestions to set ulimit limits at the 
top of my slurmd init scripts.


I've done all these tricks and restarted services to no avail. Users have 
ulimits of unlimited (when checking with ulimit -a) but running sbatch or srun 
results in a cap on memory size. Memory lock seems to be ok.


[mikec@lunchbox] (34)$ srun bash -c "ulimit -a"
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1030449
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 1024   <=====!!!!
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514377
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Here is an example error when just trying to sbatch a script to run 
/bin/hostname

[mikec@lunchbox] (38)$ cat slurm-3053.out

slurmstepd: error: Job 3053 exceeded memory limit (1096 > 1024), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 3053 ON marzano01 CANCELLED AT 2016-10-13T22:17:08 
***




Where can I set ulimit -m so that it will actually take effect?


Here is my slurm.conf. I tried using PropagateResourceLimitsExcept=MEMLOCK,RSS 
but did not alleviate max memory size issue.


#
ClusterName=marzano
ControlMachine=lunchbox
ControlAddr=xxx.xxx.xxx.xxx
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/slurm.state
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=/etc/slurm/plugstack.conf
#PropagatePrioProcess=
#PropagateResourceLimits=
PropagateResourceLimitsExcept=MEMLOCK,RSS
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
MailProg=/s/slurm/bin/smail
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmdDebug=6
SlurmdLogFile=/var/log/slurmd/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=lunchbox
AccountingStorageLoc=slurm_acct_db
AccountingStoragePass=auth/munge
AccountingStorageUser=slurm
#
# COMPUTE NODES
NodeName=marzano0[1-6] CPUs=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 
State=UNKNOWN

PartitionName=debug Nodes=marzano0[1-6] Default=YES MaxTime=INFINITE State=UP

Reply via email to