Mike, I would suggest that the limit is a SLURM limit rather than a ulimit.
What is the result of scontrol show config | grep Mem ? Because you have set your SelectTypeParameters=CR_Core_Memory Memory will cause jobs to fail if they go over the default memory limit. The SLURM head will kill jobs that eat too many resources. There are a number of ways to solve this: 1. Change SelectTypeParameters=CR_Core_Memory to SelectTypeParameters=CR_Core noting that Memory will no longer be a measured resource when distributing resources. 2. tell everyone that their jobs will fail if they don't use --mem=X or --mem-per-cpu=X 3. Set one of: DefMemPerNode = UNLIMITED MaxMemPerNode = UNLIMITED DefMemPerCPU = UNLIMITED MaxMemPerCPU = UNLIMITED Cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 14 October 2016 at 14:21, Mike Cammilleri <mi...@stat.wisc.edu> wrote: > OK, should be easy but I am stumped > > > Built Slurm 16.05.0 on Ubuntu-14.04 LTS. Worked fine until I decided I > wanted to get email notifications going using smail and seff, then realized > that would need slurmdbd which we weren't using because we have no need for > accounting. > > > Fast forward to getting slurmdbd built and job accounting going. Now for > some reason when I submit any job I get: > > > slurmstepd: error: Job 3049 exceeded memory limit (1336 > 1024), being > killed > > > And of course I see lots of stuff on slurm-dev about setting > /etc/default/slurm to have 'ulimit -m unlimited' etc. I also see > suggestions to put it in /etc/security/limits.conf. I also see suggestions > to set ulimit limits at the top of my slurmd init scripts. > > > I've done all these tricks and restarted services to no avail. Users have > ulimits of unlimited (when checking with ulimit -a) but running sbatch or > srun results in a cap on memory size. Memory lock seems to be ok. > > > [mikec@lunchbox] (34)$ srun bash -c "ulimit -a" > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1030449 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) 1024 <=====!!!! > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 8192 > cpu time (seconds, -t) unlimited > max user processes (-u) 514377 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > Here is an example error when just trying to sbatch a script to run > /bin/hostname > > [mikec@lunchbox] (38)$ cat slurm-3053.out > > slurmstepd: error: Job 3053 exceeded memory limit (1096 > 1024), being > killed > slurmstepd: error: Exceeded job memory limit > slurmstepd: error: *** JOB 3053 ON marzano01 CANCELLED AT > 2016-10-13T22:17:08 *** > > > > Where can I set ulimit -m so that it will actually take effect? > > > Here is my slurm.conf. I tried using PropagateResourceLimitsExcept=MEMLOCK,RSS > but did not alleviate max memory size issue. > > > # > ClusterName=marzano > ControlMachine=lunchbox > ControlAddr=xxx.xxx.xxx.xxx > #BackupController= > #BackupAddr= > # > SlurmUser=slurm > #SlurmdUser=root > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > #JobCredentialPrivateKey= > #JobCredentialPublicCertificate= > StateSaveLocation=/slurm.state > SlurmdSpoolDir=/tmp/slurmd > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurm/slurmctld.pid > SlurmdPidFile=/var/run/slurm/slurmd.pid > ProctrackType=proctrack/pgid > #PluginDir= > #FirstJobId= > ReturnToService=2 > #MaxJobCount= > #PlugStackConfig=/etc/slurm/plugstack.conf > #PropagatePrioProcess= > #PropagateResourceLimits= > PropagateResourceLimitsExcept=MEMLOCK,RSS > #Prolog= > #Epilog= > #SrunProlog= > #SrunEpilog= > #TaskProlog= > #TaskEpilog= > #TaskPlugin= > #TrackWCKey=no > #TreeWidth=50 > #TmpFS= > #UsePAM= > MailProg=/s/slurm/bin/smail > # > # TIMERS > SlurmctldTimeout=300 > SlurmdTimeout=300 > InactiveLimit=0 > MinJobAge=300 > KillWait=30 > Waittime=0 > # > # SCHEDULING > SchedulerType=sched/backfill > #SchedulerAuth= > #SchedulerPort= > #SchedulerRootFilter= > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > FastSchedule=1 > #PriorityType=priority/multifactor > #PriorityDecayHalfLife=14-0 > #PriorityUsageResetPeriod=14-0 > #PriorityWeightFairshare=100000 > #PriorityWeightAge=1000 > #PriorityWeightPartition=10000 > #PriorityWeightJobSize=1000 > #PriorityMaxAge=1-0 > # > # LOGGING > SlurmctldDebug=6 > SlurmctldLogFile=/var/log/slurmctld/slurmctld.log > SlurmdDebug=6 > SlurmdLogFile=/var/log/slurmd/slurmd.log > JobCompType=jobcomp/none > #JobCompLoc= > # > # ACCOUNTING > JobAcctGatherType=jobacct_gather/linux > JobAcctGatherFrequency=30 > # > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=lunchbox > AccountingStorageLoc=slurm_acct_db > AccountingStoragePass=auth/munge > AccountingStorageUser=slurm > # > # COMPUTE NODES > NodeName=marzano0[1-6] CPUs=48 Sockets=2 CoresPerSocket=12 > ThreadsPerCore=2 State=UNKNOWN > > PartitionName=debug Nodes=marzano0[1-6] Default=YES MaxTime=INFINITE > State=UP > >