I think we've seen this problem as well. When using ThreadsPerCore along with DefMemPerCpu or --mem-per-cpu, the worker node doesn't set the memory limit correctly. Setting --mem instead should work OK.
I have a lightly tested patch that appears to address the issue, but I've only tried it on a single-node development system, and not yet in production. It's attached to the following bugzilla ticket: http://bugs.schedmd.com/show_bug.cgi?id=309 Regards, John On 05/23/2013 01:00 PM, Paul Edmon wrote: > I use 2.5.6. > > -Paul Edmon- > > On 05/23/2013 01:58 PM, S. Aravindan wrote: >> I guess so. The slurm version I use is 2.5.4. I have attached my >> slurm.conf with this mail. >> >> --Semparithi >> >> +++ On 10:34 23 May Paul Edmon wrote: >>> Hmm, maybe its the ThreadsPerCore? Perhaps its thinks there are half as >>> many core as there really are due to the ThreadsPerCore. Thus if you do >>> the --mem-per-cpu it will only give you half, as it only counts cores >>> not threads*cores? >>> >>> -Paul Edmon- >>> >>> On 05/23/2013 01:31 PM, S. Aravindan wrote: >>>> I was about to post a similar query. Gaussian 09 job is killed when the >>>> memory consumption exceeds half the amount of memory available on a node >>>> when --mem-per-cpu is used but the job runs when --mem is used. The >>>> relevant lines from slurm.conf is below. >>>> >>>> NodeName=node[01-15] RealMemory=48228 Sockets=2 CoresPerSocket=6 >>>> ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 >>>> NodeName=node[16-30] RealMemory=96705 Sockets=2 CoresPerSocket=6 >>>> ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Feature=96g >>>> >>>> Any suggestion is welcome. >>>> >>>> --Semparithi >>>> >>>> >>>> +++ On 09:41 23 May Paul Edmon wrote: >>>>> I have a user that is running a problem which uses 512 GB of memory. She >>>>> request this from SLURM on a node which has this much. However her code >>>>> dies: >>>>> >>>>> slurmd[holy2b09101]: error: Job 6497 exceeded 268435456 KB memory limit, >>>>> being killed >>>>> slurmd[holy2b09101]: error: Exceeded job memory limit >>>>> slurmd[holy2b09101]: error: *** JOB 6497 CANCELLED AT 2013-05-23T00:53:31 >>>>> *** >>>>> >>>>> This is half of the 512 GB which was requested. Is there something I am >>>>> missing? The nodes in question have: >>>>> >>>>> NodeName=DEFAULT CPUs=64 RealMemory=529247 Sockets=4 CoresPerSocket=8 >>>>> ThreadsPerCore=2 State=UNKNOWN >>>>> >>>>> These are AMD Abu Dhabi processors with 8 GB per core, so 512 GB total. >>>>> She is requesting 8 GB per cpu and is asking for 64 cores. Thoughts? >>>>> >>>>> -Paul Edmon- >>>> -- Semparithi Aravindan
