I am setting up SLURM to replace a Torque/Maui installation and have nodes that 
use the AMD Opteron 6320 processors.  These nodes technical values should be 
CPUs=32 Sockets=4 CoresPerSocket=4 ThreadsPerCore=2.  When that configuration 
is used and a job is run, 'sbatch --ntasks-per-node=2 --cpus-per-task=1 
--mem-per-cpu=384 --nodes=4' the job fails saying the memory limits are 
exceeded.  If I change the node definitions so that Sockets, CoresPerSocket and 
ThreadsPerCore are undefined, the job succeeds.

Relevant config options:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK

The DEFAULT partition has DefMemPerCPU=1900.  The partition I'm running these 
jobs in has no limits except MaxNodes=5 MinNodes=4 MaxTime=48:00:00.

Failure:
Nodes: CPUs=32 Sockets=4 CoresPerSocket=4 ThreadsPerCore=2
sbatch: --ntasks-per-node=2 --cpus-per-task=1 --mem-per-cpu=384 --nodes=4

slurmstepd: Step 16884.0 exceeded memory limit (720464 > 393216), being killed

What's very odd is these jobs are on a partition for IB enabled nodes.  Some of 
those nodes are "CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 
RealMemory=32100 TmpDisk=16000 Feature=core8,mem32gb,ib_ddr,k10,shanghai 
State=UNKNOWN" while others are "CPUs=32 Sockets=4 CoresPerSocket=4 
ThreadsPerCore=2 RealMemory=129000 TmpDisk=16000 
Feature=core32,mem128gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN".  If I run 
the same job with "--constraint='core32'" then the jobs succeed.  Is there some 
issue with running jobs across systems with different values for 
"ThreadsPerCore" and using the --mem-per-cpu option?  If the "--mem" option is 
used the failure does not occur.

Please let me know if any additional information would be useful.

Thanks
- Trey

=============================

Trey Dockendorf 
Systems Analyst I 
Texas A&M University 
Academy for Advanced Telecommunications and Learning Technologies 
Phone: (979)458-2396 
Email: [email protected] 
Jabber: [email protected]

Reply via email to