I am setting up SLURM to replace a Torque/Maui installation and have nodes that use the AMD Opteron 6320 processors. These nodes technical values should be CPUs=32 Sockets=4 CoresPerSocket=4 ThreadsPerCore=2. When that configuration is used and a job is run, 'sbatch --ntasks-per-node=2 --cpus-per-task=1 --mem-per-cpu=384 --nodes=4' the job fails saying the memory limits are exceeded. If I change the node definitions so that Sockets, CoresPerSocket and ThreadsPerCore are undefined, the job succeeds.
Relevant config options: SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK The DEFAULT partition has DefMemPerCPU=1900. The partition I'm running these jobs in has no limits except MaxNodes=5 MinNodes=4 MaxTime=48:00:00. Failure: Nodes: CPUs=32 Sockets=4 CoresPerSocket=4 ThreadsPerCore=2 sbatch: --ntasks-per-node=2 --cpus-per-task=1 --mem-per-cpu=384 --nodes=4 slurmstepd: Step 16884.0 exceeded memory limit (720464 > 393216), being killed What's very odd is these jobs are on a partition for IB enabled nodes. Some of those nodes are "CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=32100 TmpDisk=16000 Feature=core8,mem32gb,ib_ddr,k10,shanghai State=UNKNOWN" while others are "CPUs=32 Sockets=4 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=129000 TmpDisk=16000 Feature=core32,mem128gb,ib_ddr,piledriver,abu_dhabi State=UNKNOWN". If I run the same job with "--constraint='core32'" then the jobs succeed. Is there some issue with running jobs across systems with different values for "ThreadsPerCore" and using the --mem-per-cpu option? If the "--mem" option is used the failure does not occur. Please let me know if any additional information would be useful. Thanks - Trey ============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: [email protected] Jabber: [email protected]
