We are seeing an unexpected behavior with our scheduler. All the nodes have 24 cores. If we ask for 60 cpus we get something less than that. It appears as if the schedule allocates enough nodes to cover the 60 cpus but then if any of those cpus are already allocated they are subtracted from the total granted to the new job.
In this example I asked for 5 tasks of 12 cpus each. I was allocated 3 24 core nodes but the first node already had an 8 core job running on it. Below are the allocation request and the slurm allocations reported to my job. The SLURM_JOB_CPUS_PER_NODE are 16,24,12 instead of the expected 24,24,12. We have also confirmed that if we try to use 60 cores, we get an error after the 52 cores are used. If no other jobs are currently running on the node I do get the full amount of cpus I asked for. Attached is the slurm.conf. We running slurm 14.03.0. Is this a bug in the scheduler or are we missing something? There is an error about over subscribe in the logfile but I don’t know what it means. [2014-08-11T10:09:37.494] job submit for user cschmid7_local(5199): max node changed 4294967294 -> 16 because of qos limit [2014-08-11T10:09:37.494] error: cons_res: _compute_c_b_task_dist oversubscribe for job 269189 [2014-08-11T10:09:37.494] sched: _slurm_rpc_allocate_resources JobId=269189 NodeList=bhc[0001-0003] usec=641 Thanks for any enlightenment, Carl [cschmid7_local@bh-sn]$ salloc -p debug -n 5 -c 12 salloc: Granted job allocation 269189 08/11/2014 10:09:37 AM [/var/home/cschmid7_local] [cschmid7_local@]$ env | grep SLURM SLURM_NODELIST=bhc[0001-0003] SLURM_NODE_ALIASES=(null) SLURM_NNODES=3 SLURM_JOBID=269189 SLURM_NTASKS=5 SLURM_TASKS_PER_NODE=2(x2),1 SLURM_CPUS_PER_TASK=12 SLURM_JOB_ID=269189 SLURM_SUBMIT_DIR=/var/home/cschmid7_local SLURM_NPROCS=5 SLURM_JOB_NODELIST=bhc[0001-0003] SLURM_JOB_CPUS_PER_NODE=16,24,12 SLURM_SUBMIT_HOST=bh-sn.bluehive.circ.private SLURM_JOB_PARTITION=debug SLURM_JOB_NUM_NODES=3 [cschmid7_local@bh-sn]$ scontrol sho part debug PartitionName=debug AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=00:01:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=01:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=bhc[0001-0010] Priority=100 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF State=UP TotalCPUs=240 TotalNodes=10 SelectTypeParameters=N/A DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
slurm.conf
Description: slurm.conf
