All,
I am in the process of transitioning from Torque to Slurm.
So far it is doing very well, especially handling arrays.
Now I have one array job that is running across several nodes, but only using
some of the node resources. I would like to have slurm start sharing the nodes
so some of the array jobs will start where there are unused resources.
I ran a scontrol update to force sharing and see the partition did change:
#scontrol show partitions
PartitionName=debug
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=compute[45-49]
Priority=1 RootOnly=NO ReqResv=NO Shared=FORCE:4 PreemptMode=OFF
State=UP TotalCPUs=280 TotalNodes=5 SelectTypeParameters=N/A
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
But it is not starting job 416_37 on any node as I would expect.
#squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
416_[37-1013%6] debug slurm_ar user1 PD 0:00 1 (Resources)
416_36 debug slurm_ar user1 R 35:46 1 compute49
416_35 debug slurm_ar user1 R 1:47:25 1 compute46
416_33 debug slurm_ar user1 R 7:30:50 1 compute45
416_32 debug slurm_ar user1 R 7:38:39 1 compute47
416_31 debug slurm_ar user1 R 8:53:26 1 compute48
In my config, I have:
SelectType = select/cons_res
SelectTypeParameters = CR_CORE_MEMORY
What am I missing to get more than one job to run on a node?
Thanks in advance,
Brian Andrus