All,

I am in the process of transitioning from Torque to Slurm.
So far it is doing very well, especially handling arrays.

Now I have one array job that is running across several nodes, but only using 
some of the node resources. I would like to have slurm start sharing the nodes 
so some of the array jobs will start where there are unused resources.

I ran a scontrol update to force sharing and see the partition did change:

#scontrol show partitions
PartitionName=debug
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO 
MaxCPUsPerNode=UNLIMITED
   Nodes=compute[45-49]
   Priority=1 RootOnly=NO ReqResv=NO Shared=FORCE:4 PreemptMode=OFF
   State=UP TotalCPUs=280 TotalNodes=5 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

But it is not starting job 416_37 on any node as I would expect.

#squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
   416_[37-1013%6]     debug slurm_ar  user1 PD       0:00      1 (Resources)
            416_36     debug slurm_ar  user1  R      35:46      1 compute49
            416_35     debug slurm_ar  user1  R    1:47:25      1 compute46
            416_33     debug slurm_ar  user1  R    7:30:50      1 compute45
            416_32     debug slurm_ar  user1  R    7:38:39      1 compute47
            416_31     debug slurm_ar  user1  R    8:53:26      1 compute48

In my config, I have:
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY


What am I missing to get more than one job to run on a node?

Thanks in advance,

Brian Andrus

Reply via email to