Brian,

Try setting a default memory per CPU in the partition definition.  Later
versions of SLURM (>= 14.11.6?) require this value to be set, otherwise all
memory per node is scheduled.

HTH,
John DeSantis

2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor <[email protected]>:

> All,
>
>
>
> I am in the process of transitioning from Torque to Slurm.
>
> So far it is doing very well, especially handling arrays.
>
>
>
> Now I have one array job that is running across several nodes, but only
> using some of the node resources. I would like to have slurm start sharing
> the nodes so some of the array jobs will start where there are unused
> resources.
>
>
>
> I ran a scontrol update to force sharing and see the partition did change:
>
>
>
> *#scontrol show partitions*
>
> *PartitionName=debug*
>
> *   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL*
>
> *   AllocNodes=ALL Default=YES QoS=N/A*
>
> *   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO*
>
> *   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO
> MaxCPUsPerNode=UNLIMITED*
>
> *   Nodes=compute[45-49]*
>
> *   Priority=1 RootOnly=NO ReqResv=NO Shared=FORCE:4 PreemptMode=OFF*
>
> *   State=UP TotalCPUs=280 TotalNodes=5 SelectTypeParameters=N/A*
>
> *   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED*
>
>
>
> But it is not starting job 416_37 on any node as I would expect.
>
>
>
> *#squeue*
>
> *             JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)*
>
> *   416_[37-1013%6]     debug slurm_ar  user1 PD       0:00      1
> (Resources)*
>
> *            416_36     debug slurm_ar  user1  R      35:46      1
> compute49*
>
> *            416_35     debug slurm_ar  user1  R    1:47:25      1
> compute46*
>
> *            416_33     debug slurm_ar  user1  R    7:30:50      1
> compute45*
>
> *            416_32     debug slurm_ar  user1  R    7:38:39      1
> compute47*
>
> *            416_31     debug slurm_ar  user1  R    8:53:26      1
> compute48*
>
>
>
> In my config, I have:
>
> *SelectType              = select/cons_res*
>
> *SelectTypeParameters    = CR_CORE_MEMORY*
>
>
>
>
>
> What am I missing to get more than one job to run on a node?
>
>
>
> Thanks in advance,
>
>
>
> Brian Andrus
>

Reply via email to