I am using the Bright cluster manager version 6.0, which uses SLURM v2.3.4.
I'm seeing an odd issue: I have many jobs queued up, but SLURM has decided to power down most of my nodes and mark them as "~idle". The short version is that I have multiple partitions and multiple different types of servers in my cluster, and I have SLURM's power control enabled to power off my servers when they're not in use. However, I have seen SLURM mark nodes as "~idle" (i.e., idle and powered off) while there are lots of jobs in the queue. For example, this morning, I see about 1000 jobs queued up in "defq" (my default partition), but only 4 nodes (out of 32) are powered up and running jobs from that queue. The remaining 28 are marked as "~idle": Here's some of the details: ----- [root@savbu-usnic-a ~]# srun --version slurm 2.3.4 [root@savbu-usnic-a ~]# squeue | head JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 82646 defq Run imb- mpiteam PD 0:00 2 (Priority) 82647 defq Run netp mpiteam PD 0:00 2 (Priority) 82649 defq Run triv mpiteam PD 0:00 2 (Priority) 82650 defq Run inte mpiteam PD 0:00 2 (Priority) 82651 defq Run ibm mpiteam PD 0:00 2 (Priority) 82652 defq Run ones mpiteam PD 0:00 2 (Priority) 82653 defq Run mpic mpiteam PD 0:00 2 (Priority) 82654 defq Run mpi- mpiteam PD 0:00 2 (Priority) 82655 defq Run java mpiteam PD 0:00 2 (Priority) [root@savbu-usnic-a ~]# squeue | grep Resour 82642 defq Run mpic mpiteam PD 0:00 2 (Resources) [root@savbu-usnic-a ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 28 idle~ node[001-011,014-015,018-032] defq* up infinite 4 alloc node[012-013,016-017] eurompi up infinite 0 n/a infiniban up infinite 38 idle~ dell[001-016,022-043] [root@savbu-usnic-a ~]# ----- Do you still answer questions about SLURM v2.3.4? (upgrading is not really an option, since Bright controls my entire SLURM setup) Thanks. -- Jeff Squyres [email protected] For corporate legal information go to: http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/314919641674/
