Hi all --
I'm trying to interpret a job state on my cluster and having trouble
finding the answer anywhere in the docs.
sinfo returns:
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 4 unk* n[14,19,27,30]
all* up infinite 27 idle n[0-13,15-18,20-26,28-29]
However, I don't know what the "unk*" refers to (it was previously
"idle*" and I tried doing a "service slurm restart" on the controller to
see if that would kick it loss).
A bit more information. I'm running slurm 14.03.7 on RHEL 5 on a Scyld
Beowulf cluster. I had just rebooted the cluster, and the nodes came
back up in that state. I have "ReturnToService=2" in my slurm.conf file.
The node report from scontrol looks ok (except for the state):
# scontrol show node n14
NodeName=n14 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=N/A Features=(null)
Gres=(null)
NodeAddr=n14 NodeHostName=n14 Version=(null)
RealMemory=1 AllocMem=0 Sockets=20 Boards=1
State=UNKNOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=None SlurmdStartTime=None
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
And the slurmd log for that node doesn't provide much help either:
# cat /var/log/slurm/slurmd.log
[2015-06-23T12:58:20.249] Node configuration differs from hardware:
CPUs=20:40(hw) Boards=1:1(hw) SocketsPerBoard=20:2(hw)
CoresPerSocket=1:10(hw) ThreadsPerCore=1:2(hw)
[2015-06-23T12:58:20.250] CPU frequency setting not configured for
this node
[2015-06-23T12:58:20.255] slurmd version 14.03.7 started
[2015-06-23T12:58:20.256] WARNING: We will use a much slower
algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc
or some other proctrack when using jobacct_gather/linux
[2015-06-23T12:58:20.257] slurmd started on Tue, 23 Jun 2015
12:58:20 -0600
[2015-06-23T12:58:20.258] CPUs=20 Boards=1 Sockets=20 Cores=1
Threads=1 Memory=193389 TmpDisk=96694 Uptime=22
[2015-06-23T12:58:25.273] Node configuration differs from hardware:
CPUs=20:40(hw) Boards=1:1(hw) SocketsPerBoard=20:2(hw)
CoresPerSocket=1:10(hw) ThreadsPerCore=1:2(hw)
[2015-06-23T12:58:25.273] CPU frequency setting not configured for
this node
Any thoughts or suggestions are definitely appreciated.
As an adjacent question... What is the proper procedure for bringing
down a Slurm enabled cluster and then bringing it back up (or bringing
just the compute nodes back up)? Should I stop the Slurm controller
while the cluster reboots? Do I need to issue some command across the
slaves (in the past I've found that manually relaunching the slurmd
application has helped "$ bpsh -a slurmd").
Thank you,
--
~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941