Hi all --

I'm trying to interpret a job state on my cluster and having trouble finding the answer anywhere in the docs.

sinfo returns:

   # sinfo
   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
   all*         up   infinite      4   unk* n[14,19,27,30]
   all*         up   infinite     27   idle n[0-13,15-18,20-26,28-29]

However, I don't know what the "unk*" refers to (it was previously "idle*" and I tried doing a "service slurm restart" on the controller to see if that would kick it loss).

A bit more information. I'm running slurm 14.03.7 on RHEL 5 on a Scyld Beowulf cluster. I had just rebooted the cluster, and the nodes came back up in that state. I have "ReturnToService=2" in my slurm.conf file.

The node report from scontrol looks ok (except for the state):

   # scontrol show node n14
   NodeName=n14 CoresPerSocket=1
       CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=N/A Features=(null)
       Gres=(null)
       NodeAddr=n14 NodeHostName=n14 Version=(null)
       RealMemory=1 AllocMem=0 Sockets=20 Boards=1
       State=UNKNOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1
       BootTime=None SlurmdStartTime=None
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

And the slurmd log for that node doesn't provide much help either:

   # cat /var/log/slurm/slurmd.log
   [2015-06-23T12:58:20.249] Node configuration differs from hardware:
   CPUs=20:40(hw) Boards=1:1(hw) SocketsPerBoard=20:2(hw)
   CoresPerSocket=1:10(hw) ThreadsPerCore=1:2(hw)
   [2015-06-23T12:58:20.250] CPU frequency setting not configured for
   this node
   [2015-06-23T12:58:20.255] slurmd version 14.03.7 started
   [2015-06-23T12:58:20.256] WARNING: We will use a much slower
   algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc
   or some other proctrack when using jobacct_gather/linux
   [2015-06-23T12:58:20.257] slurmd started on Tue, 23 Jun 2015
   12:58:20 -0600
   [2015-06-23T12:58:20.258] CPUs=20 Boards=1 Sockets=20 Cores=1
   Threads=1 Memory=193389 TmpDisk=96694 Uptime=22
   [2015-06-23T12:58:25.273] Node configuration differs from hardware:
   CPUs=20:40(hw) Boards=1:1(hw) SocketsPerBoard=20:2(hw)
   CoresPerSocket=1:10(hw) ThreadsPerCore=1:2(hw)
   [2015-06-23T12:58:25.273] CPU frequency setting not configured for
   this node

Any thoughts or suggestions are definitely appreciated.

As an adjacent question... What is the proper procedure for bringing down a Slurm enabled cluster and then bringing it back up (or bringing just the compute nodes back up)? Should I stop the Slurm controller while the cluster reboots? Do I need to issue some command across the slaves (in the past I've found that manually relaunching the slurmd application has helped "$ bpsh -a slurmd").

Thank you,


--
~ Ian Lee
Lawrence Livermore National Laboratory
(W) 925-423-4941

Reply via email to