Hello, slurm development group.

Thank you for the excellent tools providing to us, and I'm really happy that 
slurm help me doing my job tasks in a graceful way.

But when I was trying to reconfigure my slurm.conf and restarted, my test2 node 
stayed in the "not responding" status.

I followed the instructions in Troubleshooting: "Nodes are getting set to a 
DOWN state". The results are followed as below:

NodeName=test2 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null)
   Gres=(null)
   NodeAddr=test2 NodeHostName=test2 Version=(null)
   RealMemory=1 AllocMem=0 Sockets=1 Boards=1
   State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [root@2015-07-06T18:12:15]

It says "node is DOWN and Not responding". I tried to ping controller node from 
the not responding node test2, and the ping works well.

On test2, ps shows that the slurmd is running, so I restarted it to check if it 
works.

But test2 node remains "not responding", and the SlurmdLog file of test2 node 
only says (with Debug value set to 5):

[2015-07-06T18:31:40.476] Node configuration differs from hardware: 
CPUs=1:4(hw) Boards=1:1(hw) SocketsPerBoard=1:4(hw) CoresPerSocket=1:1(hw) 
ThreadsPerCore=1:1(hw)
[2015-07-06T18:31:40.478] topology NONE plugin loaded
[2015-07-06T18:31:40.478] route default plugin loaded
[2015-07-06T18:31:40.478] CPU frequency setting not configured for this node
[2015-07-06T18:31:40.478] No specialized cores configured by default on this 
node
[2015-07-06T18:31:40.478] Resource spec: Reserved system memory limit not 
configured for this node
[2015-07-06T18:31:40.479] debug:  task NONE plugin loaded
[2015-07-06T18:31:40.479] debug:  auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
[2015-07-06T18:31:40.479] debug:  spank: opening plugin stack 
/etc/plugstack.conf
[2015-07-06T18:31:40.479] Munge cryptographic signature plugin loaded
[2015-07-06T18:31:40.481] Warning: Core limit is only 0 KB
[2015-07-06T18:31:40.481] slurmd version 14.11.6 started
[2015-07-06T18:31:40.482] killing old slurmd[9087]
[2015-07-06T18:31:40.483] debug:  Job accounting gather NOT_INVOKED plugin 
loaded
[2015-07-06T18:31:40.483] debug:  job_container none plugin loaded
[2015-07-06T18:31:40.483] debug:  switch NONE plugin loaded
[2015-07-06T18:31:40.483] Slurmd shutdown completing
[2015-07-06T18:31:40.485] slurmd started on Mon, 06 Jul 2015 18:31:40 +0800
[2015-07-06T18:31:40.505] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 
Memory=7554 TmpDisk=48580 Uptime=4149 CPUSpecList=(null)
[2015-07-06T18:31:40.506] debug:  AcctGatherEnergy NONE plugin loaded
[2015-07-06T18:31:40.506] debug:  AcctGatherProfile NONE plugin loaded
[2015-07-06T18:31:40.506] debug:  AcctGatherInfiniband NONE plugin loaded
[2015-07-06T18:31:40.507] debug:  AcctGatherFilesystem NONE plugin loaded

Some warnings the debugs seem common in other working nodes, so I have totally 
no idea what had happened to my test2 node. (They, test[1-4] nodes, all have 
same settings and environment)

I would be appreciate if I can get help from the development group, and thank 
you very much for the information you may provide.

Sincerely, 
Xiang Du



[email protected]

Reply via email to