Hello, slurm development group. Thank you for the excellent tools providing to us, and I'm really happy that slurm help me doing my job tasks in a graceful way.
But when I was trying to reconfigure my slurm.conf and restarted, my test2 node stayed in the "not responding" status. I followed the instructions in Troubleshooting: "Nodes are getting set to a DOWN state". The results are followed as below: NodeName=test2 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null) Gres=(null) NodeAddr=test2 NodeHostName=test2 Version=(null) RealMemory=1 AllocMem=0 Sockets=1 Boards=1 State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=None SlurmdStartTime=None CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [root@2015-07-06T18:12:15] It says "node is DOWN and Not responding". I tried to ping controller node from the not responding node test2, and the ping works well. On test2, ps shows that the slurmd is running, so I restarted it to check if it works. But test2 node remains "not responding", and the SlurmdLog file of test2 node only says (with Debug value set to 5): [2015-07-06T18:31:40.476] Node configuration differs from hardware: CPUs=1:4(hw) Boards=1:1(hw) SocketsPerBoard=1:4(hw) CoresPerSocket=1:1(hw) ThreadsPerCore=1:1(hw) [2015-07-06T18:31:40.478] topology NONE plugin loaded [2015-07-06T18:31:40.478] route default plugin loaded [2015-07-06T18:31:40.478] CPU frequency setting not configured for this node [2015-07-06T18:31:40.478] No specialized cores configured by default on this node [2015-07-06T18:31:40.478] Resource spec: Reserved system memory limit not configured for this node [2015-07-06T18:31:40.479] debug: task NONE plugin loaded [2015-07-06T18:31:40.479] debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded [2015-07-06T18:31:40.479] debug: spank: opening plugin stack /etc/plugstack.conf [2015-07-06T18:31:40.479] Munge cryptographic signature plugin loaded [2015-07-06T18:31:40.481] Warning: Core limit is only 0 KB [2015-07-06T18:31:40.481] slurmd version 14.11.6 started [2015-07-06T18:31:40.482] killing old slurmd[9087] [2015-07-06T18:31:40.483] debug: Job accounting gather NOT_INVOKED plugin loaded [2015-07-06T18:31:40.483] debug: job_container none plugin loaded [2015-07-06T18:31:40.483] debug: switch NONE plugin loaded [2015-07-06T18:31:40.483] Slurmd shutdown completing [2015-07-06T18:31:40.485] slurmd started on Mon, 06 Jul 2015 18:31:40 +0800 [2015-07-06T18:31:40.505] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=7554 TmpDisk=48580 Uptime=4149 CPUSpecList=(null) [2015-07-06T18:31:40.506] debug: AcctGatherEnergy NONE plugin loaded [2015-07-06T18:31:40.506] debug: AcctGatherProfile NONE plugin loaded [2015-07-06T18:31:40.506] debug: AcctGatherInfiniband NONE plugin loaded [2015-07-06T18:31:40.507] debug: AcctGatherFilesystem NONE plugin loaded Some warnings the debugs seem common in other working nodes, so I have totally no idea what had happened to my test2 node. (They, test[1-4] nodes, all have same settings and environment) I would be appreciate if I can get help from the development group, and thank you very much for the information you may provide. Sincerely, Xiang Du [email protected]
