Hello Folks,
We're hitting this problem where jobs run and then suddenly slurm marks
all the nodes in those jobs as down* and unresponsive. On the nodes
themselves slurmd is still running and slurmstepd is running as well
managing step 0 of the respective job on the node:
29135 slurmstepd: [46119.0]
The job itself is still running and computing on the nodes, but at the
same time, slurm restarts the job on a new set of nodes that are still
responding.
The only thing logged in slurmd.log is the following on the first set of
nodes:
[2013-12-04T09:26:06.333] slurmd version 2.6.0 started
[2013-12-04T09:26:06.333] slurmd started on Wed, 04 Dec 2013 09:26:06 -0500
[2013-12-04T09:26:06.333] Procs=16 Boards=1 Sockets=2 Cores=8 Threads=2
Memory=96832 TmpDisk=10 Uptime=678184
[2013-12-04T09:32:38.986] launch task 46119.0 request from
[email protected] (port 65476)
[2013-12-04T11:11:34.736] active_threads == MAX_THREADS(130)
Trying to figure out what could be pushing the number of slurmd
active_threads so high.
Any pointers would be much appreciated.
Thanks
--
Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479