[slurm-dev] slurm 2.6.0 nodes unresponsive, slurmd.log active_threads == MAX_THREADS(130)

Marcin Sliwowski Wed, 04 Dec 2013 09:27:00 -0800


Hello Folks,

We're hitting this problem where jobs run and then suddenly slurm marksall the nodes in those jobs as down* and unresponsive. On the nodesthemselves slurmd is still running and slurmstepd is running as wellmanaging step 0 of the respective job on the node:


29135 slurmstepd: [46119.0]

The job itself is still running and computing on the nodes, but at thesame time, slurm restarts the job on a new set of nodes that are stillresponding.

The only thing logged in slurmd.log is the following on the first set ofnodes:


[2013-12-04T09:26:06.333] slurmd version 2.6.0 started
[2013-12-04T09:26:06.333] slurmd started on Wed, 04 Dec 2013 09:26:06 -0500

[2013-12-04T09:26:06.333] Procs=16 Boards=1 Sockets=2 Cores=8 Threads=2Memory=96832 TmpDisk=10 Uptime=678184[2013-12-04T09:32:38.986] launch task 46119.0 request from[email protected] (port 65476)

[2013-12-04T11:11:34.736] active_threads == MAX_THREADS(130)

Trying to figure out what could be pushing the number of slurmdactive_threads so high.


Any pointers would be much appreciated.

Thanks

--
Marcin Sliwowski | SysAdmin@RENCI | (919) 445-0479

[slurm-dev] slurm 2.6.0 nodes unresponsive, slurmd.log active_threads == MAX_THREADS(130)

Reply via email to