[slurm-dev] Re: thread count over limit

Stuart Rankin Tue, 13 May 2014 05:36:29 -0700


Hi Mario


Is this some version of 14.03? Do you by any chance have messages like this in 
your slurmctld.log:

[2014-05-08T19:19:41.757] error: Batch completion for job 313404 sent from wrong node (nodeA ratherthan nodeB), ignored request

If not you are seeing something different, but we have occasionally seen similar messages heraldingsimilar symptoms on 14.03.0-0pre5. If you see the same type of message, it would be useful to attacha debugger if possible and get a backtrace of the slurmctld threads (this was Moe's suggestion to usalthough we haven't managed to do it yet).

To free it up (for us this condition persists indefinitely), try removing the job-related files (ifany) from the first mentioned node and restart slurmd on it, then you may need to restart slurmctldrelatively brutally, e.g.


scontrol abort

and when that fails to produce a core file or kill the threads resort to killall -9 slurmctld, andrestart slurmctld.


Best regards

Stuart


On 13/05/14 12:38, Mario Kadastik wrote:


Hi,

I'm seeing in logs this:

[2014-05-13T14:24:22.363] server_thread_count over limit (256), waiting

and user commands get during that time:

[root@slurm-1 ~]# squeue -j 73271
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

any ideas how to debug what the 256 threads are in fact doing to understand the 
underlying cause? As I doubt it's normal that we're exhausting the thread count 
on a 5000 jobslot cluster...

Mario Kadastik, PhD
Senior researcher

---
   "Physics is like sex, sure it may have practical reasons, but that's not why we 
do it"
      -- Richard P. Feynman


--
Dr. Stuart Rankin

Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517

[slurm-dev] Re: thread count over limit

Reply via email to