Hi,

I'm seeing in logs this: 

[2014-05-13T14:24:22.363] server_thread_count over limit (256), waiting

and user commands get during that time:

[root@slurm-1 ~]# squeue -j 73271
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

any ideas how to debug what the 256 threads are in fact doing to understand the 
underlying cause? As I doubt it's normal that we're exhausting the thread count 
on a 5000 jobslot cluster...

Mario Kadastik, PhD
Senior researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

Reply via email to