Hi Mario

Is this some version of 14.03? Do you by any chance have messages like this in 
your slurmctld.log:

[2014-05-08T19:19:41.757] error: Batch completion for job 313404 sent from wrong node (nodeA rather than nodeB), ignored request

If not you are seeing something different, but we have occasionally seen similar messages heralding similar symptoms on 14.03.0-0pre5. If you see the same type of message, it would be useful to attach a debugger if possible and get a backtrace of the slurmctld threads (this was Moe's suggestion to us although we haven't managed to do it yet).

To free it up (for us this condition persists indefinitely), try removing the job-related files (if any) from the first mentioned node and restart slurmd on it, then you may need to restart slurmctld relatively brutally, e.g.

scontrol abort

and when that fails to produce a core file or kill the threads resort to killall -9 slurmctld, and restart slurmctld.

Best regards

Stuart


On 13/05/14 12:38, Mario Kadastik wrote:

Hi,

I'm seeing in logs this:

[2014-05-13T14:24:22.363] server_thread_count over limit (256), waiting

and user commands get during that time:

[root@slurm-1 ~]# squeue -j 73271
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

any ideas how to debug what the 256 threads are in fact doing to understand the 
underlying cause? As I doubt it's normal that we're exhausting the thread count 
on a 5000 jobslot cluster...

Mario Kadastik, PhD
Senior researcher

---
   "Physics is like sex, sure it may have practical reasons, but that's not why we 
do it"
      -- Richard P. Feynman


--
Dr. Stuart Rankin

Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517

Reply via email to