Hi Mario
Is this some version of 14.03? Do you by any chance have messages like this in
your slurmctld.log:
[2014-05-08T19:19:41.757] error: Batch completion for job 313404 sent from wrong node (nodeA rather
than nodeB), ignored request
If not you are seeing something different, but we have occasionally seen similar messages heralding
similar symptoms on 14.03.0-0pre5. If you see the same type of message, it would be useful to attach
a debugger if possible and get a backtrace of the slurmctld threads (this was Moe's suggestion to us
although we haven't managed to do it yet).
To free it up (for us this condition persists indefinitely), try removing the job-related files (if
any) from the first mentioned node and restart slurmd on it, then you may need to restart slurmctld
relatively brutally, e.g.
scontrol abort
and when that fails to produce a core file or kill the threads resort to killall -9 slurmctld, and
restart slurmctld.
Best regards
Stuart
On 13/05/14 12:38, Mario Kadastik wrote:
Hi,
I'm seeing in logs this:
[2014-05-13T14:24:22.363] server_thread_count over limit (256), waiting
and user commands get during that time:
[root@slurm-1 ~]# squeue -j 73271
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
any ideas how to debug what the 256 threads are in fact doing to understand the
underlying cause? As I doubt it's normal that we're exhausting the thread count
on a 5000 jobslot cluster...
Mario Kadastik, PhD
Senior researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why we
do it"
-- Richard P. Feynman
--
Dr. Stuart Rankin
Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517