Hi, I have SLURM-15.08.1 set up for scheduling multi-threaded jobs on a single computer (56 HT cores). I noticed recently that my slurmctld would run for a while but then die with the following error messages:
... slurmctld: error: pthread_create error Resource temporarily unavailable slurmctld: error: pthread_create error Resource temporarily unavailable slurmctld: error: pthread_create error Resource temporarily unavailable slurmctld: error: pthread_create error Resource temporarily unavailable slurmctld: fatal: Can't create pthread So I started the SLURM controller daemon under GDB to figure out the stacktrace when it crashes. Here's what the stack trace looks like when it fails: #0 0x00000032f1235cc0 in exit () from /lib64/libc.so.6 #1 0x00000000004fb776 in fatal (fmt=0x63329d "Can't create pthread") at log.c:1147 #2 0x00000000004f0bc5 in _start_msg_tree_internal (hl=0x0, sp_hl=0x7fff6c000bc0, fwd_tree_in=0x7fffecab5ca0, hl_count=1) at forward.c:535 #3 0x00000000004f1258 in start_msg_tree (hl=0x7fff6c0008c0, msg=0x7fffecab5da0, timeout=0) at forward.c:714 #4 0x000000000053108b in slurm_send_recv_msgs (nodelist=0x7ffec0001010 "localhost", msg=0x7fffecab5da0, timeout=0, quiet=true) at slurm_protocol_api.c:4221 #5 0x00000000004369b4 in _thread_per_group_rpc (args=0x7ffec0000920) at agent.c:889 #6 0x00000032f1a07851 in start_thread () from /lib64/libpthread.so.0 #7 0x00000032f12e890d in clone () from /lib64/libc.so.6 It seems like the pthread_create() call in _start_msg_tree_internal() is resulting in an EAGAIN error, but I'm not sure about the root cause. The user process limit (ulimit -u) on the system where slurmctld is running is set to 1024 processes; and increasing the limit to 2048 processes does not help. The other thing I noticed is that the controller daemon does not create a lot of threads (about 5-6 threads are alive at any given point of time), so it's unclear to me what other resource limits to check. Can anyone more familiar with SLURM help me with this? In particular, I'm trying to puzzle out the context in which these functions -- _thread_per_group_rpc(), slurm_send_recv_msgs(), etc. -- are called; I'm hoping this will point me to the root cause. Thanks, Rohan
