Hello,
I'm trying to debug a problem I've seen where the slurmctld process
crashed. The usage scenario was a user trying to cancel and resubmit a
bunch (100 or so) jobs, but there's not a simple use case that makes
this easily reproducible, we've only seen it happen once so far. The
stack trace where the crash occurred is:
(gdb) bt
#0 slurm_xfree (item=0x8, file=0x5408a5 "pack.c", line=127,
func=0x53d483 "") at xmalloc.c:264
#1 0x0000000000494195 in free_buf (my_buf=0x0) at pack.c:127
#2 0x00000000004e9a61 in _handle_mult_rc_ret (x=<value optimized out>)
at slurmdbd_defs.c:1664
#3 _agent (x=<value optimized out>) at slurmdbd_defs.c:2030
#4 0x00007f75aaa84971 in start_thread () from /lib/libpthread.so.0
#5 0x00007f75aa7e092d in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()
The segfault looks to be caused by my_buf being null. From looking
through the source in slurmdbd_defs.c, it looks as though slurmdbd is
returning multiple response codes to slurmctld in a single message, but
somehow the request queue has fewer requests, resulting in it trying to
dequeue from an empty queue.
I'm not sure how the queue can be empty here, but the following patch
should prevent the crash by simply checking that the dequeue returns a
non-null buffer to be freed.
Thanks,
-sam
diff --git a/src/common/slurmdbd_defs.c b/src/common/slurmdbd_defs.c
index a29c5b9..d0285e9 100644
--- a/src/common/slurmdbd_defs.c
+++ b/src/common/slurmdbd_defs.c
@@ -1672,7 +1672,8 @@ static int _handle_mult_rc_ret(uint16_t
rpc_version, int read_timeout)
!= SLURM_SUCCESS)
break;
- free_buf(list_dequeue(agent_list));
+ Buf b = list_dequeue(agent_list);
+ if(b) free_buf(b);
}
list_iterator_destroy(itr);
}