Sam,
Thanks for reporting this and the patch. We've never seen this before,
but have applied a variation on your patch to version 2.3.0 (both
should avoid the abort, but the definition of "b" was moved to the
beginning of a block to prevent warning messages from some compilers
and a log message added to report this anomaly).
Moe Jette
SchedMD
Quoting Sam Lang <[email protected]>:
Hello,
I'm trying to debug a problem I've seen where the slurmctld process
crashed. The usage scenario was a user trying to cancel and
resubmit a bunch (100 or so) jobs, but there's not a simple use case
that makes this easily reproducible, we've only seen it happen once
so far. The stack trace where the crash occurred is:
(gdb) bt
#0 slurm_xfree (item=0x8, file=0x5408a5 "pack.c", line=127,
func=0x53d483 "") at xmalloc.c:264
#1 0x0000000000494195 in free_buf (my_buf=0x0) at pack.c:127
#2 0x00000000004e9a61 in _handle_mult_rc_ret (x=<value optimized
out>) at slurmdbd_defs.c:1664
#3 _agent (x=<value optimized out>) at slurmdbd_defs.c:2030
#4 0x00007f75aaa84971 in start_thread () from /lib/libpthread.so.0
#5 0x00007f75aa7e092d in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()
The segfault looks to be caused by my_buf being null. From looking
through the source in slurmdbd_defs.c, it looks as though slurmdbd
is returning multiple response codes to slurmctld in a single
message, but somehow the request queue has fewer requests, resulting
in it trying to dequeue from an empty queue.
I'm not sure how the queue can be empty here, but the following
patch should prevent the crash by simply checking that the dequeue
returns a non-null buffer to be freed.
Thanks,
-sam
diff --git a/src/common/slurmdbd_defs.c b/src/common/slurmdbd_defs.c
index a29c5b9..d0285e9 100644
--- a/src/common/slurmdbd_defs.c
+++ b/src/common/slurmdbd_defs.c
@@ -1672,7 +1672,8 @@ static int _handle_mult_rc_ret(uint16_t
rpc_version, int read_timeout)
!= SLURM_SUCCESS)
break;
- free_buf(list_dequeue(agent_list));
+ Buf b = list_dequeue(agent_list);
+ if(b) free_buf(b);
}
list_iterator_destroy(itr);
}