Sam,

Thanks for reporting this and the patch. We've never seen this before, but have applied a variation on your patch to version 2.3.0 (both should avoid the abort, but the definition of "b" was moved to the beginning of a block to prevent warning messages from some compilers and a log message added to report this anomaly).

Moe Jette
SchedMD

Quoting Sam Lang <[email protected]>:


Hello,

I'm trying to debug a problem I've seen where the slurmctld process crashed. The usage scenario was a user trying to cancel and resubmit a bunch (100 or so) jobs, but there's not a simple use case that makes this easily reproducible, we've only seen it happen once so far. The stack trace where the crash occurred is:

 (gdb) bt
#0 slurm_xfree (item=0x8, file=0x5408a5 "pack.c", line=127, func=0x53d483 "") at xmalloc.c:264
#1  0x0000000000494195 in free_buf (my_buf=0x0) at pack.c:127
#2 0x00000000004e9a61 in _handle_mult_rc_ret (x=<value optimized out>) at slurmdbd_defs.c:1664
#3  _agent (x=<value optimized out>) at slurmdbd_defs.c:2030
#4  0x00007f75aaa84971 in start_thread () from /lib/libpthread.so.0
#5  0x00007f75aa7e092d in clone () from /lib/libc.so.6
#6  0x0000000000000000 in ?? ()

The segfault looks to be caused by my_buf being null. From looking through the source in slurmdbd_defs.c, it looks as though slurmdbd is returning multiple response codes to slurmctld in a single message, but somehow the request queue has fewer requests, resulting in it trying to dequeue from an empty queue.

I'm not sure how the queue can be empty here, but the following patch should prevent the crash by simply checking that the dequeue returns a non-null buffer to be freed.

Thanks,
-sam


diff --git a/src/common/slurmdbd_defs.c b/src/common/slurmdbd_defs.c
index a29c5b9..d0285e9 100644
--- a/src/common/slurmdbd_defs.c
+++ b/src/common/slurmdbd_defs.c
@@ -1672,7 +1672,8 @@ static int _handle_mult_rc_ret(uint16_t rpc_version, int read_timeout)
                                    != SLURM_SUCCESS)
                                        break;

-                               free_buf(list_dequeue(agent_list));
+                               Buf b = list_dequeue(agent_list);
+                               if(b) free_buf(b);
                        }
                        list_iterator_destroy(itr);
                }





Reply via email to