Hello, You say you can reproduce is as you do some load tests, it is better to get the output of:
kamctl trap It writes the gdb bt full for all kamailio processes in a file that you can attach here. All the locks you listed in your email can be a side effect of another blocking operations, because at the first sight the lock() inside bcast_dmq_message1() has a corresponding unlock(). Cheers, Daniel On 26.10.20 23:22, Patrick Wakano wrote: > Hello list, > Hope all are doing well! > > We are running load tests in our Kamailio server, that is just making > inbound and outbound calls and eventually (there is no identified > pattern) Kamailio freezes and of course all calls start to fail. It > does not crash, it just stops responding and it has to be killed -9. > When this happens, SIP messages are not processed, dmq keepalive fails > (so the other node reports as down), dialog KA are not sent, but > Registrations from UAC seem to still go out (logs from local_route are > seen). > We don't have a high amount of cps, it is max 3 or 4 per sec, and it > gets around 1900 active calls. We are now using Kamailio 5.2.8 > installed from the repo on a CentOS7 server. Dialog has KA active and > DMQ (with 2 workers) is being used on an active-active instance. > From investigation using GDB as pasted below, I can see UDP workers > are stuck on a lock either on a callback from t_relay... > #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 > #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at > ../../core/futexlock.h:108 > #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, > body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 > <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 > <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 > #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, > body=0x7fff2e95ffb0, except=0x0, resp_cback=0x7ffb2a8a0ab0 > <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 > <dlg_dmq_content_type>) at dmq_funcs.c:188 > #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e95ffb0, node=0x0) > at dlg_dmq.c:88 > #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action > (action=DLG_DMQ_UPDATE, dlg=0x7ffb362ea3c8, needlock=1, node=0x0) at > dlg_dmq.c:628 > #6 0x00007ffb2a61f28e in dlg_on_send (t=0x7ffb36c98120, type=16, > param=0x7fff2e9601e0) at dlg_handlers.c:739 > #7 0x00007ffb2ef285b6 in run_trans_callbacks_internal > (cb_lst=0x7ffb36c98198, type=16, trans=0x7ffb36c98120, > params=0x7fff2e9601e0) at t_hooks.c:260 > #8 0x00007ffb2ef286d0 in run_trans_callbacks (type=16, > trans=0x7ffb36c98120, req=0x7ffb742f27e0, rpl=0x0, code=-1) at > t_hooks.c:287 > #9 0x00007ffb2ef38ac1 in prepare_new_uac (t=0x7ffb36c98120, > i_req=0x7ffb742f27e0, branch=0, uri=0x7fff2e9603e0, > path=0x7fff2e9603c0, next_hop=0x7ffb742f2a58, fsocket=0x7ffb73e3e968, > snd_flags=..., fproto=0, flags=2, instance=0x7fff2e9603b0, > ruid=0x7fff2e9603a0, location_ua=0x7fff2e960390) at t_fwd.c:381 > #10 0x00007ffb2ef3d02d in add_uac (t=0x7ffb36c98120, > request=0x7ffb742f27e0, uri=0x7ffb742f2a58, next_hop=0x7ffb742f2a58, > path=0x7ffb742f2e20, proxy=0x0, fsocket=0x7ffb73e3e968, snd_flags=..., > proto=0, flags=2, instance=0x7ffb742f2e30, ruid=0x7ffb742f2e48, > location_ua=0x7ffb742f2e58) at t_fwd.c:811 > #11 0x00007ffb2ef4535a in t_forward_nonack (t=0x7ffb36c98120, > p_msg=0x7ffb742f27e0, proxy=0x0, proto=0) at t_fwd.c:1699 > #12 0x00007ffb2ef20505 in t_relay_to (p_msg=0x7ffb742f27e0, proxy=0x0, > proto=0, replicate=0) at t_funcs.c:334 > > or loose_route... > #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 > #1 0x00007ffb2b1bce08 in futex_get (lock=0x7ffb35217b90) at > ../../core/futexlock.h:108 > #2 0x00007ffb2b1bec44 in bcast_dmq_message1 (peer=0x7ffb35e8bf38, > body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 > <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 > <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:156 > #3 0x00007ffb2b1bf46b in bcast_dmq_message (peer=0x7ffb35e8bf38, > body=0x7fff2e9629d0, except=0x0, resp_cback=0x7ffb2a8a0ab0 > <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7ffb2a8a0a70 > <dlg_dmq_content_type>) at dmq_funcs.c:188 > #4 0x00007ffb2a6448fa in dlg_dmq_send (body=0x7fff2e9629d0, node=0x0) > at dlg_dmq.c:88 > #5 0x00007ffb2a64da5d in dlg_dmq_replicate_action > (action=DLG_DMQ_STATE, dlg=0x7ffb363e0c10, needlock=0, node=0x0) at > dlg_dmq.c:628 > #6 0x00007ffb2a62b3bf in dlg_onroute (req=0x7ffb742f11d0, > route_params=0x7fff2e962ce0, param=0x0) at dlg_handlers.c:1538 > #7 0x00007ffb2e7db203 in run_rr_callbacks (req=0x7ffb742f11d0, > rr_param=0x7fff2e962d80) at rr_cb.c:96 > #8 0x00007ffb2e7eb2f9 in after_loose (_m=0x7ffb742f11d0, preloaded=0) > at loose.c:945 > #9 0x00007ffb2e7eb990 in loose_route (_m=0x7ffb742f11d0) at loose.c:979 > > or t_check_trans: > #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 > #1 0x00007ffb2a5ea9c6 in futex_get (lock=0x7ffb35e78804) at > ../../core/futexlock.h:108 > #2 0x00007ffb2a5f1c46 in dlg_lookup_mode (h_entry=1609, h_id=59882, > lmode=0) at dlg_hash.c:709 > #3 0x00007ffb2a5f27aa in dlg_get_by_iuid (diuid=0x7ffb36326bd0) at > dlg_hash.c:777 > #4 0x00007ffb2a61ba1d in dlg_onreply (t=0x7ffb36952988, type=2, > param=0x7fff2e963bf0) at dlg_handlers.c:437 > #5 0x00007ffb2ef285b6 in run_trans_callbacks_internal > (cb_lst=0x7ffb36952a00, type=2, trans=0x7ffb36952988, > params=0x7fff2e963bf0) at t_hooks > .c:260 > #6 0x00007ffb2ef286d0 in run_trans_callbacks (type=2, > trans=0x7ffb36952988, req=0x7ffb3675c360, rpl=0x7ffb742f1930, > code=200) at t_hooks.c:28 > 7 > #7 0x00007ffb2ee7037f in t_reply_matching (p_msg=0x7ffb742f1930, > p_branch=0x7fff2e963ebc) at t_lookup.c:997 > #8 0x00007ffb2ee725e4 in t_check_msg (p_msg=0x7ffb742f1930, > param_branch=0x7fff2e963ebc) at t_lookup.c:1101 > #9 0x00007ffb2eee44c7 in t_check_trans (msg=0x7ffb742f1930) at tm.c:2351 > > And the DMQ workers are here: > #0 0x00007ffb74e9bbf9 in syscall () from /lib64/libc.so.6 > #1 0x00007ffb2b1d6c81 in futex_get (lock=0x7ffb35217c34) at > ../../core/futexlock.h:108 > #2 0x00007ffb2b1d7c3a in worker_loop (id=1) at worker.c:86 > #3 0x00007ffb2b1d5d35 in child_init (rank=0) at dmq.c:300 > > Currently I will not be able to upgrade to latest 5.4 version to try > to reproduce the error and since 5.2.8 has already reached > end-of-life, maybe is there anything I can do on the configuration to > avoid such condition? > Any ideas are welcome! > > Kind regards, > Patrick Wakano > > _______________________________________________ > Kamailio (SER) - Users Mailing List > [email protected] > https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users -- Daniel-Constantin Mierla -- www.asipto.com www.twitter.com/miconda -- www.linkedin.com/in/miconda Funding: https://www.paypal.me/dcmierla
_______________________________________________ Kamailio (SER) - Users Mailing List [email protected] https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
