Hello, can you get the output of the next gdb commands:
- first core file frame 2 p *p_cell list - second core file frame 8 p *p_cell list Is there any way that I can get access to the system with the core files via ssh? It would help speeding up investigating the core files. If yes, I can provide my ssh key. Cheers, Daniel On 19.03.19 16:12, Enrico Bandiera wrote: > Hi guys, recently we had some issues with kamailio in production, a > proxy at the edge of the infrastructure was crashing seemingly > randomly, it happened a couple of times in a timespan of 4-5 days > until a point where we found it in an endless loop, unable to process > SIP packets anymore and printing in loop messages like these: > > 50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: > freeing already freed pointer (0x7ff40a359450), called from core: > core/usr_avp.c: destroy_avp_list_unsafe(626), first free core: > core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring > 50(60) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: > freeing already freed pointer (0x7ff409c4fa58), called from core: > core/usr_avp.c: destroy_avp_list_unsafe(626), first free core: > core/usr_avp.c: destroy_avp_list_unsafe(626) - ignoring > > We managed to replicate the crash and the loop in a dev environment > with a lot of difficulties. It seems not related to the peak load, but > to the growing number of processed calls > The kamailio version is: > version: kamailio 5.2.2 (x86_64/linux) > flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, > USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, > SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, > USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, > USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES > ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE > 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB > poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. > id: unknown > compiled with gcc 5.3.1 > > > Every time this happened it was the timer pid that was crashing or > looping, we managed to get 2 core dumps of the crash and 1 of the loop > (got it after killing the timer with SIGQUIT). > The timer process crashes after attempting to free memory that it's > not correctly managed. In particular we saw that it happens inside > destroy_avp_list_unsafe() called by the tm timer, when uri_avps_from > points to an invalid memory address. > We don't use dialog, uac, http_async_client, but we do use topos. > > In the crash dumps it seems that there's something wrong inside > free_cell_helper(), when trying to clean the dead_cell something goes > wrong while freeing uri_avps_from, here's the crash log: > > 61(76) CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 66 > 0(15) ALERT: <core> [main.c:740]: handle_sigs(): child process 65 > exited by a signal 11 > 0(15) ALERT: <core> [main.c:743]: handle_sigs(): core was generated > 0(15) INFO: <core> [main.c:766]: handle_sigs(): terminating due to > SIGCHLD > <childs getting the SIGTERM signal> > 0(15) CRITICAL: <core> [core/mem/q_malloc.c:512]: qm_free(): BUG: > freeing already freed pointer (0x7f4d60a2e168), called from htable: > ht_api.c: ht_cell_free(217), first free core: core/usr_avp.c: > destroy_avp_list_unsafe(626) - ignoring > 0(15) INFO: <core> [core/sctp_core.c:53]: sctp_core_destroy(): SCTP > API not initialized > > Before the crash we have no warnings on errors about shared memory > allocation > > backtrace: > > (gdb) bt full > #0 0x00000000006371d7 in destroy_avp_list_unsafe > (list=0x7f34f7a411b0) at core/usr_avp.c:625 > avp = 0x2 > foo = 0x2 > __func__ = "destroy_avp_list_unsafe" > #1 0x00007f351c6909ab in free_cell_helper (dead_cell=0x7f34f7a41018, > silent=0, fname=0x7f351c796a02 "timer.c", fline=689) at h_table.c:260 > b = 0x0 > i = 1 > rpl = 0x0 > tt = 0x0 > foo = 0x11c68daf2 > cbs = 0x0 > cbs_tmp = 0x7f34f80c3790 > __func__ = "free_cell_helper" > #2 0x00007f351c7291e6 in wait_handler (ti=1984410416, > wait_tl=0x7f34f7a410a0, data=0x7f34f7a41018) at timer.c:689 > p_cell = 0x7f34f7a41018 > ret = 1 > unlinked = 0 > rcount = 1 > __func__ = "wait_handler" > #3 0x00000000004d0f56 in timer_list_expire (t=1984410416, > h=0x7f34f6be7850, slow_l=0x7f34f6bea368, slow_mark=2702) at > core/timer.c:874 > tl = 0x7f34f7a410a0 > ret = 0 > #4 0x00000000004d13bd in timer_handler () at core/timer.c:939 > saved_ticks = 1984410416 > run_slow_timer = 0 > i = 654 > __func__ = "timer_handler" > #5 0x00000000004d1845 in timer_main () at core/timer.c:978 > No locals. > #6 0x0000000000425cc8 in main_loop () at main.c:1727 > i = 8 > pid = 0 > si = 0x0 > si_desc = "udp receiver child=7 sock=127.0.0.1:5060 > <http://127.0.0.1:5060>\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\b\206|\210X\322:\032\301\255\000\000\000\000\000\000\000\000 > \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>", > '\000' <repeats 15 times>, > "\320\220\377\226\374\177\000\000\351\022a\000\000\000\000" > nrprocs = 8 > woneinit = 1 > __func__ = "main_loop" > #7 0x000000000042ca76 in main (argc=10, argv=0x7ffc96ff9398) at > main.c:2696 > cfg_stream = 0x25ca010 > c = -1 > r = 0 > tmp = 0x7ffc96ffa4c9 "" > tmp_len = -1761635728 > port = 32764 > proto = -1761635632 > options = 0x76c598 > ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" > ret = -1 > seed = 1492539644 > rfd = 4 > debug_save = 0 > debug_flag = 0 > dont_fork_cnt = 2 > n_lst = 0x7f351f6f6718 > p = 0xffffffff <error: Cannot access memory at address 0xffffffff> > st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode = > 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = > 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec = > 1552931478, tv_nsec = 972761643}, > st_mtim = {tv_sec = 1552931478, tv_nsec = 972761643}, > st_ctim = {tv_sec = 1552931478, tv_nsec = 972761643}, __glibc_reserved > = {0, 0, 0}} > __func__ = "main" > > > > > > > > In the loop dump we have a different situation, there's a loop in the > uri_avps_from structure as you can see from the next pointers: > > (gdb) p *dead_cell->uri_avps_from > $3 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p = > 0x7ff40a9ad540, l = 140686126667072, data = "@\325\232\n\364\177\000"}} > (gdb) p *dead_cell->uri_avps_from.next > $4 = {id = 66, flags = 275, next = 0x7ff409c4fa58, d = {p = > 0x7ff40a359480, l = 140686120031360, data = "\200\224\065\n\364\177\000"}} > (gdb) p *dead_cell->uri_avps_from.next.next > $5 = {id = 66, flags = 275, next = 0x7ff40a359450, d = {p = > 0x7ff409c4fa88, l = 140686112651912, data = "\210\372\304\t\364\177\000"}} > > backtrace: > (gdb) bt full > #0 0x00007ff43152e67e in _IO_default_xsputn () from > /lib/x86_64-linux-gnu/libc.so.6 > No symbol table info available. > #1 0x00007ff43150150b in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6 > No symbol table info available. > #2 0x00007ff431502ef1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 > No symbol table info available. > #3 0x00007ff43150032d in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6 > No symbol table info available. > #4 0x00007ff4315087f7 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6 > No symbol table info available. > #5 0x0000000000686e43 in qm_free (qmp=0x7ff409336000, > p=0x7ff409c4fa58, file=0x7e0a75 "core: core/usr_avp.c", func=0x7e2d70 > <__func__.8439> "destroy_avp_list_unsafe", line=626, mname=0x7e0a70 > "core") at core/mem/q_malloc.c:512 > __llevel = -3 > qm = 0x7ff409336000 > f = 0x7ff409c4fa20 > size = 72 > next = 0x0 > prev = 0x0 > __func__ = "qm_free" > #6 0x0000000000637209 in destroy_avp_list_unsafe > (list=0x7ff409e1b6c8) at core/usr_avp.c:626 > avp = 0x7ff40a359450 > foo = 0x7ff409c4fa58 > __func__ = "destroy_avp_list_unsafe" > #7 0x00007ff42ee5d9ab in free_cell_helper (dead_cell=0x7ff409e1b530, > silent=0, fname=0x7ff42ef63a02 "timer.c", fline=689) at h_table.c:260 > b = 0x0 > i = 1 > rpl = 0x0 > tt = 0x0 > foo = 0x12ee5aaf2 > cbs = 0x0 > cbs_tmp = 0x7ff409c5d420 > __func__ = "free_cell_helper" > #8 0x00007ff42eef61e6 in wait_handler (ti=89094309, > wait_tl=0x7ff409e1b5b8, data=0x7ff409e1b530) at timer.c:689 > p_cell = 0x7ff409e1b530 > ret = 1 > unlinked = 0 > rcount = 1 > __func__ = "wait_handler" > #9 0x00000000004d0f56 in timer_list_expire (t=89094309, > h=0x7ff4093b4850, slow_l=0x7ff4093b8058, slow_mark=5981) at > core/timer.c:874 > tl = 0x7ff409e1b5b8 > ret = 0 > #10 0x00000000004d13bd in timer_handler () at core/timer.c:939 > saved_ticks = 89094309 > run_slow_timer = 0 > i = 861 > __func__ = "timer_handler" > #11 0x00000000004d1845 in timer_main () at core/timer.c:978 > No locals. > #12 0x0000000000425cc8 in main_loop () at main.c:1727 > i = 8 > pid = 0 > si = 0x0 > si_desc = "udp receiver child=7 sock=127.0.0.1:5060 > <http://127.0.0.1:5060>\000\000\000\000q\000\000\000@6y\000\000\000\000\000\000\032 > t\220\260\242\024\032\301\255\000\000\000\000\000\000\000\000 > \000\000\000\000\000\000\000\002\000\000\000\000\063\275{\000\000\000\000\000>", > '\000' <repeats 15 times>, > "\220\306\336\071\377\177\000\000\351\022a\000\000\000\000" > nrprocs = 8 > woneinit = 1 > __func__ = "main_loop" > #13 0x000000000042ca76 in main (argc=10, argv=0x7fff39dec958) at > main.c:2696 > cfg_stream = 0x2292010 > c = -1 > r = 0 > tmp = 0x7fff39dee4c9 "" > tmp_len = 970901552 > port = 32767 > proto = 970901648 > options = 0x76c598 > ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" > ret = -1 > seed = 2710999419 > rfd = 4 > debug_save = 0 > debug_flag = 0 > dont_fork_cnt = 2 > n_lst = 0x7ff431ec3718 > p = 0xffffffff <error: Cannot access memory at address 0xffffffff> > st = {st_dev = 2049, st_ino = 1282695, st_nlink = 2, st_mode = > 16895, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = > 4096, st_blksize = 4096, st_blocks = 8, st_atim = {tv_sec = > 1552931523, tv_nsec = 918435869}, > st_mtim = {tv_sec = 1552994274, tv_nsec = 814438687}, > st_ctim = {tv_sec = 1552994274, tv_nsec = 814438687}, __glibc_reserved > = {0, 0, 0}} > __func__ = "main" > > > > We are not sure on how to keep debugging this issue which is causing > some serious troubles in our environment, any help is appreciated > > Here you can find the dumps (beware, 12M download but 1.5G once > uncompressed): > > https://drive.google.com/file/d/1U0LZedyona8jJZHq6HaoqXpfnU83p6Sp/view?usp=sharing > > Due thanks to Giacomo Vacca who helped us to dig into this issue. > > Thanks > > _______________________________________________ > Kamailio (SER) - Users Mailing List > [email protected] > https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users -- Daniel-Constantin Mierla -- www.asipto.com www.twitter.com/miconda -- www.linkedin.com/in/miconda Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com Kamailio Advanced Training - Mar 25-27, 2019, in Washington, DC, USA -- www.asipto.com
_______________________________________________ Kamailio (SER) - Users Mailing List [email protected] https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
