It would be good to add some sip keepalive monitoring (e.g., cron job with sipsak sending options) that will alert/restart in case of no response. The monit tool can also send sip keepalives and take actions on no response.
On a deadlock, checking process table is not enough. There should have been high cpu usage, though, if you monitored that. Cheers, Daniel On 27/03/15 12:47, Alex Balashov wrote: > This was a rather peculiar crash: > > From the logs, it would appear that Kamailio simply stopped processing > messages at some point. There's about 8 minutes of zero log output at > a time of constantly incoming traffic. > > > At some point, this situation is resolved when all Kamailio processes > die with a normal SIGTERM, when someone manually restarted it: > > Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27498]: NOTICE: <core> > [main.c:739]: handle_sigs(): Thank you for flying kamailio!!! > Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27535]: INFO: <core> > [main.c:850]: sig_usr(): signal 15 received. > ... > > But there are a few things here that are difficult to explain from the > log: > > 1. Why was there no SIP stack response for 8 minutes, no logging > activity, etc? > > 2. We have a script that checks if Kamailio processes are running > every 1 second, and restarts Kamailio if it's not. It sends an e-mail > informing us of that development also. > > It's a rather naive check: > > ps aux | grep kamailio | grep -v 'grep kamailio' | wc -l > > But in this case, the script was not triggered, which would imply that > some Kamailio processes--perhaps all--remained running. > > There is no indication in the logs that any process died for any > reason, except for the 'signal 15' received by all processes at the > time of manual restart. > > 3. Why was a core dump generated at the time of the restart, if > nothing crashed? > > #3 is most interesting to me, because if it were some other problem, > e.g. blocking of SIP worker threads for some reason, then I wouldn't > expect a core dump upon service shutdown. > > There is no other indication of any child process dying with SIGSEGV > or SIGABRT. > > -- Alex > > On 03/27/2015 06:17 AM, Alex Balashov wrote: > >> Hello, >> >> The system experienced another crash yesterday, but unfortunately the >> core dump is not very insightful, possibly due to being incomplete: >> >> BFD: Warning: /tmp/./core.kamailio.500.1427402410.27498 is truncated: >> expected core file size >= 8602058752, found: 1769852928. >> [New Thread 27498] >> Cannot access memory at address 0x7f52891e3168 >> Cannot access memory at address 0x7f52891e3168 >> Cannot access memory at address 0x7f52891e3168 >> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols >> found)...done. >> Loaded symbols for /lib64/ld-linux-x86-64.so.2 >> Failed to read a valid object file image from memory. >> Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio.pid >> -m 8192 -u evaristesys -g eva'. >> Program terminated with signal 11, Segmentation fault. >> #0 0x00007f5286d97e45 in ?? () >> Missing separate debuginfos, use: debuginfo-install >> glibc-2.12-1.149.el6_6.5.x86_64 >> (gdb) where >> #0 0x00007f5286d97e45 in ?? () >> Cannot access memory at address 0x7fffbe32a210 >> >> >> That's not much help at all, so I cannot possibly say it is for the same >> reasons as before. >> >> >> > > -- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Kamailio World Conference, May 27-29, 2015 Berlin, Germany - http://www.kamailioworld.com _______________________________________________ sr-dev mailing list [email protected] http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
