Re: [SR-Users] crash at 480 reply to INVITE

2019-03-04 Thread Daniel-Constantin Mierla
Short update: I think I found the issue and tried to come up with a
solution in the commit:

  -
https://github.com/kamailio/kamailio/commit/814d5cc1f4f5b1e4b95737108dffc1e7d7bd566f

The tests that reproduced the crash rather quickly before the commit
(done by Yufei Tao) are now running fine for long time.

I will backport to stable branches and likely next week I am considering
to do a new release from 5.2 branch.

Cheers,
Daniel

On 26.02.19 02:58, Juha Heinanen wrote:
> Juha Heinanen writes:
>
>> It was Debian Stretch installed on (if I remember correctly) xen virtual
>> machine.  I'll check to be sure.
> Yes, it was VM on XenServer.

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in 
Washington, DC, USA -- www.asipto.com


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-25 Thread Juha Heinanen
Juha Heinanen writes:

> It was Debian Stretch installed on (if I remember correctly) xen virtual
> machine.  I'll check to be sure.

Yes, it was VM on XenServer.

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-25 Thread Juha Heinanen
Daniel-Constantin Mierla writes:

> Can you tell what linux distro and version was running in your case?
> Also, can you give the output for:

It was Debian Stretch installed on (if I remember correctly) xen virtual
machine.  I'll check to be sure.

-- Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-25 Thread Daniel-Constantin Mierla
Today there was another report with a backtrace like the one you
reported, so I try to figure out if there are any common aspects between
the two systems.

Can you tell what linux distro and version was running in your case?
Also, can you give the output for:

uname -a

Cheers,
Daniel

On 14.02.19 08:54, Juha Heinanen wrote:
> Daniel-Constantin Mierla writes:
>
>> However, that had invalid access to transaction pointer. In this case,
>> accessing transaction is ok, by setting a variable to the address of one
>> of its fields, but then the variable becomes NULL, which is not possible
>> unless the stack got corrupted/overwritten. However, from the moment of
>> setting the variable till the moment of the crash, there are no other
>> set or copy operations, therefore nothing that I can relate to
>> kamailio's c code. Maybe it happens another time and the backtrace will
>> be different.
> I agree, Juha

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in 
Washington, DC, USA -- www.asipto.com


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-13 Thread Juha Heinanen
Daniel-Constantin Mierla writes:

> However, that had invalid access to transaction pointer. In this case,
> accessing transaction is ok, by setting a variable to the address of one
> of its fields, but then the variable becomes NULL, which is not possible
> unless the stack got corrupted/overwritten. However, from the moment of
> setting the variable till the moment of the crash, there are no other
> set or copy operations, therefore nothing that I can relate to
> kamailio's c code. Maybe it happens another time and the backtrace will
> be different.

I agree, Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-13 Thread Daniel-Constantin Mierla

On 13.02.19 22:24, Juha Heinanen wrote:
> Daniel-Constantin Mierla writes:
>
>> replying to the initial message to have the backtrace easy to look at
>> its content...
>>
>> The info locals in frame 0 show:
>>
>> uac = 0x0
>>
>> However, that is set few lines above as:
>>
>> uac=>uac[branch];
>>
>> An address of a variable (or field in a structure) cannot be null. Some
>> something happened with the stack. Did the OS kept running smooth after
>> this issue?
> The OS is running smoothly all the time.  It is possible to ssh to the
> host, take packet traces, etc. during Kamailio freeze.  This
> freezing/recovery happens happening at the frequency of once in a few
> days.
>
>> So at this moment, either the core file was somehow corrupted/not
>> properly dumped or kernel process supervizer did something wrong on
>> resume after the freeze.
> As I mentioned, the OS is running normally during Kamailio freeze.
>
>> There are no safety checks that can be added. Maybe you can try to
>> reproduce and see if the new corefile gives a different backtrace.
> Crash has happened only once after the freeze.  I'll let you know if
> another crash happends and dump becomes available.
>
> Could this freezing be caused by same kind of thing as the rtpengine
> non-availability freeze that you fixed in master?  If I remember
> correctly, the fix was not backported to stable.


That patch was backported (at least to branch 5.2).

However, that had invalid access to transaction pointer. In this case,
accessing transaction is ok, by setting a variable to the address of one
of its fields, but then the variable becomes NULL, which is not possible
unless the stack got corrupted/overwritten. However, from the moment of
setting the variable till the moment of the crash, there are no other
set or copy operations, therefore nothing that I can relate to
kamailio's c code. Maybe it happens another time and the backtrace will
be different.

Cheers,
Daniel

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in 
Washington, DC, USA -- www.asipto.com


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-13 Thread Juha Heinanen
Daniel-Constantin Mierla writes:

> replying to the initial message to have the backtrace easy to look at
> its content...
> 
> The info locals in frame 0 show:
> 
> uac = 0x0
> 
> However, that is set few lines above as:
> 
> uac=>uac[branch];
> 
> An address of a variable (or field in a structure) cannot be null. Some
> something happened with the stack. Did the OS kept running smooth after
> this issue?

The OS is running smoothly all the time.  It is possible to ssh to the
host, take packet traces, etc. during Kamailio freeze.  This
freezing/recovery happens happening at the frequency of once in a few
days.

> So at this moment, either the core file was somehow corrupted/not
> properly dumped or kernel process supervizer did something wrong on
> resume after the freeze.

As I mentioned, the OS is running normally during Kamailio freeze.

> There are no safety checks that can be added. Maybe you can try to
> reproduce and see if the new corefile gives a different backtrace.

Crash has happened only once after the freeze.  I'll let you know if
another crash happends and dump becomes available.

Could this freezing be caused by same kind of thing as the rtpengine
non-availability freeze that you fixed in master?  If I remember
correctly, the fix was not backported to stable.

-- Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-13 Thread Daniel-Constantin Mierla
Hello,

replying to the initial message to have the backtrace easy to look at
its content...

The info locals in frame 0 show:

uac = 0x0

However, that is set few lines above as:

uac=>uac[branch];

An address of a variable (or field in a structure) cannot be null. Some
something happened with the stack. Did the OS kept running smooth after
this issue?

uac is a local variable, so it is allocated on the stack of the
respective process. Given the sequence of the C code, there is no option
to overwrite uac since it was set. If the transaction pointer is
invalid, then the crash should have happened at the line:

uac=>uac[branch]; 

So at this moment, either the core file was somehow corrupted/not
properly dumped or kernel process supervizer did something wrong on
resume after the freeze.

There are no safety checks that can be added. Maybe you can try to
reproduce and see if the new corefile gives a different backtrace.

Cheers,
Daniel

On 05.02.19 10:08, Juha Heinanen wrote:
> Kamailio 5.2 crashed when it received 480 reply to INVITE.  Below is
> backtrace from the core file.
>
> The crash happens in t_reply.c on the last line of this block:
>
> uac=>uac[branch];  
>   
> LM_DBG("org. status uas=%d, uac[%d]=%d local=%d is_invite=%d)\n", 
>   
> t->uas.status, branch, uac->last_received,
>   
> is_local(t), is_invite(t));   
>   
> last_uac_status=uac->last_received;
>
> Earlier it was checked that the transaction was found.  Its uac[0]
> seems to be broken.
>
> -- Juha
>
> -
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0_  0x7f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at 
> t_reply.c:2240
> 2240_ _ _  t_reply.c: No such file or directory.
> (gdb) bt full
> #0_  0x7f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at 
> t_reply.c:2240
>  _ _ _ _ _ _ _  msg_status = 480
>  _ _ _ _ _ _ _  last_uac_status = 1590315756
>  _ _ _ _ _ _ _  ack = 0x50550c4  0x50550c4>
>  _ _ _ _ _ _ _  ack_len = 4
>  _ _ _ _ _ _ _  branch = 0
>  _ _ _ _ _ _ _  reply_status = 29
>  _ _ _ _ _ _ _  onreply_route = 9941216
>  _ _ _ _ _ _ _  cancel_data = {cancel_bitmap = 0, reason = {cause = 0, u = 
> {text = {s = 0x0, len = 1590087991}, e2e_cancel = 0x0, packed_hdrs = {s 
> = 0x0, len = 1590087991
>  _ _ _ _ _ _ _  uac = 0x0
>  _ _ _ _ _ _ _  t = 0x7f105dfe6480
>  _ _ _ _ _ _ _  lack_dst = {send_sock = 0x555b5f02720f , to = {s = 
> {sa_family = 29127, sa_data = "XXX"}, 
> sin = {sin_family = 29127, sin_port = 24322, sin_addr = {s_addr = 
> 21851}, sin_zero = "XXX"}, sin6 = {
>  _ _ _ _ _ _ _ _ _ _ _ _ _  sin6_family = 29127, sin6_port = 24322, 
> sin6_flowinfo = 
> 21851, sin6_addr = {__in6_u = {__u6_addr8 = 
> "XXX", __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, __u6_addr32 = 
> {XXX, XXX, 
> XXX, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  XXX}}}, sin6_scope_id = 
> 1980563656}}, id = 32528, 
> proto = 112 'p', send_flags = {f = 30268, blst_imask = 32528}}
>  _ _ _ _ _ _ _  backup_user_from = 0x0
>  _ _ _ _ _ _ _  backup_user_to = 0xXXX 
>  _ _ _ _ _ _ _  backup_domain_from = 0xXXX
>  _ _ _ _ _ _ _  backup_domain_to = 0xXXX
>  _ _ _ _ _ _ _  backup_uri_from = 0x0
>  _ _ _ _ _ _ _  backup_uri_to = 0xXXX
>  _ _ _ _ _ _ _  backup_xavps = 0x45ed834e3
>  _ _ _ _ _ _ _  replies_locked = 1
>  _ _ _ _ _ _ _  branch_ret = 1593995512
>  _ _ _ _ _ _ _  prev_branch = 21851
>  _ _ _ _ _ _ _  blst_503_timeout = 340003632
>  _ _ _ _ _ _ _  hf = 0x7f1076490810
>  _ _ _ _ _ _ _  onsend_params = {req = 0x7f10763c4898, rpl = 0x7f10763c4888, 
> param = 0x97b5f0, code = 10751248, flags = 0, branch = 0, t_rbuf = 
> 0xaf95c0, dst = 0x7f1076db4fc0 <__syslog>, send_buf = {s = 
> 0x555b5ed834e3 "INFO", len = 134217728}}
>  _ _ _ _ _ _ _  ctx = {rec_lev = 1593995791, run_flags = 21851, last_retcode 
> = 
> 1593995708, jmp_env = {{__jmpbuf = {48, 139708676767760, 93849330384899, 
> -7479270984431321856, 93850924380609, 139708690288576, 93850921612515, 
> 134217728}, __mask_was_saved = 12582912, __saved_mask = {
>  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  __val = {6, 140720648489936, 139708687844848, 
> 140720648490064, 93850920720905, 93850924380373, 139708676767760, 
> 140720648489904, 139708469727337, 139708679781296, 139708687844848, 
> 139708684105760, 140720648490560, 5888963087, 93849330384896, 11507136}
>  _ _ _ _ _ _ _  bctx = 0x7f10760d0010
>  _ _ _ _ _ _ _  keng = 0x0
>  _ _ _ _ _ _ _  __func__ = "reply_received"
> #1_  0x555b5eadf4dc in do_forward_reply (msg=0x7f1076b605f0, mode=0) 
> at core/forward.c:747
>  _ _ _ _ _ _ _  new_buf = 0x0
>  _ _ _ _ _ _ _  dst = {send_sock = 0x0, to = {s = {sa_family = 0, sa_data = 
> '\000' }, sin = {sin_family = 0, sin_port = 0, 
> sin_addr = {s_addr = 0}, sin_zero = "XXX"}, 
> sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {
>  _ _ _ _ _ _ _ _ _ 

Re: [SR-Users] crash at 480 reply to INVITE

2019-02-08 Thread Juha Heinanen
Daniel-Constantin Mierla writes:

> OK. It happens to be traveling for few days, once I am back, I will dig
> into it and see how to fix this long delayed processing so it does not
> crash.

Thanks for looking into this.  Perhaps the freeze is due to some locking
issue like the one when K didn't do anything for a long time if it was
started with an unreachable rtpengine.

-- Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-07 Thread Daniel-Constantin Mierla
OK. It happens to be traveling for few days, once I am back, I will dig
into it and see how to fix this long delayed processing so it does not
crash.

Cheers,
Daniel

On 06.02.19 22:04, Juha Heinanen wrote:
> I got the full syslog and pcap and based on those K indeed totally froze
> for about 14 minutes.  Below is updated summary:
>
> - K receives INVITE at 17:43:28 and forwards it over udp to uas
> - uas immediately responds with 183 followed by 180, which k forwards to UAC
> - at 17:44:29 uas responds again with 180 and K forwards it to UAC
> - at 17:44:44 K's invite timeout timer fires and it sends "408 request
>   timeout" to uac, which responds with ack
> - at that point K should also have send CANCEL to uas, but it didn't
> - K keeps on handling requests normally until 17:46:39
> - after that K freezes totally (no messages to syslog although pcap shows
>   that K receives requests over udp)
> - at 16:00:23, K un-freezes and (according to pcap) sends two cancel
>   requests to the uas
> - K also sends out several replies and request that are related to
>   requests that it had received before the freeze (and in most cases
>   gets back "481 transaction does not exist" replies)
> - after that (during the same second) K starts to process normally new
>   incoming requests
> - during the same second after the un-freeze, these errors are produced
>   to syslog:
>
> Feb  1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: 
> db_mysql_submit_query(): driver error on ping: Lost connection to MySQL 
> server during query
> Feb  1 16:00:23 /usr/bin/sip-proxy[1313]: INFO: tm [t_reply.c:480]: 
> _reply_light(): can't generate 408 reply when a final 487 was sent out
> Feb  1 16:00:23 /usr/bin/sip-proxy[1313]: ERROR: sl [sl.c:277]: send_reply(): 
> failed to reply stateful (tm)
> Feb  1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL:  
> [core/pass_fd.c:277]: receive_fd(): EOF on 18
>
> - finally K crashes:
>
> Feb  1 16:00:24 /usr/bin/sip-proxy[1309]: ALERT:  [main.c:755]: 
> handle_sigs(): child process 1314 exited by a signal 11
>
> -- Juha
>
-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in 
Washington, DC, USA -- www.asipto.com


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-06 Thread Juha Heinanen
I got the full syslog and pcap and based on those K indeed totally froze
for about 14 minutes.  Below is updated summary:

- K receives INVITE at 17:43:28 and forwards it over udp to uas
- uas immediately responds with 183 followed by 180, which k forwards to UAC
- at 17:44:29 uas responds again with 180 and K forwards it to UAC
- at 17:44:44 K's invite timeout timer fires and it sends "408 request
  timeout" to uac, which responds with ack
- at that point K should also have send CANCEL to uas, but it didn't
- K keeps on handling requests normally until 17:46:39
- after that K freezes totally (no messages to syslog although pcap shows
  that K receives requests over udp)
- at 16:00:23, K un-freezes and (according to pcap) sends two cancel
  requests to the uas
- K also sends out several replies and request that are related to
  requests that it had received before the freeze (and in most cases
  gets back "481 transaction does not exist" replies)
- after that (during the same second) K starts to process normally new
  incoming requests
- during the same second after the un-freeze, these errors are produced
  to syslog:

Feb  1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: 
db_mysql_submit_query(): driver error on ping: Lost connection to MySQL server 
during query
Feb  1 16:00:23 /usr/bin/sip-proxy[1313]: INFO: tm [t_reply.c:480]: 
_reply_light(): can't generate 408 reply when a final 487 was sent out
Feb  1 16:00:23 /usr/bin/sip-proxy[1313]: ERROR: sl [sl.c:277]: send_reply(): 
failed to reply stateful (tm)
Feb  1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL:  
[core/pass_fd.c:277]: receive_fd(): EOF on 18

- finally K crashes:

Feb  1 16:00:24 /usr/bin/sip-proxy[1309]: ALERT:  [main.c:755]: 
handle_sigs(): child process 1314 exited by a signal 11

-- Juha


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-06 Thread Juha Heinanen
Daniel-Constantin Mierla writes:

> OK. There should be no crash no matter what caused the delay. I wanted
> to sort out the freezing of entire kamailio for 15min, because that is
> rather unusual to happen for all processes and then recover.
> 
> I will look more at the tm for this delay/blocking of a single
> process.

K routed the INVITE by process 1310 and after that the same process was
used to route other requests before K 15 minutes later sent the two
CANCELs to the UAS. Unfortunately the syslog does not show the process
number that sent the CANCELs.

-- Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-06 Thread Daniel-Constantin Mierla

On 06.02.19 09:51, Juha Heinanen wrote:
> Juha Heinanen writes:
>
>> Thanks.  I don't think that this is db related issue.  Here is a
>> summary:
>>
>> - K receives INVITE at 17:43:28 and forwards it over udp to uas
>> - uas immediately responds with 183 followed by 180, which k forwards to UAC
>> - at 17:44:29 uas responds again with 180 and K forwards it to UAC
>> - at 17:44:44 K's invite timeout timer fires and it sends "408 request
>>   timeout" to uac, which responds with ack
>> - at that point K should also have send CANCEL to uas, but it didn't
>> - after that k goes to deep sleep, i.e., it does not react to any
>>   new incoming requests from anybody nor to several "480 request
>>   terminated" replies from the uas
>> - about 15 minutes later at 18:00:23, K wakes up and sends two cancel
>>   requests to uas
>> - uas replies to cancels with "481 transaction does not exist"
>> - then K sends several ACKs to the uas and crashes
> I misunderstood the pcap that I got.  It only contained packets of the
> call that caused invite timeout to fire and where CANCEL was not sent.
> So K did properly process other requests during the 15 minute period.
>
> In syslog there is this just before the crash:
>
> Feb  1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL:  
> [core/pass_fd.c:277]: receive_fd(): EOF on 18
>
> and before that during the same second:
>
> Feb  1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: 
> db_mysql_submit_query(): driver error on ping: Lost connection to MySQL 
> server during query
>
> So the crash may have something to do with db access, but why was cancel
> not sent 15 minutes ago? There is no mysql related messages in syslog at
> that time.


OK. There should be no crash no matter what caused the delay. I wanted
to sort out the freezing of entire kamailio for 15min, because that is
rather unusual to happen for all processes and then recover.

I will look more at the tm for this delay/blocking of a single process.

Cheers,
Daniel

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in 
Washington, DC, USA -- www.asipto.com


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-06 Thread Juha Heinanen
Juha Heinanen writes:

> Thanks.  I don't think that this is db related issue.  Here is a
> summary:
> 
> - K receives INVITE at 17:43:28 and forwards it over udp to uas
> - uas immediately responds with 183 followed by 180, which k forwards to UAC
> - at 17:44:29 uas responds again with 180 and K forwards it to UAC
> - at 17:44:44 K's invite timeout timer fires and it sends "408 request
>   timeout" to uac, which responds with ack
> - at that point K should also have send CANCEL to uas, but it didn't
> - after that k goes to deep sleep, i.e., it does not react to any
>   new incoming requests from anybody nor to several "480 request
>   terminated" replies from the uas
> - about 15 minutes later at 18:00:23, K wakes up and sends two cancel
>   requests to uas
> - uas replies to cancels with "481 transaction does not exist"
> - then K sends several ACKs to the uas and crashes

I misunderstood the pcap that I got.  It only contained packets of the
call that caused invite timeout to fire and where CANCEL was not sent.
So K did properly process other requests during the 15 minute period.

In syslog there is this just before the crash:

Feb  1 16:00:23 /usr/bin/sip-proxy[1348]: CRITICAL:  
[core/pass_fd.c:277]: receive_fd(): EOF on 18

and before that during the same second:

Feb  1 16:00:23 /usr/bin/sip-proxy[1327]: INFO: db_mysql [km_dbase.c:85]: 
db_mysql_submit_query(): driver error on ping: Lost connection to MySQL server 
during query

So the crash may have something to do with db access, but why was cancel
not sent 15 minutes ago? There is no mysql related messages in syslog at
that time.

-- Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-05 Thread Juha Heinanen
Daniel-Constantin Mierla writes:

> It is rather strange that it got like "frozen" for 15 minutes. Was there
> a high CPU during that interval?

CPU was not high and that top showed nothing unusual during the freeze.
Also, there was nothing special in MySQL logs.

> Typically, if there is a deadlock that block processing, then it does
> not recover until restart. But in this case is no restart, because that
> destroys active trasactions and no CANCEL would have been issued.
> 
> Is the system a virtual machine, container or bare metal?

It is virtual machine.

> I encountered in the past that some VMs got fronzen due to VM manager
> doing some backup or similar activities ...

That may very well be the case here too.  I'll keep on investigating.

I would have been nice though, if K would not had crashed after in
unfroze.

-- Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-05 Thread Daniel-Constantin Mierla
It is rather strange that it got like "frozen" for 15 minutes. Was there
a high CPU during that interval?

Typically, if there is a deadlock that block processing, then it does
not recover until restart. But in this case is no restart, because that
destroys active trasactions and no CANCEL would have been issued.

Is the system a virtual machine, container or bare metal?

I encountered in the past that some VMs got fronzen due to VM manager
doing some backup or similar activities ...

Cheers,
Daniel

On 05.02.19 21:33, Juha Heinanen wrote:
> Daniel-Constantin Mierla writes:
>
>> I'll look into it. Was there a possibility that some operation could
>> have taken very long, for example writing accounting database record?
> Thanks.  I don't think that this is db related issue.  Here is a
> summary:
>
> - K receives INVITE at 17:43:28 and forwards it over udp to uas
> - uas immediately responds with 183 followed by 180, which k forwards to UAC
> - at 17:44:29 uas responds again with 180 and K forwards it to UAC
> - at 17:44:44 K's invite timeout timer fires and it sends "408 request
>   timeout" to uac, which responds with ack
> - at that point K should also have send CANCEL to uas, but it didn't
> - after that k goes to deep sleep, i.e., it does not react to any
>   new incoming requests from anybody nor to several "480 request
>   terminated" replies from the uas
> - about 15 minutes later at 18:00:23, K wakes up and sends two cancel
>   requests to uas
> - uas replies to cancels with "481 transaction does not exist"
> - then K sends several ACKs to the uas and crashes
>
> So it seems that crash is side effect of K not sending cancel to the uas
> immediate.  Instead it goes to sleep and doesn't feel well after waking
> up 15 minutes later.
>
> The real questions are why didn't K send the CANCEL immediately and went
> to sleep instead.  How to find out why it happened?
>
> -- Juha

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in 
Washington, DC, USA -- www.asipto.com


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-05 Thread Juha Heinanen
Daniel-Constantin Mierla writes:

> I'll look into it. Was there a possibility that some operation could
> have taken very long, for example writing accounting database record?

Thanks.  I don't think that this is db related issue.  Here is a
summary:

- K receives INVITE at 17:43:28 and forwards it over udp to uas
- uas immediately responds with 183 followed by 180, which k forwards to UAC
- at 17:44:29 uas responds again with 180 and K forwards it to UAC
- at 17:44:44 K's invite timeout timer fires and it sends "408 request
  timeout" to uac, which responds with ack
- at that point K should also have send CANCEL to uas, but it didn't
- after that k goes to deep sleep, i.e., it does not react to any
  new incoming requests from anybody nor to several "480 request
  terminated" replies from the uas
- about 15 minutes later at 18:00:23, K wakes up and sends two cancel
  requests to uas
- uas replies to cancels with "481 transaction does not exist"
- then K sends several ACKs to the uas and crashes

So it seems that crash is side effect of K not sending cancel to the uas
immediate.  Instead it goes to sleep and doesn't feel well after waking
up 15 minutes later.

The real questions are why didn't K send the CANCEL immediately and went
to sleep instead.  How to find out why it happened?

-- Juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] crash at 480 reply to INVITE

2019-02-05 Thread Daniel-Constantin Mierla
Hello,

I'll look into it. Was there a possibility that some operation could
have taken very long, for example writing accounting database record?

Cheers,
Daniel

On 05.02.19 11:00, Juha Heinanen wrote:
> Juha Heinanen writes:
>
>> pcap shows that kamailio receives from the uas two "180 ringing" replies
>> 30 secs apart after which fr_inv_timeout hits.  at that point kamailio
>> sends "408 request timeout" to uac and gets back ack.  then 45 secs
>> later, kamailio gets one more "180 ringing" reply from the uas followed
>> by several 480 replies.  one of them crashes kamailio.
> pcap also shows that when fr_inv_timeout fired and kamailio send 408 to
> uac, it did not send cancel to uas.  it send 408 to uac at 17:44:44 and
> cancel to uas at 18:00:23, i.e. about 15 minutes later!
>
> -- juha
>
> ___
> Kamailio (SER) - Users Mailing List
> sr-users@lists.kamailio.org
> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users

-- 
Daniel-Constantin Mierla -- www.asipto.com
www.twitter.com/miconda -- www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 -- www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in 
Washington, DC, USA -- www.asipto.com


___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


[SR-Users] crash at 480 reply to INVITE

2019-02-05 Thread Juha Heinanen
Juha Heinanen writes:

> pcap shows that kamailio receives from the uas two "180 ringing" replies
> 30 secs apart after which fr_inv_timeout hits.  at that point kamailio
> sends "408 request timeout" to uac and gets back ack.  then 45 secs
> later, kamailio gets one more "180 ringing" reply from the uas followed
> by several 480 replies.  one of them crashes kamailio.

pcap also shows that when fr_inv_timeout fired and kamailio send 408 to
uac, it did not send cancel to uas.  it send 408 to uac at 17:44:44 and
cancel to uas at 18:00:23, i.e. about 15 minutes later!

-- juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


[SR-Users] crash at 480 reply to INVITE

2019-02-05 Thread Juha Heinanen
Juha Heinanen writes:

> Kamailio 5.2 crashed when it received 480 reply to INVITE.  Below is
> backtrace from the core file.
> 
> The crash happens in t_reply.c on the last line of this block:
> 
> uac=>uac[branch];  
>   
> LM_DBG("org. status uas=%d, uac[%d]=%d local=%d is_invite=%d)\n", 
>   
> t->uas.status, branch, uac->last_received,
>   
> is_local(t), is_invite(t));   
>   
> last_uac_status=uac->last_received;
> 
> Earlier it was checked that the transaction was found.  Its uac[0]
> seems to be broken.

pcap shows that kamailio receives from the uas two "180 ringing" replies
30 secs apart after which fr_inv_timeout hits.  at that point kamailio
sends "408 request timeout" to uac and gets back ack.  then 45 secs
later, kamailio gets one more "180 ringing" reply from the uas followed
by several 480 replies.  one of them crashes kamailio.

-- juha

___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


[SR-Users] crash at 480 reply to INVITE

2019-02-05 Thread Juha Heinanen
Kamailio 5.2 crashed when it received 480 reply to INVITE.  Below is
backtrace from the core file.

The crash happens in t_reply.c on the last line of this block:

uac=>uac[branch];
LM_DBG("org. status uas=%d, uac[%d]=%d local=%d is_invite=%d)\n",   
t->uas.status, branch, uac->last_received,  
is_local(t), is_invite(t)); 
last_uac_status=uac->last_received;

Earlier it was checked that the transaction was found.  Its uac[0]
seems to be broken.

-- Juha

-

Program terminated with signal SIGSEGV, Segmentation fault.
#0_  0x7f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at 
t_reply.c:2240
2240_ _ _  t_reply.c: No such file or directory.
(gdb) bt full
#0_  0x7f1073e234c3 in reply_received (p_msg=0x7f1076b605f0) at 
t_reply.c:2240
 _ _ _ _ _ _ _  msg_status = 480
 _ _ _ _ _ _ _  last_uac_status = 1590315756
 _ _ _ _ _ _ _  ack = 0x50550c4 
 _ _ _ _ _ _ _  ack_len = 4
 _ _ _ _ _ _ _  branch = 0
 _ _ _ _ _ _ _  reply_status = 29
 _ _ _ _ _ _ _  onreply_route = 9941216
 _ _ _ _ _ _ _  cancel_data = {cancel_bitmap = 0, reason = {cause = 0, u = 
{text = {s = 0x0, len = 1590087991}, e2e_cancel = 0x0, packed_hdrs = {s 
= 0x0, len = 1590087991
 _ _ _ _ _ _ _  uac = 0x0
 _ _ _ _ _ _ _  t = 0x7f105dfe6480
 _ _ _ _ _ _ _  lack_dst = {send_sock = 0x555b5f02720f , to = {s = 
{sa_family = 29127, sa_data = "XXX"}, 
sin = {sin_family = 29127, sin_port = 24322, sin_addr = {s_addr = 
21851}, sin_zero = "XXX"}, sin6 = {
 _ _ _ _ _ _ _ _ _ _ _ _ _  sin6_family = 29127, sin6_port = 24322, 
sin6_flowinfo = 
21851, sin6_addr = {__in6_u = {__u6_addr8 = 
"XXX", __u6_addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, XXX}, __u6_addr32 = 
{XXX, XXX, 
XXX, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  XXX}}}, sin6_scope_id = 
1980563656}}, id = 32528, 
proto = 112 'p', send_flags = {f = 30268, blst_imask = 32528}}
 _ _ _ _ _ _ _  backup_user_from = 0x0
 _ _ _ _ _ _ _  backup_user_to = 0xXXX 
 _ _ _ _ _ _ _  backup_domain_from = 0xXXX
 _ _ _ _ _ _ _  backup_domain_to = 0xXXX
 _ _ _ _ _ _ _  backup_uri_from = 0x0
 _ _ _ _ _ _ _  backup_uri_to = 0xXXX
 _ _ _ _ _ _ _  backup_xavps = 0x45ed834e3
 _ _ _ _ _ _ _  replies_locked = 1
 _ _ _ _ _ _ _  branch_ret = 1593995512
 _ _ _ _ _ _ _  prev_branch = 21851
 _ _ _ _ _ _ _  blst_503_timeout = 340003632
 _ _ _ _ _ _ _  hf = 0x7f1076490810
 _ _ _ _ _ _ _  onsend_params = {req = 0x7f10763c4898, rpl = 0x7f10763c4888, 
param = 0x97b5f0, code = 10751248, flags = 0, branch = 0, t_rbuf = 
0xaf95c0, dst = 0x7f1076db4fc0 <__syslog>, send_buf = {s = 
0x555b5ed834e3 "INFO", len = 134217728}}
 _ _ _ _ _ _ _  ctx = {rec_lev = 1593995791, run_flags = 21851, last_retcode = 
1593995708, jmp_env = {{__jmpbuf = {48, 139708676767760, 93849330384899, 
-7479270984431321856, 93850924380609, 139708690288576, 93850921612515, 
134217728}, __mask_was_saved = 12582912, __saved_mask = {
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  __val = {6, 140720648489936, 139708687844848, 
140720648490064, 93850920720905, 93850924380373, 139708676767760, 
140720648489904, 139708469727337, 139708679781296, 139708687844848, 
139708684105760, 140720648490560, 5888963087, 93849330384896, 11507136}
 _ _ _ _ _ _ _  bctx = 0x7f10760d0010
 _ _ _ _ _ _ _  keng = 0x0
 _ _ _ _ _ _ _  __func__ = "reply_received"
#1_  0x555b5eadf4dc in do_forward_reply (msg=0x7f1076b605f0, mode=0) 
at core/forward.c:747
 _ _ _ _ _ _ _  new_buf = 0x0
 _ _ _ _ _ _ _  dst = {send_sock = 0x0, to = {s = {sa_family = 0, sa_data = 
'\000' }, sin = {sin_family = 0, sin_port = 0, 
sin_addr = {s_addr = 0}, sin_zero = "XXX"}, 
sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  __in6_u = {__u6_addr8 = '\000' , 
__u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, 
sin6_scope_id = 0}}, id = 0, proto = 0 '\000', send_flags = {f = 0, 
blst_imask = 0}}
 _ _ _ _ _ _ _  new_len = 0
 _ _ _ _ _ _ _  r = 1
 _ _ _ _ _ _ _  ip = {af = XXX, len = 32528, u = {addrl = {XXX, 
95}, addr32 = {XXX, XXX, XXX, 0}, addr16 = {XXX, XXX, XXX, XXX, XXX, XXX, XXX, 
XXX}, addr = 
"XXX"}}
 _ _ _ _ _ _ _  s = 0x7ffc14440c68 ""
 _ _ _ _ _ _ _  len = 32764
 _ _ _ _ _ _ _  __func__ = "do_forward_reply"
#2_  0x555b5eae12f9 in forward_reply (msg=0x7f1076b605f0) at 
core/forward.c:852
No locals.
#3_  0x555b5eb5b679 in receive_msg (
 _ _ _  buf=0x555b5f027060  "SIP/2.0 480 Request Terminated\r\nVia: 
SIP/2.0/UDP 
XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: 
SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: 
 "SIP/2.0 480 Request 
Terminated\r\nVia: SIP/2.0/UDP 
XXX;branch=z9hG4bKe951.40cf95b28fe54d0cbda88a8fa4c91d48.0\r\nVia: 
SIP/2.0/UDP XXX:5060;branch=z9hG4bK04B95fa49ac99a7fa91\r\nTo: 

 _ _ _ _ _ _ _  from = 0x7f10764b1da0
 _ _ _ _ _ _ _  fromlen = 16
 _ _ _ _ _ _ _  ri = {src_ip = {af = 2,