Bogdan,
I am trying to obtain this information for you but I am having trouble
getting the core files. Is it really necessary to kill every opensips
process? This generates almost 40 core files and each is quite large
(~1GB). I simply don’t have that disk space currently. I can make a
change to get more but it is slowing the process. Would it be
sufficient to get just one core file?
Also, runtime inspection with gdb is possible in this case if you can
provide me with the commands you would want to see. I would need very
specific commands as I am not very familiar with gdb.
Ben Newlin
*From: *Bogdan-Andrei Iancu <[email protected]>
*Date: *Thursday, November 1, 2018 at 1:29 PM
*To: *Ben Newlin <[email protected]>, OpenSIPS users mailling
list <[email protected]>
*Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
Hi Ben,
First be sure you have the DBG_LOCK option compiled in. Do the
"opensips -V" and see the output flags.
Next step will be to force an SIGSEGV to opensips (killall -11
opensips) when the deadlock occurs - I need a core file to inspect
(assuming that runtime inspection with gdb is not possible).
Regards,
Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/31/2018 09:07 PM, Ben Newlin wrote:
Bogdan,
For the first test I have done as you suggested and disabled only
async operation for HEP, so it is still using TCP. I will send you
the trap info directly as it is too large. I also compiled with
the DBG_LOCK option, but am unsure whether that extra information
will be available in the trap output or do you need something else?
I am now going to switch HEP to use UDP to mirror our production
environment and try to reproduce again. Wish me luck! ☺
Ben Newlin
*From: *Bogdan-Andrei Iancu <[email protected]>
<mailto:[email protected]>
*Date: *Monday, October 29, 2018 at 2:19 PM
*To: *Ben Newlin <[email protected]>
<mailto:[email protected]>, OpenSIPS users mailling list
<[email protected]> <mailto:[email protected]>
*Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
Hi Ben,
I checked the error trace and it should not leave any dangling
lock (due mishandled error). Before disabling HEP, try to disable
the async support for HEP.
If you claim that the same 100% CPU happens with HEP + UDP, send
me a trap for that too, as in the previous case, the deadlock was
exclusively HEP + TCP related.
Anyhow, as the original trap showed a deadlock, next step will be
to recompile with the DBG_LOCK option - this enables extra code to
debug/troubleshoot locking related issues - are you able to do it?
Regards,
Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/26/2018 04:14 PM, Ben Newlin wrote:
Bogdan,
Actually, yes we do. Looking back I can see these errors just
before the issue occurs:
Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message:
Cannot send hep message!
Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to
10.32.163.211:9061 for proto hep_tcp/9 failed
Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to
send
Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream:
Failed first TCP async send : (32) Broken pipe
I will try disabling HEP and see if we can reproduce.
Just for information, I have been reproducing the issue in our
testing environment which uses TCP for HEP, however the issue
is occurring in our production environment as well which is
still using UDP for HEP.
Ben Newlin
*From: *Bogdan-Andrei Iancu <[email protected]>
<mailto:[email protected]>
*Date: *Friday, October 26, 2018 at 3:06 AM
*To: *Ben Newlin <[email protected]>
<mailto:[email protected]>, OpenSIPS users mailling list
<[email protected]> <mailto:[email protected]>
*Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
Hi Ben,
Thank you for the info.
It looks like the processes get stuck into a HEP related
internal lock - do you see any HEP related errors in your
logs, prior to the dead-lock ?
Also, as PoC, could you disabled HEP tracing to see if the
problem goes away ?
Thanks,
Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/24/2018 10:18 PM, Ben Newlin wrote:
Bogdan,
I have run the command but the output was too large for
pastebin so I have sent it to you directly.
Ben Newlin
*From: *Bogdan-Andrei Iancu <[email protected]>
<mailto:[email protected]>
*Date: *Wednesday, October 24, 2018 at 5:17 AM
*To: *OpenSIPS users mailling list
<[email protected]>
<mailto:[email protected]>, Ben Newlin
<[email protected]> <mailto:[email protected]>
*Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
Hi Ben,
Could you run "opensipsctl trap" ?
Regards,
Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/24/2018 12:56 AM, Ben Newlin wrote:
Hi,
We have implemented TCP recently and are performing
TCP<->UDP translation on one of our proxy types. This
proxy only exists for that purpose; there are no DB
queries, REST calls, or anything like that. It is
designed to be very fast and high throughput.
Recently we have found that when the remote endpoint
of a TCP connection is lost, i.e. the server goes
down, while under moderate load OpenSIPS quickly
reaches 100% CPU and becomes unresponsive. When this
occurs, the “top” command shows that between 30-90%
CPU is in System (kernel) space, and each OpenSIPS TCP
process shows many times the normal CPU. We are
running OpenSIPS 2.4.2 on Amazon Linux.
I obtained as much information as I could using ps,
strace, and gdb here: https://pastebin.com/JP3DnCqs
<https://pastebin.com/JP3DnCqs>. We can reproduce the
failure consistently by removing a server during call
traffic.
A few things I noticed:
* The number of running threads reported by OpenSIPS
doesn’t align with our configuration, copied here:
####### Global Parameters #########
children=32
#// Allow 503 to pass back to Control
disable_503_translation=yes
#// Even though we are not receiving HEP,
#// this listener is required by OpenSIPS
#// in order to use the proto_hep module. :/
listen=hep_tcp:10.32.40.245:9061 use_children 1
#// Configure the listeners
listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
#// Transaction Module
loadmodule "tm.so"
modparam("tm", "restart_fr_on_each_reply", 0)
modparam("tm", "timer_partitions", 8)
modparam("tm", "onreply_avp_mode", 1)
modparam("tm", "wt_timer", 10)
According to the documentation if “tcp_children” is
not set then the value of “children” will be used [1],
but we have set “children” to 32 and only have the
default 8 TCP processes. Also we appear to only have 1
timer process, although we have set the number of
timer partitions to 8.
* The server that is terminated was using TCP
connections exclusively, but all of the CPU seems
to be in the UDP threads. The one I looked at
appeared to be handling a CANCEL to one of the
calls that was active and was attempting to send
it out via TCP. I’m not sure why it would be
trying to relay the CANCEL as no 100 Trying had
been received from the server. I have noticed that
in 2.x OpenSIPS will now send CANCELs for
transactions even when 100 Trying was not
received. Is that intentional? RFC 3261 states
that no CANCEL should be sent unless a provisional
response has been received.
Any assistance with this would be appreciated.
[1] -
http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66
Ben Newlin
_______________________________________________
Users mailing list
[email protected] <mailto:[email protected]>
http://lists.opensips.org/cgi-bin/mailman/listinfo/users