Hi Ben,
Could you run "opensipsctl trap" ?
Regards,
Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/24/2018 12:56 AM, Ben Newlin wrote:
Hi,
We have implemented TCP recently and are performing TCP<->UDP
translation on one of our proxy types. This proxy only exists for that
purpose; there are no DB queries, REST calls, or anything like that.
It is designed to be very fast and high throughput.
Recently we have found that when the remote endpoint of a TCP
connection is lost, i.e. the server goes down, while under moderate
load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When
this occurs, the “top” command shows that between 30-90% CPU is in
System (kernel) space, and each OpenSIPS TCP process shows many times
the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.
I obtained as much information as I could using ps, strace, and gdb
here: https://pastebin.com/JP3DnCqs. We can reproduce the failure
consistently by removing a server during call traffic.
A few things I noticed:
* The number of running threads reported by OpenSIPS doesn’t align
with our configuration, copied here:
####### Global Parameters #########
children=32
#// Allow 503 to pass back to Control
disable_503_translation=yes
#// Even though we are not receiving HEP,
#// this listener is required by OpenSIPS
#// in order to use the proto_hep module. :/
listen=hep_tcp:10.32.40.245:9061 use_children 1
#// Configure the listeners
listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
#// Transaction Module
loadmodule "tm.so"
modparam("tm", "restart_fr_on_each_reply", 0)
modparam("tm", "timer_partitions", 8)
modparam("tm", "onreply_avp_mode", 1)
modparam("tm", "wt_timer", 10)
According to the documentation if “tcp_children” is not set then the
value of “children” will be used [1], but we have set “children” to 32
and only have the default 8 TCP processes. Also we appear to only have
1 timer process, although we have set the number of timer partitions to 8.
* The server that is terminated was using TCP connections
exclusively, but all of the CPU seems to be in the UDP threads.
The one I looked at appeared to be handling a CANCEL to one of the
calls that was active and was attempting to send it out via TCP.
I’m not sure why it would be trying to relay the CANCEL as no 100
Trying had been received from the server. I have noticed that in
2.x OpenSIPS will now send CANCELs for transactions even when 100
Trying was not received. Is that intentional? RFC 3261 states that
no CANCEL should be sent unless a provisional response has been
received.
Any assistance with this would be appreciated.
[1] -
http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66
Ben Newlin
_______________________________________________
Users mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
_______________________________________________
Users mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users