Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips -V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) when the deadlockoccurs - I need a core file to inspect (assuming that runtime inspection with gdb is not possible).

Regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/31/2018 09:07 PM, Ben Newlin wrote:

Bogdan,

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck! ☺

Ben Newlin

*From: *Bogdan-Andrei Iancu <[email protected]>
*Date: *Monday, October 29, 2018 at 2:19 PM
*To: *Ben Newlin <[email protected]>, OpenSIPS users mailling list <[email protected]>
*Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,

Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
   http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
   http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

    Bogdan,

    Actually, yes we do. Looking back I can see these errors just
    before the issue occurs:

    Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot
    send hep message!

    Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to
    10.32.163.211:9061 for proto hep_tcp/9 failed

    Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

    Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed
    first TCP async send : (32) Broken pipe

    I will try disabling HEP and see if we can reproduce.

    Just for information, I have been reproducing the issue in our
    testing environment which uses TCP for HEP, however the issue is
    occurring in our production environment as well which is still
    using UDP for HEP.

    Ben Newlin

    *From: *Bogdan-Andrei Iancu <[email protected]>
    <mailto:[email protected]>
    *Date: *Friday, October 26, 2018 at 3:06 AM
    *To: *Ben Newlin <[email protected]>
    <mailto:[email protected]>, OpenSIPS users mailling list
    <[email protected]> <mailto:[email protected]>
    *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

    Hi Ben,

    Thank you for the info.

    It looks like the processes get stuck into a HEP related internal
    lock - do you see any HEP related errors in your logs, prior to
    the dead-lock ?

    Also, as PoC, could you disabled HEP tracing to see if the problem
    goes away ?

    Thanks,



    Bogdan-Andrei Iancu

    OpenSIPS Founder and Developer

       http://www.opensips-solutions.com

    OpenSIPS Bootcamp 2018

       http://opensips.org/training/OpenSIPS_Bootcamp_2018/

    On 10/24/2018 10:18 PM, Ben Newlin wrote:

        Bogdan,

        I have run the command but the output was too large for
        pastebin so I have sent it to you directly.

        Ben Newlin

        *From: *Bogdan-Andrei Iancu <[email protected]>
        <mailto:[email protected]>
        *Date: *Wednesday, October 24, 2018 at 5:17 AM
        *To: *OpenSIPS users mailling list <[email protected]>
        <mailto:[email protected]>, Ben Newlin
        <[email protected]> <mailto:[email protected]>
        *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

        Hi Ben,

        Could you run "opensipsctl trap" ?

        Regards,



        Bogdan-Andrei Iancu

        OpenSIPS Founder and Developer

           http://www.opensips-solutions.com

        OpenSIPS Bootcamp 2018

           http://opensips.org/training/OpenSIPS_Bootcamp_2018/

        On 10/24/2018 12:56 AM, Ben Newlin wrote:

            Hi,

            We have implemented TCP recently and are performing
            TCP<->UDP translation on one of our proxy types. This
            proxy only exists for that purpose; there are no DB
            queries, REST calls, or anything like that. It is designed
            to be very fast and high throughput.

            Recently we have found that when the remote endpoint of a
            TCP connection is lost, i.e. the server goes down, while
            under moderate load OpenSIPS quickly reaches 100% CPU and
            becomes unresponsive. When this occurs, the “top” command
            shows that between 30-90% CPU is in System (kernel) space,
            and each OpenSIPS TCP process shows many times the normal
            CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

            I obtained as much information as I could using ps,
            strace, and gdb here: https://pastebin.com/JP3DnCqs
            <https://pastebin.com/JP3DnCqs>. We can reproduce the
            failure consistently by removing a server during call traffic.

            A few things I noticed:

              * The number of running threads reported by OpenSIPS
                doesn’t align with our configuration, copied here:

            ####### Global Parameters #########

            children=32

            #// Allow 503 to pass back to Control

            disable_503_translation=yes

            #// Even though we are not receiving HEP,

            #// this listener is required by OpenSIPS

            #// in order to use the proto_hep module. :/

            listen=hep_tcp:10.32.40.245:9061 use_children 1

            #// Configure the listeners

            listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

            listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

            #// Transaction Module

            loadmodule "tm.so"

            modparam("tm", "restart_fr_on_each_reply", 0)

            modparam("tm", "timer_partitions", 8)

            modparam("tm", "onreply_avp_mode", 1)

            modparam("tm", "wt_timer", 10)

            According to the documentation if “tcp_children” is not
            set then the value of “children” will be used [1], but we
            have set “children” to 32 and only have the default 8 TCP
            processes. Also we appear to only have 1 timer process,
            although we have set the number of timer partitions to 8.

              * The server that is terminated was using TCP
                connections exclusively, but all of the CPU seems to
                be in the UDP threads. The one I looked at appeared to
                be handling a CANCEL to one of the calls that was
                active and was attempting to send it out via TCP. I’m
                not sure why it would be trying to relay the CANCEL as
                no 100 Trying had been received from the server. I
                have noticed that in 2.x OpenSIPS will now send
                CANCELs for transactions even when 100 Trying was not
                received. Is that intentional? RFC 3261 states that no
                CANCEL should be sent unless a provisional response
                has been received.

            Any assistance with this would be appreciated.

            [1] -
            
http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

            Ben Newlin






            _______________________________________________

            Users mailing list

            [email protected] <mailto:[email protected]>

            http://lists.opensips.org/cgi-bin/mailman/listinfo/users











_______________________________________________
Users mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

Reply via email to