Re: [OpenSIPS-Users] CPU 100% with TCP

Bogdan-Andrei Iancu Tue, 13 Nov 2018 12:56:21 -0800

Ben,

The dump you send is actually the backtrace from all the procs. What isthe Holly Grail is the corefile to be inspected via gdb(off list, ofcourse).


Thanks and regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 11/13/2018 07:36 PM, Ben Newlin wrote:


Bogdan,

Can you clarify if you’re saying you need more information beyond thedumps I’ve just provided to you off-list?


Ben Newlin

*From: *Bogdan-Andrei Iancu <[email protected]>
*Date: *Tuesday, November 13, 2018 at 11:11 AM

*To: *Ben Newlin <[email protected]>, OpenSIPS users maillinglist <[email protected]>

*Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

Hi Ben,

Sorry for not being able to answer you before sending the new set ofBTs. Indeed, getting the corefile of only one process will do it asthe locks (and debug info) are in the shared memory. So, the deadlockhappens again, do the "opensipsctl trap" and get the corefile of oneprocess (ideally an UDP worker - get its pid via "opensipsctl fifo ps").

Keep the core as we will have to dig into it together :).

Many thanks,


Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
   http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
   http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 11/06/2018 10:14 PM, Ben Newlin wrote:

    Bogdan,

    I am trying to obtain this information for you but I am having
    trouble getting the core files. Is it really necessary to kill
    every opensips process? This generates almost 40 core files and
    each is quite large (~1GB). I simply don’t have that disk space
    currently. I can make a change to get more but it is slowing the
    process. Would it be sufficient to get just one core file?

    Also, runtime inspection with gdb is possible in this case if you
    can provide me with the commands you would want to see. I would
    need very specific commands as I am not very familiar with gdb.

    Ben Newlin

    *From: *Bogdan-Andrei Iancu <[email protected]>
    <mailto:[email protected]>
    *Date: *Thursday, November 1, 2018 at 1:29 PM
    *To: *Ben Newlin <[email protected]>
    <mailto:[email protected]>, OpenSIPS users mailling list
    <[email protected]> <mailto:[email protected]>
    *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

    Hi Ben,

    First be sure you have the DBG_LOCK option compiled in. Do the
    "opensips -V" and see the output flags.

    Next step will be to force an SIGSEGV to opensips (killall -11
    opensips) when the deadlock occurs - I need a core file to inspect
    (assuming that runtime inspection with gdb is not possible).

    Regards,


    Bogdan-Andrei Iancu

    OpenSIPS Founder and Developer

       http://www.opensips-solutions.com

    OpenSIPS Bootcamp 2018

       http://opensips.org/training/OpenSIPS_Bootcamp_2018/

    On 10/31/2018 09:07 PM, Ben Newlin wrote:

        Bogdan,

        For the first test I have done as you suggested and disabled
        only async operation for HEP, so it is still using TCP. I will
        send you the trap info directly as it is too large. I also
        compiled with the DBG_LOCK option, but am unsure whether that
        extra information will be available in the trap output or do
        you need something else?

        I am now going to switch HEP to use UDP to mirror our
        production environment and try to reproduce again. Wish me luck! ☺

        Ben Newlin

        *From: *Bogdan-Andrei Iancu <[email protected]>
        <mailto:[email protected]>
        *Date: *Monday, October 29, 2018 at 2:19 PM
        *To: *Ben Newlin <[email protected]>
        <mailto:[email protected]>, OpenSIPS users mailling list
        <[email protected]> <mailto:[email protected]>
        *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

        Hi Ben,

        I checked the error trace and it should not leave any dangling
        lock (due mishandled error). Before disabling HEP, try to
        disable the async support for HEP.

        If you claim that the same 100% CPU happens with HEP + UDP,
        send me a trap for that too, as in the previous case, the
        deadlock was exclusively HEP + TCP related.

        Anyhow, as the original trap showed a deadlock, next step will
        be to recompile with the DBG_LOCK option - this enables extra
        code to debug/troubleshoot locking related issues - are you
        able to do it?

        Regards,



        Bogdan-Andrei Iancu

        OpenSIPS Founder and Developer

           http://www.opensips-solutions.com

        OpenSIPS Bootcamp 2018

           http://opensips.org/training/OpenSIPS_Bootcamp_2018/

        On 10/26/2018 04:14 PM, Ben Newlin wrote:

            Bogdan,

            Actually, yes we do. Looking back I can see these errors
            just before the issue occurs:

            Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message:
            Cannot send hep message!

            Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to
            10.32.163.211:9061 for proto hep_tcp/9 failed

            Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send:
            failed to send

            Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream:
            Failed first TCP async send : (32) Broken pipe

            I will try disabling HEP and see if we can reproduce.

            Just for information, I have been reproducing the issue in
            our testing environment which uses TCP for HEP, however
            the issue is occurring in our production environment as
            well which is still using UDP for HEP.

            Ben Newlin

            *From: *Bogdan-Andrei Iancu <[email protected]>
            <mailto:[email protected]>
            *Date: *Friday, October 26, 2018 at 3:06 AM
            *To: *Ben Newlin <[email protected]>
            <mailto:[email protected]>, OpenSIPS users mailling
            list <[email protected]>
            <mailto:[email protected]>
            *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

            Hi Ben,

            Thank you for the info.

            It looks like the processes get stuck into a HEP related
            internal lock - do you see any HEP related errors in your
            logs, prior to the dead-lock ?

            Also, as PoC, could you disabled HEP tracing to see if the
            problem goes away ?

            Thanks,





            Bogdan-Andrei Iancu

            OpenSIPS Founder and Developer

               http://www.opensips-solutions.com

            OpenSIPS Bootcamp 2018

               http://opensips.org/training/OpenSIPS_Bootcamp_2018/

            On 10/24/2018 10:18 PM, Ben Newlin wrote:

                Bogdan,

                I have run the command but the output was too large
                for pastebin so I have sent it to you directly.

                Ben Newlin

                *From: *Bogdan-Andrei Iancu <[email protected]>
                <mailto:[email protected]>
                *Date: *Wednesday, October 24, 2018 at 5:17 AM
                *To: *OpenSIPS users mailling list
                <[email protected]>
                <mailto:[email protected]>, Ben Newlin
                <[email protected]> <mailto:[email protected]>
                *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP

                Hi Ben,

                Could you run "opensipsctl trap" ?

                Regards,





                Bogdan-Andrei Iancu

                OpenSIPS Founder and Developer

                   http://www.opensips-solutions.com

                OpenSIPS Bootcamp 2018

                   http://opensips.org/training/OpenSIPS_Bootcamp_2018/

                On 10/24/2018 12:56 AM, Ben Newlin wrote:

                    Hi,

                    We have implemented TCP recently and are
                    performing TCP<->UDP translation on one of our
                    proxy types. This proxy only exists for that
                    purpose; there are no DB queries, REST calls, or
                    anything like that. It is designed to be very fast
                    and high throughput.

                    Recently we have found that when the remote
                    endpoint of a TCP connection is lost, i.e. the
                    server goes down, while under moderate load
                    OpenSIPS quickly reaches 100% CPU and becomes
                    unresponsive. When this occurs, the “top” command
                    shows that between 30-90% CPU is in System
                    (kernel) space, and each OpenSIPS TCP process
                    shows many times the normal CPU. We are running
                    OpenSIPS 2.4.2 on Amazon Linux.

                    I obtained as much information as I could using
                    ps, strace, and gdb here:
                    https://pastebin.com/JP3DnCqs
                    <https://pastebin.com/JP3DnCqs>. We can reproduce
                    the failure consistently by removing a server
                    during call traffic.

                    A few things I noticed:

                      * The number of running threads reported by
                        OpenSIPS doesn’t align with our configuration,
                        copied here:

                    ####### Global Parameters #########

                    children=32

                    #// Allow 503 to pass back to Control

                    disable_503_translation=yes

                    #// Even though we are not receiving HEP,

                    #// this listener is required by OpenSIPS

                    #// in order to use the proto_hep module. :/

                    listen=hep_tcp:10.32.40.245:9061 use_children 1

                    #// Configure the listeners

                    listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

                    listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

                    #// Transaction Module

                    loadmodule "tm.so"

                    modparam("tm", "restart_fr_on_each_reply", 0)

                    modparam("tm", "timer_partitions", 8)

                    modparam("tm", "onreply_avp_mode", 1)

                    modparam("tm", "wt_timer", 10)

                    According to the documentation if “tcp_children”
                    is not set then the value of “children” will be
                    used [1], but we have set “children” to 32 and
                    only have the default 8 TCP processes. Also we
                    appear to only have 1 timer process, although we
                    have set the number of timer partitions to 8.

                      * The server that is terminated was using TCP
                        connections exclusively, but all of the CPU
                        seems to be in the UDP threads. The one I
                        looked at appeared to be handling a CANCEL to
                        one of the calls that was active and was
                        attempting to send it out via TCP. I’m not
                        sure why it would be trying to relay the
                        CANCEL as no 100 Trying had been received from
                        the server. I have noticed that in 2.x
                        OpenSIPS will now send CANCELs for
                        transactions even when 100 Trying was not
                        received. Is that intentional? RFC 3261 states
                        that no CANCEL should be sent unless a
                        provisional response has been received.

                    Any assistance with this would be appreciated.

                    [1] -
                    
http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

                    Ben Newlin








                    _______________________________________________

                    Users mailing list

                    [email protected]
                    <mailto:[email protected]>

                    http://lists.opensips.org/cgi-bin/mailman/listinfo/users

_______________________________________________
Users mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

Re: [OpenSIPS-Users] CPU 100% with TCP

Reply via email to