OK. I will leave it a bit in master to see if there are any new reports, then I will backport. I will also have to review the tm reply functions that can be used from config to align them to the new check.

Cheers,
Daniel

On 10/04/14 09:06, Jason Penton wrote:
oh excellent, I will look at it right away - was just getting ready to jump in myself ;)

Cheers
Jason


On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla <[email protected] <mailto:[email protected]>> wrote:

    Hello Jason,

    I pushed a patch trying to fix this case, it is only on git master
    branch. Can you test it? If all goes fine, we can consider
    backporting it.

    Cheers,
    Daniel


    On 09/04/14 23:26, Jason Penton wrote:
    Hey Daniel,

    nothing extraordinary...

    # -- TM params --
    modparam("tm", "fr_timer", 20000);
    modparam("tm", "fr_inv_timer", 10000)


    Cheers
    Jason


    On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton
    <[email protected] <mailto:[email protected]>> wrote:

        Hey Daniel,

        Yes I did a test with a very basic config file and I am not
        able to re-create. However, with my *complex* cfg file I can
        re-create every time. Tomorrow I will compare what is
        different and report back... hopefully with fix ;)

        here is bt of timer process deadlocking itself:

        #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
        #1  0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at
        ../../mem/../futexlock.h:123
        #2  0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
        file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980
        "cancel_branch", line=250) at lock.h:99
        #3  0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40,
        branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250
        #4  0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40,
        cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123
        #5  0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40,
            buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR
        select next S-CSCF\r\nVia: SIP/2.0/UDP
        
10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
        SIP/2.0/UDP
        10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778,
        code=500, to_tag=0x7f500a1c7ae0
        "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37,
        lock=0, bm=0x7fff646d0b60) at t_reply.c:660
        #6  0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40,
        p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
        error on LIR select next S-CSCF", lock=0) at t_reply.c:795
        #7  0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40,
        p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
        error on LIR select next S-CSCF") at t_reply.c:1643
        #8  0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0,
        p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870
        "h\321$\nP\177") at tm.c:1324
        #9  0x000000000041a700 in do_action (h=0x7fff646d1d30,
        a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119
        #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
        a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607
        #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30,
        a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102
        #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
        a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607
        #13 0x000000000041a54e in do_action (h=0x7fff646d1d30,
        a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098
        #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
        a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607
        #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28,
        msg=0x7f500a1c6bc0, c=0x0) at action.c:1693
        #16 0x00007f5009f73815 in run_failure_handlers
        (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408,
        extra_flags=96) at t_reply.c:1061
        #17 0x00007f5009f7527a in t_should_relay_response
        (Trans=0x7f4fc5501b40, new_code=408, branch=1,
        should_store=0x7fff646d201c, should_relay=0x7fff646d2018,
        cancel_data=0x7fff646d2070,
            reply=0xffffffffffffffff) at t_reply.c:1416
        #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40,
        p_msg=0xffffffffffffffff, branch=1, msg_status=408,
        cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819
        #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40,
        branch=1, code=408) at timer.c:354
        #20 0x00007f5009f450e7 in final_response_handler
        (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
        #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386,
        tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584
        #22 0x0000000000544119 in timer_list_expire (t=260027386,
        h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at
        timer.c:894
        #23 0x0000000000544418 in timer_handler () at timer.c:959
        #24 0x00000000005446b2 in timer_main () at timer.c:998
        #25 0x0000000000471ddf in main_loop () at main.c:1689



        On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla
        <[email protected] <mailto:[email protected]>> wrote:

            Hello,

            that should not be a very rare case and I would expect to
            be caught so far, anyhow ... this looks like easy to
            reproduce, have you tried it?

            You can have two kamailio, one relying the invite to the
            second, which will reply with 100, then wait for the
            timeout on the first instance. You can add some debug
            messages in the code to see if the lock is called twice.

            Cheers,
            Daniel


            On 09/04/14 17:51, Jason Penton wrote:
            Hi All,

            I have been experiencing a deadlock when a timeout
            occurs on a t_relayed() INVITE. Going through the code I
            have noticed a possible chance of deadlock (without
            re-entrant enabled). Here is my thinking:

            t_should_relay_response() is called with REPLY_LOCK when
            the timer process fires on the fr_inv_timer (no response
            from the INVITE that was relayed, other than 100
            provisional) and a 408 is generated. However, from
            within that function there are calls
            to run_failure_handlers() which in turn *could* try and
            lock the reply (viz. somebody having a t_reply() call in
            the cfg file - in failure route block). This would
            result in another lock on the same transaction's
            REPLY_LOCK....

            Has anybody else experienced something like this?

            this is on master btw.

            Cheers
            Jason


            _______________________________________________
            sr-dev mailing list
            [email protected]  <mailto:[email protected]>
            http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -http://www.asipto.com
            http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  
-http://www.linkedin.com/in/miconda


            _______________________________________________
            sr-dev mailing list
            [email protected]
            <mailto:[email protected]>
            http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev




-- Daniel-Constantin Mierla -http://www.asipto.com
    http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  
-http://www.linkedin.com/in/miconda



--
Daniel-Constantin Mierla - http://www.asipto.com
http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

_______________________________________________
sr-dev mailing list
[email protected]
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Reply via email to