Re: [SR-Users] DMQ mem leak issues
Hi Julien, Thanks for checking on this. I've been working in the background with Charles on this issue and we think we've found a solution, although the cause isn't clear to me yet. Following Charles advice we changed the usrloc module parameter db_mode from 1 (Write-Through) to 2 (Write-Back) and there's been no more memory leaks incidents since then. I'll report back if we have any further updates. Best, Rogelio On Wed, Aug 22, 2018 at 11:39 AM Julien Chavanton wrote: > Hi Rogerio, did you have any luck digging this leak further ? > > On Wed, Aug 8, 2018 at 3:37 AM Charles Chance < > charles.cha...@sipcentric.com> wrote: > >> Hi Rogelio, >> >> I have been running master on a three-node lab (one primary, two >> secondary) for the past 24 hours or so, maintaining 2000 registrations on >> the primary, replicating to both secondaries, and memory usage has remained >> constant throughout. >> >> I will leave it running for another 24 hours to be sure but in the >> meantime, you mentioned you are loading records from DB - which mode are >> you using for writing (write-through or write-back)? Do you experience the >> same symptoms if you disable the database completely on the secondary nodes >> (or just one for testing) and instead, enable sync in dmq_usrloc? >> >> Cheers, >> >> Charles >> >> >> On 7 August 2018 at 16:42, Julien Chavanton wrote: >> >>> I wonder if this could be introduced by a regression or if you are >>> facing a specific edge case >>> >>> I briefly looked at the commits of DMQ and DMQ_USRLOC >>> It seems there was significant work done. >>> I would give a try with 5.0.0 and then we will at least learn that this >>> is not a recent regression. >>> >>> On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez >>> wrote: >>> >>>> Charles, Julien, Daniel, >>>> >>>> The results are pretty much the same, the mem leak is still there and >>>> we need to restart Kamailio when it reaches certain threshold. >>>> https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0 >>>> >>>> Is there anything else we can try? >>>> Will a core dump file tell us what's causing it? >>>> >>>> Thanks, >>>> Rogelio >>>> >>>> On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez >>>> wrote: >>>> >>>>> Thanks Charles, it's working now. >>>>> I'm deploying to production and confirming results soon. >>>>> >>>>> Rogelio >>>>> >>>> >>>> >>>> -- >>>> <https://telnyx.com> >>>> Rogelio Perez | engineering | telnyx <https://telnyx.com> >>>> chicago: +1 312 270 8119 | dublin: +353 1 912 6119 >>>> >>>> >>>> _______ >>>> Kamailio (SER) - Users Mailing List >>>> sr-users@lists.kamailio.org >>>> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users >>>> >>>> >>> >> >> >> >> Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered >> office: Faraday Wharf, Innovation Birmingham Campus, Holt Street, >> Birmingham Science Park, Birmingham B7 4BB. >> ___ >> Kamailio (SER) - Users Mailing List >> sr-users@lists.kamailio.org >> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users >> > -- <https://telnyx.com> Rogelio Perez | engineering | telnyx <https://telnyx.com> chicago: +1 312 270 8119 | dublin: +353 1 912 6119 ___ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Re: [SR-Users] DMQ mem leak issues
Charles, Julien, Daniel, The results are pretty much the same, the mem leak is still there and we need to restart Kamailio when it reaches certain threshold. https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0 Is there anything else we can try? Will a core dump file tell us what's causing it? Thanks, Rogelio On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez wrote: > Thanks Charles, it's working now. > I'm deploying to production and confirming results soon. > > Rogelio > -- <https://telnyx.com> Rogelio Perez | engineering | telnyx <https://telnyx.com> chicago: +1 312 270 8119 | dublin: +353 1 912 6119 ___ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Re: [SR-Users] DMQ mem leak issues
Thanks Charles, it's working now. I'm deploying to production and confirming results soon. Rogelio ___ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Re: [SR-Users] DMQ mem leak issues
Charles, here you go: ### Routing Logic # Main SIP request routing logic # - processing of any incoming SIP request starts with this route # - note: this is the same as route { ... } request_route { # per request initial checks route(REQINIT); #!ifdef ENABLE_KDMQ # Handle Kamailio DQM messages if (is_method("KDMQ")) { dmq_handle_message(); } #!endif ___ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Re: [SR-Users] DMQ mem leak issues
Hello, We had to rollback the changes as dmq_usrloc notifications are not working at all with the latest master that includes the DMQ patch. On the receiving node we see the following errors: (394) INFO:
Re: [SR-Users] DMQ mem leak issues
Julien, > Since it seem you are recovering the memory this does not seems like a real "leak" I forgot to mention that the recoveries are actual Kamailio manual restarts. > One hypothesis : > When you restart a node on the DMQ bus, it can trigger memory usage on the other nodes since they will start to do a SYNC and send one DMQ message / contact > It could be that one node in the DMQ bus is restarted and not answering DMQ messages ? The mem leak periods do not match the moment we restart any of the nodes. > Few ideas : > You could search you trace, maybe you will find the DMQ sync requests ... We verified the traces and we found that at the moment of the mem leak there was nothing unusual. > You can also confirm significant increase in active transactions. Same. > Verify the state of the bus : > kamcmd dmq.list_nodes The primary node state shows the affected secondary node as inactive. > Verify the amount of contact on each node (confirm that the cluster is healthy) > kamctl stats | grep usrloc | grep contact I'll run this check the next time we see the mem leak in action. Daniel's patch is now in production, I'll confirm results soon. Thanks, Rogelio ___ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
Re: [SR-Users] DMQ mem leak issues
Thanks Daniel, Charles and Julien. I confirm we're not getting the error log "running job failed". The behavior is always the same, any of the two failover instances would run without issues for a day or two and then suddenly start consuming all available memory in the span of an hour or less. Please check these graphs with some examples for more details: https://www.dropbox.com/sh/tu0jxi1vlbq81m8/AABhfz9rDumdCu3l0ROH7Lkla?dl=0 I'll try Daniel's patch and confirm results soon. Rogelio ___ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
[SR-Users] DMQ mem leak issues
Hello, We're running three instances of Kamailo v5.14 as registrars handling registrations from ~2000 SIP clients, with one instance being primary and the other two as backups. The three of them are using the dmq and dmq_usrloc modules to synchronize user locations, however after a couple of days of operation the two failover instances show memory leak behaviors, with mem usage assigned to the core taking all available resources. When this happens we've noticed that: - The shared memory used by the function "sip_msg_shm_clone" spikes (from 1kb to 1.5GB). - The shared memory used by the function "dmq:worker.c:job_queue_push" also increases, but not as much (from 1kb to 1MB) - DMQ request are not being answered (with a 200 OK) by the affected instance during this memory leak, which make us think that DMQ module becomes unresponsive. A few more notes: - The failover instances are doing nothing except receiving replicated contacts. - The shared memory grows at the same rate on both instances, but the critical behavior never happens at the same time. - We are allocating 1GB memory on startup to each instance. - We store the location DB in a psql DB and we load it at startup. - We didn't find any errors in syslog, even at debug level. Has anyone experienced a similar issue who can suggest a possible solution? Thanks, Rogelio Perez Telnyx ___ Kamailio (SER) - Users Mailing List sr-users@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users