Re: [SR-Users] DMQ mem leak issues

2018-08-22 Thread Rogelio Perez
Hi Julien,

Thanks for checking on this.
I've been working in the background with Charles on this issue and we think
we've found a solution, although the cause isn't clear to me yet.
Following Charles advice we changed the usrloc module parameter db_mode
from 1 (Write-Through) to 2 (Write-Back) and there's been no more memory
leaks incidents since then.
I'll report back if we have any further updates.

Best,
Rogelio


On Wed, Aug 22, 2018 at 11:39 AM Julien Chavanton 
wrote:

> Hi Rogerio, did you have any luck digging this leak further ?
>
> On Wed, Aug 8, 2018 at 3:37 AM Charles Chance <
> charles.cha...@sipcentric.com> wrote:
>
>> Hi Rogelio,
>>
>> I have been running master on a three-node lab (one primary, two
>> secondary) for the past 24 hours or so, maintaining 2000 registrations on
>> the primary, replicating to both secondaries, and memory usage has remained
>> constant throughout.
>>
>> I will leave it running for another 24 hours to be sure but in the
>> meantime, you mentioned you are loading records from DB - which mode are
>> you using for writing (write-through or write-back)? Do you experience the
>> same symptoms if you disable the database completely on the secondary nodes
>> (or just one for testing) and instead, enable sync in dmq_usrloc?
>>
>> Cheers,
>>
>> Charles
>>
>>
>> On 7 August 2018 at 16:42, Julien Chavanton  wrote:
>>
>>> I wonder if this could be introduced by a regression or if you are
>>> facing a specific edge case
>>>
>>> I briefly looked at the commits of DMQ and DMQ_USRLOC
>>> It seems there was significant work done.
>>> I would give a try with 5.0.0 and then we will at least learn that this
>>> is not a recent regression.
>>>
>>> On Mon, Aug 6, 2018 at 1:43 PM, Rogelio Perez 
>>> wrote:
>>>
>>>> Charles, Julien, Daniel,
>>>>
>>>> The results are pretty much the same, the mem leak is still there and
>>>> we need to restart Kamailio when it reaches certain threshold.
>>>> https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0
>>>>
>>>> Is there anything else we can try?
>>>> Will a core dump file tell us what's causing it?
>>>>
>>>> Thanks,
>>>> Rogelio
>>>>
>>>> On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez 
>>>> wrote:
>>>>
>>>>> Thanks Charles, it's working now.
>>>>> I'm deploying to production and confirming results soon.
>>>>>
>>>>> Rogelio
>>>>>
>>>>
>>>>
>>>> --
>>>> <https://telnyx.com>
>>>> Rogelio Perez | engineering | telnyx <https://telnyx.com>
>>>> chicago: +1 312 270 8119 | dublin: +353 1 912 6119
>>>>
>>>>
>>>> _______
>>>> Kamailio (SER) - Users Mailing List
>>>> sr-users@lists.kamailio.org
>>>> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
>>>>
>>>>
>>>
>>
>>
>>
>> Sipcentric Ltd. Company registered in England & Wales no. 7365592. Registered
>> office: Faraday Wharf, Innovation Birmingham Campus, Holt Street,
>> Birmingham Science Park, Birmingham B7 4BB.
>> ___
>> Kamailio (SER) - Users Mailing List
>> sr-users@lists.kamailio.org
>> https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users
>>
>

-- 
<https://telnyx.com>
Rogelio Perez | engineering | telnyx <https://telnyx.com>
chicago: +1 312 270 8119 | dublin: +353 1 912 6119
___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] DMQ mem leak issues

2018-08-06 Thread Rogelio Perez
Charles, Julien, Daniel,

The results are pretty much the same, the mem leak is still there and we
need to restart Kamailio when it reaches certain threshold.
https://www.dropbox.com/s/enxx6b7t0c8vl49/Selection_539.png?dl=0

Is there anything else we can try?
Will a core dump file tell us what's causing it?

Thanks,
Rogelio

On Thu, Aug 2, 2018 at 2:57 PM Rogelio Perez  wrote:

> Thanks Charles, it's working now.
> I'm deploying to production and confirming results soon.
>
> Rogelio
>


-- 
<https://telnyx.com>
Rogelio Perez | engineering | telnyx <https://telnyx.com>
chicago: +1 312 270 8119 | dublin: +353 1 912 6119
___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] DMQ mem leak issues

2018-08-02 Thread Rogelio Perez
Thanks Charles, it's working now.
I'm deploying to production and confirming results soon.

Rogelio
___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] DMQ mem leak issues

2018-08-01 Thread Rogelio Perez
Charles, here you go:

### Routing Logic 

# Main SIP request routing logic
# - processing of any incoming SIP request starts with this route
# - note: this is the same as route { ... }
request_route {

  # per request initial checks
  route(REQINIT);

  #!ifdef ENABLE_KDMQ
  # Handle Kamailio DQM messages
  if (is_method("KDMQ")) {
dmq_handle_message();
  }
  #!endif
___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] DMQ mem leak issues

2018-08-01 Thread Rogelio Perez
Hello,

We had to rollback the changes as dmq_usrloc notifications are not working
at all with the latest master that includes the DMQ patch.
On the receiving node we see the following errors:

(394) INFO: 

Re: [SR-Users] DMQ mem leak issues

2018-07-31 Thread Rogelio Perez
Julien,

> Since it seem you are recovering the memory this does not seems like a
real "leak"
I forgot to mention that the recoveries are actual Kamailio manual restarts.

> One hypothesis :
> When you restart a node on the DMQ bus, it can trigger memory usage on
the other nodes since they will start to do a SYNC and send one DMQ message
/ contact
> It could be that one node in the DMQ bus is restarted and not answering
DMQ messages ?
The mem leak periods do not match the moment we restart any of the nodes.

> Few ideas :
> You could search you trace, maybe you will find the DMQ sync requests ...
We verified the traces and we found that at the moment of the mem leak
there was nothing unusual.

> You can also confirm significant increase in active transactions.
Same.

> Verify the state of the bus :
> kamcmd  dmq.list_nodes
The primary node state shows the affected secondary node as inactive.

> Verify the amount of contact on each node (confirm that the cluster is
healthy)
> kamctl stats | grep usrloc | grep contact
I'll run this check the next time we see the mem leak in action.

Daniel's patch is now in production, I'll confirm results soon.

Thanks,
Rogelio
___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


Re: [SR-Users] DMQ mem leak issues

2018-07-31 Thread Rogelio Perez
Thanks Daniel, Charles and Julien.

I confirm we're not getting the error log "running job failed".
The behavior is always the same, any of the two failover instances would
run without issues for a day or two and then suddenly start consuming all
available memory in the span of an hour or less.
Please check these graphs with some examples for more details:
https://www.dropbox.com/sh/tu0jxi1vlbq81m8/AABhfz9rDumdCu3l0ROH7Lkla?dl=0

I'll try Daniel's patch and confirm results soon.

Rogelio
___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users


[SR-Users] DMQ mem leak issues

2018-07-30 Thread Rogelio Perez
Hello,

We're running three instances of Kamailo v5.14 as registrars handling
registrations from ~2000 SIP clients, with one instance being primary and
the other two as backups.

The three of them are using the dmq and dmq_usrloc modules to synchronize
user locations, however after a couple of days of operation the two
failover instances show memory leak behaviors, with mem usage assigned to
the core taking all available resources.

When this happens we've noticed that:
 - The shared memory used by the function "sip_msg_shm_clone" spikes (from
1kb to 1.5GB).
  - The shared memory used by the function "dmq:worker.c:job_queue_push"
also increases, but not as much (from 1kb to 1MB)
 - DMQ request are not being answered (with a 200 OK) by the affected
instance during this memory leak, which make us think that DMQ module
becomes unresponsive.

A few more notes:
 - The failover instances are doing nothing except receiving replicated
contacts.
 - The shared memory grows at the same rate on both instances, but the
critical behavior never happens at the same time.
 - We are allocating 1GB memory on startup to each instance.
 - We store the location DB in a psql DB and we load it at startup.
 - We didn't find any errors in syslog, even at debug level.

Has anyone experienced a similar issue who can suggest a possible solution?

Thanks,
Rogelio Perez
Telnyx
___
Kamailio (SER) - Users Mailing List
sr-users@lists.kamailio.org
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-users