Re: [OpenSIPS-Users] Potential memory leak in siptrace module
Hi, In answer to your questions... 1. $ opensips -V version: opensips 2.3.2 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. main.c compiled on 15:38:40 Nov 24 2017 with gcc 5.4.0 I built from commit 9e497ea4ab03c67cb073bdf0dccef2b16bff80e2. I've since diff'd with the HEAD and I haven't seen any changes in the siptrace code that would have helped my particular issue. 2. The call to sip_trace is in the main request route branch and it gets called for all requests. input parameters: sip_trace("tid", "d", "sip"); module config: modparam("siptrace", "trace_on", 0) modparam("siptrace", "trace_id", "[tid]uri=mysql://db-user:db-pass@db-host/db-name;table=sip_trace;") From: Bogdan-Andrei Iancu [mailto:bog...@opensips.org] Sent: Thursday, February 1, 2018 10:54 AM To: OpenSIPS users mailling list <users@lists.opensips.org>; Steve Brisson <steve.bris...@librestream.com> Subject: Re: [OpenSIPS-Users] Potential memory leak in siptrace module Hi Steve, Thank you for your report. Could you provide the exact revision of your OpenSIPS (opensips -V) ? Also what requests are you calling the sip_trace() for (INVITES ? PUBLISH? all?) and what are the input paramters ? Thanks, Bogdan-Andrei Iancu OpenSIPS Founder and Developer http://www.opensips-solutions.com OpenSIPS Summit 2018 http://www.opensips.org/events/Summit-2018Amsterdam On 01/31/2018 07:48 PM, Steve Brisson wrote: I've found that when I call the sip_trace method, in the main request routing block, the shmem used_size and fragments continually increases until the process crashes from an out of memory condition. By enabling memory alloc debugging, I confirmed that the leaked allocation is the shm_malloc of a trace_info_t structure in the sip_trace_w method [siptrace.c:1539]. This piece of memory is supposed to be freed in the dialog terminated callback but because of the logic in the sip_trace_w method, trace_flags gets modified to TRACE_TRANSACTION and the dialog callbacks are never registered. I haven't worked out the details of a proper fix yet. My workaround has been to remove the siptrace module from my config. OpenSIPS config details: opensips 2.3.2 (x86_64/linux) #siptrace module config modparam("siptrace", "trace_on", 0) modparam("siptrace", "trace_id", "[tid]uri=mysql://db-user:db-pass@db-host/db-name;table=sip_trace;") #siptrace method call in main request branch sip_trace("tid", "d", "sip"); ___ Users mailing list Users@lists.opensips.org<mailto:Users@lists.opensips.org> http://lists.opensips.org/cgi-bin/mailman/listinfo/users ___ Users mailing list Users@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Re: [OpenSIPS-Users] tcp_async timeouts confusion
Sorry for the late response. Unfortunately I no longer have call access to the vcs server that experienced this error and the issue seemed timing dependent on that particular call path. I will update the bug with DBG logs if I am able to recreate the problem by some other means. Thanks for the help. From: Users [mailto:users-boun...@lists.opensips.org] On Behalf Of Liviu Chircu Sent: Friday, January 19, 2018 7:16 AM To: users@lists.opensips.org Subject: Re: [OpenSIPS-Users] tcp_async timeouts confusion Hi Steve, I've opened a GitHub issue for this report so we can better keep track of it [1], and will likely be next on my priority list. I will try to find some time to attempt to reproduce it asap. In the meanwhile, would you be as kind as to supply some debug logs while the async TCP reads are piling up? You may ship them to li...@opensips.org<mailto:li...@opensips.org>. Best regards, [1]: https://github.com/OpenSIPS/opensips/issues/1259 Liviu Chircu OpenSIPS Developer http://www.opensips-solutions.com On 18.01.2018 18:35, Steve Brisson wrote: Hi, just wondering if this issue is still being considered. Thanks, steve From: Users [mailto:users-boun...@lists.opensips.org] On Behalf Of Steve Brisson Sent: Thursday, January 11, 2018 11:56 AM To: OpenSIPS users mailling list <users@lists.opensips.org><mailto:users@lists.opensips.org> Subject: Re: [OpenSIPS-Users] tcp_async timeouts confusion TCP scenario: Yes that is exactly correct. If I have tcp_connect_timeout=3000 and tcp_async=1 configured then calls to the vcs fail. If I then make the single change to disable tcp_async, calls to my vcs endpoints will work. TLS scenario: Thanks for the clarification, makes sense. Unfortunately the online documentation doesn't describe the parameters and lists an incorrect scaling factor (it's defined in milliseconds, not seconds). steve From: Users [mailto:users-boun...@lists.opensips.org] On Behalf Of Liviu Chircu Sent: Thursday, January 11, 2018 3:39 AM To: users@lists.opensips.org<mailto:users@lists.opensips.org> Subject: Re: [OpenSIPS-Users] tcp_async timeouts confusion Answers below, On 09.01.2018 22:25, Steve Brisson wrote: *** Using TCP *** After the invite is sent to the vcs, tcpdump at the opensips server showed 100, 180, and 200 OK responses from the vcs arriving and ACK'd correctly at the opensips server. The 100 response arrived 185ms after the invite is sent. But, I don't see these responses in the branch's onreply_route, the global onreply_route, or in the log at DBG level. netstat -t shows the connection with the data in the recv-q that never reaches 0. This implies to me that opensips is not polling that connection correctly for recv data. If I disable tcp_async then the call is completed successfully. So in the case that works, I have tcp_connect_timeout=3000 and tcp_async=0. Just to confirm the scenario: So the signaling is broken with "tcp_connect_timeout=3000 and tcp_async=1" (reply routes do not get triggered / recv-q keeps on growing), however once you switch to "tcp_async=0", everything is back to working? *** Using TLS *** Running tcpdump, I see the opensips server send a Client Hello then a FIN packet 100ms later. The vcs responds with a Server Hello 200ms after the Client Hello and this gets RST. To workaround this case, I set tls_handshake_timeout=3000 and tls_send_timeout=1000. Maybe this is the correct behavior, I'm still not 100% sure how the tls parameters function. This time, it's behaving as expected. Maybe there should be a diagram somewhere with how these parameters work together. For example, each TLS connection will roughly follow the below steps along with their corresponding parameterized timeouts: 1. TCP connect (tcp_connect_timeout) 2. TLS connect/accept handshake (tls_handshake_timeout) 3. TLS write (tls_send_timeout) 4. TLS write (tls_send_timeout) Liviu Chircu OpenSIPS Developer http://www.opensips-solutions.com ___ Users mailing list Users@lists.opensips.org<mailto:Users@lists.opensips.org> http://lists.opensips.org/cgi-bin/mailman/listinfo/users ___ Users mailing list Users@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/users
[OpenSIPS-Users] Potential memory leak in siptrace module
I've found that when I call the sip_trace method, in the main request routing block, the shmem used_size and fragments continually increases until the process crashes from an out of memory condition. By enabling memory alloc debugging, I confirmed that the leaked allocation is the shm_malloc of a trace_info_t structure in the sip_trace_w method [siptrace.c:1539]. This piece of memory is supposed to be freed in the dialog terminated callback but because of the logic in the sip_trace_w method, trace_flags gets modified to TRACE_TRANSACTION and the dialog callbacks are never registered. I haven't worked out the details of a proper fix yet. My workaround has been to remove the siptrace module from my config. OpenSIPS config details: opensips 2.3.2 (x86_64/linux) #siptrace module config modparam("siptrace", "trace_on", 0) modparam("siptrace", "trace_id", "[tid]uri=mysql://db-user:db-pass@db-host/db-name;table=sip_trace;") #siptrace method call in main request branch sip_trace("tid", "d", "sip"); ___ Users mailing list Users@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Re: [OpenSIPS-Users] tcp_async timeouts confusion
Hi, just wondering if this issue is still being considered. Thanks, steve From: Users [mailto:users-boun...@lists.opensips.org] On Behalf Of Steve Brisson Sent: Thursday, January 11, 2018 11:56 AM To: OpenSIPS users mailling list <users@lists.opensips.org> Subject: Re: [OpenSIPS-Users] tcp_async timeouts confusion TCP scenario: Yes that is exactly correct. If I have tcp_connect_timeout=3000 and tcp_async=1 configured then calls to the vcs fail. If I then make the single change to disable tcp_async, calls to my vcs endpoints will work. TLS scenario: Thanks for the clarification, makes sense. Unfortunately the online documentation doesn't describe the parameters and lists an incorrect scaling factor (it's defined in milliseconds, not seconds). steve From: Users [mailto:users-boun...@lists.opensips.org] On Behalf Of Liviu Chircu Sent: Thursday, January 11, 2018 3:39 AM To: users@lists.opensips.org<mailto:users@lists.opensips.org> Subject: Re: [OpenSIPS-Users] tcp_async timeouts confusion Answers below, On 09.01.2018 22:25, Steve Brisson wrote: *** Using TCP *** After the invite is sent to the vcs, tcpdump at the opensips server showed 100, 180, and 200 OK responses from the vcs arriving and ACK'd correctly at the opensips server. The 100 response arrived 185ms after the invite is sent. But, I don't see these responses in the branch's onreply_route, the global onreply_route, or in the log at DBG level. netstat -t shows the connection with the data in the recv-q that never reaches 0. This implies to me that opensips is not polling that connection correctly for recv data. If I disable tcp_async then the call is completed successfully. So in the case that works, I have tcp_connect_timeout=3000 and tcp_async=0. Just to confirm the scenario: So the signaling is broken with "tcp_connect_timeout=3000 and tcp_async=1" (reply routes do not get triggered / recv-q keeps on growing), however once you switch to "tcp_async=0", everything is back to working? *** Using TLS *** Running tcpdump, I see the opensips server send a Client Hello then a FIN packet 100ms later. The vcs responds with a Server Hello 200ms after the Client Hello and this gets RST. To workaround this case, I set tls_handshake_timeout=3000 and tls_send_timeout=1000. Maybe this is the correct behavior, I'm still not 100% sure how the tls parameters function. This time, it's behaving as expected. Maybe there should be a diagram somewhere with how these parameters work together. For example, each TLS connection will roughly follow the below steps along with their corresponding parameterized timeouts: 1. TCP connect (tcp_connect_timeout) 2. TLS connect/accept handshake (tls_handshake_timeout) 3. TLS write (tls_send_timeout) 4. TLS write (tls_send_timeout) Liviu Chircu OpenSIPS Developer http://www.opensips-solutions.com ___ Users mailing list Users@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Re: [OpenSIPS-Users] tcp_async timeouts confusion
Thanks for the explanation! That matches what I was expecting to see so I think there is an issue worth examining here. *** Server Details *** opensips -V version: opensips 2.3.2 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. main.c compiled on 15:38:40 Nov 24 2017 with gcc 5.4.0 My main problem scenario is establishing a connection between an endpoint local to opensips, and an endpoint registered to a cisco vcs, using either TCP or TLS. Both transports fail in slightly different ways I will describe below. The opensips server is running on an AWS server so I have the advertised_address and listen aliases set to deal with IP translation. The only tcp timeout I had configured was tcp_connect_timeout=3000. *** Using TCP *** After the invite is sent to the vcs, tcpdump at the opensips server showed 100, 180, and 200 OK responses from the vcs arriving and ACK'd correctly at the opensips server. The 100 response arrived 185ms after the invite is sent. But, I don't see these responses in the branch's onreply_route, the global onreply_route, or in the log at DBG level. netstat -t shows the connection with the data in the recv-q that never reaches 0. This implies to me that opensips is not polling that connection correctly for recv data. If I disable tcp_async then the call is completed successfully. So in the case that works, I have tcp_connect_timeout=3000 and tcp_async=0. *** Using TLS *** Running tcpdump, I see the opensips server send a Client Hello then a FIN packet 100ms later. The vcs responds with a Server Hello 200ms after the Client Hello and this gets RST. To workaround this case, I set tls_handshake_timeout=3000 and tls_send_timeout=1000. Maybe this is the correct behavior, I'm still not 100% sure how the tls parameters function. *** Conclusion *** In both the TCP and TLS cases it seems like the tcp_connect_timeout isn't being used as expected. So to workaround this, I went from having only tcp_connect_timeout=3000 to: tcp_connect_timeout=3000 modparam("proto_tcp", "tcp_async", 0) modparam("proto_tcp", "tcp_send_timeout", 1000) modparam("tls_mgm", "tls_handshake_timeout", 3000) modparam("tls_mgm", "tls_send_timeout", 1000) Please let me know if I am unclear in my description of the issue. There is a lot of details to go through. Thanks again for the quick response. From: Users [mailto:users-boun...@lists.opensips.org] On Behalf Of Liviu Chircu Sent: Tuesday, January 9, 2018 3:20 AM To: users@lists.opensips.org Subject: Re: [OpenSIPS-Users] tcp_async timeouts confusion Hi, Steve! The transport layer was heavily refactored roughly three years ago, see [1], [2] and [3] for the relevant commits which, indeed, bumped the default connect timeout down a lot, to a much lower value (10s -> 100ms). Although 100ms might seem unnecessary (it's async! let it sleep as long as it wants!), keep in mind that the TLS support isn't async at all, yet it will also make use of the same, default "tcp_connect_timeout" - a 10s default here is quite bad for high traffic volume TLS proxies which often need to open up lots of TCP/TLS connections. All in all, the "tcp_connect_timeout" should not get ignored at all. The "tcp_async_local_connect_timeout" [4] is the first one that hits, after which the connect waiting will be performed by a non-TCP worker, up to "tcp_connect_timeout" milliseconds. If it doesn't behave like this, let us know, and we'll look into it more. Best regards, [1]: https://github.com/OpenSIPS/opensips/commit/b343ca1c [2]: https://github.com/OpenSIPS/opensips/commit/78c84620 [3]: https://github.com/OpenSIPS/opensips/commit/11aedc6 [4]: http://www.opensips.org/html/docs/modules/2.4.x/proto_tcp.html#idp5544000 Liviu Chircu OpenSIPS Developer http://www.opensips-solutions.com On 08.01.2018 22:38, Steve Brisson wrote: Hi, I've run into some issues related to tcp_async and tcp/tls timeouts since upgrading opensips from v1.8 to v2.3. Based on my v1.8 config, I had the tcp_connect_timeout set to 3 secs but this gets ignored in v2.3 because tcp_async is enabled by default. As a result, calls made from a local opensips endpoint to a remote registered endpoint (through a cisco vcs) were failing. I then noticed that the tcp/tls timeouts were aggressively reduced from 10-30s to 100ms by default with the tcp_async feature. My main questions are: - How is the tcp_async feature supposed to function if the tcp_async_local_connect_timeout expires? The code seems to imply that the socket gets put onto a tcp main thread and handled. - 100ms seems pretty short as a default for these timeouts, especially tls.
[OpenSIPS-Users] tcp_async timeouts confusion
Hi, I've run into some issues related to tcp_async and tcp/tls timeouts since upgrading opensips from v1.8 to v2.3. Based on my v1.8 config, I had the tcp_connect_timeout set to 3 secs but this gets ignored in v2.3 because tcp_async is enabled by default. As a result, calls made from a local opensips endpoint to a remote registered endpoint (through a cisco vcs) were failing. I then noticed that the tcp/tls timeouts were aggressively reduced from 10-30s to 100ms by default with the tcp_async feature. My main questions are: - How is the tcp_async feature supposed to function if the tcp_async_local_connect_timeout expires? The code seems to imply that the socket gets put onto a tcp main thread and handled. - 100ms seems pretty short as a default for these timeouts, especially tls. Does a timeout result in the sip request getting cancelled or is there still some processing that can occur after because of handling on the tcp main thread. In short, I'm a confused about what the tcp_async feature does and how the timeouts should be set. Any explanations would be greatly appreciated. Thanks for your time. steve ___ Users mailing list Users@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/users