Hi all,
For the sake of completion, here is the commit fixing the issue:
https://github.com/OpenSIPS/opensips/commit/058cc22cb55dce9b890308b9f83a42a88691f2c8
Thank you Yuval for the report and for investigating this!
Best regards,
Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 07/12/2018 04:07 PM, Yuval Dinari via Users wrote:
Hi,
I have a state in which opensips gets into an unrecoverable bad state,
in which some of the tcp children process are stuck waiting to acquire
a lock which they never get.
The issue occurs in the following load test scenario:
1. About 25K clients register in TCP (but also happens with less)
2. All the TCP connections become unresponsive (by blocking outgoing
traffic on the test clients machine)
3. INVITEs are sent for each of those clients, putting their
connection in retransmit mode
4. After a few minutes opensips gets into a bad state - some tcp
children run at 90-100% cpu, no traffic is being sent from the
machine (including OPTIONS pings)
5. After all the tcp connections die due to timeouts, opensips does
not recover, the mentioned symptoms stay
6. After all the registered users are removed from internal table
there's still no change
When attaching debugger to the problematic processes (with high cpu
usage) we see that they're all stuck trying to get a lock which they
never seem to get. Stack traces:
#0 0x00007fd6b72d1bb7 in sched_yield () at
../sysdeps/unix/syscall-template.S:81
#1 0x0000000000549e65 in get_lock (lock=<optimized out>) at
net/proto_tcp/../../net/../fastlock.h:221
#2 _tcp_write_on_socket (len=<optimized out>, buf=<optimized out>,
fd=<optimized out>, c=<optimized out>) at net/proto_tcp/proto_tcp.c:724
#3 proto_tcp_send (send_sock=0x7ffd8e12c140, buf=0x0, len=399,
to=0x7fd5c7ccdcc0, id=1) at net/proto_tcp/proto_tcp.c:922
#4 0x00007fd5a5cb7b30 in msg_send (msg=<optimized out>,
len=<optimized out>, buf=<optimized out>, id=<optimized out>,
to=<optimized out>, proto=<optimized out>,
send_sock=0x7fd6a7208168) at ../../forward.h:123
#5 send_pr_buffer (rb=0x7fd5c7ccdca0, buf=0x7fd6a76b4a50, len=0,
ctx=0xffffffffffffffff) at t_funcs.c:66
And:
#0 0x00007fd6b72d1bb7 in sched_yield () at
../sysdeps/unix/syscall-template.S:81
#1 0x00000000005349b8 in get_lock (lock=<optimized out>) at
net/../fastlock.h:221
#2 handle_io (event_type=<optimized out>, idx=<optimized out>,
fm=<optimized out>) at net/net_tcp_proc.c:210
#3 io_wait_loop_epoll (repeat=287, t=<optimized out>, h=<optimized
out>) at net/../io_wait_loop.h:280
This traces look the same every time we attach.
The machine opensips runs on has 4 cpus.
Thanks
_______________________________________________
Users mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
_______________________________________________
Users mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users