For those following along, this appears to have been an issue with Auto Scaling. Once disabled, the problem has not reoccured. In our case this is a perfectly acceptable way to move forward, but maybe at some stage we'll come back to working this out.
Andrew On Sat, 9 Oct 2021 at 01:39, Andrew Yager <[email protected]> wrote: > Process 24103 (the blocked proc) last logged at 19:02:01, and in doing > so, it had: > > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: start > searching: hash=34811, isACK=0 > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:matching_3261: RFC3261 > transaction matching failed > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: no > transaction found > Oct 8 19:02:01 hvprxy osips[24103]: DBG:statistics:w_update_stat: > needed statistic is <cpbx_presence_notify_rcv> > Oct 8 19:02:01 hvprxy osips[24103]: DBG:statistics:parse_groupname: > group: '', name: 'cpbx_presence_notify_rcv' > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:select_contacts: ct: > sip:[snip]@[snip]:55087;transport=tls > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:push_branch: setting msg > R-URI <sip:[snip]@[snip]:55087;transport=tls> > Oct 8 19:02:01 hvprxy osips[24103]: [RELAY] Performing mid-registrar > lookup on sip:[snip]:5060 > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:lookup: '' Not found in > usrloc > Oct 8 19:02:01 hvprxy osips[24103]: DBG:statistics:w_update_stat: > needed statistic is <cpbx_contact_lookup_failed> > Oct 8 19:02:01 hvprxy osips[24103]: DBG:statistics:parse_groupname: > group: '', name: 'cpbx_contact_lookup_failed' > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:select_contacts: ct: > sip:[snip]@[snip]:55087;transport=tls > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:push_branch: setting msg > R-URI <sip:xxx@xxx:55087;transport=tls> > Oct 8 19:02:01 hvprxy osips[24103]: [RELAY] Performed > mid_registrar_lookup: sip:xxx@xxx:55087;transport=tls > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:t_newtran: transaction on > entrance=(nil) > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers: > flags=ffffffffffffffff > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers: flags=78 > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: start > searching: hash=34811, isACK=0 > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:matching_3261: RFC3261 > transaction matching failed > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: no > transaction found > Oct 8 19:02:01 hvprxy osips[24103]: DBG:tm:run_reqin_callbacks: > trans=0x7fafa71d8150, callback type 1, id 0 entered > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:mk_proxy: doing DNS lookup... > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers: flags=2000 > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers: > flags=ffffffffffffffff > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: con found in > state 0 > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: tcp > connection found (0x7fafa84997e8), acquiring fd > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: c= > 0x7fafa84997e8, n=16, Usock=647 > < no more > > > Proc 11468 (tcp main thread) logged: > > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > getsockopt: snd is initially 425984 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > setting snd buf to 851968 had no effect > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > setting snd buf to 428032 had no effect > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > using snd buffer of 416 kb > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:init_sock_keepalive: TCP > keepalive enabled on socket 636 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:print_ip: tcpconn_new: > new tcp connection to: xxx > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_new: on port > 11934, proto 3 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_add: hashes: 1004, > 320 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:handle_new_connect: new > connection: 0x7fafa7cc0c80 636 flags: 001c > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: no free tcp > receiver, connection passed to the least busy one (proc #1, 3 con) > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: to tcp > worker 1 (0), 0x7fafa7cc0c80 rw 1 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:ipc_handle_job: received > job type 0[RPC] from process 1 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > getsockopt: snd is initially 425984 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > setting snd buf to 851968 had no effect > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > setting snd buf to 428032 had no effect > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff: > using snd buffer of 416 kb > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:init_sock_keepalive: TCP > keepalive enabled on socket 637 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:print_ip: tcpconn_new: > new tcp connection to: xxx > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_new: on port > 11850, proto 3 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_add: hashes: 824, 321 > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:handle_new_connect: new > connection: 0x7fafa7e76170 637 flags: 001c > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: no free tcp > receiver, connection passed to the least busy one (proc #5, 3 con) > Oct 8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: to tcp > worker 5 (0), 0x7fafa7e76170 rw 1 > > Looking for maybe other interesting things⦠> > Oct 8 19:01:55 hvprxy osips[11397]: DBG:core:tcp_conn_get: tcp > connection found (0x7fafa84997e8), acquiring fd > Oct 8 19:01:55 hvprxy osips[11397]: DBG:core:tcp_conn_get: c= > 0x7fafa84997e8, n=16, Usock=555 > Oct 8 19:01:55 hvprxy osips[11397]: DBG:core:tcp_conn_get: after > receive_fd: c= 0x7fafa84997e8 n=8 fd=5 > Oct 8 19:01:55 hvprxy osips[11397]: DBG:proto_tls:proto_tls_send: > after write: c=0x7fafa84997e8 n=483 fd=5 > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: tcp > connection found (0x7fafa84997e8), acquiring fd > Oct 8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: c= > 0x7fafa84997e8, n=16, Usock=647 > Oct 8 19:02:13 hvprxy osips[11398]: DBG:core:tcp_conn_get: tcp > connection found (0x7fafa84997e8), acquiring fd > Oct 8 19:02:13 hvprxy osips[11398]: DBG:core:tcp_conn_get: c= > 0x7fafa84997e8, n=16, Usock=557 > Oct 8 19:02:13 hvprxy osips[11398]: DBG:core:tcp_conn_get: after > receive_fd: c= 0x7fafa84997e8 n=8 fd=5 > Oct 8 19:02:13 hvprxy osips[11398]: DBG:proto_tls:proto_tls_send: > after write: c=0x7fafa84997e8 n=360 fd=5 > Oct 8 19:02:13 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req: > Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0 > Oct 8 19:02:43 hvprxy osips[11398]: DBG:core:tcp_conn_get: tcp > connection found (0x7fafa84997e8), acquiring fd > Oct 8 19:02:43 hvprxy osips[11398]: DBG:core:tcp_conn_get: c= > 0x7fafa84997e8, n=16, Usock=557 > Oct 8 19:02:43 hvprxy osips[11398]: DBG:core:tcp_conn_get: after > receive_fd: c= 0x7fafa84997e8 n=8 fd=5 > Oct 8 19:02:43 hvprxy osips[11398]: DBG:proto_tls:proto_tls_send: > after write: c=0x7fafa84997e8 n=360 fd=5 > Oct 8 19:02:43 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req: > Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0 > Oct 8 19:02:44 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req: > Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0 > Oct 8 19:02:44 hvprxy osips[24113]: DBG:core:tcp_conn_get: tcp > connection found (0x7fafa84997e8), acquiring fd > Oct 8 19:02:44 hvprxy osips[24113]: DBG:core:tcp_conn_get: c= > 0x7fafa84997e8, n=16, Usock=649 > Oct 8 19:03:13 hvprxy osips[11458]: DBG:core:tcp_conn_get: tcp > connection found (0x7fafa84997e8), acquiring fd > Oct 8 19:03:13 hvprxy osips[11458]: DBG:core:tcp_conn_get: c= > 0x7fafa84997e8, n=16, Usock=625 > Oct 8 19:03:13 hvprxy osips[11458]: DBG:core:tcp_conn_get: after > receive_fd: c= 0x7fafa84997e8 n=8 fd=950 > Oct 8 19:03:13 hvprxy osips[11458]: DBG:proto_tls:proto_tls_send: > after write: c=0x7fafa84997e8 n=360 fd=950 > Oct 8 19:03:13 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req: > Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0 > > All this said, I can't say I'm totally sure I know what I would be > looking for. Nothing seems to stand out in the logs to me; but there > are a few gigs of logs so it's hard to quite know what I'm looking > for. > > Andrew > > > On Sat, 9 Oct 2021 at 00:19, Bogdan-Andrei Iancu <[email protected]> > wrote: > > > > Hi Andrew, > > > > The second blocked process (doing the TLS/TCP stuff) surprisingly got > > stuck while waiting for a TCP fd from the TCP Main process. > > > > You mentioned that the logs of the UDP worker (doing the TCP send) > > suddenly stopped - around that time, do you see any errors from that > > process or from the TCP MAIN processes ? > > > > Regards, > > > > Bogdan-Andrei Iancu > > > > OpenSIPS Founder and Developer > > https://www.opensips-solutions.com > > OpenSIPS eBootcamp 2021 > > https://opensips.org/training/OpenSIPS_eBootcamp_2021/ > > > > On 10/8/21 2:43 PM, Andrew Yager wrote: > > > Hi Bogdan-Andrei, > > > > > > Have restarted since the last bt, but have recreated again and > > > attached. Earlier today we did also get another bt full on the second > > > blocked pid, but I didn't save it. In that case it was a UDP reply > > > from one of our upstream servers that had gone through mid_registrar > > > and was being relayed to a TCP endpoint. The TCP endpoint did have an > > > open file descriptor we could see, and it had sent and was blocked on > > > receive at the same point (I'm getting better at reading backtraces! > > > :D). > > > > > > The thing I do note is happening is that every example I have is a UDP > > > message being received from an upstream server being relayed to a > > > client on a TCP/TLS connection via a UDP worker. > > > > > > While we are using WolfSSL in this box, the other box where we have > > > the same behaviour (but I haven't taken backtraces yet) is running > > > OpenSSL and on 3.1.3; so it's not SSL library specific. > > > > > > I'm going to see if I can get a backtrace from the 3.1.3 box shortly. > > > > > > Andrew > > > > > > On Fri, 8 Oct 2021 at 17:13, Bogdan-Andrei Iancu <[email protected]> > wrote: > > >> Hi Andrew, > > >> > > >> OK, interesting progress here. So, the FIFO process blocks as it is > > >> trying to send an IPC JOB to an UDP process which looks like also > being > > >> blocked. > > >> > > >> Could you attach with GDB to the that UDP blocked process too ? (you > > >> have its PID in the printing of the pt[x] in first gdb) > > >> > > >> Regards, > > >> > > >> Bogdan-Andrei Iancu > > >> > > >> OpenSIPS Founder and Developer > > >> https://www.opensips-solutions.com > > >> OpenSIPS eBootcamp 2021 > > >> https://opensips.org/training/OpenSIPS_eBootcamp_2021/ > > >> > > >> On 10/8/21 1:43 AM, Andrew Yager wrote: > > >>> Interestingly, where I usually see a range of continued messages from > > >>> a process continually in the debug log, they appear to stop for this > > >>> PID at 3:47am, and that process seems blocked on a tcp/tls send: > > >>> > > >>> > > >
_______________________________________________ Users mailing list [email protected] http://lists.opensips.org/cgi-bin/mailman/listinfo/users
