This bug report concerns a Kamailio instance that's functioning as an edge
proxy, accepting requests from clients over TLS and forwarding them to internal
components over UDP.
I've observed occasional "deadlocks" that all take place after there were
connectivity issues between Kamailio and one or more of the clients (logging
messages such as "tcp_send failed", "handle_tcpconn_ev(): connect x failed",
"tcp_read_data(): error reading: Connection reset by peer").
Environment:
* Ubuntu 16.04
* Kamailio 5.2.7 (5.2.7+ubuntu16.04)
Relevant config:
```
# for TLS clients
socket_workers=10
listen=tls:wan.ip:TLS_PORT
# to be able to forward incoming TCP requests to UDP destinations
socket_workers=10
listen=udp:wan.ip:TLS_PORT
# for RPC commands
socket_workers=2
listen=tcp:127.0.0.1:HTTP_PORT
```
## Reproduction
After a lot of trial and error, I've found a way to somewhat reproducibly
trigger the problem using sipp. It involves setting up a large amount of
simultaneous calls dialing into the edge proxy using TLS, to a destination UA
that will answer the call and keep it open for a few seconds, then using
`ctrl-c` to suddenly quit sipp which will ungracefully terminate the TCP/TLS
connections.
It takes me a couple of iterations to trigger the problem, e.g repeat until
Kamailio starts failing:
* start sipp
* terminate sipp with ctrl-c
* (quickly) restart sipp
* terminate sipp
* etc
The amount of simultaneous calls has to be sufficiently large. In my
environment I'm unable to reproduce the problem at all with 10-20 calls, while
50 calls always triggers the problem after a couple of iterations.
Example sipp commandline: `sipp -nd -sf caller.xml -d 2000 -r 50 -l 500 -m 5000
-max_socket 1000 -t ln -tls_cert cert.crt -tls_key key.key
kamailio.edge.proxy:5061`
## Failure mode 1
In one of the failure modes, it appears that all available TCP/TLS workers are
permanently bricked; sipp's calls will simply timeout and not trigger any
dialplan handling.
Interestingly, even though I have dedicated workers for TLS (SIP client) and
TCP (RPC commands), and my reproduction scenario doesn't connect to the
HTTP_PORT at all, this failure scenario bricks the TCP (RPC) listeners as well;
any curl requests to it will simply timeout as well.
Kamailio will hang for a while on (systemd) restart.
## Failure mode 2
In another failure mode, it looks like the TCP workers are unaffected but the
UDP perform forwarding to UDP destinations are broken. Sipp's calls will enter
Kamailio, but forwarding (`relay()`) to the next hop will timeout even though
the destination is functioning just fine. Since this failure mode doesn't break
RPC I was able to enable debug logging:
```
2020-07-07T17:26:58.770411+02:00 kamailio[24056]: INFO: From Customer
2020-07-07T17:26:58.770697+02:00 kamailio[24056]: DEBUG: pv [pv_trans.c:590]:
tr_eval_string(): i=0 j=2
2020-07-07T17:26:58.770971+02:00 kamailio[24056]: DEBUG: dispatcher
[dispatch.c:2055]: ds_manage_routes(): set [4]
2020-07-07T17:26:58.771248+02:00 kamailio[24056]: DEBUG: dispatcher
[dispatch.c:2161]: ds_manage_routes(): using alg [0] hash [250849449]
2020-07-07T17:26:58.771535+02:00 kamailio[24056]: DEBUG: dispatcher
[dispatch.c:2205]: ds_manage_routes(): selected [0-4-0/1] <sip:wan.ip:8065>
2020-07-07T17:26:58.771817+02:00 kamailio[24056]: DEBUG: dispatcher
[dispatch.c:2226]: ds_manage_routes(): using first entry [4/1]
2020-07-07T17:26:58.772090+02:00 kamailio[24056]: DEBUG: dispatcher
[dispatch.c:2269]: ds_manage_routes(): using entry [4/0]
2020-07-07T17:26:58.772362+02:00 kamailio[24056]: DEBUG: dispatcher
[dispatch.c:2016]: ds_select_dst_limit(): selected target destinations: 2
2020-07-07T17:26:58.773001+02:00 kamailio[24056]: DEBUG: pv [pv_core.c:2611]:
pv_set_force_sock(): trying to set send-socket to [udp:wan.ip:5061]
2020-07-07T17:26:58.773353+02:00 kamailio[24056]: DEBUG: <core>
[core/socket_info.c:559]: grep_sock_info(): checking if host==us: 13==13 &&
[wan.ip] == [wan.ip]
2020-07-07T17:26:58.773829+02:00 kamailio[24056]: DEBUG: <core>
[core/socket_info.c:566]: grep_sock_info(): checking if port 5061 (advertise 0)
matches port 5061
2020-07-07T17:26:58.774122+02:00 kamailio[24056]: NOTICE: Relaying (cid:
[email protected]).
2020-07-07T17:26:58.774417+02:00 kamailio[24056]: DEBUG: siputils
[checks.c:120]: has_totag(): no totag
2020-07-07T17:26:58.774708+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:1344]:
t_newtran(): msg (0x7f69d1cc4980) id=1265/24056 global id=1265/24056 T
start=(nil)
2020-07-07T17:26:58.774989+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:499]:
t_lookup_request(): start searching: hash=43232, isACK=0
2020-07-07T17:26:58.775315+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:457]:
matching_3261(): RFC3261 transaction matching failed - via branch
[z9hG4bK-9151-1-1]
2020-07-07T17:26:58.775612+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:682]:
t_lookup_request(): no transaction found
2020-07-07T17:26:58.775908+02:00 kamailio[24056]: DEBUG: <core>
[core/md5utils.c:67]: MD5StringArray(): MD5 calculated:
0a5d4a8d13f9606523da9c4d193b2261
2020-07-07T17:26:58.776270+02:00 kamailio[24056]: DEBUG: <core>
[core/dns_cache.c:3197]: dns_srv_sip_resolve(): (wan.ip, 0, 0), ip, ret=0
2020-07-07T17:26:58.776701+02:00 kamailio[24056]: DEBUG: <core>
[core/msg_translator.c:1765]: check_boundaries(): no multi-part body
2020-07-07T17:26:58.777012+02:00 kamailio[24056]: DEBUG: <core>
[core/msg_translator.c:2935]: create_via_hf(): id added: <;i=a701>, rcv proto=3
2020-07-07T17:26:58.777596+02:00 kamailio[24056]: DEBUG: tm [t_funcs.c:375]:
t_relay_to(): new transaction forwarded
2020-07-07T17:26:58.777948+02:00 kamailio[24056]: DEBUG: <core>
[core/receive.c:354]: receive_msg(): request-route executed in: 1927 usec
2020-07-07T17:26:58.778243+02:00 kamailio[24056]: DEBUG: <core>
[core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil)
2020-07-07T17:26:58.779265+02:00 kamailio[24056]: message repeated 5 times: [
DEBUG: <core> [core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil)]
2020-07-07T17:26:58.779431+02:00 kamailio[24056]: DEBUG: <core>
[core/xavp.c:507]: xavp_destroy_list(): destroying xavp list (nil)
2020-07-07T17:26:58.779753+02:00 kamailio[24056]: DEBUG: <core>
[core/receive.c:458]: receive_msg(): cleaning up
2020-07-07T17:27:00.705303+02:00 kamailio[24053]: DEBUG: tm [t_reply.c:1266]:
t_should_relay_response(): ->>>>>>>>> T_code=0, new_code=408
2020-07-07T17:27:00.705917+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]:
t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T
start=0x7f69b10167c0
2020-07-07T17:27:00.706187+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]:
t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)!
2020-07-07T17:27:00.706533+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:539]:
pv_get_tm_reply_code(): reply code is <408>
2020-07-07T17:27:00.706879+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]:
t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T
start=0x7f69b10167c0
2020-07-07T17:27:00.707198+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]:
t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)!
2020-07-07T17:27:00.707522+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:594]:
pv_get_tm_reply_reason(): reply reason is [Request Timeout]
{
"DSC": "tcp receiver (tls:wan.ip:5061) child=0",
"IDX": 14,
"PID": 24056
},
{
"DSC": "slow timer",
"IDX": 11,
"PID": 24053
},
```
Kamailio will restart (systemd) normally and everything will function normally
again afterwards.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2395_______________________________________________
Kamailio (SER) - Development Mailing List
[email protected]
https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev