We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.
The scenario is as follows:
CPU 1: CPU 2:
-------- --------
tipc_sk_timeout(sk1) tipc_sk_timeout(sk2)
lock(sk1.slock) lock(sk2.slock)
msg_create(probe) msg_create(probe)
unlock(sk1.slock) unlock(sk2.slock)
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk2) tipc_sk_rcv(sk1)
lock(sk2.slock) lock((sk1.slock)
filter_rcv() filter_rcv()
tipc_sk_proto_rcv() tipc_sk_proto_rcv()
msg_create(probe_rsp) msg_create(probe_rsp)
tipc_sk_respond() tipc_sk_respond()
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk1) tipc_sk_rcv(sk2)
lock((sk1.slock) lock((sk2.slock)
===> DEADLOCK ===> DEADLOCK
Further analysis reveals that there are at least three different
locations in the socket code where tipc_sk_respond() is called within
the context of the socket lock, with ensuing risk of similar deadlocks.
We solve this by ensuring that messages created by tipc_sk_respond()
only are sent directly if sk_lock.owned mutex is held. Otherwise they
are queued up in the socket write queue and sent after the socket lock
has been released.
v2: - Testing on mutex sk_lock.owned instead of sk_lock.slock in
tipc_sk_respond(). This is safer, since sk_lock.slock may
occasionally and briefly be held (by concurrent user contexts)
even if we are in user context.
v3: - By lowering the socket timeout to 36 ms instead of 3,600,000 and
setting up 1000 connections I could easily reproduce the deadlock
and verify that my solution works.
- When killing one of the processes I sometimes got a kernel crash
in the loop emptying the socket write queue. Realizing that there
may be concurrent processes emptying the write queue, I had to add
a test that the dequeuing actually returned a buffer. This solved
the problem.
- I tried Ying's suggestion with unconditionally adding all
CONN_MANAGER messages to the backlog queue, and it didn't work.
This is because we will often add the message to the backlog when
the socket is *not* owned, so there will be nothing triggering
execution of backlog_rcv() within acceptable time. Apart from
that, my solution solves the problem at all three locations where
this deadlock may happen, as already stated above.
v4: - Introduced separate queue in struct tipc_sock for the purpose
above, instead of using the socket send queue. The socket send
queue was used for regular message sending until commit
f214fc402967e ("tipc: Revert tipc: use existing sk_write_queue for
outgoing packet chain") i.e. as recent as kernel 4.5, so using
this queue would screw up older kernel versions.
- Made small cosmetic improvement to the dequeuing loop.
Reported-by: GUNA <[email protected]>
Signed-off-by: Jon Maloy <[email protected]>
---
net/tipc/socket.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 88bfcd7..5ed6d5c 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -90,6 +90,7 @@ struct tipc_sock {
struct tipc_msg phdr;
struct list_head sock_list;
struct list_head publications;
+ struct sk_buff_head rspq;
u32 pub_count;
u32 probing_state;
unsigned long probing_intv;
@@ -278,7 +279,11 @@ static void tipc_sk_respond(struct sock *sk, struct
sk_buff *skb, int err)
dnode = msg_destnode(buf_msg(skb));
selector = msg_origport(buf_msg(skb));
- tipc_node_xmit_skb(sock_net(sk), skb, dnode, selector);
+
+ if (sock_owned_by_user(sk))
+ tipc_node_xmit_skb(sock_net(sk), skb, dnode, selector);
+ else
+ skb_queue_tail(&tipc_sk(sk)->rspq, skb);
}
/**
@@ -379,6 +384,7 @@ static int tipc_sk_create(struct net *net, struct socket
*sock,
tsk = tipc_sk(sk);
tsk->max_pkt = MAX_PKT_DEFAULT;
INIT_LIST_HEAD(&tsk->publications);
+ skb_queue_head_init(&tsk->rspq);
msg = &tsk->phdr;
tn = net_generic(sock_net(sk), tipc_net_id);
tipc_msg_init(tn->own_addr, msg, TIPC_LOW_IMPORTANCE, TIPC_NAMED_MSG,
@@ -1830,6 +1836,12 @@ void tipc_sk_rcv(struct net *net, struct sk_buff_head
*inputq)
tipc_sk_enqueue(inputq, sk, dport);
spin_unlock_bh(&sk->sk_lock.slock);
}
+ /* Send pending response/rejected messages, if any */
+ while (!skb_queue_empty(&tsk->rspq) &&
+ (skb = skb_dequeue(&tsk->rspq))) {
+ dnode = msg_destnode(buf_msg(skb));
+ tipc_node_xmit_skb(net, skb, dnode, dport);
+ }
sock_put(sk);
continue;
}
--
1.9.1
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion