Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Rainer Weikusatwrites: > David Miller writes: [...] > I'm sorry for this 13th hour request/ suggestion but while thinking > about a reply to Dmitry, it occurred to me that the restart_locked/ > sk_locked logic could be avoided by moving the test for this condition > in front of all the others while leaving the 'act on it' code at its > back, ie, reorganize unix_dgram_sendmsg such that it looks like this: [...] Just in case this is unclear on its own: If this was considered an improvement by someone other than me, I could supply either a "complete" patch with this re-arrangement or a cleanup delta patch changing the previous change. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/20/2015 05:07 PM, Rainer Weikusat wrote: > Rainer Weikusatwrites: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusat > Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") Looks good to me. Reviewed-by: Jason Baron Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
From: Rainer WeikusatDate: Fri, 20 Nov 2015 22:07:23 + > Rainer Weikusat writes: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusat > Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") Applied and queued up for -stable, thanks to you and Jason for all of your hard work on this fix. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
David Millerwrites: > From: Rainer Weikusat >> Rainer Weikusat writes: >> An AF_UNIX datagram socket being the client in an n:1 association [...] > Applied and queued up for -stable, I'm sorry for this 13th hour request/ suggestion but while thinking about a reply to Dmitry, it occurred to me that the restart_locked/ sk_locked logic could be avoided by moving the test for this condition in front of all the others while leaving the 'act on it' code at its back, ie, reorganize unix_dgram_sendmsg such that it looks like this: unix_state_lock(other); if (unix_peer(other) != sk && unix_recvq_full(other)) { need_wait = 1; if (!timeo) { unix_state_unlock(other); unix_state_double_lock(sk, other); if (unix_peer(other) == sk || (unix_peer(sk) == other && !unix_dgram_peer_wake_me(sk, other))) need_wait = 0; unix_state_unlock(sk); } } /* original code here */ if (need_wait) { if (timeo) { timeo = unix_wait_for_peer(other, timeo); err = sock_intr_errno(timeo); if (signal_pending(current)) goto out_free; goto restart; } err = -EAGAIN; goto out_unlock; } /* original tail here */ This might cause a socket to be enqueued to the peer despite it's not allowed to send to it but I don't think this matters much. This is a less conservative modification but one which results in simpler code overall. The kernel I'm currently running has been modified like this and 'survived' the usual tests. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
alternate queueing mechanism (was: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue)
Rainer Weikusatwrites: [AF_UNIX SOCK_DGRAM throughput] > It may be possible to improve this by tuning/ changing the flow > control mechanism. Out of my head, I'd suggest making the queue longer > (the default value is 10) and delaying wake ups until the server > actually did catch up, IOW, the receive queue is empty or almost > empty. But this ought to be done with a different patch. Because I was curious about the effects, I implemented this using a slightly modified design than the one I originally suggested to account for the different uses of the 'is the receive queue full' check. The code uses a datagram-specific checking function, static int unix_dgram_recvq_full(struct sock const *sk) { struct unix_sock *u; u = unix_sk(sk); if (test_bit(UNIX_DG_FULL, >flags)) return 1; if (!unix_recvq_full(sk)) return 0; __set_bit(UNIX_DG_FULL, >flags); return 1; } which gets called instead of the other for the n:1 datagram checks and a if (test_bit(UNIX_DG_FULL, >flags) && !skb_queue_len(>sk_receive_queue)) { __clear_bit(UNIX_DG_FULL, >flags); wake_up_interruptible_sync_poll(>peer_wait, POLLOUT | POLLWRNORM | POLLWRBAND); } in unix_dgram_recvmsg to delay wakeups until the queued datagrams have been consumed if the queue overflowed before. This has the additional, nice side effect that wakeups won't ever be done for 1:1 connected datagram sockets (both SOCK_DGRAM and SOCK_SEQPACKET) where they're of no use, anyway. Compared to a 'stock' 4.3 running the test program I posted (supposed to make the overhead noticable by sending lots of small messages), the average number of bytes sent per second increased by about 782,961.79 (ca 764.61K), about 5.32% of the 4.3 number (14,714,579.91), with a fairly simple code change. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
Jason Baronwrites: > On 11/19/2015 06:52 PM, Rainer Weikusat wrote: > > [...] > >> @@ -1590,21 +1718,35 @@ restart: >> goto out_unlock; >> } >> >> -if (unix_peer(other) != sk && unix_recvq_full(other)) { >> -if (!timeo) { >> +if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { >> +if (timeo) { >> +timeo = unix_wait_for_peer(other, timeo); >> + >> +err = sock_intr_errno(timeo); >> +if (signal_pending(current)) >> +goto out_free; >> + >> +goto restart; >> +} >> + >> +if (unix_peer(sk) != other || >> +unix_dgram_peer_wake_me(sk, other)) { >> err = -EAGAIN; >> goto out_unlock; >> } > > Hi, > > So here we are calling unix_dgram_peer_wake_me() without the sk lock the > first time > through - right? Yes. And this is obviously wrong. I spend most of the 'evening time' (some people would call that 'night time') with testing this and didn't get to read through it again yet. Thank you for pointing this out. I'll send an updated patch shortly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
On 11/19/2015 06:52 PM, Rainer Weikusat wrote: [...] > @@ -1590,21 +1718,35 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > - if (!timeo) { > + if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { > + if (timeo) { > + timeo = unix_wait_for_peer(other, timeo); > + > + err = sock_intr_errno(timeo); > + if (signal_pending(current)) > + goto out_free; > + > + goto restart; > + } > + > + if (unix_peer(sk) != other || > + unix_dgram_peer_wake_me(sk, other)) { > err = -EAGAIN; > goto out_unlock; > } Hi, So here we are calling unix_dgram_peer_wake_me() without the sk lock the first time through - right? In that case, we can end up registering on the queue of other for the callback but we might have already connected to a different remote. In that case, the wakeup will crash if 'sk' has freed in the meantime. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Rainer Weikusatwrites: An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusat Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") --- - uninvert the lock/ check code in _dgram_sendmsg - introduce a unix_dgram_peer_wake_disconnect_wakuep helper function as there were two calls with a wakeup immediately following and two without diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..3d93b0d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,118 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + q); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
Rainer Weikusatwrites: > Rainer Weikusat writes: > > [...] > >> The basic options would be >> >> - return EAGAIN even if sending became possible (Jason's most >> recent suggestions) >> >> - retry sending a limited number of times, eg, once, before >> returning EAGAIN, on the grounds that this is nicer to the >> application and that redoing all the stuff up to the _lock in >> dgram_sendmsg can possibly/ likely be avoided > > A third option: A fourth and even one that's reasonably simple to implement: In case other became ready during the checks, drop other lock, do a double-lock sk, other, set a flag variable indicating this and restart the procedure after the unix_state_lock_other[*], using the value of the flag to lock/ unlock sk as needed. Should other still be ready to receive data, execution can then continue with the 'queue it' code as the other lock was held all the time this time. Combined with a few unlikely annotations in place where they're IMHO appropriate, this is speed-wise comparable to the stock kernel. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer WeikusatFixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") --- This has been created around midnight at the end of a work day. I've read through it a couple of times and found no errors. But I'm not "ideally awake and attentive" right now. diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..6962ff1 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,112 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + q); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other) +{ + struct unix_sock *u, *u_other; + int rc; + + u = unix_sk(sk); + u_other =
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
David Millerwrites: > From: Rainer Weikusat > Date: Mon, 16 Nov 2015 22:28:40 + > >> An AF_UNIX datagram socket being the client in an n:1 [...] > So because of a corner case of epoll handling and sender socket release, > every single datagram sendmsg has to do a double lock now? > > I do not dispute the correctness of your fix at this point, but that > added cost in the fast path is really too high. Some more information on this: Running the test program included below on my 'work' system (otherwise idle, after logging in via VT with no GUI running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in the following throughput statistics[*]: avg 13.617 M/s median 13.393 M/s max 17.14 M/s min 13.047 M/s deviation 0.85 I'll try to post the results for 'unpatched' later as I'm also working on a couple of other things. [*] I do not use my fingers for counting, hence, these are binary and not decimal units. #include #include #include #include #include #include #include enum { MSG_SZ =16, MSGS = 100 }; static char msg[MSG_SZ]; static uint64_t tv2u(struct timeval *tv) { uint64_t u; u = tv->tv_sec; u *= 100; return u + tv->tv_usec; } int main(void) { struct timeval start, stop; uint64_t t_diff; double rate; int sks[2]; unsigned remain; char buf[MSG_SZ]; socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks); if (fork() == 0) { close(*sks); gettimeofday(, 0); while (read(sks[1], buf, sizeof(buf)) > 0); gettimeofday(, 0); t_diff = tv2u(); t_diff -= tv2u(); rate = MSG_SZ * MSGS; rate /= t_diff; rate *= 100; printf("rate %fM/s\n", rate / (1 << 20)); fflush(stdout); _exit(0); } close(sks[1]); remain = MSGS; do write(*sks, msg, sizeof(msg)); while (--remain); close(*sks); wait(NULL); return 0; } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
more statistics (was: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:))
Rainer Weikusatwrites: [...] > Some more information on this: Running the test program included below > on my 'work' system (otherwise idle, after logging in via VT with no GUI > running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted > in the > following throughput statistics[*]: Since the results were too variable with only 20 runs, I've also tested this with 100 for three kernels, stock 4.3, 4.3 plus the published patch, 4.3 plus the published patch plus the "just return EAGAIN" modification". The 1st and the 3rd perform about identical for the test program I used (slightly modified version included below), the 2nd is markedly slower. This is most easily visible when grouping the printed data rates (B/s) 'by millions': stock 4.3 - 1300.000-1399.000 3 (3%) 1400.000-1499.000 82 (82%) 1500.000-1599.000 15 (15%) 4.3 + patch --- 1300.000-1399.000 54 (54%) 1400.000-1499.000 35 (35%) 1500.000-1599.000 7 (7%) 1600.000-1699.000 1 (1%) 1800.000-1899.000 1 (1%) 2200.000-2299.000 2 (2%) 4.3 + modified patch 1300.000-1399.000 3 (3%) 1400.000-1499.000 82 (82%) 1500.000-1599.000 14 (14%) 2400.000-2499.000 1 (1%) IMHO, the 3rd option would be the way to go if this was considered an acceptable option (ie, despite it returns spurious errors in 'rare cases'). modified test program = #include #include #include #include #include #include #include enum { MSG_SZ =16, MSGS = 100 }; static char msg[MSG_SZ]; static uint64_t tv2u(struct timeval *tv) { uint64_t u; u = tv->tv_sec; u *= 100; return u + tv->tv_usec; } int main(void) { struct timeval start, stop; uint64_t t_diff; double rate; int sks[2]; unsigned remain; char buf[MSG_SZ]; socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks); if (fork() == 0) { close(*sks); gettimeofday(, 0); while (read(sks[1], buf, sizeof(buf)) > 0); gettimeofday(, 0); t_diff = tv2u(); t_diff -= tv2u(); rate = MSG_SZ * MSGS; rate /= t_diff; rate *= 100; printf("%f\n", rate); fflush(stdout); _exit(0); } close(sks[1]); remain = MSGS; do write(*sks, msg, sizeof(msg)); while (--remain); close(*sks); wait(NULL); return 0; } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
On 11/16/2015 05:28 PM, Rainer Weikusat wrote: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusat> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") > --- > > Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll() > behavior for EPOLLOUT event": This shouldn't be an issue anymore with > this change despite it restores the "only when writable" behaviour" as > the wake up relay will also be set up once _dgram_sendmsg returned > EAGAIN for a send attempt on a n:1 connected socket. > > Hi, My only comment was about potentially avoiding the double lock in the write path, otherwise this looks ok to me. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/15/2015 01:32 PM, Rainer Weikusat wrote: > > That was my original idea. The problem with this is that the code > starting after the _lock and running until the main code path unlock has > to be executed in one go with the other lock held as the results of the > tests above this one may become invalid as soon as the other lock is > released. This means instead of continuing execution with the send code > proper after the block in case other became receive-ready between the > first and the second test (possible because _dgram_recvmsg does not > take the unix state lock), the whole procedure starting with acquiring > the other lock would need to be restarted. Given sufficiently unfavorable > circumstances, this could even turn into an endless loop which couldn't > be interrupted. (unless code for this was added). > hmmm - I think we can avoid it by doing the wakeup from the write path in the rare case that the queue has emptied - and avoid the double lock. IE: unix_state_unlock(other); unix_state_lock(sk); err = -EAGAIN; if (unix_peer(sk) == other) { unix_dgram_peer_wake_connect(sk, other); if (skb_queue_len(>sk_receive_queue) == 0) need_wakeup = true; } unix_state_unlock(sk); if (need_wakeup) wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND); goto out_free; Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Jason Baronwrites: > On 11/15/2015 01:32 PM, Rainer Weikusat wrote: > >> >> That was my original idea. The problem with this is that the code >> starting after the _lock and running until the main code path unlock has >> to be executed in one go with the other lock held as the results of the >> tests above this one may become invalid as soon as the other lock is >> released. This means instead of continuing execution with the send code >> proper after the block in case other became receive-ready between the >> first and the second test (possible because _dgram_recvmsg does not >> take the unix state lock), the whole procedure starting with acquiring >> the other lock would need to be restarted. Given sufficiently unfavorable >> circumstances, this could even turn into an endless loop which couldn't >> be interrupted. (unless code for this was added). >> > > hmmm - I think we can avoid it by doing the wakeup from the write path > in the rare case that the queue has emptied - and avoid the double lock. IE: > > unix_state_unlock(other); > unix_state_lock(sk); > err = -EAGAIN; > if (unix_peer(sk) == other) { >unix_dgram_peer_wake_connect(sk, other); >if (skb_queue_len(>sk_receive_queue) == 0) >need_wakeup = true; > } > unix_state_unlock(sk); > if (need_wakeup) > wake_up_interruptible_poll(sk_sleep(sk), POLLOUT > | POLLWRNORM | POLLWRBAND); > goto out_free; This should probably rather be if (unix_dgram_peer_wake_connect(sk, other) && skb_queue_len(>sk_receive_queue) == 0) need_wakeup = 1; as there's no need to do the wake up if someone else already connected and then, the double lock could be avoided at the expense of returning a gratuitous EAGAIN to the caller and throwing all of the work _dgram_sendmsg did so far, eg, allocate a skb, copy the data into the kernel, do all the other checks, away. This would enable another thread to do one of the following things in parallell with the 'locked' part of _dgram_sendmsg 1) connect sk to a socket != other 2) use sk to send to a socket != other 3) do a shutdown on sk 4) determine write-readyness of sk via poll callback IMHO, the only thing which could possibly matter is 2) and my suggestion for this would rather be "use a send socket per sending thread if this matters to you" than "cause something to fail which could as well have succeeded". -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
Rainer Weikusatwrites: [...] > This leaves only the option of a somewhat incorrect solution and what is > or isn't acceptable in this respect is somewhat difficult to decide. The > basic options would be [...] > - retry sending a limited number of times, eg, once, before > returning EAGAIN, on the grounds that this is nicer to the > application and that redoing all the stuff up to the _lock in > dgram_sendmsg can possibly/ likely be avoided Since it's better to have a specific example of something: Here's another 'code sketch' of this option (hopefully with less errors this time, there's an int restart = 0 above): if (unix_peer(other) != sk && unix_recvq_full(other)) { int need_wakeup; [...] need_wakeup = 0; err = 0; unix_state_unlock(other); unix_state_lock(sk); if (unix_peer(sk) == other) { if (++restart == 2) { need_wakeup = unix_dgram_peer_wake_connect(sk, other) && sk_receive_queue_len(other) == 0; err = -EAGAIN; } else if (unix_dgram_peer_wake_me(sk, other)) err = -EAGAIN; } else err = -EAGAIN; unix_state_unlock(sk); if (err || !restart) { if (need_wakeup) wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND); goto out_free; } goto restart; } I don't particularly like that, either, and to me, the best option seems to be to return the spurious EAGAIN if taking both locks unconditionally is not an option as that's the simplest choice. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
From: Rainer WeikusatDate: Mon, 16 Nov 2015 22:28:40 + > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusat > Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") So because of a corner case of epoll handling and sender socket release, every single datagram sendmsg has to do a double lock now? I do not dispute the correctness of your fix at this point, but that added cost in the fast path is really too high. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
Rainer Weikusatwrites: [...] > The basic options would be > > - return EAGAIN even if sending became possible (Jason's most > recent suggestions) > > - retry sending a limited number of times, eg, once, before > returning EAGAIN, on the grounds that this is nicer to the > application and that redoing all the stuff up to the _lock in > dgram_sendmsg can possibly/ likely be avoided A third option: Use trylock to acquire the sk lock. If this succeeds, there's no risk of deadlocking anyone even if acquiring the locks in the wrong order. This could look as follows (NB: I didn't even compile this, I just wrote the code to get an idea how complicated it would be): int need_wakeup; [...] need_wakeup = 0; err = 0; if (spin_lock_trylock(unix_sk(sk)->lock)) { if (unix_peer(sk) != other || unix_dgram_peer_wake_me(sk, other)) err = -EAGAIN; } else { err = -EAGAIN; unix_state_unlock(other); unix_state_lock(sk); need_wakeup = unix_peer(sk) != other && unix_dgram_peer_wake_connect(sk, other) && sk_receive_queue_len(other) == 0; } unix_state_unlock(sk); if (err) { if (need_wakeup) wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND); goto out_free; } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
David Millerwrites: > From: Rainer Weikusat > Date: Mon, 16 Nov 2015 22:28:40 + > >> An AF_UNIX datagram socket being the client in an n:1 association with >> some server socket is only allowed to send messages to the server if the >> receive queue of this socket contains at most sk_max_ack_backlog >> datagrams. [...] >> Signed-off-by: Rainer Weikusat >> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") > > So because of a corner case of epoll handling and sender socket release, > every single datagram sendmsg has to do a double lock now? > > I do not dispute the correctness of your fix at this point, but that > added cost in the fast path is really too high. This leaves only the option of a somewhat incorrect solution and what is or isn't acceptable in this respect is somewhat difficult to decide. The basic options would be - return EAGAIN even if sending became possible (Jason's most recent suggestions) - retry sending a limited number of times, eg, once, before returning EAGAIN, on the grounds that this is nicer to the application and that redoing all the stuff up to the _lock in dgram_sendmsg can possibly/ likely be avoided Which one do you prefer? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)
An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer WeikusatFixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") --- Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event": This shouldn't be an issue anymore with this change despite it restores the "only when writable" behaviour" as the wake up relay will also be set up once _dgram_sendmsg returned EAGAIN for a send attempt on a n:1 connected socket. diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..3f4974d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,112 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + q); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusat--- - fix logic in _dgram_sendmsg: queue limit also needs to be enforced for unconnected sends - drop _recv_ready helper function: I'm usually a big fan of functional decomposition but in this case, the abstraction seemed to obscure things rather than making them easier to understand diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..3f4974d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,112 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + q); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other) +{ + struct unix_sock *u,
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Jason Baronwrites: > On 11/13/2015 01:51 PM, Rainer Weikusat wrote: > > [...] > >> >> -if (unix_peer(other) != sk && unix_recvq_full(other)) { >> -if (!timeo) { >> -err = -EAGAIN; >> -goto out_unlock; >> -} >> +if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) { > > Remind me why the 'unix_peer(sk) == other' is added here? If the remote > is not connected we still want to make sure that we don't overflow the > the remote rcv queue, right? Good point. The check is actually wrong there as the original code would also check the limit in case of an unconnected send to a socket found via address lookup. It belongs into the 2nd if (were I originally put it). > > In terms of this added 'double' lock for both sk and other, where > previously we just held the 'other' lock. I think we could continue to > just hold the 'other' lock unless the remote queue is full, so something > like: > > if (unix_peer(other) != sk && unix_recvq_full(other)) { > bool need_wakeup = false; > > skipping the blocking case... > > err = -EAGAIN; > if (!other_connected) > goto out_unlock; > unix_state_unlock(other); > unix_state_lock(sk); That was my original idea. The problem with this is that the code starting after the _lock and running until the main code path unlock has to be executed in one go with the other lock held as the results of the tests above this one may become invalid as soon as the other lock is released. This means instead of continuing execution with the send code proper after the block in case other became receive-ready between the first and the second test (possible because _dgram_recvmsg does not take the unix state lock), the whole procedure starting with acquiring the other lock would need to be restarted. Given sufficiently unfavorable circumstances, this could even turn into an endless loop which couldn't be interrupted. (unless code for this was added). [...] > we currently wake the entire queue on every remote read even when we > have room in the rcv buffer. So this patch will cut down on ctxt > switching rate dramatically from what we currently have. In my opinion, this could be improved by making the throttling mechanism work like a flip flop: If the queue lenght hits the limit after a _sendmsg, set a "no more applicants" flag blocking future sends until cleared (checking the flag would replace the present check). After the receive queue ran empty (or almost empty), _dgram_sendmsg would clear the flag and do a wakeup. But this should be an independent patch (if implemented). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusat--- "Believed to be least buggy version" - disconnect from former peer in _dgram_connect - use unix_state_double_lock in _dgram_sendmsg to ensure recv_ready/ wake_me preconditions are met (noted by Jason Baron) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..30e7c56 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,122 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + q); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other) +{ + struct unix_sock *u, *u_other; + int rc; + + u = unix_sk(sk); + u_other = unix_sk(other); +
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Hannes Frederic Sowawrites: > On Wed, Nov 11, 2015, at 17:12, Rainer Weikusat wrote: >> Hannes Frederic Sowa writes: >> > On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote: >> >> An AF_UNIX datagram socket being the client in an n:1 association with >> >> some server socket is only allowed to send messages to the server if the >> >> receive queue of this socket contains at most sk_max_ack_backlog >> >> datagrams. >> >> [...] >> >> > This whole patch seems pretty complicated to me. >> > >> > Can't we just remove the unix_recvq_full checks alltogether and unify >> > unix_dgram_poll with unix_poll? >> > >> > If we want to be cautious we could simply make unix_max_dgram_qlen limit >> > the number of skbs which are in flight from a sending socket. The skb >> > destructor can then decrement this. This seems much simpler. >> > >> > Would this work? >> >> In the way this is intended to work, cf >> >> http://marc.info/?t=11562760602=1=2 > > Oh, I see, we don't limit closed but still referenced sockets. This > actually makes sense on how fd handling is implemented, just as a range > check. > > Have you checked if we can somehow deregister the socket in the poll > event framework? You wrote that it does not provide such a function but > maybe it would be easy to add? I thought about this but this would amount to adding a general interface for the sole purpose of enabling the af_unix code to talk to the eventpoll code and I don't really like this idea: IMHO, there should be at least two users (preferably three) before creating any kind of 'abstract interface'. An even more ideal "castle in the air" (hypothetical) solution would be "change the eventpoll.c code such that it won't be affected if a wait queue just goes away". That's at least theoretically possible (although it might not be in practice). I wouldn't mind doing that (assuming it was possible) if it was just for the kernels my employer uses because I'm aware of the uses these will be put to and in control of the corresponding userland code. But for "general Linux code", changing epoll in order to help the af_unix code is more potential trouble than it's worth: Exchanging a relatively unimportant bug in some module for a much more visibly damaging bug in a central facility would be a bad tradeoff. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/13/2015 01:51 PM, Rainer Weikusat wrote: [...] > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > - if (!timeo) { > - err = -EAGAIN; > - goto out_unlock; > - } > + if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) { Remind me why the 'unix_peer(sk) == other' is added here? If the remote is not connected we still want to make sure that we don't overflow the the remote rcv queue, right? In terms of this added 'double' lock for both sk and other, where previously we just held the 'other' lock. I think we could continue to just hold the 'other' lock unless the remote queue is full, so something like: if (unix_peer(other) != sk && unix_recvq_full(other)) { bool need_wakeup = false; skipping the blocking case... err = -EAGAIN; if (!other_connected) goto out_unlock; unix_state_unlock(other); unix_state_lock(sk); /* if remote peer has changed under us, the connect() will wake up any pending waiter, just return -EAGAIN if (unix_peer(sk) == other) { /* In case we see there is space available queue the wakeup and we will try again. This this should be an unlikely condition */ if (!unix_dgram_peer_wake_me(sk, other)) need_wakeup = true; } unix_state_unlock(sk); if (need_wakeup) wake_up_interruptible_poll(sk_sleep(sk),POLLOUT | POLLWRNORM | POLLWRBAND); goto out_free; } So I'm not sure if the 'double' lock really affects any workload, but the above might be away to avoid it. Also - it might be helpful to add a 'Fixes:' tag referencing where this issue started, in the changelog. Worth mentioning too is that this patch should improve the polling case here dramatically, as we currently wake the entire queue on every remote read even when we have room in the rcv buffer. So this patch will cut down on ctxt switching rate dramatically from what we currently have. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Jason Baronwrites: >> + >> +/* Needs sk unix state lock. After recv_ready indicated not ready, >> + * establish peer_wait connection if still needed. >> + */ >> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other) >> +{ >> +int connected; >> + >> +connected = unix_dgram_peer_wake_connect(sk, other); >> + >> +if (unix_recvq_full(other)) >> +return 1; >> + >> +if (connected) >> +unix_dgram_peer_wake_disconnect(sk, other); >> + >> +return 0; >> +} >> + > > So the comment above this function says 'needs unix state lock', however > the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage > in unix_dgram_poll() has the 'sk' lock. So this looks racy. That's one thing which is broken with this patch. Judging from a 'quick' look at the _dgram_sendmsg code, the unix_state_lock(other) will need to be turned into a unix_state_double_lock(sk, other) and the remaining code changed accordingly (since all of the checks must be done without unlocking other). There's also something else seriously wrong with the present patch: Some code in unix_dgram_connect presently (with this change) looks like this: /* * If it was connected, reconnect. */ if (unix_peer(sk)) { struct sock *old_peer = unix_peer(sk); unix_peer(sk) = other; if (unix_dgram_peer_wake_disconnect(sk, other)) wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND); unix_state_double_unlock(sk, other); if (other != old_peer) unix_dgram_disconnected(sk, old_peer); sock_put(old_peer); and trying to disconnect from a peer the socket is just being connected to is - of course - "flowering tomfoolery" (literal translation of the German "bluehender Bloedsinn") --- it needs to disconnect from old_peer instead. I'll address the suggestion and send an updated patch "later today" (may become "early tomorrow"). I have some code addressing both issues but that's part of a release of 'our' kernel fork, ie, 3.2.54-based I'll need to do 'soon'. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Hi Rainer, > + > +/* Needs sk unix state lock. After recv_ready indicated not ready, > + * establish peer_wait connection if still needed. > + */ > +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other) > +{ > + int connected; > + > + connected = unix_dgram_peer_wake_connect(sk, other); > + > + if (unix_recvq_full(other)) > + return 1; > + > + if (connected) > + unix_dgram_peer_wake_disconnect(sk, other); > + > + return 0; > +} > + So the comment above this function says 'needs unix state lock', however the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage in unix_dgram_poll() has the 'sk' lock. So this looks racy. Also, another tweak on this scheme: Instead of calling '__remove_wait_queue()' in unix_dgram_peer_wake_relay(). We could instead simply mark each item in the queue as 'WQ_FLAG_EXCLUSIVE'. Then, since 'unix_dgram_recvmsg()' does an exclusive wakeup the queue has effectively been disabled (minus the first exlusive item in the list which can just return if its marked exclusive). This means that in dgram_poll(), we add to the list if we have not yet been added, and if we are on the list, we do a remove and then add removing the exclusive flag. Thus, all the waiters that need a wakeup are at the beginning of the queue, and the disabled ones are at the end with the 'WQ_FLAG_EXCLUSIVE' flag set. This does make the list potentially long, but if we only walk it to the point we are doing the wakeup, it has no impact. I like the fact that in this scheme the wakeup doesn't have to call remove against a long of waiters - its just setting the exclusive flag. Thanks, -Jason > static inline int unix_writable(struct sock *sk) > { > return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; > @@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int > embrion) > skpair->sk_state_change(skpair); > sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP); > } > + > + unix_dgram_peer_wake_disconnect(sk, skpair); > sock_put(skpair); /* It may now die */ > unix_peer(sk) = NULL; > } > @@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct > socket *sock, int kern) > INIT_LIST_HEAD(>link); > mutex_init(>readlock); /* single task reading lock */ > init_waitqueue_head(>peer_wait); > + init_waitqueue_func_entry(>peer_wake, unix_dgram_peer_wake_relay); > unix_insert_socket(unix_sockets_unbound(sk), sk); > out: > if (sk == NULL) > @@ -1031,6 +1150,13 @@ restart: > if (unix_peer(sk)) { > struct sock *old_peer = unix_peer(sk); > unix_peer(sk) = other; > + > + if (unix_dgram_peer_wake_disconnect(sk, other)) > + wake_up_interruptible_poll(sk_sleep(sk), > +POLLOUT | > +POLLWRNORM | > +POLLWRBAND); > + > unix_state_double_unlock(sk, other); > > if (other != old_peer) > @@ -1565,6 +1691,13 @@ restart: > unix_state_lock(sk); > if (unix_peer(sk) == other) { > unix_peer(sk) = NULL; > + > + if (unix_dgram_peer_wake_disconnect(sk, other)) > + wake_up_interruptible_poll(sk_sleep(sk), > +POLLOUT | > +POLLWRNORM | > +POLLWRBAND); > + > unix_state_unlock(sk); > > unix_dgram_disconnected(sk, other); > @@ -1590,19 +1723,21 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > - if (!timeo) { > - err = -EAGAIN; > - goto out_unlock; > - } > + if (!unix_dgram_peer_recv_ready(sk, other)) { > + if (timeo) { > + timeo = unix_wait_for_peer(other, timeo); > > - timeo = unix_wait_for_peer(other, timeo); > + err = sock_intr_errno(timeo); > + if (signal_pending(current)) > + goto out_free; > > - err = sock_intr_errno(timeo); > - if (signal_pending(current)) > - goto out_free; > + goto restart; > + } > > - goto restart; > + if (unix_dgram_peer_wake_me(sk, other)) { > + err = -EAGAIN; > + goto out_unlock; > + } > } > > if (sock_flag(other, SOCK_RCVTSTAMP)) > @@ -2453,14 +2588,16 @@ static unsigned int unix_dgram_poll(struct
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Hannes Frederic Sowawrites: > On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote: >> An AF_UNIX datagram socket being the client in an n:1 association with >> some server socket is only allowed to send messages to the server if the >> receive queue of this socket contains at most sk_max_ack_backlog >> datagrams. [...] > This whole patch seems pretty complicated to me. > > Can't we just remove the unix_recvq_full checks alltogether and unify > unix_dgram_poll with unix_poll? > > If we want to be cautious we could simply make unix_max_dgram_qlen limit > the number of skbs which are in flight from a sending socket. The skb > destructor can then decrement this. This seems much simpler. > > Would this work? In the way this is intended to work, cf http://marc.info/?t=11562760602=1=2 only if the limit would also apply to sockets which didn't sent anything so far. Which means it'll end up in the exact same situation as before: Sending something using a certain socket may not be possible because of data sent by other sockets, so either, code trying to send using this sockets ends up busy-waiting for "space again available" despite it's trying to use select/ poll/ epolll/ $whatnot to get notified of this condition and sleep until then or this notification needs to be propagated to sleeping threads which didn't get to send anything yet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Hi, On Wed, Nov 11, 2015, at 17:12, Rainer Weikusat wrote: > Hannes Frederic Sowawrites: > > On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote: > >> An AF_UNIX datagram socket being the client in an n:1 association with > >> some server socket is only allowed to send messages to the server if the > >> receive queue of this socket contains at most sk_max_ack_backlog > >> datagrams. > > [...] > > > This whole patch seems pretty complicated to me. > > > > Can't we just remove the unix_recvq_full checks alltogether and unify > > unix_dgram_poll with unix_poll? > > > > If we want to be cautious we could simply make unix_max_dgram_qlen limit > > the number of skbs which are in flight from a sending socket. The skb > > destructor can then decrement this. This seems much simpler. > > > > Would this work? > > In the way this is intended to work, cf > > http://marc.info/?t=11562760602=1=2 Oh, I see, we don't limit closed but still referenced sockets. This actually makes sense on how fd handling is implemented, just as a range check. Have you checked if we can somehow deregister the socket in the poll event framework? You wrote that it does not provide such a function but maybe it would be easy to add? Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
Jason Baronwrites: > On 11/09/2015 09:40 AM, Rainer Weikusat wrote: [...] >> -if (unix_peer(other) != sk && unix_recvq_full(other)) { >> +if (!unix_dgram_peer_recv_ready(sk, other)) { >> if (!timeo) { >> -err = -EAGAIN; >> -goto out_unlock; >> +if (unix_dgram_peer_wake_me(sk, other)) { >> +err = -EAGAIN; >> +goto out_unlock; >> +} >> + >> +goto restart; >> } > > > So this will cause 'unix_state_lock(other) to be called twice in a > row if we 'goto restart' (and hence will softlock the box). It just > needs a 'unix_state_unlock(other);' before the 'goto restart'. The goto restart was nonsense to begin with in this code path: Restarting something is necessary after sleeping for some time but for the case above, execution just continues. I've changed that (updated patch should follow 'soon') to if (!unix_dgram_peer_recv_ready(sk, other)) { if (timeo) { timeo = unix_wait_for_peer(other, timeo); err = sock_intr_errno(timeo); if (signal_pending(current)) goto out_free; goto restart; } if (unix_dgram_peer_wake_me(sk, other)) { err = -EAGAIN; goto out_unlock; } } > I also tested this patch with a single unix server and 200 client > threads doing roughly epoll() followed by write() until -EAGAIN in a > loop. The throughput for the test was roughly the same as current > upstream, but the cpu usage was a lot higher. I think its b/c this patch > takes the server wait queue lock in the _poll() routine. This causes a > lot of contention. The previous patch you posted for this where you did > not clear the wait queue on every wakeup and thus didn't need the queue > lock in poll() (unless we were adding to it), performed much better. I'm somewhat unsure what to make of that: The previous patch would also take the wait queue lock whenever poll was about to return 'not writable' because of the length of the server receive queue unless another thread using the same client socket also noticed this and enqueued this same socket already. And "hundreds of clients using a single client socket in order to send data to a single server socket" doesn't seem very realistic to me. Also, this code shouldn't usually be executed as the server should usually be capable of keeping up with the data sent by clients. If it's permanently incapable of that, you're effectively performing a (successful) DDOS against it. Which should result in "high CPU utilization" in either case. It may be possible to improve this by tuning/ changing the flow control mechanism. Out of my head, I'd suggest making the queue longer (the default value is 10) and delaying wake ups until the server actually did catch up, IOW, the receive queue is empty or almost empty. But this ought to be done with a different patch. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
David Millerwrites: > From: Rainer Weikusat > Date: Mon, 09 Nov 2015 14:40:48 + > >> +__remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, >> +>peer_wake); > > This is more simply: > > __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, q); Thank you for pointing this out. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusat--- - use wait_queue_t passed as argument to _relay - fix possible deadlock and logic error in _dgram_sendmsg by straightening the control flow ("spaghetti code considered confusing") diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..4297d8e 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,122 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + q); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other) +{ + struct unix_sock *u, *u_other; + int rc; + + u = unix_sk(sk); + u_other = unix_sk(other); + rc = 0; + +
[PATCH] unix: avoid use-after-free in ep_remove_wait_queue
An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusuat--- "Why do things always end up messy and complicated"? Patch is against 4.3.0. diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b36d837..2a91a05 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_tpeer_wake; }; static inline struct unix_sock *unix_sk(const struct sock *sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f6582..4f263e3 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,6 +326,122 @@ found: return s; } +/* Support code for asymmetrically connected dgram sockets + * + * If a datagram socket is connected to a socket not itself connected + * to the first socket (eg, /dev/log), clients may only enqueue more + * messages if the present receive queue of the server socket is not + * "too large". This means there's a second writeability condition + * poll and sendmsg need to test. The dgram recv code will do a wake + * up on the peer_wait wait queue of a socket upon reception of a + * datagram which needs to be propagated to sleeping would-be writers + * since these might not have sent anything so far. This can't be + * accomplished via poll_wait because the lifetime of the server + * socket might be less than that of its clients if these break their + * association with it or if the server socket is closed while clients + * are still connected to it and there's no way to inform "a polling + * implementation" that it should let go of a certain wait queue + * + * In order to propagate a wake up, a wait_queue_t of the client + * socket is enqueued on the peer_wait queue of the server socket + * whose wake function does a wake_up on the ordinary client socket + * wait queue. This connection is established whenever a write (or + * poll for write) hit the flow control condition and broken when the + * association to the server socket is dissolved or after a wake up + * was relayed. + */ + +static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, + >peer_wake); + u->peer_wake.private = NULL; + + /* relaying can only happen while the wq still exists */ + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other) +{ + struct unix_sock *u, *u_other; + int rc; + + u = unix_sk(sk); + u_other = unix_sk(other); + rc = 0; + + spin_lock(_other->peer_wait.lock); + + if (!u->peer_wake.private) { + u->peer_wake.private = other; +
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
From: Rainer WeikusatDate: Mon, 09 Nov 2015 14:40:48 + > + __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, > + >peer_wake); This is more simply: __remove_wait_queue(_sk(u->peer_wake.private)->peer_wait, q); > +static inline int unix_dgram_peer_recv_ready(struct sock *sk, > + struct sock *other) Please do not us the inline keyword in foo.c files, let the compiler decide. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue
On 11/09/2015 09:40 AM, Rainer Weikusat wrote: > An AF_UNIX datagram socket being the client in an n:1 association with > some server socket is only allowed to send messages to the server if the > receive queue of this socket contains at most sk_max_ack_backlog > datagrams. This implies that prospective writers might be forced to go > to sleep despite none of the message presently enqueued on the server > receive queue were sent by them. In order to ensure that these will be > woken up once space becomes again available, the present unix_dgram_poll > routine does a second sock_poll_wait call with the peer_wait wait queue > of the server socket as queue argument (unix_dgram_recvmsg does a wake > up on this queue after a datagram was received). This is inherently > problematic because the server socket is only guaranteed to remain alive > for as long as the client still holds a reference to it. In case the > connection is dissolved via connect or by the dead peer detection logic > in unix_dgram_sendmsg, the server socket may be freed despite "the > polling mechanism" (in particular, epoll) still has a pointer to the > corresponding peer_wait queue. There's no way to forcibly deregister a > wait queue with epoll. > > Based on an idea by Jason Baron, the patch below changes the code such > that a wait_queue_t belonging to the client socket is enqueued on the > peer_wait queue of the server whenever the peer receive queue full > condition is detected by either a sendmsg or a poll. A wake up on the > peer queue is then relayed to the ordinary wait queue of the client > socket via wake function. The connection to the peer wait queue is again > dissolved if either a wake up is about to be relayed or the client > socket reconnects or a dead peer is detected or the client socket is > itself closed. This enables removing the second sock_poll_wait from > unix_dgram_poll, thus avoiding the use-after-free, while still ensuring > that no blocked writer sleeps forever. > > Signed-off-by: Rainer Weikusuat[...] > @@ -1590,10 +1723,14 @@ restart: > goto out_unlock; > } > > - if (unix_peer(other) != sk && unix_recvq_full(other)) { > + if (!unix_dgram_peer_recv_ready(sk, other)) { > if (!timeo) { > - err = -EAGAIN; > - goto out_unlock; > + if (unix_dgram_peer_wake_me(sk, other)) { > + err = -EAGAIN; > + goto out_unlock; > + } > + > + goto restart; > } So this will cause 'unix_state_lock(other) to be called twice in a row if we 'goto restart' (and hence will softlock the box). It just needs a 'unix_state_unlock(other);' before the 'goto restart'. I also tested this patch with a single unix server and 200 client threads doing roughly epoll() followed by write() until -EAGAIN in a loop. The throughput for the test was roughly the same as current upstream, but the cpu usage was a lot higher. I think its b/c this patch takes the server wait queue lock in the _poll() routine. This causes a lot of contention. The previous patch you posted for this where you did not clear the wait queue on every wakeup and thus didn't need the queue lock in poll() (unless we were adding to it), performed much better. However, the previous patch which tested better didn't add to the remote queue when it was full on sendmsg() - so it wouldn't be correct for epoll ET. Adding to the remote queue for every sendmsg() that fails does seem undesirable, if we aren't even doing poll(). So I'm not sure if just going back to the previous patch is a great option eitherI'm also not sure how realistic the test case I have is. It would be great if we had some other workloads to test against. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html