Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Rainer Weikusatwrites: > Jason Baron writes: >> On 10/18/2015 04:58 PM, Rainer Weikusat wrote: [...] >> 1) >> >> In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus >> it requires proper dereferencing. Something like: >> >> struct unix_sock *u; >> struct socket_wq *wq; >> >> u = container_of(wait, struct unix_sock, wait); >> rcu_read_lock(); >> wq = rcu_dereference(u->sk.sk_wq); >> if (wq_has_sleeper(wq)) >> wake_up_interruptible_sync_poll(>wait, key); >> rcu_read_unlock(); > > I think this may be unecessary I consider this unnecessary now. Rationale: The wait queue is allocated and freed in tandem with the socket inode which means it will remain allocated until after the protocol release function (unix_release with the bulk of the implementation being in unix_release_sock) returned. As the connection with the other socket is broken in unix_release_sock, any relayed wake up must have completed before this time (since both operations take place while holding the same wait queue lock). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Jason Baronwrites: > On 10/18/2015 04:58 PM, Rainer Weikusat wrote: > > [...] > >> >> The idea behind 'the wait queue' (insofar I'm aware of it) is that it >> will be used as list of threads who need to be notified when the >> associated event occurs. Since you seem to argue that the run-of-the-mill >> algorithm is too slow for this particular case, is there anything to >> back this up? >> > > Generally the poll() routines only add to a wait queue once at the > beginning, and all subsequent calls to poll() simply check the wakeup > conditions. So here you are proposing to add/remove to the wait queue on > subsequent invocations of poll(). So the initial patch I did, continued > in the usual pattern and only added once on registration or connect(). The code uses the private member of a wait_queue_t to record if it the add_wait_queue was already executed so the add/remove will only happen if the wakeup condition changed in the meantime (which usually ought to be the case, though). As far as I understand this, this really only makes a difference for epoll as only epoll will keep everything on the wait queues managed by it between 'polling calls'. In order to support epoll-style wait queue management outside of epoll, the poll management code would need to execute a cleanup callback instead of just the setup callback it already executes. > 1) > > In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus > it requires proper dereferencing. Something like: > > struct unix_sock *u; > struct socket_wq *wq; > > u = container_of(wait, struct unix_sock, wait); > rcu_read_lock(); > wq = rcu_dereference(u->sk.sk_wq); > if (wq_has_sleeper(wq)) > wake_up_interruptible_sync_poll(>wait, key); > rcu_read_unlock(); I think this may be unecessary but I need more time to check this than the odd "half an hour after work after 11pm [UK time]" I could put into this today. > 2) > > For the case of epoll() in edge triggered mode we need to ensure that > when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full() > is true, we need to add a unix_peer_wake_connect() call to guarantee a > wakeup. Otherwise, we are going to potentially hang there. I consider this necessary. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/18/2015 04:58 PM, Rainer Weikusat wrote: [...] > > The idea behind 'the wait queue' (insofar I'm aware of it) is that it > will be used as list of threads who need to be notified when the > associated event occurs. Since you seem to argue that the run-of-the-mill > algorithm is too slow for this particular case, is there anything to > back this up? > Generally the poll() routines only add to a wait queue once at the beginning, and all subsequent calls to poll() simply check the wakeup conditions. So here you are proposing to add/remove to the wait queue on subsequent invocations of poll(). So the initial patch I did, continued in the usual pattern and only added once on registration or connect(). However, I do think that this is a special case since the registration is on a shared wait queue, and thus having a long list of registered waiters is going to affect all waiters. So I am fine with doing the add/removes in the poll() routine and I agree that the patch below is more compact that what I initially posted. A couple of notes on your patch: 1) In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus it requires proper dereferencing. Something like: struct unix_sock *u; struct socket_wq *wq; u = container_of(wait, struct unix_sock, wait); rcu_read_lock(); wq = rcu_dereference(u->sk.sk_wq); if (wq_has_sleeper(wq)) wake_up_interruptible_sync_poll(>wait, key); rcu_read_unlock(); 2) For the case of epoll() in edge triggered mode we need to ensure that when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full() is true, we need to add a unix_peer_wake_connect() call to guarantee a wakeup. Otherwise, we are going to potentially hang there. With these changes (or tell me why they are not needed), I'm happy to ack this patch. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Jason Baronwrites: > >> >> X-Signed-Off-By: Rainer Weikusat >> Sorry for the delayed reply but I had to do some catching up on what the people who pay me consider 'useful work' :-). > So the patches I've posted and yours both use the idea of a relaying > the remote peer wakeup via callbacks that are internal to the net/unix, > such that we avoid exposing the remote peer wakeup to the external > poll()/select()/epoll(). They differ in when and how those callbacks > are registered/unregistered. Yes. In my opinion, that's the next best idea considering that the need to handle this particular situation is presumably specfcific to the af_unix.c code and thus, doesn't warrant changes to all I/O multiplexing implementations. > So I think your approach here will generally keep the peer wait wakeup > queue to its absolute minimum, by removing from that queue when > we set POLLOUT, however it requires taking the peer waitqueue lock on > every poll() call. > > So I think there are tradeoffs here vs. what I've > posted. So for example, if there are a lot of writers against one 'server' > socket, there is going to be a lot of lock contention with your approach > here. So I think the performance is going to depend on the workload that > is tested. This approach is really completely run-of-the-mill "possibly sleep until some event occurs" code, eg, you'll find a description of this exact procedure on p. 157 of chapter 6 of https://lwn.net/Kernel/LDD3/ The idea behind 'the wait queue' (insofar I'm aware of it) is that it will be used as list of threads who need to be notified when the associated event occurs. Since you seem to argue that the run-of-the-mill algorithm is too slow for this particular case, is there anything to back this up? This is also sort-of a zero-sum game as the idea to enqueue on a wait queue because the event could possibly become interesting in future really just shifts (for n:1 connected sockets) work from the (possibly many) clients to the single server. And considering that 'the wait' became necessary (if it became necessary) because flow-control kicked in to stop clients from sending more data to the server until it had time to catch up with the already sent data, this doesn't seem like a good idea to me. Using a flag to signal that at least some of the members of this queue actually want to be woken up will also only save work if it is assumed that most threads won't ever really be waiting for this, ie, won't execute the corresponding unix_dgram_poll code. But if that code isn't going to be executed most of the time, anyway, why optimize it? It's also not really necessary to take the waitqueue lock on every poll, eg, the code in unix_dgram_poll could be changed like this: need_wakeup = 0; unix_state_lock(sk); other = unix_peer(sk); if (other && unix_peer(other) != sk) { if (unix_recvq_full(other)) { need_wakeup = !unix_peer_wake_connect(sk, other); if (unix_recvq_full(other)) { writable = 0; need_wakeup = 0; } else unix_peer_wake_disconnect(sk, other); } else need_wakeup = unix_peer_wake_disconnect(sk, other); } unix_state_unlock(sk); if (need_wakeup) wake_up_interruptible_poll(sk_sleep(sk), POLL_OUT_ALL); I'm no big fan of adding complications like this based on the conjecture that they might be useful, but this one seems justified to me. IMHO, there are two interesting cases here: 1. The current thread won't have to wait. It should detect this quickly so that it can get back to work, hence, check for this before 'preparing to wait'. 2. Nothing to be done here right now, hence, performing an additional check for 'still not writeable' shouldn't matter much. Yet another 3.2.54 patch with the change above --- --- linux-2-6.b/net/unix/af_unix.c 2015-10-18 20:41:10.829404899 +0100 +++ linux-2-6/net/unix/af_unix.c2015-10-18 20:17:34.819134482 +0100 @@ -115,6 +115,8 @@ #include #include +#define POLL_OUT_ALL (POLLOUT | POLLWRNORM | POLLWRBAND) + static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1]; static DEFINE_SPINLOCK(unix_table_lock); static atomic_long_t unix_nr_socks; @@ -303,6 +305,53 @@ found: return s; } +static int unix_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + u_sleep = sk_sleep(>sk); + if (u_sleep) +
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Jason Baronwrites: > On 10/12/2015 04:41 PM, Rainer Weikusat wrote: >> Jason Baron writes: >>> On 10/05/2015 12:31 PM, Rainer Weikusat wrote: [...] >> OTOH, something I seriously dislike about your relaying implementation >> is the unconditional add_wait_queue upon connect as this builds up a >> possibly large wait queue of entities entirely uninterested in the >> event which will need to be traversed even if peer_wait wakeups will >> only happen if at least someone actually wants to be notified. This >> could be changed such that the struct unix_sock member is only put onto >> the peer's wait queue in poll and only if it hasn't already been put >> onto it. The connection could then be severed if some kind of >> 'disconnect' occurs. [...] > What about the following race? > > 1) thread A does poll() on f, finds the wakeup condition low, and adds > itself to the remote peer_wait queue. > > 2) thread B sets the wake up condition in dgram_recvmsg(), but does not > execute the wakeup of threads yet. > > 3) thread C also does a poll() on f, finds now that the wakeup condition > is set, and hence removes f from the remote peer_wait queue. > > Then, thread A misses the POLLOUT, and hangs. Indeed. This starts to turn into an interesting, little problem. Based on the code I posted, there are (AFAICT) four possible situations a thread looking for 'peer will accept data' can find itself in: A - supposed to mean 'it connected to the peer_wait queue' B - the peer socket is ready to accept messages A && B -- Since A, no other thread can be waiting for this event and since B, A was undone. Fine to return. !A && B --- !A means 'was enqueued to peer_wait during an earlier call'. This means any number of threads may be waiting. Because of this, do an additional wake up after disconnecting since the current thread may never have slept, ie, possibly, no wake up occurred yet. A && !B --- Now connected to receive peer wake ups. Go to sleep. !A && !B Someone else connected. Go to sleep. > I understand your concern about POLLIN only waiters-I think we > could add the 'relay callback' in dgram_poll() only for those who are > looking for POLLOUT, and simply avoid the de-registration, Whether or not the currently running thread wanted to write or to read can only be determined (AIUI) if a poll_table was passed. If this was the case and the thread didn't look for 'write space', it will never execute the code in question, cf /* No write status requested, avoid expensive OUT tests. */ if (wait && !(wait->key & POLL_OUT_ALL)) return mask; As to 'concerns', I'll try to word that somewhat differently: The usual way to handle a 'may sleep' situation is 1. Enqueue on the associated wait_queue. 2. Test the condition 2a) No sleep, dequeue and return 2b) Sleep until woken up This means the wait queue is used to 'register' threads which want to be woken up if a certain event happens. It's not used such that a thread first enqueues itself on all wait queues it could ever need to enqueued on and that an associated flag is used to inform the code on the way to doing a wake up whether or not any of the entries on the list actually corresponds to someone waiting for one. Even if such a flag exists, the time needed to do the wake up is still proportional to size of the list. For the given case, take a usual socketpair: It's absolutely pointless to put these on the peer_wait queues of each other because all datagrams send by one go to the other, ie, a situation where a thread goes to sleep because it has to wait until data written via unrelated sockets has been consumed can never happen. I'll include the patch I'm currently using below, including an X-Signed-Off by line to signify that this is the result of kernel development sponsored by my employer and made available under the usual terms. Changes since the last version: - fixed the inverted (rather accidentally uninverted) test in _connect - simplified the peer-checking logic: Instead of getting the peer with the state locked and then doing all kinds of checks which will hopefully enable avoiding to lock the state again, then, possibly, lock it again, lock the socket, do the peer checks followed by whatever else is necessary, then unlock it - addressed the race described above by means of doing a wake_up on sk_sleep(sk) if a thread breaks the connection which didn't also establish it - a comment (yohoho) explaining all of this -- I don't usually use comments in my own code because I consider them rather an annoyance than helpful Unrelated change: - #defined POLL_OUT_ALL (POLLOUT | POLLWRNORM | POLLWRBAND) and use that instead of the explicit triple X-Signed-Off-By: Rainer Weikusat
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
> > X-Signed-Off-By: Rainer Weikusat> Hi, So the patches I've posted and yours both use the idea of a relaying the remote peer wakeup via callbacks that are internal to the net/unix, such that we avoid exposing the remote peer wakeup to the external poll()/select()/epoll(). They differ in when and how those callbacks are registered/unregistered. So I think your approach here will generally keep the peer wait wakeup queue to its absolute minimum, by removing from that queue when we set POLLOUT, however it requires taking the peer waitqueue lock on every poll() call. So I think there are tradeoffs here vs. what I've posted. So for example, if there are a lot of writers against one 'server' socket, there is going to be a lot of lock contention with your approach here. So I think the performance is going to depend on the workload that is tested. I don't have a specific workload that I am trying to solve here, but since you introduced the waiting on the remote peer queue, perhaps you can post numbers comparing the patches that have been posted for the workload that this was developed for? I will send you the latest version of what I have privately - so as not to overly spam the list. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/12/2015 04:41 PM, Rainer Weikusat wrote: > Jason Baronwrites: >> On 10/05/2015 12:31 PM, Rainer Weikusat wrote: > > [...] > >>> Here's a more simple idea which _might_ work. The underlying problem >>> seems to be that the second sock_poll_wait introduces a covert reference >>> to the peer socket which isn't accounted for. The basic idea behind this >>> is to execute an additional sock_hold for the peer whenever the >>> sock_poll_wait is called for it and store it (the struct sock *) in a >>> new struct unix_sock member. > > [...] > >> Interesting - will this work for the test case you supplied where we are in >> epoll_wait() and another thread does a connect() to a new target? In that >> case, even if we issue a wakeup to the epoll thread, its not going to have >> a non-NULL poll_table, so it wouldn't be added to the new target. IE >> the first test case here: >> >> https://lkml.org/lkml/2015/10/4/154 > > "Indeed it would not." I've also meanwhile found the time to check what > is and isn't locked here and found that Eric's "this looks racy" was > also justified. In theory, a clean solution could be based on modifying > the various polling implementations to keep a piece of data for a polled > something and provided that again on each subsequent poll call. This > could then be used to keep the peer socket alive for as long as > necessary were it possible to change the set of wait queues with every > poll call. Since this also isn't the case, the idea to increment the > reference count of the peer socket won't fly. > > OTOH, something I seriously dislike about your relaying implementation > is the unconditional add_wait_queue upon connect as this builds up a > possibly large wait queue of entities entirely uninterested in the > event which will need to be traversed even if peer_wait wakeups will > only happen if at least someone actually wants to be notified. This > could be changed such that the struct unix_sock member is only put onto > the peer's wait queue in poll and only if it hasn't already been put > onto it. The connection could then be severed if some kind of > 'disconnect' occurs. > > The code below (again based on 3.2.54) is what I'm currently running and > it has survived testing during the day (without trying the exercise in > hexadecimal as that doesn't cause failure for me, anyway). The wakeup > relaying function checks that a socket wait queue actually still exists > because I used to get null pointers oopses without every now and then > (I've also tested this with an additional printk printing 'no q' in case > the pointer was actually null to verify that this really occurs here). > Hi, What about the following race? 1) thread A does poll() on f, finds the wakeup condition low, and adds itself to the remote peer_wait queue. 2) thread B sets the wake up condition in dgram_recvmsg(), but does not execute the wakeup of threads yet. 3) thread C also does a poll() on f, finds now that the wakeup condition is set, and hence removes f from the remote peer_wait queue. Then, thread A misses the POLLOUT, and hangs. I understand your concern about POLLIN only waiters-I think we could add the 'relay callback' in dgram_poll() only for those who are looking for POLLOUT, and simply avoid the de-registration, as in practice I think its unlikely we are going to have a socket switching from POLLOUT to *only* POLLIN. I suspect that will cover most of the cases that concern you? Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Jason Baronwrites: > On 10/05/2015 12:31 PM, Rainer Weikusat wrote: [...] >> Here's a more simple idea which _might_ work. The underlying problem >> seems to be that the second sock_poll_wait introduces a covert reference >> to the peer socket which isn't accounted for. The basic idea behind this >> is to execute an additional sock_hold for the peer whenever the >> sock_poll_wait is called for it and store it (the struct sock *) in a >> new struct unix_sock member. [...] > Interesting - will this work for the test case you supplied where we are in > epoll_wait() and another thread does a connect() to a new target? In that > case, even if we issue a wakeup to the epoll thread, its not going to have > a non-NULL poll_table, so it wouldn't be added to the new target. IE > the first test case here: > > https://lkml.org/lkml/2015/10/4/154 "Indeed it would not." I've also meanwhile found the time to check what is and isn't locked here and found that Eric's "this looks racy" was also justified. In theory, a clean solution could be based on modifying the various polling implementations to keep a piece of data for a polled something and provided that again on each subsequent poll call. This could then be used to keep the peer socket alive for as long as necessary were it possible to change the set of wait queues with every poll call. Since this also isn't the case, the idea to increment the reference count of the peer socket won't fly. OTOH, something I seriously dislike about your relaying implementation is the unconditional add_wait_queue upon connect as this builds up a possibly large wait queue of entities entirely uninterested in the event which will need to be traversed even if peer_wait wakeups will only happen if at least someone actually wants to be notified. This could be changed such that the struct unix_sock member is only put onto the peer's wait queue in poll and only if it hasn't already been put onto it. The connection could then be severed if some kind of 'disconnect' occurs. The code below (again based on 3.2.54) is what I'm currently running and it has survived testing during the day (without trying the exercise in hexadecimal as that doesn't cause failure for me, anyway). The wakeup relaying function checks that a socket wait queue actually still exists because I used to get null pointers oopses without every now and then (I've also tested this with an additional printk printing 'no q' in case the pointer was actually null to verify that this really occurs here). - --- linux-2-6.b/net/unix/af_unix.c 2015-10-12 21:07:27.237102277 +0100 +++ linux-2-6/net/unix/af_unix.c2015-10-12 21:10:26.319384298 +0100 @@ -303,6 +303,51 @@ found: return s; } +static int unix_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, + void *key) +{ + struct unix_sock *u; + wait_queue_head_t *u_sleep; + + u = container_of(q, struct unix_sock, peer_wake); + + u_sleep = sk_sleep(>sk); + if (u_sleep) + wake_up_interruptible_poll(u_sleep, key); + + return 0; +} + +static void unix_peer_wake_connect(struct sock *me, struct sock *peer) +{ + struct unix_sock *u, *u_peer; + + u = unix_sk(me); + u_peer = unix_sk(peer); + + if (!u->peer_wake.private) + return; + + u->peer_wake.private = peer; + add_wait_queue(_peer->peer_wait, >peer_wake); +} + +static int unix_peer_wake_disconnect(struct sock *me, struct sock *peer) +{ + struct unix_sock *u, *u_peer; + + u = unix_sk(me); + u_peer = unix_sk(peer); + + if (u->peer_wake.private != peer) + return 0; + + remove_wait_queue(_peer->peer_wait, >peer_wake); + u->peer_wake.private = NULL; + + return 1; +} + static inline int unix_writable(struct sock *sk) { return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; @@ -409,6 +454,8 @@ static void unix_release_sock(struct soc skpair->sk_state_change(skpair); sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP); } + + unix_peer_wake_disconnect(sk, skpair); sock_put(skpair); /* It may now die */ unix_peer(sk) = NULL; } @@ -630,6 +677,7 @@ static struct sock *unix_create1(struct INIT_LIST_HEAD(>link); mutex_init(>readlock); /* single task reading lock */ init_waitqueue_head(>peer_wait); + init_waitqueue_func_entry(>peer_wake, unix_peer_wake_relay); unix_insert_socket(unix_sockets_unbound, sk); out: if (sk == NULL) @@ -953,7 +1001,7 @@ static int unix_dgram_connect(struct soc struct sockaddr_un *sunaddr = (struct sockaddr_un *)addr; struct sock *other; unsigned hash; - int err; + int err, disconned; if (addr->sa_family != AF_UNSPEC) { err =
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/05/2015 12:31 PM, Rainer Weikusat wrote: > Jason Baronwrites: >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >> queue associated with the socket s that we are poll'ing against, but also >> calls >> sock_poll_wait() for a remote peer socket p, if it is connected. Thus, >> if we call poll()/select()/epoll() for the socket s, there are then >> a couple of code paths in which the remote peer socket p and its associated >> peer_wait queue can be freed before poll()/select()/epoll() have a chance >> to remove themselves from the remote peer socket. >> >> The way that remote peer socket can be freed are: >> >> 1. If s calls connect() to a connect to a new socket other than p, it will >> drop its reference on p, and thus a close() on p will free it. >> >> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop >> the final reference to p, allowing it to be freed. > > Here's a more simple idea which _might_ work. The underlying problem > seems to be that the second sock_poll_wait introduces a covert reference > to the peer socket which isn't accounted for. The basic idea behind this > is to execute an additional sock_hold for the peer whenever the > sock_poll_wait is called for it and store it (the struct sock *) in a > new struct unix_sock member. Upon entering unix_dgram_poll, if the > member is not NULL, it's cleared and a sock_put for its former value is > done. The 'poll peer not NULL -> sock_put it' code is also added to the > destructor, although I'm unsure if this is really necessary. The patch > below also includes the additional SOCK_DEAD test suggested by Martin as > that seems generally sensible to me. > > NB: This has survived both Martin's and my test programs for a number > of executions/ longer periods of time than was common before without > generating list corruption warnings. The patch below is against 'my' > 3.2.54 and is here provided as a suggestion in the hope that it will be > useful for someting, not as patch submission, as I spent less time > thinking through this than I should ideally have but despite of this, > it's another 2.5 hours of my life spent on something completely > different than what I should be working on at the moment. > > -- > diff -pru linux-2-6/include/net/af_unix.h linux-2-6.p/include/net/af_unix.h > --- linux-2-6/include/net/af_unix.h 2014-01-20 21:52:53.0 + > +++ linux-2-6.p/include/net/af_unix.h 2015-10-05 15:11:20.270958297 +0100 > @@ -50,6 +50,7 @@ struct unix_sock { > struct vfsmount *mnt; > struct mutexreadlock; > struct sock *peer; > + struct sock *poll_peer; > struct sock *other; > struct list_headlink; > atomic_long_t inflight; > diff -pru linux-2-6/net/unix/af_unix.c linux-2-6.p/net/unix/af_unix.c > --- linux-2-6/net/unix/af_unix.c 2014-01-22 16:51:52.0 + > +++ linux-2-6.p/net/unix/af_unix.c2015-10-05 17:05:28.358273567 +0100 > @@ -361,6 +361,9 @@ static void unix_sock_destructor(struct > if (u->addr) > unix_release_addr(u->addr); > > + if (u->poll_peer) > + sock_put(u->poll_peer); > + > atomic_long_dec(_nr_socks); > local_bh_disable(); > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > @@ -625,6 +628,7 @@ static struct sock *unix_create1(struct > u = unix_sk(sk); > u->dentry = NULL; > u->mnt= NULL; > + u->poll_peer = NULL; > spin_lock_init(>lock); > atomic_long_set(>inflight, 0); > INIT_LIST_HEAD(>link); > @@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil > static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, > poll_table *wait) > { > - struct sock *sk = sock->sk, *other; > - unsigned int mask, writable; > + struct sock *sk = sock->sk, *other, *pp; > + struct unix_sock *u; > + unsigned int mask, writable, dead; > + > + u = unix_sk(sk); > + pp = u->poll_peer; > + if (pp) { > + u->poll_peer = NULL; > + sock_put(pp); > + } > > sock_poll_wait(file, sk_sleep(sk), wait); > mask = 0; > @@ -2170,7 +2182,20 @@ static unsigned int unix_dgram_poll(stru > other = unix_peer_get(sk); > if (other) { > if (unix_peer(other) != sk) { > - sock_poll_wait(file, _sk(other)->peer_wait, wait); > + unix_state_lock(other); > + > + dead = sock_flag(other, SOCK_DEAD); > + if (!dead) > + sock_poll_wait(file, _sk(other)->peer_wait, > +wait); > + > + unix_state_unlock(other); > + > + if (!dead) { > + u->poll_peer = other; > +
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On Mon, 2015-10-05 at 17:31 +0100, Rainer Weikusat wrote: > atomic_long_set(>inflight, 0); > INIT_LIST_HEAD(>link); > @@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil > static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, > poll_table *wait) > { > - struct sock *sk = sock->sk, *other; > - unsigned int mask, writable; > + struct sock *sk = sock->sk, *other, *pp; > + struct unix_sock *u; > + unsigned int mask, writable, dead; > + > + u = unix_sk(sk); > + pp = u->poll_peer; > + if (pp) { > + u->poll_peer = NULL; > + sock_put(pp); > + } This looks racy. Multiple threads could use poll() at the same time, and you would have too many sock_put() -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Eric Dumazetwrites: > On Mon, 2015-10-05 at 17:31 +0100, Rainer Weikusat wrote: > >> atomic_long_set(>inflight, 0); >> INIT_LIST_HEAD(>link); >> @@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil >> static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, >> poll_table *wait) >> { >> -struct sock *sk = sock->sk, *other; >> -unsigned int mask, writable; >> +struct sock *sk = sock->sk, *other, *pp; >> +struct unix_sock *u; >> +unsigned int mask, writable, dead; >> + >> +u = unix_sk(sk); >> +pp = u->poll_peer; >> +if (pp) { >> +u->poll_peer = NULL; >> +sock_put(pp); >> +} > > > This looks racy. > Multiple threads could use poll() at the same time, > and you would have too many sock_put() That's one of the reasons why I wrote "might work": The use of a single structure member without any locking for the sock_poll_wait suggests that this is taken care of in some other way, as does the absence of any comment about that in the 'public' LDDs ("Linux Device Drivers"), however, I don't really know if this is true. If not, this simple idea can't work. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Rainer Weikusatwrites: > Mathias Krause writes: >> On 2 October 2015 at 22:43, Jason Baron wrote: >>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >>> queue associated with the socket s that we are poll'ing against, but also >>> calls > > [useless full-quote removed] > >> My reproducer runs on this patch for more than 3 days now without >> triggering anything anymore. > > Since the behaviour of your program is random, using it to "test" > anything doesn't really provide any insight: It could have been > executing the same codepath which doesn't happen to trigger any problems > for all of these three days. Nobody can tell. Since this "strangely" seems to have been lost in the thread: Here's the test program showing that the reconnect while in epoll actually causes a problem (at least I think so): #include #include #include #include #include #include #include #include static int sk, tg0, tg1; static void *epoller(void *unused) { struct epoll_event epev; int epfd; epfd = epoll_create(1); if (epfd == -1) exit(0); epev.events = EPOLLOUT; epoll_ctl(epfd, EPOLL_CTL_ADD, sk, ); epoll_wait(epfd, , 1, 5000); close(sk); execl("./a.out", "./a.out", (void *)0); return NULL; } int main(void) { struct sockaddr_un sun; pthread_t tid; int rc; sun.sun_family = AF_UNIX; tg0 = socket(AF_UNIX, SOCK_DGRAM, 0); strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path)); unlink(sun.sun_path); bind(tg0, (struct sockaddr *), sizeof(sun)); tg1 = socket(AF_UNIX, SOCK_DGRAM, 0); strncpy(sun.sun_path, "/tmp/tg1", sizeof(sun.sun_path)); unlink(sun.sun_path); bind(tg1, (struct sockaddr *), sizeof(sun)); sk = socket(AF_UNIX, SOCK_DGRAM, 0); connect(sk, (struct sockaddr *), sizeof(sun)); fcntl(sk, F_SETFL, fcntl(sk, F_GETFL) | O_NONBLOCK); while ((rc = write(sk, "bla", 3)) != -1); pthread_create(, NULL, epoller, NULL); usleep(5); strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path)); connect(sk, (struct sockaddr *), sizeof(sun)); close(tg1); pause(); return 0; } -- And here the other demonstrating the poller not being woken up despite it could write something: -- #include #include #include #include #include #include #include #include int main(void) { struct sockaddr_un sun; struct pollfd pfd; int tg, sk0, sk1, rc; char buf[16]; sun.sun_family = AF_UNIX; tg = socket(AF_UNIX, SOCK_DGRAM, 0); strncpy(sun.sun_path, "/tmp/tg", sizeof(sun.sun_path)); unlink(sun.sun_path); bind(tg, (struct sockaddr *), sizeof(sun)); sk0 = socket(AF_UNIX, SOCK_DGRAM, 0); connect(sk0, (struct sockaddr *), sizeof(sun)); sk1 = socket(AF_UNIX, SOCK_DGRAM, 0); connect(sk1, (struct sockaddr *), sizeof(sun)); fcntl(sk0, F_SETFL, fcntl(sk0, F_GETFL) | O_NONBLOCK); fcntl(sk1, F_SETFL, fcntl(sk1, F_GETFL) | O_NONBLOCK); while (write(sk0, "bla", 3) != -1); if (fork() == 0) { pfd.fd = sk1; pfd.events = POLLOUT; rc = poll(, 1, -1); _exit(0); } sleep(3); read(tg, buf, sizeof(buf)); wait(); return 0; } --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
Mathias Krausewrites: > On 2 October 2015 at 22:43, Jason Baron wrote: >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >> queue associated with the socket s that we are poll'ing against, but also >> calls [useless full-quote removed] > My reproducer runs on this patch for more than 3 days now without > triggering anything anymore. Since the behaviour of your program is random, using it to "test" anything doesn't really provide any insight: It could have been executing the same codepath which doesn't happen to trigger any problems for all of these three days. Nobody can tell. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait queue associated with the socket s that we are poll'ing against, but also calls sock_poll_wait() for a remote peer socket p, if it is connected. Thus, if we call poll()/select()/epoll() for the socket s, there are then a couple of code paths in which the remote peer socket p and its associated peer_wait queue can be freed before poll()/select()/epoll() have a chance to remove themselves from the remote peer socket. The way that remote peer socket can be freed are: 1. If s calls connect() to a connect to a new socket other than p, it will drop its reference on p, and thus a close() on p will free it. 2. If we call close on p(), then a subsequent sendmsg() from s, will drop the final reference to p, allowing it to be freed. Address this issue, by reverting unix_dgram_poll() to only register with the wait queue associated with s and register a callback with the remote peer socket on connect() that will wake up the wait queue associated with s. If scenarios 1 or 2 occur above we then simply remove the callback from the remote peer. This then presents the expected semantics to poll()/select()/ epoll(). I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). Signed-off-by: Jason Baron--- include/net/af_unix.h | 1 + net/unix/af_unix.c| 32 +++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 4a167b3..9698aff 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_twait; }; #define unix_sk(__sk) ((struct unix_sock *)__sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..f789423 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion) skpair = unix_peer(sk); if (skpair != NULL) { + if (sk->sk_type != SOCK_STREAM) + remove_wait_queue(_sk(skpair)->peer_wait, + >wait); if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { unix_state_lock(skpair); /* No more writes */ @@ -636,6 +639,16 @@ static struct proto unix_proto = { */ static struct lock_class_key af_unix_sk_receive_queue_lock_key; +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct unix_sock *u; + + u = container_of(wait, struct unix_sock, wait); + wake_up_interruptible_sync_poll(sk_sleep(>sk), key); + + return 0; +} + static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) { struct sock *sk = NULL; @@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) INIT_LIST_HEAD(>link); mutex_init(>readlock); /* single task reading lock */ init_waitqueue_head(>peer_wait); + init_waitqueue_func_entry(>wait, peer_wake); unix_insert_socket(unix_sockets_unbound(sk), sk); out: if (sk == NULL) @@ -1030,7 +1044,11 @@ restart: */ if (unix_peer(sk)) { struct sock *old_peer = unix_peer(sk); + + remove_wait_queue(_sk(old_peer)->peer_wait, + _sk(sk)->wait); unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); if (other != old_peer) @@ -1038,8 +1056,12 @@ restart: sock_put(old_peer); } else { unix_peer(sk) = other; + add_wait_queue(_sk(other)->peer_wait, _sk(sk)->wait); unix_state_double_unlock(sk, other); } + /* New remote may have created write space for us */ + wake_up_interruptible_sync_poll(sk_sleep(sk), + POLLOUT | POLLWRNORM | POLLWRBAND); return 0; out_unlock: @@ -1194,6 +1216,8 @@ restart: sock_hold(sk); unix_peer(newsk)= sk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(_sk(sk)->peer_wait, _sk(newsk)->wait); newsk->sk_state = TCP_ESTABLISHED; newsk->sk_type = sk->sk_type; init_peercred(newsk); @@ -1220,6 +1244,8 @@ restart: smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ unix_peer(sk) = newsk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(_sk(newsk)->peer_wait, _sk(sk)->wait); unix_state_unlock(sk); @@ -1254,6 +1280,10 @@ static
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 2 October 2015 at 22:43, Jason Baronwrote: > The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait > queue associated with the socket s that we are poll'ing against, but also > calls > sock_poll_wait() for a remote peer socket p, if it is connected. Thus, > if we call poll()/select()/epoll() for the socket s, there are then > a couple of code paths in which the remote peer socket p and its associated > peer_wait queue can be freed before poll()/select()/epoll() have a chance > to remove themselves from the remote peer socket. > > The way that remote peer socket can be freed are: > > 1. If s calls connect() to a connect to a new socket other than p, it will > drop its reference on p, and thus a close() on p will free it. > > 2. If we call close on p(), then a subsequent sendmsg() from s, will drop > the final reference to p, allowing it to be freed. > > Address this issue, by reverting unix_dgram_poll() to only register with > the wait queue associated with s and register a callback with the remote peer > socket on connect() that will wake up the wait queue associated with s. If > scenarios 1 or 2 occur above we then simply remove the callback from the > remote peer. This then presents the expected semantics to poll()/select()/ > epoll(). > > I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET > but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). > > Signed-off-by: Jason Baron > --- > include/net/af_unix.h | 1 + > net/unix/af_unix.c| 32 +++- > 2 files changed, 32 insertions(+), 1 deletion(-) > > diff --git a/include/net/af_unix.h b/include/net/af_unix.h > index 4a167b3..9698aff 100644 > --- a/include/net/af_unix.h > +++ b/include/net/af_unix.h > @@ -62,6 +62,7 @@ struct unix_sock { > #define UNIX_GC_CANDIDATE 0 > #define UNIX_GC_MAYBE_CYCLE1 > struct socket_wqpeer_wq; > + wait_queue_twait; > }; > #define unix_sk(__sk) ((struct unix_sock *)__sk) > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 03ee4d3..f789423 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int > embrion) > skpair = unix_peer(sk); > > if (skpair != NULL) { > + if (sk->sk_type != SOCK_STREAM) > + remove_wait_queue(_sk(skpair)->peer_wait, > + >wait); > if (sk->sk_type == SOCK_STREAM || sk->sk_type == > SOCK_SEQPACKET) { > unix_state_lock(skpair); > /* No more writes */ > @@ -636,6 +639,16 @@ static struct proto unix_proto = { > */ > static struct lock_class_key af_unix_sk_receive_queue_lock_key; > > +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) > +{ > + struct unix_sock *u; > + > + u = container_of(wait, struct unix_sock, wait); > + wake_up_interruptible_sync_poll(sk_sleep(>sk), key); > + > + return 0; > +} > + > static struct sock *unix_create1(struct net *net, struct socket *sock, int > kern) > { > struct sock *sk = NULL; > @@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct > socket *sock, int kern) > INIT_LIST_HEAD(>link); > mutex_init(>readlock); /* single task reading lock */ > init_waitqueue_head(>peer_wait); > + init_waitqueue_func_entry(>wait, peer_wake); > unix_insert_socket(unix_sockets_unbound(sk), sk); > out: > if (sk == NULL) > @@ -1030,7 +1044,11 @@ restart: > */ > if (unix_peer(sk)) { > struct sock *old_peer = unix_peer(sk); > + > + remove_wait_queue(_sk(old_peer)->peer_wait, > + _sk(sk)->wait); > unix_peer(sk) = other; > + add_wait_queue(_sk(other)->peer_wait, > _sk(sk)->wait); > unix_state_double_unlock(sk, other); > > if (other != old_peer) > @@ -1038,8 +1056,12 @@ restart: > sock_put(old_peer); > } else { > unix_peer(sk) = other; > + add_wait_queue(_sk(other)->peer_wait, > _sk(sk)->wait); > unix_state_double_unlock(sk, other); > } > + /* New remote may have created write space for us */ > + wake_up_interruptible_sync_poll(sk_sleep(sk), > + POLLOUT | POLLWRNORM | POLLWRBAND); > return 0; > > out_unlock: > @@ -1194,6 +1216,8 @@ restart: > > sock_hold(sk); > unix_peer(newsk)= sk; > + if (sk->sk_type == SOCK_SEQPACKET) > + add_wait_queue(_sk(sk)->peer_wait, > _sk(newsk)->wait); > newsk->sk_state = TCP_ESTABLISHED; > newsk->sk_type = sk->sk_type; > init_peercred(newsk); > @@ -1220,6