Re: [PATCH] fix race in AF_UNIX
Miklos Szeredi [EMAIL PROTECTED] writes: And I think incremental GC algorithms are much too complex for this task. What I've realized, is that in fact we don't require a generic garbage collection algorithm, just a much more specialized cycle collection algorithm, since refcounting in struct file takes care of the rest. This would help with localizing the problem to the problematic sockets (which have an in-flight unix socket), instead of having to blindly traverse _all_ unix sockets in the system. I'll look at reimplementing the GC with such an algorithm. Ok. If you can do it more simply have at it. There are incremental garbage collectors that are essentially just the current algorithm with fine-grained locking. So we don't have to live in a spin-lock the whole time. If your approach fails we can look at something more fine-grained. It appears clear that since we can't stop the world and garbage collect we need an incremental collector. Constraining ourselves to stopping unix sockets from going in flight or coming out of flight during garbage collection should be OK I think. There's still a possibility of a DoS there, but it would only be able to affect _very_ few applications. Yes. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
Eric, thanks for looking at this. There are races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vice versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] I'm going to hold off on this one for now. Holding all of the read locks kind of defeats the purpose of using the per-socket lock. Can't you just lock purely around the receive queue operation? That's already protected by the receive queue spinlock. The race however happens _between_ pushing the root set and marking of the in-flight but reachable sockets. If in that space any of the AF_UNIX sockets goes from in-flight to installed into a file descriptor, the garbage collector can miss it. If we want to protect against this using unix_sk(s)-readlock, then we have to hold all of them for the duration of the marking. Al, Alan, you have more experience with this piece of code. Do you have better ideas about how to fix this? I haven't looked at the code closely enough to be confident of changing something in this area. However the classic solution to this kind of gc problem is to mark things that are manipulated during garbage collection as dirty (not orphaned). It should be possible to fix this problem by simply changing gc_tree when we perform a problematic manipulation of a passed socket, such as installing a passed socket into the file descriptors of a process. Essentially the idea is moving the current code in the direction of an incremental gc algorithm. If I understand the race properly. What happens is that we dequeue a socket (which has packets in it passing sockets) before the garbage collector gets to it. Therefore the garbage collector never processes that socket. So it sounds like we just need to call maybe_unmark_and_push or possibly just wait for the garbage collector to complete when we do that and the packet we have pulled out Right. But the devil is in the details, and (as you correctly point out later) to implement this, the whole locking scheme needs to be overhauled. Problems: - Using the queue lock to make the dequeue and the fd detach atomic wrt the GC is difficult, if not impossible: they are are far from each other with various magic in between. It would need thorough understanding of these functions and _big_ changes to implement. - Sleeping on u-readlock in GC is currently not possible, since that could deadlock with unix_dgram_recvmsg(). That function could probably be modified to release u-readlock, while waiting for data, similarly to unix_stream_recvmsg() at the cost of some added complexity. - Sleeping on u-readlock is also impossible, because GC is holding unix_table_lock for the whole operation. We could release unix_table_lock, but then would have to cope with sockets coming and going, making the current socket iterator unworkable. So theoretically it's quite simple, but it needs big changes. And this wouldn't even solve all the problems with the GC, like being a possible DoS vector. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
Miklos Szeredi [EMAIL PROTECTED] writes: Right. But the devil is in the details, and (as you correctly point out later) to implement this, the whole locking scheme needs to be overhauled. Problems: - Using the queue lock to make the dequeue and the fd detach atomic wrt the GC is difficult, if not impossible: they are are far from each other with various magic in between. It would need thorough understanding of these functions and _big_ changes to implement. - Sleeping on u-readlock in GC is currently not possible, since that could deadlock with unix_dgram_recvmsg(). That function could probably be modified to release u-readlock, while waiting for data, similarly to unix_stream_recvmsg() at the cost of some added complexity. - Sleeping on u-readlock is also impossible, because GC is holding unix_table_lock for the whole operation. We could release unix_table_lock, but then would have to cope with sockets coming and going, making the current socket iterator unworkable. So theoretically it's quite simple, but it needs big changes. And this wouldn't even solve all the problems with the GC, like being a possible DoS vector. Making the GC fully incremental will solve the DoS vector problem as well. Basically you do a fixed amount of reclaim in the new socket allocation code. It appears clear that since we can't stop the world and garbage collect we need an incremental collector. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
Miklos Szeredi [EMAIL PROTECTED] writes: [CC'd Al Viro and Alan Cox, restored patch] There are races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vice versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] I'm going to hold off on this one for now. Holding all of the read locks kind of defeats the purpose of using the per-socket lock. Can't you just lock purely around the receive queue operation? That's already protected by the receive queue spinlock. The race however happens _between_ pushing the root set and marking of the in-flight but reachable sockets. If in that space any of the AF_UNIX sockets goes from in-flight to installed into a file descriptor, the garbage collector can miss it. If we want to protect against this using unix_sk(s)-readlock, then we have to hold all of them for the duration of the marking. Al, Alan, you have more experience with this piece of code. Do you have better ideas about how to fix this? I haven't looked at the code closely enough to be confident of changing something in this area. However the classic solution to this kind of gc problem is to mark things that are manipulated during garbage collection as dirty (not orphaned). It should be possible to fix this problem by simply changing gc_tree when we perform a problematic manipulation of a passed socket, such as installing a passed socket into the file descriptors of a process. Essentially the idea is moving the current code in the direction of an incremental gc algorithm. If I understand the race properly. What happens is that we dequeue a socket (which has packets in it passing sockets) before the garbage collector gets to it. Therefore the garbage collector never processes that socket. So it sounds like we just need to call maybe_unmark_and_push or possibly just wait for the garbage collector to complete when we do that and the packet we have pulled out So just looking at this quickly. It looks like we need to hold the u-readlock mutex in garbage.c while looking at the receive queue. Otherwise we may be processing a packet and have the file descriptors in limbo. Either that or we need to slightly extend the scope of the receive queue lock on the receive side. Additionally we need to modify unix_notinflight to mark the sockets as inuse, or to at least wait for the garbage collection to complete. While being very careful to not add a deadlock scenario. The require changes to fix this without adding heavy handed locking look a bit nasty to make. I half suspect switching to one of the simpler incremental garbage collection algorithms might not be equally easy to implement and give us more performance at the same time. Having a garbage collector that can block has advantages. The practical problem is you can't modify the list of pushed object aka gc_tree without holding a lock. And we never drop the unix_table_lock. Anyway hopefully that is enough fodder for someone to come up with a light weight fix for this problem. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
Ping Dave, Since there doesn't seem to be any new ideas forthcoming, can we please decide on either one of my two sumbitted patches? Thanks, Miklos [CC'd Al Viro and Alan Cox, restored patch] There are races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vice versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] I'm going to hold off on this one for now. Holding all of the read locks kind of defeats the purpose of using the per-socket lock. Can't you just lock purely around the receive queue operation? That's already protected by the receive queue spinlock. The race however happens _between_ pushing the root set and marking of the in-flight but reachable sockets. If in that space any of the AF_UNIX sockets goes from in-flight to installed into a file descriptor, the garbage collector can miss it. If we want to protect against this using unix_sk(s)-readlock, then we have to hold all of them for the duration of the marking. Al, Alan, you have more experience with this piece of code. Do you have better ideas about how to fix this? Thanks, Miklos Index: linux-2.6.22-rc2/net/unix/garbage.c === --- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 +0200 +++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-06 09:48:36.0 +0200 @@ -186,7 +186,21 @@ void unix_gc(void) forall_unix_sockets(i, s) { - unix_sk(s)-gc_tree = GC_ORPHAN; + struct unix_sock *u = unix_sk(s); + + u-gc_tree = GC_ORPHAN; + + /* +* Hold -readlock to protect against sockets going from +* in-flight to installed +* +* Can't sleep on this, because +* a) we are under spinlock +* b) skb_recv_datagram() could be waiting for a packet that +* is to be sent by this thread +*/ + if (!mutex_trylock(u-readlock)) + goto lock_failed; } /* * Everything is now marked @@ -207,8 +221,6 @@ void unix_gc(void) forall_unix_sockets(i, s) { - int open_count = 0; - /* * If all instances of the descriptor are not * in flight we are in use. @@ -218,10 +230,20 @@ void unix_gc(void) * In this case (see unix_create1()) we set artificial * negative inflight counter to close race window. * It is trick of course and dirty one. +* +* Get the inflight counter first, then the open +* counter. This avoids problems if racing with +* sendmsg +* +* If just created socket is not yet attached to +* a file descriptor, assume open_count of 1 */ + int inflight_count = atomic_read(unix_sk(s)-inflight); + int open_count = 1; + if (s-sk_socket s-sk_socket-file) open_count = file_count(s-sk_socket-file); - if (open_count atomic_read(unix_sk(s)-inflight)) + if (open_count inflight_count) maybe_unmark_and_push(s); } @@ -300,6 +322,7 @@ void unix_gc(void) spin_unlock(s-sk_receive_queue.lock); } u-gc_tree = GC_ORPHAN; + mutex_unlock(u-readlock); } spin_unlock(unix_table_lock); @@ -309,4 +332,19 @@ void unix_gc(void) __skb_queue_purge(hitlist); mutex_unlock(unix_gc_sem); + return; + + lock_failed: + { + struct sock *s1; + forall_unix_sockets(i, s1) { + if (s1 == s) { + spin_unlock(unix_table_lock); + mutex_unlock(unix_gc_sem); + return; + } + mutex_unlock(unix_sk(s1)-readlock); + } + BUG(); + } } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 09:49:32 +0200 Ping Dave, Since there doesn't seem to be any new ideas forthcoming, can we please decide on either one of my two sumbitted patches? You just need to be patient, we can't just put a bad patch in because a better one has not been suggested yet :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 09:49:32 +0200 Ping Dave, Since there doesn't seem to be any new ideas forthcoming, can we please decide on either one of my two sumbitted patches? You just need to be patient, we can't just put a bad patch in because a better one has not been suggested yet :-) And is anyone working on a better patch? Those patches aren't bad in the correctness sense. So IMO any one of them is better, than having that bug in there. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 10:20:23 +0200 And is anyone working on a better patch? I have no idea. Those patches aren't bad in the correctness sense. So IMO any one of them is better, than having that bug in there. You're adding a very serious performance regression, which is about as bad as the bug itself. It can wait for a more appropriate fix. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
And is anyone working on a better patch? I have no idea. Those patches aren't bad in the correctness sense. So IMO any one of them is better, than having that bug in there. You're adding a very serious performance regression, which is about as bad as the bug itself. No, correctness always trumps performance. Lost packets on an AF_UNIX socket are _unexceptable_, and this is definitely not a theoretical problem. And BTW my second patch does _not_ have the performance problems you are arguing about, it's just plain ugly. But hey, if you can't live with ugly code, go and fix it. It can wait for a more appropriate fix. Now _please_ be a bit more constructive. Do you want me to send the patch to Andrew instead? His attitude towards bugfixes is rather better ;) Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 11:29:52 +0200 And is anyone working on a better patch? I have no idea. Those patches aren't bad in the correctness sense. So IMO any one of them is better, than having that bug in there. You're adding a very serious performance regression, which is about as bad as the bug itself. No, correctness always trumps performance. To a point. There is no black and white in this world. Lost packets on an AF_UNIX socket are _unexceptable_, and this is definitely not a theoretical problem. A lot of people will consider having all of their AF_UNIX sockets on their 64 cpu system just stop when garbage collection runs to be unacceptable as well. Secondarily, this bug has been around for years and nobody noticed. The world will not explode if this bug takes a few more days or even a week to work out. Let's do it right instead of ramming arbitrary turds into the kernel. Do you want me to send the patch to Andrew instead? His attitude towards bugfixes is rather better ;) When I explain the ramifications of your patch to him, I'm pretty sure he'll agree with me. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
And is anyone working on a better patch? I have no idea. Those patches aren't bad in the correctness sense. So IMO any one of them is better, than having that bug in there. You're adding a very serious performance regression, which is about as bad as the bug itself. No, correctness always trumps performance. To a point. There is no black and white in this world. Lost packets on an AF_UNIX socket are _unexceptable_, and this is definitely not a theoretical problem. A lot of people will consider having all of their AF_UNIX sockets on their 64 cpu system just stop when garbage collection runs to be unacceptable as well. Garbage collection only ever happens, if the app is sending AF_UNIX sockets over AF_UNIX sockets. Which is a rather rare case. And which is basically why this bug went unnoticed for so long. So my second patch only affects the performance of _exactly_ those apps which might well be bitten by the bug itself. Secondarily, this bug has been around for years and nobody noticed. The world will not explode if this bug takes a few more days or even a week to work out. Let's do it right instead of ramming arbitrary turds into the kernel. Fine, but just wishing a bug to get fixed won't accomplish anything. I've spent a fair amount of time debugging this thing, and I'm out of ideas. Really. So unless somebody steps up to look at this, it won't _ever_ get fixed. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 11:44:07 +0200 Secondarily, this bug has been around for years and nobody noticed. The world will not explode if this bug takes a few more days or even a week to work out. Let's do it right instead of ramming arbitrary turds into the kernel. Fine, but just wishing a bug to get fixed won't accomplish anything. I've spent a fair amount of time debugging this thing, and I'm out of ideas. Really. So unless somebody steps up to look at this, it won't _ever_ get fixed. Somone just needs to find a way to only lock the socket as it is being operated upon. The race you are dealing with is rather simple, the queue check and the state check need to be done atomically. The only chore is to find a way to make that happen in the context of what the garbage allocator is trying to do. I'm not even convinced that your most recent attempt is deadlock free. Locking multiple objects the same way all at once like that is something that needs to be seriously audited. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
Secondarily, this bug has been around for years and nobody noticed. The world will not explode if this bug takes a few more days or even a week to work out. Let's do it right instead of ramming arbitrary turds into the kernel. Fine, but just wishing a bug to get fixed won't accomplish anything. I've spent a fair amount of time debugging this thing, and I'm out of ideas. Really. So unless somebody steps up to look at this, it won't _ever_ get fixed. Somone just needs to find a way to only lock the socket as it is being operated upon. No, you are still not getting it. The problem is that the socket needs to be locked for the _whole_ of the garbage collection. This is because the gc is done in two phases, in the first phase sockets which are installed into file descriptors are collected as a root set, then the set is expanded by iterating over everything it's in there and that's later added. If a socket moves from in-flight to installed _during_ the gc, then it can miss being collected. So the socket must be locked for the duration of _both_ phases. The race you are dealing with is rather simple, the queue check and the state check need to be done atomically. The only chore is to find a way to make that happen in the context of what the garbage allocator is trying to do. I'm not even convinced that your most recent attempt is deadlock free. Locking multiple objects the same way all at once like that is something that needs to be seriously audited. It's doing trylocks and releasing all those locks within the same spin locked region, how the f**k can that deadlock? The only problem I've experienced with this patch (other than being ugly) is that the multitude of locks it acquires makes the lockdep code give up. But I think we can live with that. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 11:55:19 +0200 It's doing trylocks and releasing all those locks within the same spin locked region, how the f**k can that deadlock? The case I'm very concerned about is the interactions of your new code with the unix_state_double_lock() stuff I added recently to fix another bug. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
* Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 11:44 Garbage collection only ever happens, if the app is sending AF_UNIX sockets over AF_UNIX sockets. Which is a rather rare case. And which is basically why this bug went unnoticed for so long. So my second patch only affects the performance of _exactly_ those apps which might well be bitten by the bug itself. That's not entirely the truth. It affects all applications using AF_UNIX sockets while file descriptors are being transfered. I agree that the performance impact is not severe on most systems but if file descriptors are being transfered continously by just a single application it can become rather severe. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
* Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 11:44 Garbage collection only ever happens, if the app is sending AF_UNIX sockets over AF_UNIX sockets. Which is a rather rare case. And which is basically why this bug went unnoticed for so long. So my second patch only affects the performance of _exactly_ those apps which might well be bitten by the bug itself. That's not entirely the truth. It affects all applications using AF_UNIX sockets while file descriptors are being transfered. I agree that the performance impact is not severe on most systems but if file descriptors are being transfered continously by just a single application it can become rather severe. You are wrong. Look in unix_release_sock(): if (atomic_read(unix_tot_inflight)) unix_gc(); /* Garbage collect fds */ unix_tot_inflight is the number of AF_UNIX sockets currently being transferred over some AF_UNIX sockets. That means that just sending (non-unix socket) fds over unix sockets will never invoke the gc. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
* Thomas Graf [EMAIL PROTECTED] 2007-06-18 12:32 * Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 11:44 Garbage collection only ever happens, if the app is sending AF_UNIX sockets over AF_UNIX sockets. Which is a rather rare case. And which is basically why this bug went unnoticed for so long. So my second patch only affects the performance of _exactly_ those apps which might well be bitten by the bug itself. That's not entirely the truth. It affects all applications using AF_UNIX sockets while file descriptors are being transfered. I agree that the performance impact is not severe on most systems but if file descriptors are being transfered continously by just a single application it can become rather severe. Also think of the scenario where an application, deliberately or not, begins a file descriptor tranfser using sendmsg() and the receiving part never invokes recvmsg() to decrement the inflight counters again. Every unix socket that gets closed would result in a gc call locking all sockets. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
* Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 12:39 You are wrong. Look in unix_release_sock(): if (atomic_read(unix_tot_inflight)) unix_gc(); /* Garbage collect fds */ unix_tot_inflight is the number of AF_UNIX sockets currently being transferred over some AF_UNIX sockets. That means that just sending (non-unix socket) fds over unix sockets will never invoke the gc. That's what I meant, I'm sorry, I should have written unix socket file descriptor to not leave any room for misinterpretation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
* Thomas Graf [EMAIL PROTECTED] 2007-06-18 12:32 * Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 11:44 Garbage collection only ever happens, if the app is sending AF_UNIX sockets over AF_UNIX sockets. Which is a rather rare case. And which is basically why this bug went unnoticed for so long. So my second patch only affects the performance of _exactly_ those apps which might well be bitten by the bug itself. That's not entirely the truth. It affects all applications using AF_UNIX sockets while file descriptors are being transfered. I agree that the performance impact is not severe on most systems but if file descriptors are being transfered continously by just a single application it can become rather severe. Also think of the scenario where an application, deliberately or not, begins a file descriptor tranfser using sendmsg() and the receiving part never invokes recvmsg() to decrement the inflight counters again. Every unix socket that gets closed would result in a gc call locking all sockets. And if some of the sent files were unix sockets, rightly so, since the sent sockets might need to be garbage collected. And BTW the whole gc is done with the unix_table_lock held, so it will stop some socket operations anyway. The fact that it needs to stop some more operations is a necessary thing. But we are talking about a _spinlocked_ region, which for zillions of sockets might run for a long time, but it's not as if it's really going to affect performance in real cases. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 12:47:17 +0200 but it's not as if it's really going to affect performance in real cases. Since these circumstances are creatable by any user, we have to consider the cases caused by malicious entities. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
but it's not as if it's really going to affect performance in real cases. Since these circumstances are creatable by any user, we have to consider the cases caused by malicious entities. OK. But then the whole gc thing is already broken, since a user can DoS socket creation/destruction. I'm all for fixing this gc mess that we have now. But please don't expect me to be the one who's doing it. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 12:55:24 +0200 I'm all for fixing this gc mess that we have now. But please don't expect me to be the one who's doing it. Don't worry, I only expect you to make the situation worse :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
I'm all for fixing this gc mess that we have now. But please don't expect me to be the one who's doing it. Don't worry, I only expect you to make the situation worse :-) That's real nice. Looks like finding and fixing bugs in not appreciated in the networking subsystem :-/ Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: David Miller [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 04:02:53 -0700 (PDT) From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 12:55:24 +0200 I'm all for fixing this gc mess that we have now. But please don't expect me to be the one who's doing it. Don't worry, I only expect you to make the situation worse :-) In any event, I'll try to find some time to look more at your patch. But just like you don't want to be expected to rewrite the AF_UNIX GC code, you can't expect me to go right to your patch right when you want me to. I am one human being, and I depend upon other developers to help me out with patch review and so on. If nobody else wants to review your patch that doesn't work out so well for me. Your behavior also tends to make me (and others) put your patch near the end of the todo list rather than near the top. Fix your attitude and people will enjoy reviewing your patches that much more. Reviewing your work will become something to look forward to rather than a chore. Thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
No, correctness always trumps performance. Lost packets on an AF_UNIX socket are _unexceptable_, and this is definitely not a theoretical problem. If its so unacceptable why has nobody noticed until now - its a bug clearly, it needs fixing clearly, but unless you can produce some kind of exploit from it (eg a DoS attack or kernel memory leak exploiter) it doesn't appear to be that serious. And BTW my second patch does _not_ have the performance problems you are arguing about, it's just plain ugly. But hey, if you can't live with ugly code, go and fix it. If you put ugly code into the kernel you pay for maintaining it for years to come. If you get it right then you don't Do you want me to send the patch to Andrew instead? His attitude towards bugfixes is rather better ;) And it'll get NAKked and binned. DaveM is (as happens sometimes ;)) right to insist on the code being clean and efficient. Alan - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
On Jun 18 2007 12:47, Alan Cox wrote: Do you want me to send the patch to Andrew instead? His attitude towards bugfixes is rather better ;) And it'll get NAKked and binned. DaveM is (as happens sometimes ;)) right to insist on the code being clean and efficient. Or see RFC 1925 number 7a. :) Jan -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
I'm all for fixing this gc mess that we have now. But please don't expect me to be the one who's doing it. Don't worry, I only expect you to make the situation worse :-) In any event, I'll try to find some time to look more at your patch. But just like you don't want to be expected to rewrite the AF_UNIX GC code, you can't expect me to go right to your patch right when you want me to. I am one human being, and I depend upon other developers to help me out with patch review and so on. If nobody else wants to review your patch that doesn't work out so well for me. Sure, no problem. I just pinged you after a week of inactivity, to make sure it hasn't been forgotten. Just say if you are busy and I'll wait, but trying to forget the issue is not the way to deal with it. Your behavior also tends to make me (and others) put your patch near the end of the todo list rather than near the top. Fix your attitude and people will enjoy reviewing your patches that much more. Reviewing your work will become something to look forward to rather than a chore. I'll happily admit I'm a fool when it comes to social interactions. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
On Mon, 18 Jun 2007 12:43:40 +0200 Thomas Graf [EMAIL PROTECTED] wrote: * Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 12:39 You are wrong. Look in unix_release_sock(): if (atomic_read(unix_tot_inflight)) unix_gc(); /* Garbage collect fds */ unix_tot_inflight is the number of AF_UNIX sockets currently being transferred over some AF_UNIX sockets. That means that just sending (non-unix socket) fds over unix sockets will never invoke the gc. That's what I meant, I'm sorry, I should have written unix socket file descriptor to not leave any room for misinterpretation. You can bound the worst case on this and I think stay within the specs (as the specs don't say a lot about it). One way would be to make unix_gc() kick off a thread/tasklet and if the last unix_gc was within 5 seconds then use a timer to defer it - that prevents any user driven lets cause a ton of gc cases. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
No, correctness always trumps performance. Lost packets on an AF_UNIX socket are _unexceptable_, and this is definitely not a theoretical problem. If its so unacceptable why has nobody noticed until now - its a bug clearly, it needs fixing clearly, but unless you can produce some kind of exploit from it (eg a DoS attack or kernel memory leak exploiter) it doesn't appear to be that serious. It's serious in that it affects the operation of one of my applications in a way that is rather hard to work around. Sure, I could resend the lost sockets, but it's something one doesn't do unnecessarily for reliable transports. And it's entirely possible, that it could bite other apps in rare and mysterious ways. All you need is - an application that sends a unix socket over a unix socket - a parallel close() operation on an independent unix socket The two might happen to be from totally unrelated apps. It's all the more serious, because it could happen rarely and unreproducibly, and could well be crashing the app which is not expecting this behavior. And BTW my second patch does _not_ have the performance problems you are arguing about, it's just plain ugly. But hey, if you can't live with ugly code, go and fix it. If you put ugly code into the kernel you pay for maintaining it for years to come. If you get it right then you don't Do you want me to send the patch to Andrew instead? His attitude towards bugfixes is rather better ;) And it'll get NAKked and binned. DaveM is (as happens sometimes ;)) right to insist on the code being clean and efficient. Right, but treating a bug in your subsystem as if it's entirely the responsibility of the reporter is not the right attitude either I think. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
[CC'd Al Viro and Alan Cox, restored patch] There are races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vice versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] I'm going to hold off on this one for now. Holding all of the read locks kind of defeats the purpose of using the per-socket lock. Can't you just lock purely around the receive queue operation? That's already protected by the receive queue spinlock. The race however happens _between_ pushing the root set and marking of the in-flight but reachable sockets. If in that space any of the AF_UNIX sockets goes from in-flight to installed into a file descriptor, the garbage collector can miss it. If we want to protect against this using unix_sk(s)-readlock, then we have to hold all of them for the duration of the marking. Al, Alan, you have more experience with this piece of code. Do you have better ideas about how to fix this? Thanks, Miklos Index: linux-2.6.22-rc2/net/unix/garbage.c === --- linux-2.6.22-rc2.orig/net/unix/garbage.c 2007-06-03 23:58:11.0 +0200 +++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-06 09:48:36.0 +0200 @@ -186,7 +186,21 @@ void unix_gc(void) forall_unix_sockets(i, s) { - unix_sk(s)-gc_tree = GC_ORPHAN; + struct unix_sock *u = unix_sk(s); + + u-gc_tree = GC_ORPHAN; + + /* + * Hold -readlock to protect against sockets going from + * in-flight to installed + * + * Can't sleep on this, because + * a) we are under spinlock + * b) skb_recv_datagram() could be waiting for a packet that + * is to be sent by this thread + */ + if (!mutex_trylock(u-readlock)) + goto lock_failed; } /* * Everything is now marked @@ -207,8 +221,6 @@ void unix_gc(void) forall_unix_sockets(i, s) { - int open_count = 0; - /* * If all instances of the descriptor are not * in flight we are in use. @@ -218,10 +230,20 @@ void unix_gc(void) * In this case (see unix_create1()) we set artificial * negative inflight counter to close race window. * It is trick of course and dirty one. + * + * Get the inflight counter first, then the open + * counter. This avoids problems if racing with + * sendmsg + * + * If just created socket is not yet attached to + * a file descriptor, assume open_count of 1 */ + int inflight_count = atomic_read(unix_sk(s)-inflight); + int open_count = 1; + if (s-sk_socket s-sk_socket-file) open_count = file_count(s-sk_socket-file); - if (open_count atomic_read(unix_sk(s)-inflight)) + if (open_count inflight_count) maybe_unmark_and_push(s); } @@ -300,6 +322,7 @@ void unix_gc(void) spin_unlock(s-sk_receive_queue.lock); } u-gc_tree = GC_ORPHAN; + mutex_unlock(u-readlock); } spin_unlock(unix_table_lock); @@ -309,4 +332,19 @@ void unix_gc(void) __skb_queue_purge(hitlist); mutex_unlock(unix_gc_sem); + return; + + lock_failed: + { + struct sock *s1; + forall_unix_sockets(i, s1) { + if (s1 == s) { + spin_unlock(unix_table_lock); + mutex_unlock(unix_gc_sem); + return; + } + mutex_unlock(unix_sk(s1)-readlock); + } + BUG(); + } } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Wed, 06 Jun 2007 10:08:29 +0200 There are races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vice versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] I'm going to hold off on this one for now. Holding all of the read locks kind of defeats the purpose of using the per-socket lock. Can't you just lock purely around the receive queue operation? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
Holding a global mutex over recvmsg() calls under AF_UNIX is pretty much a non-starter, this will kill performance for multi-threaded apps. That's an rwsem held for read. It's held for write in unix_gc() only for a short duration, and unix_gc() should only rarely be called. So I don't think there's any performance problem here. It pulls a non-local cacheline into the local thread, that's extremely expensive on SMP. OK, here's an updated patch, that uses -readlock, and passes my testing. From: Miklos Szeredi [EMAIL PROTECTED] There are races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vice versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux-2.6.22-rc2/net/unix/garbage.c === --- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 +0200 +++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-06 09:48:36.0 +0200 @@ -186,7 +186,21 @@ void unix_gc(void) forall_unix_sockets(i, s) { - unix_sk(s)-gc_tree = GC_ORPHAN; + struct unix_sock *u = unix_sk(s); + + u-gc_tree = GC_ORPHAN; + + /* +* Hold -readlock to protect against sockets going from +* in-flight to installed +* +* Can't sleep on this, because +* a) we are under spinlock +* b) skb_recv_datagram() could be waiting for a packet that +* is to be sent by this thread +*/ + if (!mutex_trylock(u-readlock)) + goto lock_failed; } /* * Everything is now marked @@ -207,8 +221,6 @@ void unix_gc(void) forall_unix_sockets(i, s) { - int open_count = 0; - /* * If all instances of the descriptor are not * in flight we are in use. @@ -218,10 +230,20 @@ void unix_gc(void) * In this case (see unix_create1()) we set artificial * negative inflight counter to close race window. * It is trick of course and dirty one. +* +* Get the inflight counter first, then the open +* counter. This avoids problems if racing with +* sendmsg +* +* If just created socket is not yet attached to +* a file descriptor, assume open_count of 1 */ + int inflight_count = atomic_read(unix_sk(s)-inflight); + int open_count = 1; + if (s-sk_socket s-sk_socket-file) open_count = file_count(s-sk_socket-file); - if (open_count atomic_read(unix_sk(s)-inflight)) + if (open_count inflight_count) maybe_unmark_and_push(s); } @@ -300,6 +322,7 @@ void unix_gc(void) spin_unlock(s-sk_receive_queue.lock); } u-gc_tree = GC_ORPHAN; + mutex_unlock(u-readlock); } spin_unlock(unix_table_lock); @@ -309,4 +332,19 @@ void unix_gc(void) __skb_queue_purge(hitlist); mutex_unlock(unix_gc_sem); + return; + + lock_failed: + { + struct sock *s1; + forall_unix_sockets(i, s1) { + if (s1 == s) { + spin_unlock(unix_table_lock); + mutex_unlock(unix_gc_sem); + return; + } + mutex_unlock(unix_sk(s1)-readlock); + } + BUG(); + } } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Wed, 06 Jun 2007 10:08:29 +0200 Holding a global mutex over recvmsg() calls under AF_UNIX is pretty much a non-starter, this will kill performance for multi-threaded apps. That's an rwsem held for read. It's held for write in unix_gc() only for a short duration, and unix_gc() should only rarely be called. So I don't think there's any performance problem here. It pulls a non-local cacheline into the local thread, that's extremely expensive on SMP. OK, here's an updated patch, that uses -readlock, and passes my testing. Thanks a lot, I'll review this as soon as possible unless someone else beats me to it :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 04 Jun 2007 11:45:32 +0200 A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero This is only part of the story. It turns out, there are other races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vica versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. The following patch fixes it for me, but it's possibly the wrong approach. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] I haven't seen a repost of the first patch, which is necessary because that first patch doesn't apply to the current tree. Please don't ignore Arnaldo's feedback like that, or else I'll ignore you just the same. :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero This is only part of the story. It turns out, there are other races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vica versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. The following patch fixes it for me, but it's possibly the wrong approach. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] I haven't seen a repost of the first patch, which is necessary because that first patch doesn't apply to the current tree. Please don't ignore Arnaldo's feedback like that, or else I'll ignore you just the same. :-) I just want to win the who's laziest? league. It would take me about 5 minutes to get the netdev tree and test compile the change. Of which 5 seconds would be actually updating the patch. I was thought it was OK to pass that 5 seconds worth of hard work to you in order to save the rest ;) Anyway here's the updated (but not compile tested) patch. Thanks, Miklos From: Miklos Szeredi [EMAIL PROTECTED] A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero I'm surprised nobody noticed this, it's not hard to trigger. Maybe it's just (un)luck with the timing. It's possible to work around this bug in userspace, by retrying the recv() once in case of a zero return value. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux-2.6.22-rc2/net/unix/af_unix.c === --- linux-2.6.22-rc2.orig/net/unix/af_unix.c2007-06-02 23:45:47.0 +0200 +++ linux-2.6.22-rc2/net/unix/af_unix.c 2007-06-02 23:45:49.0 +0200 @@ -1711,20 +1711,23 @@ static int unix_stream_recvmsg(struct ki int chunk; struct sk_buff *skb; + unix_state_lock(sk); skb = skb_dequeue(sk-sk_receive_queue); if (skb==NULL) { if (copied = target) - break; + goto unlock; /* * POSIX 1003.1g mandates this order. */ if ((err = sock_error(sk)) != 0) - break; + goto unlock; if (sk-sk_shutdown RCV_SHUTDOWN) - break; + goto unlock; + + unix_state_unlock(sk); err = -EAGAIN; if (!timeo) break; @@ -1738,7 +1741,11 @@ static int unix_stream_recvmsg(struct ki } mutex_lock(u-readlock); continue; + unlock: + unix_state_unlock(sk); + break; } + unix_state_unlock(sk); if (check_creds) { /* Never glue messages from different writers */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Tue, 05 Jun 2007 09:42:41 +0200 I just want to win the who's laziest? league. It would take me about 5 minutes to get the netdev tree and test compile the change. Of which 5 seconds would be actually updating the patch. I was thought it was OK to pass that 5 seconds worth of hard work to you in order to save the rest ;) That tradeoff is fine, if, in return you'll do the rest of the networking subsystem maintainership work I need to to. :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Tue, 05 Jun 2007 10:11:56 +0200 I just want to win the who's laziest? league. It would take me about 5 minutes to get the netdev tree and test compile the change. Of which 5 seconds would be actually updating the patch. I was thought it was OK to pass that 5 seconds worth of hard work to you in order to save the rest ;) That tradeoff is fine, if, in return you'll do the rest of the networking subsystem maintainership work I need to to. :-) Well, I _did_ save you quite a bit of time by tracking down these bugs. That 5 sec of dedication in exchange would have really made me feel good ;( The only reason I can process as many patches as I can every day is that I depend upon the end-nodes (that's you) doing most of the time intensive work so that I can concentrate on reviewing patches for correctness and proper implementation. In any event, thanks for respinning the patch, it's late here so I'll review it tomorrow. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
I just want to win the who's laziest? league. It would take me about 5 minutes to get the netdev tree and test compile the change. Of which 5 seconds would be actually updating the patch. I was thought it was OK to pass that 5 seconds worth of hard work to you in order to save the rest ;) That tradeoff is fine, if, in return you'll do the rest of the networking subsystem maintainership work I need to to. :-) Well, I _did_ save you quite a bit of time by tracking down these bugs. That 5 sec of dedication in exchange would have really made me feel good ;( Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Tue, 05 Jun 2007 09:42:41 +0200 From: Miklos Szeredi [EMAIL PROTECTED] A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero I'm surprised nobody noticed this, it's not hard to trigger. Maybe it's just (un)luck with the timing. It's possible to work around this bug in userspace, by retrying the recv() once in case of a zero return value. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Looks good, applied. Another way to fix this would have been to take the u-readlock in unix_release() and unix_shutdown(), but that would have hurt unix_dgram_recvmsg() which doesn't need to handle this race. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: David Miller [EMAIL PROTECTED] Date: Tue, 05 Jun 2007 00:02:47 -0700 (PDT) From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 04 Jun 2007 11:45:32 +0200 A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero This is only part of the story. It turns out, there are other races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vica versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. The following patch fixes it for me, but it's possibly the wrong approach. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Concerning this specific patch I think we need to rethink it a bit. Holding a global mutex over recvmsg() calls under AF_UNIX is pretty much a non-starter, this will kill performance for multi-threaded apps. One possible solution is for the garbage collection code to hold the u-readlock while processing a socket, but be careful about deadlocks. Anyone want to give that a try? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Mon, 04 Jun 2007 11:45:32 +0200 A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero This is only part of the story. It turns out, there are other races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vica versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. The following patch fixes it for me, but it's possibly the wrong approach. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Concerning this specific patch I think we need to rethink it a bit. Holding a global mutex over recvmsg() calls under AF_UNIX is pretty much a non-starter, this will kill performance for multi-threaded apps. That's an rwsem held for read. It's held for write in unix_gc() only for a short duration, and unix_gc() should only rarely be called. So I don't think there's any performance problem here. One possible solution is for the garbage collection code to hold the u-readlock while processing a socket, but be careful about deadlocks. That would have exactly the same effect. Only the code would be more complicated. Miklos - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
From: Miklos Szeredi [EMAIL PROTECTED] Date: Wed, 06 Jun 2007 07:26:52 +0200 Holding a global mutex over recvmsg() calls under AF_UNIX is pretty much a non-starter, this will kill performance for multi-threaded apps. That's an rwsem held for read. It's held for write in unix_gc() only for a short duration, and unix_gc() should only rarely be called. So I don't think there's any performance problem here. It pulls a non-local cacheline into the local thread, that's extremely expensive on SMP. If everyone starts grabbing this thing during recvmsg() it's going to become a really hot lock and kill performance, even if it's a read side lock being taken. That's why I said we need to investigate solutions involving u-readlock, that already has to be taken and is local to the socket. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix race in AF_UNIX
A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero This is only part of the story. It turns out, there are other races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vica versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. The following patch fixes it for me, but it's possibly the wrong approach. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux-2.6.22-rc2/net/unix/garbage.c === --- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 +0200 +++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-04 11:39:42.0 +0200 @@ -90,6 +90,7 @@ static struct sock *gc_current = GC_HEAD; /* stack of objects to mark */ atomic_t unix_tot_inflight = ATOMIC_INIT(0); +DECLARE_RWSEM(unix_gc_sem); static struct sock *unix_get_socket(struct file *filp) @@ -169,7 +170,7 @@ static void maybe_unmark_and_push(struct void unix_gc(void) { - static DEFINE_MUTEX(unix_gc_sem); + static DEFINE_MUTEX(unix_gc_local_lock); int i; struct sock *s; struct sk_buff_head hitlist; @@ -179,9 +180,22 @@ void unix_gc(void) * Avoid a recursive GC. */ - if (!mutex_trylock(unix_gc_sem)) + if (!mutex_trylock(unix_gc_local_lock)) return; + + /* +* unix_gc_sem protects against sockets going from in-flight to +* installed +* +* Can't sleep on this, because skb_recv_datagram could be +* waiting for a packet that is to be sent by the thread which +* invoked the gc +*/ + if (!down_write_trylock(unix_gc_sem)) { + mutex_unlock(unix_gc_local_lock); + return; + } spin_lock(unix_table_lock); forall_unix_sockets(i, s) @@ -207,8 +221,6 @@ void unix_gc(void) forall_unix_sockets(i, s) { - int open_count = 0; - /* * If all instances of the descriptor are not * in flight we are in use. @@ -218,10 +230,20 @@ void unix_gc(void) * In this case (see unix_create1()) we set artificial * negative inflight counter to close race window. * It is trick of course and dirty one. +* +* Get the inflight counter first, then the open +* counter. This avoids problems if racing with +* sendmsg +* +* If just created socket is not yet attached to +* a file descriptor, assume open_count of 1 */ + int inflight_count = atomic_read(unix_sk(s)-inflight); + int open_count = 1; + if (s-sk_socket s-sk_socket-file) open_count = file_count(s-sk_socket-file); - if (open_count atomic_read(unix_sk(s)-inflight)) + if (open_count inflight_count) maybe_unmark_and_push(s); } @@ -302,11 +324,12 @@ void unix_gc(void) u-gc_tree = GC_ORPHAN; } spin_unlock(unix_table_lock); + up_write(unix_gc_sem); /* * Here we are. Hitlist is filled. Die. */ __skb_queue_purge(hitlist); - mutex_unlock(unix_gc_sem); + mutex_unlock(unix_gc_local_lock); } Index: linux-2.6.22-rc2/include/net/af_unix.h === --- linux-2.6.22-rc2.orig/include/net/af_unix.h 2007-04-26 05:08:32.0 +0200 +++ linux-2.6.22-rc2/include/net/af_unix.h 2007-06-04 09:13:56.0 +0200 @@ -14,6 +14,7 @@ extern void unix_gc(void); extern struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1]; extern spinlock_t unix_table_lock; +extern struct rw_semaphore unix_gc_sem; extern atomic_t unix_tot_inflight; Index: linux-2.6.22-rc2/net/unix/af_unix.c === --- linux-2.6.22-rc2.orig/net/unix/af_unix.c2007-06-03 23:58:11.0 +0200 +++ linux-2.6.22-rc2/net/unix/af_unix.c 2007-06-04 11:04:15.0 +0200 @@ -1572,6 +1572,7 @@ static int
Re: [PATCH] fix race in AF_UNIX
(resend, sorry, fscked up the address list) A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero This is only part of the story. It turns out, there are other races involving the garbage collector, that can throw away perfectly good packets with AF_UNIX sockets in them. The problems arise when a socket goes from installed to in-flight or vica versa during garbage collection. Since gc is done with a spinlock held, this only shows up on SMP. The following patch fixes it for me, but it's possibly the wrong approach. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux-2.6.22-rc2/net/unix/garbage.c === --- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 +0200 +++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-04 11:39:42.0 +0200 @@ -90,6 +90,7 @@ static struct sock *gc_current = GC_HEAD; /* stack of objects to mark */ atomic_t unix_tot_inflight = ATOMIC_INIT(0); +DECLARE_RWSEM(unix_gc_sem); static struct sock *unix_get_socket(struct file *filp) @@ -169,7 +170,7 @@ static void maybe_unmark_and_push(struct void unix_gc(void) { - static DEFINE_MUTEX(unix_gc_sem); + static DEFINE_MUTEX(unix_gc_local_lock); int i; struct sock *s; struct sk_buff_head hitlist; @@ -179,9 +180,22 @@ void unix_gc(void) * Avoid a recursive GC. */ - if (!mutex_trylock(unix_gc_sem)) + if (!mutex_trylock(unix_gc_local_lock)) return; + + /* +* unix_gc_sem protects against sockets going from in-flight to +* installed +* +* Can't sleep on this, because skb_recv_datagram could be +* waiting for a packet that is to be sent by the thread which +* invoked the gc +*/ + if (!down_write_trylock(unix_gc_sem)) { + mutex_unlock(unix_gc_local_lock); + return; + } spin_lock(unix_table_lock); forall_unix_sockets(i, s) @@ -207,8 +221,6 @@ void unix_gc(void) forall_unix_sockets(i, s) { - int open_count = 0; - /* * If all instances of the descriptor are not * in flight we are in use. @@ -218,10 +230,20 @@ void unix_gc(void) * In this case (see unix_create1()) we set artificial * negative inflight counter to close race window. * It is trick of course and dirty one. +* +* Get the inflight counter first, then the open +* counter. This avoids problems if racing with +* sendmsg +* +* If just created socket is not yet attached to +* a file descriptor, assume open_count of 1 */ + int inflight_count = atomic_read(unix_sk(s)-inflight); + int open_count = 1; + if (s-sk_socket s-sk_socket-file) open_count = file_count(s-sk_socket-file); - if (open_count atomic_read(unix_sk(s)-inflight)) + if (open_count inflight_count) maybe_unmark_and_push(s); } @@ -302,11 +324,12 @@ void unix_gc(void) u-gc_tree = GC_ORPHAN; } spin_unlock(unix_table_lock); + up_write(unix_gc_sem); /* * Here we are. Hitlist is filled. Die. */ __skb_queue_purge(hitlist); - mutex_unlock(unix_gc_sem); + mutex_unlock(unix_gc_local_lock); } Index: linux-2.6.22-rc2/include/net/af_unix.h === --- linux-2.6.22-rc2.orig/include/net/af_unix.h 2007-04-26 05:08:32.0 +0200 +++ linux-2.6.22-rc2/include/net/af_unix.h 2007-06-04 09:13:56.0 +0200 @@ -14,6 +14,7 @@ extern void unix_gc(void); extern struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1]; extern spinlock_t unix_table_lock; +extern struct rw_semaphore unix_gc_sem; extern atomic_t unix_tot_inflight; Index: linux-2.6.22-rc2/net/unix/af_unix.c === --- linux-2.6.22-rc2.orig/net/unix/af_unix.c2007-06-03 23:58:11.0 +0200 +++ linux-2.6.22-rc2/net/unix/af_unix.c 2007-06-04 11:04:15.0
Re: [PATCH] fix race in AF_UNIX
On 6/2/07, Miklos Szeredi [EMAIL PROTECTED] wrote: From: Miklos Szeredi [EMAIL PROTECTED] A recv() on an AF_UNIX, SOCK_STREAM socket can race with a send()+close() on the peer, causing recv() to return zero, even though the sent data should be received. This happens if the send() and the close() is performed between skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg(): process A skb_dequeue() returns NULL, there's no data in the socket queue process B new data is inserted onto the queue by unix_stream_sendmsg() process B sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock() process A sk-sk_shutdown is checked, unix_release_sock() returns zero I'm surprised nobody noticed this, it's not hard to trigger. Maybe it's just (un)luck with the timing. It's possible to work around this bug in userspace, by retrying the recv() once in case of a zero return value. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux-2.6.22-rc2/net/unix/af_unix.c === --- linux-2.6.22-rc2.orig/net/unix/af_unix.c2007-06-02 23:45:47.0 +0200 +++ linux-2.6.22-rc2/net/unix/af_unix.c 2007-06-02 23:45:49.0 +0200 @@ -1711,20 +1711,23 @@ static int unix_stream_recvmsg(struct ki int chunk; struct sk_buff *skb; + unix_state_rlock(sk); this function doesn't exist anymore, see: http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=e97b6936b03dd0a62991f361c048cca38ac00198 There is also another AF_UNIX patch, also related to a race. - Arnaldo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html