RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: casper@oracle.com > Sent: 21 October 2015 21:33 .. > >Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that > >behaviour, in your opinion? Because fd is sure as hell not the last > >descriptor refering to that socket - fd2 remains alive and well. > > > >Behaviour you describe below matches the _third_ option. > > >> >BTW, for real fun, consider this: > >> >7) > >> >// fd is a socket > >> >fd2 = dup(fd); > >> >in thread A: accept(fd); > >> >in thread B: accept(fd); > >> >in thread C: accept(fd2); > >> >in thread D: close(fd); If D is going to do this, it ought to be using dup2() to clone a copy of (say) /dev/null onto 'fd'. > >> >Which threads (if any), should get hit where it hurts? > >> > >> A & B should return from the accept with an error. C should > >> continue. Which is what happens on Solaris. That seems to completely break the normal *nix file scheme... > >> To this end each thread keeps a list of file descriptors > >> in use by the current active system call. > > Remember, Solaris (and SYSV) has extra levels of multiplexing between userspace and char special drivers (and probably sockets) than Linux does. As well as having multiple fd referencing the same struct FILE, multiple FILE can point to the same inode. If you have two different /dev entries for the same major/minor you also end up with separate inodes - all finally referencing the same driver data (indexed only by minor number). (You actually get two inodes in the chain, one for the disk filesystem and one char-special. The ref counts on both can be greater than 1.) David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Mon, Nov 02, 2015 at 10:03:58AM +, David Laight wrote: > Remember, Solaris (and SYSV) has extra levels of multiplexing between > userspace and char special drivers (and probably sockets) than Linux does. > As well as having multiple fd referencing the same struct FILE, multiple > FILE can point to the same inode. As they ever could on any Unix. Every open(2) results in a new struct file (BTW, I've never seen that capitalization for kernel structure - not in v7, not in *BSD, etc.; FILE is a userland typedef and I would be rather surprised if any kernel, Solaris included, would've renamed 'struct file' to 'struct FILE'). > If you have two different /dev entries for the same major/minor you > also end up with separate inodes - all finally referencing the same > driver data (indexed only by minor number). Again, the same goes for all Unices, both Linux and Solaris included. And what the devil does that have to do with sockets, anyway? Or with the problem in question, while we are at it - they have different descriptors pointing to the same struct file behave differently; anything sensitive to file type would be past that point. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > On 21/10/2015 04:49, Al Viro wrote: > >BTW, for real fun, consider this: > >7) > >// fd is a socket > >fd2 = dup(fd); > >in thread A: accept(fd); > >in thread B: accept(fd); > >in thread C: accept(fd2); > >in thread D: close(fd); > > > >Which threads (if any), should get hit where it hurts? > > A & B should return from the accept with an error. C should continue. Which > is what happens on Solaris. So, I'm coming late to this discussion and I don't have the original context; however, to me this cited behavior seems undesirable and if I ran across it in the wild I would probably describe it as a bug. System call processing for operations on files involves translating a file descriptor (a number) into an open-file object (or "file description"), struct file in BSD and I think also in Linux. The actual system call logic operates on the open-file object, so once the translation happens application monkeyshines involving file descriptor numbers should have no effect on calls in progress. Other behavior would violate the principle of least surprise, as this basic architecture predates POSIX. The behavior Al Viro found in NetBSD's dup2 is a bug. System calls are supposed to be atomic, regardless of what POSIX might or might not say. I did not write that code but I'll pass the report on to those who did. -- David A. Holland dholl...@netbsd.org -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 23, 2015 at 06:30:25PM +, David Holland wrote: > So, I'm coming late to this discussion and I don't have the original > context; however, to me this cited behavior seems undesirable and if I > ran across it in the wild I would probably describe it as a bug. Unfortunately, that's precisely what NetBSD is trying to implement (and that's what will happen if nothing else reopens fd). See the logics in fd_close(), with ->fo_restart() and waiting for all activity to settle down. As for the missing context, what fd_close() is doing is also unreliable - inducing ERESTART in other threads sitting in accept(2) and things like that and waiting for them to run into EBADF they'll get (barring races) on syscall restart; threads sitting in accept() et.al. on the same struct file, but with different descriptors will hopefully go into restart and continue unaffected. All that machinery relies on nothing having reused the descriptor for socket(2), dup2() target, etc. while those threads had been going through the syscall restart - if that happens, you are SOL, since accept(2) _will_ restart on an unexpected socket. Moreover, if you fix dup2() atomicity, this approach will reliably shit itself for situations when dup2() rather than close() is used to close the socket. It relies upon having at least some window where the victim descriptor would be yielding EBADF. > System call processing for operations on files involves translating a > file descriptor (a number) into an open-file object (or "file > description"), struct file in BSD and I think also in Linux. The > actual system call logic operates on the open-file object, so once the > translation happens application monkeyshines involving file descriptor > numbers should have no effect on calls in progress. Other behavior > would violate the principle of least surprise, as this basic > architecture predates POSIX. Well, to be fair, until '93 there was no way to have descriptor table changed under a syscall in the first place. The old model (everything up to and ncluding 4.4BSD final) simply didn't include anything of that sort - mapping from descriptors to open files was not shared and all changes a syscall might see were ones done by the syscall itself. So this thing isn't covered by the basic architecture - it's something that had been significantly new merely two decades ago. And POSIX still hasn't quite caught up with that newfangled 4.2BSD thing... IMO what you've described above is fine - that's how Linux works, that's how FreeBSD and OpenBSD work and that's how NetBSD used to work until 2008 or so. "Cancel syscall if any of the descriptors got dissociated from opened files by action of another thread, have the dissociating operation wait for all affected syscalls to run down" thing had been introduced then and it is similar to what Solaris is doing. AFAICS, the main issue with that is the memory footprint from hell and/or cacheline clusterfuck. Having accept(2) bugger off with e.g. EINTR in such situation isn't inherently worse or better than having it sit there as if close() or dup2() has not happened - matter of taste, and if there had been a way to do it without inflicting the price on processes that do not pull that kind of crap in the first place... might be worth considering. As it is, the memory footprint seems to be too heavy. I'm not entirely convinced that there's no clever way to avoid that, but right now I don't see anything that would look like a good approach. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 11:55:42AM +0100, Alan Burlison wrote: > On 22/10/2015 05:21, Al Viro wrote: > > >>Most of the work on using a file descriptor is local to the thread. > > > >Using - sure, but what of cacheline dirtied every time you resolve a > >descriptor to file reference? > > Don't you have to do that anyway, to do anything useful with the file? Dirtying the cacheline that contains struct file itself is different, but that's not per-descriptor. > >In case of Linux we have two bitmaps and an array of pointers associated > >with descriptor table. They grow on demand (in parallel) > > * reserving a descriptor is done under ->file_lock (dropped/regained > >around memory allocation if we end up expanding the sucker, actual > >reassignment > >of pointers to array/bitmaps is under that spinlock) > > * installing a pointer is lockless (we wait for ongoing resize to > >settle, RCU takes care of the rest) > > * grabbing a file by index is lockless as well > > * removing a pointer is under ->file_lock, so's replacing it by dup2(). > > Is that table per-process or global? Usually it's per-process, but any thread could ask for a private instance to work with (and then spawn more threads sharing that instance - or getting independent copies). It's common for Plan 9-inspired models - basically, you treat every thread as a machine that consists of * memory * file descriptor table * namespace * signal handlers ... * CPU (i.e. actual thread of execution). The last part can't be shared; anything else can. fork(2) variant used to start new threads (clone(2) in case of Linux, rfork(2) in Plan 9 and *BSD) is told which components should be copies of parent's ones and which should be shared with the parent. fork(2) is simply "copy everything except for the namespace". It's fairly common to have "share everything", but intermediate variants are also possible. There are constraints (e.g. you can't share signal handlers without sharing the memory space), but descriptor table can be shared independently from memory space just fine. There's also a way to say "unshare this, this and that components" - mapped to unshare(2) in Linux and to rfork(2) in Plan 9. Best way to think of that is to consider descriptor table as a first-class object a thread can be connected to. Usually you have one for each process, with all threads belonging to that process connected to the same thing, but that's just the most common use. > I don't think that it's possible to claim that a non-atomic dup2() > is POSIX-compliant. Except that it's in non-normative part of dup2(2), AFAICS. I certainly agree that it would be a standard lawyering beyond reason, but "not possible to claim" is too optimistic. Maybe I'm just more cynical... > ThreadA remains sat in accept on fd1 which is now a plain file, not > a socket. No. accept() is not an operation on file descriptors; it's an operation on file descriptions (pardon for use of that terminology). They are specified by passing descriptors, but there's a hell of a difference between e.g. dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() (operations on descriptions). Lookups are done once per syscall; the only exception is F_SETFL{,W}, where we recheck that descriptor is refering to the same thing before granting the lock. Again, POSIX is still underspecifying the semantics of shared descriptor tables; back when the bulk of it had been written there had been no way to have a descriptor -> description mapping changed under a syscall by action of another thread. Hell, they still hadn't picked on some things that happened in early 80s, let alone early-to-mid 90s... Linux and Solaris happen to cover these gaps differently; FreeBSD and OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 05:21, Al Viro wrote: Most of the work on using a file descriptor is local to the thread. Using - sure, but what of cacheline dirtied every time you resolve a descriptor to file reference? Don't you have to do that anyway, to do anything useful with the file? How much does it cover and to what degree is that local to thread? When it's a refcount inside struct file - no big deal, we'll be reading the same cacheline anyway and unless several threads are pounding on the same file with syscalls at the same time, that won't be a big deal. But when refcounts are associated with descriptors... There is a refcount in the struct held in the per-process list of open files and the 'slow path' processing is only taken if there's more than one LWP in the process that's accessing the file. In case of Linux we have two bitmaps and an array of pointers associated with descriptor table. They grow on demand (in parallel) * reserving a descriptor is done under ->file_lock (dropped/regained around memory allocation if we end up expanding the sucker, actual reassignment of pointers to array/bitmaps is under that spinlock) * installing a pointer is lockless (we wait for ongoing resize to settle, RCU takes care of the rest) * grabbing a file by index is lockless as well * removing a pointer is under ->file_lock, so's replacing it by dup2(). Is that table per-process or global? The point is, dup2() over _unused_ descriptor is inherently racy, but dup2() over a descriptor we'd opened and kept open should be safe. As it is, their implementation boils down to "userland must do process-wide exclusion between *any* dup2() and all syscalls that might create a new descriptor - open()/pipe()/socket()/accept()/recvmsg()/dup()/etc". At the very least, it's a big QoI problem, especially since such userland exclusion would have to be taken around the operations that can block for a long time. Sure, POSIX wording regarding dup2 is so weak that this behaviour can be argued to be compliant, but... replacement of the opened file associated with newfd really ought to be atomic to be of any use for multithreaded processes. There's existing language in the Issue 7 dup2() description that says dup2() has to be atomic: "the dup2( ) function provides unique services, as no other interface is able to atomically replace an existing file descriptor." And there is some new language in Issue 7 Technical Corrigenda 2 that reinforces that, when it's talking about reassignment of stdin/stdout/stderr: "Furthermore, a close() followed by a reopen operation (e.g. open(), dup() etc) is not atomic; dup2() should be used to change standard file descriptors." I don't think that it's possible to claim that a non-atomic dup2() is POSIX-compliant. IOW, if newfd is an opened descriptor prior to dup2() and no other thread attempts to close it by any means, there should be no window during which it would appear to be not opened. Linux and FreeBSD satisfy that; OpenBSD seems to do the same, from the quick look. NetBSD doesn't, no idea about Solaris. FWIW, older NetBSD implementation (prior to "File descriptor changes, discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it had fixed a lot of crap, so it's entirely possible that OpenBSD simply has kept the old implementation, with tons of other races in that area, but this dup2() race got introduced in that rewrite. Related to dup2(), there's some rather surprising behaviour on Linux. Here's the scenario: -- ThreadA opens, listens and accepts on socket fd1, waiting for incoming connections. ThreadB waits for a while, then opens normal file fd2 for read/write. ThreadB uses dup2 to make fd1 a clone of fd2. ThreadB closes fd2. ThreadA remains sat in accept on fd1 which is now a plain file, not a socket. ThreadB writes to fd1, the result of which appears in the file, so fd1 is indeed operating as a plain file. ThreadB exits. ThreadA is still sat in accept on fd1. A connection is made to fd1 by another process. The accept call succeeds and returns the incoming connection. fd1 is still operating as a socket, even though it's now actually a plain file. -- I assume this is another consequence of the fact that threads waiting in accept don't get a notification if the fd they are using is closed, either directly by a call to close or by a syscall such as dup2. Not waking up other threads on a fd when it is closed seems like it's undesirable behaviour. I can see the reasoning behind allowing shutdown to be used to do such a wakeup even if that's not POSIX-compliant - it may make it slightly easier for applications avoid fd recycling races. However the current situation is that shutdown is the *only* way to perform such a wakeup - simply closing the fd has no effect on any other threads. That seems incorrect. -- Alan Burlison -- -- To unsubscribe from this list: send the line
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/15 19:16, Al Viro wrote: Don't you have to do that anyway, to do anything useful with the file? Dirtying the cacheline that contains struct file itself is different, but that's not per-descriptor. Yes, true enough. Usually it's per-process, but any thread could ask for a private instance to work with (and then spawn more threads sharing that instance - or getting independent copies). [snip] Thanks again for the info, interesting. Solaris also does things along the same lines. In fact we recently extended posix_spawn so it could be used by the JVM to create subprocesses without wholescale copying, and Casper has done some optimisation work on posix_spawn as well. I don't think that it's possible to claim that a non-atomic dup2() is POSIX-compliant. Except that it's in non-normative part of dup2(2), AFAICS. I certainly agree that it would be a standard lawyering beyond reason, but "not possible to claim" is too optimistic. Maybe I'm just more cynical... Possibly so, and possibly justifiably so as well ;-) ThreadA remains sat in accept on fd1 which is now a plain file, not a socket. No. accept() is not an operation on file descriptors; it's an operation on file descriptions (pardon for use of that terminology). They are specified by passing descriptors, but there's a hell of a difference between e.g. dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() (operations on descriptions). Lookups are done once per syscall; the only exception is F_SETFL{,W}, where we recheck that descriptor is refering to the same thing before granting the lock. Yes, but if you believe that dup2() requires an implicit close() within it and that dup2() has to be atomic then expecting that other threads waiting on the same fd in accept() will get a notification seems reasonable enough. Again, POSIX is still underspecifying the semantics of shared descriptor tables; back when the bulk of it had been written there had been no way to have a descriptor -> description mapping changed under a syscall by action of another thread. Hell, they still hadn't picked on some things that happened in early 80s, let alone early-to-mid 90s... That's indisputably true - much of the POSIX behaviour predates threads and it shows, quite badly, in some places. Linux and Solaris happen to cover these gaps differently; FreeBSD and OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. Yes, true enough. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, 2015-10-21 at 14:03 +0100, Alan Burlison wrote: > On 21/10/2015 12:28, Eric Dumazet wrote: > > > This works for me. Please double check your programs > > I have just done so, it works as you say for AF_INET sockets but if you > switch to AF_UNIX sockets it does the wrong thing in the way I described. > Oh well. Please try the following : diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 94f658235fb4..24dec8bb571d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -328,7 +328,8 @@ found: static inline int unix_writable(struct sock *sk) { - return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; + return sk->sk_state != TCP_LISTEN && + (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } static void unix_write_space(struct sock *sk) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 21/10/2015 04:49, Al Viro wrote: Firstly, thank you for the comprehensive and considered reply. Refcount is an implementation detail, of course. However, in any Unix I know of, there are two separate notions - descriptor losing connection to opened file (be it from close(), exit(), execve(), dup2(), etc.) and opened file getting closed. Yep, it's an implementation detail inside the kernel - Solaris also has a refcount inside its vnodes. However that's really only dimly visible at the process level, where all you have is an integer file ID. The latter cannot happen while there are descriptors connected to the file in question, of course. However, that is not the only thing that might prevent an opened file from getting closed - e.g. sending an SCM_RIGHTS datagram with attached descriptor connected to the opened file in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened file until it is successfully received or discarded (in the former case recepient will get a new descriptor refering to that opened file, of course). Having the original descriptor closed right after sendmsg(2) does *not* do anything to opened file. On any Unix that implements descriptor-passing. I believe async IO data is another way that a file can remain live after a close(), from the close() section of IEEE Std 1003.1: "An I/O operation that is not canceled completes as if the close() operation had not yet occurred" There's going to be a notion of "last close"; that's what this refcount is about and _that_ is more than implementation detail. Yes, POSIX distinguishes between "file descriptor" and "file description" (ugh!) and the close() page says: "When all file descriptors associated with an open file description have been closed, the open file description shall be freed." In the context of this discussion I believe it's the behaviour of the integer file descriptor that's the issue. Once it's had close() called on it then it's invalid, and any IO on it should fail, even if the underlying file description is still 'live'. In other words, is that destruction of * any descriptor refering to this socket [utterly insane for obvious reasons] * the last descriptor refering to this socket (modulo descriptor passing, etc.) [a bitch to implement, unless we treat a syscall in progress as keeping the opened file open], or * _the_ descriptor used to issue accept(2) [a bitch to implement, with a lot of fun races in an already race-prone area]? From reading the POSIX close() page I believe the second option is the correct one. Additional question is whether it's * just a magical behaviour of close(2) [ugly], or * something that happens when descriptor gets dissociated from opened file [obviously more consistent]? The second, I believe. BTW, for real fun, consider this: 7) // fd is a socket fd2 = dup(fd); in thread A: accept(fd); in thread B: accept(fd); in thread C: accept(fd2); in thread D: close(fd); Which threads (if any), should get hit where it hurts? A & B should return from the accept with an error. C should continue. Which is what happens on Solaris. I have no idea what semantics does Solaris have in that area and how racy their descriptor table handling is. And no, I'm not going to RTFS their kernel, CDDL being what it is. I can answer that for you :-) I've looked through the appropriate bits of the Solaris kernel code and my colleague Casper has written an excellent summary of what happens, so with his permission I've just copied it verbatim below: -- Since at least Solaris 7 (1998), a thread which is sleeping on a file descriptor which is being closed by another thread, will be woken up. To this end each thread keeps a list of file descriptors in use by the current active system call. When a file descriptor is closed and this file descriptor is marked as being in use by other threads, the kernel will search all threads to see which have this file descriptor listed as in use. For each such thread, the kernel tells the thread that its active fds list is now stale and, if possible, makes the thread run. While this algorithm is pretty expensive, it is not often invoked. The thread running close() will NOT return until all other threads using that filedescriptor have released it. When run, the thread will return from its syscall and will in most cases return EBADF. A second thread trying to close this same file descriptor may return earlier with close() returning EBADF. -- -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 21, 2015 at 10:33:04PM +0200, casper@oracle.com wrote: > > >On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > > > >> >There's going to be a notion of "last close"; that's what this refcount is > >> >about and _that_ is more than implementation detail. > >> > >> Yes, POSIX distinguishes between "file descriptor" and "file > >> description" (ugh!) and the close() page says: > > > >Would've been better if they went for something like "IO channel" for > >the latter ;-/ > > Or at least some other word. A file descriptor is just an index to > a list of file pointers (and wasn't named so?) *nod* There's no less than 3 distinct notions associated with the word "file" - "file as collection of bytes on filesystem", "opened file as IO channel" and "file descriptor", all related ;-/ "File description" vs. "file descriptor" is atrociously bad terminology. > >Unless they want to consider in-flight descriptor-passing datagrams as > >collections of file descriptors, the quoted sentence is seriously misleading. > >And then there's mmap(), which they do kinda-sorta mention... > > Well, a file descriptor really only exists in the context of a process; > in-flight it is no longer a file descriptor as there process context with > a file descriptor table; so pointers to file descriptions are passed > around. Yes. Note, BTW, that descriptor contains a bit more than a pointer - there are properties associated with it (close-on-exec and is-it-already-claimed), which makes abusing it for describing SCM_RIGHTS payloads even more of a stretch. IOW, description of semantics for close() and friends needs fixing - it simply does not match the situation on anything that would be anywhere near POSIX compliance in other areas. > >Sure, but the upkeep of data structures it would need is there > >whether you actually end up triggering it or not. Both in > >memory footprint and in cacheline pingpong... > > Most of the work on using a file descriptor is local to the thread. Using - sure, but what of cacheline dirtied every time you resolve a descriptor to file reference? How much does it cover and to what degree is that local to thread? When it's a refcount inside struct file - no big deal, we'll be reading the same cacheline anyway and unless several threads are pounding on the same file with syscalls at the same time, that won't be a big deal. But when refcounts are associated with descriptors... In case of Linux we have two bitmaps and an array of pointers associated with descriptor table. They grow on demand (in parallel) * reserving a descriptor is done under ->file_lock (dropped/regained around memory allocation if we end up expanding the sucker, actual reassignment of pointers to array/bitmaps is under that spinlock) * installing a pointer is lockless (we wait for ongoing resize to settle, RCU takes care of the rest) * grabbing a file by index is lockless as well * removing a pointer is under ->file_lock, so's replacing it by dup2(). Grabbing a file by descriptor follows pointer from task_struct to descriptor table, from descriptor table to element of array of pointers (embedded when we have few descriptors, but becomes separately allocated when more is needed), and from array element to struct file. In struct file we fetch ->f_mode and (if descriptor table is shared) atomically increment ->f_count. For comparison, NetBSD has an extra level of indirection (with similar tricks for embedding them while there are few descriptors), with a lot fatter structure around the pointer to file - they keep close-on-exec and in-use in there, along with refcount and their equivalent of waitqueue. These structures, once they grow past the embedded set, are allocated one-by-one, so copying the table on fork() costs a _lot_ more. Rather than an array of pointers to files they have an array of pointers to those guys. Reserving a descriptor triggers allocation of new struct fdfile and installing a pointer to it into the array. Allows for slightly simpler installing of pointer to file afterwards - unlike us, they don't have to be careful about array resize happening in parallel. Grabbing a file by index is lockless, so's installing a pointer to file. Reserving a descriptor is under ->fd_lock (mutex rather than a spinlock). Removing a pointer is under ->fd_lock, so's replacing it by dup2(), but dup2() has an unpleasant race (see upthread). They do the same amount of pointer-chasing on lookup proper, but only because they do not look into struct file itself there. Which happens immediately afterwards, since callers *will* look into what they've got. I didn't look into the details of barrier use in there, but it looks generally sane. Cacheline pingpong is probably not a big deal there, but only because these structures are fat and scattered. Another fun issue is that they have in-use bits buried deep, which means that they need to mirror them in a separate
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, 2015-10-21 at 15:38 +0100, Alan Burlison wrote: > -- > Since at least Solaris 7 (1998), a thread which is sleeping > on a file descriptor which is being closed by another thread, > will be woken up. > > To this end each thread keeps a list of file descriptors > in use by the current active system call. Ouch. > > When a file descriptor is closed and this file descriptor > is marked as being in use by other threads, the kernel > will search all threads to see which have this file descriptor > listed as in use. For each such thread, the kernel tells > the thread that its active fds list is now stale and, if > possible, makes the thread run. > This is what I feared. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > >There's going to be a notion of "last close"; that's what this refcount is > >about and _that_ is more than implementation detail. > > Yes, POSIX distinguishes between "file descriptor" and "file > description" (ugh!) and the close() page says: Would've been better if they went for something like "IO channel" for the latter ;-/ > "When all file descriptors associated with an open file description > have been closed, the open file description shall be freed." BTW, is SCM_RIGHTS outside of scope? They do mention it in one place only: | Ancillary data is also possible at the socket level. The | header shall define the following symbolic constant for use as the cmsg_type | value when cmsg_level is SOL_SOCKET: | | SCM_RIGHTS | Indicates that the data array contains the access rights to be sent or | received. with no further details whatsoever. It's been there since at least 4.3-Reno; does anybody still use the older variant (->msg_accrights, that is)? IIRC, there was some crap circa 2.6 when Solaris used to do ->msg_accrights for descriptor-passing, but more or less current versions appear to support SCM_RIGHTS... In any case, descriptor-passing had been there in some form since at least '83 (the old variant is already present in 4.2) and considering it out-of-scope for POSIX is bloody ridiculous, IMO. Unless they want to consider in-flight descriptor-passing datagrams as collections of file descriptors, the quoted sentence is seriously misleading. And then there's mmap(), which they do kinda-sorta mention... > >In other words, is that destruction of > > * any descriptor refering to this socket [utterly insane for obvious > >reasons] > > * the last descriptor refering to this socket (modulo descriptor > >passing, etc.) [a bitch to implement, unless we treat a syscall in progress > >as keeping the opened file open], or > > * _the_ descriptor used to issue accept(2) [a bitch to implement, > >with a lot of fun races in an already race-prone area]? > > From reading the POSIX close() page I believe the second option is > the correct one. Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that behaviour, in your opinion? Because fd is sure as hell not the last descriptor refering to that socket - fd2 remains alive and well. Behaviour you describe below matches the _third_ option. > >BTW, for real fun, consider this: > >7) > >// fd is a socket > >fd2 = dup(fd); > >in thread A: accept(fd); > >in thread B: accept(fd); > >in thread C: accept(fd2); > >in thread D: close(fd); > > > >Which threads (if any), should get hit where it hurts? > > A & B should return from the accept with an error. C should > continue. Which is what happens on Solaris. > To this end each thread keeps a list of file descriptors > in use by the current active system call. Yecc... How much cross-CPU traffic does that cause on multithread processes? Not on close(2), on maintaining the descriptor use counts through the normal syscalls. > When a file descriptor is closed and this file descriptor > is marked as being in use by other threads, the kernel > will search all threads to see which have this file descriptor > listed as in use. For each such thread, the kernel tells > the thread that its active fds list is now stale and, if > possible, makes the thread run. > > While this algorithm is pretty expensive, it is not often invoked. Sure, but the upkeep of data structures it would need is there whether you actually end up triggering it or not. Both in memory footprint and in cacheline pingpong... Besides, the list of threads using given descriptor table also needs to be maintained, unless you scan *all* threads in the system (which would be quite a fun wrt latency and affect a lot more than just the process doing something dumb and rare). BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic. It calls fd_close() outside of ->fd_lock (has to, since fd_close() is grabbing that itself), so another thread doing e.g. fcntl(newfd, F_GETFD) in the middle of dup2(oldfd, newfd) might end up with EBADF, even though both before and after dup2() newfd had been open. What's worse, thread A: fd1 = open("foo", ...); fd2 = open("bar", ...); ... dup2(fd1, fd2); thread B: fd = open("baz", ...); might, AFAICS, end up with fd == fd2 and refering to foo instead of baz. All it takes is the last open() managing to grab ->fd_lock just as fd_close() from dup2() has dropped it. Which is an unexpected behaviour, to put it mildly, no matter how much standard lawyering one applies... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > >> >There's going to be a notion of "last close"; that's what this refcount is >> >about and _that_ is more than implementation detail. >> >> Yes, POSIX distinguishes between "file descriptor" and "file >> description" (ugh!) and the close() page says: > >Would've been better if they went for something like "IO channel" for >the latter ;-/ Or at least some other word. A file descriptor is just an index to a list of file pointers (and wasn't named so?) >> "When all file descriptors associated with an open file description >> have been closed, the open file description shall be freed." > >BTW, is SCM_RIGHTS outside of scope? They do mention it in one place >only: >| Ancillary data is also possible at the socket level. The >| header shall define the following symbolic constant for use as the cmsg_type >| value when cmsg_level is SOL_SOCKET: >| >| SCM_RIGHTS >| Indicates that the data array contains the access rights to be sent or >| received. > >with no further details whatsoever. It's been there since at least 4.3-Reno; >does anybody still use the older variant (->msg_accrights, that is)? IIRC, >there was some crap circa 2.6 when Solaris used to do ->msg_accrights for >descriptor-passing, but more or less current versions appear to support >SCM_RIGHTS... In any case, descriptor-passing had been there in some form >since at least '83 (the old variant is already present in 4.2) and considering >it out-of-scope for POSIX is bloody ridiculous, IMO. SCM_RIGHTS was introduced as part of the POSIX standardization of BSD sockets. Looks like they became part of Solaris 2.6, but the default was non-standard sockets so you may easily find msg->accrights but not SCM_RIGHTS. msg_accrights is what was introduced in BSD in likely the first implementation of socket-based file descriptor passing. SysV has its own file descriptor passing on file descriptors passing. But that interface is too much ad-hoc, so SCM_RIGHTS is a standardizations allowing multiple types of messages to be send around >Unless they want to consider in-flight descriptor-passing datagrams as >collections of file descriptors, the quoted sentence is seriously misleading. >And then there's mmap(), which they do kinda-sorta mention... Well, a file descriptor really only exists in the context of a process; in-flight it is no longer a file descriptor as there process context with a file descriptor table; so pointers to file descriptions are passed around. >> >In other words, is that destruction of >> >* any descriptor refering to this socket [utterly insane for obvious >> >reasons] >> >* the last descriptor refering to this socket (modulo descriptor >> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress >> >as keeping the opened file open], or >> >* _the_ descriptor used to issue accept(2) [a bitch to implement, >> >with a lot of fun races in an already race-prone area]? >> >> From reading the POSIX close() page I believe the second option is >> the correct one. > >Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that >behaviour, in your opinion? Because fd is sure as hell not the last >descriptor refering to that socket - fd2 remains alive and well. > >Behaviour you describe below matches the _third_ option. >> >BTW, for real fun, consider this: >> >7) >> >// fd is a socket >> >fd2 = dup(fd); >> >in thread A: accept(fd); >> >in thread B: accept(fd); >> >in thread C: accept(fd2); >> >in thread D: close(fd); >> > >> >Which threads (if any), should get hit where it hurts? >> >> A & B should return from the accept with an error. C should >> continue. Which is what happens on Solaris. > >> To this end each thread keeps a list of file descriptors >> in use by the current active system call. > >Yecc... How much cross-CPU traffic does that cause on >multithread processes? Not on close(2), on maintaining the >descriptor use counts through the normal syscalls. In the Solaris implementation is pretty much what we do; but there is no much cross-CPU traffic. Of course, you will need to keep locks in the file descriptor table if only to find the actual file pointer. The work is done only in the case of a badly written application where close is required to hunt down all threads currently using the specific file descriptor. >> When a file descriptor is closed and this file descriptor >> is marked as being in use by other threads, the kernel >> will search all threads to see which have this file descriptor >> listed as in use. For each such thread, the kernel tells >> the thread that its active fds list is now stale and, if >> possible, makes the thread run. >> >> While this algorithm is pretty expensive, it is not often invoked. > >Sure, but the upkeep of data structures it would need is there >whether you actually end up triggering it or not. Both in >memory footprint and in cacheline pingpong... Most of the
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 21/10/2015 19:51, Al Viro wrote: BTW, is SCM_RIGHTS outside of scope? They do mention it in one place only: | Ancillary data is also possible at the socket level. The | header shall define the following symbolic constant for use as the cmsg_type | value when cmsg_level is SOL_SOCKET: | | SCM_RIGHTS | Indicates that the data array contains the access rights to be sent or | received. That's still exactly the same in Issue 7, Technical Corrigenda 2 which is in final review at the moment. In other words, is that destruction of * any descriptor refering to this socket [utterly insane for obvious reasons] * the last descriptor refering to this socket (modulo descriptor passing, etc.) [a bitch to implement, unless we treat a syscall in progress as keeping the opened file open], or * _the_ descriptor used to issue accept(2) [a bitch to implement, with a lot of fun races in an already race-prone area]? From reading the POSIX close() page I believe the second option is the correct one. Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that behaviour, in your opinion? Because fd is sure as hell not the last descriptor refering to that socket - fd2 remains alive and well. Behaviour you describe below matches the _third_ option. Ah, I wasn't 100% sure what the intended difference between #2 and #3 was TBH, it does sound like I meant #3, yes :-) -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 21/10/2015 12:28, Eric Dumazet wrote: This works for me. Please double check your programs I have just done so, it works as you say for AF_INET sockets but if you switch to AF_UNIX sockets it does the wrong thing in the way I described. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Alan Burlison > Sent: 20 October 2015 19:31 ... > The problem with poll() is that it returns immediately when passed a FD > that is in the listening state. rather than waiting until there's an > incoming connection to handle. As I said, that means you can't use > poll() to multiplex between read/write/accept sockets. That seems to work for me... David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 21/10/2015 11:25, David Laight wrote: The problem with poll() is that it returns immediately when passed a FD that is in the listening state. rather than waiting until there's an incoming connection to handle. As I said, that means you can't use poll() to multiplex between read/write/accept sockets. That seems to work for me... In my test case I was setting all the available event bits in pollfd.events to see what came back. With poll() on a listen() socket you get an immediate return with bits set in revents indicating the socket is available for output, which of course it isn't. Indeed an attempt to write to it fails. If you remove the output event bits in pollfd.events then the poll() waits as expected until there's an incoming connection on the socket. I suppose one answer is "Well, don't do that then" but returning an output indication on a socket that's in listen() seems rather odd. With POLLOUT|POLLWRNORM|POLLWRBAND: main: polling #1 [returns immediately] main: poll #1: Success poll fd: 0 revents: POLLOUT POLLWRBAND main: write #1: Transport endpoint is not connected Without POLLOUT|POLLWRNORM|POLLWRBAND: main: polling #1 [waits for connection] main: poll #1: Success poll fd: 0 revents: POLLIN POLLRDNORM -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, 2015-10-21 at 11:49 +0100, Alan Burlison wrote: > On 21/10/2015 11:25, David Laight wrote: > > >> The problem with poll() is that it returns immediately when passed a FD > >> that is in the listening state. rather than waiting until there's an > >> incoming connection to handle. As I said, that means you can't use > >> poll() to multiplex between read/write/accept sockets. > > > > That seems to work for me... > > In my test case I was setting all the available event bits in > pollfd.events to see what came back. With poll() on a listen() socket > you get an immediate return with bits set in revents indicating the > socket is available for output, which of course it isn't. Indeed an > attempt to write to it fails. If you remove the output event bits in > pollfd.events then the poll() waits as expected until there's an > incoming connection on the socket. > > I suppose one answer is "Well, don't do that then" but returning an > output indication on a socket that's in listen() seems rather odd. > > With POLLOUT|POLLWRNORM|POLLWRBAND: > > main: polling #1 > [returns immediately] > main: poll #1: Success > poll fd: 0 revents: POLLOUT POLLWRBAND > main: write #1: Transport endpoint is not connected > > Without POLLOUT|POLLWRNORM|POLLWRBAND: > > main: polling #1 > [waits for connection] > main: poll #1: Success > poll fd: 0 revents: POLLIN POLLRDNORM > This works for me. Please double check your programs 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295 242046 <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35888), inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35890), inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35892), inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}]) 242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35894), inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4 242046 close(4) = 0 poll() only cares about POLLIN | POLLRDNORM for a listener. static inline unsigned int inet_csk_listen_poll(const struct sock *sk) { return !reqsk_queue_empty(_csk(sk)->icsk_accept_queue) ? (POLLIN | POLLRDNORM) : 0; } if (sk->sk_state == TCP_LISTEN) return inet_csk_listen_poll(sk); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 20/10/2015 02:45, Eric Dumazet wrote: On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote: Another problem is that if I call close() on a Linux socket that's in accept() the accept call just sits there until there's an incoming connection, which succeeds even though the socket is supposed to be closed, but then an immediately following accept() on the same socket fails. This is exactly what the comment I pasted documents. Yes, and it's caveated with "temporary solution" and "_not_ a good idea" and says that the problem is that close() needs repairing. In other works, the change in shutdown() behaviour appears to be a workaround for an acknowledged bug, and it's one with consequences. There are two separate things here. The first is the close/reopen race issue with filehandles. As I said, I believe that's an artefact of history, because it wasn't possible for it to happen before threads and was possible after them. There appears to be no good way to avoid this with the current *NIX filehandle semantics when threads are being used. History sucks. The second is how/if you might work around that. As far as I can tell, Linux does so by allowing shutdown() to be called on unconnected sockets and uses that as a signal that threads waiting in accept() should return from the accept with a failure but without the filehandle being actually closed, and therefore not being available for reuse, and therefore not subject to potential races. However by doing so I believe the behaviour of shutdown is then not POSIX-conforming. The Linux manpage for shutdown(2) says "CONFORMING TO POSIX.1-2001", as far as I can tell it isn't. At very least I believe the manpage needs changing. On linux, doing close(listener) on one thread does _not_ wakeup other threads doing accept(listener) Allowing an in-progress accept() to continue and to succeed at some point in the distant future on a filehandle that's closed seems incorrect. Also, the behaviour of poll() that I mentioned - that it returns immediately for a socket in the listen() state - does seem like an out-and-out bug to me, I haven't seen any explanation of why that might be correct behaviour. So I guess allowing shutdown(listener) was a way to somehow propagate some info on the threads stuck in accept() Yes, I think you are right. This is a VFS issue, and a long standing one. That seems to be what the comment you quoted is saying, yes. Think of all cases like dup() and fd passing games, and the close(fd) being able to signal out of band info is racy. Yes, I agree there are potential race conditions, intrinsic to the way *NIX FDs work. As I said earlier, there *may* be a portable way around this using /dev/null & dup2() but I haven't had chance to investigate that yet. close() is literally removing one ref count on a file. Expecting it doing some kind of magical cleanup of a socket is not reasonable/practical. I'm not sure what that means at an application level - no matter how many copies I make of a FD in a process it's still just an integer and calling close on it closes it and causes any future IOs to fail - except for the case of sockets in accept() it seems, which continue and may even eventually succeed. Leaving aside the behaviour of shutdown() on listening sockets, the current behaviour of close() on a socket in accept() seems incorrect. And then of course there's also the poll() issue. On a multi threaded program, each thread doing an accept() increased the refcount on the file. That may be how Linux implements accept(), but I don't see anything about refcounting in the POSIX spec for accept(). Really, I have no idea of how Solaris coped with this, and I do not want to know. The bug goes into quite some detail about how Solaris behaves. The issue here is that we have two implementations, Linux and Solaris, both claiming to be POSIX-conformant but both showing different behaviour. There's a discussion to be had about the whys and wherefores of that difference, but saying that you don't want to know how Solaris behaves isn't really going to help move the conversation along. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 20/10/2015 16:30, Eric Dumazet wrote: A close() does _not_ wakeup an accept() or a poll() (this is exactly the same problem), or any blocking system call using the same 'file' Not waking up the accept() is one thing, allowing the accept() to successfully complete some indeterminate time after the socket has been closed is something else entirely. You shouldn't have to call shutdown() to terminate an accept() on a socket, close() should suffice. Yes, if you want to use the shutdown() 'feature' to kick the accept() thread out of the call to accept() without closing the FD and you don't care about cross-platform compatibility then you can call shutdown() followed by close(). However that's only ever required on Linux as far as I can tell, and even on Linux applications that deal with the thread race by other means shouldn't be forced to use shutdown() when just close() would suffice. The problem with poll() is that it returns immediately when passed a FD that is in the listening state. rather than waiting until there's an incoming connection to handle. As I said, that means you can't use poll() to multiplex between read/write/accept sockets. close() man page states : NOTES It is probably unwise to close file descriptors while they may be in use by system calls in other threads in the same process. Since a file descriptor may be reused, there are some obscure race conditions that may cause unintended side effects. You are in this grey zone. No, the race issue with file descriptor reuse and the close() behaviour are not the same thing. The manpage comment is correct, but not relevant. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, 2015-10-20 at 19:31 +0100, Alan Burlison wrote: > No, the race issue with file descriptor reuse and the close() behaviour > are not the same thing. The manpage comment is correct, but not relevant. Ok, it seems you know better than me, I will be stop the discussion and wait for your patches. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, 2015-10-20 at 14:45 +0100, Alan Burlison wrote: > And I still haven't seen any reasoning behind the Linux close() and > poll() behaviour on sockets that are in the listen state. Same answer. A close() does _not_ wakeup an accept() or a poll() (this is exactly the same problem), or any blocking system call using the same 'file' This is the choice Linus Torvalds and Al Viro did years ago. It is set in stone, unless someone comes up with a very nice patch set, that does not break existing applications. close() man page states : NOTES It is probably unwise to close file descriptors while they may be in use by system calls in other threads in the same process. Since a file descriptor may be reused, there are some obscure race conditions that may cause unintended side effects. You are in this grey zone. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Mon, Oct 19, 2015 at 06:45:32PM -0700, Eric Dumazet wrote: > On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote: > > > Another problem is that if I call close() on a Linux socket that's in > > accept() the accept call just sits there until there's an incoming > > connection, which succeeds even though the socket is supposed to be > > closed, but then an immediately following accept() on the same socket > > fails. > > This is exactly what the comment I pasted documents. > > On linux, doing close(listener) on one thread does _not_ wakeup other > threads doing accept(listener) > > So I guess allowing shutdown(listener) was a way to somehow propagate > some info on the threads stuck in accept() > > This is a VFS issue, and a long standing one. > > Think of all cases like dup() and fd passing games, and the close(fd) > being able to signal out of band info is racy. > > close() is literally removing one ref count on a file. > Expecting it doing some kind of magical cleanup of a socket is not > reasonable/practical. > > On a multi threaded program, each thread doing an accept() increased the > refcount on the file. Refcount is an implementation detail, of course. However, in any Unix I know of, there are two separate notions - descriptor losing connection to opened file (be it from close(), exit(), execve(), dup2(), etc.) and opened file getting closed. The latter cannot happen while there are descriptors connected to the file in question, of course. However, that is not the only thing that might prevent an opened file from getting closed - e.g. sending an SCM_RIGHTS datagram with attached descriptor connected to the opened file in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened file until it is successfully received or discarded (in the former case recepient will get a new descriptor refering to that opened file, of course). Having the original descriptor closed right after sendmsg(2) does *not* do anything to opened file. On any Unix that implements descriptor-passing. There's going to be a notion of "last close"; that's what this refcount is about and _that_ is more than implementation detail. The real question is what kind of semantics would one want in the following situations: 1) // fd is a socket fcntl(fd, F_SETFD, FD_CLOEXEC); fork(); in parent: accept(fd); in child: execve() 2) // fd is a socket, 1 is /dev/null fork(); in parent: accept(fd); in child: dup2(1, fd); 3) // fd is a socket fd2 = dup(fd); in thread A: accept(fd); in thread B: close(fd); 4) // fd is a socket, 1 is /dev/null fd2 = dup(fd); in thread A: accept(fd); in thread B: dup2(1, fd); 5) // fd is a socket, 1 is /dev/null fd2 = dup(fd); in thread A: accept(fd); in thread B: close(fd2); 6) // fd is a socket in thread A: accept(fd); in thread B: close(fd); In other words, is that destruction of * any descriptor refering to this socket [utterly insane for obvious reasons] * the last descriptor refering to this socket (modulo descriptor passing, etc.) [a bitch to implement, unless we treat a syscall in progress as keeping the opened file open], or * _the_ descriptor used to issue accept(2) [a bitch to implement, with a lot of fun races in an already race-prone area]? Additional question is whether it's * just a magical behaviour of close(2) [ugly], or * something that happens when descriptor gets dissociated from opened file [obviously more consistent]? BTW, for real fun, consider this: 7) // fd is a socket fd2 = dup(fd); in thread A: accept(fd); in thread B: accept(fd); in thread C: accept(fd2); in thread D: close(fd); Which threads (if any), should get hit where it hurts? I honestly don't know what Solaris does; AFAICS, FreeBSD behaves like Linux these days. NetBSD plays really weird games in their fd_close(); what they are trying to achieve is at least sane - in (7) they'd hit A and B with EBADF and C would restart and continue waiting, in (3,4,6) A gets EBADF, in (1,2,5) accept() is unaffected. The problem is that their solution is racy - they have a separate refcount on _descriptor_, plus a file method (->fo_restart) for triggering an equivalent of signal interrupting anything that might be blocked on that sucker, with syscall restart (and subsequent EBADF on attempt to refetch the sucker. Racy if we reopen or are doing dup2() in the first place - these restarts might get CPU just after we return from dup2() and pick the *new* descriptor just fine. It might be possible to fix their approach (having if (__predict_false(ff->ff_file == NULL)) { /* * Another user of the file is already closing, and is * waiting for other users of the file to drain. Release * our reference, and wake up the closer. */ atomic_dec_uint(>ff_refcnt); cv_broadcast(>ff_closing); path in fd_close() mark the thread as "don't bother restarting,
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, 2015-10-20 at 10:59 +0100, Alan Burlison wrote: > That may be how Linux implements accept(), but I don't see anything > about refcounting in the POSIX spec for accept(). That's an internal implementation detail. POSIX does not document linux kernel overall design and specific tricks. linux is GPL, while Solaris is proprietary code. There is quite a difference, and we do not want to copy Solaris behavior. We want our own way, practical, and good enough. If POSIX makes sense we try to be compliant. If not, we do not. If you are interested, take a look at fs/* code, and try to implement your proposal and keep good performance. You might find a clever way, without infringing prior art. We did not yet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 20/10/2015 14:19, Eric Dumazet wrote: That may be how Linux implements accept(), but I don't see anything about refcounting in the POSIX spec for accept(). That's an internal implementation detail. POSIX does not document linux kernel overall design and specific tricks. No, neither does it document Solaris kernel design, or *BSD design, or Windows design. It does however specify what the externally observable behaviour has to be in order to be POSIX compliant. linux is GPL, while Solaris is proprietary code. There is quite a difference, and we do not want to copy Solaris behavior. We want our own way, practical, and good enough. I don't see what the licensing terms of particular implementations have to do with POSIX, indeed as far as I can tell POSIX says nothing at all about the subject. It's therefore not pertinent to this discussion. I'm not expecting Linux to copy every Solaris behaviour, if I was I might for example be suggesting that Linux dropped support for the SO_RCVTIMEO and SO_SNDTIMEO setsockopt() options on AF_UNIX sockets, because Solaris doesn't currently implement those options. That would clearly be a ridiculous stance for me to take, what I've actually done is logged a bug against Solaris because it's clearly something we need to fix. What I do think is reasonable is that if Linux claims POSIX conformance then it either conforms or documents the variant behaviour and as I've said I don't believe is does conform in the case of shutdown(). If POSIX makes sense we try to be compliant. If not, we do not. That's one possible design option. However in this case the Linux manpage claims the Linux behaviour is POSIX compliant and as far as I can tell it isn't. As I've already said several times, I agree there's probably not much that can be done about it without causing breakage, which is why I suggested simply documenting the behaviour may be the best option. And I still haven't seen any reasoning behind the Linux close() and poll() behaviour on sockets that are in the listen state. If you are interested, take a look at fs/* code, and try to implement your proposal and keep good performance. You might find a clever way, without infringing prior art. We did not yet. I might, but contractually I can't, unfortunately. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Mon, 2015-10-19 at 09:59 -0700, Stephen Hemminger wrote: > This looks like corner case, but worth forwarding. > > Begin forwarded message: > > Date: Mon, 19 Oct 2015 13:21:33 + > From: "bugzilla-dae...@bugzilla.kernel.org" >> To: "shemmin...@linux-foundation.org" > Subject: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for > sockets in accept(3) > > > https://bugzilla.kernel.org/show_bug.cgi?id=106241 > > Bug ID: 106241 >Summary: shutdown(3)/close(3) behaviour is incorrect for > sockets in accept(3) >Product: Networking >Version: 2.5 > Kernel Version: 3.10.0-229.14.1.el7.x86_64 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > Assignee: shemmin...@linux-foundation.org > Reporter: alan.burli...@oracle.com > Regression: No > > Created attachment 190501 > --> https://bugzilla.kernel.org/attachment.cgi?id=190501=edit > Test program illustrating the problem > > The Linux behaviour in the current scenario is incorrect: > > 1. ThreadA opens, binds, listens and accepts on a socket, waiting for > connections. > > 2. Some time later ThreadB calls shutdown on the socket ThreadA is waiting in > accept on. > > Here is what happens: > > On Linux, the shutdown call in ThreadB succeeds and the accept call in ThreadA > returns with EINVAL. > > On Solaris, the shutdown call in ThreadB fails and returns ENOTCONN. ThreadA > continues to wait in accept. > > Relevant POSIX manpages: > > http://pubs.opengroup.org/onlinepubs/9699919799/functions/accept.html > http://pubs.opengroup.org/onlinepubs/9699919799/functions/shutdown.html > > The POSIX shutdown manpage says: > > "The shutdown() function shall cause all or part of a full-duplex connection > on > the socket associated with the file descriptor socket to be shut down." > ... > "[ENOTCONN] The socket is not connected." > > Page 229 & 303 of "UNIX System V Network Programming" say: > > "shutdown can only be called on sockets that have been previously connected" > > "The socket [passed to accept that] fd refers to does not participate in the > connection. It remains available to receive further connect indications" > > That is pretty clear, sockets being waited on with accept are not connected by > definition. Nor is it the accept socket connected when a client connects to > it, > it is the socket returned by accept that is connected to the client. Therefore > the Solaris behaviour of failing the shutdown call is correct. > > In order to get the required behaviour of ThreadB causing ThreadA to exit the > accept call with an error, the correct way is for ThreadB to call close on the > socket that ThreadA is waiting on in accept. > > On Solaris, calling close in ThreadB succeeds, and the accept call in ThreadA > fails and returns EBADF. > > On Linux, calling close in ThreadB succeeds but ThreadA continues to wait in > accept until there is an incoming connection. That accept returns > successfully. > However subsequent accept calls on the same socket return EBADF. > > The Linux behaviour is fundamentally broken in three places: > > 1. Allowing shutdown to succeed on an unconnected socket is incorrect. > > 2. Returning a successful accept on a closed file descriptor is incorrect, > especially as future accept calls on the same socket fail. > > 3. Once shutdown has been called on the socket, calling close on the socket > fails with EBADF. That is incorrect, shutdown should just prevent further IO > on > the socket, it should not close it. > It looks it is a long standing problem, right ? inet_shutdown() has this very specific comment from beginning of git tree : switch (sk->sk_state) { ... /* Remaining two branches are temporary solution for missing * close() in multithreaded environment. It is _not_ a good idea, * but we have no choice until close() is repaired at VFS level. */ case TCP_LISTEN: if (!(how & RCV_SHUTDOWN)) break; /* Fall through */ case TCP_SYN_SENT: err = sk->sk_prot->disconnect(sk, O_NONBLOCK); sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED; break; } Claiming Solaris does it differently is kind of moot. linux is not Solaris. Unless proven a real problem (and not only by trying to backport from Solaris to linux), we'll probably wont change this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 20/10/2015 00:33, Eric Dumazet wrote: It looks it is a long standing problem, right ? Yep, seems so. inet_shutdown() has this very specific comment from beginning of git tree : switch (sk->sk_state) { ... /* Remaining two branches are temporary solution for missing * close() in multithreaded environment. It is _not_ a good idea, * but we have no choice until close() is repaired at VFS level. */ case TCP_LISTEN: if (!(how & RCV_SHUTDOWN)) break; /* Fall through */ case TCP_SYN_SENT: err = sk->sk_prot->disconnect(sk, O_NONBLOCK); sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED; break; } I think it's probably an intrinsic part of the way *NIX file descriptors and their reuse has worked since the dawn of *NIX time - at which time threads didn't exist, so this problem didn't either. The advent of threads made this this hole possible, which is I believe what the comment above is pointing out. The problem people are trying to solve by calling shutdown() on an listen()ing socket is the race in MT programs between a socket being closed and the same file descriptor being recycled by a subsequent open()/socket() etc. Claiming Solaris does it differently is kind of moot. linux is not Solaris. Agreed that Linux != Solaris, but the argument I'm being faced with is that anything that doesn't behave in the same way as Linux is wrong by definition. And the problem with that is that the Linux behaviour of shutdown() on a listen()/accept() socket is I believe incorrect anyway as my read of POSIX says that shutdown() is only valid on connected sockets, and sockets in listen()/accept() aren't connected by definition, and Linux allows shutdown() to succeed when it should probably return ENOTCONN. Yes, there's a potential race with FDs being recycled, but you can get that with vanilla file FDs as well, where shutdown() isn't an option. Another problem is that if I call close() on a Linux socket that's in accept() the accept call just sits there until there's an incoming connection, which succeeds even though the socket is supposed to be closed, but then an immediately following accept() on the same socket fails. And yet another problem is that poll() on a socket that's had listen() called on it returns immediately even if there's no incoming connection on it, which I believe makes multiplexing a set of sockets which includes a socket you want to accept() on impossible. The test program I attached to the bug allows you to play around with the different combinations. Unless proven a real problem (and not only by trying to backport from Solaris to linux), we'll probably wont change this. It's a real problem (with Hadoop, which contains C/C++ to do low-level I/O) and it's the other way around, I am porting that code from Linux to Solaris. I accept that you probably can't change the behaviour of shutdown() in Linux without breaking existing code, for example it seems libmicrohttpd also assumes it's OK to call shutdown() on a listen() socket on Linux, see https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html However, even if the shutdown() behaviour can't be changed the Linux close()/poll() behaviour on a listen()/accept() sockets seems rather odd. There *may* be a way around this that's race-free and cross-platform involving the use of /dev/null and dup2(), see Listing Five on http://www.drdobbs.com/parallel/file-descriptors-and-multithreaded-progr/212001285 but I haven't confirmed it works yet. Thanks, -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote: > Another problem is that if I call close() on a Linux socket that's in > accept() the accept call just sits there until there's an incoming > connection, which succeeds even though the socket is supposed to be > closed, but then an immediately following accept() on the same socket > fails. This is exactly what the comment I pasted documents. On linux, doing close(listener) on one thread does _not_ wakeup other threads doing accept(listener) So I guess allowing shutdown(listener) was a way to somehow propagate some info on the threads stuck in accept() This is a VFS issue, and a long standing one. Think of all cases like dup() and fd passing games, and the close(fd) being able to signal out of band info is racy. close() is literally removing one ref count on a file. Expecting it doing some kind of magical cleanup of a socket is not reasonable/practical. On a multi threaded program, each thread doing an accept() increased the refcount on the file. Really, I have no idea of how Solaris coped with this, and I do not want to know. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html