Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>And no, I'm not fond of such irregular ways to pass file descriptors, but >we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity... Yeah, there are a number of calls which supposed work on one but have a second argument which is also a file descriptor; mostly part of ioctl(). >> In those specific cases where a system call needs to convert a file >> descriptor to a file pointer, there is only one routines which can be used. > >Obviously, but the problem is deadlock avoidance using it. The Solaris algorithm is quite different and as such there is no chance of having a deadlock using that function (there is a bunch of functions) >The memory footprint is really scary. Bitmaps are pretty much noise, but >blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic - >or Venus for that matter, which is more relevant for you guys) Fair enough. I think we have some systems with a larger cache line. >Said that, what's the point of "close won't return until..."? After all, >you can't guarantee that thread with cancelled syscall won't lose CPU >immediately upon return to userland, so it *can't* make any assumptions >about the descriptor not having been already reused. I don't get it - what >does that buy for userland code? Generally I wouldn't see that as a problem, but in the case of a socket blocking on accept indefinitely, I do see it as a problem especially as the thread actually wants to stop listening. But in general, this is basically a problem with the application: the file descriptor space is shared between threads and having one thread sniping at open files, you do have a problem and whatever the kernel does in that case perhaps doesn't matter all that much: the application needs to be fixed anyway. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>Yet another POSIX deficiency. > >When a server deals with 10,000,000+ socks, we absolutely do not care of >this requirement. > >O(log(n)) is still crazy if it involves O(log(n)) cache misses. You miss the fire point of the algorithm; you *always* find an available file descriptor in O(log(N)) (where N is the size of the table). Does your algorithm guarantee that? >> Is it a problem that you can "hide" your listening socket with a thread in >> accept()? I would think so (It would be visible in netstat but you can't >> easily find out why has it) > >Again, netstat -p on a server with 10,000,000 sockets never completes. This point was not about a 10M sockets server but in general... >Never try this unless you are desperate and want to avoid a reboot >maybe. > >If you absolutely want to nuke a listener because of untrusted >applications, we better implement a proper syscall. > >Android has such a facility. Solaris has had such an option too, but that wasn't the point. You really don't want to know which application is doing this? >Alternative would be to extend netlink (ss command from iproute2 >package) to carry one pid per socket. > >ss -atnp state listening That would be an option too. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>Ho-hum... It could even be made lockless in fast path; the problems I see >are > * descriptor-to-file lookup becomes unsafe in a lot of locking >conditions. Sure, most of that happens on the entry to some syscall, with >very light locking environment, but... auditing every sodding ioctl that >might be doing such lookups is an interesting exercise, and then there are >->mount() instances doing the same thing. And procfs accesses. Probably >nothing impossible to deal with, but nothing pleasant either. In the Solaris kernel code, the ioctl code is generally not handled a file descriptor but instead a file pointer (i.e., the lookup is done early in the system call). In those specific cases where a system call needs to convert a file descriptor to a file pointer, there is only one routines which can be used. > * memory footprint. In case of Linux on amd64 or sparc64, >main() >{ > int i; > for (i = 0; i < 1<<24; dup2(0, i++))// 16M descriptors > ; >} >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient >ulimit -n, >of course). How much will Solaris eat on the same? Yeah, that is a large amount of memory. Of course, the table is only sized when it is extended and there is a reason why there is a limit on file descriptors. But we're using more data per file descriptor entry. > * related to the above - how much cacheline sharing will that involve? >These per-descriptor use counts are bitch to pack, and giving each a cacheline >of its own... As I said, we do actually use a lock and yes that means that you really want to have a single cache line for each and every entry. It does make it easy to have non-racy file description updates. You certainly do not want false sharing when there is a lot of contention. Other data is used to make sure that it only takes O(log(n)) to find the lowest available file descriptor entry. (Where n, I think, is the returned descriptor) Not contended locks aren't expensive. And all is done on a single cache line. One question about the Linux implementation: what happens when a socket in select is closed? I'm assuming that the kernel waits until "shutdown" is given or when a connection comes in? Is it a problem that you can "hide" your listening socket with a thread in accept()? I would think so (It would be visible in netstat but you can't easily find out why has it) Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote: >> >> >> >And I'm really curious about the things Solaris would do with dup2() there. >> >Does it take into account the possibility of new accept() coming just as >> >dup2() is trying to terminate the ongoing ones? Is there a window when >> >descriptor-to-file lookups would fail? Looks like a race/deadlock >> >country... >> >> Solaris does not "terminate" threads, instead it tells them that the >> file descriptor information used is stale and wkae's up the thread. > >Sorry, lousy wording - I meant "terminate syscall in another thread". >Better yet, make that "what happens if new accept(newfd) comes while dup2() >waits for affected syscalls in other threads to finish"? Assuming it >does wait, that is.. No there is no such window; the accept() call either returns EBADF (dup2()) wins the race or it returns a new file descriptor (and dup2() then closes the listening descriptor). One or the other. >While we are at it, what's the relative order of record locks removal >and switching the meaning of newfd? In our kernel it happens *after* >the switchover (i.e. if another thread is waiting for a record lock held on >any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see >the state with newfd still refering to what it used to; note that waiter >might be using any descriptor refering to the file newfd used to refer >to, so it won't be affected by the "wake those who had the meaning of >their arguments change" side of things). The external behaviour atomic; you cannot distinguish the order between the closing of the original file (and waking up other threads waiting for a record lock) or changing the file referenced by that newfd. But this not include a global or process specific lock. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro>Except that in this case "correctness" is the matter of rather obscure and >ill-documented areas in POSIX. Don't get me wrong - this semantics isn't >inherently bad, but it's nowhere near being an absolute requirement. It would more fruitful to have such a discussion in one of the OpenGroup mailing lists; people gathered there have a lot of experience and it is also possible to fix the standard when it turns out that it indeed as vague as you claim it is (I don't quite agree with that) Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>It's been said that the current mechanisms in Linux & some BSD variants >can be subject to races, and the behaviour exhibited doesn't conform to >POSIX, for example requiring the use of shutdown() on unconnected >sockets because close() doesn't kick off other threads accept()ing on >the same fd. I'd be interested to hear if there's a better and more >performant way of handling the situation that doesn't involve doing the >sort of bookkeeping Casper described,. Of course, the implementation is now around 18 years old; clearly a lot of things have changed since then. In the particular case of Linux close() on a socket, surely it must be possible to detect at close that it is a listening socket and that you are about to close the last reference; the kernel could then do the shutdown() all by itself. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote: > >> The external behaviour atomic; you cannot distinguish the order >> between the closing of the original file (and waking up other threads >> waiting for a record lock) or changing the file referenced by that newfd. >> >> But this not include a global or process specific lock. > >Interesting... Do you mean that decriptor-to-file lookup blocks until that >rundown finishes? For that particular file descriptor, yes. (I'm assuming you mean the Solaris kernel running down all lwps who have a system in progress on that particular file descriptor). All other fd to file lookups are not blocked at all by this locking. It should be clear that any such occurrences are application errors and should be hardly ever seen in practice. It is also known when this is needed so it is hardly even attempted. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro>On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote: >> On 22/10/2015 18:05, Al Viro wrote: >> >> >Oh, for... Right in this thread an example of complete BS has been quoted >> >from POSIX close(2). The part about closing a file when the last descriptor >> >gets closed. _Nothing_ is POSIX-compliant in that respect (nor should >> >it be). >> >> That's not exactly what it says, we've already discussed, for >> example in the case of pending async IO on a filehandle. > >Sigh... It completely fails to mention descriptor-passing. Which > a) is relevant to what "last close" means and > b) had been there for nearly the third of a century. Why is that different? These clearly count as file descriptors. >> I agree that part could do with some polishing. > >google("wire brush of enlightenment") is what comes to mind... Standardese is similar to legalese; it not writing that is directly open to interpretation to those who are not inducted in writing may have some problem interpreting what exactly is meant by wording of the standard. >> I think "it shall be closed first" makes it pretty clear that what >> is expected is the same behaviour as any direct invocation of close, >> and that has to happen before the reassignment. What makes you >> believe that's isn't the case? > >So unless I'm misparsing something, you want >thread A: accept(newfd) >thread B: dup2(oldfd, newfd) >have accept() bugger off before the switchover happens? Well, certainly *before* we return from dup2(). (and clearly only once we have determined that dup2() will return successfully) >What should happen if thread C does accept(newfd) right as B has decided that >there's nothing more to wait? For close(newfd) it would be simple - we are >going to have lookup by descriptor fail with EBADF anyway, so making it do >so as soon as we go hunting for those who are currently in accept(newfd) >would do the trick - no new threads like that shall appear and as long as >the descriptor is not declared free for taking by descriptor allocation nobody >is going to be screwed by open() picking that slot of descriptor table too >early. Trying to do that for dup2() would lose atomicity. I honestly don't >know how Solaris behaves in that case, BTW - the race (if any) would probably >be hard to hit, so in case of Linux I would have to go and RTFS before saying >that there isn't one. I can't do that in with Solaris; all I can do here >is ask you guys... Solaris dup2() behaves exactly like close(). >Moreover, see above for record locks removal. Should that happen prior to >switchover? If you have > >dup(fd, fd2); >set a record lock on fd2 >spawn a thread >in child, try to grab the same lock on fd2 >in parent, do some work and close(fd) >you are guaranteed that child won't see fd refering to the same file after it >acquires the lock. Here's you are talking about a lock held by the "parent" and that the "child" will only get the lock once close(fd) is done? Yes. The final "close" is done *after* the pointer has been removed from the file descriptor table. >Replace close(fd) with dup(fd3, fd); should the same hold true in that case? Yes. >FWIW, Linux behaviour in that area is to have record locks removal done >between the switchover and return to userland in case of dup2() and between >the removal from descriptor table and return to userland in case of close(). > >> Personally I believe the spec is clear enough to allow an >> unambiguous interpretation of the required behavior in this area. If >> you think there are areas where the Solaris behaviour is in >> disagreement with the spec then I'd be interested to hear them. > >The spec is so vague that I strongly suspect that *both* Solaris and Linux >behaviours are not in disagreement with it (modulo shutdown(2) extension >Linux-side and we are really stuck with that one). I'm not sure if the standard allows a handful of threads in accept() for a file descriptor which has already been closed *and* can be re-issued for other uses. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>And I'm really curious about the things Solaris would do with dup2() there. >Does it take into account the possibility of new accept() coming just as >dup2() is trying to terminate the ongoing ones? Is there a window when >descriptor-to-file lookups would fail? Looks like a race/deadlock country... Solaris does not "terminate" threads, instead it tells them that the file descriptor information used is stale and wkae's up the thread. The accept call gets woken up and it checks for incoming connections; it will then either find a new connection and returns that particular connection or it will find nothing and returns EINTR; in the post-syscall glue this is checked (the kernel thread has been told to take the expensive post-syscall routine) and if the system call was interrupted, EBADF is returned instead. It is also possible for the connection to come in late and then the socket will be changed and the already accepted (in TCP terms, not in the socket API terms) embryonic will be closed too as is normal when a listening socket is closed with a list of not ready accept()ed connections. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: David MillerDate: Wed, 21 Oct 2015 08:30:08 -0700 (PDT) (17:30 CEST) >From: Alan Burlison >Date: Wed, 21 Oct 2015 15:38:51 +0100 > >> While this algorithm is pretty expensive, it is not often invoked. > >I bet it can be easily intentionally invoked, by a malicious entity no >less. It is only expensive within the process itself. Whether it is run inside the kernel isn't much different in the context of Solaris. If you have an attacker which can run any code, it doesn't really matter what that code is. It is not really, expensive (like grabbing expensive locks or for any length of time). It's basically O(n) depending on the numbers of threads in the process. If you have an application which can be triggered in doing that, it is still a bug in the application. Is such socket still listed with netstat on Linux? I believe it uses uses /proc and it will not be able to find that socket through the list of opened files. If we look at our typical problem we have a accept loop: for (;;) { newfd = accept(fd. ...); /* X */ /* stuff */ } While we have a second thread doing a "close(fd);" and possibly opening another file which just happens to return this particular fd. In Solaris the following one of the following things will happen, whatever the first thread is doing once close() is called: - accept() dies with EBADF (close() before or during the call to accept()) - accept() returns some other error (new fd you can't accept on) - accept() returns a new fd (if it was closed and reopened and a the new fd allows accept()) On Linux exactly the same thing happens *except* when we find ourselves in accept(), then we wait until a connection made or "shutdown()" is called. I don't think any of the outcomes in the first thread is acceptable; clearly no sufficient synchronization between the threads. At that point Linux cannot find out who owns the socket: # netstat -p -a | grep /tmp/unix unix 2 [ ACC ] STREAM LISTENING 14743 - /tmp/unix_sock In Solaris you'd get: netstat -u -f unix| grep unix_ stream-ord casper 5334 shutdown /tmp/unix_sock Simple synchronization is can be done. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > >> >There's going to be a notion of "last close"; that's what this refcount is >> >about and _that_ is more than implementation detail. >> >> Yes, POSIX distinguishes between "file descriptor" and "file >> description" (ugh!) and the close() page says: > >Would've been better if they went for something like "IO channel" for >the latter ;-/ Or at least some other word. A file descriptor is just an index to a list of file pointers (and wasn't named so?) >> "When all file descriptors associated with an open file description >> have been closed, the open file description shall be freed." > >BTW, is SCM_RIGHTS outside of scope? They do mention it in one place >only: >| Ancillary data is also possible at the socket level. The >| header shall define the following symbolic constant for use as the cmsg_type >| value when cmsg_level is SOL_SOCKET: >| >| SCM_RIGHTS >| Indicates that the data array contains the access rights to be sent or >| received. > >with no further details whatsoever. It's been there since at least 4.3-Reno; >does anybody still use the older variant (->msg_accrights, that is)? IIRC, >there was some crap circa 2.6 when Solaris used to do ->msg_accrights for >descriptor-passing, but more or less current versions appear to support >SCM_RIGHTS... In any case, descriptor-passing had been there in some form >since at least '83 (the old variant is already present in 4.2) and considering >it out-of-scope for POSIX is bloody ridiculous, IMO. SCM_RIGHTS was introduced as part of the POSIX standardization of BSD sockets. Looks like they became part of Solaris 2.6, but the default was non-standard sockets so you may easily find msg->accrights but not SCM_RIGHTS. msg_accrights is what was introduced in BSD in likely the first implementation of socket-based file descriptor passing. SysV has its own file descriptor passing on file descriptors passing. But that interface is too much ad-hoc, so SCM_RIGHTS is a standardizations allowing multiple types of messages to be send around >Unless they want to consider in-flight descriptor-passing datagrams as >collections of file descriptors, the quoted sentence is seriously misleading. >And then there's mmap(), which they do kinda-sorta mention... Well, a file descriptor really only exists in the context of a process; in-flight it is no longer a file descriptor as there process context with a file descriptor table; so pointers to file descriptions are passed around. >> >In other words, is that destruction of >> >* any descriptor refering to this socket [utterly insane for obvious >> >reasons] >> >* the last descriptor refering to this socket (modulo descriptor >> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress >> >as keeping the opened file open], or >> >* _the_ descriptor used to issue accept(2) [a bitch to implement, >> >with a lot of fun races in an already race-prone area]? >> >> From reading the POSIX close() page I believe the second option is >> the correct one. > >Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that >behaviour, in your opinion? Because fd is sure as hell not the last >descriptor refering to that socket - fd2 remains alive and well. > >Behaviour you describe below matches the _third_ option. >> >BTW, for real fun, consider this: >> >7) >> >// fd is a socket >> >fd2 = dup(fd); >> >in thread A: accept(fd); >> >in thread B: accept(fd); >> >in thread C: accept(fd2); >> >in thread D: close(fd); >> > >> >Which threads (if any), should get hit where it hurts? >> >> A & B should return from the accept with an error. C should >> continue. Which is what happens on Solaris. > >> To this end each thread keeps a list of file descriptors >> in use by the current active system call. > >Yecc... How much cross-CPU traffic does that cause on >multithread processes? Not on close(2), on maintaining the >descriptor use counts through the normal syscalls. In the Solaris implementation is pretty much what we do; but there is no much cross-CPU traffic. Of course, you will need to keep locks in the file descriptor table if only to find the actual file pointer. The work is done only in the case of a badly written application where close is required to hunt down all threads currently using the specific file descriptor. >> When a file descriptor is closed and this file descriptor >> is marked as being in use by other threads, the kernel >> will search all threads to see which have this file descriptor >> listed as in use. For each such thread, the kernel tells >> the thread that its active fds list is now stale and, if >> possible, makes the thread run. >> >> While this algorithm is pretty expensive, it is not often invoked. > >Sure, but the upkeep of data structures it would need is there >whether you actually end up triggering it or not. Both in >memory footprint and in cacheline pingpong... Most of the