RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro > Sent: 04 November 2015 16:28 > On Wed, Nov 04, 2015 at 03:54:09PM +, David Laight wrote: > > > Sigh... The kernel has no idea when other threads are done with "all > > > io activities using that fd" - it can wait for them to leave the > > > kernel mode, but there's fuck-all it can do about e.g. a userland > > > loop doing write() until there's more data to send. And no, you can't > > > rely upon them catching EBADF on the next iteration - by the time they > > > get there, close() might very well have returned and open() from yet > > > another thread might've grabbed the same descriptor. Welcome to your > > > data being written to hell knows what... > > > > That just means that the application must use dup2() rather than close(). > > It must do that anyway since the thread it is trying to stop might be > > sleeping in the system call stub in libc at the time a close() and open() > > happen. > > Oh, _lovely_. So instead of continuation of that write(2) going down > the throat of something opened by unrelated thread, it (starting from a > pretty arbitrary point) will go into the descriptor the closing thread > passed to dup2(). Until it, in turn, gets closed, at which point we > are back to square one. That, of course, makes it so much better - > whatever had I been thinking about that made me miss that? Do I detect a note of sarcasm. You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd. So any writes (etc) would be discarded in a safe manner, and nothing will block. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Nov 06, 2015 at 03:07:30PM +, David Laight wrote: > > Oh, _lovely_. So instead of continuation of that write(2) going down > > the throat of something opened by unrelated thread, it (starting from a > > pretty arbitrary point) will go into the descriptor the closing thread > > passed to dup2(). Until it, in turn, gets closed, at which point we > > are back to square one. That, of course, makes it so much better - > > whatever had I been thinking about that made me miss that? > > Do I detect a note of sarcasm. > > You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd. > So any writes (etc) would be discarded in a safe manner, and nothing will > block. ... and never close that descriptor afterwards? Then why would you care whether dup2() blocks there or not? Are you seriously saying that there is any code that would (a) rely on that kind of warranty and (b) not be racy as hell? Until now everyone had been talking about that thing as an attempt of mitigation for broken userland code; you seem to be the first one to imply that this can be used as a deliberate technics... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Nov 04, 2015 at 03:54:09PM +, David Laight wrote: > > Sigh... The kernel has no idea when other threads are done with "all > > io activities using that fd" - it can wait for them to leave the > > kernel mode, but there's fuck-all it can do about e.g. a userland > > loop doing write() until there's more data to send. And no, you can't > > rely upon them catching EBADF on the next iteration - by the time they > > get there, close() might very well have returned and open() from yet > > another thread might've grabbed the same descriptor. Welcome to your > > data being written to hell knows what... > > That just means that the application must use dup2() rather than close(). > It must do that anyway since the thread it is trying to stop might be > sleeping in the system call stub in libc at the time a close() and open() > happen. Oh, _lovely_. So instead of continuation of that write(2) going down the throat of something opened by unrelated thread, it (starting from a pretty arbitrary point) will go into the descriptor the closing thread passed to dup2(). Until it, in turn, gets closed, at which point we are back to square one. That, of course, makes it so much better - whatever had I been thinking about that made me miss that? > The listening (in this case) thread would need to look at its global > data to determine that it is supposed to exit, and then close the fd itself. Right until it crosses into the kernel mode and does descriptor-to-file lookup, presumably? Because prior to that point this kernel-side "protection" oesn't come into play. In other words, this is inherently racy, and AFAICS you are the first poster in that thread who disagrees with that. _Any_ userland code that would be racy without that kludge of semantics in close()/dup2() is *still* racy with it. If that crap gets triggered at all, the userland code responsible for that is broken. Said crap makes the race windows more narrow, but it doesn't really close them. And IMO it's rather misduided, especially since it's a) quiet and b) costly as hell. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro > Sent: 30 October 2015 21:10 > On Fri, Oct 30, 2015 at 05:43:21PM +, David Laight wrote: > > > ISTM that the correct call should be listen(fd, 0); > > Although that doesn't help a thread stuck in recvmsg() for a datagram. > > > > It is also tempting to think that close(fd) should sleep until all > > io activities using that fd have completed - whether or not blocking > > calls are woken. > > Sigh... The kernel has no idea when other threads are done with "all > io activities using that fd" - it can wait for them to leave the > kernel mode, but there's fuck-all it can do about e.g. a userland > loop doing write() until there's more data to send. And no, you can't > rely upon them catching EBADF on the next iteration - by the time they > get there, close() might very well have returned and open() from yet > another thread might've grabbed the same descriptor. Welcome to your > data being written to hell knows what... That just means that the application must use dup2() rather than close(). It must do that anyway since the thread it is trying to stop might be sleeping in the system call stub in libc at the time a close() and open() happen. The listening (in this case) thread would need to look at its global data to determine that it is supposed to exit, and then close the fd itself. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: casper@oracle.com > Sent: 21 October 2015 21:33 .. > >Er... So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that > >behaviour, in your opinion? Because fd is sure as hell not the last > >descriptor refering to that socket - fd2 remains alive and well. > > > >Behaviour you describe below matches the _third_ option. > > >> >BTW, for real fun, consider this: > >> >7) > >> >// fd is a socket > >> >fd2 = dup(fd); > >> >in thread A: accept(fd); > >> >in thread B: accept(fd); > >> >in thread C: accept(fd2); > >> >in thread D: close(fd); If D is going to do this, it ought to be using dup2() to clone a copy of (say) /dev/null onto 'fd'. > >> >Which threads (if any), should get hit where it hurts? > >> > >> A & B should return from the accept with an error. C should > >> continue. Which is what happens on Solaris. That seems to completely break the normal *nix file scheme... > >> To this end each thread keeps a list of file descriptors > >> in use by the current active system call. > > Remember, Solaris (and SYSV) has extra levels of multiplexing between userspace and char special drivers (and probably sockets) than Linux does. As well as having multiple fd referencing the same struct FILE, multiple FILE can point to the same inode. If you have two different /dev entries for the same major/minor you also end up with separate inodes - all finally referencing the same driver data (indexed only by minor number). (You actually get two inodes in the chain, one for the disk filesystem and one char-special. The ref counts on both can be greater than 1.) David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Mon, Nov 02, 2015 at 10:03:58AM +, David Laight wrote: > Remember, Solaris (and SYSV) has extra levels of multiplexing between > userspace and char special drivers (and probably sockets) than Linux does. > As well as having multiple fd referencing the same struct FILE, multiple > FILE can point to the same inode. As they ever could on any Unix. Every open(2) results in a new struct file (BTW, I've never seen that capitalization for kernel structure - not in v7, not in *BSD, etc.; FILE is a userland typedef and I would be rather surprised if any kernel, Solaris included, would've renamed 'struct file' to 'struct FILE'). > If you have two different /dev entries for the same major/minor you > also end up with separate inodes - all finally referencing the same > driver data (indexed only by minor number). Again, the same goes for all Unices, both Linux and Solaris included. And what the devil does that have to do with sockets, anyway? Or with the problem in question, while we are at it - they have different descriptors pointing to the same struct file behave differently; anything sensitive to file type would be past that point. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Mon, 2015-11-02 at 00:24 +, Al Viro wrote: > This ought to be a bit cleaner. Eric, could you test the variant below on > your > setup? Sure ! 5 runs of : lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10 total = 4386311 total = 4560402 total = 4437309 total = 4516227 total = 4478778 With 48 threads : ./opensock -t 48 -n 1000 -l 10 total = 4940245 total = 4848513 total = 4813153 total = 4813946 total = 5127804 Perf output taken on the 16 threads run : 74.71% opensock opensock [.] memset | --- memset 5.64% opensock [kernel.kallsyms] [k] queued_spin_lock_slowpath | --- queued_spin_lock_slowpath | |--99.89%-- _raw_spin_lock | | | |--52.74%-- __close_fd | | sys_close | | entry_SYSCALL_64_fastpath | | __libc_close | | | | | --100.00%-- 0x0 | | | |--46.97%-- __alloc_fd | | get_unused_fd_flags | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | | | --100.00%-- 0x0 | --0.30%-- [...] --0.11%-- [...] 1.69% opensock [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--27.37%-- get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--22.40%-- sys_close | entry_SYSCALL_64_fastpath | __libc_close | |--15.79%-- cache_alloc_refill | | | |--99.27%-- kmem_cache_alloc | | | | | |--81.25%-- sk_prot_alloc | | | sk_alloc | | | inet_create | | | __sock_create | | | sock_create | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | |--9.08%-- sock_alloc_inode | | | alloc_inode | | | new_inode_pseudo | | | sock_alloc | | | __sock_create | | | sock_create | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | |--4.98%-- __d_alloc | | | d_alloc_pseudo | | | sock_alloc_file | | | sock_map_fd | | | sys_socket | | | entry_SYSCALL_64_fastpath | | | __socket | | | | | --4.69%-- get_empty_filp | | alloc_file | | sock_alloc_file | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | --0.73%--
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sun, Nov 1, 2015 at 4:24 PM, Al Virowrote: > > This ought to be a bit cleaner. Eric, could you test the variant below on > your > setup? I'd love to see the numbers, but at the same time I really can't say I love your patch. I've merged my own two patches for now (not actually pushed out - I don't want to distract people from just testing 4.3 for a while), because I felt that those had a unreasonably high bang-for-the-buck (ie big speedups for something that still keeps the code very simple). I'm definitely open to improving this further, even go as far as your patch, but just looking at your version of __alloc_fd(), I just don't get the warm and fuzzies. Linus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sat, Oct 31, 2015 at 08:29:29PM +, Al Viro wrote: > On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote: > > On Sat, Oct 31, 2015 at 12:34 PM, Al Virowrote: > > > > > > ... and here's the current variant of mine. > > > > Ugh. I really liked how simple mine ended up being. Yours is definitely not. > > Note that it's not the final variant - just something that should be > testable. There are all kinds of things that still need cleaning/simplifying > in there - e.g. scan() is definitely more complex than needed (if nothing > else, the "small bitmap" case is simply find_next_zero_bit(), and the > rest all have size equal to full cacheline; moreover, I'd overdone the > "... and check if there are other zero bits left" thing - its callers > used to use that a lot, and with the execption of two of them it was > absolutely worthless. So it ended up more generic than necessary and > I'm going to trim that crap down. > > It's still going to end up more complex than yours, obviously, but not as > badly as it is now. I'm not sure if the speedup will be worth the > extra complexity, thus asking for testing... On _some_ loads it is > considerably faster (at least by factor of 5), but whether those loads > resemble anything that occurs on real systems... This ought to be a bit cleaner. Eric, could you test the variant below on your setup? diff --git a/fs/file.c b/fs/file.c index 6c672ad..0144920 100644 --- a/fs/file.c +++ b/fs/file.c @@ -30,6 +30,9 @@ int sysctl_nr_open_min = BITS_PER_LONG; int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; +#define BITS_PER_CHUNK 512 +#define BYTES_PER_CHUNK (BITS_PER_CHUNK / 8) + static void *alloc_fdmem(size_t size) { /* @@ -46,8 +49,10 @@ static void *alloc_fdmem(size_t size) static void __free_fdtable(struct fdtable *fdt) { + int i; kvfree(fdt->fd); - kvfree(fdt->open_fds); + for (i = 0; i <= 3; i++) + kvfree(fdt->bitmaps[i]); kfree(fdt); } @@ -56,13 +61,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +static inline bool cacheline_full(unsigned long *p) +{ + int n; + + for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++) + if (likely(~p[n])) + return false; + return true; +} + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. */ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) { - unsigned int cpy, set; + unsigned int cpy, set, to, from, level, n; BUG_ON(nfdt->max_fds < ofdt->max_fds); @@ -71,18 +86,48 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memcpy(nfdt->fd, ofdt->fd, cpy); memset((char *)(nfdt->fd) + cpy, 0, set); - cpy = ofdt->max_fds / BITS_PER_BYTE; - set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE; - memcpy(nfdt->open_fds, ofdt->open_fds, cpy); - memset((char *)(nfdt->open_fds) + cpy, 0, set); + cpy = ofdt->max_fds / 8; + set = (nfdt->max_fds - ofdt->max_fds) / 8; memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + if (likely(!nfdt->bitmaps[1])) { + // flat to flat + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy); + memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set); + return; + } + to = round_up(nfdt->max_fds, BITS_PER_CHUNK); + set = (to - ofdt->max_fds) / 8; + // copy and pad the primary + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8); + memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set); + // copy and pad the old secondaries + from = round_up(ofdt->max_fds, BITS_PER_CHUNK); + for (level = 1; level <= 3; level++) { + if (!ofdt->bitmaps[level]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK); + memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8); + memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8); + } + // zero the new ones (if any) + for (n = level; n <= 3; n++) { + if (!nfdt->bitmaps[n]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + memset(nfdt->bitmaps[n], 0, to / 8); + } + // and maybe adjust bit 0 in the first new one. + if (unlikely(n != level && cacheline_full(nfdt->bitmaps[level - 1]))) + __set_bit(0, nfdt->bitmaps[level]); } static struct fdtable * alloc_fdtable(unsigned int nr) { struct fdtable *fdt; void *data; +
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sun, Nov 01, 2015 at 06:14:43PM -0800, Eric Dumazet wrote: > On Mon, 2015-11-02 at 00:24 +, Al Viro wrote: > > > This ought to be a bit cleaner. Eric, could you test the variant below on > > your > > setup? > > Sure ! > > 5 runs of : > lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10 > > total = 4386311 > total = 4560402 > total = 4437309 > total = 4516227 > total = 4478778 Umm... With Linus' variant it was what, around 400? +10% or so, then... > With 48 threads : > > ./opensock -t 48 -n 1000 -l 10 > total = 4940245 > total = 4848513 > total = 4813153 > total = 4813946 > total = 5127804 And that - +40%? Interesting... And it looks like at 48 threads you are still seeing arseloads of contention, but apparently less than with Linus' variant... What if you throw the __clear_close_on_exec() patch on top of that? Looks like it's spending less time under ->files_lock... Could you get information on fs/file.o hotspots? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote: > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? ... and here's the current variant of mine. FWIW, it seems to survive LTP and xfstests + obvious "let's torture the allocator". On the "million dups" test it seems to be about 25% faster than the one Linus had posted, at ten millions - about 80%. On opensock results seem to be about 20% better than with the variant Linus has posted, but I'm not sure if the testbox is anywhere near the expected, so I'd appreciate if you'd given it a spin on your setups. It obviously needs saner comments, tuning, etc. BTW, another obvious low-hanging fruit with this data structure is count_open_files() (and that goes for 1:64 bitmap Linus uses) - dup2(0, 1000); close(1000); fork(); and count_open_files() is chewing through the damn bitmap from about 16M down to low tens. While holding ->files_lock, at that... I'm not saying it's critical, and it's definitely a followup patch fodder in either approach, but it's easy enough to do. diff --git a/fs/file.c b/fs/file.c index 6c672ad..fa43cbe 100644 --- a/fs/file.c +++ b/fs/file.c @@ -30,6 +30,8 @@ int sysctl_nr_open_min = BITS_PER_LONG; int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; +#define BITS_PER_CHUNK 512 + static void *alloc_fdmem(size_t size) { /* @@ -46,8 +48,10 @@ static void *alloc_fdmem(size_t size) static void __free_fdtable(struct fdtable *fdt) { + int i; kvfree(fdt->fd); - kvfree(fdt->open_fds); + for (i = 0; i <= 3; i++) + kvfree(fdt->bitmaps[i]); kfree(fdt); } @@ -62,7 +66,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu) */ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) { - unsigned int cpy, set; + unsigned int cpy, set, to, from, level, n; BUG_ON(nfdt->max_fds < ofdt->max_fds); @@ -71,18 +75,53 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memcpy(nfdt->fd, ofdt->fd, cpy); memset((char *)(nfdt->fd) + cpy, 0, set); - cpy = ofdt->max_fds / BITS_PER_BYTE; - set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE; - memcpy(nfdt->open_fds, ofdt->open_fds, cpy); - memset((char *)(nfdt->open_fds) + cpy, 0, set); + cpy = ofdt->max_fds / 8; + set = (nfdt->max_fds - ofdt->max_fds) / 8; memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + if (likely(!nfdt->bitmaps[1])) { + // flat to flat + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy); + memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set); + return; + } + to = round_up(nfdt->max_fds, BITS_PER_CHUNK); + set = (to - ofdt->max_fds) / 8; + // copy and pad the primary + memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8); + memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set); + // copy and pad the old secondaries + from = round_up(ofdt->max_fds, BITS_PER_CHUNK); + for (level = 1; level <= 3; level++) { + if (!ofdt->bitmaps[level]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK); + memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8); + memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) / 8); + } + // zero the new ones (if any) + for (n = level; n <= 3; n++) { + if (!nfdt->bitmaps[n]) + break; + to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK); + memset(nfdt->bitmaps[n], 0, to / 8); + } + // and maybe adjust bit 0 in the first new one. + if (unlikely(n != level)) { + unsigned long *p = nfdt->bitmaps[level - 1]; + for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++) + if (~p[n]) + return; + __set_bit(0, nfdt->bitmaps[level]); + } } static struct fdtable * alloc_fdtable(unsigned int nr) { struct fdtable *fdt; void *data; + int level = 0; /* * Figure out how many fds we actually want to support in this fdtable. @@ -114,16 +153,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr) goto out_fdt; fdt->fd = data; + if (nr >
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote: > On Sat, Oct 31, 2015 at 12:34 PM, Al Virowrote: > > > > ... and here's the current variant of mine. > > Ugh. I really liked how simple mine ended up being. Yours is definitely not. Note that it's not the final variant - just something that should be testable. There are all kinds of things that still need cleaning/simplifying in there - e.g. scan() is definitely more complex than needed (if nothing else, the "small bitmap" case is simply find_next_zero_bit(), and the rest all have size equal to full cacheline; moreover, I'd overdone the "... and check if there are other zero bits left" thing - its callers used to use that a lot, and with the execption of two of them it was absolutely worthless. So it ended up more generic than necessary and I'm going to trim that crap down. It's still going to end up more complex than yours, obviously, but not as badly as it is now. I'm not sure if the speedup will be worth the extra complexity, thus asking for testing... On _some_ loads it is considerably faster (at least by factor of 5), but whether those loads resemble anything that occurs on real systems... BTW, considerable amount of unpleasantness is due to ragged-right-end kind of problems - take /proc/sys/fs/nr-open to something other than a power of two and a whole lot of fun issues start happening. I went for "if there are secondary bitmaps at all, pad all bitmaps to a multiple of cacheline", which at least somewhat mitigates that mess; hell knows, there might be a clever way to sidestep it entirely... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sat, Oct 31, 2015 at 12:34 PM, Al Virowrote: > > ... and here's the current variant of mine. Ugh. I really liked how simple mine ended up being. Yours is definitely not. And based on the profiles from Eric, finding the fd is no longer the problem even with my simpler patch. The problem ends up being the contention on the file_lock spinlock. Eric, I assume that's not "expand_fdtable", since your test-program seems to expand the fd array at the beginning. So it's presumably all from the __alloc_fd() use, but we should double-check.. Eric, can you do a callgraph profile and see which caller is the hottest? Linus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sat, 2015-10-31 at 12:54 -0700, Linus Torvalds wrote: > On Sat, Oct 31, 2015 at 12:34 PM, Al Virowrote: > > > > ... and here's the current variant of mine. > > Ugh. I really liked how simple mine ended up being. Yours is definitely not. > > And based on the profiles from Eric, finding the fd is no longer the > problem even with my simpler patch. The problem ends up being the > contention on the file_lock spinlock. > > Eric, I assume that's not "expand_fdtable", since your test-program > seems to expand the fd array at the beginning. So it's presumably all > from the __alloc_fd() use, but we should double-check.. Eric, can you > do a callgraph profile and see which caller is the hottest? Sure : profile taken while test runs using 16 threads (Since this is probably not a too biased micro benchmark...) # hostname : lpaa24 # os release : 4.3.0-smp-DEV # perf version : 3.12.0-6-GOOGLE # arch : x86_64 # nrcpus online : 48 # nrcpus avail : 48 # cpudesc : Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz # cpuid : GenuineIntel,6,62,4 # total memory : 264126320 kB # cmdline : /usr/bin/perf record -a -g sleep 4 # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, att # CPU_TOPOLOGY info available, use -I to display # NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, msr = 38, uncore_cbox_10 = 35, uncore_cbox_11 = 36, software = 1, power = 7, uncore_irp = 8, uncore_pcu = 37, tracepoint = 2, uncore_ # Samples: 260K of event 'cycles' # Event count (approx.): 196742182232 # # OverheadCommandShared Object # . ... .. # 67.15% opensock opensock [.] memset | --- memset 13.84% opensock [kernel.kallsyms][k] queued_spin_lock_slowpath | --- queued_spin_lock_slowpath | |--99.97%-- _raw_spin_lock | | | |--53.03%-- __close_fd | | sys_close | | entry_SYSCALL_64_fastpath | | __libc_close | | | | | --100.00%-- 0x0 | | | |--46.83%-- __alloc_fd | | get_unused_fd_flags | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | | | --100.00%-- 0x0 | --0.13%-- [...] --0.03%-- [...] 1.84% opensock [kernel.kallsyms][k] _find_next_bit.part.0 | --- _find_next_bit.part.0 | |--65.97%-- find_next_zero_bit | __alloc_fd | get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | |--34.01%-- __alloc_fd | get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | | | --100.00%-- 0x0 --0.02%-- [...] 1.59% opensock [kernel.kallsyms][k] _raw_spin_lock | --- _raw_spin_lock | |--28.78%-- get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket |
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sat, Oct 31, 2015 at 1:45 PM, Eric Dumazetwrote: > 13.84% opensock [kernel.kallsyms][k] queued_spin_lock_slowpath > | > --- queued_spin_lock_slowpath > | > |--99.97%-- _raw_spin_lock > | | > | |--53.03%-- __close_fd > | | > | |--46.83%-- __alloc_fd Interesting. "__close_fd" actually looks more expensive than allocation. They presumably get called equally often, so it's probably some cache effect. __close_fd() doesn't do anything even remotely interesting as far as I can tell, but it strikes me that we probably take a *lot* of cache misses on the stupid "close-on-exec" flags, which are probably always zero anyway. Mind testing something really stupid, and making the __clear_bit() in __clear_close_on_exec() conditiona, something like this: static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) { - __clear_bit(fd, fdt->close_on_exec); + if (test_bit(fd, fdt->close_on_exec) + __clear_bit(fd, fdt->close_on_exec); } and see if it makes a difference. This is the kind of thing that a single-threaded (or even single-socket) test will never actually show, because it caches well enough. But for two sockets, I could imagine the unnecessary dirtying of cachelines and ping-pong being noticeable. The other stuff we probably can't do all that much about. Unless we decide to go for some complicated lockless optimistic file descriptor allocation scheme with retry-on-failure instead of locks. Which I'm sure is possible, but I'm equally sure is painful. Linus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sat, Oct 31, 2015 at 02:23:31PM -0700, Linus Torvalds wrote: > The other stuff we probably can't do all that much about. Unless we > decide to go for some complicated lockless optimistic file descriptor > allocation scheme with retry-on-failure instead of locks. Which I'm > sure is possible, but I'm equally sure is painful. The interesting part is dup2() - we'd have to do something like serialize against other dup2 was_claimed = atomically set and test bit in bitmap if was_claimed tofree = fdt->fd[fd]; if (!tofree) fail with EBUSY install into ->fd[...] end of critical area in there; __alloc_fd() could be made retry-on-failure, but I don't see how to cope with dup2 vs. dup2 without an explicit exclusion. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-30 at 16:52 -0700, Linus Torvalds wrote: sequential allocations... > > I don't think it would matter in real life, since I don't really think > you have lots of fd's with strictly sequential behavior. > > That said, the trivial "open lots of fds" benchmark would show it, so > I guess we can just keep it. The next_fd logic is not expensive or > complex, after all. +1 > Attached is an updated patch that just uses the regular bitmap > allocator and extends it to also have the bitmap of bitmaps. It > actually simplifies the patch, so I guess it's better this way. > > Anyway, I've tested it all a bit more, and for a trivial worst-case > stress program that explicitly kills the next_fd logic by doing > > for (i = 0; i < 100; i++) { > close(3); > dup2(0,3); > if (dup(0)) > break; > } > > it takes it down from roughly 10s to 0.2s. So the patch is quite > noticeable on that kind of pattern. > > NOTE! You'll obviously need to increase your limits to actually be > able to do the above with lots of file descriptors. > > I ran Eric's test-program too, and find_next_zero_bit() dropped to a > fraction of a percent. It's not entirely gone, but it's down in the > noise. > > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? I absolutely agree a generic solution is far better, especially when its performance is in par. Tested-by: Eric DumazetAcked-by: Eric Dumazet Note that a non-POSIX flag (or a thread personality hints) would still allow the kernel to do proper NUMA affinity placement : Say the fd_array and bitmaps are split on the 2 nodes (or more, but most servers nowadays have 2 sockets really). Then at fd allocation time, we can prefer to pick an fd for which memory holding various bits and the file pointer are in the local node This speeds up subsequent fd system call on programs that constantly blow away cpu caches, saving QPI transactions. Thanks a lot Linus. lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10 count=1000 (check/increase ulimit -n) total = 3992764 lpaa24:~# ./opensock -t 48 -n 1000 -l 10 count=1000 (check/increase ulimit -n) total = 3545249 Profile with 16 threads : 69.55% opensock [.] memset 11.83% [kernel] [k] queued_spin_lock_slowpath 1.91% [kernel] [k] _find_next_bit.part.0 1.68% [kernel] [k] _raw_spin_lock 0.99% [kernel] [k] kmem_cache_alloc 0.99% [kernel] [k] memset_erms 0.95% [kernel] [k] get_empty_filp 0.82% [kernel] [k] __close_fd 0.73% [kernel] [k] __alloc_fd 0.65% [kernel] [k] sk_alloc 0.63% opensock [.] child_function 0.56% [kernel] [k] fput 0.35% [kernel] [k] sock_alloc 0.31% [kernel] [k] kmem_cache_free 0.31% [kernel] [k] inode_init_always 0.28% [kernel] [k] d_set_d_op 0.27% [kernel] [k] entry_SYSCALL_64_after_swapgs Profile with 48 threads : 57.92% [kernel] [k] queued_spin_lock_slowpath 32.14% opensock [.] memset 0.81% [kernel] [k] _find_next_bit.part.0 0.51% [kernel] [k] _raw_spin_lock 0.45% [kernel] [k] kmem_cache_alloc 0.38% [kernel] [k] kmem_cache_free 0.34% [kernel] [k] __close_fd 0.32% [kernel] [k] memset_erms 0.25% [kernel] [k] __alloc_fd 0.24% [kernel] [k] get_empty_filp 0.23% opensock [.] child_function 0.18% [kernel] [k] __d_alloc 0.17% [kernel] [k] inode_init_always 0.16% [kernel] [k] sock_alloc 0.16% [kernel] [k] del_timer 0.15% [kernel] [k] entry_SYSCALL_64_after_swapgs 0.15% perf [.] 0x0004d924 0.15% [kernel] [k] tcp_close -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Sat, 2015-10-31 at 14:23 -0700, Linus Torvalds wrote: > Mind testing something really stupid, and making the __clear_bit() in > __clear_close_on_exec() conditiona, something like this: > > static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) > { > - __clear_bit(fd, fdt->close_on_exec); > + if (test_bit(fd, fdt->close_on_exec) > + __clear_bit(fd, fdt->close_on_exec); > } > > and see if it makes a difference. It does ;) About 4 % qps increase 3 runs : lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10 total = 4176651 total = 4178012 total = 4105226 instead of : total = 3910620 total = 3874567 total = 3971028 Perf profile : 69.12% opensock opensock [.] memset | --- memset 12.37% opensock [kernel.kallsyms][k] queued_spin_lock_slowpath | --- queued_spin_lock_slowpath | |--99.99%-- _raw_spin_lock | | | |--51.99%-- __close_fd | | sys_close | | entry_SYSCALL_64_fastpath | | __libc_close | | | | | --100.00%-- 0x0 | | | |--47.79%-- __alloc_fd | | get_unused_fd_flags | | sock_map_fd | | sys_socket | | entry_SYSCALL_64_fastpath | | __socket | | | | | --100.00%-- 0x0 | --0.21%-- [...] --0.01%-- [...] 1.92% opensock [kernel.kallsyms][k] _find_next_bit.part.0 | --- _find_next_bit.part.0 | |--66.93%-- find_next_zero_bit | __alloc_fd | get_unused_fd_flags | sock_map_fd | sys_socket | entry_SYSCALL_64_fastpath | __socket | --33.07%-- __alloc_fd get_unused_fd_flags sock_map_fd sys_socket entry_SYSCALL_64_fastpath __socket | --100.00%-- 0x0 1.63% opensock [kernel.kallsyms][k] _raw_spin_lock | --- _raw_spin_lock | |--28.66%-- get_unused_fd_flags | sock_map_fd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvaldswrote: > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro wrote: >> >> Your variant has 1:64 ratio; obviously better than now, but we can actually >> do 1:bits-per-cacheline quite easily. > > Ok, but in that case you end up needing a counter for each cacheline > too (to count how many bits, in order to know when to say "cacheline > is entirely full"). So here's a largely untested version of my "one bit per word" approach. It seems to work, but looking at it, I'm unhappy with a few things: - using kmalloc() for the .full_fds_bits[] array is simple, but disgusting, since 99% of all programs just have a single word. I know I talked about just adding the allocation to the same one that allocates the bitmaps themselves, but I got lazy and didn't do it. Especially since that code seems to try fairly hard to make the allocations nice powers of two, according to the comments. That may actually matter from an allocation standpoint. - Maybe we could just use that "full_fds_bits_init" field for when a single word is sufficient, and avoid the kmalloc that way? Anyway. This is a pretty simple patch, and I actually think that we could just get rid of the "next_fd" logic entirely with this. That would make this *patch* be more complicated, but it would make the resulting *code* be simpler. Hmm? Want to play with this? Eric, what does this do to your test-case? Linus fs/file.c | 49 ++--- include/linux/fdtable.h | 2 ++ 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..0b89a97cabb5 100644 --- a/fs/file.c +++ b/fs/file.c @@ -48,6 +48,7 @@ static void __free_fdtable(struct fdtable *fdt) { kvfree(fdt->fd); kvfree(fdt->open_fds); + kfree(fdt->full_fds_bits); kfree(fdt); } @@ -56,6 +57,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +#define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) +#define BITBIT_SIZE(nr)(BITBIT_NR(nr) * sizeof(long)) + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -77,6 +81,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memset((char *)(nfdt->open_fds) + cpy, 0, set); memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + + cpy = BITBIT_SIZE(ofdt->max_fds); + set = BITBIT_SIZE(nfdt->max_fds) - cpy; + memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); + memset(cpy+(char *)nfdt->full_fds_bits, 0, set); } static struct fdtable * alloc_fdtable(unsigned int nr) @@ -122,8 +131,21 @@ static struct fdtable * alloc_fdtable(unsigned int nr) data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; + /* +* The "bitmap of bitmaps" has a bit set for each word in +* the open_fds array that is full. We initialize it to +* zero both at allocation and at copying, because it is +* important that it never have a bit set for a non-full +* word, but it doesn't have to be exact otherwise. +*/ + fdt->full_fds_bits = kzalloc(BITBIT_SIZE(nr), GFP_KERNEL); + if (!fdt->full_fds_bits) + goto out_nofull; + return fdt; +out_nofull: + kvfree(fdt->open_fds); out_arr: kvfree(fdt->fd); out_fdt: @@ -229,14 +251,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } -static inline void __set_open_fd(int fd, struct fdtable *fdt) +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt) { __set_bit(fd, fdt->open_fds); + fd /= BITS_PER_LONG; + if (!~fdt->open_fds[fd]) + __set_bit(fd, fdt->full_fds_bits); } -static inline void __clear_open_fd(int fd, struct fdtable *fdt) +static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt) { __clear_bit(fd, fdt->open_fds); + __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); } static int count_open_files(struct fdtable *fdt) @@ -280,6 +306,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; new_fdt->open_fds = newf->open_fds_init; + new_fdt->full_fds_bits = newf->full_fds_bits_init; new_fdt->fd = >fd_array[0]; spin_lock(>file_lock); @@ -323,6 +350,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8); memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8); + memcpy(new_fdt->full_fds_bits,
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 2:02 PM, Al Virowrote: > > Your variant has 1:64 ratio; obviously better than now, but we can actually > do 1:bits-per-cacheline quite easily. Ok, but in that case you end up needing a counter for each cacheline too (to count how many bits, in order to know when to say "cacheline is entirely full"). Because otherwise you'll end up having to check the whole cacheline (rather than check just one word, or check the "bit counter") when you set a bit. So I suspect a "bitmap of full words" ends up being much simpler. But hey, if you can make your more complex thing work, go right ahead. Linus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 3:33 PM, Al Virowrote: > On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote: > >> Anyway. This is a pretty simple patch, and I actually think that we >> could just get rid of the "next_fd" logic entirely with this. That >> would make this *patch* be more complicated, but it would make the >> resulting *code* be simpler. > > Dropping next_fd would screw you in case of strictly sequential allocations... I don't think it would matter in real life, since I don't really think you have lots of fd's with strictly sequential behavior. That said, the trivial "open lots of fds" benchmark would show it, so I guess we can just keep it. The next_fd logic is not expensive or complex, after all. > Your point re power-of-two allocations is well-taken, but then I'm not > sure that kzalloc() is good enough here. Attached is an updated patch that just uses the regular bitmap allocator and extends it to also have the bitmap of bitmaps. It actually simplifies the patch, so I guess it's better this way. Anyway, I've tested it all a bit more, and for a trivial worst-case stress program that explicitly kills the next_fd logic by doing for (i = 0; i < 100; i++) { close(3); dup2(0,3); if (dup(0)) break; } it takes it down from roughly 10s to 0.2s. So the patch is quite noticeable on that kind of pattern. NOTE! You'll obviously need to increase your limits to actually be able to do the above with lots of file descriptors. I ran Eric's test-program too, and find_next_zero_bit() dropped to a fraction of a percent. It's not entirely gone, but it's down in the noise. I really suspect this patch is "good enough" in reality, and I would *much* rather do something like this than add a new non-POSIX flag that people have to update their binaries for. I agree with Eric that *some* people will do so, but it's still the wrong thing to do. Let's just make performance with the normal semantics be good enough that we don't need to play odd special games. Eric? Linus fs/file.c | 39 +++ include/linux/fdtable.h | 2 ++ 2 files changed, 37 insertions(+), 4 deletions(-) diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..6f6eb2b03af5 100644 --- a/fs/file.c +++ b/fs/file.c @@ -56,6 +56,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +#define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) +#define BITBIT_SIZE(nr)(BITBIT_NR(nr) * sizeof(long)) + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -77,6 +80,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memset((char *)(nfdt->open_fds) + cpy, 0, set); memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + + cpy = BITBIT_SIZE(ofdt->max_fds); + set = BITBIT_SIZE(nfdt->max_fds) - cpy; + memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); + memset(cpy+(char *)nfdt->full_fds_bits, 0, set); } static struct fdtable * alloc_fdtable(unsigned int nr) @@ -115,12 +123,14 @@ static struct fdtable * alloc_fdtable(unsigned int nr) fdt->fd = data; data = alloc_fdmem(max_t(size_t, -2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES)); +2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES)); if (!data) goto out_arr; fdt->open_fds = data; data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; + data += nr / BITS_PER_BYTE; + fdt->full_fds_bits = data; return fdt; @@ -229,14 +239,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } -static inline void __set_open_fd(int fd, struct fdtable *fdt) +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt) { __set_bit(fd, fdt->open_fds); + fd /= BITS_PER_LONG; + if (!~fdt->open_fds[fd]) + __set_bit(fd, fdt->full_fds_bits); } -static inline void __clear_open_fd(int fd, struct fdtable *fdt) +static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt) { __clear_bit(fd, fdt->open_fds); + __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); } static int count_open_files(struct fdtable *fdt) @@ -280,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; new_fdt->open_fds = newf->open_fds_init; + new_fdt->full_fds_bits = newf->full_fds_bits_init; new_fdt->fd = >fd_array[0]; spin_lock(>file_lock); @@ -323,6 +338,7 @@ struct files_struct *dup_fd(struct files_struct
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-30 at 14:50 -0700, Linus Torvalds wrote: > On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds >wrote: > > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro wrote: > >> > >> Your variant has 1:64 ratio; obviously better than now, but we can actually > >> do 1:bits-per-cacheline quite easily. > > > > Ok, but in that case you end up needing a counter for each cacheline > > too (to count how many bits, in order to know when to say "cacheline > > is entirely full"). > > So here's a largely untested version of my "one bit per word" > approach. It seems to work, but looking at it, I'm unhappy with a few > things: > > - using kmalloc() for the .full_fds_bits[] array is simple, but > disgusting, since 99% of all programs just have a single word. > >I know I talked about just adding the allocation to the same one > that allocates the bitmaps themselves, but I got lazy and didn't do > it. Especially since that code seems to try fairly hard to make the > allocations nice powers of two, according to the comments. That may > actually matter from an allocation standpoint. > > - Maybe we could just use that "full_fds_bits_init" field for when a > single word is sufficient, and avoid the kmalloc that way? At least make sure the allocation uses a cache line, so that multiple processes do not share same cache line for this possibly hot field fdt->full_fds_bits = kzalloc(max_t(size_t, L1_CACHE_BYTES, BITBIT_SIZE(nr)), GFP_KERNEL); > > Anyway. This is a pretty simple patch, and I actually think that we > could just get rid of the "next_fd" logic entirely with this. That > would make this *patch* be more complicated, but it would make the > resulting *code* be simpler. > > Hmm? Want to play with this? Eric, what does this do to your test-case? Excellent results so far Linus, 500 % increase, thanks a lot ! Tested using 16 threads, 8 on Socket0, 8 on Socket1 Before patch : # ulimit -n 1200 # taskset ff0ff ./opensock -t 16 -n 1000 -l 10 count=1000 (check/increase ulimit -n) total = 636870 After patch : taskset ff0ff ./opensock -t 16 -n 1000 -l 10 count=1000 (check/increase ulimit -n) total = 3845134 (6 times better) Your patch out-performs the O_FD_FASTALLOC one on this particular test by ~ 9 % : taskset ff0ff ./opensock -t 16 -n 1000 -l 10 -f count=1000 (check/increase ulimit -n) total = 3505252 If I raise to 48 threads, the FAST_ALLOC wins by 5 % (3752087 instead of 354). Oh, and 48 threads without any patches : 383027 -> program runs one order of magnitude faster, congrats ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote: > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? IIRC, at least a part of what Eric used to complain about was that on seriously multithreaded processes doing a lot of e.g. socket(2) we end up a lot of bouncing the cacheline containing the first free bits in the bitmap. But looking at the whole thing, I really wonder if the tons of threads asking for random bytes won't get at least as bad cacheline bouncing while getting said bytes, so I'm not sure if that rationale has survived. PS: this problem obviously exists in Linus' variant as well as in mine; the question is whether Eric's approach manages to avoid it in the first place. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 29, 2015 at 5:35 AM, Eric Dumazetwrote: > > Well, I already tested the O_FD_FASTALLOC thing, and I can tell you > find_next_zero_bit() is nowhere to be found in kernel profiles anymore. > It also lowers time we hold the fd array spinlock while doing fd alloc. I do wonder if we couldn't just speed up the bitmap allocator by an order of magnitude. It would be nicer to be able to make existing loads faster without any new "don't bother with POSIX semantics" flag. We could, for example, extend the "open_fds" bitmap with a "second-level" bitmap that has information on when the first level is full. We traverse the open_fd's bitmap one word at a time anyway, we could have a second-level bitmap that has one bit per word to say whether that word is already full. So you'd basically have: - open_fds: array with the real bitmap (exactly like we do now) - full_fds_bits: array of bits that shows which of the "unsigned longs" in 'open_fs' are already full. and then you can basically walk 4096 fd's at a time by just checking one single word in the full_fds_bits[] array. So in __alloc_fd(), instead of using find_next_zero_bit(), you'd use "find_next_fd()", which woulf do something like #define NR_BITS_PER_BIT (64*sizeof(long)*sizeof(long)) unsigned long find_next_fd(struct fdtable *fdt, unsigned long start) { while (start < fdt->max_fds) { unsigned long startbit = start / BITS_PER_LONG; unsigned long bitbit = startbit / BITS_PER_LONG; unsigned long bitmax = (bitbit+1) * BITS_PER_LONG * BITS_PER_LONG; if (bitmax > max_fds) bitmax = fdt->max_fds; // Are all the bits in all the bitbit arrays already set? if (!~fds->full_fds_bits[bitbit]) { start = bitmax; continue; } fd = find_next_zero_bit(fdt->open_fds, bitmax, start); if (fd < bitmax) return fd; } return fdt->max_fds; } which really should cut down on the bit traversal by a factor of 64. Of course, then you do have to maintain that bitmap-of-bits in __set_open_fd() and __clear_open_fd(), but that should be pretty trivial too: - __clear_open_fd() would become just __clear_bit(fd, fdt->open_fds); __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); - and __set_open_fd() would become __set_bit(fd, fdt->open_fds); fd /= BITS_PER_LONG; if (!~fdt->open_fds[fd]) __set_bit(fd, fdt->full_fds_bits); or something. NOTE NOTE NOTE! The above is all very much written without any testing in the email client, so while I tried to make it look like "real code", consider the above just pseudo-code. The advantage of the above is that it should just work for existing binaries. It may not be quite as optimal as just introducing a new "don't care about POSIX" feature, but quite frankly, if it cuts down the bad case of "find_next_zero_bit()" by a factror of 64 (and then adds a *small* expense factor on top of that), I suspect it should be "good enough" even for your nasty case. What do you think? Willing to try the above approach (with any inevitable bug-fixes) and see how it compares? Obviously in addition to any fixes to my pseudo-code above you'd need to add the allocations for the new "full_fds_bits" etc, but I think it should be easy to make the full_fds_bit allocation be *part* of the "open_fds" allocation, so you wouldn't need a new allocation in alloc_fdtable(). We already do that whole "use a single allocation" to combine open_fds with close_on_exec into one single allocation. Linus -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: David Holland > Sent: 29 October 2015 14:59 > On Tue, Oct 27, 2015 at 10:52:46AM +, Alan Burlison wrote: > > >But in general, this is basically a problem with the application: the file > > >descriptor space is shared between threads and having one thread sniping > > >at open files, you do have a problem and whatever the kernel does in that > > >case perhaps doesn't matter all that much: the application needs to be > > >fixed anyway. > > > > The scenario in Hadoop is that the FD is being used by a thread that's > > waiting in accept and another thread wants to shut it down, e.g. because > > the application is terminating and needs to stop all threads cleanly. > > ISTM that the best way to do this is to post a signal to the thread so > accept bails with EINTR, at which point it can check to see if it's > supposed to be exiting. Actually, just send it a connect indication. ISTM that the correct call should be listen(fd, 0); Although that doesn't help a thread stuck in recvmsg() for a datagram. It is also tempting to think that close(fd) should sleep until all io activities using that fd have completed - whether or not blocking calls are woken. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 10:18:12AM -0700, Linus Torvalds wrote: > I do wonder if we couldn't just speed up the bitmap allocator by an > order of magnitude. It would be nicer to be able to make existing > loads faster without any new "don't bother with POSIX semantics" flag. > > We could, for example, extend the "open_fds" bitmap with a > "second-level" bitmap that has information on when the first level is > full. We traverse the open_fd's bitmap one word at a time anyway, we > could have a second-level bitmap that has one bit per word to say > whether that word is already full. Your variant has 1:64 ratio; obviously better than now, but we can actually do 1:bits-per-cacheline quite easily. I've been playing with a variant that has more than two bitmaps, and AFAICS it a) does not increase the amount of cacheline pulled and b) keeps it well-bound even in the absolutely worst case (128M-odd descriptors filled, followed by close(0);dup2(1,0); - in that case it ends up accessing the 7 cachelines worth of bitmaps; your variant will read through 4000-odd cachelines of the summary bitmap alone; the mainline is even worse). > The advantage of the above is that it should just work for existing > binaries. It may not be quite as optimal as just introducing a new > "don't care about POSIX" feature, but quite frankly, if it cuts down > the bad case of "find_next_zero_bit()" by a factror of 64 (and then > adds a *small* expense factor on top of that), I suspect it should be > "good enough" even for your nasty case. > > What do you think? Willing to try the above approach (with any > inevitable bug-fixes) and see how it compares? > > Obviously in addition to any fixes to my pseudo-code above you'd need > to add the allocations for the new "full_fds_bits" etc, but I think it > should be easy to make the full_fds_bit allocation be *part* of the > "open_fds" allocation, so you wouldn't need a new allocation in > alloc_fdtable(). We already do that whole "use a single allocation" to > combine open_fds with close_on_exec into one single allocation. I'll finish testing what I've got and post it; it costs 3 extra pointers in the files_struct and a bit fatter bitmap allocation (less than 0.2% extra). All the arguments regarding the unmodified binaries apply, of course, and so far it looks fairly compact... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 05:43:21PM +, David Laight wrote: > ISTM that the correct call should be listen(fd, 0); > Although that doesn't help a thread stuck in recvmsg() for a datagram. > > It is also tempting to think that close(fd) should sleep until all > io activities using that fd have completed - whether or not blocking > calls are woken. Sigh... The kernel has no idea when other threads are done with "all io activities using that fd" - it can wait for them to leave the kernel mode, but there's fuck-all it can do about e.g. a userland loop doing write() until there's more data to send. And no, you can't rely upon them catching EBADF on the next iteration - by the time they get there, close() might very well have returned and open() from yet another thread might've grabbed the same descriptor. Welcome to your data being written to hell knows what... That's precisely the reason why "wait in close()" kind of semantics is worthless - the races are still there, and having them a bit harder to hit just makes them harder to debug. Worse, it might create an impression of safety where there's none to be had. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote: > Anyway. This is a pretty simple patch, and I actually think that we > could just get rid of the "next_fd" logic entirely with this. That > would make this *patch* be more complicated, but it would make the > resulting *code* be simpler. Dropping next_fd would screw you in case of strictly sequential allocations... Your point re power-of-two allocations is well-taken, but then I'm not sure that kzalloc() is good enough here. Look: you have a bit for every 64 descriptors, i.e. byte per 512. On 10M case Eric had been talking about that'll yield 32Kb worth of your secondary bitmap. It's right on the edge of the range where vmalloc() becomes attractive; for something bigger it gets even worse... Currently we go for vmalloc (on bitmaps) once we are past 128K descriptors (two bitmaps packed together => 256Kbit = 32Kb). kmalloc() is very sensitive to size being a power of two, but IIRC vmalloc() isn't... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, 2015-10-29 at 05:35 -0700, Eric Dumazet wrote: > Current kernel : > > 64.98% [kernel] [k] queued_spin_lock_slowpath > 14.88% opensock [.] memset// this part simulates user land > actual work ;) > 11.15% [kernel] [k] _find_next_bit.part.0 > 0.69% [kernel] [k] _raw_spin_lock > 0.46% [kernel] [k] memset_erms > 0.38% [kernel] [k] sk_alloc > 0.37% [kernel] [k] kmem_cache_alloc > 0.33% [kernel] [k] get_empty_filp > 0.31% [kernel] [k] kmem_cache_free > 0.26% [kernel] [k] __alloc_fd > 0.26% opensock [.] child_function > 0.18% [kernel] [k] inode_init_always > 0.17% opensock [.] __random_r With attached prototype patch we get this profile instead : You can see we no longer hit the spinlock issue and cache waste in find_next_bit. Userland can really progress _much_ faster. 76.86% opensock [.] memset 1.31% [kernel] [k] _raw_spin_lock 1.15% assd [.] 0x0056f32c 1.08% [kernel] [k] kmem_cache_free 0.97% [kernel] [k] kmem_cache_alloc 0.83% [kernel] [k] sk_alloc 0.72% [kernel] [k] memset_erms 0.70% opensock [.] child_function 0.67% [kernel] [k] get_empty_filp 0.65% [kernel] [k] __alloc_fd 0.58% [kernel] [k] __close_fd 0.49% [kernel] [k] queued_spin_lock_slowpath diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..eabb9a626259 100644 --- a/fs/file.c +++ b/fs/file.c @@ -22,6 +22,7 @@ #include #include #include +#include int sysctl_nr_open __read_mostly = 1024*1024; int sysctl_nr_open_min = BITS_PER_LONG; @@ -471,6 +472,19 @@ int __alloc_fd(struct files_struct *files, spin_lock(>file_lock); repeat: fdt = files_fdtable(files); + + if (unlikely(flags & O_FD_FASTALLOC)) { + u32 rnd, limit = min(end, fdt->max_fds); + + /* +* Note: do not bother with files->next_fd, +* this is for POSIX lovers... +*/ + rnd = ((u64)prandom_u32() * limit) >> 32; + fd = find_next_zero_bit(fdt->open_fds, limit, rnd); + if (fd < limit) + goto ok; + } fd = start; if (fd < files->next_fd) fd = files->next_fd; @@ -499,7 +513,7 @@ repeat: if (start <= files->next_fd) files->next_fd = fd + 1; - +ok: __set_open_fd(fd, fdt); if (flags & O_CLOEXEC) __set_close_on_exec(fd, fdt); diff --git a/include/linux/net.h b/include/linux/net.h index 70ac5e28e6b7..3823d082af4c 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -76,6 +76,7 @@ enum sock_type { #ifndef SOCK_NONBLOCK #define SOCK_NONBLOCK O_NONBLOCK #endif +#define SOCK_FD_FASTALLOC O_FD_FASTALLOC #endif /* ARCH_HAS_SOCKET_TYPES */ diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index e063effe0cc1..badd421dd9f4 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -88,6 +88,10 @@ #define __O_TMPFILE02000 #endif +#ifndef O_FD_FASTALLOC +#define O_FD_FASTALLOC 0x4000 +#endif + /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) diff --git a/net/socket.c b/net/socket.c index 9963a0b53a64..6dde02b2eaf9 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1227,9 +1227,10 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK); BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK); BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK); + BUILD_BUG_ON(SOCK_FD_FASTALLOC & SOCK_TYPE_MASK); flags = type & ~SOCK_TYPE_MASK; - if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) + if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC)) return -EINVAL; type &= SOCK_TYPE_MASK; @@ -1240,7 +1241,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) if (retval < 0) goto out; - retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK)); + retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK | O_FD_FASTALLOC)); if (retval < 0) goto
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, 2015-10-29 at 04:16 +, Al Viro wrote: > Have you tried to experiment with that in userland? I mean, emulate that > thing in normal userland code, count the cacheline accesses and drive it > with the use patterns collected from actual applications. Sure. > > I can sit down and play with math expectations, but I suspect that it's > easier to experiment. It's nothing but an intuition (I hadn't seriously > done probability theory in quite a while, and my mathematical tastes run > more to geometry and topology anyway), but... I would expect it to degrade > badly when the bitmap is reasonably dense. > > Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's > not cheap - it's done via #PF triggered in kernel mode, with handler > noticing that the faulting address is in vmalloc range and doing the > right thing. IOW, if your bitmap is very sparse, the price of page faults > needs to be taken into account. This vmalloc PF is pure noise. This only matters for the very first allocations. We target programs opening zillions of fd in their lifetime ;) Not having to expand a 4,000,000 slots fd array while fully loaded also removes a latency spike that is very often not desirable. > > AFAICS, the only benefit of that thing is keeping dirtied cachelines far > from each other. Which might be a win overall, but I'm not convinced that > the rest won't offset the effect of that... Well, I already tested the O_FD_FASTALLOC thing, and I can tell you find_next_zero_bit() is nowhere to be found in kernel profiles anymore. It also lowers time we hold the fd array spinlock while doing fd alloc. User land test program I wrote few months back Current kernel : 64.98% [kernel] [k] queued_spin_lock_slowpath 14.88% opensock [.] memset// this part simulates user land actual work ;) 11.15% [kernel] [k] _find_next_bit.part.0 0.69% [kernel] [k] _raw_spin_lock 0.46% [kernel] [k] memset_erms 0.38% [kernel] [k] sk_alloc 0.37% [kernel] [k] kmem_cache_alloc 0.33% [kernel] [k] get_empty_filp 0.31% [kernel] [k] kmem_cache_free 0.26% [kernel] [k] __alloc_fd 0.26% opensock [.] child_function 0.18% [kernel] [k] inode_init_always 0.17% opensock [.] __random_r /* * test for b/9072743 : fd scaling on gigantic process (with ~ 10,000,000 TCP sockets) * - Size fd arrays in kernel to avoid resizings that kill latencies. * - Then launch xx threads doing *populate the fd array of the process, opening 'max' files. * *- Loop : close(randomfd()), socket(AF_INET, SOCK_STREAM, 0); * * Usage : opensock [ -n fds_count ] [ -t threads_count] [-f] */ #include #include #include #include #include #include #include #include #include #include unsigned int count; int skflags; #define NBTHREADS_MAX 4096 pthread_t tid[NBTHREADS_MAX]; int nbthreads; int nbthreads_req = 24; int stop_all; #ifndef O_FD_FASTALLOC #define O_FD_FASTALLOC 0x4000 #endif #ifndef SOCK_FD_FASTALLOC #define SOCK_FD_FASTALLOC O_FD_FASTALLOC #endif /* expand kernel fd array for optimal perf. * This could be done by doing a loop on dup(), * or can be done using dup2() */ int expand_fd_array(int max) { int target, res; int fd = socket(AF_INET, SOCK_STREAM, 0); if (fd == -1) { perror("socket()"); return -1; } for (;;) { count = max; target = count; if (skflags & SOCK_FD_FASTALLOC) target += count/10; res = dup2(fd, target); if (res != -1) { close(res); break; } max -= max/10; } printf("count=%u (check/increase ulimit -n)\n", count); return 0; } static char state[32] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }; /* each thread is using ~400 KB of data per unit of work */ #define WORKING_SET_SIZE 40 static void *child_function(void *arg) { unsigned int max = count / nbthreads_req; struct random_data buf; unsigned int idx; int *tab; unsigned long iter = 0; unsigned long *work_set = malloc(WORKING_SET_SIZE); int i; if (!work_set) return NULL; tab = malloc(max * sizeof(int)); if (!tab) { free(work_set); return NULL; } memset(tab, 255, max * sizeof(int)); initstate_r(getpid(), state,
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 29/10/2015 14:58, David Holland wrote: ISTM that the best way to do this is to post a signal to the thread so accept bails with EINTR, at which point it can check to see if it's supposed to be exiting. Yes, you could use pthread_kill, but that would require keeping a list of the tids of all the threads that were using the FD, and that really just moves the problem elsewhere rather than fixing it. Otherwise it sounds like the call you're looking for is not close(2) but revoke(2). Last I remember Linux doesn't have revoke because there's no way to implement it that isn't a trainwreck. close(2) as per specified by POSIX works just fine on Solaris, if that was the case everywhere then it wouldn't be an issue. And for cases where it is necessary to keep the FD assigned because of races, the dup2(2) trick works fine as well. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 29, 2015 at 03:18:40PM +, Alan Burlison wrote: > On 29/10/2015 14:58, David Holland wrote: > >ISTM that the best way to do this is to post a signal to the thread so > >accept bails with EINTR, at which point it can check to see if it's > >supposed to be exiting. > > Yes, you could use pthread_kill, but that would require keeping a list of > the tids of all the threads that were using the FD, and that really just > moves the problem elsewhere rather than fixing it. Hardly; it moves the burden of doing stupid things to the application. If as you said the goal is to shut down all threads cleanly, then it doesn't need to keep track in detail anyway; it can just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for some reason, or whatever. > >Otherwise it sounds like the call you're looking for is not close(2) > >but revoke(2). Last I remember Linux doesn't have revoke because > >there's no way to implement it that isn't a trainwreck. > > close(2) as per specified by POSIX works just fine on Solaris, if that was > the case everywhere then it wouldn't be an issue. And for cases where it is > necessary to keep the FD assigned because of races, the dup2(2) trick works > fine as well. close(2) as specified by POSIX doesn't prohibit this weird revoke-like behavior, but there's nothing in there that mandates it either. (I thought this discussion had already clarified that.) Note that while NetBSD apparently supports this behavior because someone copied it from Solaris, I'm about to go recommend it be removed. -- David A. Holland dholl...@netbsd.org -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, Oct 27, 2015 at 10:52:46AM +, Alan Burlison wrote: > >But in general, this is basically a problem with the application: the file > >descriptor space is shared between threads and having one thread sniping > >at open files, you do have a problem and whatever the kernel does in that > >case perhaps doesn't matter all that much: the application needs to be > >fixed anyway. > > The scenario in Hadoop is that the FD is being used by a thread that's > waiting in accept and another thread wants to shut it down, e.g. because > the application is terminating and needs to stop all threads cleanly. ISTM that the best way to do this is to post a signal to the thread so accept bails with EINTR, at which point it can check to see if it's supposed to be exiting. Otherwise it sounds like the call you're looking for is not close(2) but revoke(2). Last I remember Linux doesn't have revoke because there's no way to implement it that isn't a trainwreck. -- David A. Holland dholl...@netbsd.org -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 29/10/2015 16:01, David Holland wrote: Hardly; it moves the burden of doing stupid things to the application. If as you said the goal is to shut down all threads cleanly, then it doesn't need to keep track in detail anyway; it can just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for some reason, or whatever. I agree that the root issue is poor application design, but posting a signal to every thread is not a solution if you only want to shut down a subset of threads. close(2) as specified by POSIX doesn't prohibit this weird revoke-like behavior, but there's nothing in there that mandates it either. (I thought this discussion had already clarified that.) There was an attempt to interpret POSIX that way, with which I still disagree. If a FD is closed or reassigned then any current pending operations on it should be terminated. Note that while NetBSD apparently supports this behavior because someone copied it from Solaris, I'm about to go recommend it be removed. Which behaviour? The abort accept() on close() behaviour? -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
[off list] On 29/10/2015 17:07, Al Viro wrote: Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact place in POSIX that requires such behaviour? If that's the way the conversation is going to go, sorry, no. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 29, 2015 at 04:15:33PM +, Alan Burlison wrote: > There was an attempt to interpret POSIX that way, with which I still > disagree. If a FD is closed or reassigned then any current pending > operations on it should be terminated. Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact place in POSIX that requires such behaviour? This is getting ridiculous - if we are talking about POSIX-mandated behaviour of close(), please show where is it mandated. Using close(2) on a descriptor that might be used by other threads is a bloody bad design in userland code - I think everyone in this thread agrees on that. Making that a recommended way to do _anything_ is nuts. Now, no userland code, however lousy it might be, should be able to screw the system. But that isn't the issue - our variant is providing that just fine. BTW, "cancel accept(2) because sending a signal is hard" is bogus anyway - a thread in accept(3) just about to enter the kernel would get buggered if another thread closes that descriptor and the third one does socket(2). _IF_ you are doing that kind of "close a descriptor under other threads" thing, you need to inform the potentially affected threads anyway, and you'd better not rely on them being currently in kernel mode. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 29, 2015 at 04:15:33PM +, Alan Burlison wrote: > >close(2) as specified by POSIX doesn't prohibit this weird revoke-like > >behavior, but there's nothing in there that mandates it either. (I > >thought this discussion had already clarified that.) > > There was an attempt to interpret POSIX that way, with which I still > disagree. If a FD is closed or reassigned then any current pending > operations on it should be terminated. C, please. > >Note that while NetBSD apparently supports this behavior because > >someone copied it from Solaris, I'm about to go recommend it be > >removed. > > Which behaviour? The abort accept() on close() behaviour? That, and aborting anything else too. Close isn't revoke. -- David A. Holland dholl...@netbsd.org -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Alan BurlisonDate: Thu, 29 Oct 2015 17:12:44 + > On 29/10/2015 17:07, Al Viro wrote: > >> Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote >> the exact >> place in POSIX that requires such behaviour? > > If that's the way the conversation is going to go, sorry, no. I find Al's request to be frankly quite reasonable, as is his frustration expressed in his tone as well. Furthermore, NetBSD's intention to try and get rid of the close() on accept() behavior shows that the Linux developers are not in the minority of being against requiring this behavior or even finding it desirable in any way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al ViroDate: Thu, 29 Oct 2015 17:07:48 + > _IF_ you are doing that kind of "close a descriptor under other threads" > thing, you need to inform the potentially affected threads anyway, and > you'd better not rely on them being currently in kernel mode. +1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, 2015-10-28 at 12:35 +, Al Viro wrote: > [Linus and Dave added, Solaris and NetBSD folks dropped from Cc] > > On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote: > > On Tue, 2015-10-27 at 23:17 +, Al Viro wrote: > > > > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > > with some use patterns. The cacheline pingpong in the bitmap is probably > > > inevitable, unless we accept considerably heavier memory footprint, > > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > > to trigger - close(3);open(...); will have the next open() after that > > > scanning the entire in-use bitmap. I think I see a way to improve it > > > without slowing the normal case down, but I'll need to experiment a > > > bit before I post patches. Anybody with examples of real-world loads > > > that make our descriptor allocator to degrade is very welcome to post > > > the reproducers... > > > > Well, I do have real-world loads, but quite hard to setup in a lab :( > > > > Note that we also hit the 'struct cred'->usage refcount for every > > open()/close()/sock_alloc(), and simply moving uid/gid out of the first > > cache line really helps, as current_fsuid() and current_fsgid() no > > longer forces a pingpong. > > > > I moved seldom used fields on the first cache line, so that overall > > memory usage did not change (192 bytes on 64 bit arches) > > [snip] > > Makes sense, but there's a funny thing about that refcount - the part > coming from ->f_cred is the most frequently changed *and* almost all > places using ->f_cred are just looking at its fields and do not manipulate > its refcount. The only exception (do_process_acct()) is easy to eliminate > just by storing a separate reference to the current creds of acct(2) caller > and using it instead of looking at ->f_cred. What's more, the place where we > grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and* > most of the time such a reference is there for dropping ->f_cred (in > file_free()/file_free_rcu()). > > With that change in kernel/acct.c done, we could do the following: > a) split the cred refcount into the normal and percpu parts and > add a spinlock in there. > b) have put_cred() do this: > if (atomic_dec_and_test(>usage)) { > this_cpu_add(>f_cred_usage, 1); > call_rcu(>rcu, put_f_cred_rcu); > } > c) have get_empty_filp() increment current_cred ->f_cred_usage with > this_cpu_add() > d) have file_free() do > percpu_counter_dec(_files); > rcu_read_lock(); > if (likely(atomic_read(>f_cred->usage))) { > this_cpu_add(>f_cred->f_cred_usage, -1); > rcu_read_unlock(); > call_rcu(>f_u.fu_rcuhead, file_free_rcu_light); > } else { > rcu_read_unlock(); > call_rcu(>f_u.fu_rcuhead, file_free_rcu); > } > file_free_rcu() being > static void file_free_rcu(struct rcu_head *head) > { > struct file *f = container_of(head, struct file, f_u.fu_rcuhead); > put_f_cred(>f_cred->rcu); > kmem_cache_free(filp_cachep, f); > } > and file_free_rcu_light() - the same sans put_f_cred(); > > with put_f_cred() doing > spin_lock cred->lock > this_cpu_add(>f_cred_usage, -1); > find the sum of cred->f_cred_usage > spin_unlock cred->lock > if the sum has not reached 0 > return > current put_cred_rcu(cred) > > IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and > (most of) ->f_cred dropping. > > Note that there are two paths leading to put_f_cred() in the above - via > call_rcu() on >rcu and from file_free_rcu() called via call_rcu() on > >f_u.fu_rcuhead. Both are RCU-delayed and they can happen in parallel - > different rcu_head are used. > > atomic_read() check in file_free() might give false positives if it comes > just before put_cred() on another CPU kills the last non-f_cred reference. > It's not a problem, since put_f_cred() from that put_cred() won't be > executed until we drop rcu_read_lock(), so we can safely decrement the > cred->f_cred_usage without cred->lock here (and we are guaranteed that we > won't > be dropping the last of that - the same put_cred() would've incremented > ->f_cred_usage). > > Does anybody see problems with that approach? I'm going to grab some sleep > (only a couple of hours so far tonight ;-/), will cook an incremental to > Eric's > field-reordering patch when I get up... Before I take a deep look at your suggestion, are you sure plain use of include/linux/percpu-refcount.h infra is not possible for struct cred ? Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
[Linus and Dave added, Solaris and NetBSD folks dropped from Cc] On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote: > On Tue, 2015-10-27 at 23:17 +, Al Viro wrote: > > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > with some use patterns. The cacheline pingpong in the bitmap is probably > > inevitable, unless we accept considerably heavier memory footprint, > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > to trigger - close(3);open(...); will have the next open() after that > > scanning the entire in-use bitmap. I think I see a way to improve it > > without slowing the normal case down, but I'll need to experiment a > > bit before I post patches. Anybody with examples of real-world loads > > that make our descriptor allocator to degrade is very welcome to post > > the reproducers... > > Well, I do have real-world loads, but quite hard to setup in a lab :( > > Note that we also hit the 'struct cred'->usage refcount for every > open()/close()/sock_alloc(), and simply moving uid/gid out of the first > cache line really helps, as current_fsuid() and current_fsgid() no > longer forces a pingpong. > > I moved seldom used fields on the first cache line, so that overall > memory usage did not change (192 bytes on 64 bit arches) [snip] Makes sense, but there's a funny thing about that refcount - the part coming from ->f_cred is the most frequently changed *and* almost all places using ->f_cred are just looking at its fields and do not manipulate its refcount. The only exception (do_process_acct()) is easy to eliminate just by storing a separate reference to the current creds of acct(2) caller and using it instead of looking at ->f_cred. What's more, the place where we grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and* most of the time such a reference is there for dropping ->f_cred (in file_free()/file_free_rcu()). With that change in kernel/acct.c done, we could do the following: a) split the cred refcount into the normal and percpu parts and add a spinlock in there. b) have put_cred() do this: if (atomic_dec_and_test(>usage)) { this_cpu_add(>f_cred_usage, 1); call_rcu(>rcu, put_f_cred_rcu); } c) have get_empty_filp() increment current_cred ->f_cred_usage with this_cpu_add() d) have file_free() do percpu_counter_dec(_files); rcu_read_lock(); if (likely(atomic_read(>f_cred->usage))) { this_cpu_add(>f_cred->f_cred_usage, -1); rcu_read_unlock(); call_rcu(>f_u.fu_rcuhead, file_free_rcu_light); } else { rcu_read_unlock(); call_rcu(>f_u.fu_rcuhead, file_free_rcu); } file_free_rcu() being static void file_free_rcu(struct rcu_head *head) { struct file *f = container_of(head, struct file, f_u.fu_rcuhead); put_f_cred(>f_cred->rcu); kmem_cache_free(filp_cachep, f); } and file_free_rcu_light() - the same sans put_f_cred(); with put_f_cred() doing spin_lock cred->lock this_cpu_add(>f_cred_usage, -1); find the sum of cred->f_cred_usage spin_unlock cred->lock if the sum has not reached 0 return current put_cred_rcu(cred) IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and (most of) ->f_cred dropping. Note that there are two paths leading to put_f_cred() in the above - via call_rcu() on >rcu and from file_free_rcu() called via call_rcu() on >f_u.fu_rcuhead. Both are RCU-delayed and they can happen in parallel - different rcu_head are used. atomic_read() check in file_free() might give false positives if it comes just before put_cred() on another CPU kills the last non-f_cred reference. It's not a problem, since put_f_cred() from that put_cred() won't be executed until we drop rcu_read_lock(), so we can safely decrement the cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't be dropping the last of that - the same put_cred() would've incremented ->f_cred_usage). Does anybody see problems with that approach? I'm going to grab some sleep (only a couple of hours so far tonight ;-/), will cook an incremental to Eric's field-reordering patch when I get up... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote: > Before I take a deep look at your suggestion, are you sure plain use of > include/linux/percpu-refcount.h infra is not possible for struct cred ? BTW, I am not convinced we need to spend so much energy and per-cpu memory for struct cred refcount. The big problem is fd array spinlock of course and bitmap search for POSIX compliance. The cache line trashing in struct cred is a minor one ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 27/10/2015 23:17, Al Viro wrote: Frankly, as far as I'm concerned, the bottom line is * there are two variants of semantics in that area and there's not much that could be done about that. Yes, that seems to be the case. * POSIX is vague enough for both variants to comply with it (it's also very badly written in the area in question). On that aspect I disagree, the POSIX semantics seem clear to me, and are different to the Linux behaviour. * I don't see any way to implement something similar to Solaris behaviour without a huge increase of memory footprint or massive cacheline pingpong. Solaris appears to go for memory footprint from hell - cacheline per descriptor (instead of a pointer per descriptor). Yes, that does seem to be the case. Thanks for the detailed explanation you've provided as to why that's so. * the benefits of Solaris-style behaviour are not obvious - all things equal it would be interesting, but the things are very much not equal. What's more, if your userland code is such that accept() argument could be closed by another thread, the caller *cannot* do anything with said argument after accept() returns, no matter which variant of semantics is used. Yes, irrespective of how you terminate the accept, once it returns with an error it's unsafe to use the FD, with the exception of failures such as EAGAIN, EINTR etc. However the shutdown() behaviour of Linux is not POSIX compliant and allowing an accept to continue of a FD that's been closed doesn't seem correct either. * [Linux-specific aside] our __alloc_fd() can degrade quite badly with some use patterns. The cacheline pingpong in the bitmap is probably inevitable, unless we accept considerably heavier memory footprint, but we also have a case when alloc_fd() takes O(n) and it's _not_ hard to trigger - close(3);open(...); will have the next open() after that scanning the entire in-use bitmap. I think I see a way to improve it without slowing the normal case down, but I'll need to experiment a bit before I post patches. Anybody with examples of real-world loads that make our descriptor allocator to degrade is very welcome to post the reproducers... It looks like the remaining discussion is going to be about Linux implementation details so I'll bow out at this point. Thanks again for all the helpful explanation. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, 2015-10-29 at 00:15 +, Al Viro wrote: > On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote: > > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > > > > lot of applications absolutely rely on the POSIX thing... > > > > > > We obviously can't turn that into default behaviour, though. BTW, what > > > distribution do you have in mind for those random descriptors? Uniform > > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the > > > memory footprint pretty soon... > > > > Simply [0 , fdt->max_fds] is working well in most cases. > > Umm... So first you dup2() to establish the ->max_fds you want, then > do such opens? Yes, dup2() is done at program startup, knowing the expected max load (in term of concurrent fd) + ~10 % (actual fd array size can be more than this because of power of two rounding in alloc_fdtable() ) But this is an optimization : If you do not use the initial dup2(), the fd array can be automatically expanded if needed (all slots are in use) > What used/unused ratio do you expect to deal with? > And what kind of locking are you going to use? Keep in mind that > e.g. dup2() is dependent on the lack of allocations while it's working, > so it's not as simple as "we don't need no stinkin' ->files_lock"... No locking change. files->file_lock is still taken. We only want to minimize time to find an empty slot. The trick is to not start bitmap search at files->next_fd, but a random point. This is a win if we assume there are enough holes. low = start; if (low < files->next_fd) low = files->next_fd; res = -1; if (flags & O_FD_FASTALLOC) { random_point = pick_random_between(low, fdt->max_fds); res = find_next_zero_bit(fdt->open_fds, fdt->max_fds, random_point); /* No empty slot found, try the other range */ if (res >= fdt->max_fds) { res = find_next_zero_bit(fdt->open_fds, low, random_point); if (res >= random_point) res = -1; } } ... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 28, 2015 at 08:29:41PM -0700, Eric Dumazet wrote: > But this is an optimization : If you do not use the initial dup2(), the > fd array can be automatically expanded if needed (all slots are in use) Whee... > No locking change. files->file_lock is still taken. > > We only want to minimize time to find an empty slot. Then I'd say that my variant is going to win. It *will* lead to cacheline pingpong in more cases than yours, but I'm quite sure that it will be a win as far as the total amount of cachelines accessed. > The trick is to not start bitmap search at files->next_fd, but a random > point. This is a win if we assume there are enough holes. > > low = start; > if (low < files->next_fd) > low = files->next_fd; > > res = -1; > if (flags & O_FD_FASTALLOC) { > random_point = pick_random_between(low, fdt->max_fds); > > res = find_next_zero_bit(fdt->open_fds, fdt->max_fds, > random_point); > /* No empty slot found, try the other range */ > if (res >= fdt->max_fds) { > res = find_next_zero_bit(fdt->open_fds, > low, random_point); > if (res >= random_point) > res = -1; > } > } Have you tried to experiment with that in userland? I mean, emulate that thing in normal userland code, count the cacheline accesses and drive it with the use patterns collected from actual applications. I can sit down and play with math expectations, but I suspect that it's easier to experiment. It's nothing but an intuition (I hadn't seriously done probability theory in quite a while, and my mathematical tastes run more to geometry and topology anyway), but... I would expect it to degrade badly when the bitmap is reasonably dense. Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's not cheap - it's done via #PF triggered in kernel mode, with handler noticing that the faulting address is in vmalloc range and doing the right thing. IOW, if your bitmap is very sparse, the price of page faults needs to be taken into account. AFAICS, the only benefit of that thing is keeping dirtied cachelines far from each other. Which might be a win overall, but I'm not convinced that the rest won't offset the effect of that... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, 2015-10-28 at 21:13 +, Al Viro wrote: > On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote: > > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote: > > > > > Before I take a deep look at your suggestion, are you sure plain use of > > > include/linux/percpu-refcount.h infra is not possible for struct cred ? > > > > BTW, I am not convinced we need to spend so much energy and per-cpu > > memory for struct cred refcount. > > > > The big problem is fd array spinlock of course and bitmap search for > > POSIX compliance. > > > > The cache line trashing in struct cred is a minor one ;) > > percpu-refcount isn't convenient - the only such candidate for ref_kill in > there is "all other references are gone", and that can happen in > interesting locking environments. I doubt that it would be a good fit, TBH... OK then ... > Cacheline pingpong on the descriptors bitmap is probably inevitable, but > it's not the only problem in the existing implementation - close a small > descriptor when you've got a lot of them and look for the second open > after that. _That_ can lead to thousands of cachelines being read through, > all under the table spinlock. It's literally orders of magnitude worse. > And if the first open after that close happens to be for a short-living > descriptor, you'll get the same situation back in your face as soon as you > close it. > > I think we can seriously improve that without screwing the fast path by > adding "summary" bitmaps once the primary grows past the cacheline worth > of bits. With bits in the summary bitmap corresponding to cacheline-sized > chunks of the primary, being set iff all bits in the corresponding chunk > are set. If the summary map grows larger than one cacheline, add the > second-order one (that happens at quarter million descriptors and serves > until 128 million; adding the third-order map is probably worthless). > > I want to maintain the same kind of "everything below this is known to be > in use" thing as we do now. Allocation would start with looking into the > same place in primary bitmap where we'd looked now and similar search > forward for zero bit. _However_, it would stop at cacheline boundary. > If nothing had been found, we look in the corresponding place in the > summary bitmap and search for zero bit there. Again, no more than up > to the cacheline boundary. If something is found, we've got a chunk in > the primary known to contain a zero bit; if not - go to the second-level > and search there, etc. > > When a zero bit in the primary had been found, check if it's within the > rlimit (passed to __alloc_fd() explicitly) and either bugger off or set > that bit. If there are zero bits left in the same word - we are done, > otherwise check the still unread words in the cacheline and see if all > of them are ~0UL. If all of them are, set the bit in summary bitmap, etc. > > Normal case is exactly the same as now - one cacheline accessed and modified. > We might end up touching more than that, but it's going to be rare and > the cases when it happens are very likely to lead to much worse amount of > memory traffic with the current code. > > Freeing is done by zeroing the bit in primary, checking for other zero bits > nearby and buggering off if there are such. If the entire cacheline used > to be all-bits-set, clear the bit in summary and, if there's a second-order > summary, get the bit in there clear as well - it's probably not worth > bothering with checking that all the cacheline in summary bitmap had been > all-bits-set. Again, the normal case is the same as now. > > It'll need profiling and tuning, but AFAICS it's doable without making the > things worse than they are now, and it should get rid of those O(N) fetches > under spinlock cases. And yes, those are triggerable and visible in > profiles. IMO it's worth trying to fix... > Well, all this complexity goes away with a O_FD_FASTALLOC / SOCK_FD_FASTALLOC bit in various fd allocations, which specifically tells the kernel we do not care getting the lowest possible fd as POSIX mandates. With this bit set, the bitmap search can start at a random point, and we find a lot in O(1) : one cache line miss, if you have at least one free bit/slot per 512 bits (64 bytes cache line). #ifndef O_FD_FASTALLOC #define O_FD_FASTALLOC 0x4000 #endif #ifndef SOCK_FD_FASTALLOC #define SOCK_FD_FASTALLOC O_FD_FASTALLOC #endif ... // active sockets socket(AF_INET, SOCK_STREAM | SOCK_FD_FASTALLOC, 0); ... // passive sockets accept4(sockfd, ..., SOCK_FD_FASTALLOC); ... Except for legacy stuff and stdin/stdout/stderr games, I really doubt lot of applications absolutely rely on the POSIX thing... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote: > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote: > > > Before I take a deep look at your suggestion, are you sure plain use of > > include/linux/percpu-refcount.h infra is not possible for struct cred ? > > BTW, I am not convinced we need to spend so much energy and per-cpu > memory for struct cred refcount. > > The big problem is fd array spinlock of course and bitmap search for > POSIX compliance. > > The cache line trashing in struct cred is a minor one ;) percpu-refcount isn't convenient - the only such candidate for ref_kill in there is "all other references are gone", and that can happen in interesting locking environments. I doubt that it would be a good fit, TBH... Cacheline pingpong on the descriptors bitmap is probably inevitable, but it's not the only problem in the existing implementation - close a small descriptor when you've got a lot of them and look for the second open after that. _That_ can lead to thousands of cachelines being read through, all under the table spinlock. It's literally orders of magnitude worse. And if the first open after that close happens to be for a short-living descriptor, you'll get the same situation back in your face as soon as you close it. I think we can seriously improve that without screwing the fast path by adding "summary" bitmaps once the primary grows past the cacheline worth of bits. With bits in the summary bitmap corresponding to cacheline-sized chunks of the primary, being set iff all bits in the corresponding chunk are set. If the summary map grows larger than one cacheline, add the second-order one (that happens at quarter million descriptors and serves until 128 million; adding the third-order map is probably worthless). I want to maintain the same kind of "everything below this is known to be in use" thing as we do now. Allocation would start with looking into the same place in primary bitmap where we'd looked now and similar search forward for zero bit. _However_, it would stop at cacheline boundary. If nothing had been found, we look in the corresponding place in the summary bitmap and search for zero bit there. Again, no more than up to the cacheline boundary. If something is found, we've got a chunk in the primary known to contain a zero bit; if not - go to the second-level and search there, etc. When a zero bit in the primary had been found, check if it's within the rlimit (passed to __alloc_fd() explicitly) and either bugger off or set that bit. If there are zero bits left in the same word - we are done, otherwise check the still unread words in the cacheline and see if all of them are ~0UL. If all of them are, set the bit in summary bitmap, etc. Normal case is exactly the same as now - one cacheline accessed and modified. We might end up touching more than that, but it's going to be rare and the cases when it happens are very likely to lead to much worse amount of memory traffic with the current code. Freeing is done by zeroing the bit in primary, checking for other zero bits nearby and buggering off if there are such. If the entire cacheline used to be all-bits-set, clear the bit in summary and, if there's a second-order summary, get the bit in there clear as well - it's probably not worth bothering with checking that all the cacheline in summary bitmap had been all-bits-set. Again, the normal case is the same as now. It'll need profiling and tuning, but AFAICS it's doable without making the things worse than they are now, and it should get rid of those O(N) fetches under spinlock cases. And yes, those are triggerable and visible in profiles. IMO it's worth trying to fix... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote: > Well, all this complexity goes away with a O_FD_FASTALLOC / > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically > tells the kernel we do not care getting the lowest possible fd as POSIX > mandates. ... which won't do a damn thing for existing userland. > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > lot of applications absolutely rely on the POSIX thing... We obviously can't turn that into default behaviour, though. BTW, what distribution do you have in mind for those random descriptors? Uniform on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the memory footprint pretty soon... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, 2015-10-28 at 22:33 +, Al Viro wrote: > On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote: > > > Well, all this complexity goes away with a O_FD_FASTALLOC / > > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically > > tells the kernel we do not care getting the lowest possible fd as POSIX > > mandates. > > ... which won't do a damn thing for existing userland. For the userland that need +5,000,000 socket, I can tell you they are using this flag as soon they are aware it exists ;) > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > > lot of applications absolutely rely on the POSIX thing... > > We obviously can't turn that into default behaviour, though. BTW, what > distribution do you have in mind for those random descriptors? Uniform > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the > memory footprint pretty soon... Simply [0 , fdt->max_fds] is working well in most cases. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote: > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt > > > lot of applications absolutely rely on the POSIX thing... > > > > We obviously can't turn that into default behaviour, though. BTW, what > > distribution do you have in mind for those random descriptors? Uniform > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the > > memory footprint pretty soon... > > Simply [0 , fdt->max_fds] is working well in most cases. Umm... So first you dup2() to establish the ->max_fds you want, then do such opens? What used/unused ratio do you expect to deal with? And what kind of locking are you going to use? Keep in mind that e.g. dup2() is dependent on the lack of allocations while it's working, so it's not as simple as "we don't need no stinkin' ->files_lock"... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>And no, I'm not fond of such irregular ways to pass file descriptors, but >we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity... Yeah, there are a number of calls which supposed work on one but have a second argument which is also a file descriptor; mostly part of ioctl(). >> In those specific cases where a system call needs to convert a file >> descriptor to a file pointer, there is only one routines which can be used. > >Obviously, but the problem is deadlock avoidance using it. The Solaris algorithm is quite different and as such there is no chance of having a deadlock using that function (there is a bunch of functions) >The memory footprint is really scary. Bitmaps are pretty much noise, but >blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic - >or Venus for that matter, which is more relevant for you guys) Fair enough. I think we have some systems with a larger cache line. >Said that, what's the point of "close won't return until..."? After all, >you can't guarantee that thread with cancelled syscall won't lose CPU >immediately upon return to userland, so it *can't* make any assumptions >about the descriptor not having been already reused. I don't get it - what >does that buy for userland code? Generally I wouldn't see that as a problem, but in the case of a socket blocking on accept indefinitely, I do see it as a problem especially as the thread actually wants to stop listening. But in general, this is basically a problem with the application: the file descriptor space is shared between threads and having one thread sniping at open files, you do have a problem and whatever the kernel does in that case perhaps doesn't matter all that much: the application needs to be fixed anyway. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 27/10/2015 09:08, casper@oracle.com wrote: Generally I wouldn't see that as a problem, but in the case of a socket blocking on accept indefinitely, I do see it as a problem especially as the thread actually wants to stop listening. But in general, this is basically a problem with the application: the file descriptor space is shared between threads and having one thread sniping at open files, you do have a problem and whatever the kernel does in that case perhaps doesn't matter all that much: the application needs to be fixed anyway. The scenario in Hadoop is that the FD is being used by a thread that's waiting in accept and another thread wants to shut it down, e.g. because the application is terminating and needs to stop all threads cleanly. I agree the use of shutdown()+close() on Linux or dup2() on Solaris is pretty much an application-level hack - the concern in both cases is that the file descriptor that's being used in the accept() might be recycled by another thread. However that just begs the question of why the FD isn't properly encapsulated by the application in a singleton object, with the required shut down semantics provided by having a mechanism to invalidate the singleton and its contained FD. There are other mechanisms that could be used to do a clean shutdown that don't require the OS to provide workarounds for arguably broken application behaviour, for example by setting a 'shutdown' flag in the object and then doing a dummy connect() to the accepting FD to kick it off the accept() and thereby getting it to re-check the 'shutdown' flag and not re-enter the accept(). If the object encapsulating a FD is invalidated and that prevents the FD being used any more because the only access is via that object, then it simply doesn't matter if the FD is reused elsewhere, there can be no race so a complicated, platform-dependent dance isn't needed. Unfortunately Hadoop isn't the only thing that pulls the shutdown() trick, so I don't think there's a simple fix for this, as discussed earlier in the thread. Having said that, if close() on Linux also did an implicit shutdown() it would mean that well-written applications that handled the scoping, sharing and reuse of FDs properly could just call close() and have it work the same way across *NIX platforms. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Alan BurlisonDate: Tue, 27 Oct 2015 10:52:46 + > an implicit shutdown() it would mean that well-written applications > that handled the scoping, sharing and reuse of FDs properly could just > call close() and have it work the same way across *NIX platforms. This semantic would only exist after Linux version X.Y.Z and vendor kernels that decided to backport the feature. Ergo, this application would ironically be non-portable on Linux machines. If portable Linux applications have to handle the situation using existing facilities there is absolutely zero value to add it now because it only will add more complexity to applications handling things correctly because they will always have two cases to somehow conditionally handle under Linux. And if the intention is to just always assume the close() semantic thing is there, then you have given me a disincentive to ever add the facility. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 27/10/2015 12:01, Eric Dumazet wrote: Are non multi threaded applications considered well written ? listener = socket(...); bind(listener, ...); listen(fd, 1); Loop 1 10 if (fork() == 0) do_accept(listener) Now if a child does a close(listener), or is killed, you propose that it does an implicit shutdown() and all other children no longer can accept() ? No, of course not. I made it quite clear I was talking about MT programs. Surely you did not gave all details on how it is really working. In the case of Hadoop, it works the way I describe. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, 2015-10-27 at 12:27 +, Alan Burlison wrote: > On 27/10/2015 12:01, Eric Dumazet wrote: > > > Are non multi threaded applications considered well written ? > > > > listener = socket(...); > > bind(listener, ...); > > listen(fd, 1); > > Loop 1 10 > >if (fork() == 0) > > do_accept(listener) > > > > Now if a child does a close(listener), or is killed, you propose that it > > does an implicit shutdown() and all other children no longer can > > accept() ? > > No, of course not. I made it quite clear I was talking about MT programs. Nothing is clear. Sorry. Now shat about programs using both fork() model and MT, to get one MT process per NUMA node ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, 2015-10-27 at 10:52 +, Alan Burlison wrote: > Unfortunately Hadoop isn't the only thing that pulls the shutdown() > trick, so I don't think there's a simple fix for this, as discussed > earlier in the thread. Having said that, if close() on Linux also did an > implicit shutdown() it would mean that well-written applications that > handled the scoping, sharing and reuse of FDs properly could just call > close() and have it work the same way across *NIX platforms. Are non multi threaded applications considered well written ? listener = socket(...); bind(listener, ...); listen(fd, 1); Loop 1 10 if (fork() == 0) do_accept(listener) Now if a child does a close(listener), or is killed, you propose that it does an implicit shutdown() and all other children no longer can accept() ? Surely you did not gave all details on how it is really working. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 27/10/2015 14:39, David Miller wrote: You will never be able to assume it is available everywhere under Linux. Ever. This is the fundamental issue that you seem to completely not understand. If that was true in general then Linux would be dead and there would be no point ever adding any new features to it. You cannot just assume 5 years from now or whatever that the close() thing is there even if I added it to the tree right now. No, a configure-time feature probe would be needed, as is generally considered to be good practice. Your intent is to somewhere down the road assume this, and therefore distribute a broken piece of infrastructure that only works on some Linux systems. This is not acceptable. Acceptable to who? Unless you are claiming to speak for the authors of every piece of software that ever runs on Linux you can't make that assertion. For example, Hadoop relies on Linux CGroups to provide features you'd want in production deployments, yet CGroups only appeared in Linux 2.6.24. The backwards compat code will need to be in your code forever. There is no way around it. That is, again, unless you want your code to not work on a non-trivial number of Linux systems out there. That's simply not true. Both Linux and Solaris have dropped support for old features in the past. Yes it takes a long time to do so but it's perfectly possible. Making this worse is that there isn't going to be a straightforward nor reliable way to test for the presence of this at run time. I attached a test case to the original bug that demonstrates the platform-specific behaviours, it would be easy enough to modify that to use as a configure-time feature probe. You _have_ a way to accomplish what you want to do today and it works on every Linux system on the planet. Given the constraints, and the fact that you're going to have to account for this situation somehow in your code forever, I see very little to no value in adding the close() thing. I think your assessment is unduly pessimistic. So your cross-platform unified behavior goal is simply unobtainable. So please deal with reality rather than wishful inpractical things. If you don't think this is worth discussing because you personally don't believe cross-platform compatibility is worth bothering with then fine, say so. But then you need to persuade everyone else with a stake in Linux that your viewpoint is the only valid one, and you will have to also persuade them that Linux should no longer concern itself with POSIX compliance. I know *I* couldn't do that for Solaris, and with all due respect, I very much doubt *you* can do so for Linux. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Alan BurlisonDate: Tue, 27 Oct 2015 14:39:56 + > On 27/10/2015 14:39, David Miller wrote: > >> Making this worse is that there isn't going to be a straightforward >> nor reliable way to test for the presence of this at run time. > > I attached a test case to the original bug that demonstrates the > platform-specific behaviours, it would be easy enough to modify that > to use as a configure-time feature probe. I said "run time". Like when your final code runs on a target system. Not at build tree configure time. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 27/10/2015 15:04, David Miller wrote: I said "run time". Like when your final code runs on a target system. Not at build tree configure time. Yes, you could do it then as well. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, 2015-10-27 at 23:17 +, Al Viro wrote: > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > with some use patterns. The cacheline pingpong in the bitmap is probably > inevitable, unless we accept considerably heavier memory footprint, > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > to trigger - close(3);open(...); will have the next open() after that > scanning the entire in-use bitmap. I think I see a way to improve it > without slowing the normal case down, but I'll need to experiment a > bit before I post patches. Anybody with examples of real-world loads > that make our descriptor allocator to degrade is very welcome to post > the reproducers... Well, I do have real-world loads, but quite hard to setup in a lab :( Note that we also hit the 'struct cred'->usage refcount for every open()/close()/sock_alloc(), and simply moving uid/gid out of the first cache line really helps, as current_fsuid() and current_fsgid() no longer forces a pingpong. I moved seldom used fields on the first cache line, so that overall memory usage did not change (192 bytes on 64 bit arches) diff --git a/include/linux/cred.h b/include/linux/cred.h index 8d70e1361ecd..460efae83522 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -124,7 +124,17 @@ struct cred { #define CRED_MAGIC 0x43736564 #define CRED_MAGIC_DEAD0x44656144 #endif - kuid_t uid;/* real UID of the task */ + struct rcu_head rcu;/* RCU deletion hook */ + + kernel_cap_tcap_inheritable; /* caps our children can inherit */ + kernel_cap_tcap_permitted; /* caps we're permitted */ + kernel_cap_tcap_effective; /* caps we can actually use */ + kernel_cap_tcap_bset; /* capability bounding set */ + kernel_cap_tcap_ambient;/* Ambient capability set */ + + kuid_t uid cacheline_aligned_in_smp; + /* real UID of the task */ + kgid_t gid;/* real GID of the task */ kuid_t suid; /* saved UID of the task */ kgid_t sgid; /* saved GID of the task */ @@ -133,11 +143,6 @@ struct cred { kuid_t fsuid; /* UID for VFS ops */ kgid_t fsgid; /* GID for VFS ops */ unsignedsecurebits; /* SUID-less security management */ - kernel_cap_tcap_inheritable; /* caps our children can inherit */ - kernel_cap_tcap_permitted; /* caps we're permitted */ - kernel_cap_tcap_effective; /* caps we can actually use */ - kernel_cap_tcap_bset; /* capability bounding set */ - kernel_cap_tcap_ambient;/* Ambient capability set */ #ifdef CONFIG_KEYS unsigned char jit_keyring;/* default keyring to attach requested * keys to */ @@ -152,7 +157,6 @@ struct cred { struct user_struct *user; /* real user ID subscription */ struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */ struct group_info *group_info; /* supplementary groups for euid/fsgid */ - struct rcu_head rcu;/* RCU deletion hook */ }; extern void __put_cred(struct cred *); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Tue, Oct 27, 2015 at 10:52:46AM +, Alan Burlison wrote: > Unfortunately Hadoop isn't the only thing that pulls the shutdown() > trick, so I don't think there's a simple fix for this, as discussed > earlier in the thread. Having said that, if close() on Linux also > did an implicit shutdown() it would mean that well-written > applications that handled the scoping, sharing and reuse of FDs > properly could just call close() and have it work the same way > across *NIX platforms. ... except for all Linux, FreeBSD and OpenBSD versions out there, but hey, who's counting those, right? Not to mention the OSX behaviour - I really have no idea what it does; the FreeBSD ancestry in its kernel is distant enough for a lot of changes to have happened in that area. So... Which Unices other than Solaris and NetBSD actually behave that way? I.e. have close(fd) cancel accept(fd) another thread is sitting in. Note that NetBSD implementation has known races. Linux, FreeBSD and OpenBSD don't do that at all. Frankly, as far as I'm concerned, the bottom line is * there are two variants of semantics in that area and there's not much that could be done about that. * POSIX is vague enough for both variants to comply with it (it's also very badly written in the area in question). * I don't see any way to implement something similar to Solaris behaviour without a huge increase of memory footprint or massive cacheline pingpong. Solaris appears to go for memory footprint from hell - cacheline per descriptor (instead of a pointer per descriptor). * the benefits of Solaris-style behaviour are not obvious - all things equal it would be interesting, but the things are very much not equal. What's more, if your userland code is such that accept() argument could be closed by another thread, the caller *cannot* do anything with said argument after accept() returns, no matter which variant of semantics is used. * [Linux-specific aside] our __alloc_fd() can degrade quite badly with some use patterns. The cacheline pingpong in the bitmap is probably inevitable, unless we accept considerably heavier memory footprint, but we also have a case when alloc_fd() takes O(n) and it's _not_ hard to trigger - close(3);open(...); will have the next open() after that scanning the entire in-use bitmap. I think I see a way to improve it without slowing the normal case down, but I'll need to experiment a bit before I post patches. Anybody with examples of real-world loads that make our descriptor allocator to degrade is very welcome to post the reproducers... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Alan BurlisonDate: Tue, 27 Oct 2015 14:13:31 + > Ideally there'd be a single way of doing this that worked > cross-platform, at the moment there isn't. And yes, even if such a > mechanism was available now it would be some time before it could be > assumed to be available everywhere. You will never be able to assume it is available everywhere under Linux. Ever. This is the fundamental issue that you seem to completely not understand. You cannot just assume 5 years from now or whatever that the close() thing is there even if I added it to the tree right now. Your intent is to somewhere down the road assume this, and therefore distribute a broken piece of infrastructure that only works on some Linux systems. This is not acceptable. The backwards compat code will need to be in your code forever. There is no way around it. That is, again, unless you want your code to not work on a non-trivial number of Linux systems out there. Making this worse is that there isn't going to be a straightforward nor reliable way to test for the presence of this at run time. You _have_ a way to accomplish what you want to do today and it works on every Linux system on the planet. Given the constraints, and the fact that you're going to have to account for this situation somehow in your code forever, I see very little to no value in adding the close() thing. So your cross-platform unified behavior goal is simply unobtainable. So please deal with reality rather than wishful inpractical things. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Alan BurlisonDate: Tue, 27 Oct 2015 13:37:26 + > If you took that argument to it's logical extreme they you'd never > make any changes that made changes to existing behaviour, and that's > patently not the case. You know exactly what I mean, and what you're saying here is just a scarecrow distracting the discussion from the real issue. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 27/10/2015 13:42, David Miller wrote: This semantic would only exist after Linux version X.Y.Z and vendor kernels that decided to backport the feature. Ergo, this application would ironically be non-portable on Linux machines. Yes, that's true enough, until nobody was using the old versions any more. If portable Linux applications have to handle the situation using existing facilities there is absolutely zero value to add it now because it only will add more complexity to applications handling things correctly because they will always have two cases to somehow conditionally handle under Linux. And if the intention is to just always assume the close() semantic thing is there, then you have given me a disincentive to ever add the facility. If you took that argument to it's logical extreme they you'd never make any changes that made changes to existing behaviour, and that's patently not the case. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 27/10/2015 13:59, David Miller wrote: From: Alan BurlisonDate: Tue, 27 Oct 2015 13:37:26 + If you took that argument to it's logical extreme they you'd never make any changes that made changes to existing behaviour, and that's patently not the case. You know exactly what I mean, and what you're saying here is just a scarecrow distracting the discussion from the real issue. I think you probably mean "a straw man". The problematic case is MT applications that don't manage sharing of FDs in a sensible way and need to have a way of terminating any accept() in other threads. On Linux that's currently done with a shutdown()+close(), on other platforms you can use open()+dup2(). However the Linux shutdown() mechanism relies on bending the POSIX semantics and the dup2() mechanism doesn't work on Linux as it also doesn't kick other threads off accept(). At the moment, on Linux you have to explicitly call shutdown() on a socket on which another thread may be sat in accept(). If closing the socket in one thread terminated any accept()s in other threads, in the same way that an explicit shutdown() does, then the explicit shutdown() wouldn't be needed for more sensibly written apps that weren't prone to FD recycling races. As far as I can tell, that would work cross-platform. Ideally there'd be a single way of doing this that worked cross-platform, at the moment there isn't. And yes, even if such a mechanism was available now it would be some time before it could be assumed to be available everywhere. I don't know enough about the Linux implementation to know if there is a practical way around this, and of course even if such a change were made, potential breakage of existing code would be a concern. If there's a better, cross-platform way of doing this then I'm all ears. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote: > On 21/10/2015 04:49, Al Viro wrote: > >BTW, for real fun, consider this: > >7) > >// fd is a socket > >fd2 = dup(fd); > >in thread A: accept(fd); > >in thread B: accept(fd); > >in thread C: accept(fd2); > >in thread D: close(fd); > > > >Which threads (if any), should get hit where it hurts? > > A & B should return from the accept with an error. C should continue. Which > is what happens on Solaris. So, I'm coming late to this discussion and I don't have the original context; however, to me this cited behavior seems undesirable and if I ran across it in the wild I would probably describe it as a bug. System call processing for operations on files involves translating a file descriptor (a number) into an open-file object (or "file description"), struct file in BSD and I think also in Linux. The actual system call logic operates on the open-file object, so once the translation happens application monkeyshines involving file descriptor numbers should have no effect on calls in progress. Other behavior would violate the principle of least surprise, as this basic architecture predates POSIX. The behavior Al Viro found in NetBSD's dup2 is a bug. System calls are supposed to be atomic, regardless of what POSIX might or might not say. I did not write that code but I'll pass the report on to those who did. -- David A. Holland dholl...@netbsd.org -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 23/10/2015 17:00, Eric Dumazet wrote: Ermm, you *really* want me to submit a patch removing 'Conforms to POSIX.1-2001' from *every* Linux manpage? Only on the pages you think there is an error that matters. If there's consensus that the current shutdown(), dup2(), close() and accept() behaviour are not POSIX-compliant then I can do that, sure. Have you tested the patch I sent ? The AF_UNIX poll one? No, I don't have the means to do so, and in any case that's not a POSIX issue, just a plain bug. I'm happy to log a bug if that helps. We submit patches when someone needs a fix. If not, we have more urgent issues to solve first. I wrote following test case, and confirmed the patch fixes the issue. I will submit it formally. Thanks, works for me - the poll() issue only affects AF_UNIX sockets in the listen state and is easily avoided by simply not setting the output bits in the poll events mask, so it's not exactly high priority. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-23 at 09:00 -0700, Eric Dumazet wrote: > on the pages you think there is an error that matters. > > > > > > Have you tested the patch I sent ? > > > > The AF_UNIX poll one? No, I don't have the means to do so, and in any > > case that's not a POSIX issue, just a plain bug. I'm happy to log a bug > > if that helps. BTW, there is no kernel bug here. POSIX poll() man page says : POLLOUT Normal data may be written without blocking. If you attempt to write on a listener, write() does _not_ block and returns -1, which seems correct behavior to me, in accordance with man page. socket(PF_LOCAL, SOCK_STREAM, 0)= 3 bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0 listen(3, 10) = 0 write(3, "test", 4) = -1 ENOTCONN (Transport endpoint is not connected) Could you point out which part of POSIX is mandating that af_unix listener MUST filter out POLLOUT ? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 23/10/2015 15:21, Eric Dumazet wrote: I claim nothing. If you believe a man page should be fixed, please send a patch to man page maintainer. Ermm, you *really* want me to submit a patch removing 'Conforms to POSIX.1-2001' from *every* Linux manpage? Have you tested the patch I sent ? The AF_UNIX poll one? No, I don't have the means to do so, and in any case that's not a POSIX issue, just a plain bug. I'm happy to log a bug if that helps. The goal here is to improve things, not to say you are right or I am right. I am very often wrong, then what ? This list is about linux kernel development. Thank you for the clarification. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-23 at 16:46 +0100, Alan Burlison wrote: > On 23/10/2015 15:21, Eric Dumazet wrote: > > > I claim nothing. If you believe a man page should be fixed, please send > > a patch to man page maintainer. > > Ermm, you *really* want me to submit a patch removing 'Conforms to > POSIX.1-2001' from *every* Linux manpage? Only on the pages you think there is an error that matters. > > > Have you tested the patch I sent ? > > The AF_UNIX poll one? No, I don't have the means to do so, and in any > case that's not a POSIX issue, just a plain bug. I'm happy to log a bug > if that helps. We submit patches when someone needs a fix. If not, we have more urgent issues to solve first. I wrote following test case, and confirmed the patch fixes the issue. I will submit it formally. Thanks. #include #include #include #include #include #include #include #include #include #include #include static void fail(const char *str) { perror(str); printf("FAIL\n"); exit(1); } int main(int argc, char *argv[]) { int listener = socket(AF_UNIX, SOCK_STREAM, 0); struct pollfd pfd; struct sockaddr_un addr; int res; if (listener == -1) perror("socket()"); memset(, 0, sizeof(addr)); addr.sun_family = AF_UNIX; if (bind(listener, (struct sockaddr *), sizeof(addr)) == -1) fail("bind()"); if (listen(listener, 10) == -1) fail("listen()"); pfd.fd = listener; pfd.events = -1; res = poll(, 1, 10); if (res == -1) fail("poll()"); if (res == 1 && pfd.revents & (POLLOUT|POLLWRNORM|POLLWRBAND)) { fprintf(stderr, "poll(af_unix listener) returned a POLLOUT status !\n"); printf("FAIL\n"); return 1; } printf("OK\n"); return 0; } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-23 at 15:20 +0200, casper@oracle.com wrote: > This point was not about a 10M sockets server but in general... > Hey, we do not design the OS only for handling desktop workloads. If netstat does not work on this typical server workload, then it is a joke. > Solaris has had such an option too, but that wasn't the point. You really > don't want to know which application is doing this? Apparently nobody had this need before you asked. My point is that instead of carrying ~30 years of legacy, we would better design the system properly so that we can _use_ it in all situations. ps used to open /dev/kmem a long time ago. fuser and similar commands having to know system internals are really hacks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-23 at 15:20 +0200, casper@oracle.com wrote: > > >Yet another POSIX deficiency. > > > >When a server deals with 10,000,000+ socks, we absolutely do not care of > >this requirement. > > > >O(log(n)) is still crazy if it involves O(log(n)) cache misses. > > You miss the fire point of the algorithm; you *always* find an available > file descriptor in O(log(N)) (where N is the size of the table). > > Does your algorithm guarantee that? Its a simple bit map. Each cache line contains 64*8 = 512 bits. Lookup is actually quite fast thanks to hardware prefetches. Main problem we had until recently was that we used to acquire fd table lock 3 times per accept()/close() pair This patch took care of the issue : http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a81252b774b53e628a8a0fe18e2b8fc236d92cc Then, adding a O_RANDOMFD flag (or O_DO_NOT_REQUIRE_POSIX_FD_RULES) is helping a lot, as it allows us to pick a random starting point during the bitmap search. In practice, finding a slot in fd array is O(1) : one cache line miss exactly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-23 at 11:52 +0200, casper@oracle.com wrote: > > >Ho-hum... It could even be made lockless in fast path; the problems I see > >are > > * descriptor-to-file lookup becomes unsafe in a lot of locking > >conditions. Sure, most of that happens on the entry to some syscall, with > >very light locking environment, but... auditing every sodding ioctl that > >might be doing such lookups is an interesting exercise, and then there are > >->mount() instances doing the same thing. And procfs accesses. Probably > >nothing impossible to deal with, but nothing pleasant either. > > In the Solaris kernel code, the ioctl code is generally not handled a file > descriptor but instead a file pointer (i.e., the lookup is done early in > the system call). > > In those specific cases where a system call needs to convert a file > descriptor to a file pointer, there is only one routines which can be used. > > > * memory footprint. In case of Linux on amd64 or sparc64, > >main() > >{ > > int i; > > for (i = 0; i < 1<<24; dup2(0, i++))// 16M descriptors > > ; > >} > >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient > >ulimit -n, > >of course). How much will Solaris eat on the same? > > Yeah, that is a large amount of memory. Of course, the table is only > sized when it is extended and there is a reason why there is a limit on > file descriptors. But we're using more data per file descriptor entry. > > > > * related to the above - how much cacheline sharing will that involve? > >These per-descriptor use counts are bitch to pack, and giving each a > >cacheline > >of its own... > > As I said, we do actually use a lock and yes that means that you really > want to have a single cache line for each and every entry. It does make > it easy to have non-racy file description updates. You certainly do not > want false sharing when there is a lot of contention. > > Other data is used to make sure that it only takes O(log(n)) to find the > lowest available file descriptor entry. (Where n, I think, is the returned > descriptor) Yet another POSIX deficiency. When a server deals with 10,000,000+ socks, we absolutely do not care of this requirement. O(log(n)) is still crazy if it involves O(log(n)) cache misses. > > Not contended locks aren't expensive. And all is done on a single cache > line. > > One question about the Linux implementation: what happens when a socket in > select is closed? I'm assuming that the kernel waits until "shutdown" is > given or when a connection comes in? > > Is it a problem that you can "hide" your listening socket with a thread in > accept()? I would think so (It would be visible in netstat but you can't > easily find out why has it) Again, netstat -p on a server with 10,000,000 sockets never completes. Never try this unless you are desperate and want to avoid a reboot maybe. If you absolutely want to nuke a listener because of untrusted applications, we better implement a proper syscall. Android has such a facility. Alternative would be to extend netlink (ss command from iproute2 package) to carry one pid per socket. ss -atnp state listening -> would not have to readlink (/proc/*/fd/*) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>Yet another POSIX deficiency. > >When a server deals with 10,000,000+ socks, we absolutely do not care of >this requirement. > >O(log(n)) is still crazy if it involves O(log(n)) cache misses. You miss the fire point of the algorithm; you *always* find an available file descriptor in O(log(N)) (where N is the size of the table). Does your algorithm guarantee that? >> Is it a problem that you can "hide" your listening socket with a thread in >> accept()? I would think so (It would be visible in netstat but you can't >> easily find out why has it) > >Again, netstat -p on a server with 10,000,000 sockets never completes. This point was not about a 10M sockets server but in general... >Never try this unless you are desperate and want to avoid a reboot >maybe. > >If you absolutely want to nuke a listener because of untrusted >applications, we better implement a proper syscall. > >Android has such a facility. Solaris has had such an option too, but that wasn't the point. You really don't want to know which application is doing this? >Alternative would be to extend netlink (ss command from iproute2 >package) to carry one pid per socket. > >ss -atnp state listening That would be an option too. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 23/10/2015 14:02, Eric Dumazet wrote: Other data is used to make sure that it only takes O(log(n)) to find the lowest available file descriptor entry. (Where n, I think, is the returned descriptor) Yet another POSIX deficiency. When a server deals with 10,000,000+ socks, we absolutely do not care of this requirement. O(log(n)) is still crazy if it involves O(log(n)) cache misses. If you think it's a POSIX deficiency then logging a DR is probably the correct way of addressing that. And as I've said it's fine to decide that you don't care about what POSIX says on the subject but you can't simultaneously claim POSIX conformance. One or the other, not both. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>Ho-hum... It could even be made lockless in fast path; the problems I see >are > * descriptor-to-file lookup becomes unsafe in a lot of locking >conditions. Sure, most of that happens on the entry to some syscall, with >very light locking environment, but... auditing every sodding ioctl that >might be doing such lookups is an interesting exercise, and then there are >->mount() instances doing the same thing. And procfs accesses. Probably >nothing impossible to deal with, but nothing pleasant either. In the Solaris kernel code, the ioctl code is generally not handled a file descriptor but instead a file pointer (i.e., the lookup is done early in the system call). In those specific cases where a system call needs to convert a file descriptor to a file pointer, there is only one routines which can be used. > * memory footprint. In case of Linux on amd64 or sparc64, >main() >{ > int i; > for (i = 0; i < 1<<24; dup2(0, i++))// 16M descriptors > ; >} >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient >ulimit -n, >of course). How much will Solaris eat on the same? Yeah, that is a large amount of memory. Of course, the table is only sized when it is extended and there is a reason why there is a limit on file descriptors. But we're using more data per file descriptor entry. > * related to the above - how much cacheline sharing will that involve? >These per-descriptor use counts are bitch to pack, and giving each a cacheline >of its own... As I said, we do actually use a lock and yes that means that you really want to have a single cache line for each and every entry. It does make it easy to have non-racy file description updates. You certainly do not want false sharing when there is a lot of contention. Other data is used to make sure that it only takes O(log(n)) to find the lowest available file descriptor entry. (Where n, I think, is the returned descriptor) Not contended locks aren't expensive. And all is done on a single cache line. One question about the Linux implementation: what happens when a socket in select is closed? I'm assuming that the kernel waits until "shutdown" is given or when a connection comes in? Is it a problem that you can "hide" your listening socket with a thread in accept()? I would think so (It would be visible in netstat but you can't easily find out why has it) Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 23/10/2015 17:19, Eric Dumazet wrote: The AF_UNIX poll one? No, I don't have the means to do so, and in any case that's not a POSIX issue, just a plain bug. I'm happy to log a bug if that helps. BTW, there is no kernel bug here. POSIX poll() man page says : POLLOUT Normal data may be written without blocking. If you attempt to write on a listener, write() does _not_ block and returns -1, which seems correct behavior to me, in accordance with man page. Except of course data may not be written, because an attempt to actually do so fails, because the socket is in the listen state, is not connected and therefore no attempt to write to it could ever succeed. The only bit of the required behaviour that the current AF_UNIX poll implementation actually gets right is the "without blocking" bit, and that's only the case because the failure is detected immediately and the write call returns immediately with an error. socket(PF_LOCAL, SOCK_STREAM, 0)= 3 bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0 listen(3, 10) = 0 write(3, "test", 4) = -1 ENOTCONN (Transport endpoint is not connected) Could you point out which part of POSIX is mandating that af_unix listener MUST filter out POLLOUT ? "A file descriptor for a socket that is listening for connections shall indicate that it is ready for reading, once connections are available. A file descriptor for a socket that is connecting asynchronously shall indicate that it is ready for writing, once a connection has been established." If POSIX had to explicitly list every possible thing that implementations *should not* do rather than just those that they *should* do then it would be even more unwieldy than it already is. And if what you are asserting is correct, why isn't the AF_INET behaviour the same as the AF_UNIX behaviour? -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 09:50:10PM +0200, casper@oracle.com wrote: > >Sigh... It completely fails to mention descriptor-passing. Which > > a) is relevant to what "last close" means and > > b) had been there for nearly the third of a century. > > Why is that different? These clearly count as file descriptors. To quote your own posting upthread (Message-ID: <201510212033.t9lkx4g8007...@room101.nl.oracle.com>) > Well, a file descriptor really only exists in the context of a process; > in-flight it is no longer a file descriptor as there process context with ^^^ > a file descriptor table; so pointers to file descriptions are passed > around. IMO it shows that the wording is anything but clear. The only way to claim that it's accurate is, indeed, to declare that the contents of in-flight SCM_RIGHTS datagram should be counted as file descriptors and that's too much of a stretch for the reasons you've pointed to. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-23 at 17:40 +0100, Alan Burlison wrote: > On 23/10/2015 17:19, Eric Dumazet wrote: > > >>> The AF_UNIX poll one? No, I don't have the means to do so, and in any > >>> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug > >>> if that helps. > > > > BTW, there is no kernel bug here. POSIX poll() man page says : > > > > POLLOUT > > Normal data may be written without blocking. > > > > If you attempt to write on a listener, write() does _not_ block and > > returns -1, which seems correct behavior to me, in accordance with man > > page. > > Except of course data may not be written, because an attempt to actually > do so fails, because the socket is in the listen state, is not connected > and therefore no attempt to write to it could ever succeed. The only bit > of the required behaviour that the current AF_UNIX poll implementation > actually gets right is the "without blocking" bit, and that's only the > case because the failure is detected immediately and the write call > returns immediately with an error. Yeah, I know some people use poll(NULL, 0, timeout) to implement msleep(). Because it definitely impresses friends. So why not poll(, 1, timeout) to do the same, with a socket listener and POLLOUT in pfd.events Go figure. I'll send the fine patch. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 23, 2015 at 11:52:34AM +0200, casper@oracle.com wrote: > > > >Ho-hum... It could even be made lockless in fast path; the problems I see > >are > > * descriptor-to-file lookup becomes unsafe in a lot of locking > >conditions. Sure, most of that happens on the entry to some syscall, with > >very light locking environment, but... auditing every sodding ioctl that > >might be doing such lookups is an interesting exercise, and then there are > >->mount() instances doing the same thing. And procfs accesses. Probably > >nothing impossible to deal with, but nothing pleasant either. > > In the Solaris kernel code, the ioctl code is generally not handled a file > descriptor but instead a file pointer (i.e., the lookup is done early in > the system call). The one that comes as the first argument of ioctl(2) - sure, but e.g. BTRFS_IOC_CLONE_RANGE gets a pointer to this: struct btrfs_ioctl_clone_range_args { __s64 src_fd; __u64 src_offset, src_length; __u64 dest_offset; }; as the third argument. VFS sure as hell has no idea of that thing - it's up to btrfs_ioctl() to copy it in and deal with what it had been given. While we are at it, ioctl(fd, BTRFS_IOC_CLONE, src_fd) also gets the second descriptor-to-file lookup in btrfs-specific code; fd, of course, is looked up by VFS code. Now, these two are done in locking-neutral environment, but that's just two of several dozens. And no, I'm not fond of such irregular ways to pass file descriptors, but we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity... > In those specific cases where a system call needs to convert a file > descriptor to a file pointer, there is only one routines which can be used. Obviously, but the problem is deadlock avoidance using it. > As I said, we do actually use a lock and yes that means that you really > want to have a single cache line for each and every entry. It does make > it easy to have non-racy file description updates. You certainly do not > want false sharing when there is a lot of contention. > > Other data is used to make sure that it only takes O(log(n)) to find the > lowest available file descriptor entry. (Where n, I think, is the returned > descriptor) TBH, with that kind of memory footprint I would be more interested in constants than in asymptotics - not that O(log(n)) would be hard to arrange (a bunch of bitmaps with something like 1:512 ratio between the levels, to keep the damn thing within a reasonable cacheline size would probably do with not too horrible constant; 3 levels of that would already give 128M descriptors, and that's a gigabyte worth of struct file * alone; with your "cacheline per descriptor" it's what, about 8Gb eaten by descriptor table?) Hell knows, might be worth doing regardless of anything else. Not making it worse than our plain bitmap variant in any situations shouldn't be hard... > Not contended locks aren't expensive. And all is done on a single cache > line. The memory footprint is really scary. Bitmaps are pretty much noise, but blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic - or Venus for that matter, which is more relevant for you guys) Said that, what's the point of "close won't return until..."? After all, you can't guarantee that thread with cancelled syscall won't lose CPU immediately upon return to userland, so it *can't* make any assumptions about the descriptor not having been already reused. I don't get it - what does that buy for userland code? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, Oct 23, 2015 at 06:30:25PM +, David Holland wrote: > So, I'm coming late to this discussion and I don't have the original > context; however, to me this cited behavior seems undesirable and if I > ran across it in the wild I would probably describe it as a bug. Unfortunately, that's precisely what NetBSD is trying to implement (and that's what will happen if nothing else reopens fd). See the logics in fd_close(), with ->fo_restart() and waiting for all activity to settle down. As for the missing context, what fd_close() is doing is also unreliable - inducing ERESTART in other threads sitting in accept(2) and things like that and waiting for them to run into EBADF they'll get (barring races) on syscall restart; threads sitting in accept() et.al. on the same struct file, but with different descriptors will hopefully go into restart and continue unaffected. All that machinery relies on nothing having reused the descriptor for socket(2), dup2() target, etc. while those threads had been going through the syscall restart - if that happens, you are SOL, since accept(2) _will_ restart on an unexpected socket. Moreover, if you fix dup2() atomicity, this approach will reliably shit itself for situations when dup2() rather than close() is used to close the socket. It relies upon having at least some window where the victim descriptor would be yielding EBADF. > System call processing for operations on files involves translating a > file descriptor (a number) into an open-file object (or "file > description"), struct file in BSD and I think also in Linux. The > actual system call logic operates on the open-file object, so once the > translation happens application monkeyshines involving file descriptor > numbers should have no effect on calls in progress. Other behavior > would violate the principle of least surprise, as this basic > architecture predates POSIX. Well, to be fair, until '93 there was no way to have descriptor table changed under a syscall in the first place. The old model (everything up to and ncluding 4.4BSD final) simply didn't include anything of that sort - mapping from descriptors to open files was not shared and all changes a syscall might see were ones done by the syscall itself. So this thing isn't covered by the basic architecture - it's something that had been significantly new merely two decades ago. And POSIX still hasn't quite caught up with that newfangled 4.2BSD thing... IMO what you've described above is fine - that's how Linux works, that's how FreeBSD and OpenBSD work and that's how NetBSD used to work until 2008 or so. "Cancel syscall if any of the descriptors got dissociated from opened files by action of another thread, have the dissociating operation wait for all affected syscalls to run down" thing had been introduced then and it is similar to what Solaris is doing. AFAICS, the main issue with that is the memory footprint from hell and/or cacheline clusterfuck. Having accept(2) bugger off with e.g. EINTR in such situation isn't inherently worse or better than having it sit there as if close() or dup2() has not happened - matter of taste, and if there had been a way to do it without inflicting the price on processes that do not pull that kind of crap in the first place... might be worth considering. As it is, the memory footprint seems to be too heavy. I'm not entirely convinced that there's no clever way to avoid that, but right now I don't see anything that would look like a good approach. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Fri, 2015-10-23 at 14:35 +0100, Alan Burlison wrote: > If you think it's a POSIX deficiency then logging a DR is probably the > correct way of addressing that. And as I've said it's fine to decide > that you don't care about what POSIX says on the subject but you can't > simultaneously claim POSIX conformance. One or the other, not both. I claim nothing. If you believe a man page should be fixed, please send a patch to man page maintainer. Have you tested the patch I sent ? The goal here is to improve things, not to say you are right or I am right. I am very often wrong, then what ? This list is about linux kernel development. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 11:55:42AM +0100, Alan Burlison wrote: > On 22/10/2015 05:21, Al Viro wrote: > > >>Most of the work on using a file descriptor is local to the thread. > > > >Using - sure, but what of cacheline dirtied every time you resolve a > >descriptor to file reference? > > Don't you have to do that anyway, to do anything useful with the file? Dirtying the cacheline that contains struct file itself is different, but that's not per-descriptor. > >In case of Linux we have two bitmaps and an array of pointers associated > >with descriptor table. They grow on demand (in parallel) > > * reserving a descriptor is done under ->file_lock (dropped/regained > >around memory allocation if we end up expanding the sucker, actual > >reassignment > >of pointers to array/bitmaps is under that spinlock) > > * installing a pointer is lockless (we wait for ongoing resize to > >settle, RCU takes care of the rest) > > * grabbing a file by index is lockless as well > > * removing a pointer is under ->file_lock, so's replacing it by dup2(). > > Is that table per-process or global? Usually it's per-process, but any thread could ask for a private instance to work with (and then spawn more threads sharing that instance - or getting independent copies). It's common for Plan 9-inspired models - basically, you treat every thread as a machine that consists of * memory * file descriptor table * namespace * signal handlers ... * CPU (i.e. actual thread of execution). The last part can't be shared; anything else can. fork(2) variant used to start new threads (clone(2) in case of Linux, rfork(2) in Plan 9 and *BSD) is told which components should be copies of parent's ones and which should be shared with the parent. fork(2) is simply "copy everything except for the namespace". It's fairly common to have "share everything", but intermediate variants are also possible. There are constraints (e.g. you can't share signal handlers without sharing the memory space), but descriptor table can be shared independently from memory space just fine. There's also a way to say "unshare this, this and that components" - mapped to unshare(2) in Linux and to rfork(2) in Plan 9. Best way to think of that is to consider descriptor table as a first-class object a thread can be connected to. Usually you have one for each process, with all threads belonging to that process connected to the same thing, but that's just the most common use. > I don't think that it's possible to claim that a non-atomic dup2() > is POSIX-compliant. Except that it's in non-normative part of dup2(2), AFAICS. I certainly agree that it would be a standard lawyering beyond reason, but "not possible to claim" is too optimistic. Maybe I'm just more cynical... > ThreadA remains sat in accept on fd1 which is now a plain file, not > a socket. No. accept() is not an operation on file descriptors; it's an operation on file descriptions (pardon for use of that terminology). They are specified by passing descriptors, but there's a hell of a difference between e.g. dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() (operations on descriptions). Lookups are done once per syscall; the only exception is F_SETFL{,W}, where we recheck that descriptor is refering to the same thing before granting the lock. Again, POSIX is still underspecifying the semantics of shared descriptor tables; back when the bulk of it had been written there had been no way to have a descriptor -> description mapping changed under a syscall by action of another thread. Hell, they still hadn't picked on some things that happened in early 80s, let alone early-to-mid 90s... Linux and Solaris happen to cover these gaps differently; FreeBSD and OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote: >> >> >> >And I'm really curious about the things Solaris would do with dup2() there. >> >Does it take into account the possibility of new accept() coming just as >> >dup2() is trying to terminate the ongoing ones? Is there a window when >> >descriptor-to-file lookups would fail? Looks like a race/deadlock >> >country... >> >> Solaris does not "terminate" threads, instead it tells them that the >> file descriptor information used is stale and wkae's up the thread. > >Sorry, lousy wording - I meant "terminate syscall in another thread". >Better yet, make that "what happens if new accept(newfd) comes while dup2() >waits for affected syscalls in other threads to finish"? Assuming it >does wait, that is.. No there is no such window; the accept() call either returns EBADF (dup2()) wins the race or it returns a new file descriptor (and dup2() then closes the listening descriptor). One or the other. >While we are at it, what's the relative order of record locks removal >and switching the meaning of newfd? In our kernel it happens *after* >the switchover (i.e. if another thread is waiting for a record lock held on >any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see >the state with newfd still refering to what it used to; note that waiter >might be using any descriptor refering to the file newfd used to refer >to, so it won't be affected by the "wake those who had the meaning of >their arguments change" side of things). The external behaviour atomic; you cannot distinguish the order between the closing of the original file (and waking up other threads waiting for a record lock) or changing the file referenced by that newfd. But this not include a global or process specific lock. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro>Except that in this case "correctness" is the matter of rather obscure and >ill-documented areas in POSIX. Don't get me wrong - this semantics isn't >inherently bad, but it's nowhere near being an absolute requirement. It would more fruitful to have such a discussion in one of the OpenGroup mailing lists; people gathered there have a lot of experience and it is also possible to fix the standard when it turns out that it indeed as vague as you claim it is (I don't quite agree with that) Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 05:44:58AM +0100, Al Viro wrote: > Except that in this case "correctness" is the matter of rather obscure and > ill-documented areas in POSIX. Don't get me wrong - this semantics isn't > inherently bad, but it's nowhere near being an absolute requirement. PS: in principle, a fairly ugly trick might suffice for accept(2), but I'm less than happy with going there. Namely, we could * have ->accept() get descriptor number * have ->flush() get descriptor number in addition to current->files and have it DTRT for sockets in the middle of accept(2). However, in addition to being ugly as hell, it has the problem with the points where we call ->flush(), specifically do_dup2() and __close_fd(). It's done *after* the replacement/removal from descriptor table, so another socket might have already gotten the same descriptor and we'd get spurious termination of accept(2). And I'm really curious about the things Solaris would do with dup2() there. Does it take into account the possibility of new accept() coming just as dup2() is trying to terminate the ongoing ones? Is there a window when descriptor-to-file lookups would fail? Looks like a race/deadlock country... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>It's been said that the current mechanisms in Linux & some BSD variants >can be subject to races, and the behaviour exhibited doesn't conform to >POSIX, for example requiring the use of shutdown() on unconnected >sockets because close() doesn't kick off other threads accept()ing on >the same fd. I'd be interested to hear if there's a better and more >performant way of handling the situation that doesn't involve doing the >sort of bookkeeping Casper described,. Of course, the implementation is now around 18 years old; clearly a lot of things have changed since then. In the particular case of Linux close() on a socket, surely it must be possible to detect at close that it is a listening socket and that you are about to close the last reference; the kernel could then do the shutdown() all by itself. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 18:05, Al Viro wrote: Oh, for... Right in this thread an example of complete BS has been quoted from POSIX close(2). The part about closing a file when the last descriptor gets closed. _Nothing_ is POSIX-compliant in that respect (nor should it be). That's not exactly what it says, we've already discussed, for example in the case of pending async IO on a filehandle. Semantics around the distinction between file descriptors and file descriptions is underspecified, not to mention being very poorly written. I agree that part could do with some polishing. You want to add something along the lines of "if any action by another thread changes the mapping from file descriptors to file descriptions for any file descriptor passed to syscall, such and such things should happen" - go ahead and specify what should happen. As it is, I don't see anything of that sort in e.g. accept(2). And no, [EBADF] The socket argument is not a valid file descriptor. in there is nowhere near being unambiguous enough - everyone agrees that argument should be a valid descriptor at the time of call, but I would be very surprised to find _any_ implementation (including Solaris one) recheck that upon exit to userland. The scenario I described previously, where dup2() is used to modify a fd that's being used in accept, result in the accept call terminating in the same way as if close had been called on it - Casper gave details earlier. For more bullshit from the same source (issue 7, close(2)): If fildes refers to a socket, close() shall cause the socket to be destroyed. If the socket is in connection-mode, and the SO_LINGER option is set for the socket with non-zero linger time, and the socket has untransmitted data, then close() shall block for up to the current linger interval until all data is transmitted. I challenge you to find *any* implementation that would have fd = socket(...); close(dup(fd)); do what this wonder of technical prose clearly requests. In the same text we also have When all file descriptors associated with a pipe or FIFO special file are closed, any data remaining in the pipe or FIFO shall be discarded. as well as explicit (and underspecified, but perhaps they do it elsewhere) "last close" in parts related to sockets and ptys. Yes, Casper has just reported that to TOG, see http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11573. Our assessment is that sockets should behave the same way as plain files, i.e. 'last close'. And that is not to mention the dup2(2) wording in there: If fildes2 is already a valid open file descriptor, it shall be closed first which is (a) inviting misinterpretation that would make the damn thing non-atomic (the only mentioning of atomicity is in non-normative sections) I've already explained why I believe atomic behaviour of dup2() is required by POSIX. If you feel it's not clear then we can ask POSIX for clarification. and (b) says fsck-all about the effects of closing descriptor. The latter is a problem, since nothing in close(2) bothers making a distinction between the effects specific to particular syscall and those common to all ways of closing a descriptor. And no, it's not a nitpicking - consider e.g. the parts concerning the order of events triggered by close(2) (such and such should be completed before close(2) returns); should it be taken as "same events should be completed before newfd is associated with the file description refered to by oldfd"? It _is_ user-visible, since close(2) removes fcntl locks. Sure, there is (otherwise unexplained) The dup2() function is not intended for use in critical regions as a synchronization mechanism. down in informative sections, so one can infer that event order here isn't to be relied upon. With no way to guess whether the event order concerning e.g. effect on ongoing accept(newfd) is any different in that respect. I think "it shall be closed first" makes it pretty clear that what is expected is the same behaviour as any direct invocation of close, and that has to happen before the reassignment. What makes you believe that's isn't the case? The entire area in Issue 7 stinks. It might make sense to try and fix it up, but let's not pretend that what's in there right now does specify the semantics in this kind of situations. Sorry, I disagree. I'm not saying that Solaris approach yields an inherently bad semantics or that it's impossible to implement without high scalability price and/or high memory footprint. But waving the flag of POSIX compliance when you are actually talking about the ways your implementation plugs the holes in a badly incomplete spec... Personally I believe the spec is clear enough to allow an unambiguous interpretation of the required behavior in this area. If you think there are areas where the Solaris
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
>On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote: > >> The external behaviour atomic; you cannot distinguish the order >> between the closing of the original file (and waking up other threads >> waiting for a record lock) or changing the file referenced by that newfd. >> >> But this not include a global or process specific lock. > >Interesting... Do you mean that decriptor-to-file lookup blocks until that >rundown finishes? For that particular file descriptor, yes. (I'm assuming you mean the Solaris kernel running down all lwps who have a system in progress on that particular file descriptor). All other fd to file lookups are not blocked at all by this locking. It should be clear that any such occurrences are application errors and should be hardly ever seen in practice. It is also known when this is needed so it is hardly even attempted. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote: > On 22/10/2015 18:05, Al Viro wrote: > > >Oh, for... Right in this thread an example of complete BS has been quoted > >from POSIX close(2). The part about closing a file when the last descriptor > >gets closed. _Nothing_ is POSIX-compliant in that respect (nor should > >it be). > > That's not exactly what it says, we've already discussed, for > example in the case of pending async IO on a filehandle. Sigh... It completely fails to mention descriptor-passing. Which a) is relevant to what "last close" means and b) had been there for nearly the third of a century. > I agree that part could do with some polishing. google("wire brush of enlightenment") is what comes to mind... > >and (b) says fsck-all about the effects of closing descriptor. The latter > >is a problem, since nothing in close(2) bothers making a distinction between > >the effects specific to particular syscall and those common to all ways of > >closing a descriptor. And no, it's not a nitpicking - consider e.g. the > >parts concerning the order of events triggered by close(2) (such and such > >should be completed before close(2) returns); should it be taken as "same > >events should be completed before newfd is associated with the file > >description > >refered to by oldfd"? It _is_ user-visible, since close(2) removes fcntl > >locks. Sure, there is (otherwise unexplained) > > The dup2() function is not intended for use in critical regions > > as a synchronization mechanism. > >down in informative sections, so one can infer that event order here isn't > >to be relied upon. With no way to guess whether the event order concerning > >e.g. effect on ongoing accept(newfd) is any different in that respect. > > I think "it shall be closed first" makes it pretty clear that what > is expected is the same behaviour as any direct invocation of close, > and that has to happen before the reassignment. What makes you > believe that's isn't the case? So unless I'm misparsing something, you want thread A: accept(newfd) thread B: dup2(oldfd, newfd) have accept() bugger off before the switchover happens? What should happen if thread C does accept(newfd) right as B has decided that there's nothing more to wait? For close(newfd) it would be simple - we are going to have lookup by descriptor fail with EBADF anyway, so making it do so as soon as we go hunting for those who are currently in accept(newfd) would do the trick - no new threads like that shall appear and as long as the descriptor is not declared free for taking by descriptor allocation nobody is going to be screwed by open() picking that slot of descriptor table too early. Trying to do that for dup2() would lose atomicity. I honestly don't know how Solaris behaves in that case, BTW - the race (if any) would probably be hard to hit, so in case of Linux I would have to go and RTFS before saying that there isn't one. I can't do that in with Solaris; all I can do here is ask you guys... Moreover, see above for record locks removal. Should that happen prior to switchover? If you have dup(fd, fd2); set a record lock on fd2 spawn a thread in child, try to grab the same lock on fd2 in parent, do some work and close(fd) you are guaranteed that child won't see fd refering to the same file after it acquires the lock. Replace close(fd) with dup(fd3, fd); should the same hold true in that case? FWIW, Linux behaviour in that area is to have record locks removal done between the switchover and return to userland in case of dup2() and between the removal from descriptor table and return to userland in case of close(). > Personally I believe the spec is clear enough to allow an > unambiguous interpretation of the required behavior in this area. If > you think there are areas where the Solaris behaviour is in > disagreement with the spec then I'd be interested to hear them. The spec is so vague that I strongly suspect that *both* Solaris and Linux behaviours are not in disagreement with it (modulo shutdown(2) extension Linux-side and we are really stuck with that one). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote: > The external behaviour atomic; you cannot distinguish the order > between the closing of the original file (and waking up other threads > waiting for a record lock) or changing the file referenced by that newfd. > > But this not include a global or process specific lock. Interesting... Do you mean that decriptor-to-file lookup blocks until that rundown finishes? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote: > > > >And I'm really curious about the things Solaris would do with dup2() there. > >Does it take into account the possibility of new accept() coming just as > >dup2() is trying to terminate the ongoing ones? Is there a window when > >descriptor-to-file lookups would fail? Looks like a race/deadlock country... > > Solaris does not "terminate" threads, instead it tells them that the > file descriptor information used is stale and wkae's up the thread. Sorry, lousy wording - I meant "terminate syscall in another thread". Better yet, make that "what happens if new accept(newfd) comes while dup2() waits for affected syscalls in other threads to finish"? Assuming it does wait, that is... While we are at it, what's the relative order of record locks removal and switching the meaning of newfd? In our kernel it happens *after* the switchover (i.e. if another thread is waiting for a record lock held on any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see the state with newfd still refering to what it used to; note that waiter might be using any descriptor refering to the file newfd used to refer to, so it won't be affected by the "wake those who had the meaning of their arguments change" side of things). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Al Viro>On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote: >> On 22/10/2015 18:05, Al Viro wrote: >> >> >Oh, for... Right in this thread an example of complete BS has been quoted >> >from POSIX close(2). The part about closing a file when the last descriptor >> >gets closed. _Nothing_ is POSIX-compliant in that respect (nor should >> >it be). >> >> That's not exactly what it says, we've already discussed, for >> example in the case of pending async IO on a filehandle. > >Sigh... It completely fails to mention descriptor-passing. Which > a) is relevant to what "last close" means and > b) had been there for nearly the third of a century. Why is that different? These clearly count as file descriptors. >> I agree that part could do with some polishing. > >google("wire brush of enlightenment") is what comes to mind... Standardese is similar to legalese; it not writing that is directly open to interpretation to those who are not inducted in writing may have some problem interpreting what exactly is meant by wording of the standard. >> I think "it shall be closed first" makes it pretty clear that what >> is expected is the same behaviour as any direct invocation of close, >> and that has to happen before the reassignment. What makes you >> believe that's isn't the case? > >So unless I'm misparsing something, you want >thread A: accept(newfd) >thread B: dup2(oldfd, newfd) >have accept() bugger off before the switchover happens? Well, certainly *before* we return from dup2(). (and clearly only once we have determined that dup2() will return successfully) >What should happen if thread C does accept(newfd) right as B has decided that >there's nothing more to wait? For close(newfd) it would be simple - we are >going to have lookup by descriptor fail with EBADF anyway, so making it do >so as soon as we go hunting for those who are currently in accept(newfd) >would do the trick - no new threads like that shall appear and as long as >the descriptor is not declared free for taking by descriptor allocation nobody >is going to be screwed by open() picking that slot of descriptor table too >early. Trying to do that for dup2() would lose atomicity. I honestly don't >know how Solaris behaves in that case, BTW - the race (if any) would probably >be hard to hit, so in case of Linux I would have to go and RTFS before saying >that there isn't one. I can't do that in with Solaris; all I can do here >is ask you guys... Solaris dup2() behaves exactly like close(). >Moreover, see above for record locks removal. Should that happen prior to >switchover? If you have > >dup(fd, fd2); >set a record lock on fd2 >spawn a thread >in child, try to grab the same lock on fd2 >in parent, do some work and close(fd) >you are guaranteed that child won't see fd refering to the same file after it >acquires the lock. Here's you are talking about a lock held by the "parent" and that the "child" will only get the lock once close(fd) is done? Yes. The final "close" is done *after* the pointer has been removed from the file descriptor table. >Replace close(fd) with dup(fd3, fd); should the same hold true in that case? Yes. >FWIW, Linux behaviour in that area is to have record locks removal done >between the switchover and return to userland in case of dup2() and between >the removal from descriptor table and return to userland in case of close(). > >> Personally I believe the spec is clear enough to allow an >> unambiguous interpretation of the required behavior in this area. If >> you think there are areas where the Solaris behaviour is in >> disagreement with the spec then I'd be interested to hear them. > >The spec is so vague that I strongly suspect that *both* Solaris and Linux >behaviours are not in disagreement with it (modulo shutdown(2) extension >Linux-side and we are really stuck with that one). I'm not sure if the standard allows a handful of threads in accept() for a file descriptor which has already been closed *and* can be re-issued for other uses. Casper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 07:51, casper@oracle.com wrote: It would more fruitful to have such a discussion in one of the OpenGroup mailing lists; people gathered there have a lot of experience and it is also possible to fix the standard when it turns out that it indeed as vague as you claim it is (I don't quite agree with that) +1. If there's interest in doing that I'll ask our POSIX rep the best way of initiating such a conversation. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 05:21, Al Viro wrote: Most of the work on using a file descriptor is local to the thread. Using - sure, but what of cacheline dirtied every time you resolve a descriptor to file reference? Don't you have to do that anyway, to do anything useful with the file? How much does it cover and to what degree is that local to thread? When it's a refcount inside struct file - no big deal, we'll be reading the same cacheline anyway and unless several threads are pounding on the same file with syscalls at the same time, that won't be a big deal. But when refcounts are associated with descriptors... There is a refcount in the struct held in the per-process list of open files and the 'slow path' processing is only taken if there's more than one LWP in the process that's accessing the file. In case of Linux we have two bitmaps and an array of pointers associated with descriptor table. They grow on demand (in parallel) * reserving a descriptor is done under ->file_lock (dropped/regained around memory allocation if we end up expanding the sucker, actual reassignment of pointers to array/bitmaps is under that spinlock) * installing a pointer is lockless (we wait for ongoing resize to settle, RCU takes care of the rest) * grabbing a file by index is lockless as well * removing a pointer is under ->file_lock, so's replacing it by dup2(). Is that table per-process or global? The point is, dup2() over _unused_ descriptor is inherently racy, but dup2() over a descriptor we'd opened and kept open should be safe. As it is, their implementation boils down to "userland must do process-wide exclusion between *any* dup2() and all syscalls that might create a new descriptor - open()/pipe()/socket()/accept()/recvmsg()/dup()/etc". At the very least, it's a big QoI problem, especially since such userland exclusion would have to be taken around the operations that can block for a long time. Sure, POSIX wording regarding dup2 is so weak that this behaviour can be argued to be compliant, but... replacement of the opened file associated with newfd really ought to be atomic to be of any use for multithreaded processes. There's existing language in the Issue 7 dup2() description that says dup2() has to be atomic: "the dup2( ) function provides unique services, as no other interface is able to atomically replace an existing file descriptor." And there is some new language in Issue 7 Technical Corrigenda 2 that reinforces that, when it's talking about reassignment of stdin/stdout/stderr: "Furthermore, a close() followed by a reopen operation (e.g. open(), dup() etc) is not atomic; dup2() should be used to change standard file descriptors." I don't think that it's possible to claim that a non-atomic dup2() is POSIX-compliant. IOW, if newfd is an opened descriptor prior to dup2() and no other thread attempts to close it by any means, there should be no window during which it would appear to be not opened. Linux and FreeBSD satisfy that; OpenBSD seems to do the same, from the quick look. NetBSD doesn't, no idea about Solaris. FWIW, older NetBSD implementation (prior to "File descriptor changes, discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it had fixed a lot of crap, so it's entirely possible that OpenBSD simply has kept the old implementation, with tons of other races in that area, but this dup2() race got introduced in that rewrite. Related to dup2(), there's some rather surprising behaviour on Linux. Here's the scenario: -- ThreadA opens, listens and accepts on socket fd1, waiting for incoming connections. ThreadB waits for a while, then opens normal file fd2 for read/write. ThreadB uses dup2 to make fd1 a clone of fd2. ThreadB closes fd2. ThreadA remains sat in accept on fd1 which is now a plain file, not a socket. ThreadB writes to fd1, the result of which appears in the file, so fd1 is indeed operating as a plain file. ThreadB exits. ThreadA is still sat in accept on fd1. A connection is made to fd1 by another process. The accept call succeeds and returns the incoming connection. fd1 is still operating as a socket, even though it's now actually a plain file. -- I assume this is another consequence of the fact that threads waiting in accept don't get a notification if the fd they are using is closed, either directly by a call to close or by a syscall such as dup2. Not waking up other threads on a fd when it is closed seems like it's undesirable behaviour. I can see the reasoning behind allowing shutdown to be used to do such a wakeup even if that's not POSIX-compliant - it may make it slightly easier for applications avoid fd recycling races. However the current situation is that shutdown is the *only* way to perform such a wakeup - simply closing the fd has no effect on any other threads. That seems incorrect. -- Alan Burlison -- -- To unsubscribe from this list: send the line
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 05:44, Al Viro wrote: It's been said that the current mechanisms in Linux & some BSD variants can be subject to races You do realize that it goes for the entire area? And the races found in this thread are in the BSD variant that tries to do something similar to what you guys say Solaris is doing, so I'm not sure which way does that argument go. A high-level description with the same level of details will be race-free in _all_ of them. The devil is in the details, of course, and historically they had been very easy to get wrong. And extra complexity in that area doesn't make things better. Yes, I absolutely agree it's difficult to get it right. Modulo undetected bugs I believe the Solaris implementation is race free, and if it isn't I think it's fair to say we'd consider that to be a bug. , and the behaviour exhibited doesn't conform to POSIX, for example requiring the use of shutdown() on unconnected sockets because close() doesn't kick off other threads accept()ing on the same fd. Umm... The old kernels (and even more - old userland) are not going to disappear, so we are stuck with such uses of shutdown(2) anyway, POSIX or no POSIX. Yes, I understand the problem and in an earlier part of the discussion I said that I suspected that all that could really be done was to document the behaviour as changing it would break existing code. I still think that's probably the only workable option. I'd be interested to hear if there's a better and more performant way of handling the situation that doesn't involve doing the sort of bookkeeping Casper described,. So would a lot of other people. :-) To quote one of my colleague's favourite sayings: Performance is a goal, correctness is a constraint. Except that in this case "correctness" is the matter of rather obscure and ill-documented areas in POSIX. Don't get me wrong - this semantics isn't inherently bad, but it's nowhere near being an absolute requirement. I don't think it's *that* obscure, when I found the original shutdown() problem google showed up another occurrence pretty quickly - https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html. If a fd is closed then allowing other uses of it to continue in other threads seems incorrect to me, as in the dup2() scenario I posted. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, 2015-10-22 at 08:15 +0200, casper@oracle.com wrote: > >It's been said that the current mechanisms in Linux & some BSD variants > >can be subject to races, and the behaviour exhibited doesn't conform to > >POSIX, for example requiring the use of shutdown() on unconnected > >sockets because close() doesn't kick off other threads accept()ing on > >the same fd. I'd be interested to hear if there's a better and more > >performant way of handling the situation that doesn't involve doing the > >sort of bookkeeping Casper described,. > > Of course, the implementation is now around 18 years old; clearly a lot of > things have changed since then. > > In the particular case of Linux close() on a socket, surely it must be > possible to detect at close that it is a listening socket and that you are > about to close the last reference; the kernel could then do the shutdown() > all by itself. We absolutely do not _want_ to do this just so that linux becomes slower to the point Solaris can compete, or you guys can avoid some work. close(fd) is very far from knowing a file is a 'listener' or even a 'socket' without extra cache line misses. To force a close of an accept() or whatever blocking socket related system call a shutdown() makes a lot of sense. This would have zero additional overhead for the fast path. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/2015 12:30, Eric Dumazet wrote: We absolutely do not _want_ to do this just so that linux becomes slower to the point Solaris can compete, or you guys can avoid some work. Sentiments such as that really have no place in a discussion that's been focussed primarily on the behaviour of interfaces, albeit with digressions into the potential performance impacts. The discussion has been cordial and I for one appreciate Al Viro's posts on the subject, from which I've leaned a lot. Can we please keep it that way? Thanks. close(fd) is very far from knowing a file is a 'listener' or even a 'socket' without extra cache line misses. To force a close of an accept() or whatever blocking socket related system call a shutdown() makes a lot of sense. This would have zero additional overhead for the fast path. Yes, that would I believe be a significant improvement. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On Thu, 2015-10-22 at 12:58 +0100, Alan Burlison wrote: > On 22/10/2015 12:30, Eric Dumazet wrote: > > > We absolutely do not _want_ to do this just so that linux becomes slower > > to the point Solaris can compete, or you guys can avoid some work. > > Sentiments such as that really have no place in a discussion that's been > focussed primarily on the behaviour of interfaces, albeit with > digressions into the potential performance impacts. The discussion has > been cordial and I for one appreciate Al Viro's posts on the subject, > from which I've leaned a lot. Can we please keep it that way? Thanks. Certainly not. I am a major linux networking developper and wont accept linux is hijacked by guys who never contributed to it, just so it meets their unreasonable expectations. We absolutely care about performance. And I do not care you focus on POSIX crap. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
On 22/10/15 19:16, Al Viro wrote: Don't you have to do that anyway, to do anything useful with the file? Dirtying the cacheline that contains struct file itself is different, but that's not per-descriptor. Yes, true enough. Usually it's per-process, but any thread could ask for a private instance to work with (and then spawn more threads sharing that instance - or getting independent copies). [snip] Thanks again for the info, interesting. Solaris also does things along the same lines. In fact we recently extended posix_spawn so it could be used by the JVM to create subprocesses without wholescale copying, and Casper has done some optimisation work on posix_spawn as well. I don't think that it's possible to claim that a non-atomic dup2() is POSIX-compliant. Except that it's in non-normative part of dup2(2), AFAICS. I certainly agree that it would be a standard lawyering beyond reason, but "not possible to claim" is too optimistic. Maybe I'm just more cynical... Possibly so, and possibly justifiably so as well ;-) ThreadA remains sat in accept on fd1 which is now a plain file, not a socket. No. accept() is not an operation on file descriptors; it's an operation on file descriptions (pardon for use of that terminology). They are specified by passing descriptors, but there's a hell of a difference between e.g. dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek() (operations on descriptions). Lookups are done once per syscall; the only exception is F_SETFL{,W}, where we recheck that descriptor is refering to the same thing before granting the lock. Yes, but if you believe that dup2() requires an implicit close() within it and that dup2() has to be atomic then expecting that other threads waiting on the same fd in accept() will get a notification seems reasonable enough. Again, POSIX is still underspecifying the semantics of shared descriptor tables; back when the bulk of it had been written there had been no way to have a descriptor -> description mapping changed under a syscall by action of another thread. Hell, they still hadn't picked on some things that happened in early 80s, let alone early-to-mid 90s... That's indisputably true - much of the POSIX behaviour predates threads and it shows, quite badly, in some places. Linux and Solaris happen to cover these gaps differently; FreeBSD and OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one. Yes, true enough. -- Alan Burlison -- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html