RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-02 Thread David Laight
From: casper@oracle.com
> Sent: 21 October 2015 21:33
..
> >Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
> >behaviour, in your opinion?  Because fd is sure as hell not the last
> >descriptor refering to that socket - fd2 remains alive and well.
> >
> >Behaviour you describe below matches the _third_ option.
> 
> >> >BTW, for real fun, consider this:
> >> >7)
> >> >// fd is a socket
> >> >fd2 = dup(fd);
> >> >in thread A: accept(fd);
> >> >in thread B: accept(fd);
> >> >in thread C: accept(fd2);
> >> >in thread D: close(fd);

If D is going to do this, it ought to be using dup2() to clone
a copy of (say) /dev/null onto 'fd'.

> >> >Which threads (if any), should get hit where it hurts?
> >>
> >> A & B should return from the accept with an error. C should
> >> continue. Which is what happens on Solaris.

That seems to completely break the normal *nix file scheme...

> >> To this end each thread keeps a list of file descriptors
> >> in use by the current active system call.
> >

Remember, Solaris (and SYSV) has extra levels of multiplexing between
userspace and char special drivers (and probably sockets) than Linux does.
As well as having multiple fd referencing the same struct FILE, multiple
FILE can point to the same inode.
If you have two different /dev entries for the same major/minor you
also end up with separate inodes - all finally referencing the same
driver data (indexed only by minor number).
(You actually get two inodes in the chain, one for the disk filesystem
and one char-special. The ref counts on both can be greater than 1.)

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-02 Thread Al Viro
On Mon, Nov 02, 2015 at 10:03:58AM +, David Laight wrote:
> Remember, Solaris (and SYSV) has extra levels of multiplexing between
> userspace and char special drivers (and probably sockets) than Linux does.
> As well as having multiple fd referencing the same struct FILE, multiple
> FILE can point to the same inode.

As they ever could on any Unix.  Every open(2) results in a new struct file
(BTW, I've never seen that capitalization for kernel structure - not in
v7, not in *BSD, etc.; FILE is a userland typedef and I would be rather
surprised if any kernel, Solaris included, would've renamed 'struct file'
to 'struct FILE').

> If you have two different /dev entries for the same major/minor you
> also end up with separate inodes - all finally referencing the same
> driver data (indexed only by minor number).

Again, the same goes for all Unices, both Linux and Solaris included.
And what the devil does that have to do with sockets, anyway?  Or with
the problem in question, while we are at it - they have different
descriptors pointing to the same struct file behave differently;
anything sensitive to file type would be past that point.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread David Holland
On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
 > On 21/10/2015 04:49, Al Viro wrote:
 > >BTW, for real fun, consider this:
 > >7)
 > >// fd is a socket
 > >fd2 = dup(fd);
 > >in thread A: accept(fd);
 > >in thread B: accept(fd);
 > >in thread C: accept(fd2);
 > >in thread D: close(fd);
 > >
 > >Which threads (if any), should get hit where it hurts?
 > 
 > A & B should return from the accept with an error. C should continue. Which
 > is what happens on Solaris.

So, I'm coming late to this discussion and I don't have the original
context; however, to me this cited behavior seems undesirable and if I
ran across it in the wild I would probably describe it as a bug.

System call processing for operations on files involves translating a
file descriptor (a number) into an open-file object (or "file
description"), struct file in BSD and I think also in Linux. The
actual system call logic operates on the open-file object, so once the
translation happens application monkeyshines involving file descriptor
numbers should have no effect on calls in progress. Other behavior
would violate the principle of least surprise, as this basic
architecture predates POSIX.

The behavior Al Viro found in NetBSD's dup2 is a bug. System calls are
supposed to be atomic, regardless of what POSIX might or might not
say.

I did not write that code but I'll pass the report on to those who
did.

-- 
David A. Holland
dholl...@netbsd.org
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Al Viro
On Fri, Oct 23, 2015 at 06:30:25PM +, David Holland wrote:

> So, I'm coming late to this discussion and I don't have the original
> context; however, to me this cited behavior seems undesirable and if I
> ran across it in the wild I would probably describe it as a bug.

Unfortunately, that's precisely what NetBSD is trying to implement (and
that's what will happen if nothing else reopens fd).  See the logics in
fd_close(), with ->fo_restart() and waiting for all activity to settle
down.  As for the missing context, what fd_close() is doing is also
unreliable - inducing ERESTART in other threads sitting in accept(2) and
things like that and waiting for them to run into EBADF they'll get
(barring races) on syscall restart; threads sitting in accept() et.al.
on the same struct file, but with different descriptors will hopefully
go into restart and continue unaffected.  All that machinery relies on
nothing having reused the descriptor for socket(2), dup2() target, etc. while
those threads had been going through the syscall restart - if that happens,
you are SOL, since accept(2) _will_ restart on an unexpected socket.

Moreover, if you fix dup2() atomicity, this approach will reliably shit
itself for situations when dup2() rather than close() is used to close
the socket.  It relies upon having at least some window where the victim
descriptor would be yielding EBADF.

> System call processing for operations on files involves translating a
> file descriptor (a number) into an open-file object (or "file
> description"), struct file in BSD and I think also in Linux. The
> actual system call logic operates on the open-file object, so once the
> translation happens application monkeyshines involving file descriptor
> numbers should have no effect on calls in progress. Other behavior
> would violate the principle of least surprise, as this basic
> architecture predates POSIX.

Well, to be fair, until '93 there was no way to have descriptor table changed
under a syscall in the first place.  The old model (everything up to and 
ncluding 4.4BSD final) simply didn't include anything of that sort - mapping
from descriptors to open files was not shared and all changes a syscall might
see were ones done by the syscall itself.

So this thing isn't covered by the basic architecture - it's something that
had been significantly new merely two decades ago.  And POSIX still hasn't
quite caught up with that newfangled 4.2BSD thing...

IMO what you've described above is fine - that's how Linux works, that's
how FreeBSD and OpenBSD work and that's how NetBSD used to work until 2008
or so.  "Cancel syscall if any of the descriptors got dissociated from
opened files by action of another thread, have the dissociating operation
wait for all affected syscalls to run down" thing had been introduced then
and it is similar to what Solaris is doing.

AFAICS, the main issue with that is the memory footprint from hell and/or
cacheline clusterfuck.  Having accept(2) bugger off with e.g. EINTR in such
situation isn't inherently worse or better than having it sit there as if
close() or dup2() has not happened - matter of taste, and if there had been a
way to do it without inflicting the price on processes that do not pull that
kind of crap in the first place... might be worth considering.

As it is, the memory footprint seems to be too heavy.  I'm not entirely
convinced that there's no clever way to avoid that, but right now I don't
see anything that would look like a good approach.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Al Viro
On Thu, Oct 22, 2015 at 11:55:42AM +0100, Alan Burlison wrote:
> On 22/10/2015 05:21, Al Viro wrote:
> 
> >>Most of the work on using a file descriptor is local to the thread.
> >
> >Using - sure, but what of cacheline dirtied every time you resolve a
> >descriptor to file reference?
> 
> Don't you have to do that anyway, to do anything useful with the file?

Dirtying the cacheline that contains struct file itself is different, but
that's not per-descriptor.

> >In case of Linux we have two bitmaps and an array of pointers associated
> >with descriptor table.  They grow on demand (in parallel)
> > * reserving a descriptor is done under ->file_lock (dropped/regained
> >around memory allocation if we end up expanding the sucker, actual 
> >reassignment
> >of pointers to array/bitmaps is under that spinlock)
> > * installing a pointer is lockless (we wait for ongoing resize to
> >settle, RCU takes care of the rest)
> > * grabbing a file by index is lockless as well
> > * removing a pointer is under ->file_lock, so's replacing it by dup2().
> 
> Is that table per-process or global?

Usually it's per-process, but any thread could ask for a private instance
to work with (and then spawn more threads sharing that instance - or getting
independent copies).

It's common for Plan 9-inspired models - basically, you treat every thread
as a machine that consists of
* memory
* file descriptor table
* namespace
* signal handlers
...
* CPU (i.e. actual thread of execution).
The last part can't be shared; anything else can.  fork(2) variant used to
start new threads (clone(2) in case of Linux, rfork(2) in Plan 9 and *BSD)
is told which components should be copies of parent's ones and which should
be shared with the parent.  fork(2) is simply "copy everything except for the
namespace".  It's fairly common to have "share everything", but intermediate
variants are also possible.  There are constraints (e.g. you can't share
signal handlers without sharing the memory space), but descriptor table
can be shared independently from memory space just fine.  There's also a
way to say "unshare this, this and that components" - mapped to unshare(2) in
Linux and to rfork(2) in Plan 9.

Best way to think of that is to consider descriptor table as a first-class
object a thread can be connected to.  Usually you have one for each process,
with all threads belonging to that process connected to the same thing,
but that's just the most common use.

> I don't think that it's possible to claim that a non-atomic dup2()
> is POSIX-compliant.

Except that it's in non-normative part of dup2(2), AFAICS.  I certainly
agree that it would be a standard lawyering beyond reason, but "not
possible to claim" is too optimistic.  Maybe I'm just more cynical...

> ThreadA remains sat in accept on fd1 which is now a plain file, not
> a socket.

No.  accept() is not an operation on file descriptors; it's an operation on
file descriptions (pardon for use of that terminology).  They are specified
by passing descriptors, but there's a hell of a difference between e.g.
dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek()
(operations on descriptions).

Lookups are done once per syscall; the only exception is F_SETFL{,W}, where
we recheck that descriptor is refering to the same thing before granting
the lock.

Again, POSIX is still underspecifying the semantics of shared descriptor
tables; back when the bulk of it had been written there had been no way
to have a descriptor -> description mapping changed under a syscall by
action of another thread.  Hell, they still hadn't picked on some things
that happened in early 80s, let alone early-to-mid 90s...

Linux and Solaris happen to cover these gaps differently; FreeBSD and
OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/2015 05:21, Al Viro wrote:


Most of the work on using a file descriptor is local to the thread.


Using - sure, but what of cacheline dirtied every time you resolve a
descriptor to file reference?


Don't you have to do that anyway, to do anything useful with the file?


How much does it cover and to what
degree is that local to thread?  When it's a refcount inside struct file -
no big deal, we'll be reading the same cacheline anyway and unless several
threads are pounding on the same file with syscalls at the same time,
that won't be a big deal.  But when refcounts are associated with
descriptors...


There is a refcount in the struct held in the per-process list of open 
files and the 'slow path' processing is only taken if there's more than 
one LWP in the process that's accessing the file.



In case of Linux we have two bitmaps and an array of pointers associated
with descriptor table.  They grow on demand (in parallel)
* reserving a descriptor is done under ->file_lock (dropped/regained
around memory allocation if we end up expanding the sucker, actual reassignment
of pointers to array/bitmaps is under that spinlock)
* installing a pointer is lockless (we wait for ongoing resize to
settle, RCU takes care of the rest)
* grabbing a file by index is lockless as well
* removing a pointer is under ->file_lock, so's replacing it by dup2().


Is that table per-process or global?


The point is, dup2() over _unused_ descriptor is inherently racy, but dup2()
over a descriptor we'd opened and kept open should be safe.  As it is,
their implementation boils down to "userland must do process-wide exclusion
between *any* dup2() and all syscalls that might create a new descriptor -
open()/pipe()/socket()/accept()/recvmsg()/dup()/etc".  At the very least,
it's a big QoI problem, especially since such userland exclusion would have
to be taken around the operations that can block for a long time.  Sure,
POSIX wording regarding dup2 is so weak that this behaviour can be argued
to be compliant, but... replacement of the opened file associated with
newfd really ought to be atomic to be of any use for multithreaded processes.


There's existing language in the Issue 7 dup2() description that says 
dup2() has to be atomic:


"the dup2( ) function provides unique services, as no other
interface is able to atomically replace an existing file descriptor."

And there is some new language in Issue 7 Technical Corrigenda 2 that 
reinforces that, when it's talking about reassignment of 
stdin/stdout/stderr:


"Furthermore, a close() followed by a reopen operation (e.g. open(), 
dup() etc) is not atomic; dup2() should be used to change standard file 
descriptors."


I don't think that it's possible to claim that a non-atomic dup2() is 
POSIX-compliant.



IOW, if newfd is an opened descriptor prior to dup2() and no other thread
attempts to close it by any means, there should be no window during which
it would appear to be not opened.  Linux and FreeBSD satisfy that; OpenBSD
seems to do the same, from the quick look.  NetBSD doesn't, no idea about
Solaris.  FWIW, older NetBSD implementation (prior to "File descriptor changes,
discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it
had fixed a lot of crap, so it's entirely possible that OpenBSD simply has
kept the old implementation, with tons of other races in that area, but this
dup2() race got introduced in that rewrite.


Related to dup2(), there's some rather surprising behaviour on Linux. 
Here's the scenario:


--
ThreadA opens, listens and accepts on socket fd1, waiting for incoming 
connections.


ThreadB waits for a while, then opens normal file fd2 for read/write.
ThreadB uses dup2 to make fd1 a clone of fd2.
ThreadB closes fd2.

ThreadA remains sat in accept on fd1 which is now a plain file, not a 
socket.


ThreadB writes to fd1, the result of which appears in the file, so fd1 
is indeed operating as a plain file.


ThreadB exits. ThreadA is still sat in accept on fd1.

A connection is made to fd1 by another process. The accept call succeeds 
and returns the incoming  connection. fd1 is still operating as a 
socket, even though it's now actually a plain file.

--

I assume this is another consequence of the fact that threads waiting in 
accept don't get a notification if the fd they are using is closed, 
either directly by a call to close or by a syscall such as dup2. Not 
waking up other threads on a fd when it is closed seems like it's 
undesirable behaviour.


I can see the reasoning behind allowing shutdown to be used to do such a 
wakeup even if that's not POSIX-compliant - it may make it slightly 
easier for applications avoid fd recycling races. However the current 
situation is that shutdown is the *only* way to perform such a wakeup - 
simply closing the fd has no effect on any other threads. That seems 
incorrect.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line 

Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/15 19:16, Al Viro wrote:


Don't you have to do that anyway, to do anything useful with the file?


Dirtying the cacheline that contains struct file itself is different, but
that's not per-descriptor.


Yes, true enough.


Usually it's per-process, but any thread could ask for a private instance
to work with (and then spawn more threads sharing that instance - or getting
independent copies).


[snip]

Thanks again for the info, interesting. Solaris also does things along 
the same lines. In fact we recently extended posix_spawn so it could be 
used by the JVM to create subprocesses without wholescale copying, and 
Casper has done some optimisation work on posix_spawn as well.



I don't think that it's possible to claim that a non-atomic dup2()
is POSIX-compliant.


Except that it's in non-normative part of dup2(2), AFAICS.  I certainly
agree that it would be a standard lawyering beyond reason, but "not
possible to claim" is too optimistic.  Maybe I'm just more cynical...


Possibly so, and possibly justifiably so as well ;-)


ThreadA remains sat in accept on fd1 which is now a plain file, not
a socket.


No.  accept() is not an operation on file descriptors; it's an operation on
file descriptions (pardon for use of that terminology).  They are specified
by passing descriptors, but there's a hell of a difference between e.g.
dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek()
(operations on descriptions).

Lookups are done once per syscall; the only exception is F_SETFL{,W}, where
we recheck that descriptor is refering to the same thing before granting
the lock.


Yes, but if you believe that dup2() requires an implicit close() within 
it and that dup2() has to be atomic then expecting that other threads 
waiting on the same fd in accept() will get a notification seems 
reasonable enough.



Again, POSIX is still underspecifying the semantics of shared descriptor
tables; back when the bulk of it had been written there had been no way
to have a descriptor -> description mapping changed under a syscall by
action of another thread.  Hell, they still hadn't picked on some things
that happened in early 80s, let alone early-to-mid 90s...


That's indisputably true - much of the POSIX behaviour predates threads 
and it shows, quite badly, in some places.



Linux and Solaris happen to cover these gaps differently; FreeBSD and
OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one.


Yes, true enough.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Eric Dumazet
On Wed, 2015-10-21 at 14:03 +0100, Alan Burlison wrote:
> On 21/10/2015 12:28, Eric Dumazet wrote:
> 
> > This works for me. Please double check your programs
> 
> I have just done so, it works as you say for AF_INET sockets but if you 
> switch to AF_UNIX sockets it does the wrong thing in the way I described.
> 

Oh well. Please try the following :


diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f658235fb4..24dec8bb571d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -328,7 +328,8 @@ found:
 
 static inline int unix_writable(struct sock *sk)
 {
-   return (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
+   return sk->sk_state != TCP_LISTEN &&
+  (atomic_read(>sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
 
 static void unix_write_space(struct sock *sk)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Alan Burlison

On 21/10/2015 04:49, Al Viro wrote:

Firstly, thank you for the comprehensive and considered reply.


Refcount is an implementation detail, of course.  However, in any Unix I know
of, there are two separate notions - descriptor losing connection to opened
file (be it from close(), exit(), execve(), dup2(), etc.) and opened file
getting closed.


Yep, it's an implementation detail inside the kernel - Solaris also has 
a refcount inside its vnodes. However that's really only dimly visible 
at the process level, where all you have is an integer file ID.



The latter cannot happen while there are descriptors connected to the
file in question, of course.  However, that is not the only thing
that might prevent an opened file from getting closed - e.g. sending an
SCM_RIGHTS datagram with attached descriptor connected to the opened file
in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened
file until it is successfully received or discarded (in the former case
recepient will get a new descriptor refering to that opened file, of course).
Having the original descriptor closed right after sendmsg(2) does *not*
do anything to opened file.  On any Unix that implements descriptor-passing.


I believe async IO data is another way that a file can remain live after 
a close(), from the close() section of IEEE Std 1003.1:


"An I/O operation that is not canceled completes as if the close() 
operation had not yet occurred"



There's going to be a notion of "last close"; that's what this refcount is
about and _that_ is more than implementation detail.


Yes, POSIX distinguishes between "file descriptor" and "file 
description" (ugh!) and the close() page says:


"When all file descriptors associated with an open file description have 
been closed, the open file description shall be freed."


In the context of this discussion I believe it's the behaviour of the 
integer file descriptor that's the issue. Once it's had close() called 
on it then it's invalid, and any IO on it should fail, even if the 
underlying file description is still 'live'.



In other words, is that destruction of
* any descriptor refering to this socket [utterly insane for obvious
reasons]
* the last descriptor refering to this socket (modulo descriptor
passing, etc.) [a bitch to implement, unless we treat a syscall in progress
as keeping the opened file open], or
* _the_ descriptor used to issue accept(2) [a bitch to implement,
with a lot of fun races in an already race-prone area]?


From reading the POSIX close() page I believe the second option is the 
correct one.



Additional question is whether it's
* just a magical behaviour of close(2) [ugly], or
* something that happens when descriptor gets dissociated from
opened file [obviously more consistent]?


The second, I believe.


BTW, for real fun, consider this:
7)
// fd is a socket
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: accept(fd);
in thread C: accept(fd2);
in thread D: close(fd);

Which threads (if any), should get hit where it hurts?


A & B should return from the accept with an error. C should continue. 
Which is what happens on Solaris.



I have no idea what semantics does Solaris have in that area and how racy
their descriptor table handling is.  And no, I'm not going to RTFS their
kernel, CDDL being what it is.


I can answer that for you :-) I've looked through the appropriate bits 
of the Solaris kernel code and my colleague Casper has written an 
excellent summary of what happens, so with his permission I've just 
copied it verbatim below:


--
Since at least Solaris 7 (1998), a thread which is sleeping
on a file descriptor which is being closed by another thread,
will be woken up.

To this end each thread keeps a list of file descriptors
in use by the current active system call.

When a file descriptor is closed and this file descriptor
is marked as being in use by other threads, the kernel
will search all threads to see which have this file descriptor
listed as in use. For each such thread, the kernel tells
the thread that its active fds list is now stale and, if
possible, makes the thread run.

While this algorithm is pretty expensive, it is not often invoked.

The thread running close() will NOT return until all other threads
using that filedescriptor have released it.

When run, the thread will return from its syscall and will in most cases
return EBADF. A second thread trying to close this same file descriptor
may return earlier with close() returning EBADF.
--

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Al Viro
On Wed, Oct 21, 2015 at 10:33:04PM +0200, casper@oracle.com wrote:
> 
> >On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
> >
> >> >There's going to be a notion of "last close"; that's what this refcount is
> >> >about and _that_ is more than implementation detail.
> >> 
> >> Yes, POSIX distinguishes between "file descriptor" and "file
> >> description" (ugh!) and the close() page says:
> >
> >Would've been better if they went for something like "IO channel" for
> >the latter ;-/
> 
> Or at least some other word.  A file descriptor is just an index to
> a list of file pointers (and wasn't named so?)

*nod*

There's no less than 3 distinct notions associated with the word "file" -
"file as collection of bytes on filesystem", "opened file as IO channel" and
"file descriptor", all related ;-/  "File description" vs. "file descriptor"
is atrociously bad terminology.

> >Unless they want to consider in-flight descriptor-passing datagrams as
> >collections of file descriptors, the quoted sentence is seriously misleading.
> >And then there's mmap(), which they do kinda-sorta mention...
> 
> Well, a file descriptor really only exists in the context of a process; 
> in-flight it is no longer a file descriptor as there process context with 
> a file descriptor table; so pointers to file descriptions are passed 
> around.

Yes.  Note, BTW, that descriptor contains a bit more than a pointer - there
are properties associated with it (close-on-exec and is-it-already-claimed),
which makes abusing it for describing SCM_RIGHTS payloads even more of a
stretch.  IOW, description of semantics for close() and friends needs fixing -
it simply does not match the situation on anything that would be anywhere
near POSIX compliance in other areas.
 
> >Sure, but the upkeep of data structures it would need is there
> >whether you actually end up triggering it or not.  Both in
> >memory footprint and in cacheline pingpong...
> 
> Most of the work on using a file descriptor is local to the thread.

Using - sure, but what of cacheline dirtied every time you resolve a
descriptor to file reference?  How much does it cover and to what
degree is that local to thread?  When it's a refcount inside struct file -
no big deal, we'll be reading the same cacheline anyway and unless several
threads are pounding on the same file with syscalls at the same time,
that won't be a big deal.  But when refcounts are associated with
descriptors...

In case of Linux we have two bitmaps and an array of pointers associated
with descriptor table.  They grow on demand (in parallel)
* reserving a descriptor is done under ->file_lock (dropped/regained
around memory allocation if we end up expanding the sucker, actual reassignment
of pointers to array/bitmaps is under that spinlock)
* installing a pointer is lockless (we wait for ongoing resize to
settle, RCU takes care of the rest)
* grabbing a file by index is lockless as well
* removing a pointer is under ->file_lock, so's replacing it by dup2().

Grabbing a file by descriptor follows pointer from task_struct to descriptor
table, from descriptor table to element of array of pointers (embedded when
we have few descriptors, but becomes separately allocated when more is
needed), and from array element to struct file.  In struct file we fetch
->f_mode and (if descriptor table is shared) atomically increment ->f_count.

For comparison, NetBSD has an extra level of indirection (with similar
tricks for embedding them while there are few descriptors), with a lot
fatter structure around the pointer to file - they keep close-on-exec and
in-use in there, along with refcount and their equivalent of waitqueue.
These structures, once they grow past the embedded set, are allocated
one-by-one, so copying the table on fork() costs a _lot_ more.  Rather
than an array of pointers to files they have an array of pointers to
those guys.  Reserving a descriptor triggers allocation of new struct
fdfile and installing a pointer to it into the array.  Allows for slightly
simpler installing of pointer to file afterwards - unlike us, they don't
have to be careful about array resize happening in parallel.

Grabbing a file by index is lockless, so's installing a pointer to file.
Reserving a descriptor is under ->fd_lock (mutex rather than a spinlock).
Removing a pointer is under ->fd_lock, so's replacing it by
dup2(), but dup2() has an unpleasant race (see upthread).

They do the same amount of pointer-chasing on lookup proper, but only because
they do not look into struct file itself there.  Which happens immediately
afterwards, since callers *will* look into what they've got.  I didn't look
into the details of barrier use in there, but it looks generally sane.

Cacheline pingpong is probably not a big deal there, but only because these
structures are fat and scattered.  Another fun issue is that they have
in-use bits buried deep, which means that they need to mirror them in
a separate 

Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Eric Dumazet
On Wed, 2015-10-21 at 15:38 +0100, Alan Burlison wrote:

> --
> Since at least Solaris 7 (1998), a thread which is sleeping
> on a file descriptor which is being closed by another thread,
> will be woken up.
> 
> To this end each thread keeps a list of file descriptors
> in use by the current active system call.

Ouch.

> 
> When a file descriptor is closed and this file descriptor
> is marked as being in use by other threads, the kernel
> will search all threads to see which have this file descriptor
> listed as in use. For each such thread, the kernel tells
> the thread that its active fds list is now stale and, if
> possible, makes the thread run.
> 

This is what I feared.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Al Viro
On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:

> >There's going to be a notion of "last close"; that's what this refcount is
> >about and _that_ is more than implementation detail.
> 
> Yes, POSIX distinguishes between "file descriptor" and "file
> description" (ugh!) and the close() page says:

Would've been better if they went for something like "IO channel" for
the latter ;-/

> "When all file descriptors associated with an open file description
> have been closed, the open file description shall be freed."

BTW, is SCM_RIGHTS outside of scope?  They do mention it in one place
only:
| Ancillary data is also possible at the socket level. The 
| header shall define the following symbolic constant for use as the cmsg_type
| value when cmsg_level is SOL_SOCKET:
|
| SCM_RIGHTS
| Indicates that the data array contains the access rights to be sent or
| received.

with no further details whatsoever.  It's been there since at least 4.3-Reno;
does anybody still use the older variant (->msg_accrights, that is)?  IIRC,
there was some crap circa 2.6 when Solaris used to do ->msg_accrights for
descriptor-passing, but more or less current versions appear to support
SCM_RIGHTS...  In any case, descriptor-passing had been there in some form
since at least '83 (the old variant is already present in 4.2) and considering
it out-of-scope for POSIX is bloody ridiculous, IMO.

Unless they want to consider in-flight descriptor-passing datagrams as
collections of file descriptors, the quoted sentence is seriously misleading.
And then there's mmap(), which they do kinda-sorta mention...

> >In other words, is that destruction of
> > * any descriptor refering to this socket [utterly insane for obvious
> >reasons]
> > * the last descriptor refering to this socket (modulo descriptor
> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress
> >as keeping the opened file open], or
> > * _the_ descriptor used to issue accept(2) [a bitch to implement,
> >with a lot of fun races in an already race-prone area]?
> 
> From reading the POSIX close() page I believe the second option is
> the correct one.

Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
behaviour, in your opinion?  Because fd is sure as hell not the last
descriptor refering to that socket - fd2 remains alive and well.

Behaviour you describe below matches the _third_ option.

> >BTW, for real fun, consider this:
> >7)
> >// fd is a socket
> >fd2 = dup(fd);
> >in thread A: accept(fd);
> >in thread B: accept(fd);
> >in thread C: accept(fd2);
> >in thread D: close(fd);
> >
> >Which threads (if any), should get hit where it hurts?
> 
> A & B should return from the accept with an error. C should
> continue. Which is what happens on Solaris.

> To this end each thread keeps a list of file descriptors
> in use by the current active system call.

Yecc...  How much cross-CPU traffic does that cause on
multithread processes?  Not on close(2), on maintaining the
descriptor use counts through the normal syscalls.

> When a file descriptor is closed and this file descriptor
> is marked as being in use by other threads, the kernel
> will search all threads to see which have this file descriptor
> listed as in use. For each such thread, the kernel tells
> the thread that its active fds list is now stale and, if
> possible, makes the thread run.
>
> While this algorithm is pretty expensive, it is not often invoked.

Sure, but the upkeep of data structures it would need is there
whether you actually end up triggering it or not.  Both in
memory footprint and in cacheline pingpong...

Besides, the list of threads using given descriptor table also needs
to be maintained, unless you scan *all* threads in the system (which
would be quite a fun wrt latency and affect a lot more than just the
process doing something dumb and rare).

BTW, speaking of fun races: AFAICS, NetBSD dup2() isn't atomic.  It
calls fd_close() outside of ->fd_lock (has to, since fd_close() is
grabbing that itself), so another thread doing e.g.  fcntl(newfd, F_GETFD)
in the middle of dup2(oldfd, newfd) might end up with EBADF, even though
both before and after dup2() newfd had been open.  What's worse,
thread A:
fd1 = open("foo", ...);
fd2 = open("bar", ...);
...
dup2(fd1, fd2);
thread B:
fd = open("baz", ...);
might, AFAICS, end up with fd == fd2 and refering to foo instead of baz.
All it takes is the last open() managing to grab ->fd_lock just as fd_close()
from dup2() has dropped it.  Which is an unexpected behaviour, to put it
mildly, no matter how much standard lawyering one applies...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Casper . Dik

>On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
>
>> >There's going to be a notion of "last close"; that's what this refcount is
>> >about and _that_ is more than implementation detail.
>> 
>> Yes, POSIX distinguishes between "file descriptor" and "file
>> description" (ugh!) and the close() page says:
>
>Would've been better if they went for something like "IO channel" for
>the latter ;-/

Or at least some other word.  A file descriptor is just an index to
a list of file pointers (and wasn't named so?)

>> "When all file descriptors associated with an open file description
>> have been closed, the open file description shall be freed."
>
>BTW, is SCM_RIGHTS outside of scope?  They do mention it in one place
>only:
>| Ancillary data is also possible at the socket level. The 
>| header shall define the following symbolic constant for use as the cmsg_type
>| value when cmsg_level is SOL_SOCKET:
>|
>| SCM_RIGHTS
>| Indicates that the data array contains the access rights to be sent or
>| received.
>
>with no further details whatsoever.  It's been there since at least 4.3-Reno;
>does anybody still use the older variant (->msg_accrights, that is)?  IIRC,
>there was some crap circa 2.6 when Solaris used to do ->msg_accrights for
>descriptor-passing, but more or less current versions appear to support
>SCM_RIGHTS...  In any case, descriptor-passing had been there in some form
>since at least '83 (the old variant is already present in 4.2) and considering
>it out-of-scope for POSIX is bloody ridiculous, IMO.

SCM_RIGHTS was introduced as part of the POSIX standardization of BSD 
sockets.  Looks like they became part of Solaris 2.6, but the default
was non-standard sockets so you may easily find msg->accrights but not
SCM_RIGHTS.

msg_accrights is what was introduced in BSD in likely the first 
implementation of socket-based file descriptor passing.

SysV has its own file descriptor passing on file descriptors passing.

But that interface is too much ad-hoc, so SCM_RIGHTS is a standardizations 
allowing multiple types of messages to be send around

>Unless they want to consider in-flight descriptor-passing datagrams as
>collections of file descriptors, the quoted sentence is seriously misleading.
>And then there's mmap(), which they do kinda-sorta mention...

Well, a file descriptor really only exists in the context of a process; 
in-flight it is no longer a file descriptor as there process context with 
a file descriptor table; so pointers to file descriptions are passed 
around.

>> >In other words, is that destruction of
>> >* any descriptor refering to this socket [utterly insane for obvious
>> >reasons]
>> >* the last descriptor refering to this socket (modulo descriptor
>> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress
>> >as keeping the opened file open], or
>> >* _the_ descriptor used to issue accept(2) [a bitch to implement,
>> >with a lot of fun races in an already race-prone area]?
>> 
>> From reading the POSIX close() page I believe the second option is
>> the correct one.
>
>Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
>behaviour, in your opinion?  Because fd is sure as hell not the last
>descriptor refering to that socket - fd2 remains alive and well.
>
>Behaviour you describe below matches the _third_ option.

>> >BTW, for real fun, consider this:
>> >7)
>> >// fd is a socket
>> >fd2 = dup(fd);
>> >in thread A: accept(fd);
>> >in thread B: accept(fd);
>> >in thread C: accept(fd2);
>> >in thread D: close(fd);
>> >
>> >Which threads (if any), should get hit where it hurts?
>> 
>> A & B should return from the accept with an error. C should
>> continue. Which is what happens on Solaris.
>
>> To this end each thread keeps a list of file descriptors
>> in use by the current active system call.
>
>Yecc...  How much cross-CPU traffic does that cause on
>multithread processes?  Not on close(2), on maintaining the
>descriptor use counts through the normal syscalls.

In the Solaris implementation is pretty much what we do;
but there is no much cross-CPU traffic.  Of course, you will need
to keep locks in the file descriptor table if only to find
the actual file pointer.

The work is done only in the case of a badly written application
where close is required to hunt down all threads currently using
the specific file descriptor.

>> When a file descriptor is closed and this file descriptor
>> is marked as being in use by other threads, the kernel
>> will search all threads to see which have this file descriptor
>> listed as in use. For each such thread, the kernel tells
>> the thread that its active fds list is now stale and, if
>> possible, makes the thread run.
>>
>> While this algorithm is pretty expensive, it is not often invoked.
>
>Sure, but the upkeep of data structures it would need is there
>whether you actually end up triggering it or not.  Both in
>memory footprint and in cacheline pingpong...

Most of the 

Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Alan Burlison

On 21/10/2015 19:51, Al Viro wrote:


BTW, is SCM_RIGHTS outside of scope?  They do mention it in one place
only:
| Ancillary data is also possible at the socket level. The 
| header shall define the following symbolic constant for use as the cmsg_type
| value when cmsg_level is SOL_SOCKET:
|
| SCM_RIGHTS
| Indicates that the data array contains the access rights to be sent or
| received.


That's still exactly the same in Issue 7, Technical Corrigenda 2 which 
is in final review at the moment.



In other words, is that destruction of
* any descriptor refering to this socket [utterly insane for obvious
reasons]
* the last descriptor refering to this socket (modulo descriptor
passing, etc.) [a bitch to implement, unless we treat a syscall in progress
as keeping the opened file open], or
* _the_ descriptor used to issue accept(2) [a bitch to implement,
with a lot of fun races in an already race-prone area]?


 From reading the POSIX close() page I believe the second option is
the correct one.


Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
behaviour, in your opinion?  Because fd is sure as hell not the last
descriptor refering to that socket - fd2 remains alive and well.

Behaviour you describe below matches the _third_ option.


Ah, I wasn't 100% sure what the intended difference between #2 and #3 
was TBH, it does sound like I meant #3, yes :-)


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Alan Burlison

On 21/10/2015 12:28, Eric Dumazet wrote:


This works for me. Please double check your programs


I have just done so, it works as you say for AF_INET sockets but if you 
switch to AF_UNIX sockets it does the wrong thing in the way I described.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread David Laight
From: Alan Burlison
> Sent: 20 October 2015 19:31
...
> The problem with poll() is that it returns immediately when passed a FD
> that is in the listening state. rather than waiting until there's an
> incoming connection to handle. As I said, that means you can't use
> poll() to multiplex between read/write/accept sockets.

That seems to work for me...

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Alan Burlison

On 21/10/2015 11:25, David Laight wrote:


The problem with poll() is that it returns immediately when passed a FD
that is in the listening state. rather than waiting until there's an
incoming connection to handle. As I said, that means you can't use
poll() to multiplex between read/write/accept sockets.


That seems to work for me...


In my test case I was setting all the available event bits in 
pollfd.events to see what came back. With poll() on a listen() socket 
you get an immediate return with bits set in revents indicating the 
socket is available for output, which of course it isn't. Indeed an 
attempt to write to it fails. If you remove the output event bits in 
pollfd.events then the poll() waits as expected until there's an 
incoming connection on the socket.


I suppose one answer is "Well, don't do that then" but returning an 
output indication on a socket that's in listen() seems rather odd.


With POLLOUT|POLLWRNORM|POLLWRBAND:

main: polling #1
[returns immediately]
main: poll #1: Success
poll fd: 0 revents: POLLOUT POLLWRBAND
main: write #1: Transport endpoint is not connected

Without POLLOUT|POLLWRNORM|POLLWRBAND:

main: polling #1
[waits for connection]
main: poll #1: Success
poll fd: 0 revents: POLLIN POLLRDNORM

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Eric Dumazet
On Wed, 2015-10-21 at 11:49 +0100, Alan Burlison wrote:
> On 21/10/2015 11:25, David Laight wrote:
> 
> >> The problem with poll() is that it returns immediately when passed a FD
> >> that is in the listening state. rather than waiting until there's an
> >> incoming connection to handle. As I said, that means you can't use
> >> poll() to multiplex between read/write/accept sockets.
> >
> > That seems to work for me...
> 
> In my test case I was setting all the available event bits in 
> pollfd.events to see what came back. With poll() on a listen() socket 
> you get an immediate return with bits set in revents indicating the 
> socket is available for output, which of course it isn't. Indeed an 
> attempt to write to it fails. If you remove the output event bits in 
> pollfd.events then the poll() waits as expected until there's an 
> incoming connection on the socket.
> 
> I suppose one answer is "Well, don't do that then" but returning an 
> output indication on a socket that's in listen() seems rather odd.
> 
> With POLLOUT|POLLWRNORM|POLLWRBAND:
> 
> main: polling #1
> [returns immediately]
> main: poll #1: Success
> poll fd: 0 revents: POLLOUT POLLWRBAND
> main: write #1: Transport endpoint is not connected
> 
> Without POLLOUT|POLLWRNORM|POLLWRBAND:
> 
> main: polling #1
> [waits for connection]
> main: poll #1: Success
> poll fd: 0 revents: POLLIN POLLRDNORM
> 

This works for me. Please double check your programs

242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 
4294967295 
242046 <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35888), 
inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28]) = 4
242046 close(4) = 0
242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 
4294967295) = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35890), 
inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28]) = 4
242046 close(4) = 0
242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 
4294967295) = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35892), 
inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28]) = 4
242046 close(4) = 0
242046 poll([{fd=3, events=POLLIN|POLLOUT|POLLWRNORM|POLLWRBAND}], 1, 
4294967295) = 1 ([{fd=3, revents=POLLIN}])
242046 accept(3, {sa_family=AF_INET6, sin6_port=htons(35894), 
inet_pton(AF_INET6, ":::10.246.7.151", _addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28]) = 4
242046 close(4) = 0

poll() only cares about POLLIN | POLLRDNORM for a listener.


static inline unsigned int inet_csk_listen_poll(const struct sock *sk)
{
return !reqsk_queue_empty(_csk(sk)->icsk_accept_queue) ?
(POLLIN | POLLRDNORM) : 0;
}

if (sk->sk_state == TCP_LISTEN)
return inet_csk_listen_poll(sk);



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-20 Thread Alan Burlison

On 20/10/2015 02:45, Eric Dumazet wrote:


On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote:


Another problem is that if I call close() on a Linux socket that's in
accept() the accept call just sits there until there's an incoming
connection, which succeeds even though the socket is supposed to be
closed, but then an immediately following accept() on the same socket
fails.


This is exactly what the comment I pasted documents.


Yes, and it's caveated with "temporary solution" and "_not_ a good idea" 
and says that the problem is that close() needs repairing. In other 
works, the change in shutdown() behaviour appears to be a workaround for 
an acknowledged bug, and it's one with consequences.


There are two separate things here.

The first is the close/reopen race issue with filehandles. As I said, I 
believe that's an artefact of history, because it wasn't possible for it 
to happen before threads and was possible after them. There appears to 
be no good way to avoid this with the current *NIX filehandle semantics 
when threads are being used. History sucks.


The second is how/if you might work around that. As far as I can tell, 
Linux does so by allowing shutdown() to be called on unconnected sockets 
and uses that as a signal that threads waiting in accept() should return 
from the accept with a failure but without the filehandle being actually 
closed, and therefore not being available for reuse, and therefore not 
subject to potential races.  However by doing so I believe the behaviour 
of shutdown is then not POSIX-conforming. The Linux manpage for 
shutdown(2) says "CONFORMING TO POSIX.1-2001", as far as I can tell it 
isn't. At very least I believe the manpage needs changing.



On linux, doing close(listener) on one thread does _not_ wakeup other
threads doing accept(listener)


Allowing an in-progress accept() to continue and to succeed at some 
point in the distant future on a filehandle that's closed seems incorrect.


Also, the behaviour of poll() that I mentioned - that it returns 
immediately for a socket in the listen() state - does seem like an 
out-and-out bug to me, I haven't seen any explanation of why that might 
be correct behaviour.



So I guess allowing shutdown(listener) was a way to somehow propagate
some info on the threads stuck in accept()


Yes, I think you are right.


This is a VFS issue, and a long standing one.


That seems to be what the comment you quoted is saying, yes.


Think of all cases like dup() and fd passing games, and the close(fd)
being able to signal out of band info is racy.


Yes, I agree there are potential race conditions, intrinsic to the way 
*NIX FDs work. As I said earlier, there *may* be a portable way around 
this using /dev/null & dup2() but I haven't had chance to investigate 
that yet.



close() is literally removing one ref count on a file.
Expecting it doing some kind of magical cleanup of a socket is not
reasonable/practical.


I'm not sure what that means at an application level - no matter how 
many copies I make of a FD in a process it's still just an integer and 
calling close on it closes it and causes any future IOs to fail - except 
for the case of sockets in accept() it seems, which continue and may 
even eventually succeed. Leaving aside the behaviour of shutdown() on 
listening sockets, the current behaviour of close() on a socket in 
accept() seems incorrect. And then of course there's also the poll() issue.



On a multi threaded program, each thread doing an accept() increased the
refcount on the file.


That may be how Linux implements accept(), but I don't see anything 
about refcounting in the POSIX spec for accept().



Really, I have no idea of how Solaris coped with this, and I do not want
to know.


The bug goes into quite some detail about how Solaris behaves. The issue 
here is that we have two implementations, Linux and Solaris, both 
claiming to be POSIX-conformant but both showing different behaviour. 
There's a discussion to be had about the whys and wherefores of that 
difference, but saying that you don't want to know how Solaris behaves 
isn't really going to help move the conversation along.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-20 Thread Alan Burlison

On 20/10/2015 16:30, Eric Dumazet wrote:


A close() does _not_ wakeup an accept() or a poll() (this is exactly the
same problem), or any blocking system call using the same 'file'


Not waking up the accept() is one thing, allowing the accept() to 
successfully complete some indeterminate time after the socket has been 
closed is something else entirely. You shouldn't have to call shutdown() 
to terminate an accept() on a socket, close() should suffice. Yes, if 
you want to use the shutdown() 'feature' to kick the accept() thread out 
of the call to accept() without closing the FD and you don't care about 
cross-platform compatibility then you can call shutdown() followed by 
close(). However that's only ever required on Linux as far as I can 
tell, and even on Linux applications that deal with the thread race by 
other means shouldn't be forced to use shutdown() when just close() 
would suffice.


The problem with poll() is that it returns immediately when passed a FD 
that is in the listening state. rather than waiting until there's an 
incoming connection to handle. As I said, that means you can't use 
poll() to multiplex between read/write/accept sockets.



close() man page states :

NOTES
   It  is  probably unwise to close file descriptors while they may be in
   use by system calls in other threads in the  same  process.   Since  a
   file  descriptor may be reused, there are some obscure race conditions
   that may cause unintended side effects.

You are in this grey zone.


No, the race issue with file descriptor reuse and the close() behaviour 
are not the same thing. The manpage comment is correct, but not relevant.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-20 Thread Eric Dumazet
On Tue, 2015-10-20 at 19:31 +0100, Alan Burlison wrote:

> No, the race issue with file descriptor reuse and the close() behaviour 
> are not the same thing. The manpage comment is correct, but not relevant.

Ok, it seems you know better than me, I will be stop the discussion and
wait for your patches.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-20 Thread Eric Dumazet
On Tue, 2015-10-20 at 14:45 +0100, Alan Burlison wrote:

> And I still haven't seen any reasoning behind the Linux close() and 
> poll() behaviour on sockets that are in the listen state.

Same answer.

A close() does _not_ wakeup an accept() or a poll() (this is exactly the
same problem), or any blocking system call using the same 'file'

This is the choice Linus Torvalds and Al Viro did years ago.

It is set in stone, unless someone comes up with a very nice patch set,
that does not break existing applications.

close() man page states :

NOTES
  It  is  probably unwise to close file descriptors while they may be in
  use by system calls in other threads in the  same  process.   Since  a
  file  descriptor may be reused, there are some obscure race conditions
  that may cause unintended side effects.

You are in this grey zone.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-20 Thread Al Viro
On Mon, Oct 19, 2015 at 06:45:32PM -0700, Eric Dumazet wrote:
> On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote:
> 
> > Another problem is that if I call close() on a Linux socket that's in 
> > accept() the accept call just sits there until there's an incoming 
> > connection, which succeeds even though the socket is supposed to be 
> > closed, but then an immediately following accept() on the same socket 
> > fails. 
> 
> This is exactly what the comment I pasted documents.
> 
> On linux, doing close(listener) on one thread does _not_ wakeup other
> threads doing accept(listener)
> 
> So I guess allowing shutdown(listener) was a way to somehow propagate
> some info on the threads stuck in accept()
> 
> This is a VFS issue, and a long standing one.
> 
> Think of all cases like dup() and fd passing games, and the close(fd)
> being able to signal out of band info is racy.
> 
> close() is literally removing one ref count on a file.

> Expecting it doing some kind of magical cleanup of a socket is not
> reasonable/practical.
> 
> On a multi threaded program, each thread doing an accept() increased the
> refcount on the file.

Refcount is an implementation detail, of course.  However, in any Unix I know
of, there are two separate notions - descriptor losing connection to opened
file (be it from close(), exit(), execve(), dup2(), etc.) and opened file
getting closed.

The latter cannot happen while there are descriptors connected to the
file in question, of course.  However, that is not the only thing
that might prevent an opened file from getting closed - e.g. sending an
SCM_RIGHTS datagram with attached descriptor connected to the opened file
in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened
file until it is successfully received or discarded (in the former case
recepient will get a new descriptor refering to that opened file, of course).
Having the original descriptor closed right after sendmsg(2) does *not*
do anything to opened file.  On any Unix that implements descriptor-passing.

There's going to be a notion of "last close"; that's what this refcount is
about and _that_ is more than implementation detail.

The real question is what kind of semantics would one want in the following
situations:

1)
// fd is a socket
fcntl(fd, F_SETFD, FD_CLOEXEC);
fork();
in parent: accept(fd);
in child: execve()

2)
// fd is a socket, 1 is /dev/null
fork();
in parent: accept(fd);
in child: dup2(1, fd);

3)
// fd is a socket
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: close(fd);

4)
// fd is a socket, 1 is /dev/null
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: dup2(1, fd);

5)
// fd is a socket, 1 is /dev/null
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: close(fd2);

6)
// fd is a socket
in thread A: accept(fd);
in thread B: close(fd);

In other words, is that destruction of
* any descriptor refering to this socket [utterly insane for obvious
reasons]
* the last descriptor refering to this socket (modulo descriptor
passing, etc.) [a bitch to implement, unless we treat a syscall in progress
as keeping the opened file open], or
* _the_ descriptor used to issue accept(2) [a bitch to implement,
with a lot of fun races in an already race-prone area]?
Additional question is whether it's
* just a magical behaviour of close(2) [ugly], or
* something that happens when descriptor gets dissociated from
opened file [obviously more consistent]?

BTW, for real fun, consider this:
7)
// fd is a socket
fd2 = dup(fd);
in thread A: accept(fd);
in thread B: accept(fd);
in thread C: accept(fd2);
in thread D: close(fd);

Which threads (if any), should get hit where it hurts?

I honestly don't know what Solaris does; AFAICS, FreeBSD behaves like Linux
these days.  NetBSD plays really weird games in their fd_close(); what
they are trying to achieve is at least sane - in (7) they'd hit A and B with
EBADF and C would restart and continue waiting, in (3,4,6) A gets EBADF,
in (1,2,5) accept() is unaffected.  The problem is that their solution is
racy - they have a separate refcount on _descriptor_, plus a file method
(->fo_restart) for triggering an equivalent of signal interrupting anything
that might be blocked on that sucker, with syscall restart (and subsequent
EBADF on attempt to refetch the sucker.  Racy if we reopen or are doing dup2()
in the first place - these restarts might get CPU just after we return from
dup2() and pick the *new* descriptor just fine.  It might be possible to fix
their approach (having
if (__predict_false(ff->ff_file == NULL)) {
/*
 * Another user of the file is already closing, and is
 * waiting for other users of the file to drain.  Release
 * our reference, and wake up the closer.
 */
atomic_dec_uint(>ff_refcnt);
cv_broadcast(>ff_closing);
path in fd_close() mark the thread as "don't bother restarting, 

Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-20 Thread Eric Dumazet
On Tue, 2015-10-20 at 10:59 +0100, Alan Burlison wrote:

> That may be how Linux implements accept(), but I don't see anything 
> about refcounting in the POSIX spec for accept().

That's an internal implementation detail. POSIX does not document linux
kernel overall design and specific tricks.

linux is GPL, while Solaris is proprietary code. There is quite a
difference, and we do not want to copy Solaris behavior. We want our own
way, practical, and good enough.

If POSIX makes sense we try to be compliant. If not, we do not.

If you are interested, take a look at fs/* code, and try to implement
your proposal and keep good performance.

You might find a clever way, without infringing prior art. We did not
yet.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-20 Thread Alan Burlison

On 20/10/2015 14:19, Eric Dumazet wrote:


That may be how Linux implements accept(), but I don't see anything
about refcounting in the POSIX spec for accept().


That's an internal implementation detail. POSIX does not document linux
kernel overall design and specific tricks.


No, neither does it document Solaris kernel design, or *BSD design, or 
Windows design. It does however specify what the externally observable 
behaviour has to be in order to be POSIX compliant.



linux is GPL, while Solaris is proprietary code. There is quite a
difference, and we do not want to copy Solaris behavior. We want our own
way, practical, and good enough.


I don't see what the licensing terms of particular implementations have 
to do with POSIX, indeed as far as I can tell POSIX says nothing at all 
about the subject. It's therefore not pertinent to this discussion.


I'm not expecting Linux to copy every Solaris behaviour, if I was I 
might for example be suggesting that Linux dropped support for the 
SO_RCVTIMEO and SO_SNDTIMEO setsockopt() options on AF_UNIX sockets, 
because Solaris doesn't currently implement those options. That would 
clearly be a ridiculous stance for me to take, what I've actually done 
is logged a bug against Solaris because it's clearly something we need 
to fix.


What I do think is reasonable is that if Linux claims POSIX conformance 
then it either conforms or documents the variant behaviour and as I've 
said I don't believe is does conform in the case of shutdown().



If POSIX makes sense we try to be compliant. If not, we do not.


That's one possible design option. However in this case the Linux 
manpage claims the Linux behaviour is POSIX compliant and as far as I 
can tell it isn't. As I've already said several times, I agree there's 
probably not much that can be done about it without causing breakage, 
which is why I suggested simply documenting the behaviour may be the 
best option.


And I still haven't seen any reasoning behind the Linux close() and 
poll() behaviour on sockets that are in the listen state.



If you are interested, take a look at fs/* code, and try to implement
your proposal and keep good performance.

You might find a clever way, without infringing prior art. We did not
yet.


I might, but contractually I can't, unfortunately.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-19 Thread Eric Dumazet
On Mon, 2015-10-19 at 09:59 -0700, Stephen Hemminger wrote:
> This looks like corner case, but worth forwarding.
> 
> Begin forwarded message:
> 
> Date: Mon, 19 Oct 2015 13:21:33 +
> From: "bugzilla-dae...@bugzilla.kernel.org" 
> 
> To: "shemmin...@linux-foundation.org" 
> Subject: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for 
> sockets in accept(3)
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=106241
> 
> Bug ID: 106241
>Summary: shutdown(3)/close(3) behaviour is incorrect for
> sockets in accept(3)
>Product: Networking
>Version: 2.5
> Kernel Version: 3.10.0-229.14.1.el7.x86_64
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: IPV4
>   Assignee: shemmin...@linux-foundation.org
>   Reporter: alan.burli...@oracle.com
> Regression: No
> 
> Created attachment 190501
>   --> https://bugzilla.kernel.org/attachment.cgi?id=190501=edit
> Test program illustrating the problem
> 
> The Linux behaviour in the current scenario is incorrect:
> 
> 1. ThreadA opens, binds, listens and accepts on a socket, waiting for
> connections.
> 
> 2. Some time later ThreadB calls shutdown on the socket ThreadA is waiting in
> accept on.
> 
> Here is what happens:
> 
> On Linux, the shutdown call in ThreadB succeeds and the accept call in ThreadA
> returns with EINVAL.
> 
> On Solaris, the shutdown call in ThreadB fails and returns ENOTCONN. ThreadA
> continues to wait in accept.
> 
> Relevant POSIX manpages:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/accept.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/shutdown.html
> 
> The POSIX shutdown manpage says:
> 
> "The shutdown() function shall cause all or part of a full-duplex connection 
> on
> the socket associated with the file descriptor socket to be shut down."
> ...
> "[ENOTCONN] The socket is not connected."
> 
> Page 229 & 303 of "UNIX System V Network Programming" say:
> 
> "shutdown can only be called on sockets that have been previously connected"
> 
> "The socket [passed to accept that] fd refers to does not participate in the
> connection. It remains available to receive further connect indications"
> 
> That is pretty clear, sockets being waited on with accept are not connected by
> definition. Nor is it the accept socket connected when a client connects to 
> it,
> it is the socket returned by accept that is connected to the client. Therefore
> the Solaris behaviour of failing the shutdown call is correct.
> 
> In order to get the required behaviour of ThreadB causing ThreadA to exit the
> accept call with an error, the correct way is for ThreadB to call close on the
> socket that ThreadA is waiting on in accept.
> 
> On Solaris, calling close in ThreadB succeeds, and the accept call in ThreadA
> fails and returns EBADF.
> 
> On Linux, calling close in ThreadB succeeds but ThreadA continues to wait in
> accept until there is an incoming connection. That accept returns 
> successfully.
> However subsequent accept calls on the same socket return EBADF.
> 
> The Linux behaviour is fundamentally broken in three places:
> 
> 1. Allowing shutdown to succeed on an unconnected socket is incorrect.  
> 
> 2. Returning a successful accept on a closed file descriptor is incorrect,
> especially as future accept calls on the same socket fail.
> 
> 3. Once shutdown has been called on the socket, calling close on the socket
> fails with EBADF. That is incorrect, shutdown should just prevent further IO 
> on
> the socket, it should not close it.
> 

It looks it is a long standing problem, right ?

inet_shutdown() has this very specific comment from beginning of git
tree :

switch (sk->sk_state) {
...
/* Remaining two branches are temporary solution for missing
 * close() in multithreaded environment. It is _not_ a good idea,
 * but we have no choice until close() is repaired at VFS level.
 */
case TCP_LISTEN:
if (!(how & RCV_SHUTDOWN))
break;
/* Fall through */
case TCP_SYN_SENT:
err = sk->sk_prot->disconnect(sk, O_NONBLOCK);
sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
break;
}

Claiming Solaris does it differently is kind of moot.
linux is not Solaris.

Unless proven a real problem (and not only by trying to backport from Solaris 
to linux),
we'll probably wont change this.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-19 Thread Alan Burlison

On 20/10/2015 00:33, Eric Dumazet wrote:


It looks it is a long standing problem, right ?


Yep, seems so.


inet_shutdown() has this very specific comment from beginning of git
tree :

 switch (sk->sk_state) {
...
 /* Remaining two branches are temporary solution for missing
  * close() in multithreaded environment. It is _not_ a good idea,
  * but we have no choice until close() is repaired at VFS level.
  */
 case TCP_LISTEN:
 if (!(how & RCV_SHUTDOWN))
 break;
 /* Fall through */
 case TCP_SYN_SENT:
 err = sk->sk_prot->disconnect(sk, O_NONBLOCK);
 sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
 break;
 }


I think it's probably an intrinsic part of the way *NIX file descriptors 
and their reuse has worked since the dawn of *NIX time - at which time 
threads didn't exist, so this problem didn't either. The advent of 
threads made this this hole possible, which is I believe what the 
comment above is pointing out. The problem people are trying to solve by 
calling shutdown() on an listen()ing socket is the race in MT programs 
between a socket being closed and the same file descriptor being 
recycled by a subsequent open()/socket() etc.



Claiming Solaris does it differently is kind of moot.
linux is not Solaris.


Agreed that Linux != Solaris, but the argument I'm being faced with is 
that anything that doesn't behave in the same way as Linux is wrong by 
definition. And the problem with that is that the Linux behaviour of 
shutdown() on a listen()/accept() socket is I believe incorrect anyway 
as my read of POSIX says that shutdown() is only valid on connected 
sockets, and sockets in listen()/accept() aren't connected by 
definition, and Linux allows shutdown() to succeed when it should 
probably return ENOTCONN. Yes, there's a potential race with FDs being 
recycled, but you can get that with vanilla file FDs as well, where 
shutdown() isn't an option.


Another problem is that if I call close() on a Linux socket that's in 
accept() the accept call just sits there until there's an incoming 
connection, which succeeds even though the socket is supposed to be 
closed, but then an immediately following accept() on the same socket 
fails. And yet another problem is that poll() on a socket that's had 
listen() called on it returns immediately even if there's no incoming 
connection on it, which I believe makes multiplexing a set of sockets 
which includes a socket you want to accept() on impossible. The test 
program I attached to the bug allows you to play around with the 
different combinations.



Unless proven a real problem (and not only by trying to backport from Solaris 
to linux),
we'll probably wont change this.


It's a real problem (with Hadoop, which contains C/C++ to do low-level 
I/O) and it's the other way around, I am porting that code from Linux to 
Solaris.


I accept that you probably can't change the behaviour of shutdown() in 
Linux without breaking existing code, for example it seems libmicrohttpd 
also assumes it's OK to call shutdown() on a listen() socket on Linux, 
see https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html


However, even if the shutdown() behaviour can't be changed the Linux 
close()/poll() behaviour on a listen()/accept() sockets seems rather odd.


There *may* be a way around this that's race-free and cross-platform 
involving the use of /dev/null and dup2(), see Listing Five on 
http://www.drdobbs.com/parallel/file-descriptors-and-multithreaded-progr/212001285 
but I haven't confirmed it works yet.


Thanks,

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-19 Thread Eric Dumazet
On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote:

> Another problem is that if I call close() on a Linux socket that's in 
> accept() the accept call just sits there until there's an incoming 
> connection, which succeeds even though the socket is supposed to be 
> closed, but then an immediately following accept() on the same socket 
> fails. 

This is exactly what the comment I pasted documents.

On linux, doing close(listener) on one thread does _not_ wakeup other
threads doing accept(listener)

So I guess allowing shutdown(listener) was a way to somehow propagate
some info on the threads stuck in accept()

This is a VFS issue, and a long standing one.

Think of all cases like dup() and fd passing games, and the close(fd)
being able to signal out of band info is racy.

close() is literally removing one ref count on a file.
Expecting it doing some kind of magical cleanup of a socket is not
reasonable/practical.

On a multi threaded program, each thread doing an accept() increased the
refcount on the file.

Really, I have no idea of how Solaris coped with this, and I do not want
to know.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html