Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Casper . Dik


>And no, I'm not fond of such irregular ways to pass file descriptors, but
>we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity...

Yeah, there are a number of calls which supposed work on one but have a 
second argument which is also a file descriptor; mostly part of ioctl().

>> In those specific cases where a system call needs to convert a file 
>> descriptor to a file pointer, there is only one routines which can be used.
>
>Obviously, but the problem is deadlock avoidance using it.

The Solaris algorithm is quite different and as such there is no chance of 
having a deadlock using that function (there is a bunch of functions)


>The memory footprint is really scary.  Bitmaps are pretty much noise, but
>blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic -
>or Venus for that matter, which is more relevant for you guys)

Fair enough.  I think we have some systems with a larger cache line.

>Said that, what's the point of "close won't return until..."?  After all,
>you can't guarantee that thread with cancelled syscall won't lose CPU
>immediately upon return to userland, so it *can't* make any assumptions
>about the descriptor not having been already reused.  I don't get it - what
>does that buy for userland code?

Generally I wouldn't see that as a problem, but in the case of a socket 
blocking on accept indefinitely, I do see it as a problem especially as 
the thread actually wants to stop listening.

But in general, this is basically a problem with the application: the file 
descriptor space is shared between threads and having one thread sniping 
at open files, you do have a problem and whatever the kernel does in that 
case perhaps doesn't matter all that much: the application needs to be 
fixed anyway.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Casper . Dik


>Yet another POSIX deficiency.
>
>When a server deals with 10,000,000+ socks, we absolutely do not care of
>this requirement.
>
>O(log(n)) is still crazy if it involves O(log(n)) cache misses.

You miss the fire point of the algorithm; you *always* find an available 
file descriptor in O(log(N)) (where N is the size of the table).

Does your algorithm guarantee that?

>> Is it a problem that you can "hide" your listening socket with a thread in 
>> accept()?  I would think so (It would be visible in netstat but you can't 
>> easily find out why has it)
>
>Again, netstat -p on a server with 10,000,000 sockets never completes.

This point was not about a 10M sockets server but in general...

>Never try this unless you are desperate and want to avoid a reboot
>maybe.
>
>If you absolutely want to nuke a listener because of untrusted
>applications, we better implement a proper syscall.
>
>Android has such a facility.

Solaris has had such an option too, but that wasn't the point.  You really 
don't want to know which application is doing this?

>Alternative would be to extend netlink (ss command from iproute2
>package) to carry one pid per socket.
>
>ss -atnp state listening

That would be an option too.

Casper


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Casper . Dik


>Ho-hum...  It could even be made lockless in fast path; the problems I see
>are
>   * descriptor-to-file lookup becomes unsafe in a lot of locking
>conditions.  Sure, most of that happens on the entry to some syscall, with
>very light locking environment, but... auditing every sodding ioctl that
>might be doing such lookups is an interesting exercise, and then there are
>->mount() instances doing the same thing.  And procfs accesses.  Probably
>nothing impossible to deal with, but nothing pleasant either.

In the Solaris kernel code, the ioctl code is generally not handled a file 
descriptor but instead a file pointer (i.e., the lookup is done early in 
the system call).

In those specific cases where a system call needs to convert a file 
descriptor to a file pointer, there is only one routines which can be used.

>   * memory footprint.  In case of Linux on amd64 or sparc64,
>main()
>{
>   int i;
>   for (i = 0; i < 1<<24; dup2(0, i++))// 16M descriptors
>   ;
>}
>will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient 
>ulimit -n,
>of course).  How much will Solaris eat on the same?

Yeah, that is a large amount of memory.  Of course, the table is only 
sized when it is extended and there is a reason why there is a limit on 
file descriptors.  But we're using more data per file descriptor entry.


>   * related to the above - how much cacheline sharing will that involve?
>These per-descriptor use counts are bitch to pack, and giving each a cacheline
>of its own...  

As I said, we do actually use a lock and yes that means that you really  
want to have a single cache line for each and every entry.  It does make 
it easy to have non-racy file description updates.  You certainly do not 
want false sharing when there is a lot of contention.

Other data is used to make sure that it only takes O(log(n)) to find the 
lowest available file descriptor entry.  (Where n, I think, is the returned
descriptor)

Not contended locks aren't expensive.  And all is done on a single cache 
line.

One question about the Linux implementation: what happens when a socket in 
select is closed?  I'm assuming that the kernel waits until "shutdown" is 
given or when a connection comes in?

Is it a problem that you can "hide" your listening socket with a thread in 
accept()?  I would think so (It would be visible in netstat but you can't 
easily find out why has it)

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

>On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote:
>> 
>> 
>> >And I'm really curious about the things Solaris would do with dup2() there.
>> >Does it take into account the possibility of new accept() coming just as
>> >dup2() is trying to terminate the ongoing ones?  Is there a window when
>> >descriptor-to-file lookups would fail?  Looks like a race/deadlock 
>> >country...
>> 
>> Solaris does not "terminate" threads, instead it tells them that the
>> file descriptor information used is stale and wkae's up the thread.
>
>Sorry, lousy wording - I meant "terminate syscall in another thread".
>Better yet, make that "what happens if new accept(newfd) comes while dup2()
>waits for affected syscalls in other threads to finish"?  Assuming it
>does wait, that is..

No there is no such window; the accept() call either returns EBADF
(dup2()) wins the race or it returns a new file descriptor (and dup2()
then closes the listening descriptor).

One or the other.

>While we are at it, what's the relative order of record locks removal
>and switching the meaning of newfd?  In our kernel it happens *after*
>the switchover (i.e. if another thread is waiting for a record lock held on
>any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see
>the state with newfd still refering to what it used to; note that waiter
>might be using any descriptor refering to the file newfd used to refer
>to, so it won't be affected by the "wake those who had the meaning of
>their arguments change" side of things).

The external behaviour atomic; you cannot distinguish the order
between the closing of the original file (and waking up other threads
waiting for a record lock) or changing the file referenced by that newfd.

But this not include a global or process specific lock.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

From: Al Viro 

>Except that in this case "correctness" is the matter of rather obscure and
>ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
>inherently bad, but it's nowhere near being an absolute requirement.

It would more fruitful to have such a discussion in one of the OpenGroup 
mailing lists; people gathered there have a lot of experience and it is 
also possible to fix the standard when it turns out that it indeed as 
vague as you claim it is (I don't quite agree with that)

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

>It's been said that the current mechanisms in Linux & some BSD variants 
>can be subject to races, and the behaviour exhibited doesn't conform to 
>POSIX, for example requiring the use of shutdown() on unconnected 
>sockets because close() doesn't kick off other threads accept()ing on 
>the same fd. I'd be interested to hear if there's a better and more 
>performant way of handling the situation that doesn't involve doing the 
>sort of bookkeeping Casper described,.

Of course, the implementation is now around 18 years old; clearly a lot of 
things have changed since then.

In the particular case of Linux close() on a socket, surely it must be 
possible to detect at close that it is a listening socket and that you are 
about to close the last reference; the kernel could then do the shutdown() 
all by itself.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

>On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote:
>
>> The external behaviour atomic; you cannot distinguish the order
>> between the closing of the original file (and waking up other threads
>> waiting for a record lock) or changing the file referenced by that newfd.
>> 
>> But this not include a global or process specific lock.
>
>Interesting...  Do you mean that decriptor-to-file lookup blocks until that
>rundown finishes?

For that particular file descriptor, yes.  (I'm assuming you mean the 
Solaris kernel running down all lwps who have a system in progress on that 
particular file descriptor).  All other fd to file lookups are not blocked 
at all by this locking.

It should be clear that any such occurrences are application errors and 
should be hardly ever seen in practice.  It is also known when this is 
needed so it is hardly even attempted.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

From: Al Viro 

>On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote:
>> On 22/10/2015 18:05, Al Viro wrote:
>> 
>> >Oh, for...  Right in this thread an example of complete BS has been quoted
>> >from POSIX close(2).  The part about closing a file when the last descriptor
>> >gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
>> >it be).
>> 
>> That's not exactly what it says, we've already discussed, for
>> example in the case of pending async IO on a filehandle.
>
>Sigh...  It completely fails to mention descriptor-passing.  Which
>   a) is relevant to what "last close" means and
>   b) had been there for nearly the third of a century.

Why is that different?  These clearly count as file descriptors.

>> I agree that part could do with some polishing.
>
>google("wire brush of enlightenment") is what comes to mind...

Standardese is similar to legalese; it not writing that is directly open 
to interpretation to those who are not inducted in writing may have some 
problem interpreting what exactly is meant by wording of the standard.



>> I think "it shall be closed first" makes it pretty clear that what
>> is expected is the same behaviour as any direct invocation of close,
>> and that has to happen before the reassignment. What makes you
>> believe that's isn't the case?
>
>So unless I'm misparsing something, you want
>thread A: accept(newfd)
>thread B: dup2(oldfd, newfd)
>have accept() bugger off before the switchover happens?

Well, certainly *before* we return from dup2().
(and clearly only once we have determined that dup2() will return
successfully)

>What should happen if thread C does accept(newfd) right as B has decided that
>there's nothing more to wait?  For close(newfd) it would be simple - we are
>going to have lookup by descriptor fail with EBADF anyway, so making it do
>so as soon as we go hunting for those who are currently in accept(newfd)
>would do the trick - no new threads like that shall appear and as long as
>the descriptor is not declared free for taking by descriptor allocation nobody
>is going to be screwed by open() picking that slot of descriptor table too
>early.  Trying to do that for dup2() would lose atomicity.  I honestly don't
>know how Solaris behaves in that case, BTW - the race (if any) would probably
>be hard to hit, so in case of Linux I would have to go and RTFS before saying
>that there isn't one.  I can't do that in with Solaris; all I can do here
>is ask you guys...

Solaris dup2() behaves exactly like close().

>Moreover, see above for record locks removal.  Should that happen prior to
>switchover?  If you have
>
>dup(fd, fd2);
>set a record lock on fd2
>spawn a thread
>in child, try to grab the same lock on fd2
>in parent, do some work and close(fd)

>you are guaranteed that child won't see fd refering to the same file after it
>acquires the lock.

Here's you are talking about a lock held by the "parent" and that the
"child" will only get the lock once close(fd) is done?

Yes.  The final "close" is done *after* the pointer has been removed from 
the file descriptor table.

>Replace close(fd) with dup(fd3, fd); should the same hold true in that case?

Yes.

>FWIW, Linux behaviour in that area is to have record locks removal done
>between the switchover and return to userland in case of dup2() and between
>the removal from descriptor table and return to userland in case of close().
>
>> Personally I believe the spec is clear enough to allow an
>> unambiguous interpretation of the required behavior in this area. If
>> you think there are areas where the Solaris behaviour is in
>> disagreement with the spec then I'd be interested to hear them.
>
>The spec is so vague that I strongly suspect that *both* Solaris and Linux
>behaviours are not in disagreement with it (modulo shutdown(2) extension
>Linux-side and we are really stuck with that one).

I'm not sure if the standard allows a handful of threads in accept() for a 
file descriptor which has already been closed *and* can be re-issued for 
other uses.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik


>And I'm really curious about the things Solaris would do with dup2() there.
>Does it take into account the possibility of new accept() coming just as
>dup2() is trying to terminate the ongoing ones?  Is there a window when
>descriptor-to-file lookups would fail?  Looks like a race/deadlock country...

Solaris does not "terminate" threads, instead it tells them that the
file descriptor information used is stale and wkae's up the thread.

The accept call gets woken up and it checks for incoming connections; it 
will then either find a new connection and returns that particular 
connection or it will find nothing and returns EINTR; in the post-syscall 
glue this is checked (the kernel thread has been told to take the 
expensive post-syscall routine) and if the system call was interrupted, 
EBADF is returned instead.

It is also possible for the connection to come in late and then the socket 
will be changed and the already accepted (in TCP terms, not in the
socket API terms) embryonic will be closed too as is normal when a 
listening socket is closed with a list of not ready accept()ed connections.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Casper . Dik

From: David Miller 
Date: Wed, 21 Oct 2015 08:30:08 -0700 (PDT) (17:30 CEST)

>From: Alan Burlison 
>Date: Wed, 21 Oct 2015 15:38:51 +0100
>
>> While this algorithm is pretty expensive, it is not often invoked.
>
>I bet it can be easily intentionally invoked, by a malicious entity no
>less.

It is only expensive within the process itself.  Whether it is run inside 
the kernel isn't much different in the context of Solaris.  If you have an 
attacker which can run any code, it doesn't really matter what that code 
is.  It is not really, expensive (like grabbing expensive locks or
for any length of time).  It's basically O(n) depending on the numbers of 
threads in the process.

If you have an application which can be triggered in doing that, it is 
still a bug in the application.  

Is such socket still listed with netstat on Linux?  I believe it uses
uses /proc and it will not be able to find that socket through the list of 
opened files.

If we look at our typical problem we have a accept loop:

for (;;) {
newfd = accept(fd. ...);  /* X */

/* stuff */
}

While we have a second thread doing a "close(fd);" and possibly opening 
another file which just happens to return this particular fd.

In Solaris the following one of the following things will happen,
whatever the first thread is doing once close() is called:

- accept() dies with EBADF (close() before or during the call to
  accept())
- accept() returns some other error (new fd you can't accept on)
- accept() returns a new fd (if it was closed and reopened and a 
  the new fd allows accept())

On Linux exactly the same thing happens *except* when we find ourselves in 
accept(),
then we wait until a connection made or "shutdown()" is called.

I don't think any of the outcomes in the first thread is acceptable; 
clearly no sufficient synchronization between the threads.


At that point Linux cannot find out who owns the socket:

#  netstat -p -a | grep /tmp/unix
unix  2  [ ACC ] STREAM LISTENING 14743  -   
/tmp/unix_sock

In Solaris you'd get:

netstat -u -f unix| grep unix_
stream-ord casper 5334 shutdown   /tmp/unix_sock

Simple synchronization is can be done.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-21 Thread Casper . Dik

>On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
>
>> >There's going to be a notion of "last close"; that's what this refcount is
>> >about and _that_ is more than implementation detail.
>> 
>> Yes, POSIX distinguishes between "file descriptor" and "file
>> description" (ugh!) and the close() page says:
>
>Would've been better if they went for something like "IO channel" for
>the latter ;-/

Or at least some other word.  A file descriptor is just an index to
a list of file pointers (and wasn't named so?)

>> "When all file descriptors associated with an open file description
>> have been closed, the open file description shall be freed."
>
>BTW, is SCM_RIGHTS outside of scope?  They do mention it in one place
>only:
>| Ancillary data is also possible at the socket level. The 
>| header shall define the following symbolic constant for use as the cmsg_type
>| value when cmsg_level is SOL_SOCKET:
>|
>| SCM_RIGHTS
>| Indicates that the data array contains the access rights to be sent or
>| received.
>
>with no further details whatsoever.  It's been there since at least 4.3-Reno;
>does anybody still use the older variant (->msg_accrights, that is)?  IIRC,
>there was some crap circa 2.6 when Solaris used to do ->msg_accrights for
>descriptor-passing, but more or less current versions appear to support
>SCM_RIGHTS...  In any case, descriptor-passing had been there in some form
>since at least '83 (the old variant is already present in 4.2) and considering
>it out-of-scope for POSIX is bloody ridiculous, IMO.

SCM_RIGHTS was introduced as part of the POSIX standardization of BSD 
sockets.  Looks like they became part of Solaris 2.6, but the default
was non-standard sockets so you may easily find msg->accrights but not
SCM_RIGHTS.

msg_accrights is what was introduced in BSD in likely the first 
implementation of socket-based file descriptor passing.

SysV has its own file descriptor passing on file descriptors passing.

But that interface is too much ad-hoc, so SCM_RIGHTS is a standardizations 
allowing multiple types of messages to be send around

>Unless they want to consider in-flight descriptor-passing datagrams as
>collections of file descriptors, the quoted sentence is seriously misleading.
>And then there's mmap(), which they do kinda-sorta mention...

Well, a file descriptor really only exists in the context of a process; 
in-flight it is no longer a file descriptor as there process context with 
a file descriptor table; so pointers to file descriptions are passed 
around.

>> >In other words, is that destruction of
>> >* any descriptor refering to this socket [utterly insane for obvious
>> >reasons]
>> >* the last descriptor refering to this socket (modulo descriptor
>> >passing, etc.) [a bitch to implement, unless we treat a syscall in progress
>> >as keeping the opened file open], or
>> >* _the_ descriptor used to issue accept(2) [a bitch to implement,
>> >with a lot of fun races in an already race-prone area]?
>> 
>> From reading the POSIX close() page I believe the second option is
>> the correct one.
>
>Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
>behaviour, in your opinion?  Because fd is sure as hell not the last
>descriptor refering to that socket - fd2 remains alive and well.
>
>Behaviour you describe below matches the _third_ option.

>> >BTW, for real fun, consider this:
>> >7)
>> >// fd is a socket
>> >fd2 = dup(fd);
>> >in thread A: accept(fd);
>> >in thread B: accept(fd);
>> >in thread C: accept(fd2);
>> >in thread D: close(fd);
>> >
>> >Which threads (if any), should get hit where it hurts?
>> 
>> A & B should return from the accept with an error. C should
>> continue. Which is what happens on Solaris.
>
>> To this end each thread keeps a list of file descriptors
>> in use by the current active system call.
>
>Yecc...  How much cross-CPU traffic does that cause on
>multithread processes?  Not on close(2), on maintaining the
>descriptor use counts through the normal syscalls.

In the Solaris implementation is pretty much what we do;
but there is no much cross-CPU traffic.  Of course, you will need
to keep locks in the file descriptor table if only to find
the actual file pointer.

The work is done only in the case of a badly written application
where close is required to hunt down all threads currently using
the specific file descriptor.

>> When a file descriptor is closed and this file descriptor
>> is marked as being in use by other threads, the kernel
>> will search all threads to see which have this file descriptor
>> listed as in use. For each such thread, the kernel tells
>> the thread that its active fds list is now stale and, if
>> possible, makes the thread run.
>>
>> While this algorithm is pretty expensive, it is not often invoked.
>
>Sure, but the upkeep of data structures it would need is there
>whether you actually end up triggering it or not.  Both in
>memory footprint and in cacheline pingpong...

Most of the