RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-06 Thread David Laight
From: Al Viro
> Sent: 04 November 2015 16:28
> On Wed, Nov 04, 2015 at 03:54:09PM +, David Laight wrote:
> > > Sigh...  The kernel has no idea when other threads are done with "all
> > > io activities using that fd" - it can wait for them to leave the
> > > kernel mode, but there's fuck-all it can do about e.g. a userland
> > > loop doing write() until there's more data to send.  And no, you can't
> > > rely upon them catching EBADF on the next iteration - by the time they
> > > get there, close() might very well have returned and open() from yet
> > > another thread might've grabbed the same descriptor.  Welcome to your
> > > data being written to hell knows what...
> >
> > That just means that the application must use dup2() rather than close().
> > It must do that anyway since the thread it is trying to stop might be
> > sleeping in the system call stub in libc at the time a close() and open()
> > happen.
> 
> Oh, _lovely_.  So instead of continuation of that write(2) going down
> the throat of something opened by unrelated thread, it (starting from a
> pretty arbitrary point) will go into the descriptor the closing thread
> passed to dup2().  Until it, in turn, gets closed, at which point we
> are back to square one.  That, of course, makes it so much better -
> whatever had I been thinking about that made me miss that?

Do I detect a note of sarcasm.

You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd.
So any writes (etc) would be discarded in a safe manner, and nothing will block.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-06 Thread Al Viro
On Fri, Nov 06, 2015 at 03:07:30PM +, David Laight wrote:

> > Oh, _lovely_.  So instead of continuation of that write(2) going down
> > the throat of something opened by unrelated thread, it (starting from a
> > pretty arbitrary point) will go into the descriptor the closing thread
> > passed to dup2().  Until it, in turn, gets closed, at which point we
> > are back to square one.  That, of course, makes it so much better -
> > whatever had I been thinking about that made me miss that?
> 
> Do I detect a note of sarcasm.
> 
> You'd open "/dev/null" (lacking a "/dev/error") and use that as the source fd.
> So any writes (etc) would be discarded in a safe manner, and nothing will 
> block.

... and never close that descriptor afterwards?  Then why would you care whether
dup2() blocks there or not?  Are you seriously saying that there is any code
that would (a) rely on that kind of warranty and (b) not be racy as hell?

Until now everyone had been talking about that thing as an attempt of mitigation
for broken userland code; you seem to be the first one to imply that this can
be used as a deliberate technics...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-04 Thread Al Viro
On Wed, Nov 04, 2015 at 03:54:09PM +, David Laight wrote:
> > Sigh...  The kernel has no idea when other threads are done with "all
> > io activities using that fd" - it can wait for them to leave the
> > kernel mode, but there's fuck-all it can do about e.g. a userland
> > loop doing write() until there's more data to send.  And no, you can't
> > rely upon them catching EBADF on the next iteration - by the time they
> > get there, close() might very well have returned and open() from yet
> > another thread might've grabbed the same descriptor.  Welcome to your
> > data being written to hell knows what...
> 
> That just means that the application must use dup2() rather than close().
> It must do that anyway since the thread it is trying to stop might be
> sleeping in the system call stub in libc at the time a close() and open()
> happen.

Oh, _lovely_.  So instead of continuation of that write(2) going down
the throat of something opened by unrelated thread, it (starting from a
pretty arbitrary point) will go into the descriptor the closing thread
passed to dup2().  Until it, in turn, gets closed, at which point we
are back to square one.  That, of course, makes it so much better -
whatever had I been thinking about that made me miss that?

> The listening (in this case) thread would need to look at its global
> data to determine that it is supposed to exit, and then close the fd itself.

Right until it crosses into the kernel mode and does descriptor-to-file
lookup, presumably?  Because prior to that point this kernel-side
"protection" oesn't come into play.  In other words, this is inherently
racy, and AFAICS you are the first poster in that thread who disagrees
with that.

_Any_ userland code that would be racy without that kludge of semantics
in close()/dup2() is *still* racy with it.  If that crap gets triggered
at all, the userland code responsible for that is broken.  Said crap
makes the race windows more narrow, but it doesn't really close them.
And IMO it's rather misduided, especially since it's a) quiet and b)
costly as hell.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-04 Thread David Laight
From: Al Viro
> Sent: 30 October 2015 21:10
> On Fri, Oct 30, 2015 at 05:43:21PM +, David Laight wrote:
> 
> > ISTM that the correct call should be listen(fd, 0);
> > Although that doesn't help a thread stuck in recvmsg() for a datagram.
> >
> > It is also tempting to think that close(fd) should sleep until all
> > io activities using that fd have completed - whether or not blocking
> > calls are woken.
> 
> Sigh...  The kernel has no idea when other threads are done with "all
> io activities using that fd" - it can wait for them to leave the
> kernel mode, but there's fuck-all it can do about e.g. a userland
> loop doing write() until there's more data to send.  And no, you can't
> rely upon them catching EBADF on the next iteration - by the time they
> get there, close() might very well have returned and open() from yet
> another thread might've grabbed the same descriptor.  Welcome to your
> data being written to hell knows what...

That just means that the application must use dup2() rather than close().
It must do that anyway since the thread it is trying to stop might be
sleeping in the system call stub in libc at the time a close() and open()
happen.
The listening (in this case) thread would need to look at its global
data to determine that it is supposed to exit, and then close the fd itself.

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-02 Thread David Laight
From: casper@oracle.com
> Sent: 21 October 2015 21:33
..
> >Er...  So fd2 = dup(fd);accept(fd)/close(fd) should *not* trigger that
> >behaviour, in your opinion?  Because fd is sure as hell not the last
> >descriptor refering to that socket - fd2 remains alive and well.
> >
> >Behaviour you describe below matches the _third_ option.
> 
> >> >BTW, for real fun, consider this:
> >> >7)
> >> >// fd is a socket
> >> >fd2 = dup(fd);
> >> >in thread A: accept(fd);
> >> >in thread B: accept(fd);
> >> >in thread C: accept(fd2);
> >> >in thread D: close(fd);

If D is going to do this, it ought to be using dup2() to clone
a copy of (say) /dev/null onto 'fd'.

> >> >Which threads (if any), should get hit where it hurts?
> >>
> >> A & B should return from the accept with an error. C should
> >> continue. Which is what happens on Solaris.

That seems to completely break the normal *nix file scheme...

> >> To this end each thread keeps a list of file descriptors
> >> in use by the current active system call.
> >

Remember, Solaris (and SYSV) has extra levels of multiplexing between
userspace and char special drivers (and probably sockets) than Linux does.
As well as having multiple fd referencing the same struct FILE, multiple
FILE can point to the same inode.
If you have two different /dev entries for the same major/minor you
also end up with separate inodes - all finally referencing the same
driver data (indexed only by minor number).
(You actually get two inodes in the chain, one for the disk filesystem
and one char-special. The ref counts on both can be greater than 1.)

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-02 Thread Al Viro
On Mon, Nov 02, 2015 at 10:03:58AM +, David Laight wrote:
> Remember, Solaris (and SYSV) has extra levels of multiplexing between
> userspace and char special drivers (and probably sockets) than Linux does.
> As well as having multiple fd referencing the same struct FILE, multiple
> FILE can point to the same inode.

As they ever could on any Unix.  Every open(2) results in a new struct file
(BTW, I've never seen that capitalization for kernel structure - not in
v7, not in *BSD, etc.; FILE is a userland typedef and I would be rather
surprised if any kernel, Solaris included, would've renamed 'struct file'
to 'struct FILE').

> If you have two different /dev entries for the same major/minor you
> also end up with separate inodes - all finally referencing the same
> driver data (indexed only by minor number).

Again, the same goes for all Unices, both Linux and Solaris included.
And what the devil does that have to do with sockets, anyway?  Or with
the problem in question, while we are at it - they have different
descriptors pointing to the same struct file behave differently;
anything sensitive to file type would be past that point.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-01 Thread Eric Dumazet
On Mon, 2015-11-02 at 00:24 +, Al Viro wrote:

> This ought to be a bit cleaner.  Eric, could you test the variant below on 
> your
> setup?

Sure !

5 runs of :
lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10

total = 4386311
total = 4560402
total = 4437309
total = 4516227
total = 4478778


With 48 threads :

./opensock -t 48 -n 1000 -l 10
total = 4940245
total = 4848513
total = 4813153
total = 4813946
total = 5127804

Perf output taken on the 16 threads run :

74.71%   opensock  opensock   [.] memset

 |
 --- memset

 5.64%   opensock  [kernel.kallsyms]  [k] queued_spin_lock_slowpath 

 |
 --- queued_spin_lock_slowpath
|  
|--99.89%-- _raw_spin_lock
|  |  
|  |--52.74%-- __close_fd
|  |  sys_close
|  |  entry_SYSCALL_64_fastpath
|  |  __libc_close
|  |  |  
|  |   --100.00%-- 0x0
|  |  
|  |--46.97%-- __alloc_fd
|  |  get_unused_fd_flags
|  |  sock_map_fd
|  |  sys_socket
|  |  entry_SYSCALL_64_fastpath
|  |  __socket
|  |  |  
|  |   --100.00%-- 0x0
|   --0.30%-- [...]
 --0.11%-- [...]

 1.69%   opensock  [kernel.kallsyms]  [k] _raw_spin_lock

 |
 --- _raw_spin_lock
|  
|--27.37%-- get_unused_fd_flags
|  sock_map_fd
|  sys_socket
|  entry_SYSCALL_64_fastpath
|  __socket
|  
|--22.40%-- sys_close
|  entry_SYSCALL_64_fastpath
|  __libc_close
|  
|--15.79%-- cache_alloc_refill
|  |  
|  |--99.27%-- kmem_cache_alloc
|  |  |  
|  |  |--81.25%-- sk_prot_alloc
|  |  |  sk_alloc
|  |  |  inet_create
|  |  |  __sock_create
|  |  |  sock_create
|  |  |  sys_socket
|  |  |  entry_SYSCALL_64_fastpath
|  |  |  __socket
|  |  |  
|  |  |--9.08%-- sock_alloc_inode
|  |  |  alloc_inode
|  |  |  new_inode_pseudo
|  |  |  sock_alloc
|  |  |  __sock_create
|  |  |  sock_create
|  |  |  sys_socket
|  |  |  entry_SYSCALL_64_fastpath
|  |  |  __socket
|  |  |  
|  |  |--4.98%-- __d_alloc
|  |  |  d_alloc_pseudo
|  |  |  sock_alloc_file
|  |  |  sock_map_fd
|  |  |  sys_socket
|  |  |  entry_SYSCALL_64_fastpath
|  |  |  __socket
|  |  |  
|  |   --4.69%-- get_empty_filp
|  | alloc_file
|  | sock_alloc_file
|  | sock_map_fd
|  | sys_socket
|  | entry_SYSCALL_64_fastpath
|  | __socket
|  |  
|   --0.73%-- 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-01 Thread Linus Torvalds
On Sun, Nov 1, 2015 at 4:24 PM, Al Viro  wrote:
>
> This ought to be a bit cleaner.  Eric, could you test the variant below on 
> your
> setup?

I'd love to see the numbers, but at the same time I really can't say I
love your patch.

I've merged my own two patches for now (not actually pushed out - I
don't want to distract people from just testing 4.3 for a while),
because I felt that those had a unreasonably high bang-for-the-buck
(ie big speedups for something that still keeps the code very simple).

I'm definitely open to improving this further, even go as far as your
patch, but just looking at your version of __alloc_fd(), I just don't
get the warm and fuzzies.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-01 Thread Al Viro
On Sat, Oct 31, 2015 at 08:29:29PM +, Al Viro wrote:
> On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote:
> > On Sat, Oct 31, 2015 at 12:34 PM, Al Viro  wrote:
> > >
> > > ... and here's the current variant of mine.
> > 
> > Ugh. I really liked how simple mine ended up being. Yours is definitely not.
> 
> Note that it's not the final variant - just something that should be
> testable.  There are all kinds of things that still need cleaning/simplifying
> in there - e.g. scan() is definitely more complex than needed (if nothing
> else, the "small bitmap" case is simply find_next_zero_bit(), and the
> rest all have size equal to full cacheline; moreover, I'd overdone the
> "... and check if there are other zero bits left" thing - its callers
> used to use that a lot, and with the execption of two of them it was
> absolutely worthless.  So it ended up more generic than necessary and
> I'm going to trim that crap down.
> 
> It's still going to end up more complex than yours, obviously, but not as
> badly as it is now.  I'm not sure if the speedup will be worth the
> extra complexity, thus asking for testing...  On _some_ loads it is
> considerably faster (at least by factor of 5), but whether those loads
> resemble anything that occurs on real systems...

This ought to be a bit cleaner.  Eric, could you test the variant below on your
setup?

diff --git a/fs/file.c b/fs/file.c
index 6c672ad..0144920 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -30,6 +30,9 @@ int sysctl_nr_open_min = BITS_PER_LONG;
 int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) &
 -BITS_PER_LONG;
 
+#define BITS_PER_CHUNK 512
+#define BYTES_PER_CHUNK (BITS_PER_CHUNK / 8)
+
 static void *alloc_fdmem(size_t size)
 {
/*
@@ -46,8 +49,10 @@ static void *alloc_fdmem(size_t size)
 
 static void __free_fdtable(struct fdtable *fdt)
 {
+   int i;
kvfree(fdt->fd);
-   kvfree(fdt->open_fds);
+   for (i = 0; i <= 3; i++)
+   kvfree(fdt->bitmaps[i]);
kfree(fdt);
 }
 
@@ -56,13 +61,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+static inline bool cacheline_full(unsigned long *p)
+{
+   int n;
+
+   for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++)
+   if (likely(~p[n]))
+   return false;
+   return true;
+}
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
  */
 static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 {
-   unsigned int cpy, set;
+   unsigned int cpy, set, to, from, level, n;
 
BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
@@ -71,18 +86,48 @@ static void copy_fdtable(struct fdtable *nfdt, struct 
fdtable *ofdt)
memcpy(nfdt->fd, ofdt->fd, cpy);
memset((char *)(nfdt->fd) + cpy, 0, set);
 
-   cpy = ofdt->max_fds / BITS_PER_BYTE;
-   set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE;
-   memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-   memset((char *)(nfdt->open_fds) + cpy, 0, set);
+   cpy = ofdt->max_fds / 8;
+   set = (nfdt->max_fds - ofdt->max_fds) / 8;
memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+   if (likely(!nfdt->bitmaps[1])) {
+   // flat to flat
+   memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy);
+   memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set);
+   return;
+   }
+   to = round_up(nfdt->max_fds, BITS_PER_CHUNK);
+   set = (to - ofdt->max_fds) / 8;
+   // copy and pad the primary
+   memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8);
+   memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set);
+   // copy and pad the old secondaries
+   from = round_up(ofdt->max_fds, BITS_PER_CHUNK);
+   for (level = 1; level <= 3; level++) {
+   if (!ofdt->bitmaps[level])
+   break;
+   to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+   from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK);
+   memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8);
+   memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) 
/ 8);
+   }
+   // zero the new ones (if any)
+   for (n = level; n <= 3; n++) {
+   if (!nfdt->bitmaps[n])
+   break;
+   to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+   memset(nfdt->bitmaps[n], 0, to / 8);
+   }
+   // and maybe adjust bit 0 in the first new one.
+   if (unlikely(n != level && cacheline_full(nfdt->bitmaps[level - 1])))
+   __set_bit(0, nfdt->bitmaps[level]);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
 {
struct fdtable *fdt;
void *data;
+  

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-11-01 Thread Al Viro
On Sun, Nov 01, 2015 at 06:14:43PM -0800, Eric Dumazet wrote:
> On Mon, 2015-11-02 at 00:24 +, Al Viro wrote:
> 
> > This ought to be a bit cleaner.  Eric, could you test the variant below on 
> > your
> > setup?
> 
> Sure !
> 
> 5 runs of :
> lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10
> 
> total = 4386311
> total = 4560402
> total = 4437309
> total = 4516227
> total = 4478778

Umm...  With Linus' variant it was what, around 400?  +10% or so, then...

> With 48 threads :
> 
> ./opensock -t 48 -n 1000 -l 10
> total = 4940245
> total = 4848513
> total = 4813153
> total = 4813946
> total = 5127804

And that - +40%?  Interesting...  And it looks like at 48 threads you are
still seeing arseloads of contention, but apparently less than with Linus'
variant...  What if you throw the __clear_close_on_exec() patch on
top of that?

Looks like it's spending less time under ->files_lock...  Could you get
information on fs/file.o hotspots?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Al Viro
On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote:
> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

... and here's the current variant of mine.  FWIW, it seems to survive
LTP and xfstests + obvious "let's torture the allocator".  On the
"million dups" test it seems to be about 25% faster than the one Linus
had posted, at ten millions - about 80%.  On opensock results seem to be
about 20% better than with the variant Linus has posted, but I'm not sure
if the testbox is anywhere near the expected, so I'd appreciate if you'd
given it a spin on your setups.

It obviously needs saner comments, tuning, etc.  BTW, another obvious
low-hanging fruit with this data structure is count_open_files() (and
that goes for 1:64 bitmap Linus uses) - dup2(0, 1000); close(1000);
fork(); and count_open_files() is chewing through the damn bitmap from
about 16M down to low tens.  While holding ->files_lock, at that...
I'm not saying it's critical, and it's definitely a followup patch fodder
in either approach, but it's easy enough to do.

diff --git a/fs/file.c b/fs/file.c
index 6c672ad..fa43cbe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -30,6 +30,8 @@ int sysctl_nr_open_min = BITS_PER_LONG;
 int sysctl_nr_open_max = __const_max(INT_MAX, ~(size_t)0/sizeof(void *)) &
 -BITS_PER_LONG;
 
+#define BITS_PER_CHUNK 512
+
 static void *alloc_fdmem(size_t size)
 {
/*
@@ -46,8 +48,10 @@ static void *alloc_fdmem(size_t size)
 
 static void __free_fdtable(struct fdtable *fdt)
 {
+   int i;
kvfree(fdt->fd);
-   kvfree(fdt->open_fds);
+   for (i = 0; i <= 3; i++)
+   kvfree(fdt->bitmaps[i]);
kfree(fdt);
 }
 
@@ -62,7 +66,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  */
 static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 {
-   unsigned int cpy, set;
+   unsigned int cpy, set, to, from, level, n;
 
BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
@@ -71,18 +75,53 @@ static void copy_fdtable(struct fdtable *nfdt, struct 
fdtable *ofdt)
memcpy(nfdt->fd, ofdt->fd, cpy);
memset((char *)(nfdt->fd) + cpy, 0, set);
 
-   cpy = ofdt->max_fds / BITS_PER_BYTE;
-   set = (nfdt->max_fds - ofdt->max_fds) / BITS_PER_BYTE;
-   memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-   memset((char *)(nfdt->open_fds) + cpy, 0, set);
+   cpy = ofdt->max_fds / 8;
+   set = (nfdt->max_fds - ofdt->max_fds) / 8;
memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+   if (likely(!nfdt->bitmaps[1])) {
+   // flat to flat
+   memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], cpy);
+   memset((char *)(nfdt->bitmaps[0]) + cpy, 0, set);
+   return;
+   }
+   to = round_up(nfdt->max_fds, BITS_PER_CHUNK);
+   set = (to - ofdt->max_fds) / 8;
+   // copy and pad the primary
+   memcpy(nfdt->bitmaps[0], ofdt->bitmaps[0], ofdt->max_fds / 8);
+   memset((char *)nfdt->bitmaps[0] + ofdt->max_fds / 8, 0, set);
+   // copy and pad the old secondaries
+   from = round_up(ofdt->max_fds, BITS_PER_CHUNK);
+   for (level = 1; level <= 3; level++) {
+   if (!ofdt->bitmaps[level])
+   break;
+   to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+   from = round_up(from / BITS_PER_CHUNK, BITS_PER_CHUNK);
+   memcpy(nfdt->bitmaps[level], ofdt->bitmaps[level], from / 8);
+   memset((char *)nfdt->bitmaps[level] + from / 8, 0, (to - from) 
/ 8);
+   }
+   // zero the new ones (if any)
+   for (n = level; n <= 3; n++) {
+   if (!nfdt->bitmaps[n])
+   break;
+   to = round_up(to / BITS_PER_CHUNK, BITS_PER_CHUNK);
+   memset(nfdt->bitmaps[n], 0, to / 8);
+   }
+   // and maybe adjust bit 0 in the first new one.
+   if (unlikely(n != level)) {
+   unsigned long *p = nfdt->bitmaps[level - 1];
+   for (n = 0; n < BITS_PER_CHUNK / BITS_PER_LONG; n++)
+   if (~p[n])
+   return;
+   __set_bit(0, nfdt->bitmaps[level]);
+   }
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
 {
struct fdtable *fdt;
void *data;
+   int level = 0;
 
/*
 * Figure out how many fds we actually want to support in this fdtable.
@@ -114,16 +153,28 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
goto out_fdt;
fdt->fd = data;
 
+   if (nr > 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Al Viro
On Sat, Oct 31, 2015 at 12:54:50PM -0700, Linus Torvalds wrote:
> On Sat, Oct 31, 2015 at 12:34 PM, Al Viro  wrote:
> >
> > ... and here's the current variant of mine.
> 
> Ugh. I really liked how simple mine ended up being. Yours is definitely not.

Note that it's not the final variant - just something that should be
testable.  There are all kinds of things that still need cleaning/simplifying
in there - e.g. scan() is definitely more complex than needed (if nothing
else, the "small bitmap" case is simply find_next_zero_bit(), and the
rest all have size equal to full cacheline; moreover, I'd overdone the
"... and check if there are other zero bits left" thing - its callers
used to use that a lot, and with the execption of two of them it was
absolutely worthless.  So it ended up more generic than necessary and
I'm going to trim that crap down.

It's still going to end up more complex than yours, obviously, but not as
badly as it is now.  I'm not sure if the speedup will be worth the
extra complexity, thus asking for testing...  On _some_ loads it is
considerably faster (at least by factor of 5), but whether those loads
resemble anything that occurs on real systems...

BTW, considerable amount of unpleasantness is due to ragged-right-end kind
of problems - take /proc/sys/fs/nr-open to something other than a power of
two and a whole lot of fun issues start happening.  I went for "if there
are secondary bitmaps at all, pad all bitmaps to a multiple of cacheline",
which at least somewhat mitigates that mess; hell knows, there might be
a clever way to sidestep it entirely...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Linus Torvalds
On Sat, Oct 31, 2015 at 12:34 PM, Al Viro  wrote:
>
> ... and here's the current variant of mine.

Ugh. I really liked how simple mine ended up being. Yours is definitely not.

And based on the profiles from Eric, finding the fd is no longer the
problem even with my simpler patch. The problem ends up being the
contention on the file_lock spinlock.

Eric, I assume that's not "expand_fdtable", since your test-program
seems to expand the fd array at the beginning. So it's presumably all
from the __alloc_fd() use, but we should double-check.. Eric, can you
do a callgraph profile and see which caller is the hottest?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Eric Dumazet
On Sat, 2015-10-31 at 12:54 -0700, Linus Torvalds wrote:
> On Sat, Oct 31, 2015 at 12:34 PM, Al Viro  wrote:
> >
> > ... and here's the current variant of mine.
> 
> Ugh. I really liked how simple mine ended up being. Yours is definitely not.
> 
> And based on the profiles from Eric, finding the fd is no longer the
> problem even with my simpler patch. The problem ends up being the
> contention on the file_lock spinlock.
> 
> Eric, I assume that's not "expand_fdtable", since your test-program
> seems to expand the fd array at the beginning. So it's presumably all
> from the __alloc_fd() use, but we should double-check.. Eric, can you
> do a callgraph profile and see which caller is the hottest?

Sure : profile taken while test runs using 16 threads (Since this is
probably not a too biased micro benchmark...)

# hostname : lpaa24
# os release : 4.3.0-smp-DEV
# perf version : 3.12.0-6-GOOGLE
# arch : x86_64
# nrcpus online : 48
# nrcpus avail : 48
# cpudesc : Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz
# cpuid : GenuineIntel,6,62,4
# total memory : 264126320 kB
# cmdline : /usr/bin/perf record -a -g sleep 4 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, 
excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, att
# CPU_TOPOLOGY info available, use -I to display
# NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, msr = 38, uncore_cbox_10 = 35, uncore_cbox_11 = 36, 
software = 1, power = 7, uncore_irp = 8, uncore_pcu = 37, tracepoint = 2, 
uncore_
# Samples: 260K of event 'cycles'
# Event count (approx.): 196742182232
#
# OverheadCommandShared Object  
  
#   .  ...  
..
#
67.15%   opensock  opensock [.] memset  
  
 |
 --- memset

13.84%   opensock  [kernel.kallsyms][k] queued_spin_lock_slowpath   
  
 |
 --- queued_spin_lock_slowpath
|  
|--99.97%-- _raw_spin_lock
|  |  
|  |--53.03%-- __close_fd
|  |  sys_close
|  |  entry_SYSCALL_64_fastpath
|  |  __libc_close
|  |  |  
|  |   --100.00%-- 0x0
|  |  
|  |--46.83%-- __alloc_fd
|  |  get_unused_fd_flags
|  |  sock_map_fd
|  |  sys_socket
|  |  entry_SYSCALL_64_fastpath
|  |  __socket
|  |  |  
|  |   --100.00%-- 0x0
|   --0.13%-- [...]
 --0.03%-- [...]

 1.84%   opensock  [kernel.kallsyms][k] _find_next_bit.part.0   
  
 |
 --- _find_next_bit.part.0
|  
|--65.97%-- find_next_zero_bit
|  __alloc_fd
|  get_unused_fd_flags
|  sock_map_fd
|  sys_socket
|  entry_SYSCALL_64_fastpath
|  __socket
|  
|--34.01%-- __alloc_fd
|  get_unused_fd_flags
|  sock_map_fd
|  sys_socket
|  entry_SYSCALL_64_fastpath
|  __socket
|  |  
|   --100.00%-- 0x0
 --0.02%-- [...]

 1.59%   opensock  [kernel.kallsyms][k] _raw_spin_lock  
  
 |
 --- _raw_spin_lock
|  
|--28.78%-- get_unused_fd_flags
|  sock_map_fd
|  sys_socket
|  entry_SYSCALL_64_fastpath
|  __socket
|  

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Linus Torvalds
On Sat, Oct 31, 2015 at 1:45 PM, Eric Dumazet  wrote:
> 13.84%   opensock  [kernel.kallsyms][k] queued_spin_lock_slowpath
>  |
>  --- queued_spin_lock_slowpath
> |
> |--99.97%-- _raw_spin_lock
> |  |
> |  |--53.03%-- __close_fd
> |  |
> |  |--46.83%-- __alloc_fd

Interesting. "__close_fd" actually looks more expensive than
allocation. They presumably get called equally often, so it's probably
some cache effect.

__close_fd() doesn't do anything even remotely interesting as far as I
can tell, but it strikes me that we probably take a *lot* of cache
misses on the stupid "close-on-exec" flags, which are probably always
zero anyway.

Mind testing something really stupid, and making the __clear_bit() in
__clear_close_on_exec() conditiona, something like this:

 static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 {
-   __clear_bit(fd, fdt->close_on_exec);
+   if (test_bit(fd, fdt->close_on_exec)
+   __clear_bit(fd, fdt->close_on_exec);
 }

and see if it makes a difference.

This is the kind of thing that a single-threaded (or even
single-socket) test will never actually show, because it caches well
enough. But for two sockets, I could imagine the unnecessary dirtying
of cachelines and ping-pong being noticeable.

The other stuff we probably can't do all that much about. Unless we
decide to go for some complicated lockless optimistic file descriptor
allocation scheme with retry-on-failure instead of locks. Which I'm
sure is possible, but I'm equally sure is painful.

Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Al Viro
On Sat, Oct 31, 2015 at 02:23:31PM -0700, Linus Torvalds wrote:

> The other stuff we probably can't do all that much about. Unless we
> decide to go for some complicated lockless optimistic file descriptor
> allocation scheme with retry-on-failure instead of locks. Which I'm
> sure is possible, but I'm equally sure is painful.

The interesting part is dup2() - we'd have to do something like
serialize against other dup2
was_claimed = atomically set and test bit in bitmap
if was_claimed
tofree = fdt->fd[fd];
if (!tofree)
fail with EBUSY
install into ->fd[...]
end of critical area
in there; __alloc_fd() could be made retry-on-failure, but I don't see
how to cope with dup2 vs. dup2 without an explicit exclusion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Eric Dumazet
On Fri, 2015-10-30 at 16:52 -0700, Linus Torvalds wrote: sequential
allocations...
> 
> I don't think it would matter in real life, since I don't really think
> you have lots of fd's with strictly sequential behavior.
> 
> That said, the trivial "open lots of fds" benchmark would show it, so
> I guess we can just keep it. The next_fd logic is not expensive or
> complex, after all.

+1


> Attached is an updated patch that just uses the regular bitmap
> allocator and extends it to also have the bitmap of bitmaps. It
> actually simplifies the patch, so I guess it's better this way.
> 
> Anyway, I've tested it all a bit more, and for a trivial worst-case
> stress program that explicitly kills the next_fd logic by doing
> 
> for (i = 0; i < 100; i++) {
> close(3);
> dup2(0,3);
> if (dup(0))
> break;
> }
> 
> it takes it down from roughly 10s to 0.2s. So the patch is quite
> noticeable on that kind of pattern.
> 
> NOTE! You'll obviously need to increase your limits to actually be
> able to do the above with lots of file descriptors.
> 
> I ran Eric's test-program too, and find_next_zero_bit() dropped to a
> fraction of a percent. It's not entirely gone, but it's down in the
> noise.
> 
> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

I absolutely agree a generic solution is far better, especially when
its performance is in par.

Tested-by: Eric Dumazet 
Acked-by: Eric Dumazet 

Note that a non-POSIX flag (or a thread personality hints)
would still allow the kernel to do proper NUMA affinity placement : Say
the fd_array and bitmaps are split on the 2 nodes (or more, but most
servers nowadays have 2 sockets really).

Then at fd allocation time, we can prefer to pick an fd for which memory
holding various bits and the file pointer are in the local node

This speeds up subsequent fd system call on programs that constantly
blow away cpu caches, saving QPI transactions.

Thanks a lot Linus.

lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10
count=1000 (check/increase ulimit -n)
total = 3992764

lpaa24:~# ./opensock -t 48 -n 1000 -l 10
count=1000 (check/increase ulimit -n)
total = 3545249

Profile with 16 threads :

69.55%  opensock  [.] memset   
11.83%  [kernel]  [k] queued_spin_lock_slowpath
 1.91%  [kernel]  [k] _find_next_bit.part.0
 1.68%  [kernel]  [k] _raw_spin_lock   
 0.99%  [kernel]  [k] kmem_cache_alloc 
 0.99%  [kernel]  [k] memset_erms  
 0.95%  [kernel]  [k] get_empty_filp   
 0.82%  [kernel]  [k] __close_fd   
 0.73%  [kernel]  [k] __alloc_fd   
 0.65%  [kernel]  [k] sk_alloc 
 0.63%  opensock  [.] child_function   
 0.56%  [kernel]  [k] fput 
 0.35%  [kernel]  [k] sock_alloc   
 0.31%  [kernel]  [k] kmem_cache_free  
 0.31%  [kernel]  [k] inode_init_always
 0.28%  [kernel]  [k] d_set_d_op   
 0.27%  [kernel]  [k] entry_SYSCALL_64_after_swapgs

Profile with 48 threads :

57.92%  [kernel]  [k] queued_spin_lock_slowpath
32.14%  opensock  [.] memset   
 0.81%  [kernel]  [k] _find_next_bit.part.0
 0.51%  [kernel]  [k] _raw_spin_lock   
 0.45%  [kernel]  [k] kmem_cache_alloc 
 0.38%  [kernel]  [k] kmem_cache_free  
 0.34%  [kernel]  [k] __close_fd   
 0.32%  [kernel]  [k] memset_erms  
 0.25%  [kernel]  [k] __alloc_fd   
 0.24%  [kernel]  [k] get_empty_filp   
 0.23%  opensock  [.] child_function   
 0.18%  [kernel]  [k] __d_alloc
 0.17%  [kernel]  [k] inode_init_always
 0.16%  [kernel]  [k] sock_alloc   
 0.16%  [kernel]  [k] del_timer
 0.15%  [kernel]  [k] entry_SYSCALL_64_after_swapgs
 0.15%  perf  [.] 0x0004d924   
 0.15%  [kernel]  [k] tcp_close




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-31 Thread Eric Dumazet
On Sat, 2015-10-31 at 14:23 -0700, Linus Torvalds wrote:

> Mind testing something really stupid, and making the __clear_bit() in
> __clear_close_on_exec() conditiona, something like this:
> 
>  static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
>  {
> -   __clear_bit(fd, fdt->close_on_exec);
> +   if (test_bit(fd, fdt->close_on_exec)
> +   __clear_bit(fd, fdt->close_on_exec);
>  }
> 
> and see if it makes a difference.

It does ;)

About 4 % qps increase

3 runs : 
lpaa24:~# taskset ff0ff ./opensock -t 16 -n 1000 -l 10
total = 4176651
total = 4178012
total = 4105226

instead of :
total = 3910620
total = 3874567
total = 3971028

Perf profile :

69.12%   opensock  opensock [.] memset  
 
 |
 --- memset

12.37%   opensock  [kernel.kallsyms][k] queued_spin_lock_slowpath   
 
 |
 --- queued_spin_lock_slowpath
|  
|--99.99%-- _raw_spin_lock
|  |  
|  |--51.99%-- __close_fd
|  |  sys_close
|  |  entry_SYSCALL_64_fastpath
|  |  __libc_close
|  |  |  
|  |   --100.00%-- 0x0
|  |  
|  |--47.79%-- __alloc_fd
|  |  get_unused_fd_flags
|  |  sock_map_fd
|  |  sys_socket
|  |  entry_SYSCALL_64_fastpath
|  |  __socket
|  |  |  
|  |   --100.00%-- 0x0
|   --0.21%-- [...]
 --0.01%-- [...]

 1.92%   opensock  [kernel.kallsyms][k] _find_next_bit.part.0   
 
 |
 --- _find_next_bit.part.0
|  
|--66.93%-- find_next_zero_bit
|  __alloc_fd
|  get_unused_fd_flags
|  sock_map_fd
|  sys_socket
|  entry_SYSCALL_64_fastpath
|  __socket
|  
 --33.07%-- __alloc_fd
   get_unused_fd_flags
   sock_map_fd
   sys_socket
   entry_SYSCALL_64_fastpath
   __socket
   |  
--100.00%-- 0x0

 1.63%   opensock  [kernel.kallsyms][k] _raw_spin_lock  
 
 |
 --- _raw_spin_lock
|  
|--28.66%-- get_unused_fd_flags
|  sock_map_fd


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Linus Torvalds
On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds
 wrote:
> On Fri, Oct 30, 2015 at 2:02 PM, Al Viro  wrote:
>>
>> Your variant has 1:64 ratio; obviously better than now, but we can actually
>> do 1:bits-per-cacheline quite easily.
>
> Ok, but in that case you end up needing a counter for each cacheline
> too (to count how many bits, in order to know when to say "cacheline
> is entirely full").

So here's a largely untested version of my "one bit per word"
approach. It seems to work, but looking at it, I'm unhappy with a few
things:

 - using kmalloc() for the .full_fds_bits[] array is simple, but
disgusting, since 99% of all programs just have a single word.

   I know I talked about just adding the allocation to the same one
that allocates the bitmaps themselves, but I got lazy and didn't do
it. Especially since that code seems to try fairly hard to make the
allocations nice powers of two, according to the comments. That may
actually matter from an allocation standpoint.

 - Maybe we could just use that "full_fds_bits_init" field for when a
single word is sufficient, and avoid the kmalloc that way?

Anyway. This is a pretty simple patch, and I actually think that we
could just get rid of the "next_fd" logic entirely with this. That
would make this *patch* be more complicated, but it would make the
resulting *code* be simpler.

Hmm? Want to play with this? Eric, what does this do to your test-case?

Linus
 fs/file.c   | 49 ++---
 include/linux/fdtable.h |  2 ++
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..0b89a97cabb5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -48,6 +48,7 @@ static void __free_fdtable(struct fdtable *fdt)
 {
kvfree(fdt->fd);
kvfree(fdt->open_fds);
+   kfree(fdt->full_fds_bits);
kfree(fdt);
 }
 
@@ -56,6 +57,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+#define BITBIT_NR(nr)  BITS_TO_LONGS(BITS_TO_LONGS(nr))
+#define BITBIT_SIZE(nr)(BITBIT_NR(nr) * sizeof(long))
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -77,6 +81,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct 
fdtable *ofdt)
memset((char *)(nfdt->open_fds) + cpy, 0, set);
memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+
+   cpy = BITBIT_SIZE(ofdt->max_fds);
+   set = BITBIT_SIZE(nfdt->max_fds) - cpy;
+   memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
+   memset(cpy+(char *)nfdt->full_fds_bits, 0, set);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -122,8 +131,21 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
data += nr / BITS_PER_BYTE;
fdt->close_on_exec = data;
 
+   /*
+* The "bitmap of bitmaps" has a bit set for each word in
+* the open_fds array that is full. We initialize it to
+* zero both at allocation and at copying, because it is
+* important that it never have a bit set for a non-full
+* word, but it doesn't have to be exact otherwise.
+*/
+   fdt->full_fds_bits = kzalloc(BITBIT_SIZE(nr), GFP_KERNEL);
+   if (!fdt->full_fds_bits)
+   goto out_nofull;
+
return fdt;
 
+out_nofull:
+   kvfree(fdt->open_fds);
 out_arr:
kvfree(fdt->fd);
 out_fdt:
@@ -229,14 +251,18 @@ static inline void __clear_close_on_exec(int fd, struct 
fdtable *fdt)
__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
__set_bit(fd, fdt->open_fds);
+   fd /= BITS_PER_LONG;
+   if (!~fdt->open_fds[fd])
+   __set_bit(fd, fdt->full_fds_bits);
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
__clear_bit(fd, fdt->open_fds);
+   __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -280,6 +306,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int 
*errorp)
new_fdt->max_fds = NR_OPEN_DEFAULT;
new_fdt->close_on_exec = newf->close_on_exec_init;
new_fdt->open_fds = newf->open_fds_init;
+   new_fdt->full_fds_bits = newf->full_fds_bits_init;
new_fdt->fd = >fd_array[0];
 
spin_lock(>file_lock);
@@ -323,6 +350,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int 
*errorp)
 
memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+   memcpy(new_fdt->full_fds_bits, 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Linus Torvalds
On Fri, Oct 30, 2015 at 2:02 PM, Al Viro  wrote:
>
> Your variant has 1:64 ratio; obviously better than now, but we can actually
> do 1:bits-per-cacheline quite easily.

Ok, but in that case you end up needing a counter for each cacheline
too (to count how many bits, in order to know when to say "cacheline
is entirely full").

Because otherwise you'll end up having to check the whole cacheline
(rather than check just one word, or check the "bit counter") when you
set a bit.

So I suspect a "bitmap of full words" ends up being much simpler.  But
hey, if you can make your more complex thing work, go right ahead.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Linus Torvalds
On Fri, Oct 30, 2015 at 3:33 PM, Al Viro  wrote:
> On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote:
>
>> Anyway. This is a pretty simple patch, and I actually think that we
>> could just get rid of the "next_fd" logic entirely with this. That
>> would make this *patch* be more complicated, but it would make the
>> resulting *code* be simpler.
>
> Dropping next_fd would screw you in case of strictly sequential allocations...

I don't think it would matter in real life, since I don't really think
you have lots of fd's with strictly sequential behavior.

That said, the trivial "open lots of fds" benchmark would show it, so
I guess we can just keep it. The next_fd logic is not expensive or
complex, after all.

> Your point re power-of-two allocations is well-taken, but then I'm not
> sure that kzalloc() is good enough here.

Attached is an updated patch that just uses the regular bitmap
allocator and extends it to also have the bitmap of bitmaps. It
actually simplifies the patch, so I guess it's better this way.

Anyway, I've tested it all a bit more, and for a trivial worst-case
stress program that explicitly kills the next_fd logic by doing

for (i = 0; i < 100; i++) {
close(3);
dup2(0,3);
if (dup(0))
break;
}

it takes it down from roughly 10s to 0.2s. So the patch is quite
noticeable on that kind of pattern.

NOTE! You'll obviously need to increase your limits to actually be
able to do the above with lots of file descriptors.

I ran Eric's test-program too, and find_next_zero_bit() dropped to a
fraction of a percent. It's not entirely gone, but it's down in the
noise.

I really suspect this patch is "good enough" in reality, and I would
*much* rather do something like this than add a new non-POSIX flag
that people have to update their binaries for. I agree with Eric that
*some* people will do so, but it's still the wrong thing to do. Let's
just make performance with the normal semantics be good enough that we
don't need to play odd special games.

Eric?

   Linus
 fs/file.c   | 39 +++
 include/linux/fdtable.h |  2 ++
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..6f6eb2b03af5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -56,6 +56,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+#define BITBIT_NR(nr)  BITS_TO_LONGS(BITS_TO_LONGS(nr))
+#define BITBIT_SIZE(nr)(BITBIT_NR(nr) * sizeof(long))
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -77,6 +80,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct 
fdtable *ofdt)
memset((char *)(nfdt->open_fds) + cpy, 0, set);
memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+
+   cpy = BITBIT_SIZE(ofdt->max_fds);
+   set = BITBIT_SIZE(nfdt->max_fds) - cpy;
+   memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
+   memset(cpy+(char *)nfdt->full_fds_bits, 0, set);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -115,12 +123,14 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
fdt->fd = data;
 
data = alloc_fdmem(max_t(size_t,
-2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
+2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), 
L1_CACHE_BYTES));
if (!data)
goto out_arr;
fdt->open_fds = data;
data += nr / BITS_PER_BYTE;
fdt->close_on_exec = data;
+   data += nr / BITS_PER_BYTE;
+   fdt->full_fds_bits = data;
 
return fdt;
 
@@ -229,14 +239,18 @@ static inline void __clear_close_on_exec(int fd, struct 
fdtable *fdt)
__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
__set_bit(fd, fdt->open_fds);
+   fd /= BITS_PER_LONG;
+   if (!~fdt->open_fds[fd])
+   __set_bit(fd, fdt->full_fds_bits);
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
__clear_bit(fd, fdt->open_fds);
+   __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -280,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int 
*errorp)
new_fdt->max_fds = NR_OPEN_DEFAULT;
new_fdt->close_on_exec = newf->close_on_exec_init;
new_fdt->open_fds = newf->open_fds_init;
+   new_fdt->full_fds_bits = newf->full_fds_bits_init;
new_fdt->fd = >fd_array[0];
 
spin_lock(>file_lock);
@@ -323,6 +338,7 @@ struct files_struct *dup_fd(struct files_struct 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Eric Dumazet
On Fri, 2015-10-30 at 14:50 -0700, Linus Torvalds wrote:
> On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds
>  wrote:
> > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro  wrote:
> >>
> >> Your variant has 1:64 ratio; obviously better than now, but we can actually
> >> do 1:bits-per-cacheline quite easily.
> >
> > Ok, but in that case you end up needing a counter for each cacheline
> > too (to count how many bits, in order to know when to say "cacheline
> > is entirely full").
> 
> So here's a largely untested version of my "one bit per word"
> approach. It seems to work, but looking at it, I'm unhappy with a few
> things:
> 
>  - using kmalloc() for the .full_fds_bits[] array is simple, but
> disgusting, since 99% of all programs just have a single word.
> 
>I know I talked about just adding the allocation to the same one
> that allocates the bitmaps themselves, but I got lazy and didn't do
> it. Especially since that code seems to try fairly hard to make the
> allocations nice powers of two, according to the comments. That may
> actually matter from an allocation standpoint.
> 
>  - Maybe we could just use that "full_fds_bits_init" field for when a
> single word is sufficient, and avoid the kmalloc that way?

At least make sure the allocation uses a cache line, so that multiple
processes do not share same cache line for this possibly hot field

fdt->full_fds_bits = kzalloc(max_t(size_t,
   L1_CACHE_BYTES,
   BITBIT_SIZE(nr)),
 GFP_KERNEL);

> 
> Anyway. This is a pretty simple patch, and I actually think that we
> could just get rid of the "next_fd" logic entirely with this. That
> would make this *patch* be more complicated, but it would make the
> resulting *code* be simpler.
> 
> Hmm? Want to play with this? Eric, what does this do to your test-case?

Excellent results so far Linus, 500 % increase, thanks a lot !

Tested using 16 threads, 8 on Socket0, 8 on Socket1 

Before patch :

# ulimit -n 1200
# taskset ff0ff ./opensock -t 16 -n 1000 -l 10   
count=1000 (check/increase ulimit -n)
total = 636870

After patch :

taskset ff0ff ./opensock -t 16 -n 1000 -l 10
count=1000 (check/increase ulimit -n)
total = 3845134   (6 times better)

Your patch out-performs the O_FD_FASTALLOC one on this particular test
by ~ 9 % :

taskset ff0ff ./opensock -t 16 -n 1000 -l 10 -f
count=1000 (check/increase ulimit -n)
total = 3505252

If I raise to 48 threads, the FAST_ALLOC wins by 5 % (3752087 instead of
354).

Oh, and 48 threads without any patches : 383027

-> program runs one order of magnitude faster, congrats !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Al Viro
On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote:

> I really suspect this patch is "good enough" in reality, and I would
> *much* rather do something like this than add a new non-POSIX flag
> that people have to update their binaries for. I agree with Eric that
> *some* people will do so, but it's still the wrong thing to do. Let's
> just make performance with the normal semantics be good enough that we
> don't need to play odd special games.
> 
> Eric?

IIRC, at least a part of what Eric used to complain about was that on
seriously multithreaded processes doing a lot of e.g. socket(2) we end up
a lot of bouncing the cacheline containing the first free bits in the bitmap.
But looking at the whole thing, I really wonder if the tons of threads asking
for random bytes won't get at least as bad cacheline bouncing while
getting said bytes, so I'm not sure if that rationale has survived.

PS: this problem obviously exists in Linus' variant as well as in mine;
the question is whether Eric's approach manages to avoid it in the first place.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Linus Torvalds
On Thu, Oct 29, 2015 at 5:35 AM, Eric Dumazet  wrote:
>
> Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
> find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
> It also lowers time we hold the fd array spinlock while doing fd alloc.

I do wonder if we couldn't just speed up the bitmap allocator by an
order of magnitude. It would be nicer to be able to make existing
loads faster without any new "don't bother with POSIX semantics" flag.

We could, for example, extend the "open_fds" bitmap with a
"second-level" bitmap that has information on when the first level is
full. We traverse the open_fd's bitmap one word at a time anyway, we
could have a second-level bitmap that has one bit per word to say
whether that word is already full.

So you'd basically have:

 - open_fds: array with the real bitmap (exactly like we do now)
 - full_fds_bits: array of bits that shows which of the "unsigned
longs" in 'open_fs' are already full.

and then you can basically walk 4096 fd's at a time by just checking
one single word in the full_fds_bits[] array.

So in __alloc_fd(), instead of using find_next_zero_bit(), you'd use
"find_next_fd()", which woulf do something like

#define NR_BITS_PER_BIT (64*sizeof(long)*sizeof(long))

unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
{
while (start < fdt->max_fds) {
unsigned long startbit = start / BITS_PER_LONG;
unsigned long bitbit = startbit / BITS_PER_LONG;
unsigned long bitmax = (bitbit+1) * BITS_PER_LONG * BITS_PER_LONG;

if (bitmax > max_fds)
bitmax = fdt->max_fds;

// Are all the bits in all the bitbit arrays already set?
if (!~fds->full_fds_bits[bitbit]) {
start = bitmax;
continue;
}

fd = find_next_zero_bit(fdt->open_fds, bitmax, start);
if (fd < bitmax)
return fd;
}
return fdt->max_fds;
}

which really should cut down on the bit traversal by a factor of 64.

Of course, then you do have to maintain that bitmap-of-bits in
__set_open_fd() and __clear_open_fd(), but that should be pretty
trivial too:

 - __clear_open_fd() would become just

__clear_bit(fd, fdt->open_fds);
__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);

 - and __set_open_fd() would become

__set_bit(fd, fdt->open_fds);
fd /= BITS_PER_LONG;
if (!~fdt->open_fds[fd])
__set_bit(fd, fdt->full_fds_bits);

or something.

NOTE NOTE NOTE! The above is all very much written without any testing
in the email client, so while I tried to make it look like "real
code", consider the above just pseudo-code.

The advantage of the above is that it should just work for existing
binaries. It may not be quite as optimal as just introducing a new
"don't care about POSIX" feature, but quite frankly, if it cuts down
the bad case of "find_next_zero_bit()" by a factror of 64 (and then
adds a *small* expense factor on top of that), I suspect it should be
"good enough" even for your nasty case.

What do you think? Willing to try the above approach (with any
inevitable bug-fixes) and see how it compares?

Obviously in addition to any fixes to my pseudo-code above you'd need
to add the allocations for the new "full_fds_bits" etc, but I think it
should be easy to make the full_fds_bit allocation be *part* of the
"open_fds" allocation, so you wouldn't need a new allocation in
alloc_fdtable(). We already do that whole "use a single allocation" to
combine open_fds with close_on_exec into one single allocation.

Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread David Laight
From: David Holland
> Sent: 29 October 2015 14:59
> On Tue, Oct 27, 2015 at 10:52:46AM +, Alan Burlison wrote:
>  > >But in general, this is basically a problem with the application: the file
>  > >descriptor space is shared between threads and having one thread sniping
>  > >at open files, you do have a problem and whatever the kernel does in that
>  > >case perhaps doesn't matter all that much: the application needs to be
>  > >fixed anyway.
>  >
>  > The scenario in Hadoop is that the FD is being used by a thread that's
>  > waiting in accept and another thread wants to shut it down, e.g. because
>  > the application is terminating and needs to stop all threads cleanly.
> 
> ISTM that the best way to do this is to post a signal to the thread so
> accept bails with EINTR, at which point it can check to see if it's
> supposed to be exiting.

Actually, just send it a connect indication.

ISTM that the correct call should be listen(fd, 0);
Although that doesn't help a thread stuck in recvmsg() for a datagram.

It is also tempting to think that close(fd) should sleep until all
io activities using that fd have completed - whether or not blocking
calls are woken.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Al Viro
On Fri, Oct 30, 2015 at 10:18:12AM -0700, Linus Torvalds wrote:

> I do wonder if we couldn't just speed up the bitmap allocator by an
> order of magnitude. It would be nicer to be able to make existing
> loads faster without any new "don't bother with POSIX semantics" flag.
> 
> We could, for example, extend the "open_fds" bitmap with a
> "second-level" bitmap that has information on when the first level is
> full. We traverse the open_fd's bitmap one word at a time anyway, we
> could have a second-level bitmap that has one bit per word to say
> whether that word is already full.

Your variant has 1:64 ratio; obviously better than now, but we can actually
do 1:bits-per-cacheline quite easily.

I've been playing with a variant that has more than two bitmaps, and
AFAICS it
a) does not increase the amount of cacheline pulled and
b) keeps it well-bound even in the absolutely worst case
(128M-odd descriptors filled, followed by close(0);dup2(1,0); -
in that case it ends up accessing the 7 cachelines worth of
bitmaps; your variant will read through 4000-odd cachelines of
the summary bitmap alone; the mainline is even worse).

> The advantage of the above is that it should just work for existing
> binaries. It may not be quite as optimal as just introducing a new
> "don't care about POSIX" feature, but quite frankly, if it cuts down
> the bad case of "find_next_zero_bit()" by a factror of 64 (and then
> adds a *small* expense factor on top of that), I suspect it should be
> "good enough" even for your nasty case.
> 
> What do you think? Willing to try the above approach (with any
> inevitable bug-fixes) and see how it compares?
> 
> Obviously in addition to any fixes to my pseudo-code above you'd need
> to add the allocations for the new "full_fds_bits" etc, but I think it
> should be easy to make the full_fds_bit allocation be *part* of the
> "open_fds" allocation, so you wouldn't need a new allocation in
> alloc_fdtable(). We already do that whole "use a single allocation" to
> combine open_fds with close_on_exec into one single allocation.

I'll finish testing what I've got and post it; it costs 3 extra pointers
in the files_struct and a bit fatter bitmap allocation (less than 0.2%
extra).  All the arguments regarding the unmodified binaries apply, of
course, and so far it looks fairly compact...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Al Viro
On Fri, Oct 30, 2015 at 05:43:21PM +, David Laight wrote:

> ISTM that the correct call should be listen(fd, 0);
> Although that doesn't help a thread stuck in recvmsg() for a datagram.
> 
> It is also tempting to think that close(fd) should sleep until all
> io activities using that fd have completed - whether or not blocking
> calls are woken.

Sigh...  The kernel has no idea when other threads are done with "all
io activities using that fd" - it can wait for them to leave the
kernel mode, but there's fuck-all it can do about e.g. a userland
loop doing write() until there's more data to send.  And no, you can't
rely upon them catching EBADF on the next iteration - by the time they
get there, close() might very well have returned and open() from yet
another thread might've grabbed the same descriptor.  Welcome to your
data being written to hell knows what...

That's precisely the reason why "wait in close()" kind of semantics
is worthless - the races are still there, and having them a bit
harder to hit just makes them harder to debug.  Worse, it might
create an impression of safety where there's none to be had.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-30 Thread Al Viro
On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote:

> Anyway. This is a pretty simple patch, and I actually think that we
> could just get rid of the "next_fd" logic entirely with this. That
> would make this *patch* be more complicated, but it would make the
> resulting *code* be simpler.

Dropping next_fd would screw you in case of strictly sequential allocations...

Your point re power-of-two allocations is well-taken, but then I'm not
sure that kzalloc() is good enough here.  Look: you have a bit for every
64 descriptors, i.e. byte per 512.  On 10M case Eric had been talking
about that'll yield 32Kb worth of your secondary bitmap.  It's right on
the edge of the range where vmalloc() becomes attractive; for something
bigger it gets even worse...

Currently we go for vmalloc (on bitmaps) once we are past 128K descriptors
(two bitmaps packed together => 256Kbit = 32Kb).  kmalloc() is very
sensitive to size being a power of two, but IIRC vmalloc() isn't...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread Eric Dumazet
On Thu, 2015-10-29 at 05:35 -0700, Eric Dumazet wrote:

> Current kernel :
> 
> 64.98%  [kernel]  [k] queued_spin_lock_slowpath
> 14.88%  opensock  [.] memset// this part simulates user land 
> actual work ;)   
> 11.15%  [kernel]  [k] _find_next_bit.part.0
>  0.69%  [kernel]  [k] _raw_spin_lock   
>  0.46%  [kernel]  [k] memset_erms  
>  0.38%  [kernel]  [k] sk_alloc 
>  0.37%  [kernel]  [k] kmem_cache_alloc 
>  0.33%  [kernel]  [k] get_empty_filp   
>  0.31%  [kernel]  [k] kmem_cache_free  
>  0.26%  [kernel]  [k] __alloc_fd   
>  0.26%  opensock  [.] child_function   
>  0.18%  [kernel]  [k] inode_init_always
>  0.17%  opensock  [.] __random_r   

With attached prototype patch we get this profile instead :

You can see we no longer hit the spinlock issue and cache waste
in find_next_bit.

Userland can really progress _much_ faster.

76.86%  opensock  [.] memset
 1.31%  [kernel]  [k] _raw_spin_lock
 1.15%  assd  [.] 0x0056f32c
 1.08%  [kernel]  [k] kmem_cache_free   
 0.97%  [kernel]  [k] kmem_cache_alloc  
 0.83%  [kernel]  [k] sk_alloc  
 0.72%  [kernel]  [k] memset_erms   
 0.70%  opensock  [.] child_function
 0.67%  [kernel]  [k] get_empty_filp
 0.65%  [kernel]  [k] __alloc_fd
 0.58%  [kernel]  [k] __close_fd
 0.49%  [kernel]  [k] queued_spin_lock_slowpath 

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..eabb9a626259 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int sysctl_nr_open __read_mostly = 1024*1024;
 int sysctl_nr_open_min = BITS_PER_LONG;
@@ -471,6 +472,19 @@ int __alloc_fd(struct files_struct *files,
spin_lock(>file_lock);
 repeat:
fdt = files_fdtable(files);
+
+   if (unlikely(flags & O_FD_FASTALLOC)) {
+   u32 rnd, limit = min(end, fdt->max_fds);
+
+   /*
+* Note: do not bother with files->next_fd,
+* this is for POSIX lovers...
+*/
+   rnd = ((u64)prandom_u32() * limit) >> 32;
+   fd = find_next_zero_bit(fdt->open_fds, limit, rnd);
+   if (fd < limit)
+   goto ok;
+   }
fd = start;
if (fd < files->next_fd)
fd = files->next_fd;
@@ -499,7 +513,7 @@ repeat:
 
if (start <= files->next_fd)
files->next_fd = fd + 1;
-
+ok:
__set_open_fd(fd, fdt);
if (flags & O_CLOEXEC)
__set_close_on_exec(fd, fdt);
diff --git a/include/linux/net.h b/include/linux/net.h
index 70ac5e28e6b7..3823d082af4c 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -76,6 +76,7 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK  O_NONBLOCK
 #endif
+#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063effe0cc1..badd421dd9f4 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE02000
 #endif
 
+#ifndef O_FD_FASTALLOC
+#define O_FD_FASTALLOC 0x4000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)  
diff --git a/net/socket.c b/net/socket.c
index 9963a0b53a64..6dde02b2eaf9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1227,9 +1227,10 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, 
protocol)
BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
+   BUILD_BUG_ON(SOCK_FD_FASTALLOC & SOCK_TYPE_MASK);
 
flags = type & ~SOCK_TYPE_MASK;
-   if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+   if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_FD_FASTALLOC))
return -EINVAL;
type &= SOCK_TYPE_MASK;
 
@@ -1240,7 +1241,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, 
protocol)
if (retval < 0)
goto out;
 
-   retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+   retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK | 
O_FD_FASTALLOC));
if (retval < 0)
goto 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread Eric Dumazet
On Thu, 2015-10-29 at 04:16 +, Al Viro wrote:

> Have you tried to experiment with that in userland?  I mean, emulate that
> thing in normal userland code, count the cacheline accesses and drive it
> with the use patterns collected from actual applications.

Sure.

> 
> I can sit down and play with math expectations, but I suspect that it's
> easier to experiment.  It's nothing but an intuition (I hadn't seriously
> done probability theory in quite a while, and my mathematical tastes run
> more to geometry and topology anyway), but... I would expect it to degrade
> badly when the bitmap is reasonably dense.
> 
> Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
> not cheap - it's done via #PF triggered in kernel mode, with handler
> noticing that the faulting address is in vmalloc range and doing the
> right thing.  IOW, if your bitmap is very sparse, the price of page faults
> needs to be taken into account.

This vmalloc PF is pure noise.
This only matters for the very first allocations.

We target programs opening zillions of fd in their lifetime ;)

Not having to expand a 4,000,000 slots fd array while fully loaded also
removes a latency spike that is very often not desirable.


> 
> AFAICS, the only benefit of that thing is keeping dirtied cachelines far
> from each other.  Which might be a win overall, but I'm not convinced that
> the rest won't offset the effect of that...

Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
It also lowers time we hold the fd array spinlock while doing fd alloc.

User land test program I wrote few months back

Current kernel :

64.98%  [kernel]  [k] queued_spin_lock_slowpath
14.88%  opensock  [.] memset// this part simulates user land 
actual work ;)   
11.15%  [kernel]  [k] _find_next_bit.part.0
 0.69%  [kernel]  [k] _raw_spin_lock   
 0.46%  [kernel]  [k] memset_erms  
 0.38%  [kernel]  [k] sk_alloc 
 0.37%  [kernel]  [k] kmem_cache_alloc 
 0.33%  [kernel]  [k] get_empty_filp   
 0.31%  [kernel]  [k] kmem_cache_free  
 0.26%  [kernel]  [k] __alloc_fd   
 0.26%  opensock  [.] child_function   
 0.18%  [kernel]  [k] inode_init_always
 0.17%  opensock  [.] __random_r   


/*
 * test for b/9072743 : fd scaling on gigantic process (with ~ 10,000,000 TCP 
sockets)
 * - Size fd arrays in kernel to avoid resizings that kill latencies.
 * - Then launch xx threads doing
 *populate the fd array of the process, opening 'max' files.
 *
 *- Loop : close(randomfd()), socket(AF_INET, SOCK_STREAM, 0);
 *
 * Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

unsigned int count;
int skflags;

#define NBTHREADS_MAX 4096
pthread_t tid[NBTHREADS_MAX];
int nbthreads;
int nbthreads_req = 24;
int stop_all;

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x4000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

/* expand kernel fd array for optimal perf.
 * This could be done by doing a loop on dup(),
 * or can be done using dup2()
 */
int expand_fd_array(int max)
{
int target, res;
int fd = socket(AF_INET, SOCK_STREAM, 0);

if (fd == -1) {
perror("socket()");
return -1;
}
for (;;) {
count = max;
target = count;
if (skflags & SOCK_FD_FASTALLOC)
target += count/10;
res = dup2(fd, target);
if (res != -1) {
close(res);
break;
}
max -= max/10;
}
printf("count=%u (check/increase ulimit -n)\n", count);
return 0;
}

static char state[32] = {
0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31
};

/* each thread is using ~400 KB of data per unit of work */
#define WORKING_SET_SIZE 40

static void *child_function(void *arg)
{
unsigned int max = count / nbthreads_req;
struct random_data buf;
unsigned int idx;
int *tab;
unsigned long iter = 0;
unsigned long *work_set = malloc(WORKING_SET_SIZE);
int i;

if (!work_set)
return NULL;
tab = malloc(max * sizeof(int));
if (!tab) {
free(work_set);
return NULL;
}
memset(tab, 255, max * sizeof(int));

initstate_r(getpid(), state, 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread Alan Burlison

On 29/10/2015 14:58, David Holland wrote:


ISTM that the best way to do this is to post a signal to the thread so
accept bails with EINTR, at which point it can check to see if it's
supposed to be exiting.


Yes, you could use pthread_kill, but that would require keeping a list 
of the tids of all the threads that were using the FD, and that really 
just moves the problem elsewhere rather than fixing it.



Otherwise it sounds like the call you're looking for is not close(2)
but revoke(2). Last I remember Linux doesn't have revoke because
there's no way to implement it that isn't a trainwreck.


close(2) as per specified by POSIX works just fine on Solaris, if that 
was the case everywhere then it wouldn't be an issue. And for cases 
where it is necessary to keep the FD assigned because of races, the 
dup2(2) trick works fine as well.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread David Holland
On Thu, Oct 29, 2015 at 03:18:40PM +, Alan Burlison wrote:
 > On 29/10/2015 14:58, David Holland wrote:
 > >ISTM that the best way to do this is to post a signal to the thread so
 > >accept bails with EINTR, at which point it can check to see if it's
 > >supposed to be exiting.
 > 
 > Yes, you could use pthread_kill, but that would require keeping a list of
 > the tids of all the threads that were using the FD, and that really just
 > moves the problem elsewhere rather than fixing it.

Hardly; it moves the burden of doing stupid things to the
application. If as you said the goal is to shut down all threads
cleanly, then it doesn't need to keep track in detail anyway; it can
just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for
some reason, or whatever.

 > >Otherwise it sounds like the call you're looking for is not close(2)
 > >but revoke(2). Last I remember Linux doesn't have revoke because
 > >there's no way to implement it that isn't a trainwreck.
 > 
 > close(2) as per specified by POSIX works just fine on Solaris, if that was
 > the case everywhere then it wouldn't be an issue. And for cases where it is
 > necessary to keep the FD assigned because of races, the dup2(2) trick works
 > fine as well.

close(2) as specified by POSIX doesn't prohibit this weird revoke-like
behavior, but there's nothing in there that mandates it either. (I
thought this discussion had already clarified that.)

Note that while NetBSD apparently supports this behavior because
someone copied it from Solaris, I'm about to go recommend it be
removed.

-- 
David A. Holland
dholl...@netbsd.org
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread David Holland
On Tue, Oct 27, 2015 at 10:52:46AM +, Alan Burlison wrote:
 > >But in general, this is basically a problem with the application: the file
 > >descriptor space is shared between threads and having one thread sniping
 > >at open files, you do have a problem and whatever the kernel does in that
 > >case perhaps doesn't matter all that much: the application needs to be
 > >fixed anyway.
 > 
 > The scenario in Hadoop is that the FD is being used by a thread that's
 > waiting in accept and another thread wants to shut it down, e.g. because
 > the application is terminating and needs to stop all threads cleanly.

ISTM that the best way to do this is to post a signal to the thread so
accept bails with EINTR, at which point it can check to see if it's
supposed to be exiting.

Otherwise it sounds like the call you're looking for is not close(2)
but revoke(2). Last I remember Linux doesn't have revoke because
there's no way to implement it that isn't a trainwreck.

-- 
David A. Holland
dholl...@netbsd.org
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread Alan Burlison

On 29/10/2015 16:01, David Holland wrote:


Hardly; it moves the burden of doing stupid things to the
application. If as you said the goal is to shut down all threads
cleanly, then it doesn't need to keep track in detail anyway; it can
just post SIGTERM to every thread, or SIGUSR1 if SIGTERM is bad for
some reason, or whatever.


I agree that the root issue is poor application design, but posting a 
signal to every thread is not a solution if you only want to shut down a 
subset of threads.



close(2) as specified by POSIX doesn't prohibit this weird revoke-like
behavior, but there's nothing in there that mandates it either. (I
thought this discussion had already clarified that.)


There was an attempt to interpret POSIX that way, with which I still 
disagree. If a FD is closed or reassigned then any current pending 
operations on it should be terminated.



Note that while NetBSD apparently supports this behavior because
someone copied it from Solaris, I'm about to go recommend it be
removed.


Which behaviour? The abort accept() on close() behaviour?

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread Alan Burlison

[off list]

On 29/10/2015 17:07, Al Viro wrote:


Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact
place in POSIX that requires such behaviour?


If that's the way the conversation is going to go, sorry, no.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread Al Viro
On Thu, Oct 29, 2015 at 04:15:33PM +, Alan Burlison wrote:

> There was an attempt to interpret POSIX that way, with which I still
> disagree. If a FD is closed or reassigned then any current pending
> operations on it should be terminated.

Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote the exact
place in POSIX that requires such behaviour?

This is getting ridiculous - if we are talking about POSIX-mandated behaviour
of close(), please show where is it mandated.  Using close(2) on a descriptor
that might be used by other threads is a bloody bad design in userland code -
I think everyone in this thread agrees on that.  Making that a recommended way
to do _anything_ is nuts.

Now, no userland code, however lousy it might be, should be able to screw
the system.  But that isn't the issue - our variant is providing that just
fine.

BTW, "cancel accept(2) because sending a signal is hard" is bogus anyway -
a thread in accept(3) just about to enter the kernel would get buggered
if another thread closes that descriptor and the third one does socket(2).
_IF_ you are doing that kind of "close a descriptor under other threads"
thing, you need to inform the potentially affected threads anyway, and
you'd better not rely on them being currently in kernel mode.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread David Holland
On Thu, Oct 29, 2015 at 04:15:33PM +, Alan Burlison wrote:
 > >close(2) as specified by POSIX doesn't prohibit this weird revoke-like
 > >behavior, but there's nothing in there that mandates it either. (I
 > >thought this discussion had already clarified that.)
 > 
 > There was an attempt to interpret POSIX that way, with which I still
 > disagree. If a FD is closed or reassigned then any current pending
 > operations on it should be terminated.

C, please.

 > >Note that while NetBSD apparently supports this behavior because
 > >someone copied it from Solaris, I'm about to go recommend it be
 > >removed.
 > 
 > Which behaviour? The abort accept() on close() behaviour?

That, and aborting anything else too. Close isn't revoke.

-- 
David A. Holland
dholl...@netbsd.org
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread David Miller
From: Alan Burlison 
Date: Thu, 29 Oct 2015 17:12:44 +

> On 29/10/2015 17:07, Al Viro wrote:
> 
>> Could the esteemed sir possibly be ars^H^H^Hprevailed upon to quote
>> the exact
>> place in POSIX that requires such behaviour?
> 
> If that's the way the conversation is going to go, sorry, no.

I find Al's request to be frankly quite reasonable, as is his
frustration expressed in his tone as well.

Furthermore, NetBSD's intention to try and get rid of the
close() on accept() behavior shows that the Linux developers
are not in the minority of being against requiring this
behavior or even finding it desirable in any way.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-29 Thread David Miller
From: Al Viro 
Date: Thu, 29 Oct 2015 17:07:48 +

> _IF_ you are doing that kind of "close a descriptor under other threads"
> thing, you need to inform the potentially affected threads anyway, and
> you'd better not rely on them being currently in kernel mode.

+1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Eric Dumazet
On Wed, 2015-10-28 at 12:35 +, Al Viro wrote:
> [Linus and Dave added, Solaris and NetBSD folks dropped from Cc]
> 
> On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> > On Tue, 2015-10-27 at 23:17 +, Al Viro wrote:
> > 
> > >   * [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > > inevitable, unless we accept considerably heavier memory footprint,
> > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > > to trigger - close(3);open(...); will have the next open() after that
> > > scanning the entire in-use bitmap.  I think I see a way to improve it
> > > without slowing the normal case down, but I'll need to experiment a
> > > bit before I post patches.  Anybody with examples of real-world loads
> > > that make our descriptor allocator to degrade is very welcome to post
> > > the reproducers...
> > 
> > Well, I do have real-world loads, but quite hard to setup in a lab :(
> > 
> > Note that we also hit the 'struct cred'->usage refcount for every
> > open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> > cache line really helps, as current_fsuid() and current_fsgid() no
> > longer forces a pingpong.
> > 
> > I moved seldom used fields on the first cache line, so that overall
> > memory usage did not change (192 bytes on 64 bit arches)
> 
> [snip]
> 
> Makes sense, but there's a funny thing about that refcount - the part
> coming from ->f_cred is the most frequently changed *and* almost all
> places using ->f_cred are just looking at its fields and do not manipulate
> its refcount.  The only exception (do_process_acct()) is easy to eliminate
> just by storing a separate reference to the current creds of acct(2) caller
> and using it instead of looking at ->f_cred.  What's more, the place where we
> grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
> most of the time such a reference is there for dropping ->f_cred (in
> file_free()/file_free_rcu()).
> 
> With that change in kernel/acct.c done, we could do the following:
>   a) split the cred refcount into the normal and percpu parts and
> add a spinlock in there.
>   b) have put_cred() do this:
>   if (atomic_dec_and_test(>usage)) {
>   this_cpu_add(>f_cred_usage, 1);
>   call_rcu(>rcu, put_f_cred_rcu);
>   }
>   c) have get_empty_filp() increment current_cred ->f_cred_usage with
> this_cpu_add()
>   d) have file_free() do
>   percpu_counter_dec(_files);
>   rcu_read_lock();
>   if (likely(atomic_read(>f_cred->usage))) {
>   this_cpu_add(>f_cred->f_cred_usage, -1);
>   rcu_read_unlock();
>   call_rcu(>f_u.fu_rcuhead, file_free_rcu_light);
>   } else {
>   rcu_read_unlock();
>   call_rcu(>f_u.fu_rcuhead, file_free_rcu);
>   }
> file_free_rcu() being
> static void file_free_rcu(struct rcu_head *head)
> {
> struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
> put_f_cred(>f_cred->rcu);
> kmem_cache_free(filp_cachep, f);
> }
> and file_free_rcu_light() - the same sans put_f_cred();
> 
> with put_f_cred() doing
>   spin_lock cred->lock
>   this_cpu_add(>f_cred_usage, -1);
>   find the sum of cred->f_cred_usage
>   spin_unlock cred->lock
>   if the sum has not reached 0
>   return
>   current put_cred_rcu(cred)
> 
> IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
> (most of) ->f_cred dropping.
> 
> Note that there are two paths leading to put_f_cred() in the above - via
> call_rcu() on >rcu and from file_free_rcu() called via call_rcu() on
> >f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
> different rcu_head are used.
> 
> atomic_read() check in file_free() might give false positives if it comes
> just before put_cred() on another CPU kills the last non-f_cred reference.
> It's not a problem, since put_f_cred() from that put_cred() won't be
> executed until we drop rcu_read_lock(), so we can safely decrement the
> cred->f_cred_usage without cred->lock here (and we are guaranteed that we 
> won't
> be dropping the last of that - the same put_cred() would've incremented
> ->f_cred_usage).
> 
> Does anybody see problems with that approach?  I'm going to grab some sleep
> (only a couple of hours so far tonight ;-/), will cook an incremental to 
> Eric's
> field-reordering patch when I get up...

Before I take a deep look at your suggestion, are you sure plain use of
include/linux/percpu-refcount.h infra is not possible for struct cred ?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Al Viro
[Linus and Dave added, Solaris and NetBSD folks dropped from Cc]

On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> On Tue, 2015-10-27 at 23:17 +, Al Viro wrote:
> 
> > * [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > inevitable, unless we accept considerably heavier memory footprint,
> > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > to trigger - close(3);open(...); will have the next open() after that
> > scanning the entire in-use bitmap.  I think I see a way to improve it
> > without slowing the normal case down, but I'll need to experiment a
> > bit before I post patches.  Anybody with examples of real-world loads
> > that make our descriptor allocator to degrade is very welcome to post
> > the reproducers...
> 
> Well, I do have real-world loads, but quite hard to setup in a lab :(
> 
> Note that we also hit the 'struct cred'->usage refcount for every
> open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> cache line really helps, as current_fsuid() and current_fsgid() no
> longer forces a pingpong.
> 
> I moved seldom used fields on the first cache line, so that overall
> memory usage did not change (192 bytes on 64 bit arches)

[snip]

Makes sense, but there's a funny thing about that refcount - the part
coming from ->f_cred is the most frequently changed *and* almost all
places using ->f_cred are just looking at its fields and do not manipulate
its refcount.  The only exception (do_process_acct()) is easy to eliminate
just by storing a separate reference to the current creds of acct(2) caller
and using it instead of looking at ->f_cred.  What's more, the place where we
grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
most of the time such a reference is there for dropping ->f_cred (in
file_free()/file_free_rcu()).

With that change in kernel/acct.c done, we could do the following:
a) split the cred refcount into the normal and percpu parts and
add a spinlock in there.
b) have put_cred() do this:
if (atomic_dec_and_test(>usage)) {
this_cpu_add(>f_cred_usage, 1);
call_rcu(>rcu, put_f_cred_rcu);
}
c) have get_empty_filp() increment current_cred ->f_cred_usage with
this_cpu_add()
d) have file_free() do
percpu_counter_dec(_files);
rcu_read_lock();
if (likely(atomic_read(>f_cred->usage))) {
this_cpu_add(>f_cred->f_cred_usage, -1);
rcu_read_unlock();
call_rcu(>f_u.fu_rcuhead, file_free_rcu_light);
} else {
rcu_read_unlock();
call_rcu(>f_u.fu_rcuhead, file_free_rcu);
}
file_free_rcu() being
static void file_free_rcu(struct rcu_head *head)
{
struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
put_f_cred(>f_cred->rcu);
kmem_cache_free(filp_cachep, f);
}
and file_free_rcu_light() - the same sans put_f_cred();

with put_f_cred() doing
spin_lock cred->lock
this_cpu_add(>f_cred_usage, -1);
find the sum of cred->f_cred_usage
spin_unlock cred->lock
if the sum has not reached 0
return
current put_cred_rcu(cred)

IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
(most of) ->f_cred dropping.

Note that there are two paths leading to put_f_cred() in the above - via
call_rcu() on >rcu and from file_free_rcu() called via call_rcu() on
>f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
different rcu_head are used.

atomic_read() check in file_free() might give false positives if it comes
just before put_cred() on another CPU kills the last non-f_cred reference.
It's not a problem, since put_f_cred() from that put_cred() won't be
executed until we drop rcu_read_lock(), so we can safely decrement the
cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
be dropping the last of that - the same put_cred() would've incremented
->f_cred_usage).

Does anybody see problems with that approach?  I'm going to grab some sleep
(only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
field-reordering patch when I get up...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Eric Dumazet
On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:

> Before I take a deep look at your suggestion, are you sure plain use of
> include/linux/percpu-refcount.h infra is not possible for struct cred ?

BTW, I am not convinced we need to spend so much energy and per-cpu
memory for struct cred refcount.

The big problem is fd array spinlock of course and bitmap search for
POSIX compliance.

The cache line trashing in struct cred is a minor one ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Alan Burlison

On 27/10/2015 23:17, Al Viro wrote:


Frankly, as far as I'm concerned, the bottom line is
* there are two variants of semantics in that area and there's not
much that could be done about that.


Yes, that seems to be the case.


* POSIX is vague enough for both variants to comply with it (it's
also very badly written in the area in question).


On that aspect I disagree, the POSIX semantics seem clear to me, and are 
different to the Linux behaviour.



* I don't see any way to implement something similar to Solaris
behaviour without a huge increase of memory footprint or massive cacheline
pingpong.  Solaris appears to go for memory footprint from hell - cacheline
per descriptor (instead of a pointer per descriptor).


Yes, that does seem to be the case. Thanks for the detailed explanation 
you've provided as to why that's so.



* the benefits of Solaris-style behaviour are not obvious - all things
equal it would be interesting, but the things are very much not equal.  What's
more, if your userland code is such that accept() argument could be closed by
another thread, the caller *cannot* do anything with said argument after
accept() returns, no matter which variant of semantics is used.


Yes, irrespective of how you terminate the accept, once it returns with 
an error it's unsafe to use the FD, with the exception of failures such 
as EAGAIN, EINTR etc. However the shutdown() behaviour of Linux is not 
POSIX compliant and allowing an accept to continue of a FD that's been 
closed doesn't seem correct either.



* [Linux-specific aside] our __alloc_fd() can degrade quite badly
with some use patterns.  The cacheline pingpong in the bitmap is probably
inevitable, unless we accept considerably heavier memory footprint,
but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
to trigger - close(3);open(...); will have the next open() after that
scanning the entire in-use bitmap.  I think I see a way to improve it
without slowing the normal case down, but I'll need to experiment a
bit before I post patches.  Anybody with examples of real-world loads
that make our descriptor allocator to degrade is very welcome to post
the reproducers...


It looks like the remaining discussion is going to be about Linux 
implementation details so I'll bow out at this point. Thanks again for 
all the helpful explanation.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Eric Dumazet
On Thu, 2015-10-29 at 00:15 +, Al Viro wrote:
> On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > > lot of applications absolutely rely on the POSIX thing...
> > > 
> > > We obviously can't turn that into default behaviour, though.  BTW, what
> > > distribution do you have in mind for those random descriptors?  Uniform
> > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > > memory footprint pretty soon...
> > 
> > Simply [0 , fdt->max_fds] is working well in most cases.
> 
> Umm...  So first you dup2() to establish the ->max_fds you want, then
> do such opens?

Yes, dup2() is done at program startup, knowing the expected max load
(in term of concurrent fd) + ~10 % (actual fd array size can be more
than this because of power of two rounding in alloc_fdtable() )

But this is an optimization : If you do not use the initial dup2(), the
fd array can be automatically expanded if needed (all slots are in use)

>   What used/unused ratio do you expect to deal with?
> And what kind of locking are you going to use?  Keep in mind that
> e.g. dup2() is dependent on the lack of allocations while it's working,
> so it's not as simple as "we don't need no stinkin' ->files_lock"...

No locking change. files->file_lock is still taken.

We only want to minimize time to find an empty slot.

The trick is to not start bitmap search at files->next_fd, but a random
point. This is a win if we assume there are enough holes.

low = start;
if (low < files->next_fd)
low = files->next_fd;

res = -1;
if (flags & O_FD_FASTALLOC) {
random_point = pick_random_between(low, fdt->max_fds);

res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
random_point);
/* No empty slot found, try the other range */
if (res >= fdt->max_fds) {
res = find_next_zero_bit(fdt->open_fds,
low, random_point);
if (res >= random_point)
res = -1;
}
}
...




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Al Viro
On Wed, Oct 28, 2015 at 08:29:41PM -0700, Eric Dumazet wrote:

> But this is an optimization : If you do not use the initial dup2(), the
> fd array can be automatically expanded if needed (all slots are in use)

Whee...

> No locking change. files->file_lock is still taken.
> 
> We only want to minimize time to find an empty slot.

Then I'd say that my variant is going to win.  It *will* lead to
cacheline pingpong in more cases than yours, but I'm quite sure that
it will be a win as far as the total amount of cachelines accessed.

> The trick is to not start bitmap search at files->next_fd, but a random
> point. This is a win if we assume there are enough holes.
> 
> low = start;
> if (low < files->next_fd)
> low = files->next_fd;
> 
> res = -1;
> if (flags & O_FD_FASTALLOC) {
>   random_point = pick_random_between(low, fdt->max_fds);
> 
>   res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
>   random_point);
>   /* No empty slot found, try the other range */
>   if (res >= fdt->max_fds) {
>   res = find_next_zero_bit(fdt->open_fds,
>   low, random_point);
>   if (res >= random_point)
>   res = -1;
>   }
> }

Have you tried to experiment with that in userland?  I mean, emulate that
thing in normal userland code, count the cacheline accesses and drive it
with the use patterns collected from actual applications.

I can sit down and play with math expectations, but I suspect that it's
easier to experiment.  It's nothing but an intuition (I hadn't seriously
done probability theory in quite a while, and my mathematical tastes run
more to geometry and topology anyway), but... I would expect it to degrade
badly when the bitmap is reasonably dense.

Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
not cheap - it's done via #PF triggered in kernel mode, with handler
noticing that the faulting address is in vmalloc range and doing the
right thing.  IOW, if your bitmap is very sparse, the price of page faults
needs to be taken into account.

AFAICS, the only benefit of that thing is keeping dirtied cachelines far
from each other.  Which might be a win overall, but I'm not convinced that
the rest won't offset the effect of that...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Eric Dumazet
On Wed, 2015-10-28 at 21:13 +, Al Viro wrote:
> On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> > 
> > > Before I take a deep look at your suggestion, are you sure plain use of
> > > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> > 
> > BTW, I am not convinced we need to spend so much energy and per-cpu
> > memory for struct cred refcount.
> > 
> > The big problem is fd array spinlock of course and bitmap search for
> > POSIX compliance.
> > 
> > The cache line trashing in struct cred is a minor one ;)
> 
> percpu-refcount isn't convenient - the only such candidate for ref_kill in
> there is "all other references are gone", and that can happen in
> interesting locking environments.  I doubt that it would be a good fit, TBH...

OK then ...

> Cacheline pingpong on the descriptors bitmap is probably inevitable, but
> it's not the only problem in the existing implementation - close a small
> descriptor when you've got a lot of them and look for the second open
> after that.  _That_ can lead to thousands of cachelines being read through,
> all under the table spinlock.  It's literally orders of magnitude worse.
> And if the first open after that close happens to be for a short-living
> descriptor, you'll get the same situation back in your face as soon as you
> close it.
> 
> I think we can seriously improve that without screwing the fast path by
> adding "summary" bitmaps once the primary grows past the cacheline worth
> of bits.  With bits in the summary bitmap corresponding to cacheline-sized
> chunks of the primary, being set iff all bits in the corresponding chunk
> are set.  If the summary map grows larger than one cacheline, add the
> second-order one (that happens at quarter million descriptors and serves
> until 128 million; adding the third-order map is probably worthless).
> 
> I want to maintain the same kind of "everything below this is known to be
> in use" thing as we do now.  Allocation would start with looking into the
> same place in primary bitmap where we'd looked now and similar search
> forward for zero bit.  _However_, it would stop at cacheline boundary.
> If nothing had been found, we look in the corresponding place in the
> summary bitmap and search for zero bit there.  Again, no more than up
> to the cacheline boundary.  If something is found, we've got a chunk in
> the primary known to contain a zero bit; if not - go to the second-level
> and search there, etc.
> 
> When a zero bit in the primary had been found, check if it's within the
> rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
> that bit.  If there are zero bits left in the same word - we are done,
> otherwise check the still unread words in the cacheline and see if all
> of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.
> 
> Normal case is exactly the same as now - one cacheline accessed and modified.
> We might end up touching more than that, but it's going to be rare and
> the cases when it happens are very likely to lead to much worse amount of
> memory traffic with the current code.
> 
> Freeing is done by zeroing the bit in primary, checking for other zero bits
> nearby and buggering off if there are such.  If the entire cacheline used
> to be all-bits-set, clear the bit in summary and, if there's a second-order
> summary, get the bit in there clear as well - it's probably not worth
> bothering with checking that all the cacheline in summary bitmap had been
> all-bits-set.  Again, the normal case is the same as now.
> 
> It'll need profiling and tuning, but AFAICS it's doable without making the
> things worse than they are now, and it should get rid of those O(N) fetches
> under spinlock cases.  And yes, those are triggerable and visible in
> profiles.  IMO it's worth trying to fix...
> 

Well, all this complexity goes away with a O_FD_FASTALLOC /
SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
tells the kernel we do not care getting the lowest possible fd as POSIX
mandates.

With this bit set, the bitmap search can start at a random point, and we
find a lot in O(1) : one cache line miss, if you have at least one free
bit/slot per 512 bits (64 bytes cache line).

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x4000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

... // active sockets
socket(AF_INET, SOCK_STREAM | SOCK_FD_FASTALLOC, 0);
... // passive sockets
accept4(sockfd, ..., SOCK_FD_FASTALLOC);
...

Except for legacy stuff and stdin/stdout/stderr games, I really doubt
lot of applications absolutely rely on the POSIX thing...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Al Viro
On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> 
> > Before I take a deep look at your suggestion, are you sure plain use of
> > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> 
> BTW, I am not convinced we need to spend so much energy and per-cpu
> memory for struct cred refcount.
> 
> The big problem is fd array spinlock of course and bitmap search for
> POSIX compliance.
> 
> The cache line trashing in struct cred is a minor one ;)

percpu-refcount isn't convenient - the only such candidate for ref_kill in
there is "all other references are gone", and that can happen in
interesting locking environments.  I doubt that it would be a good fit, TBH...

Cacheline pingpong on the descriptors bitmap is probably inevitable, but
it's not the only problem in the existing implementation - close a small
descriptor when you've got a lot of them and look for the second open
after that.  _That_ can lead to thousands of cachelines being read through,
all under the table spinlock.  It's literally orders of magnitude worse.
And if the first open after that close happens to be for a short-living
descriptor, you'll get the same situation back in your face as soon as you
close it.

I think we can seriously improve that without screwing the fast path by
adding "summary" bitmaps once the primary grows past the cacheline worth
of bits.  With bits in the summary bitmap corresponding to cacheline-sized
chunks of the primary, being set iff all bits in the corresponding chunk
are set.  If the summary map grows larger than one cacheline, add the
second-order one (that happens at quarter million descriptors and serves
until 128 million; adding the third-order map is probably worthless).

I want to maintain the same kind of "everything below this is known to be
in use" thing as we do now.  Allocation would start with looking into the
same place in primary bitmap where we'd looked now and similar search
forward for zero bit.  _However_, it would stop at cacheline boundary.
If nothing had been found, we look in the corresponding place in the
summary bitmap and search for zero bit there.  Again, no more than up
to the cacheline boundary.  If something is found, we've got a chunk in
the primary known to contain a zero bit; if not - go to the second-level
and search there, etc.

When a zero bit in the primary had been found, check if it's within the
rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
that bit.  If there are zero bits left in the same word - we are done,
otherwise check the still unread words in the cacheline and see if all
of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.

Normal case is exactly the same as now - one cacheline accessed and modified.
We might end up touching more than that, but it's going to be rare and
the cases when it happens are very likely to lead to much worse amount of
memory traffic with the current code.

Freeing is done by zeroing the bit in primary, checking for other zero bits
nearby and buggering off if there are such.  If the entire cacheline used
to be all-bits-set, clear the bit in summary and, if there's a second-order
summary, get the bit in there clear as well - it's probably not worth
bothering with checking that all the cacheline in summary bitmap had been
all-bits-set.  Again, the normal case is the same as now.

It'll need profiling and tuning, but AFAICS it's doable without making the
things worse than they are now, and it should get rid of those O(N) fetches
under spinlock cases.  And yes, those are triggerable and visible in
profiles.  IMO it's worth trying to fix...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Al Viro
On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:

> Well, all this complexity goes away with a O_FD_FASTALLOC /
> SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> tells the kernel we do not care getting the lowest possible fd as POSIX
> mandates.

... which won't do a damn thing for existing userland.

> Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> lot of applications absolutely rely on the POSIX thing...

We obviously can't turn that into default behaviour, though.  BTW, what
distribution do you have in mind for those random descriptors?  Uniform
on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
memory footprint pretty soon...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Eric Dumazet
On Wed, 2015-10-28 at 22:33 +, Al Viro wrote:
> On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:
> 
> > Well, all this complexity goes away with a O_FD_FASTALLOC /
> > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> > tells the kernel we do not care getting the lowest possible fd as POSIX
> > mandates.
> 
> ... which won't do a damn thing for existing userland.

For the userland that need +5,000,000 socket, I can tell you they are
using this flag as soon they are aware it exists ;)

> 
> > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > lot of applications absolutely rely on the POSIX thing...
> 
> We obviously can't turn that into default behaviour, though.  BTW, what
> distribution do you have in mind for those random descriptors?  Uniform
> on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> memory footprint pretty soon...

Simply [0 , fdt->max_fds] is working well in most cases.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-28 Thread Al Viro
On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > lot of applications absolutely rely on the POSIX thing...
> > 
> > We obviously can't turn that into default behaviour, though.  BTW, what
> > distribution do you have in mind for those random descriptors?  Uniform
> > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > memory footprint pretty soon...
> 
> Simply [0 , fdt->max_fds] is working well in most cases.

Umm...  So first you dup2() to establish the ->max_fds you want, then
do such opens?  What used/unused ratio do you expect to deal with?
And what kind of locking are you going to use?  Keep in mind that
e.g. dup2() is dependent on the lack of allocations while it's working,
so it's not as simple as "we don't need no stinkin' ->files_lock"...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Casper . Dik


>And no, I'm not fond of such irregular ways to pass file descriptors, but
>we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity...

Yeah, there are a number of calls which supposed work on one but have a 
second argument which is also a file descriptor; mostly part of ioctl().

>> In those specific cases where a system call needs to convert a file 
>> descriptor to a file pointer, there is only one routines which can be used.
>
>Obviously, but the problem is deadlock avoidance using it.

The Solaris algorithm is quite different and as such there is no chance of 
having a deadlock using that function (there is a bunch of functions)


>The memory footprint is really scary.  Bitmaps are pretty much noise, but
>blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic -
>or Venus for that matter, which is more relevant for you guys)

Fair enough.  I think we have some systems with a larger cache line.

>Said that, what's the point of "close won't return until..."?  After all,
>you can't guarantee that thread with cancelled syscall won't lose CPU
>immediately upon return to userland, so it *can't* make any assumptions
>about the descriptor not having been already reused.  I don't get it - what
>does that buy for userland code?

Generally I wouldn't see that as a problem, but in the case of a socket 
blocking on accept indefinitely, I do see it as a problem especially as 
the thread actually wants to stop listening.

But in general, this is basically a problem with the application: the file 
descriptor space is shared between threads and having one thread sniping 
at open files, you do have a problem and whatever the kernel does in that 
case perhaps doesn't matter all that much: the application needs to be 
fixed anyway.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Alan Burlison

On 27/10/2015 09:08, casper@oracle.com wrote:


Generally I wouldn't see that as a problem, but in the case of a socket
blocking on accept indefinitely, I do see it as a problem especially as
the thread actually wants to stop listening.

But in general, this is basically a problem with the application: the file
descriptor space is shared between threads and having one thread sniping
at open files, you do have a problem and whatever the kernel does in that
case perhaps doesn't matter all that much: the application needs to be
fixed anyway.


The scenario in Hadoop is that the FD is being used by a thread that's 
waiting in accept and another thread wants to shut it down, e.g. because 
the application is terminating and needs to stop all threads cleanly. I 
agree the use of shutdown()+close() on Linux or dup2() on Solaris is 
pretty much an application-level hack - the concern in both cases is 
that the file descriptor that's being used in the accept() might be 
recycled by another thread. However that just begs the question of why 
the FD isn't properly encapsulated by the application in a singleton 
object, with the required shut down semantics provided by having a 
mechanism to invalidate the singleton and its contained FD.


There are other mechanisms that could be used to do a clean shutdown 
that don't require the OS to provide workarounds for arguably broken 
application behaviour, for example by setting a 'shutdown' flag in the 
object and then doing a dummy connect() to the accepting FD to kick it 
off the accept() and thereby getting it to re-check the 'shutdown' flag 
and not re-enter the accept().


If the object encapsulating a FD is invalidated and that prevents the FD 
being used any more because the only access is via that object, then it 
simply doesn't matter if the FD is reused elsewhere, there can be no 
race so a complicated, platform-dependent dance isn't needed.


Unfortunately Hadoop isn't the only thing that pulls the shutdown() 
trick, so I don't think there's a simple fix for this, as discussed 
earlier in the thread. Having said that, if close() on Linux also did an 
implicit shutdown() it would mean that well-written applications that 
handled the scoping, sharing and reuse of FDs properly could just call 
close() and have it work the same way across *NIX platforms.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread David Miller
From: Alan Burlison 
Date: Tue, 27 Oct 2015 10:52:46 +

> an implicit shutdown() it would mean that well-written applications
> that handled the scoping, sharing and reuse of FDs properly could just
> call close() and have it work the same way across *NIX platforms.

This semantic would only exist after Linux version X.Y.Z and vendor
kernels that decided to backport the feature.

Ergo, this application would ironically be non-portable on Linux
machines.

If portable Linux applications have to handle the situation using
existing facilities there is absolutely zero value to add it now
because it only will add more complexity to applications handling
things correctly because they will always have two cases to somehow
conditionally handle under Linux.

And if the intention is to just always assume the close() semantic
thing is there, then you have given me a disincentive to ever add the
facility.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Alan Burlison

On 27/10/2015 12:01, Eric Dumazet wrote:


Are non multi threaded applications considered well written ?

listener = socket(...);
bind(listener, ...);
listen(fd, 1);
Loop 1 10
   if (fork() == 0)
 do_accept(listener)

Now if a child does a close(listener), or is killed, you propose that it
does an implicit shutdown() and all other children no longer can
accept() ?


No, of course not. I made it quite clear I was talking about MT programs.


Surely you did not gave all details on how it is really working.


In the case of Hadoop, it works the way I describe.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Eric Dumazet
On Tue, 2015-10-27 at 12:27 +, Alan Burlison wrote:
> On 27/10/2015 12:01, Eric Dumazet wrote:
> 
> > Are non multi threaded applications considered well written ?
> >
> > listener = socket(...);
> > bind(listener, ...);
> > listen(fd, 1);
> > Loop 1 10
> >if (fork() == 0)
> >  do_accept(listener)
> >
> > Now if a child does a close(listener), or is killed, you propose that it
> > does an implicit shutdown() and all other children no longer can
> > accept() ?
> 
> No, of course not. I made it quite clear I was talking about MT programs.

Nothing is clear. Sorry.

Now shat about programs using both fork() model and MT, to get one MT
process per NUMA node ?






--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Eric Dumazet
On Tue, 2015-10-27 at 10:52 +, Alan Burlison wrote:

> Unfortunately Hadoop isn't the only thing that pulls the shutdown() 
> trick, so I don't think there's a simple fix for this, as discussed 
> earlier in the thread. Having said that, if close() on Linux also did an 
> implicit shutdown() it would mean that well-written applications that 
> handled the scoping, sharing and reuse of FDs properly could just call 
> close() and have it work the same way across *NIX platforms.


Are non multi threaded applications considered well written ?

listener = socket(...);
bind(listener, ...);
listen(fd, 1);
Loop 1 10
  if (fork() == 0)
do_accept(listener)

Now if a child does a close(listener), or is killed, you propose that it
does an implicit shutdown() and all other children no longer can
accept() ?

Surely you did not gave all details on how it is really working.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Alan Burlison

On 27/10/2015 14:39, David Miller wrote:


You will never be able to assume it is available everywhere under
Linux.  Ever.

This is the fundamental issue that you seem to completely not
understand.


If that was true in general then Linux would be dead and there would be 
no point ever adding any new features to it.



You cannot just assume 5 years from now or whatever that the close()
thing is there even if I added it to the tree right now.


No, a configure-time feature probe would be needed, as is generally 
considered to be good practice.



Your intent is to somewhere down the road assume this, and therefore
distribute a broken piece of infrastructure that only works on some
Linux systems.

This is not acceptable.


Acceptable to who? Unless you are claiming to speak for the authors of 
every piece of software that ever runs on Linux you can't make that 
assertion. For example, Hadoop relies on Linux CGroups to provide 
features you'd want in production deployments, yet CGroups only appeared 
in Linux 2.6.24.



The backwards compat code will need to be in your code forever.  There
is no way around it.  That is, again, unless you want your code to not
work on a non-trivial number of Linux systems out there.


That's simply not true. Both Linux and Solaris have dropped support for 
old features in the past. Yes it takes a long time to do so but it's 
perfectly possible.



Making this worse is that there isn't going to be a straightforward
nor reliable way to test for the presence of this at run time.


I attached a test case to the original bug that demonstrates the 
platform-specific behaviours, it would be easy enough to modify that to 
use as a configure-time feature probe.



You _have_ a way to accomplish what you want to do today and it works
on every Linux system on the planet.

Given the constraints, and the fact that you're going to have to
account for this situation somehow in your code forever, I see very
little to no value in adding the close() thing.


I think your assessment is unduly pessimistic.


So your cross-platform unified behavior goal is simply unobtainable.
So please deal with reality rather than wishful inpractical things.


If you don't think this is worth discussing because you personally don't 
believe cross-platform compatibility is worth bothering with then fine, 
say so. But then you need to persuade everyone else with a stake in 
Linux that your viewpoint is the only valid one, and you will have to 
also persuade them that Linux should no longer concern itself with POSIX 
compliance. I know *I* couldn't do that for Solaris, and with all due 
respect, I very much doubt *you* can do so for Linux.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread David Miller
From: Alan Burlison 
Date: Tue, 27 Oct 2015 14:39:56 +

> On 27/10/2015 14:39, David Miller wrote:
> 
>> Making this worse is that there isn't going to be a straightforward
>> nor reliable way to test for the presence of this at run time.
> 
> I attached a test case to the original bug that demonstrates the
> platform-specific behaviours, it would be easy enough to modify that
> to use as a configure-time feature probe.

I said "run time".  Like when your final code runs on a target system.

Not at build tree configure time.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Alan Burlison

On 27/10/2015 15:04, David Miller wrote:


I said "run time".  Like when your final code runs on a target system.

Not at build tree configure time.


Yes, you could do it then as well.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Eric Dumazet
On Tue, 2015-10-27 at 23:17 +, Al Viro wrote:

>   * [Linux-specific aside] our __alloc_fd() can degrade quite badly
> with some use patterns.  The cacheline pingpong in the bitmap is probably
> inevitable, unless we accept considerably heavier memory footprint,
> but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> to trigger - close(3);open(...); will have the next open() after that
> scanning the entire in-use bitmap.  I think I see a way to improve it
> without slowing the normal case down, but I'll need to experiment a
> bit before I post patches.  Anybody with examples of real-world loads
> that make our descriptor allocator to degrade is very welcome to post
> the reproducers...

Well, I do have real-world loads, but quite hard to setup in a lab :(

Note that we also hit the 'struct cred'->usage refcount for every
open()/close()/sock_alloc(), and simply moving uid/gid out of the first
cache line really helps, as current_fsuid() and current_fsgid() no
longer forces a pingpong.

I moved seldom used fields on the first cache line, so that overall
memory usage did not change (192 bytes on 64 bit arches)


diff --git a/include/linux/cred.h b/include/linux/cred.h
index 8d70e1361ecd..460efae83522 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -124,7 +124,17 @@ struct cred {
 #define CRED_MAGIC 0x43736564
 #define CRED_MAGIC_DEAD0x44656144
 #endif
-   kuid_t  uid;/* real UID of the task */
+   struct rcu_head rcu;/* RCU deletion hook */
+
+   kernel_cap_tcap_inheritable; /* caps our children can inherit */
+   kernel_cap_tcap_permitted;  /* caps we're permitted */
+   kernel_cap_tcap_effective;  /* caps we can actually use */
+   kernel_cap_tcap_bset;   /* capability bounding set */
+   kernel_cap_tcap_ambient;/* Ambient capability set */
+
+   kuid_t  uid cacheline_aligned_in_smp;
+   /* real UID of the task */
+
kgid_t  gid;/* real GID of the task */
kuid_t  suid;   /* saved UID of the task */
kgid_t  sgid;   /* saved GID of the task */
@@ -133,11 +143,6 @@ struct cred {
kuid_t  fsuid;  /* UID for VFS ops */
kgid_t  fsgid;  /* GID for VFS ops */
unsignedsecurebits; /* SUID-less security management */
-   kernel_cap_tcap_inheritable; /* caps our children can inherit */
-   kernel_cap_tcap_permitted;  /* caps we're permitted */
-   kernel_cap_tcap_effective;  /* caps we can actually use */
-   kernel_cap_tcap_bset;   /* capability bounding set */
-   kernel_cap_tcap_ambient;/* Ambient capability set */
 #ifdef CONFIG_KEYS
unsigned char   jit_keyring;/* default keyring to attach requested
 * keys to */
@@ -152,7 +157,6 @@ struct cred {
struct user_struct *user;   /* real user ID subscription */
struct user_namespace *user_ns; /* user_ns the caps and keyrings are 
relative to. */
struct group_info *group_info;  /* supplementary groups for euid/fsgid 
*/
-   struct rcu_head rcu;/* RCU deletion hook */
 };
 
 extern void __put_cred(struct cred *);



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Al Viro
On Tue, Oct 27, 2015 at 10:52:46AM +, Alan Burlison wrote:
> Unfortunately Hadoop isn't the only thing that pulls the shutdown()
> trick, so I don't think there's a simple fix for this, as discussed
> earlier in the thread. Having said that, if close() on Linux also
> did an implicit shutdown() it would mean that well-written
> applications that handled the scoping, sharing and reuse of FDs
> properly could just call close() and have it work the same way
> across *NIX platforms.

... except for all Linux, FreeBSD and OpenBSD versions out there, but
hey, who's counting those, right?  Not to mention the OSX behaviour -
I really have no idea what it does; the FreeBSD ancestry in its kernel
is distant enough for a lot of changes to have happened in that area.

So...  Which Unices other than Solaris and NetBSD actually behave that
way?  I.e. have close(fd) cancel accept(fd) another thread is sitting
in.  Note that NetBSD implementation has known races.  Linux, FreeBSD
and OpenBSD don't do that at all.

Frankly, as far as I'm concerned, the bottom line is
* there are two variants of semantics in that area and there's not
much that could be done about that.
* POSIX is vague enough for both variants to comply with it (it's
also very badly written in the area in question).
* I don't see any way to implement something similar to Solaris
behaviour without a huge increase of memory footprint or massive cacheline
pingpong.  Solaris appears to go for memory footprint from hell - cacheline
per descriptor (instead of a pointer per descriptor).
* the benefits of Solaris-style behaviour are not obvious - all things
equal it would be interesting, but the things are very much not equal.  What's
more, if your userland code is such that accept() argument could be closed by
another thread, the caller *cannot* do anything with said argument after
accept() returns, no matter which variant of semantics is used.
* [Linux-specific aside] our __alloc_fd() can degrade quite badly
with some use patterns.  The cacheline pingpong in the bitmap is probably
inevitable, unless we accept considerably heavier memory footprint,
but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
to trigger - close(3);open(...); will have the next open() after that
scanning the entire in-use bitmap.  I think I see a way to improve it
without slowing the normal case down, but I'll need to experiment a
bit before I post patches.  Anybody with examples of real-world loads
that make our descriptor allocator to degrade is very welcome to post
the reproducers...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread David Miller
From: Alan Burlison 
Date: Tue, 27 Oct 2015 14:13:31 +

> Ideally there'd be a single way of doing this that worked
> cross-platform, at the moment there isn't. And yes, even if such a
> mechanism was available now it would be some time before it could be
> assumed to be available everywhere.

You will never be able to assume it is available everywhere under
Linux.  Ever.

This is the fundamental issue that you seem to completely not
understand.

You cannot just assume 5 years from now or whatever that the close()
thing is there even if I added it to the tree right now.

Your intent is to somewhere down the road assume this, and therefore
distribute a broken piece of infrastructure that only works on some
Linux systems.

This is not acceptable.

The backwards compat code will need to be in your code forever.  There
is no way around it.  That is, again, unless you want your code to not
work on a non-trivial number of Linux systems out there.

Making this worse is that there isn't going to be a straightforward
nor reliable way to test for the presence of this at run time.

You _have_ a way to accomplish what you want to do today and it works
on every Linux system on the planet.

Given the constraints, and the fact that you're going to have to
account for this situation somehow in your code forever, I see very
little to no value in adding the close() thing.

So your cross-platform unified behavior goal is simply unobtainable.
So please deal with reality rather than wishful inpractical things.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread David Miller
From: Alan Burlison 
Date: Tue, 27 Oct 2015 13:37:26 +

> If you took that argument to it's logical extreme they you'd never
> make any changes that made changes to existing behaviour, and that's
> patently not the case.

You know exactly what I mean, and what you're saying here is just a
scarecrow distracting the discussion from the real issue.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Alan Burlison

On 27/10/2015 13:42, David Miller wrote:


This semantic would only exist after Linux version X.Y.Z and vendor
kernels that decided to backport the feature.

Ergo, this application would ironically be non-portable on Linux
machines.


Yes, that's true enough, until nobody was using the old versions any more.


If portable Linux applications have to handle the situation using
existing facilities there is absolutely zero value to add it now
because it only will add more complexity to applications handling
things correctly because they will always have two cases to somehow
conditionally handle under Linux.

And if the intention is to just always assume the close() semantic
thing is there, then you have given me a disincentive to ever add the
facility.


If you took that argument to it's logical extreme they you'd never make 
any changes that made changes to existing behaviour, and that's patently 
not the case.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-27 Thread Alan Burlison

On 27/10/2015 13:59, David Miller wrote:


From: Alan Burlison 
Date: Tue, 27 Oct 2015 13:37:26 +


If you took that argument to it's logical extreme they you'd never
make any changes that made changes to existing behaviour, and that's
patently not the case.


You know exactly what I mean, and what you're saying here is just a
scarecrow distracting the discussion from the real issue.


I think you probably mean "a straw man".

The problematic case is MT applications that don't manage sharing of FDs 
in a sensible way and need to have a way of terminating any accept() in 
other threads. On Linux that's currently done with a shutdown()+close(), 
on other platforms you can use open()+dup2(). However the Linux 
shutdown() mechanism relies on bending the POSIX semantics and the 
dup2() mechanism doesn't work on Linux as it also doesn't kick other 
threads off accept().


At the moment, on Linux you have to explicitly call shutdown() on a 
socket on which another thread may be sat in accept(). If closing the 
socket in one thread terminated any accept()s in other threads, in the 
same way that an explicit shutdown() does, then the explicit shutdown() 
wouldn't be needed for more sensibly written apps that weren't prone to 
FD recycling races. As far as I can tell, that would work cross-platform.


Ideally there'd be a single way of doing this that worked 
cross-platform, at the moment there isn't. And yes, even if such a 
mechanism was available now it would be some time before it could be 
assumed to be available everywhere. I don't know enough about the Linux 
implementation to know if there is a practical way around this, and of 
course even if such a change were made, potential breakage of existing 
code would be a concern. If there's a better, cross-platform way of 
doing this then I'm all ears.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread David Holland
On Wed, Oct 21, 2015 at 03:38:51PM +0100, Alan Burlison wrote:
 > On 21/10/2015 04:49, Al Viro wrote:
 > >BTW, for real fun, consider this:
 > >7)
 > >// fd is a socket
 > >fd2 = dup(fd);
 > >in thread A: accept(fd);
 > >in thread B: accept(fd);
 > >in thread C: accept(fd2);
 > >in thread D: close(fd);
 > >
 > >Which threads (if any), should get hit where it hurts?
 > 
 > A & B should return from the accept with an error. C should continue. Which
 > is what happens on Solaris.

So, I'm coming late to this discussion and I don't have the original
context; however, to me this cited behavior seems undesirable and if I
ran across it in the wild I would probably describe it as a bug.

System call processing for operations on files involves translating a
file descriptor (a number) into an open-file object (or "file
description"), struct file in BSD and I think also in Linux. The
actual system call logic operates on the open-file object, so once the
translation happens application monkeyshines involving file descriptor
numbers should have no effect on calls in progress. Other behavior
would violate the principle of least surprise, as this basic
architecture predates POSIX.

The behavior Al Viro found in NetBSD's dup2 is a bug. System calls are
supposed to be atomic, regardless of what POSIX might or might not
say.

I did not write that code but I'll pass the report on to those who
did.

-- 
David A. Holland
dholl...@netbsd.org
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Alan Burlison

On 23/10/2015 17:00, Eric Dumazet wrote:


Ermm, you *really* want me to submit a patch removing 'Conforms to
POSIX.1-2001' from *every* Linux manpage?


Only on the pages you think there is an error that matters.


If there's consensus that the current shutdown(), dup2(), close() and 
accept() behaviour are not POSIX-compliant then I can do that, sure.



Have you tested the patch I sent ?


The AF_UNIX poll one? No, I don't have the means to do so, and in any
case that's not a POSIX issue, just a plain bug. I'm happy to log a bug
if that helps.


We submit patches when someone needs a fix.

If not, we have more urgent issues to solve first.

I wrote following test case, and confirmed the patch fixes the issue.

I will submit it formally.


Thanks, works for me - the poll() issue only affects AF_UNIX sockets in 
the listen state and is easily avoided by simply not setting the output 
bits in the poll events mask, so it's not exactly high priority.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Eric Dumazet
On Fri, 2015-10-23 at 09:00 -0700, Eric Dumazet wrote:
> on the pages you think there is an error that matters.
> 
> > 
> > > Have you tested the patch I sent ?
> > 
> > The AF_UNIX poll one? No, I don't have the means to do so, and in any 
> > case that's not a POSIX issue, just a plain bug. I'm happy to log a bug 
> > if that helps.

BTW, there is no kernel bug here. POSIX poll() man page says :

POLLOUT
Normal data may be written without blocking. 

If you attempt to write on a listener, write() does _not_ block and
returns -1, which seems correct behavior to me, in accordance with man
page.

socket(PF_LOCAL, SOCK_STREAM, 0)= 3
bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0
listen(3, 10)   = 0
write(3, "test", 4) = -1 ENOTCONN (Transport endpoint is 
not connected)

Could you point out which part of POSIX is mandating that af_unix
listener MUST filter out POLLOUT ?

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Alan Burlison

On 23/10/2015 15:21, Eric Dumazet wrote:


I claim nothing. If you believe a man page should be fixed, please send
a patch to man page maintainer.


Ermm, you *really* want me to submit a patch removing 'Conforms to 
POSIX.1-2001' from *every* Linux manpage?



Have you tested the patch I sent ?


The AF_UNIX poll one? No, I don't have the means to do so, and in any 
case that's not a POSIX issue, just a plain bug. I'm happy to log a bug 
if that helps.



The goal here is to improve things, not to say you are right or I am
right. I am very often wrong, then what ?

This list is about linux kernel development.


Thank you for the clarification.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Eric Dumazet
On Fri, 2015-10-23 at 16:46 +0100, Alan Burlison wrote:
> On 23/10/2015 15:21, Eric Dumazet wrote:
> 
> > I claim nothing. If you believe a man page should be fixed, please send
> > a patch to man page maintainer.
> 
> Ermm, you *really* want me to submit a patch removing 'Conforms to 
> POSIX.1-2001' from *every* Linux manpage?

Only on the pages you think there is an error that matters.

> 
> > Have you tested the patch I sent ?
> 
> The AF_UNIX poll one? No, I don't have the means to do so, and in any 
> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug 
> if that helps.

We submit patches when someone needs a fix.

If not, we have more urgent issues to solve first.

I wrote following test case, and confirmed the patch fixes the issue.

I will submit it formally.

Thanks.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void fail(const char *str)
{
perror(str);
printf("FAIL\n");
exit(1);
}

int main(int argc, char *argv[])
{
int listener = socket(AF_UNIX, SOCK_STREAM, 0);
struct pollfd pfd;
struct sockaddr_un addr;
int res;

if (listener == -1)
perror("socket()");
memset(, 0, sizeof(addr));
addr.sun_family = AF_UNIX;
if (bind(listener, (struct sockaddr *), sizeof(addr)) == -1)
fail("bind()");  
if (listen(listener, 10) == -1)
fail("listen()");
pfd.fd = listener;
pfd.events = -1;
res = poll(, 1, 10);
if (res == -1)
fail("poll()");
if (res == 1 && pfd.revents & (POLLOUT|POLLWRNORM|POLLWRBAND)) {
fprintf(stderr, "poll(af_unix listener) returned a POLLOUT 
status !\n");
printf("FAIL\n");
return 1;
}
printf("OK\n");
return 0;
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Eric Dumazet
On Fri, 2015-10-23 at 15:20 +0200, casper@oracle.com wrote:

> This point was not about a 10M sockets server but in general...
> 

Hey, we do not design the OS only for handling desktop workloads.

If netstat does not work on this typical server workload, then it is a
joke.

> Solaris has had such an option too, but that wasn't the point.  You really 
> don't want to know which application is doing this?


Apparently nobody had this need before you asked.

My point is that instead of carrying ~30 years of legacy, we would
better design the system properly so that we can _use_ it in all
situations.

ps used to open /dev/kmem a long time ago.

fuser and similar commands having to know system internals are really
hacks.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Eric Dumazet
On Fri, 2015-10-23 at 15:20 +0200, casper@oracle.com wrote:
> 
> >Yet another POSIX deficiency.
> >
> >When a server deals with 10,000,000+ socks, we absolutely do not care of
> >this requirement.
> >
> >O(log(n)) is still crazy if it involves O(log(n)) cache misses.
> 
> You miss the fire point of the algorithm; you *always* find an available 
> file descriptor in O(log(N)) (where N is the size of the table).
> 
> Does your algorithm guarantee that?

Its a simple bit map. Each cache line contains 64*8 = 512 bits. Lookup
is actually quite fast thanks to hardware prefetches. 

Main problem we had until recently was that we used to acquire fd table
lock 3 times per accept()/close() pair

This patch took care of the issue :

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a81252b774b53e628a8a0fe18e2b8fc236d92cc

Then, adding a O_RANDOMFD flag (or O_DO_NOT_REQUIRE_POSIX_FD_RULES) is
helping a lot, as it allows us to pick a random starting point during
the bitmap search.

In practice, finding a slot in fd array is O(1) : one cache line miss
exactly.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Eric Dumazet
On Fri, 2015-10-23 at 11:52 +0200, casper@oracle.com wrote:
> 
> >Ho-hum...  It could even be made lockless in fast path; the problems I see
> >are
> > * descriptor-to-file lookup becomes unsafe in a lot of locking
> >conditions.  Sure, most of that happens on the entry to some syscall, with
> >very light locking environment, but... auditing every sodding ioctl that
> >might be doing such lookups is an interesting exercise, and then there are
> >->mount() instances doing the same thing.  And procfs accesses.  Probably
> >nothing impossible to deal with, but nothing pleasant either.
> 
> In the Solaris kernel code, the ioctl code is generally not handled a file 
> descriptor but instead a file pointer (i.e., the lookup is done early in 
> the system call).
> 
> In those specific cases where a system call needs to convert a file 
> descriptor to a file pointer, there is only one routines which can be used.
> 
> > * memory footprint.  In case of Linux on amd64 or sparc64,
> >main()
> >{
> > int i;
> > for (i = 0; i < 1<<24; dup2(0, i++))// 16M descriptors
> > ;
> >}
> >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient 
> >ulimit -n,
> >of course).  How much will Solaris eat on the same?
> 
> Yeah, that is a large amount of memory.  Of course, the table is only 
> sized when it is extended and there is a reason why there is a limit on 
> file descriptors.  But we're using more data per file descriptor entry.
> 
> 
> > * related to the above - how much cacheline sharing will that involve?
> >These per-descriptor use counts are bitch to pack, and giving each a 
> >cacheline
> >of its own...  
> 
> As I said, we do actually use a lock and yes that means that you really  
> want to have a single cache line for each and every entry.  It does make 
> it easy to have non-racy file description updates.  You certainly do not 
> want false sharing when there is a lot of contention.
> 
> Other data is used to make sure that it only takes O(log(n)) to find the 
> lowest available file descriptor entry.  (Where n, I think, is the returned
> descriptor)

Yet another POSIX deficiency.

When a server deals with 10,000,000+ socks, we absolutely do not care of
this requirement.

O(log(n)) is still crazy if it involves O(log(n)) cache misses.

> 
> Not contended locks aren't expensive.  And all is done on a single cache 
> line.
> 
> One question about the Linux implementation: what happens when a socket in 
> select is closed?  I'm assuming that the kernel waits until "shutdown" is 
> given or when a connection comes in?
> 
> Is it a problem that you can "hide" your listening socket with a thread in 
> accept()?  I would think so (It would be visible in netstat but you can't 
> easily find out why has it)

Again, netstat -p on a server with 10,000,000 sockets never completes.

Never try this unless you are desperate and want to avoid a reboot
maybe.

If you absolutely want to nuke a listener because of untrusted
applications, we better implement a proper syscall.

Android has such a facility.

Alternative would be to extend netlink (ss command from iproute2
package) to carry one pid per socket.

ss -atnp state listening

-> would not have to readlink (/proc/*/fd/*)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Casper . Dik


>Yet another POSIX deficiency.
>
>When a server deals with 10,000,000+ socks, we absolutely do not care of
>this requirement.
>
>O(log(n)) is still crazy if it involves O(log(n)) cache misses.

You miss the fire point of the algorithm; you *always* find an available 
file descriptor in O(log(N)) (where N is the size of the table).

Does your algorithm guarantee that?

>> Is it a problem that you can "hide" your listening socket with a thread in 
>> accept()?  I would think so (It would be visible in netstat but you can't 
>> easily find out why has it)
>
>Again, netstat -p on a server with 10,000,000 sockets never completes.

This point was not about a 10M sockets server but in general...

>Never try this unless you are desperate and want to avoid a reboot
>maybe.
>
>If you absolutely want to nuke a listener because of untrusted
>applications, we better implement a proper syscall.
>
>Android has such a facility.

Solaris has had such an option too, but that wasn't the point.  You really 
don't want to know which application is doing this?

>Alternative would be to extend netlink (ss command from iproute2
>package) to carry one pid per socket.
>
>ss -atnp state listening

That would be an option too.

Casper


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Alan Burlison

On 23/10/2015 14:02, Eric Dumazet wrote:


Other data is used to make sure that it only takes O(log(n)) to find the
lowest available file descriptor entry.  (Where n, I think, is the returned
descriptor)


Yet another POSIX deficiency.

When a server deals with 10,000,000+ socks, we absolutely do not care of
this requirement.

O(log(n)) is still crazy if it involves O(log(n)) cache misses.


If you think it's a POSIX deficiency then logging a DR is probably the 
correct way of addressing that. And as I've said it's fine to decide 
that you don't care about what POSIX says on the subject but you can't 
simultaneously claim POSIX conformance. One or the other, not both.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Casper . Dik


>Ho-hum...  It could even be made lockless in fast path; the problems I see
>are
>   * descriptor-to-file lookup becomes unsafe in a lot of locking
>conditions.  Sure, most of that happens on the entry to some syscall, with
>very light locking environment, but... auditing every sodding ioctl that
>might be doing such lookups is an interesting exercise, and then there are
>->mount() instances doing the same thing.  And procfs accesses.  Probably
>nothing impossible to deal with, but nothing pleasant either.

In the Solaris kernel code, the ioctl code is generally not handled a file 
descriptor but instead a file pointer (i.e., the lookup is done early in 
the system call).

In those specific cases where a system call needs to convert a file 
descriptor to a file pointer, there is only one routines which can be used.

>   * memory footprint.  In case of Linux on amd64 or sparc64,
>main()
>{
>   int i;
>   for (i = 0; i < 1<<24; dup2(0, i++))// 16M descriptors
>   ;
>}
>will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient 
>ulimit -n,
>of course).  How much will Solaris eat on the same?

Yeah, that is a large amount of memory.  Of course, the table is only 
sized when it is extended and there is a reason why there is a limit on 
file descriptors.  But we're using more data per file descriptor entry.


>   * related to the above - how much cacheline sharing will that involve?
>These per-descriptor use counts are bitch to pack, and giving each a cacheline
>of its own...  

As I said, we do actually use a lock and yes that means that you really  
want to have a single cache line for each and every entry.  It does make 
it easy to have non-racy file description updates.  You certainly do not 
want false sharing when there is a lot of contention.

Other data is used to make sure that it only takes O(log(n)) to find the 
lowest available file descriptor entry.  (Where n, I think, is the returned
descriptor)

Not contended locks aren't expensive.  And all is done on a single cache 
line.

One question about the Linux implementation: what happens when a socket in 
select is closed?  I'm assuming that the kernel waits until "shutdown" is 
given or when a connection comes in?

Is it a problem that you can "hide" your listening socket with a thread in 
accept()?  I would think so (It would be visible in netstat but you can't 
easily find out why has it)

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Alan Burlison

On 23/10/2015 17:19, Eric Dumazet wrote:


The AF_UNIX poll one? No, I don't have the means to do so, and in any
case that's not a POSIX issue, just a plain bug. I'm happy to log a bug
if that helps.


BTW, there is no kernel bug here. POSIX poll() man page says :

POLLOUT
 Normal data may be written without blocking.

If you attempt to write on a listener, write() does _not_ block and
returns -1, which seems correct behavior to me, in accordance with man
page.


Except of course data may not be written, because an attempt to actually 
do so fails, because the socket is in the listen state, is not connected 
and therefore no attempt to write to it could ever succeed. The only bit 
of the required behaviour that the current AF_UNIX poll implementation 
actually gets right is the "without blocking" bit, and that's only the 
case because the failure is detected immediately and the write call 
returns immediately with an error.



socket(PF_LOCAL, SOCK_STREAM, 0)= 3
bind(3, {sa_family=AF_LOCAL, sun_path=@""}, 110) = 0
listen(3, 10)   = 0
write(3, "test", 4) = -1 ENOTCONN (Transport endpoint is 
not connected)

Could you point out which part of POSIX is mandating that af_unix
listener MUST filter out POLLOUT ?


"A file descriptor for a socket that is listening for connections shall 
indicate that it is ready for reading, once connections are available. A 
file descriptor for a socket that is connecting asynchronously shall 
indicate that it is ready for writing, once a connection has been 
established."


If POSIX had to explicitly list every possible thing that 
implementations *should not* do rather than just those that they 
*should* do then it would be even more unwieldy than it already is.


And if what you are asserting is correct, why isn't the AF_INET 
behaviour the same as the AF_UNIX behaviour?


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Al Viro
On Thu, Oct 22, 2015 at 09:50:10PM +0200, casper@oracle.com wrote:
> >Sigh...  It completely fails to mention descriptor-passing.  Which
> > a) is relevant to what "last close" means and
> > b) had been there for nearly the third of a century.
> 
> Why is that different?  These clearly count as file descriptors.

To quote your own posting upthread (Message-ID:
<201510212033.t9lkx4g8007...@room101.nl.oracle.com>)

> Well, a file descriptor really only exists in the context of a process; 
> in-flight it is no longer a file descriptor as there process context with 
  ^^^
> a file descriptor table; so pointers to file descriptions are passed 
> around.

IMO it shows that the wording is anything but clear.  The only way to claim
that it's accurate is, indeed, to declare that the contents of in-flight
SCM_RIGHTS datagram should be counted as file descriptors and that's too
much of a stretch for the reasons you've pointed to.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Eric Dumazet
On Fri, 2015-10-23 at 17:40 +0100, Alan Burlison wrote:
> On 23/10/2015 17:19, Eric Dumazet wrote:
> 
> >>> The AF_UNIX poll one? No, I don't have the means to do so, and in any
> >>> case that's not a POSIX issue, just a plain bug. I'm happy to log a bug
> >>> if that helps.
> >
> > BTW, there is no kernel bug here. POSIX poll() man page says :
> >
> > POLLOUT
> >  Normal data may be written without blocking.
> >
> > If you attempt to write on a listener, write() does _not_ block and
> > returns -1, which seems correct behavior to me, in accordance with man
> > page.
> 
> Except of course data may not be written, because an attempt to actually 
> do so fails, because the socket is in the listen state, is not connected 
> and therefore no attempt to write to it could ever succeed. The only bit 
> of the required behaviour that the current AF_UNIX poll implementation 
> actually gets right is the "without blocking" bit, and that's only the 
> case because the failure is detected immediately and the write call 
> returns immediately with an error.

Yeah, I know some people use poll(NULL, 0, timeout) to implement
msleep().

Because it definitely impresses friends.

So why not poll(, 1, timeout) to do the same, with a socket listener
and POLLOUT in pfd.events

Go figure. I'll send the fine patch.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Al Viro
On Fri, Oct 23, 2015 at 11:52:34AM +0200, casper@oracle.com wrote:
> 
> 
> >Ho-hum...  It could even be made lockless in fast path; the problems I see
> >are
> > * descriptor-to-file lookup becomes unsafe in a lot of locking
> >conditions.  Sure, most of that happens on the entry to some syscall, with
> >very light locking environment, but... auditing every sodding ioctl that
> >might be doing such lookups is an interesting exercise, and then there are
> >->mount() instances doing the same thing.  And procfs accesses.  Probably
> >nothing impossible to deal with, but nothing pleasant either.
> 
> In the Solaris kernel code, the ioctl code is generally not handled a file 
> descriptor but instead a file pointer (i.e., the lookup is done early in 
> the system call).

The one that comes as the first argument of ioctl(2) - sure, but e.g.
BTRFS_IOC_CLONE_RANGE gets a pointer to this:
struct btrfs_ioctl_clone_range_args {
  __s64 src_fd;
  __u64 src_offset, src_length;
  __u64 dest_offset;
};
as the third argument.  VFS sure as hell has no idea of that thing - it's
up to btrfs_ioctl() to copy it in and deal with what it had been given.
While we are at it, ioctl(fd, BTRFS_IOC_CLONE, src_fd) also gets the
second descriptor-to-file lookup in btrfs-specific code; fd, of course,
is looked up by VFS code.  Now, these two are done in locking-neutral
environment, but that's just two of several dozens.

And no, I'm not fond of such irregular ways to pass file descriptors, but
we can't kill ioctl(2) with all weirdness hiding behind it, more's the pity...

> In those specific cases where a system call needs to convert a file 
> descriptor to a file pointer, there is only one routines which can be used.

Obviously, but the problem is deadlock avoidance using it.

> As I said, we do actually use a lock and yes that means that you really  
> want to have a single cache line for each and every entry.  It does make 
> it easy to have non-racy file description updates.  You certainly do not 
> want false sharing when there is a lot of contention.
> 
> Other data is used to make sure that it only takes O(log(n)) to find the 
> lowest available file descriptor entry.  (Where n, I think, is the returned
> descriptor)

TBH, with that kind of memory footprint I would be more interested in
constants than in asymptotics - not that O(log(n)) would be hard to
arrange (a bunch of bitmaps with something like 1:512 ratio between the
levels, to keep the damn thing within a reasonable cacheline size would
probably do with not too horrible constant; 3 levels of that would already
give 128M descriptors, and that's a gigabyte worth of struct file *
alone; with your "cacheline per descriptor" it's what, about 8Gb eaten by
descriptor table?)

Hell knows, might be worth doing regardless of anything else.  Not making
it worse than our plain bitmap variant in any situations shouldn't be hard...

> Not contended locks aren't expensive.  And all is done on a single cache 
> line.

The memory footprint is really scary.  Bitmaps are pretty much noise, but
blowing it by factor of 8 on normal 64bit (or 16 on something like Itanic -
or Venus for that matter, which is more relevant for you guys)

Said that, what's the point of "close won't return until..."?  After all,
you can't guarantee that thread with cancelled syscall won't lose CPU
immediately upon return to userland, so it *can't* make any assumptions
about the descriptor not having been already reused.  I don't get it - what
does that buy for userland code?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Al Viro
On Fri, Oct 23, 2015 at 06:30:25PM +, David Holland wrote:

> So, I'm coming late to this discussion and I don't have the original
> context; however, to me this cited behavior seems undesirable and if I
> ran across it in the wild I would probably describe it as a bug.

Unfortunately, that's precisely what NetBSD is trying to implement (and
that's what will happen if nothing else reopens fd).  See the logics in
fd_close(), with ->fo_restart() and waiting for all activity to settle
down.  As for the missing context, what fd_close() is doing is also
unreliable - inducing ERESTART in other threads sitting in accept(2) and
things like that and waiting for them to run into EBADF they'll get
(barring races) on syscall restart; threads sitting in accept() et.al.
on the same struct file, but with different descriptors will hopefully
go into restart and continue unaffected.  All that machinery relies on
nothing having reused the descriptor for socket(2), dup2() target, etc. while
those threads had been going through the syscall restart - if that happens,
you are SOL, since accept(2) _will_ restart on an unexpected socket.

Moreover, if you fix dup2() atomicity, this approach will reliably shit
itself for situations when dup2() rather than close() is used to close
the socket.  It relies upon having at least some window where the victim
descriptor would be yielding EBADF.

> System call processing for operations on files involves translating a
> file descriptor (a number) into an open-file object (or "file
> description"), struct file in BSD and I think also in Linux. The
> actual system call logic operates on the open-file object, so once the
> translation happens application monkeyshines involving file descriptor
> numbers should have no effect on calls in progress. Other behavior
> would violate the principle of least surprise, as this basic
> architecture predates POSIX.

Well, to be fair, until '93 there was no way to have descriptor table changed
under a syscall in the first place.  The old model (everything up to and 
ncluding 4.4BSD final) simply didn't include anything of that sort - mapping
from descriptors to open files was not shared and all changes a syscall might
see were ones done by the syscall itself.

So this thing isn't covered by the basic architecture - it's something that
had been significantly new merely two decades ago.  And POSIX still hasn't
quite caught up with that newfangled 4.2BSD thing...

IMO what you've described above is fine - that's how Linux works, that's
how FreeBSD and OpenBSD work and that's how NetBSD used to work until 2008
or so.  "Cancel syscall if any of the descriptors got dissociated from
opened files by action of another thread, have the dissociating operation
wait for all affected syscalls to run down" thing had been introduced then
and it is similar to what Solaris is doing.

AFAICS, the main issue with that is the memory footprint from hell and/or
cacheline clusterfuck.  Having accept(2) bugger off with e.g. EINTR in such
situation isn't inherently worse or better than having it sit there as if
close() or dup2() has not happened - matter of taste, and if there had been a
way to do it without inflicting the price on processes that do not pull that
kind of crap in the first place... might be worth considering.

As it is, the memory footprint seems to be too heavy.  I'm not entirely
convinced that there's no clever way to avoid that, but right now I don't
see anything that would look like a good approach.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-23 Thread Eric Dumazet
On Fri, 2015-10-23 at 14:35 +0100, Alan Burlison wrote:

> If you think it's a POSIX deficiency then logging a DR is probably the 
> correct way of addressing that. And as I've said it's fine to decide 
> that you don't care about what POSIX says on the subject but you can't 
> simultaneously claim POSIX conformance. One or the other, not both.

I claim nothing. If you believe a man page should be fixed, please send
a patch to man page maintainer.

Have you tested the patch I sent ?

The goal here is to improve things, not to say you are right or I am
right. I am very often wrong, then what ?

This list is about linux kernel development.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Al Viro
On Thu, Oct 22, 2015 at 11:55:42AM +0100, Alan Burlison wrote:
> On 22/10/2015 05:21, Al Viro wrote:
> 
> >>Most of the work on using a file descriptor is local to the thread.
> >
> >Using - sure, but what of cacheline dirtied every time you resolve a
> >descriptor to file reference?
> 
> Don't you have to do that anyway, to do anything useful with the file?

Dirtying the cacheline that contains struct file itself is different, but
that's not per-descriptor.

> >In case of Linux we have two bitmaps and an array of pointers associated
> >with descriptor table.  They grow on demand (in parallel)
> > * reserving a descriptor is done under ->file_lock (dropped/regained
> >around memory allocation if we end up expanding the sucker, actual 
> >reassignment
> >of pointers to array/bitmaps is under that spinlock)
> > * installing a pointer is lockless (we wait for ongoing resize to
> >settle, RCU takes care of the rest)
> > * grabbing a file by index is lockless as well
> > * removing a pointer is under ->file_lock, so's replacing it by dup2().
> 
> Is that table per-process or global?

Usually it's per-process, but any thread could ask for a private instance
to work with (and then spawn more threads sharing that instance - or getting
independent copies).

It's common for Plan 9-inspired models - basically, you treat every thread
as a machine that consists of
* memory
* file descriptor table
* namespace
* signal handlers
...
* CPU (i.e. actual thread of execution).
The last part can't be shared; anything else can.  fork(2) variant used to
start new threads (clone(2) in case of Linux, rfork(2) in Plan 9 and *BSD)
is told which components should be copies of parent's ones and which should
be shared with the parent.  fork(2) is simply "copy everything except for the
namespace".  It's fairly common to have "share everything", but intermediate
variants are also possible.  There are constraints (e.g. you can't share
signal handlers without sharing the memory space), but descriptor table
can be shared independently from memory space just fine.  There's also a
way to say "unshare this, this and that components" - mapped to unshare(2) in
Linux and to rfork(2) in Plan 9.

Best way to think of that is to consider descriptor table as a first-class
object a thread can be connected to.  Usually you have one for each process,
with all threads belonging to that process connected to the same thing,
but that's just the most common use.

> I don't think that it's possible to claim that a non-atomic dup2()
> is POSIX-compliant.

Except that it's in non-normative part of dup2(2), AFAICS.  I certainly
agree that it would be a standard lawyering beyond reason, but "not
possible to claim" is too optimistic.  Maybe I'm just more cynical...

> ThreadA remains sat in accept on fd1 which is now a plain file, not
> a socket.

No.  accept() is not an operation on file descriptors; it's an operation on
file descriptions (pardon for use of that terminology).  They are specified
by passing descriptors, but there's a hell of a difference between e.g.
dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek()
(operations on descriptions).

Lookups are done once per syscall; the only exception is F_SETFL{,W}, where
we recheck that descriptor is refering to the same thing before granting
the lock.

Again, POSIX is still underspecifying the semantics of shared descriptor
tables; back when the bulk of it had been written there had been no way
to have a descriptor -> description mapping changed under a syscall by
action of another thread.  Hell, they still hadn't picked on some things
that happened in early 80s, let alone early-to-mid 90s...

Linux and Solaris happen to cover these gaps differently; FreeBSD and
OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

>On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote:
>> 
>> 
>> >And I'm really curious about the things Solaris would do with dup2() there.
>> >Does it take into account the possibility of new accept() coming just as
>> >dup2() is trying to terminate the ongoing ones?  Is there a window when
>> >descriptor-to-file lookups would fail?  Looks like a race/deadlock 
>> >country...
>> 
>> Solaris does not "terminate" threads, instead it tells them that the
>> file descriptor information used is stale and wkae's up the thread.
>
>Sorry, lousy wording - I meant "terminate syscall in another thread".
>Better yet, make that "what happens if new accept(newfd) comes while dup2()
>waits for affected syscalls in other threads to finish"?  Assuming it
>does wait, that is..

No there is no such window; the accept() call either returns EBADF
(dup2()) wins the race or it returns a new file descriptor (and dup2()
then closes the listening descriptor).

One or the other.

>While we are at it, what's the relative order of record locks removal
>and switching the meaning of newfd?  In our kernel it happens *after*
>the switchover (i.e. if another thread is waiting for a record lock held on
>any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see
>the state with newfd still refering to what it used to; note that waiter
>might be using any descriptor refering to the file newfd used to refer
>to, so it won't be affected by the "wake those who had the meaning of
>their arguments change" side of things).

The external behaviour atomic; you cannot distinguish the order
between the closing of the original file (and waking up other threads
waiting for a record lock) or changing the file referenced by that newfd.

But this not include a global or process specific lock.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

From: Al Viro 

>Except that in this case "correctness" is the matter of rather obscure and
>ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
>inherently bad, but it's nowhere near being an absolute requirement.

It would more fruitful to have such a discussion in one of the OpenGroup 
mailing lists; people gathered there have a lot of experience and it is 
also possible to fix the standard when it turns out that it indeed as 
vague as you claim it is (I don't quite agree with that)

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Al Viro
On Thu, Oct 22, 2015 at 05:44:58AM +0100, Al Viro wrote:

> Except that in this case "correctness" is the matter of rather obscure and
> ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
> inherently bad, but it's nowhere near being an absolute requirement.

PS: in principle, a fairly ugly trick might suffice for accept(2), but
I'm less than happy with going there.  Namely, we could
* have ->accept() get descriptor number
* have ->flush() get descriptor number in addition to current->files
and have it DTRT for sockets in the middle of accept(2).

However, in addition to being ugly as hell, it has the problem with the points
where we call ->flush(), specifically do_dup2() and __close_fd().
It's done *after* the replacement/removal from descriptor table, so another
socket might have already gotten the same descriptor and we'd get spurious
termination of accept(2).

And I'm really curious about the things Solaris would do with dup2() there.
Does it take into account the possibility of new accept() coming just as
dup2() is trying to terminate the ongoing ones?  Is there a window when
descriptor-to-file lookups would fail?  Looks like a race/deadlock country...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

>It's been said that the current mechanisms in Linux & some BSD variants 
>can be subject to races, and the behaviour exhibited doesn't conform to 
>POSIX, for example requiring the use of shutdown() on unconnected 
>sockets because close() doesn't kick off other threads accept()ing on 
>the same fd. I'd be interested to hear if there's a better and more 
>performant way of handling the situation that doesn't involve doing the 
>sort of bookkeeping Casper described,.

Of course, the implementation is now around 18 years old; clearly a lot of 
things have changed since then.

In the particular case of Linux close() on a socket, surely it must be 
possible to detect at close that it is a listening socket and that you are 
about to close the last reference; the kernel could then do the shutdown() 
all by itself.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/2015 18:05, Al Viro wrote:


Oh, for...  Right in this thread an example of complete BS has been quoted
from POSIX close(2).  The part about closing a file when the last descriptor
gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
it be).


That's not exactly what it says, we've already discussed, for example in 
the case of pending async IO on a filehandle.



Semantics around the distinction between file descriptors and
 file descriptions is underspecified, not to mention being very poorly
written.


I agree that part could do with some polishing.


You want to add something along the lines of "if any action by another thread
changes the mapping from file descriptors to file descriptions for any
file descriptor passed to syscall, such and such things should happen" - go
ahead and specify what should happen.  As it is, I don't see anything of
that sort in e.g. accept(2).  And no,
[EBADF]
The socket argument is not a valid file descriptor.
in there is nowhere near being unambiguous enough - everyone agrees that
argument should be a valid descriptor at the time of call, but I would be
very surprised to find _any_ implementation (including Solaris one)
recheck that upon exit to userland.


The scenario I described previously, where dup2() is used to modify a fd 
that's being used in accept, result in the accept call terminating in 
the same way as if close had been called on it - Casper gave details 
earlier.



For more bullshit from the same source (issue 7, close(2)):
If fildes refers to a socket, close() shall cause the socket to be
destroyed. If the socket is in connection-mode, and the SO_LINGER
option is set for the socket with non-zero linger time, and the socket
has untransmitted data, then close() shall block for up to the current
linger interval until all data is transmitted.
I challenge you to find *any* implementation that would have
fd = socket(...);
close(dup(fd));
do what this wonder of technical prose clearly requests.  In the same text we
also have
When all file descriptors associated with a pipe or FIFO special file
are closed, any data remaining in the pipe or FIFO shall be discarded.
as well as explicit (and underspecified, but perhaps they do it elsewhere)
"last close" in parts related to sockets and ptys.


Yes, Casper has just reported that to TOG, see 
http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11573. 
Our assessment is that sockets should behave the same way as plain 
files, i.e. 'last close'.



And that is not to mention the dup2(2) wording in there:
If fildes2 is already a valid open file descriptor, it shall be
closed first
which is (a) inviting misinterpretation that would make the damn thing
non-atomic (the only mentioning of atomicity is in non-normative sections)


I've already explained why I believe atomic behaviour of dup2() is 
required by POSIX. If you feel it's not clear then we can ask POSIX for 
clarification.



and (b) says fsck-all about the effects of closing descriptor.  The latter
is a problem, since nothing in close(2) bothers making a distinction between
the effects specific to particular syscall and those common to all ways of
closing a descriptor.  And no, it's not a nitpicking - consider e.g. the
parts concerning the order of events triggered by close(2) (such and such
should be completed before close(2) returns); should it be taken as "same
events should be completed before newfd is associated with the file description
refered to by oldfd"?  It _is_ user-visible, since close(2) removes fcntl
locks.  Sure, there is (otherwise unexplained)
The dup2() function is not intended for use in critical regions
as a synchronization mechanism.
down in informative sections, so one can infer that event order here isn't
to be relied upon.  With no way to guess whether the event order concerning
e.g. effect on ongoing accept(newfd) is any different in that respect.


I think "it shall be closed first" makes it pretty clear that what is 
expected is the same behaviour as any direct invocation of close, and 
that has to happen before the reassignment. What makes you believe 
that's isn't the case?



The entire area in Issue 7 stinks.  It might make sense to try and fix it
up, but let's not pretend that what's in there right now does specify the
semantics in this kind of situations.


Sorry, I disagree.


I'm not saying that Solaris approach yields an inherently bad semantics or
that it's impossible to implement without high scalability price and/or
high memory footprint.  But waving the flag of POSIX compliance when you
are actually talking about the ways your implementation plugs the holes in
a badly incomplete spec...


Personally I believe the spec is clear enough to allow an unambiguous 
interpretation of the required behavior in this area. If you think there 
are areas where the Solaris 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

>On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote:
>
>> The external behaviour atomic; you cannot distinguish the order
>> between the closing of the original file (and waking up other threads
>> waiting for a record lock) or changing the file referenced by that newfd.
>> 
>> But this not include a global or process specific lock.
>
>Interesting...  Do you mean that decriptor-to-file lookup blocks until that
>rundown finishes?

For that particular file descriptor, yes.  (I'm assuming you mean the 
Solaris kernel running down all lwps who have a system in progress on that 
particular file descriptor).  All other fd to file lookups are not blocked 
at all by this locking.

It should be clear that any such occurrences are application errors and 
should be hardly ever seen in practice.  It is also known when this is 
needed so it is hardly even attempted.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Al Viro
On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote:
> On 22/10/2015 18:05, Al Viro wrote:
> 
> >Oh, for...  Right in this thread an example of complete BS has been quoted
> >from POSIX close(2).  The part about closing a file when the last descriptor
> >gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
> >it be).
> 
> That's not exactly what it says, we've already discussed, for
> example in the case of pending async IO on a filehandle.

Sigh...  It completely fails to mention descriptor-passing.  Which
a) is relevant to what "last close" means and
b) had been there for nearly the third of a century.

> I agree that part could do with some polishing.

google("wire brush of enlightenment") is what comes to mind...

> >and (b) says fsck-all about the effects of closing descriptor.  The latter
> >is a problem, since nothing in close(2) bothers making a distinction between
> >the effects specific to particular syscall and those common to all ways of
> >closing a descriptor.  And no, it's not a nitpicking - consider e.g. the
> >parts concerning the order of events triggered by close(2) (such and such
> >should be completed before close(2) returns); should it be taken as "same
> >events should be completed before newfd is associated with the file 
> >description
> >refered to by oldfd"?  It _is_ user-visible, since close(2) removes fcntl
> >locks.  Sure, there is (otherwise unexplained)
> > The dup2() function is not intended for use in critical regions
> > as a synchronization mechanism.
> >down in informative sections, so one can infer that event order here isn't
> >to be relied upon.  With no way to guess whether the event order concerning
> >e.g. effect on ongoing accept(newfd) is any different in that respect.
> 
> I think "it shall be closed first" makes it pretty clear that what
> is expected is the same behaviour as any direct invocation of close,
> and that has to happen before the reassignment. What makes you
> believe that's isn't the case?

So unless I'm misparsing something, you want
thread A: accept(newfd)
thread B: dup2(oldfd, newfd)
have accept() bugger off before the switchover happens?

What should happen if thread C does accept(newfd) right as B has decided that
there's nothing more to wait?  For close(newfd) it would be simple - we are
going to have lookup by descriptor fail with EBADF anyway, so making it do
so as soon as we go hunting for those who are currently in accept(newfd)
would do the trick - no new threads like that shall appear and as long as
the descriptor is not declared free for taking by descriptor allocation nobody
is going to be screwed by open() picking that slot of descriptor table too
early.  Trying to do that for dup2() would lose atomicity.  I honestly don't
know how Solaris behaves in that case, BTW - the race (if any) would probably
be hard to hit, so in case of Linux I would have to go and RTFS before saying
that there isn't one.  I can't do that in with Solaris; all I can do here
is ask you guys...

Moreover, see above for record locks removal.  Should that happen prior to
switchover?  If you have

dup(fd, fd2);
set a record lock on fd2
spawn a thread
in child, try to grab the same lock on fd2
in parent, do some work and close(fd)

you are guaranteed that child won't see fd refering to the same file after it
acquires the lock.
Replace close(fd) with dup(fd3, fd); should the same hold true in that case?

FWIW, Linux behaviour in that area is to have record locks removal done
between the switchover and return to userland in case of dup2() and between
the removal from descriptor table and return to userland in case of close().

> Personally I believe the spec is clear enough to allow an
> unambiguous interpretation of the required behavior in this area. If
> you think there are areas where the Solaris behaviour is in
> disagreement with the spec then I'd be interested to hear them.

The spec is so vague that I strongly suspect that *both* Solaris and Linux
behaviours are not in disagreement with it (modulo shutdown(2) extension
Linux-side and we are really stuck with that one).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Al Viro
On Thu, Oct 22, 2015 at 08:24:51PM +0200, casper@oracle.com wrote:

> The external behaviour atomic; you cannot distinguish the order
> between the closing of the original file (and waking up other threads
> waiting for a record lock) or changing the file referenced by that newfd.
> 
> But this not include a global or process specific lock.

Interesting...  Do you mean that decriptor-to-file lookup blocks until that
rundown finishes?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Al Viro
On Thu, Oct 22, 2015 at 08:34:19AM +0200, casper@oracle.com wrote:
> 
> 
> >And I'm really curious about the things Solaris would do with dup2() there.
> >Does it take into account the possibility of new accept() coming just as
> >dup2() is trying to terminate the ongoing ones?  Is there a window when
> >descriptor-to-file lookups would fail?  Looks like a race/deadlock country...
> 
> Solaris does not "terminate" threads, instead it tells them that the
> file descriptor information used is stale and wkae's up the thread.

Sorry, lousy wording - I meant "terminate syscall in another thread".
Better yet, make that "what happens if new accept(newfd) comes while dup2()
waits for affected syscalls in other threads to finish"?  Assuming it
does wait, that is...

While we are at it, what's the relative order of record locks removal
and switching the meaning of newfd?  In our kernel it happens *after*
the switchover (i.e. if another thread is waiting for a record lock held on
any alias of newfd and we do dup2(oldfd, newfd), the waiter will not see
the state with newfd still refering to what it used to; note that waiter
might be using any descriptor refering to the file newfd used to refer
to, so it won't be affected by the "wake those who had the meaning of
their arguments change" side of things).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Casper . Dik

From: Al Viro 

>On Thu, Oct 22, 2015 at 06:39:34PM +0100, Alan Burlison wrote:
>> On 22/10/2015 18:05, Al Viro wrote:
>> 
>> >Oh, for...  Right in this thread an example of complete BS has been quoted
>> >from POSIX close(2).  The part about closing a file when the last descriptor
>> >gets closed.  _Nothing_ is POSIX-compliant in that respect (nor should
>> >it be).
>> 
>> That's not exactly what it says, we've already discussed, for
>> example in the case of pending async IO on a filehandle.
>
>Sigh...  It completely fails to mention descriptor-passing.  Which
>   a) is relevant to what "last close" means and
>   b) had been there for nearly the third of a century.

Why is that different?  These clearly count as file descriptors.

>> I agree that part could do with some polishing.
>
>google("wire brush of enlightenment") is what comes to mind...

Standardese is similar to legalese; it not writing that is directly open 
to interpretation to those who are not inducted in writing may have some 
problem interpreting what exactly is meant by wording of the standard.



>> I think "it shall be closed first" makes it pretty clear that what
>> is expected is the same behaviour as any direct invocation of close,
>> and that has to happen before the reassignment. What makes you
>> believe that's isn't the case?
>
>So unless I'm misparsing something, you want
>thread A: accept(newfd)
>thread B: dup2(oldfd, newfd)
>have accept() bugger off before the switchover happens?

Well, certainly *before* we return from dup2().
(and clearly only once we have determined that dup2() will return
successfully)

>What should happen if thread C does accept(newfd) right as B has decided that
>there's nothing more to wait?  For close(newfd) it would be simple - we are
>going to have lookup by descriptor fail with EBADF anyway, so making it do
>so as soon as we go hunting for those who are currently in accept(newfd)
>would do the trick - no new threads like that shall appear and as long as
>the descriptor is not declared free for taking by descriptor allocation nobody
>is going to be screwed by open() picking that slot of descriptor table too
>early.  Trying to do that for dup2() would lose atomicity.  I honestly don't
>know how Solaris behaves in that case, BTW - the race (if any) would probably
>be hard to hit, so in case of Linux I would have to go and RTFS before saying
>that there isn't one.  I can't do that in with Solaris; all I can do here
>is ask you guys...

Solaris dup2() behaves exactly like close().

>Moreover, see above for record locks removal.  Should that happen prior to
>switchover?  If you have
>
>dup(fd, fd2);
>set a record lock on fd2
>spawn a thread
>in child, try to grab the same lock on fd2
>in parent, do some work and close(fd)

>you are guaranteed that child won't see fd refering to the same file after it
>acquires the lock.

Here's you are talking about a lock held by the "parent" and that the
"child" will only get the lock once close(fd) is done?

Yes.  The final "close" is done *after* the pointer has been removed from 
the file descriptor table.

>Replace close(fd) with dup(fd3, fd); should the same hold true in that case?

Yes.

>FWIW, Linux behaviour in that area is to have record locks removal done
>between the switchover and return to userland in case of dup2() and between
>the removal from descriptor table and return to userland in case of close().
>
>> Personally I believe the spec is clear enough to allow an
>> unambiguous interpretation of the required behavior in this area. If
>> you think there are areas where the Solaris behaviour is in
>> disagreement with the spec then I'd be interested to hear them.
>
>The spec is so vague that I strongly suspect that *both* Solaris and Linux
>behaviours are not in disagreement with it (modulo shutdown(2) extension
>Linux-side and we are really stuck with that one).

I'm not sure if the standard allows a handful of threads in accept() for a 
file descriptor which has already been closed *and* can be re-issued for 
other uses.

Casper

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/2015 07:51, casper@oracle.com wrote:


It would more fruitful to have such a discussion in one of the OpenGroup
mailing lists; people gathered there have a lot of experience and it is
also possible to fix the standard when it turns out that it indeed as
vague as you claim it is (I don't quite agree with that)


+1. If there's interest in doing that I'll ask our POSIX rep the best 
way of initiating such a conversation.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/2015 05:21, Al Viro wrote:


Most of the work on using a file descriptor is local to the thread.


Using - sure, but what of cacheline dirtied every time you resolve a
descriptor to file reference?


Don't you have to do that anyway, to do anything useful with the file?


How much does it cover and to what
degree is that local to thread?  When it's a refcount inside struct file -
no big deal, we'll be reading the same cacheline anyway and unless several
threads are pounding on the same file with syscalls at the same time,
that won't be a big deal.  But when refcounts are associated with
descriptors...


There is a refcount in the struct held in the per-process list of open 
files and the 'slow path' processing is only taken if there's more than 
one LWP in the process that's accessing the file.



In case of Linux we have two bitmaps and an array of pointers associated
with descriptor table.  They grow on demand (in parallel)
* reserving a descriptor is done under ->file_lock (dropped/regained
around memory allocation if we end up expanding the sucker, actual reassignment
of pointers to array/bitmaps is under that spinlock)
* installing a pointer is lockless (we wait for ongoing resize to
settle, RCU takes care of the rest)
* grabbing a file by index is lockless as well
* removing a pointer is under ->file_lock, so's replacing it by dup2().


Is that table per-process or global?


The point is, dup2() over _unused_ descriptor is inherently racy, but dup2()
over a descriptor we'd opened and kept open should be safe.  As it is,
their implementation boils down to "userland must do process-wide exclusion
between *any* dup2() and all syscalls that might create a new descriptor -
open()/pipe()/socket()/accept()/recvmsg()/dup()/etc".  At the very least,
it's a big QoI problem, especially since such userland exclusion would have
to be taken around the operations that can block for a long time.  Sure,
POSIX wording regarding dup2 is so weak that this behaviour can be argued
to be compliant, but... replacement of the opened file associated with
newfd really ought to be atomic to be of any use for multithreaded processes.


There's existing language in the Issue 7 dup2() description that says 
dup2() has to be atomic:


"the dup2( ) function provides unique services, as no other
interface is able to atomically replace an existing file descriptor."

And there is some new language in Issue 7 Technical Corrigenda 2 that 
reinforces that, when it's talking about reassignment of 
stdin/stdout/stderr:


"Furthermore, a close() followed by a reopen operation (e.g. open(), 
dup() etc) is not atomic; dup2() should be used to change standard file 
descriptors."


I don't think that it's possible to claim that a non-atomic dup2() is 
POSIX-compliant.



IOW, if newfd is an opened descriptor prior to dup2() and no other thread
attempts to close it by any means, there should be no window during which
it would appear to be not opened.  Linux and FreeBSD satisfy that; OpenBSD
seems to do the same, from the quick look.  NetBSD doesn't, no idea about
Solaris.  FWIW, older NetBSD implementation (prior to "File descriptor changes,
discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it
had fixed a lot of crap, so it's entirely possible that OpenBSD simply has
kept the old implementation, with tons of other races in that area, but this
dup2() race got introduced in that rewrite.


Related to dup2(), there's some rather surprising behaviour on Linux. 
Here's the scenario:


--
ThreadA opens, listens and accepts on socket fd1, waiting for incoming 
connections.


ThreadB waits for a while, then opens normal file fd2 for read/write.
ThreadB uses dup2 to make fd1 a clone of fd2.
ThreadB closes fd2.

ThreadA remains sat in accept on fd1 which is now a plain file, not a 
socket.


ThreadB writes to fd1, the result of which appears in the file, so fd1 
is indeed operating as a plain file.


ThreadB exits. ThreadA is still sat in accept on fd1.

A connection is made to fd1 by another process. The accept call succeeds 
and returns the incoming  connection. fd1 is still operating as a 
socket, even though it's now actually a plain file.

--

I assume this is another consequence of the fact that threads waiting in 
accept don't get a notification if the fd they are using is closed, 
either directly by a call to close or by a syscall such as dup2. Not 
waking up other threads on a fd when it is closed seems like it's 
undesirable behaviour.


I can see the reasoning behind allowing shutdown to be used to do such a 
wakeup even if that's not POSIX-compliant - it may make it slightly 
easier for applications avoid fd recycling races. However the current 
situation is that shutdown is the *only* way to perform such a wakeup - 
simply closing the fd has no effect on any other threads. That seems 
incorrect.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line 

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/2015 05:44, Al Viro wrote:


It's been said that the current mechanisms in Linux & some BSD
variants can be subject to races


You do realize that it goes for the entire area?  And the races found
in this thread are in the BSD variant that tries to do something similar
to what you guys say Solaris is doing, so I'm not sure which way does
that argument go.  A high-level description with the same level of details
will be race-free in _all_ of them.  The devil is in the details, of course,
and historically they had been very easy to get wrong.  And extra complexity
in that area doesn't make things better.


Yes, I absolutely agree it's difficult to get it right. Modulo 
undetected bugs I believe the Solaris implementation is race free, and 
if it isn't I think it's fair to say we'd consider that to be a bug.



, and the behaviour exhibited
doesn't conform to POSIX, for example requiring the use of
shutdown() on unconnected sockets because close() doesn't kick off
other threads accept()ing on the same fd.


Umm...  The old kernels (and even more - old userland) are not going to
disappear, so we are stuck with such uses of shutdown(2) anyway, POSIX or
no POSIX.


Yes, I understand the problem and in an earlier part of the discussion I 
said that I suspected that all that could really be done was to document 
the behaviour as changing it would break existing code. I still think 
that's probably the only workable option.



I'd be interested to hear
if there's a better and more performant way of handling the
situation that doesn't involve doing the sort of bookkeeping Casper
described,.


So would a lot of other people.


:-)


To quote one of my colleague's favourite sayings: Performance is a
goal, correctness is a constraint.


Except that in this case "correctness" is the matter of rather obscure and
ill-documented areas in POSIX.  Don't get me wrong - this semantics isn't
inherently bad, but it's nowhere near being an absolute requirement.


I don't think it's *that* obscure, when I found the original shutdown() 
problem google showed up another occurrence pretty quickly - 
https://lists.gnu.org/archive/html/libmicrohttpd/2011-09/msg00024.html. 
If a fd is closed then allowing other uses of it to continue in other 
threads seems incorrect to me, as in the dup2() scenario I posted.


--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Eric Dumazet
On Thu, 2015-10-22 at 08:15 +0200, casper@oracle.com wrote:
> >It's been said that the current mechanisms in Linux & some BSD variants 
> >can be subject to races, and the behaviour exhibited doesn't conform to 
> >POSIX, for example requiring the use of shutdown() on unconnected 
> >sockets because close() doesn't kick off other threads accept()ing on 
> >the same fd. I'd be interested to hear if there's a better and more 
> >performant way of handling the situation that doesn't involve doing the 
> >sort of bookkeeping Casper described,.
> 
> Of course, the implementation is now around 18 years old; clearly a lot of 
> things have changed since then.
> 
> In the particular case of Linux close() on a socket, surely it must be 
> possible to detect at close that it is a listening socket and that you are 
> about to close the last reference; the kernel could then do the shutdown() 
> all by itself.

We absolutely do not _want_ to do this just so that linux becomes slower
to the point Solaris can compete, or you guys can avoid some work.

close(fd) is very far from knowing a file is a 'listener' or even a
'socket' without extra cache line misses.

To force a close of an accept() or whatever blocking socket related
system call a shutdown() makes a lot of sense.

This would have zero additional overhead for the fast path.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/2015 12:30, Eric Dumazet wrote:


We absolutely do not _want_ to do this just so that linux becomes slower
to the point Solaris can compete, or you guys can avoid some work.


Sentiments such as that really have no place in a discussion that's been 
focussed primarily on the behaviour of interfaces, albeit with 
digressions into the potential performance impacts. The discussion has 
been cordial and I for one appreciate Al Viro's posts on the subject, 
from which I've leaned a lot. Can we please keep it that way? Thanks.



close(fd) is very far from knowing a file is a 'listener' or even a
'socket' without extra cache line misses.

To force a close of an accept() or whatever blocking socket related
system call a shutdown() makes a lot of sense.

This would have zero additional overhead for the fast path.


Yes, that would I believe be a significant improvement.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Eric Dumazet
On Thu, 2015-10-22 at 12:58 +0100, Alan Burlison wrote:
> On 22/10/2015 12:30, Eric Dumazet wrote:
> 
> > We absolutely do not _want_ to do this just so that linux becomes slower
> > to the point Solaris can compete, or you guys can avoid some work.
> 
> Sentiments such as that really have no place in a discussion that's been 
> focussed primarily on the behaviour of interfaces, albeit with 
> digressions into the potential performance impacts. The discussion has 
> been cordial and I for one appreciate Al Viro's posts on the subject, 
> from which I've leaned a lot. Can we please keep it that way? Thanks.

Certainly not.

I am a major linux networking developper and wont accept linux is
hijacked by guys who never contributed to it, just so it meets their
unreasonable expectations.

We absolutely care about performance. And I do not care you focus on
POSIX crap.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

2015-10-22 Thread Alan Burlison

On 22/10/15 19:16, Al Viro wrote:


Don't you have to do that anyway, to do anything useful with the file?


Dirtying the cacheline that contains struct file itself is different, but
that's not per-descriptor.


Yes, true enough.


Usually it's per-process, but any thread could ask for a private instance
to work with (and then spawn more threads sharing that instance - or getting
independent copies).


[snip]

Thanks again for the info, interesting. Solaris also does things along 
the same lines. In fact we recently extended posix_spawn so it could be 
used by the JVM to create subprocesses without wholescale copying, and 
Casper has done some optimisation work on posix_spawn as well.



I don't think that it's possible to claim that a non-atomic dup2()
is POSIX-compliant.


Except that it's in non-normative part of dup2(2), AFAICS.  I certainly
agree that it would be a standard lawyering beyond reason, but "not
possible to claim" is too optimistic.  Maybe I'm just more cynical...


Possibly so, and possibly justifiably so as well ;-)


ThreadA remains sat in accept on fd1 which is now a plain file, not
a socket.


No.  accept() is not an operation on file descriptors; it's an operation on
file descriptions (pardon for use of that terminology).  They are specified
by passing descriptors, but there's a hell of a difference between e.g.
dup() or fcntl(,F_SETFD,) (operations on descriptors) and read() or lseek()
(operations on descriptions).

Lookups are done once per syscall; the only exception is F_SETFL{,W}, where
we recheck that descriptor is refering to the same thing before granting
the lock.


Yes, but if you believe that dup2() requires an implicit close() within 
it and that dup2() has to be atomic then expecting that other threads 
waiting on the same fd in accept() will get a notification seems 
reasonable enough.



Again, POSIX is still underspecifying the semantics of shared descriptor
tables; back when the bulk of it had been written there had been no way
to have a descriptor -> description mapping changed under a syscall by
action of another thread.  Hell, they still hadn't picked on some things
that happened in early 80s, let alone early-to-mid 90s...


That's indisputably true - much of the POSIX behaviour predates threads 
and it shows, quite badly, in some places.



Linux and Solaris happen to cover these gaps differently; FreeBSD and
OpenBSD are probably closer to Linux variant, NetBSD - to Solaris one.


Yes, true enough.

--
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >