Re: dev_lock() contention for fdesc syscalls -- possible fix

2014-11-10 Thread Konstantin Belousov
On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
 It was noticed that there is huge dev_lock() contention when multiple
 processes do a poll() even on independent file descriptors.
 
 Turns out that not just poll but most syscalls on file descriptors
 (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
 devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
 as they use the following pattern
 
 devfs_poll_f() {
 ...
 devfs_fp_check(fp, ...) --
 kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) --
 dev_lock();
 dev = vp-v_rdev; // lock on vp ?
 ... check that dev != NULL
 atomic_add_long(dev-si_threadcount, 1);
 dev_unlock();
 dsw-d_poll()
 dev_relthread() --
 atomic_subtract_rel_long(dev-si_threadcount, 1);
}
 
 
 I believe that dev_lock() in devvn_refthread() is protecting
 dev = vp-v_rdev
 (the cdev entry 'dev' cannot go away for the reasons stated below).
 
 However looking at places where vp-v_rdev is modified, turns out
 that it only happens when holding VI_LOCK(vp) -- the places are
 devfs_allocv() and devfs_reclaim().
 There is one place in zfs where the vnode is created and v_rdev
 is set without holding a lock, so nobody can dereference it.
 
 As a consequence i believe that if in devfs_fp_check() we replace
 dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
 we make sure that we can safely dereference vp-v_rdev, and the
 cdev cannot go away because the vnode has a reference to it.
 The counter uses an atomic instruction (so the release is lockless)
Vnode reference, as well as cdev reference, which is obtained by
dev_ref(), do not provide any protection there.  v_rdev is only
coincidentally protected by the vnode interlock.

If you look at larger part of the code, you would note that dev mutex
is held even after v_rdev is dereferenced.  The real protection it
provides is against race with destroy_dev(l)(), which could invalidate
dev-so_devsw at any moment when either device thread reference is
not held, or dev mutex is not held.  So your patch breaks the
device destruction.

 
 This should be enough to remove the contention.
If you never calls destroy_dev() for the devfs node, you could use
MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
is no risk of race with destroy_dev() for the created node.

Usually, code which could be compiled in kernel statically but also
loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
module needed to call destroy_dev().


 
 Anyone care to review/test the patch below ?
Yes, there are people who care.

 (dev_refthread() still has the single dev_lock() contention,
 i don't know how to address that yet)
 
   cheers
   luigi
 
 Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c
 ===
 --- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (revision 273452)
 +++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (working copy)
 @@ -224,10 +224,11 @@
   }
  
   csw = NULL;
 - dev_lock();
 + ASSERT_VI_UNLOCKED(vp, __func__);
 + VI_LOCK(vp); // dev_lock();
   dev = vp-v_rdev;
   if (dev == NULL) {
 - dev_unlock();
 + VI_UNLOCK(vp); // dev_unlock();
   return (NULL);
   }
   cdp = cdev2priv(dev);
 @@ -236,7 +237,7 @@
   if (csw != NULL)
   atomic_add_long(dev-si_threadcount, 1);
   }
 - dev_unlock();
 + VI_UNLOCK(vp); // dev_unlock();
   if (csw != NULL) {
   *devp = dev;
   *ref = 1;
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dev_lock() contention for fdesc syscalls -- possible fix

2014-11-10 Thread Luigi Rizzo
On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote:
 On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
  It was noticed that there is huge dev_lock() contention when multiple
  processes do a poll() even on independent file descriptors.
  
  Turns out that not just poll but most syscalls on file descriptors
  (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
  devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
  as they use the following pattern
  
  devfs_poll_f() {
  ...
  devfs_fp_check(fp, ...) --
  kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) --
  dev_lock();
  dev = vp-v_rdev; // lock on vp ?
  ... check that dev != NULL
  atomic_add_long(dev-si_threadcount, 1);
  dev_unlock();
  dsw-d_poll()
  dev_relthread() --
  atomic_subtract_rel_long(dev-si_threadcount, 1);
 }
  
  
  I believe that dev_lock() in devvn_refthread() is protecting
  dev = vp-v_rdev
  (the cdev entry 'dev' cannot go away for the reasons stated below).
  
  However looking at places where vp-v_rdev is modified, turns out
  that it only happens when holding VI_LOCK(vp) -- the places are
  devfs_allocv() and devfs_reclaim().
  There is one place in zfs where the vnode is created and v_rdev
  is set without holding a lock, so nobody can dereference it.
  
  As a consequence i believe that if in devfs_fp_check() we replace
  dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
  we make sure that we can safely dereference vp-v_rdev, and the
  cdev cannot go away because the vnode has a reference to it.
  The counter uses an atomic instruction (so the release is lockless)
 Vnode reference, as well as cdev reference, which is obtained by
 dev_ref(), do not provide any protection there.  v_rdev is only
 coincidentally protected by the vnode interlock.
 
 If you look at larger part of the code, you would note that dev mutex
 is held even after v_rdev is dereferenced.  The real protection it
 provides is against race with destroy_dev(l)(), which could invalidate
 dev-so_devsw at any moment when either device thread reference is
 not held, or dev mutex is not held.  So your patch breaks the
 device destruction.

I see. Thanks for the clarification.

Would it help to rewrite the part of devvn_refthread as follows:

devvn_refthread() {
// protect vp-v_rdev dereference and dev disappearing
VI_LOCK(vp);
dev = vi-v_rdev;
.. check that dev != NULL
// protect the race on dev-si_devsw
atomic_add_long(dev-si_threadcount, 1);
VI_UNLOCK(vp);
... appropriate memory barrier
csw = dev-si_devsw;
if (csw != NULL) {
// we won the race, even if destroy_devl() runs
// it will keep the cdevsw alive until si_threadcount goes to 0
*devp = dev;
*ref = 1;
} else {
// we lost
*ref = 0;
}
return csw;
}

i.e.  tentatively increment si_threadcount before dereferencing si_devsw
and then restoring it if we lose the race ?
It might be necessary to add a barrier in destroy_devl() between clearing
si_devsw and reading si_threadcount.

  
  This should be enough to remove the contention.
 If you never calls destroy_dev() for the devfs node, you could use
 MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
 is no risk of race with destroy_dev() for the created node.
 
 Usually, code which could be compiled in kernel statically but also
 loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
 module needed to call destroy_dev().

that I suppose would mean that the module cannot be safely unloaded ?

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dev_lock() contention for fdesc syscalls -- possible fix

2014-11-10 Thread Konstantin Belousov
On Mon, Nov 10, 2014 at 10:44:12AM +0100, Luigi Rizzo wrote:
 On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote:
  On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote:
   It was noticed that there is huge dev_lock() contention when multiple
   processes do a poll() even on independent file descriptors.
   
   Turns out that not just poll but most syscalls on file descriptors
   (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
   devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
   as they use the following pattern
   
   devfs_poll_f() {
   ...
   devfs_fp_check(fp, ...) --
   kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) --
   dev_lock();
   dev = vp-v_rdev; // lock on vp ?
   ... check that dev != NULL
   atomic_add_long(dev-si_threadcount, 1);
   dev_unlock();
   dsw-d_poll()
   dev_relthread() --
   atomic_subtract_rel_long(dev-si_threadcount, 1);
  }
   
   
   I believe that dev_lock() in devvn_refthread() is protecting
   dev = vp-v_rdev
   (the cdev entry 'dev' cannot go away for the reasons stated below).
   
   However looking at places where vp-v_rdev is modified, turns out
   that it only happens when holding VI_LOCK(vp) -- the places are
   devfs_allocv() and devfs_reclaim().
   There is one place in zfs where the vnode is created and v_rdev
   is set without holding a lock, so nobody can dereference it.
   
   As a consequence i believe that if in devfs_fp_check() we replace
   dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
   we make sure that we can safely dereference vp-v_rdev, and the
   cdev cannot go away because the vnode has a reference to it.
   The counter uses an atomic instruction (so the release is lockless)
  Vnode reference, as well as cdev reference, which is obtained by
  dev_ref(), do not provide any protection there.  v_rdev is only
  coincidentally protected by the vnode interlock.
  
  If you look at larger part of the code, you would note that dev mutex
  is held even after v_rdev is dereferenced.  The real protection it
  provides is against race with destroy_dev(l)(), which could invalidate
  dev-so_devsw at any moment when either device thread reference is
  not held, or dev mutex is not held.  So your patch breaks the
  device destruction.
 
 I see. Thanks for the clarification.
 
 Would it help to rewrite the part of devvn_refthread as follows:
 
 devvn_refthread() {
   // protect vp-v_rdev dereference and dev disappearing
   VI_LOCK(vp);
   dev = vi-v_rdev;
   .. check that dev != NULL
   // protect the race on dev-si_devsw
   atomic_add_long(dev-si_threadcount, 1);
   VI_UNLOCK(vp);
   ... appropriate memory barrier
   csw = dev-si_devsw;
   if (csw != NULL) {
   // we won the race, even if destroy_devl() runs
   // it will keep the cdevsw alive until si_threadcount goes to 0
   *devp = dev;
   *ref = 1;
   } else {
   // we lost
   *ref = 0;
   }
   return csw;
 }
 
 i.e.  tentatively increment si_threadcount before dereferencing si_devsw
 and then restoring it if we lose the race ?
 It might be necessary to add a barrier in destroy_devl() between clearing
 si_devsw and reading si_threadcount.
Sure it is neccessary to add a barrier in destroy_devl() then.

From the first look, this might work, but how would you handle the
possibility that cdev memory is already freed when you do the increment ?
The vnode interlock does not protect against it; if you mean that
vnode reclamation already happen and vp-v_rdev must be cleared, this
might need some additional argumentation.

Note that the real patch is more involved, since both dev_refthread()
and devvn_refthread() should be handled. Also, there is some ugly hack
in sound/clone.c.

 
   
   This should be enough to remove the contention.
  If you never calls destroy_dev() for the devfs node, you could use
  MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there
  is no risk of race with destroy_dev() for the created node.
  
  Usually, code which could be compiled in kernel statically but also
  loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of
  module needed to call destroy_dev().
 
 that I suppose would mean that the module cannot be safely unloaded ?

Well, you are not supposed to use MAKEDEV_ETERNAL for loadable modules
at all.  This is what MAKEDEV_ETERNAL_KLD ensures.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


dev_lock() contention for fdesc syscalls -- possible fix

2014-11-09 Thread Luigi Rizzo
It was noticed that there is huge dev_lock() contention when multiple
processes do a poll() even on independent file descriptors.

Turns out that not just poll but most syscalls on file descriptors
(as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including
devfs_poll_f(), devfs_ioctl_f() and read/write share the problem
as they use the following pattern

devfs_poll_f() {
...
devfs_fp_check(fp, ...) --
kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) --
dev_lock();
dev = vp-v_rdev; // lock on vp ?
... check that dev != NULL
atomic_add_long(dev-si_threadcount, 1);
dev_unlock();
dsw-d_poll()
dev_relthread() --
atomic_subtract_rel_long(dev-si_threadcount, 1);
   }


I believe that dev_lock() in devvn_refthread() is protecting
dev = vp-v_rdev
(the cdev entry 'dev' cannot go away for the reasons stated below).

However looking at places where vp-v_rdev is modified, turns out
that it only happens when holding VI_LOCK(vp) -- the places are
devfs_allocv() and devfs_reclaim().
There is one place in zfs where the vnode is created and v_rdev
is set without holding a lock, so nobody can dereference it.

As a consequence i believe that if in devfs_fp_check() we replace
dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp),
we make sure that we can safely dereference vp-v_rdev, and the
cdev cannot go away because the vnode has a reference to it.
The counter uses an atomic instruction (so the release is lockless)

This should be enough to remove the contention.

Anyone care to review/test the patch below ?
(dev_refthread() still has the single dev_lock() contention,
i don't know how to address that yet)

cheers
luigi

Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c
===
--- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c   (revision 273452)
+++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c   (working copy)
@@ -224,10 +224,11 @@
}
 
csw = NULL;
-   dev_lock();
+   ASSERT_VI_UNLOCKED(vp, __func__);
+   VI_LOCK(vp); // dev_lock();
dev = vp-v_rdev;
if (dev == NULL) {
-   dev_unlock();
+   VI_UNLOCK(vp); // dev_unlock();
return (NULL);
}
cdp = cdev2priv(dev);
@@ -236,7 +237,7 @@
if (csw != NULL)
atomic_add_long(dev-si_threadcount, 1);
}
-   dev_unlock();
+   VI_UNLOCK(vp); // dev_unlock();
if (csw != NULL) {
*devp = dev;
*ref = 1;
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org