Re: dev_lock() contention for fdesc syscalls -- possible fix
On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote: It was noticed that there is huge dev_lock() contention when multiple processes do a poll() even on independent file descriptors. Turns out that not just poll but most syscalls on file descriptors (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including devfs_poll_f(), devfs_ioctl_f() and read/write share the problem as they use the following pattern devfs_poll_f() { ... devfs_fp_check(fp, ...) -- kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) -- dev_lock(); dev = vp-v_rdev; // lock on vp ? ... check that dev != NULL atomic_add_long(dev-si_threadcount, 1); dev_unlock(); dsw-d_poll() dev_relthread() -- atomic_subtract_rel_long(dev-si_threadcount, 1); } I believe that dev_lock() in devvn_refthread() is protecting dev = vp-v_rdev (the cdev entry 'dev' cannot go away for the reasons stated below). However looking at places where vp-v_rdev is modified, turns out that it only happens when holding VI_LOCK(vp) -- the places are devfs_allocv() and devfs_reclaim(). There is one place in zfs where the vnode is created and v_rdev is set without holding a lock, so nobody can dereference it. As a consequence i believe that if in devfs_fp_check() we replace dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), we make sure that we can safely dereference vp-v_rdev, and the cdev cannot go away because the vnode has a reference to it. The counter uses an atomic instruction (so the release is lockless) Vnode reference, as well as cdev reference, which is obtained by dev_ref(), do not provide any protection there. v_rdev is only coincidentally protected by the vnode interlock. If you look at larger part of the code, you would note that dev mutex is held even after v_rdev is dereferenced. The real protection it provides is against race with destroy_dev(l)(), which could invalidate dev-so_devsw at any moment when either device thread reference is not held, or dev mutex is not held. So your patch breaks the device destruction. This should be enough to remove the contention. If you never calls destroy_dev() for the devfs node, you could use MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there is no risk of race with destroy_dev() for the created node. Usually, code which could be compiled in kernel statically but also loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of module needed to call destroy_dev(). Anyone care to review/test the patch below ? Yes, there are people who care. (dev_refthread() still has the single dev_lock() contention, i don't know how to address that yet) cheers luigi Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c === --- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (revision 273452) +++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (working copy) @@ -224,10 +224,11 @@ } csw = NULL; - dev_lock(); + ASSERT_VI_UNLOCKED(vp, __func__); + VI_LOCK(vp); // dev_lock(); dev = vp-v_rdev; if (dev == NULL) { - dev_unlock(); + VI_UNLOCK(vp); // dev_unlock(); return (NULL); } cdp = cdev2priv(dev); @@ -236,7 +237,7 @@ if (csw != NULL) atomic_add_long(dev-si_threadcount, 1); } - dev_unlock(); + VI_UNLOCK(vp); // dev_unlock(); if (csw != NULL) { *devp = dev; *ref = 1; ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: dev_lock() contention for fdesc syscalls -- possible fix
On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote: On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote: It was noticed that there is huge dev_lock() contention when multiple processes do a poll() even on independent file descriptors. Turns out that not just poll but most syscalls on file descriptors (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including devfs_poll_f(), devfs_ioctl_f() and read/write share the problem as they use the following pattern devfs_poll_f() { ... devfs_fp_check(fp, ...) -- kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) -- dev_lock(); dev = vp-v_rdev; // lock on vp ? ... check that dev != NULL atomic_add_long(dev-si_threadcount, 1); dev_unlock(); dsw-d_poll() dev_relthread() -- atomic_subtract_rel_long(dev-si_threadcount, 1); } I believe that dev_lock() in devvn_refthread() is protecting dev = vp-v_rdev (the cdev entry 'dev' cannot go away for the reasons stated below). However looking at places where vp-v_rdev is modified, turns out that it only happens when holding VI_LOCK(vp) -- the places are devfs_allocv() and devfs_reclaim(). There is one place in zfs where the vnode is created and v_rdev is set without holding a lock, so nobody can dereference it. As a consequence i believe that if in devfs_fp_check() we replace dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), we make sure that we can safely dereference vp-v_rdev, and the cdev cannot go away because the vnode has a reference to it. The counter uses an atomic instruction (so the release is lockless) Vnode reference, as well as cdev reference, which is obtained by dev_ref(), do not provide any protection there. v_rdev is only coincidentally protected by the vnode interlock. If you look at larger part of the code, you would note that dev mutex is held even after v_rdev is dereferenced. The real protection it provides is against race with destroy_dev(l)(), which could invalidate dev-so_devsw at any moment when either device thread reference is not held, or dev mutex is not held. So your patch breaks the device destruction. I see. Thanks for the clarification. Would it help to rewrite the part of devvn_refthread as follows: devvn_refthread() { // protect vp-v_rdev dereference and dev disappearing VI_LOCK(vp); dev = vi-v_rdev; .. check that dev != NULL // protect the race on dev-si_devsw atomic_add_long(dev-si_threadcount, 1); VI_UNLOCK(vp); ... appropriate memory barrier csw = dev-si_devsw; if (csw != NULL) { // we won the race, even if destroy_devl() runs // it will keep the cdevsw alive until si_threadcount goes to 0 *devp = dev; *ref = 1; } else { // we lost *ref = 0; } return csw; } i.e. tentatively increment si_threadcount before dereferencing si_devsw and then restoring it if we lose the race ? It might be necessary to add a barrier in destroy_devl() between clearing si_devsw and reading si_threadcount. This should be enough to remove the contention. If you never calls destroy_dev() for the devfs node, you could use MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there is no risk of race with destroy_dev() for the created node. Usually, code which could be compiled in kernel statically but also loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of module needed to call destroy_dev(). that I suppose would mean that the module cannot be safely unloaded ? cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: dev_lock() contention for fdesc syscalls -- possible fix
On Mon, Nov 10, 2014 at 10:44:12AM +0100, Luigi Rizzo wrote: On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote: On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote: It was noticed that there is huge dev_lock() contention when multiple processes do a poll() even on independent file descriptors. Turns out that not just poll but most syscalls on file descriptors (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including devfs_poll_f(), devfs_ioctl_f() and read/write share the problem as they use the following pattern devfs_poll_f() { ... devfs_fp_check(fp, ...) -- kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) -- dev_lock(); dev = vp-v_rdev; // lock on vp ? ... check that dev != NULL atomic_add_long(dev-si_threadcount, 1); dev_unlock(); dsw-d_poll() dev_relthread() -- atomic_subtract_rel_long(dev-si_threadcount, 1); } I believe that dev_lock() in devvn_refthread() is protecting dev = vp-v_rdev (the cdev entry 'dev' cannot go away for the reasons stated below). However looking at places where vp-v_rdev is modified, turns out that it only happens when holding VI_LOCK(vp) -- the places are devfs_allocv() and devfs_reclaim(). There is one place in zfs where the vnode is created and v_rdev is set without holding a lock, so nobody can dereference it. As a consequence i believe that if in devfs_fp_check() we replace dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), we make sure that we can safely dereference vp-v_rdev, and the cdev cannot go away because the vnode has a reference to it. The counter uses an atomic instruction (so the release is lockless) Vnode reference, as well as cdev reference, which is obtained by dev_ref(), do not provide any protection there. v_rdev is only coincidentally protected by the vnode interlock. If you look at larger part of the code, you would note that dev mutex is held even after v_rdev is dereferenced. The real protection it provides is against race with destroy_dev(l)(), which could invalidate dev-so_devsw at any moment when either device thread reference is not held, or dev mutex is not held. So your patch breaks the device destruction. I see. Thanks for the clarification. Would it help to rewrite the part of devvn_refthread as follows: devvn_refthread() { // protect vp-v_rdev dereference and dev disappearing VI_LOCK(vp); dev = vi-v_rdev; .. check that dev != NULL // protect the race on dev-si_devsw atomic_add_long(dev-si_threadcount, 1); VI_UNLOCK(vp); ... appropriate memory barrier csw = dev-si_devsw; if (csw != NULL) { // we won the race, even if destroy_devl() runs // it will keep the cdevsw alive until si_threadcount goes to 0 *devp = dev; *ref = 1; } else { // we lost *ref = 0; } return csw; } i.e. tentatively increment si_threadcount before dereferencing si_devsw and then restoring it if we lose the race ? It might be necessary to add a barrier in destroy_devl() between clearing si_devsw and reading si_threadcount. Sure it is neccessary to add a barrier in destroy_devl() then. From the first look, this might work, but how would you handle the possibility that cdev memory is already freed when you do the increment ? The vnode interlock does not protect against it; if you mean that vnode reclamation already happen and vp-v_rdev must be cleared, this might need some additional argumentation. Note that the real patch is more involved, since both dev_refthread() and devvn_refthread() should be handled. Also, there is some ugly hack in sound/clone.c. This should be enough to remove the contention. If you never calls destroy_dev() for the devfs node, you could use MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there is no risk of race with destroy_dev() for the created node. Usually, code which could be compiled in kernel statically but also loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of module needed to call destroy_dev(). that I suppose would mean that the module cannot be safely unloaded ? Well, you are not supposed to use MAKEDEV_ETERNAL for loadable modules at all. This is what MAKEDEV_ETERNAL_KLD ensures. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
dev_lock() contention for fdesc syscalls -- possible fix
It was noticed that there is huge dev_lock() contention when multiple processes do a poll() even on independent file descriptors. Turns out that not just poll but most syscalls on file descriptors (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including devfs_poll_f(), devfs_ioctl_f() and read/write share the problem as they use the following pattern devfs_poll_f() { ... devfs_fp_check(fp, ...) -- kern/kern_conf.c :: devvn_refthread(fp-f_vnode, ...) -- dev_lock(); dev = vp-v_rdev; // lock on vp ? ... check that dev != NULL atomic_add_long(dev-si_threadcount, 1); dev_unlock(); dsw-d_poll() dev_relthread() -- atomic_subtract_rel_long(dev-si_threadcount, 1); } I believe that dev_lock() in devvn_refthread() is protecting dev = vp-v_rdev (the cdev entry 'dev' cannot go away for the reasons stated below). However looking at places where vp-v_rdev is modified, turns out that it only happens when holding VI_LOCK(vp) -- the places are devfs_allocv() and devfs_reclaim(). There is one place in zfs where the vnode is created and v_rdev is set without holding a lock, so nobody can dereference it. As a consequence i believe that if in devfs_fp_check() we replace dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), we make sure that we can safely dereference vp-v_rdev, and the cdev cannot go away because the vnode has a reference to it. The counter uses an atomic instruction (so the release is lockless) This should be enough to remove the contention. Anyone care to review/test the patch below ? (dev_refthread() still has the single dev_lock() contention, i don't know how to address that yet) cheers luigi Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c === --- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (revision 273452) +++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (working copy) @@ -224,10 +224,11 @@ } csw = NULL; - dev_lock(); + ASSERT_VI_UNLOCKED(vp, __func__); + VI_LOCK(vp); // dev_lock(); dev = vp-v_rdev; if (dev == NULL) { - dev_unlock(); + VI_UNLOCK(vp); // dev_unlock(); return (NULL); } cdp = cdev2priv(dev); @@ -236,7 +237,7 @@ if (csw != NULL) atomic_add_long(dev-si_threadcount, 1); } - dev_unlock(); + VI_UNLOCK(vp); // dev_unlock(); if (csw != NULL) { *devp = dev; *ref = 1; ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org