Re: dev_lock() contention for fdesc syscalls -- possible fix
On Mon, Nov 10, 2014 at 10:44:12AM +0100, Luigi Rizzo wrote: > On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote: > > On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote: > > > It was noticed that there is huge dev_lock() contention when multiple > > > processes do a poll() even on independent file descriptors. > > > > > > Turns out that not just poll but most syscalls on file descriptors > > > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including > > > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem > > > as they use the following pattern > > > > > > devfs_poll_f() { > > > ... > > > devfs_fp_check(fp, ...) --> > > > kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) --> > > > dev_lock(); > > > dev = vp->v_rdev; // lock on vp ? > > > ... check that dev != NULL > > > atomic_add_long(&dev->si_threadcount, 1); > > > dev_unlock(); > > > dsw->d_poll() > > > dev_relthread() --> > > > atomic_subtract_rel_long(&dev->si_threadcount, 1); > > >} > > > > > > > > > I believe that dev_lock() in devvn_refthread() is protecting > > > dev = vp->v_rdev > > > (the cdev entry 'dev' cannot go away for the reasons stated below). > > > > > > However looking at places where vp->v_rdev is modified, turns out > > > that it only happens when holding VI_LOCK(vp) -- the places are > > > devfs_allocv() and devfs_reclaim(). > > > There is one place in zfs where the vnode is created and v_rdev > > > is set without holding a lock, so nobody can dereference it. > > > > > > As a consequence i believe that if in devfs_fp_check() we replace > > > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), > > > we make sure that we can safely dereference vp->v_rdev, and the > > > cdev cannot go away because the vnode has a reference to it. > > > The counter uses an atomic instruction (so the release is lockless) > > Vnode reference, as well as cdev reference, which is obtained by > > dev_ref(), do not provide any protection there. v_rdev is only > > coincidentally protected by the vnode interlock. > > > > If you look at larger part of the code, you would note that dev mutex > > is held even after v_rdev is dereferenced. The real protection it > > provides is against race with destroy_dev(l)(), which could invalidate > > dev->so_devsw at any moment when either device thread reference is > > not held, or dev mutex is not held. So your patch breaks the > > device destruction. > > I see. Thanks for the clarification. > > Would it help to rewrite the part of devvn_refthread as follows: > > devvn_refthread() { > // protect vp->v_rdev dereference and dev disappearing > VI_LOCK(vp); > dev = vi->v_rdev; > .. check that dev != NULL > // protect the race on dev->si_devsw > atomic_add_long(&dev->si_threadcount, 1); > VI_UNLOCK(vp); > ... appropriate memory barrier > csw = dev->si_devsw; > if (csw != NULL) { > // we won the race, even if destroy_devl() runs > // it will keep the cdevsw alive until si_threadcount goes to 0 > *devp = dev; > *ref = 1; > } else { > // we lost > *ref = 0; > } > return csw; > } > > i.e. tentatively increment si_threadcount before dereferencing si_devsw > and then restoring it if we lose the race ? > It might be necessary to add a barrier in destroy_devl() between clearing > si_devsw and reading si_threadcount. Sure it is neccessary to add a barrier in destroy_devl() then. >From the first look, this might work, but how would you handle the possibility that cdev memory is already freed when you do the increment ? The vnode interlock does not protect against it; if you mean that vnode reclamation already happen and vp->v_rdev must be cleared, this might need some additional argumentation. Note that the real patch is more involved, since both dev_refthread() and devvn_refthread() should be handled. Also, there is some ugly hack in sound/clone.c. > > > > > > > This should be enough to remove the contention. > > If you never calls destroy_dev() for the devfs node, you could use > > MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there > > is no risk of race with destroy_dev() for the created node. > > > > Usually, code which could be compiled in kernel statically but also > > loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of > > module needed to call destroy_dev(). > > that I suppose would mean that the module cannot be safely unloaded ? Well, you are not supposed to use MAKEDEV_ETERNAL for loadable modules at all. This is what MAKEDEV_ETERNAL_KLD ensures. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To un
Re: dev_lock() contention for fdesc syscalls -- possible fix
On Mon, Nov 10, 2014 at 10:34:57AM +0200, Konstantin Belousov wrote: > On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote: > > It was noticed that there is huge dev_lock() contention when multiple > > processes do a poll() even on independent file descriptors. > > > > Turns out that not just poll but most syscalls on file descriptors > > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including > > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem > > as they use the following pattern > > > > devfs_poll_f() { > > ... > > devfs_fp_check(fp, ...) --> > > kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) --> > > dev_lock(); > > dev = vp->v_rdev; // lock on vp ? > > ... check that dev != NULL > > atomic_add_long(&dev->si_threadcount, 1); > > dev_unlock(); > > dsw->d_poll() > > dev_relthread() --> > > atomic_subtract_rel_long(&dev->si_threadcount, 1); > >} > > > > > > I believe that dev_lock() in devvn_refthread() is protecting > > dev = vp->v_rdev > > (the cdev entry 'dev' cannot go away for the reasons stated below). > > > > However looking at places where vp->v_rdev is modified, turns out > > that it only happens when holding VI_LOCK(vp) -- the places are > > devfs_allocv() and devfs_reclaim(). > > There is one place in zfs where the vnode is created and v_rdev > > is set without holding a lock, so nobody can dereference it. > > > > As a consequence i believe that if in devfs_fp_check() we replace > > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), > > we make sure that we can safely dereference vp->v_rdev, and the > > cdev cannot go away because the vnode has a reference to it. > > The counter uses an atomic instruction (so the release is lockless) > Vnode reference, as well as cdev reference, which is obtained by > dev_ref(), do not provide any protection there. v_rdev is only > coincidentally protected by the vnode interlock. > > If you look at larger part of the code, you would note that dev mutex > is held even after v_rdev is dereferenced. The real protection it > provides is against race with destroy_dev(l)(), which could invalidate > dev->so_devsw at any moment when either device thread reference is > not held, or dev mutex is not held. So your patch breaks the > device destruction. I see. Thanks for the clarification. Would it help to rewrite the part of devvn_refthread as follows: devvn_refthread() { // protect vp->v_rdev dereference and dev disappearing VI_LOCK(vp); dev = vi->v_rdev; .. check that dev != NULL // protect the race on dev->si_devsw atomic_add_long(&dev->si_threadcount, 1); VI_UNLOCK(vp); ... appropriate memory barrier csw = dev->si_devsw; if (csw != NULL) { // we won the race, even if destroy_devl() runs // it will keep the cdevsw alive until si_threadcount goes to 0 *devp = dev; *ref = 1; } else { // we lost *ref = 0; } return csw; } i.e. tentatively increment si_threadcount before dereferencing si_devsw and then restoring it if we lose the race ? It might be necessary to add a barrier in destroy_devl() between clearing si_devsw and reading si_threadcount. > > > > This should be enough to remove the contention. > If you never calls destroy_dev() for the devfs node, you could use > MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there > is no risk of race with destroy_dev() for the created node. > > Usually, code which could be compiled in kernel statically but also > loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of > module needed to call destroy_dev(). that I suppose would mean that the module cannot be safely unloaded ? cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: dev_lock() contention for fdesc syscalls -- possible fix
On Mon, Nov 10, 2014 at 02:49:39AM +0100, Luigi Rizzo wrote: > It was noticed that there is huge dev_lock() contention when multiple > processes do a poll() even on independent file descriptors. > > Turns out that not just poll but most syscalls on file descriptors > (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including > devfs_poll_f(), devfs_ioctl_f() and read/write share the problem > as they use the following pattern > > devfs_poll_f() { > ... > devfs_fp_check(fp, ...) --> > kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) --> > dev_lock(); > dev = vp->v_rdev; // lock on vp ? > ... check that dev != NULL > atomic_add_long(&dev->si_threadcount, 1); > dev_unlock(); > dsw->d_poll() > dev_relthread() --> > atomic_subtract_rel_long(&dev->si_threadcount, 1); >} > > > I believe that dev_lock() in devvn_refthread() is protecting > dev = vp->v_rdev > (the cdev entry 'dev' cannot go away for the reasons stated below). > > However looking at places where vp->v_rdev is modified, turns out > that it only happens when holding VI_LOCK(vp) -- the places are > devfs_allocv() and devfs_reclaim(). > There is one place in zfs where the vnode is created and v_rdev > is set without holding a lock, so nobody can dereference it. > > As a consequence i believe that if in devfs_fp_check() we replace > dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), > we make sure that we can safely dereference vp->v_rdev, and the > cdev cannot go away because the vnode has a reference to it. > The counter uses an atomic instruction (so the release is lockless) Vnode reference, as well as cdev reference, which is obtained by dev_ref(), do not provide any protection there. v_rdev is only coincidentally protected by the vnode interlock. If you look at larger part of the code, you would note that dev mutex is held even after v_rdev is dereferenced. The real protection it provides is against race with destroy_dev(l)(), which could invalidate dev->so_devsw at any moment when either device thread reference is not held, or dev mutex is not held. So your patch breaks the device destruction. > > This should be enough to remove the contention. If you never calls destroy_dev() for the devfs node, you could use MAKEDEV_ETERNAL flag for make_dev_credf(), which indicates that there is no risk of race with destroy_dev() for the created node. Usually, code which could be compiled in kernel statically but also loaded as module, use MAKEDEV_ETERNAL_KLD flag, which takes care of module needed to call destroy_dev(). > > Anyone care to review/test the patch below ? Yes, there are people who care. > (dev_refthread() still has the single dev_lock() contention, > i don't know how to address that yet) > > cheers > luigi > > Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c > === > --- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (revision 273452) > +++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (working copy) > @@ -224,10 +224,11 @@ > } > > csw = NULL; > - dev_lock(); > + ASSERT_VI_UNLOCKED(vp, __func__); > + VI_LOCK(vp); // dev_lock(); > dev = vp->v_rdev; > if (dev == NULL) { > - dev_unlock(); > + VI_UNLOCK(vp); // dev_unlock(); > return (NULL); > } > cdp = cdev2priv(dev); > @@ -236,7 +237,7 @@ > if (csw != NULL) > atomic_add_long(&dev->si_threadcount, 1); > } > - dev_unlock(); > + VI_UNLOCK(vp); // dev_unlock(); > if (csw != NULL) { > *devp = dev; > *ref = 1; > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
dev_lock() contention for fdesc syscalls -- possible fix
It was noticed that there is huge dev_lock() contention when multiple processes do a poll() even on independent file descriptors. Turns out that not just poll but most syscalls on file descriptors (as opposed to sockets) in sys/fs/devfs/devfs_vnops.c including devfs_poll_f(), devfs_ioctl_f() and read/write share the problem as they use the following pattern devfs_poll_f() { ... devfs_fp_check(fp, ...) --> kern/kern_conf.c :: devvn_refthread(fp->f_vnode, ...) --> dev_lock(); dev = vp->v_rdev; // lock on vp ? ... check that dev != NULL atomic_add_long(&dev->si_threadcount, 1); dev_unlock(); dsw->d_poll() dev_relthread() --> atomic_subtract_rel_long(&dev->si_threadcount, 1); } I believe that dev_lock() in devvn_refthread() is protecting dev = vp->v_rdev (the cdev entry 'dev' cannot go away for the reasons stated below). However looking at places where vp->v_rdev is modified, turns out that it only happens when holding VI_LOCK(vp) -- the places are devfs_allocv() and devfs_reclaim(). There is one place in zfs where the vnode is created and v_rdev is set without holding a lock, so nobody can dereference it. As a consequence i believe that if in devfs_fp_check() we replace dev_lock() / dev_unlock() with VI_LOCK(vp) / VI_UNLOCK(vp), we make sure that we can safely dereference vp->v_rdev, and the cdev cannot go away because the vnode has a reference to it. The counter uses an atomic instruction (so the release is lockless) This should be enough to remove the contention. Anyone care to review/test the patch below ? (dev_refthread() still has the single dev_lock() contention, i don't know how to address that yet) cheers luigi Index: /home/luigi/FreeBSD/head/sys/kern/kern_conf.c === --- /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (revision 273452) +++ /home/luigi/FreeBSD/head/sys/kern/kern_conf.c (working copy) @@ -224,10 +224,11 @@ } csw = NULL; - dev_lock(); + ASSERT_VI_UNLOCKED(vp, __func__); + VI_LOCK(vp); // dev_lock(); dev = vp->v_rdev; if (dev == NULL) { - dev_unlock(); + VI_UNLOCK(vp); // dev_unlock(); return (NULL); } cdp = cdev2priv(dev); @@ -236,7 +237,7 @@ if (csw != NULL) atomic_add_long(&dev->si_threadcount, 1); } - dev_unlock(); + VI_UNLOCK(vp); // dev_unlock(); if (csw != NULL) { *devp = dev; *ref = 1; ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"