Bryan, I am happy to know that the diff looks good, sure enough my e-mail address [email protected] can be used.
Thanks again for looking into this issue. -Youzhong On Thu, May 28, 2015 at 1:54 PM, Bryan Cantrill <[email protected]> wrote: > > Youzhong, > > Sorry for the delay on this. Yes, I think Garrett's suggestion was an > excellent one -- and those diffs look good! If you don't mind, since > you've done the work here, I'd like to push this to SmartOS with you as the > author. What e-mail address do you prefer for that? And thank you for > doing this work -- and to Garrett for the suggestion! > > - Bryan > > > On Thu, May 28, 2015 at 6:12 AM, Youzhong Yang <[email protected]> wrote: > >> Hi Bryan, >> >> I tried the following patch and it works well so far. Would you please >> take a look? does it look good? >> >> By the way, with this patch, the inotify_clean() stack looks like: >> >> zfs`dmu_tx_delay+0x147 >> zfs`dmu_tx_wait+0x8e >> zfs`dmu_tx_assign+0x55 >> zfs`dmu_free_long_range_impl+0x9f >> zfs`dmu_free_long_range+0x62 >> zfs`zfs_rmnode+0x52 >> zfs`zfs_zinactive+0xe8 >> zfs`zfs_inactive+0x75 >> genunix`fop_inactive+0x76 >> genunix`vn_rele+0x82 >> inotify`inotify_clean+0xfc >> genunix`periodic_execute+0xc9 >> genunix`taskq_thread+0x2d0 >> unix`thread_start+0x8 >> >> Thanks, >> >> -Youzhong >> >> diff --git a/usr/src/uts/common/io/inotify.c >> b/usr/src/uts/common/io/inotify.c ( >> https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/io/inotify.c >> ) >> index c5c7b37..594b727 100644 >> --- a/usr/src/uts/common/io/inotify.c >> +++ b/usr/src/uts/common/io/inotify.c >> @@ -99,7 +99,7 @@ struct inotify_state { >> pollhead_t ins_pollhd; /* poll head */ >> kcondvar_t ins_cv; /* condvar for reading */ >> list_t ins_orphans; /* orphan list */ >> - cyclic_id_t ins_cleaner; /* cyclic for cleaning */ >> + ddi_periodic_t ins_cleaner; /* cyclic for cleaning */ >> inotify_watch_t *ins_zombies; /* zombie watch list */ >> cred_t *ins_cred; /* creator's credentials >> */ >> inotify_state_t *ins_next; /* next state on global >> list */ >> @@ -1042,8 +1042,6 @@ inotify_open(dev_t *devp, int flag, int otyp, >> cred_t *cred_p) >> major_t major = getemajor(*devp); >> minor_t minor = getminor(*devp); >> int instances = 0; >> - cyc_handler_t hdlr; >> - cyc_time_t when; >> char c[64]; >> >> if (minor != INOTIFYMNRN_INOTIFY) >> @@ -1100,17 +1098,7 @@ inotify_open(dev_t *devp, int flag, int otyp, >> cred_t *cred_p) >> >> mutex_exit(&inotify_lock); >> >> - mutex_enter(&cpu_lock); >> - >> - hdlr.cyh_func = inotify_clean; >> - hdlr.cyh_level = CY_LOW_LEVEL; >> - hdlr.cyh_arg = state; >> - >> - when.cyt_when = 0; >> - when.cyt_interval = NANOSEC; >> - >> - state->ins_cleaner = cyclic_add(&hdlr, &when); >> - mutex_exit(&cpu_lock); >> + state->ins_cleaner = ddi_periodic_add(inotify_clean, (void >> *)state, NANOSEC, DDI_IPL_0); >> >> return (0); >> } >> @@ -1329,9 +1317,10 @@ inotify_close(dev_t dev, int flag, int otyp, >> cred_t *cred_p) >> inotify_watch_destroy(watch); >> } >> >> - mutex_enter(&cpu_lock); >> - cyclic_remove(state->ins_cleaner); >> - mutex_exit(&cpu_lock); >> + if (state->ins_cleaner != NULL) { >> + ddi_periodic_delete(state->ins_cleaner); >> + state->ins_cleaner = NULL; >> + } >> >> mutex_enter(&inotify_lock); >> >> >> On Thu, May 21, 2015 at 9:23 PM, Garrett D'Amore <[email protected]> >> wrote: >> >>> A ddi_periodic would probably be simpler and cleaner than the >>> taskq/cyclic combo. And I think you can define the interrupt level (or >>> not) that you want to run it at. >>> >>> Sent from my iPhone >>> >>> On May 21, 2015, at 2:59 PM, Bryan Cantrill <[email protected]> >>> wrote: >>> >>> >>> Youzhong, >>> >>> Well, today I leaned that there's an implicit dependency that I imagine >>> would be news for most: ZFS doesn't allow for vn_rele()'s from low-level >>> interrupt context. :| There are a couple of ways to fix this, but the >>> simplest is to have inotify_clean() be in a task queue that's kicked off by >>> a cyclic, which I can prototype relatively easily. Meanwhile, might it be >>> possible for you to upload the dump to Manta and/or otherwise make it >>> available so I can verify that this is indeed what's happening? And thank >>> you (as always!) for your diligence in reproducing and reporting the issue! >>> >>> - Bryan >>> >>> >>> On Thu, May 21, 2015 at 11:34 AM, Youzhong Yang <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> We're trying to use inotify for change notification in Samba 4, most of >>>> the time, it works very well, but under certain conditions, the server can >>>> become unresponsive (to client NFS requests), due to some kind of deadlock >>>> possibly caused by inotify_clean() running as a cyclic. >>>> >>>> Here are the steps to reproduce: >>>> >>>> 1. suppose you have a zfs file system with plenty of space, say >>>> /zpool/temp which is created in the zpool named 'zpool'; download the >>>> attached 2 C programs and compile them. >>>> >>>> 2. in terminal windows #1, run the following dtrace command to monitor >>>> the 'dirty_total': >>>> >>>> dtrace -n 'txg-syncing /((dsl_pool_t *)arg0)->dp_spa->spa_name == >>>> "zpool"/ {printf("%4dMB (%2d) of %4dMB used", ((dsl_pool_t >>>> *)arg0)->dp_dirty_total/1024/1024, ((dsl_pool_t >>>> *)arg0)->dp_dirty_total*100/`zfs_dirty_data_max, `zfs_dirty_data_max / 1024 >>>> / 1024);}' >>>> >>>> Replace the string "zpool" with your actual zpool name. >>>> >>>> 3. in terminal window #2, run the following command >>>> >>>> inotify_stress_new -f /zpool/temp -n 4 >>>> >>>> It forks 4 processes, each of which uses inotify_* functions to monitor >>>> the folder /zpool/temp, create and remove some temporary files in that >>>> folder. >>>> >>>> 4. in terminal windows #3, run the following command: >>>> >>>> write-files -f /zpool/temp -n 50 >>>> >>>> It forks 50 processes, each one creates a subfolder under /zpool/temp, >>>> keeps creating files of 1GB size. >>>> >>>> Depending on your machine's horse power, you may want to adjust the >>>> number of processes for write-files program. Its purpose is basically to >>>> make zpool dirty_total go beyond 60% of zfs_dirty_data_max. >>>> >>>> Normally, within a few minutes, terminal windows #1 will stop printing >>>> anything, which means the issue is reproduced. >>>> >>>> Now once the issue is reproduced, I run 'echo ::threadlist -v 10 | mdb >>>> -k' to dump the kernel threads. So far there are two patterns I observed: >>>> >>>> Case No.1: inotify_clean() stuck at cv_timedwait_hires() >>>> --------------------------- >>>> swtch+0x82() >>>> cv_timedwait_hires+0xec(ffffff01e852ae2e, ffffff01e852ae30, >>>> 1047be67f9694, >>>> 186a0, 3) >>>> dmu_tx_delay+0x147(ffffff8114229c00, d80448bb) >>>> dmu_tx_wait+0x8e(ffffff8114229c00) >>>> dmu_tx_assign+0x55(ffffff8114229c00, 1) >>>> dmu_free_long_range_impl+0x9f(ffffff443a1943c0, ffffff4be51507e8, >>>> 0, ffffffffffffffff) >>>> dmu_free_long_range+0x62(ffffff443a1943c0, 179294d, 0, >>>> ffffffffffffffff) >>>> zfs_rmnode+0x52(ffffff44d20891f0) >>>> zfs_zinactive+0xe8(ffffff44d20891f0) >>>> zfs_inactive+0x75(ffffff450b8e8500, 0, 0) >>>> fop_inactive+0x76(ffffff450b8e8500, 0, 0) >>>> vn_rele+0x82(ffffff450b8e8500) >>>> inotify_clean+0xfc(ffffff43abf54810) >>>> cyclic_softint+0xf3(ffffff42eb589a80, 0) >>>> cbe_low_level+0x14() >>>> av_dispatch_softvect+0x78(2) >>>> apix_dispatch_softint+0x35(0, 0) >>>> switch_sp_and_call+0x13() >>>> 0xffffff42eb589a80() >>>> apix_do_interrupt+0xfe(ffffff01e84faad0, 0) >>>> _interrupt+0xba() >>>> i86_mwait+0xd() >>>> cpu_idle_mwait+0x109() >>>> idle+0xa7() >>>> thread_start+8() >>>> --------------------------- >>>> Case No. 2: inotify_clean() waits for a mutex which is held by another >>>> thread stuck at cv_timedwait_hires() >>>> --------------------------- >>>> > ffffff003d29cc40::findstack -v >>>> stack pointer for thread ffffff003d29cc40: ffffff003d29c960 >>>> [ ffffff003d29c960 resume_from_intr+0xb7() ] >>>> ffffff003d29c990 swtch+0x82() >>>> ffffff003d29ca30 turnstile_block+0x21a(ffffff090bf5bd58, 0, >>>> ffffff096b1d9808, fffffffffbc07aa0, 0, 0) >>>> ffffff003d29caa0 mutex_vector_enter+0x3a3(ffffff096b1d9808) >>>> ffffff003d29caf0 inotify_clean+0x2f(ffffff096b1d9808) >>>> ffffff003d29cb90 cyclic_softint+0xfd(ffffff090b55bb00, 0) >>>> ffffff003d29cba0 cbe_low_level+0x14() >>>> ffffff003d29cbf0 av_dispatch_softvect+0x78(2) >>>> ffffff003d29cc20 dispatch_softint+0x39(0, 0) >>>> ffffff003d5a8e60 switch_sp_and_call+0x13() >>>> ffffff003d5a8e90 0xffffff0900000001() >>>> 4136b0 [no mapping for address] >>>> > ffffff096b1d9808::mutex >>>> ADDR TYPE HELD MINSPL OLDSPL WAITERS >>>> ffffff096b1d9808 adapt ffffff0912213000 - - yes >>>> > ffffff0912213000::findstack -v >>>> stack pointer for thread ffffff0912213000: ffffff003d8372f0 >>>> [ ffffff003d8372f0 _resume_from_idle+0xf4() ] >>>> ffffff003d837320 swtch+0x141() >>>> ffffff003d8373b0 cv_timedwait_hires+0xec(ffffff09122131ee, >>>> ffffff09122131f0, 7d07b0bbc4, 186a0, 3) >>>> ffffff003d837410 dmu_tx_delay+0x147(ffffff096f22a900, 87f76900) >>>> ffffff003d837460 dmu_tx_wait+0x8e(ffffff096f22a900) >>>> ffffff003d8374a0 dmu_tx_assign+0x55(ffffff096f22a900, 1) >>>> ffffff003d837520 dmu_free_long_range_impl+0xa7(ffffff0914365780, >>>> ffffff096cb94d28, 0, ffffffffffffffff) >>>> ffffff003d837590 dmu_free_long_range+0x62(ffffff0914365780, 89e, 0, >>>> ffffffffffffffff) >>>> ffffff003d8375e0 zfs_rmnode+0x52(ffffff0978376220) >>>> ffffff003d837620 zfs_zinactive+0xe8(ffffff0978376220) >>>> ffffff003d837680 zfs_inactive+0x75(ffffff09783d0e00, >>>> ffffff096ed6abb8, 0) >>>> ffffff003d8376e0 fop_inactive+0x76(ffffff09783d0e00, >>>> ffffff096ed6abb8, 0) >>>> ffffff003d837710 vn_rele+0x82(ffffff09783d0e00) >>>> ffffff003d837750 inotify_watch_remove+0x93(ffffff096b1d9808, >>>> ffffff097a002540) >>>> ffffff003d837830 inotify_rm_watch+0x5b(ffffff096b1d9808, 1) >>>> ffffff003d837cc0 inotify_ioctl+0x113(12000000004, 696e7902, 1, >>>> 202003, ffffff096ed6abb8, ffffff003d837ea8) >>>> ffffff003d837d00 cdev_ioctl+0x39(12000000004, 696e7902, 1, 202003, >>>> ffffff096ed6abb8, ffffff003d837ea8) >>>> ffffff003d837d50 spec_ioctl+0x60(ffffff0968d52d40, 696e7902, 1, >>>> 202003, ffffff096ed6abb8, ffffff003d837ea8, 0) >>>> ffffff003d837de0 fop_ioctl+0x55(ffffff0968d52d40, 696e7902, 1, >>>> 202003, ffffff096ed6abb8, ffffff003d837ea8, 0) >>>> ffffff003d837f00 ioctl+0x9b(3, 696e7902, 1) >>>> ffffff003d837f10 sys_syscall+0x196() >>>> --------------------------- >>>> >>>> I've reproduced this issue using the latest >>>> image joyent_20150514T133314Z by the way. >>>> >>>> Thanks, >>>> >>>> -Youzhong >>>> >>>> >>> *smartos-discuss* | Archives >>> <https://www.listbox.com/member/archive/184463/=now> >>> <https://www.listbox.com/member/archive/rss/184463/22103350-51080293> | >>> Modify >>> <https://www.listbox.com/member/?&> >>> Your Subscription <http://www.listbox.com> >>> >>> >> > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
