Bryan,

I am happy to know that the diff looks good, sure enough my e-mail address
[email protected] can be used.

Thanks again for looking into this issue.

-Youzhong


On Thu, May 28, 2015 at 1:54 PM, Bryan Cantrill <[email protected]>
wrote:

>
> Youzhong,
>
> Sorry for the delay on this.  Yes, I think Garrett's suggestion was an
> excellent one -- and those diffs look good!  If you don't mind, since
> you've done the work here, I'd like to push this to SmartOS with you as the
> author.  What e-mail address do you prefer for that?  And thank you for
> doing this work -- and to Garrett for the suggestion!
>
>         - Bryan
>
>
> On Thu, May 28, 2015 at 6:12 AM, Youzhong Yang <[email protected]> wrote:
>
>> Hi Bryan,
>>
>> I tried the following patch and it works well so far. Would you please
>> take a look? does it look good?
>>
>> By the way, with this patch, the inotify_clean() stack looks like:
>>
>>               zfs`dmu_tx_delay+0x147
>>               zfs`dmu_tx_wait+0x8e
>>               zfs`dmu_tx_assign+0x55
>>               zfs`dmu_free_long_range_impl+0x9f
>>               zfs`dmu_free_long_range+0x62
>>               zfs`zfs_rmnode+0x52
>>               zfs`zfs_zinactive+0xe8
>>               zfs`zfs_inactive+0x75
>>               genunix`fop_inactive+0x76
>>               genunix`vn_rele+0x82
>>               inotify`inotify_clean+0xfc
>>               genunix`periodic_execute+0xc9
>>               genunix`taskq_thread+0x2d0
>>               unix`thread_start+0x8
>>
>> Thanks,
>>
>> -Youzhong
>>
>> diff --git a/usr/src/uts/common/io/inotify.c
>> b/usr/src/uts/common/io/inotify.c (
>> https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/io/inotify.c
>> )
>> index c5c7b37..594b727 100644
>> --- a/usr/src/uts/common/io/inotify.c
>> +++ b/usr/src/uts/common/io/inotify.c
>> @@ -99,7 +99,7 @@ struct inotify_state {
>>         pollhead_t ins_pollhd;                  /* poll head */
>>         kcondvar_t ins_cv;                      /* condvar for reading */
>>         list_t ins_orphans;                     /* orphan list */
>> -       cyclic_id_t ins_cleaner;                /* cyclic for cleaning */
>> +       ddi_periodic_t ins_cleaner;             /* cyclic for cleaning */
>>         inotify_watch_t *ins_zombies;           /* zombie watch list */
>>         cred_t *ins_cred;                       /* creator's credentials
>> */
>>         inotify_state_t *ins_next;              /* next state on global
>> list */
>> @@ -1042,8 +1042,6 @@ inotify_open(dev_t *devp, int flag, int otyp,
>> cred_t *cred_p)
>>         major_t major = getemajor(*devp);
>>         minor_t minor = getminor(*devp);
>>         int instances = 0;
>> -       cyc_handler_t hdlr;
>> -       cyc_time_t when;
>>         char c[64];
>>
>>         if (minor != INOTIFYMNRN_INOTIFY)
>> @@ -1100,17 +1098,7 @@ inotify_open(dev_t *devp, int flag, int otyp,
>> cred_t *cred_p)
>>
>>         mutex_exit(&inotify_lock);
>>
>> -       mutex_enter(&cpu_lock);
>> -
>> -       hdlr.cyh_func = inotify_clean;
>> -       hdlr.cyh_level = CY_LOW_LEVEL;
>> -       hdlr.cyh_arg = state;
>> -
>> -       when.cyt_when = 0;
>> -       when.cyt_interval = NANOSEC;
>> -
>> -       state->ins_cleaner = cyclic_add(&hdlr, &when);
>> -       mutex_exit(&cpu_lock);
>> +       state->ins_cleaner = ddi_periodic_add(inotify_clean, (void
>> *)state, NANOSEC, DDI_IPL_0);
>>
>>         return (0);
>>  }
>> @@ -1329,9 +1317,10 @@ inotify_close(dev_t dev, int flag, int otyp,
>> cred_t *cred_p)
>>                 inotify_watch_destroy(watch);
>>         }
>>
>> -       mutex_enter(&cpu_lock);
>> -       cyclic_remove(state->ins_cleaner);
>> -       mutex_exit(&cpu_lock);
>> +       if (state->ins_cleaner != NULL) {
>> +               ddi_periodic_delete(state->ins_cleaner);
>> +               state->ins_cleaner = NULL;
>> +       }
>>
>>         mutex_enter(&inotify_lock);
>>
>>
>> On Thu, May 21, 2015 at 9:23 PM, Garrett D'Amore <[email protected]>
>> wrote:
>>
>>> A ddi_periodic would probably be simpler and cleaner than the
>>> taskq/cyclic combo.   And I think you can define the interrupt level (or
>>> not) that you want to run it at.
>>>
>>> Sent from my iPhone
>>>
>>> On May 21, 2015, at 2:59 PM, Bryan Cantrill <[email protected]>
>>> wrote:
>>>
>>>
>>> Youzhong,
>>>
>>> Well, today I leaned that there's an implicit dependency that I imagine
>>> would be news for most: ZFS doesn't allow for vn_rele()'s from low-level
>>> interrupt context. :|  There are a couple of ways to fix this, but the
>>> simplest is to have inotify_clean() be in a task queue that's kicked off by
>>> a cyclic, which I can prototype relatively easily.  Meanwhile, might it be
>>> possible for you to upload the dump to Manta and/or otherwise make it
>>> available so I can verify that this is indeed what's happening?  And thank
>>> you (as always!) for your diligence in reproducing and reporting the issue!
>>>
>>>          - Bryan
>>>
>>>
>>> On Thu, May 21, 2015 at 11:34 AM, Youzhong Yang <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We're trying to use inotify for change notification in Samba 4, most of
>>>> the time, it works very well, but under certain conditions, the server can
>>>> become unresponsive (to client NFS requests), due to some kind of deadlock
>>>> possibly caused by inotify_clean() running as a cyclic.
>>>>
>>>> Here are the steps to reproduce:
>>>>
>>>> 1. suppose you have a zfs file system with plenty of space, say
>>>> /zpool/temp which is created in the zpool named 'zpool'; download the
>>>> attached 2 C programs and compile them.
>>>>
>>>> 2. in terminal windows #1, run the following dtrace command to monitor
>>>> the 'dirty_total':
>>>>
>>>> dtrace -n 'txg-syncing /((dsl_pool_t *)arg0)->dp_spa->spa_name ==
>>>> "zpool"/ {printf("%4dMB (%2d) of %4dMB used", ((dsl_pool_t
>>>> *)arg0)->dp_dirty_total/1024/1024, ((dsl_pool_t
>>>> *)arg0)->dp_dirty_total*100/`zfs_dirty_data_max, `zfs_dirty_data_max / 1024
>>>> / 1024);}'
>>>>
>>>> Replace the string "zpool" with your actual zpool name.
>>>>
>>>> 3. in terminal window #2, run the following command
>>>>
>>>> inotify_stress_new -f /zpool/temp -n 4
>>>>
>>>> It forks 4 processes, each of which uses inotify_* functions to monitor
>>>> the folder /zpool/temp, create and remove some temporary files in that
>>>> folder.
>>>>
>>>> 4. in terminal windows #3, run the following command:
>>>>
>>>> write-files -f /zpool/temp -n 50
>>>>
>>>> It forks 50 processes, each one creates a subfolder under /zpool/temp,
>>>> keeps creating files of 1GB size.
>>>>
>>>> Depending on your machine's horse power, you may want to adjust the
>>>> number of processes for write-files program. Its purpose is basically to
>>>> make zpool dirty_total go beyond 60% of zfs_dirty_data_max.
>>>>
>>>> Normally, within a few minutes, terminal windows #1 will stop printing
>>>> anything, which means the issue is reproduced.
>>>>
>>>> Now once the issue is reproduced, I run 'echo ::threadlist -v 10 | mdb
>>>> -k' to dump the kernel threads. So far there are two patterns I observed:
>>>>
>>>> Case No.1: inotify_clean() stuck at cv_timedwait_hires()
>>>> ---------------------------
>>>>     swtch+0x82()
>>>>     cv_timedwait_hires+0xec(ffffff01e852ae2e, ffffff01e852ae30,
>>>> 1047be67f9694,
>>>>     186a0, 3)
>>>>     dmu_tx_delay+0x147(ffffff8114229c00, d80448bb)
>>>>     dmu_tx_wait+0x8e(ffffff8114229c00)
>>>>     dmu_tx_assign+0x55(ffffff8114229c00, 1)
>>>>     dmu_free_long_range_impl+0x9f(ffffff443a1943c0, ffffff4be51507e8,
>>>> 0, ffffffffffffffff)
>>>>     dmu_free_long_range+0x62(ffffff443a1943c0, 179294d, 0,
>>>> ffffffffffffffff)
>>>>     zfs_rmnode+0x52(ffffff44d20891f0)
>>>>     zfs_zinactive+0xe8(ffffff44d20891f0)
>>>>     zfs_inactive+0x75(ffffff450b8e8500, 0, 0)
>>>>     fop_inactive+0x76(ffffff450b8e8500, 0, 0)
>>>>     vn_rele+0x82(ffffff450b8e8500)
>>>>     inotify_clean+0xfc(ffffff43abf54810)
>>>>     cyclic_softint+0xf3(ffffff42eb589a80, 0)
>>>>     cbe_low_level+0x14()
>>>>     av_dispatch_softvect+0x78(2)
>>>>     apix_dispatch_softint+0x35(0, 0)
>>>>     switch_sp_and_call+0x13()
>>>>     0xffffff42eb589a80()
>>>>     apix_do_interrupt+0xfe(ffffff01e84faad0, 0)
>>>>     _interrupt+0xba()
>>>>     i86_mwait+0xd()
>>>>     cpu_idle_mwait+0x109()
>>>>     idle+0xa7()
>>>>     thread_start+8()
>>>> ---------------------------
>>>> Case No. 2: inotify_clean() waits for a mutex which is held by another
>>>> thread stuck at cv_timedwait_hires()
>>>> ---------------------------
>>>> > ffffff003d29cc40::findstack -v
>>>> stack pointer for thread ffffff003d29cc40: ffffff003d29c960
>>>> [ ffffff003d29c960 resume_from_intr+0xb7() ]
>>>>   ffffff003d29c990 swtch+0x82()
>>>>   ffffff003d29ca30 turnstile_block+0x21a(ffffff090bf5bd58, 0,
>>>> ffffff096b1d9808, fffffffffbc07aa0, 0, 0)
>>>>   ffffff003d29caa0 mutex_vector_enter+0x3a3(ffffff096b1d9808)
>>>>   ffffff003d29caf0 inotify_clean+0x2f(ffffff096b1d9808)
>>>>   ffffff003d29cb90 cyclic_softint+0xfd(ffffff090b55bb00, 0)
>>>>   ffffff003d29cba0 cbe_low_level+0x14()
>>>>   ffffff003d29cbf0 av_dispatch_softvect+0x78(2)
>>>>   ffffff003d29cc20 dispatch_softint+0x39(0, 0)
>>>>   ffffff003d5a8e60 switch_sp_and_call+0x13()
>>>>   ffffff003d5a8e90 0xffffff0900000001()
>>>>   4136b0 [no mapping for address]
>>>> > ffffff096b1d9808::mutex
>>>>             ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
>>>> ffffff096b1d9808 adapt ffffff0912213000      -      -     yes
>>>> > ffffff0912213000::findstack -v
>>>> stack pointer for thread ffffff0912213000: ffffff003d8372f0
>>>> [ ffffff003d8372f0 _resume_from_idle+0xf4() ]
>>>>   ffffff003d837320 swtch+0x141()
>>>>   ffffff003d8373b0 cv_timedwait_hires+0xec(ffffff09122131ee,
>>>> ffffff09122131f0, 7d07b0bbc4, 186a0, 3)
>>>>   ffffff003d837410 dmu_tx_delay+0x147(ffffff096f22a900, 87f76900)
>>>>   ffffff003d837460 dmu_tx_wait+0x8e(ffffff096f22a900)
>>>>   ffffff003d8374a0 dmu_tx_assign+0x55(ffffff096f22a900, 1)
>>>>   ffffff003d837520 dmu_free_long_range_impl+0xa7(ffffff0914365780,
>>>> ffffff096cb94d28, 0, ffffffffffffffff)
>>>>   ffffff003d837590 dmu_free_long_range+0x62(ffffff0914365780, 89e, 0,
>>>> ffffffffffffffff)
>>>>   ffffff003d8375e0 zfs_rmnode+0x52(ffffff0978376220)
>>>>   ffffff003d837620 zfs_zinactive+0xe8(ffffff0978376220)
>>>>   ffffff003d837680 zfs_inactive+0x75(ffffff09783d0e00,
>>>> ffffff096ed6abb8, 0)
>>>>   ffffff003d8376e0 fop_inactive+0x76(ffffff09783d0e00,
>>>> ffffff096ed6abb8, 0)
>>>>   ffffff003d837710 vn_rele+0x82(ffffff09783d0e00)
>>>>   ffffff003d837750 inotify_watch_remove+0x93(ffffff096b1d9808,
>>>> ffffff097a002540)
>>>>   ffffff003d837830 inotify_rm_watch+0x5b(ffffff096b1d9808, 1)
>>>>   ffffff003d837cc0 inotify_ioctl+0x113(12000000004, 696e7902, 1,
>>>> 202003, ffffff096ed6abb8, ffffff003d837ea8)
>>>>   ffffff003d837d00 cdev_ioctl+0x39(12000000004, 696e7902, 1, 202003,
>>>> ffffff096ed6abb8, ffffff003d837ea8)
>>>>   ffffff003d837d50 spec_ioctl+0x60(ffffff0968d52d40, 696e7902, 1,
>>>> 202003, ffffff096ed6abb8, ffffff003d837ea8, 0)
>>>>   ffffff003d837de0 fop_ioctl+0x55(ffffff0968d52d40, 696e7902, 1,
>>>> 202003, ffffff096ed6abb8, ffffff003d837ea8, 0)
>>>>   ffffff003d837f00 ioctl+0x9b(3, 696e7902, 1)
>>>>   ffffff003d837f10 sys_syscall+0x196()
>>>> ---------------------------
>>>>
>>>> I've reproduced this issue using the latest
>>>> image joyent_20150514T133314Z by the way.
>>>>
>>>> Thanks,
>>>>
>>>> -Youzhong
>>>>
>>>>
>>> *smartos-discuss* | Archives
>>> <https://www.listbox.com/member/archive/184463/=now>
>>> <https://www.listbox.com/member/archive/rss/184463/22103350-51080293> |
>>> Modify
>>> <https://www.listbox.com/member/?&;>
>>> Your Subscription <http://www.listbox.com>
>>>
>>>
>>
>



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to