Re: [Nfs-ganesha-devel] deadlock in lru_reap_impl()
I think we need to ensure that the partition lock is taken before the qlane lock. I have a patch for this, but it introduced a refcount issue, so I'm debugging. Daniel On 08/03/2017 08:52 PM, Pradeep wrote: Thanks Franks. I merged your patch and now hitting another deadlock. Here are the two threads: This thread below holds the partition lock in 'read' mode and try to acquire queue lock: Thread 143 (Thread 0x7faf82f72700 (LWP 143573)): #0 0x7fafd1c371bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x7fafd1c32d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x7fafd1c32c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x005221fd in _mdcache_lru_ref (entry=0x7fae78d19000, flags=2, func=0x58ec80 <__func__.23467> "mdcache_find_keyed", line=881) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1813 #4 0x00532686 in mdcache_find_keyed (key=0x7faf82f70760, entry=0x7faf82f707e8) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:881 874 *entry = cih_get_by_key_latch(key, , 875 CIH_GET_RLOCK | CIH_GET_UNLOCK_ON_MISS, 876 __func__, __LINE__); 877 if (likely(*entry)) { 878 fsal_status_t status; 879 880 /* Initial Ref on entry */ 881 status = mdcache_lru_ref(*entry, LRU_REQ_INITIAL); This thread is already holding queue lock and trying to acquire partition lock in write mode: Thread 188 (Thread 0x7faf9979f700 (LWP 143528)): #0 0x7fafd1c3403e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x0052fc61 in cih_remove_checked (entry=0x7fad62914e00) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_hash.h:394 #2 0x00530b3e in mdc_clean_entry (entry=0x7fad62914e00) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:272 #3 0x0051df7e in mdcache_lru_clean (entry=0x7fad62914e00) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:590 #4 0x00522cca in _mdcache_lru_unref (entry=0x7fad62914e00, flags=8, func=0x58b700 <__func__.23710> "lru_reap_impl", line=690) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1922 #5 0x0051ea38 in lru_reap_impl (qid=LRU_ENTRY_L1) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:690 On Fri, Jul 28, 2017 at 1:34 PM, Frank Filz <ffilz...@mindspring.com <mailto:ffilz...@mindspring.com>> wrote: Hmm, well, that’s easy to fix… __ __ Instead of: __ __ mdcache_lru_unref(entry, LRU_UNREF_QLOCKED); goto next_lane; __ __ It could: __ __ QUNLOCK(qlane); mdcache_put(entry); continue; __ __ Fix posted here: __ __ https://review.gerrithub.io/371764 <https://review.gerrithub.io/371764> __ __ Frank __ __ __ __ *From:*Pradeep [mailto:pradeep.tho...@gmail.com <mailto:pradeep.tho...@gmail.com>] *Sent:* Friday, July 28, 2017 12:44 PM *To:* nfs-ganesha-devel@lists.sourceforge.net <mailto:nfs-ganesha-devel@lists.sourceforge.net> *Subject:* [Nfs-ganesha-devel] deadlock in lru_reap_impl() __ __ __ __ I'm hitting another deadlock in mdcache with 2.5.1 base. In this case two threads are in different places in lru_reap_impl() __ __ Thread 1: __ __ 636 QLOCK(qlane); 637 lru = glist_first_entry(>q, mdcache_lru_t, q); 638 if (!lru) 639 goto next_lane; 640 refcnt = atomic_inc_int32_t(>refcnt); 641 entry = container_of(lru, mdcache_entry_t, lru); 642 if (unlikely(refcnt != (LRU_SENTINEL_REFCOUNT + 1))) { 643 /* cant use it. */ 644 mdcache_lru_unref(entry, LRU_UNREF_QLOCKED); __ __ mdcache_lru_unref() could lead to the set of calls below: __ __ mdcache_lru_unref() -> mdcache_lru_clean() -> mdc_clean_entry() -> cih_remove_checked() __ __ This tries to get partition lock which is held by 'Thread 2' which is trying to acquire queue lane lock. __ __ Thread 2: 650 if (cih_latch_entry(>fh_hk.key, , CIH_GET_WLOCK, 651
Re: [Nfs-ganesha-devel] deadlock in lru_reap_impl()
Thanks Franks. I merged your patch and now hitting another deadlock. Here are the two threads: This thread below holds the partition lock in 'read' mode and try to acquire queue lock: Thread 143 (Thread 0x7faf82f72700 (LWP 143573)): #0 0x7fafd1c371bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x7fafd1c32d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x7fafd1c32c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x005221fd in _mdcache_lru_ref (entry=0x7fae78d19000, flags=2, func=0x58ec80 <__func__.23467> "mdcache_find_keyed", line=881) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1813 #4 0x00532686 in mdcache_find_keyed (key=0x7faf82f70760, entry=0x7faf82f707e8) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:881 874 *entry = cih_get_by_key_latch(key, , 875 CIH_GET_RLOCK | CIH_GET_UNLOCK_ON_MISS, 876 __func__, __LINE__); 877 if (likely(*entry)) { 878 fsal_status_t status; 879 880 /* Initial Ref on entry */ 881 status = mdcache_lru_ref(*entry, LRU_REQ_INITIAL); This thread is already holding queue lock and trying to acquire partition lock in write mode: Thread 188 (Thread 0x7faf9979f700 (LWP 143528)): #0 0x7fafd1c3403e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x0052fc61 in cih_remove_checked (entry=0x7fad62914e00) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_hash.h:394 #2 0x00530b3e in mdc_clean_entry (entry=0x7fad62914e00) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:272 #3 0x0051df7e in mdcache_lru_clean (entry=0x7fad62914e00) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:590 #4 0x00522cca in _mdcache_lru_unref (entry=0x7fad62914e00, flags=8, func=0x58b700 <__func__.23710> "lru_reap_impl", line=690) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1922 #5 0x0051ea38 in lru_reap_impl (qid=LRU_ENTRY_L1) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:690 On Fri, Jul 28, 2017 at 1:34 PM, Frank Filz <ffilz...@mindspring.com> wrote: > Hmm, well, that’s easy to fix… > > > > Instead of: > > > > mdcache_lru_unref(entry, LRU_UNREF_QLOCKED); > > goto next_lane; > > > > It could: > > > > QUNLOCK(qlane); > > mdcache_put(entry); > > continue; > > > > Fix posted here: > > > > https://review.gerrithub.io/371764 > > > > Frank > > > > > > *From:* Pradeep [mailto:pradeep.tho...@gmail.com] > *Sent:* Friday, July 28, 2017 12:44 PM > *To:* nfs-ganesha-devel@lists.sourceforge.net > *Subject:* [Nfs-ganesha-devel] deadlock in lru_reap_impl() > > > > > > I'm hitting another deadlock in mdcache with 2.5.1 base. In this case two > threads are in different places in lru_reap_impl() > > > > Thread 1: > > > > 636 QLOCK(qlane); > > 637 lru = glist_first_entry(>q, mdcache_lru_t, q); > > 638 if (!lru) > > 639 goto next_lane; > > 640 refcnt = atomic_inc_int32_t(>refcnt); > > 641 entry = container_of(lru, mdcache_entry_t, lru); > > 642 if (unlikely(refcnt != (LRU_SENTINEL_REFCOUNT + > 1))) { > > 643 /* cant use it. */ > > 644 mdcache_lru_unref(entry, > LRU_UNREF_QLOCKED); > > > > mdcache_lru_unref() could lead to the set of calls below: > > > > mdcache_lru_unref() -> mdcache_lru_clean() -> mdc_clean_entry() > -> cih_remove_checked() > > > > This tries to get partition lock which is held by 'Thread 2' which is > trying to acquire queue lane lock. > > > > Thread 2: > > 650 if (cih_latch_entry(>fh_hk.key, , > CIH_GET_WLOCK, > > 651 __func__, __LINE__)) { > > 652 QLOCK(qlane); > > > > Stack traces: > > > > Thread 1: > > > #0 0x7f571328103e in pthread_rwlock_wrlock () from > /lib64/libpthread.so.0 > > #1 0x0052f928 in cih_remove_checked (entry=0x7f548e86c400) > > at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/ > Stackable_FSALs/FSAL_MDCACHE/mdcache_hash.h:394 > >
Re: [Nfs-ganesha-devel] deadlock in lru_reap_impl()
Hmm, well, that’s easy to fix… Instead of: mdcache_lru_unref(entry, LRU_UNREF_QLOCKED); goto next_lane; It could: QUNLOCK(qlane); mdcache_put(entry); continue; Fix posted here: https://review.gerrithub.io/371764 Frank From: Pradeep [mailto:pradeep.tho...@gmail.com] Sent: Friday, July 28, 2017 12:44 PM To: nfs-ganesha-devel@lists.sourceforge.net Subject: [Nfs-ganesha-devel] deadlock in lru_reap_impl() I'm hitting another deadlock in mdcache with 2.5.1 base. In this case two threads are in different places in lru_reap_impl() Thread 1: 636 QLOCK(qlane); 637 lru = glist_first_entry(>q, mdcache_lru_t, q); 638 if (!lru) 639 goto next_lane; 640 refcnt = atomic_inc_int32_t(>refcnt); 641 entry = container_of(lru, mdcache_entry_t, lru); 642 if (unlikely(refcnt != (LRU_SENTINEL_REFCOUNT + 1))) { 643 /* cant use it. */ 644 mdcache_lru_unref(entry, LRU_UNREF_QLOCKED); mdcache_lru_unref() could lead to the set of calls below: mdcache_lru_unref() -> mdcache_lru_clean() -> mdc_clean_entry() -> cih_remove_checked() This tries to get partition lock which is held by 'Thread 2' which is trying to acquire queue lane lock. Thread 2: 650 if (cih_latch_entry(>fh_hk.key, , CIH_GET_WLOCK, 651 __func__, __LINE__)) { 652 QLOCK(qlane); Stack traces: Thread 1: #0 0x7f571328103e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x0052f928 in cih_remove_checked (entry=0x7f548e86c400) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_hash.h:394 #2 0x00530805 in mdc_clean_entry (entry=0x7f548e86c400) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:272 #3 0x0051df7e in mdcache_lru_clean (entry=0x7f548e86c400) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:590 #4 0x005229c0 in _mdcache_lru_unref (entry=0x7f548e86c400, flags=8, func=0x58b5c0 <__func__.23710> "lru_reap_impl", line=687) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1918 #5 0x0051e83a in lru_reap_impl (qid=LRU_ENTRY_L1) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:687 Thread 2: #0 0x7f57132841bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x7f571327fd02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x7f571327fc08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0051e4f5 in lru_reap_impl (qid=LRU_ENTRY_L1) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:652 --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Nfs-ganesha-devel mailing list Nfs-ganesha-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
[Nfs-ganesha-devel] deadlock in lru_reap_impl()
I'm hitting another deadlock in mdcache with 2.5.1 base. In this case two threads are in different places in lru_reap_impl() Thread 1: 636 QLOCK(qlane); 637 lru = glist_first_entry(>q, mdcache_lru_t, q); 638 if (!lru) 639 goto next_lane; 640 refcnt = atomic_inc_int32_t(>refcnt); 641 entry = container_of(lru, mdcache_entry_t, lru); 642 if (unlikely(refcnt != (LRU_SENTINEL_REFCOUNT + 1))) { 643 /* cant use it. */ 644 mdcache_lru_unref(entry, LRU_UNREF_QLOCKED); mdcache_lru_unref() could lead to the set of calls below: mdcache_lru_unref() -> mdcache_lru_clean() -> mdc_clean_entry() -> cih_remove_checked() This tries to get partition lock which is held by 'Thread 2' which is trying to acquire queue lane lock. Thread 2: 650 if (cih_latch_entry(>fh_hk.key, , CIH_GET_WLOCK, 651 __func__, __LINE__)) { 652 QLOCK(qlane); Stack traces: Thread 1: #0 0x7f571328103e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x0052f928 in cih_remove_checked (entry=0x7f548e86c400) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_hash.h:394 #2 0x00530805 in mdc_clean_entry (entry=0x7f548e86c400) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:272 #3 0x0051df7e in mdcache_lru_clean (entry=0x7f548e86c400) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:590 #4 0x005229c0 in _mdcache_lru_unref (entry=0x7f548e86c400, flags=8, func=0x58b5c0 <__func__.23710> "lru_reap_impl", line=687) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1918 #5 0x0051e83a in lru_reap_impl (qid=LRU_ENTRY_L1) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:687 Thread 2: #0 0x7f57132841bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x7f571327fd02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x7f571327fc08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0051e4f5 in lru_reap_impl (qid=LRU_ENTRY_L1) at /usr/src/debug/nfs-ganesha-2.5.1-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:652 -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Nfs-ganesha-devel mailing list Nfs-ganesha-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel