Re: [Nfs-ganesha-devel] crash in makefd_xprt()

2017-08-10 Thread Malahal Naineni
Following confirms that Thread1 (TCP) is trying to use the same "rec" as
Thread42 (UDP), it is easy to reproduce on the customer system!

 (gdb) thread 42
[Switching to thread 42 (Thread 0x3fffa98fe850 (LWP 99483))]
#0  0x3fffb33b1df8 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt 5
#0  0x3fffb33b1df8 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x3fffb33ab178 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x3fffb330df8c in rpc_dplx_unref (rec=0x3ffeccc25d90, flags=0)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/rpc_dplx.c:350
#3  0x3fffb330226c in clnt_dg_destroy (clnt=0x3ffecc4c4790)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/clnt_dg.c:709
#4  0x3fffb331c1d4 in __rpcb_findaddr_timed (program=100024, version=1,
nconf=0x3ffeccc21230,
host=0x102061a8 "localhost", clpp=0x3fffa98fbde8, tp=0x3fffb33603e0
)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/rpcb_clnt.c:821
(More stack frames follow...)
(gdb) thread 1
[Switching to thread 1 (Thread 0x3fff2a8fe850 (LWP 100755))]
#0  0x3fffb332ceb0 in makefd_xprt (fd=32039, sendsz=262144,
recvsz=262144, allocated=0x3fff2a8fdb4c)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:436
436 if (!(xd->flags & X_VC_DATA_FLAG_SVC_DESTROYED)) {
(gdb) bt 5
#0  0x3fffb332ceb0 in makefd_xprt (fd=32039, sendsz=262144,
recvsz=262144, allocated=0x3fff2a8fdb4c)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:436
#1  0x3fffb332d224 in rendezvous_request (xprt=0x10030fa5b80,
req=0x3fff28f0)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:549
#2  0x10065104 in thr_decode_rpc_request (context=0x0,
xprt=0x10030fa5b80)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/MainNFSD/nfs_rpc_dispatcher_thread.c:1729
#3  0x100657f4 in thr_decode_rpc_requests (thr_ctx=0x3fff1c0008c0)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/MainNFSD/nfs_rpc_dispatcher_thread.c:1853
#4  0x10195744 in fridgethr_start_routine (arg=0x3fff1c0008c0)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/support/fridgethr.c:561
(More stack frames follow...)
(gdb) frame 0
#0  0x3fffb332ceb0 in makefd_xprt (fd=32039, sendsz=262144,
recvsz=262144, allocated=0x3fff2a8fdb4c)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:436
436 if (!(xd->flags & X_VC_DATA_FLAG_SVC_DESTROYED)) {
(gdb) p rec
$2 = (struct rpc_dplx_rec *) 0x3ffeccc25d90


On Fri, Aug 11, 2017 at 2:05 AM, Matt Benjamin  wrote:

> discussion in #ganesha :)
>
> On Thu, Aug 10, 2017 at 3:55 PM, Malahal Naineni 
> wrote:
> > Hi All,
> >
> > One of our customers reported the following backtrace. The
> returned
> > "rec" seems to be corrupted. Based on oflags, rpc_dplx_lookup_rec()
> didn't
> > allocate the "rec" in this call path. Its refcount is 2. More importantly
> > rec.hdl.xd is 0x51 (a bogus pointer) leading to the crash. GDB data is at
> > the end of this email. Note that this crash is observed in latest
> ganesha2.3
> > release.
> >
> > Looking at rpc_dplx_lookup_rec() and rpc_dplx_unref(), looks like rec's
> > refcnt can go to 0 and then back up. Also, rpc_dplx_unref is releasing
> > rec-lock and then acquires hash-lock to preserve the lock order. After
> > dropping the lock at line 359 below, someone else could grab and change
> > refcnt to 1. The second thread could call rpc_dplx_unref() after it is
> done
> > beating the first thread and free the "rec". The first thread accessing
> > "&rec->node_k" at line 361 is in danger as it might be accessing freed
> > memory. In any case, this is NOT our backtrace here. :-(
> >
> > Also, looking at the users of this "rec", they seem to close the file
> > descriptor and then call rpc_dplx_unref(). This has very nasty side
> effects
> > if my understanding is right. Say, thread one has fd 100, it closed it
> and
> > is calling rpc_dplx_unref to free the "rec", but in the mean time another
> > thread gets fd 100, and is calling rpc_dplx_lookup_rec(). At this point
> the
> > second thread is going to use the same "rec" as the first thread,
> correct?
> > Can it happen that a "rec" that belonged to UDP is now being given to a
> > thread doing "TCP"? This is one way I can explain the backtrace! The
> first
> > thread has to be UDP that doesn't need "xd" and the second thread should
> be
> > "TCP" where it finds that the "xd" is uninitialized because the "rec" was
> > allocated by a UDP thread. If you are still reading this email, kudos
> and a
> > big thank you.
> >
> > 357 if (rec->refcnt == 0) {
> > 358 t = rbtx_partition_of_scalar(&rpc_dplx_rec_set.xt,
> > rec->fd_k);
> > 359 REC_UNLOCK(rec);
> > 360 rwlock_wrlock(&t->lock);
> > 361 nv = opr_rbtree_lookup(&

Re: [Nfs-ganesha-devel] Proposed backports for 2.5.2

2017-08-10 Thread Soumya Koduri


> commit 7f2d461277521301a417ca368d3c7656edbfc903
>  FSAL_GLUSTER: Reset caller_garray to NULL upon free
>

Yes

On 08/09/2017 08:57 PM, Frank Filz wrote:

39119aa Soumya Koduri FSAL_GLUSTER: Use glfs_xreaddirplus_r for
readdir

Yes? No? It's sort of a new feature, but may be critical for some use cases.
I'd rather it go into stable than end up separately backported for
downstream.



Right..as it is more of a new feature, wrt upstream we wanted it to be 
part of only 2.6 on wards so as not to break stable branch (in case if 
there are nit issues).


But yes we may end up back-porting to downstream if we do not rebase to 
2.6 by then.


Thanks,
Soumya

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] Weekly conference call timing

2017-08-10 Thread Soumya Koduri



On 08/10/2017 01:18 AM, Frank Filz wrote:

My daughter will be starting a new preschool, possibly as early as August
22nd. Unfortunately since it's Monday, Tuesday, Wednesday and I will need to
drop her off at 9:00 AM Pacific Time, which is right in the middle of our
current time slot...

We could keep the time slot and move to Thursday (or even Friday), or I
could make it work to do it an hour earlier.

I'd like to make this work for the largest number of people, so if you could
give me an idea of what times DON'T work for you that would be helpful.

7:30 AM to 8:30 AM Pacific Time would be:
10:30 AM to 11:30 AM Eastern Time
4:30 PM to 5:30 PM Paris Time
8:00 PM to 9:00 PM Bangalore Time (and 9:00 PM to 10:00 PM when we switch
back to standard time)


An hour earlier (same day) is fine with me as well.

Regards,
Soumya

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] mdcache growing beyond limits.

2017-08-10 Thread Matt Benjamin
I think the particular thresholds of opens and inode count are
interacting in a way we'd like to change.  I think it might make sense
to delegate the various decision points to maybe a vector of strategy
functions, letting more varied approaches compete?

Matt

On Thu, Aug 10, 2017 at 7:12 PM, Pradeep  wrote:
> Debugged this a little more. It appears that the entries that can be reaped
> are not at the LRU position (head) of the L1 queue. So those can be free'd
> later by lru_run(). I don't see it happening either for some reason.
>
> (gdb) p LRU[1].L1
> $29 = {q = {next = 0x7fb459e71960, prev = 0x7fb3ec3c0d30}, id =
> LRU_ENTRY_L1, size = 260379}
>
> head of the list is an entry with refcnt 2; but there are several entries
> with refcnt 1.
>
> (gdb) p *(mdcache_lru_t *)0x7fb459e71960
> $30 = {q = {next = 0x7fb43ddea8a0, prev = 0x7d68a0 }, qid =
> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 2}
> (gdb) p *(mdcache_lru_t *)0x7fb43ddea8a0
> $31 = {q = {next = 0x7fb3f041f9a0, prev = 0x7fb459e71960}, qid =
> LRU_ENTRY_L1, refcnt = 1, flags = 0, lane = 1, cf = 0}
> (gdb) p *(mdcache_lru_t *)0x7fb3f041f9a0
> $32 = {q = {next = 0x7fb466960200, prev = 0x7fb43ddea8a0}, qid =
> LRU_ENTRY_L1, refcnt = 1, flags = 0, lane = 1, cf = 0}
> (gdb) p *(mdcache_lru_t *)0x7fb466960200
> $33 = {q = {next = 0x7fb451e20570, prev = 0x7fb3f041f9a0}, qid =
> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 1}
>
> The entries with refcnt 1 are moved to L2 by the background thread
> (lru_run). However it does it only of the open file count is greater than
> low water mark. In my case, the open_fd_count is not high; so lru_run()
> doesn't call lru_run_lane() to demote those entries to L2. What is the best
> approach to handle this scenario?
>
> Thanks,
> Pradeep
>
>
>
> On Mon, Aug 7, 2017 at 6:08 AM, Daniel Gryniewicz  wrote:
>>
>> It never has been.  In cache_inode, a pin-ref kept it from being
>> reaped, now any ref beyond 1 keeps it.
>>
>> On Fri, Aug 4, 2017 at 1:31 PM, Frank Filz 
>> wrote:
>> >> I'm hitting a case where mdcache keeps growing well beyond the high
>> >> water
>> >> mark. Here is a snapshot of the lru_state:
>> >>
>> >> 1 = {entries_hiwat = 10, entries_used = 2306063, chunks_hiwat =
>> > 10,
>> >> chunks_used = 16462,
>> >>
>> >> It has grown to 2.3 million entries and each entry is ~1.6K.
>> >>
>> >> I looked at the first entry in lane 0, L1 queue:
>> >>
>> >> (gdb) p LRU[0].L1
>> >> $9 = {q = {next = 0x7fad64256f00, prev = 0x7faf21a1bc00}, id =
>> >> LRU_ENTRY_L1, size = 254628}
>> >> (gdb) p (mdcache_entry_t *)(0x7fad64256f00-1024)
>> >> $10 = (mdcache_entry_t *) 0x7fad64256b00
>> >> (gdb) p $10->lru
>> >> $11 = {q = {next = 0x7fad65ea0f00, prev = 0x7d67c0 }, qid =
>> >> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 0, cf = 0}
>> >> (gdb) p $10->fh_hk.inavl
>> >> $13 = true
>> >
>> > The refcount 2 prevents reaping.
>> >
>> > There could be a refcount leak.
>> >
>> > Hmm, though, I thought the entries_hwmark was a hard limit, guess not...
>> >
>> > Frank
>> >
>> >> Lane 1:
>> >> (gdb) p LRU[1].L1
>> >> $18 = {q = {next = 0x7fad625c0300, prev = 0x7faec08c5100}, id =
>> >> LRU_ENTRY_L1, size = 253006}
>> >> (gdb) p (mdcache_entry_t *)(0x7fad625c0300 - 1024)
>> >> $21 = (mdcache_entry_t *) 0x7fad625bff00
>> >> (gdb) p $21->lru
>> >> $22 = {q = {next = 0x7fad66fce600, prev = 0x7d68a0 }, qid =
>> >> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 1}
>> >>
>> >> (gdb) p $21->fh_hk.inavl
>> >> $24 = true
>> >>
>> >> As per LRU_ENTRY_RECLAIMABLE(), these entry should be reclaimable. Not
>> >> sure why it is not able to claim it. Any ideas?
>> >>
>> >> Thanks,
>> >> Pradeep
>> >>
>> >>
>> >
>> > 
>> > --
>> >> Check out the vibrant tech community on one of the world's most
>> >> engaging
>> >> tech sites, Slashdot.org! http://sdm.link/slashdot
>> >> ___
>> >> Nfs-ganesha-devel mailing list
>> >> Nfs-ganesha-devel@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>> >
>> >
>> > ---
>> > This email has been checked for viruses by Avast antivirus software.
>> > https://www.avast.com/antivirus
>> >
>> >
>> >
>> > --
>> > Check out the vibrant tech community on one of the world's most
>> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> > ___
>> > Nfs-ganesha-devel mailing list
>> > Nfs-ganesha-devel@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourcefor

Re: [Nfs-ganesha-devel] mdcache growing beyond limits.

2017-08-10 Thread Pradeep
Debugged this a little more. It appears that the entries that can be reaped
are not at the LRU position (head) of the L1 queue. So those can be free'd
later by lru_run(). I don't see it happening either for some reason.

(gdb) p LRU[1].L1
$29 = {q = {next = 0x7fb459e71960, prev = 0x7fb3ec3c0d30}, id =
LRU_ENTRY_L1, size = 260379}

head of the list is an entry with refcnt 2; but there are several entries
with refcnt 1.

(gdb) p *(mdcache_lru_t *)0x7fb459e71960
$30 = {q = {next = 0x7fb43ddea8a0, prev = 0x7d68a0 }, qid =
LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 2}
(gdb) p *(mdcache_lru_t *)0x7fb43ddea8a0
$31 = {q = {next = 0x7fb3f041f9a0, prev = 0x7fb459e71960}, qid =
LRU_ENTRY_L1, refcnt = 1, flags = 0, lane = 1, cf = 0}
(gdb) p *(mdcache_lru_t *)0x7fb3f041f9a0
$32 = {q = {next = 0x7fb466960200, prev = 0x7fb43ddea8a0}, qid =
LRU_ENTRY_L1, refcnt = 1, flags = 0, lane = 1, cf = 0}
(gdb) p *(mdcache_lru_t *)0x7fb466960200
$33 = {q = {next = 0x7fb451e20570, prev = 0x7fb3f041f9a0}, qid =
LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 1}

The entries with refcnt 1 are moved to L2 by the background thread
(lru_run). However it does it only of the open file count is greater than
low water mark. In my case, the open_fd_count is not high; so lru_run()
doesn't call lru_run_lane() to demote those entries to L2. What is the best
approach to handle this scenario?

Thanks,
Pradeep



On Mon, Aug 7, 2017 at 6:08 AM, Daniel Gryniewicz  wrote:

> It never has been.  In cache_inode, a pin-ref kept it from being
> reaped, now any ref beyond 1 keeps it.
>
> On Fri, Aug 4, 2017 at 1:31 PM, Frank Filz 
> wrote:
> >> I'm hitting a case where mdcache keeps growing well beyond the high
> water
> >> mark. Here is a snapshot of the lru_state:
> >>
> >> 1 = {entries_hiwat = 10, entries_used = 2306063, chunks_hiwat =
> > 10,
> >> chunks_used = 16462,
> >>
> >> It has grown to 2.3 million entries and each entry is ~1.6K.
> >>
> >> I looked at the first entry in lane 0, L1 queue:
> >>
> >> (gdb) p LRU[0].L1
> >> $9 = {q = {next = 0x7fad64256f00, prev = 0x7faf21a1bc00}, id =
> >> LRU_ENTRY_L1, size = 254628}
> >> (gdb) p (mdcache_entry_t *)(0x7fad64256f00-1024)
> >> $10 = (mdcache_entry_t *) 0x7fad64256b00
> >> (gdb) p $10->lru
> >> $11 = {q = {next = 0x7fad65ea0f00, prev = 0x7d67c0 }, qid =
> >> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 0, cf = 0}
> >> (gdb) p $10->fh_hk.inavl
> >> $13 = true
> >
> > The refcount 2 prevents reaping.
> >
> > There could be a refcount leak.
> >
> > Hmm, though, I thought the entries_hwmark was a hard limit, guess not...
> >
> > Frank
> >
> >> Lane 1:
> >> (gdb) p LRU[1].L1
> >> $18 = {q = {next = 0x7fad625c0300, prev = 0x7faec08c5100}, id =
> >> LRU_ENTRY_L1, size = 253006}
> >> (gdb) p (mdcache_entry_t *)(0x7fad625c0300 - 1024)
> >> $21 = (mdcache_entry_t *) 0x7fad625bff00
> >> (gdb) p $21->lru
> >> $22 = {q = {next = 0x7fad66fce600, prev = 0x7d68a0 }, qid =
> >> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 1}
> >>
> >> (gdb) p $21->fh_hk.inavl
> >> $24 = true
> >>
> >> As per LRU_ENTRY_RECLAIMABLE(), these entry should be reclaimable. Not
> >> sure why it is not able to claim it. Any ideas?
> >>
> >> Thanks,
> >> Pradeep
> >>
> >>
> > 
> 
> > --
> >> Check out the vibrant tech community on one of the world's most engaging
> >> tech sites, Slashdot.org! http://sdm.link/slashdot
> >> ___
> >> Nfs-ganesha-devel mailing list
> >> Nfs-ganesha-devel@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> >
> >
> > ---
> > This email has been checked for viruses by Avast antivirus software.
> > https://www.avast.com/antivirus
> >
> >
> > 
> --
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > ___
> > Nfs-ganesha-devel mailing list
> > Nfs-ganesha-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] crash in makefd_xprt()

2017-08-10 Thread Matt Benjamin
discussion in #ganesha :)

On Thu, Aug 10, 2017 at 3:55 PM, Malahal Naineni  wrote:
> Hi All,
>
> One of our customers reported the following backtrace. The returned
> "rec" seems to be corrupted. Based on oflags, rpc_dplx_lookup_rec() didn't
> allocate the "rec" in this call path. Its refcount is 2. More importantly
> rec.hdl.xd is 0x51 (a bogus pointer) leading to the crash. GDB data is at
> the end of this email. Note that this crash is observed in latest ganesha2.3
> release.
>
> Looking at rpc_dplx_lookup_rec() and rpc_dplx_unref(), looks like rec's
> refcnt can go to 0 and then back up. Also, rpc_dplx_unref is releasing
> rec-lock and then acquires hash-lock to preserve the lock order. After
> dropping the lock at line 359 below, someone else could grab and change
> refcnt to 1. The second thread could call rpc_dplx_unref() after it is done
> beating the first thread and free the "rec". The first thread accessing
> "&rec->node_k" at line 361 is in danger as it might be accessing freed
> memory. In any case, this is NOT our backtrace here. :-(
>
> Also, looking at the users of this "rec", they seem to close the file
> descriptor and then call rpc_dplx_unref(). This has very nasty side effects
> if my understanding is right. Say, thread one has fd 100, it closed it and
> is calling rpc_dplx_unref to free the "rec", but in the mean time another
> thread gets fd 100, and is calling rpc_dplx_lookup_rec(). At this point the
> second thread is going to use the same "rec" as the first thread, correct?
> Can it happen that a "rec" that belonged to UDP is now being given to a
> thread doing "TCP"? This is one way I can explain the backtrace! The first
> thread has to be UDP that doesn't need "xd" and the second thread should be
> "TCP" where it finds that the "xd" is uninitialized because the "rec" was
> allocated by a UDP thread. If you are still reading this email, kudos and a
> big thank you.
>
> 357 if (rec->refcnt == 0) {
> 358 t = rbtx_partition_of_scalar(&rpc_dplx_rec_set.xt,
> rec->fd_k);
> 359 REC_UNLOCK(rec);
> 360 rwlock_wrlock(&t->lock);
> 361 nv = opr_rbtree_lookup(&t->t, &rec->node_k);
> 362 rec = NULL;
>
>
> BORING GDB STUFF:
>
> (gdb) bt
> #0  0x3fff7aaaceb0 in makefd_xprt (fd=166878, sendsz=262144,
> recvsz=262144, allocated=0x3ffab97fdb4c)
> at
> /usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:436
> #1  0x3fff7aaad224 in rendezvous_request (xprt=0x1000b125310,
> req=0x3ffa2c0008f0)
> at
> /usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:549
> #2  0x10065104 in thr_decode_rpc_request (context=0x0,
> xprt=0x1000b125310)
> at
> /usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/MainNFSD/nfs_rpc_dispatcher_thread.c:1729
> #3  0x100657f4 in thr_decode_rpc_requests (thr_ctx=0x3ffedc001280)
> at
> /usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/MainNFSD/nfs_rpc_dispatcher_thread.c:1853
> #4  0x10195744 in fridgethr_start_routine (arg=0x3ffedc001280)
> at
> /usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/support/fridgethr.c:561
>
> (gdb) p oflags
> $1 = 0
> (gdb) p rec->hdl.xd
> $2 = (struct x_vc_data *) 0x51
> (gdb) p *rec
> $3 = {fd_k = 166878, locktrace = {mtx = {__data = {__lock = 2, __count = 0,
> __owner = 92274, __nusers = 1, __kind = 3,
> __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
>   __size =
> "\002\000\000\000\000\000\000\000rh\001\000\001\000\000\000\003", '\000'
> ,
>   __align = 2}, func = 0x3fff7aac6ca0 <__func__.8774> "rpc_dplx_ref",
> line = 89}, node_k = {left = 0x0,
> right = 0x0, parent = 0x3ff9c80034f0, red = 1, gen = 639163}, refcnt =
> 2, send = {lock = {we = {mtx = {__data = {
> __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 3,
> __spins = 0, __list = {__prev = 0x0,
>   __next = 0x0}}, __size = '\000' , "\003",
> '\000' , __align = 0},
> cv = {__data = {__lock = 0, __futex = 0, __total_seq = 0,
> __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0,
> __nwaiters = 0, __broadcast_seq = 0}, __size = '\000'  47 times>, __align = 0}},
>   lock_flag_value = 0, locktrace = {func = 0x0, line = 0}}}, recv =
> {lock = {we = {mtx = {__data = {__lock = 0,
> __count = 0, __owner = 0, __nusers = 0, __kind = 3, __spins = 0,
> __list = {__prev = 0x0, __next = 0x0}},
>   __size = '\000' , "\003", '\000'  times>, __align = 0}, cv = {__data = {
> __lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0,
> __woken_seq = 0, __mutex = 0x0, __nwaiters = 0,
> __broadcast_seq = 0}, __size = '\000' ,
> __align = 0}}, lock_flag_value = 0, locktrace = {
> func = 0x3ffc00d8 "\300L\001", line = 0}}}, hdl = {xd = 0x51,
> xprt = 0x0}}
> (gdb)
>
>
> --
> Check out the vibrant tech com

[Nfs-ganesha-devel] crash in makefd_xprt()

2017-08-10 Thread Malahal Naineni
Hi All,

One of our customers reported the following backtrace. The returned
"rec" seems to be corrupted. Based on oflags, rpc_dplx_lookup_rec() didn't
allocate the "rec" in this call path. Its refcount is 2. More importantly
rec.hdl.xd is 0x51 (a bogus pointer) leading to the crash. GDB data is at
the end of this email. Note that this crash is observed in latest ganesha2.3
release.

Looking at rpc_dplx_lookup_rec() and rpc_dplx_unref(), looks like rec's
refcnt can go to 0 and then back up. Also, rpc_dplx_unref is releasing
rec-lock and then acquires hash-lock to preserve the lock order. After
dropping the lock at line 359 below, someone else could grab and change
refcnt to 1. The second thread could call rpc_dplx_unref() after it is done
beating the first thread and free the "rec". The first thread accessing
"&rec->node_k" at line 361 is in danger as it might be accessing freed
memory. In any case, this is NOT our backtrace here. :-(

Also, looking at the users of this "rec", they seem to close the file
descriptor and then call rpc_dplx_unref(). This has very nasty side effects
if my understanding is right. Say, thread one has fd 100, it closed it and
is calling rpc_dplx_unref to free the "rec", but in the mean time another
thread gets fd 100, and is calling rpc_dplx_lookup_rec(). At this point the
second thread is going to use the same "rec" as the first thread, correct?
Can it happen that a "rec" that belonged to UDP is now being given to a
thread doing "TCP"? This is one way I can explain the backtrace! The first
thread has to be UDP that doesn't need "xd" and the second thread should be
"TCP" where it finds that the "xd" is uninitialized because the "rec" was
allocated by a UDP thread. If you are still reading this email, kudos and a
big thank you.

357 if (rec->refcnt == 0) {
358 t = rbtx_partition_of_scalar(&rpc_dplx_rec_set.xt,
rec->fd_k);
359 REC_UNLOCK(rec);
360 rwlock_wrlock(&t->lock);
361 nv = opr_rbtree_lookup(&t->t, &rec->node_k);
362 rec = NULL;


BORING GDB STUFF:

(gdb) bt
#0  0x3fff7aaaceb0 in makefd_xprt (fd=166878, sendsz=262144,
recvsz=262144, allocated=0x3ffab97fdb4c)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:436
#1  0x3fff7aaad224 in rendezvous_request (xprt=0x1000b125310,
req=0x3ffa2c0008f0)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/libntirpc/src/svc_vc.c:549
#2  0x10065104 in thr_decode_rpc_request (context=0x0,
xprt=0x1000b125310)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/MainNFSD/nfs_rpc_dispatcher_thread.c:1729
#3  0x100657f4 in thr_decode_rpc_requests (thr_ctx=0x3ffedc001280)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/MainNFSD/nfs_rpc_dispatcher_thread.c:1853
#4  0x10195744 in fridgethr_start_routine (arg=0x3ffedc001280)
at
/usr/src/debug/nfs-ganesha-2.3.2-ibm44-0.1.1-Source/support/fridgethr.c:561

(gdb) p oflags
$1 = 0
(gdb) p rec->hdl.xd
$2 = (struct x_vc_data *) 0x51
(gdb) p *rec
$3 = {fd_k = 166878, locktrace = {mtx = {__data = {__lock = 2, __count = 0,
__owner = 92274, __nusers = 1, __kind = 3,
__spins = 0, __list = {__prev = 0x0, __next = 0x0}},
  __size =
"\002\000\000\000\000\000\000\000rh\001\000\001\000\000\000\003", '\000'
,
  __align = 2}, func = 0x3fff7aac6ca0 <__func__.8774> "rpc_dplx_ref",
line = 89}, node_k = {left = 0x0,
right = 0x0, parent = 0x3ff9c80034f0, red = 1, gen = 639163}, refcnt =
2, send = {lock = {we = {mtx = {__data = {
__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 3,
__spins = 0, __list = {__prev = 0x0,
  __next = 0x0}}, __size = '\000' , "\003",
'\000' , __align = 0},
cv = {__data = {__lock = 0, __futex = 0, __total_seq = 0,
__wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0,
__nwaiters = 0, __broadcast_seq = 0}, __size = '\000' , __align = 0}},
  lock_flag_value = 0, locktrace = {func = 0x0, line = 0}}}, recv =
{lock = {we = {mtx = {__data = {__lock = 0,
__count = 0, __owner = 0, __nusers = 0, __kind = 3, __spins =
0, __list = {__prev = 0x0, __next = 0x0}},
  __size = '\000' , "\003", '\000' , __align = 0}, cv = {__data = {
__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0,
__woken_seq = 0, __mutex = 0x0, __nwaiters = 0,
__broadcast_seq = 0}, __size = '\000' ,
__align = 0}}, lock_flag_value = 0, locktrace = {
func = 0x3ffc00d8 "\300L\001", line = 0}}}, hdl = {xd = 0x51,
xprt = 0x0}}
(gdb)
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


[Nfs-ganesha-devel] Change in ffilz/nfs-ganesha[next]: FSAL_MEM - fix UP thread init/cleanup

2017-08-10 Thread GerritHub
>From Daniel Gryniewicz :

Daniel Gryniewicz has uploaded this change for review. ( 
https://review.gerrithub.io/373818


Change subject: FSAL_MEM - fix UP thread init/cleanup
..

FSAL_MEM - fix UP thread init/cleanup

Change-Id: I0428d3c316a12fc1cab750f745640a50c03a34cc
Signed-off-by: Daniel Gryniewicz 
---
M src/FSAL/FSAL_MEM/mem_up.c
1 file changed, 10 insertions(+), 0 deletions(-)



  git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha 
refs/changes/18/373818/1
-- 
To view, visit https://review.gerrithub.io/373818
To unsubscribe, visit https://review.gerrithub.io/settings

Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0428d3c316a12fc1cab750f745640a50c03a34cc
Gerrit-Change-Number: 373818
Gerrit-PatchSet: 1
Gerrit-Owner: Daniel Gryniewicz 
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] Weekly conference call timing

2017-08-10 Thread Malahal Naineni
Hour earlier is very good for me on any day except Friday.

On Thu, Aug 10, 2017 at 2:20 PM, Swen Schillig  wrote:

> Hi Frank
>
> I'd prefer to keep it the same day, an hour earlier is fine with me.
> If you need to move to another day, friday would suit me best.
>
> Cheers Swen
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] Weekly conference call timing

2017-08-10 Thread Swen Schillig
Hi Frank

I'd prefer to keep it the same day, an hour earlier is fine with me.
If you need to move to another day, friday would suit me best.

Cheers Swen


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] Weekly conference call timing

2017-08-10 Thread Supriti Singh
Hi Frank,

For me Thursday 4.30 pm to 5.30 pm paris time works. And Friday anytime is good.

Thanks,
Supriti  


--
Supriti Singh SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham 
Norton,
HRB 21284 (AG Nürnberg)
 



>>> "Frank Filz"  08/09/17 9:49 PM >>>
My daughter will be starting a new preschool, possibly as early as August
22nd. Unfortunately since it's Monday, Tuesday, Wednesday and I will need to
drop her off at 9:00 AM Pacific Time, which is right in the middle of our
current time slot...

We could keep the time slot and move to Thursday (or even Friday), or I
could make it work to do it an hour earlier.

I'd like to make this work for the largest number of people, so if you could
give me an idea of what times DON'T work for you that would be helpful.

7:30 AM to 8:30 AM Pacific Time would be:
10:30 AM to 11:30 AM Eastern Time
4:30 PM to 5:30 PM Paris Time
8:00 PM to 9:00 PM Bangalore Time (and 9:00 PM to 10:00 PM when we switch
back to standard time)

If there are other time zones we have folks joining from, please let me
know.

Thanks

Frank


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel