[smartos-discuss] Re: [developer] Issue 4965: NLM: Mac OS X client can be stuck at locking a NFS file on an illumos server (SmartOS, Nexenta or OmniOS etc.) running open source lockd

Youzhong Yang via smartos-discuss Fri, 11 Jul 2014 07:58:35 -0700

Thanks for helping us solve this issue.



In our use case, we need the NULL_RPC timeout to be at least lower than the
client LOCK request timeout…otherwise our client Mac processes stall/fail
–that’s the real pain point we need to solve.



We don’t see any use that in which the illumos server must wait/recover
across a Mac reboot. While it certainly seems like a valid use-case, it’s
not one that we expect to encounter. But please note, we’d be more than
happy to test/validate any such scenario if helps.



After thinking more about it, what I’d like to do is make the timeout value
configurable (via mdb) – that way, it can be defaulted to whatever value
you feel serves best for the most affected, and we will be able to adjust
it to our use-case pain points.


Below is our proposed patch, your comments/advices are appreciated.


Thank you,


- Youzhong


-----------------------------------------------------------------------------

diff --git a/usr/src/uts/common/klm/nlm_impl.c
b/usr/src/uts/common/klm/nlm_impl.c

index 7daa30d..c2f7178 100644

--- a/usr/src/uts/common/klm/nlm_impl.c

+++ b/usr/src/uts/common/klm/nlm_impl.c

@@ -124,6 +124,11 @@ krwlock_t lm_lck;

 static const struct timeval nlm_rpctv_zero = { 0,  0 };


 /*

+ * Initial timeout for NLM NULL RPC

+ */

+static volatile struct timeval nlm_nullrpc_wait = { 0, 200000 };

+

+/*

  * List of all Zone globals nlm_globals instences

  * linked together.

  */

@@ -527,6 +532,20 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum,
xdrproc_t xdr_args,

                wait = nlm_rpctv_zero;


        /*

+        * Default timeout value of 25 seconds can take

+        * nlm_null_rpc() 150 seconds to return RPC_TIMEDOUT

+        * if it uses UDP and the destination port is

+        * unreachable.

+        *

+        * A shorter timeout value, e.g. 200 milliseconds,

+        * will cause nlm_null_rpc() to time out after

+        * 200 * (1 + 2 + 4 + 8 + 16 + 32) = 12.6 seconds

+        * (with retries set to 5)

+        */

+       if (procnum == NLM_NULL)

+               wait = nlm_nullrpc_wait;

+

+       /*

         * We need to block signals in case of NLM_CANCEL RPC

         * in order to prevent interruption of network RPC

         * calls.

diff --git a/usr/src/uts/common/klm/nlm_rpc_handle.c
b/usr/src/uts/common/klm/nlm_rpc_handle.c

index 9ddf568..26397b3 100644

--- a/usr/src/uts/common/klm/nlm_rpc_handle.c

+++ b/usr/src/uts/common/klm/nlm_rpc_handle.c

@@ -55,6 +55,7 @@

        (_status) == RPC_PROGVERSMISMATCH ||    \

        (_status) == RPC_PROCUNAVAIL ||         \

        (_status) == RPC_CANTCONNECT ||         \

+       (_status) == RPC_TIMEDOUT ||            \

        (_status) == RPC_XPRTFAILED)


 static struct kmem_cache *nlm_rpch_cache = NULL;




On Tue, Jul 8, 2014 at 2:44 PM, Marcel Telka <mar...@telka.sk> wrote:

> Hi,
>
> Nice work. Please find my comments inline.
>
> On Tue, Jul 08, 2014 at 01:43:36PM -0400, Youzhong Yang via
> illumos-developer wrote:
> > I made the following two changes and built a new image, the issue goes
> away.
> >
> > --- a/usr/src/uts/common/klm/nlm_impl.c
> > +++ b/usr/src/uts/common/klm/nlm_impl.c
> > @@ -525,6 +525,12 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum,
> > xdrproc_t xdr_args,
> >          */
> >         if (procnum >= NLM_TEST_RES && procnum <= NLM_GRANTED_RES)
> >                 wait = nlm_rpctv_zero;
> > +       if (procnum == NLM_NULL) {
> > +               wait.tv_sec = 0;
> > +               wait.tv_usec = 25000;
> > +       }
> > +
> >
> > --- a/usr/src/uts/common/klm/nlm_rpc_handle.c
> > +++ b/usr/src/uts/common/klm/nlm_rpc_handle.c
> > @@ -55,6 +55,7 @@
> >         (_status) == RPC_PROGVERSMISMATCH ||    \
> >         (_status) == RPC_PROCUNAVAIL ||         \
> >         (_status) == RPC_CANTCONNECT ||         \
> > +       (_status) == RPC_TIMEDOUT ||            \
> >         (_status) == RPC_XPRTFAILED)
> >
>
> Please add the following (or similar) note as a comment close to your
> modification in nlm_clnt_call() so the explanation why we do need to
> adjust the
> default timeout is retained for future.
>
> > Setting timeout value of NULL rpc to 25 milliseconds instead of the
> default
> > 25 seconds can make nlm_null_rpc() returns RPC_TIMEDOUT after 1575
> > milliseconds when the UDP port is not reachable:
> >    1575 = 25 + 50 + 100 + 200 + 400 + 800 => 5 retries
>
> In general, this should work, but I'm not sure your suggestion of 25 ms
> (for
> the initial timoeut) and 1.5 sec (as the total timeout) is long enough.
> Assuming the client (Mac OS) rebooted.  The reboot is slow operation, so
> there
> is no need to be so fast (1.5 sec) in detecting it.  I think we should
> push the
> total timeout to ca 30 seconds (with the initial timeout at 0.5 sec) or
> maybe
> even more to avoid spurious stale clients detection.
>
>
> Thank you.
>
> --
> +-------------------------------------------+
> | Marcel Telka   e-mail:   mar...@telka.sk  |
> |                homepage: http://telka.sk/ |
> |                jabber:   mar...@jabber.sk |
> +-------------------------------------------+
>



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

[smartos-discuss] Re: [developer] Issue 4965: NLM: Mac OS X client can be stuck at locking a NFS file on an illumos server (SmartOS, Nexenta or OmniOS etc.) running open source lockd

Reply via email to