[smartos-discuss] Re: [developer] Issue 4965: NLM: Mac OS X client can be stuck at locking a NFS file on an illumos server (SmartOS, Nexenta or OmniOS etc.) running open source lockd

2014-07-11 Thread Youzhong Yang via smartos-discuss
Thanks for helping us solve this issue.



In our use case, we need the NULL_RPC timeout to be at least lower than the
client LOCK request timeout…otherwise our client Mac processes stall/fail
–that’s the real pain point we need to solve.



We don’t see any use that in which the illumos server must wait/recover
across a Mac reboot. While it certainly seems like a valid use-case, it’s
not one that we expect to encounter. But please note, we’d be more than
happy to test/validate any such scenario if helps.



After thinking more about it, what I’d like to do is make the timeout value
configurable (via mdb) – that way, it can be defaulted to whatever value
you feel serves best for the most affected, and we will be able to adjust
it to our use-case pain points.


Below is our proposed patch, your comments/advices are appreciated.


Thank you,


- Youzhong


-

diff --git a/usr/src/uts/common/klm/nlm_impl.c
b/usr/src/uts/common/klm/nlm_impl.c

index 7daa30d..c2f7178 100644

--- a/usr/src/uts/common/klm/nlm_impl.c

+++ b/usr/src/uts/common/klm/nlm_impl.c

@@ -124,6 +124,11 @@ krwlock_t lm_lck;

 static const struct timeval nlm_rpctv_zero = { 0,  0 };


 /*

+ * Initial timeout for NLM NULL RPC

+ */

+static volatile struct timeval nlm_nullrpc_wait = { 0, 20 };

+

+/*

  * List of all Zone globals nlm_globals instences

  * linked together.

  */

@@ -527,6 +532,20 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum,
xdrproc_t xdr_args,

wait = nlm_rpctv_zero;


/*

+* Default timeout value of 25 seconds can take

+* nlm_null_rpc() 150 seconds to return RPC_TIMEDOUT

+* if it uses UDP and the destination port is

+* unreachable.

+*

+* A shorter timeout value, e.g. 200 milliseconds,

+* will cause nlm_null_rpc() to time out after

+* 200 * (1 + 2 + 4 + 8 + 16 + 32) = 12.6 seconds

+* (with retries set to 5)

+*/

+   if (procnum == NLM_NULL)

+   wait = nlm_nullrpc_wait;

+

+   /*

 * We need to block signals in case of NLM_CANCEL RPC

 * in order to prevent interruption of network RPC

 * calls.

diff --git a/usr/src/uts/common/klm/nlm_rpc_handle.c
b/usr/src/uts/common/klm/nlm_rpc_handle.c

index 9ddf568..26397b3 100644

--- a/usr/src/uts/common/klm/nlm_rpc_handle.c

+++ b/usr/src/uts/common/klm/nlm_rpc_handle.c

@@ -55,6 +55,7 @@

(_status) == RPC_PROGVERSMISMATCH ||\

(_status) == RPC_PROCUNAVAIL || \

(_status) == RPC_CANTCONNECT || \

+   (_status) == RPC_TIMEDOUT ||\

(_status) == RPC_XPRTFAILED)


 static struct kmem_cache *nlm_rpch_cache = NULL;




On Tue, Jul 8, 2014 at 2:44 PM, Marcel Telka mar...@telka.sk wrote:

 Hi,

 Nice work. Please find my comments inline.

 On Tue, Jul 08, 2014 at 01:43:36PM -0400, Youzhong Yang via
 illumos-developer wrote:
  I made the following two changes and built a new image, the issue goes
 away.
 
  --- a/usr/src/uts/common/klm/nlm_impl.c
  +++ b/usr/src/uts/common/klm/nlm_impl.c
  @@ -525,6 +525,12 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum,
  xdrproc_t xdr_args,
   */
  if (procnum = NLM_TEST_RES  procnum = NLM_GRANTED_RES)
  wait = nlm_rpctv_zero;
  +   if (procnum == NLM_NULL) {
  +   wait.tv_sec = 0;
  +   wait.tv_usec = 25000;
  +   }
  +
 
  --- a/usr/src/uts/common/klm/nlm_rpc_handle.c
  +++ b/usr/src/uts/common/klm/nlm_rpc_handle.c
  @@ -55,6 +55,7 @@
  (_status) == RPC_PROGVERSMISMATCH ||\
  (_status) == RPC_PROCUNAVAIL || \
  (_status) == RPC_CANTCONNECT || \
  +   (_status) == RPC_TIMEDOUT ||\
  (_status) == RPC_XPRTFAILED)
 

 Please add the following (or similar) note as a comment close to your
 modification in nlm_clnt_call() so the explanation why we do need to
 adjust the
 default timeout is retained for future.

  Setting timeout value of NULL rpc to 25 milliseconds instead of the
 default
  25 seconds can make nlm_null_rpc() returns RPC_TIMEDOUT after 1575
  milliseconds when the UDP port is not reachable:
 1575 = 25 + 50 + 100 + 200 + 400 + 800 = 5 retries

 In general, this should work, but I'm not sure your suggestion of 25 ms
 (for
 the initial timoeut) and 1.5 sec (as the total timeout) is long enough.
 Assuming the client (Mac OS) rebooted.  The reboot is slow operation, so
 there
 is no need to be so fast (1.5 sec) in detecting it.  I think we should
 push the
 total timeout to ca 30 seconds (with the initial timeout at 0.5 sec) or
 maybe
 even more to avoid spurious stale clients detection.


 Thank you.

 --
 +---+
 | Marcel Telka   e-mail:   mar...@telka.sk  |
 |homepage: http://telka.sk/ |
 |jabber:   mar...@jabber.sk |
 

[smartos-discuss] Re: [developer] Issue 4965: NLM: Mac OS X client can be stuck at locking a NFS file on an illumos server (SmartOS, Nexenta or OmniOS etc.) running open source lockd

2014-07-08 Thread Garrett D'Amore via smartos-discuss
I agree with Marcel.  Error on the side of caution here. :-)


On Tue, Jul 8, 2014 at 11:44 AM, Marcel Telka via illumos-developer 
develo...@lists.illumos.org wrote:

 Hi,

 Nice work. Please find my comments inline.

 On Tue, Jul 08, 2014 at 01:43:36PM -0400, Youzhong Yang via
 illumos-developer wrote:
  I made the following two changes and built a new image, the issue goes
 away.
 
  --- a/usr/src/uts/common/klm/nlm_impl.c
  +++ b/usr/src/uts/common/klm/nlm_impl.c
  @@ -525,6 +525,12 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum,
  xdrproc_t xdr_args,
   */
  if (procnum = NLM_TEST_RES  procnum = NLM_GRANTED_RES)
  wait = nlm_rpctv_zero;
  +   if (procnum == NLM_NULL) {
  +   wait.tv_sec = 0;
  +   wait.tv_usec = 25000;
  +   }
  +
 
  --- a/usr/src/uts/common/klm/nlm_rpc_handle.c
  +++ b/usr/src/uts/common/klm/nlm_rpc_handle.c
  @@ -55,6 +55,7 @@
  (_status) == RPC_PROGVERSMISMATCH ||\
  (_status) == RPC_PROCUNAVAIL || \
  (_status) == RPC_CANTCONNECT || \
  +   (_status) == RPC_TIMEDOUT ||\
  (_status) == RPC_XPRTFAILED)
 

 Please add the following (or similar) note as a comment close to your
 modification in nlm_clnt_call() so the explanation why we do need to
 adjust the
 default timeout is retained for future.

  Setting timeout value of NULL rpc to 25 milliseconds instead of the
 default
  25 seconds can make nlm_null_rpc() returns RPC_TIMEDOUT after 1575
  milliseconds when the UDP port is not reachable:
 1575 = 25 + 50 + 100 + 200 + 400 + 800 = 5 retries

 In general, this should work, but I'm not sure your suggestion of 25 ms
 (for
 the initial timoeut) and 1.5 sec (as the total timeout) is long enough.
 Assuming the client (Mac OS) rebooted.  The reboot is slow operation, so
 there
 is no need to be so fast (1.5 sec) in detecting it.  I think we should
 push the
 total timeout to ca 30 seconds (with the initial timeout at 0.5 sec) or
 maybe
 even more to avoid spurious stale clients detection.


 Thank you.

 --
 +---+
 | Marcel Telka   e-mail:   mar...@telka.sk  |
 |homepage: http://telka.sk/ |
 |jabber:   mar...@jabber.sk |
 +---+


 ---
 illumos-developer
 Archives: https://www.listbox.com/member/archive/182179/=now
 RSS Feed:
 https://www.listbox.com/member/archive/rss/182179/21239177-3604570e
 Modify Your Subscription:
 https://www.listbox.com/member/?;
 Powered by Listbox: http://www.listbox.com




---
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com