Thanks for helping us solve this issue.
In our use case, we need the NULL_RPC timeout to be at least lower than the client LOCK request timeout…otherwise our client Mac processes stall/fail –that’s the real pain point we need to solve. We don’t see any use that in which the illumos server must wait/recover across a Mac reboot. While it certainly seems like a valid use-case, it’s not one that we expect to encounter. But please note, we’d be more than happy to test/validate any such scenario if helps. After thinking more about it, what I’d like to do is make the timeout value configurable (via mdb) – that way, it can be defaulted to whatever value you feel serves best for the most affected, and we will be able to adjust it to our use-case pain points. Below is our proposed patch, your comments/advices are appreciated. Thank you, - Youzhong ----------------------------------------------------------------------------- diff --git a/usr/src/uts/common/klm/nlm_impl.c b/usr/src/uts/common/klm/nlm_impl.c index 7daa30d..c2f7178 100644 --- a/usr/src/uts/common/klm/nlm_impl.c +++ b/usr/src/uts/common/klm/nlm_impl.c @@ -124,6 +124,11 @@ krwlock_t lm_lck; static const struct timeval nlm_rpctv_zero = { 0, 0 }; /* + * Initial timeout for NLM NULL RPC + */ +static volatile struct timeval nlm_nullrpc_wait = { 0, 200000 }; + +/* * List of all Zone globals nlm_globals instences * linked together. */ @@ -527,6 +532,20 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum, xdrproc_t xdr_args, wait = nlm_rpctv_zero; /* + * Default timeout value of 25 seconds can take + * nlm_null_rpc() 150 seconds to return RPC_TIMEDOUT + * if it uses UDP and the destination port is + * unreachable. + * + * A shorter timeout value, e.g. 200 milliseconds, + * will cause nlm_null_rpc() to time out after + * 200 * (1 + 2 + 4 + 8 + 16 + 32) = 12.6 seconds + * (with retries set to 5) + */ + if (procnum == NLM_NULL) + wait = nlm_nullrpc_wait; + + /* * We need to block signals in case of NLM_CANCEL RPC * in order to prevent interruption of network RPC * calls. diff --git a/usr/src/uts/common/klm/nlm_rpc_handle.c b/usr/src/uts/common/klm/nlm_rpc_handle.c index 9ddf568..26397b3 100644 --- a/usr/src/uts/common/klm/nlm_rpc_handle.c +++ b/usr/src/uts/common/klm/nlm_rpc_handle.c @@ -55,6 +55,7 @@ (_status) == RPC_PROGVERSMISMATCH || \ (_status) == RPC_PROCUNAVAIL || \ (_status) == RPC_CANTCONNECT || \ + (_status) == RPC_TIMEDOUT || \ (_status) == RPC_XPRTFAILED) static struct kmem_cache *nlm_rpch_cache = NULL; On Tue, Jul 8, 2014 at 2:44 PM, Marcel Telka <mar...@telka.sk> wrote: > Hi, > > Nice work. Please find my comments inline. > > On Tue, Jul 08, 2014 at 01:43:36PM -0400, Youzhong Yang via > illumos-developer wrote: > > I made the following two changes and built a new image, the issue goes > away. > > > > --- a/usr/src/uts/common/klm/nlm_impl.c > > +++ b/usr/src/uts/common/klm/nlm_impl.c > > @@ -525,6 +525,12 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum, > > xdrproc_t xdr_args, > > */ > > if (procnum >= NLM_TEST_RES && procnum <= NLM_GRANTED_RES) > > wait = nlm_rpctv_zero; > > + if (procnum == NLM_NULL) { > > + wait.tv_sec = 0; > > + wait.tv_usec = 25000; > > + } > > + > > > > --- a/usr/src/uts/common/klm/nlm_rpc_handle.c > > +++ b/usr/src/uts/common/klm/nlm_rpc_handle.c > > @@ -55,6 +55,7 @@ > > (_status) == RPC_PROGVERSMISMATCH || \ > > (_status) == RPC_PROCUNAVAIL || \ > > (_status) == RPC_CANTCONNECT || \ > > + (_status) == RPC_TIMEDOUT || \ > > (_status) == RPC_XPRTFAILED) > > > > Please add the following (or similar) note as a comment close to your > modification in nlm_clnt_call() so the explanation why we do need to > adjust the > default timeout is retained for future. > > > Setting timeout value of NULL rpc to 25 milliseconds instead of the > default > > 25 seconds can make nlm_null_rpc() returns RPC_TIMEDOUT after 1575 > > milliseconds when the UDP port is not reachable: > > 1575 = 25 + 50 + 100 + 200 + 400 + 800 => 5 retries > > In general, this should work, but I'm not sure your suggestion of 25 ms > (for > the initial timoeut) and 1.5 sec (as the total timeout) is long enough. > Assuming the client (Mac OS) rebooted. The reboot is slow operation, so > there > is no need to be so fast (1.5 sec) in detecting it. I think we should > push the > total timeout to ca 30 seconds (with the initial timeout at 0.5 sec) or > maybe > even more to avoid spurious stale clients detection. > > > Thank you. > > -- > +-------------------------------------------+ > | Marcel Telka e-mail: mar...@telka.sk | > | homepage: http://telka.sk/ | > | jabber: mar...@jabber.sk | > +-------------------------------------------+ > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com