Thanks for helping us solve this issue.
In our use case, we need the NULL_RPC timeout to be at least lower than the
client LOCK request timeout…otherwise our client Mac processes stall/fail
–that’s the real pain point we need to solve.
We don’t see any use that in which the illumos server must wait/recover
across a Mac reboot. While it certainly seems like a valid use-case, it’s
not one that we expect to encounter. But please note, we’d be more than
happy to test/validate any such scenario if helps.
After thinking more about it, what I’d like to do is make the timeout value
configurable (via mdb) – that way, it can be defaulted to whatever value
you feel serves best for the most affected, and we will be able to adjust
it to our use-case pain points.
Below is our proposed patch, your comments/advices are appreciated.
Thank you,
- Youzhong
-----------------------------------------------------------------------------
diff --git a/usr/src/uts/common/klm/nlm_impl.c
b/usr/src/uts/common/klm/nlm_impl.c
index 7daa30d..c2f7178 100644
--- a/usr/src/uts/common/klm/nlm_impl.c
+++ b/usr/src/uts/common/klm/nlm_impl.c
@@ -124,6 +124,11 @@ krwlock_t lm_lck;
static const struct timeval nlm_rpctv_zero = { 0, 0 };
/*
+ * Initial timeout for NLM NULL RPC
+ */
+static volatile struct timeval nlm_nullrpc_wait = { 0, 200000 };
+
+/*
* List of all Zone globals nlm_globals instences
* linked together.
*/
@@ -527,6 +532,20 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum,
xdrproc_t xdr_args,
wait = nlm_rpctv_zero;
/*
+ * Default timeout value of 25 seconds can take
+ * nlm_null_rpc() 150 seconds to return RPC_TIMEDOUT
+ * if it uses UDP and the destination port is
+ * unreachable.
+ *
+ * A shorter timeout value, e.g. 200 milliseconds,
+ * will cause nlm_null_rpc() to time out after
+ * 200 * (1 + 2 + 4 + 8 + 16 + 32) = 12.6 seconds
+ * (with retries set to 5)
+ */
+ if (procnum == NLM_NULL)
+ wait = nlm_nullrpc_wait;
+
+ /*
* We need to block signals in case of NLM_CANCEL RPC
* in order to prevent interruption of network RPC
* calls.
diff --git a/usr/src/uts/common/klm/nlm_rpc_handle.c
b/usr/src/uts/common/klm/nlm_rpc_handle.c
index 9ddf568..26397b3 100644
--- a/usr/src/uts/common/klm/nlm_rpc_handle.c
+++ b/usr/src/uts/common/klm/nlm_rpc_handle.c
@@ -55,6 +55,7 @@
(_status) == RPC_PROGVERSMISMATCH || \
(_status) == RPC_PROCUNAVAIL || \
(_status) == RPC_CANTCONNECT || \
+ (_status) == RPC_TIMEDOUT || \
(_status) == RPC_XPRTFAILED)
static struct kmem_cache *nlm_rpch_cache = NULL;
On Tue, Jul 8, 2014 at 2:44 PM, Marcel Telka <[email protected]> wrote:
> Hi,
>
> Nice work. Please find my comments inline.
>
> On Tue, Jul 08, 2014 at 01:43:36PM -0400, Youzhong Yang via
> illumos-developer wrote:
> > I made the following two changes and built a new image, the issue goes
> away.
> >
> > --- a/usr/src/uts/common/klm/nlm_impl.c
> > +++ b/usr/src/uts/common/klm/nlm_impl.c
> > @@ -525,6 +525,12 @@ nlm_clnt_call(CLIENT *clnt, rpcproc_t procnum,
> > xdrproc_t xdr_args,
> > */
> > if (procnum >= NLM_TEST_RES && procnum <= NLM_GRANTED_RES)
> > wait = nlm_rpctv_zero;
> > + if (procnum == NLM_NULL) {
> > + wait.tv_sec = 0;
> > + wait.tv_usec = 25000;
> > + }
> > +
> >
> > --- a/usr/src/uts/common/klm/nlm_rpc_handle.c
> > +++ b/usr/src/uts/common/klm/nlm_rpc_handle.c
> > @@ -55,6 +55,7 @@
> > (_status) == RPC_PROGVERSMISMATCH || \
> > (_status) == RPC_PROCUNAVAIL || \
> > (_status) == RPC_CANTCONNECT || \
> > + (_status) == RPC_TIMEDOUT || \
> > (_status) == RPC_XPRTFAILED)
> >
>
> Please add the following (or similar) note as a comment close to your
> modification in nlm_clnt_call() so the explanation why we do need to
> adjust the
> default timeout is retained for future.
>
> > Setting timeout value of NULL rpc to 25 milliseconds instead of the
> default
> > 25 seconds can make nlm_null_rpc() returns RPC_TIMEDOUT after 1575
> > milliseconds when the UDP port is not reachable:
> > 1575 = 25 + 50 + 100 + 200 + 400 + 800 => 5 retries
>
> In general, this should work, but I'm not sure your suggestion of 25 ms
> (for
> the initial timoeut) and 1.5 sec (as the total timeout) is long enough.
> Assuming the client (Mac OS) rebooted. The reboot is slow operation, so
> there
> is no need to be so fast (1.5 sec) in detecting it. I think we should
> push the
> total timeout to ca 30 seconds (with the initial timeout at 0.5 sec) or
> maybe
> even more to avoid spurious stale clients detection.
>
>
> Thank you.
>
> --
> +-------------------------------------------+
> | Marcel Telka e-mail: [email protected] |
> | homepage: http://telka.sk/ |
> | jabber: [email protected] |
> +-------------------------------------------+
>
-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription:
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com