Running a packet trace while the problem was occurring let me see that the
problem is that the NFS layer isn't even being involved for 3 minutes due
to the TCP layer retrying for 3 minute before notifying the caller of the
error.

The client sends SYN packets, doesn't get an ACK, and finally times out
after 3 minutes.  The documentation for Linux (the client in this case)
indicates the following default:

tcp_syn_retries (integer; default: 5; since Linux 2.2)
              The maximum number of times initial SYNs for an active TCP
              connection attempt will be retransmitted.  This value should
              not be higher than 255.  The default value is 5, which
              corresponds to approximately 180 seconds.

I think this means that it's not possible to get faster failover w/o
modifying the default values for the TCP layer.  This particular error is
always going to take at least 3 minutes to surface.

Jim

Reply via email to