Hi James,

It is already well known that routes come and go.
It is already well known that the way to determine
whether a destination is reachable is to attempt to
contact that destination.
That is NOT the issue that I am raising.

We have seen the following PROBLEM.
Our code has a dependency upon Solaris network routing.
After SMF reports that Solaris network routing initialization
has begun. We attempt to contact the quorum device.
That attempt fails. We wait and retry.
ALL SUBSEQUENT retries fail !!!
If we make the code sleep long enough for Solaris routing to
complete initialization, then after a failed attempt
to connect, then retries work whenever the route becomes
available. The problem is that Solaris routing goes into
an error state when we attempt to connect before it is ready.
SMF starts services as soon as the service dependencies are satisfied.
So we can and do attempt our first connection before
Solaris routing is really ready !

We are not asking for indication as to when a route is present.
We want to know when we can attempt to establish a connection
without Solaris routing going into an error state that
causes all subsequent attempts to connect to fail.

We have found another recovery method for this problem.
We do not just retry the connection.
We destroy all network data structures (socket)
This clears the bad state. retries then eventually succeed.


James Carlson wrote:
> Ellard Roush writes:
>> Thanks for explaining about how the routing situation changes dynamically.
>> However, we have been aware of that for a long time.
>> Sun Cluster (SC) is a High Availability product.
>> We have customers that want recovery to occur in less than 2 seconds.
>> While we have not achieved that goal, we are working in that direction.
>> This means that some operations MUST complete very quickly.
>> A late completion of an operation is a failure.
> Understood.
> As a general principle, though, you cannot "demand" that other systems
> do anything you want at any other time.  When networking is involved,
> other independent systems are involved.
> In other words, I think the focus is on the wrong level here.  The
> whole deployment -- the routers, bridges, and other infrastructure
> included -- must be designed to meet your goal, not _just_ this one
> bit of Solaris software.  (And once that's done, the state of routing
> in Solaris may or may not be at issue.)
>> More specifically, when a quorum device is unreachable for substantial
>> periods of time, the unreachable quorum device is in a failed state
>> as far as we are concerned. This is true even when the device
>> might be reachable 60 seconds from now. The administrator
>> must configure a quorum device that can be reached reliably
>> in a short time period.
> The solution is easy at this level: send a packet.  If you get a
> sensible response, then that system is in fact reachable.  If you
> don't get a sensible response within the time constraint that you've
> set for yourself, then it's not.
> That's really the only information available.
>> The current SMF information does not even tell us when the Solaris
>> routing software can even accept attempts to communicate.
> That's correct.  As I've already outlined *it doesn't know* and (more
> importantly) *it cannot in principle know*.
> Or, if you prefer: it always accepts attempts to communicate.  It just
> won't always be successful in those attempts.
>> We already
>> know that the attempts can fail. Before the routing software in
>> Solaris is ready, all attempts to communicate will fail.
>> We just want to know when it is safe to try.
>> We are not asking for a dependency upon when a specific route is present.
>> We know that is not possible.
>> We have encountered problems when an attempt is made before
>> the routing software is ready.
>> We want to access the quorum device as soon as we can for
>> quicker recovery, but no sooner than can be achieved reliably.
> There's just no general solution to the problem.
> If the only thing you care about is whether routing has established a
> route to "somewhere," then (as I mentioned before) you can listen to a
> routing socket to observe the resulting RTM_ADD.  I don't think
> that'll actually help you in your quest, but it's certainly doable and
> answers the immediate (and I think improperly formed) question of when
> "routing software in Solaris is 'ready'."  For some value of "ready,"
> at least.
> There is simply *NO WAY* that the system can tell you a priori whether
> an attempt to transmit a packet will actually result in that packet
> being sent from the system (ARP can still fail and Spanning Tree can
> disable ports silently) or whether delivery is possible.
> Only sending data can do that, and only then in retrospect.  If you
> get an answer, then it must have worked.
> I strongly disagree that we should be offering any sort of "routing is
> ready" checkpoint or SMF dependency.  It'd be misleading at best, and
> would result in a new class of unsolvable failure modes.
zones-discuss mailing list

Reply via email to