Ellard Roush writes:
> Thanks for explaining about how the routing situation changes dynamically.
> However, we have been aware of that for a long time.
> Sun Cluster (SC) is a High Availability product.
> We have customers that want recovery to occur in less than 2 seconds.
> While we have not achieved that goal, we are working in that direction.
> This means that some operations MUST complete very quickly.
> A late completion of an operation is a failure.


As a general principle, though, you cannot "demand" that other systems
do anything you want at any other time.  When networking is involved,
other independent systems are involved.

In other words, I think the focus is on the wrong level here.  The
whole deployment -- the routers, bridges, and other infrastructure
included -- must be designed to meet your goal, not _just_ this one
bit of Solaris software.  (And once that's done, the state of routing
in Solaris may or may not be at issue.)

> More specifically, when a quorum device is unreachable for substantial
> periods of time, the unreachable quorum device is in a failed state
> as far as we are concerned. This is true even when the device
> might be reachable 60 seconds from now. The administrator
> must configure a quorum device that can be reached reliably
> in a short time period.

The solution is easy at this level: send a packet.  If you get a
sensible response, then that system is in fact reachable.  If you
don't get a sensible response within the time constraint that you've
set for yourself, then it's not.

That's really the only information available.

> The current SMF information does not even tell us when the Solaris
> routing software can even accept attempts to communicate.

That's correct.  As I've already outlined *it doesn't know* and (more
importantly) *it cannot in principle know*.

Or, if you prefer: it always accepts attempts to communicate.  It just
won't always be successful in those attempts.

> We already
> know that the attempts can fail. Before the routing software in
> Solaris is ready, all attempts to communicate will fail.
> We just want to know when it is safe to try.
> We are not asking for a dependency upon when a specific route is present.
> We know that is not possible.
> We have encountered problems when an attempt is made before
> the routing software is ready.
> We want to access the quorum device as soon as we can for
> quicker recovery, but no sooner than can be achieved reliably.

There's just no general solution to the problem.

If the only thing you care about is whether routing has established a
route to "somewhere," then (as I mentioned before) you can listen to a
routing socket to observe the resulting RTM_ADD.  I don't think
that'll actually help you in your quest, but it's certainly doable and
answers the immediate (and I think improperly formed) question of when
"routing software in Solaris is 'ready'."  For some value of "ready,"
at least.

There is simply *NO WAY* that the system can tell you a priori whether
an attempt to transmit a packet will actually result in that packet
being sent from the system (ARP can still fail and Spanning Tree can
disable ports silently) or whether delivery is possible.

Only sending data can do that, and only then in retrospect.  If you
get an answer, then it must have worked.

I strongly disagree that we should be offering any sort of "routing is
ready" checkpoint or SMF dependency.  It'd be misleading at best, and
would result in a new class of unsolvable failure modes.

James Carlson, Solaris Networking              <[EMAIL PROTECTED]>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
zones-discuss mailing list

Reply via email to