[SunRay-Users] RE: Not having much luck with failover groups in 3.1b

Ellis, Mike Mon, 25 Jul 2005 22:16:23 -0700

So Otto comes through with some serious info! (see below)

  I'm posting what I've found so far in an effort to assist others going
down this path....

There is good news and bad news (for me) here...

Basically it looks like what I wanted to do (place SunRAY servers
geographically close to their customers, and then use a failover group
between them) is going to be a problem... (since ut-failover groups have
a "requirement" of a local subnet between them...)

[ personally I think that this restriction REALLY limits
large-scale/distributed deployments, but maybe that's just me... (but I
also see Otto's point, as to why this requirement exists... (more on
that later) ]

--

What I think is REALLY interesting (interesting being BAD in this case)
is the whole multicast business.... Per Otto's suggestion I upped the
multicast TTL to something higher than traceroute counted. This would
APPEAR to work right, but then not so much... It appears flaky... I
changed my multicast address from ...101 to ...102 to see if that made a
difference... (it didn't).

Then I did a some "snoop 224.101.101.102" stuff on both servers, and the
weird thing is that I see BOTH servers send out the Multicast data...

(scrubbed example below)

sunray1 -> 224.101.101.102 IP  D=224.101.101.102 S=172.26.22.75 LEN=28,
ID=36394, TOS=0x0, TTL=1
sunray1 -> 224.101.101.102 IP  D=224.101.101.102 S=172.26.22.75 LEN=28,
ID=36395, TOS=0x0, TTL=1
sunray1 -> 224.101.101.102 UDP D=7009 S=7009 LEN=259
sunray1 -> 224.101.101.102 IP  D=224.101.101.102 S=172.26.22.75 LEN=28,
ID=36397, TOS=0x0, TTL=1
sunray1 -> 224.101.101.102 UDP D=7009 S=7009 LEN=259
sunray1 -> 224.101.101.102 UDP D=7009 S=7009 LEN=259
sunray1 -> 224.101.101.102 IP  D=224.101.101.102 S=172.26.22.75 LEN=28,
ID=36400, TOS=0x0, TTL=1
sunray1 -> 224.101.101.102 UDP D=7009 S=7009 LEN=259
sunray1 -> 224.101.101.102 UDP D=7009 S=7009 LEN=259

The problem seems to be that these packets don't ALWAYS make it to the
remote server, and the remote server's packets don't ALWAYS make it to
the local... Very odd... (but sometimes you DO see a packet come
through... )

Therefore utgstatus seems more-or-less OK at one point, and confused at
other points. Utrestarts on either side doesn't seem to help matters...

(mind you the TCP/IP-UDP network between the 2 hosts is just dandy! No
packet-loss or any weirdness like that...)

The multicasts on the other hand seem to dis-appear... Maybe the have
some miserable QoS across the WAN or something? I guess UDP-multicasts
can get whacked on busy networks, perhaps if a multicast IP could be
used? unsure.......

--

>From the snoop output above it would suggest that only the IP packages
have a TTL on 1... Not sure what that's all about, but per Otto it's the
UPD ones that really count.. (and for those a TTL isn't listed, although
its higher than 1, since otherwise they wouldn't make it to the remote
host.... Ever...)

--

So I'm wondering if I'm running into the "switch-based" multicast
"flakiness" Otto mentioned... If someone knows if there a
setting/tunable of any kind to set on our switches/routers, please point
me in the right direction. ( Our switch/router vendor's name rhymes with
Crisco.... :-)

--

Lastly, the SunRAY replication portion of all of this (data between the
various LDAP stores) seems to work great... Any thoughts about using
THAT same framework for heart beating/status info.... Say every 20
seconds or so, just like the Multicast (is supposed to work).) [ perhaps
as a fallback in cases where the Multicast piece acts up? ]

Thanks, and I'll post a summary in the next few days assuming I get
somewhere.

 -- MikeE

"Ellis, Mike" <Mike.Ellis at fmr.com> wrote:
> Using "lan network" to get to servers, so servers don't provide
DHCP...
> Servers are one different subnets from eachother

Officially that's an unsupported configuration.  Hosts belonging
to the same failover group are supposed to have at least one subnet
in common.

> and from the DTUs.
> Servers don't have/need access to each others IP-ranges or anything
like
> that... (again, since they aren't providing dhcp services).
>
> [...]
> If I do a "./utuser -l" on the secondary, I *see* the
users/smart-cards
> that were created on the primary (only) which I think means that
> replication (at least in some basic form) is working.

Yes, sounds like replication is fine.  However, replication plays no
role in group membership.  It's true that you'll practically always 
want to configure replication across all of the members of a group
but that's only because it's the easiest way to retain your sanity.

> The problem appears to be that the servers in question do NOT seem to
> like/TRUST eachother. ./utgstatus does NOT show anything but the
server
> your run the command on...

'utgstatus' shows all servers that can be contacted by the Sun Ray
group membership protocol, regardless of whether those servers are 
considered to be trusted or not.  If no other servers are showing up
in the listing then the servers aren't even seeing each other.  They're 
not even getting a chance to go to the stage beyond that where they 
decide whether they trust each other.

> The "server-selection" login screen also does NOT show the other
server
> in failover group....

That dialogue would only show trusted hosts that were visible to
'utgstatus', so its being empty is consistent with the 'utgstatus' 
output being empty.

> In /etc/opt/SUNWut/auth.props there are a couple of parameters that
> might be of interest...
> -- enableGroupManager = true   was still commented out... Is that
> normal?

That's normal.  The commented-out lines in that file indicate the 
default settings.

> Which log-files might provide insight?

/var/opt/SUNWut/messages, except that if 'utgstatus' shows no other 
servers then it's pretty much guaranteed that no group-membership 
traffic is arriving from the other server so there'll be nothing 
reported in the log.

> (is this a good time to start playing with the non-support gmDebug
flag
> in auth.props? )

Not if there's nothing showing from 'utgstatus'.  All you'll see is
group-membership notifications being sent and none being received.

However, now is an excellent time to play with the 'multicastTTL'
setting in that file.  And perhaps also the 'multicastAddress'
setting too if the default address happens to conflict with any
other multicast traffic already flowing on these subnets.

The group-membership protocol depends (by default) on multicast 
datagrams to announce the presence of a server to any other 
servers in the vicinity.  The default 'multicastTTL' setting of
'1' prevents these datagrams from crossing routers, which probably
explains why your servers aren't seeing each other.  Crank that
setting up to (at least) one greater than the number of routers 
between the servers, then 'utrestart' and see whether 'utgstatus'
starts showing the other machine.  (Before you do that, verify that
'utpolicy' reports '-g' among the policy options.  If that's not
there then no group-membership activities will work at all.)  If
the servers still con't see each other then get out a sniffer and
try to figure out why the multicast traffic isn't making it from
one machine to the other.  One thing to watch for is firewalls:
group-membership traffic is sent via UDP to port 7009.

The reason for the requirement that group members share at least
one subnet in common is that sadly it's still not unusual, after 
all this time, for some networking gear to not do the right thing
with multicast traffic.  Sometimes it doesn't work in the first
place, sometimes it works fine for a while and then stops for no
good reason.  If the servers share a common subnet then as a last
resort you can reconfigure them ('enableMulticast' in auth.props)
to send their group-membership announcements as broadcasts rather
than multicasts.  If they don't share a common subnet then you 
have no choice but to depend on cross-subnet multicasts working 
in order for failover groups to work.  That can be a fragile
situation.

Once in a while we discuss supporting point-to-point 
group-membership traffic between configured hosts as another way
of getting around multicast brokenness (and also as a way of
containing traffic propagation in preference to needing large 
multicastTTL settings for distant hosts) but that hasn't made it 
onto the to-do list yet.

OttoM.
__ 
ottomeister

Disclaimer: These are my opinions.  I do not speak for my employer.

-- 
_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users

[SunRay-Users] RE: Not having much luck with failover groups in 3.1b

Reply via email to