load-balancing or no?

Devin Nate Tue, 06 Jul 2010 15:09:29 -0700

Hi Scott;

Thanks for the reply. I completely understand that with no default gateway, all 
the SunRays are going to be disconnected. The problem is, after the default 
gateway became available again (at same IP and MAC addrs, and after about 30 
seconds), the SunRays not only didn't reconnect and instead they flat out 
didn't work. utauthd was throwing errors all over. Only a full reboot of both 
SRSS servers cleared the condition. We couldn't log into the web interface or 
run several SRSS commands.


That problem aside, I'm now gun-shy of the FOG... because all that we ended up 
with in our failure scenario was a fleet of SRSS servers not working. Since we 
can handle the load just fine on one server, we're not doing this except to 
accomplish HA, and if in a failure scenario all that happens is all the FOG 
servers equally become unavailable, we will revert to a manual replication.

What I'm seeking is if other users have had a good HA experience with the FOG, 
or if they've found that once one FOG member fails if the other members do as 
well (as was our experience).

Thanks,
Devin




-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Nishimura, Scott L (IT 
Solutions)
Sent: Tuesday, July 06, 2010 11:14 AM
To: SunRay-Users mailing list
Subject: Re: [SunRay-Users] FOG / failure / load-balancing or no?

Devin,

   Something similar happened to me:  my default gateway definition
pointed to a middle-man machine instead of the GW itself and, we
theorized, when that middle-man went down, my gateway route vanished.

However, NONE of my users could work.  The curious thing is that some of
your TCs were still able to function.

I don't really see how the FOG is the major factor:  whether your SRSs
were in a FOG, if the default GW went down, I would think all TCs would
be hosed.

The other difference is once I added the correct route, everything
started working again.  The persistence of the problem you're
experiencing suggests that a cache somewhere did not get
flushed/updated.


Scott

From: [email protected]
[mailto:[email protected]] On Behalf Of Devin Nate
Sent: Monday, July 05, 2010 4:14 PM
To: SunRay-Users mailing list
Subject: EXTERNAL:[SunRay-Users] FOG / failure / load-balancing or no?

Hi Sun Ray Users;

We are currently in an interesting situation. We've had several bad
failures of the SRSS system, which was rooted in the underlying network
(default gateway would temporarily become overworked and stop
transmitting info). Our existing sun ray setup is two SRSS 4.2 boxes on
RHEL 5 living on VMware ESXi 4. Our mode of using SRSS is essentially
kiosk mode for all users which then runs uttsc to a farm of terminal
servers.

Under normal load, either SRSS box can handle all users.

On failure of the def gw, lasting about 30 seconds or less, approx 66%
of users would be unable to insert their key card and get any reaction,
including hot desking if there was a session already alive for them, or
starting a new session. The remaining 25%-33% of users appeared to
mostly be able to work. There are about 200 concurrent users. The worst
is, the failure condition lasts for hours afterwards (no citrix or msrdp
reconnect - just pure hell with most users offline the rest of the day -
if we reboot we turf all users, if we don't we have about 66% angry).

We traced the symptoms to utauthd and utsessiond, but the trail went
cold there. And that could be completely wrong anyhow... my problem is
this:

When server 1 went bad, server 2 in the FOG equally went bad - in fact,
having 2 servers was worse because they both were broken and the
complexity of dealing with 2 broken servers is worse than 1. We cannot
tell if SRSS is just really really really bad at handling a small
network outage (not to justify network outages, but MS-RDP and Citrix
handle way better to the point most of our users wouldn't have noticed)
or if whatever "badness" persists in the databases got replicated, or if
it's a bug in linux+srss or something caused by vmware. If FOG
replicates errors to the point of making the systems non-usable, we
don't want to use FOG ... but, FOG is the mechanism Sun has developed to
achieve High Availability.

The impact to this is going to be six figures in rebates, compensation,
lost customer base, and damages, so 'experimentation' isn't even
remotely going to be permitted with live customers. The result was
repeatable over 4 instances.... Although it was very expensive and not
done on purpose.

We opened a 'Priority #1, Impacting Health Care Providers / medical
emergency' call with Sun/Oracle. I received a call number, and a SLA
time of 48 hours to response. I received 2 calls from a manager (nice
enough fellow) apologizing that nobody had yet looked at the ticket, and
haven't heard back. The point being, in spite of being on a supported
platform, when the emergency hit our ability to get support from Sun was
not helpful.

We're subsequently finalizing SRSS 4.2 on two Solaris physical boxes,
running on Sun Hardware. But our question remains... enable FOG or not?

We want to for the high availability, thinking perhaps turning off load
balancing to make it less complex. BUT

I have 2 servers for a reason, and it's not because I need to spread out
the load. I need to know that if one server goes down, the other wont-
and more importantly, that 1 server won't 'pollute' the second (or
third/forth/etc) with some bug/condition that makes them all stop
working. I can manually sync the servers nightly and have a 'manual'
failover that's more reliable than this.

Thoughts? Insights?

Thanks,
Devin



_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users
_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users

Re: [SunRay-Users] FOG / failure / load-balancing or no?

Reply via email to