load-balancing or no?

Nishimura, Scott L (IT Solutions) Tue, 06 Jul 2010 10:15:15 -0700

Devin,

   Something similar happened to me:  my default gateway definition
pointed to a middle-man machine instead of the GW itself and, we
theorized, when that middle-man went down, my gateway route vanished.


However, NONE of my users could work.  The curious thing is that some of
your TCs were still able to function.

I don't really see how the FOG is the major factor:  whether your SRSs
were in a FOG, if the default GW went down, I would think all TCs would
be hosed.

The other difference is once I added the correct route, everything
started working again.  The persistence of the problem you're
experiencing suggests that a cache somewhere did not get
flushed/updated.


Scott

From: [email protected]
[mailto:[email protected]] On Behalf Of Devin Nate
Sent: Monday, July 05, 2010 4:14 PM
To: SunRay-Users mailing list
Subject: EXTERNAL:[SunRay-Users] FOG / failure / load-balancing or no?

Hi Sun Ray Users;

We are currently in an interesting situation. We've had several bad
failures of the SRSS system, which was rooted in the underlying network
(default gateway would temporarily become overworked and stop
transmitting info). Our existing sun ray setup is two SRSS 4.2 boxes on
RHEL 5 living on VMware ESXi 4. Our mode of using SRSS is essentially
kiosk mode for all users which then runs uttsc to a farm of terminal
servers.

Under normal load, either SRSS box can handle all users.

On failure of the def gw, lasting about 30 seconds or less, approx 66%
of users would be unable to insert their key card and get any reaction,
including hot desking if there was a session already alive for them, or
starting a new session. The remaining 25%-33% of users appeared to
mostly be able to work. There are about 200 concurrent users. The worst
is, the failure condition lasts for hours afterwards (no citrix or msrdp
reconnect - just pure hell with most users offline the rest of the day -
if we reboot we turf all users, if we don't we have about 66% angry).

We traced the symptoms to utauthd and utsessiond, but the trail went
cold there. And that could be completely wrong anyhow... my problem is
this:

When server 1 went bad, server 2 in the FOG equally went bad - in fact,
having 2 servers was worse because they both were broken and the
complexity of dealing with 2 broken servers is worse than 1. We cannot
tell if SRSS is just really really really bad at handling a small
network outage (not to justify network outages, but MS-RDP and Citrix
handle way better to the point most of our users wouldn't have noticed)
or if whatever "badness" persists in the databases got replicated, or if
it's a bug in linux+srss or something caused by vmware. If FOG
replicates errors to the point of making the systems non-usable, we
don't want to use FOG ... but, FOG is the mechanism Sun has developed to
achieve High Availability.

The impact to this is going to be six figures in rebates, compensation,
lost customer base, and damages, so 'experimentation' isn't even
remotely going to be permitted with live customers. The result was
repeatable over 4 instances.... Although it was very expensive and not
done on purpose.

We opened a 'Priority #1, Impacting Health Care Providers / medical
emergency' call with Sun/Oracle. I received a call number, and a SLA
time of 48 hours to response. I received 2 calls from a manager (nice
enough fellow) apologizing that nobody had yet looked at the ticket, and
haven't heard back. The point being, in spite of being on a supported
platform, when the emergency hit our ability to get support from Sun was
not helpful.

We're subsequently finalizing SRSS 4.2 on two Solaris physical boxes,
running on Sun Hardware. But our question remains... enable FOG or not?

We want to for the high availability, thinking perhaps turning off load
balancing to make it less complex. BUT

I have 2 servers for a reason, and it's not because I need to spread out
the load. I need to know that if one server goes down, the other wont-
and more importantly, that 1 server won't 'pollute' the second (or
third/forth/etc) with some bug/condition that makes them all stop
working. I can manually sync the servers nightly and have a 'manual'
failover that's more reliable than this.

Thoughts? Insights?

Thanks,
Devin



_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users

Re: [SunRay-Users] FOG / failure / load-balancing or no?

Reply via email to