Hi Sun Ray Users;

We are currently in an interesting situation. We've had several bad failures of 
the SRSS system, which was rooted in the underlying network (default gateway 
would temporarily become overworked and stop transmitting info). Our existing 
sun ray setup is two SRSS 4.2 boxes on RHEL 5 living on VMware ESXi 4. Our mode 
of using SRSS is essentially kiosk mode for all users which then runs uttsc to 
a farm of terminal servers.

Under normal load, either SRSS box can handle all users.

On failure of the def gw, lasting about 30 seconds or less, approx 66% of users 
would be unable to insert their key card and get any reaction, including hot 
desking if there was a session already alive for them, or starting a new 
session. The remaining 25%-33% of users appeared to mostly be able to work. 
There are about 200 concurrent users. The worst is, the failure condition lasts 
for hours afterwards (no citrix or msrdp reconnect - just pure hell with most 
users offline the rest of the day - if we reboot we turf all users, if we don't 
we have about 66% angry).

We traced the symptoms to utauthd and utsessiond, but the trail went cold 
there. And that could be completely wrong anyhow... my problem is this:

When server 1 went bad, server 2 in the FOG equally went bad - in fact, having 
2 servers was worse because they both were broken and the complexity of dealing 
with 2 broken servers is worse than 1. We cannot tell if SRSS is just really 
really really bad at handling a small network outage (not to justify network 
outages, but MS-RDP and Citrix handle way better to the point most of our users 
wouldn't have noticed) or if whatever "badness" persists in the databases got 
replicated, or if it's a bug in linux+srss or something caused by vmware. If 
FOG replicates errors to the point of making the systems non-usable, we don't 
want to use FOG ... but, FOG is the mechanism Sun has developed to achieve High 
Availability.

The impact to this is going to be six figures in rebates, compensation, lost 
customer base, and damages, so 'experimentation' isn't even remotely going to 
be permitted with live customers. The result was repeatable over 4 
instances.... Although it was very expensive and not done on purpose.

We opened a 'Priority #1, Impacting Health Care Providers / medical emergency' 
call with Sun/Oracle. I received a call number, and a SLA time of 48 hours to 
response. I received 2 calls from a manager (nice enough fellow) apologizing 
that nobody had yet looked at the ticket, and haven't heard back. The point 
being, in spite of being on a supported platform, when the emergency hit our 
ability to get support from Sun was not helpful.

We're subsequently finalizing SRSS 4.2 on two Solaris physical boxes, running 
on Sun Hardware. But our question remains... enable FOG or not?

We want to for the high availability, thinking perhaps turning off load 
balancing to make it less complex. BUT

I have 2 servers for a reason, and it's not because I need to spread out the 
load. I need to know that if one server goes down, the other wont- and more 
importantly, that 1 server won't 'pollute' the second (or third/forth/etc) with 
some bug/condition that makes them all stop working. I can manually sync the 
servers nightly and have a 'manual' failover that's more reliable than this.

Thoughts? Insights?

Thanks,
Devin



_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users

Reply via email to