Devin, Something similar happened to me: my default gateway definition pointed to a middle-man machine instead of the GW itself and, we theorized, when that middle-man went down, my gateway route vanished.
However, NONE of my users could work. The curious thing is that some of your TCs were still able to function. I don't really see how the FOG is the major factor: whether your SRSs were in a FOG, if the default GW went down, I would think all TCs would be hosed. The other difference is once I added the correct route, everything started working again. The persistence of the problem you're experiencing suggests that a cache somewhere did not get flushed/updated. Scott From: [email protected] [mailto:[email protected]] On Behalf Of Devin Nate Sent: Monday, July 05, 2010 4:14 PM To: SunRay-Users mailing list Subject: EXTERNAL:[SunRay-Users] FOG / failure / load-balancing or no? Hi Sun Ray Users; We are currently in an interesting situation. We've had several bad failures of the SRSS system, which was rooted in the underlying network (default gateway would temporarily become overworked and stop transmitting info). Our existing sun ray setup is two SRSS 4.2 boxes on RHEL 5 living on VMware ESXi 4. Our mode of using SRSS is essentially kiosk mode for all users which then runs uttsc to a farm of terminal servers. Under normal load, either SRSS box can handle all users. On failure of the def gw, lasting about 30 seconds or less, approx 66% of users would be unable to insert their key card and get any reaction, including hot desking if there was a session already alive for them, or starting a new session. The remaining 25%-33% of users appeared to mostly be able to work. There are about 200 concurrent users. The worst is, the failure condition lasts for hours afterwards (no citrix or msrdp reconnect - just pure hell with most users offline the rest of the day - if we reboot we turf all users, if we don't we have about 66% angry). We traced the symptoms to utauthd and utsessiond, but the trail went cold there. And that could be completely wrong anyhow... my problem is this: When server 1 went bad, server 2 in the FOG equally went bad - in fact, having 2 servers was worse because they both were broken and the complexity of dealing with 2 broken servers is worse than 1. We cannot tell if SRSS is just really really really bad at handling a small network outage (not to justify network outages, but MS-RDP and Citrix handle way better to the point most of our users wouldn't have noticed) or if whatever "badness" persists in the databases got replicated, or if it's a bug in linux+srss or something caused by vmware. If FOG replicates errors to the point of making the systems non-usable, we don't want to use FOG ... but, FOG is the mechanism Sun has developed to achieve High Availability. The impact to this is going to be six figures in rebates, compensation, lost customer base, and damages, so 'experimentation' isn't even remotely going to be permitted with live customers. The result was repeatable over 4 instances.... Although it was very expensive and not done on purpose. We opened a 'Priority #1, Impacting Health Care Providers / medical emergency' call with Sun/Oracle. I received a call number, and a SLA time of 48 hours to response. I received 2 calls from a manager (nice enough fellow) apologizing that nobody had yet looked at the ticket, and haven't heard back. The point being, in spite of being on a supported platform, when the emergency hit our ability to get support from Sun was not helpful. We're subsequently finalizing SRSS 4.2 on two Solaris physical boxes, running on Sun Hardware. But our question remains... enable FOG or not? We want to for the high availability, thinking perhaps turning off load balancing to make it less complex. BUT I have 2 servers for a reason, and it's not because I need to spread out the load. I need to know that if one server goes down, the other wont- and more importantly, that 1 server won't 'pollute' the second (or third/forth/etc) with some bug/condition that makes them all stop working. I can manually sync the servers nightly and have a 'manual' failover that's more reliable than this. Thoughts? Insights? Thanks, Devin _______________________________________________ SunRay-Users mailing list [email protected] http://www.filibeto.org/mailman/listinfo/sunray-users
