Interesting on your SRs. We found that once in the error condition, created when the default gw became stopped transmitting packets for even 30 seconds, rebooting SunRay DTUs had no impact, which as you wrote, made it worse. We were unable to:
1. Reboot SunRays to re-start correct use. 2. Disconnect user or DTU sessions to re-start correct use. 2a. In many cases, not log into the Web Interface. 3. Re-insert cards/de-insert/use pristine never been used before cards, etc. 4. Summary: nothing short of rebooting the SRS Servers allowed the system to work again 4a. Repeated efforts of all the above would not work. 4b. utrestart wouldn't do it either. We did get socket and memory errors out of utauthd. Restarting utauthd had intermittent results, sometimes causing utsessiond to get upset. But back to question about FOG, you like it... do you prefer load-balancing enabled or disabled? Any thoughts from other members of Sun Ray Users? Thanks, Devin -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Nishimura, Scott L (IT Solutions) Sent: Tuesday, July 06, 2010 4:24 PM To: SunRay-Users mailing list Subject: Re: [SunRay-Users] FOG / failure / load-balancing or no? Devin, Gotcha. My experience with the SRS FOG has been good. I think you ran into one situation where a FOG didn't help you but there are many scenarios where a FOG would help. And I don't think the FOG hurt you. Here are my notes from when a similar problem happened to me: Crashing utauthd: 2009/04/0...@04:07, utauthd started reporting errors on both shop floor SRSs and an unrelated one. The machines were also not pingable for minutes at a time. When a machine would get into this state, all TCs connected to it would reboot. Rebooting the SRSs fixed the problem. SR 70878188 JC found no switch problems. Blane from Sun is now thinking it could be a network problem [tcp connection between the auth managers on the servers and the DTU's] but there's no conclusive corroborating evidence. #71057938 Problem happened again on 2009/05/18 from 06:27 to 07:58. However, this time it's unclear whether the TCs rebooted multiple times and also, Pinger did not lose track of the SRSs. However, I did get the message "socket looping limit exceeded.Close it." Network engineer said it was a bad Edge router. This could cause 1 set of TC reboots as traffic moved from one SRS to the other. -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Devin Nate Sent: Tuesday, July 06, 2010 3:09 PM To: SunRay-Users mailing list Subject: EXTERNAL:Re: [SunRay-Users] FOG / failure / load-balancing or no? Hi Scott; Thanks for the reply. I completely understand that with no default gateway, all the SunRays are going to be disconnected. The problem is, after the default gateway became available again (at same IP and MAC addrs, and after about 30 seconds), the SunRays not only didn't reconnect and instead they flat out didn't work. utauthd was throwing errors all over. Only a full reboot of both SRSS servers cleared the condition. We couldn't log into the web interface or run several SRSS commands. That problem aside, I'm now gun-shy of the FOG... because all that we ended up with in our failure scenario was a fleet of SRSS servers not working. Since we can handle the load just fine on one server, we're not doing this except to accomplish HA, and if in a failure scenario all that happens is all the FOG servers equally become unavailable, we will revert to a manual replication. What I'm seeking is if other users have had a good HA experience with the FOG, or if they've found that once one FOG member fails if the other members do as well (as was our experience). Thanks, Devin -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Nishimura, Scott L (IT Solutions) Sent: Tuesday, July 06, 2010 11:14 AM To: SunRay-Users mailing list Subject: Re: [SunRay-Users] FOG / failure / load-balancing or no? Devin, Something similar happened to me: my default gateway definition pointed to a middle-man machine instead of the GW itself and, we theorized, when that middle-man went down, my gateway route vanished. However, NONE of my users could work. The curious thing is that some of your TCs were still able to function. I don't really see how the FOG is the major factor: whether your SRSs were in a FOG, if the default GW went down, I would think all TCs would be hosed. The other difference is once I added the correct route, everything started working again. The persistence of the problem you're experiencing suggests that a cache somewhere did not get flushed/updated. Scott From: [email protected] [mailto:[email protected]] On Behalf Of Devin Nate Sent: Monday, July 05, 2010 4:14 PM To: SunRay-Users mailing list Subject: EXTERNAL:[SunRay-Users] FOG / failure / load-balancing or no? Hi Sun Ray Users; We are currently in an interesting situation. We've had several bad failures of the SRSS system, which was rooted in the underlying network (default gateway would temporarily become overworked and stop transmitting info). Our existing sun ray setup is two SRSS 4.2 boxes on RHEL 5 living on VMware ESXi 4. Our mode of using SRSS is essentially kiosk mode for all users which then runs uttsc to a farm of terminal servers. Under normal load, either SRSS box can handle all users. On failure of the def gw, lasting about 30 seconds or less, approx 66% of users would be unable to insert their key card and get any reaction, including hot desking if there was a session already alive for them, or starting a new session. The remaining 25%-33% of users appeared to mostly be able to work. There are about 200 concurrent users. The worst is, the failure condition lasts for hours afterwards (no citrix or msrdp reconnect - just pure hell with most users offline the rest of the day - if we reboot we turf all users, if we don't we have about 66% angry). We traced the symptoms to utauthd and utsessiond, but the trail went cold there. And that could be completely wrong anyhow... my problem is this: When server 1 went bad, server 2 in the FOG equally went bad - in fact, having 2 servers was worse because they both were broken and the complexity of dealing with 2 broken servers is worse than 1. We cannot tell if SRSS is just really really really bad at handling a small network outage (not to justify network outages, but MS-RDP and Citrix handle way better to the point most of our users wouldn't have noticed) or if whatever "badness" persists in the databases got replicated, or if it's a bug in linux+srss or something caused by vmware. If FOG replicates errors to the point of making the systems non-usable, we don't want to use FOG ... but, FOG is the mechanism Sun has developed to achieve High Availability. The impact to this is going to be six figures in rebates, compensation, lost customer base, and damages, so 'experimentation' isn't even remotely going to be permitted with live customers. The result was repeatable over 4 instances.... Although it was very expensive and not done on purpose. We opened a 'Priority #1, Impacting Health Care Providers / medical emergency' call with Sun/Oracle. I received a call number, and a SLA time of 48 hours to response. I received 2 calls from a manager (nice enough fellow) apologizing that nobody had yet looked at the ticket, and haven't heard back. The point being, in spite of being on a supported platform, when the emergency hit our ability to get support from Sun was not helpful. We're subsequently finalizing SRSS 4.2 on two Solaris physical boxes, running on Sun Hardware. But our question remains... enable FOG or not? We want to for the high availability, thinking perhaps turning off load balancing to make it less complex. BUT I have 2 servers for a reason, and it's not because I need to spread out the load. I need to know that if one server goes down, the other wont- and more importantly, that 1 server won't 'pollute' the second (or third/forth/etc) with some bug/condition that makes them all stop working. I can manually sync the servers nightly and have a 'manual' failover that's more reliable than this. Thoughts? Insights? Thanks, Devin _______________________________________________ SunRay-Users mailing list [email protected] http://www.filibeto.org/mailman/listinfo/sunray-users _______________________________________________ SunRay-Users mailing list [email protected] http://www.filibeto.org/mailman/listinfo/sunray-users _______________________________________________ SunRay-Users mailing list [email protected] http://www.filibeto.org/mailman/listinfo/sunray-users _______________________________________________ SunRay-Users mailing list [email protected] http://www.filibeto.org/mailman/listinfo/sunray-users
