Devin,
Gotcha. My experience with the SRS FOG has been good. I think you
ran into one situation where a FOG didn't help you but there are many
scenarios where a FOG would help. And I don't think the FOG hurt you.
Here are my notes from when a similar problem happened to me:
Crashing utauthd: 2009/04/0...@04:07, utauthd started reporting errors on
both shop floor SRSs and an unrelated one. The machines were also not
pingable for minutes at a time. When a machine would get into this
state, all TCs connected to it would reboot. Rebooting the SRSs fixed
the problem.
SR 70878188
JC found no switch problems.
Blane from Sun is now thinking it could be a network
problem [tcp connection between the auth managers on the servers and the
DTU's] but there's no conclusive corroborating evidence.
#71057938
Problem happened again on 2009/05/18 from 06:27 to
07:58. However, this time it's unclear whether the TCs rebooted
multiple times and also, Pinger did not lose track of the SRSs.
However, I did get the message "socket looping limit exceeded.Close it."
Network engineer said it was a bad Edge router. This could cause 1 set
of TC reboots as traffic moved from one SRS to the other.
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Devin Nate
Sent: Tuesday, July 06, 2010 3:09 PM
To: SunRay-Users mailing list
Subject: EXTERNAL:Re: [SunRay-Users] FOG / failure / load-balancing or
no?
Hi Scott;
Thanks for the reply. I completely understand that with no default
gateway, all the SunRays are going to be disconnected. The problem is,
after the default gateway became available again (at same IP and MAC
addrs, and after about 30 seconds), the SunRays not only didn't
reconnect and instead they flat out didn't work. utauthd was throwing
errors all over. Only a full reboot of both SRSS servers cleared the
condition. We couldn't log into the web interface or run several SRSS
commands.
That problem aside, I'm now gun-shy of the FOG... because all that we
ended up with in our failure scenario was a fleet of SRSS servers not
working. Since we can handle the load just fine on one server, we're not
doing this except to accomplish HA, and if in a failure scenario all
that happens is all the FOG servers equally become unavailable, we will
revert to a manual replication.
What I'm seeking is if other users have had a good HA experience with
the FOG, or if they've found that once one FOG member fails if the other
members do as well (as was our experience).
Thanks,
Devin
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Nishimura, Scott
L (IT Solutions)
Sent: Tuesday, July 06, 2010 11:14 AM
To: SunRay-Users mailing list
Subject: Re: [SunRay-Users] FOG / failure / load-balancing or no?
Devin,
Something similar happened to me: my default gateway definition
pointed to a middle-man machine instead of the GW itself and, we
theorized, when that middle-man went down, my gateway route vanished.
However, NONE of my users could work. The curious thing is that some of
your TCs were still able to function.
I don't really see how the FOG is the major factor: whether your SRSs
were in a FOG, if the default GW went down, I would think all TCs would
be hosed.
The other difference is once I added the correct route, everything
started working again. The persistence of the problem you're
experiencing suggests that a cache somewhere did not get
flushed/updated.
Scott
From: [email protected]
[mailto:[email protected]] On Behalf Of Devin Nate
Sent: Monday, July 05, 2010 4:14 PM
To: SunRay-Users mailing list
Subject: EXTERNAL:[SunRay-Users] FOG / failure / load-balancing or no?
Hi Sun Ray Users;
We are currently in an interesting situation. We've had several bad
failures of the SRSS system, which was rooted in the underlying network
(default gateway would temporarily become overworked and stop
transmitting info). Our existing sun ray setup is two SRSS 4.2 boxes on
RHEL 5 living on VMware ESXi 4. Our mode of using SRSS is essentially
kiosk mode for all users which then runs uttsc to a farm of terminal
servers.
Under normal load, either SRSS box can handle all users.
On failure of the def gw, lasting about 30 seconds or less, approx 66%
of users would be unable to insert their key card and get any reaction,
including hot desking if there was a session already alive for them, or
starting a new session. The remaining 25%-33% of users appeared to
mostly be able to work. There are about 200 concurrent users. The worst
is, the failure condition lasts for hours afterwards (no citrix or msrdp
reconnect - just pure hell with most users offline the rest of the day -
if we reboot we turf all users, if we don't we have about 66% angry).
We traced the symptoms to utauthd and utsessiond, but the trail went
cold there. And that could be completely wrong anyhow... my problem is
this:
When server 1 went bad, server 2 in the FOG equally went bad - in fact,
having 2 servers was worse because they both were broken and the
complexity of dealing with 2 broken servers is worse than 1. We cannot
tell if SRSS is just really really really bad at handling a small
network outage (not to justify network outages, but MS-RDP and Citrix
handle way better to the point most of our users wouldn't have noticed)
or if whatever "badness" persists in the databases got replicated, or if
it's a bug in linux+srss or something caused by vmware. If FOG
replicates errors to the point of making the systems non-usable, we
don't want to use FOG ... but, FOG is the mechanism Sun has developed to
achieve High Availability.
The impact to this is going to be six figures in rebates, compensation,
lost customer base, and damages, so 'experimentation' isn't even
remotely going to be permitted with live customers. The result was
repeatable over 4 instances.... Although it was very expensive and not
done on purpose.
We opened a 'Priority #1, Impacting Health Care Providers / medical
emergency' call with Sun/Oracle. I received a call number, and a SLA
time of 48 hours to response. I received 2 calls from a manager (nice
enough fellow) apologizing that nobody had yet looked at the ticket, and
haven't heard back. The point being, in spite of being on a supported
platform, when the emergency hit our ability to get support from Sun was
not helpful.
We're subsequently finalizing SRSS 4.2 on two Solaris physical boxes,
running on Sun Hardware. But our question remains... enable FOG or not?
We want to for the high availability, thinking perhaps turning off load
balancing to make it less complex. BUT
I have 2 servers for a reason, and it's not because I need to spread out
the load. I need to know that if one server goes down, the other wont-
and more importantly, that 1 server won't 'pollute' the second (or
third/forth/etc) with some bug/condition that makes them all stop
working. I can manually sync the servers nightly and have a 'manual'
failover that's more reliable than this.
Thoughts? Insights?
Thanks,
Devin
_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users
_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users
_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users