I don't know if this helps you, but I have come across a situation,
that caused a problem, and I resolved it.
We have a master db server, and a slave that is replicated from
master. Master is DB1, and slave is DB3.
We have a load balancing system using vars for the datasource. I
described this in a recent thread.
We were seeing with our alarms that occasionally, witango was going
down, during busy parts of the day. Now, my alarm, is hitting the
same db on each db server, sequentially and reports a failure. The
alarm reported a general failure, not which of the dsns were not
working. After investigating, we found that it was not that witango
was going down, but that only one of the dsns was reporting a tcp/ip
communication failure.
So to track this down better, we separated our 6 alarms, 6 checks to
6 wtiango servers, check both dsns at once, to 12 alarms, 1 alarm for
each dsn for each of the 6 witango servers. We did this to help FIND
the problem.
The weird thing, is that this helped ELIMINATE the problem. For some
reason, witango did not like hitting these 2 dsns sequentially in the
same taf, during heavy traffic periods. Just by separating the
alarms, so that one is hit on one request, and the other on another,
almost elimated the problem.
A few more notes about this.
The db3 dsn was the only one that ever failed. I am not sure why, but
it should also be noted that in these alarms, it was always the
second to done in the sequence.
DB1 and DB3 are identical in every way, hardware and software.
The last note is a bit more complicated, but I am including to
possible give you more things to think about, (as if you need any more).
When we moved up to witango 55, we did tons of testing first, and
found it was much more reliable than 5.0, and eliminated many
instabilities. I did notice that witango would still get into a
situation during extremely heavly loads where it would stop serving
requests, but instead of crashing, it wouild hang for a few seconds
and recover. You can search in the archives where I made these
observations and included pictures of the task manager showing this
occuring.
In order to further minimize this, we worked heavily with a PrimeBase
engineer to tune the PrimeBase ODBC driver and connection dlls. We
found the only thing to completely eliminate it was to add debug code
that slowed down the time between simultaneous requests. We used this
for a while, and it didn't affect witango performance, except on tafs
that would loop and do sequential inserts or updates.
With this setup, there was virtually ZERO downtime due to witango.
Later, we started the replication system, and did some major hardware
upgrades. Performance was great, but we experienced these tcp/ip
disconnects, where all 6 servers would go down at once. We replaced
and reconfigured switches, and tried many things, in the end, it was
the NICS in the witango machines. We were using server class nics,
but that were using the Marvell Yukon chipset. We replaced with intel
nics, and the problem went away.
Now, on to a month or so ago, primebase came out with a major update,
and it was supposed to really increase performance. When we made the
update, we started having the tcp/ip disconnects again, just on db3,
that I started this email on. It happend only about once a day, and
effected 3-5 witango servers simultaneously. Just restarting witango
service to reinitiate connection would return functionality.
By fixing the alarms to not hit one then the other db sequentially,
the problem almost went away.
The last thing, these motherboards in DB1 and DB3 are socket 939
gigabyte motherboards with dual gigabit ethernet. The ethernet chips
are not the same, and the one being used primarily was a marvel yukon
chip. These servers are running linux fedora core 4, with the
ethernet driver compiled into the kernel, for serious performance.
We switched the primary ethernet comm to go through the NON yukon
chip, and now we are back to ZERO problems. The servers have not
burped on a single request since.
I don't know how much of that is going to help you, if at all. I hope
it does, I know how painful it is chasing these things down, but as
in my case, it was a combination of several things.
One thing I would try, is instead of having a single taf, that hits
the 2 dsns in sequence, make 2 tafs, and call them in sequence
seperately, and see if you get the same errors.
Anyway, hope that helps.
--
Robert Garcia
President - BigHead Technology
VP Application Development - eventpix.com
13653 West Park Dr
Magalia, Ca 95954
ph: 530.645.4040 x222 fax: 530.645.4040
[EMAIL PROTECTED] - [EMAIL PROTECTED]
http://bighead.net/ - http://eventpix.com/
On Jan 25, 2006, at 10:46 AM, Dave Machin wrote:
Thanks for the feedback.
We're still stumped. We've confirmed that the ODBC variable is set
correctly before and after our two test queries, and yet one of the
two (or
sometimes both) execute against the wrong data source. Queries
that happen
later or earlier in the request execute against the correct data
source.
I've compiled some notes of one example from this morning, if
anyone has
some time they could take a look and see what we're seeing.
You can download the document here:
www.benchmarkportal.com/witango_error_notes.zip
In short, we're showing that queries execute correctly at first;
the user
variable is correct before the query in question; the first test
query then
executes against the wrong data source; the second test query executes
against the correct data source; and then the next query executes
correctly
as well.
This happens in 5.5.009 on three different servers. The current
production
server is a new, clean build with the latest updates and the latest
MDAC
drivers. This .taf does not use the user reference argument, and
relies on
the cookie. What's confusing is that one query in the middle of a a
sequence of ten goes wrong and the others work fine - all in the same
request.
We're going to downgrade from 5.5 back to 5.0 to see if the problem
goes
away, because we can't find anything else to change.
----- Original Message -----
From: "Customer Support" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, January 24, 2006 5:32 PM
Subject: Re: Witango-Talk: Select 1 From xxx where 1=0
Is it unusual that our production server doesn't issue the
heartbeat query before each DB action?
Yes. It may be going to another DB or DB Server as it is based on
the datasource and tables in the db.
On our development machine, I wrote a .taf application that issued
an identical DB action 20 times (just copied and pasted the same
DBMS action over and over). In SQL profiler, when that application
is executed, I see the heartbeat and then the query repeated 20
times in pairs as expected.
But on our production server, we often see cases where there is no
heartbeat query before a DBMS action. Sometimes we see two or
three DB actions execute before we see another heartbeat query.
If you are running the same version of the Witango Server, OS, ODBC
and DB on all servers and they are exhibiting different behaviors
then you probably have a difference in the configuration of one or
more of Witango Server, OS, ODBC and/or DB processes.
Your problem sounds like tafs interacting with eachother, hard coded
user references, lost user reference cookies or lost user references
argument. If you do not use user reference arguments and rely on the
user reference cookie make sure that it is working and has a value or
the server will fall back to issuing a new user reference cookie.
If you are using iframes or AJAX check that you do not have one taf
interacting with another under the same user reference reseting a
variable when you are not expecting it.
Witango Support
_____________________________________________________________________
___
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf
______________________________________________________________________
__
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf
________________________________________________________________________
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf