I don't know if this helps you, but I have come across a situation, that caused a problem, and I resolved it.

We have a master db server, and a slave that is replicated from master. Master is DB1, and slave is DB3.

We have a load balancing system using vars for the datasource. I described this in a recent thread.

We were seeing with our alarms that occasionally, witango was going down, during busy parts of the day. Now, my alarm, is hitting the same db on each db server, sequentially and reports a failure. The alarm reported a general failure, not which of the dsns were not working. After investigating, we found that it was not that witango was going down, but that only one of the dsns was reporting a tcp/ip communication failure.

So to track this down better, we separated our 6 alarms, 6 checks to 6 wtiango servers, check both dsns at once, to 12 alarms, 1 alarm for each dsn for each of the 6 witango servers. We did this to help FIND the problem.

The weird thing, is that this helped ELIMINATE the problem. For some reason, witango did not like hitting these 2 dsns sequentially in the same taf, during heavy traffic periods. Just by separating the alarms, so that one is hit on one request, and the other on another, almost elimated the problem.

A few more notes about this.

The db3 dsn was the only one that ever failed. I am not sure why, but it should also be noted that in these alarms, it was always the second to done in the sequence.

DB1 and DB3 are identical in every way, hardware and software.

The last note is a bit more complicated, but I am including to possible give you more things to think about, (as if you need any more).

When we moved up to witango 55, we did tons of testing first, and found it was much more reliable than 5.0, and eliminated many instabilities. I did notice that witango would still get into a situation during extremely heavly loads where it would stop serving requests, but instead of crashing, it wouild hang for a few seconds and recover. You can search in the archives where I made these observations and included pictures of the task manager showing this occuring.

In order to further minimize this, we worked heavily with a PrimeBase engineer to tune the PrimeBase ODBC driver and connection dlls. We found the only thing to completely eliminate it was to add debug code that slowed down the time between simultaneous requests. We used this for a while, and it didn't affect witango performance, except on tafs that would loop and do sequential inserts or updates.

With this setup, there was virtually ZERO downtime due to witango.

Later, we started the replication system, and did some major hardware upgrades. Performance was great, but we experienced these tcp/ip disconnects, where all 6 servers would go down at once. We replaced and reconfigured switches, and tried many things, in the end, it was the NICS in the witango machines. We were using server class nics, but that were using the Marvell Yukon chipset. We replaced with intel nics, and the problem went away.

Now, on to a month or so ago, primebase came out with a major update, and it was supposed to really increase performance. When we made the update, we started having the tcp/ip disconnects again, just on db3, that I started this email on. It happend only about once a day, and effected 3-5 witango servers simultaneously. Just restarting witango service to reinitiate connection would return functionality.

By fixing the alarms to not hit one then the other db sequentially, the problem almost went away.

The last thing, these motherboards in DB1 and DB3 are socket 939 gigabyte motherboards with dual gigabit ethernet. The ethernet chips are not the same, and the one being used primarily was a marvel yukon chip. These servers are running linux fedora core 4, with the ethernet driver compiled into the kernel, for serious performance.

We switched the primary ethernet comm to go through the NON yukon chip, and now we are back to ZERO problems. The servers have not burped on a single request since.

I don't know how much of that is going to help you, if at all. I hope it does, I know how painful it is chasing these things down, but as in my case, it was a combination of several things.

One thing I would try, is instead of having a single taf, that hits the 2 dsns in sequence, make 2 tafs, and call them in sequence seperately, and see if you get the same errors.

Anyway, hope that helps.

--

Robert Garcia
President - BigHead Technology
VP Application Development - eventpix.com
13653 West Park Dr
Magalia, Ca 95954
ph: 530.645.4040 x222 fax: 530.645.4040
[EMAIL PROTECTED] - [EMAIL PROTECTED]
http://bighead.net/ - http://eventpix.com/

On Jan 25, 2006, at 10:46 AM, Dave Machin wrote:

Thanks for the feedback.

We're still stumped.  We've confirmed that the ODBC variable is set
correctly before and after our two test queries, and yet one of the two (or sometimes both) execute against the wrong data source. Queries that happen later or earlier in the request execute against the correct data source.

I've compiled some notes of one example from this morning, if anyone has
some time they could take a look and see what we're seeing.

You can download the document here:
www.benchmarkportal.com/witango_error_notes.zip

In short, we're showing that queries execute correctly at first; the user variable is correct before the query in question; the first test query then
executes against the wrong data source; the second test query executes
against the correct data source; and then the next query executes correctly
as well.

This happens in 5.5.009 on three different servers. The current production server is a new, clean build with the latest updates and the latest MDAC drivers. This .taf does not use the user reference argument, and relies on
the cookie.  What's confusing is that one query in the middle of a a
sequence of ten goes wrong and the others work fine - all in the same
request.

We're going to downgrade from 5.5 back to 5.0 to see if the problem goes
away, because we can't find anything else to change.

----- Original Message -----
From: "Customer Support" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, January 24, 2006 5:32 PM
Subject: Re: Witango-Talk: Select 1 From xxx where 1=0


Is it unusual that our production server doesn't issue the
heartbeat query before each DB action?

Yes.  It may be going to another DB or DB Server as it is based on
the datasource and tables in the db.

On our development machine, I wrote a .taf application that issued
an identical DB action 20 times (just copied and pasted the same
DBMS action over and over).  In SQL profiler, when that application
is executed, I see the heartbeat and then the query repeated 20
times in pairs as expected.

But on our production server, we often see cases where there is no
heartbeat query before a DBMS action.  Sometimes we see two or
three DB actions execute before we see another heartbeat query.

If you are running the same version of the Witango Server, OS, ODBC
and DB on all servers and they are exhibiting different behaviors
then you probably have a difference in the configuration of one or
more of Witango Server, OS, ODBC and/or DB processes.

Your problem sounds like tafs interacting with eachother, hard coded
user references, lost user reference cookies or lost user references
argument.  If you do not use user reference arguments and rely on the
user reference cookie make sure that it is working and has a value or
the server will fall back to issuing a new user reference cookie.

If you are using iframes or AJAX check that you do not have one taf
interacting with another under the same user reference reseting a
variable when you are not expecting it.


Witango Support
_____________________________________________________________________ ___
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf



______________________________________________________________________ __
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf


________________________________________________________________________
TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf

Reply via email to