Thanks so much for the extensive reply. One(!) question - the two queries that were done sequentially, and you separated out, were they identicle queries except for the dsn?
One suspision I've been having is that something bad happens if we issue the exact same query twice in a row, either to the same dsn (which would be redundent) or to a different dsn. Oh, I lied, one more question - how were you detecting the tcp/ip disconnects? We're on Windows, is that an ODBC debug thing? ----- Original Message ----- From: "Robert Garcia" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Wednesday, January 25, 2006 12:16 PM Subject: Re: Witango-Talk: Select 1 From xxx where 1=0 > I don't know if this helps you, but I have come across a situation, > that caused a problem, and I resolved it. > > We have a master db server, and a slave that is replicated from > master. Master is DB1, and slave is DB3. > > We have a load balancing system using vars for the datasource. I > described this in a recent thread. > > We were seeing with our alarms that occasionally, witango was going > down, during busy parts of the day. Now, my alarm, is hitting the > same db on each db server, sequentially and reports a failure. The > alarm reported a general failure, not which of the dsns were not > working. After investigating, we found that it was not that witango > was going down, but that only one of the dsns was reporting a tcp/ip > communication failure. > > So to track this down better, we separated our 6 alarms, 6 checks to > 6 wtiango servers, check both dsns at once, to 12 alarms, 1 alarm for > each dsn for each of the 6 witango servers. We did this to help FIND > the problem. > > The weird thing, is that this helped ELIMINATE the problem. For some > reason, witango did not like hitting these 2 dsns sequentially in the > same taf, during heavy traffic periods. Just by separating the > alarms, so that one is hit on one request, and the other on another, > almost elimated the problem. > > A few more notes about this. > > The db3 dsn was the only one that ever failed. I am not sure why, but > it should also be noted that in these alarms, it was always the > second to done in the sequence. > > DB1 and DB3 are identical in every way, hardware and software. > > The last note is a bit more complicated, but I am including to > possible give you more things to think about, (as if you need any more). > > When we moved up to witango 55, we did tons of testing first, and > found it was much more reliable than 5.0, and eliminated many > instabilities. I did notice that witango would still get into a > situation during extremely heavly loads where it would stop serving > requests, but instead of crashing, it wouild hang for a few seconds > and recover. You can search in the archives where I made these > observations and included pictures of the task manager showing this > occuring. > > In order to further minimize this, we worked heavily with a PrimeBase > engineer to tune the PrimeBase ODBC driver and connection dlls. We > found the only thing to completely eliminate it was to add debug code > that slowed down the time between simultaneous requests. We used this > for a while, and it didn't affect witango performance, except on tafs > that would loop and do sequential inserts or updates. > > With this setup, there was virtually ZERO downtime due to witango. > > Later, we started the replication system, and did some major hardware > upgrades. Performance was great, but we experienced these tcp/ip > disconnects, where all 6 servers would go down at once. We replaced > and reconfigured switches, and tried many things, in the end, it was > the NICS in the witango machines. We were using server class nics, > but that were using the Marvell Yukon chipset. We replaced with intel > nics, and the problem went away. > > Now, on to a month or so ago, primebase came out with a major update, > and it was supposed to really increase performance. When we made the > update, we started having the tcp/ip disconnects again, just on db3, > that I started this email on. It happend only about once a day, and > effected 3-5 witango servers simultaneously. Just restarting witango > service to reinitiate connection would return functionality. > > By fixing the alarms to not hit one then the other db sequentially, > the problem almost went away. > > The last thing, these motherboards in DB1 and DB3 are socket 939 > gigabyte motherboards with dual gigabit ethernet. The ethernet chips > are not the same, and the one being used primarily was a marvel yukon > chip. These servers are running linux fedora core 4, with the > ethernet driver compiled into the kernel, for serious performance. > > We switched the primary ethernet comm to go through the NON yukon > chip, and now we are back to ZERO problems. The servers have not > burped on a single request since. > > I don't know how much of that is going to help you, if at all. I hope > it does, I know how painful it is chasing these things down, but as > in my case, it was a combination of several things. > > One thing I would try, is instead of having a single taf, that hits > the 2 dsns in sequence, make 2 tafs, and call them in sequence > seperately, and see if you get the same errors. > > Anyway, hope that helps. > > -- > > Robert Garcia > President - BigHead Technology > VP Application Development - eventpix.com > 13653 West Park Dr > Magalia, Ca 95954 > ph: 530.645.4040 x222 fax: 530.645.4040 > [EMAIL PROTECTED] - [EMAIL PROTECTED] > http://bighead.net/ - http://eventpix.com/ > > On Jan 25, 2006, at 10:46 AM, Dave Machin wrote: > > > Thanks for the feedback. > > > > We're still stumped. We've confirmed that the ODBC variable is set > > correctly before and after our two test queries, and yet one of the > > two (or > > sometimes both) execute against the wrong data source. Queries > > that happen > > later or earlier in the request execute against the correct data > > source. > > > > I've compiled some notes of one example from this morning, if > > anyone has > > some time they could take a look and see what we're seeing. > > > > You can download the document here: > > www.benchmarkportal.com/witango_error_notes.zip > > > > In short, we're showing that queries execute correctly at first; > > the user > > variable is correct before the query in question; the first test > > query then > > executes against the wrong data source; the second test query executes > > against the correct data source; and then the next query executes > > correctly > > as well. > > > > This happens in 5.5.009 on three different servers. The current > > production > > server is a new, clean build with the latest updates and the latest > > MDAC > > drivers. This .taf does not use the user reference argument, and > > relies on > > the cookie. What's confusing is that one query in the middle of a a > > sequence of ten goes wrong and the others work fine - all in the same > > request. > > > > We're going to downgrade from 5.5 back to 5.0 to see if the problem > > goes > > away, because we can't find anything else to change. > > > > ----- Original Message ----- > > From: "Customer Support" <[EMAIL PROTECTED]> > > To: <[email protected]> > > Sent: Tuesday, January 24, 2006 5:32 PM > > Subject: Re: Witango-Talk: Select 1 From xxx where 1=0 > > > > > >>> Is it unusual that our production server doesn't issue the > >>> heartbeat query before each DB action? > >> > >> Yes. It may be going to another DB or DB Server as it is based on > >> the datasource and tables in the db. > >> > >>> On our development machine, I wrote a .taf application that issued > >>> an identical DB action 20 times (just copied and pasted the same > >>> DBMS action over and over). In SQL profiler, when that application > >>> is executed, I see the heartbeat and then the query repeated 20 > >>> times in pairs as expected. > >>> > >>> But on our production server, we often see cases where there is no > >>> heartbeat query before a DBMS action. Sometimes we see two or > >>> three DB actions execute before we see another heartbeat query. > >> > >> If you are running the same version of the Witango Server, OS, ODBC > >> and DB on all servers and they are exhibiting different behaviors > >> then you probably have a difference in the configuration of one or > >> more of Witango Server, OS, ODBC and/or DB processes. > >> > >> Your problem sounds like tafs interacting with eachother, hard coded > >> user references, lost user reference cookies or lost user references > >> argument. If you do not use user reference arguments and rely on the > >> user reference cookie make sure that it is working and has a value or > >> the server will fall back to issuing a new user reference cookie. > >> > >> If you are using iframes or AJAX check that you do not have one taf > >> interacting with another under the same user reference reseting a > >> variable when you are not expecting it. > >> > >> > >> Witango Support > >> _____________________________________________________________________ > >> ___ > >> TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf > >> > > > > > > ______________________________________________________________________ > > __ > > TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf > > > > ________________________________________________________________________ > TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf > ________________________________________________________________________ TO UNSUBSCRIBE: Go to http://www.witango.com/developer/maillist.taf
