nice detective work, Robert!
> -----Original Message----- > From: Robert Garcia [mailto:[EMAIL PROTECTED] > Sent: Friday, August 22, 2003 12:47 AM > To: [EMAIL PROTECTED] > Subject: Witango-Talk: **Solved Crashing Issue (knock on wood)** > > > I have a bald spot on my head from pulling my hair out trying to solve > why my witango app servers go down a couple times a day, and have to be > restarted manually. I know there are a couple of other people that have > had this same problem. > > This problem has occured usually during very heavy traffic periods, and > the crash happens after a random period of time. The event log would > sometimes show a fatal exception, and other times would show a strange > odbc.connection error. I have sent in test applications and logs to > With Enterprise to reproduce the problem, and I believe there are few > engineers with bald spots there also. > > I almost don't have this problem at all on my test system, only my > deployed servers, which can sustain 50 simultaneous connections and > server hundreds of thousands of pages and blobs a day. > > This pointed me to the problem being some kind of memory leak in the > witango odbc interface, complicated with my heavy blob serving. > > Last night around midnight I woke up to my alarms going off and found > the app servers crashed at the same time, which was unusual, and when I > restarted them, they just kept crashing. I would get them up and > running, and would serve tml's, but when I hit a taf with a query, they > would hang, crash, or give a tcp error that the odbc source was not > published on this protocol. > > After at least 2 hours of trying to resolve the issue, I realized that > the app servers could not see the database server (which is two feet > away on the same subnet). So I went to the basics and did a ping from > my app server to the database server: > > ping db1.mydomain.com > > I was astonished to see the result, the app server started pinging an > ip address that I have never heard of. The DNS server I was using must > have gone wacko. I am using a DNS server in the data center, maintained > by the datacenter which is a BIND system running on OS X. I woke the > Datacenter mgr up and told him he had a serious problem, and he > rebooted the dns servers, and VOILA, the servers started working again > and the problem was resolved. > > This incident gave me much concern, and I decided that I should not > rely on someone elses dns, so I setup a quick forwarding dns server on > my database server (windows 2003 server, took 2 minutes), and pointed > all of my machines to it. This DNS has no zones, but would cache all > dns queries and only serve this group of machines. > > To my amazement my servers chugged through the heaviest traffic periods > all day, without a single incident. Not a timeout, not a crash, > nothing. I have been floored all day staring at my live performance > counters. I have watched the servers go through sustained periods up to > 4 hours long where they were averaging 50 simultaneous connections. > > This has not been possible for me. > > In conclusion, I can only assume that if the app server has to connect > to a datasource, and needs to resolve a domain name, and the dns server > doesn't respond back quick enough, and other queries cause several > threads to have the same problem, it would crash. > > Bringing the DNS to a high level of performance seems to have solved > the issue. I am now going to add stub zones to my dns server so that it > never has to query another dns, and the response will be instant. I may > even install a backup dns server on each app server, pointing to the > main on the database server, and then point the primary dns for each > app server to itself, to give it the highest dns performance possible. > > This was the last thing I would have ever thought to look at. If you > are deploying heavy traffic servers, check this issue. > > Some may ask why not use ip addresses and not names. Well periodically > an ip address may change, but I can always keep the name consistent, > and therefore never have to rewrite code or change a bunch of dsn's > because I just changed an IP. > > I debated whether posting this to the list so soon, before having more > time of watching the issue, but If anyone else is having issues, this > may help. It has given me back my life, not having to watch the servers > so closely. > > -- > > Robert Garcia > President - BigHead Technology > CTO - eventpix.com > 2781 N Carlmont Pl > Simi Valley, Ca 93065 > ph: 805.522.8577 - cell: 805.501.1390 - fax: 805.830.0321 > [EMAIL PROTECTED] - [EMAIL PROTECTED] > http://bighead.net/ - http://eventpix.com/ - http://theradmac.com/ > > ________________________________________________________________________ > TO UNSUBSCRIBE: Go to http://www.witango.com/maillist.taf > ________________________________________________________________________ TO UNSUBSCRIBE: Go to http://www.witango.com/maillist.taf
