nice detective work, Robert!

> -----Original Message-----
> From: Robert Garcia [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 22, 2003 12:47 AM
> To: [EMAIL PROTECTED]
> Subject: Witango-Talk: **Solved Crashing Issue (knock on wood)**
> 
> 
> I have a bald spot on my head from pulling my hair out trying to solve 
> why my witango app servers go down a couple times a day, and have to be 
> restarted manually. I know there are a couple of other people that have 
> had this same problem.
> 
> This problem has occured usually during very heavy traffic periods, and 
> the crash happens after a random period of time. The event log would 
> sometimes show a fatal exception, and other times would show a strange 
> odbc.connection error. I have sent in test applications and logs to 
> With Enterprise to reproduce the problem, and I believe there are few 
> engineers with bald spots there also.
> 
> I almost don't have this problem at all on my test system, only my 
> deployed servers, which can sustain 50 simultaneous connections and 
> server hundreds of thousands of pages and blobs a day.
> 
> This pointed me to the problem being some kind of memory leak in the 
> witango odbc interface, complicated with my heavy blob serving.
> 
> Last night around midnight I woke up to my alarms going off and found 
> the app servers crashed at the same time, which was unusual, and when I 
> restarted them, they just kept crashing. I would get them up and 
> running, and would serve tml's, but when I hit a taf with a query, they 
> would hang, crash, or give a tcp error that the odbc source was not 
> published on this protocol.
> 
> After at least 2 hours of trying to resolve the issue, I realized that 
> the app servers could not see the database server (which is two feet 
> away on the same subnet). So I went to the basics and did a ping from 
> my app server to the database server:
> 
> ping db1.mydomain.com
> 
> I was astonished to see the result, the app server started pinging an 
> ip address that I have never heard of. The DNS server I was using must 
> have gone wacko. I am using a DNS server in the data center, maintained 
> by the datacenter which is a BIND system running on OS X. I woke the 
> Datacenter mgr up and told him he had a serious problem, and he 
> rebooted the dns servers, and VOILA, the servers started working again 
> and the problem was resolved.
> 
> This incident gave me much concern, and I decided that I should not 
> rely on someone elses dns, so I setup a quick forwarding dns server on 
> my database server (windows 2003 server, took 2 minutes), and pointed 
> all of my machines to it. This DNS has no zones, but would cache all 
> dns queries and only serve this group of machines.
> 
> To my amazement my servers chugged through the heaviest traffic periods 
> all day, without a single incident. Not a timeout, not a crash, 
> nothing. I have been floored all day staring at my live performance 
> counters. I have watched the servers go through sustained periods up to 
> 4 hours long where they were averaging 50 simultaneous connections.
> 
> This has not been possible for me.
> 
> In conclusion, I can only assume that if the app server has to connect 
> to a datasource, and needs to resolve a domain name, and the dns server 
> doesn't respond back quick enough, and other queries cause several 
> threads to have the same problem, it would crash.
> 
> Bringing the DNS to a high level of performance seems to have solved 
> the issue. I am now going to add stub zones to my dns server so that it 
> never has to query another dns, and the response will be instant. I may 
> even install a backup dns server on each app server, pointing to the 
> main on the database server, and then point the primary dns for each 
> app server to itself, to give it the highest dns performance possible.
> 
> This was the last thing I would have ever thought to look at. If you 
> are deploying heavy traffic servers, check this issue.
> 
> Some may ask why not use ip addresses and not names. Well periodically 
> an ip address may change, but I can always keep the name consistent, 
> and therefore never have to rewrite code or change a bunch of dsn's 
> because I just changed an IP.
> 
> I debated whether posting this to the list so soon, before having more 
> time of watching the issue, but If anyone else is having issues, this 
> may help. It has given me back my life, not having to watch the servers 
> so closely.
> 
> -- 
> 
> Robert Garcia
> President - BigHead Technology
> CTO - eventpix.com
> 2781 N Carlmont Pl
> Simi Valley, Ca 93065
> ph: 805.522.8577 - cell: 805.501.1390 - fax: 805.830.0321
> [EMAIL PROTECTED] - [EMAIL PROTECTED]
> http://bighead.net/ - http://eventpix.com/ - http://theradmac.com/
> 
> ________________________________________________________________________
> TO UNSUBSCRIBE: Go to http://www.witango.com/maillist.taf
> 
________________________________________________________________________
TO UNSUBSCRIBE: Go to http://www.witango.com/maillist.taf

Reply via email to