At 14:48 30/03/2007, Bogdan-Andrei Iancu wrote: >wrong again :) I wish it would be.
The operational experience shows us that in the former versions there have been race conditions which do cause troubles under hard-to-reproduce conditions. Based on surface knowledge, it appears that openser has inhereted those from ser before's ser's overhaul of those. >as I mentioned in my previous email, the "detached timer" was more an maker >that something else was going wrong - there was no amplification. lucky those who haven't been affected by the race conditions. My point is though, this particular warning corelates with undeterminism. >and as TR clearly said, the problem was with DB connectivity and had nothing >to do with TM timers. Well, as a matter of fact, I have witnessed several failures which coincidently appeared with this warning. Studing the code will reveal to you and anyone else that actually this warning is just a hack which helps to ignore erroneous conditions and survive those, but doesn't heal the cause of the problem, which may still generate disfucntional service. Again -- I don't mean to daemonize it, with this -ignore-the-problem-hack things have been running mostly fine. -jiri >regards, >bogdan > >Jiri Kuthan wrote: >>Actually more likely it has been both. The root problem lies in the timer >>subsystem >>and may be amplified by other troubles (or amplify those). >> >>-jiri >> >>At 01:35 30/03/2007, T.R. Missner wrote: >> >>>FYI All >>> >>>This turned out to be a database write ( acc ) that was blocking due to a >>>raid card problem. >>> >>> >>> >>>T.R. Missner wrote: >>> >>>>Is it possible the locked state I am seeing with openser leads to the >>>>"detached" timer? >>>>Since the "detached" timer is a race, it would make sense to see the race >>>>condition after openser locks up and messages buffer up in the stack. >>>>When a bunch of messages are processed all at once by multiple threads the >>>>race condition would occur. >>>>Does this make sense? >>>> >>>>Maybe I have been focusing on the wrong place. >>>> >>>>Ignoring the "detached" timer what could cause openser to hang for a couple >>>>seconds then clear every 5 - 10 minutes? >>>> >>>>Ideas? >>>> >>>>We are seeing this on 3 different productions servers. >>>> >>>>Thanks >>>> >>>>TR >>>> >>>>using openser1.1.1 >>>> >>>> >>>> >>>>T.R. Missner wrote: >>>> >>>>>Bogdan, >>>>> >>>>>I have been chasing this for days and done lots of debugging. >>>>>using 1.1.1 >>>>>While looking at the network trace at the time of these messages ( I >>>>>usually see at least 5 in a row with differing hex values ) I see many >>>>>incoming packets coming into the box and no response from the proxy for >>>>>somewhere between 5 - 10 seconds, then a flood a responses from the proxy. >>>>>I can email you a sample pcap file if you like. >>>>>As part of my debugging I forced a 100 reply at the very top of my cfg >>>>>file. >>>>>The forced 100 was not sent during the locked up time leading me to >>>>>believe openser was not processing incoming packets. >>>>>I have now seen this on multiple servers in different locations. Likely a >>>>>particular customer call flow is causing this but I have not been able to >>>>>pin it down to the exact customer. These proxies run pretty fast during >>>>>the day so finding a pattern leading up the this issue is difficult. What >>>>>could I add to the Log output to identify the offending sip-callid? Is >>>>>sip-callid or branch tag or anything similar easily accessible in any of >>>>>the data structs in timer.c? >>>>> >>>>>TR >>>>> >>>>>Bogdan-Andrei Iancu wrote: >>>>> >>>>>>Hi TR, >>>>>> >>>>>>it is race between expire even (from timer) and inserting again on a >>>>>>timer list. >>>>>> 1 is the final response timer list (fr_timer) >>>>>> 3 id the wait timer list (wt_timer) >>>>>> >>>>>>I would say there is no way this could leas to a any kind of lock. >>>>>> >>>>>>what version are you using? what makes you say it locks? >>>>>> >>>>>>regards, >>>>>>bogdan >>>>>> >>>>>>T.R. Missner wrote: >>>>>> >>>>>>>Does anyone know what causes this? >>>>>>> >>>>>>>*/set_timer for 1 list called on a "detached" timer -- ignoring /* >>>>>>> >>>>>>>I also see >>>>>>> >>>>>>>*/set_timer for 3 list called on a "detached" timer -- ignoring /* >>>>>>> >>>>>>> >>>>>>> >>>>>>>When this happens Openser seems to lock up for 10 seconds or so. >>>>>>> >>>>>>>>From searching it appears this is caused by a race but I am not sure >>>>>>>>what the race is or why this results in an unresponsive openser >>>>>>>>instance for multiple seconds. >>>>>>> >>>>>>>Transaction expiration racing reply? >>>>>>> >>>>>>> >>>>>>>Desperately need to understand how this could be triggered so I can get >>>>>>>customer to adjust system. >>>>>>> >>>>>>>Any way to adjust? >>>>>>> >>>>>>>tried tweaking fr_inv_timer but no joy. >>>>>>> >>>>>>> >>>>>>> >>>>>>>TR >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>>-- >>>>>>Jiri Kuthan http://iptel.org/~jiri/ _______________________________________________ Users mailing list [email protected] http://openser.org/cgi-bin/mailman/listinfo/users
