Just out of curiosity, are you by any chance doing a TSContCall() with a TS_EVENT_IMMEDIATE or TS_EVENT_TIMEOUT?
Brian 2012/2/17 Whitaker, James X. -ND <[email protected]> > Hi All, > > Sorry for being long, and babbling, but it's getting late in the day, and > this is needs to get fixed. There is a fair amount of information, and > it's hard to organize it. > > ATS with our plugin is crashing... Because of an unexpected event "2", > evidently created by something inside ATS that appears to be going > berserk, attempting to retry an event, or perhaps to try to demand that > our plug-in clean up after itself. What I need to know is: what is the > meaning of of this "event number 2", where in ATS code is it being > scheduled, and what should we be doing? How do I track down where this is > coming from? > > I have run through about 100 hypothesis now about where the crash might be > coming from, and each time I prove myself wrong. On the other hand, I am > learning a lot about ATS innards, making me wonder if I am soon to be > committer material... I am amazed at how difficult it is to describe this > crash in a post - this is one of several attempts. > > It will sometimes run for hours without incident. Sometimes for maybe a > minute. > > I have been assuming that we were doing something wrong in our plugin. > But for the life of me, I can't seem to see what we are doing to > contribute to this, or what part of ATS code could be objecting to what we > are doing. The more I dig around, the more the nature of the crash just > gets more and more baffling… > > Version is 3.0.1, with our plugin, running on RedHat EL5. > > The stack trace in the core file consists of EThread::execute calling into > EThread::process_event calling into the callback function for > INKContInternal, which calls into the event handling function we > registered with TSContCreate for a continuation used to assist the > HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT. We switch > on the event parameter of the callback, and the default switch case > catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on > the transaction. The crash occurs when a sanity check on the validity of > the transaction fails when we attempt to do the obvious: TSHttpTxnReenable > on what appears to be a valid transaction pointer. ATS asserts and shuts > down. m_deletable is 1, m_deleted is 0.... So it is not the result of an > attempt to destroy the object... Apparently something is merely trying to > give it a nudge and wake it up? The object shows up with a live magic > number, but the HttpSM pointer coming into arg3 appears to point into > garbled memory (with a dead flag magic number). > > In otherwords, TSCont's intended for assisting HttpSM objects are coming > to life on their own somehow, or perhaps as an unintended side effect of > something we are doing? In this case it is happening after a transaction > is over.... Or perhaps after an error? Something put it on the event > cue... And I have no idea how it is getting there, or why. I haven't > found the code that does it... It certainly does not appear to be the > product of anything in our plug-in... and I haven't found documentation > of why TS_EVENT_TIMEOUT is there, either. > > Our code uses TSHttpTxnReenable to bump the transaction along when the > callback is done processing each stage. Nothing in our code puts this > TSCont linked to our transaction state machine callback function directly > on the event cue, it is only used in our code to assist the HttpSM class, > which hooks it with the apihooks. > > I don't see this kind of behavior documented anywhere. > > Best theory so far is that certain error conditions might cause us to leak > TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these leaky > TSCont's are being woken up and put on the event Queue long after the > Transactions disappear from memory and are cleaned up. I can't figure out > where they might be getting woken up, or why, or why a once in a blue-moon > leaked TSCont that was long abandoned and was never on the event queue > would be of interest to anything or anyone. The only evidence that they > might be leaking is that we see these linked to non-operating HttpSM > objects in the core files. There is at least one spot in the code where > the callback function issues a Reenable on the transaction object into the > error state without doing a destroy on the TSCont. > > Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show > up with a transaction pointer of NULL, which we are not seeing. Instead > we are seeing a non-null pointer. (Also, the blacklist example for how to > handle the timeout looks wrong. It looks like the blacklist example > repeats the previous processing step, rather than going to the 'next > state' that should occur after the previous one finished. Why would it > want to repeat?) > > In some of the examples and test code, we see TS_EVENT_TIMEOUT and > handlers that have inconsistent ways of dealing with the event. We have > not found documentation on it. > > The knee jerk reaction was to put in a prototype to pull in the > transaction sanity check code from InkApi.cc by declaring a prototype to > it at the top of our plugin file, and call it to decide whether to call > the TxnReenable. However, the HttpSM objects wind up on freelists, which > means the memory gets re-allocated. If we then get a call on a stale > TSCont that points into the freelist memory of HttpSM's, we would likely > point into a "live" one, in which case we would be operating on the WRONG > HttpSM, wreaking havoc. How do we know whether we are pointing to the > original HttpSM object, and not a new one in it's memory slot? > > If we get a timeout, do we just destroy the continuation? Is that any > kind of answer? What if the transaction is still using it? Do we ignore > the event? What are the repercussions of ignoring the event? Will we > wind up with connections that die in mid-stream? > > In our plugin's main continuation callback handler, when we get a > TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation > with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event > handler/callback, and at the start of the transaction processing when that > event arrives, a series of hooks back into the callback to process various > stages of the transaction. > > I have been digging around for a while now, trying to figure out what line > or lines of code could possibly be generating this event "2". It seems > like knowing where that event code is being generated is critical to > knowing what to do about the anomaly. Looking at code from PluginVC.cc > and other places, it looks to my eyes that a general design principal in > ATS is to follow the practice of re-scheduling an event, but NOT > preserving the original event code when rescheduling. There are many > places that it appears that the practice followed is if the queued up > event somehow does not get a mutex lock, the event will be re-issued but > the event type will be clobbered with the default event type of > EVENT_INTERVAL that is found in the various header files in > iocore/eventsystem for the schedule_* methods of the event and event > processor. I suspect that this type of thing is where the event ID of "2" > (EVENT_INTERVAL) is coming from. > > I remember a server design in the past that would issue events like this > to warn that the timing was bad, to see if the callback wanted to > terminate a sequence of events with an error. But that server allowed > RE-ISSUING the delayed event. The call to TS_EVENT_HTTP_CONTINUE will not > allow the server to re-issue the clobbered event. The code that generates > the transaction events only goes one way -- the code in proxy/HttpSM.cc > HttpSM::state_api_callout seems to my eyes to only allow an incrementing > series of events. > > ...So if we miss the cue/event that tells us that we need to handle some > critical step because the event code was masked by a "TIMEOUT" (whatever > that is?) then it would be impossible for the code to follow a normal path > because the state information would be lost. It is hard for me to believe > that the designers would deliberately destroy state information like this > (the HttpSM class, for example maintains the state that gets passed down, > insulating the callback from the event system), so something else must be > going on. > > Now, under high levels of load, we see a number of errors. I am wondering > if at least some of these errors are from housekeeping events on stale > TSCont's being woken up by some "garbage collector" I have yet to find in > ATS... And these stale TSConts are doing damage because they point into > re-incarnated HttpSM memory slots? > > Also, there is another "sister" crash showing up in the logs that we do > not have a core dump for yet, but we do have this: > > FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuation > which is deleted"` > /usr/local/bin/traffic_server - STACK TRACE: > /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f92 > ad] > /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408] > /usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b9 > 5] > /usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)[ > 0x4aa247] > /usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0x > 6c138f] > /usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb] > /usr/local/bin/traffic_server[0x6c06ce] > /lib64/libpthread.so.0[0x344fc0673d] > /lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d] > > I suspect they may have the same cause. > > > What is the proper thing to do here? I don't know if this is an ATS bug, > a plug-in bug, both, many, or what it is. The leaky continuations idea is > wild guess number 100 or thereabouts of what is causing this crash. If I > ignore this, and just return 0 on sight of event==2 is something really > bad going to happen, like a house cleaning function clogging the priority > queues with leaked continuations trying to wake them up to get the owner > to clean them up? > > Server's usually issue things like this for a reason. It would be nice to > have that reason documented, so that people like myself don't find myself > running around on a wild goose chase. > > Thanks, > > James Whitaker > >
