Hi All, Sorry for being long, and babbling, but it's getting late in the day, and this is needs to get fixed. There is a fair amount of information, and it's hard to organize it.
ATS with our plugin is crashing... Because of an unexpected event "2", evidently created by something inside ATS that appears to be going berserk, attempting to retry an event, or perhaps to try to demand that our plug-in clean up after itself. What I need to know is: what is the meaning of of this "event number 2", where in ATS code is it being scheduled, and what should we be doing? How do I track down where this is coming from? I have run through about 100 hypothesis now about where the crash might be coming from, and each time I prove myself wrong. On the other hand, I am learning a lot about ATS innards, making me wonder if I am soon to be committer material... I am amazed at how difficult it is to describe this crash in a post - this is one of several attempts. It will sometimes run for hours without incident. Sometimes for maybe a minute. I have been assuming that we were doing something wrong in our plugin. But for the life of me, I can't seem to see what we are doing to contribute to this, or what part of ATS code could be objecting to what we are doing. The more I dig around, the more the nature of the crash just gets more and more baffling… Version is 3.0.1, with our plugin, running on RedHat EL5. The stack trace in the core file consists of EThread::execute calling into EThread::process_event calling into the callback function for INKContInternal, which calls into the event handling function we registered with TSContCreate for a continuation used to assist the HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT. We switch on the event parameter of the callback, and the default switch case catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on the transaction. The crash occurs when a sanity check on the validity of the transaction fails when we attempt to do the obvious: TSHttpTxnReenable on what appears to be a valid transaction pointer. ATS asserts and shuts down. m_deletable is 1, m_deleted is 0.... So it is not the result of an attempt to destroy the object... Apparently something is merely trying to give it a nudge and wake it up? The object shows up with a live magic number, but the HttpSM pointer coming into arg3 appears to point into garbled memory (with a dead flag magic number). In otherwords, TSCont's intended for assisting HttpSM objects are coming to life on their own somehow, or perhaps as an unintended side effect of something we are doing? In this case it is happening after a transaction is over.... Or perhaps after an error? Something put it on the event cue... And I have no idea how it is getting there, or why. I haven't found the code that does it... It certainly does not appear to be the product of anything in our plug-in... and I haven't found documentation of why TS_EVENT_TIMEOUT is there, either. Our code uses TSHttpTxnReenable to bump the transaction along when the callback is done processing each stage. Nothing in our code puts this TSCont linked to our transaction state machine callback function directly on the event cue, it is only used in our code to assist the HttpSM class, which hooks it with the apihooks. I don't see this kind of behavior documented anywhere. Best theory so far is that certain error conditions might cause us to leak TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these leaky TSCont's are being woken up and put on the event Queue long after the Transactions disappear from memory and are cleaned up. I can't figure out where they might be getting woken up, or why, or why a once in a blue-moon leaked TSCont that was long abandoned and was never on the event queue would be of interest to anything or anyone. The only evidence that they might be leaking is that we see these linked to non-operating HttpSM objects in the core files. There is at least one spot in the code where the callback function issues a Reenable on the transaction object into the error state without doing a destroy on the TSCont. Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show up with a transaction pointer of NULL, which we are not seeing. Instead we are seeing a non-null pointer. (Also, the blacklist example for how to handle the timeout looks wrong. It looks like the blacklist example repeats the previous processing step, rather than going to the 'next state' that should occur after the previous one finished. Why would it want to repeat?) In some of the examples and test code, we see TS_EVENT_TIMEOUT and handlers that have inconsistent ways of dealing with the event. We have not found documentation on it. The knee jerk reaction was to put in a prototype to pull in the transaction sanity check code from InkApi.cc by declaring a prototype to it at the top of our plugin file, and call it to decide whether to call the TxnReenable. However, the HttpSM objects wind up on freelists, which means the memory gets re-allocated. If we then get a call on a stale TSCont that points into the freelist memory of HttpSM's, we would likely point into a "live" one, in which case we would be operating on the WRONG HttpSM, wreaking havoc. How do we know whether we are pointing to the original HttpSM object, and not a new one in it's memory slot? If we get a timeout, do we just destroy the continuation? Is that any kind of answer? What if the transaction is still using it? Do we ignore the event? What are the repercussions of ignoring the event? Will we wind up with connections that die in mid-stream? In our plugin's main continuation callback handler, when we get a TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event handler/callback, and at the start of the transaction processing when that event arrives, a series of hooks back into the callback to process various stages of the transaction. I have been digging around for a while now, trying to figure out what line or lines of code could possibly be generating this event "2". It seems like knowing where that event code is being generated is critical to knowing what to do about the anomaly. Looking at code from PluginVC.cc and other places, it looks to my eyes that a general design principal in ATS is to follow the practice of re-scheduling an event, but NOT preserving the original event code when rescheduling. There are many places that it appears that the practice followed is if the queued up event somehow does not get a mutex lock, the event will be re-issued but the event type will be clobbered with the default event type of EVENT_INTERVAL that is found in the various header files in iocore/eventsystem for the schedule_* methods of the event and event processor. I suspect that this type of thing is where the event ID of "2" (EVENT_INTERVAL) is coming from. I remember a server design in the past that would issue events like this to warn that the timing was bad, to see if the callback wanted to terminate a sequence of events with an error. But that server allowed RE-ISSUING the delayed event. The call to TS_EVENT_HTTP_CONTINUE will not allow the server to re-issue the clobbered event. The code that generates the transaction events only goes one way -- the code in proxy/HttpSM.cc HttpSM::state_api_callout seems to my eyes to only allow an incrementing series of events. ...So if we miss the cue/event that tells us that we need to handle some critical step because the event code was masked by a "TIMEOUT" (whatever that is?) then it would be impossible for the code to follow a normal path because the state information would be lost. It is hard for me to believe that the designers would deliberately destroy state information like this (the HttpSM class, for example maintains the state that gets passed down, insulating the callback from the event system), so something else must be going on. Now, under high levels of load, we see a number of errors. I am wondering if at least some of these errors are from housekeeping events on stale TSCont's being woken up by some "garbage collector" I have yet to find in ATS... And these stale TSConts are doing damage because they point into re-incarnated HttpSM memory slots? Also, there is another "sister" crash showing up in the logs that we do not have a core dump for yet, but we do have this: FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuation which is deleted"` /usr/local/bin/traffic_server - STACK TRACE: /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f92 ad] /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408] /usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b9 5] /usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)[ 0x4aa247] /usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0x 6c138f] /usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb] /usr/local/bin/traffic_server[0x6c06ce] /lib64/libpthread.so.0[0x344fc0673d] /lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d] I suspect they may have the same cause. What is the proper thing to do here? I don't know if this is an ATS bug, a plug-in bug, both, many, or what it is. The leaky continuations idea is wild guess number 100 or thereabouts of what is causing this crash. If I ignore this, and just return 0 on sight of event==2 is something really bad going to happen, like a house cleaning function clogging the priority queues with leaked continuations trying to wake them up to get the owner to clean them up? Server's usually issue things like this for a reason. It would be nice to have that reason documented, so that people like myself don't find myself running around on a wild goose chase. Thanks, James Whitaker
