I'm not sure if this will fully resolve your issue but change your TSContCalls() to not use TS_EVENT_TIMEOUT or TS_EVENT_IMMEDIATE because doing so will cause the event count to become negative, this is a known bug (TS-935) that I'll be working on a fix for. Can you try change these to a new event type and let me know if this helps?
Brian On Fri, Feb 17, 2012 at 10:21 PM, Brian Geffon <[email protected]>wrote: > Just out of curiosity, are you by any chance doing a TSContCall() with a > TS_EVENT_IMMEDIATE or TS_EVENT_TIMEOUT? > > Brian > > 2012/2/17 Whitaker, James X. -ND <[email protected]> > > Hi All, >> >> Sorry for being long, and babbling, but it's getting late in the day, and >> this is needs to get fixed. There is a fair amount of information, and >> it's hard to organize it. >> >> ATS with our plugin is crashing... Because of an unexpected event "2", >> evidently created by something inside ATS that appears to be going >> berserk, attempting to retry an event, or perhaps to try to demand that >> our plug-in clean up after itself. What I need to know is: what is the >> meaning of of this "event number 2", where in ATS code is it being >> scheduled, and what should we be doing? How do I track down where this is >> coming from? >> >> I have run through about 100 hypothesis now about where the crash might be >> coming from, and each time I prove myself wrong. On the other hand, I am >> learning a lot about ATS innards, making me wonder if I am soon to be >> committer material... I am amazed at how difficult it is to describe this >> crash in a post - this is one of several attempts. >> >> It will sometimes run for hours without incident. Sometimes for maybe a >> minute. >> >> I have been assuming that we were doing something wrong in our plugin. >> But for the life of me, I can't seem to see what we are doing to >> contribute to this, or what part of ATS code could be objecting to what we >> are doing. The more I dig around, the more the nature of the crash just >> gets more and more baffling… >> >> Version is 3.0.1, with our plugin, running on RedHat EL5. >> >> The stack trace in the core file consists of EThread::execute calling into >> EThread::process_event calling into the callback function for >> INKContInternal, which calls into the event handling function we >> registered with TSContCreate for a continuation used to assist the >> HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT. We switch >> on the event parameter of the callback, and the default switch case >> catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on >> the transaction. The crash occurs when a sanity check on the validity of >> the transaction fails when we attempt to do the obvious: TSHttpTxnReenable >> on what appears to be a valid transaction pointer. ATS asserts and shuts >> down. m_deletable is 1, m_deleted is 0.... So it is not the result of an >> attempt to destroy the object... Apparently something is merely trying to >> give it a nudge and wake it up? The object shows up with a live magic >> number, but the HttpSM pointer coming into arg3 appears to point into >> garbled memory (with a dead flag magic number). >> >> In otherwords, TSCont's intended for assisting HttpSM objects are coming >> to life on their own somehow, or perhaps as an unintended side effect of >> something we are doing? In this case it is happening after a transaction >> is over.... Or perhaps after an error? Something put it on the event >> cue... And I have no idea how it is getting there, or why. I haven't >> found the code that does it... It certainly does not appear to be the >> product of anything in our plug-in... and I haven't found documentation >> of why TS_EVENT_TIMEOUT is there, either. >> >> Our code uses TSHttpTxnReenable to bump the transaction along when the >> callback is done processing each stage. Nothing in our code puts this >> TSCont linked to our transaction state machine callback function directly >> on the event cue, it is only used in our code to assist the HttpSM class, >> which hooks it with the apihooks. >> >> I don't see this kind of behavior documented anywhere. >> >> Best theory so far is that certain error conditions might cause us to leak >> TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these leaky >> TSCont's are being woken up and put on the event Queue long after the >> Transactions disappear from memory and are cleaned up. I can't figure out >> where they might be getting woken up, or why, or why a once in a blue-moon >> leaked TSCont that was long abandoned and was never on the event queue >> would be of interest to anything or anyone. The only evidence that they >> might be leaking is that we see these linked to non-operating HttpSM >> objects in the core files. There is at least one spot in the code where >> the callback function issues a Reenable on the transaction object into the >> error state without doing a destroy on the TSCont. >> >> Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show >> up with a transaction pointer of NULL, which we are not seeing. Instead >> we are seeing a non-null pointer. (Also, the blacklist example for how to >> handle the timeout looks wrong. It looks like the blacklist example >> repeats the previous processing step, rather than going to the 'next >> state' that should occur after the previous one finished. Why would it >> want to repeat?) >> >> In some of the examples and test code, we see TS_EVENT_TIMEOUT and >> handlers that have inconsistent ways of dealing with the event. We have >> not found documentation on it. >> >> The knee jerk reaction was to put in a prototype to pull in the >> transaction sanity check code from InkApi.cc by declaring a prototype to >> it at the top of our plugin file, and call it to decide whether to call >> the TxnReenable. However, the HttpSM objects wind up on freelists, which >> means the memory gets re-allocated. If we then get a call on a stale >> TSCont that points into the freelist memory of HttpSM's, we would likely >> point into a "live" one, in which case we would be operating on the WRONG >> HttpSM, wreaking havoc. How do we know whether we are pointing to the >> original HttpSM object, and not a new one in it's memory slot? >> >> If we get a timeout, do we just destroy the continuation? Is that any >> kind of answer? What if the transaction is still using it? Do we ignore >> the event? What are the repercussions of ignoring the event? Will we >> wind up with connections that die in mid-stream? >> >> In our plugin's main continuation callback handler, when we get a >> TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation >> with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event >> handler/callback, and at the start of the transaction processing when that >> event arrives, a series of hooks back into the callback to process various >> stages of the transaction. >> >> I have been digging around for a while now, trying to figure out what line >> or lines of code could possibly be generating this event "2". It seems >> like knowing where that event code is being generated is critical to >> knowing what to do about the anomaly. Looking at code from PluginVC.cc >> and other places, it looks to my eyes that a general design principal in >> ATS is to follow the practice of re-scheduling an event, but NOT >> preserving the original event code when rescheduling. There are many >> places that it appears that the practice followed is if the queued up >> event somehow does not get a mutex lock, the event will be re-issued but >> the event type will be clobbered with the default event type of >> EVENT_INTERVAL that is found in the various header files in >> iocore/eventsystem for the schedule_* methods of the event and event >> processor. I suspect that this type of thing is where the event ID of "2" >> (EVENT_INTERVAL) is coming from. >> >> I remember a server design in the past that would issue events like this >> to warn that the timing was bad, to see if the callback wanted to >> terminate a sequence of events with an error. But that server allowed >> RE-ISSUING the delayed event. The call to TS_EVENT_HTTP_CONTINUE will not >> allow the server to re-issue the clobbered event. The code that generates >> the transaction events only goes one way -- the code in proxy/HttpSM.cc >> HttpSM::state_api_callout seems to my eyes to only allow an incrementing >> series of events. >> >> ...So if we miss the cue/event that tells us that we need to handle some >> critical step because the event code was masked by a "TIMEOUT" (whatever >> that is?) then it would be impossible for the code to follow a normal path >> because the state information would be lost. It is hard for me to believe >> that the designers would deliberately destroy state information like this >> (the HttpSM class, for example maintains the state that gets passed down, >> insulating the callback from the event system), so something else must be >> going on. >> >> Now, under high levels of load, we see a number of errors. I am wondering >> if at least some of these errors are from housekeeping events on stale >> TSCont's being woken up by some "garbage collector" I have yet to find in >> ATS... And these stale TSConts are doing damage because they point into >> re-incarnated HttpSM memory slots? >> >> Also, there is another "sister" crash showing up in the logs that we do >> not have a core dump for yet, but we do have this: >> >> FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuation >> which is deleted"` >> /usr/local/bin/traffic_server - STACK TRACE: >> >> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f92 >> ad] >> >> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408] >> >> /usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b9 >> 5] >> >> /usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)[ >> 0x4aa247] >> >> /usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0x >> 6c138f] >> /usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb] >> /usr/local/bin/traffic_server[0x6c06ce] >> /lib64/libpthread.so.0[0x344fc0673d] >> /lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d] >> >> I suspect they may have the same cause. >> >> >> What is the proper thing to do here? I don't know if this is an ATS bug, >> a plug-in bug, both, many, or what it is. The leaky continuations idea is >> wild guess number 100 or thereabouts of what is causing this crash. If I >> ignore this, and just return 0 on sight of event==2 is something really >> bad going to happen, like a house cleaning function clogging the priority >> queues with leaked continuations trying to wake them up to get the owner >> to clean them up? >> >> Server's usually issue things like this for a reason. It would be nice to >> have that reason documented, so that people like myself don't find myself >> running around on a wild goose chase. >> >> Thanks, >> >> James Whitaker >> >> >
