Re: crashing on TS_EVENT_TIMEOUT

Brian Geffon Fri, 17 Feb 2012 22:29:10 -0800

I'm not sure if this will fully resolve your issue but change your
TSContCalls() to not use TS_EVENT_TIMEOUT or TS_EVENT_IMMEDIATE because
doing so will cause the event count to become negative, this is a known bug
(TS-935) that I'll be working on a fix for. Can you try change these to a
new event type and let me know if this helps?


Brian

On Fri, Feb 17, 2012 at 10:21 PM, Brian Geffon <[email protected]>wrote:

> Just out of curiosity, are you by any chance doing a TSContCall() with a
> TS_EVENT_IMMEDIATE or TS_EVENT_TIMEOUT?
>
> Brian
>
> 2012/2/17 Whitaker, James X. -ND <[email protected]>
>
> Hi All,
>>
>> Sorry for being long, and babbling, but it's getting late in the day, and
>> this is needs to get fixed.  There is a fair amount of information, and
>> it's hard to organize it.
>>
>> ATS with our plugin is crashing...  Because of an unexpected event "2",
>> evidently created by something inside ATS that appears to be going
>> berserk, attempting to retry an event, or perhaps to try to demand that
>> our plug-in clean up after itself.  What I need to know is: what is the
>> meaning of of this "event number 2", where in ATS code is it being
>> scheduled, and what should we be doing?  How do I track down where this is
>> coming from?
>>
>> I have run through about 100 hypothesis now about where the crash might be
>> coming from, and each time I prove myself wrong.  On the other hand, I am
>> learning a lot about ATS innards, making me wonder if I am soon to be
>> committer material...  I am amazed at how difficult it is to describe this
>> crash in a post - this is one of several attempts.
>>
>> It will sometimes run for hours without incident.  Sometimes for maybe a
>> minute.
>>
>> I have been assuming that we were doing something wrong in our plugin.
>> But for the life of me, I can't seem to see what we are doing to
>> contribute to this, or what part of ATS code could be objecting to what we
>> are doing.  The more I dig around, the more the nature of the crash just
>> gets more and more baffling…
>>
>> Version is 3.0.1, with our plugin, running on RedHat EL5.
>>
>> The stack trace in the core file consists of EThread::execute calling into
>> EThread::process_event calling into the callback function for
>> INKContInternal, which calls into the event handling function we
>> registered with TSContCreate for a continuation used to assist the
>> HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT.  We switch
>> on the event parameter of the callback, and the default switch case
>> catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on
>> the transaction. The crash occurs when a sanity check on the validity of
>> the transaction fails when we attempt to do the obvious: TSHttpTxnReenable
>> on what appears to be a valid transaction pointer.  ATS asserts and shuts
>> down.  m_deletable is 1, m_deleted is 0.... So it is not the result of an
>> attempt to destroy the object...  Apparently something is merely trying to
>> give it a nudge and wake it up?  The object shows up with a live magic
>> number, but the HttpSM pointer coming into arg3 appears to point into
>> garbled memory (with a dead flag magic number).
>>
>> In otherwords, TSCont's intended for assisting HttpSM objects are coming
>> to life on their own somehow, or perhaps as an unintended side effect of
>> something we are doing?  In this case it is happening after a transaction
>> is over.... Or perhaps after an error?  Something put it on the event
>> cue... And I have no idea how it is getting there, or why.  I haven't
>> found the code that does it... It certainly does not appear to be the
>> product of anything in our plug-in...  and I haven't found documentation
>> of why TS_EVENT_TIMEOUT is there, either.
>>
>> Our code uses TSHttpTxnReenable to bump the transaction along when the
>> callback is done processing each stage.  Nothing in our code puts this
>> TSCont linked to our transaction state machine callback function directly
>> on the event cue, it is only used in our code to assist the HttpSM class,
>> which hooks it with the apihooks.
>>
>> I don't see this kind of behavior documented anywhere.
>>
>> Best theory so far is that certain error conditions might cause us to leak
>> TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these leaky
>> TSCont's are being woken up and put on the event Queue long after the
>> Transactions disappear from memory and are cleaned up.  I can't figure out
>> where they might be getting woken up, or why, or why a once in a blue-moon
>> leaked TSCont that was long abandoned and was never on the event queue
>> would be of interest to anything or anyone.  The only evidence that they
>> might be leaking is that we see these linked to non-operating HttpSM
>> objects in the core files.  There is at least one spot in the code where
>> the callback function issues a Reenable on the transaction object into the
>> error state without doing a destroy on the TSCont.
>>
>> Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show
>> up with a transaction pointer of NULL, which we are not seeing.  Instead
>> we are seeing a non-null pointer.  (Also, the blacklist example for how to
>> handle the timeout looks wrong. It looks like the blacklist example
>> repeats the previous processing step, rather than going to the 'next
>> state' that should occur after the previous one finished.  Why would it
>> want to repeat?)
>>
>> In some of the examples and test code, we see TS_EVENT_TIMEOUT and
>> handlers that have inconsistent ways of dealing with the event.  We have
>> not found documentation on it.
>>
>> The knee jerk reaction was to put in a prototype to pull in the
>> transaction sanity check code from InkApi.cc by declaring a prototype to
>> it at the top of our plugin file, and call it to decide whether to call
>> the TxnReenable.  However, the HttpSM objects wind up on freelists, which
>> means the memory gets re-allocated.  If we then get a call on a stale
>> TSCont that points into the freelist memory of HttpSM's, we would likely
>> point into a "live" one, in which case we would be operating on the WRONG
>> HttpSM, wreaking havoc.  How do we know whether we are pointing to the
>> original HttpSM object, and not a new one in it's memory slot?
>>
>> If we get a timeout, do we just destroy the continuation?  Is that any
>> kind of answer?  What if the transaction is still using it?  Do we ignore
>> the event?  What are the repercussions of ignoring the event?  Will we
>> wind up with connections that die in mid-stream?
>>
>> In our plugin's main continuation callback handler, when we get a
>> TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation
>> with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event
>> handler/callback, and at the start of the transaction processing when that
>> event arrives, a series of hooks back into the callback to process various
>> stages of the transaction.
>>
>> I have been digging around for a while now, trying to figure out what line
>> or lines of code could possibly be generating this event "2".  It seems
>> like knowing where that event code is being generated is critical to
>> knowing what to do about the anomaly.  Looking at code from PluginVC.cc
>> and other places, it looks to my eyes that a general design principal in
>> ATS is to follow the practice of re-scheduling an event, but NOT
>> preserving the original event code when rescheduling.  There are many
>> places that it appears that the practice followed is if the queued up
>> event somehow does not get a mutex lock, the event will be re-issued but
>> the event type will be clobbered with the default event type of
>> EVENT_INTERVAL that is found in the various header files in
>> iocore/eventsystem for the schedule_* methods of the event and event
>> processor.  I suspect that this type of thing is where the event ID of "2"
>> (EVENT_INTERVAL) is coming from.
>>
>> I remember a server design in the past that would issue events like this
>> to warn that the timing was bad, to see if the callback wanted to
>> terminate a sequence of events with an error.  But that server allowed
>> RE-ISSUING the delayed event.  The call to TS_EVENT_HTTP_CONTINUE will not
>> allow the server to re-issue the clobbered event.  The code that generates
>> the transaction events only goes one way -- the code in proxy/HttpSM.cc
>> HttpSM::state_api_callout seems to my eyes to only allow an incrementing
>> series of events.
>>
>> ...So if we miss the cue/event that tells us that we need to handle some
>> critical step because the event code was masked by a "TIMEOUT" (whatever
>> that is?) then it would be impossible for the code to follow a normal path
>> because the state information would be lost.  It is hard for me to believe
>> that the designers would deliberately destroy state information like this
>> (the HttpSM class, for example maintains the state that gets passed down,
>> insulating the callback from the event system), so something else must be
>> going on.
>>
>> Now, under high levels of load, we see a number of errors.  I am wondering
>> if at least some of these errors are from housekeeping events on stale
>> TSCont's being woken up by some "garbage collector" I have yet to find in
>> ATS...  And these stale TSConts are doing damage because they point into
>> re-incarnated HttpSM memory slots?
>>
>> Also, there is another "sister" crash showing up in the logs that we do
>> not have a core dump for yet, but we do have this:
>>
>> FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuation
>> which is deleted"`
>> /usr/local/bin/traffic_server - STACK TRACE:
>>
>> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f92
>> ad]
>>
>> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408]
>>
>> /usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b9
>> 5]
>>
>> /usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)[
>> 0x4aa247]
>>
>> /usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0x
>> 6c138f]
>> /usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb]
>> /usr/local/bin/traffic_server[0x6c06ce]
>> /lib64/libpthread.so.0[0x344fc0673d]
>> /lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d]
>>
>> I suspect they may have the same cause.
>>
>>
>> What is the proper thing to do here?  I don't know if this is an ATS bug,
>> a plug-in bug, both, many, or what it is.  The leaky continuations idea is
>> wild guess number 100 or thereabouts of what is causing this crash.  If I
>> ignore this, and just return 0 on sight of event==2 is something really
>> bad going to happen, like a house cleaning function clogging the priority
>> queues with leaked continuations trying to wake them up to get the owner
>> to clean them up?
>>
>> Server's usually issue things like this for a reason.  It would be nice to
>> have that reason documented, so that people like myself don't find myself
>> running around on a wild goose chase.
>>
>> Thanks,
>>
>> James Whitaker
>>
>>
>

Re: crashing on TS_EVENT_TIMEOUT

Reply via email to