Re: crashing on TS_EVENT_TIMEOUT

Brian Geffon Fri, 17 Feb 2012 22:21:53 -0800

Just out of curiosity, are you by any chance doing a TSContCall() with a
TS_EVENT_IMMEDIATE or TS_EVENT_TIMEOUT?


Brian

2012/2/17 Whitaker, James X. -ND <[email protected]>

> Hi All,
>
> Sorry for being long, and babbling, but it's getting late in the day, and
> this is needs to get fixed.  There is a fair amount of information, and
> it's hard to organize it.
>
> ATS with our plugin is crashing...  Because of an unexpected event "2",
> evidently created by something inside ATS that appears to be going
> berserk, attempting to retry an event, or perhaps to try to demand that
> our plug-in clean up after itself.  What I need to know is: what is the
> meaning of of this "event number 2", where in ATS code is it being
> scheduled, and what should we be doing?  How do I track down where this is
> coming from?
>
> I have run through about 100 hypothesis now about where the crash might be
> coming from, and each time I prove myself wrong.  On the other hand, I am
> learning a lot about ATS innards, making me wonder if I am soon to be
> committer material...  I am amazed at how difficult it is to describe this
> crash in a post - this is one of several attempts.
>
> It will sometimes run for hours without incident.  Sometimes for maybe a
> minute.
>
> I have been assuming that we were doing something wrong in our plugin.
> But for the life of me, I can't seem to see what we are doing to
> contribute to this, or what part of ATS code could be objecting to what we
> are doing.  The more I dig around, the more the nature of the crash just
> gets more and more baffling…
>
> Version is 3.0.1, with our plugin, running on RedHat EL5.
>
> The stack trace in the core file consists of EThread::execute calling into
> EThread::process_event calling into the callback function for
> INKContInternal, which calls into the event handling function we
> registered with TSContCreate for a continuation used to assist the
> HttpSM/TSHttpTxn class, with an event code of TS_EVENT_TIMEOUT.  We switch
> on the event parameter of the callback, and the default switch case
> catches this event #2 (TS_EVENT_TIMEOUT) which does a TSHttpTxnReenable on
> the transaction. The crash occurs when a sanity check on the validity of
> the transaction fails when we attempt to do the obvious: TSHttpTxnReenable
> on what appears to be a valid transaction pointer.  ATS asserts and shuts
> down.  m_deletable is 1, m_deleted is 0.... So it is not the result of an
> attempt to destroy the object...  Apparently something is merely trying to
> give it a nudge and wake it up?  The object shows up with a live magic
> number, but the HttpSM pointer coming into arg3 appears to point into
> garbled memory (with a dead flag magic number).
>
> In otherwords, TSCont's intended for assisting HttpSM objects are coming
> to life on their own somehow, or perhaps as an unintended side effect of
> something we are doing?  In this case it is happening after a transaction
> is over.... Or perhaps after an error?  Something put it on the event
> cue... And I have no idea how it is getting there, or why.  I haven't
> found the code that does it... It certainly does not appear to be the
> product of anything in our plug-in...  and I haven't found documentation
> of why TS_EVENT_TIMEOUT is there, either.
>
> Our code uses TSHttpTxnReenable to bump the transaction along when the
> callback is done processing each stage.  Nothing in our code puts this
> TSCont linked to our transaction state machine callback function directly
> on the event cue, it is only used in our code to assist the HttpSM class,
> which hooks it with the apihooks.
>
> I don't see this kind of behavior documented anywhere.
>
> Best theory so far is that certain error conditions might cause us to leak
> TSCont's that we use to assist the HttpSM/TSHttpTxn class, and these leaky
> TSCont's are being woken up and put on the event Queue long after the
> Transactions disappear from memory and are cleaned up.  I can't figure out
> where they might be getting woken up, or why, or why a once in a blue-moon
> leaked TSCont that was long abandoned and was never on the event queue
> would be of interest to anything or anyone.  The only evidence that they
> might be leaking is that we see these linked to non-operating HttpSM
> objects in the core files.  There is at least one spot in the code where
> the callback function issues a Reenable on the transaction object into the
> error state without doing a destroy on the TSCont.
>
> Comments in the blacklist example say that the TS_EVENT_TIMEOUT will show
> up with a transaction pointer of NULL, which we are not seeing.  Instead
> we are seeing a non-null pointer.  (Also, the blacklist example for how to
> handle the timeout looks wrong. It looks like the blacklist example
> repeats the previous processing step, rather than going to the 'next
> state' that should occur after the previous one finished.  Why would it
> want to repeat?)
>
> In some of the examples and test code, we see TS_EVENT_TIMEOUT and
> handlers that have inconsistent ways of dealing with the event.  We have
> not found documentation on it.
>
> The knee jerk reaction was to put in a prototype to pull in the
> transaction sanity check code from InkApi.cc by declaring a prototype to
> it at the top of our plugin file, and call it to decide whether to call
> the TxnReenable.  However, the HttpSM objects wind up on freelists, which
> means the memory gets re-allocated.  If we then get a call on a stale
> TSCont that points into the freelist memory of HttpSM's, we would likely
> point into a "live" one, in which case we would be operating on the WRONG
> HttpSM, wreaking havoc.  How do we know whether we are pointing to the
> original HttpSM object, and not a new one in it's memory slot?
>
> If we get a timeout, do we just destroy the continuation?  Is that any
> kind of answer?  What if the transaction is still using it?  Do we ignore
> the event?  What are the repercussions of ignoring the event?  Will we
> wind up with connections that die in mid-stream?
>
> In our plugin's main continuation callback handler, when we get a
> TS_EVENT_HTTP_TXT_START, we set up a transaction assisting continuation
> with TSContCreate, a hook for TS_HTTP_READ_REQUEST_HDR_HOOK into an event
> handler/callback, and at the start of the transaction processing when that
> event arrives, a series of hooks back into the callback to process various
> stages of the transaction.
>
> I have been digging around for a while now, trying to figure out what line
> or lines of code could possibly be generating this event "2".  It seems
> like knowing where that event code is being generated is critical to
> knowing what to do about the anomaly.  Looking at code from PluginVC.cc
> and other places, it looks to my eyes that a general design principal in
> ATS is to follow the practice of re-scheduling an event, but NOT
> preserving the original event code when rescheduling.  There are many
> places that it appears that the practice followed is if the queued up
> event somehow does not get a mutex lock, the event will be re-issued but
> the event type will be clobbered with the default event type of
> EVENT_INTERVAL that is found in the various header files in
> iocore/eventsystem for the schedule_* methods of the event and event
> processor.  I suspect that this type of thing is where the event ID of "2"
> (EVENT_INTERVAL) is coming from.
>
> I remember a server design in the past that would issue events like this
> to warn that the timing was bad, to see if the callback wanted to
> terminate a sequence of events with an error.  But that server allowed
> RE-ISSUING the delayed event.  The call to TS_EVENT_HTTP_CONTINUE will not
> allow the server to re-issue the clobbered event.  The code that generates
> the transaction events only goes one way -- the code in proxy/HttpSM.cc
> HttpSM::state_api_callout seems to my eyes to only allow an incrementing
> series of events.
>
> ...So if we miss the cue/event that tells us that we need to handle some
> critical step because the event code was masked by a "TIMEOUT" (whatever
> that is?) then it would be impossible for the code to follow a normal path
> because the state information would be lost.  It is hard for me to believe
> that the designers would deliberately destroy state information like this
> (the HttpSM class, for example maintains the state that gets passed down,
> insulating the callback from the event system), so something else must be
> going on.
>
> Now, under high levels of load, we see a number of errors.  I am wondering
> if at least some of these errors are from housekeeping events on stale
> TSCont's being woken up by some "garbage collector" I have yet to find in
> ATS...  And these stale TSConts are doing damage because they point into
> re-incarnated HttpSM memory slots?
>
> Also, there is another "sister" crash showing up in the logs that we do
> not have a core dump for yet, but we do have this:
>
> FATAL: InkAPI.cc:968: failed assert `!"Plugin tries to use a continuation
> which is deleted"`
> /usr/local/bin/traffic_server - STACK TRACE:
> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal_va+0x9d)[0x2b701f5f92
> ad]
> /usr/local/lib/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b701f5f9408]
> /usr/local/lib/trafficserver/libtsutil.so.3(_ink_assert+0x85)[0x2b701f5f7b9
> 5]
> /usr/local/bin/traffic_server(_ZN15INKContInternal12handle_eventEiPv+0xc7)[
> 0x4aa247]
> /usr/local/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x22f)[0x
> 6c138f]
> /usr/local/bin/traffic_server(_ZN7EThread7executeEv+0x3db)[0x6c1abb]
> /usr/local/bin/traffic_server[0x6c06ce]
> /lib64/libpthread.so.0[0x344fc0673d]
> /lib64/libc.so.6(clone+0x6d)[0x344f4d3d1d]
>
> I suspect they may have the same cause.
>
>
> What is the proper thing to do here?  I don't know if this is an ATS bug,
> a plug-in bug, both, many, or what it is.  The leaky continuations idea is
> wild guess number 100 or thereabouts of what is causing this crash.  If I
> ignore this, and just return 0 on sight of event==2 is something really
> bad going to happen, like a house cleaning function clogging the priority
> queues with leaked continuations trying to wake them up to get the owner
> to clean them up?
>
> Server's usually issue things like this for a reason.  It would be nice to
> have that reason documented, so that people like myself don't find myself
> running around on a wild goose chase.
>
> Thanks,
>
> James Whitaker
>
>

Re: crashing on TS_EVENT_TIMEOUT

Reply via email to