Re: Topology is stuck

Jason Jackson Wed, 09 Apr 2014 14:53:24 -0700

Fyi we're using Summingbird in production not Trident. However summingbird
does not give you exactly once semantics, it does give you a higher level
of abstraction than Storm API though.



On Wed, Apr 9, 2014 at 2:50 PM, Jason Jackson <[email protected]> wrote:

> I have one theory that because reads in zookeeper are eventually
> consistent, this is a necessary condition for the bug to manifest. So one
> way to test this hypothesis is to run a zookeeper ensemble with 1 node, or
> a zookeeper ensemble configured for 5 nodes, but take 2 of them offline, so
> that every write operation only succeeds if every member of the ensemble
> sees the write. This should produce strong consistent reads. If you run
> this test, let me know what the results are. (Clearly this isn't a good
> production system though as you're making a tradeoff for lower availability
> in favor of greater consistency, but the results could help narrow down the
> bug)
>
>
> On Wed, Apr 9, 2014 at 2:43 PM, Jason Jackson <[email protected]> wrote:
>
>> Yah it's probably a bug in trident. It would be amazing if someone
>> figured out the fix for this. I spent about 6 hours looking into, but
>> couldn't figure out why it was occuring.
>>
>> Beyond fixing this, one thing you could do to buy yourself time is
>> disable batch retries in trident. There's no option for this in the API,
>> but it's like a 1 or 2 line change to the code. Obviously you loose exactly
>> once semantics, but at least you would have a system that never falls
>> behind real-time.
>>
>>
>>
>> On Wed, Apr 9, 2014 at 1:10 AM, Danijel Schiavuzzi <
>> [email protected]> wrote:
>>
>>> Thanks Jason. However, I don't think that was the case in my stuck
>>> topology, otherwise I'd have seen exceptions (thrown by my Trident
>>> functions) in the worker logs.
>>>
>>>
>>> On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <[email protected]>wrote:
>>>
>>>> An example of "corrupted input" that causes a batch to fail would be
>>>> for example if you expected a schema to your data that you read off kafka,
>>>> or some queue, and for whatever reason the data didn't conform to your
>>>> schema and the function that you implement that you pass to stream.each
>>>> throws an exception when this unexpected situation occurs. This would cause
>>>> the batch to be retried, but it's deterministically failing, so the batch
>>>> will be retried forever.
>>>>
>>>>
>>>> On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> Could you be more specific -- what do you mean by "corrupted input"?
>>>>> Do you mean that there's a bug in Trident itself that causes the tuples in
>>>>> a batch to somehow become corrupted?
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>> Danijel
>>>>>
>>>>>
>>>>> On Monday, April 7, 2014, Jason Jackson <[email protected]> wrote:
>>>>>
>>>>>> This could happen if you have corrupted input that always causes a
>>>>>> batch to fail and be retried.
>>>>>>
>>>>>> I have seen this behaviour before and I didn't see corrupted input.
>>>>>> It might be a bug in trident, I'm not sure. If you figure it out please
>>>>>> update this thread and/or submit a patch.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> To (partially) answer my own question -- I still have no idea on the
>>>>>> cause of the stuck topology, but re-submitting the topology helps -- 
>>>>>> after
>>>>>> re-submitting my topology is now running normally.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Also, I did have multiple cases of my IBackingMap workers dying
>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy 
>>>>>> in
>>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>>> fail+retry the batch).
>>>>>>
>>>>>> From the logs, one such IBackingMap worker death (and subsequent
>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>>
>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>>>>> batch, attempt 29698959:736
>>>>>>
>>>>>> This is of course the normal behavior of a transactional topology,
>>>>>> but this is the first time I've encountered a case of a batch retrying
>>>>>> indefinitely. This is especially suspicious since the topology has been
>>>>>> running fine for 20 days straight, re-emitting batches and restarting
>>>>>> IBackingMap workers quite a number of times.
>>>>>>
>>>>>> I can see in my IBackingMap backing SQL database that the batch with
>>>>>> the exact txid value 29698959 has been committed -- but I suspect that
>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>> instances running (paralellismHint 2).
>>>>>>
>>>>>> However, I have no idea why the batch is being retried indefinitely
>>>>>> now nor why it hasn't been successfully acked by Trident.
>>>>>>
>>>>>> Any suggestions on the area (topology component) to focus my research
>>>>>> on?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I'm having problems with my transactional Trident topology. It has
>>>>>> been running fine for about 20 days, and suddenly is stuck processing a
>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>> the TridentState (IBackingMap).
>>>>>>
>>>>>> It's a simple topology which consumes messages off a Kafka queue. The
>>>>>> spout is an instance of storm-kafka-0.8-plus 
>>>>>> TransactionalTridentKafkaSpout
>>>>>> and I use the trident-mssql transactional TridentState implementation to
>>>>>> persistentAggregate() data into a SQL database.
>>>>>>
>>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>
>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>> "{"29698959":6487}"
>>>>>>
>>>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps 
>>>>>> increasing
>>>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>>>> why, especially since the topology has been running successfully the last
>>>>>> 20 days.
>>>>>>
>>>>>> I did rebalance the topology on one occasion, after which it
>>>>>> continued running normally. Other than that, no other modifications were
>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>
>>>>>> Any hints on how to debug the stuck topology? Any other useful info I
>>>>>> might provide?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: [email protected]
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijel.schiavuzzi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danije
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: [email protected]
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: [email protected]
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>

Re: Topology is stuck

Reply via email to