Re: Topology is stuck

P. Taylor Goetz Wed, 09 Apr 2014 15:02:02 -0700

Hey Jason,

Do you know a way to reliably reproduce this? If so can you share the steps?


-Taylor

> On Apr 9, 2014, at 5:52 PM, Jason Jackson <[email protected]> wrote:
> 
> Fyi we're using Summingbird in production not Trident. However summingbird 
> does not give you exactly once semantics, it does give you a higher level of 
> abstraction than Storm API though. 
> 
> 
>> On Wed, Apr 9, 2014 at 2:50 PM, Jason Jackson <[email protected]> wrote:
>> I have one theory that because reads in zookeeper are eventually consistent, 
>> this is a necessary condition for the bug to manifest. So one way to test 
>> this hypothesis is to run a zookeeper ensemble with 1 node, or a zookeeper 
>> ensemble configured for 5 nodes, but take 2 of them offline, so that every 
>> write operation only succeeds if every member of the ensemble sees the 
>> write. This should produce strong consistent reads. If you run this test, 
>> let me know what the results are. (Clearly this isn't a good production 
>> system though as you're making a tradeoff for lower availability in favor of 
>> greater consistency, but the results could help narrow down the bug)
>> 
>> 
>>> On Wed, Apr 9, 2014 at 2:43 PM, Jason Jackson <[email protected]> wrote:
>>> Yah it's probably a bug in trident. It would be amazing if someone figured 
>>> out the fix for this. I spent about 6 hours looking into, but couldn't 
>>> figure out why it was occuring. 
>>> 
>>> Beyond fixing this, one thing you could do to buy yourself time is disable 
>>> batch retries in trident. There's no option for this in the API, but it's 
>>> like a 1 or 2 line change to the code. Obviously you loose exactly once 
>>> semantics, but at least you would have a system that never falls behind 
>>> real-time. 
>>> 
>>> 
>>> 
>>>> On Wed, Apr 9, 2014 at 1:10 AM, Danijel Schiavuzzi 
>>>> <[email protected]> wrote:
>>>> Thanks Jason. However, I don't think that was the case in my stuck 
>>>> topology, otherwise I'd have seen exceptions (thrown by my Trident 
>>>> functions) in the worker logs.
>>>> 
>>>> 
>>>>> On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <[email protected]> wrote:
>>>>> An example of "corrupted input" that causes a batch to fail would be for 
>>>>> example if you expected a schema to your data that you read off kafka, or 
>>>>> some queue, and for whatever reason the data didn't conform to your 
>>>>> schema and the function that you implement that you pass to stream.each 
>>>>> throws an exception when this unexpected situation occurs. This would 
>>>>> cause the batch to be retried, but it's deterministically failing, so the 
>>>>> batch will be retried forever. 
>>>>> 
>>>>> 
>>>>>> On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi 
>>>>>> <[email protected]> wrote:
>>>>>> Hi Jason,
>>>>>> 
>>>>>> Could you be more specific -- what do you mean by "corrupted input"? Do 
>>>>>> you mean that there's a bug in Trident itself that causes the tuples in 
>>>>>> a batch to somehow become corrupted?
>>>>>> 
>>>>>> Thanks a lot!
>>>>>> 
>>>>>> Danijel
>>>>>> 
>>>>>> 
>>>>>>> On Monday, April 7, 2014, Jason Jackson <[email protected]> wrote:
>>>>>>> This could happen if you have corrupted input that always causes a 
>>>>>>> batch to fail and be retried. 
>>>>>>> 
>>>>>>> I have seen this behaviour before and I didn't see corrupted input. It 
>>>>>>> might be a bug in trident, I'm not sure. If you figure it out please 
>>>>>>> update this thread and/or submit a patch. 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi 
>>>>>>> <[email protected]> wrote:
>>>>>>> To (partially) answer my own question -- I still have no idea on the 
>>>>>>> cause of the stuck topology, but re-submitting the topology helps -- 
>>>>>>> after re-submitting my topology is now running normally.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi 
>>>>>>> <[email protected]> wrote:
>>>>>>> Also, I did have multiple cases of my IBackingMap workers dying 
>>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards 
>>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my 
>>>>>>> strategy in rare SQL database deadlock situations to force a worker 
>>>>>>> restart and to fail+retry the batch).
>>>>>>> 
>>>>>>> From the logs, one such IBackingMap worker death (and subsequent 
>>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>>> 
>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting 
>>>>>>> batch, attempt 29698959:736
>>>>>>> 
>>>>>>> This is of course the normal behavior of a transactional topology, but 
>>>>>>> this is the first time I've encountered a case of a batch retrying 
>>>>>>> indefinitely. This is especially suspicious since the topology has been 
>>>>>>> running fine for 20 days straight, re-emitting batches and restarting 
>>>>>>> IBackingMap workers quite a number of times.
>>>>>>> 
>>>>>>> I can see in my IBackingMap backing SQL database that the batch with 
>>>>>>> the exact txid value 29698959 has been committed -- but I suspect that 
>>>>>>> could come from another BackingMap, since there are two BackingMap 
>>>>>>> instances running (paralellismHint 2).
>>>>>>> 
>>>>>>> However, I have no idea why the batch is being retried indefinitely now 
>>>>>>> nor why it hasn't been successfully acked by Trident.
>>>>>>> 
>>>>>>> Any suggestions on the area (topology component) to focus my research 
>>>>>>> on?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi 
>>>>>>> <[email protected]> wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> I'm having problems with my transactional Trident topology. It has been 
>>>>>>> running fine for about 20 days, and suddenly is stuck processing a 
>>>>>>> single batch, with no tuples being emitted nor tuples being persisted 
>>>>>>> by the TridentState (IBackingMap).
>>>>>>> 
>>>>>>> It's a simple topology which consumes messages off a Kafka queue. The 
>>>>>>> spout is an instance of storm-kafka-0.8-plus 
>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql 
>>>>>>> transactional TridentState implementation to persistentAggregate() data 
>>>>>>> into a SQL database.
>>>>>>> 
>>>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>> 
>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is 
>>>>>>> "{"29698959":6487}"
>>>>>>> 
>>>>>>> ... and the attempt count keeps increasing. It seems the batch with 
>>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps 
>>>>>>> increasing -- seems like the batch isn't being acked by Trident and I 
>>>>>>> have no idea why, especially since the topology has been running 
>>>>>>> successfully the last 20 days.
>>>>>>> 
>>>>>>> I did rebalance the topology on one occasion, after which it continued 
>>>>>>> running normally. Other than that, no other modifications were done. 
>>>>>>> Storm is at version 0.9.0.1.
>>>>>>> 
>>>>>>> Any hints on how to debug the stuck topology? Any other useful info I 
>>>>>>> might provide?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> -- 
>>>>>>> Danijel Schiavuzzi
>>>>>>> 
>>>>>>> E: [email protected]
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijel.schiavuzzi
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Danijel Schiavuzzi
>>>>>>> 
>>>>>>> E: danije
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Danijel Schiavuzzi
>>>>>> 
>>>>>> E: [email protected]
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijels7
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Danijel Schiavuzzi
>>>> 
>>>> E: [email protected]
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijels7
>

Re: Topology is stuck

Reply via email to