Hey Jason, Do you know a way to reliably reproduce this? If so can you share the steps?
-Taylor > On Apr 9, 2014, at 5:52 PM, Jason Jackson <[email protected]> wrote: > > Fyi we're using Summingbird in production not Trident. However summingbird > does not give you exactly once semantics, it does give you a higher level of > abstraction than Storm API though. > > >> On Wed, Apr 9, 2014 at 2:50 PM, Jason Jackson <[email protected]> wrote: >> I have one theory that because reads in zookeeper are eventually consistent, >> this is a necessary condition for the bug to manifest. So one way to test >> this hypothesis is to run a zookeeper ensemble with 1 node, or a zookeeper >> ensemble configured for 5 nodes, but take 2 of them offline, so that every >> write operation only succeeds if every member of the ensemble sees the >> write. This should produce strong consistent reads. If you run this test, >> let me know what the results are. (Clearly this isn't a good production >> system though as you're making a tradeoff for lower availability in favor of >> greater consistency, but the results could help narrow down the bug) >> >> >>> On Wed, Apr 9, 2014 at 2:43 PM, Jason Jackson <[email protected]> wrote: >>> Yah it's probably a bug in trident. It would be amazing if someone figured >>> out the fix for this. I spent about 6 hours looking into, but couldn't >>> figure out why it was occuring. >>> >>> Beyond fixing this, one thing you could do to buy yourself time is disable >>> batch retries in trident. There's no option for this in the API, but it's >>> like a 1 or 2 line change to the code. Obviously you loose exactly once >>> semantics, but at least you would have a system that never falls behind >>> real-time. >>> >>> >>> >>>> On Wed, Apr 9, 2014 at 1:10 AM, Danijel Schiavuzzi >>>> <[email protected]> wrote: >>>> Thanks Jason. However, I don't think that was the case in my stuck >>>> topology, otherwise I'd have seen exceptions (thrown by my Trident >>>> functions) in the worker logs. >>>> >>>> >>>>> On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <[email protected]> wrote: >>>>> An example of "corrupted input" that causes a batch to fail would be for >>>>> example if you expected a schema to your data that you read off kafka, or >>>>> some queue, and for whatever reason the data didn't conform to your >>>>> schema and the function that you implement that you pass to stream.each >>>>> throws an exception when this unexpected situation occurs. This would >>>>> cause the batch to be retried, but it's deterministically failing, so the >>>>> batch will be retried forever. >>>>> >>>>> >>>>>> On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi >>>>>> <[email protected]> wrote: >>>>>> Hi Jason, >>>>>> >>>>>> Could you be more specific -- what do you mean by "corrupted input"? Do >>>>>> you mean that there's a bug in Trident itself that causes the tuples in >>>>>> a batch to somehow become corrupted? >>>>>> >>>>>> Thanks a lot! >>>>>> >>>>>> Danijel >>>>>> >>>>>> >>>>>>> On Monday, April 7, 2014, Jason Jackson <[email protected]> wrote: >>>>>>> This could happen if you have corrupted input that always causes a >>>>>>> batch to fail and be retried. >>>>>>> >>>>>>> I have seen this behaviour before and I didn't see corrupted input. It >>>>>>> might be a bug in trident, I'm not sure. If you figure it out please >>>>>>> update this thread and/or submit a patch. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi >>>>>>> <[email protected]> wrote: >>>>>>> To (partially) answer my own question -- I still have no idea on the >>>>>>> cause of the stuck topology, but re-submitting the topology helps -- >>>>>>> after re-submitting my topology is now running normally. >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi >>>>>>> <[email protected]> wrote: >>>>>>> Also, I did have multiple cases of my IBackingMap workers dying >>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards >>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my >>>>>>> strategy in rare SQL database deadlock situations to force a worker >>>>>>> restart and to fail+retry the batch). >>>>>>> >>>>>>> From the logs, one such IBackingMap worker death (and subsequent >>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple: >>>>>>> >>>>>>> 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting >>>>>>> batch, attempt 29698959:736 >>>>>>> >>>>>>> This is of course the normal behavior of a transactional topology, but >>>>>>> this is the first time I've encountered a case of a batch retrying >>>>>>> indefinitely. This is especially suspicious since the topology has been >>>>>>> running fine for 20 days straight, re-emitting batches and restarting >>>>>>> IBackingMap workers quite a number of times. >>>>>>> >>>>>>> I can see in my IBackingMap backing SQL database that the batch with >>>>>>> the exact txid value 29698959 has been committed -- but I suspect that >>>>>>> could come from another BackingMap, since there are two BackingMap >>>>>>> instances running (paralellismHint 2). >>>>>>> >>>>>>> However, I have no idea why the batch is being retried indefinitely now >>>>>>> nor why it hasn't been successfully acked by Trident. >>>>>>> >>>>>>> Any suggestions on the area (topology component) to focus my research >>>>>>> on? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi >>>>>>> <[email protected]> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> I'm having problems with my transactional Trident topology. It has been >>>>>>> running fine for about 20 days, and suddenly is stuck processing a >>>>>>> single batch, with no tuples being emitted nor tuples being persisted >>>>>>> by the TridentState (IBackingMap). >>>>>>> >>>>>>> It's a simple topology which consumes messages off a Kafka queue. The >>>>>>> spout is an instance of storm-kafka-0.8-plus >>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql >>>>>>> transactional TridentState implementation to persistentAggregate() data >>>>>>> into a SQL database. >>>>>>> >>>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e. >>>>>>> >>>>>>> "/transactional/<myTopologyName>/coordinator/currattempts" is >>>>>>> "{"29698959":6487}" >>>>>>> >>>>>>> ... and the attempt count keeps increasing. It seems the batch with >>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps >>>>>>> increasing -- seems like the batch isn't being acked by Trident and I >>>>>>> have no idea why, especially since the topology has been running >>>>>>> successfully the last 20 days. >>>>>>> >>>>>>> I did rebalance the topology on one occasion, after which it continued >>>>>>> running normally. Other than that, no other modifications were done. >>>>>>> Storm is at version 0.9.0.1. >>>>>>> >>>>>>> Any hints on how to debug the stuck topology? Any other useful info I >>>>>>> might provide? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -- >>>>>>> Danijel Schiavuzzi >>>>>>> >>>>>>> E: [email protected] >>>>>>> W: www.schiavuzzi.com >>>>>>> T: +385989035562 >>>>>>> Skype: danijel.schiavuzzi >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Danijel Schiavuzzi >>>>>>> >>>>>>> E: danije >>>>>> >>>>>> >>>>>> -- >>>>>> Danijel Schiavuzzi >>>>>> >>>>>> E: [email protected] >>>>>> W: www.schiavuzzi.com >>>>>> T: +385989035562 >>>>>> Skype: danijels7 >>>> >>>> >>>> >>>> -- >>>> Danijel Schiavuzzi >>>> >>>> E: [email protected] >>>> W: www.schiavuzzi.com >>>> T: +385989035562 >>>> Skype: danijels7 >
