Fyi we're using Summingbird in production not Trident. However summingbird does not give you exactly once semantics, it does give you a higher level of abstraction than Storm API though.
On Wed, Apr 9, 2014 at 2:50 PM, Jason Jackson <[email protected]> wrote: > I have one theory that because reads in zookeeper are eventually > consistent, this is a necessary condition for the bug to manifest. So one > way to test this hypothesis is to run a zookeeper ensemble with 1 node, or > a zookeeper ensemble configured for 5 nodes, but take 2 of them offline, so > that every write operation only succeeds if every member of the ensemble > sees the write. This should produce strong consistent reads. If you run > this test, let me know what the results are. (Clearly this isn't a good > production system though as you're making a tradeoff for lower availability > in favor of greater consistency, but the results could help narrow down the > bug) > > > On Wed, Apr 9, 2014 at 2:43 PM, Jason Jackson <[email protected]> wrote: > >> Yah it's probably a bug in trident. It would be amazing if someone >> figured out the fix for this. I spent about 6 hours looking into, but >> couldn't figure out why it was occuring. >> >> Beyond fixing this, one thing you could do to buy yourself time is >> disable batch retries in trident. There's no option for this in the API, >> but it's like a 1 or 2 line change to the code. Obviously you loose exactly >> once semantics, but at least you would have a system that never falls >> behind real-time. >> >> >> >> On Wed, Apr 9, 2014 at 1:10 AM, Danijel Schiavuzzi < >> [email protected]> wrote: >> >>> Thanks Jason. However, I don't think that was the case in my stuck >>> topology, otherwise I'd have seen exceptions (thrown by my Trident >>> functions) in the worker logs. >>> >>> >>> On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <[email protected]>wrote: >>> >>>> An example of "corrupted input" that causes a batch to fail would be >>>> for example if you expected a schema to your data that you read off kafka, >>>> or some queue, and for whatever reason the data didn't conform to your >>>> schema and the function that you implement that you pass to stream.each >>>> throws an exception when this unexpected situation occurs. This would cause >>>> the batch to be retried, but it's deterministically failing, so the batch >>>> will be retried forever. >>>> >>>> >>>> On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi < >>>> [email protected]> wrote: >>>> >>>>> Hi Jason, >>>>> >>>>> Could you be more specific -- what do you mean by "corrupted input"? >>>>> Do you mean that there's a bug in Trident itself that causes the tuples in >>>>> a batch to somehow become corrupted? >>>>> >>>>> Thanks a lot! >>>>> >>>>> Danijel >>>>> >>>>> >>>>> On Monday, April 7, 2014, Jason Jackson <[email protected]> wrote: >>>>> >>>>>> This could happen if you have corrupted input that always causes a >>>>>> batch to fail and be retried. >>>>>> >>>>>> I have seen this behaviour before and I didn't see corrupted input. >>>>>> It might be a bug in trident, I'm not sure. If you figure it out please >>>>>> update this thread and/or submit a patch. >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi < >>>>>> [email protected]> wrote: >>>>>> >>>>>> To (partially) answer my own question -- I still have no idea on the >>>>>> cause of the stuck topology, but re-submitting the topology helps -- >>>>>> after >>>>>> re-submitting my topology is now running normally. >>>>>> >>>>>> >>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Also, I did have multiple cases of my IBackingMap workers dying >>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I >>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy >>>>>> in >>>>>> rare SQL database deadlock situations to force a worker restart and to >>>>>> fail+retry the batch). >>>>>> >>>>>> From the logs, one such IBackingMap worker death (and subsequent >>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple: >>>>>> >>>>>> 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting >>>>>> batch, attempt 29698959:736 >>>>>> >>>>>> This is of course the normal behavior of a transactional topology, >>>>>> but this is the first time I've encountered a case of a batch retrying >>>>>> indefinitely. This is especially suspicious since the topology has been >>>>>> running fine for 20 days straight, re-emitting batches and restarting >>>>>> IBackingMap workers quite a number of times. >>>>>> >>>>>> I can see in my IBackingMap backing SQL database that the batch with >>>>>> the exact txid value 29698959 has been committed -- but I suspect that >>>>>> could come from another BackingMap, since there are two BackingMap >>>>>> instances running (paralellismHint 2). >>>>>> >>>>>> However, I have no idea why the batch is being retried indefinitely >>>>>> now nor why it hasn't been successfully acked by Trident. >>>>>> >>>>>> Any suggestions on the area (topology component) to focus my research >>>>>> on? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> I'm having problems with my transactional Trident topology. It has >>>>>> been running fine for about 20 days, and suddenly is stuck processing a >>>>>> single batch, with no tuples being emitted nor tuples being persisted by >>>>>> the TridentState (IBackingMap). >>>>>> >>>>>> It's a simple topology which consumes messages off a Kafka queue. The >>>>>> spout is an instance of storm-kafka-0.8-plus >>>>>> TransactionalTridentKafkaSpout >>>>>> and I use the trident-mssql transactional TridentState implementation to >>>>>> persistentAggregate() data into a SQL database. >>>>>> >>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e. >>>>>> >>>>>> "/transactional/<myTopologyName>/coordinator/currattempts" is >>>>>> "{"29698959":6487}" >>>>>> >>>>>> ... and the attempt count keeps increasing. It seems the batch with >>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps >>>>>> increasing >>>>>> -- seems like the batch isn't being acked by Trident and I have no idea >>>>>> why, especially since the topology has been running successfully the last >>>>>> 20 days. >>>>>> >>>>>> I did rebalance the topology on one occasion, after which it >>>>>> continued running normally. Other than that, no other modifications were >>>>>> done. Storm is at version 0.9.0.1. >>>>>> >>>>>> Any hints on how to debug the stuck topology? Any other useful info I >>>>>> might provide? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> -- >>>>>> Danijel Schiavuzzi >>>>>> >>>>>> E: [email protected] >>>>>> W: www.schiavuzzi.com >>>>>> T: +385989035562 >>>>>> Skype: danijel.schiavuzzi >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Danijel Schiavuzzi >>>>>> >>>>>> E: danije >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Danijel Schiavuzzi >>>>> >>>>> E: [email protected] >>>>> W: www.schiavuzzi.com >>>>> T: +385989035562 >>>>> Skype: danijels7 >>>>> >>>> >>>> >>> >>> >>> -- >>> Danijel Schiavuzzi >>> >>> E: [email protected] >>> W: www.schiavuzzi.com >>> T: +385989035562 >>> Skype: danijels7 >>> >> >> >
