Thanks Jason. However, I don't think that was the case in my stuck topology, otherwise I'd have seen exceptions (thrown by my Trident functions) in the worker logs.
On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <[email protected]> wrote: > An example of "corrupted input" that causes a batch to fail would be for > example if you expected a schema to your data that you read off kafka, or > some queue, and for whatever reason the data didn't conform to your schema > and the function that you implement that you pass to stream.each throws an > exception when this unexpected situation occurs. This would cause the batch > to be retried, but it's deterministically failing, so the batch will be > retried forever. > > > On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi < > [email protected]> wrote: > >> Hi Jason, >> >> Could you be more specific -- what do you mean by "corrupted input"? >> Do you mean that there's a bug in Trident itself that causes the tuples in >> a batch to somehow become corrupted? >> >> Thanks a lot! >> >> Danijel >> >> >> On Monday, April 7, 2014, Jason Jackson <[email protected]> wrote: >> >>> This could happen if you have corrupted input that always causes a batch >>> to fail and be retried. >>> >>> I have seen this behaviour before and I didn't see corrupted input. It >>> might be a bug in trident, I'm not sure. If you figure it out please update >>> this thread and/or submit a patch. >>> >>> >>> >>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi < >>> [email protected]> wrote: >>> >>> To (partially) answer my own question -- I still have no idea on the >>> cause of the stuck topology, but re-submitting the topology helps -- after >>> re-submitting my topology is now running normally. >>> >>> >>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi < >>> [email protected]> wrote: >>> >>> Also, I did have multiple cases of my IBackingMap workers dying (because >>> of RuntimeExceptions) but successfully restarting afterwards (I throw >>> RuntimeExceptions in the BackingMap implementation as my strategy in rare >>> SQL database deadlock situations to force a worker restart and to >>> fail+retry the batch). >>> >>> From the logs, one such IBackingMap worker death (and subsequent >>> restart) resulted in the Kafka spout re-emitting the pending tuple: >>> >>> 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting >>> batch, attempt 29698959:736 >>> >>> This is of course the normal behavior of a transactional topology, but >>> this is the first time I've encountered a case of a batch retrying >>> indefinitely. This is especially suspicious since the topology has been >>> running fine for 20 days straight, re-emitting batches and restarting >>> IBackingMap workers quite a number of times. >>> >>> I can see in my IBackingMap backing SQL database that the batch with the >>> exact txid value 29698959 has been committed -- but I suspect that could >>> come from another BackingMap, since there are two BackingMap instances >>> running (paralellismHint 2). >>> >>> However, I have no idea why the batch is being retried indefinitely now >>> nor why it hasn't been successfully acked by Trident. >>> >>> Any suggestions on the area (topology component) to focus my research on? >>> >>> Thanks, >>> >>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi < >>> [email protected]> wrote: >>> >>> Hello, >>> >>> I'm having problems with my transactional Trident topology. It has been >>> running fine for about 20 days, and suddenly is stuck processing a single >>> batch, with no tuples being emitted nor tuples being persisted by the >>> TridentState (IBackingMap). >>> >>> It's a simple topology which consumes messages off a Kafka queue. The >>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout >>> and I use the trident-mssql transactional TridentState implementation to >>> persistentAggregate() data into a SQL database. >>> >>> In Zookeeper I can see Storm is re-trying a batch, i.e. >>> >>> "/transactional/<myTopologyName>/coordinator/currattempts" is >>> "{"29698959":6487}" >>> >>> ... and the attempt count keeps increasing. It seems the batch with txid >>> 29698959 is stuck, as the attempt count in Zookeeper keeps increasing -- >>> seems like the batch isn't being acked by Trident and I have no idea why, >>> especially since the topology has been running successfully the last 20 >>> days. >>> >>> I did rebalance the topology on one occasion, after which it continued >>> running normally. Other than that, no other modifications were done. Storm >>> is at version 0.9.0.1. >>> >>> Any hints on how to debug the stuck topology? Any other useful info I >>> might provide? >>> >>> Thanks, >>> >>> -- >>> Danijel Schiavuzzi >>> >>> E: [email protected] >>> W: www.schiavuzzi.com >>> T: +385989035562 >>> Skype: danijel.schiavuzzi >>> >>> >>> >>> >>> -- >>> Danijel Schiavuzzi >>> >>> E: danije >>> >>> >> >> -- >> Danijel Schiavuzzi >> >> E: [email protected] >> W: www.schiavuzzi.com >> T: +385989035562 >> Skype: danijels7 >> > > -- Danijel Schiavuzzi E: [email protected] W: www.schiavuzzi.com T: +385989035562 Skype: danijels7
