To (partially) answer my own question -- I still have no idea on the cause of the stuck topology, but re-submitting the topology helps -- after re-submitting my topology is now running normally.
On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <[email protected]>wrote: > Also, I did have multiple cases of my IBackingMap workers dying (because > of RuntimeExceptions) but successfully restarting afterwards (I throw > RuntimeExceptions in the BackingMap implementation as my strategy in rare > SQL database deadlock situations to force a worker restart and to > fail+retry the batch). > > From the logs, one such IBackingMap worker death (and subsequent restart) > resulted in the Kafka spout re-emitting the pending tuple: > > 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting > batch, attempt 29698959:736 > > This is of course the normal behavior of a transactional topology, but > this is the first time I've encountered a case of a batch retrying > indefinitely. This is especially suspicious since the topology has been > running fine for 20 days straight, re-emitting batches and restarting > IBackingMap workers quite a number of times. > > I can see in my IBackingMap backing SQL database that the batch with the > exact txid value 29698959 has been committed -- but I suspect that could > come from another BackingMap, since there are two BackingMap instances > running (paralellismHint 2). > > However, I have no idea why the batch is being retried indefinitely now > nor why it hasn't been successfully acked by Trident. > > Any suggestions on the area (topology component) to focus my research on? > > Thanks, > > On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi < > [email protected]> wrote: > >> Hello, >> >> I'm having problems with my transactional Trident topology. It has been >> running fine for about 20 days, and suddenly is stuck processing a single >> batch, with no tuples being emitted nor tuples being persisted by the >> TridentState (IBackingMap). >> >> It's a simple topology which consumes messages off a Kafka queue. The >> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout >> and I use the trident-mssql transactional TridentState implementation to >> persistentAggregate() data into a SQL database. >> >> In Zookeeper I can see Storm is re-trying a batch, i.e. >> >> "/transactional/<myTopologyName>/coordinator/currattempts" is >> "{"29698959":6487}" >> >> ... and the attempt count keeps increasing. It seems the batch with txid >> 29698959 is stuck, as the attempt count in Zookeeper keeps increasing -- >> seems like the batch isn't being acked by Trident and I have no idea why, >> especially since the topology has been running successfully the last 20 >> days. >> >> I did rebalance the topology on one occasion, after which it continued >> running normally. Other than that, no other modifications were done. Storm >> is at version 0.9.0.1. >> >> Any hints on how to debug the stuck topology? Any other useful info I >> might provide? >> >> Thanks, >> >> -- >> Danijel Schiavuzzi >> >> E: [email protected] >> W: www.schiavuzzi.com >> T: +385989035562 >> Skype: danijel.schiavuzzi >> > > > > -- > Danijel Schiavuzzi > > E: [email protected] > W: www.schiavuzzi.com > T: +385989035562 > Skype: danijel.schiavuzzi > -- Danijel Schiavuzzi E: [email protected] W: www.schiavuzzi.com T: +385989035562 Skype: danijels7
