Also, I did have multiple cases of my IBackingMap workers dying (because of
RuntimeExceptions) but successfully restarting afterwards (I throw
RuntimeExceptions in the BackingMap implementation as my strategy in rare
SQL database deadlock situations to force a worker restart and to
fail+retry the batch).
>From the logs, one such IBackingMap worker death (and subsequent restart)
resulted in the Kafka spout re-emitting the pending tuple:
2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting batch,
attempt 29698959:736
This is of course the normal behavior of a transactional topology, but this
is the first time I've encountered a case of a batch retrying indefinitely.
This is especially suspicious since the topology has been running fine for
20 days straight, re-emitting batches and restarting IBackingMap workers
quite a number of times.
I can see in my IBackingMap backing SQL database that the batch with the
exact txid value 29698959 has been committed -- but I suspect that could
come from another BackingMap, since there are two BackingMap instances
running (paralellismHint 2).
However, I have no idea why the batch is being retried indefinitely now nor
why it hasn't been successfully acked by Trident.
Any suggestions on the area (topology component) to focus my research on?
Thanks,
On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi
<[email protected]>wrote:
> Hello,
>
> I'm having problems with my transactional Trident topology. It has been
> running fine for about 20 days, and suddenly is stuck processing a single
> batch, with no tuples being emitted nor tuples being persisted by the
> TridentState (IBackingMap).
>
> It's a simple topology which consumes messages off a Kafka queue. The
> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
> and I use the trident-mssql transactional TridentState implementation to
> persistentAggregate() data into a SQL database.
>
> In Zookeeper I can see Storm is re-trying a batch, i.e.
>
> "/transactional/<myTopologyName>/coordinator/currattempts" is
> "{"29698959":6487}"
>
> ... and the attempt count keeps increasing. It seems the batch with txid
> 29698959 is stuck, as the attempt count in Zookeeper keeps increasing --
> seems like the batch isn't being acked by Trident and I have no idea why,
> especially since the topology has been running successfully the last 20
> days.
>
> I did rebalance the topology on one occasion, after which it continued
> running normally. Other than that, no other modifications were done. Storm
> is at version 0.9.0.1.
>
> Any hints on how to debug the stuck topology? Any other useful info I
> might provide?
>
> Thanks,
>
> --
> Danijel Schiavuzzi
>
> E: [email protected]
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijel.schiavuzzi
>
--
Danijel Schiavuzzi
E: [email protected]
W: www.schiavuzzi.com
T: +385989035562
Skype: danijel.schiavuzzi