Replaying of logs and Trident

Milind Vaidya Thu, 14 Jan 2016 11:07:28 -0800

We have been using regular storm, topology-bolt set up for a while.
The input to storm is from kafka cluster and zookeeper keeps the metadata.



I was looking at the Trident for its exactly once paradigm. We are trying
to achieve minimum data loss, which may lead to replaying the logs (Kafka
stores them for designated amount of time) in case of some failure
scenarios.

As I understand Trident deals the tuples in batches having a unique batch
id. Will it still guarantee exactly once protocol, if some batch is
replayed later in the timeframe ?

e.g.

1. Batches X1...........Xk...........Xn

The failure occurs between batch Xk and Xn. If we are not sure of the
offset or log or message at which failure occurred we may replay from some
message or time line little prior to where we think the failure may have
occurred, say it corresponds to batch Xk. So we are back to normal say
couple of hours later and replay those logs. But Xk was already processed
so, will it still guarantee that it will not be processed ?


2. Data Analytics wants to replay the some logs which are already processed
say couple of days before. In this case we want this range to be
reprocessed. So how will the Trident behave in such scenario ?

Replaying of logs and Trident

Reply via email to