We have been using regular storm, topology-bolt set up for a while. The input to storm is from kafka cluster and zookeeper keeps the metadata.
I was looking at the Trident for its exactly once paradigm. We are trying to achieve minimum data loss, which may lead to replaying the logs (Kafka stores them for designated amount of time) in case of some failure scenarios. As I understand Trident deals the tuples in batches having a unique batch id. Will it still guarantee exactly once protocol, if some batch is replayed later in the timeframe ? e.g. 1. Batches X1...........Xk...........Xn The failure occurs between batch Xk and Xn. If we are not sure of the offset or log or message at which failure occurred we may replay from some message or time line little prior to where we think the failure may have occurred, say it corresponds to batch Xk. So we are back to normal say couple of hours later and replay those logs. But Xk was already processed so, will it still guarantee that it will not be processed ? 2. Data Analytics wants to replay the some logs which are already processed say couple of days before. In this case we want this range to be reprocessed. So how will the Trident behave in such scenario ?
