Hi, I have theoretical question about the guarantees OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to illustrate the question I have.
Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they respectively come from the kafka partition p1,p2,p3,p4. When this batch is played for the very first time it failed processing however the commit happen for tuples t3 in the database while it did not happen for the tuples t1,t2,t4. Since the batch failed, it is expected that the metadata in the zookeeper is not going to be updated i.e. it will not assume the offsets as committed for p1,p2,p3,p4. It is expected that the batch will be replayed, however, suppose before it gets replayed the kafka partition p3 goes down. What happens now? I understand that another batch with same transaction id containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t be replayed again. Since t3 is not replayed again, even if the batch succeeds on replay the offsets for the p3 don’t get updated in the zookeeper. That is all fine as long fault tolerance and opaque behavior is concerned. My concern is more around what happens when partition p3 is back up again and the spout starts reading data from the last offset it committed successfully. Since from partition p3, tuple t3 is again going to be read and it is certainly going to be in a batch with some txId > 10 (say 19) it is going to be applied in the state again. This apparently violates the exactly once semantics. Is the concern genuine or am I missing something? Regards -- Ashok Gupta, (+1) 361-522-2172 San Jose, CA
