Question about OpaqueTridentKafkaSpout

Ashok Gupta Thu, 01 May 2014 19:13:29 -0700

Hi,

 I have theoretical question about the guarantees
OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to
illustrate the question I have.


 Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they
respectively come from the kafka partition p1,p2,p3,p4. When this batch is
played for the very first time it failed processing however the commit
happen for tuples t3 in the database while it did not happen for the tuples
t1,t2,t4. Since the batch failed, it is expected that the metadata in the
zookeeper is not going to be updated i.e. it will not assume the offsets as
committed for p1,p2,p3,p4. It is expected that the batch will be replayed,
however, suppose before it gets replayed the kafka partition p3 goes down.
What happens now? I understand that another batch with same transaction id
containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t
be replayed again. Since t3 is not replayed again, even if the batch
succeeds on replay the offsets for the p3 don’t get updated in the
zookeeper. That is all fine as long fault tolerance and opaque behavior is
concerned.

 My concern is more around what happens when partition p3 is back up again
and the spout starts reading data from the last offset it committed
successfully. Since from partition p3, tuple t3 is again going to be read
and it is certainly going to be in a batch with some txId > 10 (say 19) it
is going to be applied in the state again. This apparently violates the
exactly once semantics.

 Is the concern genuine or am I missing something?
Regards
-- 
Ashok Gupta,
(+1) 361-522-2172
San Jose, CA

Question about OpaqueTridentKafkaSpout

Reply via email to