Re: Question about OpaqueTridentKafkaSpout

P. Taylor Goetz Wed, 07 May 2014 13:09:18 -0700

It all depends on the nature of the spout.

With a transactional spout, batches are always the same, even if replayed.


With an opaque spout, batches can change. But you have the guarantee that a 
tuple will only ever be processed successfully in a single batch. If a tuple 
fails in one batch, it could succeed in another.

-Taylor

> On May 6, 2014, at 8:19 PM, Ashok Gupta <[email protected]> wrote:
> 
> I think it can. That is where the coordinator comes in picture. Coordinator 
> defines the parameters of a batch and emitters do the job of emitting the sub 
> portions of batch.
> 
> 
> 
> 
> 
>> On Mon, May 5, 2014 at 12:50 PM, Abhishek Bhattacharjee 
>> <[email protected]> wrote:
>> Are you sure that a batch can consist of tuples from different partitions ?
>> I am just asking I am not sure , if it can then your question seems to be 
>> valid else it is not valid anymore :-) 
>> 
>> 
>>> On Fri, May 2, 2014 at 7:42 AM, Ashok Gupta <[email protected]> 
>>> wrote:
>>> 
>>> Hi, 
>>> 
>>> I have theoretical question about the guarantees 
>>> OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to 
>>> illustrate the question I have.
>>> 
>>> Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they respectively 
>>> come from the kafka partition p1,p2,p3,p4. When this batch is played for 
>>> the very first time it failed processing however the commit happen for 
>>> tuples t3 in the database while it did not happen for the tuples t1,t2,t4. 
>>> Since the batch failed, it is expected that the metadata in the zookeeper 
>>> is not going to be updated i.e. it will not assume the offsets as committed 
>>> for p1,p2,p3,p4. It is expected that the batch will be replayed, however, 
>>> suppose before it gets replayed the kafka partition p3 goes down. What 
>>> happens now? I understand that another batch with same transaction id 
>>> containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t 
>>> be replayed again. Since t3 is not replayed again, even if the batch 
>>> succeeds on replay the offsets for the p3 don’t get updated in the 
>>> zookeeper. That is all fine as long fault tolerance and opaque behavior is 
>>> concerned. 
>>> 
>>> My concern is more around what happens when partition p3 is back up again 
>>> and the spout starts reading data from the last offset it committed 
>>> successfully. Since from partition p3, tuple t3 is again going to be read 
>>> and it is certainly going to be in a batch with some txId > 10 (say 19) it 
>>> is going to be applied in the state again. This apparently violates the 
>>> exactly once semantics. 
>>> 
>>> Is the concern genuine or am I missing something?
>>> 
>>> Regards
>>> -- 
>>> Ashok Gupta, 
>>> (+1) 361-522-2172
>>> San Jose, CA
>> 
>> 
>> 
>> -- 
>> Abhishek Bhattacharjee
>> Pune Institute of Computer Technology
> 
> 
> 
> -- 
> Ashok Gupta, 
> (+1) 361-522-2172
> San Jose, CA

Re: Question about OpaqueTridentKafkaSpout

Reply via email to