Re: [Transaction] About KafkaSource and HDFSEventSource Transaction GGuarantee

Gwen Shapira Tue, 14 Apr 2015 09:52:00 -0700

Flume is at-least-once system. This means we will never lose data, but
you may get duplicate events on errors.
In the cases you pointed out - the events were written but we still
BACKOFF, you will get duplicate events in the channel or in HDFS.


You probably want to write a small script to de-duplicate the data in
HDFS, like we do in this example:
https://github.com/hadooparchitecturebook/clickstream-tutorial/blob/master/03_processing/01_dedup/pig/dedup.pig

Gwen

On Tue, Apr 14, 2015 at 9:17 AM, Tao Li <[email protected]> wrote:
> Hi all：
>
> I have a question about "Transaction". For example, KafkaSource code like
> this:
>
> try {
>     getChannelProcessor().processEventBatch(eventList);
>     consumer.commitOffsets();
>     return Status.READY
> } catch(Exception e) {
>     return Status.BACKOFF;
> }
>
> If processEventBatch() succeed, but commitOffsets() failed, will return
> BACKOFF. But the eventList is already  write to channel.
>
> ----------------------------------
>
> Also for HDFSEventSink code like this:
>
> try {
>     bucketWriter.append(event);
>     bucketWriter.flush();
>     transaction.commit();
>     return Status.READY;
> } catch(Exception e) {
>     transaction.rollback();
>     return Status.BACKOFF;
> }
>
> If bucketWriter.flush() succeed, but transaction.commit() failed, will
> transaction.rollback() and return BACKOFF. But the event is already flush to
> HDFS.
>
>
> 2015-04-15 0:09 GMT+08:00 Tao Li <[email protected]>:
>>
>> Hi all：
>>
>> I have a question about "Transaction". For example, KafkaSource code like
>> this:
>> try {
>>     getChannelProcessor().processEventBatch(eventList);
>>     consumer.commitOffsets();
>>
>> }
>
>

Re: [Transaction] About KafkaSource and HDFSEventSource Transaction GGuarantee

Reply via email to