Flume is at-least-once system. This means we will never lose data, but you may get duplicate events on errors. In the cases you pointed out - the events were written but we still BACKOFF, you will get duplicate events in the channel or in HDFS.
You probably want to write a small script to de-duplicate the data in HDFS, like we do in this example: https://github.com/hadooparchitecturebook/clickstream-tutorial/blob/master/03_processing/01_dedup/pig/dedup.pig Gwen On Tue, Apr 14, 2015 at 9:17 AM, Tao Li <[email protected]> wrote: > Hi all: > > I have a question about "Transaction". For example, KafkaSource code like > this: > > try { > getChannelProcessor().processEventBatch(eventList); > consumer.commitOffsets(); > return Status.READY > } catch(Exception e) { > return Status.BACKOFF; > } > > If processEventBatch() succeed, but commitOffsets() failed, will return > BACKOFF. But the eventList is already write to channel. > > ---------------------------------- > > Also for HDFSEventSink code like this: > > try { > bucketWriter.append(event); > bucketWriter.flush(); > transaction.commit(); > return Status.READY; > } catch(Exception e) { > transaction.rollback(); > return Status.BACKOFF; > } > > If bucketWriter.flush() succeed, but transaction.commit() failed, will > transaction.rollback() and return BACKOFF. But the event is already flush to > HDFS. > > > 2015-04-15 0:09 GMT+08:00 Tao Li <[email protected]>: >> >> Hi all: >> >> I have a question about "Transaction". For example, KafkaSource code like >> this: >> try { >> getChannelProcessor().processEventBatch(eventList); >> consumer.commitOffsets(); >> >> } > >
