A transaction in Flume consists of 1 or more batches. So the minimum requirement is your channel's transaction capacity >= batchSize of the src/sink. Since Flume supports "at least once" transaction semantics, all events part of the current transaction are stored internally as part of a Take List that Flume maintains, so that in case of transaction failure, the events can be put back into the channel.
Typically when batchSize > transactionCapacity, the transaction will never succeed and will keep on retrying. Since not a single batch went through, there should be no duplicates. But RollingFileSink, writes every event taken from the channel immediately, hence every time Flume retries the transaction, a partial set of events that are part of the current transaction/batch would still make it to the destination file. and will be duplicated when the transaction fails and is rolledback and retried. Thanks, Rufus On Wed, Jun 17, 2015 at 4:48 PM, Quintana, Cesar (C) < [email protected]> wrote: > Oh man! Thanks for spotting that. Whoever modified this config must have > copied and pasted because EVERY Memory Channel has the same typo. > > > > I’ve corrected it. Now, I’m still not understanding how having the > TransactionCapacity = 100 and the BatchSize = 1000 would cause duplicates. > Can someone walk me through that logic? > > > > Thanks for all the help so far. And, FYI, I am RTFMing it, as well. > > > > Cesar M. Quintana > > > > *From:* Hari Shreedharan [mailto:[email protected]] > *Sent:* Wednesday, June 17, 2015 4:15 PM > *To:* [email protected] > *Subject:* Re: Flume duplicating a set of events many (hundreds of) times! > > > > > > On Wed, Jun 17, 2015 at 3:54 PM, Quintana, Cesar (C) < > [email protected]> wrote: > > agent1.channels.PASPAChannel.transactionCapactiy = 1000 > > > This line has a typo - so the channel is starting up at default capacity. > Change this to: > > agent1.channels.PASPAChannel.transactionCapacity = 1000 > > > > Thanks, > > Hari >
