Thanks for the response. I’ve been sitting on this for a few months, as fixing 
the typo resolved the issue. Still, I must have read your response like 40 
times because I just wasn’t getting the logic behind the duplicates. Well, I 
think I’ve almost got it, but I’m still unclear on something.


-        Let’s say we’re using a Memory Channel and a Rolling File Sink

-        Let’s say Batch Size for the Source and Sink equals 50

-        Let’s say Transaction Capacity for the Channel = 25

So, as Data A1 through A20 come in from the source, as opposed to waiting for 
the Sink Batch Size of 50 to be filled, they immediately start writing to the 
File Sink. However, A1..A20 are still added to the Take List, since the Sink’s 
Batch of 50 hasn’t been filled and the Transaction is not yet marked as 
committed. Now, let’s say there is a Traffic Surge, and Data A21..A70 are sent 
to the source. Data A21..A25 are accepted in the Channel, added to the Take 
List, and written to the File Sink. The Sink Batch Size of 50 still isn’t 
reached, and a Transaction still isn’t submitted. Now, A26..A70 will not make 
it into the Channel because it’s currently filled by A1..A25, and nothing else 
will ever get into the Channel because The Batch Size of 50 will never be 
reached, and that Transaction will never be marked as complete nor will the 
Channel be emptied.

Where is the part where the Transaction of writing to the Sink fails? Is there 
some Timeout that causes a retry on the Transaction on the Sink side? Will 
transaction A1..A25 be written over and over to the Sink?

Thanks again for all the help!

Cesar M. Quintana

From: Johny Rufus [mailto:[email protected]]
Sent: Wednesday, June 17, 2015 6:52 PM
To: [email protected]
Subject: Re: Flume duplicating a set of events many (hundreds of) times!

A transaction in Flume consists of 1 or more batches. So the minimum 
requirement is your channel's transaction capacity >= batchSize of the src/sink.
Since Flume supports "at least once" transaction semantics, all events part of 
the current transaction are stored internally as part of a Take List that Flume 
maintains, so that in case of transaction failure, the events can be put back 
into the channel.

Typically when  batchSize > transactionCapacity, the transaction will never 
succeed and will keep on retrying. Since not a single batch went through, there 
should be no duplicates.
But RollingFileSink, writes every event taken from the channel immediately, 
hence every time Flume retries the transaction,  a partial set of events that 
are part of the current transaction/batch would still make it to the 
destination file. and will be duplicated when the transaction fails and is 
rolledback and retried.

Thanks,
Rufus

On Wed, Jun 17, 2015 at 4:48 PM, Quintana, Cesar (C) 
<[email protected]<mailto:[email protected]>> wrote:
Oh man! Thanks for spotting that. Whoever modified this config must have copied 
and pasted because EVERY Memory Channel has the same typo.

I’ve corrected it. Now, I’m still not understanding how having the 
TransactionCapacity = 100 and the BatchSize = 1000 would cause duplicates. Can 
someone walk me through that logic?

Thanks for all the help so far. And, FYI, I am RTFMing it, as well.

Cesar M. Quintana

From: Hari Shreedharan 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, June 17, 2015 4:15 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Flume duplicating a set of events many (hundreds of) times!


On Wed, Jun 17, 2015 at 3:54 PM, Quintana, Cesar (C) 
<[email protected]<mailto:[email protected]>> wrote:
agent1.channels.PASPAChannel.transactionCapactiy = 1000

This line has a typo - so the channel is starting up at default capacity. 
Change this to:
agent1.channels.PASPAChannel.transactionCapacity = 1000


Thanks,
Hari

Reply via email to