Hi Henry,

Just to add to Mike's response:

When used with secure channels(mainly file channel) and with transports that can be rolled back(avro), message delivery is guarranteed(eventually). The only way you can lose data is for a part of the chain to be permanently removed: HD failure or removal of the physical hardware.

Prevention of data duplication has never been an objective of flume, though it is uncommon in a properly configured setup. The larger your batch sizes are, the more duplication you may get with each partial failure. Similarly ordered arrival of data is not guarranteed. The best way to address these two issues, if it is a concern, is to run a map-reduce task or similar to reduce to unique rows and/or reorder.

On 01/24/2013 12:26 PM, Henry Ma wrote:
Dear Flume developers and users,

I understand that Flume NG uses channel-based transactions to guarantee reliable message delivery between agents. But in some extreme failure scenes, will Flume keep total Reliability? I have thought of these scenes below.

1. In transactions between agent, what will happen if the receiving agent process down just after it commits its put transaction and before sends the success indication to the sending agent? Will the sending agent send the same event again when the receiving agent recovers, and cause data duplication?

2. In the communication between the client (data source, sending data to the first-hop agent) and the first-hop agent, what will happen if the agent process down just after it receives the event and before saves to its channel? Will it cause data loss?

3. In the communication between the final-hup agent and the storage system (such as MySQL, HDFS, file system, etc.), what happened if the agent down before it commits the saving transaction but has saved some data in the storage? Will this cause data duplication after the recover of the agent?

Thank you very much!
--
Best Regards,
Henry Ma

Reply via email to