Re: About duplicate events and how to deal with them in Flume with interceptors.

Majid Alfifi Fri, 07 Aug 2015 05:26:40 -0700

You may also find some hints in the discussions of FLUME-2173:

https://issues.apache.org/jira/plugins/servlet/mobile#issue/FLUME-2173


-Majid

> On Aug 7, 2015, at 2:33 PM, Guillermo Ortiz <[email protected]> wrote:
> 
> Thanks for the answer. I was talking more about possible failures of an Flume 
> Agent. There's a tiny possiblity to get duplicates not because the source is 
> producing duplicates. It's true that they should be a really small percentage 
> of the data size but if the agent crashs you could get duplicates when you 
> starts the agent again. 
> 
> I guess that you need a third player if you want to manage this case of 
> duplicates and it's not possible to use a CircularFifoQueue in the same JVM 
> than Flume that's why I thought about Redis or something similar. Ideally, 
> that system should be independent of Flume and have HA.
> 
> 
> 
> 2015-08-07 13:20 GMT+02:00 Majid Alfifi <[email protected]>:
>> It's not clear if you are referring to duplicates that result from the 
>> source or duplicates that result from Flume itself trying to maintain the 
>> at-least-once delivery of events.
>> 
>> I had a case were the source was producing  duplicates but the network 
>> bandwidth was almost fully utilized by the regular de-duplicated stream so 
>> we couldn't afford to have duplicates travel all the way to the final 
>> destination (HDFS in our case). We ultimately just used a CircularFifoQueue 
>> in a flume interceptor. It was a good fit because for our case all 
>> duplicates will come in about 30-seconds window. We were receiving about 600 
>> event per second so a CircularFifoQueue of size 18,000 for example was an 
>> easy solution to remove duplicates but at the expense of having a single 
>> flume agent to remove duplicates (SPOF).
>> 
>> However, we still see duplicates at the final destination that are a result 
>> of Flume architecture or from occasional duplicates that come more than 30 
>> seconds apart from the source but they were a very small percentage of the 
>> data size. We had a MapReduce job that removed those remaining duplicates in 
>> HDFS.
>> 
>> -Majid
>> 
>> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > I would like to delete duplicates in Flume with Interceptors.
>> > The idea is to calculate an MD5 or similar for the event and store in 
>> > Redis or another database. I want just to check the lost of performance 
>> > and which it's the best solution for dealing with it.
>> >
>> > As I understand the max number of events what they could be duplicates 
>> > depend of the batchSize. So, you only need to store that number of keys in 
>> > your database. I don't know if Redis has that feature as capped collection 
>> > in Mongo.
>> >
>> > Has someone done something similar and knows the lost of performance? 
>> > Which could it be the best place where to store the keys for really fast 
>> > access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse 
>> > since with Redis or similar could be in the same host than Flume and you 
>> > don't lose time because the network.
>> > Any other solution to deal with duplicates in realtime?
>> >
>> >
>

Re: About duplicate events and how to deal with them in Flume with interceptors.

Reply via email to