I want to store them in Kafka. If I use Kafka as channel I guess that it doesn't fix this case. El 07/08/2015 14:26, "Majid Alfifi" <[email protected]> escribió:
> You may also find some hints in the discussions of FLUME-2173: > > https://issues.apache.org/jira/plugins/servlet/mobile#issue/FLUME-2173 > > -Majid > > On Aug 7, 2015, at 2:33 PM, Guillermo Ortiz <[email protected]> wrote: > > Thanks for the answer. I was talking more about possible failures of an > Flume Agent. There's a tiny possiblity to get duplicates not because the > source is producing duplicates. It's true that they should be a really > small percentage of the data size but if the agent crashs you could get > duplicates when you starts the agent again. > > I guess that you need a third player if you want to manage this case of > duplicates and it's not possible to use a CircularFifoQueue in the same JVM > than Flume that's why I thought about Redis or something similar. Ideally, > that system should be independent of Flume and have HA. > > > > 2015-08-07 13:20 GMT+02:00 Majid Alfifi <[email protected]>: > >> It's not clear if you are referring to duplicates that result from the >> source or duplicates that result from Flume itself trying to maintain the >> at-least-once delivery of events. >> >> I had a case were the source was producing duplicates but the network >> bandwidth was almost fully utilized by the regular de-duplicated stream so >> we couldn't afford to have duplicates travel all the way to the final >> destination (HDFS in our case). We ultimately just used a CircularFifoQueue >> in a flume interceptor. It was a good fit because for our case all >> duplicates will come in about 30-seconds window. We were receiving about >> 600 event per second so a CircularFifoQueue of size 18,000 for example was >> an easy solution to remove duplicates but at the expense of having a single >> flume agent to remove duplicates (SPOF). >> >> However, we still see duplicates at the final destination that are a >> result of Flume architecture or from occasional duplicates that come more >> than 30 seconds apart from the source but they were a very small percentage >> of the data size. We had a MapReduce job that removed those remaining >> duplicates in HDFS. >> >> -Majid >> >> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <[email protected]> >> wrote: >> > >> > Hi, >> > >> > I would like to delete duplicates in Flume with Interceptors. >> > The idea is to calculate an MD5 or similar for the event and store in >> Redis or another database. I want just to check the lost of performance and >> which it's the best solution for dealing with it. >> > >> > As I understand the max number of events what they could be duplicates >> depend of the batchSize. So, you only need to store that number of keys in >> your database. I don't know if Redis has that feature as capped collection >> in Mongo. >> > >> > Has someone done something similar and knows the lost of performance? >> Which could it be the best place where to store the keys for really fast >> access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse >> since with Redis or similar could be in the same host than Flume and you >> don't lose time because the network. >> > Any other solution to deal with duplicates in realtime? >> > >> > >> > >
