About duplicate events and how to deal with them in Flume with interceptors.

Guillermo Ortiz Fri, 07 Aug 2015 03:25:19 -0700

Hi,

I would like to delete duplicates in Flume with Interceptors.
The idea is to calculate an MD5 or similar for the event and store in Redis
or another database. I want just to check the lost of performance and which
it's the best solution for dealing with it.


As I understand the max number of events what they could be duplicates
depend of the batchSize. So, you only need to store that number of keys in
your database. I don't know if Redis has that feature as capped collection
in Mongo.

Has someone done something similar and knows the lost of performance? Which
could it be the best place where to store the keys for really fast access??
Mongo, Redis,...? I think that HBase or Cassandra could be worse since with
Redis or similar could be in the same host than Flume and you don't lose
time because the network.
Any other solution to deal with duplicates in realtime?

About duplicate events and how to deal with them in Flume with interceptors.

Reply via email to