Hi, I would like to delete duplicates in Flume with Interceptors. The idea is to calculate an MD5 or similar for the event and store in Redis or another database. I want just to check the lost of performance and which it's the best solution for dealing with it.
As I understand the max number of events what they could be duplicates depend of the batchSize. So, you only need to store that number of keys in your database. I don't know if Redis has that feature as capped collection in Mongo. Has someone done something similar and knows the lost of performance? Which could it be the best place where to store the keys for really fast access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse since with Redis or similar could be in the same host than Flume and you don't lose time because the network. Any other solution to deal with duplicates in realtime?
