On Sep 02, sagar naik wrote: >We are counting events for our application. >Sometimes, the same event arrives multiple times. >This leads to counting of same event multiple times. >Is there a way I can avoid this ? >(Say timestamp on value or filters ?)
A basic data model question is do you want a set that supports a "count" operation or an accumulator? An accumulator (counter) is a plain integer where you just get to bump up counts. What contributes to the counts is outside the purview of this scheme. A set on the other hand is used for accumulating items that have an idenity. When you say "same event, multiple times", it sounds like the events have some identity and that you wish to de-duplicate the counting activity based on this idenity. The *only* way of implementing this accurately is you keep adding these idenities to a set. When you need a count of how many events have occured, then the cardinality of the set holds the answer. The principal difference is an O(1) solution v/s an O(n) solution with respect to memory requirements. The question to ask is if you would like to bear this cost in hbase or some layer that pre-processes data before it hits hbase. Note that hbase does not have an efficient support for "count (*)" operation in general. An approximate pre-processing solution can be as follows: * maintain a TTL based cache of all identities. * Do an atomic put-if-absent operation on this cache * Increment the counter if the put succeeds. This technique will bounded memory usage and a reasonable level of accuracy if the following assumptions hold good: * The arrival rate of events has a predictable & reasonable upper bound * The duplicate events arrive at roughly the same time
