On Sep 02, sagar naik wrote:
>We are counting events for our application.
>Sometimes, the same event arrives multiple times.
>This leads to counting of same event multiple times.
>Is there a way I can avoid this ?
>(Say timestamp on value or filters ?)

A basic data model question is do you want a set that supports a "count"
operation or an accumulator?

An accumulator (counter) is a plain integer where you just get to bump
up counts. What contributes to the counts is outside the purview of this
scheme.

A set on the other hand is used for accumulating items that have an
idenity. When you say "same event, multiple times", it sounds like the
events have some identity and that you wish to de-duplicate the counting
activity based on this idenity. The *only* way of implementing this
accurately is you keep adding these idenities to a set. When you need a
count of how many events have occured, then the cardinality of the set
holds the answer.

The principal difference is an O(1) solution v/s an O(n) solution with
respect to memory requirements. The question to ask is if you would like
to bear this cost in hbase or some layer that pre-processes data before
it hits hbase. Note that hbase does not have an efficient support for
"count (*)" operation in general.

An approximate pre-processing solution can be as follows: 
* maintain a TTL based cache of all identities.
* Do an atomic put-if-absent operation on this cache
* Increment the counter if the put succeeds.


This technique will bounded memory usage and a reasonable level of
accuracy if the following assumptions hold good:

* The arrival rate of events has a predictable & reasonable upper bound
* The duplicate events arrive at roughly the same time

Reply via email to