Thanks for thoughtful response, Arvind Comments inline:
On Sun, Sep 4, 2011 at 12:17 AM, Arvind Jayaprakash <[email protected]>wrote: > On Sep 02, sagar naik wrote: > >We are counting events for our application. > >Sometimes, the same event arrives multiple times. > >This leads to counting of same event multiple times. > >Is there a way I can avoid this ? > >(Say timestamp on value or filters ?) > > A basic data model question is do you want a set that supports a "count" > operation or an accumulator? > > An accumulator (counter) is a plain integer where you just get to bump > up counts. What contributes to the counts is outside the purview of this > scheme. > > A set on the other hand is used for accumulating items that have an > idenity. When you say "same event, multiple times", it sounds like the > events have some identity and that you wish to de-duplicate the counting > activity based on this idenity. The *only* way of implementing this > accurately is you keep adding these idenities to a set. When you need a > count of how many events have occured, then the cardinality of the set > holds the answer. > > The principal difference is an O(1) solution v/s an O(n) solution with > respect to memory requirements. The question to ask is if you would like > to bear this cost in hbase or some layer that pre-processes data before > it hits hbase. Note that hbase does not have an efficient support for > "count (*)" operation in general. > > An approximate pre-processing solution can be as follows: > * maintain a TTL based cache of all identities. > * Do an atomic put-if-absent operation on this cache > * Increment the counter if the put succeeds. > For this solution, I have to maintain a synchronization between the cache and counter increments ( I dont see how to implement tht, in a distributed env) On the other hand: I can dedup based on timestamp of the event. Can I increment the counter value and assign the version as the timestamp of this event ? Is it possible to implement this via a filter ? > > > This technique will bounded memory usage and a reasonable level of > accuracy if the following assumptions hold good: > > * The arrival rate of events has a predictable & reasonable upper bound > * The duplicate events arrive at roughly the same time > Thanks Again -Sagar
