Re: Distinct counters and counting rows

Andrew Purtell Wed, 30 May 2012 15:39:29 -0700

I should add that getting an exact count at open time would be expensive
and probably not necessary.


On Wednesday, May 30, 2012, Andrew Purtell wrote:

> A common question about HBase is if statistics on row index cardinality
> are maintained.
>
> The short answer is no, because in some sense each HBase table region is
> its own database, and each region is partly in memory and partly (log
> structured) on disk, including perhaps tombstones, so discovering the count
> of all unique keys in the full table requires the client iterate over all
> rows in all regions. Only then might all live row keys be found.
>
> However as others have mentioned the coprocessor framework can help
> someone implement fast counting. When a region is first opened all data is
> in HFiles and each HFile knows the number of keys within it (though not
> unique keys at the moment). So a coprocessor could add new metadata (a
> unique row key count) to HFiles when writing them, at flush and compaction
> times. And then load and sum such counts at region open time. And then
> maintain a probabilistic count at runtime using available blooms as new
> entries are stored into the Memstore*. The exact count would be available
> again upon the next open.
>
> *- Though offhand I'm not sure what to do about deletes.
>
> If someone does end up implementing something like this, please consider
> contributing it back because it's not uncommonly discussed.
>
>     - Andy
>
> On Wednesday, May 30, 2012, Ramkrishna.S.Vasudevan wrote:
>
>> To answer this question
>> Alternatively, is there a way to trigger an increment in another table
>> (say
>> "count") whenever a row was added to "user"?
>>
>> You can try to use Coprocessors here.  Like once a put is done to the
>> table
>> 'user' using the coprocessor hooks you can trigger an Increment()
>> operation
>> on table 'count'.
>> This can be done on one call from client.  Also the increment() operation
>> guarantees atomicity.
>>
>> Hope this helps.
>>
>> Regards
>> Ram
>>
>>
>> > -----Original Message-----
>> > From: David Koch [mailto:[email protected]]
>> > Sent: Wednesday, May 30, 2012 12:47 PM
>> > To: [email protected]
>> > Subject: Distinct counters and counting rows
>> >
>> > Hello,
>> >
>> > I am testing HBase for distinct counters - more concretely, counting
>> > unique users from a fairly large stream of user_ids. For some time to
>> > come the volume will be limited enough to use exact counting rather
>> > than approximation but already it's too big to hold the entire set of
>> > user_ids in memory.
>> >
>> > For now I am basically inserting all elements from the stream into a
>> > "user" table which has row key "user_id" as to enforce the unique
>> > constraint.
>> >
>> > My question:
>> > a) Is there a way to get a quick (i.e with small delay in a user
>> > interface) count of the size of the user table to return the number of
>> > users? Alternatively, is there a way to trigger an increment in
>> > another table (say "count") whenever a row was added to "user"? I
>> > guess this can be picked up eventually by the client application but I
>> > don't want this to delay the actual stream processing.
>> > b) I heard about Bloom filters in HBase but failed to understand if
>> > they are used for row keys as well. Are they? How do I activate it? I
>> > was looking to reduce the work-load of checking set membership for
>> > every user_id in the stream. If this is done by HBase internally even
>> > better.
>> > c) Eventually, I want to store distinct users by day and then do
>> > unions on different days to get the total amount of unique users for a
>> > multi-day period. Is this likely to involve a Map Reduce or is there a
>> > more "light-weight" approach?
>> >
>> > Thank you,
>> >
>> > /David
>>
>>
>
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Distinct counters and counting rows

Reply via email to