Hello, I am testing HBase for distinct counters - more concretely, counting unique users from a fairly large stream of user_ids. For some time to come the volume will be limited enough to use exact counting rather than approximation but already it's too big to hold the entire set of user_ids in memory.
For now I am basically inserting all elements from the stream into a "user" table which has row key "user_id" as to enforce the unique constraint. My question: a) Is there a way to get a quick (i.e with small delay in a user interface) count of the size of the user table to return the number of users? Alternatively, is there a way to trigger an increment in another table (say "count") whenever a row was added to "user"? I guess this can be picked up eventually by the client application but I don't want this to delay the actual stream processing. b) I heard about Bloom filters in HBase but failed to understand if they are used for row keys as well. Are they? How do I activate it? I was looking to reduce the work-load of checking set membership for every user_id in the stream. If this is done by HBase internally even better. c) Eventually, I want to store distinct users by day and then do unions on different days to get the total amount of unique users for a multi-day period. Is this likely to involve a Map Reduce or is there a more "light-weight" approach? Thank you, /David
