I should add that getting an exact count at open time would be expensive and probably not necessary.
On Wednesday, May 30, 2012, Andrew Purtell wrote: > A common question about HBase is if statistics on row index cardinality > are maintained. > > The short answer is no, because in some sense each HBase table region is > its own database, and each region is partly in memory and partly (log > structured) on disk, including perhaps tombstones, so discovering the count > of all unique keys in the full table requires the client iterate over all > rows in all regions. Only then might all live row keys be found. > > However as others have mentioned the coprocessor framework can help > someone implement fast counting. When a region is first opened all data is > in HFiles and each HFile knows the number of keys within it (though not > unique keys at the moment). So a coprocessor could add new metadata (a > unique row key count) to HFiles when writing them, at flush and compaction > times. And then load and sum such counts at region open time. And then > maintain a probabilistic count at runtime using available blooms as new > entries are stored into the Memstore*. The exact count would be available > again upon the next open. > > *- Though offhand I'm not sure what to do about deletes. > > If someone does end up implementing something like this, please consider > contributing it back because it's not uncommonly discussed. > > - Andy > > On Wednesday, May 30, 2012, Ramkrishna.S.Vasudevan wrote: > >> To answer this question >> Alternatively, is there a way to trigger an increment in another table >> (say >> "count") whenever a row was added to "user"? >> >> You can try to use Coprocessors here. Like once a put is done to the >> table >> 'user' using the coprocessor hooks you can trigger an Increment() >> operation >> on table 'count'. >> This can be done on one call from client. Also the increment() operation >> guarantees atomicity. >> >> Hope this helps. >> >> Regards >> Ram >> >> >> > -----Original Message----- >> > From: David Koch [mailto:[email protected]] >> > Sent: Wednesday, May 30, 2012 12:47 PM >> > To: [email protected] >> > Subject: Distinct counters and counting rows >> > >> > Hello, >> > >> > I am testing HBase for distinct counters - more concretely, counting >> > unique users from a fairly large stream of user_ids. For some time to >> > come the volume will be limited enough to use exact counting rather >> > than approximation but already it's too big to hold the entire set of >> > user_ids in memory. >> > >> > For now I am basically inserting all elements from the stream into a >> > "user" table which has row key "user_id" as to enforce the unique >> > constraint. >> > >> > My question: >> > a) Is there a way to get a quick (i.e with small delay in a user >> > interface) count of the size of the user table to return the number of >> > users? Alternatively, is there a way to trigger an increment in >> > another table (say "count") whenever a row was added to "user"? I >> > guess this can be picked up eventually by the client application but I >> > don't want this to delay the actual stream processing. >> > b) I heard about Bloom filters in HBase but failed to understand if >> > they are used for row keys as well. Are they? How do I activate it? I >> > was looking to reduce the work-load of checking set membership for >> > every user_id in the stream. If this is done by HBase internally even >> > better. >> > c) Eventually, I want to store distinct users by day and then do >> > unions on different days to get the total amount of unique users for a >> > multi-day period. Is this likely to involve a Map Reduce or is there a >> > more "light-weight" approach? >> > >> > Thank you, >> > >> > /David >> >> > > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
