"Each HFile knows how many KV entries there are in it, but this does not map in a general way to the number of rows, or the number of rows with a specific column."
It would be nice to have an index like that; Would solve a lot of issues for people migrating from mysql. I assume that without the 'count' feature, people are resorting to storing dataset elements in other engines, which is not great, since you then end up to require a non-hbase index to be consistent and authoritative for all of your datasets that require counts. -Jack On Fri, Jun 3, 2011 at 3:24 PM, Ryan Rawson <[email protected]> wrote: > This is a commonly requested feature, and it remains unimplemented > because it is actually quite hard. Each HFile knows how many KV > entries there are in it, but this does not map in a general way to the > number of rows, or the number of rows with a specific column. Keeping > track of the row count as new rows are created is also not as easy as > it seems - this is because a Put does not know if a row already exists > or not. Making it aware of that fact would require doing a get before > a put - not cheap. > > -ryan > > On Fri, Jun 3, 2011 at 3:20 PM, Jack Levin <[email protected]> wrote: >> I have a feature request: There should be a native function called >> 'count', that produces count of rows based on specific family filter, >> that is internal to HBASE and won't be required to read CELLs off the >> disk/cache. Just count up the rows in the most efficient way >> possible. I realize that family definitions are part of the cells, so >> it would be nice to have an index that somehow can produce low IO/CPU >> hit to hbase when doing a count (for example enabling an index like >> that in table schema would be how you turn it on for a specific >> family). >> >> Best, >> >> -Jack >> >
