All, I'm writing an OLAP cube database and I can implement the storage in one of two schemas, and I don't know if there's any unexpected performance trade-offs I'm not aware of.
Each row represents a unique cell in the cube, with about 5 columns for each row. The row key format is a set of attributes joined by a delimiter. To represent a cube cell that contains rolled-up data, I use fewer attributes in the row key. For example, if a, b, and c are attributes, a single new piece of information with all those attributes could affect rows "a-b-c", "a-b", and "a". When I perform a scan, I append "$" and "~" onto the desired start and end keys (respectively) so I can be sure to get all the data in between. When I want to get the most specific data, I can scan from "a-b-c" to "a-b-c" and retrieve a single row. But a problem occurs when rows with different types of attributes are interleaved when put in order. For example, if "a1", and "a2" are possible values for attribute "a", the following sets of rows are in a bad order for efficient scans: a1-b-c a1-b a1 a2-b-c a2-b a2 When I want to retrieve data for all rows of "a-b-c", I have to scan from "a1-b-c" until after "a2-b-c", and skip the irrelevant rows. The two possible solutions are to either put each data style in it's own table (e.g. table "a-b-c" would contain "a1-b-c" and "a2-b-c", table "a-b" would contain "a1-b" and "a2-b"), or to prefix each row key with data that forces it into a useful order: abc~a1-b-c abc~a2-b-c ab~a1-b ab~a2-b a~a1 a~a2 Placing each style of row into it's own table seems to be the ideal solution, but I don't understand the internal HBase architecture enough to know for sure. Will the overhead of having extra tables (if there is any) outweigh the overhead of adding the prefix to each row key? Is there a better way to structure my solution? Thanks in advance! --Tom
