I need to index my main hbase table on some column values. The available indexing solutions like Lily are a little too heavyweight for my simple requirements and so I decided to roll my own.
Based on my reading, there seem to be two main options: 1) For every column value that needs to be indexed on the main table, add index table records where the rowkey is of the following form: <Optional prefix><column-name><column-value><main-table-rowkey> The rowkey is added to the index table record to support non-unique indexes and also to avoid a get to check for existence, before the put. The index is accessed by creating a scan where the startRow is initialized to <Optional prefix><column-name><column-value> and setting a BinaryPrefixComparator RowFilter for the same rowk-key prefix to stop the scan. For every record returned by the scan, get the original table rowKey and do a get. I have glossed over some details like ensuring that <Optional prefix><column-name> is of a fixed size when the table supports indexes for multiple columns. 2) Use a wide table approach where the index record rowkey is of the form: <Optional prefix><column-name><column-value> and the main-table-rowkey is added as columns e.g. "col-family:<main-table-rowkey>" The index is accessed through a simple get with the index rowkey <Optional prefix><column-name><column-value>. My question is, is one of these approaches preferable to the other from a performance perspective? Will a get significantly outperform a scan with a startRow and a BinaryPrefixComparator RowFilter or are the two forms equivalent? Thanks, - Ashwin
