I need to index my main hbase table on some column values. The available 
indexing solutions like Lily are a little too heavyweight for my simple 
requirements and so I decided to roll my own.

Based on my reading, there seem to be two main options:

1) For every column value that needs to be indexed on the main table, add index 
table records where the rowkey is of the following form:
<Optional prefix><column-name><column-value><main-table-rowkey>

The rowkey is added to the index table record to support non-unique indexes and 
also to avoid a get to check for existence, before the put.

The index is accessed by creating a scan where the startRow is initialized to 
<Optional prefix><column-name><column-value> and setting a 
BinaryPrefixComparator RowFilter for the same rowk-key prefix to stop the scan. 
For every record returned by the scan, get the original table rowKey and do a 
get.

I have glossed over some details like ensuring that <Optional 
prefix><column-name> is of a fixed size when the table supports indexes for 
multiple columns.

2) Use a wide table approach where the index record rowkey is of the form:
<Optional prefix><column-name><column-value> and the main-table-rowkey is added 
as columns e.g. "col-family:<main-table-rowkey>"

The index is accessed through a simple get with the index rowkey <Optional 
prefix><column-name><column-value>.

My question is, is one of these approaches preferable to the other from a 
performance perspective? Will a get significantly outperform a scan with a 
startRow and a BinaryPrefixComparator RowFilter or are the two forms equivalent?

Thanks,
 - Ashwin

Reply via email to