Hi all,
I have a question regarding the difference between storing a set of data as:
*a) n columns with 1 version each*
*b) 1 column with n versions*

Since the storage unit in hbase is a cell (rowkey, column family, column
qualifier, timestamp), is there a difference between the above two storage
options in terms of read/write performance, compaction/GC time, etc?

I know it is not recommended to use high number of versions if you do not
really need them. However, if those n versions of data are really needed
for reading, then will it cause any problem to store the data in a single
column with n versions. Also, even if max versions is set to 1 for a column
(option a), new values are still stored as a new cell and old cell is
deleted at compaction time. So, I also feel like compaction-wise two
options are identical.
I wonder if there is anything that makes one option superior to the other.

*Example*: To clarify more, say the data to be stored is set of urls
visited in certain time ranges and we want to keep the last 100 hours of
url sets:

*a) store each hour as column name with one url set in it (column names
will be used in cyclic manner (data for hour 101 will be written into
column 1))*
column_qualifier: value
---------------------------
urls_hour1: <abc.com, xyz.com, ...>
urls_hour2: <urls>
urls_hour3: <urls>
...
urls_hour100: <urls>


*b) store in a single column with 100 versions (one for each hour) (max
versions for column will be 100 and hbase will do the auto-compaction for
old versions)*
column_qualifier: value @ timestamp
---------------------------
urls: <abc.com, xyz.com, ...> @ ts_hour1, <urls> @ ts_hour2, <urls> @
ts_hour3, .... , <urls> @ ts_hour100

Thanks,
-Serkan

Reply via email to