Re: HFile V2 vs HFile V3

abhishek1015 Sat, 14 Jun 2014 20:22:26 -0700

Dremel is designed to store a nesting structure of arbitrary depth. They use
repetition and definition levels to be able to reconstruct the nested
structure. However, Bigtable like system such as HBase and Cassandra is a
multi-dimensional sorted map, which maps rowkey, column-family, columnkey,
time-stamp into value. Therefore, both repetition and definition levels are
not required to reconstruct a row. This could be a reason that cassandra is
using a dremel inspired format, rather than implementing dremel itself.


We can also visualize this sorted map as a table structure with columns as
"rowkey", "column-family:columnkey" and values as "time-stamp,value". The
HFile is designed with the assumption that hbase table structure is very
sparse. This assumption is true in many cases where columnkey is also used
to store some information (e.g. order_id). However, this assumption is not
true for all tables. In many cases, we use columnkey as traditional column
name.

Therefore, it will be good to have two file formats. Based on sparsity of
table, user can choose between the traditional hfile and a columnar format.
As a lot of companies are using Hbase, I am wondering if any company will be
interested in sharing their anonymized production trace so that I can
estimate the sparsity of their table to validate my argument.

Thanks
Abhishek  





--
View this message in context: 
http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060450.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HFile V2 vs HFile V3

Reply via email to