Dremel is designed to store a nesting structure of arbitrary depth. They use repetition and definition levels to be able to reconstruct the nested structure. However, Bigtable like system such as HBase and Cassandra is a multi-dimensional sorted map, which maps rowkey, column-family, columnkey, time-stamp into value. Therefore, both repetition and definition levels are not required to reconstruct a row. This could be a reason that cassandra is using a dremel inspired format, rather than implementing dremel itself.
We can also visualize this sorted map as a table structure with columns as "rowkey", "column-family:columnkey" and values as "time-stamp,value". The HFile is designed with the assumption that hbase table structure is very sparse. This assumption is true in many cases where columnkey is also used to store some information (e.g. order_id). However, this assumption is not true for all tables. In many cases, we use columnkey as traditional column name. Therefore, it will be good to have two file formats. Based on sparsity of table, user can choose between the traditional hfile and a columnar format. As a lot of companies are using Hbase, I am wondering if any company will be interested in sharing their anonymized production trace so that I can estimate the sparsity of their table to validate my argument. Thanks Abhishek -- View this message in context: http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060450.html Sent from the HBase User mailing list archive at Nabble.com.
