Hi all, I have made some perf test about Hive+HBase. The table is a normal 2D table with about 160M rows (each row with 7 small columns) and 32 regions. There is only one column family and all regions have been major compacted to one store file before test.
On a cluster with 11 task trackers (each with 4 map slots and 1 reduce slot, these servers also act as region servers), a simple SQL in Hive select count(*) from table where column3='Y'; needs ~1700 seconds to finish. But, after use CTAS statement to create an internal table (stored as sequence file), this statement only needs 43 seconds to finish. So Hive+HBase is 40X slower than Hive+HDFS. Though Hive+HBase has less map tasks (32 vs 223), but since there are only 44 map slots available, I don't think it is the main cause. I studied the source code of HBase scan implementation. To me, it seems, in my case, the scan performs HFile read in a quite similar way as sequence file read (sequential reading of each key/value pair). So, in theory, the performance shall be quite similar. Can anyone explain the 40X slowdown? Thanks Weihua
