Re: Hive+HBase performance is much poorer than Hive+HDFS

Weihua JIANG Tue, 11 Oct 2011 20:05:13 -0700

Since I am using Hive to perform query, I don't know how to set it.
Can you tell me how to do so?


Thanks
Weihua

2011/10/12 Jean-Daniel Cryans <[email protected]>:
> This is one big factor and you didn't mention configuring it:
> http://hbase.apache.org/book.html#perf.hbase.client.caching
>
> J-D
>
> On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[email protected]>wrote:
>
>> Hi all,
>>
>> I have made some perf test about Hive+HBase. The table is a normal 2D
>> table with about 160M rows (each row with 7 small columns) and 32
>> regions. There is only one column family and all regions have been
>> major compacted to one store file before test.
>>
>> On a cluster with 11 task trackers (each with 4 map slots and 1 reduce
>> slot, these servers also act as region servers), a simple SQL in Hive
>>   select count(*) from table where column3='Y';
>> needs ~1700 seconds to finish.
>>
>> But, after use CTAS statement to create an internal table (stored as
>> sequence file), this statement only needs 43 seconds to finish.
>>
>> So Hive+HBase is 40X slower than Hive+HDFS.
>>
>> Though Hive+HBase has less map tasks (32 vs 223), but since there are
>> only 44 map slots available, I don't think it is the main cause.
>>
>> I studied the source code of HBase scan implementation. To me, it
>> seems, in my case, the scan performs HFile read in a quite similar way
>> as sequence file read (sequential reading of each key/value pair). So,
>> in theory, the performance shall be quite similar.
>>
>> Can anyone explain the 40X slowdown?
>>
>> Thanks
>> Weihua
>>
>

Re: Hive+HBase performance is much poorer than Hive+HDFS

Reply via email to