Need to set scanner caching, otherwise each call to next will be an network RTT.
________________________________ From: Hao Ren <[email protected]> To: [email protected] Sent: Thursday, August 1, 2013 7:45 AM Subject: Why HBase integation with Hive makes Hive slow Hi, I have a cluster (1 master + 3 slaves) on which there Hive, Hbase, and Hadoop. In order to do some daily row-level update routine, we need to integrate Hbase with hive, but the performance is not good. E.g. There are 2 tables in hive, hbase_table: a hbase table created via Hive hive_table: a native hive table both hold the same data set. When runing: select count(*) from hbase_table; ===> takes 500 s select count(*) from hive_table; ===> takes 6 s I have tried a lot of queries on the two tables. But hbase_table is always very slow. To be claire, I created the hbase_ table as below: CREATE TABLE hbase_table ( idvisite string, client_list Array<string>, nb_client int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,clients:id_list,clients:nb") TBLPROPERTIES("hbase.table.name" = "table_test") ; And my Hbase is on pseudo-distributed mode. I guess, at the beginning of a hive query execution, hive will load data from Hbase, where serde takes a long time. Could someone tell me how to improve my poor performance ? Is this cause by my wrongly configured integration ? Is a fully-distributed mode needed here ? Thank you in advance for your time. Hao. -- Hao Ren ClaraVista www.claravista.fr
