Hi, Lars
Thank you for your reply and sorry for the unclarity.
Actually, hbase daemon is runing only on the master, just one server. It
uses HDFS as its storage.
The input data is on the EBS. It is wrtten in HBase which is over Hdfs
based on EBS.
The only turning I did is :
<property>
<name>hbase.client.scanner.caching</name>
<value>10000</value>
</property>
That makes count(*) fast.
When loading to HDFS dirctly, it just ends in less than 10 mins.
In addition, when loading loading other data sets with different schema which
is about 700 mb into HBase, it takes only a few minutes.
Thank you again.
Hao.
Le 20/08/2013 01:51, lars hofhansl a écrit :
Hi Hao,
how do you run HBase in pseudo distributed mode, yet with 3 slaves?
Where is the data written in EC2? EBS or local storage?
Did you do any other tuning at the HBase or HDFS level (server side)?
If your replication level is still set to 3 you're seeing somewhat of a worst
case scenario, where each node gets 100% of all writes, and the speed is always
dominated by your slowest machine.
How does Hive perform here when you write to HDFS directly?
Sorry, many questions :)
-- Lars
________________________________
From: Hao Ren <[email protected]>
To: [email protected]
Sent: Monday, August 19, 2013 1:50 AM
Subject: Re: Loading data from Hive to HBase takes too long
Update:
There are 1 master and 3 slaves in my cluster.
They are all m1.medium instances.
*Instance Family* *Instance Type* *Processor Arch* *vCPU* *ECU*
*Memory (GiB)* *Instance Storage (GB)* *EBS-optimized Available*
*Network Performance*
General purpose m1.medium 32-bit or
64-bit 1 2 3.75 1 x 410 - Moderate
Le 19/08/2013 10:44, Hao Ren a écrit :
Update:
I messed up some queries, here are the right ones:
CREATE TABLE hbase_table (
material_id int,
new_id_client int,
last_purchase_date int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf1:idclt,cf1:dt_last_purchase")
TBLPROPERTIES("hbase.table.name" = "test");
insert OVERWRITE TABLE hbase_table
select * from test; -- takes a long time (about 8 hours)
# bin/hadoop dfs -dus /user/hive/warehouse/test
hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/user/hive/warehouse/test
1318012108
the table 'test' is just about 1.3 GB.
Le 19/08/2013 10:40, Hao Ren a écrit :
Hi,
I am runing Hive and Hbase on the same Amazon EC2 cluster, where
Hbase is in a pseudo-distributed mode.
After integrating HBase in Hive, I find that it takes a long time
when runing a "insert overwrite" query from hive in order to load
data into a related HBase table.
In fact, the size of data is about 1.3Gb. I dont think it's normal.
Maybe there are something wrong with my configuration.
Here are some queries:
CREATE TABLE hbase_table (
material_id int,
new_id_client int,
last_purchase_date int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf1:idclt,cf1:dt_last_purchase")
TBLPROPERTIES("hbase.table.name" = "test");
insert OVERWRITE TABLE t_LIGNES_DERN_VENTES
select * from test; -- takes a long time (about 8 hours)
Here are some configurations files for my cluster :
# cat hive/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>ip-10-159-41-177.ec2.internal</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
</property>
<property>
<name>hbase.client.scanner.caching</name>
<value>10000</value>
</property>
</configuration>
# cat hbase-0.92.0/conf/hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>ip-10-159-41-177.ec2.internal</value>
</property>
<property>
<name>hbase.client.scanner.caching</name>
<value>10000</value>
</property>
</configuration>
Any help is highly appreciated!
Thank you.
Hao
--
Hao Ren
ClaraVista
www.claravista.fr