HBase-Hive integration performance issues

Hao Ren Tue, 27 Aug 2013 06:53:31 -0700

Hi,

I am running Hive and HBase on Amazon EC2. By following the tutorial:https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration , Imanaged to create a HBase table from Hive and insert data into it.

It works but with a low performance. To be specific, inserting 1.3 Gb(50 M rows, 3 columns) takes 30 mins. It is far from what I excepted,say 100 s.

Actually, my EC2 cluster contains 3 slaves and 1 master whose instancetype is medium(http://aws.amazon.com/ec2/instance-types/#instance-type).

Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributedmode. A region server is running on the master. HDFS is used as storage.


Here are some configuration files:

*// hive-site.xml*

<configuration>

    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>ip-10-178-13-39.ec2.internal</value>
    </property>

    <property>
        <name>hive.aux.jars.path</name>
<value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
    </property>

    <property>
        <name>hbase.client.scanner.caching</name>
        <value>10000</value>
    </property>

</configuration>

*// hbase-site.xml*

<configuration>

    <property>
        <name>hbase.rootdir</name>
<value>hdfs://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase</value>
    </property>

    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>

    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>ip-10-178-13-39.ec2.internal</value>
    </property>

    <property>
        <name>hbase.client.scanner.caching</name>
        <value>10000</value>
    </property>

</configuration>

*For understanding, I have some questions:*

1) In order to improve read performance, I have sethbase.client.scanner.caching to 10000. But I don't know how to improvewrite performance. Is there some basic config to do ?2) Does the distributed mode matter ? Does fully-distributed mode havebetter write performance than pseudo-distributed mode ?3) If the number of region server is increased, will the writeperformance be improved ?4) In pseudo-distributed mode (one hbase daemon on master), when writingdata from hive to a hbase table, is the master the only entry to HBase ?I don't think all data passes through the master is efficient. I wonderwhether it is possible write data in parallel from hive to hbasedirectly in using mapReduce ?

5) Will the HBase bulk loading help a lot ?

I am new to HBase, but I really want to integrate HBase in production.

Any help is highly appreciated ! =)

Hao

--
Hao Ren
ClaraVista
www.claravista.fr

HBase-Hive integration performance issues

Reply via email to