Hi,

I am running Hive and HBase on Amazon EC2. By following the tutorial: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration , I managed to create a HBase table from Hive and insert data into it.

It works but with a low performance. To be specific, inserting 1.3 Gb (50 M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s.

Actually, my EC2 cluster contains 3 slaves and 1 master whose instance type is medium(http://aws.amazon.com/ec2/instance-types/#instance-type).

Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed mode. A region server is running on the master. HDFS is used as storage.

Here are some configuration files:

*// hive-site.xml*

<configuration>

    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>ip-10-178-13-39.ec2.internal</value>
    </property>

    <property>
        <name>hive.aux.jars.path</name>
<value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
    </property>

    <property>
        <name>hbase.client.scanner.caching</name>
        <value>10000</value>
    </property>

</configuration>

*// hbase-site.xml*

<configuration>

    <property>
        <name>hbase.rootdir</name>
<value>hdfs://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase</value>
    </property>

    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>

    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>ip-10-178-13-39.ec2.internal</value>
    </property>

    <property>
        <name>hbase.client.scanner.caching</name>
        <value>10000</value>
    </property>

</configuration>

*For understanding, I have some questions:*
1) In order to improve read performance, I have set hbase.client.scanner.caching to 10000. But I don't know how to improve write performance. Is there some basic config to do ? 2) Does the distributed mode matter ? Does fully-distributed mode have better write performance than pseudo-distributed mode ? 3) If the number of region server is increased, will the write performance be improved ? 4) In pseudo-distributed mode (one hbase daemon on master), when writing data from hive to a hbase table, is the master the only entry to HBase ? I don't think all data passes through the master is efficient. I wonder whether it is possible write data in parallel from hive to hbase directly in using mapReduce ?
5) Will the HBase bulk loading help a lot ?

I am new to HBase, but I really want to integrate HBase in production.

Any help is highly appreciated ! =)

Hao

--
Hao Ren
ClaraVista
www.claravista.fr

Reply via email to