Hao, A couple thoughts here.
This could be related to many things. 1. Did you pre-split your regions? If not, you could be hot-spotting on a single server and then waiting for the region to split. If that is the case, you could actually only be using a single server for much of your load (if not all - depends on the region size you have configured) While running did you see one system take the full load (via top, ganglia, or some other tool)? 2. The memory on each of these systems is quite low - 1.7 or 3.7 gb depending if it is compute or memory - either way, it is way low, and I'd expect you to be doing a lot of swapping. You'll need 1 GB for each daemon, which leaves you very little room for the OS (at 3.7 gb). Do you see swapping? What are your JVM parameters? 3. Do these same 4 servers run your Hadoop infrastructure and the hive query? If so, the system is woefully underpowered if you expect to see production-like speed. Running an Hive query on top of an HBase cluster with so few resources will just not work out well in the end ;) -Matt On Tue, Aug 27, 2013 at 7:51 AM, Hao Ren <[email protected]> wrote: > Hi, > > I am running Hive and HBase on Amazon EC2. By following the tutorial: > https://cwiki.apache.org/**confluence/display/Hive/**HBaseIntegration<https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration>, > I managed to create a HBase table from Hive and insert data into it. > > It works but with a low performance. To be specific, inserting 1.3 Gb (50 > M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s. > > Actually, my EC2 cluster contains 3 slaves and 1 master whose instance > type is > medium(http://aws.amazon.com/**ec2/instance-types/#instance-**type<http://aws.amazon.com/ec2/instance-types/#instance-type> > ). > > Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed > mode. A region server is running on the master. HDFS is used as storage. > > Here are some configuration files: > > *// hive-site.xml* > > <configuration> > > <property> > <name>hbase.zookeeper.quorum</**name> > <value>ip-10-178-13-39.ec2.**internal</value> > </property> > > <property> > <name>hive.aux.jars.path</**name> > <value>/root/hive/build/dist/**lib/hive-hbase-handler-0.9.0-** > amplab-4.jar,/root/hive/build/**dist/lib/hbase-0.92.0.jar,/** > root/hive/build/dist/lib/**zookeeper-3.4.3.jar,/root/** > hive/build/dist/lib/guava-r09.**jar</value> > </property> > > <property> > <name>hbase.client.scanner.**caching</name> > <value>10000</value> > </property> > > </configuration> > > *// hbase-site.xml* > > <configuration> > > <property> > <name>hbase.rootdir</name> > <value>hdfs://ec2-54-226-206-**28.compute-1.amazonaws.com:**9010/hbase<http://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase> > </value> > </property> > > <property> > <name>hbase.cluster.**distributed</name> > <value>true</value> > </property> > > <property> > <name>hbase.zookeeper.quorum</**name> > <value>ip-10-178-13-39.ec2.**internal</value> > </property> > > <property> > <name>hbase.client.scanner.**caching</name> > <value>10000</value> > </property> > > </configuration> > > *For understanding, I have some questions:* > 1) In order to improve read performance, I have set > hbase.client.scanner.caching to 10000. But I don't know how to improve > write performance. Is there some basic config to do ? > 2) Does the distributed mode matter ? Does fully-distributed mode have > better write performance than pseudo-distributed mode ? > 3) If the number of region server is increased, will the write performance > be improved ? > 4) In pseudo-distributed mode (one hbase daemon on master), when writing > data from hive to a hbase table, is the master the only entry to HBase ? I > don't think all data passes through the master is efficient. I wonder > whether it is possible write data in parallel from hive to hbase directly > in using mapReduce ? > 5) Will the HBase bulk loading help a lot ? > > I am new to HBase, but I really want to integrate HBase in production. > > Any help is highly appreciated ! =) > > Hao > > -- > Hao Ren > ClaraVista > www.claravista.fr > >
