Hello, We're doing a proof-of-concept study to see if HBase is a good fit for an application we're planning to build. The application will be recording a continuous stream of sensor data throughout the day and the data needs to be online immediately. Our test cluster consists of 16 machines, each with 16 cores and 32GB of RAM and 8TB local storage running CDH3u2. We're using the HBase client Put class, and have set the table "auto flush" to false and the write buffer size to 12MB. Here are the region server JVM options:
export HBASE_REGIONSERVER_OPTS="-Xmx28g -Xms28g -Xmn128m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$HBASE_HOME/logs/gc-$(hostname)-hbase.log" And here are the property settings that we're using in the hbase-site.xml file: hbase.rootdir=hdfs://master:9000/hbase hbase.regionserver.handler.count=20 hbase.cluster.distributed=true hbase.zookeeper.quorum=zk01,zk02,zk03 hfile.block.cache.size=0 hbase.hregion.max.filesize=1073741824 hbase.regionserver.global.memstore.upperLimit=0.79 hbase.regionserver.global.memstore.lowerLimit=0.70 hbase.hregion.majorcompaction=0 hbase.hstore.compactionThreshold=15 hbase.hstore.blockingStoreFiles=20 hbase.rpc.timeout=0 zookeeper.session.timeout=3600000 It's taking about 24 hours to load 4TB of data which isn't quite fast enough for our application. Is there a more optimal configuration that we can use to improve loading performance? - Amit
