Hi All, I am working on benchmarking different data stores to find the best fit for our use case. I would like to know views and suggestions of the HBase user and developer community on some of my findings as the results I am getting are highly variable.
My HBase Setup has two EC2 Large hosts (each one has 7.5 GB memory, 4 CPU cores etc), on which both the HBase master and slaves reside. HDFS master slave and Zookeeper instances are also split between these two hosts. I have three tables with one column family each and they have 100 million, 75 million and 500 million rows respectively. The actual data consists of a String key and Long, String columns. The usual access patterns is to have GETs on individual keys and have periodical batch PUTs. I ran my benchmark application on HBase for different scenarios to measure pure GET performance, mixed GET and PUT performance etc. This was actually without enabling the HTable APIs writeBuffer or any BloomFilters. The results I got were quite unimpressive, compared to similar benchmarking done using MySQL, Cassandra etc. The performance was anywhere from 40% to 100% worse. So I started using writeBuffers in my code and also enabled BloomFilters at ROW level. However I started seeing a lot of variance in the benchmarking results (though I would not be too sure about correlating this with Bloomfilters/WriteBuffering). Another fact causing concern was that the results were actually worse than earlier results. Since we are using EC2 Large instances, it seems unlikely that network or some other virtualization related resources crunch are affecting our performance measurement. What I would want to know is whether this rings a bell for anyone else here. Could I be missing out on some configuration knob which would result in background compaction or some such process to start at the wrong time which might be affecting my benchmarks? Any comments or feedback are welcome. Thanks, Aditya
