On Sep 06, Something Something wrote: >Anyway, before I spent a lot of time on it, I thought I should check if >anyone has compared HBase against CitrusLeaf. If you've, I would greatly >appreciate it if you would share your experiences.
Disclaimer: I was an early evaluator/tester of citrusleaf about a year ago when it was in its infancy. Though I am not affliated with them in any manner, I might be more benevolent to them than most readers of this mailing list. The short answer is that hbase & citrusleaf (called CL in remainder of the mail) are very different products. CL cares a lot more about predictable latencies than hbase does. This is manifested in two aspects of the design: * It is heavily optimized for large RAM + SSD usage. While hbase does a fair job of using RAM, I can say for sure that both the throughput and latency trends is much better with CL in cases where spinning disks are not used directly in the readwrite path. * Multiple machines can concurrently/actively handle requests for the same key, so the loss of one server does not mean that a range of keys is temporarily unavailable. A hbase cluster does have a partial, temporary outage when a region server dies. Things don't get back to normal immediately even when a new server takes over since not all region data may now be local disk reads. Even if they are, it won't be readily waiting for you in fast memory. * A third aspect that is more of a side-effect is that HDFS still has a SPOF in form the namenode does continue to be a cause for concern wrt overall uptime guarantees Here is where hbase would do much better: * It is designed for much larger data to the point where it is natural for the entire dataset to much larger than the total available RAM and the usage of hard disks as the primary storage medium is natural. * A bigtable implementation is also designed for both ranged scans and also full table scans. Last I recall, CL was more of a DHT and so ranged scans is infeasible and doing full scans would qualify as much more than shooting oneself in the foot. And here is where hbase has advantages in principle: * As others mentioned, there are "textbook" advantages of using an open source solution. * hbase definitely has run both longer and on larger clusters than CL possibly has. While generalizations are dangerous, the one place when C++ code could shine over java (JVM really) is one does not have to fight the GC. I'd personally be more confomtable with handing off say 48GB of memory to a good C/C++ code than the JVM. That being said, the folks working on hbase have been actively been addressing this problem to the extent possible in pure java by using unmanaged heap memory. Search for "mslab hbase" to learn more about it. My conclusion is that the two products address different problem spaces. So I'd urge you to spend time understanding your access patterns and see which one does it map to more closely. Feel free to contact me off list if you feel the need to ask anything that is not approrpiate for the mailing list but is relevant to this discussion.
