On Aug 30, 2011, at 2:47 AM, Andrew Purtell wrote:

> Better to focus on improving HBase than play whack a mole.

Absolutely.  So let's talk about improving HBase.  I'm speaking here as someone 
who has been learning about and experimenting with HBase for more than six 
months.

> HBase supports replication between clusters (i.e. data centers).

That’s … debatable.  There's replication support in the code, but several times 
in the recent past when someone asked about it on this mailing list, the 
response was “I don't know of anyone actually using it.”  My understanding of 
replication is that you can't replicate any existing data, so unless you 
activated it on day one, it isn't very useful.  Do I misunderstand?

> Cassandra does not have strong consistency in the sense that HBase provides. 
> It can provide strong consistency, but at the cost of failing any read if 
> there is insufficient quorum. HBase/HDFS does not have that limitation. On 
> the other hand, HBase has its own and different scenarios where data may not 
> be immediately available. The differences between the systems are nuanced and 
> which to use depends on the use case requirements.

That's fair enough, although I think your first two sentences nearly contradict 
each other :-).  If you use N=3, W=3, R=1 in Cassandra, you should get similar 
behavior to HBase/HDFS with respect to consistency and availability ("strong" 
consistency and reads do not fail if any one copy is available).

A more important point, I think, is the one about storage.  HBase uses two 
different kinds of files, data files and logs, but HDFS doesn't know about that 
and cannot, for example, optimize data files for write throughput (and random 
reads) and log files for low latency sequential writes.  (For example, how 
could performance be improved by adding solid-state disk?)

> Cassandra's RandomPartitioner / hash based partitioning means efficient 
> MapReduce or table scanning is not possible, whereas HBase's distributed 
> ordered tree is naturally efficient for such use cases, I believe explaining 
> why Hadoop users often prefer it. This may or may not be a problem for any 
> given use case. 

I don't think you can make a blanket statement that random partitioning makes 
efficient MapReduce impossible (scanning, yes).  Many M/R tasks process entire 
tables.  Random partitioning has definite advantages for some cases, and HBase 
might well benefit from recognizing that and adding some support.

> Cassandra is no less complex than HBase. All of this complexity is "hidden" 
> in the sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase, 
> etc. -- but the Cassandra internals are no less layered. 

Operationally, however, HBase is more complex.  Admins have to configure and 
manage ZooKeeper, HDFS, and HBase.  Could this be improved?

> With Cassandra, all RPC is via Thrift with various wrappers, so actually all 
> Cassandra clients are second class in the sense that jbellis means when he 
> states "Non-Java clients are not second-class citizens".

That's disingenuous.  Thrift exposes all of the Cassandra API to all of the 
wrappers, while HBase clients who want to use all of the HBase API must use 
Java.  That can be fixed, but it is the status quo.

joe

Reply via email to