Hi Joe,

> > HBase supports replication between clusters (i.e. data centers).
> 
> That’s
 … debatable.  There's replication support in the code, but
> several 
times in the recent past when someone asked about it on this
> mailing 
list, the response was “I don't know of anyone actually
> using it.” 


I believe SU uses it.

Anyway I think this is really the point I was making here:

> > the main difference is HBase provides simple mechanism and the user
> > must
 build a replication architecture useful for them; while
> > Cassandra 
attempts to hide some of that complexity

So I don't think you nor I are debating this point really, except this:

> My understanding of replication is that you can't replicate any
> existing
 data, so unless you activated it on day one, it isn't very
> useful.

That was a design choice. Existing data should be transferred in advance or in 
background one-shot with a utility that chooses on an application-specific 
basis what is useful to replicate. There is also a generic utility provided as 
a MR job for this purpose.

> If you use N=3, W=3, R=1 in Cassandra, you 
should get similar behavior
> to HBase/HDFS with respect to consistency 
and availability

My understanding is that R=1 does not guarantee that you won't see different 
versions of the data in different reads, in some scenarios. There was an 
excellent Quora answer in this regard, I don't remember it offhand, perhaps you 
can find the link to it or someone can provide it to you.

> Random partitioning has definite advantages for some cases, and HBase
> might well benefit from recognizing that and adding some support.

Or just use salted keys? 

Random partitioning in a distributed ordered tree sounds like impedance 
mismatch to me.

> HBase uses two different kinds of files, data files and logs, but
> HDFS 
doesn't know about that and cannot, for example, optimize data
> files for write throughput

You are assuming that HDFS is a shrinkwrapped static thing here, no?

Anyway, your point is valid, in the past features that HBase requires of HDFS 
have not received the level of support in the HDFS developer community that we 
would have liked. However this is now rapidly changing for the better.

> Operationally, however, HBase is more complex.
> Admins have to configure
 and manage ZooKeeper, HDFS, and HBase.
> Could this be improved?

Sure, there is room for improvement for hiding some of the complexity for 
evaluators or single system developers or other users who want e.g. a three 
step quickstart.

Personally I prefer having the ability to tune those layers independent of each 
other.

And, while complexity may be more "hidden" operationally in the Cassandra case 
relative to HBase, when there is a problem on your cluster, I don't know if 
that buys you anything. I suppose it depends on the nature of the problem. I do 
not believe there is a guarantee that operationally Cassandra is really simpler 
than HBase when it's 2 am and there is a bug and nodes are going down.


Best regards,


        - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


>________________________________
>From: Joe Pallas <joseph.pal...@oracle.com>
>To: user@hbase.apache.org
>Sent: Wednesday, August 31, 2011 1:42 AM
>Subject: Re: HBase and Cassandra on StackOverflow
>
>
>On Aug 30, 2011, at 2:47 AM, Andrew Purtell wrote:
>
>> Better to focus on improving HBase than play whack a mole.
>
>Absolutely.  So let's talk about improving HBase.  I'm speaking here as 
>someone who has been learning about and experimenting with HBase for more than 
>six months.
>
>> HBase supports replication between clusters (i.e. data centers).
>
>That’s … debatable.  There's replication support in the code, but several 
>times in the recent past when someone asked about it on this mailing list, the 
>response was “I don't know of anyone actually using it.”  My understanding of 
>replication is that you can't replicate any existing data, so unless you 
>activated it on day one, it isn't very useful.  Do I misunderstand?
>
>> Cassandra does not have strong consistency in the sense that HBase provides. 
>> It can provide strong consistency, but at the cost of failing any read if 
>> there is insufficient quorum. HBase/HDFS does not have that limitation. On 
>> the other hand, HBase has its own and different scenarios where data may not 
>> be immediately available. The differences between the systems are nuanced 
>> and which to use depends on the use case requirements.
>
>That's fair enough, although I think your first two sentences nearly 
>contradict each other :-).  If you use N=3, W=3, R=1 in Cassandra, you should 
>get similar behavior to HBase/HDFS with respect to consistency and 
>availability ("strong" consistency and reads do not fail if any one copy is 
>available).
>
>A more important point, I think, is the one about storage.  HBase uses two 
>different kinds of files, data files and logs, but HDFS doesn't know about 
>that and cannot, for example, optimize data files for write throughput (and 
>random reads) and log files for low latency sequential writes.  (For example, 
>how could performance be improved by adding solid-state disk?)
>
>> Cassandra's RandomPartitioner / hash based partitioning means efficient 
>> MapReduce or table scanning is not possible, whereas HBase's distributed 
>> ordered tree is naturally efficient for such use cases, I believe explaining 
>> why Hadoop users often prefer it. This may or may not be a problem for any 
>> given use case. 
>
>I don't think you can make a blanket statement that random partitioning makes 
>efficient MapReduce impossible (scanning, yes).  Many M/R tasks process entire 
>tables.  Random partitioning has definite advantages for some cases, and HBase 
>might well benefit from recognizing that and adding some support.
>
>> Cassandra is no less complex than HBase. All of this complexity is "hidden" 
>> in the sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase, 
>> etc. -- but the Cassandra internals are no less layered. 
>
>Operationally, however, HBase is more complex.  Admins have to configure and 
>manage ZooKeeper, HDFS, and HBase.  Could this be improved?
>
>> With Cassandra, all RPC is via Thrift with various wrappers, so actually all 
>> Cassandra clients are second class in the sense that jbellis means when he 
>> states "Non-Java clients are not second-class citizens".
>
>That's disingenuous.  Thrift exposes all of the Cassandra API to all of the 
>wrappers, while HBase clients who want to use all of the HBase API must use 
>Java.  That can be fixed, but it is the status quo.
>
>joe
>
>
>
>

Reply via email to