Re: HBase and Cassandra on StackOverflow

Edward Capriolo Thu, 01 Sep 2011 10:53:58 -0700

On Wed, Aug 31, 2011 at 1:34 AM, Time Less <[email protected]> wrote:


> Most of your points are dead-on.
>
> > Cassandra is no less complex than HBase. All of this complexity is
> > "hidden" in the sense that with Hadoop/HBase the layering is obvious --
> > HDFS, HBase, etc. -- but the Cassandra internals are no less layered.
> >
> > Operationally, however, HBase is more complex.  Admins have to configure
> > and manage ZooKeeper, HDFS, and HBase.  Could this be improved?
> >
>
> I strongly disagree with the premise[1]. Having personally been involved in
> the Digg Cassandra rollout, and spent up until a couple months ago being in
> part-time weekly contact with the Digg Cassandra administrator, and having
> very close ties to the SimpleGeo Cassandra admin, I know it is a fickle
> beast. Having also spent a good amount of time at StumbleUpon and Mozilla
> (and now Riot Games) I also see first-hand that HBase is far more stable
> and
> -- dare I say it? -- operationally more simple.
>
> So okay, HBase is "harder to set up" if following a step-by-step guide on a
> wiki is "hard,"[2] but it's FAR easier to administer. Cassandra is rife
> with
> cascading cluster failure scenarios. I would not recommend running
> Cassandra
> in a highly-available high-volume data scenario, but don't hesitate to do
> so
> for HBase.
>
> I do not know if this is a guaranteed (provable due to architecture)
> result,
> or just the result of the Cassandra community being... how shall I say...
> hostile to administrators. But then, to me it doesn't matter. Results do.
>
> --
> Tim Ellis
> Data Architect, Riot Games
> [1] That said, the other part of your statement is spot-on, too. It's
> surely
> possible to improve the HBase architecture or simplify it.
> [2] I went from having never set up HBase nor ever used Chef to having
> functional Chef recipes that installed a functional HBase/HDFS cluster in
> about 2 weeks. From my POV, the biggest stumbling point was that HDFS by
> default stores critical data in the underlying filesystem's /tmp directory
> by default, which is, for lack of a better word, insane. If I had to
> suggest
> how to simplify "HBase installation," I'd ask for sane HDFS config files
> that are extremely common and difficult-to-ignore.
>

Why are you quoting "harder" what was said was "more complex". Setting up N
things is more complex then setting up a single thing.

First, you have to learn:
1) Linux HA
2) DRDB

Right out of the gate just to have a redundant name node.

This is not easy, fast, or simple. In fact this is quite a pain.
http://docs.google.com/viewer?a=v&q=cache:9rnx-eRzi1AJ:files.meetup.com/1228907/Hadoop%2520Namenode%2520High%2520Availability.pptx+linux+ha+namenode&hl=en&gl=us&pid=bl&srcid=ADGEESig5aJNVAXbLgBwyc311sPSd88jUJbKHx4z2PQtDKHnmM1FuCJpg2IUyqi5JrmUL3RbCb8QRYsjHnP74YuKQfOQXoUZxnhrCy6N1kVpiG1jNi4zhqoKlUTmoDaqS1NegCFb6-WM&sig=AHIEtbQbjN1Olwxui5JmywdWzhqv4Hq3tw&pli=1

Doing it properly involves setting up physical wires between servers or link
aggregation groups. You can't script having someone physically run crossover
cables. You need your switching engineer to set up LAG's.
Also you may notice that everyone that describes this setup is also
describing it using linux-ha V1 which was deprecated for over 2 years. Which
also demonstrates how this process is so complicated people tend to touch it
and never touch it again because of how fragile it is.

You are also implying that following the wiki is easy. Personally, I find
that the wiki has fine detail, but it is confusing.
Here is why.

"1.3.1.2. hadoop

This version of HBase will only run on Hadoop 0.20.x. It will not run on
hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an
HDFS that has a durable sync. Currently only the branch-0.20-append branch
has this attribute[1]. No official releases have been made from this branch
up to now so you will have to build your own Hadoop from the tip of this
branch. Michael Noll has written a detailed blog, Building an Hadoop 0.20.x
version for HBase 0.90.2, on how to build an Hadoop from branch-0.20-append.
Recommended.

Or rather than build your own, you could use Cloudera's CDH3. CDH has the
0.20-append patches needed to add a durable sync (CDH3 betas will suffice;
b2, b3, or b4)."

So the setup starts by recommending rolling your own hadoop (pain in the
ass). OR using a beta ( :(  ).

Then it gets onto hbase it branches into “Standalone HBase” and Section
1.3.2.2, “Distributed”
Then it branches into "psuedo distributed" and "full distributed" , then the
zookeeper section offers you two options "1.3.2.2.2.2. ZooKeeper",
"1.3.2.2.2.2.1. Using existing ZooKeeper ensemble" .

Not to say this is hard or impossible, but it is a lot of information to
digest and all the branching decisions are hard to understand to a first
time user.

Uppercasing the word FAR does not prove to me that hbase is easier to
administer nor does the your employment history or second hand stories
unnamed from people you know. I can tell you why I think Cassandra is easier
to manage:

1) There is only one log file /var/log/cassandra/system.log
2) There is only one configuration folder
/usr/local/cassandra/conf/cassandra.yaml cassandra-env.sh
3) I do not need to keep a chart or post it notes where all these 1 off
components are.  zk server list, hbase master server list, namenode,
4) No need to configure auxiliary stuff such as DRBD or Linux-HA

*Fud ALARM* "Cassandra is rife with cascading cluster failure scenarios."
....and hbase never has issues apparently. (remember I am on both lists)

Also...
[2] I went from having never set up HBase nor ever used Chef to having
functional Chef recipes that installed a functional HBase/HDFS cluster in
about 2 weeks.

It took me about one hour to accomplish the same result with puppet +
cassandra.
http://www.jointhegrid.com/highperfcassandra/?p=62

Re: HBase and Cassandra on StackOverflow

Reply via email to