On Thu, Sep 1, 2011 at 10:53 AM, Edward Capriolo <[email protected]> wrote: > On Wed, Aug 31, 2011 at 1:34 AM, Time Less <[email protected]> wrote: > >> Most of your points are dead-on. >> >> > Cassandra is no less complex than HBase. All of this complexity is >> > "hidden" in the sense that with Hadoop/HBase the layering is obvious -- >> > HDFS, HBase, etc. -- but the Cassandra internals are no less layered. >> > >> > Operationally, however, HBase is more complex. Admins have to configure >> > and manage ZooKeeper, HDFS, and HBase. Could this be improved? >> > >> >> I strongly disagree with the premise[1]. Having personally been involved in >> the Digg Cassandra rollout, and spent up until a couple months ago being in >> part-time weekly contact with the Digg Cassandra administrator, and having >> very close ties to the SimpleGeo Cassandra admin, I know it is a fickle >> beast. Having also spent a good amount of time at StumbleUpon and Mozilla >> (and now Riot Games) I also see first-hand that HBase is far more stable >> and >> -- dare I say it? -- operationally more simple. >> >> So okay, HBase is "harder to set up" if following a step-by-step guide on a >> wiki is "hard,"[2] but it's FAR easier to administer. Cassandra is rife >> with >> cascading cluster failure scenarios. I would not recommend running >> Cassandra >> in a highly-available high-volume data scenario, but don't hesitate to do >> so >> for HBase. >> >> I do not know if this is a guaranteed (provable due to architecture) >> result, >> or just the result of the Cassandra community being... how shall I say... >> hostile to administrators. But then, to me it doesn't matter. Results do. >> >> -- >> Tim Ellis >> Data Architect, Riot Games >> [1] That said, the other part of your statement is spot-on, too. It's >> surely >> possible to improve the HBase architecture or simplify it. >> [2] I went from having never set up HBase nor ever used Chef to having >> functional Chef recipes that installed a functional HBase/HDFS cluster in >> about 2 weeks. From my POV, the biggest stumbling point was that HDFS by >> default stores critical data in the underlying filesystem's /tmp directory >> by default, which is, for lack of a better word, insane. If I had to >> suggest >> how to simplify "HBase installation," I'd ask for sane HDFS config files >> that are extremely common and difficult-to-ignore. >> > > Why are you quoting "harder" what was said was "more complex". Setting up N > things is more complex then setting up a single thing. > > First, you have to learn: > 1) Linux HA > 2) DRDB > > Right out of the gate just to have a redundant name node.
Eh, no one would do that. If you want a redundant name node your only choice is to use Mapr, which I would def recommend since you get a better nn "fail-over" w/o service interruption and significantly higher performance than hdfs. > > This is not easy, fast, or simple. In fact this is quite a pain. > http://docs.google.com/viewer?a=v&q=cache:9rnx-eRzi1AJ:files.meetup.com/1228907/Hadoop%2520Namenode%2520High%2520Availability.pptx+linux+ha+namenode&hl=en&gl=us&pid=bl&srcid=ADGEESig5aJNVAXbLgBwyc311sPSd88jUJbKHx4z2PQtDKHnmM1FuCJpg2IUyqi5JrmUL3RbCb8QRYsjHnP74YuKQfOQXoUZxnhrCy6N1kVpiG1jNi4zhqoKlUTmoDaqS1NegCFb6-WM&sig=AHIEtbQbjN1Olwxui5JmywdWzhqv4Hq3tw&pli=1 > > Doing it properly involves setting up physical wires between servers or link > aggregation groups. You can't script having someone physically run crossover > cables. You need your switching engineer to set up LAG's. > Also you may notice that everyone that describes this setup is also > describing it using linux-ha V1 which was deprecated for over 2 years. Which > also demonstrates how this process is so complicated people tend to touch it > and never touch it again because of how fragile it is. > > You are also implying that following the wiki is easy. Personally, I find > that the wiki has fine detail, but it is confusing. > Here is why. > > "1.3.1.2. hadoop > > This version of HBase will only run on Hadoop 0.20.x. It will not run on > hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an > HDFS that has a durable sync. Currently only the branch-0.20-append branch > has this attribute[1]. No official releases have been made from this branch > up to now so you will have to build your own Hadoop from the tip of this > branch. Michael Noll has written a detailed blog, Building an Hadoop 0.20.x > version for HBase 0.90.2, on how to build an Hadoop from branch-0.20-append. > Recommended. > > Or rather than build your own, you could use Cloudera's CDH3. CDH has the > 0.20-append patches needed to add a durable sync (CDH3 betas will suffice; > b2, b3, or b4)." > > So the setup starts by recommending rolling your own hadoop (pain in the > ass). OR using a beta ( :( ). > > Then it gets onto hbase it branches into “Standalone HBase” and Section > 1.3.2.2, “Distributed” > Then it branches into "psuedo distributed" and "full distributed" , then the > zookeeper section offers you two options "1.3.2.2.2.2. ZooKeeper", > "1.3.2.2.2.2.1. Using existing ZooKeeper ensemble" . > > Not to say this is hard or impossible, but it is a lot of information to > digest and all the branching decisions are hard to understand to a first > time user. Moving forward, my plan is to only deploy HBase on top of mapr for real-time situations where at all possible. HDFS isn't there yet, 2.5 years ago I was optimistic, and they still have more years to go. In the mean time, with mapr you get yourself HA, better performance, and hopefully better error recovery. > > Uppercasing the word FAR does not prove to me that hbase is easier to > administer nor does the your employment history or second hand stories > unnamed from people you know. I can tell you why I think Cassandra is easier > to manage: > > 1) There is only one log file /var/log/cassandra/system.log > 2) There is only one configuration folder > /usr/local/cassandra/conf/cassandra.yaml cassandra-env.sh > 3) I do not need to keep a chart or post it notes where all these 1 off > components are. zk server list, hbase master server list, namenode, > 4) No need to configure auxiliary stuff such as DRBD or Linux-HA Just as an aside, no one does #4. As for #3, what you are really saying is "i dont want to have good sysadmin/automation practices" - sure a lot of people don't, but if you do, #3 is a non-issue. Chef can help. > > *Fud ALARM* "Cassandra is rife with cascading cluster failure scenarios." > ....and hbase never has issues apparently. (remember I am on both lists) This is not FUD, its a legitimate concern. The issue isn't if one system has failures or not, because all fail, but HOW they fail. And that also leads to HOW you determine what the root cause it, and HOW you recover. This sounds like a difference of opinion, but there are practicalities of how you admin and deal with 3am failure modes. I think this is the place where HBase shines very well, but this is a story you can't tell without people crying "FUD" since it's complex and thus doesn't translate well. I also would posit that the HBase master is a _good_ thing. It provides a management point, it doesnt participate in the query path, and is not a major scaling issue. It lets you give definitive answers to things like "how busy is my cluster" and "what is online/offline" "what tables are there" etc etc. It handles failures in a highly explicit manner, which is good. > > Also... > [2] I went from having never set up HBase nor ever used Chef to having > functional Chef recipes that installed a functional HBase/HDFS cluster in > about 2 weeks. > > It took me about one hour to accomplish the same result with puppet + > cassandra. > http://www.jointhegrid.com/highperfcassandra/?p=62 >
