I'm obviously not in a good position to answer since I've been a committer since 2008, but my experience if you can somehow relate is the following:
At StumbleUpon we have 2 committers on staff (including me, oh and we're looking to hire a third one if anyone is interested). We've been using HBase in production since early 2009 so our clusters have been through multiple upgrades of both software and hardware. It used to be that our team was responsible for maintaining almost everything related to HBase, except for actual hardware maintenance and OS upgrades. It's hard for me to say exactly how much time we spent maintaining HBase compared to developing it, as often both overlapped. Right now our situation evolved a bit. Our ops team is mostly composed of SREs[1] and a lot of them have varying experience on HBase, one of them being more carefully trained than the others so that we can have a goto guy. Once the cluster is running usually there's nothing to do except keeping a dashboard with the important metrics on one screen. Mine has the requests/second, GC activity, and compaction queues for the whole prod cluster. To make your life easier to really need to: - Have tools to automate cluster maintenance, such as doing rolling upgrades. We use Puppet and Fabric[2]. - Have good metrics, we use OpenTSDB[3]. It also helps that the author works for us. - Have a good alerting system, we use Nagios. - Have at least one ops guy that understands/codes in Java. HBase and Hadoop have a lot of Java-ism so it helps finding your way around. Do allocate time for your teams to understand how HBase works (both data model and architecture) as it will make everything much easier. You don't want to end up in the middle of an outage with no understanding of what's going on at all. Distributed systems have different failure modes than the ones with single machine architecture (even if they are in a cluster like a master-slave MySQL setup). Hope this helps, J-D 1. Site Reliability Engineer, a term that I believe comes from the goog, use their search engine to learn more about what that position involves. I think you can say it's close to DevOps. 2. Fabric: http://docs.fabfile.org/en/1.2.0/index.html 3. OpenTSDB: http://opentsdb.net/ On Tue, Aug 16, 2011 at 6:27 PM, Sam Seigal <[email protected]> wrote: > Hi All, > > I had a question about the operational overhead of maintaining HBase in > production. Would someone care to share their experiences ? We have a team > of 3 DBAs dedicated to maintaining our Oracle cluster. I am curious to know > if we would need the same for HBase. > > I am talking of a small cluster of 7-8 machines handling around 150 million > transactions per hour for the initial rollout. > > What are some of the common operational/maintenance tasks associated > with maintaining a cluster of that size ? How much developer time goes into > this once the cluster is up and running ? > > It would be extremely beneficial to hear some thoughts/experiences. > > Thank you, > > Sam >
