Re: operational overhead for HBase

Jean-Daniel Cryans Wed, 17 Aug 2011 11:26:35 -0700

I'm obviously not in a good position to answer since I've been a
committer since 2008, but my experience if you can somehow relate is
the following:

At StumbleUpon we have 2 committers on staff (including me, oh and
we're looking to hire a third one if anyone is interested). We've been
using HBase in production since early 2009 so our clusters have been
through multiple upgrades of both software and hardware.

It used to be that our team was responsible for maintaining almost
everything related to HBase, except for actual hardware maintenance
and OS upgrades. It's hard for me to say exactly how much time we
spent maintaining HBase compared to developing it, as often both
overlapped.

Right now our situation evolved a bit. Our ops team is mostly composed
of SREs[1] and a lot of them have varying experience on HBase, one of
them being more carefully trained than the others so that we can have
a goto guy.

Once the cluster is running usually there's nothing to do except
keeping a dashboard with the important metrics on one screen. Mine has
the requests/second, GC activity, and compaction queues for the whole
prod cluster.

To make your life easier to really need to:

 - Have tools to automate cluster maintenance, such as doing rolling
upgrades. We use Puppet and Fabric[2].
 - Have good metrics, we use OpenTSDB[3]. It also helps that the
author works for us.
 - Have a good alerting system, we use Nagios.
 - Have at least one ops guy that understands/codes in Java. HBase and
Hadoop have a lot of Java-ism so it helps finding your way around.

Do allocate time for your teams to understand how HBase works (both
data model and architecture) as it will make everything much easier.
You don't want to end up in the middle of an outage with no
understanding of what's going on at all. Distributed systems have
different failure modes than the ones with single machine architecture
(even if they are in a cluster like a master-slave MySQL setup).

Hope this helps,

J-D

1. Site Reliability Engineer, a term that I believe comes from the
goog, use their search engine to learn more about what that position
involves. I think you can say it's close to DevOps.
2. Fabric: http://docs.fabfile.org/en/1.2.0/index.html
3. OpenTSDB: http://opentsdb.net/

On Tue, Aug 16, 2011 at 6:27 PM, Sam Seigal <[email protected]> wrote:
> Hi All,
>
> I had a question about the operational overhead of maintaining HBase in
> production. Would someone care to share their experiences ? We have a team
> of 3 DBAs dedicated to maintaining our Oracle cluster. I am curious to know
> if we would need the same for HBase.
>
> I am talking of a small cluster of 7-8 machines handling around 150 million
> transactions per hour for the initial rollout.
>
> What are some of the common operational/maintenance tasks associated
> with maintaining a cluster of that size ? How much developer time goes into
> this once the cluster is up and running ?
>
> It would be extremely beneficial to hear some thoughts/experiences.
>
> Thank you,
>
> Sam
>

Re: operational overhead for HBase

Reply via email to