Inline...

On Thu, May 5, 2011 at 5:26 PM, Matt Davies <[email protected]> wrote:

> Afternoon everyone,
>
> I am researching what the best practice is for using HBase in user facing
> applications.  I do not know all of the applications that will be ported to
> use HBase, but they do share common characteristics such as
>
> - simple key / value data.  Not serving large files ATM. Perhaps a couple
> columns in a single column family
> - very tall tables
> - hundreds of millions of rows
> - need millisecond access times for a single row
> - random access
> - maintain very, very good query times while loading in new data
>
>
> The quick choice would be to use something like memcache or Redis, but the
> data is growing faster than the memory of a single box or even few boxes.
>  We also have a significant investment in Hadoop technologies so keeping
> HBase prime seems to make a lot of sense.
>
> So, some questions:
>
> 1. do you find that having a single HBase cluster to serve all applications
> vs smaller clusters to serve application specific data is better?
>
In general, you get higher utilization with the former, but you can better
tune for a specific application with the latter. Also consider that a single
20-node cluster running 2 apps will balance load and tolerate node failures
better than two 10-node clusters running one app each. Another consideration
is upgrades and maintenance. You might have one app out of 10 that needs the
shiny new HBase v1.3.37 hotness, but you'll have to bring them all down if
they're served out of the same cluster.


> 2. In the real world do people hook API's directly to HBase or is there
> some
> caching layer that is used?
>
We see both in the real world. I think unless you a priori have a good idea
of what to cache, it's better to rely on HBase's block cache. One severe
limitation on block cache, though, that is that you can only make it so big,
because large heaps will make you prone to long GC pauses (even with MSLAB,
you will GC eventually). However, the OS does a pretty good job
compensating, because it will use up available memory to cache the
filesystem.


> 3. I remember hearing people like StumbleUpon use different clusters for
> analytics vs customer apps.  Is this still best practice?

This is good practice in RDBMS-land, and I think it still holds, assuming
you can afford 2 clusters. The former is read-mostly and the latter is
transactional, and some tuning parameters like memstore/blockcache ratios
are cluster wide.


> 4. Anyone using MSLAB's to reduce GC pauses in production? Experiences /
> landmines?
> 5. What other considerations have you found when hooking HBase up for
> user-facing applications?
>
> Thanks in advance and I'd love to hear some bragging!
>
> -Matt
>

Reply via email to