Inline... On Thu, May 5, 2011 at 5:26 PM, Matt Davies <[email protected]> wrote:
> Afternoon everyone, > > I am researching what the best practice is for using HBase in user facing > applications. I do not know all of the applications that will be ported to > use HBase, but they do share common characteristics such as > > - simple key / value data. Not serving large files ATM. Perhaps a couple > columns in a single column family > - very tall tables > - hundreds of millions of rows > - need millisecond access times for a single row > - random access > - maintain very, very good query times while loading in new data > > > The quick choice would be to use something like memcache or Redis, but the > data is growing faster than the memory of a single box or even few boxes. > We also have a significant investment in Hadoop technologies so keeping > HBase prime seems to make a lot of sense. > > So, some questions: > > 1. do you find that having a single HBase cluster to serve all applications > vs smaller clusters to serve application specific data is better? > In general, you get higher utilization with the former, but you can better tune for a specific application with the latter. Also consider that a single 20-node cluster running 2 apps will balance load and tolerate node failures better than two 10-node clusters running one app each. Another consideration is upgrades and maintenance. You might have one app out of 10 that needs the shiny new HBase v1.3.37 hotness, but you'll have to bring them all down if they're served out of the same cluster. > 2. In the real world do people hook API's directly to HBase or is there > some > caching layer that is used? > We see both in the real world. I think unless you a priori have a good idea of what to cache, it's better to rely on HBase's block cache. One severe limitation on block cache, though, that is that you can only make it so big, because large heaps will make you prone to long GC pauses (even with MSLAB, you will GC eventually). However, the OS does a pretty good job compensating, because it will use up available memory to cache the filesystem. > 3. I remember hearing people like StumbleUpon use different clusters for > analytics vs customer apps. Is this still best practice? This is good practice in RDBMS-land, and I think it still holds, assuming you can afford 2 clusters. The former is read-mostly and the latter is transactional, and some tuning parameters like memstore/blockcache ratios are cluster wide. > 4. Anyone using MSLAB's to reduce GC pauses in production? Experiences / > landmines? > 5. What other considerations have you found when hooking HBase up for > user-facing applications? > > Thanks in advance and I'd love to hear some bragging! > > -Matt >
