For our setup we went with 2 clusters. We call one our "hbase cluster" and the other our "analytics cluster". For M/R jobs where hbase is the source and/or sink we usually run the jobs on the "hbase cluster" and so far its been fine (and you definitely want the data locality for these jobs). We also export data from our "hbase cluster" to HDFS on the analytics cluster for M/R jobs where we need to join with data that lives outside of hbase. In my experience, you can run M/R jobs on the same cluster as hbase but you need to limit the number of tasks that you run on that cluster to make sure hbase gets its share of resources. For example, our nodes have 8 cores and we reserve 3 of them for hbase. On the analytics cluster we use all of the cores for M/R tasks. Given the ad-hoc nature of our analytics workload (lots of hive/pig queries), I sleep a lot better at night knowing that no matter how bad a query someone comes up with, it won't take down hbase since we keep it on a separate cluster.
On 6/5/12 8:00 PM, "Atif Khan" <[email protected]> wrote: > >During a recent Cloudera course we were told that it is "Best practice" to >isolate a MapReduce/HDFS cluster from an HBase/HDFS cluster as the two >when >sharing the same HDFS cluster could lead to performance problems. I am >not >sure if this is entirely true given the fact that the main concept behind >Hadoop is to export computation to the data and not import data to the >computation. If I were to segregate HBase and MapReduce clusters, then >when >using MapReduce on HBase data would I not have to transfer large amounts >of >data from HBase/HDFS cluster to MapReduce/HDFS cluster? > >Cloudera on their best practice page >(http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) has the >following: >"Be careful when running mixed workloads on an HBase cluster. When you >have >SLAs on HBase access independent of any MapReduce jobs (for example, a >transformation in Pig and serving data from HBase) run them on separate >clusters. HBase is CPU and Memory intensive with sporadic large sequential >I/O access while MapReduce jobs are primarily I/O bound with fixed memory >and sporadic CPU. Combined these can lead to unpredictable latencies for >HBase and CPU contention between the two. A shared cluster also requires >fewer task slots per node to accommodate for HBase CPU requirements >(generally half the slots on each node that you would allocate without >HBase). Also keep an eye on memory swap. If HBase starts to swap there is >a >good chance it will miss a heartbeat and get dropped from the cluster. On >a >busy cluster this may overload another region, causing it to swap and a >cascade of failures." > >All my initial investigation/reading lead me believe that I should a >create >a common HDFS cluster and then I can run MapReduce and HBase against the >common HDFS cluster. But from the above Cloudera best practice it seems >like I should create two HDFS clusters, one for MapReduce and one for >HBase >and then move data around when required. Something does not make sense >with >this best practice recommendation. > >Any thoughts and/or feedback will be much appreciated. > >-- >View this message in context: >http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp3396721 >9p33967219.html >Sent from the HBase User mailing list archive at Nabble.com. >
