Thanks for the confirmation. There is also a good/detailed discussion thread on this issue found at http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-td4018856.html http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-td4018856.html .
Michael Segel-3 wrote: > > It depends... There are some reasons to do this however in general, you > don't need to do this... > > The course is wrong to suggest this as a best practice. > > Sent from my iPhone > > On Jun 5, 2012, at 5:00 PM, "Atif Khan" <[email protected]> > wrote: > >> >> During a recent Cloudera course we were told that it is "Best practice" >> to >> isolate a MapReduce/HDFS cluster from an HBase/HDFS cluster as the two >> when >> sharing the same HDFS cluster could lead to performance problems. I am >> not >> sure if this is entirely true given the fact that the main concept behind >> Hadoop is to export computation to the data and not import data to the >> computation. If I were to segregate HBase and MapReduce clusters, then >> when >> using MapReduce on HBase data would I not have to transfer large amounts >> of >> data from HBase/HDFS cluster to MapReduce/HDFS cluster? >> >> Cloudera on their best practice page >> (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) has the >> following: >> "Be careful when running mixed workloads on an HBase cluster. When you >> have >> SLAs on HBase access independent of any MapReduce jobs (for example, a >> transformation in Pig and serving data from HBase) run them on separate >> clusters. HBase is CPU and Memory intensive with sporadic large >> sequential >> I/O access while MapReduce jobs are primarily I/O bound with fixed memory >> and sporadic CPU. Combined these can lead to unpredictable latencies for >> HBase and CPU contention between the two. A shared cluster also requires >> fewer task slots per node to accommodate for HBase CPU requirements >> (generally half the slots on each node that you would allocate without >> HBase). Also keep an eye on memory swap. If HBase starts to swap there is >> a >> good chance it will miss a heartbeat and get dropped from the cluster. On >> a >> busy cluster this may overload another region, causing it to swap and a >> cascade of failures." >> >> All my initial investigation/reading lead me believe that I should a >> create >> a common HDFS cluster and then I can run MapReduce and HBase against the >> common HDFS cluster. But from the above Cloudera best practice it seems >> like I should create two HDFS clusters, one for MapReduce and one for >> HBase >> and then move data around when required. Something does not make sense >> with >> this best practice recommendation. >> >> Any thoughts and/or feedback will be much appreciated. >> >> -- >> View this message in context: >> http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp33967219p33967219.html >> Sent from the HBase User mailing list archive at Nabble.com. >> > > -- View this message in context: http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp33967219p33973918.html Sent from the HBase User mailing list archive at Nabble.com.
