Mohit, Adding to what Andy and Vaibhav have listed - you'll need to ensure that the Hadoop versions running in EMR and your HBase cluster are compatible if you want to run MapReduce from EMR onto an external HBase cluster.
If you choose to run HBase on your EMR cluster and don't want it to tear down on job completion, start the cluster with the alive flag. However, the moment the health of your master node goes bad (does not happen very often, but is not unheard of either. It's more common in EC2 than physical hardware), the EMR cluster will terminate. Read up on the semantics of the alive flag and termination protection to understand the behavior better. Another thing to be aware of while running HBase on EMR, you will most likely be limited to keeping your HBase master and ZK on the node running your Namenode and Jobtracker (aka EMR master). You can run multiple masters, zk and probably or have separate nodes outside of the existing EMR cluster but you will need to do extra work (like adding nodes to the same security groups, spinning up instances separately after the EMR cluster is up). It comes down to specifying your requirements clearly and then figuring out the right solution. :) You'll get plenty help on the mailing list. Hope this helps. -Amandeep On Sat, Mar 3, 2012 at 7:12 PM, Vaibhav Puranik <[email protected]> wrote: > Mohit, > > I have written the blogpost. > > EMR is nothing but map reduce. HBase provides TableInputFormat. With > TableInputFormat and TableMapReduceUtil ( > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html > ) > class, you can specify your source as HBase - hosted anywhere as long as > it's accessible through internet. In doing so if the HBase is not hosted on > the same Hadoop cluster (which it won't be in case of an EMR job), you will > be sacficing data locality (We are okay with that). > > Regards, > Vaibhav > > On Sat, Mar 3, 2012 at 9:21 AM, Andrew Purtell <[email protected]> wrote: > > > I think there are a couple of things conflated here. Let me make four > > brief points and then feel free to follow up where you would like more > > information. > > > > 1) Many run HBase (and self-hosted Hadoop) on EC2. These clusters have > > their own HDFS on EBS or instance store volumes. > > > > 2) You cannot run HBase backed by S3. Search on other HBase user list > > emails on the subject. But this of course does not mean you cannot run > > HBase on EC2. (See point 1.) > > > > 3) Your EMR jobs can talk to your other EC2 resources, such as a HBase > > cluster running off to the side. > > > > 4) You can perform custom setup time actions for your EMR clusters, which > > can set up HBase to run (using the cluster's HDFS file system). Then your > > EMR job had a transient HBase for doing things like holding large > > intermediate representations (sparse matrix or whatever) that require > > random access. Of course here when the EMR job is complete, everything > will > > be torn down. > > > > Best regards, > > > > - Andy > > > > > > On Mar 3, 2012, at 3:45 AM, Mohit Gupta <[email protected]> > > wrote: > > > > > Hi, > > > > > > I am a bit confused about using HBase with EMR. In one of the previous > > > thread ( and in EMR Documentation > > > http://aws.amazon.com/elasticmapreduce/), it is said that S3 is the > > > only option available to be used as > > > source/destination at the moment. But I have come around a couple of > > blogs > > > saying that those people are actually using HBase with EMR. ( one is > > > http://whynosql.com/why-we-run-our-hbase-on-ec2/ ). > > > I have a scenario where running EMR with Hbase would be really useful. > > > Please let me know if its possible or if there is any workaround > > available > > > for this( like first transferring the data to s3 and then to EMR). > > > > > > > > > -- > > > Best Regards, > > > > > > Mohit Gupta > > > Software Engineer at Vdopia Inc. > > >
