Re: HBase with EMR

Amandeep Khurana Sat, 03 Mar 2012 13:21:45 -0800

Mohit,

Adding to what Andy and Vaibhav have listed - you'll need to ensure that
the Hadoop versions running in EMR and your HBase cluster are compatible if
you want to run MapReduce from EMR onto an external HBase cluster.


If you choose to run HBase on your EMR cluster and don't want it to tear
down on job completion, start the cluster with the alive flag. However, the
moment the health of your master node goes bad (does not happen very often,
but is not unheard of either. It's more common in EC2 than physical
hardware), the EMR cluster will terminate. Read up on the semantics of the
alive flag and termination protection to understand the behavior better.

Another thing to be aware of while running HBase on EMR, you will most
likely be limited to keeping your HBase master and ZK on the node running
your Namenode and Jobtracker (aka EMR master). You can run multiple
masters, zk and probably or have separate nodes outside of the existing EMR
cluster but you will need to do extra work (like adding nodes to the same
security groups, spinning up instances separately after the EMR cluster is
up).

It comes down to specifying your requirements clearly and then figuring out
the right solution. :) You'll get plenty help on the mailing list.

Hope this helps.

-Amandeep

On Sat, Mar 3, 2012 at 7:12 PM, Vaibhav Puranik <[email protected]> wrote:

> Mohit,
>
> I have written the blogpost.
>
> EMR is nothing but map reduce. HBase provides TableInputFormat. With
> TableInputFormat and TableMapReduceUtil (
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
> )
> class, you can specify your source as HBase - hosted anywhere as long as
> it's accessible through internet. In doing so if the HBase is not hosted on
> the same Hadoop cluster (which it won't be in case of an EMR job), you will
> be sacficing data locality (We are okay with that).
>
> Regards,
> Vaibhav
>
> On Sat, Mar 3, 2012 at 9:21 AM, Andrew Purtell <[email protected]> wrote:
>
> > I think there are a couple of things conflated here. Let me make four
> > brief points and then feel free to follow up where you would like more
> > information.
> >
> > 1) Many run HBase (and self-hosted Hadoop) on EC2. These clusters have
> > their own HDFS on EBS or instance store volumes.
> >
> > 2) You cannot run HBase backed by S3. Search on other HBase user list
> > emails on the subject.  But this of course does not mean you cannot run
> > HBase on EC2. (See point 1.)
> >
> > 3) Your EMR jobs can talk to your other EC2 resources, such as a HBase
> > cluster running off to the side.
> >
> > 4) You can perform custom setup time actions for your EMR clusters, which
> > can set up HBase to run (using the cluster's HDFS file system). Then your
> > EMR job had a transient HBase for doing things like holding large
> > intermediate representations (sparse matrix or whatever) that require
> > random access. Of course here when the EMR job is complete, everything
> will
> > be torn down.
> >
> > Best regards,
> >
> >    - Andy
> >
> >
> > On Mar 3, 2012, at 3:45 AM, Mohit Gupta <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I am a bit confused about using HBase with EMR. In one of the previous
> > > thread ( and in EMR Documentation
> > > http://aws.amazon.com/elasticmapreduce/), it is said that S3 is the
> > > only option available to be used as
> > > source/destination at the moment. But I have come around a couple of
> > blogs
> > > saying that those people are actually using HBase with EMR. ( one is
> > > http://whynosql.com/why-we-run-our-hbase-on-ec2/ ).
> > > I have a scenario where running EMR with Hbase would be really useful.
> > > Please let me know if its possible or if there is any workaround
> > available
> > > for this( like first transferring the data to s3 and then to EMR).
> > >
> > >
> > > --
> > > Best Regards,
> > >
> > > Mohit Gupta
> > > Software Engineer at Vdopia Inc.
> >
>

Re: HBase with EMR

Reply via email to