HI All,

Thank you so much. It has been a great help.
As of now, I am exploring the idea of running an HBase cluster on EC2 ( EBS
backed) and using EMR to run the heavy ad-hoc jobs.

I got confused by reading in a couple of places ( esp this Amazon's EMR
forum thread
https://forums.aws.amazon.com/thread.jspa?messageID=238747&#238747 and the
EMR doc. where it is mentioned at a no. of places that 'The service runs
job flows in Amazon EC2 and stores input and output data in Amazon S3
and/or Amazon DynamoDB.' ) that HBase can't be used with EMR. But now,
after going through your replies, I understand it this way : For using Hive
on EMR, input and output needs to be on S3 ( or now dyanmoDB as well). And
for using other input/output sources ( like EC2 HBase cluster), need to
write a custom jar for every single job/query.

Please let me know if I have got this right or still missing something.

Also, Interesting idea of running a transient HBase besides the normal
cluster.



On Sun, Mar 4, 2012 at 2:50 AM, Amandeep Khurana <[email protected]> wrote:

> Mohit,
>
> Adding to what Andy and Vaibhav have listed - you'll need to ensure that
> the Hadoop versions running in EMR and your HBase cluster are compatible if
> you want to run MapReduce from EMR onto an external HBase cluster.
>
> If you choose to run HBase on your EMR cluster and don't want it to tear
> down on job completion, start the cluster with the alive flag. However, the
> moment the health of your master node goes bad (does not happen very often,
> but is not unheard of either. It's more common in EC2 than physical
> hardware), the EMR cluster will terminate. Read up on the semantics of the
> alive flag and termination protection to understand the behavior better.
>
> Another thing to be aware of while running HBase on EMR, you will most
> likely be limited to keeping your HBase master and ZK on the node running
> your Namenode and Jobtracker (aka EMR master). You can run multiple
> masters, zk and probably or have separate nodes outside of the existing EMR
> cluster but you will need to do extra work (like adding nodes to the same
> security groups, spinning up instances separately after the EMR cluster is
> up).
>
> It comes down to specifying your requirements clearly and then figuring out
> the right solution. :) You'll get plenty help on the mailing list.
>
> Hope this helps.
>
> -Amandeep
>
> On Sat, Mar 3, 2012 at 7:12 PM, Vaibhav Puranik <[email protected]>
> wrote:
>
> > Mohit,
> >
> > I have written the blogpost.
> >
> > EMR is nothing but map reduce. HBase provides TableInputFormat. With
> > TableInputFormat and TableMapReduceUtil (
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
> > )
> > class, you can specify your source as HBase - hosted anywhere as long as
> > it's accessible through internet. In doing so if the HBase is not hosted
> on
> > the same Hadoop cluster (which it won't be in case of an EMR job), you
> will
> > be sacficing data locality (We are okay with that).
> >
> > Regards,
> > Vaibhav
> >
> > On Sat, Mar 3, 2012 at 9:21 AM, Andrew Purtell <[email protected]>
> wrote:
> >
> > > I think there are a couple of things conflated here. Let me make four
> > > brief points and then feel free to follow up where you would like more
> > > information.
> > >
> > > 1) Many run HBase (and self-hosted Hadoop) on EC2. These clusters have
> > > their own HDFS on EBS or instance store volumes.
> > >
> > > 2) You cannot run HBase backed by S3. Search on other HBase user list
> > > emails on the subject.  But this of course does not mean you cannot run
> > > HBase on EC2. (See point 1.)
> > >
> > > 3) Your EMR jobs can talk to your other EC2 resources, such as a HBase
> > > cluster running off to the side.
> > >
> > > 4) You can perform custom setup time actions for your EMR clusters,
> which
> > > can set up HBase to run (using the cluster's HDFS file system). Then
> your
> > > EMR job had a transient HBase for doing things like holding large
> > > intermediate representations (sparse matrix or whatever) that require
> > > random access. Of course here when the EMR job is complete, everything
> > will
> > > be torn down.
> > >
> > > Best regards,
> > >
> > >    - Andy
> > >
> > >
> > > On Mar 3, 2012, at 3:45 AM, Mohit Gupta <[email protected]
> >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a bit confused about using HBase with EMR. In one of the
> previous
> > > > thread ( and in EMR Documentation
> > > > http://aws.amazon.com/elasticmapreduce/), it is said that S3 is the
> > > > only option available to be used as
> > > > source/destination at the moment. But I have come around a couple of
> > > blogs
> > > > saying that those people are actually using HBase with EMR. ( one is
> > > > http://whynosql.com/why-we-run-our-hbase-on-ec2/ ).
> > > > I have a scenario where running EMR with Hbase would be really
> useful.
> > > > Please let me know if its possible or if there is any workaround
> > > available
> > > > for this( like first transferring the data to s3 and then to EMR).
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > >
> > > > Mohit Gupta
> > > > Software Engineer at Vdopia Inc.
> > >
> >
>



-- 
Best Regards,

Mohit Gupta

Reply via email to