Ok - thank you. On Tue, Mar 6, 2012 at 12:52 AM, Amandeep Khurana <[email protected]> wrote:
> Correct - you can access any external service by using a custom jar. > > On Sun, Mar 4, 2012 at 10:55 PM, Mohit Gupta > <[email protected]>wrote: > > > HI All, > > > > Thank you so much. It has been a great help. > > As of now, I am exploring the idea of running an HBase cluster on EC2 ( > EBS > > backed) and using EMR to run the heavy ad-hoc jobs. > > > > I got confused by reading in a couple of places ( esp this Amazon's EMR > > forum thread > > https://forums.aws.amazon.com/thread.jspa?messageID=238747𺒛 and > the > > EMR doc. where it is mentioned at a no. of places that 'The service runs > > job flows in Amazon EC2 and stores input and output data in Amazon S3 > > and/or Amazon DynamoDB.' ) that HBase can't be used with EMR. But now, > > after going through your replies, I understand it this way : For using > Hive > > on EMR, input and output needs to be on S3 ( or now dyanmoDB as well). > And > > for using other input/output sources ( like EC2 HBase cluster), need to > > write a custom jar for every single job/query. > > > > Please let me know if I have got this right or still missing something. > > > > Also, Interesting idea of running a transient HBase besides the normal > > cluster. > > > > > > > > On Sun, Mar 4, 2012 at 2:50 AM, Amandeep Khurana <[email protected]> > wrote: > > > > > Mohit, > > > > > > Adding to what Andy and Vaibhav have listed - you'll need to ensure > that > > > the Hadoop versions running in EMR and your HBase cluster are > compatible > > if > > > you want to run MapReduce from EMR onto an external HBase cluster. > > > > > > If you choose to run HBase on your EMR cluster and don't want it to > tear > > > down on job completion, start the cluster with the alive flag. However, > > the > > > moment the health of your master node goes bad (does not happen very > > often, > > > but is not unheard of either. It's more common in EC2 than physical > > > hardware), the EMR cluster will terminate. Read up on the semantics of > > the > > > alive flag and termination protection to understand the behavior > better. > > > > > > Another thing to be aware of while running HBase on EMR, you will most > > > likely be limited to keeping your HBase master and ZK on the node > running > > > your Namenode and Jobtracker (aka EMR master). You can run multiple > > > masters, zk and probably or have separate nodes outside of the existing > > EMR > > > cluster but you will need to do extra work (like adding nodes to the > same > > > security groups, spinning up instances separately after the EMR cluster > > is > > > up). > > > > > > It comes down to specifying your requirements clearly and then figuring > > out > > > the right solution. :) You'll get plenty help on the mailing list. > > > > > > Hope this helps. > > > > > > -Amandeep > > > > > > On Sat, Mar 3, 2012 at 7:12 PM, Vaibhav Puranik <[email protected]> > > > wrote: > > > > > > > Mohit, > > > > > > > > I have written the blogpost. > > > > > > > > EMR is nothing but map reduce. HBase provides TableInputFormat. With > > > > TableInputFormat and TableMapReduceUtil ( > > > > > > > > > > > > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html > > > > ) > > > > class, you can specify your source as HBase - hosted anywhere as long > > as > > > > it's accessible through internet. In doing so if the HBase is not > > hosted > > > on > > > > the same Hadoop cluster (which it won't be in case of an EMR job), > you > > > will > > > > be sacficing data locality (We are okay with that). > > > > > > > > Regards, > > > > Vaibhav > > > > > > > > On Sat, Mar 3, 2012 at 9:21 AM, Andrew Purtell <[email protected]> > > > wrote: > > > > > > > > > I think there are a couple of things conflated here. Let me make > four > > > > > brief points and then feel free to follow up where you would like > > more > > > > > information. > > > > > > > > > > 1) Many run HBase (and self-hosted Hadoop) on EC2. These clusters > > have > > > > > their own HDFS on EBS or instance store volumes. > > > > > > > > > > 2) You cannot run HBase backed by S3. Search on other HBase user > list > > > > > emails on the subject. But this of course does not mean you cannot > > run > > > > > HBase on EC2. (See point 1.) > > > > > > > > > > 3) Your EMR jobs can talk to your other EC2 resources, such as a > > HBase > > > > > cluster running off to the side. > > > > > > > > > > 4) You can perform custom setup time actions for your EMR clusters, > > > which > > > > > can set up HBase to run (using the cluster's HDFS file system). > Then > > > your > > > > > EMR job had a transient HBase for doing things like holding large > > > > > intermediate representations (sparse matrix or whatever) that > require > > > > > random access. Of course here when the EMR job is complete, > > everything > > > > will > > > > > be torn down. > > > > > > > > > > Best regards, > > > > > > > > > > - Andy > > > > > > > > > > > > > > > On Mar 3, 2012, at 3:45 AM, Mohit Gupta < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am a bit confused about using HBase with EMR. In one of the > > > previous > > > > > > thread ( and in EMR Documentation > > > > > > http://aws.amazon.com/elasticmapreduce/), it is said that S3 is > > the > > > > > > only option available to be used as > > > > > > source/destination at the moment. But I have come around a couple > > of > > > > > blogs > > > > > > saying that those people are actually using HBase with EMR. ( one > > is > > > > > > http://whynosql.com/why-we-run-our-hbase-on-ec2/ ). > > > > > > I have a scenario where running EMR with Hbase would be really > > > useful. > > > > > > Please let me know if its possible or if there is any workaround > > > > > available > > > > > > for this( like first transferring the data to s3 and then to > EMR). > > > > > > > > > > > > > > > > > > -- > > > > > > Best Regards, > > > > > > > > > > > > Mohit Gupta > > > > > > Software Engineer at Vdopia Inc. > > > > > > > > > > > > > > > > > > > > -- > > Best Regards, > > > > Mohit Gupta > > > -- Best Regards, Mohit Gupta Software Engineer at Vdopia Inc.
