Thanks all for their answers. But I want to ask one more thing that: 1) I have written a program (my task) which contains Hive JDBC code and code(commands of SQOOP) for importing the tables and exporting too. If I create JAR of my program and put it on EMR, then should I need to do some extra thing like writing mappers/reducers for execution of program? Or Just by simply creating the JAR and run it?
-- Regards, Bhavesh Shah On Tue, Apr 24, 2012 at 7:20 AM, Mark Grover <mgro...@oanda.com> wrote: > Hi Bhavesh, > > To answer your questions: > > 1) S3 terminology uses the word "object" and I am sure they have good > reasons as to why but for us Hive'ers, an S3 object is the same as a file > stored on S3. The complete path to the file would be what Amazon calls the > S3 "key" and the corresponding value would be the contents of the file e.g. > s3://my_bucket/tables/log.txt would be the key and the actual content of > the file would be S3 object. You can use the AWS web console to create a > bucket and use tools like S3cmd (http://s3tools.org/s3cmd) to put data > onto S3. > > However, like Kyle said, you don't necessarily need to use S3. S3 is > typically only used when you want to have a persistent storage of data. > Most people would store their input logs/files on S3 for Hive processing > and also store the final aggregations and results on S3 for future > retrieval. If you are just temporarily loading some data into Hive, > processing it and exporting it out, you don't have to worry about S3. The > nodes that form your cluster have ephemeral storage that forms the HDFS. > You can just use that. The only side effect is that you will loose all your > data in HDFS once you terminate the cluster. If that's ok, don't worry > about S3. > > EMR instances are basically EC2 instances with some additional setup done > on them. Transferring data between EC2 and EMR instances should be simple, > I'd think. If your data is present in EBS volumes, you could look into > adding an EMR bootstrap action that mounts that same EBS volume onto your > EMR instances. It might be easier if you can do it without all the fancy > mounting business though. > > Also, keep in mind that there might be costs for data transfers across > Amazon data centers, you would want to keep your S3 buckets, EMR cluster > and EC2 instances in the same region, if at all possible. Within the same > region, there shouldn't be any extra transfer costs. > > 2) Yeah, EMR supports custom jars. You can specify them at the time you > create your cluster. This should require minimal porting changes to your > jar itself since it runs on Hadoop and Hive which are the same as (well, > close enough to) what you installed your local cluster vs. what's installed > on EMR. > > 3) Like Kyle said, Sqoop with EMR should be OK. > > Good luck! > Mark > > > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > e: mgro...@oanda.com > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Kyle Mulka" <kyle.mu...@gmail.com> > To: user@hive.apache.org > Cc: user@hive.apache.org, d...@hive.apache.org > Sent: Monday, April 23, 2012 10:55:36 AM > Subject: Re: Doubts related to Amazon EMR > > > It is possible to install Sqoop on AWS EMR. I've got some scripts I can > publish later. You are not required to use S3 to store files and can use > the local (temporary) HDFS instead. After you have Sqoop installed, you can > import your data with it into HDFS, run your calculations in HDFS, then > export your data back out using Sqoop again. > > -- > Kyle Mulka > http://www.kylemulka.com > > On Apr 23, 2012, at 8:42 AM, Bhavesh Shah < bhavesh25s...@gmail.com > > wrote: > > > > > > > > > Hello all, > > > I want to deploy my task on Amazon EMR. But as I am new to Amazon Web > Services I am confused in understanding the concepts. > > My Use Case: > > I want to import the large data from EC2 through SQOOP into the Hive. > Imported data in Hive will get processed in Hive by applying some algorithm > and will generate some result (in table form, in Hive only). And generated > result will be exported back to Ec2 again through SQOOP only. > > I am new to Amazon Web Services and want to implement this use case with > the help of AWS EMR. I have implemented it on local machine. > > I have read some links related to AWS EMR for launching the instance and > about what is EMR, How it works and etc... I have some doubts about EMR > like: > > > 1) EMR uses S3 Buckets, which holds Input and Output data Hadoop > Processing (in the form of Objects). ---> I didn't get How to store the > data in the form of Objects on S3 (My data will be files) > > 2) As already said I have implemented a task for my use case in Java. So > If I create the JAR of my program and create the Job Flow with Custom JAR. > Will it be possible to implement like this or do need to do some thing > extra for that? > > 3) As I said in my Use Case that I want to export my result back to Ec2 > with the help of SQOOP. Does EMR have support of SQOOP? > > > > > If you have any kind of idea related to AWS, please reply me with your > answer as soon as possible. I want to do this as early as possible. > > many Thanks. > > > -- > Regards, > Bhavesh Shah >