What I mean by copying EMR jars to the EC2 box is copying them into your own directory. An EC2 box is your own box & account. So there shouldn't be anything objectionable that needs to be checked with Amazon. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions.
________________________________ From: Aniket Mokashi <[email protected]> To: [email protected]; Ayon Sinha <[email protected]> Sent: Friday, December 16, 2011 4:11 PM Subject: Re: My notes for running Pig from EC2 to EMR Amazon supports pig 0.9.1 now. Take a look- http://aws.amazon.com/releasenotes/Elastic-MapReduce/1044996466833146 Also, I am not very sure about copying EMR jars to EC2. You should check that with Amazon. Thanks, Aniket On Fri, Dec 16, 2011 at 12:02 PM, Ayon Sinha <[email protected]> wrote: This might get outdated quickly as EMR upgrades the Pig version and Pig 0.9.1 is being used by everyone anyway. But here is my write-up for your review: > >The main obstacles for running Pig on Elastic MapReduce (EMR) are: > > * Pig version installed on EMR is older than 0.8.1. (By some accounts >EMR just upgraded their Pig version to 0.9.1) > * Hadoop Version on EMR might not match the one Pig is using. > * The user you’re running Pig as might not have permissions on the HDFS >on the EMR cluster. > >How to solve each one of these issues: > 1. We will not be using Pig that is installed on EMR. We will use an >EC2 instance as the Pig client which compiles the Pig Scripts and submits >MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop version that >Pig is using and whats installed on EMR must match (or at least be backward >compatible). i.e. EMR hadoop version should be >= Pig’s Hadoop version. > 2. The best way to do this is to copy over the Hadoop directory from >one of the EMR instances to the Pig client EC2 machine. The next problem is to >make Pig use this hadoop rather than the one its been using. For Pig version >8.1 or earlier Pig jar has hadoop classes bundled within so any attempt at >making Pig use the jars downloaded from EMR fails. The solution was to use Pig >0.9.1 which had a pigwithouthadoop.jar. When you use this it will use >whichever hadoop you make HADOOP_HOME point to, which in this case will be the >directory where you downloaded the EMR classes and configs. > 3. Now that you are using Pig 0.9.1 your version might have a big in >the pig executable (in <Pig install dir>/bin )script where it does not respect >the HADOOP_HOME. So patch the script. > 4. Now you want Pig to be using the Jobtracker and Namenode of the EMR >cluster you want the computation to be on. Follow one of the usual ways to do >this: > 1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The jt & >nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node. ports are >9000 and 9001 for the NN & JT respectively. > 2. pig.properties file in conf dir. > 3. change core-site.xml & mapred-site.xml in the local >$HADOOP_HOME/conf dir. > >The precedence is a > b > c > 1. Now Pig will start but will fail if the use you are running Pig as >does not match default EMR user which is hadoop. So this is what I do on the >EMR: > 1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 -mkdir >/user/piguser;hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 >-chmod -R 777 / > 2. You can argue that 777 is too generous, but I don't care as its the >temporary files that are stored and they are gone once my instance is gone. >All my real data is on S3. >Now you should be all set. >Only steps 4 & 5 need to be done every time you start you new EMR cluster. > > > > -Ayon > -- "...:::Aniket:::... Quetzalco@tl"
