Re: My notes for running Pig from EC2 to EMR

Ayon Sinha Fri, 16 Dec 2011 16:27:09 -0800

What I mean by copying EMR jars to the EC2 box is copying them into your own 
directory. An EC2 box is your own box & account. So there shouldn't be anything 
objectionable that needs to be checked with Amazon.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.




________________________________
 From: Aniket Mokashi <[email protected]>
To: [email protected]; Ayon Sinha <[email protected]> 
Sent: Friday, December 16, 2011 4:11 PM
Subject: Re: My notes for running Pig from EC2 to EMR
 

Amazon supports pig 0.9.1 now. Take a look-
http://aws.amazon.com/releasenotes/Elastic-MapReduce/1044996466833146

Also, I am not very sure about copying EMR jars to EC2. You should check that 
with Amazon.

Thanks,
Aniket


On Fri, Dec 16, 2011 at 12:02 PM, Ayon Sinha <[email protected]> wrote:

This might get outdated quickly as EMR upgrades the Pig version and Pig 0.9.1 
is being used by everyone anyway. But here is my write-up for your review:
>
>The main obstacles for running Pig on Elastic MapReduce (EMR) are:
>
>       * Pig version installed on EMR is older than 0.8.1. (By some accounts 
>EMR just upgraded their Pig version to 0.9.1)
>       * Hadoop Version on EMR might not match the one Pig is using.
>       * The user you’re running Pig as might not have permissions on the HDFS 
>on the EMR cluster.
>
>How to solve each one of these issues:
>       1. We will not be using Pig that is installed on EMR. We will use an 
>EC2 instance as the Pig client which compiles the Pig Scripts and submits 
>MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop version that 
>Pig is using and whats installed on EMR must match (or at least be backward 
>compatible). i.e. EMR hadoop version should be >= Pig’s Hadoop version.
>       2. The best way to do this is to copy over the Hadoop directory from 
>one of the EMR instances to the Pig client EC2 machine. The next problem is to 
>make Pig use this hadoop rather than the one its been using. For Pig version 
>8.1 or earlier Pig jar has hadoop classes bundled within so any attempt at 
>making Pig use the jars downloaded from EMR fails. The solution was to use Pig 
>0.9.1 which had a pigwithouthadoop.jar. When you use this it will use 
>whichever hadoop you make HADOOP_HOME point to, which in this case will be the 
>directory where you downloaded the EMR classes and configs.
>       3. Now that you are using Pig 0.9.1 your version might have a big in 
>the pig executable (in <Pig install dir>/bin )script where it does not respect 
>the HADOOP_HOME. So patch the script.
>       4. Now you want Pig to be using the Jobtracker and Namenode of the EMR 
>cluster you want the computation to be on. Follow one of the usual ways to do 
>this:
>       1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The jt & 
>nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node. ports are 
>9000 and 9001 for the NN & JT respectively.
>       2. pig.properties file in conf dir.
>       3. change core-site.xml & mapred-site.xml in the local 
>$HADOOP_HOME/conf dir.
>
>The precedence is a > b > c
>       1. Now Pig will start but will fail if the use you are running Pig as 
>does not match default EMR user which is hadoop. So this is what I do on the 
>EMR:
>       1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 -mkdir 
>/user/piguser;hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 
>-chmod -R 777  /
>       2. You can argue that 777 is too generous, but I don't care as its the 
>temporary files that are stored and they are gone once my instance is gone. 
>All my real data is on S3.
>Now you should be all set.
>Only steps 4 & 5 need to be done every time you start you new EMR cluster.
>
>
>
> -Ayon
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: My notes for running Pig from EC2 to EMR

Reply via email to