My notes for running Pig from EC2 to EMR

Ayon Sinha Fri, 16 Dec 2011 12:03:27 -0800

This might get outdated quickly as EMR upgrades the Pig version and Pig 0.9.1 
is being used by everyone anyway. But here is my write-up for your review:


The main obstacles for running Pig on Elastic MapReduce (EMR) are:

        * Pig version installed on EMR is older than 0.8.1. (By some accounts 
EMR just upgraded their Pig version to 0.9.1)
        * Hadoop Version on EMR might not match the one Pig is using.
        * The user you’re running Pig as might not have permissions on the HDFS 
on the EMR cluster.

How to solve each one of these issues:
        1. We will not be using Pig that is installed on EMR. We will use an 
EC2 instance as the Pig client which compiles the Pig Scripts and submits 
MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop version that 
Pig is using and whats installed on EMR must match (or at least be backward 
compatible). i.e. EMR hadoop version should be >= Pig’s Hadoop version.
        2. The best way to do this is to copy over the Hadoop directory from 
one of the EMR instances to the Pig client EC2 machine. The next problem is to 
make Pig use this hadoop rather than the one its been using. For Pig version 
8.1 or earlier Pig jar has hadoop classes bundled within so any attempt at 
making Pig use the jars downloaded from EMR fails. The solution was to use Pig 
0.9.1 which had a pigwithouthadoop.jar. When you use this it will use whichever 
hadoop you make HADOOP_HOME point to, which in this case will be the directory 
where you downloaded the EMR classes and configs.
        3. Now that you are using Pig 0.9.1 your version might have a big in 
the pig executable (in <Pig install dir>/bin )script where it does not respect 
the HADOOP_HOME. So patch the script.
        4. Now you want Pig to be using the Jobtracker and Namenode of the EMR 
cluster you want the computation to be on. Follow one of the usual ways to do 
this:
        1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The jt & 
nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node. ports are 
9000 and 9001 for the NN & JT respectively.
        2. pig.properties file in conf dir.
        3. change core-site.xml & mapred-site.xml in the local 
$HADOOP_HOME/conf dir.

The precedence is a > b > c
        1. Now Pig will start but will fail if the use you are running Pig as 
does not match default EMR user which is hadoop. So this is what I do on the 
EMR:
        1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 -mkdir 
/user/piguser;hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 
-chmod -R 777  /
        2. You can argue that 777 is too generous, but I don't care as its the 
temporary files that are stored and they are gone once my instance is gone. All 
my real data is on S3.
Now you should be all set.
Only steps 4 & 5 need to be done every time you start you new EMR cluster.



 -Ayon

My notes for running Pig from EC2 to EMR

Reply via email to