This might get outdated quickly as EMR upgrades the Pig version and Pig 0.9.1
is being used by everyone anyway. But here is my write-up for your review:
The main obstacles for running Pig on Elastic MapReduce (EMR) are:
* Pig version installed on EMR is older than 0.8.1. (By some accounts
EMR just upgraded their Pig version to 0.9.1)
* Hadoop Version on EMR might not match the one Pig is using.
* The user you’re running Pig as might not have permissions on the HDFS
on the EMR cluster.
How to solve each one of these issues:
1. We will not be using Pig that is installed on EMR. We will use an
EC2 instance as the Pig client which compiles the Pig Scripts and submits
MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop version that
Pig is using and whats installed on EMR must match (or at least be backward
compatible). i.e. EMR hadoop version should be >= Pig’s Hadoop version.
2. The best way to do this is to copy over the Hadoop directory from
one of the EMR instances to the Pig client EC2 machine. The next problem is to
make Pig use this hadoop rather than the one its been using. For Pig version
8.1 or earlier Pig jar has hadoop classes bundled within so any attempt at
making Pig use the jars downloaded from EMR fails. The solution was to use Pig
0.9.1 which had a pigwithouthadoop.jar. When you use this it will use whichever
hadoop you make HADOOP_HOME point to, which in this case will be the directory
where you downloaded the EMR classes and configs.
3. Now that you are using Pig 0.9.1 your version might have a big in
the pig executable (in <Pig install dir>/bin )script where it does not respect
the HADOOP_HOME. So patch the script.
4. Now you want Pig to be using the Jobtracker and Namenode of the EMR
cluster you want the computation to be on. Follow one of the usual ways to do
this:
1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The jt &
nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node. ports are
9000 and 9001 for the NN & JT respectively.
2. pig.properties file in conf dir.
3. change core-site.xml & mapred-site.xml in the local
$HADOOP_HOME/conf dir.
The precedence is a > b > c
1. Now Pig will start but will fail if the use you are running Pig as
does not match default EMR user which is hadoop. So this is what I do on the
EMR:
1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 -mkdir
/user/piguser;hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000
-chmod -R 777 /
2. You can argue that 777 is too generous, but I don't care as its the
temporary files that are stored and they are gone once my instance is gone. All
my real data is on S3.
Now you should be all set.
Only steps 4 & 5 need to be done every time you start you new EMR cluster.
-Ayon