Amazon supports pig 0.9.1 now. Take a look- http://aws.amazon.com/releasenotes/Elastic-MapReduce/1044996466833146
Also, I am not very sure about copying EMR jars to EC2. You should check that with Amazon. Thanks, Aniket On Fri, Dec 16, 2011 at 12:02 PM, Ayon Sinha <[email protected]> wrote: > This might get outdated quickly as EMR upgrades the Pig version and Pig > 0.9.1 is being used by everyone anyway. But here is my write-up for your > review: > > The main obstacles for running Pig on Elastic MapReduce (EMR) are: > > * Pig version installed on EMR is older than 0.8.1. (By some > accounts EMR just upgraded their Pig version to 0.9.1) > * Hadoop Version on EMR might not match the one Pig is using. > * The user you’re running Pig as might not have permissions on the > HDFS on the EMR cluster. > > How to solve each one of these issues: > 1. We will not be using Pig that is installed on EMR. We will use > an EC2 instance as the Pig client which compiles the Pig Scripts and > submits MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop > version that Pig is using and whats installed on EMR must match (or at > least be backward compatible). i.e. EMR hadoop version should be >= Pig’s > Hadoop version. > 2. The best way to do this is to copy over the Hadoop directory > from one of the EMR instances to the Pig client EC2 machine. The next > problem is to make Pig use this hadoop rather than the one its been using. > For Pig version 8.1 or earlier Pig jar has hadoop classes bundled within so > any attempt at making Pig use the jars downloaded from EMR fails. The > solution was to use Pig 0.9.1 which had a pigwithouthadoop.jar. When you > use this it will use whichever hadoop you make HADOOP_HOME point to, which > in this case will be the directory where you downloaded the EMR classes and > configs. > 3. Now that you are using Pig 0.9.1 your version might have a big > in the pig executable (in <Pig install dir>/bin )script where it does not > respect the HADOOP_HOME. So patch the script. > 4. Now you want Pig to be using the Jobtracker and Namenode of the > EMR cluster you want the computation to be on. Follow one of the usual ways > to do this: > 1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The > jt & nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node. > ports are 9000 and 9001 for the NN & JT respectively. > 2. pig.properties file in conf dir. > 3. change core-site.xml & mapred-site.xml in the local > $HADOOP_HOME/conf dir. > > The precedence is a > b > c > 1. Now Pig will start but will fail if the use you are running Pig > as does not match default EMR user which is hadoop. So this is what I do on > the EMR: > 1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000 > -mkdir /user/piguser;hadoop dfs -fs hdfs://<EMR internal ip > 10.xxx.xxx.xxx>:9000 -chmod -R 777 / > 2. You can argue that 777 is too generous, but I don't care as its > the temporary files that are stored and they are gone once my instance is > gone. All my real data is on S3. > Now you should be all set. > Only steps 4 & 5 need to be done every time you start you new EMR cluster. > > > > -Ayon > -- "...:::Aniket:::... Quetzalco@tl"
