Re: My notes for running Pig from EC2 to EMR

Aniket Mokashi Fri, 16 Dec 2011 16:11:35 -0800

Amazon supports pig 0.9.1 now. Take a look-
http://aws.amazon.com/releasenotes/Elastic-MapReduce/1044996466833146


Also, I am not very sure about copying EMR jars to EC2. You should check
that with Amazon.

Thanks,
Aniket

On Fri, Dec 16, 2011 at 12:02 PM, Ayon Sinha <[email protected]> wrote:

> This might get outdated quickly as EMR upgrades the Pig version and Pig
> 0.9.1 is being used by everyone anyway. But here is my write-up for your
> review:
>
> The main obstacles for running Pig on Elastic MapReduce (EMR) are:
>
>        * Pig version installed on EMR is older than 0.8.1. (By some
> accounts EMR just upgraded their Pig version to 0.9.1)
>        * Hadoop Version on EMR might not match the one Pig is using.
>        * The user you’re running Pig as might not have permissions on the
> HDFS on the EMR cluster.
>
> How to solve each one of these issues:
>        1. We will not be using Pig that is installed on EMR. We will use
> an EC2 instance as the Pig client which compiles the Pig Scripts and
> submits MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop
> version that Pig is using and whats installed on EMR must match (or at
> least be backward compatible). i.e. EMR hadoop version should be >= Pig’s
> Hadoop version.
>        2. The best way to do this is to copy over the Hadoop directory
> from one of the EMR instances to the Pig client EC2 machine. The next
> problem is to make Pig use this hadoop rather than the one its been using.
> For Pig version 8.1 or earlier Pig jar has hadoop classes bundled within so
> any attempt at making Pig use the jars downloaded from EMR fails. The
> solution was to use Pig 0.9.1 which had a pigwithouthadoop.jar. When you
> use this it will use whichever hadoop you make HADOOP_HOME point to, which
> in this case will be the directory where you downloaded the EMR classes and
> configs.
>        3. Now that you are using Pig 0.9.1 your version might have a big
> in the pig executable (in <Pig install dir>/bin )script where it does not
> respect the HADOOP_HOME. So patch the script.
>        4. Now you want Pig to be using the Jobtracker and Namenode of the
> EMR cluster you want the computation to be on. Follow one of the usual ways
> to do this:
>        1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The
> jt & nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node.
> ports are 9000 and 9001 for the NN & JT respectively.
>        2. pig.properties file in conf dir.
>        3. change core-site.xml & mapred-site.xml in the local
> $HADOOP_HOME/conf dir.
>
> The precedence is a > b > c
>        1. Now Pig will start but will fail if the use you are running Pig
> as does not match default EMR user which is hadoop. So this is what I do on
> the EMR:
>        1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000
> -mkdir /user/piguser;hadoop dfs -fs hdfs://<EMR internal ip
> 10.xxx.xxx.xxx>:9000 -chmod -R 777  /
>        2. You can argue that 777 is too generous, but I don't care as its
> the temporary files that are stored and they are gone once my instance is
> gone. All my real data is on S3.
> Now you should be all set.
> Only steps 4 & 5 need to be done every time you start you new EMR cluster.
>
>
>
>  -Ayon
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: My notes for running Pig from EC2 to EMR

Reply via email to