Re: running pig on amazon ec2

Daniel Dai Tue, 14 Jun 2011 11:02:44 -0700

Local mode and mapreduce mode makes a huge difference. For a smallquery, the mapreduce overhead will dominate. For a fair comparison, canyou setup a single node hadoop cluster on your laptop and run Pig on it?


Daniel


On 06/14/2011 10:54 AM, Dexin Wang wrote:

Thanks for your feedback. My comments below.
On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai <[email protected]<mailto:[email protected]>> wrote:
    Curious, couple of questions:
    1. Are you running in local mode or mapreduce mode?
Local mode (-x local) when I ran it on my laptop, and mapreduce modewhen I ran it on ec2 cluster.
    2. If mapreduce mode, did you look into the hadoop log to see how
    much slow down each mapreduce job does?

I'm looking into that.

    3. What kind of query is it?
The input is gzipped json files which has one event per line. Then Ido some hourly aggregation on the raw events, then do bunch ofgroupping, joining and some metrics computing (like median, variance)on some fields.
    Daniel
Someone mentioned it's EC2's I/O performance. But I'm sure there areplenty of people using EC2/EMR running big MR jobs so more likely Ihave some configuration issues? My jobs can be optimized a bit but thefact that running on my laptop is faster tells me this is a separateissue.
Thanks!



    On 06/13/2011 11:54 AM, Dexin Wang wrote:

        Hi,

        This is probably not directly a Pig question.

        Anyone running Pig on amazon EC2 instances? Something's not
        making sense to
        me. I ran a Pig script that has about 10 mapred jobs in it on
        a 16 node
        cluster using m1.small. It took *13 minutes*. The job reads
        input from S3
        and writes output to S3. But from the logs the reading and
        writing part
        to/from S3 is pretty fast. And all the intermediate steps
        should happen on
        HDFS.

        Running the same job on my mbp laptop, it only took *3 minutes*.

        Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll
        try Pig 0.6
        on my laptop. Some hadoop config is probably also not ideal. I
        tried
        m1.large instead of m1.small, doesn't seem to make a huge
        difference.
        Anything you would suggest to look for the slowness on EC2?

        Dexin

Re: running pig on amazon ec2

Reply via email to