If the job finishes in 3 minutes in local mode, I would think it is small.

On 06/14/2011 11:07 AM, Dexin Wang wrote:
Good to know. Trying single node hadoop cluster now. The main input is about 1+ million lines of events. After some aggregation, it joins with another input source which has also about 1+ million rows. Is this considered small query? Thanks.

On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <[email protected] <mailto:[email protected]>> wrote:

    Local mode and mapreduce mode makes a huge difference. For a small
    query, the mapreduce overhead will dominate. For a fair
    comparison, can you setup a single node hadoop cluster on your
    laptop and run Pig on it?

    Daniel


    On 06/14/2011 10:54 AM, Dexin Wang wrote:
    Thanks for your feedback. My comments below.

    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
    <[email protected] <mailto:[email protected]>> wrote:

        Curious, couple of questions:
        1. Are you running in local mode or mapreduce mode?

    Local mode (-x local) when I ran it on my laptop, and mapreduce
    mode when I ran it on ec2 cluster.

        2. If mapreduce mode, did you look into the hadoop log to see
        how much slow down each mapreduce job does?

    I'm looking into that.

        3. What kind of query is it?

    The input is gzipped json files which has one event per line.
    Then I do some hourly aggregation on the raw events, then do
    bunch of groupping, joining and some metrics computing (like
    median, variance) on some fields.

        Daniel

     Someone mentioned it's EC2's I/O performance. But I'm sure there
    are plenty of people using EC2/EMR running big MR jobs so more
    likely I have some configuration issues? My jobs can be optimized
    a bit but the fact that running on my laptop is faster tells me
    this is a separate issue.

    Thanks!



        On 06/13/2011 11:54 AM, Dexin Wang wrote:

            Hi,

            This is probably not directly a Pig question.

            Anyone running Pig on amazon EC2 instances? Something's
            not making sense to
            me. I ran a Pig script that has about 10 mapred jobs in
            it on a 16 node
            cluster using m1.small. It took *13 minutes*. The job
            reads input from S3
            and writes output to S3. But from the logs the reading
            and writing part
            to/from S3 is pretty fast. And all the intermediate steps
            should happen on
            HDFS.

            Running the same job on my mbp laptop, it only took *3
            minutes*.

            Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
            I'll try Pig 0.6
            on my laptop. Some hadoop config is probably also not
            ideal. I tried
            m1.large instead of m1.small, doesn't seem to make a huge
            difference.
            Anything you would suggest to look for the slowness on EC2?

            Dexin






Reply via email to