you need to add this to your pig.properties:

pig.tmpfilecompression=true
pig.tmpfilecompression.codec=lzo

Make sure that you are running hadoop 20.2 or higher, pig 8.1 or
higher, and that all the lzo stuff is set up -- it's a bit involved.

Use replicated joins where possible.

If you are doing a large number of small jobs, scheduling and
provisioning is likely to dominate -- tune your job scheduler to
schedule more tasks per heartbeat and make sure your jar is as small
as you can get it (there's a lot of unjarring going on in Hadoop)
D

On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang <[email protected]> wrote:
> Tomas,
>
> What worked well for me is still to be figured out. Right now, it works but
> it's too slow. I think one of the main problem is that my job has many
> JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which
> is slow.
>
> On that node, anyone knows how to know if the lzo is turned on for
> intermediate jobs. Reference to this
>
> http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs
>
> and this
>
> http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
>
> I see I have this in my mapred-site.xml file:
>
>    <property><name>mapred.map.output.compression.codec</name>
> <value>com.hadoop.compression.lzo.LzoCodec</value></property>
>
> Is that all I need to have map compression turned on? Thanks.
>
> Dexin
>
> On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
> <[email protected]>wrote:
>
>> Hi Dexin,
>>
>> Since I am being a Pig and map reduce newbie your post is very
>> intriguing for me. I am coming from Talend background and trying to
>> asses if map/reduce would bring any possible speed up and faster
>> turnaround to my projects. My worries are that my data are to small so
>> that map reduce overhead will be prohibitive in certain cases.
>>
>> When using Talend if the transformation was reasonable it could
>> process 10s of thousand rows per second. Processing 1 million rows
>> could be finished well under 1 minute so I think that your dataset is
>> fairly small. Nevertheless my data are growing so soon it wil be time
>> for pig.
>>
>> Could you provide some info what worked well for you to run your job on
>> EC2?
>>
>> Thanks in advance,
>>
>> Tomas
>>
>> On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <[email protected]>
>> wrote:
>> > If the job finishes in 3 minutes in local mode, I would think it is
>> small.
>> >
>> > On 06/14/2011 11:07 AM, Dexin Wang wrote:
>> >>
>> >> Good to know. Trying single node hadoop cluster now. The main input is
>> >> about 1+ million lines of events. After some aggregation, it joins with
>> >> another input source which has also about 1+ million rows. Is this
>> >> considered small query? Thanks.
>> >>
>> >> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <[email protected]
>> >> <mailto:[email protected]>> wrote:
>> >>
>> >>    Local mode and mapreduce mode makes a huge difference. For a small
>> >>    query, the mapreduce overhead will dominate. For a fair
>> >>    comparison, can you setup a single node hadoop cluster on your
>> >>    laptop and run Pig on it?
>> >>
>> >>    Daniel
>> >>
>> >>
>> >>    On 06/14/2011 10:54 AM, Dexin Wang wrote:
>> >>>
>> >>>    Thanks for your feedback. My comments below.
>> >>>
>> >>>    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
>> >>>    <[email protected] <mailto:[email protected]>> wrote:
>> >>>
>> >>>        Curious, couple of questions:
>> >>>        1. Are you running in local mode or mapreduce mode?
>> >>>
>> >>>    Local mode (-x local) when I ran it on my laptop, and mapreduce
>> >>>    mode when I ran it on ec2 cluster.
>> >>>
>> >>>        2. If mapreduce mode, did you look into the hadoop log to see
>> >>>        how much slow down each mapreduce job does?
>> >>>
>> >>>    I'm looking into that.
>> >>>
>> >>>        3. What kind of query is it?
>> >>>
>> >>>    The input is gzipped json files which has one event per line.
>> >>>    Then I do some hourly aggregation on the raw events, then do
>> >>>    bunch of groupping, joining and some metrics computing (like
>> >>>    median, variance) on some fields.
>> >>>
>> >>>        Daniel
>> >>>
>> >>>     Someone mentioned it's EC2's I/O performance. But I'm sure there
>> >>>    are plenty of people using EC2/EMR running big MR jobs so more
>> >>>    likely I have some configuration issues? My jobs can be optimized
>> >>>    a bit but the fact that running on my laptop is faster tells me
>> >>>    this is a separate issue.
>> >>>
>> >>>    Thanks!
>> >>>
>> >>>
>> >>>
>> >>>        On 06/13/2011 11:54 AM, Dexin Wang wrote:
>> >>>
>> >>>            Hi,
>> >>>
>> >>>            This is probably not directly a Pig question.
>> >>>
>> >>>            Anyone running Pig on amazon EC2 instances? Something's
>> >>>            not making sense to
>> >>>            me. I ran a Pig script that has about 10 mapred jobs in
>> >>>            it on a 16 node
>> >>>            cluster using m1.small. It took *13 minutes*. The job
>> >>>            reads input from S3
>> >>>            and writes output to S3. But from the logs the reading
>> >>>            and writing part
>> >>>            to/from S3 is pretty fast. And all the intermediate steps
>> >>>            should happen on
>> >>>            HDFS.
>> >>>
>> >>>            Running the same job on my mbp laptop, it only took *3
>> >>>            minutes*.
>> >>>
>> >>>            Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
>> >>>            I'll try Pig 0.6
>> >>>            on my laptop. Some hadoop config is probably also not
>> >>>            ideal. I tried
>> >>>            m1.large instead of m1.small, doesn't seem to make a huge
>> >>>            difference.
>> >>>            Anything you would suggest to look for the slowness on EC2?
>> >>>
>> >>>            Dexin
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>

Reply via email to