you need to add this to your pig.properties: pig.tmpfilecompression=true pig.tmpfilecompression.codec=lzo
Make sure that you are running hadoop 20.2 or higher, pig 8.1 or higher, and that all the lzo stuff is set up -- it's a bit involved. Use replicated joins where possible. If you are doing a large number of small jobs, scheduling and provisioning is likely to dominate -- tune your job scheduler to schedule more tasks per heartbeat and make sure your jar is as small as you can get it (there's a lot of unjarring going on in Hadoop) D On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang <[email protected]> wrote: > Tomas, > > What worked well for me is still to be figured out. Right now, it works but > it's too slow. I think one of the main problem is that my job has many > JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which > is slow. > > On that node, anyone knows how to know if the lzo is turned on for > intermediate jobs. Reference to this > > http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs > > and this > > http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ > > I see I have this in my mapred-site.xml file: > > <property><name>mapred.map.output.compression.codec</name> > <value>com.hadoop.compression.lzo.LzoCodec</value></property> > > Is that all I need to have map compression turned on? Thanks. > > Dexin > > On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky > <[email protected]>wrote: > >> Hi Dexin, >> >> Since I am being a Pig and map reduce newbie your post is very >> intriguing for me. I am coming from Talend background and trying to >> asses if map/reduce would bring any possible speed up and faster >> turnaround to my projects. My worries are that my data are to small so >> that map reduce overhead will be prohibitive in certain cases. >> >> When using Talend if the transformation was reasonable it could >> process 10s of thousand rows per second. Processing 1 million rows >> could be finished well under 1 minute so I think that your dataset is >> fairly small. Nevertheless my data are growing so soon it wil be time >> for pig. >> >> Could you provide some info what worked well for you to run your job on >> EC2? >> >> Thanks in advance, >> >> Tomas >> >> On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <[email protected]> >> wrote: >> > If the job finishes in 3 minutes in local mode, I would think it is >> small. >> > >> > On 06/14/2011 11:07 AM, Dexin Wang wrote: >> >> >> >> Good to know. Trying single node hadoop cluster now. The main input is >> >> about 1+ million lines of events. After some aggregation, it joins with >> >> another input source which has also about 1+ million rows. Is this >> >> considered small query? Thanks. >> >> >> >> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <[email protected] >> >> <mailto:[email protected]>> wrote: >> >> >> >> Local mode and mapreduce mode makes a huge difference. For a small >> >> query, the mapreduce overhead will dominate. For a fair >> >> comparison, can you setup a single node hadoop cluster on your >> >> laptop and run Pig on it? >> >> >> >> Daniel >> >> >> >> >> >> On 06/14/2011 10:54 AM, Dexin Wang wrote: >> >>> >> >>> Thanks for your feedback. My comments below. >> >>> >> >>> On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai >> >>> <[email protected] <mailto:[email protected]>> wrote: >> >>> >> >>> Curious, couple of questions: >> >>> 1. Are you running in local mode or mapreduce mode? >> >>> >> >>> Local mode (-x local) when I ran it on my laptop, and mapreduce >> >>> mode when I ran it on ec2 cluster. >> >>> >> >>> 2. If mapreduce mode, did you look into the hadoop log to see >> >>> how much slow down each mapreduce job does? >> >>> >> >>> I'm looking into that. >> >>> >> >>> 3. What kind of query is it? >> >>> >> >>> The input is gzipped json files which has one event per line. >> >>> Then I do some hourly aggregation on the raw events, then do >> >>> bunch of groupping, joining and some metrics computing (like >> >>> median, variance) on some fields. >> >>> >> >>> Daniel >> >>> >> >>> Someone mentioned it's EC2's I/O performance. But I'm sure there >> >>> are plenty of people using EC2/EMR running big MR jobs so more >> >>> likely I have some configuration issues? My jobs can be optimized >> >>> a bit but the fact that running on my laptop is faster tells me >> >>> this is a separate issue. >> >>> >> >>> Thanks! >> >>> >> >>> >> >>> >> >>> On 06/13/2011 11:54 AM, Dexin Wang wrote: >> >>> >> >>> Hi, >> >>> >> >>> This is probably not directly a Pig question. >> >>> >> >>> Anyone running Pig on amazon EC2 instances? Something's >> >>> not making sense to >> >>> me. I ran a Pig script that has about 10 mapred jobs in >> >>> it on a 16 node >> >>> cluster using m1.small. It took *13 minutes*. The job >> >>> reads input from S3 >> >>> and writes output to S3. But from the logs the reading >> >>> and writing part >> >>> to/from S3 is pretty fast. And all the intermediate steps >> >>> should happen on >> >>> HDFS. >> >>> >> >>> Running the same job on my mbp laptop, it only took *3 >> >>> minutes*. >> >>> >> >>> Amazon is using pig0.6 while I'm using pig 0.8 on laptop. >> >>> I'll try Pig 0.6 >> >>> on my laptop. Some hadoop config is probably also not >> >>> ideal. I tried >> >>> m1.large instead of m1.small, doesn't seem to make a huge >> >>> difference. >> >>> Anything you would suggest to look for the slowness on EC2? >> >>> >> >>> Dexin >> >>> >> >>> >> >>> >> >> >> >> >> > >> > >> >
