Thank you for the help, that opened my eyes. I've noticed that while using LZO compression "Map output bytes" is 296,608,592,100 and "HDFS_BYTES_WRITTEN" is 57,941,932,388, does that mean that reducer output compression is 296,608,592,100 / 57,941,932,388 = 5.11 times, why is it so small for Sequence File Format?
Other statistics: FILE_BYTES_READ 121,983,712,033 135,435,145,919 257,418,857,952 HDFS_BYTES_READ 23,721,946,243 0 23,721,946,243 FILE_BYTES_WRITTEN 188,046,014,425 135,437,054,645 323,483,069,070 HDFS_BYTES_WRITTEN 0 57,941,932,388 57,941,932,388 Reduce input groups 0 1,895,637,970 1,895,637,970 Combine output records 3,791,275,940 272,362,481 4,063,638,421 Map input records 1,895,637,976 0 1,895,637,976 Reduce shuffle bytes 0 65,503,257,420 65,503,257,420 Reduce output records 0 1,895,637,970 1,895,637,970 Spilled Records 5,436,423,030 3,871,926,741 9,308,349,771 Map output bytes 296,608,592,100 0 296,608,592,100 SPLIT_RAW_BYTES 73,060 0 73,060 Map output records 1,895,637,976 0 1,895,637,976 Combine input records 3,791,275,946 272,362,481 4,063,638,427 Reduce input records 0 1,895,637,970 1,895,637,970 Thanks, Marek M. ________________________________________ From: Harsh J [[email protected]] Sent: Wednesday, February 01, 2012 1:23 PM To: [email protected] Subject: Re: Snappy in Mapreduce Also, if you want finalized outputs in LZO, set "mapred.output.compression.codec" to that codec. You have it set to Snappy presently. On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <[email protected]> wrote: > Hello guys, > > I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've > added to mapred-site: > <property> > <name>mapred.compress.map.output</name> > <value>true</value> > </property> > > <property> > <name>mapred.map.output.compression.codec</name> > <value>org.apache.hadoop.io.compress.SnappyCodec</value> > </property> > > Also to my pig job properties: > <property> > <name>io.compression.codec.lzo.class</name> > <value>com.hadoop.compression.lzo.LzoCodec</value> > </property> > <property> > <name>pig.tmpfilecompression</name> > <value>true</value> > </property> > <property> > <name>pig.tmpfilecompression.codec</name> > <value>lzo</value> > </property> > <property> > <name>mapred.output.compress</name> > <value>true</value> > </property> > <property> > <name>mapred.output.compression.codec</name> > <value>org.apache.hadoop.io.compress.SnappyCodec</value> > </property> > <property> > <name>mapred.output.compression.type</name> > <value>BLOCK</value> > </property> > <property> > <name>mapred.compress.map.output</name> > <value>true</value> > </property> > <property> > <name>mapred.map.output.compression.codec</name> > <value>org.apache.hadoop.io.compress.SnappyCodec</value> > </property> > <property> > <name>mapreduce.map.output.compress</name> > <value>true</value> > </property> > <property> > <name>mapreduce.map.output.compress.codec</name> > <value>org.apache.hadoop.io.compress.SnappyCodec</value> > </property> > > So I want PIG to compress it's data with LZO but mapreduce with Snappy, but > as I see in the tasktracker details (Map Bytes Out) data is not compressed at > all, which reduces performance a lot (IO is 100% most of the time)... What am > I doing wrong and how do I fix it? > > > Thanks, > Marek M. -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
