RE: Snappy in Mapreduce

Marek Miglinski Mon, 06 Feb 2012 05:14:59 -0800

Thank you for the help, that opened my eyes.

I've noticed that while using LZO compression "Map output bytes" is 
296,608,592,100 and "HDFS_BYTES_WRITTEN" is 57,941,932,388, does that mean that 
reducer output compression is 296,608,592,100 / 57,941,932,388 = 5.11 times, 
why is it so small for Sequence File Format?


Other statistics:

FILE_BYTES_READ 121,983,712,033 135,435,145,919 257,418,857,952
HDFS_BYTES_READ 23,721,946,243  0       23,721,946,243
FILE_BYTES_WRITTEN      188,046,014,425 135,437,054,645 323,483,069,070
HDFS_BYTES_WRITTEN      0       57,941,932,388  57,941,932,388

Reduce input groups     0       1,895,637,970   1,895,637,970
Combine output records  3,791,275,940   272,362,481     4,063,638,421
Map input records       1,895,637,976   0       1,895,637,976
Reduce shuffle bytes    0       65,503,257,420  65,503,257,420
Reduce output records   0       1,895,637,970   1,895,637,970
Spilled Records 5,436,423,030   3,871,926,741   9,308,349,771
Map output bytes        296,608,592,100 0       296,608,592,100
SPLIT_RAW_BYTES 73,060  0       73,060
Map output records      1,895,637,976   0       1,895,637,976
Combine input records   3,791,275,946   272,362,481     4,063,638,427
Reduce input records    0       1,895,637,970   1,895,637,970


Thanks,
Marek M.
________________________________________
From: Harsh J [[email protected]]
Sent: Wednesday, February 01, 2012 1:23 PM
To: [email protected]
Subject: Re: Snappy in Mapreduce

Also, if you want finalized outputs in LZO, set
"mapred.output.compression.codec" to that codec. You have it set to
Snappy presently.

On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <[email protected]> wrote:
> Hello guys,
>
> I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've 
> added to mapred-site:
>    <property>
>        <name>mapred.compress.map.output</name>
>        <value>true</value>
>    </property>
>
>    <property>
>        <name>mapred.map.output.compression.codec</name>
>        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> Also to my pig job properties:
>                <property>
>                    <name>io.compression.codec.lzo.class</name>
>                    <value>com.hadoop.compression.lzo.LzoCodec</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression.codec</name>
>                    <value>lzo</value>
>                </property>
>                <property>
>                    <name>mapred.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.type</name>
>                    <value>BLOCK</value>
>                </property>
>                <property>
>                    <name>mapred.compress.map.output</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.map.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>
> So I want PIG to compress it's data with LZO but mapreduce with Snappy, but 
> as I see in the tasktracker details (Map Bytes Out) data is not compressed at 
> all, which reduces performance a lot (IO is 100% most of the time)... What am 
> I doing wrong and how do I fix it?
>
>
> Thanks,
> Marek M.



--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

RE: Snappy in Mapreduce

Reply via email to