Hi Sreenath The lzo error is because you don't have the lzo libraries in Hadoop_Home/lib/native folder. You need to pack/build lzo for the OS you are using.
In compression as you mentioned there is an overhead in decompressing while processing the records. HDFS is used to store large amount of data so compression saves much on storage space (consider replication as well). Now it is not final output compression that speeds up map reduce jobs but it the intermediate compression that has this advantage. Intermediate compression means compression of map output. In a map reduce job there is much of copy and shuffle happening between the map and reduce phases, when this intermediate data is compressed this operation is faster as it consumes much lesser IO. The following properties enables intermediate compression mapred.compress.map.output=true mapred.map.output.compression.codec= hadoop.compression.lzo.LzoCodec Regards Bejoy KS ________________________________ From: Siddharth Tiwari <siddharth.tiw...@live.com> To: "user@hive.apache.org " <user@hive.apache.org> Sent: Wednesday, June 6, 2012 2:58 PM Subject: RE: Compressed data storage in HDFS - Error There is something you gain and something you loose. Compression would reduce IO through increased cpu work . Also you would receive different experience for different tasks ie HDFS read , HDFS write , shuffle and sort . So to go for compression or not depends on your usages . Sent from my N8 -----Original Message----- From: Sreenath Menon Sent: 6/6/2012 8:50:23 AM To: user@hive.apache.org Subject: Compressed data storage in HDFS - Error I would like to compress my data in the HDFS using some Hive commands. Step followed: (data already residing in table sample) create table rc_lzo like sample; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; insert overwrite table rc_lzo select * from sample; Error: Compression codec com\.hadoop\.compression\.lzo\.LzoCodec was not found 1)What do I need to do to use Lzo as well as other compression methods? 2)Heard somewhere that :Using compressed data will produce better results than uncompressed data in some cases. How can this be, as there is always a compression and decompression time allotted with compression methods. Any truth in this, if so how ? Can understand how there are better results when using compression between mappers-to-reducers and in between map-reduce jobs. Thanks and Regards Sreenath Mullassery