Agree with Mark. On Fri, Jun 8, 2012 at 5:08 PM, Mark Grover <grover.markgro...@gmail.com>wrote:
> Hi Sreenath, > All the points made on this thread are very valid. However, I wanted to > add that you should keep in mind that Gzip compression is not splittable. > This is because of the very nature of the codec. So, if your input data > contains files of size greater than HDFS block size in Gzip format, Hadoop > wouldn't be able to split these files and the entire file would be sent to > a single mapper. This reduces performance of the job. > > As Vinod mentioned, Snappy is getting some traction. Definitely worth a > shot! > > Good luck! > Mark > > On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <vi...@vinodsingh.com> wrote: > >> But it may payoff by saving on network IO while copying the data during >> reduce phase. Though it will vary from case to case. We had good results by >> using Snappy codec for compressing map output. Snappy provides reasonably >> good compression at faster rate. >> >> Thanks, >> Vinod >> >> http://blog.vinodsingh.com/ >> >> >> On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <debarshi.ba...@tcs.com>wrote: >> >>> Compression is an overhead when you have a CPU intensive job >>> >>> >>> Debarshi Basak >>> Tata Consultancy Services >>> Mailto: debarshi.ba...@tcs.com >>> Website: http://www.tcs.com >>> ____________________________________________ >>> Experience certainty. IT Services >>> Business Solutions >>> Outsourcing >>> ____________________________________________ >>> >>> -----Bejoy Ks ** wrote: -----** >>> >>> To: "user@hive.apache.org" <user@hive.apache.org> >>> From: Bejoy Ks <bejoy...@yahoo.com> >>> Date: 06/06/2012 03:37PM >>> Subject: Re: Compressed data storage in HDFS - Error >>> >>> >>> Hi Sreenath >>> >>> Output compression is more useful on storage level, when a larger file >>> is compressed it saves on hdfs blocks and there by the cluster become more >>> scalable in terms of number of files. >>> >>> Yes lzo libraries needs to be there in all task tracker nodes as well >>> the node that hosts the hive client. >>> >>> Regards >>> Bejoy KS >>> >>> ------------------------------ >>> *From:* Sreenath Menon <sreenathmen...@gmail.com> >>> *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> >>> *Sent:* Wednesday, June 6, 2012 3:25 PM >>> *Subject:* Re: Compressed data storage in HDFS - Error >>> >>> Hi Bejoy >>> I would like to make this clear. >>> There is no gain on processing throughput/time on compressing the data >>> stored in HDFS (not talking about intermediate compression)...wright?? >>> And do I need to add the lzo libraries in Hadoop_Home/lib/native for >>> all the nodes (including the slave nodes)?? >>> >>> >>> =====-----=====-----===== >>> Notice: The information contained in this e-mail >>> message and/or attachments to it may contain >>> confidential or privileged information. If you are >>> not the intended recipient, any dissemination, use, >>> review, distribution, printing or copying of the >>> information contained in this e-mail message >>> and/or attachments to it are strictly prohibited. If >>> you have received this communication in error, >>> please notify us by reply e-mail or telephone and >>> immediately and permanently delete the message >>> and any attachments. Thank you >>> >>> >> > -- Raja Thiruvathuru