Re: Compressed data storage in HDFS - Error

Raja Thiruvathuru Fri, 08 Jun 2012 20:43:14 -0700

Agree with Mark.

On Fri, Jun 8, 2012 at 5:08 PM, Mark Grover <grover.markgro...@gmail.com>wrote:


> Hi Sreenath,
> All the points made on this thread are very valid. However, I wanted to
> add that you should keep in mind that Gzip compression is not splittable.
> This is because of the very nature of the codec. So, if your input data
> contains files of size greater than HDFS block size in Gzip format, Hadoop
> wouldn't be able to split these files and the entire file would be sent to
> a single mapper. This reduces performance of the job.
>
> As Vinod mentioned, Snappy is getting some traction. Definitely worth a
> shot!
>
> Good luck!
> Mark
>
> On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <vi...@vinodsingh.com> wrote:
>
>> But it may payoff by saving on network IO while copying the data during
>> reduce phase. Though it will vary from case to case. We had good results by
>> using Snappy codec for compressing map output. Snappy provides reasonably
>> good compression at faster rate.
>>
>> Thanks,
>> Vinod
>>
>> http://blog.vinodsingh.com/
>>
>>
>> On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <debarshi.ba...@tcs.com>wrote:
>>
>>>  Compression is an overhead when you have a CPU intensive job
>>>
>>>
>>> Debarshi Basak
>>> Tata Consultancy Services
>>> Mailto: debarshi.ba...@tcs.com
>>> Website: http://www.tcs.com
>>> ____________________________________________
>>> Experience certainty. IT Services
>>> Business Solutions
>>> Outsourcing
>>> ____________________________________________
>>>
>>> -----Bejoy Ks ** wrote: -----**
>>>
>>> To: "user@hive.apache.org" <user@hive.apache.org>
>>> From: Bejoy Ks <bejoy...@yahoo.com>
>>> Date: 06/06/2012 03:37PM
>>> Subject: Re: Compressed data storage in HDFS - Error
>>>
>>>
>>> Hi Sreenath
>>>
>>> Output compression is more useful on storage level, when a larger file
>>> is compressed it saves on hdfs blocks and there by the cluster become more
>>> scalable in terms of number of files.
>>>
>>> Yes lzo libraries needs to be there in all task tracker nodes as well
>>> the node that hosts the hive client.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>>   ------------------------------
>>> *From:* Sreenath Menon <sreenathmen...@gmail.com>
>>> *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com>
>>> *Sent:* Wednesday, June 6, 2012 3:25 PM
>>> *Subject:* Re: Compressed data storage in HDFS - Error
>>>
>>> Hi Bejoy
>>> I would like to make this clear.
>>> There is no gain on processing throughput/time on compressing the data
>>> stored in HDFS (not talking about intermediate compression)...wright??
>>> And do I need to add the lzo libraries in Hadoop_Home/lib/native for
>>> all the nodes (including the slave nodes)??
>>>
>>>
>>>  =====-----=====-----=====
>>> Notice: The information contained in this e-mail
>>> message and/or attachments to it may contain
>>> confidential or privileged information. If you are
>>> not the intended recipient, any dissemination, use,
>>> review, distribution, printing or copying of the
>>> information contained in this e-mail message
>>> and/or attachments to it are strictly prohibited. If
>>> you have received this communication in error,
>>> please notify us by reply e-mail or telephone and
>>> immediately and permanently delete the message
>>> and any attachments. Thank you
>>>
>>>
>>
>


-- 

Raja Thiruvathuru

Re: Compressed data storage in HDFS - Error

Reply via email to