Re: Compression and splittable Avro files in Hadoop

Eric Hauser Fri, 30 Sep 2011 19:14:43 -0700

Thanks Scott.  I remember reading another one of your threads about
sync interval, but I had forgotten to change it.  We will do some
experimentation with compression level and the sync internal.



On Fri, Sep 30, 2011 at 9:52 PM, Scott Carey <[email protected]> wrote:
> Yes, Avro Data Files are always splittable.
>
> You may want to up the default block size in the files if this is for
> MapReduce.  The block size can often have a bigger impact on the
> compression ratio than the compression level setting.
>
> If you are sensitive to the write performance, you might want lower
> deflate compression levels as well.  The read performance is relatively
> constant for deflate as the compression level changes (except for
> uncompressed level 0), but the write performance varies a quite a bit
> between compression level 1 and 9 -- typically a factor of 5 or 6.
>
> On 9/30/11 6:42 PM, "Eric Hauser" <[email protected]> wrote:
>
>>A coworker and I were having a conversation today about choosing a
>>compression algorithm for some data we are storing in Hadoop.  We have
>>been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
>>jobs and Haivvreo for integration with Hive.  By default, the
>>avro-utils OutputFormat uses deflate compression.  Even though
>>default/zlib/gzip files are not splittable, we decided that Avro data
>>files are always splittable because individual blocks within the file
>>are compressed instead of the entire file.
>>
>>Is this accurate?  Thanks.
>
>
>

Re: Compression and splittable Avro files in Hadoop

Reply via email to