Thanks Scott. I remember reading another one of your threads about sync interval, but I had forgotten to change it. We will do some experimentation with compression level and the sync internal.
On Fri, Sep 30, 2011 at 9:52 PM, Scott Carey <[email protected]> wrote: > Yes, Avro Data Files are always splittable. > > You may want to up the default block size in the files if this is for > MapReduce. The block size can often have a bigger impact on the > compression ratio than the compression level setting. > > If you are sensitive to the write performance, you might want lower > deflate compression levels as well. The read performance is relatively > constant for deflate as the compression level changes (except for > uncompressed level 0), but the write performance varies a quite a bit > between compression level 1 and 9 -- typically a factor of 5 or 6. > > On 9/30/11 6:42 PM, "Eric Hauser" <[email protected]> wrote: > >>A coworker and I were having a conversation today about choosing a >>compression algorithm for some data we are storing in Hadoop. We have >>been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce >>jobs and Haivvreo for integration with Hive. By default, the >>avro-utils OutputFormat uses deflate compression. Even though >>default/zlib/gzip files are not splittable, we decided that Avro data >>files are always splittable because individual blocks within the file >>are compressed instead of the entire file. >> >>Is this accurate? Thanks. > > >
