A coworker and I were having a conversation today about choosing a compression algorithm for some data we are storing in Hadoop. We have been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce jobs and Haivvreo for integration with Hive. By default, the avro-utils OutputFormat uses deflate compression. Even though default/zlib/gzip files are not splittable, we decided that Avro data files are always splittable because individual blocks within the file are compressed instead of the entire file.
Is this accurate? Thanks.
