Thanks for the reply, Joshua. I tried to compress my data with deflate and didn't see much improvement beyond a sync interval of 1MB. Using bzip2 and xz, I got about a 30% and 50% space improvement, respectively, compared to deflate.
I'm interested in trying zstandard compression next. It's included in Hadoop <https://issues.apache.org/jira/browse/HADOOP-13578>, but not supported by default in CodecFactory <https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/file/CodecFactory.html#fromString(java.lang.String)>. *How can I add a non-default compression algorithm to Avro?* This is the code that I'm currently using: *DataFileWriter.setCodec(CodecFactory.deflateCodec(9));* On Sun, Jun 24, 2018 at 12:37 PM, Joshua Martell <[email protected]> wrote: > Reading a split that doesn’t start at the beginning of the file must seek > to the next block boundary to start reading. Compression should improve > some with larger blocks, but you’ll pay for it in the extra seek time. > > It’s always best to run tests with your specific use case though. > > Joshua > > On Sat, Jun 23, 2018 at 11:37 PM Benson Qiu <[email protected]> wrote: > >> Hi, >> >> I have Avro files compressed with deflate (compression level 9). I am >> wondering if increasing the sync interval, which to my understanding >> implies increasing the size of each Avro block, would lead to better >> compression ratios. >> >> I see that suggested values for the sync interval >> <https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)> >> are between 2KB and 2MB. However, I have been unable to find any >> explanation *why* those are the optimal intervals. Given that my HDFS >> block size is something around 128MB, why is the max suggested sync >> interval only 2MB? >> >> Thanks, >> Ben >> >
