Took a best-effort stab at it: https://issues.apache.org/jira/browse/AVRO-2195
I would appreciate any comments! On Thu, Jun 28, 2018 at 9:41 PM, Benson Qiu <[email protected]> wrote: > Thanks for the reply, Joshua. > > I tried to compress my data with deflate and didn't see much improvement > beyond a sync interval of 1MB. Using bzip2 and xz, I got about a 30% and > 50% space improvement, respectively, compared to deflate. > > I'm interested in trying zstandard compression next. It's included in > Hadoop <https://issues.apache.org/jira/browse/HADOOP-13578>, but not > supported by default in CodecFactory > <https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/file/CodecFactory.html#fromString(java.lang.String)>. > *How can I add a non-default compression algorithm to Avro?* > > This is the code that I'm currently using: > *DataFileWriter.setCodec(CodecFactory.deflateCodec(9));* > > > On Sun, Jun 24, 2018 at 12:37 PM, Joshua Martell <[email protected] > > wrote: > >> Reading a split that doesn’t start at the beginning of the file must seek >> to the next block boundary to start reading. Compression should improve >> some with larger blocks, but you’ll pay for it in the extra seek time. >> >> It’s always best to run tests with your specific use case though. >> >> Joshua >> >> On Sat, Jun 23, 2018 at 11:37 PM Benson Qiu <[email protected]> wrote: >> >>> Hi, >>> >>> I have Avro files compressed with deflate (compression level 9). I am >>> wondering if increasing the sync interval, which to my understanding >>> implies increasing the size of each Avro block, would lead to better >>> compression ratios. >>> >>> I see that suggested values for the sync interval >>> <https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)> >>> are between 2KB and 2MB. However, I have been unable to find any >>> explanation *why* those are the optimal intervals. Given that my HDFS >>> block size is something around 128MB, why is the max suggested sync >>> interval only 2MB? >>> >>> Thanks, >>> Ben >>> >> >
