Thanks for the reply, Joshua.

I tried to compress my data with deflate and didn't see much improvement
beyond a sync interval of 1MB. Using bzip2 and xz, I got about a 30% and
50% space improvement, respectively, compared to deflate.

I'm interested in trying zstandard compression next. It's included in Hadoop
<https://issues.apache.org/jira/browse/HADOOP-13578>, but not supported by
default in CodecFactory
<https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/file/CodecFactory.html#fromString(java.lang.String)>.
*How can I add a non-default compression algorithm to Avro?*

This is the code that I'm currently using:
*DataFileWriter.setCodec(CodecFactory.deflateCodec(9));*


On Sun, Jun 24, 2018 at 12:37 PM, Joshua Martell <[email protected]>
wrote:

> Reading a split that doesn’t start at the beginning of the file must seek
> to the next block boundary to start reading. Compression should improve
> some with larger blocks, but you’ll pay for it in the extra seek time.
>
> It’s always best to run tests with your specific use case though.
>
> Joshua
>
> On Sat, Jun 23, 2018 at 11:37 PM Benson Qiu <[email protected]> wrote:
>
>> Hi,
>>
>> I have Avro files compressed with deflate (compression level 9). I am
>> wondering if increasing the sync interval, which to my understanding
>> implies increasing the size of each Avro block, would lead to better
>> compression ratios.
>>
>> I see that suggested values for the sync interval
>> <https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)>
>> are between 2KB and 2MB. However, I have been unable to find any
>> explanation *why* those are the optimal intervals. Given that my HDFS
>> block size is something around 128MB, why is the max suggested sync
>> interval only 2MB?
>>
>> Thanks,
>> Ben
>>
>

Reply via email to