Took a best-effort stab at it:
https://issues.apache.org/jira/browse/AVRO-2195

I would appreciate any comments!

On Thu, Jun 28, 2018 at 9:41 PM, Benson Qiu <[email protected]> wrote:

> Thanks for the reply, Joshua.
>
> I tried to compress my data with deflate and didn't see much improvement
> beyond a sync interval of 1MB. Using bzip2 and xz, I got about a 30% and
> 50% space improvement, respectively, compared to deflate.
>
> I'm interested in trying zstandard compression next. It's included in
> Hadoop <https://issues.apache.org/jira/browse/HADOOP-13578>, but not
> supported by default in CodecFactory
> <https://avro.apache.org/docs/1.8.1/api/java/org/apache/avro/file/CodecFactory.html#fromString(java.lang.String)>.
> *How can I add a non-default compression algorithm to Avro?*
>
> This is the code that I'm currently using:
> *DataFileWriter.setCodec(CodecFactory.deflateCodec(9));*
>
>
> On Sun, Jun 24, 2018 at 12:37 PM, Joshua Martell <[email protected]
> > wrote:
>
>> Reading a split that doesn’t start at the beginning of the file must seek
>> to the next block boundary to start reading. Compression should improve
>> some with larger blocks, but you’ll pay for it in the extra seek time.
>>
>> It’s always best to run tests with your specific use case though.
>>
>> Joshua
>>
>> On Sat, Jun 23, 2018 at 11:37 PM Benson Qiu <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have Avro files compressed with deflate (compression level 9). I am
>>> wondering if increasing the sync interval, which to my understanding
>>> implies increasing the size of each Avro block, would lead to better
>>> compression ratios.
>>>
>>> I see that suggested values for the sync interval
>>> <https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)>
>>> are between 2KB and 2MB. However, I have been unable to find any
>>> explanation *why* those are the optimal intervals. Given that my HDFS
>>> block size is something around 128MB, why is the max suggested sync
>>> interval only 2MB?
>>>
>>> Thanks,
>>> Ben
>>>
>>
>

Reply via email to