Reading a split that doesn’t start at the beginning of the file must seek to the next block boundary to start reading. Compression should improve some with larger blocks, but you’ll pay for it in the extra seek time.
It’s always best to run tests with your specific use case though. Joshua On Sat, Jun 23, 2018 at 11:37 PM Benson Qiu <[email protected]> wrote: > Hi, > > I have Avro files compressed with deflate (compression level 9). I am > wondering if increasing the sync interval, which to my understanding > implies increasing the size of each Avro block, would lead to better > compression ratios. > > I see that suggested values for the sync interval > <https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)> > are between 2KB and 2MB. However, I have been unable to find any > explanation *why* those are the optimal intervals. Given that my HDFS > block size is something around 128MB, why is the max suggested sync > interval only 2MB? > > Thanks, > Ben >
