Thanks Dongjoon, Just speaking hypothetically. More curious if there are performance gains in reading ORC data into a dataframe compared to Avro. Would it operate any faster due to the compression, etc?
On Tue, Nov 9, 2021 at 10:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Hi, Ryan. > > I don't think you have one 100GB Avro file in production. :) > If you have one million 1MB or one thousand 1GB Avro files, it becomes a > completely different story. > > Most big data compute engines like Spark/Hive/Trino/Impala support both of > them because the use cases are different. > I'd like to recommend you to test both of them simply in your use case. :) > > BTW, ORC has more advanced features like encryption and bloom filters > while Avro doesn't. > > Dongjoon. > > > On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <coderyanschac...@gmail.com> > wrote: > >> Hi everyone, I'm looking for a better understanding of ORC compared to >> Avro when leveraging a big data compute engine like Spark. >> >> If I have 100GB dataset of Avro and the same dataset in ORC which >> consumes 10GB, would the ORC dataset be more performant and consume less >> memory than the Avro counterpart? >> >> My initial assumption was no because the data would both be deserialized >> and I'm consuming the entire dataset for both, but wanted to have the >> conversation to see if I'm thinking about that correctly. >> >> Cheers, >> Ryan S. >> >> >>