Hi, Ryan. I don't think you have one 100GB Avro file in production. :) If you have one million 1MB or one thousand 1GB Avro files, it becomes a completely different story.
Most big data compute engines like Spark/Hive/Trino/Impala support both of them because the use cases are different. I'd like to recommend you to test both of them simply in your use case. :) BTW, ORC has more advanced features like encryption and bloom filters while Avro doesn't. Dongjoon. On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <coderyanschac...@gmail.com> wrote: > Hi everyone, I'm looking for a better understanding of ORC compared to > Avro when leveraging a big data compute engine like Spark. > > If I have 100GB dataset of Avro and the same dataset in ORC which consumes > 10GB, would the ORC dataset be more performant and consume less memory than > the Avro counterpart? > > My initial assumption was no because the data would both be deserialized > and I'm consuming the entire dataset for both, but wanted to have the > conversation to see if I'm thinking about that correctly. > > Cheers, > Ryan S. > > >