As Dongjoon says, it depends on the use case. ORC is column oriented, so it only needs to read the columns you request. This often saves a significant amount of I/O. It also uses run length and dictionary encoding for many column types, so again it will read less data from storage. However, if you're reading every column of a wide record, row oriented storage like Avro can be better because the cost of stitching all the columns together to rebuild a wide row is high.
Alan. On Tue, Nov 9, 2021 at 11:12 AM Ryan Schachte <coderyanschac...@gmail.com> wrote: > Thanks Dongjoon, > Just speaking hypothetically. More curious if there are performance gains > in reading ORC data into a dataframe compared to Avro. Would it operate any > faster due to the compression, etc? > > On Tue, Nov 9, 2021 at 10:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Hi, Ryan. >> >> I don't think you have one 100GB Avro file in production. :) >> If you have one million 1MB or one thousand 1GB Avro files, it becomes a >> completely different story. >> >> Most big data compute engines like Spark/Hive/Trino/Impala support both >> of them because the use cases are different. >> I'd like to recommend you to test both of them simply in your use case. :) >> >> BTW, ORC has more advanced features like encryption and bloom filters >> while Avro doesn't. >> >> Dongjoon. >> >> >> On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <coderyanschac...@gmail.com> >> wrote: >> >>> Hi everyone, I'm looking for a better understanding of ORC compared to >>> Avro when leveraging a big data compute engine like Spark. >>> >>> If I have 100GB dataset of Avro and the same dataset in ORC which >>> consumes 10GB, would the ORC dataset be more performant and consume less >>> memory than the Avro counterpart? >>> >>> My initial assumption was no because the data would both be deserialized >>> and I'm consuming the entire dataset for both, but wanted to have the >>> conversation to see if I'm thinking about that correctly. >>> >>> Cheers, >>> Ryan S. >>> >>> >>>