As Dongjoon says, it depends on the use case.  ORC is column oriented, so
it only needs to read the columns you request.  This often saves a
significant amount of I/O.  It also uses run length and dictionary encoding
for many column types, so again it will read less data from storage.
However, if you're reading every column of a wide record, row oriented
storage like Avro can be better because the cost of stitching all the
columns together to rebuild a wide row is high.

Alan.

On Tue, Nov 9, 2021 at 11:12 AM Ryan Schachte <coderyanschac...@gmail.com>
wrote:

> Thanks Dongjoon,
> Just speaking hypothetically. More curious if there are performance gains
> in reading ORC data into a dataframe compared to Avro. Would it operate any
> faster due to the compression, etc?
>
> On Tue, Nov 9, 2021 at 10:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, Ryan.
>>
>> I don't think you have one 100GB Avro file in production. :)
>> If you have one million 1MB or one thousand 1GB Avro files, it becomes a
>> completely different story.
>>
>> Most big data compute engines like Spark/Hive/Trino/Impala support both
>> of them because the use cases are different.
>> I'd like to recommend you to test both of them simply in your use case. :)
>>
>> BTW, ORC has more advanced features like encryption and bloom filters
>> while Avro doesn't.
>>
>> Dongjoon.
>>
>>
>> On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <coderyanschac...@gmail.com>
>> wrote:
>>
>>> Hi everyone, I'm looking for a better understanding of ORC compared to
>>> Avro when leveraging a big data compute engine like Spark.
>>>
>>> If I have 100GB dataset of Avro and the same dataset in ORC which
>>> consumes 10GB, would the ORC dataset be more performant and consume less
>>> memory than the Avro counterpart?
>>>
>>> My initial assumption was no because the data would both be deserialized
>>> and I'm consuming the entire dataset for both, but wanted to have the
>>> conversation to see if I'm thinking about that correctly.
>>>
>>> Cheers,
>>> Ryan S.
>>>
>>>
>>>

Reply via email to