Re: Avro vs ORC in Spark

Ryan Schachte Tue, 09 Nov 2021 11:12:39 -0800

Thanks Dongjoon,
Just speaking hypothetically. More curious if there are performance gains
in reading ORC data into a dataframe compared to Avro. Would it operate any
faster due to the compression, etc?


On Tue, Nov 9, 2021 at 10:39 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, Ryan.
>
> I don't think you have one 100GB Avro file in production. :)
> If you have one million 1MB or one thousand 1GB Avro files, it becomes a
> completely different story.
>
> Most big data compute engines like Spark/Hive/Trino/Impala support both of
> them because the use cases are different.
> I'd like to recommend you to test both of them simply in your use case. :)
>
> BTW, ORC has more advanced features like encryption and bloom filters
> while Avro doesn't.
>
> Dongjoon.
>
>
> On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <coderyanschac...@gmail.com>
> wrote:
>
>> Hi everyone, I'm looking for a better understanding of ORC compared to
>> Avro when leveraging a big data compute engine like Spark.
>>
>> If I have 100GB dataset of Avro and the same dataset in ORC which
>> consumes 10GB, would the ORC dataset be more performant and consume less
>> memory than the Avro counterpart?
>>
>> My initial assumption was no because the data would both be deserialized
>> and I'm consuming the entire dataset for both, but wanted to have the
>> conversation to see if I'm thinking about that correctly.
>>
>> Cheers,
>> Ryan S.
>>
>>
>>

Re: Avro vs ORC in Spark

Reply via email to