Re: Avro vs ORC in Spark

Dongjoon Hyun Tue, 09 Nov 2021 09:39:47 -0800

Hi, Ryan.

I don't think you have one 100GB Avro file in production. :)
If you have one million 1MB or one thousand 1GB Avro files, it becomes a
completely different story.

Most big data compute engines like Spark/Hive/Trino/Impala support both of
them because the use cases are different.
I'd like to recommend you to test both of them simply in your use case. :)

BTW, ORC has more advanced features like encryption and bloom filters while
Avro doesn't.

Dongjoon.

On Tue, Nov 9, 2021 at 8:35 AM Ryan Schachte <coderyanschac...@gmail.com>
wrote:

> Hi everyone, I'm looking for a better understanding of ORC compared to
> Avro when leveraging a big data compute engine like Spark.
>
> If I have 100GB dataset of Avro and the same dataset in ORC which consumes
> 10GB, would the ORC dataset be more performant and consume less memory than
> the Avro counterpart?
>
> My initial assumption was no because the data would both be deserialized
> and I'm consuming the entire dataset for both, but wanted to have the
> conversation to see if I'm thinking about that correctly.
>
> Cheers,
> Ryan S.
>
>
>

Re: Avro vs ORC in Spark

Reply via email to