Re: ORC v/s Parquet for Spark 2.0

Koert Kuipers Tue, 26 Jul 2016 06:21:12 -0700

when parquet came out it was developed by a community of companies, and was
designed as a library to be supported by multiple big data projects. nice


orc on the other hand initially only supported hive. it wasn't even
designed as a library that can be re-used. even today it brings in the
kitchen sink of transitive dependencies. yikes

On Jul 26, 2016 5:09 AM, "Jörn Franke" <[email protected]> wrote:

> I think both are very similar, but with slightly different goals. While
> they work transparently for each Hadoop application you need to enable
> specific support in the application for predicate push down.
> In the end you have to check which application you are using and do some
> tests (with correct predicate push down configuration). Keep in mind that
> both formats work best if they are sorted on filter columns (which is your
> responsibility) and if their optimatizations are correctly configured (min
> max index, bloom filter, compression etc) .
>
> If you need to ingest sensor data you may want to store it first in hbase
> and then batch process it in large files in Orc or parquet format.
>
> On 26 Jul 2016, at 04:09, janardhan shetty <[email protected]> wrote:
>
> Just wondering advantages and disadvantages to convert data into ORC or
> Parquet.
>
> In the documentation of Spark there are numerous examples of Parquet
> format.
>
> Any strong reasons to chose Parquet over ORC file format ?
>
> Also : current data compression is bzip2
>
>
> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
> This seems like biased.
>
>

Re: ORC v/s Parquet for Spark 2.0

Reply via email to