Re: AVRO vs Parquet

Guru Medasani Wed, 09 Mar 2016 20:52:24 -0800

+1 Paul. Both have some pros and cons.

Hope this helps.

Avro:

Pros: 
1) Plays nice with other tools, 3rd party or otherwise, or you specifically 
need some data type in AVRO like binary, but gladly that list is shrinking all 
the time (yay nested types in Impala).
2) Good for event data that changes over time. Can be used in Kafka Schema 
Registry
3) Schema Evolution - supports adding, removing, renaming, reordering

Cons:
1) Cannot write data from Impala
2) Timestamp data type is not supported (Hive & Impala)-  we can still store 
them as longs and convert it in the end-user tables.

Parquet:

Pros:
1) High performance during both reads and writes - both wide and narrow tables 
(though mileage may vary based on the datasets cardinality)
2) Great for analytics

Cons:
1) High memory and cpu usage. This is because Parquet files can't be streamed 
into as records arrive. I have seen a lot of OOMs in reasonably sized MR/Spark 
containers that write out Parquet. When doing dynamic partitioning, where many 
writers are open at once, we’ve seen customers having trouble to make it work. 
This has made for some very confused ETL developers.
2) Parquet lags well behind Avro in schema evolution semantics. Can only add 
columns at the end? Deleting columns at the end is not recommended if you plan 
to add any columns in the future. Reordering is not supported in current 
release. 

Here is Schema Evolution and best practices in Parquet and Avro in more detail. 

Parquet you can add columns to the end of the table definition and they'll be 
populated with NULL if missing in the file (you can technically delete columns 
from the end too, but then it will try to read the old deleted column if you 
add a new column to the end afterwards). 

Here is an example of how this works.

At t0 - if we have a parquet table with the following columns c1,c2,c3,c4
     t1 - if we delete the columns c3, c4, table now has the columns c1,c2
     t2 - Now we add a new column c5, table now has the columns c1,c2,c5

So now when we query the table for columns c1,c2 and c5, it will try to read 
the columns c3.

Impala currently matches the file schema to the table schema by column order, 
so at t2, if you select c5, Impala will try to read c3 from the old files, 
thinking it's c5 since it's the third column in the file. This will work if the 
types of c3 and c5 are the same (but probably not be the right results) or fail 
if the types don't match. It will ignore the old c4. Files with the new schema 
(c1,c2,c5) will be read correctly.

You can also rename columns, since we only look at the column ordering and not 
their names. So if you have columns c1 and c2, and reorder them c2 and c1, 
Impala will read c1 when you query c2 and vice versa.

In the next release, there might be support for optionally matching Parquet 
file columns by name instead of order, like Hive does. Under this scheme, you 
cannot rename columns (since the files will retain the old name and will no 
longer be matched), but you can reorder them. ( This is regarding Impala)

In Avro we follow the rules described here: 
http://avro.apache.org/docs/1.8.0/spec.html#Schema+Resolution 
<http://avro.apache.org/docs/1.8.0/spec.html#Schema+Resolution>

This includes rearranging columns, renaming columns via aliasing, and certain 
data type differences (e.g. you can "promote" an int to a bigint in the table 
schema, but not vice versa). AFAIK you should always modify the Avro schema 
directly in the table metadata, I think weird stuff happens if you try to use 
the usual ALTER TABLE commands. Preferred method is to modify the json schema 
file on HDFS and then using the ALTER TABLE to add/remove/reorder/rename the 
columns.

I think in some cases using ALTER TABLE without changing the schema could work 
by chance, but in general it seems dangerous to change the column metadata in 
the metastore and not have that be reflected in the Avro schema as well.

Guru Medasani
gdm...@gmail.com

> On Mar 4, 2016, at 7:36 AM, Paul Leclercq <paul.lecle...@tabmo.io> wrote:
> 
> 
> 
> Nice article about Parquet with Avro : 
> https://dzone.com/articles/understanding-how-parquet 
> <https://dzone.com/articles/understanding-how-parquet>
> http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ 
> <http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/>
> Nice video from the good folks of Cloudera for the differences between 
> "Avrow" and Parquet
> https://www.youtube.com/watch?v=AY1dEfyFeHc 
> <https://www.youtube.com/watch?v=AY1dEfyFeHc>
> 
> 2016-03-04 7:12 GMT+01:00 Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>>:
> well can you use orc without bringing in the kitchen sink of dependencies 
> also known as hive?
> 
> On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim <ilike...@gmail.com 
> <mailto:ilike...@gmail.com>> wrote:
> How about ORC? I have experimented briefly with Parquet and ORC, and I liked 
> the fact that ORC has its schema within the file, which makes it handy to 
> work with any other tools.
> 
> Jong Wook
> 
> On 3 March 2016 at 23:29, Don Drake <dondr...@gmail.com 
> <mailto:dondr...@gmail.com>> wrote:
> My tests show Parquet has better performance than Avro in just about every 
> test.  It really shines when you are querying a subset of columns in a wide 
> table.
> 
> -Don
> 
> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann <tim.sp...@airisdata.com 
> <mailto:tim.sp...@airisdata.com>> wrote:
> Which format is the best format for SparkSQL adhoc queries and general data 
> storage?
> 
> There are lots of specialized cases, but generally accessing some but not all 
> the available columns with a reasonable subset of the data.
> 
> I am learning towards Parquet as it has great support in Spark.
> 
> I also have to consider any file on HDFS may be accessed from other tools 
> like Hive, Impala, HAWQ.
> 
> Suggestions?
> — 
> airis.DATA
> Timothy Spann, Senior Solutions Architect
> C: 609-250-5894 <tel:609-250-5894>
> http://airisdata.com <http://airisdata.com/>/
> http://meetup.com/nj-datascience <http://meetup.com/nj-datascience>
> 
> 
> 
> 
> 
> -- 
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/ <http://www.drakeconsulting.com/>
> https://twitter.com/dondrake <http://www.maillaunder.com/>
> 800-733-2143 <tel:800-733-2143>
> 
> 
> 
> 
> -- 
> Paul Leclercq | Data engineer
> 
> 
> 
>  paul.lecle...@tabmo.io <mailto:paul.lecle...@tabmo.io>  |  
> http://www.tabmo.fr/ <http://www.tabmo.fr/>

Re: AVRO vs Parquet

Reply via email to