Best practices for optimizing the structure of parquet schema

Vitaliy Pisarev Thu, 29 Mar 2018 05:54:19 -0700

There is a lot of talk that in order to really benefit from fast queries
over parquet and hdfs, we need to make sure that the data is stored in a
manner that is friendly to compression.


Unfortunately, I did not find any specific guidelines or tips online that
describe do-s and dont-s
in designing the parquet schema.

I am wondering that perhaps someone here can either share sych material or
his or her own experience regarding this.

For example:

I have the following logical structure that I want to store:

{
     root: [
         [int, int float, float],
         [int, int float, float],
         [int, int float, float],
         ....,
         .....
     ]
}

This is of course a list of lists. All the sublists are actually vectors of
the same length where the coordinates match in meaning and type.

If I understand correctly, the best way to *store* this structure is by
going for the columnar paradigm, where I will have 4 very long vectors, one
for each coordinate. rather than many vectors that are short.

What other consideration can I apply?

Best practices for optimizing the structure of parquet schema

Reply via email to