There is a lot of talk that in order to really benefit from fast queries over parquet and hdfs, we need to make sure that the data is stored in a manner that is friendly to compression.
Unfortunately, I did not find any specific guidelines or tips online that describe do-s and dont-s in designing the parquet schema. I am wondering that perhaps someone here can either share sych material or his or her own experience regarding this. For example: I have the following logical structure that I want to store: { root: [ [int, int float, float], [int, int float, float], [int, int float, float], ...., ..... ] } This is of course a list of lists. All the sublists are actually vectors of the same length where the coordinates match in meaning and type. If I understand correctly, the best way to *store* this structure is by going for the columnar paradigm, where I will have 4 very long vectors, one for each coordinate. rather than many vectors that are short. What other consideration can I apply?