Hi Tomas,
Parquet tuning time !!!
I strongly recommend you to read papers by CERN on spark parquet tuning
https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example
You have to check the size of the row groups in your parquet files and
maybe tweak it a little
I forgot to mention important part that I'm issuing same query to both
parquets - selecting only one column:
df.select(sum('amount))
BR,
Tomas
št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a):
> Hello,
>
> I have 2 parquets (each containing 1 file):
>
>- parquet-wide - schema has 25 top le
Hello,
I have 2 parquets (each containing 1 file):
- parquet-wide - schema has 25 top level cols + 1 array
- parquet-narrow - schema has 3 top level cols
Both files have same data for given columns.
When I read from parquet-wide spark reports* read 52.6 KB*, from
parquet-narrow *only 2.6 K