subject:"Parquet read performance for different schemas"

Re: Parquet read performance for different schemas

2019-09-20 Thread Julien Laurenceau

Hi Tomas, Parquet tuning time !!! I strongly recommend you to read papers by CERN on spark parquet tuning https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example You have to check the size of the row groups in your parquet files and maybe tweak it a little

Re: Parquet read performance for different schemas

2019-09-20 Thread Tomas Bartalos

I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column: df.select(sum('amount)) BR, Tomas št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a): > Hello, > > I have 2 parquets (each containing 1 file): > >- parquet-wide - schema has 25 top le

Parquet read performance for different schemas

2019-09-19 Thread Tomas Bartalos

Hello, I have 2 parquets (each containing 1 file): - parquet-wide - schema has 25 top level cols + 1 array - parquet-narrow - schema has 3 top level cols Both files have same data for given columns. When I read from parquet-wide spark reports* read 52.6 KB*, from parquet-narrow *only 2.6 K

Re: Parquet read performance for different schemas

Re: Parquet read performance for different schemas

Parquet read performance for different schemas

3 matches

Site Navigation

Mail list logo

Footer information