Hello, I have 2 parquets (each containing 1 file):
- parquet-wide - schema has 25 top level cols + 1 array - parquet-narrow - schema has 3 top level cols Both files have same data for given columns. When I read from parquet-wide spark reports* read 52.6 KB*, from parquet-narrow *only 2.6 KB*. For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say reading narrow parquet is much faster. Since schema pruning is applied I *expected to get similar results* for both scenarios (timing and amount of data read). What do you think is the reason for such a big difference, is there any tuning I can do ? Thank you, Tomas