Hi spark users and developers,
I have been trying to understand how Spark SQL works with Parquet for the
couple of days. There is a performance problem that is unexpected using the
column pruning. Here is a dummy example:
The parquet file has the 3 fields:
|-- customer_id: string (nullable =
Hi Jerry,
Thanks for the detailed report! I haven't investigate this issue in
detail. But for the input size issue, I believe this is due to a
limitation of HDFS API. It seems that Hadoop FileSystem adds the size of
a whole block to the metrics even if you only touch a fraction of that