Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
Hi spark users and developers, I have been trying to understand how Spark SQL works with Parquet for the couple of days. There is a performance problem that is unexpected using the column pruning. Here is a dummy example: The parquet file has the 3 fields: |-- customer_id: string (nullable =

Re: Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Cheng Lian
Hi Jerry, Thanks for the detailed report! I haven't investigate this issue in detail. But for the input size issue, I believe this is due to a limitation of HDFS API. It seems that Hadoop FileSystem adds the size of a whole block to the metrics even if you only touch a fraction of that