Re: Performance regression for partitioned parquet data

2017-06-15 Thread Bertrand Bossy
Hi, I created https://issues.apache.org/jira/browse/SPARK-21056 and proposed an implementation here: https://github.com/apache/spark/pull/18269 I'll try to address cloud-fan's comment ASAP Any input welcome. Regards, Bertrand On Thu, Jun 15, 2017 at 1:27 AM, Mike Wheeler

Re: Performance regression for partitioned parquet data

2017-06-14 Thread Mike Wheeler
I might have a similar problem: in the spark-shell: val data = spark.read.parquet("...") after hitting enter, it takes more than 30 seconds for the "read" to complete and return the command line. I am running Spark 2.1.1. But I have also tested it on 2.0.2 and encountered the same issue.

Re: Performance regression for partitioned parquet data

2017-06-13 Thread Michael Allman
Hi Bertrand, I encourage you to create a ticket for this and submit a PR if you have time. Please add me as a listener, and I'll try to contribute/review. Michael > On Jun 6, 2017, at 5:18 AM, Bertrand Bossy > wrote: > > Hi, > > since moving to spark 2.1 from

Performance regression for partitioned parquet data

2017-06-06 Thread Bertrand Bossy
Hi, since moving to spark 2.1 from 2.0, we experience a performance regression when reading a large, partitioned parquet dataset: We observe many (hundreds) very short jobs executing before the job that reads the data is starting. I looked into this issue and pinned it down to