I might have a similar problem:

in the spark-shell:
val data = spark.read.parquet("...")

after hitting enter, it takes more than 30 seconds for the "read" to
complete and return the command line. I am running Spark 2.1.1. But I have
also tested it on 2.0.2 and encountered the same issue.

thanks,

Mike



On Tue, Jun 13, 2017 at 10:05 AM, Michael Allman <mich...@videoamp.com>
wrote:

> Hi Bertrand,
>
> I encourage you to create a ticket for this and submit a PR if you have
> time. Please add me as a listener, and I'll try to contribute/review.
>
> Michael
>
> On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <bertrand.bo...@teralytics.ch>
> wrote:
>
> Hi,
>
> since moving to spark 2.1 from 2.0, we experience a performance regression
> when reading a large, partitioned parquet dataset:
>
> We observe many (hundreds) very short jobs executing before the job that
> reads the data is starting. I looked into this issue and pinned it down to
> PartitioningAwareFileIndex: While recursively listing the directories, if a
> directory contains more than "spark.sql.sources.
> parallelPartitionDiscovery.threshold" (default: 32) paths, the children
> are listed using a spark job. Because the tree is listed serially, this can
> result in a lot of small spark jobs executed one after the other and the
> overhead dominates. Performance can be improved by tuning
> "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this
> is not a satisfactory solution.
>
> I think that the current behaviour could be improved by walking the
> directory tree in breadth first search order and only launching one spark
> job to list files in parallel if the number of paths to be listed at some
> level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .
>
> Does this approach make sense? I have found "Regression in file listing
> performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the
> most closely related ticket.
>
> Unless there is a reason for the current behaviour, I will create a ticket
> on this soon. I might have some time in the coming days to work on this.
>
> Regards,
> Bertrand
>
> --
>
> Bertrand Bossy | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>
>
>
>

Reply via email to