Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Cheng Lian Wed, 05 Aug 2015 21:08:40 -0700

We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396

Could you give it a shot to see whether it helps in your case? We'veobserved ~50x performance boost with schema merging turned on.


Cheng

On 8/6/15 8:26 AM, Philip Weaver wrote:

I have a parquet directory that was produced by partitioning by twokeys, e.g. like this:
    df.write.partitionBy("a", "b").parquet("asdf")
There are 35 values of "a", and about 1100-1200 values of "b" for eachvalue of "a", for a total of over 40,000 partitions.
Before running any transformations or actions on the DataFrame, justinitializing it like this takes *2 minutes*:
    val df = sqlContext.read.parquet("asdf")
Is this normal? Is this because it is doing some bookeeping todiscover all the partitions? Is it perhaps having to merge the schemafrom each partition? Would you expect it to get better or worse if Isubpartition by another key?
- Philip

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Reply via email to