Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Philip Weaver Wed, 05 Aug 2015 21:09:06 -0700

Absolutely, thanks!

On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian <lian.cs....@gmail.com> wrote:


> We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
>
> Could you give it a shot to see whether it helps in your case? We've
> observed ~50x performance boost with schema merging turned on.
>
> Cheng
>
>
> On 8/6/15 8:26 AM, Philip Weaver wrote:
>
> I have a parquet directory that was produced by partitioning by two keys,
> e.g. like this:
>
> df.write.partitionBy("a", "b").parquet("asdf")
>
>
> There are 35 values of "a", and about 1100-1200 values of "b" for each
> value of "a", for a total of over 40,000 partitions.
>
> Before running any transformations or actions on the DataFrame, just
> initializing it like this takes *2 minutes*:
>
> val df = sqlContext.read.parquet("asdf")
>
>
> Is this normal? Is this because it is doing some bookeeping to discover
> all the partitions? Is it perhaps having to merge the schema from each
> partition? Would you expect it to get better or worse if I subpartition by
> another key?
>
> - Philip
>
>
>
>

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Reply via email to