Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Cheng Lian Thu, 06 Aug 2015 08:39:34 -0700

Would you mind to provide the driver log?

On 8/6/15 3:58 PM, Philip Weaver wrote:

I built spark from the v1.5.0-snapshot-20150803 tag in the repo andtried again.

The initialization time is about 1 minute now, which is still prettyterrible.

On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver <philip.wea...@gmail.com<mailto:philip.wea...@gmail.com>> wrote:


    Absolutely, thanks!

    On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian <lian.cs....@gmail.com
    <mailto:lian.cs....@gmail.com>> wrote:

        We've fixed this issue in 1.5
        https://github.com/apache/spark/pull/7396

        Could you give it a shot to see whether it helps in your case?
        We've observed ~50x performance boost with schema merging
        turned on.

        Cheng


        On 8/6/15 8:26 AM, Philip Weaver wrote:

        I have a parquet directory that was produced by partitioning
        by two keys, e.g. like this:

            df.write.partitionBy("a", "b").parquet("asdf")


        There are 35 values of "a", and about 1100-1200 values of "b"
        for each value of "a", for a total of over 40,000 partitions.

        Before running any transformations or actions on the
        DataFrame, just initializing it like this takes *2 minutes*:

            val df = sqlContext.read.parquet("asdf")


        Is this normal? Is this because it is doing some bookeeping
        to discover all the partitions? Is it perhaps having to merge
        the schema from each partition? Would you expect it to get
        better or worse if I subpartition by another key?

        - Philip

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

Reply via email to