Absolutely, thanks! On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
> We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 > > Could you give it a shot to see whether it helps in your case? We've > observed ~50x performance boost with schema merging turned on. > > Cheng > > > On 8/6/15 8:26 AM, Philip Weaver wrote: > > I have a parquet directory that was produced by partitioning by two keys, > e.g. like this: > > df.write.partitionBy("a", "b").parquet("asdf") > > > There are 35 values of "a", and about 1100-1200 values of "b" for each > value of "a", for a total of over 40,000 partitions. > > Before running any transformations or actions on the DataFrame, just > initializing it like this takes *2 minutes*: > > val df = sqlContext.read.parquet("asdf") > > > Is this normal? Is this because it is doing some bookeeping to discover > all the partitions? Is it perhaps having to merge the schema from each > partition? Would you expect it to get better or worse if I subpartition by > another key? > > - Philip > > > >