Spark 2.0.0 introduced "Automatic file coalescing for native data sources" ( http://spark.apache.org/releases/spark-release-2-0-0.html#performance-and-runtime). Perhaps that is the cause?
I'm not sure if this feature is mentioned anywhere in the documentation or if there's any way to disable it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic <kpl...@gmail.com > wrote: > Hi, > > I write a randomly generated 30,000-row dataframe to parquet. I verify > that it has 200 partitions (both in Spark and inspecting the parquet file > in hdfs). > > When I read it back in, it has 23 partitions?! Is there some optimization > going on? (This doesn't happen in Spark 1.5) > > *How can I force it to read back the same partitions i.e. 200?* I'm > trying to reproduce a problem that depends on partitioning and can't > because the number of partitions goes way down. > > Thanks for any insights! > Kristina > > Here is the code and output: > > scala> spark.version > res13: String = 2.0.2 > > scala> df.show(2) > +---+-----------+----------+----------+----------+---------- > +--------+--------+ > | id| id2| strfeat0| strfeat1| strfeat2| > strfeat3|binfeat0|binfeat1| > +---+-----------+----------+----------+----------+---------- > +--------+--------+ > | 0|12345678901 <(234)%20567-8901>|fcvEmHTZte| > null|fnuAQdnBkJ|aU3puFMq5h| 1| 1| > | 1|12345678902 <(234)%20567-8902>| > null|rtcrPaAVNX|fnuAQdnBkJ|x6NyoX662X| > 0| 0| > +---+-----------+----------+----------+----------+---------- > +--------+--------+ > only showing top 2 rows > > > scala> df.count > res15: Long = 30001 > > scala> df.rdd.partitions.size > res16: Int = 200 > > scala> df.write.parquet("/tmp/df") > > > scala> val newdf = spark.read.parquet("/tmp/df") > newdf: org.apache.spark.sql.DataFrame = [id: int, id2: bigint ... 6 more > fields] > > scala> newdf.rdd.partitions.size > res18: Int = 23 > > > [kris@airisdata195 ~]$ hdfs dfs -ls /tmp/df > Found 201 items > -rw-r--r-- 3 kris supergroup 0 2016-12-22 11:01 /tmp/df/_SUCCESS > -rw-r--r-- 3 kris supergroup 4974 2016-12-22 11:01 > /tmp/df/part-r-00000-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet > -rw-r--r-- 3 kris supergroup 4914 2016-12-22 11:01 > /tmp/df/part-r-00001-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet > . > . (omitted output) > . > -rw-r--r-- 3 kris supergroup 4893 2016-12-22 11:01 > /tmp/df/part-r-00198-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet > -rw-r--r-- 3 kris supergroup 4981 2016-12-22 11:01 > /tmp/df/part-r-00199-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet > >