Hi, I write a randomly generated 30,000-row dataframe to parquet. I verify that it has 200 partitions (both in Spark and inspecting the parquet file in hdfs).
When I read it back in, it has 23 partitions?! Is there some optimization going on? (This doesn't happen in Spark 1.5) *How can I force it to read back the same partitions i.e. 200?* I'm trying to reproduce a problem that depends on partitioning and can't because the number of partitions goes way down. Thanks for any insights! Kristina Here is the code and output: scala> spark.version res13: String = 2.0.2 scala> df.show(2) +---+-----------+----------+----------+----------+----------+--------+--------+ | id| id2| strfeat0| strfeat1| strfeat2| strfeat3|binfeat0|binfeat1| +---+-----------+----------+----------+----------+----------+--------+--------+ | 0|12345678901|fcvEmHTZte| null|fnuAQdnBkJ|aU3puFMq5h| 1| 1| | 1|12345678902| null|rtcrPaAVNX|fnuAQdnBkJ|x6NyoX662X| 0| 0| +---+-----------+----------+----------+----------+----------+--------+--------+ only showing top 2 rows scala> df.count res15: Long = 30001 scala> df.rdd.partitions.size res16: Int = 200 scala> df.write.parquet("/tmp/df") scala> val newdf = spark.read.parquet("/tmp/df") newdf: org.apache.spark.sql.DataFrame = [id: int, id2: bigint ... 6 more fields] scala> newdf.rdd.partitions.size res18: Int = 23 [kris@airisdata195 ~]$ hdfs dfs -ls /tmp/df Found 201 items -rw-r--r-- 3 kris supergroup 0 2016-12-22 11:01 /tmp/df/_SUCCESS -rw-r--r-- 3 kris supergroup 4974 2016-12-22 11:01 /tmp/df/part-r-00000-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet -rw-r--r-- 3 kris supergroup 4914 2016-12-22 11:01 /tmp/df/part-r-00001-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet . . (omitted output) . -rw-r--r-- 3 kris supergroup 4893 2016-12-22 11:01 /tmp/df/part-r-00198-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet -rw-r--r-- 3 kris supergroup 4981 2016-12-22 11:01 /tmp/df/part-r-00199-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet