Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

Daniel Siegmann Thu, 22 Dec 2016 15:06:12 -0800

Spark 2.0.0 introduced "Automatic file coalescing for native data sources" (
http://spark.apache.org/releases/spark-release-2-0-0.html#performance-and-runtime).
Perhaps that is the cause?


I'm not sure if this feature is mentioned anywhere in the documentation or
if there's any way to disable it.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic <kpl...@gmail.com
> wrote:

> Hi,
>
> I write a randomly generated 30,000-row dataframe to parquet. I verify
> that it has 200 partitions (both in Spark and inspecting the parquet file
> in hdfs).
>
> When I read it back in, it has 23 partitions?! Is there some optimization
> going on? (This doesn't happen in Spark 1.5)
>
> *How can I force it to read back the same partitions i.e. 200?* I'm
> trying to reproduce a problem that depends on partitioning and can't
> because the number of partitions goes way down.
>
> Thanks for any insights!
> Kristina
>
> Here is the code and output:
>
> scala> spark.version
> res13: String = 2.0.2
>
> scala> df.show(2)
> +---+-----------+----------+----------+----------+----------
> +--------+--------+
> | id|        id2|  strfeat0|  strfeat1|  strfeat2|
>  strfeat3|binfeat0|binfeat1|
> +---+-----------+----------+----------+----------+----------
> +--------+--------+
> |  0|12345678901 <(234)%20567-8901>|fcvEmHTZte|
>  null|fnuAQdnBkJ|aU3puFMq5h|       1|       1|
> |  1|12345678902 <(234)%20567-8902>|      
> null|rtcrPaAVNX|fnuAQdnBkJ|x6NyoX662X|
>       0|       0|
> +---+-----------+----------+----------+----------+----------
> +--------+--------+
> only showing top 2 rows
>
>
> scala> df.count
> res15: Long = 30001
>
> scala> df.rdd.partitions.size
> res16: Int = 200
>
> scala> df.write.parquet("/tmp/df")
>
>
> scala> val newdf = spark.read.parquet("/tmp/df")
> newdf: org.apache.spark.sql.DataFrame = [id: int, id2: bigint ... 6 more
> fields]
>
> scala> newdf.rdd.partitions.size
> res18: Int = 23
>
>
> [kris@airisdata195 ~]$ hdfs dfs -ls /tmp/df
> Found 201 items
> -rw-r--r--   3 kris supergroup          0 2016-12-22 11:01 /tmp/df/_SUCCESS
> -rw-r--r--   3 kris supergroup       4974 2016-12-22 11:01
> /tmp/df/part-r-00000-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
> -rw-r--r--   3 kris supergroup       4914 2016-12-22 11:01
> /tmp/df/part-r-00001-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
> .
> . (omitted output)
> .
> -rw-r--r--   3 kris supergroup       4893 2016-12-22 11:01
> /tmp/df/part-r-00198-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
> -rw-r--r--   3 kris supergroup       4981 2016-12-22 11:01
> /tmp/df/part-r-00199-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
>
>

Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

Reply via email to