partitionBy produces wrong number of tasks

Daniel Haviv Wed, 19 Oct 2016 22:32:56 -0700

Hi,
I have a case where I use partitionBy to write my DF using a calculated
column, so it looks somethings like this:


val df = spark.sql("select *, from_unixtime(ts, 'yyyyMMddH')
partition_key from mytable")

df.write.partitionBy("partition_key").orc("/partitioned_table")


df is 8 partitions in size (spark.sql.shuffle.partitions is set to 8) and
partition_key usually has 1 or 2 distinct values.
When the write action begins it's split into 330 tasks and takes much
longer than it should but if I switch to the following code instead it
works as expected with 8 tasks:

df.createTempView("tab")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("insert into partitioned_table select * from tab")



Any idea why is this happening ?
How does partitionBy decide to repartition the DF ?


Thank you,
Daniel

partitionBy produces wrong number of tasks

Reply via email to