Hi, I have a case where I use partitionBy to write my DF using a calculated column, so it looks somethings like this:
val df = spark.sql("select *, from_unixtime(ts, 'yyyyMMddH') partition_key from mytable") df.write.partitionBy("partition_key").orc("/partitioned_table") df is 8 partitions in size (spark.sql.shuffle.partitions is set to 8) and partition_key usually has 1 or 2 distinct values. When the write action begins it's split into 330 tasks and takes much longer than it should but if I switch to the following code instead it works as expected with 8 tasks: df.createTempView("tab") spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") spark.sql("insert into partitioned_table select * from tab") Any idea why is this happening ? How does partitionBy decide to repartition the DF ? Thank you, Daniel