Hi,
When renaming a DataFrame column, it looks like Spark is forgetting the
partition information:
Seq((1, 2))
.toDF("a", "b")
.repartition($"b")
.withColumnRenamed("b", "c")
.repartition($"c")
.explain()
Gives the following plan:
== Physical Plan ==
Exchange hashpartitioning(c#40, 10)
+- *(1) Project [a#36, b#37 AS c#40]
+- Exchange hashpartitioning(b#37, 10)
+- LocalTableScan [a#36, b#37]
As you can see, two shuffles are done, but the second one is unnecessary.
Is there a reason I don't know for this behavior ? Is there a way to work
around it (other than not renaming my columns) ?
I'm using Spark 2.4.3.
Thanks for your help,
Antoine