Renaming a DataFrame column makes Spark lose partitioning information

Antoine Wendlinger Tue, 04 Aug 2020 06:26:47 -0700

Hi,

When renaming a DataFrame column, it looks like Spark is forgetting the
partition information:


    Seq((1, 2))
      .toDF("a", "b")
      .repartition($"b")
      .withColumnRenamed("b", "c")
      .repartition($"c")
      .explain()

Gives the following plan:

    == Physical Plan ==
    Exchange hashpartitioning(c#40, 10)
    +- *(1) Project [a#36, b#37 AS c#40]
       +- Exchange hashpartitioning(b#37, 10)
          +- LocalTableScan [a#36, b#37]

As you can see, two shuffles are done, but the second one is unnecessary.
Is there a reason I don't know for this behavior ? Is there a way to work
around it (other than not renaming my columns) ?

I'm using Spark 2.4.3.


Thanks for your help,

Antoine

Renaming a DataFrame column makes Spark lose partitioning information

Reply via email to