Hi, We have some PySpark code that joins a table table_a, twice to another table table_b using the following code. After joining the table, we drop the key_hash column from the output DataFrame. This code was working fine in spark version 3.0.1. Since upgrading to spark version 3.2.2, the behaviour has changed and during the first transform operation the key_hash field gets dropped from the output DataFrame but when the 2nd transform operation gets executed then the key_hash field still stays in the output_df. Can someone please guide what has changed in Spark behaviour that is causing this issue? def tr_join_sac_user(self, df_a):
def inner(df_b): return ( df_b.join(df_a, on=df_b["sac_key_hash"] == df_a["key_hash"], how="left") .drop(df_a.key_hash) .drop(df_b.sac_key_hash) ) return inner def tr_join_sec_user(self, df_a): def inner(df_b): return ( df_b.join(df_a, on=df_b["sec_key_hash"] == df_a["key_hash"], how="left") .drop(df_a.key_hash) .drop(df_b.sec_key_hash) ) return inner table_a_df = spark.read.format("delta").load("/path/to/table_a") table_b_df = spark.read.format("delta").load("/path/to/table_b") output_df = table_b_df.transform(tr_join_sac_user(table_a_df)) output_df = output_df.transform(tr_join_sec_user(table_a_df)) If we use .drop('key_hash') instead of .drop(df_a.key_hash) that seems to work and the column does get dropped in 2nd transform as well. I would like to understand what has changed in Spark behaviour between these versions (or if it’s a bug) as this might have an impact in other places in our codebase as well. Regards, Shahban Riaz