[Spark Core] Joining Same DataFrame Multiple Times Results in Column not getting dropped

Shahban Riaz Fri, 16 Sep 2022 05:02:34 -0700

Hi,

We have some PySpark code that joins a table table_a, twice to another table 
table_b using the following code. After joining the table, we drop the key_hash 
column from the output DataFrame.
This code was working fine in spark version 3.0.1. Since upgrading to spark 
version 3.2.2, the behaviour has changed and during the first transform 
operation the key_hash field gets dropped from the output DataFrame but when 
the 2nd transform operation gets executed then the key_hash field still stays 
in the output_df. Can someone please guide what has changed in Spark behaviour 
that is causing this issue?
def tr_join_sac_user(self, df_a):


def inner(df_b):
    return (
        df_b.join(df_a, on=df_b["sac_key_hash"] == df_a["key_hash"], how="left")
        .drop(df_a.key_hash)
        .drop(df_b.sac_key_hash)
    )

return inner

def tr_join_sec_user(self, df_a):

    def inner(df_b):
        return (
            df_b.join(df_a, on=df_b["sec_key_hash"] == df_a["key_hash"], 
how="left")
            .drop(df_a.key_hash)
            .drop(df_b.sec_key_hash)
        )

    return inner

table_a_df = spark.read.format("delta").load("/path/to/table_a")
table_b_df = spark.read.format("delta").load("/path/to/table_b")

output_df = table_b_df.transform(tr_join_sac_user(table_a_df))
output_df = output_df.transform(tr_join_sec_user(table_a_df))


If we use .drop('key_hash') instead of .drop(df_a.key_hash) that seems to work 
and the column does get dropped in 2nd transform as well. I would like to 
understand what has changed in Spark behaviour between these versions (or if 
it’s a bug) as this might have an impact in other places in our codebase as 
well.

Regards,
Shahban Riaz

[Spark Core] Joining Same DataFrame Multiple Times Results in Column not getting dropped

Reply via email to