Hi Spark Users, The following code snippet has an "attribute missing" error while the attribute exists. This bug is triggered by a particular sequence of of "select", "groupby" and "join". Note that if I take away the "select" in #line B, the code runs without error. However, the "select" in #line B includes all columns in the dataframe and hence should not affect the final result.
import pyspark.sql.functions as F df = spark.createDataFrame([{'score':1.0,'ID':'abc','LABEL':True,'k':2},{'score':1.0,'ID':'abc','LABEL':False,'k':3}]) df = df.withColumnRenamed("k","kk")\ .select("ID","score","LABEL","kk") #line B df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1) df = df.join(df_t.select("ID"),["ID"]) df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1") df = df.join(df_sw, ["ID","kk"])