Dataframe multiple joins with same dataframe not able to resolve correct join columns

Nirav Patel Wed, 11 Jul 2018 14:52:58 -0700

I am trying to joind df1 with df2 and result of which to again with df2.

df is a common dataframe.


val df3 = df1
      .join(*df2*,
          df1("PARTICIPANT_ID") === df2("PARTICIPANT_ID") and
          df1("BUSINESS_ID") === df2("BUSINESS_ID"))
      .drop(df1("BUSINESS_ID")) //dropping duplicates
      .drop(df1("PARTICIPANT_ID")) //dropping duplicates
      .select("EMPLOYEE_ID",...)

val df4 = df3
      .join(*df2*,
          df3("EMPLOYEE_ID") === df2("EMPLOYEE_ID") and
          df3("BUSINESS_ID") === df2("BUSINESS_ID"))
      .drop(df2("BUSINESS_ID")) //dropping duplicates
      .drop(df2("EMPLOYEE_ID")) //dropping duplicates
      .select(...)


I am getting following warning and most likely its an Cartesian join which
is not what I want.
14:30:32.193 12262 [main] WARN   org.apache.spark.sql.Column - *Constructing
trivially true equals predicate*, 'EMPLOYEE_ID#83 = EMPLOYEE_ID#83'.
Perhaps you need to use aliases.

14:30:32.195 12264 [main] WARN   org.apache.spark.sql.Column - Constructing
trivially true equals predicate, 'BUSINESS_ID#312 = BUSINESS_ID#312'.
Perhaps you need to use aliases.

As you can see,  one of my Join predicate is converted to "(EMPLOYEE_ID#83
= EMPLOYEE_ID#83)"  I think this should be okay because they should still
be columns from different dataframe (df3 and df2).

Just want to confirm that this warning is harmless in this scenario.

Problem is similar to this one:
https://stackoverflow.com/questions/32190828/spark-sql-performing-carthesian-join-instead-of-inner-join

-- 


 <http://www.xactlycorp.com/email-click/>

 
<https://www.instagram.com/xactlycorp/>   
<https://www.linkedin.com/company/xactly-corporation>   
<https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
<http://www.youtube.com/xactlycorporation>

Dataframe multiple joins with same dataframe not able to resolve correct join columns

Reply via email to