wuchang created SPARK-19860:
-------------------------------

             Summary: DataFrame join get conflict error if two frames has a 
same name column.
                 Key: SPARK-19860
                 URL: https://issues.apache.org/jira/browse/SPARK-19860
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.1.0
            Reporter: wuchang


>>> print df1.collect()
[Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
Row(fdate=u'20170301', in_amount1=10159653)]
>>> print df2.collect()
[Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
Row(fdate=u'20170301', in_amount2=9475418)]
>>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
    jdf = self._jdf.join(other._jdf, on._jc, how)
  File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 
933, in __call__
  File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"
Failure when resolving conflicting references in Join:
'Join Inner, (fdate#42 = fdate#42)
:- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int) AS in_amount1#97]
:  +- Filter (inorout#44 = A)
:     +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
:        +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT 
(firm_id#40 = -1) && (fdate#42 >= 20170201)))
:           +- SubqueryAlias history_transfer_v
:              +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
bankwaterid#48, waterid#49, waterstate#50, source#51]
:                 +- SubqueryAlias history_transfer
:                    +- 
Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
 parquet
+- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int) AS in_amount2#145]
   +- Filter (inorout#44 = B)
      +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
         +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT 
(firm_id#40 = -1) && (fdate#42 >= 20170201)))
                 +- SubqueryAlias history_transfer_v
                 +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
bankwaterid#48, waterid#49, waterstate#50, source#51]
                 +- SubqueryAlias history_transfer
                 +- 
Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
 parquet
                 
                 Conflicting attributes: fdate#42


Only when I use .withColumnRenamed('fdate','fdate2') method to change df1's 
column fdate to fdate1 and df2's column fdate to fdate2 , the join is ok.
So ,my question is ,why the conflict happened?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to