It’s probably not advisable to use 1 though since it will break when `df = df2`, which can easily happen when you’ve written a function that does such a join internally.
This could be solved by an identity like function that returns the dataframe unchanged but with a different identity. `.as` would be such a candidate but that doesn’t work. Thoughts? > On 16 May 2015, at 00:55, Michael Armbrust <[email protected] > <mailto:[email protected]>> wrote: > > There are several ways to solve this ambiguity: > > 1. use the DataFrames to get the attribute so its already "resolved" and not > just a string we need to map to a DataFrame. > > df.join(df2, df("_1") === df2("_1")) > > 2. Use aliases > > df.as <http://df.as/>('a).join(df2.as <http://df2.as/>('b), $"a._1" === > $"b._1") > > 3. rename the columns as you suggested. > > df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === > $"right_key").printSchema > > 4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String): > DataFrame > > df.join(df1, "_1") > > This has the added benefit of only outputting a single _1 column. > > On Fri, May 15, 2015 at 3:44 PM, Justin Yip <[email protected] > <mailto:[email protected]>> wrote: > Hello, > > I would like ask know if there are recommended ways of preventing ambiguous > columns when joining dataframes. When we join dataframes, it usually happen > we join the column with identical name. I could have rename the columns on > the right data frame, as described in the following code. Is there a better > way to achieve this? > > scala> val df = sqlContext.createDataFrame(Seq((1, "a"), (2, "b"), (3, "b"), > (4, "b"))) > df: org.apache.spark.sql.DataFrame = [_1: int, _2: string] > > scala> val df2 = sqlContext.createDataFrame(Seq((1, 10), (2, 20), (3, 30), > (4, 40))) > df2: org.apache.spark.sql.DataFrame = [_1: int, _2: int] > > scala> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === > $"right_key").printSchema > > Thanks. > > Justin > > View this message in context: Best practice to avoid ambiguous columns in > DataFrame.join > <http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com > <http://nabble.com/>. >
