Re: Best practice to avoid ambiguous columns in DataFrame.join

Jan-Paul Bultmann Sun, 17 May 2015 12:47:12 -0700

It’s probably not advisable to use 1 though since it will break when `df = df2`,
which can easily happen when you’ve written a function that does such a join 
internally.


This could be solved by an identity like function that returns the dataframe 
unchanged but with a different identity.
`.as` would be such a candidate but that doesn’t work.

Thoughts?

> On 16 May 2015, at 00:55, Michael Armbrust <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> There are several ways to solve this ambiguity:
> 
> 1. use the DataFrames to get the attribute so its already "resolved" and not 
> just a string we need to map to a DataFrame.
> 
> df.join(df2, df("_1") === df2("_1"))
> 
> 2. Use aliases
> 
> df.as <http://df.as/>('a).join(df2.as <http://df2.as/>('b), $"a._1" === 
> $"b._1")
> 
> 3. rename the columns as you suggested.
> 
> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === 
> $"right_key").printSchema
> 
> 4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String): 
> DataFrame
> 
> df.join(df1, "_1")
> 
> This has the added benefit of only outputting a single _1 column.
> 
> On Fri, May 15, 2015 at 3:44 PM, Justin Yip <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello,
> 
> I would like ask know if there are recommended ways of preventing ambiguous 
> columns when joining dataframes. When we join dataframes, it usually happen 
> we join the column with identical name. I could have rename the columns on 
> the right data frame, as described in the following code. Is there a better 
> way to achieve this? 
> 
> scala> val df = sqlContext.createDataFrame(Seq((1, "a"), (2, "b"), (3, "b"), 
> (4, "b")))
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
> 
> scala> val df2 = sqlContext.createDataFrame(Seq((1, 10), (2, 20), (3, 30), 
> (4, 40)))
> df2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
> 
> scala> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === 
> $"right_key").printSchema
> 
> Thanks.
> 
> Justin
> 
> View this message in context: Best practice to avoid ambiguous columns in 
> DataFrame.join 
> <http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html>
> Sent from the Apache Spark User List mailing list archive 
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com 
> <http://nabble.com/>.
>

Re: Best practice to avoid ambiguous columns in DataFrame.join

Reply via email to