when working with Dataframes and using explain to debug I observed that Spark gives different tagging number for the same dataframe columns Like in this case val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2")
For instance,df2 has 2 columns ,df2 columns gets tagging like df2Col1#4 ,df2Col2#5 df3 has 4 columns ,df3 columns gets tagging like df3Col1#6,df3Col2#7,df3Col3#8,df3Col4#9 Now after joining df1 columns tagging will be df2Co1l#10,df2Col2#11,df3Col1#12,df3Col2#13,df3Col3#14,df3Col4#15 Now when df1 again with df3 the df3 columns tagging changed df2Co1l#16,df2Col2#17,df3Col1#18 ,df3Col2#19,df3Col3#20,df3Col4#21,df3Col2#23,df3Col3#24,df3Col4#25 but joining df3Col1#12 would be referring to the previous dataframe and that causes the issue . Thanks, Divya On 27 April 2016 at 23:55, Ted Yu <yuzhih...@gmail.com> wrote: > I wonder if Spark can provide better support for this case. > > The following schema is not user friendly (shown previsouly): > > StructField(b,IntegerType,false), StructField(b,IntegerType,false) > > Except for 'select *', there is no way for user to query any of the two > fields. > > On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> Based on my example, how about renaming columns? >> >> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), >> df2("b").as("2-b")) >> val df4 = df3.join(df2, df3("2-b") === df2("b")) >> >> // maropu >> >> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com> >> wrote: >> >>> Correct Takeshi >>> Even I am facing the same issue . >>> >>> How to avoid the ambiguity ? >>> >>> >>> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin....@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I tried; >>>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >>>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >>>> val df3 = df1.join(df2, "a") >>>> val df4 = df3.join(df2, "b") >>>> >>>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is >>>> ambiguous, could be: b#6, b#14.; >>>> If same case, this message makes sense and this is clear. >>>> >>>> Thought? >>>> >>>> // maropu >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> >>>> wrote: >>>> >>>>> Also, check the column names of df1 ( after joining df2 and df3 ). >>>>> >>>>> Prasad. >>>>> >>>>> From: Ted Yu >>>>> Date: Monday, April 25, 2016 at 8:35 PM >>>>> To: Divya Gehlot >>>>> Cc: "user @spark" >>>>> Subject: Re: Cant join same dataframe twice ? >>>>> >>>>> Can you show us the structure of df2 and df3 ? >>>>> >>>>> Thanks >>>>> >>>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> I am using Spark 1.5.2 . >>>>>> I have a use case where I need to join the same dataframe twice on >>>>>> two different columns. >>>>>> I am getting error missing Columns >>>>>> >>>>>> For instance , >>>>>> val df1 = df2.join(df3,"Column1") >>>>>> Below throwing error missing columns >>>>>> val df 4 = df1.join(df3,"Column2") >>>>>> >>>>>> Is the bug or valid scenario ? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Divya >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > >