Re: Cant join same dataframe twice ?

Divya Gehlot Wed, 27 Apr 2016 20:38:05 -0700

 when working with Dataframes and using explain to debug I observed that
Spark gives  different tagging number for the same dataframe columns
Like in this case
val df1 = df2.join(df3,"Column1")
Below throwing error missing columns
val df 4 = df1.join(df3,"Column2")


For instance,df2 has 2 columns ,df2 columns gets tagging like df2Col1#4
,df2Col2#5
                   df3 has 4 columns ,df3 columns gets tagging like
df3Col1#6,df3Col2#7,df3Col3#8,df3Col4#9
Now after joining df1 columns tagging will be
df2Co1l#10,df2Col2#11,df3Col1#12,df3Col2#13,df3Col3#14,df3Col4#15

Now when df1 again with df3 the df3 columns tagging changed
                 df2Co1l#16,df2Col2#17,df3Col1#18
,df3Col2#19,df3Col3#20,df3Col4#21,df3Col2#23,df3Col3#24,df3Col4#25

but joining df3Col1#12  would be referring to the previous dataframe and
that causes the issue .

Thanks,
Divya






On 27 April 2016 at 23:55, Ted Yu <yuzhih...@gmail.com> wrote:

> I wonder if Spark can provide better support for this case.
>
> The following schema is not user friendly (shown previsouly):
>
> StructField(b,IntegerType,false), StructField(b,IntegerType,false)
>
> Except for 'select *', there is no way for user to query any of the two
> fields.
>
> On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Based on my example, how about renaming columns?
>>
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
>> df2("b").as("2-b"))
>> val df4 = df3.join(df2, df3("2-b") === df2("b"))
>>
>> // maropu
>>
>> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Correct Takeshi
>>> Even I am facing the same issue .
>>>
>>> How to avoid the ambiguity ?
>>>
>>>
>>> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I tried;
>>>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>>> val df3 = df1.join(df2, "a")
>>>> val df4 = df3.join(df2, "b")
>>>>
>>>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>>>> ambiguous, could be: b#6, b#14.;
>>>> If same case, this message makes sense and this is clear.
>>>>
>>>> Thought?
>>>>
>>>> // maropu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
>>>> wrote:
>>>>
>>>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>>>
>>>>> Prasad.
>>>>>
>>>>> From: Ted Yu
>>>>> Date: Monday, April 25, 2016 at 8:35 PM
>>>>> To: Divya Gehlot
>>>>> Cc: "user @spark"
>>>>> Subject: Re: Cant join same dataframe twice ?
>>>>>
>>>>> Can you show us the structure of df2 and df3 ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>> I am using Spark 1.5.2 .
>>>>>> I have a use case where I need to join the same dataframe twice on
>>>>>> two different columns.
>>>>>> I am getting error missing Columns
>>>>>>
>>>>>> For instance ,
>>>>>> val df1 = df2.join(df3,"Column1")
>>>>>> Below throwing error missing columns
>>>>>> val df 4 = df1.join(df3,"Column2")
>>>>>>
>>>>>> Is the bug or valid scenario ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Divya
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>

Re: Cant join same dataframe twice ?

Reply via email to