Re: Cant join same dataframe twice ?

2016-04-27 Thread Divya Gehlot
 when working with Dataframes and using explain to debug I observed that
Spark gives  different tagging number for the same dataframe columns
Like in this case
val df1 = df2.join(df3,"Column1")
Below throwing error missing columns
val df 4 = df1.join(df3,"Column2")

For instance,df2 has 2 columns ,df2 columns gets tagging like df2Col1#4
,df2Col2#5
   df3 has 4 columns ,df3 columns gets tagging like
df3Col1#6,df3Col2#7,df3Col3#8,df3Col4#9
Now after joining df1 columns tagging will be
df2Co1l#10,df2Col2#11,df3Col1#12,df3Col2#13,df3Col3#14,df3Col4#15

Now when df1 again with df3 the df3 columns tagging changed
 df2Co1l#16,df2Col2#17,df3Col1#18
,df3Col2#19,df3Col3#20,df3Col4#21,df3Col2#23,df3Col3#24,df3Col4#25

but joining df3Col1#12  would be referring to the previous dataframe and
that causes the issue .

Thanks,
Divya






On 27 April 2016 at 23:55, Ted Yu <yuzhih...@gmail.com> wrote:

> I wonder if Spark can provide better support for this case.
>
> The following schema is not user friendly (shown previsouly):
>
> StructField(b,IntegerType,false), StructField(b,IntegerType,false)
>
> Except for 'select *', there is no way for user to query any of the two
> fields.
>
> On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Based on my example, how about renaming columns?
>>
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
>> df2("b").as("2-b"))
>> val df4 = df3.join(df2, df3("2-b") === df2("b"))
>>
>> // maropu
>>
>> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Correct Takeshi
>>> Even I am facing the same issue .
>>>
>>> How to avoid the ambiguity ?
>>>
>>>
>>> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I tried;
>>>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>>> val df3 = df1.join(df2, "a")
>>>> val df4 = df3.join(df2, "b")
>>>>
>>>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>>>> ambiguous, could be: b#6, b#14.;
>>>> If same case, this message makes sense and this is clear.
>>>>
>>>> Thought?
>>>>
>>>> // maropu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
>>>> wrote:
>>>>
>>>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>>>
>>>>> Prasad.
>>>>>
>>>>> From: Ted Yu
>>>>> Date: Monday, April 25, 2016 at 8:35 PM
>>>>> To: Divya Gehlot
>>>>> Cc: "user @spark"
>>>>> Subject: Re: Cant join same dataframe twice ?
>>>>>
>>>>> Can you show us the structure of df2 and df3 ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>> I am using Spark 1.5.2 .
>>>>>> I have a use case where I need to join the same dataframe twice on
>>>>>> two different columns.
>>>>>> I am getting error missing Columns
>>>>>>
>>>>>> For instance ,
>>>>>> val df1 = df2.join(df3,"Column1")
>>>>>> Below throwing error missing columns
>>>>>> val df 4 = df1.join(df3,"Column2")
>>>>>>
>>>>>> Is the bug or valid scenario ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Divya
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


Re: Cant join same dataframe twice ?

2016-04-27 Thread Ted Yu
I wonder if Spark can provide better support for this case.

The following schema is not user friendly (shown previsouly):

StructField(b,IntegerType,false), StructField(b,IntegerType,false)

Except for 'select *', there is no way for user to query any of the two
fields.

On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Based on my example, how about renaming columns?
>
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
> df2("b").as("2-b"))
> val df4 = df3.join(df2, df3("2-b") === df2("b"))
>
> // maropu
>
> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Correct Takeshi
>> Even I am facing the same issue .
>>
>> How to avoid the ambiguity ?
>>
>>
>> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I tried;
>>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>> val df3 = df1.join(df2, "a")
>>> val df4 = df3.join(df2, "b")
>>>
>>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>>> ambiguous, could be: b#6, b#14.;
>>> If same case, this message makes sense and this is clear.
>>>
>>> Thought?
>>>
>>> // maropu
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
>>> wrote:
>>>
>>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>>
>>>> Prasad.
>>>>
>>>> From: Ted Yu
>>>> Date: Monday, April 25, 2016 at 8:35 PM
>>>> To: Divya Gehlot
>>>> Cc: "user @spark"
>>>> Subject: Re: Cant join same dataframe twice ?
>>>>
>>>> Can you show us the structure of df2 and df3 ?
>>>>
>>>> Thanks
>>>>
>>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I am using Spark 1.5.2 .
>>>>> I have a use case where I need to join the same dataframe twice on two
>>>>> different columns.
>>>>> I am getting error missing Columns
>>>>>
>>>>> For instance ,
>>>>> val df1 = df2.join(df3,"Column1")
>>>>> Below throwing error missing columns
>>>>> val df 4 = df1.join(df3,"Column2")
>>>>>
>>>>> Is the bug or valid scenario ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Divya
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Based on my example, how about renaming columns?

val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
df2("b").as("2-b"))
val df4 = df3.join(df2, df3("2-b") === df2("b"))

// maropu

On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Correct Takeshi
> Even I am facing the same issue .
>
> How to avoid the ambiguity ?
>
>
> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com> wrote:
>
>> Hi,
>>
>> I tried;
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a")
>> val df4 = df3.join(df2, "b")
>>
>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>> ambiguous, could be: b#6, b#14.;
>> If same case, this message makes sense and this is clear.
>>
>> Thought?
>>
>> // maropu
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
>> wrote:
>>
>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>
>>> Prasad.
>>>
>>> From: Ted Yu
>>> Date: Monday, April 25, 2016 at 8:35 PM
>>> To: Divya Gehlot
>>> Cc: "user @spark"
>>> Subject: Re: Cant join same dataframe twice ?
>>>
>>> Can you show us the structure of df2 and df3 ?
>>>
>>> Thanks
>>>
>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I am using Spark 1.5.2 .
>>>> I have a use case where I need to join the same dataframe twice on two
>>>> different columns.
>>>> I am getting error missing Columns
>>>>
>>>> For instance ,
>>>> val df1 = df2.join(df3,"Column1")
>>>> Below throwing error missing columns
>>>> val df 4 = df1.join(df3,"Column2")
>>>>
>>>> Is the bug or valid scenario ?
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Divya
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cant join same dataframe twice ?

2016-04-26 Thread Divya Gehlot
Correct Takeshi
Even I am facing the same issue .

How to avoid the ambiguity ?


On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com> wrote:

> Hi,
>
> I tried;
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a")
> val df4 = df3.join(df2, "b")
>
> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
> ambiguous, could be: b#6, b#14.;
> If same case, this message makes sense and this is clear.
>
> Thought?
>
> // maropu
>
>
>
>
>
>
>
> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
> wrote:
>
>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>
>> Prasad.
>>
>> From: Ted Yu
>> Date: Monday, April 25, 2016 at 8:35 PM
>> To: Divya Gehlot
>> Cc: "user @spark"
>> Subject: Re: Cant join same dataframe twice ?
>>
>> Can you show us the structure of df2 and df3 ?
>>
>> Thanks
>>
>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I am using Spark 1.5.2 .
>>> I have a use case where I need to join the same dataframe twice on two
>>> different columns.
>>> I am getting error missing Columns
>>>
>>> For instance ,
>>> val df1 = df2.join(df3,"Column1")
>>> Below throwing error missing columns
>>> val df 4 = df1.join(df3,"Column2")
>>>
>>> Is the bug or valid scenario ?
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Yeah, I think so. This is a kind of common mistakes.

// maropu

On Wed, Apr 27, 2016 at 1:05 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> The ambiguity came from:
>
> scala> df3.schema
> res0: org.apache.spark.sql.types.StructType =
> StructType(StructField(a,IntegerType,false),
> StructField(b,IntegerType,false), StructField(b,IntegerType,false))
>
> On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> I tried;
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a")
>> val df4 = df3.join(df2, "b")
>>
>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>> ambiguous, could be: b#6, b#14.;
>> If same case, this message makes sense and this is clear.
>>
>> Thought?
>>
>> // maropu
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
>> wrote:
>>
>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>
>>> Prasad.
>>>
>>> From: Ted Yu
>>> Date: Monday, April 25, 2016 at 8:35 PM
>>> To: Divya Gehlot
>>> Cc: "user @spark"
>>> Subject: Re: Cant join same dataframe twice ?
>>>
>>> Can you show us the structure of df2 and df3 ?
>>>
>>> Thanks
>>>
>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I am using Spark 1.5.2 .
>>>> I have a use case where I need to join the same dataframe twice on two
>>>> different columns.
>>>> I am getting error missing Columns
>>>>
>>>> For instance ,
>>>> val df1 = df2.join(df3,"Column1")
>>>> Below throwing error missing columns
>>>> val df 4 = df1.join(df3,"Column2")
>>>>
>>>> Is the bug or valid scenario ?
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Divya
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cant join same dataframe twice ?

2016-04-26 Thread Ted Yu
The ambiguity came from:

scala> df3.schema
res0: org.apache.spark.sql.types.StructType =
StructType(StructField(a,IntegerType,false),
StructField(b,IntegerType,false), StructField(b,IntegerType,false))

On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> I tried;
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a")
> val df4 = df3.join(df2, "b")
>
> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
> ambiguous, could be: b#6, b#14.;
> If same case, this message makes sense and this is clear.
>
> Thought?
>
> // maropu
>
>
>
>
>
>
>
> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
> wrote:
>
>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>
>> Prasad.
>>
>> From: Ted Yu
>> Date: Monday, April 25, 2016 at 8:35 PM
>> To: Divya Gehlot
>> Cc: "user @spark"
>> Subject: Re: Cant join same dataframe twice ?
>>
>> Can you show us the structure of df2 and df3 ?
>>
>> Thanks
>>
>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I am using Spark 1.5.2 .
>>> I have a use case where I need to join the same dataframe twice on two
>>> different columns.
>>> I am getting error missing Columns
>>>
>>> For instance ,
>>> val df1 = df2.join(df3,"Column1")
>>> Below throwing error missing columns
>>> val df 4 = df1.join(df3,"Column2")
>>>
>>> Is the bug or valid scenario ?
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Hi,

I tried;
val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df3 = df1.join(df2, "a")
val df4 = df3.join(df2, "b")

And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
ambiguous, could be: b#6, b#14.;
If same case, this message makes sense and this is clear.

Thought?

// maropu







On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> wrote:

> Also, check the column names of df1 ( after joining df2 and df3 ).
>
> Prasad.
>
> From: Ted Yu
> Date: Monday, April 25, 2016 at 8:35 PM
> To: Divya Gehlot
> Cc: "user @spark"
> Subject: Re: Cant join same dataframe twice ?
>
> Can you show us the structure of df2 and df3 ?
>
> Thanks
>
> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Hi,
>> I am using Spark 1.5.2 .
>> I have a use case where I need to join the same dataframe twice on two
>> different columns.
>> I am getting error missing Columns
>>
>> For instance ,
>> val df1 = df2.join(df3,"Column1")
>> Below throwing error missing columns
>> val df 4 = df1.join(df3,"Column2")
>>
>> Is the bug or valid scenario ?
>>
>>
>>
>>
>> Thanks,
>> Divya
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cant join same dataframe twice ?

2016-04-26 Thread Prasad Ravilla
Also, check the column names of df1 ( after joining df2 and df3 ).

Prasad.

From: Ted Yu
Date: Monday, April 25, 2016 at 8:35 PM
To: Divya Gehlot
Cc: "user @spark"
Subject: Re: Cant join same dataframe twice ?

Can you show us the structure of df2 and df3 ?

Thanks

On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
<divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote:
Hi,
I am using Spark 1.5.2 .
I have a use case where I need to join the same dataframe twice on two 
different columns.
I am getting error missing Columns

For instance ,
val df1 = df2.join(df3,"Column1")
Below throwing error missing columns
val df 4 = df1.join(df3,"Column2")

Is the bug or valid scenario ?




Thanks,
Divya



Re: Cant join same dataframe twice ?

2016-04-25 Thread Ted Yu
Can you show us the structure of df2 and df3 ?

Thanks

On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
wrote:

> Hi,
> I am using Spark 1.5.2 .
> I have a use case where I need to join the same dataframe twice on two
> different columns.
> I am getting error missing Columns
>
> For instance ,
> val df1 = df2.join(df3,"Column1")
> Below throwing error missing columns
> val df 4 = df1.join(df3,"Column2")
>
> Is the bug or valid scenario ?
>
>
>
>
> Thanks,
> Divya
>