Re: column expression in left outer join for DataFrame

S Krishna Wed, 25 Mar 2015 11:51:19 -0700

Hi,

Thanks for your response.  I am not clear about why the query is ambiguous.


val both = df_2.join(df_1, df_2("country")===df_1("country"), "left_outer")

I thought df_2("country")===df_1("country") indicates that the country
field in the 2 dataframes should match and df_2("country") is the
equivalent of df_2.country in SQL, while  df_1("country") is the equivalent
of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark
1.2.0 I have used the same logic using SparkSQL  and Tables ( e.g.  "WHERE
tab1.country = tab2.country")  and had no problems getting the correct
result.

thanks





On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> Unfortunately you are now hitting a bug (that is fixed in master and will
> be released in 1.3.1 hopefully next week).  However, even with that your
> query is still ambiguous and you will need to use aliases:
>
> val df_1 = df.filter( df("event") === 0)
>                   . select("country", "cnt").as("a")
> val df_2 = df.filter( df("event") === 3)
>                   . select("country", "cnt").as("b")
> val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer")
>
>
>
> On Tue, Mar 24, 2015 at 11:57 PM, S Krishna <skrishna...@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks for your response. I modified my code as per your suggestion, but
>> now I am getting a runtime error. Here's my code:
>>
>> val df_1 = df.filter( df("event") === 0)
>>                   . select("country", "cnt")
>>
>> val df_2 = df.filter( df("event") === 3)
>>                   . select("country", "cnt")
>>
>> df_1.show()
>> //produces the following output :
>> // country    cnt
>> //   tw           3000
>> //   uk           2000
>> //   us           1000
>>
>> df_2.show()
>> //produces the following output :
>> // country    cnt
>> //   tw           25
>> //   uk           200
>> //   us           95
>>
>> val both = df_2.join(df_1, df_2("country")===df_1("country"),
>> "left_outer")
>>
>> I am getting the following error when executing the join statement:
>>
>> java.util.NoSuchElementException: next on empty iterator.
>>
>> This error seems to be originating at DataFrame.join (line 133 in
>> DataFrame.scala).
>>
>> The show() results show that both dataframes do have columns named
>> "country" and that they are non-empty. I also tried the simpler join ( i.e.
>> df_2.join(df_1) ) and got the same error stated above.
>>
>> I would like to know what is wrong with the join statement above.
>>
>> thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> You need to use `===`, so that you are constructing a column expression
>>> instead of evaluating the standard scala equality method.  Calling methods
>>> to access columns (i.e. df.county is only supported in python).
>>>
>>> val join_df =  df1.join( df2, df1("country") === df2("country"),
>>> "left_outer")
>>>
>>> On Tue, Mar 24, 2015 at 5:50 PM, SK <skrishna...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to port some code that was working in Spark 1.2.0 on the
>>>> latest
>>>> version, Spark 1.3.0. This code involves a left outer join between two
>>>> SchemaRDDs which I am now trying to change to a left outer join between
>>>> 2
>>>> DataFrames. I followed the example  for left outer join of DataFrame at
>>>>
>>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>>>
>>>> Here's my code, where df1 and df2 are the 2 dataframes I am joining on
>>>> the
>>>> "country" field:
>>>>
>>>>  val join_df =  df1.join( df2,  df1.country == df2.country,
>>>> "left_outer")
>>>>
>>>> But I got a compilation error that value  country is not a member of
>>>> sql.DataFrame
>>>>
>>>> I  also tried the following:
>>>>  val join_df =  df1.join( df2, df1("country") == df2("country"),
>>>> "left_outer")
>>>>
>>>> I got a compilation error that it is a Boolean whereas a Column is
>>>> required.
>>>>
>>>> So what is the correct Column expression I need to provide for joining
>>>> the 2
>>>> dataframes on a specific field ?
>>>>
>>>> thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: column expression in left outer join for DataFrame

Reply via email to