Re: Need help for RDD/DF transformation.
Unfortunately, I don't think there is any optimized way to do this. Maybe someone else can correct me, but in theory, there is no way other than a cartesian product of your 2 sides if you can not change the data. Think about it, if you want to join between 2 different types (Array and Int in your case), Spark cannot do HashJoin, nor SortMergeJoin. In the RelationDB world, you have to do a NestedLoop, which is cartesian join in BigData world. After you create a cartesian product of both, then check if Array column contains the other column. I saw a similar question answered in stackoverflow<http://stackoverflow.com/questions/41595099/spark-join-dataframe-column-with-an-array>, but the answer is NOT correct for serious case. Assume the key never contains any duplicate strings: scala> sc.version res1: String = 1.6.3 scala> val df1 = Seq((1, "a"), (3, "a"), (5, "b")).toDF("key", "value") df1: org.apache.spark.sql.DataFrame = [key: int, value: string] scala> val df2 = Seq((Array(1,2,3), "other1"),(Array(4,5), "other2")).toDF("keys", "other") df2: org.apache.spark.sql.DataFrame = [keys: array, other: string] scala> df1.show +---+-+ |key|value| +---+-+ | 1|a| | 3|a| | 5|b| +---+-+ scala> df2.show +-+--+ | keys| other| +-+--+ |[1, 2, 3]|other1| | [4, 5]|other2| +-+--+ scala> df1.join(df2, df2("keys").cast("string").contains(df1("key").cast("string"))).show +---+-+-+--+ |key|value| keys| other| +---+-+-+--+ | 1|a|[1, 2, 3]|other1| | 3|a|[1, 2, 3]|other1| | 5|b| [4, 5]|other2| +---+-+-+--+ This code looks like working in Spark 1.6.x, but in fact, it has serious issue. As it assumes that the key will never have any conflict string in it, but this won't be true for any serious data. So [11,2,3] contains "1" will return true, but is broken in your logic. And it won't work in Spark 2.x, as the cast("string") logic changes for Array in Spark 2.x. The idea behind whole thing is to transfer your Array field to a String type, and use contains method to check if it contains another field (In String type too). But this is impossible to match the Array.contains(element) logic in most cases. You need to know your data, then try to see if you can find any optimized way to avoid cartesian product. For example, maybe make sure "key" in DF1, always guarantee presenting the first element of the Array in a logic order, so you can just pick the first element out from the Array "keys" of DF2, to join. Otherwise, I don't see any way to avoid a cartesian join. Yong From: Mungeol Heo Sent: Thursday, March 30, 2017 3:05 AM To: ayan guha Cc: Yong Zhang; user@spark.apache.org Subject: Re: Need help for RDD/DF transformation. Hello ayan, Same key will not exists in different lists. Which means, If "1" exists in a list, then it will not be presented in another list. Thank you. On Thu, Mar 30, 2017 at 3:56 PM, ayan guha wrote: > Is it possible for one key in 2 groups in rdd2? > > [1,2,3] > [1,4,5] > > ? > > On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo wrote: >> >> Hello Yong, >> >> First of all, thank your attention. >> Note that the values of elements, which have values at RDD/DF1, in the >> same list will be always same. >> Therefore, the "1" and "3", which from RDD/DF 1, will always have the >> same value which is "a". >> >> The goal here is assigning same value to elements of the list which >> does not exist in RDD/DF 1. >> So, all the elements in the same list can have same value. >> >> Or, the final RDD/DF also can be like this, >> >> [1, 2, 3], a >> [4, 5], b >> >> Thank you again. >> >> - Mungeol >> >> >> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang wrote: >> > What is the desired result for >> > >> > >> > RDD/DF 1 >> > >> > 1, a >> > 3, c >> > 5, b >> > >> > RDD/DF 2 >> > >> > [1, 2, 3] >> > [4, 5] >> > >> > >> > Yong >> > >> > >> > From: Mungeol Heo >> > Sent: Wednesday, March 29, 2017 5:37 AM >> > To: user@spark.apache.org >> > Subject: Need help for RDD/DF transformation. >> > >> > Hello, >> > >> > Suppose, I have two RDD or data frame like addressed below. >> > >> > RDD/DF 1 >> > >> > 1, a >> > 3, a >> > 5, b >> > >> > RDD/DF 2 >> > >> > [1, 2, 3] >> > [4, 5] >> > >> > I need to create a new RDD/DF like below from RDD/DF 1 and 2. >> > >> > 1, a >> > 2, a >> > 3, a >> > 4, b >> > 5, b >> > >> > Is there an efficient way to do this? >> > Any help will be great. >> > >> > Thank you. >> > >> > - >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > -- > Best Regards, > Ayan Guha
Re: Need help for RDD/DF transformation.
Hello ayan, Same key will not exists in different lists. Which means, If "1" exists in a list, then it will not be presented in another list. Thank you. On Thu, Mar 30, 2017 at 3:56 PM, ayan guha wrote: > Is it possible for one key in 2 groups in rdd2? > > [1,2,3] > [1,4,5] > > ? > > On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo wrote: >> >> Hello Yong, >> >> First of all, thank your attention. >> Note that the values of elements, which have values at RDD/DF1, in the >> same list will be always same. >> Therefore, the "1" and "3", which from RDD/DF 1, will always have the >> same value which is "a". >> >> The goal here is assigning same value to elements of the list which >> does not exist in RDD/DF 1. >> So, all the elements in the same list can have same value. >> >> Or, the final RDD/DF also can be like this, >> >> [1, 2, 3], a >> [4, 5], b >> >> Thank you again. >> >> - Mungeol >> >> >> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang wrote: >> > What is the desired result for >> > >> > >> > RDD/DF 1 >> > >> > 1, a >> > 3, c >> > 5, b >> > >> > RDD/DF 2 >> > >> > [1, 2, 3] >> > [4, 5] >> > >> > >> > Yong >> > >> > >> > From: Mungeol Heo >> > Sent: Wednesday, March 29, 2017 5:37 AM >> > To: user@spark.apache.org >> > Subject: Need help for RDD/DF transformation. >> > >> > Hello, >> > >> > Suppose, I have two RDD or data frame like addressed below. >> > >> > RDD/DF 1 >> > >> > 1, a >> > 3, a >> > 5, b >> > >> > RDD/DF 2 >> > >> > [1, 2, 3] >> > [4, 5] >> > >> > I need to create a new RDD/DF like below from RDD/DF 1 and 2. >> > >> > 1, a >> > 2, a >> > 3, a >> > 4, b >> > 5, b >> > >> > Is there an efficient way to do this? >> > Any help will be great. >> > >> > Thank you. >> > >> > - >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > -- > Best Regards, > Ayan Guha - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Need help for RDD/DF transformation.
Is it possible for one key in 2 groups in rdd2? [1,2,3] [1,4,5] ? On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo wrote: > Hello Yong, > > First of all, thank your attention. > Note that the values of elements, which have values at RDD/DF1, in the > same list will be always same. > Therefore, the "1" and "3", which from RDD/DF 1, will always have the > same value which is "a". > > The goal here is assigning same value to elements of the list which > does not exist in RDD/DF 1. > So, all the elements in the same list can have same value. > > Or, the final RDD/DF also can be like this, > > [1, 2, 3], a > [4, 5], b > > Thank you again. > > - Mungeol > > > On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang wrote: > > What is the desired result for > > > > > > RDD/DF 1 > > > > 1, a > > 3, c > > 5, b > > > > RDD/DF 2 > > > > [1, 2, 3] > > [4, 5] > > > > > > Yong > > > > > > From: Mungeol Heo > > Sent: Wednesday, March 29, 2017 5:37 AM > > To: user@spark.apache.org > > Subject: Need help for RDD/DF transformation. > > > > Hello, > > > > Suppose, I have two RDD or data frame like addressed below. > > > > RDD/DF 1 > > > > 1, a > > 3, a > > 5, b > > > > RDD/DF 2 > > > > [1, 2, 3] > > [4, 5] > > > > I need to create a new RDD/DF like below from RDD/DF 1 and 2. > > > > 1, a > > 2, a > > 3, a > > 4, b > > 5, b > > > > Is there an efficient way to do this? > > Any help will be great. > > > > Thank you. > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best Regards, Ayan Guha
Re: Need help for RDD/DF transformation.
Hello Yong, First of all, thank your attention. Note that the values of elements, which have values at RDD/DF1, in the same list will be always same. Therefore, the "1" and "3", which from RDD/DF 1, will always have the same value which is "a". The goal here is assigning same value to elements of the list which does not exist in RDD/DF 1. So, all the elements in the same list can have same value. Or, the final RDD/DF also can be like this, [1, 2, 3], a [4, 5], b Thank you again. - Mungeol On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang wrote: > What is the desired result for > > > RDD/DF 1 > > 1, a > 3, c > 5, b > > RDD/DF 2 > > [1, 2, 3] > [4, 5] > > > Yong > > > From: Mungeol Heo > Sent: Wednesday, March 29, 2017 5:37 AM > To: user@spark.apache.org > Subject: Need help for RDD/DF transformation. > > Hello, > > Suppose, I have two RDD or data frame like addressed below. > > RDD/DF 1 > > 1, a > 3, a > 5, b > > RDD/DF 2 > > [1, 2, 3] > [4, 5] > > I need to create a new RDD/DF like below from RDD/DF 1 and 2. > > 1, a > 2, a > 3, a > 4, b > 5, b > > Is there an efficient way to do this? > Any help will be great. > > Thank you. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Need help for RDD/DF transformation.
What is the desired result for RDD/DF 1 1, a 3, c 5, b RDD/DF 2 [1, 2, 3] [4, 5] Yong From: Mungeol Heo Sent: Wednesday, March 29, 2017 5:37 AM To: user@spark.apache.org Subject: Need help for RDD/DF transformation. Hello, Suppose, I have two RDD or data frame like addressed below. RDD/DF 1 1, a 3, a 5, b RDD/DF 2 [1, 2, 3] [4, 5] I need to create a new RDD/DF like below from RDD/DF 1 and 2. 1, a 2, a 3, a 4, b 5, b Is there an efficient way to do this? Any help will be great. Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org