Re: Spark DataFrame GroupBy into List
Thanks, Michael and java8964! Does Hive Context also provides udf for combining existing lists, into flattened(not nested) list? (list->list of lists -[flatten]->list). On Thu, Oct 15, 2015 at 1:16 AM Michael Armbrust wrote: > Thats correct. It is a Hive UDAF. > > On Wed, Oct 14, 2015 at 6:45 AM, java8964 wrote: > >> My guess is the same as UDAF of (collect_set) in Hive. >> >> >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) >> >> Yong >> >> -- >> From: sliznmail...@gmail.com >> Date: Wed, 14 Oct 2015 02:45:48 + >> Subject: Re: Spark DataFrame GroupBy into List >> To: mich...@databricks.com >> CC: user@spark.apache.org >> >> >> Hi Michael, >> >> Can you be more specific on `collect_set`? Is it a built-in function or, >> if it is an UDF, how it is defined? >> >> BR, >> Todd Leo >> >> On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust >> wrote: >> >> import org.apache.spark.sql.functions._ >> >> df.groupBy("category") >> .agg(callUDF("collect_set", df("id")).as("id_list")) >> >> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu >> wrote: >> >> Hey Spark users, >> >> I'm trying to group by a dataframe, by appending occurrences into a list >> instead of count. >> >> Let's say we have a dataframe as shown below: >> >> | category | id | >> | |:--:| >> | A| 1 | >> | A| 2 | >> | B| 3 | >> | B| 4 | >> | C| 5 | >> >> ideally, after some magic group by (reverse explode?): >> >> | category | id_list | >> | | | >> | A| 1,2 | >> | B| 3,4 | >> | C| 5| >> >> any tricks to achieve that? Scala Spark API is preferred. =D >> >> BR, >> Todd Leo >> >> >> >> >> >
Re: Spark DataFrame GroupBy into List
Thats correct. It is a Hive UDAF. On Wed, Oct 14, 2015 at 6:45 AM, java8964 wrote: > My guess is the same as UDAF of (collect_set) in Hive. > > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) > > Yong > > -- > From: sliznmail...@gmail.com > Date: Wed, 14 Oct 2015 02:45:48 + > Subject: Re: Spark DataFrame GroupBy into List > To: mich...@databricks.com > CC: user@spark.apache.org > > > Hi Michael, > > Can you be more specific on `collect_set`? Is it a built-in function or, > if it is an UDF, how it is defined? > > BR, > Todd Leo > > On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust > wrote: > > import org.apache.spark.sql.functions._ > > df.groupBy("category") > .agg(callUDF("collect_set", df("id")).as("id_list")) > > On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu > wrote: > > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list > instead of count. > > Let's say we have a dataframe as shown below: > > | category | id | > | |:--:| > | A| 1 | > | A| 2 | > | B| 3 | > | B| 4 | > | C| 5 | > > ideally, after some magic group by (reverse explode?): > > | category | id_list | > | | | > | A| 1,2 | > | B| 3,4 | > | C| 5| > > any tricks to achieve that? Scala Spark API is preferred. =D > > BR, > Todd Leo > > > > >
RE: Spark DataFrame GroupBy into List
My guess is the same as UDAF of (collect_set) in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) Yong From: sliznmail...@gmail.com Date: Wed, 14 Oct 2015 02:45:48 + Subject: Re: Spark DataFrame GroupBy into List To: mich...@databricks.com CC: user@spark.apache.org Hi Michael, Can you be more specific on `collect_set`? Is it a built-in function or, if it is an UDF, how it is defined? BR,Todd Leo On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust wrote: import org.apache.spark.sql.functions._ df.groupBy("category") .agg(callUDF("collect_set", df("id")).as("id_list")) On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu wrote: Hey Spark users, I'm trying to group by a dataframe, by appending occurrences into a list instead of count. Let's say we have a dataframe as shown below:| category | id | | |:--:| | A| 1 | | A| 2 | | B| 3 | | B| 4 | | C| 5 | ideally, after some magic group by (reverse explode?):| category | id_list | | | | | A| 1,2 | | B| 3,4 | | C| 5| any tricks to achieve that? Scala Spark API is preferred. =D BR,Todd Leo
Re: Spark DataFrame GroupBy into List
collect_set and collect_list are built-in User Defined functions see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On 14 October 2015 at 03:45, SLiZn Liu wrote: > Hi Michael, > > Can you be more specific on `collect_set`? Is it a built-in function or, > if it is an UDF, how it is defined? > > BR, > Todd Leo > > On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust > wrote: > >> import org.apache.spark.sql.functions._ >> >> df.groupBy("category") >> .agg(callUDF("collect_set", df("id")).as("id_list")) >> >> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu >> wrote: >> >>> Hey Spark users, >>> >>> I'm trying to group by a dataframe, by appending occurrences into a list >>> instead of count. >>> >>> Let's say we have a dataframe as shown below: >>> >>> | category | id | >>> | |:--:| >>> | A| 1 | >>> | A| 2 | >>> | B| 3 | >>> | B| 4 | >>> | C| 5 | >>> >>> ideally, after some magic group by (reverse explode?): >>> >>> | category | id_list | >>> | | | >>> | A| 1,2 | >>> | B| 3,4 | >>> | C| 5| >>> >>> any tricks to achieve that? Scala Spark API is preferred. =D >>> >>> BR, >>> Todd Leo >>> >>> >>> >>> >>
Re: Spark DataFrame GroupBy into List
Hi Michael, Can you be more specific on `collect_set`? Is it a built-in function or, if it is an UDF, how it is defined? BR, Todd Leo On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust wrote: > import org.apache.spark.sql.functions._ > > df.groupBy("category") > .agg(callUDF("collect_set", df("id")).as("id_list")) > > On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu > wrote: > >> Hey Spark users, >> >> I'm trying to group by a dataframe, by appending occurrences into a list >> instead of count. >> >> Let's say we have a dataframe as shown below: >> >> | category | id | >> | |:--:| >> | A| 1 | >> | A| 2 | >> | B| 3 | >> | B| 4 | >> | C| 5 | >> >> ideally, after some magic group by (reverse explode?): >> >> | category | id_list | >> | | | >> | A| 1,2 | >> | B| 3,4 | >> | C| 5| >> >> any tricks to achieve that? Scala Spark API is preferred. =D >> >> BR, >> Todd Leo >> >> >> >> >
Re: Spark DataFrame GroupBy into List
import org.apache.spark.sql.functions._ df.groupBy("category") .agg(callUDF("collect_set", df("id")).as("id_list")) On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu wrote: > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list > instead of count. > > Let's say we have a dataframe as shown below: > > | category | id | > | |:--:| > | A| 1 | > | A| 2 | > | B| 3 | > | B| 4 | > | C| 5 | > > ideally, after some magic group by (reverse explode?): > > | category | id_list | > | | | > | A| 1,2 | > | B| 3,4 | > | C| 5| > > any tricks to achieve that? Scala Spark API is preferred. =D > > BR, > Todd Leo > > > >
Re: Spark DataFrame GroupBy into List
Hi Rishitesh, I did it by CombineByKey, but your solution is more clear and readable, at least doesn't require 3 lambda functions to get confused with. Will definitely try it out tomorrow, thanks. 😁 Plus, OutOfMemoryError keeps bothering me as I read a massive amount of json files, whereas the yielded RDD by CombineByKey is rather small. Anyway I'll file another mail to describe this. BR, Todd Leo Rishitesh Mishra 于2015年10月13日 周二19:05写道: > Hi Liu, > I could not see any operator on DataFrame which will give the desired > result . DataFrame APIs as expected works on Row format and a fixed set of > operators on them. > However you can achive the desired result by accessing the internal RDD as > below.. > > val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2)) > val rdd = testSparkContext.parallelize(s) > val df = snc.createDataFrame(rdd) > val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1 > > val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) } > > val rdd3 = rdd1.reduceByKey(reduceF) > rdd3.foreach(r => println(r)) > > > > You can always reconvert the obtained RDD after tranformation and reduce to a > DataFrame. > > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > > https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile > > On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu > wrote: > >> Hey Spark users, >> >> I'm trying to group by a dataframe, by appending occurrences into a list >> instead of count. >> >> Let's say we have a dataframe as shown below: >> >> | category | id | >> | |:--:| >> | A| 1 | >> | A| 2 | >> | B| 3 | >> | B| 4 | >> | C| 5 | >> >> ideally, after some magic group by (reverse explode?): >> >> | category | id_list | >> | | | >> | A| 1,2 | >> | B| 3,4 | >> | C| 5| >> >> any tricks to achieve that? Scala Spark API is preferred. =D >> >> BR, >> Todd Leo >> >> >> >> > > > -- > > >
Re: Spark DataFrame GroupBy into List
Hi Liu, I could not see any operator on DataFrame which will give the desired result . DataFrame APIs as expected works on Row format and a fixed set of operators on them. However you can achive the desired result by accessing the internal RDD as below.. val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2)) val rdd = testSparkContext.parallelize(s) val df = snc.createDataFrame(rdd) val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1 val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) } val rdd3 = rdd1.reduceByKey(reduceF) rdd3.foreach(r => println(r)) You can always reconvert the obtained RDD after tranformation and reduce to a DataFrame. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu wrote: > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list > instead of count. > > Let's say we have a dataframe as shown below: > > | category | id | > | |:--:| > | A| 1 | > | A| 2 | > | B| 3 | > | B| 4 | > | C| 5 | > > ideally, after some magic group by (reverse explode?): > > | category | id_list | > | | | > | A| 1,2 | > | B| 3,4 | > | C| 5| > > any tricks to achieve that? Scala Spark API is preferred. =D > > BR, > Todd Leo > > > > --
Spark DataFrame GroupBy into List
Hey Spark users, I'm trying to group by a dataframe, by appending occurrences into a list instead of count. Let's say we have a dataframe as shown below: | category | id | | |:--:| | A| 1 | | A| 2 | | B| 3 | | B| 4 | | C| 5 | ideally, after some magic group by (reverse explode?): | category | id_list | | | | | A| 1,2 | | B| 3,4 | | C| 5| any tricks to achieve that? Scala Spark API is preferred. =D BR, Todd Leo