Re: Spark DataFrame GroupBy into List

2015-10-14 Thread SLiZn Liu
Thanks, Michael and java8964!

Does Hive Context also provides udf for combining existing lists, into
flattened(not nested) list? (list->list of lists -[flatten]->list).

On Thu, Oct 15, 2015 at 1:16 AM Michael Armbrust 
wrote:

> Thats correct.  It is a Hive UDAF.
>
> On Wed, Oct 14, 2015 at 6:45 AM, java8964  wrote:
>
>> My guess is the same as UDAF of (collect_set) in Hive.
>>
>>
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
>>
>> Yong
>>
>> --
>> From: sliznmail...@gmail.com
>> Date: Wed, 14 Oct 2015 02:45:48 +
>> Subject: Re: Spark DataFrame GroupBy into List
>> To: mich...@databricks.com
>> CC: user@spark.apache.org
>>
>>
>> Hi Michael,
>>
>> Can you be more specific on `collect_set`? Is it a built-in function or,
>> if it is an UDF, how it is defined?
>>
>> BR,
>> Todd Leo
>>
>> On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust 
>> wrote:
>>
>> import org.apache.spark.sql.functions._
>>
>> df.groupBy("category")
>>   .agg(callUDF("collect_set", df("id")).as("id_list"))
>>
>> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu 
>> wrote:
>>
>> Hey Spark users,
>>
>> I'm trying to group by a dataframe, by appending occurrences into a list
>> instead of count.
>>
>> Let's say we have a dataframe as shown below:
>>
>> | category | id |
>> |  |:--:|
>> | A| 1  |
>> | A| 2  |
>> | B| 3  |
>> | B| 4  |
>> | C| 5  |
>>
>> ideally, after some magic group by (reverse explode?):
>>
>> | category | id_list  |
>> |  |  |
>> | A| 1,2  |
>> | B| 3,4  |
>> | C| 5|
>>
>> any tricks to achieve that? Scala Spark API is preferred. =D
>>
>> BR,
>> Todd Leo
>>
>>
>>
>>
>>
>


Re: Spark DataFrame GroupBy into List

2015-10-14 Thread Michael Armbrust
Thats correct.  It is a Hive UDAF.

On Wed, Oct 14, 2015 at 6:45 AM, java8964  wrote:

> My guess is the same as UDAF of (collect_set) in Hive.
>
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
>
> Yong
>
> --
> From: sliznmail...@gmail.com
> Date: Wed, 14 Oct 2015 02:45:48 +
> Subject: Re: Spark DataFrame GroupBy into List
> To: mich...@databricks.com
> CC: user@spark.apache.org
>
>
> Hi Michael,
>
> Can you be more specific on `collect_set`? Is it a built-in function or,
> if it is an UDF, how it is defined?
>
> BR,
> Todd Leo
>
> On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust 
> wrote:
>
> import org.apache.spark.sql.functions._
>
> df.groupBy("category")
>   .agg(callUDF("collect_set", df("id")).as("id_list"))
>
> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu 
> wrote:
>
> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> |  |:--:|
> | A| 1  |
> | A| 2  |
> | B| 3  |
> | B| 4  |
> | C| 5  |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list  |
> |  |  |
> | A| 1,2  |
> | B| 3,4  |
> | C| 5|
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>
>


RE: Spark DataFrame GroupBy into List

2015-10-14 Thread java8964
My guess is the same as UDAF of (collect_set) in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
Yong

From: sliznmail...@gmail.com
Date: Wed, 14 Oct 2015 02:45:48 +
Subject: Re: Spark DataFrame GroupBy into List
To: mich...@databricks.com
CC: user@spark.apache.org

Hi Michael, 
Can you be more specific on `collect_set`? Is it a built-in function or, if it 
is an UDF, how it is defined?
BR,Todd Leo
On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust  wrote:
import org.apache.spark.sql.functions._
df.groupBy("category")  .agg(callUDF("collect_set", df("id")).as("id_list"))
On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu  wrote:
Hey Spark users,
I'm trying to group by a dataframe, by appending occurrences into a list 
instead of count. 
Let's say we have a dataframe as shown below:| category | id |
|  |:--:|
| A| 1  |
| A| 2  |
| B| 3  |
| B| 4  |
| C| 5  |
ideally, after some magic group by (reverse explode?):| category | id_list  |
|  |  |
| A| 1,2  |
| B| 3,4  |
| C| 5|
any tricks to achieve that? Scala Spark API is preferred. =D
BR,Todd Leo 





  

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Deenar Toraskar
collect_set and collect_list are built-in User Defined functions see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

On 14 October 2015 at 03:45, SLiZn Liu  wrote:

> Hi Michael,
>
> Can you be more specific on `collect_set`? Is it a built-in function or,
> if it is an UDF, how it is defined?
>
> BR,
> Todd Leo
>
> On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust 
> wrote:
>
>> import org.apache.spark.sql.functions._
>>
>> df.groupBy("category")
>>   .agg(callUDF("collect_set", df("id")).as("id_list"))
>>
>> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu 
>> wrote:
>>
>>> Hey Spark users,
>>>
>>> I'm trying to group by a dataframe, by appending occurrences into a list
>>> instead of count.
>>>
>>> Let's say we have a dataframe as shown below:
>>>
>>> | category | id |
>>> |  |:--:|
>>> | A| 1  |
>>> | A| 2  |
>>> | B| 3  |
>>> | B| 4  |
>>> | C| 5  |
>>>
>>> ideally, after some magic group by (reverse explode?):
>>>
>>> | category | id_list  |
>>> |  |  |
>>> | A| 1,2  |
>>> | B| 3,4  |
>>> | C| 5|
>>>
>>> any tricks to achieve that? Scala Spark API is preferred. =D
>>>
>>> BR,
>>> Todd Leo
>>>
>>>
>>>
>>>
>>


Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hi Michael,

Can you be more specific on `collect_set`? Is it a built-in function or, if
it is an UDF, how it is defined?

BR,
Todd Leo

On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust 
wrote:

> import org.apache.spark.sql.functions._
>
> df.groupBy("category")
>   .agg(callUDF("collect_set", df("id")).as("id_list"))
>
> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu 
> wrote:
>
>> Hey Spark users,
>>
>> I'm trying to group by a dataframe, by appending occurrences into a list
>> instead of count.
>>
>> Let's say we have a dataframe as shown below:
>>
>> | category | id |
>> |  |:--:|
>> | A| 1  |
>> | A| 2  |
>> | B| 3  |
>> | B| 4  |
>> | C| 5  |
>>
>> ideally, after some magic group by (reverse explode?):
>>
>> | category | id_list  |
>> |  |  |
>> | A| 1,2  |
>> | B| 3,4  |
>> | C| 5|
>>
>> any tricks to achieve that? Scala Spark API is preferred. =D
>>
>> BR,
>> Todd Leo
>>
>>
>>
>>
>


Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Michael Armbrust
import org.apache.spark.sql.functions._

df.groupBy("category")
  .agg(callUDF("collect_set", df("id")).as("id_list"))

On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu  wrote:

> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> |  |:--:|
> | A| 1  |
> | A| 2  |
> | B| 3  |
> | B| 4  |
> | C| 5  |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list  |
> |  |  |
> | A| 1,2  |
> | B| 3,4  |
> | C| 5|
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>


Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hi Rishitesh,

I did it by CombineByKey, but your solution is more clear and readable, at
least doesn't require 3 lambda functions to get confused with. Will
definitely try it out tomorrow, thanks. 😁

Plus, OutOfMemoryError keeps bothering me as I read a massive amount of
json files, whereas the yielded RDD by CombineByKey is rather small. Anyway
I'll file another mail to describe this.

BR,
Todd Leo

Rishitesh Mishra 于2015年10月13日 周二19:05写道:

> Hi Liu,
> I could not see any operator on DataFrame which will give the desired
> result . DataFrame APIs as expected works on Row format and a fixed set of
> operators on them.
> However you can achive the desired result by accessing the internal RDD as
> below..
>
> val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2))
> val rdd = testSparkContext.parallelize(s)
> val df = snc.createDataFrame(rdd)
> val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1
>
> val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) }
>
> val rdd3 = rdd1.reduceByKey(reduceF)
> rdd3.foreach(r => println(r))
>
>
>
> You can always reconvert the obtained RDD after tranformation and reduce to a 
> DataFrame.
>
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
>
> https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile
>
> On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu 
> wrote:
>
>> Hey Spark users,
>>
>> I'm trying to group by a dataframe, by appending occurrences into a list
>> instead of count.
>>
>> Let's say we have a dataframe as shown below:
>>
>> | category | id |
>> |  |:--:|
>> | A| 1  |
>> | A| 2  |
>> | B| 3  |
>> | B| 4  |
>> | C| 5  |
>>
>> ideally, after some magic group by (reverse explode?):
>>
>> | category | id_list  |
>> |  |  |
>> | A| 1,2  |
>> | B| 3,4  |
>> | C| 5|
>>
>> any tricks to achieve that? Scala Spark API is preferred. =D
>>
>> BR,
>> Todd Leo
>>
>>
>>
>>
>
>
> --
>
>
>


Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Rishitesh Mishra
Hi Liu,
I could not see any operator on DataFrame which will give the desired
result . DataFrame APIs as expected works on Row format and a fixed set of
operators on them.
However you can achive the desired result by accessing the internal RDD as
below..

val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2))
val rdd = testSparkContext.parallelize(s)
val df = snc.createDataFrame(rdd)
val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1

val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) }

val rdd3 = rdd1.reduceByKey(reduceF)
rdd3.foreach(r => println(r))



You can always reconvert the obtained RDD after tranformation and
reduce to a DataFrame.


Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile

On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu  wrote:

> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> |  |:--:|
> | A| 1  |
> | A| 2  |
> | B| 3  |
> | B| 4  |
> | C| 5  |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list  |
> |  |  |
> | A| 1,2  |
> | B| 3,4  |
> | C| 5|
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>


--


Spark DataFrame GroupBy into List

2015-10-12 Thread SLiZn Liu
Hey Spark users,

I'm trying to group by a dataframe, by appending occurrences into a list
instead of count.

Let's say we have a dataframe as shown below:

| category | id |
|  |:--:|
| A| 1  |
| A| 2  |
| B| 3  |
| B| 4  |
| C| 5  |

ideally, after some magic group by (reverse explode?):

| category | id_list  |
|  |  |
| A| 1,2  |
| B| 3,4  |
| C| 5|

any tricks to achieve that? Scala Spark API is preferred. =D

BR,
Todd Leo