Thanks to all of you guys for the helpful suggestions. I'll try these first 
thing tomorrow morning.

Stefan Panayotov
Sent from my Windows Phone
________________________________
From: java8964<mailto:java8...@hotmail.com>
Sent: ‎10/‎15/‎2015 4:30 PM
To: Michael Armbrust<mailto:mich...@databricks.com>; Deenar 
Toraskar<mailto:deenar.toras...@gmail.com>
Cc: Stefan Panayotov<mailto:spanayo...@msn.com>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: Spark SQL running totals

My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported.
So cumulative sum should work then.
Thanks
Yong

From: java8...@hotmail.com
To: mich...@databricks.com; deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org
Subject: RE: Spark SQL running totals
Date: Thu, 15 Oct 2015 16:24:39 -0400




Not sure the windows function can work for his case.
If you do a "sum() over (partitioned by)", that will return a total sum per 
partition, instead of a cumulative sum wanted in this case.
I saw there is a "cume_dis", but no "cume_sum".
Do we really have a "cume_sum" in Spark window function, or am I total 
misunderstand about "sum() over (partitioned by)" in it?
Yong

From: mich...@databricks.com
Date: Thu, 15 Oct 2015 11:51:59 -0700
Subject: Re: Spark SQL running totals
To: deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org

Check out: 
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar <deenar.toras...@gmail.com> 
wrote:
you can do a self join of the table with itself with the join clause being 
a.col1 >= b.col1
select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b 
on (a.col1 >= b.col1)group by a.col1, a.col2
I havent tried it, but cant see why it cant work, but doing it in RDD might be 
more efficient see 
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
On 15 October 2015 at 18:48, Stefan Panayotov <spanayo...@msn.com> wrote:



Hi,

I need help with Spark SQL. I need to achieve something like the following.
If I have data like:

col_1  col_2
1         10
2         30
3         15
4         20
5         25

I need to get col_3 to be the running total of the sum of the previous rows of 
col_2, e.g.

col_1  col_2  col_3
1         10        10
2         30        40
3         15        55
4         20        75
5         25        100

Is there a way to achieve this in Spark SQL or maybe with Data frame 
transformations?

Thanks in advance,


Stefan Panayotov, PhD
Home: 610-355-0919
Cell: 610-517-5586
email: spanayo...@msn.com
spanayo...@outlook.com
spanayo...@comcast.net





Reply via email to