dataframe cumulative sum

2015-05-29 Thread Cesar Flores
What will be the more appropriate method to add a cumulative sum column to
a data frame. For example, assuming that I have the next data frame:

flag | price
--
1|47.808764653746
1|47.808764653746
1|31.9869279512204


How can I create a data frame with an extra cumsum column as the next one:

flag | price  | cumsum_price
--|---
1|47.808764653746 | 47.808764653746
1|47.808764653746 | 95.6175293075
1|31.9869279512204| 127.604457259


Thanks
-- 
Cesar Flores


Re: dataframe cumulative sum

2015-05-29 Thread Yin Huai
Hi Cesar,

We just added it in Spark 1.4.

In Spark 1.4, You can use window function in HiveContext to do it. Assuming
you want to calculate the cumulative sum for every flag,

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

df.select(
  $flag,
  $price,

sum($price).over(Window.partitionBy(flag).orderBy(price).rowsBetween(Long.MinValue,
0


In the code, over lets Spark SQL knows that you want to use window function
sum. partitionBy(flag) will partition the table by the value of flag and
the sum's scope is a single partition. orderBy(price) will sort rows in a
partition based on the value of price (probably this does not really matter
for your case. But using orderBy will make the result deterministic).
Finally, rowsBetween(Long.MinValue, 0) means that the sum value for every
row is calculated from price values of the first row in the partition to
the current row (so, you get the cumulative sum).

Thanks,

Yin

On Fri, May 29, 2015 at 8:09 AM, Cesar Flores ces...@gmail.com wrote:

 What will be the more appropriate method to add a cumulative sum column to
 a data frame. For example, assuming that I have the next data frame:

 flag | price
 --
 1|47.808764653746
 1|47.808764653746
 1|31.9869279512204


 How can I create a data frame with an extra cumsum column as the next one:

 flag | price  | cumsum_price
 --|---
 1|47.808764653746 | 47.808764653746
 1|47.808764653746 | 95.6175293075
 1|31.9869279512204| 127.604457259


 Thanks
 --
 Cesar Flores