This is not a bug, but a intension of windows function.

When you use max + rowsBetween, it is kind of strange requirement.


RowsBetween is more like to be used to calculate the moving sun or avg, which 
will handle null as 0.


But in your case, you want your grouping window as 2 rows before + 2 rows after 
current row, plus use max function. In the max function, if the current row is 
already in the last row (with max revenue per catalog), then it won't have 2 
rows after it. So in this case, the max function has to return NULL, as 
max(null, anything) is NULL.


Yong


________________________________
From: Han-Cheol Cho <hancheol....@nhn-techorus.com>
Sent: Monday, November 28, 2016 10:57 PM
To: user@spark.apache.org
Subject: null values returned by max() over a window function


Hello,

I am trying to test Spark's SQL window functions in the following blog,
  
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Introducing Window Functions in Spark SQL - 
Databricks<https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html>
databricks.com
To use window functions, users need to mark that a function is used as a window 
function by either. Adding an OVER clause after a supported function in SQL, 
e.g. avg ...



, and facing a problem as follows:

# testing rowsBetween()
winSpec2 = 
window.Window.partitionBy(data["category"]).orderBy(data["revenue"]).rowsBetween(2,2)
tmp4 = functions.max(data["revenue"]).over(winSpec2)
data.select(["product","category","revenue", 
tmp4.alias("rowbetween2and2")]).orderBy(["category","revenue"]).show()

+----------+----------+-------+---------------+
   product  category   revenue    rowbetween2and2
+----------+----------+-------+---------------+
  BendableCell phone   3000           5000
  FoldableCell   phone   3000           6000
Ultra thinCell   phone   5000           6000
      ThinCell    phone   6000           null --> ???
 Very thinCell phone   6000           null
    Normal      Tablet   1500           4500
       Big         Tablet   2500           5500
       Pro         Tablet   4500           6500
      Mini         Tablet   5500           null
      Pro2         Tablet   6500           null
+----------+----------+-------+---------------+

As you can see, the last column calculates the max value among the current row,
left two rows and right two rows partitioned by category row.
However, the result for the last two rows in each category partition is null.

Is there something that I missed or is this a bug?



Han-Cheol Cho Data Laboratory / Data Scientist
?160-0022?????????6-27-30??????????????13?
Email  hancheol....@nhn-techorus.com <mailto:hancheol....@nhn-techorus.com>


[https://kr1-mail.worksmobile.com/readReceipt/notify/?img=YZYlKoU%2FFoJvKqmsKxgmpoEXMoF0K63oK4ulpztZMqElpxpCKxv%2Fp6M9poEdtzFr%2BrkSKxu5%2Br9C16m5W4C5bX0q%2BzkR74FTWx%2Fs%2BBF0bvIqbzJZ1ZlCbzJo1zE5WXiN.gif]

Reply via email to