This is not a bug, but a intension of windows function.
When you use max + rowsBetween, it is kind of strange requirement. RowsBetween is more like to be used to calculate the moving sun or avg, which will handle null as 0. But in your case, you want your grouping window as 2 rows before + 2 rows after current row, plus use max function. In the max function, if the current row is already in the last row (with max revenue per catalog), then it won't have 2 rows after it. So in this case, the max function has to return NULL, as max(null, anything) is NULL. Yong ________________________________ From: Han-Cheol Cho <hancheol....@nhn-techorus.com> Sent: Monday, November 28, 2016 10:57 PM To: user@spark.apache.org Subject: null values returned by max() over a window function Hello, I am trying to test Spark's SQL window functions in the following blog, https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html Introducing Window Functions in Spark SQL - Databricks<https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html> databricks.com To use window functions, users need to mark that a function is used as a window function by either. Adding an OVER clause after a supported function in SQL, e.g. avg ... , and facing a problem as follows: # testing rowsBetween() winSpec2 = window.Window.partitionBy(data["category"]).orderBy(data["revenue"]).rowsBetween(2,2) tmp4 = functions.max(data["revenue"]).over(winSpec2) data.select(["product","category","revenue", tmp4.alias("rowbetween2and2")]).orderBy(["category","revenue"]).show() +----------+----------+-------+---------------+ product category revenue rowbetween2and2 +----------+----------+-------+---------------+ BendableCell phone 3000 5000 FoldableCell phone 3000 6000 Ultra thinCell phone 5000 6000 ThinCell phone 6000 null --> ??? Very thinCell phone 6000 null Normal Tablet 1500 4500 Big Tablet 2500 5500 Pro Tablet 4500 6500 Mini Tablet 5500 null Pro2 Tablet 6500 null +----------+----------+-------+---------------+ As you can see, the last column calculates the max value among the current row, left two rows and right two rows partitioned by category row. However, the result for the last two rows in each category partition is null. Is there something that I missed or is this a bug? Han-Cheol Cho Data Laboratory / Data Scientist ?160-0022?????????6-27-30??????????????13? Email hancheol....@nhn-techorus.com <mailto:hancheol....@nhn-techorus.com> [https://kr1-mail.worksmobile.com/readReceipt/notify/?img=YZYlKoU%2FFoJvKqmsKxgmpoEXMoF0K63oK4ulpztZMqElpxpCKxv%2Fp6M9poEdtzFr%2BrkSKxu5%2Br9C16m5W4C5bX0q%2BzkR74FTWx%2Fs%2BBF0bvIqbzJZ1ZlCbzJo1zE5WXiN.gif]