Re: Clarify window behavior in Spark SQL

2018-04-09 Thread Sandor Murakozi
Hi Li,
You might find my pending PR useful:
https://github.com/apache/spark/pull/20045/files

It contains a big bunch of test cases covering the windowing functionality,
showing and checking the behavior of a number of special cases.

On Wed, Apr 4, 2018 at 4:26 AM, Reynold Xin  wrote:

> Thanks Li!
>
> On Tue, Apr 3, 2018 at 7:23 PM Li Jin  wrote:
>
>> Thanks all for the explanation. I am happy to update the API doc.
>>
>> https://issues.apache.org/jira/browse/SPARK-23861
>>
>> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin  wrote:
>>
>>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>>>
>>> For others to follow, SQL Server has a good explanation of the behavior:
>>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-
>>> transact-sql
>>>
>>>
>>> Can somebody (Li?) update the API documentation to specify the gotchas,
>>> in case users are not familiar with SQL window function semantics?
>>>
>>>
>>>
>>> General Remarks
>>> 
>>>
>>> More than one window function can be used in a single query with a
>>> single FROM clause. The OVER clause for each function can differ in
>>> partitioning and ordering.
>>>
>>> If PARTITION BY is not specified, the function treats all rows of the
>>> query result set as a single group.
>>> Important!
>>> 
>>>
>>> If ROWS/RANGE is specified and  is used for
>>>  (short syntax) then this specification is used for
>>> the window frame boundary starting point and CURRENT ROW is used for the
>>> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
>>> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>>>
>>> Note+
>>>
>>> If ORDER BY is not specified entire partition is used for a window
>>> frame. This applies only to functions that do not require ORDER BY clause.
>>> If ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
>>> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
>>> only to functions that have can accept optional ROWS/RANGE specification.
>>> For example, ranking functions cannot accept ROWS/RANGE, therefore this
>>> window frame is not applied even though ORDER BY is present and ROWS/RANGE
>>> is not.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang 
>>> wrote:
>>>
 This is actually by design, without a `ORDER BY` clause, all rows are
 considered as the peer row of the current row, which means that the frame
 is effectively the entire partition. This behavior follows the window
 syntax of PGSQL.
 You can refer to the comment by yhuai: https://github.com/
 apache/spark/pull/5604#discussion_r157931911
 :)

 2018-04-04 6:27 GMT+08:00 Reynold Xin :

> Do other (non-Hive) SQL systems do the same thing?
>
> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
> her...@databricks.com> wrote:
>
>> This is something we inherited from Hive: https://cwiki.apache.
>> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>
>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>> CURRENT ROW.
>>
>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>> UNBOUNDED FOLLOWING.
>>
>>
>> It sort of makes sense if you think about it. If there is no ordering
>> there is no way to have a bound frame. If there is ordering we default to
>> the most commonly used deterministic frame.
>>
>>
>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin 
>> wrote:
>>
>>> Seems like a bug.
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin 
>>> wrote:
>>>
 Hi Devs,

 I am seeing some behavior with window functions that is a bit
 unintuitive and would like to get some clarification.

 When using aggregation function with window, the frame boundary
 seems to change depending on the order of the window.

 Example:
 (1)

 df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

 w1 = Window.partitionBy('id')

 df.withColumn('v2', mean(df.v).over(w1)).show()

 +---+---+---+

 | id|  v| v2|

 +---+---+---+

 |  0|  1|2.0|

 |  0|  2|2.0|

 |  0|  3|2.0|

 +---+---+---+

 (2)
 df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

 w2 = Window.partitionBy('id').orderBy('v')

 df.withColumn('v2', mean(df.v).over(w2)).show()

 +---

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Thanks Li!

On Tue, Apr 3, 2018 at 7:23 PM Li Jin  wrote:

> Thanks all for the explanation. I am happy to update the API doc.
>
> https://issues.apache.org/jira/browse/SPARK-23861
>
> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin  wrote:
>
>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>>
>> For others to follow, SQL Server has a good explanation of the behavior:
>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql
>>
>>
>> Can somebody (Li?) update the API documentation to specify the gotchas,
>> in case users are not familiar with SQL window function semantics?
>>
>>
>>
>> General Remarks
>> 
>>
>> More than one window function can be used in a single query with a single
>> FROM clause. The OVER clause for each function can differ in partitioning
>> and ordering.
>>
>> If PARTITION BY is not specified, the function treats all rows of the
>> query result set as a single group.
>> Important!
>> 
>>
>> If ROWS/RANGE is specified and  is used for
>>  (short syntax) then this specification is used for
>> the window frame boundary starting point and CURRENT ROW is used for the
>> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
>> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>>
>> Note+
>>
>> If ORDER BY is not specified entire partition is used for a window frame.
>> This applies only to functions that do not require ORDER BY clause. If
>> ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
>> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
>> only to functions that have can accept optional ROWS/RANGE specification.
>> For example, ranking functions cannot accept ROWS/RANGE, therefore this
>> window frame is not applied even though ORDER BY is present and ROWS/RANGE
>> is not.
>>
>>
>>
>>
>>
>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang 
>> wrote:
>>
>>> This is actually by design, without a `ORDER BY` clause, all rows are
>>> considered as the peer row of the current row, which means that the frame
>>> is effectively the entire partition. This behavior follows the window
>>> syntax of PGSQL.
>>> You can refer to the comment by yhuai:
>>> https://github.com/apache/spark/pull/5604#discussion_r157931911
>>> :)
>>>
>>> 2018-04-04 6:27 GMT+08:00 Reynold Xin :
>>>
 Do other (non-Hive) SQL systems do the same thing?

 On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
 her...@databricks.com> wrote:

> This is something we inherited from Hive:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>
> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>> CURRENT ROW.
>
> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>> UNBOUNDED FOLLOWING.
>
>
> It sort of makes sense if you think about it. If there is no ordering
> there is no way to have a bound frame. If there is ordering we default to
> the most commonly used deterministic frame.
>
>
> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin 
> wrote:
>
>> Seems like a bug.
>>
>>
>>
>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:
>>
>>> Hi Devs,
>>>
>>> I am seeing some behavior with window functions that is a bit
>>> unintuitive and would like to get some clarification.
>>>
>>> When using aggregation function with window, the frame boundary
>>> seems to change depending on the order of the window.
>>>
>>> Example:
>>> (1)
>>>
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w1 = Window.partitionBy('id')
>>>
>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|2.0|
>>>
>>> |  0|  2|2.0|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> (2)
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w2 = Window.partitionBy('id').orderBy('v')
>>>
>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|1.0|
>>>
>>> |  0|  2|1.5|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> Seems like orderBy('v') in the example (2) also changes the frame
>>> boundaries from (
>>>
>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>> currentRow).
>>>
>>>
>>> I found this behavi

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Li Jin
Thanks all for the explanation. I am happy to update the API doc.

https://issues.apache.org/jira/browse/SPARK-23861

On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin  wrote:

> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>
> For others to follow, SQL Server has a good explanation of the behavior:
> https://docs.microsoft.com/en-us/sql/t-sql/queries
> /select-over-clause-transact-sql
>
>
> Can somebody (Li?) update the API documentation to specify the gotchas, in
> case users are not familiar with SQL window function semantics?
>
>
>
> General Remarks
> 
>
> More than one window function can be used in a single query with a single
> FROM clause. The OVER clause for each function can differ in partitioning
> and ordering.
>
> If PARTITION BY is not specified, the function treats all rows of the
> query result set as a single group.
> Important!
> 
>
> If ROWS/RANGE is specified and  is used for
>  (short syntax) then this specification is used for
> the window frame boundary starting point and CURRENT ROW is used for the
> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>
> Note+
>
> If ORDER BY is not specified entire partition is used for a window frame.
> This applies only to functions that do not require ORDER BY clause. If
> ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
> only to functions that have can accept optional ROWS/RANGE specification.
> For example, ranking functions cannot accept ROWS/RANGE, therefore this
> window frame is not applied even though ORDER BY is present and ROWS/RANGE
> is not.
>
>
>
>
>
> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang 
> wrote:
>
>> This is actually by design, without a `ORDER BY` clause, all rows are
>> considered as the peer row of the current row, which means that the frame
>> is effectively the entire partition. This behavior follows the window
>> syntax of PGSQL.
>> You can refer to the comment by yhuai: https://github.com/apac
>> he/spark/pull/5604#discussion_r157931911
>> :)
>>
>> 2018-04-04 6:27 GMT+08:00 Reynold Xin :
>>
>>> Do other (non-Hive) SQL systems do the same thing?
>>>
>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>>> her...@databricks.com> wrote:
>>>
 This is something we inherited from Hive: https://cwiki.apache.org
 /confluence/display/Hive/LanguageManual+WindowingAndAnalytics

 When ORDER BY is specified with missing WINDOW clause, the WINDOW
> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
> CURRENT ROW.

 When both ORDER BY and WINDOW clauses are missing, the WINDOW
> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
> UNBOUNDED FOLLOWING.


 It sort of makes sense if you think about it. If there is no ordering
 there is no way to have a bound frame. If there is ordering we default to
 the most commonly used deterministic frame.


 On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin 
 wrote:

> Seems like a bug.
>
>
>
> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:
>
>> Hi Devs,
>>
>> I am seeing some behavior with window functions that is a bit
>> unintuitive and would like to get some clarification.
>>
>> When using aggregation function with window, the frame boundary seems
>> to change depending on the order of the window.
>>
>> Example:
>> (1)
>>
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w1 = Window.partitionBy('id')
>>
>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|2.0|
>>
>> |  0|  2|2.0|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> (2)
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w2 = Window.partitionBy('id').orderBy('v')
>>
>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|1.0|
>>
>> |  0|  2|1.5|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> Seems like orderBy('v') in the example (2) also changes the frame
>> boundaries from (
>>
>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>> currentRow).
>>
>>
>> I found this behavior a bit unintuitive. I wonder if this behavior is
>> by design and if so, what's the specific rule that orderBy() interacts 
>> with
>> frame boundaries?
>>
>>
>> Thanks,
>>
>> L

Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Ah ok. Thanks for commenting. Everyday I learn something new about SQL.

For others to follow, SQL Server has a good explanation of the behavior:
https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-
transact-sql


Can somebody (Li?) update the API documentation to specify the gotchas, in
case users are not familiar with SQL window function semantics?



General Remarks


More than one window function can be used in a single query with a single
FROM clause. The OVER clause for each function can differ in partitioning
and ordering.

If PARTITION BY is not specified, the function treats all rows of the query
result set as a single group.
Important!


If ROWS/RANGE is specified and  is used for  (short syntax) then this specification is used for the window
frame boundary starting point and CURRENT ROW is used for the boundary
ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS BETWEEN 5
PRECEDING AND CURRENT ROW”.

Note+

If ORDER BY is not specified entire partition is used for a window frame.
This applies only to functions that do not require ORDER BY clause. If
ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
PRECEDING AND CURRENT ROW is used as default for window frame. This applies
only to functions that have can accept optional ROWS/RANGE specification.
For example, ranking functions cannot accept ROWS/RANGE, therefore this
window frame is not applied even though ORDER BY is present and ROWS/RANGE
is not.





On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang  wrote:

> This is actually by design, without a `ORDER BY` clause, all rows are
> considered as the peer row of the current row, which means that the frame
> is effectively the entire partition. This behavior follows the window
> syntax of PGSQL.
> You can refer to the comment by yhuai: https://github.com/apac
> he/spark/pull/5604#discussion_r157931911
> :)
>
> 2018-04-04 6:27 GMT+08:00 Reynold Xin :
>
>> Do other (non-Hive) SQL systems do the same thing?
>>
>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>> her...@databricks.com> wrote:
>>
>>> This is something we inherited from Hive: https://cwiki.apache.org
>>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>
>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
 specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
 CURRENT ROW.
>>>
>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
 specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
 UNBOUNDED FOLLOWING.
>>>
>>>
>>> It sort of makes sense if you think about it. If there is no ordering
>>> there is no way to have a bound frame. If there is ordering we default to
>>> the most commonly used deterministic frame.
>>>
>>>
>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin 
>>> wrote:
>>>
 Seems like a bug.



 On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:

> Hi Devs,
>
> I am seeing some behavior with window functions that is a bit
> unintuitive and would like to get some clarification.
>
> When using aggregation function with window, the frame boundary seems
> to change depending on the order of the window.
>
> Example:
> (1)
>
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w1 = Window.partitionBy('id')
>
> df.withColumn('v2', mean(df.v).over(w1)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|2.0|
>
> |  0|  2|2.0|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> (2)
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w2 = Window.partitionBy('id').orderBy('v')
>
> df.withColumn('v2', mean(df.v).over(w2)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|1.0|
>
> |  0|  2|1.5|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> Seems like orderBy('v') in the example (2) also changes the frame
> boundaries from (
>
> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
> currentRow).
>
>
> I found this behavior a bit unintuitive. I wonder if this behavior is
> by design and if so, what's the specific rule that orderBy() interacts 
> with
> frame boundaries?
>
>
> Thanks,
>
> Li
>
>

>>>
>>
>


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Xingbo Jiang
This is actually by design, without a `ORDER BY` clause, all rows are
considered as the peer row of the current row, which means that the frame
is effectively the entire partition. This behavior follows the window
syntax of PGSQL.
You can refer to the comment by yhuai:
https://github.com/apache/spark/pull/5604#discussion_r157931911
:)

2018-04-04 6:27 GMT+08:00 Reynold Xin :

> Do other (non-Hive) SQL systems do the same thing?
>
> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
> her...@databricks.com> wrote:
>
>> This is something we inherited from Hive: https://cwiki.apache.org
>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>
>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>>> ROW.
>>
>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>>> FOLLOWING.
>>
>>
>> It sort of makes sense if you think about it. If there is no ordering
>> there is no way to have a bound frame. If there is ordering we default to
>> the most commonly used deterministic frame.
>>
>>
>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin  wrote:
>>
>>> Seems like a bug.
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:
>>>
 Hi Devs,

 I am seeing some behavior with window functions that is a bit
 unintuitive and would like to get some clarification.

 When using aggregation function with window, the frame boundary seems
 to change depending on the order of the window.

 Example:
 (1)

 df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

 w1 = Window.partitionBy('id')

 df.withColumn('v2', mean(df.v).over(w1)).show()

 +---+---+---+

 | id|  v| v2|

 +---+---+---+

 |  0|  1|2.0|

 |  0|  2|2.0|

 |  0|  3|2.0|

 +---+---+---+

 (2)
 df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

 w2 = Window.partitionBy('id').orderBy('v')

 df.withColumn('v2', mean(df.v).over(w2)).show()

 +---+---+---+

 | id|  v| v2|

 +---+---+---+

 |  0|  1|1.0|

 |  0|  2|1.5|

 |  0|  3|2.0|

 +---+---+---+

 Seems like orderBy('v') in the example (2) also changes the frame
 boundaries from (

 unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
 currentRow).


 I found this behavior a bit unintuitive. I wonder if this behavior is
 by design and if so, what's the specific rule that orderBy() interacts with
 frame boundaries?


 Thanks,

 Li


>>>
>>
>


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Do other (non-Hive) SQL systems do the same thing?

On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
her...@databricks.com> wrote:

> This is something we inherited from Hive: https://cwiki.apache.
> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>
> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>> ROW.
>
> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>> FOLLOWING.
>
>
> It sort of makes sense if you think about it. If there is no ordering
> there is no way to have a bound frame. If there is ordering we default to
> the most commonly used deterministic frame.
>
>
> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin  wrote:
>
>> Seems like a bug.
>>
>>
>>
>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:
>>
>>> Hi Devs,
>>>
>>> I am seeing some behavior with window functions that is a bit
>>> unintuitive and would like to get some clarification.
>>>
>>> When using aggregation function with window, the frame boundary seems to
>>> change depending on the order of the window.
>>>
>>> Example:
>>> (1)
>>>
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w1 = Window.partitionBy('id')
>>>
>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|2.0|
>>>
>>> |  0|  2|2.0|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> (2)
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w2 = Window.partitionBy('id').orderBy('v')
>>>
>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|1.0|
>>>
>>> |  0|  2|1.5|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> Seems like orderBy('v') in the example (2) also changes the frame
>>> boundaries from (
>>>
>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>> currentRow).
>>>
>>>
>>> I found this behavior a bit unintuitive. I wonder if this behavior is by
>>> design and if so, what's the specific rule that orderBy() interacts with
>>> frame boundaries?
>>>
>>>
>>> Thanks,
>>>
>>> Li
>>>
>>>
>>
>


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Li Jin
Here is the original code and comments:
https://github.com/apache/spark/commit/b6b50efc854f298d5b3e11c05dca995a85bec962#diff-4a8f00ca33a80744965463dcc6662c75L277

Seems this is intentional. Although I am not really sure why - maybe to
match other SQL systems behavior?

On Tue, Apr 3, 2018 at 5:09 PM, Reynold Xin  wrote:

> Seems like a bug.
>
>
>
> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:
>
>> Hi Devs,
>>
>> I am seeing some behavior with window functions that is a bit unintuitive
>> and would like to get some clarification.
>>
>> When using aggregation function with window, the frame boundary seems to
>> change depending on the order of the window.
>>
>> Example:
>> (1)
>>
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w1 = Window.partitionBy('id')
>>
>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|2.0|
>>
>> |  0|  2|2.0|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> (2)
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w2 = Window.partitionBy('id').orderBy('v')
>>
>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|1.0|
>>
>> |  0|  2|1.5|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> Seems like orderBy('v') in the example (2) also changes the frame
>> boundaries from (
>>
>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>> currentRow).
>>
>>
>> I found this behavior a bit unintuitive. I wonder if this behavior is by
>> design and if so, what's the specific rule that orderBy() interacts with
>> frame boundaries?
>>
>>
>> Thanks,
>>
>> Li
>>
>>
>


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Seems like a bug.



On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:

> Hi Devs,
>
> I am seeing some behavior with window functions that is a bit unintuitive
> and would like to get some clarification.
>
> When using aggregation function with window, the frame boundary seems to
> change depending on the order of the window.
>
> Example:
> (1)
>
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w1 = Window.partitionBy('id')
>
> df.withColumn('v2', mean(df.v).over(w1)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|2.0|
>
> |  0|  2|2.0|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> (2)
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w2 = Window.partitionBy('id').orderBy('v')
>
> df.withColumn('v2', mean(df.v).over(w2)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|1.0|
>
> |  0|  2|1.5|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> Seems like orderBy('v') in the example (2) also changes the frame
> boundaries from (
>
> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
> currentRow).
>
>
> I found this behavior a bit unintuitive. I wonder if this behavior is by
> design and if so, what's the specific rule that orderBy() interacts with
> frame boundaries?
>
>
> Thanks,
>
> Li
>
>