[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-12 Thread Will Du (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212359#comment-17212359
 ] 

Will Du commented on SPARK-33116:
-

[~maropu], the statement is comparing query with PARTITION BY and without 
PARTITION BY. But if you look at the query I provided, both of them have 
PARTITION BY clause. The only difference is the ORDER BY clause added or not. 
The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212130#comment-17212130
 ] 

Takeshi Yamamuro commented on SPARK-33116:
--

In SQL, they are different, I think. The statement below was cited from the 
PostgreSQL doc:
{code:java}
since there is no ORDER BY in the OVER clause, the window frame is the same as 
the partition, which for lack of PARTITION BY is the whole table; in other 
words each sum is taken over the whole table and so we get the same result for 
each output row. But if we add an ORDER BY clause, we get very different 
results:...
{code}
[https://www.postgresql.org/docs/current/tutorial-window.html]

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org