[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Will Du closed SPARK-33116. --------------------------- > Spark SQL window function with order by cause result incorrect > -------------------------------------------------------------- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.1 > Reporter: Will Du > Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90 | 8.900000 | 1 | > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.463333 | 3 | > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.000000 | 1 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 5.90 | 5.90 | 5.90 | 5.90 | 5.900000 | 1 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org