[jira] [Updated] (SPARK-38592) Column name contains back tick `
[ https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Du updated SPARK-38592: -- Description: Try to modify the data frame to ensure column names have no special characters. {code:java} df.columns.map { columnName => df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName)) } {code} *surroundingBackTickedName()* will enclose column name with backticks. However, col() kept having issue with column name contains back tick because {*}parseAttributeName{*}() only takes backticks that appear in pair. I am wondering if there is a workaround {code:java} org.apache.spark.sql.AnalysisException: Cannot resolve column name "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code} was: Try to modify the data frame to ensure column names have no special characters. {code:java} val newdf = df.select( df.columns.map { columnName => df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName)) }: _* ){code} *surroundingBackTickedName()* will enclose column name with backticks. However, col() kept having issue with column name contains back tick because {*}parseAttributeName{*}() only takes backticks that appear in pair. I am wondering if there is a workaround {code:java} org.apache.spark.sql.AnalysisException: Cannot resolve column name "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code} > Column name contains back tick ` > > > Key: SPARK-38592 > URL: https://issues.apache.org/jira/browse/SPARK-38592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Dennis Du >Priority: Major > Labels: bulk-closed > > Try to modify the data frame to ensure column names have no special > characters. > {code:java} > df.columns.map { columnName => >df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName)) > } > {code} > *surroundingBackTickedName()* will enclose column name with backticks. > However, col() kept having issue with column name contains back tick because > {*}parseAttributeName{*}() only takes backticks that appear in pair. I am > wondering if there is a workaround > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, > COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, > COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, > COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38592) Column name contains back tick `
[ https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Du updated SPARK-38592: -- Description: Try to modify the data frame to ensure column names have no special characters. {code:java} val newdf = df.select( df.columns.map { columnName => df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName)) }: _* ){code} *surroundingBackTickedName()* will enclose column name with backticks. However, col() kept having issue with column name contains back tick because {*}parseAttributeName{*}() only takes backticks that appear in pair. I am wondering if there is a workaround {code:java} org.apache.spark.sql.AnalysisException: Cannot resolve column name "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code} was: Try to modify the data frame to ensure column names have no special characters. {code:java} val newdf = df.select( df.columns.map { columnName => df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName)) }: _* ){code} col() kept having issue with column name contains back tick {code:java} org.apache.spark.sql.AnalysisException: Cannot resolve column name "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code} > Column name contains back tick ` > > > Key: SPARK-38592 > URL: https://issues.apache.org/jira/browse/SPARK-38592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Dennis Du >Priority: Major > Labels: bulk-closed > > Try to modify the data frame to ensure column names have no special > characters. > {code:java} > val newdf = df.select( > df.columns.map { columnName => > df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName)) > }: _* > ){code} > *surroundingBackTickedName()* will enclose column name with backticks. > However, col() kept having issue with column name contains back tick because > {*}parseAttributeName{*}() only takes backticks that appear in pair. I am > wondering if there is a workaround > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, > COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, > COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, > COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38592) Column name contains back tick `
Dennis Du created SPARK-38592: - Summary: Column name contains back tick ` Key: SPARK-38592 URL: https://issues.apache.org/jira/browse/SPARK-38592 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Dennis Du Try to modify the data frame to ensure column names have no special characters. {code:java} val newdf = df.select( df.columns.map { columnName => df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName)) }: _* ){code} col() kept having issue with column name contains back tick {code:java} org.apache.spark.sql.AnalysisException: Cannot resolve column name "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Du closed SPARK-33116. --- > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359 ] Will Du edited comment on SPARK-33116 at 10/12/20, 1:41 PM: [~maropu], the statement you mentioned does not have PARTITION BY in the example. But I am able to reproduce the same behavior in SQL server. I think this can be closed. was (Author: willddy): [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359 ] Will Du edited comment on SPARK-33116 at 10/12/20, 1:19 PM: [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). was (Author: willddy): [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359 ] Will Du edited comment on SPARK-33116 at 10/12/20, 1:18 PM: [~maropu], the statement you mentioned is comparing queries with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). was (Author: willddy): [~maropu], the statement is comparing query with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359 ] Will Du commented on SPARK-33116: - [~maropu], the statement is comparing query with PARTITION BY and without PARTITION BY. But if you look at the query I provided, both of them have PARTITION BY clause. The only difference is the ORDER BY clause added or not. The expected result I think should be the same on both queries except the orders of rows (by price). > Spark SQL window function with order by cause result incorrect > -- > > Key: SPARK-33116 > URL: https://issues.apache.org/jira/browse/SPARK-33116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Will Du >Priority: Major > > Prepare the data > CREATE TABLE IF NOT EXISTS product_catalog ( > name STRING,category STRING,location STRING,price DECIMAL(10,2)); > INSERT OVERWRITE product_catalog VALUES > ('Nest Coffee', 'drink', 'Toronto', 15.5), > ('Pepesi', 'drink', 'Toronto', 9.99), > ('Hasimal', 'toy', 'Toronto', 5.9), > ('Fire War', 'game', 'Toronto', 70.0), > ('Final Fantasy', 'game', 'Montreal', 79.99), > ('Lego Friends 15005', 'toy', 'Montreal', 12.99), > ('Nesion Milk', 'drink', 'Montreal', 8.9); > 1. Query without ORDER BY after PARTITION BY col, the result is correct. > SELECT > category, price, > max(price) over(PARTITION BY category) as max_p, > min(price) over(PARTITION BY category) as min_p, > sum(price) over(PARTITION BY category) as sum_p, > avg(price) over(PARTITION BY category) as avg_p, > count(*) over(PARTITION BY category) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | > 7 rows selected (0.442 seconds) > 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min > result is ok. Why other results are like that? > SELECT > category, price, > max(price) over(PARTITION BY category ORDER BY price) as max_p, > min(price) over(PARTITION BY category ORDER BY price) as min_p, > sum(price) over(PARTITION BY category ORDER BY price) as sum_p, > avg(price) over(PARTITION BY category ORDER BY price) as avg_p, > count(*) over(PARTITION BY category ORDER BY price) as count_w > FROM > product_catalog; > || category || price || max_p || min_p || sum_p || avg_p > || count_w || > | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| > | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| > | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| > | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| > | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| > | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| > | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| > 7 rows selected (0.436 seconds) > Does it seem that we can only order by the columns after partition by clause? > I do not think there are such limitation in standard SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Du updated SPARK-33116: Description: Prepare the data CREATE TABLE IF NOT EXISTS product_catalog ( name STRING,category STRING,location STRING,price DECIMAL(10,2)); INSERT OVERWRITE product_catalog VALUES ('Nest Coffee', 'drink', 'Toronto', 15.5), ('Pepesi', 'drink', 'Toronto', 9.99), ('Hasimal', 'toy', 'Toronto', 5.9), ('Fire War', 'game', 'Toronto', 70.0), ('Final Fantasy', 'game', 'Montreal', 79.99), ('Lego Friends 15005', 'toy', 'Montreal', 12.99), ('Nesion Milk', 'drink', 'Montreal', 8.9); 1. Query without ORDER BY after PARTITION BY col, the result is correct. SELECT category, price, max(price) over(PARTITION BY category) as max_p, min(price) over(PARTITION BY category) as min_p, sum(price) over(PARTITION BY category) as sum_p, avg(price) over(PARTITION BY category) as avg_p, count(*) over(PARTITION BY category) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | 7 rows selected (0.442 seconds) 2 Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min result is ok. Why other results are like that? SELECT category, price, max(price) over(PARTITION BY category ORDER BY price) as max_p, min(price) over(PARTITION BY category ORDER BY price) as min_p, sum(price) over(PARTITION BY category ORDER BY price) as sum_p, avg(price) over(PARTITION BY category ORDER BY price) as avg_p, count(*) over(PARTITION BY category ORDER BY price) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| 7 rows selected (0.436 seconds) Does it seem that we can only order by the columns after partition by clause? I do not think there are such limitation in standard SQL. was: Prepare the data CREATE TABLE IF NOT EXISTS product_catalog ( name STRING,category STRING,location STRING,price DECIMAL(10,2)); INSERT OVERWRITE product_catalog VALUES ('Nest Coffee', 'drink', 'Toronto', 15.5), ('Pepesi', 'drink', 'Toronto', 9.99), ('Hasimal', 'toy', 'Toronto', 5.9), ('Fire War', 'game', 'Toronto', 70.0), ('Final Fantasy', 'game', 'Montreal', 79.99), ('Lego Friends 15005', 'toy', 'Montreal', 12.99), ('Nesion Milk', 'drink', 'Montreal', 8.9); 1. Query without ORDER BY after PARTITION BY col, the result is correct. SELECT category, price, max(price) over(PARTITION BY category) as max_p, min(price) over(PARTITION BY category) as min_p, sum(price) over(PARTITION BY category) as sum_p, avg(price) over(PARTITION BY category) as avg_p, count(*) over(PARTITION BY category) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | 7 rows selected (0.442 seconds) 1. Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min result is ok. Why other results are like that? SELECT category, price, max(price) over(PARTITION BY category ORDER BY price) as max_p, min(price) over(PARTITION BY category ORDER BY price) as min_p, sum(price) over(PARTITION BY category ORDER BY price) as sum_p, avg(price) over(PARTITION BY category ORDER BY price) as avg_p, count(*) over(PARTITION BY category ORDER BY price) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p
[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Du updated SPARK-33116: Description: Prepare the data CREATE TABLE IF NOT EXISTS product_catalog ( name STRING,category STRING,location STRING,price DECIMAL(10,2)); INSERT OVERWRITE product_catalog VALUES ('Nest Coffee', 'drink', 'Toronto', 15.5), ('Pepesi', 'drink', 'Toronto', 9.99), ('Hasimal', 'toy', 'Toronto', 5.9), ('Fire War', 'game', 'Toronto', 70.0), ('Final Fantasy', 'game', 'Montreal', 79.99), ('Lego Friends 15005', 'toy', 'Montreal', 12.99), ('Nesion Milk', 'drink', 'Montreal', 8.9); 1. Query without ORDER BY after PARTITION BY col, the result is correct. SELECT category, price, max(price) over(PARTITION BY category) as max_p, min(price) over(PARTITION BY category) as min_p, sum(price) over(PARTITION BY category) as sum_p, avg(price) over(PARTITION BY category) as avg_p, count(*) over(PARTITION BY category) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | 7 rows selected (0.442 seconds) 1. Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min result is ok. Why other results are like that? SELECT category, price, max(price) over(PARTITION BY category ORDER BY price) as max_p, min(price) over(PARTITION BY category ORDER BY price) as min_p, sum(price) over(PARTITION BY category ORDER BY price) as sum_p, avg(price) over(PARTITION BY category ORDER BY price) as avg_p, count(*) over(PARTITION BY category ORDER BY price) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| 7 rows selected (0.436 seconds) Does it seem that we can only order by the columns after partition by clause? I do not think there are such limitation in standard SQL. was: Prepare the data CREATE TABLE IF NOT EXISTS product_catalog ( name STRING,category STRING,location STRING,price DECIMAL(10,2)); INSERT OVERWRITE product_catalog VALUES ('Nest Coffee', 'drink', 'Toronto', 15.5), ('Pepesi', 'drink', 'Toronto', 9.99), ('Hasimal', 'toy', 'Toronto', 5.9), ('Fire War', 'game', 'Toronto', 70.0), ('Final Fantasy', 'game', 'Montreal', 79.99), ('Lego Friends 15005', 'toy', 'Montreal', 12.99), ('Nesion Milk', 'drink', 'Montreal', 8.9); 1. Query without ORDER BY after PARTITION BY col, the result is correct. SELECT category, price, max(price) over(PARTITION BY category) as max_p, min(price) over(PARTITION BY category) as min_p, sum(price) over(PARTITION BY category) as sum_p, avg(price) over(PARTITION BY category) as avg_p, count(*) over(PARTITION BY category) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | 7 rows selected (0.442 seconds) 1. Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min result is ok. Why other results are like that? SELECT category, price, max(price) over(PARTITION BY category ORDER BY price) as max_p, min(price) over(PARTITION BY category ORDER BY price) as min_p, sum(price) over(PARTITION BY category ORDER BY price) as sum_p, avg(price) over(PARTITION BY category ORDER BY price) as avg_p, count(*) over(PARTITION BY category ORDER BY price) as count_w FROM product_catalog; || category || price || max_p || min_p ||
[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect
[ https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Du updated SPARK-33116: Description: Prepare the data CREATE TABLE IF NOT EXISTS product_catalog ( name STRING,category STRING,location STRING,price DECIMAL(10,2)); INSERT OVERWRITE product_catalog VALUES ('Nest Coffee', 'drink', 'Toronto', 15.5), ('Pepesi', 'drink', 'Toronto', 9.99), ('Hasimal', 'toy', 'Toronto', 5.9), ('Fire War', 'game', 'Toronto', 70.0), ('Final Fantasy', 'game', 'Montreal', 79.99), ('Lego Friends 15005', 'toy', 'Montreal', 12.99), ('Nesion Milk', 'drink', 'Montreal', 8.9); 1. Query without ORDER BY after PARTITION BY col, the result is correct. SELECT category, price, max(price) over(PARTITION BY category) as max_p, min(price) over(PARTITION BY category) as min_p, sum(price) over(PARTITION BY category) as sum_p, avg(price) over(PARTITION BY category) as avg_p, count(*) over(PARTITION BY category) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | 7 rows selected (0.442 seconds) 1. Query with ORDER BY after PARTITION BY col, the result is NOT correct. Min result is ok. Why other results are like that? SELECT category, price, max(price) over(PARTITION BY category ORDER BY price) as max_p, min(price) over(PARTITION BY category ORDER BY price) as min_p, sum(price) over(PARTITION BY category ORDER BY price) as sum_p, avg(price) over(PARTITION BY category ORDER BY price) as avg_p, count(*) over(PARTITION BY category ORDER BY price) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| 7 rows selected (0.436 seconds) was: Prepare the data CREATE TABLE IF NOT EXISTS product_catalog ( name STRING,category STRING,location STRING,price DECIMAL(10,2)); INSERT OVERWRITE product_catalog VALUES ('Nest Coffee', 'drink', 'Toronto', 15.5), ('Pepesi', 'drink', 'Toronto', 9.99), ('Hasimal', 'toy', 'Toronto', 5.9), ('Fire War', 'game', 'Toronto', 70.0), ('Final Fantasy', 'game', 'Montreal', 79.99), ('Lego Friends 15005', 'toy', 'Montreal', 12.99), ('Nesion Milk', 'drink', 'Montreal', 8.9); 1. Query without ORDER BY after PARTITION BY col, the result is correct. SELECT category, price, max(price) over(PARTITION BY category) as max_p, min(price) over(PARTITION BY category) as min_p, sum(price) over(PARTITION BY category) as sum_p, avg(price) over(PARTITION BY category) as avg_p, count(*) over(PARTITION BY category) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | 7 rows selected (0.442 seconds) 1. Query with ORDER BY after PARTITION BY col, the result is NOT correct. SELECT category, price, max(price) over(PARTITION BY category ORDER BY price) as max_p, min(price) over(PARTITION BY category ORDER BY price) as min_p, sum(price) over(PARTITION BY category ORDER BY price) as sum_p, avg(price) over(PARTITION BY category ORDER BY price) as avg_p, count(*) over(PARTITION BY category ORDER BY price) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2|
[jira] [Created] (SPARK-33116) Spark SQL window function with order by cause result incorrect
Will Du created SPARK-33116: --- Summary: Spark SQL window function with order by cause result incorrect Key: SPARK-33116 URL: https://issues.apache.org/jira/browse/SPARK-33116 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Will Du Prepare the data CREATE TABLE IF NOT EXISTS product_catalog ( name STRING,category STRING,location STRING,price DECIMAL(10,2)); INSERT OVERWRITE product_catalog VALUES ('Nest Coffee', 'drink', 'Toronto', 15.5), ('Pepesi', 'drink', 'Toronto', 9.99), ('Hasimal', 'toy', 'Toronto', 5.9), ('Fire War', 'game', 'Toronto', 70.0), ('Final Fantasy', 'game', 'Montreal', 79.99), ('Lego Friends 15005', 'toy', 'Montreal', 12.99), ('Nesion Milk', 'drink', 'Montreal', 8.9); 1. Query without ORDER BY after PARTITION BY col, the result is correct. SELECT category, price, max(price) over(PARTITION BY category) as max_p, min(price) over(PARTITION BY category) as min_p, sum(price) over(PARTITION BY category) as sum_p, avg(price) over(PARTITION BY category) as avg_p, count(*) over(PARTITION BY category) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 9.99 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3 | | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | game | 70.00 | 79.99 | 70.00 | 149.99 | 74.995000 | 2 | | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | | toy | 5.90 | 12.99 | 5.90 | 18.89 | 9.445000 | 2 | 7 rows selected (0.442 seconds) 1. Query with ORDER BY after PARTITION BY col, the result is NOT correct. SELECT category, price, max(price) over(PARTITION BY category ORDER BY price) as max_p, min(price) over(PARTITION BY category ORDER BY price) as min_p, sum(price) over(PARTITION BY category ORDER BY price) as sum_p, avg(price) over(PARTITION BY category ORDER BY price) as avg_p, count(*) over(PARTITION BY category ORDER BY price) as count_w FROM product_catalog; || category || price || max_p || min_p || sum_p || avg_p || count_w || | drink | 8.90 | 8.90 | 8.90 | 8.90| 8.90 | 1| | drink | 9.99 | 9.99 | 8.90 | 18.89 | 9.445000 | 2| | drink | 15.50 | 15.50 | 8.90 | 34.39 | 11.46 | 3| | game | 70.00 | 70.00 | 70.00 | 70.00 | 70.00 | 1| | game | 79.99 | 79.99 | 70.00 | 149.99 | 74.995000 | 2| | toy | 5.90 | 5.90 | 5.90 | 5.90| 5.90 | 1| | toy | 12.99 | 12.99 | 5.90 | 18.89 | 9.445000 | 2| 7 rows selected (0.436 seconds) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32983) Spark SQL INTERSECT ALL does not keep all rows.
[ https://issues.apache.org/jira/browse/SPARK-32983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Du resolved SPARK-32983. - Resolution: Not A Problem [~petertoth]. You are correct. I close this issue as not a problem. > Spark SQL INTERSECT ALL does not keep all rows. > --- > > Key: SPARK-32983 > URL: https://issues.apache.org/jira/browse/SPARK-32983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1 >Reporter: Will Du >Priority: Major > > Spark SQL INTERSECT ALL should keep all rows. But, it actually remove > replicated like Spark SQL INTERSECT. > with base as ( > select 1 as id union all select 2 as id > ), a as ( > select 1 as id union all select 3 as id) > select * from a INTERSECT ALL select * from base; > > with base as ( > select 1 as id union all select 2 as id > ), a as ( > select 1 as id union all select 3 as id) > select * from a INTERSECT select * from base; > > Both the above queries return one record that is 1. > I think the 1st query should return > 1 > 1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32983) Spark SQL INTERSECT ALL does not keep all rows.
Will Du created SPARK-32983: --- Summary: Spark SQL INTERSECT ALL does not keep all rows. Key: SPARK-32983 URL: https://issues.apache.org/jira/browse/SPARK-32983 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0, 2.4.6 Reporter: Will Du Spark SQL INTERSECT ALL should keep all rows. But, it actually remove replicated like Spark SQL INTERSECT. with base as ( select 1 as id union all select 2 as id ), a as ( select 1 as id union all select 3 as id) select * from a INTERSECT ALL select * from base; with base as ( select 1 as id union all select 2 as id ), a as ( select 1 as id union all select 3 as id) select * from a INTERSECT select * from base; Both the above queries return one record that is 1. I think the 1st query should return 1 1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14423) Handle jar conflict issue when uploading to distributed cache
[ https://issues.apache.org/jira/browse/SPARK-14423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898256#comment-15898256 ] Junping Du commented on SPARK-14423: Thanks [~jerryshao] for reporting this issue. I think YARN should fix this problem also. If the same jars are added to distributed cache, it should detect and failed fast with throwing indicating messages: YARN-5306 already get filed to track this issue. > Handle jar conflict issue when uploading to distributed cache > - > > Key: SPARK-14423 > URL: https://issues.apache.org/jira/browse/SPARK-14423 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.0.0 > > > Currently with the introduction of assembly-free deployment of Spark, by > default yarn#client will upload all the jars in assembly to HDFS staging > folder. If the jars in assembly and specified with \--jars have the same > name, this will introduce exception while downloading these jars and make the > application fail to run. > Here is the exception when running example with {{run-example}}: > {noformat} > 16/04/06 10:29:48 INFO Client: Application report for > application_1459907402325_0004 (state: FAILED) > 16/04/06 10:29:48 INFO Client: >client token: N/A >diagnostics: Application application_1459907402325_0004 failed 2 times > due to AM Container for appattempt_1459907402325_0004_02 exited with > exitCode: -1000 > For more detailed output, check application tracking > page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, > click on links to logs of each attempt. > Diagnostics: Resource > hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar > changed on src filesystem (expected 1459909780508, was 1459909782590 > java.io.IOException: Resource > hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar > changed on src filesystem (expected 1459909780508, was 1459909782590 > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) > at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61) > at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) > at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The problem is that this jar {{avro-mapred-1.7.7-hadoop2.jar}} both existed > in assembly and example folder. > We should handle this situation, since now spark example is failed to run > under yarn mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes
[ https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326823#comment-15326823 ] Fan Du commented on SPARK-6707: --- Thanks alot! [~ankurcha] > Mesos Scheduler should allow the user to specify constraints based on slave > attributes > -- > > Key: SPARK-6707 > URL: https://issues.apache.org/jira/browse/SPARK-6707 > Project: Spark > Issue Type: Improvement > Components: Mesos, Scheduler >Affects Versions: 1.3.0 >Reporter: Ankur Chauhan >Assignee: Ankur Chauhan > Labels: mesos, scheduler > Fix For: 1.5.0 > > > Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources > when trying to determine the usablility of a resource offer from a mesos > slave node. It may be preferable for the user to be able to ensure that the > spark jobs are only started on a certain set of nodes (based on attributes). > For example, If the user sets a property, let's say > {code}spark.mesos.constraints{code} is set to > {code}tachyon=true;us-east-1=false{code}, then the resource offers will be > checked to see if they meet both these constraints and only then will be > accepted to start new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes
[ https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326297#comment-15326297 ] Fan Du commented on SPARK-6707: --- Hi [~ankurcha] I have one query about the example, {{tachyon=true;us-east-1=false}}, in the ticket message, I assume tachyon is memory-centric distributed storage system, which is now renamed to Alluxio, Alluxio can run on Mesos too, so back to my question: Does {{tachyon=true;us-east-1=false}} indicate application prefer nodes which already runs tachyon on them? IOW, applications has affinity between each other? I'm asking this because I'm working on a similar feature, which would enable Mesos framework the ability to update attributes in [MESOS-5545|https://issues.apache.org/jira/browse/MESOS-5545#], I think your comments could possibly justify such requirement. > Mesos Scheduler should allow the user to specify constraints based on slave > attributes > -- > > Key: SPARK-6707 > URL: https://issues.apache.org/jira/browse/SPARK-6707 > Project: Spark > Issue Type: Improvement > Components: Mesos, Scheduler >Affects Versions: 1.3.0 >Reporter: Ankur Chauhan >Assignee: Ankur Chauhan > Labels: mesos, scheduler > Fix For: 1.5.0 > > > Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources > when trying to determine the usablility of a resource offer from a mesos > slave node. It may be preferable for the user to be able to ensure that the > spark jobs are only started on a certain set of nodes (based on attributes). > For example, If the user sets a property, let's say > {code}spark.mesos.constraints{code} is set to > {code}tachyon=true;us-east-1=false{code}, then the resource offers will be > checked to see if they meet both these constraints and only then will be > accepted to start new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10179) LogisticRegressionWithSGD does not multiclass
[ https://issues.apache.org/jira/browse/SPARK-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiqiao Du updated SPARK-10179: --- Description: LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. I found this line in error ``` 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. Found 50 invalid labels ``` In LogisticRegressionWithLBFGS there is setNumClasses method but LogisticRegressionWithSGD does not have this method. was: LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. I found this line in error ``` 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. Found 50 invalid labels ``` LogisticRegressionWithSGD does not multiclass -- Key: SPARK-10179 URL: https://issues.apache.org/jira/browse/SPARK-10179 Project: Spark Issue Type: Bug Components: MLlib Environment: Ubuntu 14.04, spark 1.4.1 Reporter: Shiqiao Du LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. I found this line in error ``` 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. Found 50 invalid labels ``` In LogisticRegressionWithLBFGS there is setNumClasses method but LogisticRegressionWithSGD does not have this method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10179) LogisticRegressionWithSGD does not multiclass
[ https://issues.apache.org/jira/browse/SPARK-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiqiao Du updated SPARK-10179: --- Description: LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. I found this line in error ``` 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. Found 50 invalid labels ``` was:LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. LogisticRegressionWithSGD does not multiclass -- Key: SPARK-10179 URL: https://issues.apache.org/jira/browse/SPARK-10179 Project: Spark Issue Type: Bug Components: MLlib Environment: Ubuntu 14.04, spark 1.4.1 Reporter: Shiqiao Du LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. I found this line in error ``` 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. Found 50 invalid labels ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10179) LogisticRegressionWithSGD does not multiclass
[ https://issues.apache.org/jira/browse/SPARK-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710429#comment-14710429 ] Shiqiao Du commented on SPARK-10179: Thank you. I got it. LogisticRegressionWithSGD does not multiclass -- Key: SPARK-10179 URL: https://issues.apache.org/jira/browse/SPARK-10179 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shiqiao Du Priority: Minor LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. I found this line in error ``` 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. Found 50 invalid labels ``` In LogisticRegressionWithLBFGS there is setNumClasses method but LogisticRegressionWithSGD does not have this method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10179) LogisticRegressionWithSGD does not multiclass
Shiqiao Du created SPARK-10179: -- Summary: LogisticRegressionWithSGD does not multiclass Key: SPARK-10179 URL: https://issues.apache.org/jira/browse/SPARK-10179 Project: Spark Issue Type: Bug Components: MLlib Environment: Ubuntu 14.04, spark 1.4.1 Reporter: Shiqiao Du LogisticRegressionWithSGD does not support Multi-Class input while LogisticRegressionWithLBFGS is OK. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5393) Flood of util.RackResolver log messages after SPARK-1714
[ https://issues.apache.org/jira/browse/SPARK-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363451#comment-14363451 ] Junping Du commented on SPARK-5393: --- Thanks [~sandyr] for confirmation on this! I will go ahead to commit that patch in YARN (after getting updated with my comments). Flood of util.RackResolver log messages after SPARK-1714 Key: SPARK-5393 URL: https://issues.apache.org/jira/browse/SPARK-5393 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical Fix For: 1.3.0 I thought I fixed this while working on the patch, but [~laserson] seems to have encountered it when running on master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org