[jira] [Updated] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Dennis Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Du updated SPARK-38592:
--
Description: 
Try to modify the data frame to ensure column names have no special characters.
{code:java}
df.columns.map { columnName =>
   df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}
{code}
*surroundingBackTickedName()* will enclose column name with backticks.

However, col() kept having issue with column name contains back tick because 
{*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
wondering if there is a workaround
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}

  was:
Try to modify the data frame to ensure column names have no special characters.
{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
*surroundingBackTickedName()* will enclose column name with backticks.

However, col() kept having issue with column name contains back tick because 
{*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
wondering if there is a workaround
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}


> Column name contains back tick `
> 
>
> Key: SPARK-38592
> URL: https://issues.apache.org/jira/browse/SPARK-38592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Dennis Du
>Priority: Major
>  Labels: bulk-closed
>
> Try to modify the data frame to ensure column names have no special 
> characters.
> {code:java}
> df.columns.map { columnName =>
>df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
> }
> {code}
> *surroundingBackTickedName()* will enclose column name with backticks.
> However, col() kept having issue with column name contains back tick because 
> {*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
> wondering if there is a workaround
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
> COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
> COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
> COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Dennis Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Du updated SPARK-38592:
--
Description: 
Try to modify the data frame to ensure column names have no special characters.
{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
*surroundingBackTickedName()* will enclose column name with backticks.

However, col() kept having issue with column name contains back tick because 
{*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
wondering if there is a workaround
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}

  was:
Try to modify the data frame to ensure column names have no special characters.


{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
col() kept having issue with column name contains back tick
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}


> Column name contains back tick `
> 
>
> Key: SPARK-38592
> URL: https://issues.apache.org/jira/browse/SPARK-38592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Dennis Du
>Priority: Major
>  Labels: bulk-closed
>
> Try to modify the data frame to ensure column names have no special 
> characters.
> {code:java}
> val newdf = df.select(
> df.columns.map { columnName =>
> df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
> }: _*
> ){code}
> *surroundingBackTickedName()* will enclose column name with backticks.
> However, col() kept having issue with column name contains back tick because 
> {*}parseAttributeName{*}() only takes backticks that appear in pair. I am 
> wondering if there is a workaround
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
> COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
> COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
> COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38592) Column name contains back tick `

2022-03-17 Thread Dennis Du (Jira)
Dennis Du created SPARK-38592:
-

 Summary: Column name contains back tick `
 Key: SPARK-38592
 URL: https://issues.apache.org/jira/browse/SPARK-38592
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Dennis Du


Try to modify the data frame to ensure column names have no special characters.


{code:java}
val newdf = df.select(
df.columns.map { columnName =>
df.col(surroundingBackTickedName(columnName)).as(normalizeName(columnName))
}: _*
){code}
col() kept having issue with column name contains back tick
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"`COLNAME`2`" among (COLID, COLNAME!4, COLNAME#6, COLNAME$7, COLNAME%8, 
COLNAME'25, COLNAME(11, COLNAME)12, COLNAME*10, COLNAME+16, COLNAME,26, 
COLNAME-13, COLNAME/30, COLNAME:24, COLNAME;23, COLNAME<27, COLNAME=15, 
COLNAME>29, COLNAME?31, COLNAME@5, COLNAME`2){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-12 Thread Will Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du closed SPARK-33116.
---

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-12 Thread Will Du (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359
 ] 

Will Du edited comment on SPARK-33116 at 10/12/20, 1:41 PM:


[~maropu], the statement you mentioned does not have PARTITION BY in the 
example. But I am able to reproduce the same behavior in SQL server. I think 
this can be closed. 


was (Author: willddy):
[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-12 Thread Will Du (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359
 ] 

Will Du edited comment on SPARK-33116 at 10/12/20, 1:19 PM:


[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).


was (Author: willddy):
[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-12 Thread Will Du (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359
 ] 

Will Du edited comment on SPARK-33116 at 10/12/20, 1:18 PM:


[~maropu], the statement you mentioned is comparing queries with PARTITION BY 
and without PARTITION BY. But if you look at the query I provided, both of them 
have PARTITION BY clause. The only difference is the ORDER BY clause added or 
not. The expected result I think should be the same on both queries except the 
orders of rows (by price).


was (Author: willddy):
[~maropu], the statement is comparing query with PARTITION BY and without 
PARTITION BY. But if you look at the query I provided, both of them have 
PARTITION BY clause. The only difference is the ORDER BY clause added or not. 
The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-12 Thread Will Du (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212359#comment-17212359
 ] 

Will Du commented on SPARK-33116:
-

[~maropu], the statement is comparing query with PARTITION BY and without 
PARTITION BY. But if you look at the query I provided, both of them have 
PARTITION BY clause. The only difference is the ORDER BY clause added or not. 
The expected result I think should be the same on both queries except the 
orders of rows (by price).

> Spark SQL window function with order by cause result incorrect
> --
>
> Key: SPARK-33116
> URL: https://issues.apache.org/jira/browse/SPARK-33116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
> | drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
> | drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
> | game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
> | game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
> | toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
> | toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du updated SPARK-33116:

Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)

Does it seem that we can only order by the columns after partition by clause?
I do not think there are such limitation in standard SQL.


  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p 

[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du updated SPARK-33116:

Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)

Does it seem that we can only order by the columns after partition by clause?
I do not think there are such limitation in standard SQL.


  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || 

[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du updated SPARK-33116:

Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)

  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct.
SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|

[jira] [Created] (SPARK-33116) Spark SQL window function with order by cause result incorrect

2020-10-11 Thread Will Du (Jira)
Will Du created SPARK-33116:
---

 Summary: Spark SQL window function with order by cause result 
incorrect
 Key: SPARK-33116
 URL: https://issues.apache.org/jira/browse/SPARK-33116
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Will Du


Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.46 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct.
SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink | 8.90   | 8.90   | 8.90   | 8.90| 8.90   | 1|
| drink | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2|
| drink | 15.50  | 15.50  | 8.90   | 34.39   | 11.46  | 3|
| game  | 70.00  | 70.00  | 70.00  | 70.00   | 70.00  | 1|
| game  | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2|
| toy   | 5.90   | 5.90   | 5.90   | 5.90| 5.90   | 1|
| toy   | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2|
7 rows selected (0.436 seconds)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32983) Spark SQL INTERSECT ALL does not keep all rows.

2020-09-24 Thread Will Du (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Du resolved SPARK-32983.
-
Resolution: Not A Problem

[~petertoth]. You are correct. I close this issue as not a problem.

> Spark SQL INTERSECT ALL does not keep all rows.
> ---
>
> Key: SPARK-32983
> URL: https://issues.apache.org/jira/browse/SPARK-32983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Will Du
>Priority: Major
>
> Spark SQL INTERSECT ALL should keep all rows. But, it actually remove 
> replicated like Spark SQL INTERSECT.
> with base as (
> select 1 as id union all select 2 as id
> ), a as (
> select 1 as id union all select 3 as id)
> select * from a INTERSECT ALL select * from base;
>  
> with base as (
> select 1 as id union all select 2 as id
> ), a as (
> select 1 as id union all select 3 as id)
> select * from a INTERSECT select * from base;
>  
> Both the above queries return one record that is 1.
> I think the 1st query should return
> 1
> 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32983) Spark SQL INTERSECT ALL does not keep all rows.

2020-09-23 Thread Will Du (Jira)
Will Du created SPARK-32983:
---

 Summary: Spark SQL INTERSECT ALL does not keep all rows.
 Key: SPARK-32983
 URL: https://issues.apache.org/jira/browse/SPARK-32983
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 2.4.6
Reporter: Will Du


Spark SQL INTERSECT ALL should keep all rows. But, it actually remove 
replicated like Spark SQL INTERSECT.

with base as (

select 1 as id union all select 2 as id

), a as (

select 1 as id union all select 3 as id)

select * from a INTERSECT ALL select * from base;

 

with base as (

select 1 as id union all select 2 as id

), a as (

select 1 as id union all select 3 as id)

select * from a INTERSECT select * from base;

 

Both the above queries return one record that is 1.

I think the 1st query should return

1

1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14423) Handle jar conflict issue when uploading to distributed cache

2017-03-06 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898256#comment-15898256
 ] 

Junping Du commented on SPARK-14423:


Thanks [~jerryshao] for reporting this issue. I think YARN should fix this 
problem also. If the same jars are added to distributed cache, it should detect 
and failed fast with throwing indicating messages: YARN-5306 already get filed 
to track this issue.

> Handle jar conflict issue when uploading to distributed cache
> -
>
> Key: SPARK-14423
> URL: https://issues.apache.org/jira/browse/SPARK-14423
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.0
>
>
> Currently with the introduction of assembly-free deployment of Spark, by 
> default yarn#client will upload all the jars in assembly to HDFS staging 
> folder. If the jars in assembly and specified with \--jars have the same 
> name, this will introduce exception while downloading these jars and make the 
> application fail to run.
> Here is the exception when running example with {{run-example}}:
> {noformat}
> 16/04/06 10:29:48 INFO Client: Application report for 
> application_1459907402325_0004 (state: FAILED)
> 16/04/06 10:29:48 INFO Client:
>client token: N/A
>diagnostics: Application application_1459907402325_0004 failed 2 times 
> due to AM Container for appattempt_1459907402325_0004_02 exited with  
> exitCode: -1000
> For more detailed output, check application tracking 
> page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, 
> click on links to logs of each attempt.
> Diagnostics: Resource 
> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
>  changed on src filesystem (expected 1459909780508, was 1459909782590
> java.io.IOException: Resource 
> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
>  changed on src filesystem (expected 1459909780508, was 1459909782590
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The problem is that this jar {{avro-mapred-1.7.7-hadoop2.jar}} both existed 
> in assembly and example folder.
> We should handle this situation, since now spark example is failed to run 
> under yarn mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes

2016-06-12 Thread Fan Du (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326823#comment-15326823
 ] 

Fan Du commented on SPARK-6707:
---

Thanks alot! [~ankurcha]

> Mesos Scheduler should allow the user to specify constraints based on slave 
> attributes
> --
>
> Key: SPARK-6707
> URL: https://issues.apache.org/jira/browse/SPARK-6707
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 1.3.0
>Reporter: Ankur Chauhan
>Assignee: Ankur Chauhan
>  Labels: mesos, scheduler
> Fix For: 1.5.0
>
>
> Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources 
> when trying to determine the usablility of a resource offer from a mesos 
> slave node. It may be preferable for the user to be able to ensure that the 
> spark jobs are only started on a certain set of nodes (based on attributes). 
> For example, If the user sets a property, let's say 
> {code}spark.mesos.constraints{code} is set to 
> {code}tachyon=true;us-east-1=false{code}, then the resource offers will be 
> checked to see if they meet both these constraints and only then will be 
> accepted to start new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes

2016-06-12 Thread Fan Du (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326297#comment-15326297
 ] 

Fan Du commented on SPARK-6707:
---

Hi [~ankurcha]

I have one query about the example, {{tachyon=true;us-east-1=false}}, in the 
ticket message, 
I assume tachyon is memory-centric distributed storage system, which is now 
renamed to Alluxio, 
Alluxio can run on Mesos too, so back to my question:
Does {{tachyon=true;us-east-1=false}} indicate application prefer nodes which 
already runs tachyon on them?
IOW, applications has affinity between each other?

I'm asking this because I'm working on a similar feature, which would enable 
Mesos framework the ability to update attributes in 
[MESOS-5545|https://issues.apache.org/jira/browse/MESOS-5545#], I think your 
comments could possibly justify such requirement.


> Mesos Scheduler should allow the user to specify constraints based on slave 
> attributes
> --
>
> Key: SPARK-6707
> URL: https://issues.apache.org/jira/browse/SPARK-6707
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 1.3.0
>Reporter: Ankur Chauhan
>Assignee: Ankur Chauhan
>  Labels: mesos, scheduler
> Fix For: 1.5.0
>
>
> Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources 
> when trying to determine the usablility of a resource offer from a mesos 
> slave node. It may be preferable for the user to be able to ensure that the 
> spark jobs are only started on a certain set of nodes (based on attributes). 
> For example, If the user sets a property, let's say 
> {code}spark.mesos.constraints{code} is set to 
> {code}tachyon=true;us-east-1=false{code}, then the resource offers will be 
> checked to see if they meet both these constraints and only then will be 
> accepted to start new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10179) LogisticRegressionWithSGD does not multiclass

2015-08-24 Thread Shiqiao Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiqiao Du updated SPARK-10179:
---
Description: 
LogisticRegressionWithSGD does not support Multi-Class input while 
LogisticRegressionWithLBFGS is OK.

I found this line in error
```
15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. 
Found 50 invalid labels
```

In LogisticRegressionWithLBFGS there is setNumClasses method but 
LogisticRegressionWithSGD does not have this method.

  was:
LogisticRegressionWithSGD does not support Multi-Class input while 
LogisticRegressionWithLBFGS is OK.

I found this line in error
```
15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. 
Found 50 invalid labels
```


 LogisticRegressionWithSGD does not multiclass 
 --

 Key: SPARK-10179
 URL: https://issues.apache.org/jira/browse/SPARK-10179
 Project: Spark
  Issue Type: Bug
  Components: MLlib
 Environment: Ubuntu 14.04, spark 1.4.1
Reporter: Shiqiao Du

 LogisticRegressionWithSGD does not support Multi-Class input while 
 LogisticRegressionWithLBFGS is OK.
 I found this line in error
 ```
 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 
 1. Found 50 invalid labels
 ```
 In LogisticRegressionWithLBFGS there is setNumClasses method but 
 LogisticRegressionWithSGD does not have this method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10179) LogisticRegressionWithSGD does not multiclass

2015-08-24 Thread Shiqiao Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiqiao Du updated SPARK-10179:
---
Description: 
LogisticRegressionWithSGD does not support Multi-Class input while 
LogisticRegressionWithLBFGS is OK.

I found this line in error
```
15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 1. 
Found 50 invalid labels
```

  was:LogisticRegressionWithSGD does not support Multi-Class input while 
LogisticRegressionWithLBFGS is OK.


 LogisticRegressionWithSGD does not multiclass 
 --

 Key: SPARK-10179
 URL: https://issues.apache.org/jira/browse/SPARK-10179
 Project: Spark
  Issue Type: Bug
  Components: MLlib
 Environment: Ubuntu 14.04, spark 1.4.1
Reporter: Shiqiao Du

 LogisticRegressionWithSGD does not support Multi-Class input while 
 LogisticRegressionWithLBFGS is OK.
 I found this line in error
 ```
 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 
 1. Found 50 invalid labels
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10179) LogisticRegressionWithSGD does not multiclass

2015-08-24 Thread Shiqiao Du (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710429#comment-14710429
 ] 

Shiqiao Du commented on SPARK-10179:


Thank you. I got it.

 LogisticRegressionWithSGD does not multiclass 
 --

 Key: SPARK-10179
 URL: https://issues.apache.org/jira/browse/SPARK-10179
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shiqiao Du
Priority: Minor

 LogisticRegressionWithSGD does not support Multi-Class input while 
 LogisticRegressionWithLBFGS is OK.
 I found this line in error
 ```
 15/08/24 14:46:57 ERROR DataValidators: Classification labels should be 0 or 
 1. Found 50 invalid labels
 ```
 In LogisticRegressionWithLBFGS there is setNumClasses method but 
 LogisticRegressionWithSGD does not have this method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10179) LogisticRegressionWithSGD does not multiclass

2015-08-23 Thread Shiqiao Du (JIRA)
Shiqiao Du created SPARK-10179:
--

 Summary: LogisticRegressionWithSGD does not multiclass 
 Key: SPARK-10179
 URL: https://issues.apache.org/jira/browse/SPARK-10179
 Project: Spark
  Issue Type: Bug
  Components: MLlib
 Environment: Ubuntu 14.04, spark 1.4.1
Reporter: Shiqiao Du


LogisticRegressionWithSGD does not support Multi-Class input while 
LogisticRegressionWithLBFGS is OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5393) Flood of util.RackResolver log messages after SPARK-1714

2015-03-16 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363451#comment-14363451
 ] 

Junping Du commented on SPARK-5393:
---

Thanks [~sandyr] for confirmation on this! I will go ahead to commit that patch 
in YARN (after getting updated with my comments).

 Flood of util.RackResolver log messages after SPARK-1714
 

 Key: SPARK-5393
 URL: https://issues.apache.org/jira/browse/SPARK-5393
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.3.0


 I thought I fixed this while working on the patch, but [~laserson] seems to 
 have encountered it when running on master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org