[jira] [Commented] (SPARK-34993) from_json() acts differently on created and literal strings with backslashes

2021-04-12 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319958#comment-17319958
 ] 

Kousuke Saruta commented on SPARK-34993:


Hi [~laurikoobas].
 I think all of the results is correct.
 About the following two expressions, '\{"msg": " "}' is a string literal and 
"\\" is regarded as "\".
{code:java}
 from_json('{"msg":"\\"}', schema_of_json(to_json(named_struct('msg', '\\'
 from_json('{"msg":"\\"}', 'msg string')
{code}
So, if you need to give an escaped character like "\\", you should write "".
 In this case, those expression should be:
{code:java}
 from_json('{"msg":""}', schema_of_json(to_json(named_struct('msg', '\\'
 from_json('{"msg":""}', 'msg string')
{code}
 

About thefollowing expression,
from_json(to_json(named_struct('msg', '\\')), 
schema_of_json(to_json(named_struct('msg', '\\'
The evaluated result is \{"msg":"\"} in my Spark 3.1.1 environment. It seems to 
be different from the result you get (\{"msg":"\\"}) but it should be \{"msg": 
"\"}.

> from_json() acts differently on created and literal strings with backslashes
> 
>
> Key: SPARK-34993
> URL: https://issues.apache.org/jira/browse/SPARK-34993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Databricks DBR 8.1
>Reporter: Lauri Koobas
>Priority: Major
> Attachments: image-2021-04-11-07-21-02-750.png
>
>
> JSON string with the value that contains backslashes fails to be recovered by 
> `from_json()`.
> I found that if the same string is created with `to_json(named_struct())` 
> then it actually does work.
>  
> The following code to reproduce. I would expect all of these methods to 
> return the same (correct) result:
> {code:java}
> select to_json(named_struct('msg', '\\'))
>  , schema_of_json(to_json(named_struct('msg', '\\')))
>  , from_json(to_json(named_struct('msg', '\\')), 
> schema_of_json(to_json(named_struct('msg', '\\'
>  , from_json('{"msg":"\\"}', schema_of_json(to_json(named_struct('msg', 
> '\\'
>  , from_json('{"msg":"\\"}', 'msg string')
>  
> {code}
>  
> !image-2021-04-11-07-21-02-750.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35042) Support traversal pruning in transform/resolve functions and their call sites

2021-04-12 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35042:
---
Affects Version/s: (was: 3.1.0)
   3.2.0

> Support traversal pruning in transform/resolve functions and their call sites
> -
>
> Key: SPARK-35042
> URL: https://issues.apache.org/jira/browse/SPARK-35042
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditions. This 
> [doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit]
>  some evaluation numbers with a prototype.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35045) Add an internal option to control input buffer in univocity

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35045:


Assignee: (was: Apache Spark)

> Add an internal option to control input buffer in univocity
> ---
>
> Key: SPARK-35045
> URL: https://issues.apache.org/jira/browse/SPARK-35045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/pull/31858 fixed it to respect Univocity's 
> default buffer to:
> - Firstly, it's best to trust their judgement on the default values. Also 128 
> is too low.
> - Default values arguably have more test coverage in Univocity.
> - It will also fix uniVocity/univocity-parsers#449 which is a regression 
> compared to Spark 2.4
> To prevent related side effect, we should have a workaround to change the 
> buffer size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35045) Add an internal option to control input buffer in univocity

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319956#comment-17319956
 ] 

Apache Spark commented on SPARK-35045:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32145

> Add an internal option to control input buffer in univocity
> ---
>
> Key: SPARK-35045
> URL: https://issues.apache.org/jira/browse/SPARK-35045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/pull/31858 fixed it to respect Univocity's 
> default buffer to:
> - Firstly, it's best to trust their judgement on the default values. Also 128 
> is too low.
> - Default values arguably have more test coverage in Univocity.
> - It will also fix uniVocity/univocity-parsers#449 which is a regression 
> compared to Spark 2.4
> To prevent related side effect, we should have a workaround to change the 
> buffer size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35045) Add an internal option to control input buffer in univocity

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35045:


Assignee: Apache Spark

> Add an internal option to control input buffer in univocity
> ---
>
> Key: SPARK-35045
> URL: https://issues.apache.org/jira/browse/SPARK-35045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/apache/spark/pull/31858 fixed it to respect Univocity's 
> default buffer to:
> - Firstly, it's best to trust their judgement on the default values. Also 128 
> is too low.
> - Default values arguably have more test coverage in Univocity.
> - It will also fix uniVocity/univocity-parsers#449 which is a regression 
> compared to Spark 2.4
> To prevent related side effect, we should have a workaround to change the 
> buffer size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35042) Support traversal pruning in transform/resolve functions and their call sites

2021-04-12 Thread Yingyi Bu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319950#comment-17319950
 ] 

Yingyi Bu commented on SPARK-35042:
---

>> [~buyingyi] Shall we use a bigger title and include 
>>https://issues.apache.org/jira/browse/SPARK-34916 in this umbrella JIRA?

Done.  Thanks, Gengliang!

> Support traversal pruning in transform/resolve functions and their call sites
> -
>
> Key: SPARK-35042
> URL: https://issues.apache.org/jira/browse/SPARK-35042
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditions. This 
> [doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit]
>  some evaluation numbers with a prototype.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35042) Support traversal pruning in transform/resolve functions and their call sites

2021-04-12 Thread Yingyi Bu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319950#comment-17319950
 ] 

Yingyi Bu edited comment on SPARK-35042 at 4/13/21, 6:42 AM:
-

>> [~buyingyi] Shall we use a bigger title and include 
>>https://issues.apache.org/jira/browse/SPARK-34916 in this umbrella JIRA?

Done.  Thanks, [~Gengliang.Wang]!


was (Author: buyingyi):
>> [~buyingyi] Shall we use a bigger title and include 
>>https://issues.apache.org/jira/browse/SPARK-34916 in this umbrella JIRA?

Done.  Thanks, Gengliang!

> Support traversal pruning in transform/resolve functions and their call sites
> -
>
> Key: SPARK-35042
> URL: https://issues.apache.org/jira/browse/SPARK-35042
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditions. This 
> [doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit]
>  some evaluation numbers with a prototype.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34916) Support traversal pruning in the transform function family

2021-04-12 Thread Yingyi Bu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated SPARK-34916:
--
Description: 
Add variants that support tree traversal pruning to the transform function 
family.
 

  was:Transform/resolve functions are called ~280k times per query on average 
for a TPC-DS query, which are way more than necessary. We can reduce those 
calls with early exit information and conditons.

Summary: Support traversal pruning in the transform function family  
(was: Reduce tree traversals in transform/resolve function families)

> Support traversal pruning in the transform function family
> --
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> Add variants that support tree traversal pruning to the transform function 
> family.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-04-12 Thread Yingyi Bu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated SPARK-34916:
--
Parent: SPARK-35042
Issue Type: Sub-task  (was: Improvement)

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35042) Support traversal pruning in transform/resolve functions and their call sites

2021-04-12 Thread Yingyi Bu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated SPARK-35042:
--
Description: Transform/resolve functions are called ~280k times per query 
on average for a TPC-DS query, which are way more than necessary. We can reduce 
those calls with early exit information and conditions. This 
[doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit]
 some evaluation numbers with a prototype.  (was: Transform/resolve functions 
are called ~280k times per query on average for a TPC-DS query, which are way 
more than necessary. We can reduce those calls with early exit information and 
conditions. Here are some evaluation numbers with a prototype 
[doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit].)

> Support traversal pruning in transform/resolve functions and their call sites
> -
>
> Key: SPARK-35042
> URL: https://issues.apache.org/jira/browse/SPARK-35042
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditions. This 
> [doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit]
>  some evaluation numbers with a prototype.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35042) Support traversal pruning in transform/resolve functions and their call sites

2021-04-12 Thread Yingyi Bu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated SPARK-35042:
--
Description: Transform/resolve functions are called ~280k times per query 
on average for a TPC-DS query, which are way more than necessary. We can reduce 
those calls with early exit information and conditions. Here are some 
evaluation numbers with a prototype 
[doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit].
  (was: It's an umbrella JIRA issue for migrating eligible transform/resolve 
call sites to the version of transform/resolve functions with tree traversal 
pruning support)
Summary: Support traversal pruning in transform/resolve functions and 
their call sites  (was: Migrate eligible transform/resolve call sites to the 
version with traversal pruning)

> Support traversal pruning in transform/resolve functions and their call sites
> -
>
> Key: SPARK-35042
> URL: https://issues.apache.org/jira/browse/SPARK-35042
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditions. Here are some evaluation numbers 
> with a prototype 
> [doc|https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk/edit].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35042) Migrate eligible transform/resolve call sites to the version with traversal pruning

2021-04-12 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319935#comment-17319935
 ] 

Gengliang Wang commented on SPARK-35042:


[~buyingyi] Shall we use a bigger title and include 
https://issues.apache.org/jira/browse/SPARK-34916 in this umbrella JIRA?

> Migrate eligible transform/resolve call sites to the version with traversal 
> pruning
> ---
>
> Key: SPARK-35042
> URL: https://issues.apache.org/jira/browse/SPARK-35042
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> It's an umbrella JIRA issue for migrating eligible transform/resolve call 
> sites to the version of transform/resolve functions with tree traversal 
> pruning support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33411) Cardinality estimation of union, range and sort logical operators

2021-04-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33411.
--
Fix Version/s: 3.2.0
 Assignee: Ayushi Agarwal
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30334

> Cardinality estimation of union, range and sort logical operators
> -
>
> Key: SPARK-33411
> URL: https://issues.apache.org/jira/browse/SPARK-33411
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Ayushi Agarwal
>Assignee: Ayushi Agarwal
>Priority: Major
> Fix For: 3.2.0
>
>
> Support cardinality estimation for union, sort and range operators to enhance 
> https://issues.apache.org/jira/browse/SPARK-16026



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35045) Add an internal option to control input buffer in univocity

2021-04-12 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-35045:


 Summary: Add an internal option to control input buffer in 
univocity
 Key: SPARK-35045
 URL: https://issues.apache.org/jira/browse/SPARK-35045
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/pull/31858 fixed it to respect Univocity's 
default buffer to:
- Firstly, it's best to trust their judgement on the default values. Also 128 
is too low.
- Default values arguably have more test coverage in Univocity.
- It will also fix uniVocity/univocity-parsers#449 which is a regression 
compared to Spark 2.4

To prevent related side effect, we should have a workaround to change the 
buffer size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35044) Support retrieve hadoop configurations via SET syntax

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35044:


Assignee: (was: Apache Spark)

> Support retrieve hadoop configurations via SET syntax
> -
>
> Key: SPARK-35044
> URL: https://issues.apache.org/jira/browse/SPARK-35044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> Currently, pure SQL users are short of ways to see the Hadoop configurations 
> which may affect their jobs a lot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35044) Support retrieve hadoop configurations via SET syntax

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319922#comment-17319922
 ] 

Apache Spark commented on SPARK-35044:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32144

> Support retrieve hadoop configurations via SET syntax
> -
>
> Key: SPARK-35044
> URL: https://issues.apache.org/jira/browse/SPARK-35044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> Currently, pure SQL users are short of ways to see the Hadoop configurations 
> which may affect their jobs a lot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35044) Support retrieve hadoop configurations via SET syntax

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35044:


Assignee: Apache Spark

> Support retrieve hadoop configurations via SET syntax
> -
>
> Key: SPARK-35044
> URL: https://issues.apache.org/jira/browse/SPARK-35044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> Currently, pure SQL users are short of ways to see the Hadoop configurations 
> which may affect their jobs a lot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35043) Support traversal pruning in resolve functions in AnalysisHelper

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319916#comment-17319916
 ] 

Apache Spark commented on SPARK-35043:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32143

> Support traversal pruning in resolve functions in AnalysisHelper
> 
>
> Key: SPARK-35043
> URL: https://issues.apache.org/jira/browse/SPARK-35043
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Add variants that support tree traversal pruning to the resolve function 
> family.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35043) Support traversal pruning in resolve functions in AnalysisHelper

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319915#comment-17319915
 ] 

Apache Spark commented on SPARK-35043:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32143

> Support traversal pruning in resolve functions in AnalysisHelper
> 
>
> Key: SPARK-35043
> URL: https://issues.apache.org/jira/browse/SPARK-35043
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Add variants that support tree traversal pruning to the resolve function 
> family.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35037) Recognize sign before the interval string in literals

2021-04-12 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35037.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32134
[https://github.com/apache/spark/pull/32134]

> Recognize sign before the interval string in literals
> -
>
> Key: SPARK-35037
> URL: https://issues.apache.org/jira/browse/SPARK-35037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> According to the SQL standard:
> {code:java}
>  ::=
>   INTERVAL [  ]  
>  ::=
> 
>  ::=
>   [  ] {  |  }
>  ::=
> 
>   | 
> {code}
> but the parsing fails:
> {code:java}
> spark-sql> select interval -'1-1' year to month;
> Error in query:
> mismatched input 'to' expecting {, ';'}(line 1, pos 28)
> == SQL ==
> select interval -'1-1' year to month
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35044) Support retrieve hadoop configurations via SET syntax

2021-04-12 Thread Kent Yao (Jira)
Kent Yao created SPARK-35044:


 Summary: Support retrieve hadoop configurations via SET syntax
 Key: SPARK-35044
 URL: https://issues.apache.org/jira/browse/SPARK-35044
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kent Yao


Currently, pure SQL users are short of ways to see the Hadoop configurations 
which may affect their jobs a lot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35043) Support traversal pruning in resolve functions in AnalysisHelper

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319903#comment-17319903
 ] 

Apache Spark commented on SPARK-35043:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32135

> Support traversal pruning in resolve functions in AnalysisHelper
> 
>
> Key: SPARK-35043
> URL: https://issues.apache.org/jira/browse/SPARK-35043
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Add variants that support tree traversal pruning to the resolve function 
> family.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35043) Support traversal pruning in resolve functions in AnalysisHelper

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35043:


Assignee: Apache Spark

> Support traversal pruning in resolve functions in AnalysisHelper
> 
>
> Key: SPARK-35043
> URL: https://issues.apache.org/jira/browse/SPARK-35043
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> Add variants that support tree traversal pruning to the resolve function 
> family.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35043) Support traversal pruning in resolve functions in AnalysisHelper

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35043:


Assignee: (was: Apache Spark)

> Support traversal pruning in resolve functions in AnalysisHelper
> 
>
> Key: SPARK-35043
> URL: https://issues.apache.org/jira/browse/SPARK-35043
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Add variants that support tree traversal pruning to the resolve function 
> family.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35043) Support traversal pruning in resolve functions in AnalysisHelper

2021-04-12 Thread Yingyi Bu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu updated SPARK-35043:
--
Shepherd: Gengliang Wang

> Support traversal pruning in resolve functions in AnalysisHelper
> 
>
> Key: SPARK-35043
> URL: https://issues.apache.org/jira/browse/SPARK-35043
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Add variants that support tree traversal pruning to the resolve function 
> family.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35043) Support traversal pruning in resolve functions in AnalysisHelper

2021-04-12 Thread Yingyi Bu (Jira)
Yingyi Bu created SPARK-35043:
-

 Summary: Support traversal pruning in resolve functions in 
AnalysisHelper
 Key: SPARK-35043
 URL: https://issues.apache.org/jira/browse/SPARK-35043
 Project: Spark
  Issue Type: Sub-task
  Components: Optimizer
Affects Versions: 3.1.0
Reporter: Yingyi Bu


Add variants that support tree traversal pruning to the resolve function family.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35042) Migrate eligible transform/resolve call sites to the version with traversal pruning

2021-04-12 Thread Yingyi Bu (Jira)
Yingyi Bu created SPARK-35042:
-

 Summary: Migrate eligible transform/resolve call sites to the 
version with traversal pruning
 Key: SPARK-35042
 URL: https://issues.apache.org/jira/browse/SPARK-35042
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.1.0
Reporter: Yingyi Bu


It's an umbrella JIRA issue for migrating eligible transform/resolve call sites 
to the version of transform/resolve functions with tree traversal pruning 
support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35041) Revise the overflow in UTF8String

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35041:


Assignee: (was: Apache Spark)

> Revise the overflow in UTF8String
> -
>
> Key: SPARK-35041
> URL: https://issues.apache.org/jira/browse/SPARK-35041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> Revise the code of `UTF8String`, add check if it will overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35041) Revise the overflow in UTF8String

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319878#comment-17319878
 ] 

Apache Spark commented on SPARK-35041:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32142

> Revise the overflow in UTF8String
> -
>
> Key: SPARK-35041
> URL: https://issues.apache.org/jira/browse/SPARK-35041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> Revise the code of `UTF8String`, add check if it will overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35041) Revise the overflow in UTF8String

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35041:


Assignee: Apache Spark

> Revise the overflow in UTF8String
> -
>
> Key: SPARK-35041
> URL: https://issues.apache.org/jira/browse/SPARK-35041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> Revise the code of `UTF8String`, add check if it will overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34996) Port Koalas Series related unit tests into PySpark

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34996.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32117
[https://github.com/apache/spark/pull/32117]

> Port Koalas Series related unit tests into PySpark
> --
>
> Key: SPARK-34996
> URL: https://issues.apache.org/jira/browse/SPARK-34996
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> This JIRA aims to port Koalas Series related unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34996) Port Koalas Series related unit tests into PySpark

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34996:


Assignee: Xinrong Meng

> Port Koalas Series related unit tests into PySpark
> --
>
> Key: SPARK-34996
> URL: https://issues.apache.org/jira/browse/SPARK-34996
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas Series related unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35029) Extract a new method to eliminate duplicate code in `BufferReleasingInputStream`

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35029.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32130
[https://github.com/apache/spark/pull/32130]

> Extract a new method to eliminate duplicate code in 
> `BufferReleasingInputStream`
> 
>
> Key: SPARK-35029
> URL: https://issues.apache.org/jira/browse/SPARK-35029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> There are some duplicate code patterns in `BufferReleasingInputStream`, such 
> as 
>  
> {code:java}
> override def read(): Int = { 
>   try { 
> delegate.read() 
>   } catch { 
> case e: IOException if detectCorruption => 
>   IOUtils.closeQuietly(this) 
>   iterator.throwFetchFailedException(blockId, mapIndex, address, e) 
>   } 
> }
> {code}
> , 
>  
> {code:java}
> override def read(b: Array[Byte]): Int = {
>   try {  
> delegate.read(b)
>   } catch {  
> case e: IOException if detectCorruption =>
>   IOUtils.closeQuietly(this)
>   iterator.throwFetchFailedException(blockId, mapIndex, address, e)
>   }  
> }
> {code}
>  
> and
>  
> {code:java}
> override def read(b: Array[Byte], off: Int, len: Int): Int = {
>   try {  
> delegate.read(b, off, len)
>   } catch {  
> case e: IOException if detectCorruption =>
>   IOUtils.closeQuietly(this)
>   iterator.throwFetchFailedException(blockId, mapIndex, address, e)
>   }  
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35029) Extract a new method to eliminate duplicate code in `BufferReleasingInputStream`

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35029:


Assignee: Yang Jie

> Extract a new method to eliminate duplicate code in 
> `BufferReleasingInputStream`
> 
>
> Key: SPARK-35029
> URL: https://issues.apache.org/jira/browse/SPARK-35029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> There are some duplicate code patterns in `BufferReleasingInputStream`, such 
> as 
>  
> {code:java}
> override def read(): Int = { 
>   try { 
> delegate.read() 
>   } catch { 
> case e: IOException if detectCorruption => 
>   IOUtils.closeQuietly(this) 
>   iterator.throwFetchFailedException(blockId, mapIndex, address, e) 
>   } 
> }
> {code}
> , 
>  
> {code:java}
> override def read(b: Array[Byte]): Int = {
>   try {  
> delegate.read(b)
>   } catch {  
> case e: IOException if detectCorruption =>
>   IOUtils.closeQuietly(this)
>   iterator.throwFetchFailedException(blockId, mapIndex, address, e)
>   }  
> }
> {code}
>  
> and
>  
> {code:java}
> override def read(b: Array[Byte], off: Int, len: Int): Int = {
>   try {  
> delegate.read(b, off, len)
>   } catch {  
> case e: IOException if detectCorruption =>
>   IOUtils.closeQuietly(this)
>   iterator.throwFetchFailedException(blockId, mapIndex, address, e)
>   }  
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35041) Revise the overflow in UTF8String

2021-04-12 Thread ulysses you (Jira)
ulysses you created SPARK-35041:
---

 Summary: Revise the overflow in UTF8String
 Key: SPARK-35041
 URL: https://issues.apache.org/jira/browse/SPARK-35041
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: ulysses you


Revise the code of `UTF8String`, add check if it will overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319866#comment-17319866
 ] 

Apache Spark commented on SPARK-32922:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32140

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319865#comment-17319865
 ] 

Apache Spark commented on SPARK-32922:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32140

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34944) Employ correct data type for web_returns and store_returns in TPCDS tests

2021-04-12 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-34944:


Assignee: Kent Yao

> Employ correct data type for web_returns and store_returns in TPCDS tests
> -
>
> Key: SPARK-34944
> URL: https://issues.apache.org/jira/browse/SPARK-34944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> {noformat}
> 2.2.2 Datatype
> 2.2.2.1 Each column employs one of the following datatypes:
> a) Identifier means that the column shall be able to hold any key value 
> generated for that column.
> b) Integer means that the column shall be able to exactly represent integer 
> values (i.e., values in increments of
> 1) in the range of at least ( − 2n − 1) to (2n − 1 − 1), where n is 64.
> c) Decimal(d, f) means that the column shall be able to represent decimal 
> values up to and including d digits,
> of which f shall occur to the right of the decimal place; the values can be 
> either represented exactly or
> interpreted to be in this range.
> d) Char(N) means that the column shall be able to hold any string of 
> characters of a fixed length of N.
> Comment: If the string that a column of datatype char(N) holds is shorter 
> than N characters, then trailing
> spaces shall be stored in the database or the database shall automatically 
> pad with spaces upon retrieval such
> that a CHAR_LENGTH() function will return N.
> e) Varchar(N) means that the column shall be able to hold any string of 
> characters of a variable length with a
> maximum length of N. Columns defined as "varchar(N)" may optionally be 
> implemented as "char(N)".
> f) Date means that the column shall be able to express any calendar day 
> between January 1, 1900 and
> December 31, 2199.
> 2.2.2.2 The datatypes do not correspond to any specific SQL-standard 
> datatype. The definitions are provided to
> highlight the properties that are required for a particular column. The 
> benchmark implementer may employ any internal representation or SQL datatype 
> that meets those requirements.
> {noformat}
> one thing might be clear that we should replace bigint type which is now used 
> in web_returns and store_returns with int type.
> another thing that might need to be further discussed is - shall we use 
> bigint to meet 2.2.2.1 b)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34944) Employ correct data type for web_returns and store_returns in TPCDS tests

2021-04-12 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-34944.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32037
[https://github.com/apache/spark/pull/32037]

> Employ correct data type for web_returns and store_returns in TPCDS tests
> -
>
> Key: SPARK-34944
> URL: https://issues.apache.org/jira/browse/SPARK-34944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> {noformat}
> 2.2.2 Datatype
> 2.2.2.1 Each column employs one of the following datatypes:
> a) Identifier means that the column shall be able to hold any key value 
> generated for that column.
> b) Integer means that the column shall be able to exactly represent integer 
> values (i.e., values in increments of
> 1) in the range of at least ( − 2n − 1) to (2n − 1 − 1), where n is 64.
> c) Decimal(d, f) means that the column shall be able to represent decimal 
> values up to and including d digits,
> of which f shall occur to the right of the decimal place; the values can be 
> either represented exactly or
> interpreted to be in this range.
> d) Char(N) means that the column shall be able to hold any string of 
> characters of a fixed length of N.
> Comment: If the string that a column of datatype char(N) holds is shorter 
> than N characters, then trailing
> spaces shall be stored in the database or the database shall automatically 
> pad with spaces upon retrieval such
> that a CHAR_LENGTH() function will return N.
> e) Varchar(N) means that the column shall be able to hold any string of 
> characters of a variable length with a
> maximum length of N. Columns defined as "varchar(N)" may optionally be 
> implemented as "char(N)".
> f) Date means that the column shall be able to express any calendar day 
> between January 1, 1900 and
> December 31, 2199.
> 2.2.2.2 The datatypes do not correspond to any specific SQL-standard 
> datatype. The definitions are provided to
> highlight the properties that are required for a particular column. The 
> benchmark implementer may employ any internal representation or SQL datatype 
> that meets those requirements.
> {noformat}
> one thing might be clear that we should replace bigint type which is now used 
> in web_returns and store_returns with int type.
> another thing that might need to be further discussed is - shall we use 
> bigint to meet 2.2.2.1 b)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-04-12 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319839#comment-17319839
 ] 

Gengliang Wang commented on SPARK-34916:


[~buyingyi] could you create an umbrella Jira for this?

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35032) Port Koalas Index unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319837#comment-17319837
 ] 

Apache Spark commented on SPARK-35032:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32139

> Port Koalas Index unit tests into PySpark
> -
>
> Key: SPARK-35032
> URL: https://issues.apache.org/jira/browse/SPARK-35032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas Index unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35032) Port Koalas Index unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35032:


Assignee: Apache Spark

> Port Koalas Index unit tests into PySpark
> -
>
> Key: SPARK-35032
> URL: https://issues.apache.org/jira/browse/SPARK-35032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims to port Koalas Index unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35032) Port Koalas Index unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35032:


Assignee: (was: Apache Spark)

> Port Koalas Index unit tests into PySpark
> -
>
> Key: SPARK-35032
> URL: https://issues.apache.org/jira/browse/SPARK-35032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas Index unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35040) Remove Spark-version related codes from test codes.

2021-04-12 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-35040:
--
Description: 
There are several places to check the PySpark version and switch the tests, but 
now those are not necessary.

We should remove them.

> Remove Spark-version related codes from test codes.
> ---
>
> Key: SPARK-35040
> URL: https://issues.apache.org/jira/browse/SPARK-35040
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are several places to check the PySpark version and switch the tests, 
> but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35040) Remove Spark-version related codes from test codes.

2021-04-12 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35040:
-

 Summary: Remove Spark-version related codes from test codes.
 Key: SPARK-35040
 URL: https://issues.apache.org/jira/browse/SPARK-35040
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35033) Port Koalas plot unit tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319834#comment-17319834
 ] 

Xinrong Meng commented on SPARK-35033:
--

I am working on this ticket.

> Port Koalas plot unit tests into PySpark
> 
>
> Key: SPARK-35033
> URL: https://issues.apache.org/jira/browse/SPARK-35033
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas plot unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35028) ANSI mode: disallow group by aliases

2021-04-12 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35028.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32129
[https://github.com/apache/spark/pull/32129]

> ANSI mode: disallow group by aliases
> 
>
> Key: SPARK-35028
> URL: https://issues.apache.org/jira/browse/SPARK-35028
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> As per the ANSI SQL standard secion 7.12 : 
> bq. Each  shall unambiguously reference a column 
> of the table resulting from the . A column referenced in a 
>  is a grouping column.
> By forbidding it, we can avoid ambiguous SQL queries like:
> SELECT col + 1 as col FROM t GROUP BY col



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35034) Port Koalas miscellaneous unit tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319833#comment-17319833
 ] 

Xinrong Meng commented on SPARK-35034:
--

I would like to work on this ticket.

> Port Koalas miscellaneous unit tests into PySpark
> -
>
> Key: SPARK-35034
> URL: https://issues.apache.org/jira/browse/SPARK-35034
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas miscellaneous unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-12 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319830#comment-17319830
 ] 

Haejoon Lee commented on SPARK-34995:
-

I'm working on this

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28227) Spark can’t support TRANSFORM with aggregation

2021-04-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28227.
--
Fix Version/s: 3.2.0
 Assignee: angerszhu
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29087

> Spark can’t  support TRANSFORM with aggregation
> ---
>
> Key: SPARK-28227
> URL: https://issues.apache.org/jira/browse/SPARK-28227
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Spark can;t support using TRANSFORM with aggregation such as :
> {code:java}
> SELECT TRANSFORM(T.A, SUM(T.B))
> USING 'func' AS (X STRING Y STRING)
> FROM DEFAULT.TEST T
> WHERE T.C > 0
> GROUP BY T.A{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35039) Remove Spark-version related codes from main codes.

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319826#comment-17319826
 ] 

Apache Spark commented on SPARK-35039:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32138

> Remove Spark-version related codes from main codes.
> ---
>
> Key: SPARK-35039
> URL: https://issues.apache.org/jira/browse/SPARK-35039
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are several places to check the PySpark version and switch the 
> behavior, but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35039) Remove Spark-version related codes from main codes.

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35039:


Assignee: (was: Apache Spark)

> Remove Spark-version related codes from main codes.
> ---
>
> Key: SPARK-35039
> URL: https://issues.apache.org/jira/browse/SPARK-35039
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are several places to check the PySpark version and switch the 
> behavior, but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35039) Remove Spark-version related codes from main codes.

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319822#comment-17319822
 ] 

Apache Spark commented on SPARK-35039:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32138

> Remove Spark-version related codes from main codes.
> ---
>
> Key: SPARK-35039
> URL: https://issues.apache.org/jira/browse/SPARK-35039
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are several places to check the PySpark version and switch the 
> behavior, but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35039) Remove Spark-version related codes from main codes.

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35039:


Assignee: Apache Spark

> Remove Spark-version related codes from main codes.
> ---
>
> Key: SPARK-35039
> URL: https://issues.apache.org/jira/browse/SPARK-35039
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> There are several places to check the PySpark version and switch the 
> behavior, but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35031) Port Koalas operations on different frames tests into PySpark

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35031.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32133
[https://github.com/apache/spark/pull/32133]

> Port Koalas operations on different frames tests into PySpark
> -
>
> Key: SPARK-35031
> URL: https://issues.apache.org/jira/browse/SPARK-35031
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> This JIRA aims to port Koalas operations on different frames related unit 
> tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35031) Port Koalas operations on different frames tests into PySpark

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35031:


Assignee: Xinrong Meng

> Port Koalas operations on different frames tests into PySpark
> -
>
> Key: SPARK-35031
> URL: https://issues.apache.org/jira/browse/SPARK-35031
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas operations on different frames related unit 
> tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35019) Improve type hints on pyspark.sql.*

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35019.
--
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32122
[https://github.com/apache/spark/pull/32122]

> Improve type hints on pyspark.sql.*
> ---
>
> Key: SPARK-35019
> URL: https://issues.apache.org/jira/browse/SPARK-35019
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> Fix the mismatches in pyspark.sql.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35019) Improve type hints on pyspark.sql.*

2021-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35019:


Assignee: Yikun Jiang

> Improve type hints on pyspark.sql.*
> ---
>
> Key: SPARK-35019
> URL: https://issues.apache.org/jira/browse/SPARK-35019
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> Fix the mismatches in pyspark.sql.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35035) Port Koalas internal implementation unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319816#comment-17319816
 ] 

Apache Spark commented on SPARK-35035:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32137

> Port Koalas internal implementation unit tests into PySpark
> ---
>
> Key: SPARK-35035
> URL: https://issues.apache.org/jira/browse/SPARK-35035
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas internal implementation related unit tests to 
> [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35035) Port Koalas internal implementation unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35035:


Assignee: Apache Spark

> Port Koalas internal implementation unit tests into PySpark
> ---
>
> Key: SPARK-35035
> URL: https://issues.apache.org/jira/browse/SPARK-35035
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims to port Koalas internal implementation related unit tests to 
> [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35035) Port Koalas internal implementation unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35035:


Assignee: (was: Apache Spark)

> Port Koalas internal implementation unit tests into PySpark
> ---
>
> Key: SPARK-35035
> URL: https://issues.apache.org/jira/browse/SPARK-35035
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas internal implementation related unit tests to 
> [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35039) Remove Spark-version related codes from main codes.

2021-04-12 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35039:
-

 Summary: Remove Spark-version related codes from main codes.
 Key: SPARK-35039
 URL: https://issues.apache.org/jira/browse/SPARK-35039
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


There are several places to check the PySpark version and switch the behavior, 
but now those are not necessary.

We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35022) Task Scheduling Plugin in Spark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35022:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Task Scheduling Plugin in Spark
> ---
>
> Key: SPARK-35022
> URL: https://issues.apache.org/jira/browse/SPARK-35022
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Spark scheduler schedules tasks to executors in an arbitrary way. The 
> schedule schedules the tasks by itself. Although there is locality 
> configuration, the configuration is used for data locality purposes. 
> Generally we cannot suggest the scheduler where a task should be scheduled 
> to. Normally it is not a problem because the general task is 
> executor-agnostic. But for special tasks, for example stateful tasks in 
> Structured Streaming, state store is maintained at the executor side. 
> Changing task location means reloading checkpoint data from the last batch. 
> It has disadvantages from the performance perspective and also casts some 
> limitations when we want to implement advanced features in Structured 
> Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35022) Task Scheduling Plugin in Spark

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319760#comment-17319760
 ] 

Apache Spark commented on SPARK-35022:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32136

> Task Scheduling Plugin in Spark
> ---
>
> Key: SPARK-35022
> URL: https://issues.apache.org/jira/browse/SPARK-35022
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Spark scheduler schedules tasks to executors in an arbitrary way. The 
> schedule schedules the tasks by itself. Although there is locality 
> configuration, the configuration is used for data locality purposes. 
> Generally we cannot suggest the scheduler where a task should be scheduled 
> to. Normally it is not a problem because the general task is 
> executor-agnostic. But for special tasks, for example stateful tasks in 
> Structured Streaming, state store is maintained at the executor side. 
> Changing task location means reloading checkpoint data from the last batch. 
> It has disadvantages from the performance perspective and also casts some 
> limitations when we want to implement advanced features in Structured 
> Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35022) Task Scheduling Plugin in Spark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35022:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Task Scheduling Plugin in Spark
> ---
>
> Key: SPARK-35022
> URL: https://issues.apache.org/jira/browse/SPARK-35022
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Spark scheduler schedules tasks to executors in an arbitrary way. The 
> schedule schedules the tasks by itself. Although there is locality 
> configuration, the configuration is used for data locality purposes. 
> Generally we cannot suggest the scheduler where a task should be scheduled 
> to. Normally it is not a problem because the general task is 
> executor-agnostic. But for special tasks, for example stateful tasks in 
> Structured Streaming, state store is maintained at the executor side. 
> Changing task location means reloading checkpoint data from the last batch. 
> It has disadvantages from the performance perspective and also casts some 
> limitations when we want to implement advanced features in Structured 
> Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35022) Task Scheduling Plugin in Spark

2021-04-12 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35022:

Description: Spark scheduler schedules tasks to executors in an arbitrary 
way. The schedule schedules the tasks by itself. Although there is locality 
configuration, the configuration is used for data locality purposes. Generally 
we cannot suggest the scheduler where a task should be scheduled to. Normally 
it is not a problem because the general task is executor-agnostic. But for 
special tasks, for example stateful tasks in Structured Streaming, state store 
is maintained at the executor side. Changing task location means reloading 
checkpoint data from the last batch. It has disadvantages from the performance 
perspective and also casts some limitations when we want to implement advanced 
features in Structured Streaming.  (was: Spark scheduler schedules tasks to 
executors in an indeterminate way. Although there is locality configuration, 
the configuration is used for data locality purposes. Generally we cannot 
suggest the scheduler where a task should be scheduled to. Normally it is not a 
problem because the general task is executor-agnostic. But for special tasks, 
for example stateful tasks in Structured Streaming, state store is maintained 
at the executor side. Changing task location means reloading checkpoint data 
from the last batch. It has disadvantages from the performance perspective and 
also casts some limitations when we want to implement advanced features in 
Structured Streaming.)

> Task Scheduling Plugin in Spark
> ---
>
> Key: SPARK-35022
> URL: https://issues.apache.org/jira/browse/SPARK-35022
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Spark scheduler schedules tasks to executors in an arbitrary way. The 
> schedule schedules the tasks by itself. Although there is locality 
> configuration, the configuration is used for data locality purposes. 
> Generally we cannot suggest the scheduler where a task should be scheduled 
> to. Normally it is not a problem because the general task is 
> executor-agnostic. But for special tasks, for example stateful tasks in 
> Structured Streaming, state store is maintained at the executor side. 
> Changing task location means reloading checkpoint data from the last batch. 
> It has disadvantages from the performance perspective and also casts some 
> limitations when we want to implement advanced features in Structured 
> Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319730#comment-17319730
 ] 

Apache Spark commented on SPARK-34916:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32135

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319729#comment-17319729
 ] 

Apache Spark commented on SPARK-34916:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32135

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35038) org.apache.spark.sql.AnalysisException: Resolved attribute(s) colName#390 missing from listofcolumns

2021-04-12 Thread unical1988 (Jira)
unical1988 created SPARK-35038:
--

 Summary: org.apache.spark.sql.AnalysisException: Resolved 
attribute(s) colName#390 missing from listofcolumns
 Key: SPARK-35038
 URL: https://issues.apache.org/jira/browse/SPARK-35038
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 3.1.1
Reporter: unical1988


I create a column to exchange an existing one in my dataframe using
{code:java}
withColumn {code}
like so : 
{noformat}
Dataset df2 = spark.createDataset(data, 
Encoders.DOUBLE()).toDF("colName1");
datasetRow.withColumn("colName1",df2.col("colName1"));{noformat}
But the last line of code is yielding the following : 

org.apache.spark.sql.AnalysisException: Resolved attribute(s) colName1#390 
missing from ...

Why ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35031) Port Koalas operations on different frames tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35031:
-
Summary: Port Koalas operations on different frames tests into PySpark  
(was: Port Koalas operations on different DataFrames tests into PySpark)

> Port Koalas operations on different frames tests into PySpark
> -
>
> Key: SPARK-35031
> URL: https://issues.apache.org/jira/browse/SPARK-35031
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas operations on different DataFrames related unit 
> tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35031) Port Koalas operations on different frames tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35031:
-
Description: This JIRA aims to port Koalas operations on different frames 
related unit tests to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].  (was: 
This JIRA aims to port Koalas operations on different DataFrames related unit 
tests to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].)

> Port Koalas operations on different frames tests into PySpark
> -
>
> Key: SPARK-35031
> URL: https://issues.apache.org/jira/browse/SPARK-35031
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas operations on different frames related unit 
> tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-12 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319700#comment-17319700
 ] 

Herman van Hövell commented on SPARK-34989:
---

We are seeing some pretty serious query compilation overhead (5-10s) for 
non-trivial (massive) queries. This change tends to cut this in half. It also 
works well for smaller queries. On top of this we are working on a couple of 
other changes that should reduce things even more.

It is a pretty big change to the TreeNode code base. We opted to make it a 
breaking change, so Spark dev's won't get lazy and use the old code path. 
However I do understand that this is a bit annoying for folks who have 
implemented their own TreeNode. In that case they have an escape hatch in the 
form of the TreeNode.legacyWithNewChildren function. Let me know if this 
addresses your concerns.

> Improve the performance of mapChildren and withNewChildren methods
> --
>
> Key: SPARK-34989
> URL: https://issues.apache.org/jira/browse/SPARK-34989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.2.0
>
>
> One of the main performance bottlenecks in query compilation is 
> overly-generic tree transformation methods, namely {{mapChildren}} and 
> {{withNewChildren}} (defined in {{TreeNode}}). These methods have an 
> overly-generic implementation to iterate over the children and rely on 
> reflection to create new instances. We have observed that, especially for 
> queries with large query plans, a significant amount of CPU cycles are wasted 
> in these methods. In this PR we make these methods more efficient, by 
> delegating the iteration and instantiation to concrete node types. The 
> benchmarks show that we can expect significant performance improvement in 
> total query compilation time in queries with large query plans (from 30-80%) 
> and about 20% on average.
> h4. Problem detail
> The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To 
> be more specific, this method:
>  * iterates over all the fields of a node using Scala’s product iterator. 
> While the iteration is not reflection-based, thanks to the Scala compiler 
> generating code for {{Product}}, we create many anonymous functions and visit 
> many nested structures (recursive calls).
>  The anonymous functions (presumably compiled to Java anonymous inner 
> classes) also show up quite high on the list in the object allocation 
> profiles, so we are putting unnecessary pressure on GC here.
>  * does a lot of comparisons. Basically for each element returned from the 
> product iterator, we check if it is a child (contained in the list of 
> children) and then transform it. We can avoid that by just iterating over 
> children, but in the current implementation, we need to gather all the fields 
> (only transform the children) so that we can instantiate the object using the 
> reflection.
>  * creates objects using reflection, by delegating to the {{makeCopy}} 
> method, which is several orders of magnitude slower than using the 
> constructor.
> h4. Solution
> The proposed solution in this PR is rather straightforward: we rewrite the 
> {{mapChildren}} method using the {{children}} and {{withNewChildren}} 
> methods. The default {{withNewChildren}} method suffers from the same 
> problems as {{mapChildren}} and we need to make it more efficient by 
> specializing it in concrete classes. Similar to how each concrete query plan 
> node already defines its children, it should also define how they can be 
> constructed given a new list of children. Actually, the implementation is 
> quite simple in most cases and is a one-liner thanks to the copy method 
> present in Scala case classes. Note that we cannot abstract over the copy 
> method, it’s generated by the compiler for case classes if no other type 
> higher in the hierarchy defines it. For most concrete nodes, the 
> implementation of {{withNewChildren}} looks like this:
>  
> {{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
> copy(children = newChildren)}}
> The current {{withNewChildren}} method has two properties that we should 
> preserve:
>  * It returns the same instance if the provided children are the same as its 
> children, i.e., it preserves referential equality.
>  * It copies tags and maintains the origin links when a new copy is created.
> These properties are hard to enforce in the concrete node type 
> implementation. Therefore, we propose a template method 
> {{withNewChildrenInternal}} that should be rewritten by the concrete classes 
> and let the {{withNewChildren}} method take care of referential 

[jira] [Commented] (SPARK-35037) Recognize sign before the interval string in literals

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319698#comment-17319698
 ] 

Apache Spark commented on SPARK-35037:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32134

> Recognize sign before the interval string in literals
> -
>
> Key: SPARK-35037
> URL: https://issues.apache.org/jira/browse/SPARK-35037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> According to the SQL standard:
> {code:java}
>  ::=
>   INTERVAL [  ]  
>  ::=
> 
>  ::=
>   [  ] {  |  }
>  ::=
> 
>   | 
> {code}
> but the parsing fails:
> {code:java}
> spark-sql> select interval -'1-1' year to month;
> Error in query:
> mismatched input 'to' expecting {, ';'}(line 1, pos 28)
> == SQL ==
> select interval -'1-1' year to month
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35037) Recognize sign before the interval string in literals

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319696#comment-17319696
 ] 

Apache Spark commented on SPARK-35037:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32134

> Recognize sign before the interval string in literals
> -
>
> Key: SPARK-35037
> URL: https://issues.apache.org/jira/browse/SPARK-35037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> According to the SQL standard:
> {code:java}
>  ::=
>   INTERVAL [  ]  
>  ::=
> 
>  ::=
>   [  ] {  |  }
>  ::=
> 
>   | 
> {code}
> but the parsing fails:
> {code:java}
> spark-sql> select interval -'1-1' year to month;
> Error in query:
> mismatched input 'to' expecting {, ';'}(line 1, pos 28)
> == SQL ==
> select interval -'1-1' year to month
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35037) Recognize sign before the interval string in literals

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35037:


Assignee: Max Gekk  (was: Apache Spark)

> Recognize sign before the interval string in literals
> -
>
> Key: SPARK-35037
> URL: https://issues.apache.org/jira/browse/SPARK-35037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> According to the SQL standard:
> {code:java}
>  ::=
>   INTERVAL [  ]  
>  ::=
> 
>  ::=
>   [  ] {  |  }
>  ::=
> 
>   | 
> {code}
> but the parsing fails:
> {code:java}
> spark-sql> select interval -'1-1' year to month;
> Error in query:
> mismatched input 'to' expecting {, ';'}(line 1, pos 28)
> == SQL ==
> select interval -'1-1' year to month
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35037) Recognize sign before the interval string in literals

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35037:


Assignee: Apache Spark  (was: Max Gekk)

> Recognize sign before the interval string in literals
> -
>
> Key: SPARK-35037
> URL: https://issues.apache.org/jira/browse/SPARK-35037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> According to the SQL standard:
> {code:java}
>  ::=
>   INTERVAL [  ]  
>  ::=
> 
>  ::=
>   [  ] {  |  }
>  ::=
> 
>   | 
> {code}
> but the parsing fails:
> {code:java}
> spark-sql> select interval -'1-1' year to month;
> Error in query:
> mismatched input 'to' expecting {, ';'}(line 1, pos 28)
> == SQL ==
> select interval -'1-1' year to month
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35037) Recognize sign before the interval string in literals

2021-04-12 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-35037:
-
Summary: Recognize sign before the interval string in literals  (was: 
Recognize '-' before the interval string in literals)

> Recognize sign before the interval string in literals
> -
>
> Key: SPARK-35037
> URL: https://issues.apache.org/jira/browse/SPARK-35037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> According to the SQL standard:
> {code:java}
>  ::=
>   INTERVAL [  ]  
>  ::=
> 
>  ::=
>   [  ] {  |  }
>  ::=
> 
>   | 
> {code}
> but the parsing fails:
> {code:java}
> spark-sql> select interval -'1-1' year to month;
> Error in query:
> mismatched input 'to' expecting {, ';'}(line 1, pos 28)
> == SQL ==
> select interval -'1-1' year to month
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35037) Recognize '-' before the interval string in literals

2021-04-12 Thread Max Gekk (Jira)
Max Gekk created SPARK-35037:


 Summary: Recognize '-' before the interval string in literals
 Key: SPARK-35037
 URL: https://issues.apache.org/jira/browse/SPARK-35037
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


According to the SQL standard:

{code:java}
 ::=
  INTERVAL [  ]  
 ::=

 ::=
  [  ] {  |  }
 ::=

  | 
{code}
but the parsing fails:

{code:java}
spark-sql> select interval -'1-1' year to month;
Error in query:
mismatched input 'to' expecting {, ';'}(line 1, pos 28)

== SQL ==
select interval -'1-1' year to month
^^^
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35031) Port Koalas operations on different DataFrames tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35031:


Assignee: (was: Apache Spark)

> Port Koalas operations on different DataFrames tests into PySpark
> -
>
> Key: SPARK-35031
> URL: https://issues.apache.org/jira/browse/SPARK-35031
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas operations on different DataFrames related unit 
> tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35031) Port Koalas operations on different DataFrames tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35031:


Assignee: Apache Spark

> Port Koalas operations on different DataFrames tests into PySpark
> -
>
> Key: SPARK-35031
> URL: https://issues.apache.org/jira/browse/SPARK-35031
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims to port Koalas operations on different DataFrames related unit 
> tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35031) Port Koalas operations on different DataFrames tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319685#comment-17319685
 ] 

Apache Spark commented on SPARK-35031:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32133

> Port Koalas operations on different DataFrames tests into PySpark
> -
>
> Key: SPARK-35031
> URL: https://issues.apache.org/jira/browse/SPARK-35031
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas operations on different DataFrames related unit 
> tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35036) Improve push based shuffle to work with AQE by fetching partial map indexes for a reduce partition

2021-04-12 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-35036:
---

 Summary: Improve push based shuffle to work with AQE by fetching 
partial map indexes for a reduce partition
 Key: SPARK-35036
 URL: https://issues.apache.org/jira/browse/SPARK-35036
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.1
Reporter: Venkata krishnan Sowrirajan


Currently when both Push based shuffle and AQE is enabled and when partial set 
of map indexes are requested to MapOutputTracker this is delegated the regular 
shuffle instead of push based shuffle reading map blocks. This is because 
blocks from mapper in push based shuffle are merged out of order due to which 
its hard to only get the matching blocks of the reduce partition for the 
requested start and end map indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-12 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319679#comment-17319679
 ] 

Thomas Graves commented on SPARK-34989:
---

I just saw this go, all the performance numbers here are in % improvement.  
What kind of raw times are you seeing for query compilation?

Sounds like a nice improvement, I haven't had any issues with query compilation 
times so I'm curious what the numbers are and I assume people are seeing issues 
with this ? Its a pretty major changes to the base TreeNode class so anyone 
extending those is now broken.

> Improve the performance of mapChildren and withNewChildren methods
> --
>
> Key: SPARK-34989
> URL: https://issues.apache.org/jira/browse/SPARK-34989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.2.0
>
>
> One of the main performance bottlenecks in query compilation is 
> overly-generic tree transformation methods, namely {{mapChildren}} and 
> {{withNewChildren}} (defined in {{TreeNode}}). These methods have an 
> overly-generic implementation to iterate over the children and rely on 
> reflection to create new instances. We have observed that, especially for 
> queries with large query plans, a significant amount of CPU cycles are wasted 
> in these methods. In this PR we make these methods more efficient, by 
> delegating the iteration and instantiation to concrete node types. The 
> benchmarks show that we can expect significant performance improvement in 
> total query compilation time in queries with large query plans (from 30-80%) 
> and about 20% on average.
> h4. Problem detail
> The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To 
> be more specific, this method:
>  * iterates over all the fields of a node using Scala’s product iterator. 
> While the iteration is not reflection-based, thanks to the Scala compiler 
> generating code for {{Product}}, we create many anonymous functions and visit 
> many nested structures (recursive calls).
>  The anonymous functions (presumably compiled to Java anonymous inner 
> classes) also show up quite high on the list in the object allocation 
> profiles, so we are putting unnecessary pressure on GC here.
>  * does a lot of comparisons. Basically for each element returned from the 
> product iterator, we check if it is a child (contained in the list of 
> children) and then transform it. We can avoid that by just iterating over 
> children, but in the current implementation, we need to gather all the fields 
> (only transform the children) so that we can instantiate the object using the 
> reflection.
>  * creates objects using reflection, by delegating to the {{makeCopy}} 
> method, which is several orders of magnitude slower than using the 
> constructor.
> h4. Solution
> The proposed solution in this PR is rather straightforward: we rewrite the 
> {{mapChildren}} method using the {{children}} and {{withNewChildren}} 
> methods. The default {{withNewChildren}} method suffers from the same 
> problems as {{mapChildren}} and we need to make it more efficient by 
> specializing it in concrete classes. Similar to how each concrete query plan 
> node already defines its children, it should also define how they can be 
> constructed given a new list of children. Actually, the implementation is 
> quite simple in most cases and is a one-liner thanks to the copy method 
> present in Scala case classes. Note that we cannot abstract over the copy 
> method, it’s generated by the compiler for case classes if no other type 
> higher in the hierarchy defines it. For most concrete nodes, the 
> implementation of {{withNewChildren}} looks like this:
>  
> {{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
> copy(children = newChildren)}}
> The current {{withNewChildren}} method has two properties that we should 
> preserve:
>  * It returns the same instance if the provided children are the same as its 
> children, i.e., it preserves referential equality.
>  * It copies tags and maintains the origin links when a new copy is created.
> These properties are hard to enforce in the concrete node type 
> implementation. Therefore, we propose a template method 
> {{withNewChildrenInternal}} that should be rewritten by the concrete classes 
> and let the {{withNewChildren}} method take care of referential equality and 
> copying:
> {{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
> {}}
>  {{  if (childrenFastEquals(children, newChildren)) {}}
>  {{    this}}
>  {{  } else {}}
>  {{    CurrentOrigin.withOrigin(origin) {}}
>  {{      val res = withNewChildr

[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2021-04-12 Thread Jean-Yves STEPHAN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319652#comment-17319652
 ] 

Jean-Yves STEPHAN commented on SPARK-31923:
---

After further investigation, I confirmed this is a red herring - our end user 
was using Spark 3.0.0 after all. Thanks for checking. 

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones will be shown in Spark UI).
> However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
> internal accumulator has only 3 possible types: int, long, and 
> java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
> unexpected type, it will crash.
> An event log that contains such accumulator will be dropped because it cannot 
> be converted to JSON, and it will cause weird UI issue when rendering in 
> Spark History Server. For example, if `SparkListenerTaskEnd` is dropped 
> because of this issue, the user will see the task is still running even if it 
> was finished.
> It's better to make *accumValueToJson* more robust.
> 
> How to reproduce it:
> - Enable Spark event log
> - Run the following command:
> {code}
> scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
> accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, 
> name: Some(internal.metrics.foo), value: 0.0)
> scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
> 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
>   at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at scala.collection.immutable.List.map(List.scala:284)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
>   at 
> org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.ap

[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2021-04-12 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319631#comment-17319631
 ] 

Shixiong Zhu commented on SPARK-31923:
--

Do you have a reproduction? It's weird to see 
`java.util.Collections$SynchronizedSet cannot be cast to java.util.List` since 
before calling `v.asScala.toList`, the pattern match `case v: 
java.util.List[_]` should not accept `java.util.Collections$SynchronizedSet`.

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones will be shown in Spark UI).
> However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
> internal accumulator has only 3 possible types: int, long, and 
> java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
> unexpected type, it will crash.
> An event log that contains such accumulator will be dropped because it cannot 
> be converted to JSON, and it will cause weird UI issue when rendering in 
> Spark History Server. For example, if `SparkListenerTaskEnd` is dropped 
> because of this issue, the user will see the task is still running even if it 
> was finished.
> It's better to make *accumValueToJson* more robust.
> 
> How to reproduce it:
> - Enable Spark event log
> - Run the following command:
> {code}
> scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
> accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, 
> name: Some(internal.metrics.foo), value: 0.0)
> scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
> 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
>   at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at scala.collection.immutable.List.map(List.scala:284)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
>   at 
> org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEvent

[jira] [Comment Edited] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2021-04-12 Thread Jean-Yves STEPHAN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319622#comment-17319622
 ] 

Jean-Yves STEPHAN edited comment on SPARK-31923 at 4/12/21, 6:01 PM:
-

Hi [~zsxwing] :) 

Despite your patch, we're running in the same issue while using Spark 3.0.1. 
The stack trace is informative (for the line numbers we need to refer to this 
file 
[https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L354]).

The problem is that we're given a class (java.util.Collections$SynchronizedSet) 
that enters the branch
{code:java}
case v: java.util.List[_] =>
{code}
but on the next line the cast v.asScala.toList fails. 

 
Question: Is there a workaround available through spark configurations? E.g. a 
flag to disable the metrics being collected here?



```

21/04/12 15:11:40 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception

java.lang.ClassCastException: java.util.Collections$SynchronizedSet cannot be 
cast to java.util.List at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:355) at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulableInfoToJson$4(JsonProtocol.scala:331)
 at scala.Option.map(Option.scala:230) at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:331)
 at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulablesToJson$3(JsonProtocol.scala:324)
 at scala.collection.immutable.List.map(List.scala:290) at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:324) 
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:316) 
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:151) at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79) at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
 at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:119)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 

```

 


was (Author: jystephan):
Hi [~zsxwing] :) 

Despite your patch, we're running in the same issue while using Spark 3.0.1. 
The stack trace is informative (for the line numbers we need to refer to this 
file 
[https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L354]).

The problem is that we're given a class (java.util.Collections$SynchronizedSet) 
that enters the branch
{code:java}
case v: java.util.List[_] =>
{code}
but on the next line the cast v.asScala.toList fails. 

 

```

21/04/12 15:11:40 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception

java.lang.ClassCastException: java.util.Collections$SynchronizedSet cannot be 
cast to java.util.List at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:355) at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulableInfoToJson$4(JsonProtocol.scala:331)
 at scala.Option.map(Option.scala:230) at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:331)
 at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulablesToJson$3(JsonProtocol.scala:324)
 at scala.collection.immutable.List.map(List.scala:290) at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:324) 
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:316) 
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:151) at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79) at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
 at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:119)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 

```

 

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones

[jira] [Comment Edited] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2021-04-12 Thread Jean-Yves STEPHAN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319622#comment-17319622
 ] 

Jean-Yves STEPHAN edited comment on SPARK-31923 at 4/12/21, 5:45 PM:
-

Hi [~zsxwing] :) 

Despite your patch, we're running in the same issue while using Spark 3.0.1. 
The stack trace is informative (for the line numbers we need to refer to this 
file 
[https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L354]).

The problem is that we're given a class (java.util.Collections$SynchronizedSet) 
that enters the branch
{code:java}
case v: java.util.List[_] =>
{code}
but on the next line the cast v.asScala.toList fails. 

 

```

21/04/12 15:11:40 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception

java.lang.ClassCastException: java.util.Collections$SynchronizedSet cannot be 
cast to java.util.List at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:355) at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulableInfoToJson$4(JsonProtocol.scala:331)
 at scala.Option.map(Option.scala:230) at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:331)
 at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulablesToJson$3(JsonProtocol.scala:324)
 at scala.collection.immutable.List.map(List.scala:290) at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:324) 
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:316) 
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:151) at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79) at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
 at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:119)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 

```

 


was (Author: jystephan):
Hi [~zsxwing] :) 



Despite your patch, we're running in the same issue while using Spark 3.0.1. 
The stack trace is informative (for the line numbers we need to refer to this 
file 
https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L354).

The problem is that we're given a class (java.util.Collections$SynchronizedSet) 
that enters the branch
{code:java}
case v: java.util.List[_] =>
{code}
but on the next line the cast v.asScala.toList fails. 

 

```

21/04/12 15:11:40 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception21/04/12 15:11:40 ERROR AsyncEventQueue: Listener EventLoggingListener 
threw an exceptionjava.lang.ClassCastException: 
java.util.Collections$SynchronizedSet cannot be cast to java.util.List at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:355) at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulableInfoToJson$4(JsonProtocol.scala:331)
 at scala.Option.map(Option.scala:230) at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:331)
 at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulablesToJson$3(JsonProtocol.scala:324)
 at scala.collection.immutable.List.map(List.scala:290) at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:324) 
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:316) 
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:151) at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79) at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
 at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:119)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at 
org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
 at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
 at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at 
scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
 at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
 a

[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2021-04-12 Thread Jean-Yves STEPHAN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319622#comment-17319622
 ] 

Jean-Yves STEPHAN commented on SPARK-31923:
---

Hi [~zsxwing] :) 



Despite your patch, we're running in the same issue while using Spark 3.0.1. 
The stack trace is informative (for the line numbers we need to refer to this 
file 
https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L354).

The problem is that we're given a class (java.util.Collections$SynchronizedSet) 
that enters the branch
{code:java}
case v: java.util.List[_] =>
{code}
but on the next line the cast v.asScala.toList fails. 

 

```

21/04/12 15:11:40 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception21/04/12 15:11:40 ERROR AsyncEventQueue: Listener EventLoggingListener 
threw an exceptionjava.lang.ClassCastException: 
java.util.Collections$SynchronizedSet cannot be cast to java.util.List at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:355) at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulableInfoToJson$4(JsonProtocol.scala:331)
 at scala.Option.map(Option.scala:230) at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:331)
 at 
org.apache.spark.util.JsonProtocol$.$anonfun$accumulablesToJson$3(JsonProtocol.scala:324)
 at scala.collection.immutable.List.map(List.scala:290) at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:324) 
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:316) 
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:151) at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79) at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
 at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:119)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
 at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at 
org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
 at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
 at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at 
scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
 at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

```

 

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones will be shown in Spark UI).
> However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
> internal accumulator has only 3 possible types: int, long, and 
> java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
> unexpected type, it will crash.
> An event log that contains such accumulator will be dropped because it cannot 
> be converted to JSON, and it will cause weird UI issue when rendering in 
> Spark History Server. For example, if `SparkListenerTaskEnd` is dropped 
> because of this issue, the user will see the task is still running even if it 
> was finished.
> It's better to make *accumValueToJson* more robust.
> 
> How to reproduce it:
> - Enable Spark event log
> - Run the following command:
> {code}
> scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
> accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, 
> name: Some(internal.metrics.foo), value: 0.0)
>

[jira] [Created] (SPARK-35035) Port Koalas internal implementation unit tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35035:


 Summary: Port Koalas internal implementation unit tests into 
PySpark
 Key: SPARK-35035
 URL: https://issues.apache.org/jira/browse/SPARK-35035
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


This JIRA aims to port Koalas internal implementation related unit tests to 
[PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35034) Port Koalas miscellaneous unit tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35034:


 Summary: Port Koalas miscellaneous unit tests into PySpark
 Key: SPARK-35034
 URL: https://issues.apache.org/jira/browse/SPARK-35034
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


This JIRA aims to port Koalas miscellaneous unit tests to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35033) Port Koalas plot unit tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35033:


 Summary: Port Koalas plot unit tests into PySpark
 Key: SPARK-35033
 URL: https://issues.apache.org/jira/browse/SPARK-35033
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


This JIRA aims to port Koalas plot unit tests to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35032) Port Koalas Index unit tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35032:


 Summary: Port Koalas Index unit tests into PySpark
 Key: SPARK-35032
 URL: https://issues.apache.org/jira/browse/SPARK-35032
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


This JIRA aims to port Koalas Index unit tests to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34593) Preserve broadcast nested loop join output partitioning and ordering

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319554#comment-17319554
 ] 

Apache Spark commented on SPARK-34593:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32132

> Preserve broadcast nested loop join output partitioning and ordering
> 
>
> Key: SPARK-34593
> URL: https://issues.apache.org/jira/browse/SPARK-34593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> `BroadcastNestedLoopJoinExec` does not preserve `outputPartitioning` and 
> `outputOrdering` right now. But it can preserve the streamed side 
> partitioning and ordering when possible. This can help avoid shuffle and sort 
> in later stage, if there's join and aggregation in the query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35031) Port Koalas operations on different DataFrames tests into PySpark

2021-04-12 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35031:


 Summary: Port Koalas operations on different DataFrames tests into 
PySpark
 Key: SPARK-35031
 URL: https://issues.apache.org/jira/browse/SPARK-35031
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


This JIRA aims to port Koalas operations on different DataFrames related unit 
tests to [PySpark 
tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34593) Preserve broadcast nested loop join output partitioning and ordering

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319553#comment-17319553
 ] 

Apache Spark commented on SPARK-34593:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32132

> Preserve broadcast nested loop join output partitioning and ordering
> 
>
> Key: SPARK-34593
> URL: https://issues.apache.org/jira/browse/SPARK-34593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> `BroadcastNestedLoopJoinExec` does not preserve `outputPartitioning` and 
> `outputOrdering` right now. But it can preserve the streamed side 
> partitioning and ordering when possible. This can help avoid shuffle and sort 
> in later stage, if there's join and aggregation in the query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35012) Port Koalas DataFrame related unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319544#comment-17319544
 ] 

Apache Spark commented on SPARK-35012:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32131

> Port Koalas DataFrame related unit tests into PySpark
> -
>
> Key: SPARK-35012
> URL: https://issues.apache.org/jira/browse/SPARK-35012
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas DataFrame related unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35012) Port Koalas DataFrame related unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35012:


Assignee: (was: Apache Spark)

> Port Koalas DataFrame related unit tests into PySpark
> -
>
> Key: SPARK-35012
> URL: https://issues.apache.org/jira/browse/SPARK-35012
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas DataFrame related unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35012) Port Koalas DataFrame related unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319542#comment-17319542
 ] 

Apache Spark commented on SPARK-35012:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32131

> Port Koalas DataFrame related unit tests into PySpark
> -
>
> Key: SPARK-35012
> URL: https://issues.apache.org/jira/browse/SPARK-35012
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas DataFrame related unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35012) Port Koalas DataFrame related unit tests into PySpark

2021-04-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35012:


Assignee: Apache Spark

> Port Koalas DataFrame related unit tests into PySpark
> -
>
> Key: SPARK-35012
> URL: https://issues.apache.org/jira/browse/SPARK-35012
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims to port Koalas DataFrame related unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-12 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317066#comment-17317066
 ] 

Nick Hryhoriev edited comment on SPARK-11844 at 4/12/21, 3:05 PM:
--

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
 Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
 .set("spark.hadoop.fs.s3a.connection.maximum", "500")
 .set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
 .set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
 .set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
 "spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
 "spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in exception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
 So in case of high-level exception in spark or parquet, it just does not close 
stream.
 But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
 Which will commit the wrong file and hide the underlying exception.
 It's only a guess, I was not able to confirm it with code or reproduce it.

FIY [~hyukjin.kwon]


was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
 Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
 .set("spark.hadoop.fs.s3a.connection.maximum", "500")
 .set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
 .set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
 .set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
 "spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
 "spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
 So in case of high-level exception in spark or parquet, it just does not close 
stream.
 But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
 Which will commit the wrong file and hide the underlying exception.
 It's only a guess, I was not able to confirm it with code or reproduce it.

the most Strange place which I find is `SparkHadoopWriter` -> line 129.

 
{code:java}
// Write all rows in RDD partition.
try {
 val ret = Utils.tryWithSafeFinallyAndFailureCallbacks {
 while (iterator.hasNext) {
 val pair = iterator.next()
 config.write(pair)

 // Update bytes written metric every few records
 maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten)
 recordsWritten += 1
 }

 config.closeWriter(taskContext)
 committer.commitTask(taskContext)
 }{code}
And here is the java-doc for tryWithSafeFinallyAndFailureCallbacks.
{code:java}
/**
 * Execute a block of code and call the failure callbacks in the catch block. 
If exceptions occur
 * in either the catch or the finally block, they are appended 

  1   2   >