date:20210511

[jira] [Commented] (SPARK-35383) Improve s3a magic committer support by inferring missing configs

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343017#comment-17343017
 ] 

Apache Spark commented on SPARK-35383:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32518

> Improve s3a magic committer support by inferring missing configs
> 
>
> Key: SPARK-35383
> URL: https://issues.apache.org/jira/browse/SPARK-35383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> Exception in thread "main" 
> org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-spark-bucket`: 
> Filesystem does not have support for 'magic' committer enabled in 
> configuration option fs.s3a.committer.magic.enabled
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35383) Improve s3a magic committer support by inferring missing configs

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35383:


Assignee: Apache Spark

> Improve s3a magic committer support by inferring missing configs
> 
>
> Key: SPARK-35383
> URL: https://issues.apache.org/jira/browse/SPARK-35383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> Exception in thread "main" 
> org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-spark-bucket`: 
> Filesystem does not have support for 'magic' committer enabled in 
> configuration option fs.s3a.committer.magic.enabled
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35383) Improve s3a magic committer support by inferring missing configs

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35383:


Assignee: (was: Apache Spark)

> Improve s3a magic committer support by inferring missing configs
> 
>
> Key: SPARK-35383
> URL: https://issues.apache.org/jira/browse/SPARK-35383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> Exception in thread "main" 
> org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-spark-bucket`: 
> Filesystem does not have support for 'magic' committer enabled in 
> configuration option fs.s3a.committer.magic.enabled
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35383) Improve s3a magic committer support by inferring missing configs

2021-05-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35383:
--
Description: 
{code}
Exception in thread "main" org.apache.hadoop.fs.s3a.commit.PathCommitException: 
`s3a://my-spark-bucket`: Filesystem does not have support for 'magic' committer 
enabled in configuration option fs.s3a.committer.magic.enabled
at 
org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
at 
org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
{code}

> Improve s3a magic committer support by inferring missing configs
> 
>
> Key: SPARK-35383
> URL: https://issues.apache.org/jira/browse/SPARK-35383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> Exception in thread "main" 
> org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-spark-bucket`: 
> Filesystem does not have support for 'magic' committer enabled in 
> configuration option fs.s3a.committer.magic.enabled
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
>   at 
> org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35383) Improve s3a magic committer support by inferring missing configs

2021-05-11 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-35383:
-

 Summary: Improve s3a magic committer support by inferring missing 
configs
 Key: SPARK-35383
 URL: https://issues.apache.org/jira/browse/SPARK-35383
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35381) Fix lambda variable name issues in nested DataFrame functions in R APIs

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343007#comment-17343007
 ] 

Apache Spark commented on SPARK-35381:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32517

> Fix lambda variable name issues in nested DataFrame functions in R APIs
> ---
>
> Key: SPARK-35381
> URL: https://issues.apache.org/jira/browse/SPARK-35381
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
>
> R's higher order functions also have the same problem with SPARK-34794:
> {code}
> df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> collect(select(
>   df,
>   array_transform("numbers", function(number) {
> array_transform("letters", function(latter) {
>   struct(alias(number, "n"), alias(latter, "l"))
> })
>   })
> ))
> {code}
> {code}
>   transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))
> 1 
> a, a, b, b, c, c, a, a, 
> b, b, c, c, a, a, b, b, c, c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35381) Fix lambda variable name issues in nested DataFrame functions in R APIs

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35381:


Assignee: (was: Apache Spark)

> Fix lambda variable name issues in nested DataFrame functions in R APIs
> ---
>
> Key: SPARK-35381
> URL: https://issues.apache.org/jira/browse/SPARK-35381
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
>
> R's higher order functions also have the same problem with SPARK-34794:
> {code}
> df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> collect(select(
>   df,
>   array_transform("numbers", function(number) {
> array_transform("letters", function(latter) {
>   struct(alias(number, "n"), alias(latter, "l"))
> })
>   })
> ))
> {code}
> {code}
>   transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))
> 1 
> a, a, b, b, c, c, a, a, 
> b, b, c, c, a, a, b, b, c, c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35381) Fix lambda variable name issues in nested DataFrame functions in R APIs

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343006#comment-17343006
 ] 

Apache Spark commented on SPARK-35381:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32517

> Fix lambda variable name issues in nested DataFrame functions in R APIs
> ---
>
> Key: SPARK-35381
> URL: https://issues.apache.org/jira/browse/SPARK-35381
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
>
> R's higher order functions also have the same problem with SPARK-34794:
> {code}
> df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> collect(select(
>   df,
>   array_transform("numbers", function(number) {
> array_transform("letters", function(latter) {
>   struct(alias(number, "n"), alias(latter, "l"))
> })
>   })
> ))
> {code}
> {code}
>   transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))
> 1 
> a, a, b, b, c, c, a, a, 
> b, b, c, c, a, a, b, b, c, c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35381) Fix lambda variable name issues in nested DataFrame functions in R APIs

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35381:


Assignee: Apache Spark

> Fix lambda variable name issues in nested DataFrame functions in R APIs
> ---
>
> Key: SPARK-35381
> URL: https://issues.apache.org/jira/browse/SPARK-35381
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> R's higher order functions also have the same problem with SPARK-34794:
> {code}
> df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> collect(select(
>   df,
>   array_transform("numbers", function(number) {
> array_transform("letters", function(latter) {
>   struct(alias(number, "n"), alias(latter, "l"))
> })
>   })
> ))
> {code}
> {code}
>   transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))
> 1 
> a, a, b, b, c, c, a, a, 
> b, b, c, c, a, a, b, b, c, c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35364) Renaming the existing Koalas related codes.

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343003#comment-17343003
 ] 

Apache Spark commented on SPARK-35364:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32516

> Renaming the existing Koalas related codes.
> ---
>
> Key: SPARK-35364
> URL: https://issues.apache.org/jira/browse/SPARK-35364
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should renaming the several Koalas-related codes in the pandas APIs on 
> Spark.
>  * kdf -> psdf
>  * kser -> psser
>  * kidx -> psidx
>  * kmidx -> psmidx
>  * sdf.to_koalas() -> sdf.to_pandas_on_spark()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35364) Renaming the existing Koalas related codes.

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35364:


Assignee: Apache Spark

> Renaming the existing Koalas related codes.
> ---
>
> Key: SPARK-35364
> URL: https://issues.apache.org/jira/browse/SPARK-35364
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should renaming the several Koalas-related codes in the pandas APIs on 
> Spark.
>  * kdf -> psdf
>  * kser -> psser
>  * kidx -> psidx
>  * kmidx -> psmidx
>  * sdf.to_koalas() -> sdf.to_pandas_on_spark()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35364) Renaming the existing Koalas related codes.

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35364:


Assignee: (was: Apache Spark)

> Renaming the existing Koalas related codes.
> ---
>
> Key: SPARK-35364
> URL: https://issues.apache.org/jira/browse/SPARK-35364
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should renaming the several Koalas-related codes in the pandas APIs on 
> Spark.
>  * kdf -> psdf
>  * kser -> psser
>  * kidx -> psidx
>  * kmidx -> psmidx
>  * sdf.to_koalas() -> sdf.to_pandas_on_spark()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35364) Renaming the existing Koalas related codes.

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343002#comment-17343002
 ] 

Apache Spark commented on SPARK-35364:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32516

> Renaming the existing Koalas related codes.
> ---
>
> Key: SPARK-35364
> URL: https://issues.apache.org/jira/browse/SPARK-35364
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should renaming the several Koalas-related codes in the pandas APIs on 
> Spark.
>  * kdf -> psdf
>  * kser -> psser
>  * kidx -> psidx
>  * kmidx -> psmidx
>  * sdf.to_koalas() -> sdf.to_pandas_on_spark()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35380) Support loading SparkSessionExtensions from ServiceLoader

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342998#comment-17342998
 ] 

Apache Spark commented on SPARK-35380:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32515

> Support loading SparkSessionExtensions from ServiceLoader
> -
>
> Key: SPARK-35380
> URL: https://issues.apache.org/jira/browse/SPARK-35380
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> With this, extensions can be automatically loaded despite setting 
> `spark.sql.extensions`
> Previous discussion could be found:
> https://github.com/yaooqinn/itachi/issues/8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35380) Support loading SparkSessionExtensions from ServiceLoader

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342996#comment-17342996
 ] 

Apache Spark commented on SPARK-35380:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32515

> Support loading SparkSessionExtensions from ServiceLoader
> -
>
> Key: SPARK-35380
> URL: https://issues.apache.org/jira/browse/SPARK-35380
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> With this, extensions can be automatically loaded despite setting 
> `spark.sql.extensions`
> Previous discussion could be found:
> https://github.com/yaooqinn/itachi/issues/8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35380) Support loading SparkSessionExtensions from ServiceLoader

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35380:


Assignee: (was: Apache Spark)

> Support loading SparkSessionExtensions from ServiceLoader
> -
>
> Key: SPARK-35380
> URL: https://issues.apache.org/jira/browse/SPARK-35380
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> With this, extensions can be automatically loaded despite setting 
> `spark.sql.extensions`
> Previous discussion could be found:
> https://github.com/yaooqinn/itachi/issues/8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35380) Support loading SparkSessionExtensions from ServiceLoader

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35380:


Assignee: Apache Spark

> Support loading SparkSessionExtensions from ServiceLoader
> -
>
> Key: SPARK-35380
> URL: https://issues.apache.org/jira/browse/SPARK-35380
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> With this, extensions can be automatically loaded despite setting 
> `spark.sql.extensions`
> Previous discussion could be found:
> https://github.com/yaooqinn/itachi/issues/8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35382) Fix lambda variable name issues in nested DataFrame functions in Python APIs

2021-05-11 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35382:


 Summary: Fix lambda variable name issues in nested DataFrame 
functions in Python APIs
 Key: SPARK-35382
 URL: https://issues.apache.org/jira/browse/SPARK-35382
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.1
Reporter: Hyukjin Kwon


Python side also has the same issue as SPARK-34794

{code}
from pyspark.sql.functions import *
df = sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(
transform(
"numbers",
lambda n: transform("letters", lambda l: struct(n.alias("n"), 
l.alias("l")))
)
).show()
{code}

{code}
++
|transform(numbers, lambdafunction(transform(letters, 
lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS l), 
namedlambdavariable())), namedlambdavariable()))|
++
|   

 [[{a, a}, {b, b},...|
++
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35381) Fix lambda variable name issues in nested DataFrame functions in R APIs

2021-05-11 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35381:


 Summary: Fix lambda variable name issues in nested DataFrame 
functions in R APIs
 Key: SPARK-35381
 URL: https://issues.apache.org/jira/browse/SPARK-35381
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 3.1.1
Reporter: Hyukjin Kwon


R's higher order functions also have the same problem with SPARK-34794:

{code}
df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
collect(select(
  df,
  array_transform("numbers", function(number) {
array_transform("letters", function(latter) {
  struct(alias(number, "n"), alias(latter, "l"))
})
  })
))
{code}

{code}
  transform(numbers, lambdafunction(transform(letters, 
lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS l), 
namedlambdavariable())), namedlambdavariable()))
1   
  a, a, b, b, c, c, a, a, b, b, 
c, c, a, a, b, b, c, c
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35380) Support loading SparkSessionExtensions from ServiceLoader

2021-05-11 Thread Kent Yao (Jira)

Kent Yao created SPARK-35380:


 Summary: Support loading SparkSessionExtensions from ServiceLoader
 Key: SPARK-35380
 URL: https://issues.apache.org/jira/browse/SPARK-35380
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kent Yao



With this, extensions can be automatically loaded despite setting 
`spark.sql.extensions`

Previous discussion could be found:

https://github.com/yaooqinn/itachi/issues/8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35379) Improve InferFiltersFromConstraints rule performance when parsing spark sql

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35379:


Assignee: (was: Apache Spark)

> Improve InferFiltersFromConstraints rule performance when parsing spark sql
> ---
>
> Key: SPARK-35379
> URL: https://issues.apache.org/jira/browse/SPARK-35379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Wan Kun
>Priority: Major
>
> *InferFiltersFromConstraints* rule will generates too many constraints when 
> there are many aliases in Project, . For example:
>  
> {code:java}
> test("Expression explosion when analyze test") {
> RuleExecutor.resetMetrics()
> Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
>   .toDF("a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
> "k", "l", "m", "n")
>   .write.saveAsTable("test")
> val df = spark.table("test")
> val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n > 100")
> val df3 = df2.select('a as 'a1, 'b as 'b1,
>   'c as 'c1, 'd as 'd1, 'e as 'e1, 'f as 'f1,
>   'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1,
>   'k as 'k1, 'l as 'l1, 'm as 'm1, 'n as 'n1)
> val df4 = df3.join(df2, df3("a1") === df2("a"))
> df4.explain(true)
> logWarning(RuleExecutor.dumpTimeSpent())
>   }
> {code}
> So we need to optimize *InferFiltersFromConstraints* rule to improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35379) Improve InferFiltersFromConstraints rule performance when parsing spark sql

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35379:


Assignee: Apache Spark

> Improve InferFiltersFromConstraints rule performance when parsing spark sql
> ---
>
> Key: SPARK-35379
> URL: https://issues.apache.org/jira/browse/SPARK-35379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Major
>
> *InferFiltersFromConstraints* rule will generates too many constraints when 
> there are many aliases in Project, . For example:
>  
> {code:java}
> test("Expression explosion when analyze test") {
> RuleExecutor.resetMetrics()
> Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
>   .toDF("a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
> "k", "l", "m", "n")
>   .write.saveAsTable("test")
> val df = spark.table("test")
> val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n > 100")
> val df3 = df2.select('a as 'a1, 'b as 'b1,
>   'c as 'c1, 'd as 'd1, 'e as 'e1, 'f as 'f1,
>   'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1,
>   'k as 'k1, 'l as 'l1, 'm as 'm1, 'n as 'n1)
> val df4 = df3.join(df2, df3("a1") === df2("a"))
> df4.explain(true)
> logWarning(RuleExecutor.dumpTimeSpent())
>   }
> {code}
> So we need to optimize *InferFiltersFromConstraints* rule to improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35379) Improve InferFiltersFromConstraints rule performance when parsing spark sql

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342992#comment-17342992
 ] 

Apache Spark commented on SPARK-35379:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/32514

> Improve InferFiltersFromConstraints rule performance when parsing spark sql
> ---
>
> Key: SPARK-35379
> URL: https://issues.apache.org/jira/browse/SPARK-35379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Wan Kun
>Priority: Major
>
> *InferFiltersFromConstraints* rule will generates too many constraints when 
> there are many aliases in Project, . For example:
>  
> {code:java}
> test("Expression explosion when analyze test") {
> RuleExecutor.resetMetrics()
> Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
>   .toDF("a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
> "k", "l", "m", "n")
>   .write.saveAsTable("test")
> val df = spark.table("test")
> val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n > 100")
> val df3 = df2.select('a as 'a1, 'b as 'b1,
>   'c as 'c1, 'd as 'd1, 'e as 'e1, 'f as 'f1,
>   'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1,
>   'k as 'k1, 'l as 'l1, 'm as 'm1, 'n as 'n1)
> val df4 = df3.join(df2, df3("a1") === df2("a"))
> df4.explain(true)
> logWarning(RuleExecutor.dumpTimeSpent())
>   }
> {code}
> So we need to optimize *InferFiltersFromConstraints* rule to improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35379) Improve InferFiltersFromConstraints rule performance when parsing spark sql

2021-05-11 Thread Wan Kun (Jira)

Wan Kun created SPARK-35379:
---

 Summary: Improve InferFiltersFromConstraints rule performance when 
parsing spark sql
 Key: SPARK-35379
 URL: https://issues.apache.org/jira/browse/SPARK-35379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Wan Kun


*InferFiltersFromConstraints* rule will generates too many constraints when 
there are many aliases in Project, . For example:

 
{code:java}

test("Expression explosion when analyze test") {
RuleExecutor.resetMetrics()
Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
  .toDF("a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n")
  .write.saveAsTable("test")
val df = spark.table("test")
val df2 = df.filter("a+b+c+d+e+f+g+h+i+j+k+l+m+n > 100")
val df3 = df2.select('a as 'a1, 'b as 'b1,
  'c as 'c1, 'd as 'd1, 'e as 'e1, 'f as 'f1,
  'g as 'g1, 'h as 'h1, 'i as 'i1, 'j as 'j1,
  'k as 'k1, 'l as 'l1, 'm as 'm1, 'n as 'n1)
val df4 = df3.join(df2, df3("a1") === df2("a"))
df4.explain(true)
logWarning(RuleExecutor.dumpTimeSpent())
  }

{code}


So we need to optimize *InferFiltersFromConstraints* rule to improve 
performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35378) Convert LeafRunnableCommand to LocalRelation when query with CTE

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342984#comment-17342984
 ] 

Apache Spark commented on SPARK-35378:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/32513

> Convert LeafRunnableCommand to LocalRelation when query with CTE
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35378) Convert LeafRunnableCommand to LocalRelation when query with CTE

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35378:


Assignee: (was: Apache Spark)

> Convert LeafRunnableCommand to LocalRelation when query with CTE
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35378) Convert LeafRunnableCommand to LocalRelation when query with CTE

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35378:


Assignee: Apache Spark

> Convert LeafRunnableCommand to LocalRelation when query with CTE
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35378) Convert LeafRunnableCommand to LocalRelation when query with CTE

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342983#comment-17342983
 ] 

Apache Spark commented on SPARK-35378:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/32513

> Convert LeafRunnableCommand to LocalRelation when query with CTE
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35378) Convert LeafRunnableCommand to LocalRelation when query with CTE

2021-05-11 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35378:
---
Summary: Convert LeafRunnableCommand to LocalRelation when query with CTE  
(was: Support LeafRunnableCommand as sub query)

> Convert LeafRunnableCommand to LocalRelation when query with CTE
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342981#comment-17342981
 ] 

Yuming Wang commented on SPARK-35365:
-

{noformat}
-- 2.4
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences   
  46101271113 / 47150195238   1415 / 2368 
-- 3.1
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
113500482546 / 1155004825462300 / 4124
{noformat}

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.1.1_report_originalsql, 
> spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yao updated SPARK-35365:

Attachment: spark3.1.1_report_originalsql

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.1.1_report_originalsql, 
> spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35378) Support LeafRunnableCommand as sub query

2021-05-11 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35378:
---
Description: 
Currently, Spark doesn't support LeafRunnableCommand as sub query.
Because the LeafRunnableCommand always output GenericInternalRow and some 
node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
GenericInternalRow to UnsafeRow. So will causes error as follows:

{code:java}
java.lang.ClassCastException
org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to 
org.apache.spark.sql.catalyst.expressions.UnsafeRow
{code}


  was:
Currently, Spark doesn't support LeafRunnableCommand as sub query.
Because the LeafRunnableCommand always output GenericInternalRow and some 
node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExce) will convert 
GenericInternalRow to UnsafeRow. So will causes error as follows:

{code:java}
java.lang.ClassCastException
org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to 
org.apache.spark.sql.catalyst.expressions.UnsafeRow
{code}



> Support LeafRunnableCommand as sub query
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExec) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342978#comment-17342978
 ] 

yao commented on SPARK-35365:
-

[~yumwang] I have upload two files,  I removed some columns and adjust the sql, 
now spark 3.1 use 20-30 minues around. thanks ,please help me check, is there 
something I can do to improve the performance in spark3.1.1, thanks.

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342978#comment-17342978
 ] 

yao edited comment on SPARK-35365 at 5/12/21, 3:04 AM:
---

[~yumwang] I have uploaded two files,  I removed some columns and adjust the 
sql, now spark 3.1 use 20-30 minues around. thanks ,please help me check, is 
there something I can do to improve the performance in spark3.1.1, thanks.


was (Author: xiaohua):
[~yumwang] I have upload two files,  I removed some columns and adjust the sql, 
now spark 3.1 use 20-30 minues around. thanks ,please help me check, is there 
something I can do to improve the performance in spark3.1.1, thanks.

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35378) Support LeafRunnableCommand as sub query

2021-05-11 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35378:
---
Summary: Support LeafRunnableCommand as sub query  (was: Support 
LeafRunnableCommand as sub query so as supports query command of CTE)

> Support LeafRunnableCommand as sub query
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExce) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yao updated SPARK-35365:

Attachment: spark3.11report

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35378) Support LeafRunnableCommand as sub query so as supports query command of CTE

2021-05-11 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35378:
---
Summary: Support LeafRunnableCommand as sub query so as supports query 
command of CTE  (was: Support LeafRunnableCommand as subQuery so as support 
query command with CTE)

> Support LeafRunnableCommand as sub query so as supports query command of CTE
> 
>
> Key: SPARK-35378
> URL: https://issues.apache.org/jira/browse/SPARK-35378
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark doesn't support LeafRunnableCommand as sub query.
> Because the LeafRunnableCommand always output GenericInternalRow and some 
> node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExce) will convert 
> GenericInternalRow to UnsafeRow. So will causes error as follows:
> {code:java}
> java.lang.ClassCastException
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yao updated SPARK-35365:

Attachment: spark2.4report

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
> Attachments: spark2.4report, spark3.11report
>
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35378) Support LeafRunnableCommand as subQuery so as support query command with CTE

2021-05-11 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-35378:
--

 Summary: Support LeafRunnableCommand as subQuery so as support 
query command with CTE
 Key: SPARK-35378
 URL: https://issues.apache.org/jira/browse/SPARK-35378
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: jiaan.geng


Currently, Spark doesn't support LeafRunnableCommand as sub query.
Because the LeafRunnableCommand always output GenericInternalRow and some 
node(e.g. SortExec, AdaptiveExecutionExec, WholeCodegenExce) will convert 
GenericInternalRow to UnsafeRow. So will causes error as follows:

{code:java}
java.lang.ClassCastException
org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to 
org.apache.spark.sql.catalyst.expressions.UnsafeRow
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35377) Add JS linter to GA

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342975#comment-17342975
 ] 

Apache Spark commented on SPARK-35377:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32512

> Add JS linter to GA
> ---
>
> Key: SPARK-35377
> URL: https://issues.apache.org/jira/browse/SPARK-35377
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35175 added a linter for JS so let's add it to GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35377) Add JS linter to GA

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35377:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Add JS linter to GA
> ---
>
> Key: SPARK-35377
> URL: https://issues.apache.org/jira/browse/SPARK-35377
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-35175 added a linter for JS so let's add it to GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35377) Add JS linter to GA

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35377:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Add JS linter to GA
> ---
>
> Key: SPARK-35377
> URL: https://issues.apache.org/jira/browse/SPARK-35377
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35175 added a linter for JS so let's add it to GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35377) Add JS linter to GA

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342974#comment-17342974
 ] 

Apache Spark commented on SPARK-35377:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32512

> Add JS linter to GA
> ---
>
> Key: SPARK-35377
> URL: https://issues.apache.org/jira/browse/SPARK-35377
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35175 added a linter for JS so let's add it to GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35377) Add JS linter to GA

2021-05-11 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35377:
---
Issue Type: Improvement  (was: Bug)

> Add JS linter to GA
> ---
>
> Key: SPARK-35377
> URL: https://issues.apache.org/jira/browse/SPARK-35377
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35175 added a linter for JS so let's add it to GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35377) Add JS linter to GA

2021-05-11 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-35377:
--

 Summary: Add JS linter to GA
 Key: SPARK-35377
 URL: https://issues.apache.org/jira/browse/SPARK-35377
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


SPARK-35175 added a linter for JS so let's add it to GA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35365) spark3.1.1 use too long time to analyze table fields

2021-05-11 Thread yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yao updated SPARK-35365:

Summary: spark3.1.1 use too long time to analyze table fields  (was: 
spark3.1.1 use too long to analyze table fields)

> spark3.1.1 use too long time to analyze table fields
> 
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35365) spark3.1.1 use too long to analyze table fields

2021-05-11 Thread yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342962#comment-17342962
 ] 

yao commented on SPARK-35365:
-

[~yumwang] thanks, I will have a try, then give the results

> spark3.1.1 use too long to analyze table fields
> ---
>
> Key: SPARK-35365
> URL: https://issues.apache.org/jira/browse/SPARK-35365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: yao
>Priority: Major
>
> I have a big sql with a few width tables join and complex logic, when I run 
> that in spark 2.4 , it will take 20 minues in analyze phase, when I use spark 
> 3.1.1, it will use about 40 minutes,
> I need set spark.sql.analyzer.maxIterations=1000 in spark3.1.1.
> or spark.sql.optimizer.maxIterations=1000 in spark2.4.
> no other special setting for this .
> I check on the spark ui , I find that there is no job generated, all executor 
> have no active tasks, and when I set log level to debug, I find that the job 
> is in analyze phase, analyze the fields reference.
> this phase use too long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35376) Fallback config should override defaultValue

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35376:


Assignee: Apache Spark

> Fallback config should override defaultValue
> 
>
> Key: SPARK-35376
> URL: https://issues.apache.org/jira/browse/SPARK-35376
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>
> Some config use `.fallbackConf` to reference another config but lost the 
> defaultValue.
>  
> A negative case is `SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.defaultValue` 
> would return `None`. It could mislead user.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35376) Fallback config should override defaultValue

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35376:


Assignee: (was: Apache Spark)

> Fallback config should override defaultValue
> 
>
> Key: SPARK-35376
> URL: https://issues.apache.org/jira/browse/SPARK-35376
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2
>Reporter: XiDuo You
>Priority: Minor
>
> Some config use `.fallbackConf` to reference another config but lost the 
> defaultValue.
>  
> A negative case is `SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.defaultValue` 
> would return `None`. It could mislead user.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35376) Fallback config should override defaultValue

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342959#comment-17342959
 ] 

Apache Spark commented on SPARK-35376:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32510

> Fallback config should override defaultValue
> 
>
> Key: SPARK-35376
> URL: https://issues.apache.org/jira/browse/SPARK-35376
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2
>Reporter: XiDuo You
>Priority: Minor
>
> Some config use `.fallbackConf` to reference another config but lost the 
> defaultValue.
>  
> A negative case is `SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.defaultValue` 
> would return `None`. It could mislead user.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35361.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32507
[https://github.com/apache/spark/pull/32507]

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35361:


Assignee: Chao Sun

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35376) Fallback config should override defaultValue

2021-05-11 Thread XiDuo You (Jira)

XiDuo You created SPARK-35376:
-

 Summary: Fallback config should override defaultValue
 Key: SPARK-35376
 URL: https://issues.apache.org/jira/browse/SPARK-35376
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2
Reporter: XiDuo You


Some config use `.fallbackConf` to reference another config but lost the 
defaultValue.

 

A negative case is `SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.defaultValue` 
would return `None`. It could mislead user.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35356) Fix issue of the createTable when externalCatalog is InMemoryCatalog

2021-05-11 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-35356:
-
Description: 
In the case of externalCatalog is InMemoryCatalog, Using a different 
SparkSession will report the following error when create the same table , as 
follow
{code:java}
sparkSession.sql("create table if not exists t1 (id Int) using orc")
sparkSession.sql("insert into t1 values(1)")

// will report error as follow
sparkSession2.sql("create table if not exists t1 (id Int) using orc") {code}
{code:java}
Can not create the managed table('`default`.`t1`'). The associated 
location('file:***/t1') already exists.
{code}

  was:
In the case of externalCatalog is InMemoryCatalog, Using a different 
SparkSession will report the following error when create the same table , as 
follow
{code:java}
sparkSession.sql("create table if not exists t1 (id Int) using orc")
sparkSession.sql("insert into t1 values(1)")

sparkSession2.sql("create table if not exists t1 (id Int) using orc") // will 
report error as follow
{code}
 
{code:java}
Can not create the managed table('`default`.`t1`'). The associated 
location('file:***/t1') already exists.
{code}


> Fix issue of the createTable when externalCatalog is InMemoryCatalog
> 
>
> Key: SPARK-35356
> URL: https://issues.apache.org/jira/browse/SPARK-35356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
>
> In the case of externalCatalog is InMemoryCatalog, Using a different 
> SparkSession will report the following error when create the same table , as 
> follow
> {code:java}
> sparkSession.sql("create table if not exists t1 (id Int) using orc")
> sparkSession.sql("insert into t1 values(1)")
> // will report error as follow
> sparkSession2.sql("create table if not exists t1 (id Int) using orc") {code}
> {code:java}
> Can not create the managed table('`default`.`t1`'). The associated 
> location('file:***/t1') already exists.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35356) Fix issue of the createTable when externalCatalog is InMemoryCatalog

2021-05-11 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-35356:
-
Description: 
In the case of externalCatalog is InMemoryCatalog, Using a different 
SparkSession will report the following error when create the same table , as 
follow
{code:java}
sparkSession.sql("create table if not exists t1 (id Int) using orc")
sparkSession.sql("insert into t1 values(1)")

sparkSession2.sql("create table if not exists t1 (id Int) using orc") // will 
report error as follow
{code}
 
{code:java}
Can not create the managed table('`default`.`t1`'). The associated 
location('file:***/t1') already exists.
{code}

  was:
In the case of externalCatalog is InMemoryCatalog, Using a different 
SparkSession will report the following error when create the same table , as 
follow:
{code:java}
Can not create the managed table('`default`.`t1`'). The associated 
location('file:***/t1') already exists.
{code}


> Fix issue of the createTable when externalCatalog is InMemoryCatalog
> 
>
> Key: SPARK-35356
> URL: https://issues.apache.org/jira/browse/SPARK-35356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
>
> In the case of externalCatalog is InMemoryCatalog, Using a different 
> SparkSession will report the following error when create the same table , as 
> follow
> {code:java}
> sparkSession.sql("create table if not exists t1 (id Int) using orc")
> sparkSession.sql("insert into t1 values(1)")
> sparkSession2.sql("create table if not exists t1 (id Int) using orc") // will 
> report error as follow
> {code}
>  
> {code:java}
> Can not create the managed table('`default`.`t1`'). The associated 
> location('file:***/t1') already exists.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35356) Fix issue of the createTable when externalCatalog is InMemoryCatalog

2021-05-11 Thread yikf (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342949#comment-17342949
 ] 

yikf commented on SPARK-35356:
--

Sorry, my version of Spark is 3.0.0, Please visit 
https://github.com/apache/spark/pull/32486 for more details.

> Fix issue of the createTable when externalCatalog is InMemoryCatalog
> 
>
> Key: SPARK-35356
> URL: https://issues.apache.org/jira/browse/SPARK-35356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
>
> In the case of externalCatalog is InMemoryCatalog, Using a different 
> SparkSession will report the following error when create the same table , as 
> follow:
> {code:java}
> Can not create the managed table('`default`.`t1`'). The associated 
> location('file:***/t1') already exists.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35375) Use Jinja2 < 3.0.0 for Python linter dependency in GA

2021-05-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342947#comment-17342947
 ] 

Hyukjin Kwon commented on SPARK-35375:
--

Fixed in https://github.com/apache/spark/pull/32509

> Use Jinja2 < 3.0.0 for Python linter dependency in GA
> -
>
> Key: SPARK-35375
> URL: https://issues.apache.org/jira/browse/SPARK-35375
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> From a few hours ago, Python linter fails in GA.
> The latest Jinja 3.0.0 seems to cause this failure.
> https://pypi.org/project/Jinja2/
> {code}
> Run ./dev/lint-python
> starting python compilation test...
> python compilation succeeded.
> starting pycodestyle test...
> pycodestyle checks passed.
> starting flake8 test...
> flake8 checks passed.
> starting mypy test...
> mypy checks passed.
> starting sphinx-build tests...
> sphinx-build checks failed:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> re-running make html to print full warning list:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> Error: Process completed with exit code 2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35375) Use Jinja2 < 3.0.0 for Python linter dependency in GA

2021-05-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35375.
--
Fix Version/s: 3.2.0
   3.1.2
   Resolution: Fixed

> Use Jinja2 < 3.0.0 for Python linter dependency in GA
> -
>
> Key: SPARK-35375
> URL: https://issues.apache.org/jira/browse/SPARK-35375
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> From a few hours ago, Python linter fails in GA.
> The latest Jinja 3.0.0 seems to cause this failure.
> https://pypi.org/project/Jinja2/
> {code}
> Run ./dev/lint-python
> starting python compilation test...
> python compilation succeeded.
> starting pycodestyle test...
> pycodestyle checks passed.
> starting flake8 test...
> flake8 checks passed.
> starting mypy test...
> mypy checks passed.
> starting sphinx-build tests...
> sphinx-build checks failed:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> re-running make html to print full warning list:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> Error: Process completed with exit code 2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35356) Fix issue of the createTable when externalCatalog is InMemoryCatalog

2021-05-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342938#comment-17342938
 ] 

L. C. Hsieh commented on SPARK-35356:
-

Is the affect version wrong? Do you mean 3.0.2?

> Fix issue of the createTable when externalCatalog is InMemoryCatalog
> 
>
> Key: SPARK-35356
> URL: https://issues.apache.org/jira/browse/SPARK-35356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
>
> In the case of externalCatalog is InMemoryCatalog, Using a different 
> SparkSession will report the following error when create the same table , as 
> follow:
> {code:java}
> Can not create the managed table('`default`.`t1`'). The associated 
> location('file:***/t1') already exists.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35356) Fix issue of the createTable when externalCatalog is InMemoryCatalog

2021-05-11 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35356:

Fix Version/s: (was: 3.0.0)

> Fix issue of the createTable when externalCatalog is InMemoryCatalog
> 
>
> Key: SPARK-35356
> URL: https://issues.apache.org/jira/browse/SPARK-35356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
>
> In the case of externalCatalog is InMemoryCatalog, Using a different 
> SparkSession will report the following error when create the same table , as 
> follow:
> {code:java}
> Can not create the managed table('`default`.`t1`'). The associated 
> location('file:***/t1') already exists.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35371) Scala UDF returning string or complex type applied to array members returns wrong data

2021-05-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342937#comment-17342937
 ] 

L. C. Hsieh commented on SPARK-35371:
-

Oh, I think it was fixed by SPARK-34829.

> Scala UDF returning string or complex type applied to array members returns 
> wrong data
> --
>
> Key: SPARK-35371
> URL: https://issues.apache.org/jira/browse/SPARK-35371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: David Benedeki
>Priority: Major
>
> When using an UDF returning string or complex type (Struct) on array members 
> the resulting array consists of the last array member UDF result.
> h3. *Example code:*
> {code:scala}
> import org.apache.spark.sql.{Column, SparkSession}
> import org.apache.spark.sql.functions.{callUDF, col, transform, udf}
> val sparkBuilder: SparkSession.Builder = SparkSession.builder()
>   .master("local[*]")
>   .appName(s"Udf Bug Demo")
>   .config("spark.ui.enabled", "false")
>   .config("spark.debug.maxToStringFields", 100)
> val spark: SparkSession = sparkBuilder
>   .config("spark.driver.bindAddress", "127.0.0.1")
>   .config("spark.driver.host", "127.0.0.1")
>   .getOrCreate()
> import spark.implicits._
> case class Foo(num: Int, s: String)
> val src  = Seq(
>   (1, 2, Array(1, 2, 3)),
>   (2, 2, Array(2, 2, 2)),
>   (3, 4, Array(3, 4, 3, 4))
> ).toDF("A", "B", "C")
> val udfStringName = "UdfString"
> val udfIntName = "UdfInt"
> val udfStructName = "UdfStruct"
> val udfString = udf((num: Int) => {
>   (num + 1).toString
> })
> spark.udf.register(udfStringName, udfString)
> val udfInt = udf((num: Int) => {
>   num + 1
> })
> spark.udf.register(udfIntName, udfInt)
> val udfStruct = udf((num: Int) => {
>   Foo(num + 1, (num + 1).toString)
> })
> spark.udf.register(udfStructName, udfStruct)
> val lambdaString = (forCol: Column) => callUDF(udfStringName, forCol)
> val lambdaInt = (forCol: Column) => callUDF(udfIntName, forCol)
> val lambdaStruct = (forCol: Column) => callUDF(udfStructName, forCol)
> val cA = callUDF(udfStringName, col("A"))
> val cB = callUDF(udfStringName, col("B"))
> val cCString: Column = transform(col("C"), lambdaString)
> val cCInt: Column = transform(col("C"), lambdaInt)
> val cCStruc: Column = transform(col("C"), lambdaStruct)
> val dest = src.withColumn("AStr", cA)
>   .withColumn("BStr", cB)
>   .withColumn("CString (Wrong)", cCString)
>   .withColumn("CInt (OK)", cCInt)
>   .withColumn("CStruct (Wrong)", cCStruc)
> dest.show(false)
> dest.printSchema()
> {code}
> h3. *Expected:*
> {noformat}
> +---+---++++---+++
> |A  |B  |C   |AStr|BStr|CString|CInt|CStruct  
> |
> +---+---++++---+++
> |1  |2  |[1, 2, 3]   |2   |3   |[2, 3, 4]  |[2, 3, 4]   |[{2, 2}, {3, 3}, 
> {4, 4}]|
> |2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
> {3, 3}]|
> |3  |4  |[3, 4, 3, 4]|4   |5   |[4, 5, 4, 5]   |[4, 5, 4, 5]|[{4, 4}, {5, 5}, 
> {4, 4}, {5, 5}]|
> +---+---++++---+++
> {noformat}
> h3. *Got:*
> {noformat}
> +---+---++++---+++
> |A  |B  |C   |AStr|BStr|CString (Wrong)|CInt (Ok)   |CStruct (Wrong)  
>|
> +---+---++++---+++
> |1  |2  |[1, 2, 3]   |2   |3   |[4, 4, 4]  |[2, 3, 4]   |[{4, 4}, {4, 4}, 
> {4, 4}]|
> |2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
> {3, 3}]|
> |3  |4  |[3, 4, 3, 4]|4   |5   |[5, 5, 5, 5]   |[4, 5, 4, 5]|[{5, 5}, {5, 5}, 
> {5, 5}, {5, 5}]|
> +---+---++++---+++
> {noformat}
> h3. *Observation*
>  * Work correctly on Spark 3.0.2
>  * When UDF is registered as Java UDF, it works as supposed
>  * The UDF is called the appropriate number of times (regardless if UDF is 
> marked as deterministic or non-deterministic).
>  * When debugged, the correct value is actually saved into the result array 
> at first but every subsequent item processing overwrites the previous result 
> values as well. Therefore the last item values filling the array is the final 
> result.
>  * When the UDF returns NULL/None it does not "overwrite” the prior array 
> values nor is “overwritten” by subsequent non-NULL values. See with following 
> UDF impelementation:
> {code:scala}
> val udfString = udf((num: Int) => {
>   if (num == 3) {
> None
>   } else {
>

[jira] [Commented] (SPARK-35371) Scala UDF returning string or complex type applied to array members returns wrong data

2021-05-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342936#comment-17342936
 ] 

L. C. Hsieh commented on SPARK-35371:
-

I just ran the example in both current master branch, and branch-3.1. Both got 
the correct results. Would you like to test it too?

> Scala UDF returning string or complex type applied to array members returns 
> wrong data
> --
>
> Key: SPARK-35371
> URL: https://issues.apache.org/jira/browse/SPARK-35371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: David Benedeki
>Priority: Major
>
> When using an UDF returning string or complex type (Struct) on array members 
> the resulting array consists of the last array member UDF result.
> h3. *Example code:*
> {code:scala}
> import org.apache.spark.sql.{Column, SparkSession}
> import org.apache.spark.sql.functions.{callUDF, col, transform, udf}
> val sparkBuilder: SparkSession.Builder = SparkSession.builder()
>   .master("local[*]")
>   .appName(s"Udf Bug Demo")
>   .config("spark.ui.enabled", "false")
>   .config("spark.debug.maxToStringFields", 100)
> val spark: SparkSession = sparkBuilder
>   .config("spark.driver.bindAddress", "127.0.0.1")
>   .config("spark.driver.host", "127.0.0.1")
>   .getOrCreate()
> import spark.implicits._
> case class Foo(num: Int, s: String)
> val src  = Seq(
>   (1, 2, Array(1, 2, 3)),
>   (2, 2, Array(2, 2, 2)),
>   (3, 4, Array(3, 4, 3, 4))
> ).toDF("A", "B", "C")
> val udfStringName = "UdfString"
> val udfIntName = "UdfInt"
> val udfStructName = "UdfStruct"
> val udfString = udf((num: Int) => {
>   (num + 1).toString
> })
> spark.udf.register(udfStringName, udfString)
> val udfInt = udf((num: Int) => {
>   num + 1
> })
> spark.udf.register(udfIntName, udfInt)
> val udfStruct = udf((num: Int) => {
>   Foo(num + 1, (num + 1).toString)
> })
> spark.udf.register(udfStructName, udfStruct)
> val lambdaString = (forCol: Column) => callUDF(udfStringName, forCol)
> val lambdaInt = (forCol: Column) => callUDF(udfIntName, forCol)
> val lambdaStruct = (forCol: Column) => callUDF(udfStructName, forCol)
> val cA = callUDF(udfStringName, col("A"))
> val cB = callUDF(udfStringName, col("B"))
> val cCString: Column = transform(col("C"), lambdaString)
> val cCInt: Column = transform(col("C"), lambdaInt)
> val cCStruc: Column = transform(col("C"), lambdaStruct)
> val dest = src.withColumn("AStr", cA)
>   .withColumn("BStr", cB)
>   .withColumn("CString (Wrong)", cCString)
>   .withColumn("CInt (OK)", cCInt)
>   .withColumn("CStruct (Wrong)", cCStruc)
> dest.show(false)
> dest.printSchema()
> {code}
> h3. *Expected:*
> {noformat}
> +---+---++++---+++
> |A  |B  |C   |AStr|BStr|CString|CInt|CStruct  
> |
> +---+---++++---+++
> |1  |2  |[1, 2, 3]   |2   |3   |[2, 3, 4]  |[2, 3, 4]   |[{2, 2}, {3, 3}, 
> {4, 4}]|
> |2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
> {3, 3}]|
> |3  |4  |[3, 4, 3, 4]|4   |5   |[4, 5, 4, 5]   |[4, 5, 4, 5]|[{4, 4}, {5, 5}, 
> {4, 4}, {5, 5}]|
> +---+---++++---+++
> {noformat}
> h3. *Got:*
> {noformat}
> +---+---++++---+++
> |A  |B  |C   |AStr|BStr|CString (Wrong)|CInt (Ok)   |CStruct (Wrong)  
>|
> +---+---++++---+++
> |1  |2  |[1, 2, 3]   |2   |3   |[4, 4, 4]  |[2, 3, 4]   |[{4, 4}, {4, 4}, 
> {4, 4}]|
> |2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
> {3, 3}]|
> |3  |4  |[3, 4, 3, 4]|4   |5   |[5, 5, 5, 5]   |[4, 5, 4, 5]|[{5, 5}, {5, 5}, 
> {5, 5}, {5, 5}]|
> +---+---++++---+++
> {noformat}
> h3. *Observation*
>  * Work correctly on Spark 3.0.2
>  * When UDF is registered as Java UDF, it works as supposed
>  * The UDF is called the appropriate number of times (regardless if UDF is 
> marked as deterministic or non-deterministic).
>  * When debugged, the correct value is actually saved into the result array 
> at first but every subsequent item processing overwrites the previous result 
> values as well. Therefore the last item values filling the array is the final 
> result.
>  * When the UDF returns NULL/None it does not "overwrite” the prior array 
> values nor is “overwritten” by subsequent non-NULL values. See with following 
> UDF impelementation:
>

[jira] [Updated] (SPARK-35375) Use Jinja2 < 3.0.0 for Python linter dependency in GA

2021-05-11 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35375:
---
Summary: Use Jinja2 < 3.0.0 for Python linter dependency in GA  (was: Use 
Jinja2 < 3.0.0 for linter dependency in GA)

> Use Jinja2 < 3.0.0 for Python linter dependency in GA
> -
>
> Key: SPARK-35375
> URL: https://issues.apache.org/jira/browse/SPARK-35375
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> From a few hours ago, Python linter fails in GA.
> The latest Jinja 3.0.0 seems to cause this failure.
> https://pypi.org/project/Jinja2/
> {code}
> Run ./dev/lint-python
> starting python compilation test...
> python compilation succeeded.
> starting pycodestyle test...
> pycodestyle checks passed.
> starting flake8 test...
> flake8 checks passed.
> starting mypy test...
> mypy checks passed.
> starting sphinx-build tests...
> sphinx-build checks failed:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> re-running make html to print full warning list:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> Error: Process completed with exit code 2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35375) Use Jinja2 < 3.0.0 for linter dependency in GA

2021-05-11 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35375:
---
Description: 
>From a few hours ago, Python linter fails in GA.
The latest Jinja 3.0.0 seems to cause this failure.
https://pypi.org/project/Jinja2/

{code}
Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

starting sphinx-build tests...
sphinx-build checks failed:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, 
development/debugging.rst, development/index.rst, development/setting_ide.rst, 
development/testing.rst, getting_started/index.rst, 
getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, 
migration_guide/index.rst, ..., reference/pyspark.ml.rst, 
reference/pyspark.mllib.rst, reference/pyspark.resource.rst, 
reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File 
"/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
 line 26, in top-level template code
{% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want 
to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [Makefile:20: html] Error 2

re-running make html to print full warning list:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, 
development/debugging.rst, development/index.rst, development/setting_ide.rst, 
development/testing.rst, getting_started/index.rst, 
getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, 
migration_guide/index.rst, ..., reference/pyspark.ml.rst, 
reference/pyspark.mllib.rst, reference/pyspark.resource.rst, 
reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File 
"/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
 line 26, in top-level template code
{% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you want 
to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.
{code}

  was:
>From a few hours ago, Python linter fails in GA.
The latest Jinja 3.0.0 seems to cause this failure.

{code}
Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

starting sphinx-build tests...
sphinx-build checks failed:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, 
development/debugging.rst, development/index.rst, development/setting_ide.rst, 
development/testing.rst, getting_started/index.rst, 
getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, 
migration_guide/index.rst, ..., reference/pyspark.ml.rst, 
reference/pyspark.mllib.rst, reference/pyspark.resource.rst, 
reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File 
"/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
 line 26, in top-level template code
{% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want 
to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [Makefile:20: html] Error 2

re-running make html to print full warning list:
Running Sphinx

[jira] [Assigned] (SPARK-35375) Use Jinja2 < 3.0.0 for linter dependency in GA

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35375:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Use Jinja2 < 3.0.0 for linter dependency in GA
> --
>
> Key: SPARK-35375
> URL: https://issues.apache.org/jira/browse/SPARK-35375
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> From a few hours ago, Python linter fails in GA.
> The latest Jinja 3.0.0 seems to cause this failure.
> {code}
> Run ./dev/lint-python
> starting python compilation test...
> python compilation succeeded.
> starting pycodestyle test...
> pycodestyle checks passed.
> starting flake8 test...
> flake8 checks passed.
> starting mypy test...
> mypy checks passed.
> starting sphinx-build tests...
> sphinx-build checks failed:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> re-running make html to print full warning list:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> Error: Process completed with exit code 2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35375) Use Jinja2 < 3.0.0 for linter dependency in GA

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35375:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Use Jinja2 < 3.0.0 for linter dependency in GA
> --
>
> Key: SPARK-35375
> URL: https://issues.apache.org/jira/browse/SPARK-35375
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> From a few hours ago, Python linter fails in GA.
> The latest Jinja 3.0.0 seems to cause this failure.
> {code}
> Run ./dev/lint-python
> starting python compilation test...
> python compilation succeeded.
> starting pycodestyle test...
> pycodestyle checks passed.
> starting flake8 test...
> flake8 checks passed.
> starting mypy test...
> mypy checks passed.
> starting sphinx-build tests...
> sphinx-build checks failed:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> re-running make html to print full warning list:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> Error: Process completed with exit code 2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35375) Use Jinja2 < 3.0.0 for linter dependency in GA

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342926#comment-17342926
 ] 

Apache Spark commented on SPARK-35375:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32509

> Use Jinja2 < 3.0.0 for linter dependency in GA
> --
>
> Key: SPARK-35375
> URL: https://issues.apache.org/jira/browse/SPARK-35375
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> From a few hours ago, Python linter fails in GA.
> The latest Jinja 3.0.0 seems to cause this failure.
> {code}
> Run ./dev/lint-python
> starting python compilation test...
> python compilation succeeded.
> starting pycodestyle test...
> pycodestyle checks passed.
> starting flake8 test...
> flake8 checks passed.
> starting mypy test...
> mypy checks passed.
> starting sphinx-build tests...
> sphinx-build checks failed:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> re-running make html to print full warning list:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
> "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
>  line 26, in top-level template code
> {% if '__init__' in methods %}
> jinja2.exceptions.UndefinedError: 'methods' is undefined
> The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you 
> want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [Makefile:20: html] Error 2
> Error: Process completed with exit code 2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35375) Use Jinja2 < 3.0.0 for linter dependency in GA

2021-05-11 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35375:
---
Description: 
>From a few hours ago, Python linter fails in GA.
The latest Jinja 3.0.0 seems to cause this failure.

{code}
Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

starting sphinx-build tests...
sphinx-build checks failed:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, 
development/debugging.rst, development/index.rst, development/setting_ide.rst, 
development/testing.rst, getting_started/index.rst, 
getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, 
migration_guide/index.rst, ..., reference/pyspark.ml.rst, 
reference/pyspark.mllib.rst, reference/pyspark.resource.rst, 
reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File 
"/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
 line 26, in top-level template code
{% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want 
to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [Makefile:20: html] Error 2

re-running make html to print full warning list:
Running Sphinx v3.0.4
making output directory... done
[autosummary] generating autosummary for: development/contributing.rst, 
development/debugging.rst, development/index.rst, development/setting_ide.rst, 
development/testing.rst, getting_started/index.rst, 
getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, 
migration_guide/index.rst, ..., reference/pyspark.ml.rst, 
reference/pyspark.mllib.rst, reference/pyspark.resource.rst, 
reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
user_guide/index.rst, user_guide/python_packaging.rst

Exception occurred:
  File 
"/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst",
 line 26, in top-level template code
{% if '__init__' in methods %}
jinja2.exceptions.UndefinedError: 'methods' is undefined
The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you want 
to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [Makefile:20: html] Error 2
Error: Process completed with exit code 2.
{code}

  was:
>From a few hours ago, Python linter fails in GA.

The latest Jinja 3.0.0 seems to cause this failure.


> Use Jinja2 < 3.0.0 for linter dependency in GA
> --
>
> Key: SPARK-35375
> URL: https://issues.apache.org/jira/browse/SPARK-35375
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> From a few hours ago, Python linter fails in GA.
> The latest Jinja 3.0.0 seems to cause this failure.
> {code}
> Run ./dev/lint-python
> starting python compilation test...
> python compilation succeeded.
> starting pycodestyle test...
> pycodestyle checks passed.
> starting flake8 test...
> flake8 checks passed.
> starting mypy test...
> mypy checks passed.
> starting sphinx-build tests...
> sphinx-build checks failed:
> Running Sphinx v3.0.4
> making output directory... done
> [autosummary] generating autosummary for: development/contributing.rst, 
> development/debugging.rst, development/index.rst, 
> development/setting_ide.rst, development/testing.rst, 
> getting_started/index.rst, getting_started/install.rst, 
> getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., 
> reference/pyspark.ml.rst, reference/pyspark.mllib.rst, 
> reference/pyspark.resource.rst, reference/pyspark.rst, 
> reference/pyspark.sql.rst, reference/pyspark.ss.rst, 
> reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, 
> user_guide/index.rst, user_guide/python_packaging.rst
> Exception occurred:
>   File 
>

[jira] [Created] (SPARK-35375) Use Jinja2 < 3.0.0 for linter dependency in GA

2021-05-11 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-35375:
--

 Summary: Use Jinja2 < 3.0.0 for linter dependency in GA
 Key: SPARK-35375
 URL: https://issues.apache.org/jira/browse/SPARK-35375
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


>From a few hours ago, Python linter fails in GA.

The latest Jinja 3.0.0 seems to cause this failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342793#comment-17342793
 ] 

Apache Spark commented on SPARK-35361:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32507

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35361:


Assignee: (was: Apache Spark)

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35361:


Assignee: Apache Spark

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Minor
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35361) Improve performance for ApplyFunctionExpression

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342792#comment-17342792
 ] 

Apache Spark commented on SPARK-35361:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32507

> Improve performance for ApplyFunctionExpression
> ---
>
> Key: SPARK-35361
> URL: https://issues.apache.org/jira/browse/SPARK-35361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> When the `ScalarFunction` itself is trivial, `ApplyFunctionExpression` could 
> incur significant runtime cost with `zipWithIndex` call. This proposes to 
> move the call outside the loop over each input row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34205) Add pipe API to Dataset

2021-05-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342788#comment-17342788
 ] 

L. C. Hsieh commented on SPARK-34205:
-

This is basically driven by our internal customer use-case. Now the customer 
turns to implement the external program as a Java library. 

I think this is a useful feature and also a missing piece in Spark Structured 
Streaming. But as there is the community hurdle during pushing it, and now the 
internal requirement for the feature is removed, I am going to close this for 
now.


> Add pipe API to Dataset
> ---
>
> Key: SPARK-34205
> URL: https://issues.apache.org/jira/browse/SPARK-34205
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Dataset doesn't have pipe API but RDD has it. Although for normal Dataset, 
> user can convert a Dataset to RDD and call RDD.pipe, for streaming Dataset it 
> is not possible. So that being said, this is actually a requirement from 
> Structured Streaming, but we need to add pipe API to Dataset to enable it in 
> Structured Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34205) Add pipe API to Dataset

2021-05-11 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34205.
-
Resolution: Won't Fix

> Add pipe API to Dataset
> ---
>
> Key: SPARK-34205
> URL: https://issues.apache.org/jira/browse/SPARK-34205
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Dataset doesn't have pipe API but RDD has it. Although for normal Dataset, 
> user can convert a Dataset to RDD and call RDD.pipe, for streaming Dataset it 
> is not possible. So that being said, this is actually a requirement from 
> Structured Streaming, but we need to add pipe API to Dataset to enable it in 
> Structured Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35372) JDK 11 compilation failure due to StackOverflowError

2021-05-11 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-35372.

Fix Version/s: 3.2.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

This issue was resolved in https://github.com/apache/spark/pull/32502.

> JDK 11 compilation failure due to StackOverflowError
> 
>
> Key: SPARK-35372
> URL: https://issues.apache.org/jira/browse/SPARK-35372
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> See https://github.com/apache/spark/runs/2554067779
> {code}
> java.lang.StackOverflowError
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
> scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
> scala.reflect.internal.Trees.itransform(Trees.scala:1404)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.internal.Trees.itransform(Trees.scala:1383)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35374) Add string-to-number conversion support to JacksonParser

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342733#comment-17342733
 ] 

Apache Spark commented on SPARK-35374:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32506

> Add string-to-number conversion support to JacksonParser
> 
>
> Key: SPARK-35374
> URL: https://issues.apache.org/jira/browse/SPARK-35374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, spark.read.json doesn't convert numbers 
> represented as string even though types are specified by schema.
> Here is anexample.
> {code}
> $ cat test.json
> { "value": "foo" }
> 
> { "value": "100" }
> 
> { "value": "257" }
> 
> { "value": "32768" }  
> 
> { "value": "2147483648" } 
> 
> { "value": "2.71f" }  
> 
> { "value": "1.0E100" }  
> $ bin/spark-shell
> import org.apache.spark.sql.types._
> val df = spark.read.
>   schema(StructType(StructField("value", DoubleType)::Nil)).json("test.json")
> df.show
> +-+
> |value|
> +-+
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35374) Add string-to-number conversion support to JacksonParser

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35374:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Add string-to-number conversion support to JacksonParser
> 
>
> Key: SPARK-35374
> URL: https://issues.apache.org/jira/browse/SPARK-35374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, spark.read.json doesn't convert numbers 
> represented as string even though types are specified by schema.
> Here is anexample.
> {code}
> $ cat test.json
> { "value": "foo" }
> 
> { "value": "100" }
> 
> { "value": "257" }
> 
> { "value": "32768" }  
> 
> { "value": "2147483648" } 
> 
> { "value": "2.71f" }  
> 
> { "value": "1.0E100" }  
> $ bin/spark-shell
> import org.apache.spark.sql.types._
> val df = spark.read.
>   schema(StructType(StructField("value", DoubleType)::Nil)).json("test.json")
> df.show
> +-+
> |value|
> +-+
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35374) Add string-to-number conversion support to JacksonParser

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35374:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Add string-to-number conversion support to JacksonParser
> 
>
> Key: SPARK-35374
> URL: https://issues.apache.org/jira/browse/SPARK-35374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In the current implementation, spark.read.json doesn't convert numbers 
> represented as string even though types are specified by schema.
> Here is anexample.
> {code}
> $ cat test.json
> { "value": "foo" }
> 
> { "value": "100" }
> 
> { "value": "257" }
> 
> { "value": "32768" }  
> 
> { "value": "2147483648" } 
> 
> { "value": "2.71f" }  
> 
> { "value": "1.0E100" }  
> $ bin/spark-shell
> import org.apache.spark.sql.types._
> val df = spark.read.
>   schema(StructType(StructField("value", DoubleType)::Nil)).json("test.json")
> df.show
> +-+
> |value|
> +-+
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35374) Add string-to-number conversion support to JacksonParser

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342732#comment-17342732
 ] 

Apache Spark commented on SPARK-35374:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32506

> Add string-to-number conversion support to JacksonParser
> 
>
> Key: SPARK-35374
> URL: https://issues.apache.org/jira/browse/SPARK-35374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, spark.read.json doesn't convert numbers 
> represented as string even though types are specified by schema.
> Here is anexample.
> {code}
> $ cat test.json
> { "value": "foo" }
> 
> { "value": "100" }
> 
> { "value": "257" }
> 
> { "value": "32768" }  
> 
> { "value": "2147483648" } 
> 
> { "value": "2.71f" }  
> 
> { "value": "1.0E100" }  
> $ bin/spark-shell
> import org.apache.spark.sql.types._
> val df = spark.read.
>   schema(StructType(StructField("value", DoubleType)::Nil)).json("test.json")
> df.show
> +-+
> |value|
> +-+
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> | null|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35374) Add string-to-number conversion support to JacksonParser

2021-05-11 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-35374:
--

 Summary: Add string-to-number conversion support to JacksonParser
 Key: SPARK-35374
 URL: https://issues.apache.org/jira/browse/SPARK-35374
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1, 2.4.7, 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current implementation, spark.read.json doesn't convert numbers 
represented as string even though types are specified by schema.
Here is anexample.
{code}
$ cat test.json
{ "value": "foo" }  
  
{ "value": "100" }  
  
{ "value": "257" }  
  
{ "value": "32768" }
  
{ "value": "2147483648" }   
  
{ "value": "2.71f" }
  
{ "value": "1.0E100" }  

$ bin/spark-shell
import org.apache.spark.sql.types._
val df = spark.read.
  schema(StructType(StructField("value", DoubleType)::Nil)).json("test.json")
df.show
+-+
|value|
+-+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35373:


Assignee: Apache Spark  (was: Sean R. Owen)

> Verify checksums of downloaded artifacts in build/mvn
> -
>
> Key: SPARK-35373
> URL: https://issues.apache.org/jira/browse/SPARK-35373
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
>
> build/mvn is a convenience script that will automatically download Maven (and 
> Scala) if not already present. While it downloads from official ASF mirrors, 
> it does not check the checksum of the artifact, which is available as a 
> .sha512 file from ASF servers.
> The risk of a supply chain attack is a bit less theoretical here than usual, 
> because artifacts are downloaded from any of several mirrors worldwide, and 
> injecting a malicious copy of Maven in any one of them might be simpler and 
> less noticeable than injecting it into ASF servers.
> (Note, Scala's download site does not seem to provide a checksum. They do all 
> come from Lightbend, at least, not N mirrors. Not much we can do there.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342709#comment-17342709
 ] 

Apache Spark commented on SPARK-35373:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32505

> Verify checksums of downloaded artifacts in build/mvn
> -
>
> Key: SPARK-35373
> URL: https://issues.apache.org/jira/browse/SPARK-35373
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> build/mvn is a convenience script that will automatically download Maven (and 
> Scala) if not already present. While it downloads from official ASF mirrors, 
> it does not check the checksum of the artifact, which is available as a 
> .sha512 file from ASF servers.
> The risk of a supply chain attack is a bit less theoretical here than usual, 
> because artifacts are downloaded from any of several mirrors worldwide, and 
> injecting a malicious copy of Maven in any one of them might be simpler and 
> less noticeable than injecting it into ASF servers.
> (Note, Scala's download site does not seem to provide a checksum. They do all 
> come from Lightbend, at least, not N mirrors. Not much we can do there.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35373:


Assignee: Apache Spark  (was: Sean R. Owen)

> Verify checksums of downloaded artifacts in build/mvn
> -
>
> Key: SPARK-35373
> URL: https://issues.apache.org/jira/browse/SPARK-35373
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
>
> build/mvn is a convenience script that will automatically download Maven (and 
> Scala) if not already present. While it downloads from official ASF mirrors, 
> it does not check the checksum of the artifact, which is available as a 
> .sha512 file from ASF servers.
> The risk of a supply chain attack is a bit less theoretical here than usual, 
> because artifacts are downloaded from any of several mirrors worldwide, and 
> injecting a malicious copy of Maven in any one of them might be simpler and 
> less noticeable than injecting it into ASF servers.
> (Note, Scala's download site does not seem to provide a checksum. They do all 
> come from Lightbend, at least, not N mirrors. Not much we can do there.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35373:


Assignee: Sean R. Owen  (was: Apache Spark)

> Verify checksums of downloaded artifacts in build/mvn
> -
>
> Key: SPARK-35373
> URL: https://issues.apache.org/jira/browse/SPARK-35373
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> build/mvn is a convenience script that will automatically download Maven (and 
> Scala) if not already present. While it downloads from official ASF mirrors, 
> it does not check the checksum of the artifact, which is available as a 
> .sha512 file from ASF servers.
> The risk of a supply chain attack is a bit less theoretical here than usual, 
> because artifacts are downloaded from any of several mirrors worldwide, and 
> injecting a malicious copy of Maven in any one of them might be simpler and 
> less noticeable than injecting it into ASF servers.
> (Note, Scala's download site does not seem to provide a checksum. They do all 
> come from Lightbend, at least, not N mirrors. Not much we can do there.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35373) Verify checksums of downloaded artifacts in build/mvn

2021-05-11 Thread Sean R. Owen (Jira)

Sean R. Owen created SPARK-35373:


 Summary: Verify checksums of downloaded artifacts in build/mvn
 Key: SPARK-35373
 URL: https://issues.apache.org/jira/browse/SPARK-35373
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.1, 3.0.2, 2.4.7
Reporter: Sean R. Owen
Assignee: Sean R. Owen


build/mvn is a convenience script that will automatically download Maven (and 
Scala) if not already present. While it downloads from official ASF mirrors, it 
does not check the checksum of the artifact, which is available as a .sha512 
file from ASF servers.

The risk of a supply chain attack is a bit less theoretical here than usual, 
because artifacts are downloaded from any of several mirrors worldwide, and 
injecting a malicious copy of Maven in any one of them might be simpler and 
less noticeable than injecting it into ASF servers.

(Note, Scala's download site does not seem to provide a checksum. They do all 
come from Lightbend, at least, not N mirrors. Not much we can do there.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35013) Spark allows to set spark.driver.cores=0

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342645#comment-17342645
 ] 

Apache Spark commented on SPARK-35013:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/32504

> Spark allows to set spark.driver.cores=0
> 
>
> Key: SPARK-35013
> URL: https://issues.apache.org/jira/browse/SPARK-35013
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Oleg Lypkan
>Priority: Minor
>
> I found an inconsistency in [validation logic of Spark submit arguments 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L248-L258]that
>  allows *spark.driver.cores* value to be set to 0 but requires 
> *spark.driver.memory,* *spark.executor.cores, spark.executor.memory* to be 
> positive numbers:
> {quote}Exception in thread "main" org.apache.spark.SparkException: Driver 
> memory must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor cores 
> must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor memory 
> must be a positive number
> {quote}
> I would like to understand if there is a reason for this inconsistency in the 
> validation logic or it is a bug?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35013) Spark allows to set spark.driver.cores=0

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35013:


Assignee: (was: Apache Spark)

> Spark allows to set spark.driver.cores=0
> 
>
> Key: SPARK-35013
> URL: https://issues.apache.org/jira/browse/SPARK-35013
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Oleg Lypkan
>Priority: Minor
>
> I found an inconsistency in [validation logic of Spark submit arguments 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L248-L258]that
>  allows *spark.driver.cores* value to be set to 0 but requires 
> *spark.driver.memory,* *spark.executor.cores, spark.executor.memory* to be 
> positive numbers:
> {quote}Exception in thread "main" org.apache.spark.SparkException: Driver 
> memory must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor cores 
> must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor memory 
> must be a positive number
> {quote}
> I would like to understand if there is a reason for this inconsistency in the 
> validation logic or it is a bug?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35013) Spark allows to set spark.driver.cores=0

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35013:


Assignee: Apache Spark

> Spark allows to set spark.driver.cores=0
> 
>
> Key: SPARK-35013
> URL: https://issues.apache.org/jira/browse/SPARK-35013
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Oleg Lypkan
>Assignee: Apache Spark
>Priority: Minor
>
> I found an inconsistency in [validation logic of Spark submit arguments 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L248-L258]that
>  allows *spark.driver.cores* value to be set to 0 but requires 
> *spark.driver.memory,* *spark.executor.cores, spark.executor.memory* to be 
> positive numbers:
> {quote}Exception in thread "main" org.apache.spark.SparkException: Driver 
> memory must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor cores 
> must be a positive number
>  Exception in thread "main" org.apache.spark.SparkException: Executor memory 
> must be a positive number
> {quote}
> I would like to understand if there is a reason for this inconsistency in the 
> validation logic or it is a bug?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35370) IllegalArgumentException when loading a PipelineModel with Spark 3

2021-05-11 Thread Avenash Kabeera (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342559#comment-17342559
 ] 

Avenash Kabeera commented on SPARK-35370:
-

Some additional details.  My model was saved as  parquet format and after some 
research I found that parquet columns should be case insensitive.  I confirmed 
this by trying a workaround to load my model rename the column to "nodeData" 
and resave it but everything I tried ended up saving the model with the column 
"nodedata."  Given this logic for case insensitivity, doesn't it make more 
sense that the fix mentioned above to supposed loading spark2 models should be 
checking for "nodedata" not "nodeData"?

> IllegalArgumentException when loading a PipelineModel with Spark 3
> --
>
> Key: SPARK-35370
> URL: https://issues.apache.org/jira/browse/SPARK-35370
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.1.0, 3.1.1
> Environment: spark 3.1.1
>Reporter: Avenash Kabeera
>Priority: Minor
>  Labels: V3, decisiontree, scala, treemodels
>
> Hi, 
> This is a followup of the this issue 
> https://issues.apache.org/jira/browse/SPARK-33398 that fixed an exception 
> when loading a model in Spark 3 that trained in Spark2.  After incorporating 
> this fix in my project, I ran into another issue which was introduced in the 
> fix [https://github.com/apache/spark/pull/30889/files.]  
> While loading my random forest model which was trained in Spark 2.2, I ran 
> into the following exception:
> {code:java}
> 16:03:34 ERROR Instrumentation:73 - java.lang.IllegalArgumentException: 
> nodeData does not exist. Available: treeid, nodedata
>  at 
> org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
>  at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147)
>  at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
>  at 
> org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:522)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:420)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:410)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>  at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
>  at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355)
>  at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337){code}
> When I looked at the data for the model, I see the

[jira] [Resolved] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline

2021-05-11 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35229.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32381
[https://github.com/apache/spark/pull/32381]

> Spark Job web page is extremely slow while there are more than 1500 events in 
> timeline
> --
>
> Key: SPARK-35229
> URL: https://issues.apache.org/jira/browse/SPARK-35229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Wang Yuan
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 3.2.0
>
>
> In a spark streaming application, there are 1000+ executors, more than 2000 
> events (executors events, job events) generated, then the jobs/job web page 
> of spark is not able to show. The browser (chrome, firefox, safari) freezes. 
> I had to open another window to open stages, executors pages by their 
> addresses manually.
> The jobs page is the home page, so this is rather annoyed. The problem is 
> that the vis-timeline rending is too slow.
> Their are some suggestion:
> 1) the page should not render timeline when the page is loading unless user 
> clicks the link.
> 2) the executor group and job group should be separated, and user can choose 
> to show one, both or neither.
> 3) the executor group should display executor event by time horizontally. 
> Currently it is displayed executors one line by one line. If more than 100 
> executors, the page is not that good.
> 4)the vis-timeline library is not maintained anymore since 2017. Should be 
> replaced a new one like [https://github.com/visjs/vis-timeline]
> 5) it is also good to get recent events, e.g. 500, and load more if user want 
> see more, however the data could all be loaded once.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35229) Spark Job web page is extremely slow while there are more than 1500 events in timeline

2021-05-11 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-35229:
--

Assignee: Kousuke Saruta

> Spark Job web page is extremely slow while there are more than 1500 events in 
> timeline
> --
>
> Key: SPARK-35229
> URL: https://issues.apache.org/jira/browse/SPARK-35229
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Wang Yuan
>Assignee: Kousuke Saruta
>Priority: Blocker
>
> In a spark streaming application, there are 1000+ executors, more than 2000 
> events (executors events, job events) generated, then the jobs/job web page 
> of spark is not able to show. The browser (chrome, firefox, safari) freezes. 
> I had to open another window to open stages, executors pages by their 
> addresses manually.
> The jobs page is the home page, so this is rather annoyed. The problem is 
> that the vis-timeline rending is too slow.
> Their are some suggestion:
> 1) the page should not render timeline when the page is loading unless user 
> clicks the link.
> 2) the executor group and job group should be separated, and user can choose 
> to show one, both or neither.
> 3) the executor group should display executor event by time horizontally. 
> Currently it is displayed executors one line by one line. If more than 100 
> executors, the page is not that good.
> 4)the vis-timeline library is not maintained anymore since 2017. Should be 
> replaced a new one like [https://github.com/visjs/vis-timeline]
> 5) it is also good to get recent events, e.g. 500, and load more if user want 
> see more, however the data could all be loaded once.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35372) JDK 11 compilation failure due to StackOverflowError

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35372:


Assignee: (was: Apache Spark)

> JDK 11 compilation failure due to StackOverflowError
> 
>
> Key: SPARK-35372
> URL: https://issues.apache.org/jira/browse/SPARK-35372
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/runs/2554067779
> {code}
> java.lang.StackOverflowError
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
> scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
> scala.reflect.internal.Trees.itransform(Trees.scala:1404)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.internal.Trees.itransform(Trees.scala:1383)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35372) JDK 11 compilation failure due to StackOverflowError

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342503#comment-17342503
 ] 

Apache Spark commented on SPARK-35372:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32502

> JDK 11 compilation failure due to StackOverflowError
> 
>
> Key: SPARK-35372
> URL: https://issues.apache.org/jira/browse/SPARK-35372
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/runs/2554067779
> {code}
> java.lang.StackOverflowError
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
> scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
> scala.reflect.internal.Trees.itransform(Trees.scala:1404)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.internal.Trees.itransform(Trees.scala:1383)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35372) JDK 11 compilation failure due to StackOverflowError

2021-05-11 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342504#comment-17342504
 ] 

Apache Spark commented on SPARK-35372:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32502

> JDK 11 compilation failure due to StackOverflowError
> 
>
> Key: SPARK-35372
> URL: https://issues.apache.org/jira/browse/SPARK-35372
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/runs/2554067779
> {code}
> java.lang.StackOverflowError
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
> scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
> scala.reflect.internal.Trees.itransform(Trees.scala:1404)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.internal.Trees.itransform(Trees.scala:1383)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35372) JDK 11 compilation failure due to StackOverflowError

2021-05-11 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35372:


Assignee: Apache Spark

> JDK 11 compilation failure due to StackOverflowError
> 
>
> Key: SPARK-35372
> URL: https://issues.apache.org/jira/browse/SPARK-35372
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See https://github.com/apache/spark/runs/2554067779
> {code}
> java.lang.StackOverflowError
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
> scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
> scala.reflect.internal.Trees.itransform(Trees.scala:1404)
> scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
> scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
> scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
> scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
> scala.reflect.internal.Trees.itransform(Trees.scala:1383)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35372) JDK 11 compilation failure due to StackOverflowError

2021-05-11 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35372:


 Summary: JDK 11 compilation failure due to StackOverflowError
 Key: SPARK-35372
 URL: https://issues.apache.org/jira/browse/SPARK-35372
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


See https://github.com/apache/spark/runs/2554067779

{code}
java.lang.StackOverflowError
scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
scala.reflect.internal.Trees.itransform(Trees.scala:1404)
scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
scala.reflect.internal.Trees.itransform(Trees.scala:1383)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35371) Scala UDF returning string or complex type applied to array members returns wrong data

2021-05-11 Thread David Benedeki (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Benedeki updated SPARK-35371:
---
Description: 
When using an UDF returning string or complex type (Struct) on array members 
the resulting array consists of the last array member UDF result.
h3. *Example code:*
{code:scala}
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{callUDF, col, transform, udf}

val sparkBuilder: SparkSession.Builder = SparkSession.builder()
  .master("local[*]")
  .appName(s"Udf Bug Demo")
  .config("spark.ui.enabled", "false")
  .config("spark.debug.maxToStringFields", 100)

val spark: SparkSession = sparkBuilder
  .config("spark.driver.bindAddress", "127.0.0.1")
  .config("spark.driver.host", "127.0.0.1")
  .getOrCreate()

import spark.implicits._

case class Foo(num: Int, s: String)

val src  = Seq(
  (1, 2, Array(1, 2, 3)),
  (2, 2, Array(2, 2, 2)),
  (3, 4, Array(3, 4, 3, 4))
).toDF("A", "B", "C")

val udfStringName = "UdfString"
val udfIntName = "UdfInt"
val udfStructName = "UdfStruct"

val udfString = udf((num: Int) => {
  (num + 1).toString
})
spark.udf.register(udfStringName, udfString)

val udfInt = udf((num: Int) => {
  num + 1
})
spark.udf.register(udfIntName, udfInt)

val udfStruct = udf((num: Int) => {
  Foo(num + 1, (num + 1).toString)
})
spark.udf.register(udfStructName, udfStruct)


val lambdaString = (forCol: Column) => callUDF(udfStringName, forCol)
val lambdaInt = (forCol: Column) => callUDF(udfIntName, forCol)
val lambdaStruct = (forCol: Column) => callUDF(udfStructName, forCol)

val cA = callUDF(udfStringName, col("A"))
val cB = callUDF(udfStringName, col("B"))
val cCString: Column = transform(col("C"), lambdaString)
val cCInt: Column = transform(col("C"), lambdaInt)
val cCStruc: Column = transform(col("C"), lambdaStruct)
val dest = src.withColumn("AStr", cA)
  .withColumn("BStr", cB)
  .withColumn("CString (Wrong)", cCString)
  .withColumn("CInt (OK)", cCInt)
  .withColumn("CStruct (Wrong)", cCStruc)

dest.show(false)
dest.printSchema()
{code}
h3. *Expected:*
{noformat}
+---+---++++---+++
|A  |B  |C   |AStr|BStr|CString|CInt|CStruct
  |
+---+---++++---+++
|1  |2  |[1, 2, 3]   |2   |3   |[2, 3, 4]  |[2, 3, 4]   |[{2, 2}, {3, 3}, 
{4, 4}]|
|2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
{3, 3}]|
|3  |4  |[3, 4, 3, 4]|4   |5   |[4, 5, 4, 5]   |[4, 5, 4, 5]|[{4, 4}, {5, 5}, 
{4, 4}, {5, 5}]|
+---+---++++---+++
{noformat}
h3. *Got:*
{noformat}
+---+---++++---+++
|A  |B  |C   |AStr|BStr|CString (Wrong)|CInt (Ok)   |CStruct (Wrong)
 |
+---+---++++---+++
|1  |2  |[1, 2, 3]   |2   |3   |[4, 4, 4]  |[2, 3, 4]   |[{4, 4}, {4, 4}, 
{4, 4}]|
|2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
{3, 3}]|
|3  |4  |[3, 4, 3, 4]|4   |5   |[5, 5, 5, 5]   |[4, 5, 4, 5]|[{5, 5}, {5, 5}, 
{5, 5}, {5, 5}]|
+---+---++++---+++
{noformat}
h3. *Observation*
 * Work correctly on Spark 3.0.2
 * When UDF is registered as Java UDF, it works as supposed
 * The UDF is called the appropriate number of times (regardless if UDF is 
marked as deterministic or non-deterministic).
 * When debugged, the correct value is actually saved into the result array at 
first but every subsequent item processing overwrites the previous result 
values as well. Therefore the last item values filling the array is the final 
result.
 * When the UDF returns NULL/None it does not "overwrite” the prior array 
values nor is “overwritten” by subsequent non-NULL values. See with following 
UDF impelementation:
{code:scala}
val udfString = udf((num: Int) => {
  if (num == 3) {
None
  } else {
Some((num + 1).toString)
  }
})
{code}
 

  was:
When using an UDF returning string or complex type (Struct) on array members 
the resulting array consists of the last array member UDF result.
h3. *Example code:*

{code:spark}
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{callUDF, col, transform, udf}


val sparkBuilder: SparkSession.Builder = SparkSession.builder()
  .master("local[*]")
  .appName(s"Udf Bug Demo")
  .config("spark.ui.enabled", "false")
  .config("spark.debug.maxToStringFields", 100)

val spark: SparkSession = sparkBuilder
  .config("spark.driver.bindAddress", "127.0.0.1")
  .config("spark.driver.host", "127.0.0.1")
  .getOrCreate()

import

[jira] [Updated] (SPARK-35371) Scala UDF returning string or complex type applied to array members returns wrong data

2021-05-11 Thread David Benedeki (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Benedeki updated SPARK-35371:
---
Description: 
When using an UDF returning string or complex type (Struct) on array members 
the resulting array consists of the last array member UDF result.
h3. *Example code:*

{code:spark}
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{callUDF, col, transform, udf}


val sparkBuilder: SparkSession.Builder = SparkSession.builder()
  .master("local[*]")
  .appName(s"Udf Bug Demo")
  .config("spark.ui.enabled", "false")
  .config("spark.debug.maxToStringFields", 100)

val spark: SparkSession = sparkBuilder
  .config("spark.driver.bindAddress", "127.0.0.1")
  .config("spark.driver.host", "127.0.0.1")
  .getOrCreate()

import spark.implicits._

case class Foo(num: Int, s: String)

val src  = Seq(
  (1, 2, Array(1, 2, 3)),
  (2, 2, Array(2, 2, 2)),
  (3, 4, Array(3, 4, 3, 4))
).toDF("A", "B", "C")

val udfStringName = "UdfString"
val udfIntName = "UdfInt"
val udfStructName = "UdfStruct"

val udfString = udf((num: Int) => {
  (num + 1).toString
})
spark.udf.register(udfStringName, udfString)

val udfInt = udf((num: Int) => {
  num + 1
})
spark.udf.register(udfIntName, udfInt)

val udfStruct = udf((num: Int) => {
  Foo(num + 1, (num + 1).toString)
})
spark.udf.register(udfStructName, udfStruct)


val lambdaString = (forCol: Column) => callUDF(udfStringName, forCol)
val lambdaInt = (forCol: Column) => callUDF(udfIntName, forCol)
val lambdaStruct = (forCol: Column) => callUDF(udfStructName, forCol)

val cA = callUDF(udfStringName, col("A"))
val cB = callUDF(udfStringName, col("B"))
val cCString: Column = transform(col("C"), lambdaString)
val cCInt: Column = transform(col("C"), lambdaInt)
val cCStruc: Column = transform(col("C"), lambdaStruct)
val dest = src.withColumn("AStr", cA)
  .withColumn("BStr", cB)
  .withColumn("CString (Wrong)", cCString)
  .withColumn("CInt (OK)", cCInt)
  .withColumn("CStruct (Wrong)", cCStruc)

dest.show(false)
dest.printSchema()
{code}

h3. *Expected:*
{noformat}
+---+---++++---+++
|A  |B  |C   |AStr|BStr|CString|CInt|CStruct
  |
+---+---++++---+++
|1  |2  |[1, 2, 3]   |2   |3   |[2, 3, 4]  |[2, 3, 4]   |[{2, 2}, {3, 3}, 
{4, 4}]|
|2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
{3, 3}]|
|3  |4  |[3, 4, 3, 4]|4   |5   |[4, 5, 4, 5]   |[4, 5, 4, 5]|[{4, 4}, {5, 5}, 
{4, 4}, {5, 5}]|
+---+---++++---+++
{noformat}
h3. *Got:*
{noformat}
+---+---++++---+++
|A  |B  |C   |AStr|BStr|CString (Wrong)|CInt (Ok)   |CStruct (Wrong)
 |
+---+---++++---+++
|1  |2  |[1, 2, 3]   |2   |3   |[4, 4, 4]  |[2, 3, 4]   |[{4, 4}, {4, 4}, 
{4, 4}]|
|2  |2  |[2, 2, 2]   |3   |3   |[3, 3, 3]  |[3, 3, 3]   |[{3, 3}, {3, 3}, 
{3, 3}]|
|3  |4  |[3, 4, 3, 4]|4   |5   |[5, 5, 5, 5]   |[4, 5, 4, 5]|[{5, 5}, {5, 5}, 
{5, 5}, {5, 5}]|
+---+---++++---+++
{noformat}
h3. *Observation*
 * Work correctly on Spark 3.0.2
 * When UDF is registered as Java UDF, it works as supposed
 * The UDF is called the appropriate number of times (regardless if UDF is 
marked as deterministic or non-deterministic).
 * When debugged, the correct value is actually saved into the result array at 
first but every subsequent item processing overwrites the previous result 
values as well. Therefore the last item values filling the array is the final 
result.
 * When the UDF returns NULL/None it does not "overwrite” the prior array 
values nor is “overwritten” by subsequent non-NULL values. See with following 
UDF impelementation:

{color:#0033b3}val {color}{color:#00}udfString {color}= udf((num: Int) => {
   {color:#0033b3}if {color}(num == {color:#1750eb}3{color}) {
  {color:#00}None{color}  } {color:#0033b3}else {color}{
  Some((num + {color:#1750eb}1{color}).toString)
   }
 })


  was:
When using an UDF returning string or complex type (Struct) on array members 
the resulting array consists of the last array member UDF result.
h3. *Example code:*

{color:#0033b3}import {color}org.apache.spark.sql.\{Column, SparkSession}
 import org.apache.spark.sql.functions.\{callUDF, col, transform, udf}
 
 val {color:#00}sparkBuilder{color}: 
{color:#00}SparkSession{color}.{color:#00}Builder {color}= 
{color:#00}SparkSession{color}.builder()
   .master({color:#067d17}"local[*]"{color})

[jira] [Updated] (SPARK-35371) Scala UDF returning string or complex type applied to array members returns wrong data

2021-05-11 Thread David Benedeki (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Benedeki updated SPARK-35371:
---
Description: 
When using an UDF returning string or complex type (Struct) on array members 
the resulting array consists of the last array member UDF result.
h3. *Example code:*

{color:#0033b3}import {color}org.apache.spark.sql.\{Column, SparkSession}
 import org.apache.spark.sql.functions.\{callUDF, col, transform, udf}
 
 val {color:#00}sparkBuilder{color}: 
{color:#00}SparkSession{color}.{color:#00}Builder {color}= 
{color:#00}SparkSession{color}.builder()
   .master({color:#067d17}"local[*]"{color})
   .appName({color:#067d17}s"Udf Bug Demo"{color})
   .config({color:#067d17}"spark.ui.enabled"{color}, 
{color:#067d17}"false"{color})
   .config({color:#067d17}"spark.debug.maxToStringFields"{color}, 
{color:#1750eb}100{color})

{color:#0033b3}val {color}{color:#00}spark{color}: 
{color:#00}SparkSession {color}= {color:#00}sparkBuilder{color} 
.config({color:#067d17}"spark.driver.bindAddress"{color}, 
{color:#067d17}"127.0.0.1"{color})
   .config({color:#067d17}"spark.driver.host"{color}, 
{color:#067d17}"127.0.0.1"{color})
   .getOrCreate()

{color:#0033b3}import 
{color}{color:#00}spark{color}.{color:#00}implicits{color}._

{color:#0033b3}case class {color}{color:#00}Foo{color}(num: Int, s: 
{color:#007e8a}String{color})

{color:#0033b3}val {color}{color:#00}src {color}= {color:#871094}Seq{color}(
   ({color:#1750eb}1{color}, {color:#1750eb}2{color}, 
Array({color:#1750eb}1{color}, {color:#1750eb}2{color}, 
{color:#1750eb}3{color})),
   ({color:#1750eb}2{color}, {color:#1750eb}2{color}, 
Array({color:#1750eb}2{color}, {color:#1750eb}2{color}, 
{color:#1750eb}2{color})),
   ({color:#1750eb}3{color}, {color:#1750eb}4{color}, 
Array({color:#1750eb}3{color}, {color:#1750eb}4{color}, 
{color:#1750eb}3{color}, {color:#1750eb}4{color}))
 ).toDF({color:#067d17}"A"{color}, {color:#067d17}"B"{color}, 
{color:#067d17}"C"{color})

{color:#0033b3}val {color}{color:#00}udfStringName {color}= 
{color:#067d17}"UdfString"{color}{color:#0033b3}val 
{color}{color:#00}udfIntName {color}= 
{color:#067d17}"UdfInt"{color}{color:#0033b3}val 
{color}{color:#00}udfStructName {color}= 
{color:#067d17}"UdfStruct"{color}{color:#0033b3}val 
{color}{color:#00}udfString {color}= udf((num: Int) => {
   (num + {color:#1750eb}1{color}).toString
 })
 {color:#00}spark{color}.udf.register({color:#00}udfStringName{color}, 
{color:#00}udfString{color})

{color:#0033b3}val {color}{color:#00}udfInt {color}= udf((num: Int) => {
   num + {color:#1750eb}1
 })
 spark{color}.udf.register({color:#00}udfIntName{color}, 
{color:#00}udfInt{color})

{color:#0033b3}val {color}{color:#00}udfStruct {color}= udf((num: Int) => {
   Foo(num + {color:#1750eb}1{color}, (num + {color:#1750eb}1{color}).toString)
 })
 {color:#00}spark{color}.udf.register({color:#00}udfStructName{color}, 
{color:#00}udfStruct{color})

{color:#0033b3}val {color}{color:#00}lambdaString {color}= (forCol: 
{color:#00}Column{color}) => callUDF({color:#00}udfStringName{color}, 
forCol)
 {color:#0033b3}val {color}{color:#00}lambdaInt {color}= (forCol: 
{color:#00}Column{color}) => callUDF({color:#00}udfIntName{color}, 
forCol)
 {color:#0033b3}val {color}{color:#00}lambdaStruct {color}= (forCol: 
{color:#00}Column{color}) => callUDF({color:#00}udfStructName{color}, 
forCol)

{color:#0033b3}val {color}{color:#00}cA {color}= 
callUDF({color:#00}udfStringName{color}, col({color:#067d17}"A"{color}))
 {color:#0033b3}val {color}{color:#00}cB {color}= 
callUDF({color:#00}udfStringName{color}, col({color:#067d17}"B"{color}))
 {color:#0033b3}val {color}{color:#00}cCString{color}: 
{color:#00}Column {color}= transform(col({color:#067d17}"C"{color}), 
{color:#00}lambdaString{color})
 {color:#0033b3}val {color}{color:#00}cCInt{color}: {color:#00}Column 
{color}= transform(col({color:#067d17}"C"{color}), 
{color:#00}lambdaInt{color})
 {color:#0033b3}val {color}{color:#00}cCStruc{color}: {color:#00}Column 
{color}= transform(col({color:#067d17}"C"{color}), 
{color:#00}lambdaStruct{color})
 {color:#0033b3}val {color}{color:#00}dest {color}= 
{color:#00}src{color}.withColumn({color:#067d17}"AStr"{color}, 
{color:#00}cA{color})
   .withColumn({color:#067d17}"BStr"{color}, {color:#00}cB{color})
   .withColumn({color:#067d17}"CString (Wrong)"{color}, 
{color:#00}cCString{color})
   .withColumn({color:#067d17}"CInt (OK)"{color}, {color:#00}cCInt{color})
   .withColumn({color:#067d17}"CStruct (Wrong)"{color}, 
{color:#00}cCStruc{color})

{color:#00}dest{color}.show({color:#0033b3}false{color})
 {color:#00}dest{color}.printSchema()
h3. *Expected:*
{noformat}

1 2 >

1 - 100 of 133 matches

Mail list logo