[jira] [Commented] (SPARK-40241) Correct the link of GenericUDTF

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585669#comment-17585669
 ] 

Apache Spark commented on SPARK-40241:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37685

> Correct the link of GenericUDTF
> ---
>
> Key: SPARK-40241
> URL: https://issues.apache.org/jira/browse/SPARK-40241
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40241) Correct the link of GenericUDTF

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40241:


Assignee: (was: Apache Spark)

> Correct the link of GenericUDTF
> ---
>
> Key: SPARK-40241
> URL: https://issues.apache.org/jira/browse/SPARK-40241
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40241) Correct the link of GenericUDTF

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40241:


Assignee: Apache Spark

> Correct the link of GenericUDTF
> ---
>
> Key: SPARK-40241
> URL: https://issues.apache.org/jira/browse/SPARK-40241
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40241) Correct the link of GenericUDTF

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585668#comment-17585668
 ] 

Apache Spark commented on SPARK-40241:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37685

> Correct the link of GenericUDTF
> ---
>
> Key: SPARK-40241
> URL: https://issues.apache.org/jira/browse/SPARK-40241
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40241) Correct the link of GenericUDTF

2022-08-26 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40241:
-

 Summary: Correct the link of GenericUDTF
 Key: SPARK-40241
 URL: https://issues.apache.org/jira/browse/SPARK-40241
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40039:


Assignee: (was: Apache Spark)

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40039:


Assignee: Apache Spark

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585665#comment-17585665
 ] 

Apache Spark commented on SPARK-40152:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37684

> Codegen compilation error when using split_part
> ---
>
> Key: SPARK-40152
> URL: https://issues.apache.org/jira/browse/SPARK-40152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The following query throws an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> ('11.12.13', '.', 3)
> as v1(col1, col2, col3);
> cache table v1;
> SELECT split_part(col1, col2, col3)
> from v1;
> {noformat}
> The error is:
> {noformat}
> 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
>   at 
> org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811)
>   at org.codehaus.janino.Parser.parseBlock(Parser.java:1792)
>   at 
> {noformat}
> In the end, {{split_part}} does successfully execute, although in interpreted 
> mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585666#comment-17585666
 ] 

Apache Spark commented on SPARK-40152:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37684

> Codegen compilation error when using split_part
> ---
>
> Key: SPARK-40152
> URL: https://issues.apache.org/jira/browse/SPARK-40152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> The following query throws an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> ('11.12.13', '.', 3)
> as v1(col1, col2, col3);
> cache table v1;
> SELECT split_part(col1, col2, col3)
> from v1;
> {noformat}
> The error is:
> {noformat}
> 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 42, Column 1: Expression "project_isNull_0 = false" is not a type
>   at 
> org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811)
>   at org.codehaus.janino.Parser.parseBlock(Parser.java:1792)
>   at 
> {noformat}
> In the end, {{split_part}} does successfully execute, although in interpreted 
> mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40156:


Assignee: Apache Spark

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40156:


Assignee: (was: Apache Spark)

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40156) url_decode() exposes a Java error

2022-08-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40156:
-
Fix Version/s: (was: 3.4.0)

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40156) url_decode() exposes a Java error

2022-08-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-40156:
--
  Assignee: (was: ming95)

Reverted at 
https://github.com/apache/spark/commit/4a19e05563b655216cf6b551d0ab8a64447cd49a

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
> Fix For: 3.4.0
>
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40039:
-
Fix Version/s: (was: 3.4.0)

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-40039:
--
  Assignee: (was: Attila Zsolt Piros)

Reverted at 
https://github.com/apache/spark/commit/fb4dba1413dac860c9bec5bae422b2f78d212948 
and 
https://github.com/apache/spark/commit/8fb853218bed792b3f66c8401691dd16262e1f9e

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40240:


Assignee: (was: Apache Spark)

> PySpark rdd.takeSample should validate `num > maxSampleSize` at first
> -
>
> Key: SPARK-40240
> URL: https://issues.apache.org/jira/browse/SPARK-40240
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40240:


Assignee: Apache Spark

> PySpark rdd.takeSample should validate `num > maxSampleSize` at first
> -
>
> Key: SPARK-40240
> URL: https://issues.apache.org/jira/browse/SPARK-40240
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585656#comment-17585656
 ] 

Apache Spark commented on SPARK-40240:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37683

> PySpark rdd.takeSample should validate `num > maxSampleSize` at first
> -
>
> Key: SPARK-40240
> URL: https://issues.apache.org/jira/browse/SPARK-40240
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first

2022-08-26 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40240:
-

 Summary: PySpark rdd.takeSample should validate `num > 
maxSampleSize` at first
 Key: SPARK-40240
 URL: https://issues.apache.org/jira/browse/SPARK-40240
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40239:


Assignee: Apache Spark

> Remove duplicated 'fraction' validation in RDD.sample
> -
>
> Key: SPARK-40239
> URL: https://issues.apache.org/jira/browse/SPARK-40239
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585654#comment-17585654
 ] 

Apache Spark commented on SPARK-40239:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37682

> Remove duplicated 'fraction' validation in RDD.sample
> -
>
> Key: SPARK-40239
> URL: https://issues.apache.org/jira/browse/SPARK-40239
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40239:


Assignee: (was: Apache Spark)

> Remove duplicated 'fraction' validation in RDD.sample
> -
>
> Key: SPARK-40239
> URL: https://issues.apache.org/jira/browse/SPARK-40239
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample

2022-08-26 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40239:
-

 Summary: Remove duplicated 'fraction' validation in RDD.sample
 Key: SPARK-40239
 URL: https://issues.apache.org/jira/browse/SPARK-40239
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40238) support scaleUpFactor and initialNumPartition in pyspark rdd API

2022-08-26 Thread Ziqi Liu (Jira)
Ziqi Liu created SPARK-40238:


 Summary: support scaleUpFactor and initialNumPartition in pyspark 
rdd API
 Key: SPARK-40238
 URL: https://issues.apache.org/jira/browse/SPARK-40238
 Project: Spark
  Issue Type: Story
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ziqi Liu


This is a followup on https://issues.apache.org/jira/browse/SPARK-40211

`scaleUpFactor` and `initialNumPartition` config are not supported yet in 
pyspark rdd take API

 

(see [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2799)]

basically it hardcoded `scaleUpFactor` as 1 and `initialNumPartition` as 4, 
therefore pyspark rdd take API is inconsistent with scala API.

 

Anyone familiar with pyspark can help support this (referring to [scala 
implementation|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1448])?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized

2022-08-26 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-40211.

Fix Version/s: 3.4.0
 Assignee: Ziqi Liu
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/37661] 

> Allow executeTake() / collectLimit's number of starting partitions to be 
> customized
> ---
>
> Key: SPARK-40211
> URL: https://issues.apache.org/jira/browse/SPARK-40211
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Assignee: Ziqi Liu
>Priority: Major
> Fix For: 3.4.0
>
>
> Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be 
> customized but does not allow for the initial number of partitions to be 
> customized: it’s currently hardcoded to {{{}1{}}}.
> We should add a configuration so that the initial partition count can be 
> customized. By setting this new configuration to a high value we could 
> effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. 
> We could also set it to higher-than-1-but-still-small values (like, say, 
> {{{}10{}}}) to achieve a middle-ground trade-off.
>  
> Essentially, we need to make {{numPartsToTry = 1L}} 
> ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481])
>  customizable. We should do this via a new SQL conf, similar to the 
> {{limitScaleUpFactor}} conf.
>  
> Spark has several near-duplicate versions of this code ([see code 
> search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in:
>  * SparkPlan
>  * RDD
>  * pyspark rdd
> Also, in pyspark  {{limitScaleUpFactor}}  is not supported either. So for 
> now, I will focus on scala side first, leaving python side untouched and 
> meanwhile sync with pyspark members. Depending on the progress we can do them 
> all in one PR or make scala side change first and leave pyspark change as a 
> follow-up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40235) Use interruptible lock instead of synchronized in Executor.updateDependencies()

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585633#comment-17585633
 ] 

Apache Spark commented on SPARK-40235:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37681

> Use interruptible lock instead of synchronized in 
> Executor.updateDependencies()
> ---
>
> Key: SPARK-40235
> URL: https://issues.apache.org/jira/browse/SPARK-40235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> This patch modifies the synchronization in {{Executor.updateDependencies()}} 
> in order to allow tasks to be interrupted while they are blocked and waiting 
> on other tasks to finish downloading dependencies.
> This synchronization was added years ago in 
> [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a]
>  in order to prevent concurrently-launching tasks from performing concurrent 
> dependency updates. If one task is downloading dependencies, all other 
> newly-launched tasks will block until the original dependency download is 
> complete.
> Let's say that a Spark task launches, becomes blocked on a 
> {{updateDependencies()}} call, then is cancelled while it is blocked. 
> Although Spark will send a Thread.interrupt() to the canceled task, the task 
> will continue waiting because threads blocked on a {{synchronized}} won't 
> throw an InterruptedException in response to the interrupt. As a result, the 
> blocked thread will continue to wait until the other thread exits the 
> synchronized block. 
> In the wild, we saw a case where this happened and the thread remained 
> blocked for over 1 minute, causing the TaskReaper to kick in and 
> self-destruct the executor.
> This PR aims to fix this problem by replacing the {{synchronized}} with a 
> ReentrantLock, which has a {{lockInterruptibly}} method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40235) Use interruptible lock instead of synchronized in Executor.updateDependencies()

2022-08-26 Thread Josh Rosen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-40235:
---
Description: 
This patch modifies the synchronization in {{Executor.updateDependencies()}} in 
order to allow tasks to be interrupted while they are blocked and waiting on 
other tasks to finish downloading dependencies.

This synchronization was added years ago in 
[mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a]
 in order to prevent concurrently-launching tasks from performing concurrent 
dependency updates. If one task is downloading dependencies, all other 
newly-launched tasks will block until the original dependency download is 
complete.

Let's say that a Spark task launches, becomes blocked on a 
{{updateDependencies()}} call, then is cancelled while it is blocked. Although 
Spark will send a Thread.interrupt() to the canceled task, the task will 
continue waiting because threads blocked on a {{synchronized}} won't throw an 
InterruptedException in response to the interrupt. As a result, the blocked 
thread will continue to wait until the other thread exits the synchronized 
block. 

In the wild, we saw a case where this happened and the thread remained blocked 
for over 1 minute, causing the TaskReaper to kick in and self-destruct the 
executor.

This PR aims to fix this problem by replacing the {{synchronized}} with a 
ReentrantLock, which has a {{lockInterruptibly}} method.

  was:
This patch modifies the synchronization in {{Executor.updateDependencies()}} in 
order to allow tasks to be interrupted while they are blocked and waiting on 
other tasks to finish downloading dependencies.

This synchronization was added years ago in 
[mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a]
 in order to prevent concurrently-launching tasks from performing concurrent 
dependency updates (file downloads, and, later, library installation). If one 
task is downloading dependencies, all other newly-launched tasks will block 
until the original dependency download is complete.

Let's say that a Spark task launches, becomes blocked on a 
{{updateDependencies()}} call, then is cancelled while it is blocked. Although 
Spark will send a Thread.interrupt() to the canceled task, the task will 
continue waiting because threads blocked on a {{synchronized}} won't throw an 
InterruptedException in response to the interrupt. As a result, the blocked 
thread will continue to wait until the other thread exits the synchronized 
block. 

In the wild, we saw a case where this happened and the thread remained blocked 
for over 1 minute, causing the TaskReaper to kick in and self-destruct the 
executor.

This PR aims to fix this problem by replacing the {{synchronized}} with a 
ReentrantLock, which has a {{lockInterruptibly}} method.


> Use interruptible lock instead of synchronized in 
> Executor.updateDependencies()
> ---
>
> Key: SPARK-40235
> URL: https://issues.apache.org/jira/browse/SPARK-40235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> This patch modifies the synchronization in {{Executor.updateDependencies()}} 
> in order to allow tasks to be interrupted while they are blocked and waiting 
> on other tasks to finish downloading dependencies.
> This synchronization was added years ago in 
> [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a]
>  in order to prevent concurrently-launching tasks from performing concurrent 
> dependency updates. If one task is downloading dependencies, all other 
> newly-launched tasks will block until the original dependency download is 
> complete.
> Let's say that a Spark task launches, becomes blocked on a 
> {{updateDependencies()}} call, then is cancelled while it is blocked. 
> Although Spark will send a Thread.interrupt() to the canceled task, the task 
> will continue waiting because threads blocked on a {{synchronized}} won't 
> throw an InterruptedException in response to the interrupt. As a result, the 
> blocked thread will continue to wait until the other thread exits the 
> synchronized block. 
> In the wild, we saw a case where this happened and the thread remained 
> blocked for over 1 minute, causing the TaskReaper to kick in and 
> self-destruct the executor.
> This PR aims to fix this problem by replacing the {{synchronized}} with a 
> ReentrantLock, which has a {{lockInterruptibly}} method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (SPARK-40237) Can't get JDBC type for map in Spark 3.3.0 and PostgreSQL

2022-08-26 Thread Igor Suhorukov (Jira)
Igor Suhorukov created SPARK-40237:
--

 Summary: Can't get JDBC type for map in Spark 3.3.0 
and PostgreSQL
 Key: SPARK-40237
 URL: https://issues.apache.org/jira/browse/SPARK-40237
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
 Environment: Linux 5.15.0-46-generic #49~20.04.1-Ubuntu SMP Thu Aug 4 
19:15:44 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

22/08/27 00:30:01 INFO SparkContext: Running Spark version 3.3.0
Reporter: Igor Suhorukov


 

Exception 'Can't get JDBC type for map' happens when I try to 
save dataset into PostgreSQL via JDBC
{code:java}
Dataset ds = ...;
ds.write().mode(SaveMode.Overwrite)
.option("truncate", true).format("jdbc")
.option("url", "jdbc:postgresql://127.0.0.1:5432/???")
.option("dbtable", "t1")
.option("isolationLevel", "NONE")
.option("user", "???")
.option("password", "???")
.save();
{code}
 

This Issue related to unimplemented PostgresDialect#getJDBCType  and it is 
strange as PostgreSQL supports hstore/json/jsonb types sutable to store map. 
PostgreSql JDBC driver support hstore write by using 
statement.setObject(parameterIndex, map) for hstore and 
statement.setString(parameterIndex,map) with  cast(? as JSON). 

 

 

 
{code:java}
java.lang.IllegalArgumentException: Can't get JDBC type for map
    at 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGetJdbcTypeError(QueryExecutionErrors.scala:810)
    at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getJdbcType$2(JdbcUtils.scala:162)
    at scala.Option.getOrElse(Option.scala:201)
    at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getJdbcType(JdbcUtils.scala:162)
    at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$4(JdbcUtils.scala:782)
    at scala.collection.immutable.Map$EmptyMap$.getOrElse(Map.scala:228)
    at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$3(JdbcUtils.scala:782)
    at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
    at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(JdbcUtils.scala:779)
    at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:883)
    at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:81)
    at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
    at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
    at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
    at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecutio

[jira] [Resolved] (SPARK-40222) Numeric try_add/try_divide/try_subtract/try_multiply should throw error from their children

2022-08-26 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-40222.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37663
[https://github.com/apache/spark/pull/37663]

> Numeric try_add/try_divide/try_subtract/try_multiply should throw error from 
> their children
> ---
>
> Key: SPARK-40222
> URL: https://issues.apache.org/jira/browse/SPARK-40222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Similar to https://issues.apache.org/jira/browse/SPARK-40054, we should 
> refactor the 
> {{{}try_add{}}}/{{{}try_subtract{}}}/{{{}try_multiply{}}}/{{{}try_divide{}}} 
> functions so that the errors from their children will be shown instead of 
> ignored.
>  Spark SQL allows arithmetic operations between 
> Number/Date/Timestamp/CalendarInterval/AnsiInterval (see the rule 
> [ResolveBinaryArithmetic|https://github.com/databricks/runtime/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L501]
>  for details). Some of these combinations can throw exceptions too: * Date + 
> CalendarInterval
>  * Date + AnsiInterval
>  * Timestamp + AnsiInterval
>  * Date - CalendarInterval
>  * Date - AnsiInterval
>  * Timestamp - AnsiInterval
>  * Number * CalendarInterval
>  * Number * AnsiInterval
>  * CalendarInterval / Number
>  * AnsiInterval / Number
> This Jira is for the cases when both input data types are numbers.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40236) Error messages produced from error-classes.json should not have hard-coded sentences as parameters

2022-08-26 Thread Vitalii Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalii Li updated SPARK-40236:
---
Description: 
Relevant comment: 
[https://github.com/apache/spark/pull/37621#discussion_r955101102.]

Example of hard coded string: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L512-L513

> Error messages produced from error-classes.json should not have hard-coded 
> sentences as parameters
> --
>
> Key: SPARK-40236
> URL: https://issues.apache.org/jira/browse/SPARK-40236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> Relevant comment: 
> [https://github.com/apache/spark/pull/37621#discussion_r955101102.]
> Example of hard coded string: 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L512-L513



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40236) Error messages produced from error-classes.json should not have hard-coded sentences as parameters

2022-08-26 Thread Vitalii Li (Jira)
Vitalii Li created SPARK-40236:
--

 Summary: Error messages produced from error-classes.json should 
not have hard-coded sentences as parameters
 Key: SPARK-40236
 URL: https://issues.apache.org/jira/browse/SPARK-40236
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Vitalii Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40235) Use interruptible lock instead of synchronized in Executor.updateDependencies()

2022-08-26 Thread Josh Rosen (Jira)
Josh Rosen created SPARK-40235:
--

 Summary: Use interruptible lock instead of synchronized in 
Executor.updateDependencies()
 Key: SPARK-40235
 URL: https://issues.apache.org/jira/browse/SPARK-40235
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen


This patch modifies the synchronization in {{Executor.updateDependencies()}} in 
order to allow tasks to be interrupted while they are blocked and waiting on 
other tasks to finish downloading dependencies.

This synchronization was added years ago in 
[mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a]
 in order to prevent concurrently-launching tasks from performing concurrent 
dependency updates (file downloads, and, later, library installation). If one 
task is downloading dependencies, all other newly-launched tasks will block 
until the original dependency download is complete.

Let's say that a Spark task launches, becomes blocked on a 
{{updateDependencies()}} call, then is cancelled while it is blocked. Although 
Spark will send a Thread.interrupt() to the canceled task, the task will 
continue waiting because threads blocked on a {{synchronized}} won't throw an 
InterruptedException in response to the interrupt. As a result, the blocked 
thread will continue to wait until the other thread exits the synchronized 
block. 

In the wild, we saw a case where this happened and the thread remained blocked 
for over 1 minute, causing the TaskReaper to kick in and self-destruct the 
executor.

This PR aims to fix this problem by replacing the {{synchronized}} with a 
ReentrantLock, which has a {{lockInterruptibly}} method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40234) Clean only MDC items set by Spark

2022-08-26 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-40234:
---

Assignee: L. C. Hsieh

> Clean only MDC items set by Spark
> -
>
> Key: SPARK-40234
> URL: https://issues.apache.org/jira/browse/SPARK-40234
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, 
> the executor cleans up all MDC items. But it causes an issue for other MDC 
> items not set by Spark but from users at other places. It causes these custom 
> MDC items not shown in executor log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40234) Clean only MDC items set by Spark

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40234:


Assignee: (was: Apache Spark)

> Clean only MDC items set by Spark
> -
>
> Key: SPARK-40234
> URL: https://issues.apache.org/jira/browse/SPARK-40234
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, 
> the executor cleans up all MDC items. But it causes an issue for other MDC 
> items not set by Spark but from users at other places. It causes these custom 
> MDC items not shown in executor log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40234) Clean only MDC items set by Spark

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40234:


Assignee: Apache Spark

> Clean only MDC items set by Spark
> -
>
> Key: SPARK-40234
> URL: https://issues.apache.org/jira/browse/SPARK-40234
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, 
> the executor cleans up all MDC items. But it causes an issue for other MDC 
> items not set by Spark but from users at other places. It causes these custom 
> MDC items not shown in executor log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40234) Clean only MDC items set by Spark

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585609#comment-17585609
 ] 

Apache Spark commented on SPARK-40234:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/37680

> Clean only MDC items set by Spark
> -
>
> Key: SPARK-40234
> URL: https://issues.apache.org/jira/browse/SPARK-40234
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, 
> the executor cleans up all MDC items. But it causes an issue for other MDC 
> items not set by Spark but from users at other places. It causes these custom 
> MDC items not shown in executor log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40234) Clean only MDC items set by Spark

2022-08-26 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-40234:
---

 Summary: Clean only MDC items set by Spark
 Key: SPARK-40234
 URL: https://issues.apache.org/jira/browse/SPARK-40234
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: L. C. Hsieh


Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, 
the executor cleans up all MDC items. But it causes an issue for other MDC 
items not set by Spark but from users at other places. It causes these custom 
MDC items not shown in executor log.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35242) Support change catalog default database for spark

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585601#comment-17585601
 ] 

Apache Spark commented on SPARK-35242:
--

User 'roczei' has created a pull request for this issue:
https://github.com/apache/spark/pull/37679

> Support change catalog default database for spark
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35242) Support change catalog default database for spark

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585602#comment-17585602
 ] 

Apache Spark commented on SPARK-35242:
--

User 'roczei' has created a pull request for this issue:
https://github.com/apache/spark/pull/37679

> Support change catalog default database for spark
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40233) Unable to load large pandas dataframe to pyspark

2022-08-26 Thread Niranda Perera (Jira)
Niranda Perera created SPARK-40233:
--

 Summary: Unable to load large pandas dataframe to pyspark
 Key: SPARK-40233
 URL: https://issues.apache.org/jira/browse/SPARK-40233
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Niranda Perera


I've been trying to join two large pandas dataframes using pyspark using the 
following code. I'm trying to vary executor cores allocated for the application 
and measure scalability of pyspark (strong scaling).
{code:java}
r = 10 # 1Bn rows 
it = 10
w = 256
unique = 0.9

TOTAL_MEM = 240
TOTAL_NODES = 14

max_val = r * unique
rng = default_rng()
frame_data = rng.integers(0, max_val, size=(r, 2)) 
frame_data1 = rng.integers(0, max_val, size=(r, 2)) 
print(f"data generated", flush=True)

df_l = pd.DataFrame(frame_data).add_prefix("col")
df_r = pd.DataFrame(frame_data1).add_prefix("col")
print(f"data loaded", flush=True)


procs = int(math.ceil(w / TOTAL_NODES))
mem = int(TOTAL_MEM*0.9)
print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", flush=True)

spark = SparkSession\
.builder\
.appName(f'join {r} {w}')\
.master('spark://node:7077')\
.config('spark.executor.memory', f'{int(mem*0.6)}g')\
.config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
.config('spark.cores.max', w)\
.config('spark.driver.memory', '100g')\
.config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
.getOrCreate()

sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
print(f"data loaded to spark", flush=True)

try:   
for i in range(it):
t1 = time.time()
out = sdf0.join(sdf1, on='col0', how='inner')
count = out.count()
t2 = time.time()
print(f"timings {r} 1 {i} {(t2 - t1) * 1000:.0f} ms, {count}", 
flush=True)

del out
del count
gc.collect()
finally:
spark.stop() {code}
{*}Cluster{*}: I am using standalone spark cluster in a 15 node cluster with 48 
cores and 240GB RAM each. I've spawned master and the driver code in node1, 
while other 14 nodes have spawned workers allocating maximum memory. In the 
spark context, I am reserving 90% of total memory to executor, splitting 60% to 
jvm and 40% to pyspark.

{*}Issue{*}: When I run the above program, I can see that the executors are 
being assigned to the app. But it doesn't move forward, even after 60 mins. For 
smaller row count (10M), this was working without a problem. Driver output
{code:java}
world sz 256 procs per worker 19 mem 216 iter 8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
/N/u/d/dnperera/.conda/envs/env/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425:
 UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Negative initial size: -589934400
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value

2022-08-26 Thread Patryk Piekarski (Jira)
Patryk Piekarski created SPARK-40232:


 Summary: KMeans: high variability in results despite high 
initSteps parameter value
 Key: SPARK-40232
 URL: https://issues.apache.org/jira/browse/SPARK-40232
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 3.3.0
Reporter: Patryk Piekarski
 Attachments: sample_data.csv

I'm running KMeans on a sample dataset using PySpark. I want the results to be 
fairly stable, so I play with the _initSteps_ parameter. My understanding is 
that the higher the number of steps for k-means|| initialization mode, the 
higher the number of iterations the algorithm runs and in the end selects the 
best model out of all iterations. And that's the behavior I observe when 
running sklearn implementation with _n_init_ >= 10. However, when running 
PySpark implementation, regardless of the number of partitions of underlying 
data frame (tested on 1, 4, 8 number of partitions), even when setting 
_initSteps_ to 10, 50, or let's say 500, the results I get with different seeds 
are different and trainingCost value I observe is sometimes far from the lowest.

As a workaround, to force the algorithm to iterate and select the best model I 
used a loop with dynamic seed.

SKlearn in each iteration gets the trainingCost near 276655.

PySpark implementation of KMeans gets there in the 2nd, 5th and 6th iteration, 
but all the remaining iterations yield higher values.

Does the _initSteps_ parameter work as expected? Because my findings suggest 
that something might be off here.

Let me know where I could upload this sample dataset (2MB)

 
{code:java}
import pandas as pd
from sklearn.cluster import KMeans as KMeansSKlearn
df = pd.read_csv('sample_data.csv')

minimum = 
for i in range(1,10):
    kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, 
random_state=i)
    model = kmeans.fit(df)
    print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 

import SparkSession
spark= SparkSession.builder \
    .appName("kmeans-test") \
    .config('spark.driver.memory', '2g') \
    .master("local[2]") \
    .getOrCreate()df1 = spark.createDataFrame(df)

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
assembled_data=assemble.transform(df1)

minimum = 
for i in range(1,10):
    kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, 
seed=i, tol=0.0001)
    model = kmeans.fit(assembled_data)
    summary = model.summary
    print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value

2022-08-26 Thread Patryk Piekarski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patryk Piekarski updated SPARK-40232:
-
Attachment: sample_data.csv

> KMeans: high variability in results despite high initSteps parameter value
> --
>
> Key: SPARK-40232
> URL: https://issues.apache.org/jira/browse/SPARK-40232
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Patryk Piekarski
>Priority: Major
> Attachments: sample_data.csv
>
>
> I'm running KMeans on a sample dataset using PySpark. I want the results to 
> be fairly stable, so I play with the _initSteps_ parameter. My understanding 
> is that the higher the number of steps for k-means|| initialization mode, the 
> higher the number of iterations the algorithm runs and in the end selects the 
> best model out of all iterations. And that's the behavior I observe when 
> running sklearn implementation with _n_init_ >= 10. However, when running 
> PySpark implementation, regardless of the number of partitions of underlying 
> data frame (tested on 1, 4, 8 number of partitions), even when setting 
> _initSteps_ to 10, 50, or let's say 500, the results I get with different 
> seeds are different and trainingCost value I observe is sometimes far from 
> the lowest.
> As a workaround, to force the algorithm to iterate and select the best model 
> I used a loop with dynamic seed.
> SKlearn in each iteration gets the trainingCost near 276655.
> PySpark implementation of KMeans gets there in the 2nd, 5th and 6th 
> iteration, but all the remaining iterations yield higher values.
> Does the _initSteps_ parameter work as expected? Because my findings suggest 
> that something might be off here.
> Let me know where I could upload this sample dataset (2MB)
>  
> {code:java}
> import pandas as pd
> from sklearn.cluster import KMeans as KMeansSKlearn
> df = pd.read_csv('sample_data.csv')
> minimum = 
> for i in range(1,10):
>     kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, 
> random_state=i)
>     model = kmeans.fit(df)
>     print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 
> import SparkSession
> spark= SparkSession.builder \
>     .appName("kmeans-test") \
>     .config('spark.driver.memory', '2g') \
>     .master("local[2]") \
>     .getOrCreate()df1 = spark.createDataFrame(df)
> from pyspark.ml.clustering import KMeans
> from pyspark.ml.feature import VectorAssembler
> assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
> assembled_data=assemble.transform(df1)
> minimum = 
> for i in range(1,10):
>     kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, 
> seed=i, tol=0.0001)
>     model = kmeans.fit(assembled_data)
>     summary = model.summary
>     print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585318#comment-17585318
 ] 

Apache Spark commented on SPARK-40124:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37678

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Assignee: Kapil Singh
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups

2022-08-26 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-39931:
--
Description: 
Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|16.2|7.8|
|512|9.4|26.7|9.8|
|256|9.3|44.5|20.2|
|128|9.5|82.7|48.8|
|64|9.5|158.2|91.9|
|32|9.6|319.8|207.3|
|16|9.6|652.6|261.5|
|8|9.5|1,376|663.0|
|4|9.8|2,656|1,168|
|2|10.4|5,412|2,456|
|1|11.3|9,491|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
{{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
the Python method. With large groups, that Panadas DataFrame has all rows of a 
single group, with small groups it contains many groups. This should improve 
efficiency.

  was:
Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|16.2|7.8|
|512|9.4|26.7|9.8|
|256|9.3|44.5|20.2|
|128|9.5|82.7|48.8|
|64|9.5|158.2|91.9|
|32|9.6|319.8|207.3|
|16|9.6|652.6|261.5|
|8|9.5|1,376|663.0|
|4|9.8|2,656|1,168|
|2|10.4|5,412|2,456|
|1|11.3|8,162|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
{{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
the Python method. With large groups, that Panadas DataFrame has all rows of a 
single group, with small groups it contains many groups. This should improve 
efficiency.


> Improve performance of applyInPandas for very small groups
> --
>
> Key: SPARK-39931
> URL: https://issues.apache.org/jira/browse/SPARK-39931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>
> Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups 
> in PySpark is very slow. The reason is that for each group, PySpark creates a 
> Pandas DataFrame and calls into the Python code. For very small groups, the 
> overhead is huge, for large groups, it is smaller.
> Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m 
> rows):
> ||groupSize||Scala||pyspark.sql||pyspark.pandas||
> |1024|8.9|16.2|7.8|
> |512|9.4|26.7|9.8|
> |256|9.3|44.5|20.2|
> |128|9.5|82.7|48.8|
> |64|9.5|158.2|91.9|
> |32|9.6|319.8|207.3|
> |16|9.6|652.6|261.5|
> |8|9.5|1,376|663.0|
> |4|9.8|2,656|1,168|
> |2|10.4|5,412|2,456|
> |1|11.3|9,491|4,642|
> *Idea to overcome this* is to call into Python side with a Pandas DataFrame 
> that contains potentially multiple groups, then perform a Pandas 
> {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
> the Python method. With large groups, that Panadas DataFrame has all rows of 
> a single group, with small groups it contains many groups. This should 
> improve efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585315#comment-17585315
 ] 

Apache Spark commented on SPARK-40039:
--

User 'roczei' has created a pull request for this issue:
https://github.com/apache/spark/pull/37677

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585314#comment-17585314
 ] 

Apache Spark commented on SPARK-40039:
--

User 'roczei' has created a pull request for this issue:
https://github.com/apache/spark/pull/37677

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34265) Instrument Python UDF execution using SQL Metrics

2022-08-26 Thread Luca Canali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-34265:

Affects Version/s: 3.3.0
   (was: 3.2.0)

> Instrument Python UDF execution using SQL Metrics
> -
>
> Key: SPARK-34265
> URL: https://issues.apache.org/jira/browse/SPARK-34265
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: PandasUDF_ArrowEvalPython_Metrics.png, 
> PythonSQLMetrics_Jira_Picture.png, proposed_Python_SQLmetrics_v20210128.png
>
>
> This proposes to add SQLMetrics instrumentation for Python UDF. 
> This is aimed at improving monitoring and performance troubleshooting of 
> Python UDFs, Pandas UDF, including also the use of MapPartittion, and 
> MapInArrow.
> The introduced metrics are exposed to the end users via the metrics system 
> and are visible through the WebUI interface, in the SQL/DataFrame tab for 
> execution steps related to Python UDF execution. See also the attached 
> screenshots.
> This intrumentation is lightweight and can be used in production and for 
> monitoring. It is complementary to the Python/Pandas UDF Profiler introduced 
> in Spark 3.3 
> [https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34265) Instrument Python UDF execution using SQL Metrics

2022-08-26 Thread Luca Canali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-34265:

Description: 
This proposes to add SQLMetrics instrumentation for Python UDF. 
This is aimed at improving monitoring and performance troubleshooting of Python 
UDFs, Pandas UDF, including also the use of MapPartittion, and MapInArrow.
The introduced metrics are exposed to the end users via the metrics system and 
are visible through the WebUI interface, in the SQL/DataFrame tab for execution 
steps related to Python UDF execution. See also the attached screenshots.

This intrumentation is lightweight and can be used in production and for 
monitoring. It is complementary to the Python/Pandas UDF Profiler introduced in 
Spark 3.3 
[https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf]

  was:
This proposes to add SQLMetrics instrumentation for Python UDF. This is aimed 
at improving monitoring and performance troubleshooting of Python code called 
by Spark, via UDF, Pandas UDF or with MapPartittions.
The introduced metrics are exposed to the end users via the WebUI interface, in 
the SQL tab for execution steps related to Python UDF execution. 
Thes scope of this has been limited to Pandas UDF and related operatio, namely: 
ArrowEvalPython, AggregateInPandas, FlaMapGroupsInPandas, MapInPandas, 
FlatMapsCoGroupsInPandas, PythonMapInArrow, WindowsInPandas.
See also the attached screenshot.


> Instrument Python UDF execution using SQL Metrics
> -
>
> Key: SPARK-34265
> URL: https://issues.apache.org/jira/browse/SPARK-34265
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: PandasUDF_ArrowEvalPython_Metrics.png, 
> PythonSQLMetrics_Jira_Picture.png, proposed_Python_SQLmetrics_v20210128.png
>
>
> This proposes to add SQLMetrics instrumentation for Python UDF. 
> This is aimed at improving monitoring and performance troubleshooting of 
> Python UDFs, Pandas UDF, including also the use of MapPartittion, and 
> MapInArrow.
> The introduced metrics are exposed to the end users via the metrics system 
> and are visible through the WebUI interface, in the SQL/DataFrame tab for 
> execution steps related to Python UDF execution. See also the attached 
> screenshots.
> This intrumentation is lightweight and can be used in production and for 
> monitoring. It is complementary to the Python/Pandas UDF Profiler introduced 
> in Spark 3.3 
> [https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585297#comment-17585297
 ] 

Apache Spark commented on SPARK-40124:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37675

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Assignee: Kapil Singh
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585296#comment-17585296
 ] 

Apache Spark commented on SPARK-40124:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37675

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Assignee: Kapil Singh
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40156:


Assignee: ming95

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: ming95
>Priority: Major
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40156) url_decode() exposes a Java error

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40156.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37636
[https://github.com/apache/spark/pull/37636]

> url_decode() exposes a Java error
> -
>
> Key: SPARK-40156
> URL: https://issues.apache.org/jira/browse/SPARK-40156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: ming95
>Priority: Major
> Fix For: 3.4.0
>
>
> Given a badly encode string Spark returns a Java error.
> It should the return an ERROR_CLASS
> spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org');
> 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT 
> url_decode('http%3A%2F%2spark.apache.org')]
> java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in 
> escape (%) pattern - Error at index 1 in: "2s"
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
>  at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113)
>  at 
> org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups

2022-08-26 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-39931:
--
Description: 
Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|16.2|7.8|
|512|9.4|26.7|9.8|
|256|9.3|44.5|20.2|
|128|9.5|82.7|48.8|
|64|9.5|158.2|91.9|
|32|9.6|319.8|207.3|
|16|9.6|652.6|261.5|
|8|9.5|1,376|663.0|
|4|9.8|2,656|1,168|
|2|10.4|5,412|2,456|
|1|11.3|8,162|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
{{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
the Python method. With large groups, that Panadas DataFrame has all rows of a 
single group, with small groups it contains many groups. This should improve 
efficiency.

  was:
Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|20.9|7.8|
|512|9.4|31.8|9.8|
|256|9.3|47.0|20.2|
|128|9.5|83.3|48.8|
|64|9.5|137.8|91.9|
|32|9.6|263.6|207.3|
|16|9.6|525.9|261.5|
|8|9.5|1,043|663.0|
|4|9.8|2,073|1,168|
|2|10.4|4,132|2,456|
|1|11.3|8,162|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
{{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
the Python method. With large groups, that Panadas DataFrame has all rows of a 
single group, with small groups it contains many groups. This should improve 
efficiency.


> Improve performance of applyInPandas for very small groups
> --
>
> Key: SPARK-39931
> URL: https://issues.apache.org/jira/browse/SPARK-39931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>
> Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups 
> in PySpark is very slow. The reason is that for each group, PySpark creates a 
> Pandas DataFrame and calls into the Python code. For very small groups, the 
> overhead is huge, for large groups, it is smaller.
> Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m 
> rows):
> ||groupSize||Scala||pyspark.sql||pyspark.pandas||
> |1024|8.9|16.2|7.8|
> |512|9.4|26.7|9.8|
> |256|9.3|44.5|20.2|
> |128|9.5|82.7|48.8|
> |64|9.5|158.2|91.9|
> |32|9.6|319.8|207.3|
> |16|9.6|652.6|261.5|
> |8|9.5|1,376|663.0|
> |4|9.8|2,656|1,168|
> |2|10.4|5,412|2,456|
> |1|11.3|8,162|4,642|
> *Idea to overcome this* is to call into Python side with a Pandas DataFrame 
> that contains potentially multiple groups, then perform a Pandas 
> {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
> the Python method. With large groups, that Panadas DataFrame has all rows of 
> a single group, with small groups it contains many groups. This should 
> improve efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40231) Add 1TB TPCDS Plan stability tests

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40231:


Assignee: (was: Apache Spark)

> Add 1TB TPCDS Plan stability tests
> --
>
> Key: SPARK-40231
> URL: https://issues.apache.org/jira/browse/SPARK-40231
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40231) Add 1TB TPCDS Plan stability tests

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585279#comment-17585279
 ] 

Apache Spark commented on SPARK-40231:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37674

> Add 1TB TPCDS Plan stability tests
> --
>
> Key: SPARK-40231
> URL: https://issues.apache.org/jira/browse/SPARK-40231
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40231) Add 1TB TPCDS Plan stability tests

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585277#comment-17585277
 ] 

Apache Spark commented on SPARK-40231:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37674

> Add 1TB TPCDS Plan stability tests
> --
>
> Key: SPARK-40231
> URL: https://issues.apache.org/jira/browse/SPARK-40231
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40231) Add 1TB TPCDS Plan stability tests

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40231:


Assignee: Apache Spark

> Add 1TB TPCDS Plan stability tests
> --
>
> Key: SPARK-40231
> URL: https://issues.apache.org/jira/browse/SPARK-40231
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kapil Singh
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38934) Provider TemporaryAWSCredentialsProvider has no credentials

2022-08-26 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585275#comment-17585275
 ] 

Steve Loughran commented on SPARK-38934:


[~graceee318] try explicitly setting the aws secrets as a per-bucket option, If 
the bucket is called  "my-bucket", set them in your config into these options

{code}

spark.hadoop.fs.s3a.bucket.my-bucket.access.key
spark.hadoop.fs.s3a.bucket.my-bucket.secret.key
spark.hadoop.fs.s3a.bucket.my-bucket.session.token

{code}

This will stop any env var propagation from interfering with the values for the 
bucket

> Provider TemporaryAWSCredentialsProvider has no credentials
> ---
>
> Key: SPARK-38934
> URL: https://issues.apache.org/jira/browse/SPARK-38934
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.1
>Reporter: Lily
>Priority: Major
>
>  
> We are using Jupyter Hub on K8s as a notebook based development environment 
> and Spark on K8s as a backend cluster of Jupyter Hub on K8s with Spark 3.2.1 
> and Hadoop 3.3.1.
> When we run a code like the one below in the Jupyter Hub on K8s,
>  
> {code:java}
> val perm = ... // get AWS temporary credential by AWS STS from AWS assumed 
> role
> // set AWS temporary credential
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", 
> "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", 
> perm.credential.accessKeyID)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", 
> perm.credential.secretAccessKey)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", 
> perm.credential.sessionToken)
> // execute simple Spark action
> spark.read.format("parquet").load("s3a:///*").show(1) {code}
>  
>  
> the first few executors left a warning like the one below in the first code 
> execution, but we were able to get the proper result thanks to Spark task 
> retry function. 
> {code:java}
> 22/04/18 09:13:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) 
> (10.197.5.15 executor 1): java.nio.file.AccessDeniedException: 
> s3a:///.parquet: 
> org.apache.hadoop.fs.s3a.CredentialInitializationException: Provider 
> TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206)
>   at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2810)
>   at 
> org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$6(HadoopFSUtils.scala:136)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$4(HadoopFSUtils.scala:126)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.hadoop.fs.s3a.CredentialInitializationException: 
> Provider TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider.getCredentials(AbstractSessionCredentialsProvider.java:130)
>   at 
> org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1266)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRe

[jira] [Created] (SPARK-40231) Add 1TB TPCDS Plan stability tests

2022-08-26 Thread Kapil Singh (Jira)
Kapil Singh created SPARK-40231:
---

 Summary: Add 1TB TPCDS Plan stability tests
 Key: SPARK-40231
 URL: https://issues.apache.org/jira/browse/SPARK-40231
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kapil Singh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups

2022-08-26 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-39931:
--
Description: 
Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|20.9|7.8|
|512|9.4|31.8|9.8|
|256|9.3|47.0|20.2|
|128|9.5|83.3|48.8|
|64|9.5|137.8|91.9|
|32|9.6|263.6|207.3|
|16|9.6|525.9|261.5|
|8|9.5|1,043|663.0|
|4|9.8|2,073|1,168|
|2|10.4|4,132|2,456|
|1|11.3|8,162|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
{{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
the Python method. With large groups, that Panadas DataFrame has all rows of a 
single group, with small groups it contains many groups. This should improve 
efficiency.

  was:
Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|20.9|7.8|
|512|9.4|31.8|9.8|
|256|9.3|47.0|20.2|
|128|9.5|83.3|48.8|
|64|9.5|137.8|91.9|
|32|9.6|263.6|207.3|
|16|9.6|525.9|261.5|
|8|9.5|1,043|663.0|
|4|9.8|2,073|1,168|
|2|10.4|4,132|2,456|
|1|11.3|8,162|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
{{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
the Python method. With large groups, that Panadas DataFrame has all rows of a 
single group, with small groups it contains many groups. This should improve 
efficiency.


> Improve performance of applyInPandas for very small groups
> --
>
> Key: SPARK-39931
> URL: https://issues.apache.org/jira/browse/SPARK-39931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>
> Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups 
> in PySpark is very slow. The reason is that for each group, PySpark creates a 
> Pandas DataFrame and calls into the Python code. For very small groups, the 
> overhead is huge, for large groups, it is smaller.
> Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m 
> rows):
> ||groupSize||Scala||pyspark.sql||pyspark.pandas||
> |1024|8.9|20.9|7.8|
> |512|9.4|31.8|9.8|
> |256|9.3|47.0|20.2|
> |128|9.5|83.3|48.8|
> |64|9.5|137.8|91.9|
> |32|9.6|263.6|207.3|
> |16|9.6|525.9|261.5|
> |8|9.5|1,043|663.0|
> |4|9.8|2,073|1,168|
> |2|10.4|4,132|2,456|
> |1|11.3|8,162|4,642|
> *Idea to overcome this* is to call into Python side with a Pandas DataFrame 
> that contains potentially multiple groups, then perform a Pandas 
> {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
> the Python method. With large groups, that Panadas DataFrame has all rows of 
> a single group, with small groups it contains many groups. This should 
> improve efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups

2022-08-26 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-39931:
--
Description: 
Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|20.9|7.8|
|512|9.4|31.8|9.8|
|256|9.3|47.0|20.2|
|128|9.5|83.3|48.8|
|64|9.5|137.8|91.9|
|32|9.6|263.6|207.3|
|16|9.6|525.9|261.5|
|8|9.5|1,043|663.0|
|4|9.8|2,073|1,168|
|2|10.4|4,132|2,456|
|1|11.3|8,162|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
{{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
the Python method. With large groups, that Panadas DataFrame has all rows of a 
single group, with small groups it contains many groups. This should improve 
efficiency.

  was:
Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in 
PySpark is very slow. The reason is that for each group, PySpark creates a 
Pandas DataFrame and calls into the Python code. For very small groups, the 
overhead is huge, for large groups, it is smaller.

Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows):
||groupSize||Scala||pyspark.sql||pyspark.pandas||
|1024|8.9|20.9|7.8|
|512|9.4|31.8|9.8|
|256|9.3|47.0|20.2|
|128|9.5|83.3|48.8|
|64|9.5|137.8|91.9|
|32|9.6|263.6|207.3|
|16|9.6|525.9|261.5|
|8|9.5|1,043|663.0|
|4|9.8|2,073|1,168|
|2|10.4|4,132|2,456|
|1|11.3|8,162|4,642|

*Idea to overcome this* is to call into Python side with a Pandas DataFrame 
that contains potentially multiple groups, then perform a Pandas 
DataFrame.groupBy(...).apply(...). With large groups, that Panadas DataFrame 
has all rows of single group, with small groups it contains many groups. This 
should improve efficiency.

I have prepared a PoC to benchmark that idea but am struggling to massage the 
internal rows before sending them to Python. I have the key and the grouped 
values of each row as {{InternalRow}}s and want to turn them into a single 
{{InternalRow}} in such a way that in Python I can easily group by the key and 
get the same grouped values. For this, I think adding the key as a single 
column (struct) while leaving the grouped values as is should work best. But I 
have not found a way to get this working.


> Improve performance of applyInPandas for very small groups
> --
>
> Key: SPARK-39931
> URL: https://issues.apache.org/jira/browse/SPARK-39931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>
> Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in 
> PySpark is very slow. The reason is that for each group, PySpark creates a 
> Pandas DataFrame and calls into the Python code. For very small groups, the 
> overhead is huge, for large groups, it is smaller.
> Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows):
> ||groupSize||Scala||pyspark.sql||pyspark.pandas||
> |1024|8.9|20.9|7.8|
> |512|9.4|31.8|9.8|
> |256|9.3|47.0|20.2|
> |128|9.5|83.3|48.8|
> |64|9.5|137.8|91.9|
> |32|9.6|263.6|207.3|
> |16|9.6|525.9|261.5|
> |8|9.5|1,043|663.0|
> |4|9.8|2,073|1,168|
> |2|10.4|4,132|2,456|
> |1|11.3|8,162|4,642|
> *Idea to overcome this* is to call into Python side with a Pandas DataFrame 
> that contains potentially multiple groups, then perform a Pandas 
> {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to 
> the Python method. With large groups, that Panadas DataFrame has all rows of 
> a single group, with small groups it contains many groups. This should 
> improve efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-38934) Provider TemporaryAWSCredentialsProvider has no credentials

2022-08-26 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reopened SPARK-38934:


> Provider TemporaryAWSCredentialsProvider has no credentials
> ---
>
> Key: SPARK-38934
> URL: https://issues.apache.org/jira/browse/SPARK-38934
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.1
>Reporter: Lily
>Priority: Major
>
>  
> We are using Jupyter Hub on K8s as a notebook based development environment 
> and Spark on K8s as a backend cluster of Jupyter Hub on K8s with Spark 3.2.1 
> and Hadoop 3.3.1.
> When we run a code like the one below in the Jupyter Hub on K8s,
>  
> {code:java}
> val perm = ... // get AWS temporary credential by AWS STS from AWS assumed 
> role
> // set AWS temporary credential
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", 
> "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", 
> perm.credential.accessKeyID)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", 
> perm.credential.secretAccessKey)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", 
> perm.credential.sessionToken)
> // execute simple Spark action
> spark.read.format("parquet").load("s3a:///*").show(1) {code}
>  
>  
> the first few executors left a warning like the one below in the first code 
> execution, but we were able to get the proper result thanks to Spark task 
> retry function. 
> {code:java}
> 22/04/18 09:13:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) 
> (10.197.5.15 executor 1): java.nio.file.AccessDeniedException: 
> s3a:///.parquet: 
> org.apache.hadoop.fs.s3a.CredentialInitializationException: Provider 
> TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206)
>   at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2810)
>   at 
> org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$6(HadoopFSUtils.scala:136)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$4(HadoopFSUtils.scala:126)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.hadoop.fs.s3a.CredentialInitializationException: 
> Provider TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider.getCredentials(AbstractSessionCredentialsProvider.java:130)
>   at 
> org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1266)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:842)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:792)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
>   a

[jira] [Commented] (SPARK-38934) Provider TemporaryAWSCredentialsProvider has no credentials

2022-08-26 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585268#comment-17585268
 ] 

Steve Loughran commented on SPARK-38934:


staring at this some more, as there's enough occurrences of this with spark 
alone that maybe there is a problem, not with the s3a code, but how spark 
passed configurations around, including env var adoption.

> Provider TemporaryAWSCredentialsProvider has no credentials
> ---
>
> Key: SPARK-38934
> URL: https://issues.apache.org/jira/browse/SPARK-38934
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.1
>Reporter: Lily
>Priority: Major
>
>  
> We are using Jupyter Hub on K8s as a notebook based development environment 
> and Spark on K8s as a backend cluster of Jupyter Hub on K8s with Spark 3.2.1 
> and Hadoop 3.3.1.
> When we run a code like the one below in the Jupyter Hub on K8s,
>  
> {code:java}
> val perm = ... // get AWS temporary credential by AWS STS from AWS assumed 
> role
> // set AWS temporary credential
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", 
> "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", 
> perm.credential.accessKeyID)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", 
> perm.credential.secretAccessKey)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", 
> perm.credential.sessionToken)
> // execute simple Spark action
> spark.read.format("parquet").load("s3a:///*").show(1) {code}
>  
>  
> the first few executors left a warning like the one below in the first code 
> execution, but we were able to get the proper result thanks to Spark task 
> retry function. 
> {code:java}
> 22/04/18 09:13:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) 
> (10.197.5.15 executor 1): java.nio.file.AccessDeniedException: 
> s3a:///.parquet: 
> org.apache.hadoop.fs.s3a.CredentialInitializationException: Provider 
> TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206)
>   at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2810)
>   at 
> org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$6(HadoopFSUtils.scala:136)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$4(HadoopFSUtils.scala:126)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.hadoop.fs.s3a.CredentialInitializationException: 
> Provider TemporaryAWSCredentialsProvider has no credentials
>   at 
> org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider.getCredentials(AbstractSessionCredentialsProvider.java:130)
>   at 
> org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1266)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:842)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:792)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestE

[jira] [Updated] (SPARK-40230) Executor connection issue in hybrid cloud deployment

2022-08-26 Thread Gleb Abroskin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gleb Abroskin updated SPARK-40230:
--
Environment: 
About the k8s setup:
 * 6+ nodes in AWS
 * 4 nodes in DC

Spark 3.2.1 + spark-hadoop-cloud 3.2.1
{code:java}
JAVA_HOME=/Users/gleb.abroskin/Library/Java/JavaVirtualMachines/corretto-11.0.13/Contents/Home
 spark-submit \
  --master k8s://https://kubemaster:6443 \
  --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties"
 \
  --conf spark.submit.deployMode=cluster \
  --conf spark.kubernetes.namespace=ml \
  --conf spark.kubernetes.container.image=SPARK_WITH_TRACE_LOGS_BAKED_IN \
  --conf spark.kubernetes.container.image.pullSecrets=aws-ecr \
  --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark-submitter \
  --conf spark.kubernetes.authenticate.submission.oauthToken=XXX \
  --conf spark.kubernetes.executor.podNamePrefix=ds-224-executor-test \
  --conf spark.kubernetes.file.upload.path=s3a://ifunny-ml-data/dev/spark \
  --conf "spark.hadoop.fs.s3a.access.key=XXX" \
  --conf "spark.hadoop.fs.s3a.secret.key=XXX" \
  --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-1.amazonaws.com \
  --conf "spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=XXX" \
  --conf "spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=XXX" \
  --conf spark.sql.shuffle.partitions=500 \
  --num-executors 100 \
  
--driver-java-options="-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties"
 \
  --name k8s-pyspark-test \
  main.py{code}
main.py is just pi.py from the examples, modified to work on 100 machines (this 
is a way to make sure executors are deployed in both AWS & DC)
{code:java}
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession


if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("PythonPi") \
.getOrCreate()

spark.sparkContext.setLogLevel("TRACE")

partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 100
n = 1000 * partitions

def f(_: int) -> float:
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), 
partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n)) {code}

  was:
About the k8s setup:
 * 6+ nodes in AWS
 * 4 nodes in DC

Spark 3.2.1 + spark-hadoop-cloud 3.2.1
{code:java}
JAVA_HOME=/Users/gleb.abroskin/Library/Java/JavaVirtualMachines/corretto-11.0.13/Contents/Home
 spark-submit \
  --master k8s://https://ifunny-ml-kubemaster.ash1.fun.co:6443 \
  --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties"
 \
  --conf spark.submit.deployMode=cluster \
  --conf spark.kubernetes.namespace=ml \
  --conf spark.kubernetes.container.image=SPARK_WITH_TRACE_LOGS_BAKED_IN \
  --conf spark.kubernetes.container.image.pullSecrets=aws-ecr \
  --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark-submitter \
  --conf spark.kubernetes.authenticate.submission.oauthToken=XXX \
  --conf spark.kubernetes.executor.podNamePrefix=ds-224-executor-test \
  --conf spark.kubernetes.file.upload.path=s3a://ifunny-ml-data/dev/spark \
  --conf "spark.hadoop.fs.s3a.access.key=XXX" \
  --conf "spark.hadoop.fs.s3a.secret.key=XXX" \
  --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-1.amazonaws.com \
  --conf "spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=XXX" \
  --conf "spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=XXX" \
  --conf spark.sql.shuffle.partitions=500 \
  --num-executors 100 \
  
--driver-java-options="-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties"
 \
  --name k8s-pyspark-test \
  main.py{code}
main.py is just pi.py from the examples, modified to work on 100 machines (this 
is a way to make sure executors are deployed in both AWS & DC)
{code:java}
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession


if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("PythonPi") \
.getOrCreate()

spark.sparkContext.setLogLevel("TRACE")

partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 100
n = 1000 * partitions

def f(_: int) -> float:
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), 
partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n)) {code}


> Executor connection issue in hybrid cloud deployment
> 
>
> Key: SPARK-40230
> URL: https://issues.apache.org/jira/browse/SPARK-40230
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Kubernetes
>   

[jira] [Created] (SPARK-40230) Executor connection issue in hybrid cloud deployment

2022-08-26 Thread Gleb Abroskin (Jira)
Gleb Abroskin created SPARK-40230:
-

 Summary: Executor connection issue in hybrid cloud deployment
 Key: SPARK-40230
 URL: https://issues.apache.org/jira/browse/SPARK-40230
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Kubernetes
Affects Versions: 3.2.1
 Environment: About the k8s setup:
 * 6+ nodes in AWS
 * 4 nodes in DC

Spark 3.2.1 + spark-hadoop-cloud 3.2.1
{code:java}
JAVA_HOME=/Users/gleb.abroskin/Library/Java/JavaVirtualMachines/corretto-11.0.13/Contents/Home
 spark-submit \
  --master k8s://https://ifunny-ml-kubemaster.ash1.fun.co:6443 \
  --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties"
 \
  --conf spark.submit.deployMode=cluster \
  --conf spark.kubernetes.namespace=ml \
  --conf spark.kubernetes.container.image=SPARK_WITH_TRACE_LOGS_BAKED_IN \
  --conf spark.kubernetes.container.image.pullSecrets=aws-ecr \
  --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark-submitter \
  --conf spark.kubernetes.authenticate.submission.oauthToken=XXX \
  --conf spark.kubernetes.executor.podNamePrefix=ds-224-executor-test \
  --conf spark.kubernetes.file.upload.path=s3a://ifunny-ml-data/dev/spark \
  --conf "spark.hadoop.fs.s3a.access.key=XXX" \
  --conf "spark.hadoop.fs.s3a.secret.key=XXX" \
  --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-1.amazonaws.com \
  --conf "spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=XXX" \
  --conf "spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=XXX" \
  --conf spark.sql.shuffle.partitions=500 \
  --num-executors 100 \
  
--driver-java-options="-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties"
 \
  --name k8s-pyspark-test \
  main.py{code}
main.py is just pi.py from the examples, modified to work on 100 machines (this 
is a way to make sure executors are deployed in both AWS & DC)
{code:java}
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession


if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("PythonPi") \
.getOrCreate()

spark.sparkContext.setLogLevel("TRACE")

partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 100
n = 1000 * partitions

def f(_: int) -> float:
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), 
partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n)) {code}
Reporter: Gleb Abroskin


I understand that the issue is quite subtle and might be hard to debug, still I 
was not able to find issue with our infra, so I guess that is something inside 
the spark.

We deploy spark application in k8s and everything works well, if all the driver 
& executor pods are either in AWS or our DC, but in case they are split between 
datacenters something strange happens, for example, logs of one of the 
executors inside the DC
{code:java}
22/08/26 07:55:35 INFO TransportClientFactory: Successfully created connection 
to /172.19.149.92:39414 after 50 ms (1 ms spent in bootstraps)
22/08/26 07:55:35 TRACE TransportClient: Sending RPC to /172.19.149.92:39414
22/08/26 07:55:35 TRACE TransportClient: Sending request RPC 
4860401977118244334 to /172.19.149.92:39414 took 3 ms
22/08/26 07:55:35 DEBUG TransportClient: Sending fetch chunk request 0 to 
/172.19.149.92:39414
22/08/26 07:55:35 TRACE TransportClient: Sending request 
StreamChunkId[streamId=1644979023003,chunkIndex=0] to /172.19.149.92:39414 took 
0 ms
22/08/26 07:57:35 ERROR TransportChannelHandler: Connection to 
/172.19.149.92:39414 has been quiet for 12 ms while there are outstanding 
requests. Assuming connection is dead; please adjust 
spark.shuffle.io.connectionTimeout if this is wrong. {code}
The executor successfully creates connection & sends the request, but the 
connection was assumed dead. Even stranger the executor on ip 172.19.149.92 
have sent the response back, which I can confirm with following logs
{code:java}
22/08/26 07:55:35 TRACE MessageDecoder: Received message ChunkFetchRequest: 
ChunkFetchRequest[streamChunkId=StreamChunkId[streamId=1644979023003,chunkIndex=0]]
22/08/26 07:55:35 TRACE ChunkFetchRequestHandler: Received req from 
/172.19.123.197:37626 to fetch block 
StreamChunkId[streamId=1644979023003,chunkIndex=0]
22/08/26 07:55:35 TRACE OneForOneStreamManager: Removing stream id 1644979023003
22/08/26 07:55:35 TRACE BlockInfoManager: Task -1024 releasing lock for 
broadcast_0_piece0
--
22/08/26 07:55:35 TRACE BlockInfoManager: Task -1024 releasing lock for 
broadcast_0_piece0
22/08/26 07:55:35 TRACE ChunkFetchRequestHandler: Sent result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=1644979023003,chunkIndex=0],buffer=org.apache.spark.storage.BlockManagerManagedBuffer@79b43e2a]
 to client /

[jira] [Resolved] (SPARK-38749) Test the error class: RENAME_SRC_PATH_NOT_FOUND

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38749.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37611
[https://github.com/apache/spark/pull/37611]

> Test the error class: RENAME_SRC_PATH_NOT_FOUND
> ---
>
> Key: SPARK-38749
> URL: https://issues.apache.org/jira/browse/SPARK-38749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
> Fix For: 3.4.0
>
>
> Add a test for the error classes *RENAME_SRC_PATH_NOT_FOUND* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def renameSrcPathNotFoundError(srcPath: Path): Throwable = {
> new SparkFileNotFoundException(errorClass = "RENAME_SRC_PATH_NOT_FOUND",
>   Array(srcPath.toString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40228) Don't simplify multiLike if child is not attribute

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40228:


Assignee: (was: Apache Spark)

> Don't simplify multiLike if child is not attribute
> --
>
> Key: SPARK-40228
> URL: https://issues.apache.org/jira/browse/SPARK-40228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql("create table t1(name string) using parquet")
> sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', 
> '%c%')").explain(true)
> {code}
> {noformat}
> == Physical Plan ==
> *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 
> 1, 5), b)) OR Contains(substr(name#0, 1, 5), c))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: 
> [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) 
> OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40228) Don't simplify multiLike if child is not attribute

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40228:


Assignee: Apache Spark

> Don't simplify multiLike if child is not attribute
> --
>
> Key: SPARK-40228
> URL: https://issues.apache.org/jira/browse/SPARK-40228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:scala}
> sql("create table t1(name string) using parquet")
> sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', 
> '%c%')").explain(true)
> {code}
> {noformat}
> == Physical Plan ==
> *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 
> 1, 5), b)) OR Contains(substr(name#0, 1, 5), c))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: 
> [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) 
> OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40228) Don't simplify multiLike if child is not attribute

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585217#comment-17585217
 ] 

Apache Spark commented on SPARK-40228:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/37672

> Don't simplify multiLike if child is not attribute
> --
>
> Key: SPARK-40228
> URL: https://issues.apache.org/jira/browse/SPARK-40228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql("create table t1(name string) using parquet")
> sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', 
> '%c%')").explain(true)
> {code}
> {noformat}
> == Physical Plan ==
> *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 
> 1, 5), b)) OR Contains(substr(name#0, 1, 5), c))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: 
> [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) 
> OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585210#comment-17585210
 ] 

Apache Spark commented on SPARK-40229:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37671

> Re-enable excel I/O test for pandas API on Spark.
> -
>
> Key: SPARK-40229
> URL: https://issues.apache.org/jira/browse/SPARK-40229
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently we're skipping the `read_excel` and `to_excel` tests for pandas API 
> on Spark, since the `openpyxl` is not installed in our test environments.
> We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40229:


Assignee: Apache Spark

> Re-enable excel I/O test for pandas API on Spark.
> -
>
> Key: SPARK-40229
> URL: https://issues.apache.org/jira/browse/SPARK-40229
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Currently we're skipping the `read_excel` and `to_excel` tests for pandas API 
> on Spark, since the `openpyxl` is not installed in our test environments.
> We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40229:


Assignee: (was: Apache Spark)

> Re-enable excel I/O test for pandas API on Spark.
> -
>
> Key: SPARK-40229
> URL: https://issues.apache.org/jira/browse/SPARK-40229
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently we're skipping the `read_excel` and `to_excel` tests for pandas API 
> on Spark, since the `openpyxl` is not installed in our test environments.
> We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.

2022-08-26 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40229:

Description: 
Currently we're skipping the `read_excel` and `to_excel` test for pandas API on 
Spark, since the `openpyxl` is not installed in our test environments.
We should enable this test for better test coverage.

  was:
Currently we're skipping the read_excel test for pandas API on Spark, since the 
`openpyxl` is not installed in our test environments.
We should enable this test for better test coverage.


> Re-enable excel I/O test for pandas API on Spark.
> -
>
> Key: SPARK-40229
> URL: https://issues.apache.org/jira/browse/SPARK-40229
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently we're skipping the `read_excel` and `to_excel` test for pandas API 
> on Spark, since the `openpyxl` is not installed in our test environments.
> We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.

2022-08-26 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40229:

Summary: Re-enable excel I/O test for pandas API on Spark.  (was: Re-enable 
read_excel test for pandas API on Spark.)

> Re-enable excel I/O test for pandas API on Spark.
> -
>
> Key: SPARK-40229
> URL: https://issues.apache.org/jira/browse/SPARK-40229
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently we're skipping the read_excel test for pandas API on Spark, since 
> the `openpyxl` is not installed in our test environments.
> We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.

2022-08-26 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40229:

Description: 
Currently we're skipping the `read_excel` and `to_excel` tests for pandas API 
on Spark, since the `openpyxl` is not installed in our test environments.
We should enable this test for better test coverage.

  was:
Currently we're skipping the `read_excel` and `to_excel` test for pandas API on 
Spark, since the `openpyxl` is not installed in our test environments.
We should enable this test for better test coverage.


> Re-enable excel I/O test for pandas API on Spark.
> -
>
> Key: SPARK-40229
> URL: https://issues.apache.org/jira/browse/SPARK-40229
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently we're skipping the `read_excel` and `to_excel` tests for pandas API 
> on Spark, since the `openpyxl` is not installed in our test environments.
> We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40229) Re-enable read_excel test for pandas API on Spark.

2022-08-26 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40229:
---

 Summary: Re-enable read_excel test for pandas API on Spark.
 Key: SPARK-40229
 URL: https://issues.apache.org/jira/browse/SPARK-40229
 Project: Spark
  Issue Type: Test
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Currently we're skipping the read_excel test for pandas API on Spark, since the 
`openpyxl` is not installed in our test environments.
We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40228) Don't simplify multiLike if child is not attribute

2022-08-26 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-40228:
---

 Summary: Don't simplify multiLike if child is not attribute
 Key: SPARK-40228
 URL: https://issues.apache.org/jira/browse/SPARK-40228
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang


{code:scala}
sql("create table t1(name string) using parquet")
sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', 
'%c%')").explain(true)
{code}

{noformat}
== Physical Plan ==
*(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 
5), b)) OR Contains(substr(name#0, 1, 5), c))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[name#0] Batched: true, DataFilters: 
[((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) OR 
Contains(substr(n..., Format: Parquet, PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct

{noformat}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40197.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37632
[https://github.com/apache/spark/pull/37632]

> Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
> --
>
> Key: SPARK-40197
> URL: https://issues.apache.org/jira/browse/SPARK-40197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
> Fix For: 3.4.0
>
>
> Instead of a query plan - output subquery context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40197:


Assignee: Vitalii Li

> Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
> --
>
> Key: SPARK-40197
> URL: https://issues.apache.org/jira/browse/SPARK-40197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Vitalii Li
>Priority: Major
>
> Instead of a query plan - output subquery context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40220) Don't output the empty map of error message parameters

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40220.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37660
[https://github.com/apache/spark/pull/37660]

> Don't output the empty map of error message parameters
> --
>
> Key: SPARK-40220
> URL: https://issues.apache.org/jira/browse/SPARK-40220
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> In the current implementation, Spark output empty message parameters in the 
> MINIMAL and STANDARD formats:
> {code:json}
>  org.apache.spark.SparkRuntimeException
>  {
>   "errorClass" : "ELEMENT_AT_BY_INDEX_ZERO",
>   "messageParameters" : { }
>  }
> {code}
> that contradict w/ the approach for other JSON fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40218) GROUPING SETS should preserve the grouping columns

2022-08-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40218.
-
Fix Version/s: 3.3.1
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37655
[https://github.com/apache/spark/pull/37655]

> GROUPING SETS should preserve the grouping columns
> --
>
> Key: SPARK-40218
> URL: https://issues.apache.org/jira/browse/SPARK-40218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.1, 3.2.3, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40221) Not able to format using scalafmt

2022-08-26 Thread Ziqi Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585191#comment-17585191
 ] 

Ziqi Liu commented on SPARK-40221:
--

[~hyukjin.kwon]  I think running in master should always be good, since it only 
format git diff from master, so running in master actually didn't format 
anything if understand the [dev tools 
guidance|https://spark.apache.org/developer-tools.html] correctly...

 

The issue happened in another branch, but very new 
branch(https://github.com/apache/spark/pull/37661), and I believe I didn't 
modify anything related to scalafmt

It tried to format those files that changed since master then the error 
happened:

 
{code:java}
[INFO] Checking for files changed from master
[INFO] Changed from master:
/Users/ziqi.liu/code/spark/core/src/main/scala/org/apache/spark/internal/config/package.scala
/Users/ziqi.liu/code/spark/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala
/Users/ziqi.liu/code/spark/core/src/main/scala/org/apache/spark/rdd/RDD.scala
/Users/ziqi.liu/code/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
/Users/ziqi.liu/code/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
/Users/ziqi.liu/code/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
/Users/ziqi.liu/code/spark/sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala
[ERROR]
org.scalafmt.dynamic.exceptions.ScalafmtException: missing setting 'version'. 
To fix this problem, add the following line to .scalafmt.conf: 'version=3.2.1'. 
{code}
 

 

> Not able to format using scalafmt
> -
>
> Key: SPARK-40221
> URL: https://issues.apache.org/jira/browse/SPARK-40221
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> I'm following the guidance in [https://spark.apache.org/developer-tools.html] 
> using 
> {code:java}
> ./dev/scalafmt{code}
> to format the code, but getting this error:
> {code:java}
> [ERROR] Failed to execute goal 
> org.antipathy:mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) 
> on project spark-parent_2.12: Error formatting Scala files: missing setting 
> 'version'. To fix this problem, add the following line to .scalafmt.conf: 
> 'version=3.2.1'. -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40215:


Assignee: Ivan Sadikov

> Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
> 
>
> Key: SPARK-40215
> URL: https://issues.apache.org/jira/browse/SPARK-40215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour

2022-08-26 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40215.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37653
[https://github.com/apache/spark/pull/37653]

> Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
> 
>
> Key: SPARK-40215
> URL: https://issues.apache.org/jira/browse/SPARK-40215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585186#comment-17585186
 ] 

Apache Spark commented on SPARK-40227:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37670

> Data Source V2: Support creating table with the duplicate transform with 
> different arguments
> 
>
> Key: SPARK-40227
> URL: https://issues.apache.org/jira/browse/SPARK-40227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Xianyang Liu
>Priority: Major
>
> This patch adds support to creating a table with duplicate transforms while 
> with different arguments. For example, the following code will be failed 
> before:
> ```scala
> CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY 
> (bucket(2, id), bucket(4, id))
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40227:


Assignee: (was: Apache Spark)

> Data Source V2: Support creating table with the duplicate transform with 
> different arguments
> 
>
> Key: SPARK-40227
> URL: https://issues.apache.org/jira/browse/SPARK-40227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Xianyang Liu
>Priority: Major
>
> This patch adds support to creating a table with duplicate transforms while 
> with different arguments. For example, the following code will be failed 
> before:
> ```scala
> CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY 
> (bucket(2, id), bucket(4, id))
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments

2022-08-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40227:


Assignee: Apache Spark

> Data Source V2: Support creating table with the duplicate transform with 
> different arguments
> 
>
> Key: SPARK-40227
> URL: https://issues.apache.org/jira/browse/SPARK-40227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Xianyang Liu
>Assignee: Apache Spark
>Priority: Major
>
> This patch adds support to creating a table with duplicate transforms while 
> with different arguments. For example, the following code will be failed 
> before:
> ```scala
> CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY 
> (bucket(2, id), bucket(4, id))
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments

2022-08-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585185#comment-17585185
 ] 

Apache Spark commented on SPARK-40227:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/37670

> Data Source V2: Support creating table with the duplicate transform with 
> different arguments
> 
>
> Key: SPARK-40227
> URL: https://issues.apache.org/jira/browse/SPARK-40227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Xianyang Liu
>Priority: Major
>
> This patch adds support to creating a table with duplicate transforms while 
> with different arguments. For example, the following code will be failed 
> before:
> ```scala
> CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY 
> (bucket(2, id), bucket(4, id))
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments

2022-08-26 Thread Xianyang Liu (Jira)
Xianyang Liu created SPARK-40227:


 Summary: Data Source V2: Support creating table with the duplicate 
transform with different arguments
 Key: SPARK-40227
 URL: https://issues.apache.org/jira/browse/SPARK-40227
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.2
Reporter: Xianyang Liu


This patch adds support to creating a table with duplicate transforms while 
with different arguments. For example, the following code will be failed before:
```scala
CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY 
(bucket(2, id), bucket(4, id))
```




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org