[jira] [Commented] (SPARK-40241) Correct the link of GenericUDTF
[ https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585669#comment-17585669 ] Apache Spark commented on SPARK-40241: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37685 > Correct the link of GenericUDTF > --- > > Key: SPARK-40241 > URL: https://issues.apache.org/jira/browse/SPARK-40241 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40241) Correct the link of GenericUDTF
[ https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40241: Assignee: (was: Apache Spark) > Correct the link of GenericUDTF > --- > > Key: SPARK-40241 > URL: https://issues.apache.org/jira/browse/SPARK-40241 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40241) Correct the link of GenericUDTF
[ https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40241: Assignee: Apache Spark > Correct the link of GenericUDTF > --- > > Key: SPARK-40241 > URL: https://issues.apache.org/jira/browse/SPARK-40241 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40241) Correct the link of GenericUDTF
[ https://issues.apache.org/jira/browse/SPARK-40241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585668#comment-17585668 ] Apache Spark commented on SPARK-40241: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37685 > Correct the link of GenericUDTF > --- > > Key: SPARK-40241 > URL: https://issues.apache.org/jira/browse/SPARK-40241 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40241) Correct the link of GenericUDTF
Ruifeng Zheng created SPARK-40241: - Summary: Correct the link of GenericUDTF Key: SPARK-40241 URL: https://issues.apache.org/jira/browse/SPARK-40241 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40039: Assignee: (was: Apache Spark) > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40039: Assignee: Apache Spark > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part
[ https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585665#comment-17585665 ] Apache Spark commented on SPARK-40152: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37684 > Codegen compilation error when using split_part > --- > > Key: SPARK-40152 > URL: https://issues.apache.org/jira/browse/SPARK-40152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bruce Robbins >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0, 3.3.1 > > > The following query throws an error: > {noformat} > create or replace temp view v1 as > select * from values > ('11.12.13', '.', 3) > as v1(col1, col2, col3); > cache table v1; > SELECT split_part(col1, col2, col3) > from v1; > {noformat} > The error is: > {noformat} > 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 42, Column 1: Expression "project_isNull_0 = false" is not a type > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 42, Column 1: Expression "project_isNull_0 = false" is not a type > at > org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934) > at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887) > at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811) > at org.codehaus.janino.Parser.parseBlock(Parser.java:1792) > at > {noformat} > In the end, {{split_part}} does successfully execute, although in interpreted > mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part
[ https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585666#comment-17585666 ] Apache Spark commented on SPARK-40152: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37684 > Codegen compilation error when using split_part > --- > > Key: SPARK-40152 > URL: https://issues.apache.org/jira/browse/SPARK-40152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bruce Robbins >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0, 3.3.1 > > > The following query throws an error: > {noformat} > create or replace temp view v1 as > select * from values > ('11.12.13', '.', 3) > as v1(col1, col2, col3); > cache table v1; > SELECT split_part(col1, col2, col3) > from v1; > {noformat} > The error is: > {noformat} > 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 42, Column 1: Expression "project_isNull_0 = false" is not a type > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 42, Column 1: Expression "project_isNull_0 = false" is not a type > at > org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934) > at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887) > at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811) > at org.codehaus.janino.Parser.parseBlock(Parser.java:1792) > at > {noformat} > In the end, {{split_part}} does successfully execute, although in interpreted > mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error
[ https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40156: Assignee: Apache Spark > url_decode() exposes a Java error > - > > Key: SPARK-40156 > URL: https://issues.apache.org/jira/browse/SPARK-40156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Apache Spark >Priority: Major > > Given a badly encode string Spark returns a Java error. > It should the return an ERROR_CLASS > spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org'); > 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT > url_decode('http%3A%2F%2spark.apache.org')] > java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in > escape (%) pattern - Error at index 1 in: "2s" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error
[ https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40156: Assignee: (was: Apache Spark) > url_decode() exposes a Java error > - > > Key: SPARK-40156 > URL: https://issues.apache.org/jira/browse/SPARK-40156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > Given a badly encode string Spark returns a Java error. > It should the return an ERROR_CLASS > spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org'); > 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT > url_decode('http%3A%2F%2spark.apache.org')] > java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in > escape (%) pattern - Error at index 1 in: "2s" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40156) url_decode() exposes a Java error
[ https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40156: - Fix Version/s: (was: 3.4.0) > url_decode() exposes a Java error > - > > Key: SPARK-40156 > URL: https://issues.apache.org/jira/browse/SPARK-40156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > Given a badly encode string Spark returns a Java error. > It should the return an ERROR_CLASS > spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org'); > 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT > url_decode('http%3A%2F%2spark.apache.org')] > java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in > escape (%) pattern - Error at index 1 in: "2s" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40156) url_decode() exposes a Java error
[ https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-40156: -- Assignee: (was: ming95) Reverted at https://github.com/apache/spark/commit/4a19e05563b655216cf6b551d0ab8a64447cd49a > url_decode() exposes a Java error > - > > Key: SPARK-40156 > URL: https://issues.apache.org/jira/browse/SPARK-40156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > Fix For: 3.4.0 > > > Given a badly encode string Spark returns a Java error. > It should the return an ERROR_CLASS > spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org'); > 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT > url_decode('http%3A%2F%2spark.apache.org')] > java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in > escape (%) pattern - Error at index 1 in: "2s" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40039: - Fix Version/s: (was: 3.4.0) > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-40039: -- Assignee: (was: Attila Zsolt Piros) Reverted at https://github.com/apache/spark/commit/fb4dba1413dac860c9bec5bae422b2f78d212948 and https://github.com/apache/spark/commit/8fb853218bed792b3f66c8401691dd16262e1f9e > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > Fix For: 3.4.0 > > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first
[ https://issues.apache.org/jira/browse/SPARK-40240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40240: Assignee: (was: Apache Spark) > PySpark rdd.takeSample should validate `num > maxSampleSize` at first > - > > Key: SPARK-40240 > URL: https://issues.apache.org/jira/browse/SPARK-40240 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first
[ https://issues.apache.org/jira/browse/SPARK-40240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40240: Assignee: Apache Spark > PySpark rdd.takeSample should validate `num > maxSampleSize` at first > - > > Key: SPARK-40240 > URL: https://issues.apache.org/jira/browse/SPARK-40240 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first
[ https://issues.apache.org/jira/browse/SPARK-40240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585656#comment-17585656 ] Apache Spark commented on SPARK-40240: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37683 > PySpark rdd.takeSample should validate `num > maxSampleSize` at first > - > > Key: SPARK-40240 > URL: https://issues.apache.org/jira/browse/SPARK-40240 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40240) PySpark rdd.takeSample should validate `num > maxSampleSize` at first
Ruifeng Zheng created SPARK-40240: - Summary: PySpark rdd.takeSample should validate `num > maxSampleSize` at first Key: SPARK-40240 URL: https://issues.apache.org/jira/browse/SPARK-40240 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample
[ https://issues.apache.org/jira/browse/SPARK-40239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40239: Assignee: Apache Spark > Remove duplicated 'fraction' validation in RDD.sample > - > > Key: SPARK-40239 > URL: https://issues.apache.org/jira/browse/SPARK-40239 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample
[ https://issues.apache.org/jira/browse/SPARK-40239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585654#comment-17585654 ] Apache Spark commented on SPARK-40239: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37682 > Remove duplicated 'fraction' validation in RDD.sample > - > > Key: SPARK-40239 > URL: https://issues.apache.org/jira/browse/SPARK-40239 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample
[ https://issues.apache.org/jira/browse/SPARK-40239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40239: Assignee: (was: Apache Spark) > Remove duplicated 'fraction' validation in RDD.sample > - > > Key: SPARK-40239 > URL: https://issues.apache.org/jira/browse/SPARK-40239 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40239) Remove duplicated 'fraction' validation in RDD.sample
Ruifeng Zheng created SPARK-40239: - Summary: Remove duplicated 'fraction' validation in RDD.sample Key: SPARK-40239 URL: https://issues.apache.org/jira/browse/SPARK-40239 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40238) support scaleUpFactor and initialNumPartition in pyspark rdd API
Ziqi Liu created SPARK-40238: Summary: support scaleUpFactor and initialNumPartition in pyspark rdd API Key: SPARK-40238 URL: https://issues.apache.org/jira/browse/SPARK-40238 Project: Spark Issue Type: Story Components: PySpark Affects Versions: 3.4.0 Reporter: Ziqi Liu This is a followup on https://issues.apache.org/jira/browse/SPARK-40211 `scaleUpFactor` and `initialNumPartition` config are not supported yet in pyspark rdd take API (see [https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L2799)] basically it hardcoded `scaleUpFactor` as 1 and `initialNumPartition` as 4, therefore pyspark rdd take API is inconsistent with scala API. Anyone familiar with pyspark can help support this (referring to [scala implementation|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1448])? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized
[ https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-40211. Fix Version/s: 3.4.0 Assignee: Ziqi Liu Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/37661] > Allow executeTake() / collectLimit's number of starting partitions to be > customized > --- > > Key: SPARK-40211 > URL: https://issues.apache.org/jira/browse/SPARK-40211 > Project: Spark > Issue Type: Story > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Ziqi Liu >Assignee: Ziqi Liu >Priority: Major > Fix For: 3.4.0 > > > Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be > customized but does not allow for the initial number of partitions to be > customized: it’s currently hardcoded to {{{}1{}}}. > We should add a configuration so that the initial partition count can be > customized. By setting this new configuration to a high value we could > effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. > We could also set it to higher-than-1-but-still-small values (like, say, > {{{}10{}}}) to achieve a middle-ground trade-off. > > Essentially, we need to make {{numPartsToTry = 1L}} > ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481]) > customizable. We should do this via a new SQL conf, similar to the > {{limitScaleUpFactor}} conf. > > Spark has several near-duplicate versions of this code ([see code > search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in: > * SparkPlan > * RDD > * pyspark rdd > Also, in pyspark {{limitScaleUpFactor}} is not supported either. So for > now, I will focus on scala side first, leaving python side untouched and > meanwhile sync with pyspark members. Depending on the progress we can do them > all in one PR or make scala side change first and leave pyspark change as a > follow-up. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40235) Use interruptible lock instead of synchronized in Executor.updateDependencies()
[ https://issues.apache.org/jira/browse/SPARK-40235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585633#comment-17585633 ] Apache Spark commented on SPARK-40235: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/37681 > Use interruptible lock instead of synchronized in > Executor.updateDependencies() > --- > > Key: SPARK-40235 > URL: https://issues.apache.org/jira/browse/SPARK-40235 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > This patch modifies the synchronization in {{Executor.updateDependencies()}} > in order to allow tasks to be interrupted while they are blocked and waiting > on other tasks to finish downloading dependencies. > This synchronization was added years ago in > [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a] > in order to prevent concurrently-launching tasks from performing concurrent > dependency updates. If one task is downloading dependencies, all other > newly-launched tasks will block until the original dependency download is > complete. > Let's say that a Spark task launches, becomes blocked on a > {{updateDependencies()}} call, then is cancelled while it is blocked. > Although Spark will send a Thread.interrupt() to the canceled task, the task > will continue waiting because threads blocked on a {{synchronized}} won't > throw an InterruptedException in response to the interrupt. As a result, the > blocked thread will continue to wait until the other thread exits the > synchronized block. > In the wild, we saw a case where this happened and the thread remained > blocked for over 1 minute, causing the TaskReaper to kick in and > self-destruct the executor. > This PR aims to fix this problem by replacing the {{synchronized}} with a > ReentrantLock, which has a {{lockInterruptibly}} method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40235) Use interruptible lock instead of synchronized in Executor.updateDependencies()
[ https://issues.apache.org/jira/browse/SPARK-40235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-40235: --- Description: This patch modifies the synchronization in {{Executor.updateDependencies()}} in order to allow tasks to be interrupted while they are blocked and waiting on other tasks to finish downloading dependencies. This synchronization was added years ago in [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a] in order to prevent concurrently-launching tasks from performing concurrent dependency updates. If one task is downloading dependencies, all other newly-launched tasks will block until the original dependency download is complete. Let's say that a Spark task launches, becomes blocked on a {{updateDependencies()}} call, then is cancelled while it is blocked. Although Spark will send a Thread.interrupt() to the canceled task, the task will continue waiting because threads blocked on a {{synchronized}} won't throw an InterruptedException in response to the interrupt. As a result, the blocked thread will continue to wait until the other thread exits the synchronized block. In the wild, we saw a case where this happened and the thread remained blocked for over 1 minute, causing the TaskReaper to kick in and self-destruct the executor. This PR aims to fix this problem by replacing the {{synchronized}} with a ReentrantLock, which has a {{lockInterruptibly}} method. was: This patch modifies the synchronization in {{Executor.updateDependencies()}} in order to allow tasks to be interrupted while they are blocked and waiting on other tasks to finish downloading dependencies. This synchronization was added years ago in [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a] in order to prevent concurrently-launching tasks from performing concurrent dependency updates (file downloads, and, later, library installation). If one task is downloading dependencies, all other newly-launched tasks will block until the original dependency download is complete. Let's say that a Spark task launches, becomes blocked on a {{updateDependencies()}} call, then is cancelled while it is blocked. Although Spark will send a Thread.interrupt() to the canceled task, the task will continue waiting because threads blocked on a {{synchronized}} won't throw an InterruptedException in response to the interrupt. As a result, the blocked thread will continue to wait until the other thread exits the synchronized block. In the wild, we saw a case where this happened and the thread remained blocked for over 1 minute, causing the TaskReaper to kick in and self-destruct the executor. This PR aims to fix this problem by replacing the {{synchronized}} with a ReentrantLock, which has a {{lockInterruptibly}} method. > Use interruptible lock instead of synchronized in > Executor.updateDependencies() > --- > > Key: SPARK-40235 > URL: https://issues.apache.org/jira/browse/SPARK-40235 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > This patch modifies the synchronization in {{Executor.updateDependencies()}} > in order to allow tasks to be interrupted while they are blocked and waiting > on other tasks to finish downloading dependencies. > This synchronization was added years ago in > [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a] > in order to prevent concurrently-launching tasks from performing concurrent > dependency updates. If one task is downloading dependencies, all other > newly-launched tasks will block until the original dependency download is > complete. > Let's say that a Spark task launches, becomes blocked on a > {{updateDependencies()}} call, then is cancelled while it is blocked. > Although Spark will send a Thread.interrupt() to the canceled task, the task > will continue waiting because threads blocked on a {{synchronized}} won't > throw an InterruptedException in response to the interrupt. As a result, the > blocked thread will continue to wait until the other thread exits the > synchronized block. > In the wild, we saw a case where this happened and the thread remained > blocked for over 1 minute, causing the TaskReaper to kick in and > self-destruct the executor. > This PR aims to fix this problem by replacing the {{synchronized}} with a > ReentrantLock, which has a {{lockInterruptibly}} method. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (SPARK-40237) Can't get JDBC type for map in Spark 3.3.0 and PostgreSQL
Igor Suhorukov created SPARK-40237: -- Summary: Can't get JDBC type for map in Spark 3.3.0 and PostgreSQL Key: SPARK-40237 URL: https://issues.apache.org/jira/browse/SPARK-40237 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Environment: Linux 5.15.0-46-generic #49~20.04.1-Ubuntu SMP Thu Aug 4 19:15:44 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux 22/08/27 00:30:01 INFO SparkContext: Running Spark version 3.3.0 Reporter: Igor Suhorukov Exception 'Can't get JDBC type for map' happens when I try to save dataset into PostgreSQL via JDBC {code:java} Dataset ds = ...; ds.write().mode(SaveMode.Overwrite) .option("truncate", true).format("jdbc") .option("url", "jdbc:postgresql://127.0.0.1:5432/???") .option("dbtable", "t1") .option("isolationLevel", "NONE") .option("user", "???") .option("password", "???") .save(); {code} This Issue related to unimplemented PostgresDialect#getJDBCType and it is strange as PostgreSQL supports hstore/json/jsonb types sutable to store map. PostgreSql JDBC driver support hstore write by using statement.setObject(parameterIndex, map) for hstore and statement.setString(parameterIndex,map) with cast(? as JSON). {code:java} java.lang.IllegalArgumentException: Can't get JDBC type for map at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGetJdbcTypeError(QueryExecutionErrors.scala:810) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getJdbcType$2(JdbcUtils.scala:162) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getJdbcType(JdbcUtils.scala:162) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$4(JdbcUtils.scala:782) at scala.collection.immutable.Map$EmptyMap$.getOrElse(Map.scala:228) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$schemaString$3(JdbcUtils.scala:782) at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(JdbcUtils.scala:779) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:883) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:81) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecutio
[jira] [Resolved] (SPARK-40222) Numeric try_add/try_divide/try_subtract/try_multiply should throw error from their children
[ https://issues.apache.org/jira/browse/SPARK-40222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40222. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37663 [https://github.com/apache/spark/pull/37663] > Numeric try_add/try_divide/try_subtract/try_multiply should throw error from > their children > --- > > Key: SPARK-40222 > URL: https://issues.apache.org/jira/browse/SPARK-40222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > Similar to https://issues.apache.org/jira/browse/SPARK-40054, we should > refactor the > {{{}try_add{}}}/{{{}try_subtract{}}}/{{{}try_multiply{}}}/{{{}try_divide{}}} > functions so that the errors from their children will be shown instead of > ignored. > Spark SQL allows arithmetic operations between > Number/Date/Timestamp/CalendarInterval/AnsiInterval (see the rule > [ResolveBinaryArithmetic|https://github.com/databricks/runtime/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L501] > for details). Some of these combinations can throw exceptions too: * Date + > CalendarInterval > * Date + AnsiInterval > * Timestamp + AnsiInterval > * Date - CalendarInterval > * Date - AnsiInterval > * Timestamp - AnsiInterval > * Number * CalendarInterval > * Number * AnsiInterval > * CalendarInterval / Number > * AnsiInterval / Number > This Jira is for the cases when both input data types are numbers. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40236) Error messages produced from error-classes.json should not have hard-coded sentences as parameters
[ https://issues.apache.org/jira/browse/SPARK-40236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Li updated SPARK-40236: --- Description: Relevant comment: [https://github.com/apache/spark/pull/37621#discussion_r955101102.] Example of hard coded string: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L512-L513 > Error messages produced from error-classes.json should not have hard-coded > sentences as parameters > -- > > Key: SPARK-40236 > URL: https://issues.apache.org/jira/browse/SPARK-40236 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Priority: Major > > Relevant comment: > [https://github.com/apache/spark/pull/37621#discussion_r955101102.] > Example of hard coded string: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L512-L513 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40236) Error messages produced from error-classes.json should not have hard-coded sentences as parameters
Vitalii Li created SPARK-40236: -- Summary: Error messages produced from error-classes.json should not have hard-coded sentences as parameters Key: SPARK-40236 URL: https://issues.apache.org/jira/browse/SPARK-40236 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Vitalii Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40235) Use interruptible lock instead of synchronized in Executor.updateDependencies()
Josh Rosen created SPARK-40235: -- Summary: Use interruptible lock instead of synchronized in Executor.updateDependencies() Key: SPARK-40235 URL: https://issues.apache.org/jira/browse/SPARK-40235 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Josh Rosen Assignee: Josh Rosen This patch modifies the synchronization in {{Executor.updateDependencies()}} in order to allow tasks to be interrupted while they are blocked and waiting on other tasks to finish downloading dependencies. This synchronization was added years ago in [mesos/spark@{{{}7b9e96c{}}}|https://github.com/mesos/spark/commit/7b9e96c99206c0679d9925e0161fde738a5c7c3a] in order to prevent concurrently-launching tasks from performing concurrent dependency updates (file downloads, and, later, library installation). If one task is downloading dependencies, all other newly-launched tasks will block until the original dependency download is complete. Let's say that a Spark task launches, becomes blocked on a {{updateDependencies()}} call, then is cancelled while it is blocked. Although Spark will send a Thread.interrupt() to the canceled task, the task will continue waiting because threads blocked on a {{synchronized}} won't throw an InterruptedException in response to the interrupt. As a result, the blocked thread will continue to wait until the other thread exits the synchronized block. In the wild, we saw a case where this happened and the thread remained blocked for over 1 minute, causing the TaskReaper to kick in and self-destruct the executor. This PR aims to fix this problem by replacing the {{synchronized}} with a ReentrantLock, which has a {{lockInterruptibly}} method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40234) Clean only MDC items set by Spark
[ https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-40234: --- Assignee: L. C. Hsieh > Clean only MDC items set by Spark > - > > Key: SPARK-40234 > URL: https://issues.apache.org/jira/browse/SPARK-40234 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, > the executor cleans up all MDC items. But it causes an issue for other MDC > items not set by Spark but from users at other places. It causes these custom > MDC items not shown in executor log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40234) Clean only MDC items set by Spark
[ https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40234: Assignee: (was: Apache Spark) > Clean only MDC items set by Spark > - > > Key: SPARK-40234 > URL: https://issues.apache.org/jira/browse/SPARK-40234 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: L. C. Hsieh >Priority: Major > > Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, > the executor cleans up all MDC items. But it causes an issue for other MDC > items not set by Spark but from users at other places. It causes these custom > MDC items not shown in executor log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40234) Clean only MDC items set by Spark
[ https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40234: Assignee: Apache Spark > Clean only MDC items set by Spark > - > > Key: SPARK-40234 > URL: https://issues.apache.org/jira/browse/SPARK-40234 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, > the executor cleans up all MDC items. But it causes an issue for other MDC > items not set by Spark but from users at other places. It causes these custom > MDC items not shown in executor log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40234) Clean only MDC items set by Spark
[ https://issues.apache.org/jira/browse/SPARK-40234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585609#comment-17585609 ] Apache Spark commented on SPARK-40234: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/37680 > Clean only MDC items set by Spark > - > > Key: SPARK-40234 > URL: https://issues.apache.org/jira/browse/SPARK-40234 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: L. C. Hsieh >Priority: Major > > Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, > the executor cleans up all MDC items. But it causes an issue for other MDC > items not set by Spark but from users at other places. It causes these custom > MDC items not shown in executor log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40234) Clean only MDC items set by Spark
L. C. Hsieh created SPARK-40234: --- Summary: Clean only MDC items set by Spark Key: SPARK-40234 URL: https://issues.apache.org/jira/browse/SPARK-40234 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: L. C. Hsieh Since SPARK-8981, Spark executor adds MDC support. Before setting MDC items, the executor cleans up all MDC items. But it causes an issue for other MDC items not set by Spark but from users at other places. It causes these custom MDC items not shown in executor log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35242) Support change catalog default database for spark
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585601#comment-17585601 ] Apache Spark commented on SPARK-35242: -- User 'roczei' has created a pull request for this issue: https://github.com/apache/spark/pull/37679 > Support change catalog default database for spark > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35242) Support change catalog default database for spark
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585602#comment-17585602 ] Apache Spark commented on SPARK-35242: -- User 'roczei' has created a pull request for this issue: https://github.com/apache/spark/pull/37679 > Support change catalog default database for spark > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40233) Unable to load large pandas dataframe to pyspark
Niranda Perera created SPARK-40233: -- Summary: Unable to load large pandas dataframe to pyspark Key: SPARK-40233 URL: https://issues.apache.org/jira/browse/SPARK-40233 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Reporter: Niranda Perera I've been trying to join two large pandas dataframes using pyspark using the following code. I'm trying to vary executor cores allocated for the application and measure scalability of pyspark (strong scaling). {code:java} r = 10 # 1Bn rows it = 10 w = 256 unique = 0.9 TOTAL_MEM = 240 TOTAL_NODES = 14 max_val = r * unique rng = default_rng() frame_data = rng.integers(0, max_val, size=(r, 2)) frame_data1 = rng.integers(0, max_val, size=(r, 2)) print(f"data generated", flush=True) df_l = pd.DataFrame(frame_data).add_prefix("col") df_r = pd.DataFrame(frame_data1).add_prefix("col") print(f"data loaded", flush=True) procs = int(math.ceil(w / TOTAL_NODES)) mem = int(TOTAL_MEM*0.9) print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", flush=True) spark = SparkSession\ .builder\ .appName(f'join {r} {w}')\ .master('spark://node:7077')\ .config('spark.executor.memory', f'{int(mem*0.6)}g')\ .config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\ .config('spark.cores.max', w)\ .config('spark.driver.memory', '100g')\ .config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\ .getOrCreate() sdf0 = spark.createDataFrame(df_l).repartition(w).cache() sdf1 = spark.createDataFrame(df_r).repartition(w).cache() print(f"data loaded to spark", flush=True) try: for i in range(it): t1 = time.time() out = sdf0.join(sdf1, on='col0', how='inner') count = out.count() t2 = time.time() print(f"timings {r} 1 {i} {(t2 - t1) * 1000:.0f} ms, {count}", flush=True) del out del count gc.collect() finally: spark.stop() {code} {*}Cluster{*}: I am using standalone spark cluster in a 15 node cluster with 48 cores and 240GB RAM each. I've spawned master and the driver code in node1, while other 14 nodes have spawned workers allocating maximum memory. In the spark context, I am reserving 90% of total memory to executor, splitting 60% to jvm and 40% to pyspark. {*}Issue{*}: When I run the above program, I can see that the executors are being assigned to the app. But it doesn't move forward, even after 60 mins. For smaller row count (10M), this was working without a problem. Driver output {code:java} world sz 256 procs per worker 19 mem 216 iter 8 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable /N/u/d/dnperera/.conda/envs/env/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Negative initial size: -589934400 Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warn(msg) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value
Patryk Piekarski created SPARK-40232: Summary: KMeans: high variability in results despite high initSteps parameter value Key: SPARK-40232 URL: https://issues.apache.org/jira/browse/SPARK-40232 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 3.3.0 Reporter: Patryk Piekarski Attachments: sample_data.csv I'm running KMeans on a sample dataset using PySpark. I want the results to be fairly stable, so I play with the _initSteps_ parameter. My understanding is that the higher the number of steps for k-means|| initialization mode, the higher the number of iterations the algorithm runs and in the end selects the best model out of all iterations. And that's the behavior I observe when running sklearn implementation with _n_init_ >= 10. However, when running PySpark implementation, regardless of the number of partitions of underlying data frame (tested on 1, 4, 8 number of partitions), even when setting _initSteps_ to 10, 50, or let's say 500, the results I get with different seeds are different and trainingCost value I observe is sometimes far from the lowest. As a workaround, to force the algorithm to iterate and select the best model I used a loop with dynamic seed. SKlearn in each iteration gets the trainingCost near 276655. PySpark implementation of KMeans gets there in the 2nd, 5th and 6th iteration, but all the remaining iterations yield higher values. Does the _initSteps_ parameter work as expected? Because my findings suggest that something might be off here. Let me know where I could upload this sample dataset (2MB) {code:java} import pandas as pd from sklearn.cluster import KMeans as KMeansSKlearn df = pd.read_csv('sample_data.csv') minimum = for i in range(1,10): kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, random_state=i) model = kmeans.fit(df) print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql import SparkSession spark= SparkSession.builder \ .appName("kmeans-test") \ .config('spark.driver.memory', '2g') \ .master("local[2]") \ .getOrCreate()df1 = spark.createDataFrame(df) from pyspark.ml.clustering import KMeans from pyspark.ml.feature import VectorAssembler assemble=VectorAssembler(inputCols=df1.columns, outputCol='features') assembled_data=assemble.transform(df1) minimum = for i in range(1,10): kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, seed=i, tol=0.0001) model = kmeans.fit(assembled_data) summary = model.summary print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value
[ https://issues.apache.org/jira/browse/SPARK-40232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patryk Piekarski updated SPARK-40232: - Attachment: sample_data.csv > KMeans: high variability in results despite high initSteps parameter value > -- > > Key: SPARK-40232 > URL: https://issues.apache.org/jira/browse/SPARK-40232 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Patryk Piekarski >Priority: Major > Attachments: sample_data.csv > > > I'm running KMeans on a sample dataset using PySpark. I want the results to > be fairly stable, so I play with the _initSteps_ parameter. My understanding > is that the higher the number of steps for k-means|| initialization mode, the > higher the number of iterations the algorithm runs and in the end selects the > best model out of all iterations. And that's the behavior I observe when > running sklearn implementation with _n_init_ >= 10. However, when running > PySpark implementation, regardless of the number of partitions of underlying > data frame (tested on 1, 4, 8 number of partitions), even when setting > _initSteps_ to 10, 50, or let's say 500, the results I get with different > seeds are different and trainingCost value I observe is sometimes far from > the lowest. > As a workaround, to force the algorithm to iterate and select the best model > I used a loop with dynamic seed. > SKlearn in each iteration gets the trainingCost near 276655. > PySpark implementation of KMeans gets there in the 2nd, 5th and 6th > iteration, but all the remaining iterations yield higher values. > Does the _initSteps_ parameter work as expected? Because my findings suggest > that something might be off here. > Let me know where I could upload this sample dataset (2MB) > > {code:java} > import pandas as pd > from sklearn.cluster import KMeans as KMeansSKlearn > df = pd.read_csv('sample_data.csv') > minimum = > for i in range(1,10): > kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, > random_state=i) > model = kmeans.fit(df) > print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql > import SparkSession > spark= SparkSession.builder \ > .appName("kmeans-test") \ > .config('spark.driver.memory', '2g') \ > .master("local[2]") \ > .getOrCreate()df1 = spark.createDataFrame(df) > from pyspark.ml.clustering import KMeans > from pyspark.ml.feature import VectorAssembler > assemble=VectorAssembler(inputCols=df1.columns, outputCol='features') > assembled_data=assemble.transform(df1) > minimum = > for i in range(1,10): > kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, > seed=i, tol=0.0001) > model = kmeans.fit(assembled_data) > summary = model.summary > print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585318#comment-17585318 ] Apache Spark commented on SPARK-40124: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37678 > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Assignee: Kapil Singh >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups
[ https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-39931: -- Description: Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|16.2|7.8| |512|9.4|26.7|9.8| |256|9.3|44.5|20.2| |128|9.5|82.7|48.8| |64|9.5|158.2|91.9| |32|9.6|319.8|207.3| |16|9.6|652.6|261.5| |8|9.5|1,376|663.0| |4|9.8|2,656|1,168| |2|10.4|5,412|2,456| |1|11.3|9,491|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to the Python method. With large groups, that Panadas DataFrame has all rows of a single group, with small groups it contains many groups. This should improve efficiency. was: Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|16.2|7.8| |512|9.4|26.7|9.8| |256|9.3|44.5|20.2| |128|9.5|82.7|48.8| |64|9.5|158.2|91.9| |32|9.6|319.8|207.3| |16|9.6|652.6|261.5| |8|9.5|1,376|663.0| |4|9.8|2,656|1,168| |2|10.4|5,412|2,456| |1|11.3|8,162|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to the Python method. With large groups, that Panadas DataFrame has all rows of a single group, with small groups it contains many groups. This should improve efficiency. > Improve performance of applyInPandas for very small groups > -- > > Key: SPARK-39931 > URL: https://issues.apache.org/jira/browse/SPARK-39931 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > > Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups > in PySpark is very slow. The reason is that for each group, PySpark creates a > Pandas DataFrame and calls into the Python code. For very small groups, the > overhead is huge, for large groups, it is smaller. > Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m > rows): > ||groupSize||Scala||pyspark.sql||pyspark.pandas|| > |1024|8.9|16.2|7.8| > |512|9.4|26.7|9.8| > |256|9.3|44.5|20.2| > |128|9.5|82.7|48.8| > |64|9.5|158.2|91.9| > |32|9.6|319.8|207.3| > |16|9.6|652.6|261.5| > |8|9.5|1,376|663.0| > |4|9.8|2,656|1,168| > |2|10.4|5,412|2,456| > |1|11.3|9,491|4,642| > *Idea to overcome this* is to call into Python side with a Pandas DataFrame > that contains potentially multiple groups, then perform a Pandas > {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to > the Python method. With large groups, that Panadas DataFrame has all rows of > a single group, with small groups it contains many groups. This should > improve efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585315#comment-17585315 ] Apache Spark commented on SPARK-40039: -- User 'roczei' has created a pull request for this issue: https://github.com/apache/spark/pull/37677 > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.4.0 > > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585314#comment-17585314 ] Apache Spark commented on SPARK-40039: -- User 'roczei' has created a pull request for this issue: https://github.com/apache/spark/pull/37677 > Introducing a streaming checkpoint file manager based on Hadoop's Abortable > interface > - > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.4.0 > > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34265) Instrument Python UDF execution using SQL Metrics
[ https://issues.apache.org/jira/browse/SPARK-34265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-34265: Affects Version/s: 3.3.0 (was: 3.2.0) > Instrument Python UDF execution using SQL Metrics > - > > Key: SPARK-34265 > URL: https://issues.apache.org/jira/browse/SPARK-34265 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Luca Canali >Priority: Minor > Attachments: PandasUDF_ArrowEvalPython_Metrics.png, > PythonSQLMetrics_Jira_Picture.png, proposed_Python_SQLmetrics_v20210128.png > > > This proposes to add SQLMetrics instrumentation for Python UDF. > This is aimed at improving monitoring and performance troubleshooting of > Python UDFs, Pandas UDF, including also the use of MapPartittion, and > MapInArrow. > The introduced metrics are exposed to the end users via the metrics system > and are visible through the WebUI interface, in the SQL/DataFrame tab for > execution steps related to Python UDF execution. See also the attached > screenshots. > This intrumentation is lightweight and can be used in production and for > monitoring. It is complementary to the Python/Pandas UDF Profiler introduced > in Spark 3.3 > [https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34265) Instrument Python UDF execution using SQL Metrics
[ https://issues.apache.org/jira/browse/SPARK-34265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-34265: Description: This proposes to add SQLMetrics instrumentation for Python UDF. This is aimed at improving monitoring and performance troubleshooting of Python UDFs, Pandas UDF, including also the use of MapPartittion, and MapInArrow. The introduced metrics are exposed to the end users via the metrics system and are visible through the WebUI interface, in the SQL/DataFrame tab for execution steps related to Python UDF execution. See also the attached screenshots. This intrumentation is lightweight and can be used in production and for monitoring. It is complementary to the Python/Pandas UDF Profiler introduced in Spark 3.3 [https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf] was: This proposes to add SQLMetrics instrumentation for Python UDF. This is aimed at improving monitoring and performance troubleshooting of Python code called by Spark, via UDF, Pandas UDF or with MapPartittions. The introduced metrics are exposed to the end users via the WebUI interface, in the SQL tab for execution steps related to Python UDF execution. Thes scope of this has been limited to Pandas UDF and related operatio, namely: ArrowEvalPython, AggregateInPandas, FlaMapGroupsInPandas, MapInPandas, FlatMapsCoGroupsInPandas, PythonMapInArrow, WindowsInPandas. See also the attached screenshot. > Instrument Python UDF execution using SQL Metrics > - > > Key: SPARK-34265 > URL: https://issues.apache.org/jira/browse/SPARK-34265 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Luca Canali >Priority: Minor > Attachments: PandasUDF_ArrowEvalPython_Metrics.png, > PythonSQLMetrics_Jira_Picture.png, proposed_Python_SQLmetrics_v20210128.png > > > This proposes to add SQLMetrics instrumentation for Python UDF. > This is aimed at improving monitoring and performance troubleshooting of > Python UDFs, Pandas UDF, including also the use of MapPartittion, and > MapInArrow. > The introduced metrics are exposed to the end users via the metrics system > and are visible through the WebUI interface, in the SQL/DataFrame tab for > execution steps related to Python UDF execution. See also the attached > screenshots. > This intrumentation is lightweight and can be used in production and for > monitoring. It is complementary to the Python/Pandas UDF Profiler introduced > in Spark 3.3 > [https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585297#comment-17585297 ] Apache Spark commented on SPARK-40124: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37675 > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Assignee: Kapil Singh >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585296#comment-17585296 ] Apache Spark commented on SPARK-40124: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37675 > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Assignee: Kapil Singh >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40156) url_decode() exposes a Java error
[ https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40156: Assignee: ming95 > url_decode() exposes a Java error > - > > Key: SPARK-40156 > URL: https://issues.apache.org/jira/browse/SPARK-40156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: ming95 >Priority: Major > > Given a badly encode string Spark returns a Java error. > It should the return an ERROR_CLASS > spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org'); > 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT > url_decode('http%3A%2F%2spark.apache.org')] > java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in > escape (%) pattern - Error at index 1 in: "2s" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40156) url_decode() exposes a Java error
[ https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40156. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37636 [https://github.com/apache/spark/pull/37636] > url_decode() exposes a Java error > - > > Key: SPARK-40156 > URL: https://issues.apache.org/jira/browse/SPARK-40156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: ming95 >Priority: Major > Fix For: 3.4.0 > > > Given a badly encode string Spark returns a Java error. > It should the return an ERROR_CLASS > spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org'); > 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT > url_decode('http%3A%2F%2spark.apache.org')] > java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in > escape (%) pattern - Error at index 1 in: "2s" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups
[ https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-39931: -- Description: Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|16.2|7.8| |512|9.4|26.7|9.8| |256|9.3|44.5|20.2| |128|9.5|82.7|48.8| |64|9.5|158.2|91.9| |32|9.6|319.8|207.3| |16|9.6|652.6|261.5| |8|9.5|1,376|663.0| |4|9.8|2,656|1,168| |2|10.4|5,412|2,456| |1|11.3|8,162|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to the Python method. With large groups, that Panadas DataFrame has all rows of a single group, with small groups it contains many groups. This should improve efficiency. was: Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|20.9|7.8| |512|9.4|31.8|9.8| |256|9.3|47.0|20.2| |128|9.5|83.3|48.8| |64|9.5|137.8|91.9| |32|9.6|263.6|207.3| |16|9.6|525.9|261.5| |8|9.5|1,043|663.0| |4|9.8|2,073|1,168| |2|10.4|4,132|2,456| |1|11.3|8,162|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to the Python method. With large groups, that Panadas DataFrame has all rows of a single group, with small groups it contains many groups. This should improve efficiency. > Improve performance of applyInPandas for very small groups > -- > > Key: SPARK-39931 > URL: https://issues.apache.org/jira/browse/SPARK-39931 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > > Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups > in PySpark is very slow. The reason is that for each group, PySpark creates a > Pandas DataFrame and calls into the Python code. For very small groups, the > overhead is huge, for large groups, it is smaller. > Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m > rows): > ||groupSize||Scala||pyspark.sql||pyspark.pandas|| > |1024|8.9|16.2|7.8| > |512|9.4|26.7|9.8| > |256|9.3|44.5|20.2| > |128|9.5|82.7|48.8| > |64|9.5|158.2|91.9| > |32|9.6|319.8|207.3| > |16|9.6|652.6|261.5| > |8|9.5|1,376|663.0| > |4|9.8|2,656|1,168| > |2|10.4|5,412|2,456| > |1|11.3|8,162|4,642| > *Idea to overcome this* is to call into Python side with a Pandas DataFrame > that contains potentially multiple groups, then perform a Pandas > {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to > the Python method. With large groups, that Panadas DataFrame has all rows of > a single group, with small groups it contains many groups. This should > improve efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40231) Add 1TB TPCDS Plan stability tests
[ https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40231: Assignee: (was: Apache Spark) > Add 1TB TPCDS Plan stability tests > -- > > Key: SPARK-40231 > URL: https://issues.apache.org/jira/browse/SPARK-40231 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40231) Add 1TB TPCDS Plan stability tests
[ https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585279#comment-17585279 ] Apache Spark commented on SPARK-40231: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37674 > Add 1TB TPCDS Plan stability tests > -- > > Key: SPARK-40231 > URL: https://issues.apache.org/jira/browse/SPARK-40231 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40231) Add 1TB TPCDS Plan stability tests
[ https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585277#comment-17585277 ] Apache Spark commented on SPARK-40231: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/37674 > Add 1TB TPCDS Plan stability tests > -- > > Key: SPARK-40231 > URL: https://issues.apache.org/jira/browse/SPARK-40231 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40231) Add 1TB TPCDS Plan stability tests
[ https://issues.apache.org/jira/browse/SPARK-40231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40231: Assignee: Apache Spark > Add 1TB TPCDS Plan stability tests > -- > > Key: SPARK-40231 > URL: https://issues.apache.org/jira/browse/SPARK-40231 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kapil Singh >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38934) Provider TemporaryAWSCredentialsProvider has no credentials
[ https://issues.apache.org/jira/browse/SPARK-38934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585275#comment-17585275 ] Steve Loughran commented on SPARK-38934: [~graceee318] try explicitly setting the aws secrets as a per-bucket option, If the bucket is called "my-bucket", set them in your config into these options {code} spark.hadoop.fs.s3a.bucket.my-bucket.access.key spark.hadoop.fs.s3a.bucket.my-bucket.secret.key spark.hadoop.fs.s3a.bucket.my-bucket.session.token {code} This will stop any env var propagation from interfering with the values for the bucket > Provider TemporaryAWSCredentialsProvider has no credentials > --- > > Key: SPARK-38934 > URL: https://issues.apache.org/jira/browse/SPARK-38934 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.1 >Reporter: Lily >Priority: Major > > > We are using Jupyter Hub on K8s as a notebook based development environment > and Spark on K8s as a backend cluster of Jupyter Hub on K8s with Spark 3.2.1 > and Hadoop 3.3.1. > When we run a code like the one below in the Jupyter Hub on K8s, > > {code:java} > val perm = ... // get AWS temporary credential by AWS STS from AWS assumed > role > // set AWS temporary credential > spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", > "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") > spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", > perm.credential.accessKeyID) > spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", > perm.credential.secretAccessKey) > spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", > perm.credential.sessionToken) > // execute simple Spark action > spark.read.format("parquet").load("s3a:///*").show(1) {code} > > > the first few executors left a warning like the one below in the first code > execution, but we were able to get the proper result thanks to Spark task > retry function. > {code:java} > 22/04/18 09:13:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) > (10.197.5.15 executor 1): java.nio.file.AccessDeniedException: > s3a:///.parquet: > org.apache.hadoop.fs.s3a.CredentialInitializationException: Provider > TemporaryAWSCredentialsProvider has no credentials > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206) > at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2810) > at > org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225) > at > org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$6(HadoopFSUtils.scala:136) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$4(HadoopFSUtils.scala:126) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: org.apache.hadoop.fs.s3a.CredentialInitializationException: > Provider TemporaryAWSCredentialsProvider has no credentials > at > org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider.getCredentials(AbstractSessionCredentialsProvider.java:130) > at > org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1266) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRe
[jira] [Created] (SPARK-40231) Add 1TB TPCDS Plan stability tests
Kapil Singh created SPARK-40231: --- Summary: Add 1TB TPCDS Plan stability tests Key: SPARK-40231 URL: https://issues.apache.org/jira/browse/SPARK-40231 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.4.0 Reporter: Kapil Singh -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups
[ https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-39931: -- Description: Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|20.9|7.8| |512|9.4|31.8|9.8| |256|9.3|47.0|20.2| |128|9.5|83.3|48.8| |64|9.5|137.8|91.9| |32|9.6|263.6|207.3| |16|9.6|525.9|261.5| |8|9.5|1,043|663.0| |4|9.8|2,073|1,168| |2|10.4|4,132|2,456| |1|11.3|8,162|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to the Python method. With large groups, that Panadas DataFrame has all rows of a single group, with small groups it contains many groups. This should improve efficiency. was: Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|20.9|7.8| |512|9.4|31.8|9.8| |256|9.3|47.0|20.2| |128|9.5|83.3|48.8| |64|9.5|137.8|91.9| |32|9.6|263.6|207.3| |16|9.6|525.9|261.5| |8|9.5|1,043|663.0| |4|9.8|2,073|1,168| |2|10.4|4,132|2,456| |1|11.3|8,162|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to the Python method. With large groups, that Panadas DataFrame has all rows of a single group, with small groups it contains many groups. This should improve efficiency. > Improve performance of applyInPandas for very small groups > -- > > Key: SPARK-39931 > URL: https://issues.apache.org/jira/browse/SPARK-39931 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > > Calling {{DataFrame.groupby(...).applyInPandas(...)}} for very small groups > in PySpark is very slow. The reason is that for each group, PySpark creates a > Pandas DataFrame and calls into the Python code. For very small groups, the > overhead is huge, for large groups, it is smaller. > Here is a benchmarks (seconds to {{groupBy(...).applyInPandas(...)}} 10m > rows): > ||groupSize||Scala||pyspark.sql||pyspark.pandas|| > |1024|8.9|20.9|7.8| > |512|9.4|31.8|9.8| > |256|9.3|47.0|20.2| > |128|9.5|83.3|48.8| > |64|9.5|137.8|91.9| > |32|9.6|263.6|207.3| > |16|9.6|525.9|261.5| > |8|9.5|1,043|663.0| > |4|9.8|2,073|1,168| > |2|10.4|4,132|2,456| > |1|11.3|8,162|4,642| > *Idea to overcome this* is to call into Python side with a Pandas DataFrame > that contains potentially multiple groups, then perform a Pandas > {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to > the Python method. With large groups, that Panadas DataFrame has all rows of > a single group, with small groups it contains many groups. This should > improve efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39931) Improve performance of applyInPandas for very small groups
[ https://issues.apache.org/jira/browse/SPARK-39931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-39931: -- Description: Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|20.9|7.8| |512|9.4|31.8|9.8| |256|9.3|47.0|20.2| |128|9.5|83.3|48.8| |64|9.5|137.8|91.9| |32|9.6|263.6|207.3| |16|9.6|525.9|261.5| |8|9.5|1,043|663.0| |4|9.8|2,073|1,168| |2|10.4|4,132|2,456| |1|11.3|8,162|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to the Python method. With large groups, that Panadas DataFrame has all rows of a single group, with small groups it contains many groups. This should improve efficiency. was: Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in PySpark is very slow. The reason is that for each group, PySpark creates a Pandas DataFrame and calls into the Python code. For very small groups, the overhead is huge, for large groups, it is smaller. Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows): ||groupSize||Scala||pyspark.sql||pyspark.pandas|| |1024|8.9|20.9|7.8| |512|9.4|31.8|9.8| |256|9.3|47.0|20.2| |128|9.5|83.3|48.8| |64|9.5|137.8|91.9| |32|9.6|263.6|207.3| |16|9.6|525.9|261.5| |8|9.5|1,043|663.0| |4|9.8|2,073|1,168| |2|10.4|4,132|2,456| |1|11.3|8,162|4,642| *Idea to overcome this* is to call into Python side with a Pandas DataFrame that contains potentially multiple groups, then perform a Pandas DataFrame.groupBy(...).apply(...). With large groups, that Panadas DataFrame has all rows of single group, with small groups it contains many groups. This should improve efficiency. I have prepared a PoC to benchmark that idea but am struggling to massage the internal rows before sending them to Python. I have the key and the grouped values of each row as {{InternalRow}}s and want to turn them into a single {{InternalRow}} in such a way that in Python I can easily group by the key and get the same grouped values. For this, I think adding the key as a single column (struct) while leaving the grouped values as is should work best. But I have not found a way to get this working. > Improve performance of applyInPandas for very small groups > -- > > Key: SPARK-39931 > URL: https://issues.apache.org/jira/browse/SPARK-39931 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > > Calling `DataFrame.groupby(...).applyInPandas(...)` for very small groups in > PySpark is very slow. The reason is that for each group, PySpark creates a > Pandas DataFrame and calls into the Python code. For very small groups, the > overhead is huge, for large groups, it is smaller. > Here is a benchmarks (seconds to groupBy(...).applyInPandas(...) 10m rows): > ||groupSize||Scala||pyspark.sql||pyspark.pandas|| > |1024|8.9|20.9|7.8| > |512|9.4|31.8|9.8| > |256|9.3|47.0|20.2| > |128|9.5|83.3|48.8| > |64|9.5|137.8|91.9| > |32|9.6|263.6|207.3| > |16|9.6|525.9|261.5| > |8|9.5|1,043|663.0| > |4|9.8|2,073|1,168| > |2|10.4|4,132|2,456| > |1|11.3|8,162|4,642| > *Idea to overcome this* is to call into Python side with a Pandas DataFrame > that contains potentially multiple groups, then perform a Pandas > {{DataFrame.groupBy(...).apply(...)}} or provide the {{DataFrameGroupBy}} to > the Python method. With large groups, that Panadas DataFrame has all rows of > a single group, with small groups it contains many groups. This should > improve efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-38934) Provider TemporaryAWSCredentialsProvider has no credentials
[ https://issues.apache.org/jira/browse/SPARK-38934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran reopened SPARK-38934: > Provider TemporaryAWSCredentialsProvider has no credentials > --- > > Key: SPARK-38934 > URL: https://issues.apache.org/jira/browse/SPARK-38934 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.1 >Reporter: Lily >Priority: Major > > > We are using Jupyter Hub on K8s as a notebook based development environment > and Spark on K8s as a backend cluster of Jupyter Hub on K8s with Spark 3.2.1 > and Hadoop 3.3.1. > When we run a code like the one below in the Jupyter Hub on K8s, > > {code:java} > val perm = ... // get AWS temporary credential by AWS STS from AWS assumed > role > // set AWS temporary credential > spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", > "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") > spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", > perm.credential.accessKeyID) > spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", > perm.credential.secretAccessKey) > spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", > perm.credential.sessionToken) > // execute simple Spark action > spark.read.format("parquet").load("s3a:///*").show(1) {code} > > > the first few executors left a warning like the one below in the first code > execution, but we were able to get the proper result thanks to Spark task > retry function. > {code:java} > 22/04/18 09:13:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) > (10.197.5.15 executor 1): java.nio.file.AccessDeniedException: > s3a:///.parquet: > org.apache.hadoop.fs.s3a.CredentialInitializationException: Provider > TemporaryAWSCredentialsProvider has no credentials > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206) > at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2810) > at > org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225) > at > org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$6(HadoopFSUtils.scala:136) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$4(HadoopFSUtils.scala:126) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: org.apache.hadoop.fs.s3a.CredentialInitializationException: > Provider TemporaryAWSCredentialsProvider has no credentials > at > org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider.getCredentials(AbstractSessionCredentialsProvider.java:130) > at > org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1266) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:842) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:792) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713) > a
[jira] [Commented] (SPARK-38934) Provider TemporaryAWSCredentialsProvider has no credentials
[ https://issues.apache.org/jira/browse/SPARK-38934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585268#comment-17585268 ] Steve Loughran commented on SPARK-38934: staring at this some more, as there's enough occurrences of this with spark alone that maybe there is a problem, not with the s3a code, but how spark passed configurations around, including env var adoption. > Provider TemporaryAWSCredentialsProvider has no credentials > --- > > Key: SPARK-38934 > URL: https://issues.apache.org/jira/browse/SPARK-38934 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.1 >Reporter: Lily >Priority: Major > > > We are using Jupyter Hub on K8s as a notebook based development environment > and Spark on K8s as a backend cluster of Jupyter Hub on K8s with Spark 3.2.1 > and Hadoop 3.3.1. > When we run a code like the one below in the Jupyter Hub on K8s, > > {code:java} > val perm = ... // get AWS temporary credential by AWS STS from AWS assumed > role > // set AWS temporary credential > spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", > "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") > spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", > perm.credential.accessKeyID) > spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", > perm.credential.secretAccessKey) > spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", > perm.credential.sessionToken) > // execute simple Spark action > spark.read.format("parquet").load("s3a:///*").show(1) {code} > > > the first few executors left a warning like the one below in the first code > execution, but we were able to get the proper result thanks to Spark task > retry function. > {code:java} > 22/04/18 09:13:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) > (10.197.5.15 executor 1): java.nio.file.AccessDeniedException: > s3a:///.parquet: > org.apache.hadoop.fs.s3a.CredentialInitializationException: Provider > TemporaryAWSCredentialsProvider has no credentials > at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:206) > at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2810) > at > org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225) > at > org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$6(HadoopFSUtils.scala:136) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$4(HadoopFSUtils.scala:126) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: org.apache.hadoop.fs.s3a.CredentialInitializationException: > Provider TemporaryAWSCredentialsProvider has no credentials > at > org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider.getCredentials(AbstractSessionCredentialsProvider.java:130) > at > org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1266) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:842) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:792) > at > com.amazonaws.http.AmazonHttpClient$RequestE
[jira] [Updated] (SPARK-40230) Executor connection issue in hybrid cloud deployment
[ https://issues.apache.org/jira/browse/SPARK-40230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gleb Abroskin updated SPARK-40230: -- Environment: About the k8s setup: * 6+ nodes in AWS * 4 nodes in DC Spark 3.2.1 + spark-hadoop-cloud 3.2.1 {code:java} JAVA_HOME=/Users/gleb.abroskin/Library/Java/JavaVirtualMachines/corretto-11.0.13/Contents/Home spark-submit \ --master k8s://https://kubemaster:6443 \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties" \ --conf spark.submit.deployMode=cluster \ --conf spark.kubernetes.namespace=ml \ --conf spark.kubernetes.container.image=SPARK_WITH_TRACE_LOGS_BAKED_IN \ --conf spark.kubernetes.container.image.pullSecrets=aws-ecr \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-submitter \ --conf spark.kubernetes.authenticate.submission.oauthToken=XXX \ --conf spark.kubernetes.executor.podNamePrefix=ds-224-executor-test \ --conf spark.kubernetes.file.upload.path=s3a://ifunny-ml-data/dev/spark \ --conf "spark.hadoop.fs.s3a.access.key=XXX" \ --conf "spark.hadoop.fs.s3a.secret.key=XXX" \ --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-1.amazonaws.com \ --conf "spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=XXX" \ --conf "spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=XXX" \ --conf spark.sql.shuffle.partitions=500 \ --num-executors 100 \ --driver-java-options="-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties" \ --name k8s-pyspark-test \ main.py{code} main.py is just pi.py from the examples, modified to work on 100 machines (this is a way to make sure executors are deployed in both AWS & DC) {code:java} import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession \ .builder \ .appName("PythonPi") \ .getOrCreate() spark.sparkContext.setLogLevel("TRACE") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 100 n = 1000 * partitions def f(_: int) -> float: x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) {code} was: About the k8s setup: * 6+ nodes in AWS * 4 nodes in DC Spark 3.2.1 + spark-hadoop-cloud 3.2.1 {code:java} JAVA_HOME=/Users/gleb.abroskin/Library/Java/JavaVirtualMachines/corretto-11.0.13/Contents/Home spark-submit \ --master k8s://https://ifunny-ml-kubemaster.ash1.fun.co:6443 \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties" \ --conf spark.submit.deployMode=cluster \ --conf spark.kubernetes.namespace=ml \ --conf spark.kubernetes.container.image=SPARK_WITH_TRACE_LOGS_BAKED_IN \ --conf spark.kubernetes.container.image.pullSecrets=aws-ecr \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-submitter \ --conf spark.kubernetes.authenticate.submission.oauthToken=XXX \ --conf spark.kubernetes.executor.podNamePrefix=ds-224-executor-test \ --conf spark.kubernetes.file.upload.path=s3a://ifunny-ml-data/dev/spark \ --conf "spark.hadoop.fs.s3a.access.key=XXX" \ --conf "spark.hadoop.fs.s3a.secret.key=XXX" \ --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-1.amazonaws.com \ --conf "spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=XXX" \ --conf "spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=XXX" \ --conf spark.sql.shuffle.partitions=500 \ --num-executors 100 \ --driver-java-options="-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties" \ --name k8s-pyspark-test \ main.py{code} main.py is just pi.py from the examples, modified to work on 100 machines (this is a way to make sure executors are deployed in both AWS & DC) {code:java} import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession \ .builder \ .appName("PythonPi") \ .getOrCreate() spark.sparkContext.setLogLevel("TRACE") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 100 n = 1000 * partitions def f(_: int) -> float: x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) {code} > Executor connection issue in hybrid cloud deployment > > > Key: SPARK-40230 > URL: https://issues.apache.org/jira/browse/SPARK-40230 > Project: Spark > Issue Type: Bug > Components: Block Manager, Kubernetes >
[jira] [Created] (SPARK-40230) Executor connection issue in hybrid cloud deployment
Gleb Abroskin created SPARK-40230: - Summary: Executor connection issue in hybrid cloud deployment Key: SPARK-40230 URL: https://issues.apache.org/jira/browse/SPARK-40230 Project: Spark Issue Type: Bug Components: Block Manager, Kubernetes Affects Versions: 3.2.1 Environment: About the k8s setup: * 6+ nodes in AWS * 4 nodes in DC Spark 3.2.1 + spark-hadoop-cloud 3.2.1 {code:java} JAVA_HOME=/Users/gleb.abroskin/Library/Java/JavaVirtualMachines/corretto-11.0.13/Contents/Home spark-submit \ --master k8s://https://ifunny-ml-kubemaster.ash1.fun.co:6443 \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties" \ --conf spark.submit.deployMode=cluster \ --conf spark.kubernetes.namespace=ml \ --conf spark.kubernetes.container.image=SPARK_WITH_TRACE_LOGS_BAKED_IN \ --conf spark.kubernetes.container.image.pullSecrets=aws-ecr \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-submitter \ --conf spark.kubernetes.authenticate.submission.oauthToken=XXX \ --conf spark.kubernetes.executor.podNamePrefix=ds-224-executor-test \ --conf spark.kubernetes.file.upload.path=s3a://ifunny-ml-data/dev/spark \ --conf "spark.hadoop.fs.s3a.access.key=XXX" \ --conf "spark.hadoop.fs.s3a.secret.key=XXX" \ --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-1.amazonaws.com \ --conf "spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=XXX" \ --conf "spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=XXX" \ --conf spark.sql.shuffle.partitions=500 \ --num-executors 100 \ --driver-java-options="-Dlog4j.configuration=file:///opt/spark/jars/log4j-trace.properties" \ --name k8s-pyspark-test \ main.py{code} main.py is just pi.py from the examples, modified to work on 100 machines (this is a way to make sure executors are deployed in both AWS & DC) {code:java} import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession \ .builder \ .appName("PythonPi") \ .getOrCreate() spark.sparkContext.setLogLevel("TRACE") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 100 n = 1000 * partitions def f(_: int) -> float: x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) {code} Reporter: Gleb Abroskin I understand that the issue is quite subtle and might be hard to debug, still I was not able to find issue with our infra, so I guess that is something inside the spark. We deploy spark application in k8s and everything works well, if all the driver & executor pods are either in AWS or our DC, but in case they are split between datacenters something strange happens, for example, logs of one of the executors inside the DC {code:java} 22/08/26 07:55:35 INFO TransportClientFactory: Successfully created connection to /172.19.149.92:39414 after 50 ms (1 ms spent in bootstraps) 22/08/26 07:55:35 TRACE TransportClient: Sending RPC to /172.19.149.92:39414 22/08/26 07:55:35 TRACE TransportClient: Sending request RPC 4860401977118244334 to /172.19.149.92:39414 took 3 ms 22/08/26 07:55:35 DEBUG TransportClient: Sending fetch chunk request 0 to /172.19.149.92:39414 22/08/26 07:55:35 TRACE TransportClient: Sending request StreamChunkId[streamId=1644979023003,chunkIndex=0] to /172.19.149.92:39414 took 0 ms 22/08/26 07:57:35 ERROR TransportChannelHandler: Connection to /172.19.149.92:39414 has been quiet for 12 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.shuffle.io.connectionTimeout if this is wrong. {code} The executor successfully creates connection & sends the request, but the connection was assumed dead. Even stranger the executor on ip 172.19.149.92 have sent the response back, which I can confirm with following logs {code:java} 22/08/26 07:55:35 TRACE MessageDecoder: Received message ChunkFetchRequest: ChunkFetchRequest[streamChunkId=StreamChunkId[streamId=1644979023003,chunkIndex=0]] 22/08/26 07:55:35 TRACE ChunkFetchRequestHandler: Received req from /172.19.123.197:37626 to fetch block StreamChunkId[streamId=1644979023003,chunkIndex=0] 22/08/26 07:55:35 TRACE OneForOneStreamManager: Removing stream id 1644979023003 22/08/26 07:55:35 TRACE BlockInfoManager: Task -1024 releasing lock for broadcast_0_piece0 -- 22/08/26 07:55:35 TRACE BlockInfoManager: Task -1024 releasing lock for broadcast_0_piece0 22/08/26 07:55:35 TRACE ChunkFetchRequestHandler: Sent result ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=1644979023003,chunkIndex=0],buffer=org.apache.spark.storage.BlockManagerManagedBuffer@79b43e2a] to client /
[jira] [Resolved] (SPARK-38749) Test the error class: RENAME_SRC_PATH_NOT_FOUND
[ https://issues.apache.org/jira/browse/SPARK-38749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38749. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37611 [https://github.com/apache/spark/pull/37611] > Test the error class: RENAME_SRC_PATH_NOT_FOUND > --- > > Key: SPARK-38749 > URL: https://issues.apache.org/jira/browse/SPARK-38749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > Fix For: 3.4.0 > > > Add a test for the error classes *RENAME_SRC_PATH_NOT_FOUND* to > QueryExecutionErrorsSuite. The test should cover the exception throw in > QueryExecutionErrors: > {code:scala} > def renameSrcPathNotFoundError(srcPath: Path): Throwable = { > new SparkFileNotFoundException(errorClass = "RENAME_SRC_PATH_NOT_FOUND", > Array(srcPath.toString)) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40228) Don't simplify multiLike if child is not attribute
[ https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40228: Assignee: (was: Apache Spark) > Don't simplify multiLike if child is not attribute > -- > > Key: SPARK-40228 > URL: https://issues.apache.org/jira/browse/SPARK-40228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > sql("create table t1(name string) using parquet") > sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', > '%c%')").explain(true) > {code} > {noformat} > == Physical Plan == > *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, > 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: > [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) > OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40228) Don't simplify multiLike if child is not attribute
[ https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40228: Assignee: Apache Spark > Don't simplify multiLike if child is not attribute > -- > > Key: SPARK-40228 > URL: https://issues.apache.org/jira/browse/SPARK-40228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:scala} > sql("create table t1(name string) using parquet") > sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', > '%c%')").explain(true) > {code} > {noformat} > == Physical Plan == > *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, > 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: > [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) > OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40228) Don't simplify multiLike if child is not attribute
[ https://issues.apache.org/jira/browse/SPARK-40228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585217#comment-17585217 ] Apache Spark commented on SPARK-40228: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/37672 > Don't simplify multiLike if child is not attribute > -- > > Key: SPARK-40228 > URL: https://issues.apache.org/jira/browse/SPARK-40228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > sql("create table t1(name string) using parquet") > sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', > '%c%')").explain(true) > {code} > {noformat} > == Physical Plan == > *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, > 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[name#0] Batched: true, DataFilters: > [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) > OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585210#comment-17585210 ] Apache Spark commented on SPARK-40229: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37671 > Re-enable excel I/O test for pandas API on Spark. > - > > Key: SPARK-40229 > URL: https://issues.apache.org/jira/browse/SPARK-40229 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently we're skipping the `read_excel` and `to_excel` tests for pandas API > on Spark, since the `openpyxl` is not installed in our test environments. > We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40229: Assignee: Apache Spark > Re-enable excel I/O test for pandas API on Spark. > - > > Key: SPARK-40229 > URL: https://issues.apache.org/jira/browse/SPARK-40229 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Currently we're skipping the `read_excel` and `to_excel` tests for pandas API > on Spark, since the `openpyxl` is not installed in our test environments. > We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40229: Assignee: (was: Apache Spark) > Re-enable excel I/O test for pandas API on Spark. > - > > Key: SPARK-40229 > URL: https://issues.apache.org/jira/browse/SPARK-40229 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently we're skipping the `read_excel` and `to_excel` tests for pandas API > on Spark, since the `openpyxl` is not installed in our test environments. > We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40229: Description: Currently we're skipping the `read_excel` and `to_excel` test for pandas API on Spark, since the `openpyxl` is not installed in our test environments. We should enable this test for better test coverage. was: Currently we're skipping the read_excel test for pandas API on Spark, since the `openpyxl` is not installed in our test environments. We should enable this test for better test coverage. > Re-enable excel I/O test for pandas API on Spark. > - > > Key: SPARK-40229 > URL: https://issues.apache.org/jira/browse/SPARK-40229 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently we're skipping the `read_excel` and `to_excel` test for pandas API > on Spark, since the `openpyxl` is not installed in our test environments. > We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40229: Summary: Re-enable excel I/O test for pandas API on Spark. (was: Re-enable read_excel test for pandas API on Spark.) > Re-enable excel I/O test for pandas API on Spark. > - > > Key: SPARK-40229 > URL: https://issues.apache.org/jira/browse/SPARK-40229 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently we're skipping the read_excel test for pandas API on Spark, since > the `openpyxl` is not installed in our test environments. > We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40229: Description: Currently we're skipping the `read_excel` and `to_excel` tests for pandas API on Spark, since the `openpyxl` is not installed in our test environments. We should enable this test for better test coverage. was: Currently we're skipping the `read_excel` and `to_excel` test for pandas API on Spark, since the `openpyxl` is not installed in our test environments. We should enable this test for better test coverage. > Re-enable excel I/O test for pandas API on Spark. > - > > Key: SPARK-40229 > URL: https://issues.apache.org/jira/browse/SPARK-40229 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Currently we're skipping the `read_excel` and `to_excel` tests for pandas API > on Spark, since the `openpyxl` is not installed in our test environments. > We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40229) Re-enable read_excel test for pandas API on Spark.
Haejoon Lee created SPARK-40229: --- Summary: Re-enable read_excel test for pandas API on Spark. Key: SPARK-40229 URL: https://issues.apache.org/jira/browse/SPARK-40229 Project: Spark Issue Type: Test Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee Currently we're skipping the read_excel test for pandas API on Spark, since the `openpyxl` is not installed in our test environments. We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40228) Don't simplify multiLike if child is not attribute
Yuming Wang created SPARK-40228: --- Summary: Don't simplify multiLike if child is not attribute Key: SPARK-40228 URL: https://issues.apache.org/jira/browse/SPARK-40228 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang {code:scala} sql("create table t1(name string) using parquet") sql("select * from t1 where substr(name, 1, 5) like any('%a', 'b%', '%c%')").explain(true) {code} {noformat} == Physical Plan == *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) +- *(1) ColumnarToRow +- FileScan parquet default.t1[name#0] Batched: true, DataFilters: [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
[ https://issues.apache.org/jira/browse/SPARK-40197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40197. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37632 [https://github.com/apache/spark/pull/37632] > Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR > -- > > Key: SPARK-40197 > URL: https://issues.apache.org/jira/browse/SPARK-40197 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Assignee: Vitalii Li >Priority: Major > Fix For: 3.4.0 > > > Instead of a query plan - output subquery context. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40197) Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR
[ https://issues.apache.org/jira/browse/SPARK-40197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40197: Assignee: Vitalii Li > Replace query plan with context for MULTI_VALUE_SUBQUERY_ERROR > -- > > Key: SPARK-40197 > URL: https://issues.apache.org/jira/browse/SPARK-40197 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vitalii Li >Assignee: Vitalii Li >Priority: Major > > Instead of a query plan - output subquery context. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40220) Don't output the empty map of error message parameters
[ https://issues.apache.org/jira/browse/SPARK-40220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40220. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37660 [https://github.com/apache/spark/pull/37660] > Don't output the empty map of error message parameters > -- > > Key: SPARK-40220 > URL: https://issues.apache.org/jira/browse/SPARK-40220 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > In the current implementation, Spark output empty message parameters in the > MINIMAL and STANDARD formats: > {code:json} > org.apache.spark.SparkRuntimeException > { > "errorClass" : "ELEMENT_AT_BY_INDEX_ZERO", > "messageParameters" : { } > } > {code} > that contradict w/ the approach for other JSON fields. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40218) GROUPING SETS should preserve the grouping columns
[ https://issues.apache.org/jira/browse/SPARK-40218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40218. - Fix Version/s: 3.3.1 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37655 [https://github.com/apache/spark/pull/37655] > GROUPING SETS should preserve the grouping columns > -- > > Key: SPARK-40218 > URL: https://issues.apache.org/jira/browse/SPARK-40218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.1, 3.2.3, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40221) Not able to format using scalafmt
[ https://issues.apache.org/jira/browse/SPARK-40221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585191#comment-17585191 ] Ziqi Liu commented on SPARK-40221: -- [~hyukjin.kwon] I think running in master should always be good, since it only format git diff from master, so running in master actually didn't format anything if understand the [dev tools guidance|https://spark.apache.org/developer-tools.html] correctly... The issue happened in another branch, but very new branch(https://github.com/apache/spark/pull/37661), and I believe I didn't modify anything related to scalafmt It tried to format those files that changed since master then the error happened: {code:java} [INFO] Checking for files changed from master [INFO] Changed from master: /Users/ziqi.liu/code/spark/core/src/main/scala/org/apache/spark/internal/config/package.scala /Users/ziqi.liu/code/spark/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala /Users/ziqi.liu/code/spark/core/src/main/scala/org/apache/spark/rdd/RDD.scala /Users/ziqi.liu/code/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala /Users/ziqi.liu/code/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala /Users/ziqi.liu/code/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala /Users/ziqi.liu/code/spark/sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala [ERROR] org.scalafmt.dynamic.exceptions.ScalafmtException: missing setting 'version'. To fix this problem, add the following line to .scalafmt.conf: 'version=3.2.1'. {code} > Not able to format using scalafmt > - > > Key: SPARK-40221 > URL: https://issues.apache.org/jira/browse/SPARK-40221 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 3.4.0 >Reporter: Ziqi Liu >Priority: Major > > I'm following the guidance in [https://spark.apache.org/developer-tools.html] > using > {code:java} > ./dev/scalafmt{code} > to format the code, but getting this error: > {code:java} > [ERROR] Failed to execute goal > org.antipathy:mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) > on project spark-parent_2.12: Error formatting Scala files: missing setting > 'version'. To fix this problem, add the following line to .scalafmt.conf: > 'version=3.2.1'. -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
[ https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40215: Assignee: Ivan Sadikov > Add SQL configs to control CSV/JSON date and timestamp parsing behaviour > > > Key: SPARK-40215 > URL: https://issues.apache.org/jira/browse/SPARK-40215 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
[ https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40215. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37653 [https://github.com/apache/spark/pull/37653] > Add SQL configs to control CSV/JSON date and timestamp parsing behaviour > > > Key: SPARK-40215 > URL: https://issues.apache.org/jira/browse/SPARK-40215 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments
[ https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585186#comment-17585186 ] Apache Spark commented on SPARK-40227: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/37670 > Data Source V2: Support creating table with the duplicate transform with > different arguments > > > Key: SPARK-40227 > URL: https://issues.apache.org/jira/browse/SPARK-40227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Xianyang Liu >Priority: Major > > This patch adds support to creating a table with duplicate transforms while > with different arguments. For example, the following code will be failed > before: > ```scala > CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY > (bucket(2, id), bucket(4, id)) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments
[ https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40227: Assignee: (was: Apache Spark) > Data Source V2: Support creating table with the duplicate transform with > different arguments > > > Key: SPARK-40227 > URL: https://issues.apache.org/jira/browse/SPARK-40227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Xianyang Liu >Priority: Major > > This patch adds support to creating a table with duplicate transforms while > with different arguments. For example, the following code will be failed > before: > ```scala > CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY > (bucket(2, id), bucket(4, id)) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments
[ https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40227: Assignee: Apache Spark > Data Source V2: Support creating table with the duplicate transform with > different arguments > > > Key: SPARK-40227 > URL: https://issues.apache.org/jira/browse/SPARK-40227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Xianyang Liu >Assignee: Apache Spark >Priority: Major > > This patch adds support to creating a table with duplicate transforms while > with different arguments. For example, the following code will be failed > before: > ```scala > CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY > (bucket(2, id), bucket(4, id)) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments
[ https://issues.apache.org/jira/browse/SPARK-40227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17585185#comment-17585185 ] Apache Spark commented on SPARK-40227: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/37670 > Data Source V2: Support creating table with the duplicate transform with > different arguments > > > Key: SPARK-40227 > URL: https://issues.apache.org/jira/browse/SPARK-40227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Xianyang Liu >Priority: Major > > This patch adds support to creating a table with duplicate transforms while > with different arguments. For example, the following code will be failed > before: > ```scala > CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY > (bucket(2, id), bucket(4, id)) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40227) Data Source V2: Support creating table with the duplicate transform with different arguments
Xianyang Liu created SPARK-40227: Summary: Data Source V2: Support creating table with the duplicate transform with different arguments Key: SPARK-40227 URL: https://issues.apache.org/jira/browse/SPARK-40227 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.2 Reporter: Xianyang Liu This patch adds support to creating a table with duplicate transforms while with different arguments. For example, the following code will be failed before: ```scala CREATE TABLE testcat.t (id int, `a.b` string) USING foo PARTITIONED BY (bucket(2, id), bucket(4, id)) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org