[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493042#comment-17493042 ] Mehul Batra commented on SPARK-38218: - When can we expect the next release? > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg, > image-2022-02-16-12-26-32-871.png > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493041#comment-17493041 ] Hyukjin Kwon commented on SPARK-38218: -- That'd be fixed from the next release since we fixed it in the dev branch. > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg, > image-2022-02-16-12-26-32-871.png > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38227) Apply strict nullability of nested column in time window / session window
Jungtaek Lim created SPARK-38227: Summary: Apply strict nullability of nested column in time window / session window Key: SPARK-38227 URL: https://issues.apache.org/jira/browse/SPARK-38227 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.2.1, 3.3.0 Reporter: Jungtaek Lim In TimeWindow and SessionWindow, we define dataType of these function expressions as StructType having two nested columns "start" and "end", which is "nullable". And we replace these expressions in the analyzer via corresponding rules, TimeWindowing for TimeWindow, and SessionWindowing for SessionWindow. The rules replace the function expressions with Alias, referring CreateNamedStruct. For the value side of CreateNamedStruct, we don't specify anything about nullability, which leads to a risk the value side may be interpreted (or optimized) as non-nullable, which would be different Spark would be expected. We should make sure the nullability of columns in CreateNamedStruct remains the same with dataType definition on these function expressions. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493029#comment-17493029 ] Mehul Batra edited comment on SPARK-38218 at 2/16/22, 6:57 AM: --- HI [~hyukjin.kwon] So if I download tgz from the spark downloads page will it download hadoop 3.3.1, because it still shows hadoop 3.2 in that section attaching screenshot for the same. !image-2022-02-16-12-26-32-871.png|width=736,height=121! was (Author: me_bat): So if I download tgz from the spark downloads page will it download hadoop 3.3.1, because it still shows hadoop 3.2 in that section attaching screenshot for the same. !image-2022-02-16-12-26-32-871.png|width=736,height=121! > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg, > image-2022-02-16-12-26-32-871.png > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493029#comment-17493029 ] Mehul Batra commented on SPARK-38218: - So if I download tgz from the spark downloads page will it download hadoop 3.3.1, because it still shows hadoop 3.2 in that section attaching screenshot for the same. !image-2022-02-16-12-26-32-871.png|width=736,height=121! > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg, > image-2022-02-16-12-26-32-871.png > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mehul Batra updated SPARK-38218: Attachment: image-2022-02-16-12-26-32-871.png > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg, > image-2022-02-16-12-26-32-871.png > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493007#comment-17493007 ] Apache Spark commented on SPARK-38226: -- User 'anchovYu' has created a pull request for this issue: https://github.com/apache/spark/pull/35538 > Fix HiveCompatibilitySuite under ANSI mode > -- > > Key: SPARK-38226 > URL: https://issues.apache.org/jira/browse/SPARK-38226 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > Fix > sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala > under ANSI mode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38226: Assignee: (was: Apache Spark) > Fix HiveCompatibilitySuite under ANSI mode > -- > > Key: SPARK-38226 > URL: https://issues.apache.org/jira/browse/SPARK-38226 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > Fix > sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala > under ANSI mode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38226: Assignee: Apache Spark > Fix HiveCompatibilitySuite under ANSI mode > -- > > Key: SPARK-38226 > URL: https://issues.apache.org/jira/browse/SPARK-38226 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Assignee: Apache Spark >Priority: Major > > Fix > sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala > under ANSI mode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493005#comment-17493005 ] Apache Spark commented on SPARK-38226: -- User 'anchovYu' has created a pull request for this issue: https://github.com/apache/spark/pull/35538 > Fix HiveCompatibilitySuite under ANSI mode > -- > > Key: SPARK-38226 > URL: https://issues.apache.org/jira/browse/SPARK-38226 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > Fix > sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala > under ANSI mode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyi Yu updated SPARK-38226: - Description: Fix sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala under ANSI mode. > Fix HiveCompatibilitySuite under ANSI mode > -- > > Key: SPARK-38226 > URL: https://issues.apache.org/jira/browse/SPARK-38226 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > Fix > sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala > under ANSI mode. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode
Xinyi Yu created SPARK-38226: Summary: Fix HiveCompatibilitySuite under ANSI mode Key: SPARK-38226 URL: https://issues.apache.org/jira/browse/SPARK-38226 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Xinyi Yu -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true
[ https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38173: --- Assignee: Tongwei > Quoted column cannot be recognized correctly when quotedRegexColumnNames is > true > > > Key: SPARK-38173 > URL: https://issues.apache.org/jira/browse/SPARK-38173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Tongwei >Assignee: Tongwei >Priority: Major > Fix For: 3.3.0 > > > When spark.sql.parser.quotedRegexColumnNames=true > {code:java} > SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code} > The above query will throw an exception > {code:java} > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'multiply' > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in > expression 'multiply' > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656) > {code} > It works fine in hive > {code:java} > 0: jdbc:hive2://hiveserver-inc.> set hive.support.quoted.identifiers=none;
[jira] [Resolved] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true
[ https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38173. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35476 [https://github.com/apache/spark/pull/35476] > Quoted column cannot be recognized correctly when quotedRegexColumnNames is > true > > > Key: SPARK-38173 > URL: https://issues.apache.org/jira/browse/SPARK-38173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Tongwei >Priority: Major > Fix For: 3.3.0 > > > When spark.sql.parser.quotedRegexColumnNames=true > {code:java} > SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code} > The above query will throw an exception > {code:java} > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'multiply' > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in > expression 'multiply' > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656) > {code} > It works fine in hive > {code:java} >
[jira] [Resolved] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`
[ https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38201. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35509 [https://github.com/apache/spark/pull/35509] > Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and > `overwrite` > - > > Key: SPARK-38201 > URL: https://issues.apache.org/jira/browse/SPARK-38201 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Fix For: 3.3.0 > > > KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters ` > delSrc` and `overwrite`, but constants(false and true) are used when call ` > FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, > Path dst) ` method. > ` > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`
[ https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38201: - Assignee: Yang Jie > Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and > `overwrite` > - > > Key: SPARK-38201 > URL: https://issues.apache.org/jira/browse/SPARK-38201 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > > KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters ` > delSrc` and `overwrite`, but constants(false and true) are used when call ` > FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, > Path dst) ` method. > ` > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38225) Complete input validation of function to_binary
[ https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38225: Assignee: (was: Apache Spark) > Complete input validation of function to_binary > --- > > Key: SPARK-38225 > URL: https://issues.apache.org/jira/browse/SPARK-38225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, function to_binary doesn't deal with the non-string {{format}} > parameter properly. > For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting > error, rather than hint that encoding format is unsupported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38225) Complete input validation of function to_binary
[ https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38225: Assignee: Apache Spark > Complete input validation of function to_binary > --- > > Key: SPARK-38225 > URL: https://issues.apache.org/jira/browse/SPARK-38225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Currently, function to_binary doesn't deal with the non-string {{format}} > parameter properly. > For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting > error, rather than hint that encoding format is unsupported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38225) Complete input validation of function to_binary
[ https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492985#comment-17492985 ] Apache Spark commented on SPARK-38225: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/35533 > Complete input validation of function to_binary > --- > > Key: SPARK-38225 > URL: https://issues.apache.org/jira/browse/SPARK-38225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, function to_binary doesn't deal with the non-string {{format}} > parameter properly. > For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting > error, rather than hint that encoding format is unsupported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38225) Complete input validation of function to_binary
[ https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492984#comment-17492984 ] Xinrong Meng commented on SPARK-38225: -- I am working on that. > Complete input validation of function to_binary > --- > > Key: SPARK-38225 > URL: https://issues.apache.org/jira/browse/SPARK-38225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, function to_binary doesn't deal with the non-string {{format}} > parameter properly. > For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting > error, rather than hint that encoding format is unsupported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38225) Complete input validation of function to_binary
Xinrong Meng created SPARK-38225: Summary: Complete input validation of function to_binary Key: SPARK-38225 URL: https://issues.apache.org/jira/browse/SPARK-38225 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Xinrong Meng Currently, function to_binary doesn't deal with the non-string {{format}} parameter properly. For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting error, rather than hint that encoding format is unsupported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38224) How do I get a lot of results in KDE
[ https://issues.apache.org/jira/browse/SPARK-38224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Wan updated SPARK-38224: Priority: Trivial (was: Major) > How do I get a lot of results in KDE > > > Key: SPARK-38224 > URL: https://issues.apache.org/jira/browse/SPARK-38224 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.4.5 >Reporter: Ben Wan >Priority: Trivial > > I have a pyspark.DataFrame, I have converted one of the columns to RDD and > performed KDE, I need to get all the KDE estimates of the column and add a > new column in the DataFrame for subsequent work, how can I do it by Spark? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38224) How do I get a lot of results in KDE
Ben Wan created SPARK-38224: --- Summary: How do I get a lot of results in KDE Key: SPARK-38224 URL: https://issues.apache.org/jira/browse/SPARK-38224 Project: Spark Issue Type: Question Components: ML Affects Versions: 2.4.5 Reporter: Ben Wan I have a pyspark.DataFrame, I have converted one of the columns to RDD and performed KDE, I need to get all the KDE estimates of the column and add a new column in the DataFrame for subsequent work, how can I do it by Spark? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38223) PersistentVolumeClaim does not work in clusters with multiple nodes
[ https://issues.apache.org/jira/browse/SPARK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492960#comment-17492960 ] Zimo Li commented on SPARK-38223: - [~hyukjin.kwon] Sorry I missed that. I just added a title. > PersistentVolumeClaim does not work in clusters with multiple nodes > --- > > Key: SPARK-38223 > URL: https://issues.apache.org/jira/browse/SPARK-38223 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 > Environment: > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes] > >Reporter: Zimo Li >Priority: Minor > > We are using {{spark-submit}} to establish a ThriftServer warehouse on Google > Kubernetes Engine. The Spark documentation on running on Kubernetes suggests > that we can use > [persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim] > for Spark applications. > {code:bash} > spark-submit \ > --master k8s://$KUBERNETES_SERVICE_HOST \ > --deploy-mode cluster \ > --class $THRIFTSERVER \ > --conf spark.sql.catalogImplementation=hive \ > --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \ > --conf spark.hadoop.hive.metastore.schema.verification=false \ > --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \ > --conf spark.hadoop.datanucleus.autoCreateSchema=false \ > --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \ > --conf > spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \ > --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \ > --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \ > --conf spark.sql.warehouse.dir=$MOUNT_PATH \ > --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \ > --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false > \ > --conf spark.kubernetes.executor.deleteOnTermination=true \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \ > --conf spark.kubernetes.container.image=$IMAGE \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > --conf spark.executor.memory=2g \ > --conf spark.driver.memory=2g \ > local:///$JAR {code} > When it ran, it created one driver and two executors. Each of these wanted to > use the same pvc. Unfortunately, at least one of these pods was scheduled on > a different node from the rest. As GKE mounts pvs to nodes in order to honor > pvcs for pods, that odd pod out was unable to attach the pv: > {code:java} > FailedMount > Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], > unattached volumes=[kube-api-access-grfld spark-conf-volume-exec > spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code} > This is because GKE like many cloud providers does not support > {{ReadWriteMany}} for pvcs/pvs. > > I suggest changing the documentation not to suggest using pvcs for > ThriftServers. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38223) PersistentVolumeClaim does not work in clusters with multiple nodes
[ https://issues.apache.org/jira/browse/SPARK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zimo Li updated SPARK-38223: Summary: PersistentVolumeClaim does not work in clusters with multiple nodes (was: Spark) > PersistentVolumeClaim does not work in clusters with multiple nodes > --- > > Key: SPARK-38223 > URL: https://issues.apache.org/jira/browse/SPARK-38223 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 > Environment: > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes] > >Reporter: Zimo Li >Priority: Minor > > We are using {{spark-submit}} to establish a ThriftServer warehouse on Google > Kubernetes Engine. The Spark documentation on running on Kubernetes suggests > that we can use > [persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim] > for Spark applications. > {code:bash} > spark-submit \ > --master k8s://$KUBERNETES_SERVICE_HOST \ > --deploy-mode cluster \ > --class $THRIFTSERVER \ > --conf spark.sql.catalogImplementation=hive \ > --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \ > --conf spark.hadoop.hive.metastore.schema.verification=false \ > --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \ > --conf spark.hadoop.datanucleus.autoCreateSchema=false \ > --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \ > --conf > spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \ > --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \ > --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \ > --conf spark.sql.warehouse.dir=$MOUNT_PATH \ > --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \ > --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false > \ > --conf spark.kubernetes.executor.deleteOnTermination=true \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \ > --conf spark.kubernetes.container.image=$IMAGE \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > --conf spark.executor.memory=2g \ > --conf spark.driver.memory=2g \ > local:///$JAR {code} > When it ran, it created one driver and two executors. Each of these wanted to > use the same pvc. Unfortunately, at least one of these pods was scheduled on > a different node from the rest. As GKE mounts pvs to nodes in order to honor > pvcs for pods, that odd pod out was unable to attach the pv: > {code:java} > FailedMount > Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], > unattached volumes=[kube-api-access-grfld spark-conf-volume-exec > spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code} > This is because GKE like many cloud providers does not support > {{ReadWriteMany}} for pvcs/pvs. > > I suggest changing the documentation not to suggest using pvcs for > ThriftServers. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38221) Group by a stream of complex expressions fails
[ https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38221. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/35537 > Group by a stream of complex expressions fails > -- > > Key: SPARK-38221 > URL: https://issues.apache.org/jira/browse/SPARK-38221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > This query fails: > {noformat} > scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in > [id#4,_groupingexpression#23] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.foreach(Stream.scala:534) > at scala.collection.TraversableOnce.count(TraversableOnce.scala:152) > at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145) > at scala.collection.AbstractTraversable.count(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623) > {noformat} > However, replace {{Stream}} with {{Seq}} and it works: > {noformat} > scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > +++---+ > |(id + 1)|(id + 2)|sum(id)| > +++---+ > |2 |3 |1 | > +++---+ > scala> > {noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38221) Group by a stream of complex expressions fails
[ https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38221: - Fix Version/s: 3.3.0 3.2.2 > Group by a stream of complex expressions fails > -- > > Key: SPARK-38221 > URL: https://issues.apache.org/jira/browse/SPARK-38221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > This query fails: > {noformat} > scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in > [id#4,_groupingexpression#23] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.foreach(Stream.scala:534) > at scala.collection.TraversableOnce.count(TraversableOnce.scala:152) > at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145) > at scala.collection.AbstractTraversable.count(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623) > {noformat} > However, replace {{Stream}} with {{Seq}} and it works: > {noformat} > scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > +++---+ > |(id + 1)|(id + 2)|sum(id)| > +++---+ > |2 |3 |1 | > +++---+ > scala> > {noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38218. -- Resolution: Duplicate probably it's a duplicate of SPARK-37445 > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492952#comment-17492952 ] Hyukjin Kwon commented on SPARK-38218: -- [~ME_BAT] the images are broken. mind checking this please? > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38223) Spark
[ https://issues.apache.org/jira/browse/SPARK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492951#comment-17492951 ] Hyukjin Kwon commented on SPARK-38223: -- [~lizimo] mind fixing the JIRA title to summarize the issue? > Spark > - > > Key: SPARK-38223 > URL: https://issues.apache.org/jira/browse/SPARK-38223 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 > Environment: > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes] > >Reporter: Zimo Li >Priority: Minor > > We are using {{spark-submit}} to establish a ThriftServer warehouse on Google > Kubernetes Engine. The Spark documentation on running on Kubernetes suggests > that we can use > [persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim] > for Spark applications. > {code:bash} > spark-submit \ > --master k8s://$KUBERNETES_SERVICE_HOST \ > --deploy-mode cluster \ > --class $THRIFTSERVER \ > --conf spark.sql.catalogImplementation=hive \ > --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \ > --conf spark.hadoop.hive.metastore.schema.verification=false \ > --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \ > --conf spark.hadoop.datanucleus.autoCreateSchema=false \ > --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \ > --conf > spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \ > --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \ > --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \ > --conf spark.sql.warehouse.dir=$MOUNT_PATH \ > --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \ > --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false > \ > --conf spark.kubernetes.executor.deleteOnTermination=true \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \ > --conf spark.kubernetes.container.image=$IMAGE \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > --conf spark.executor.memory=2g \ > --conf spark.driver.memory=2g \ > local:///$JAR {code} > When it ran, it created one driver and two executors. Each of these wanted to > use the same pvc. Unfortunately, at least one of these pods was scheduled on > a different node from the rest. As GKE mounts pvs to nodes in order to honor > pvcs for pods, that odd pod out was unable to attach the pv: > {code:java} > FailedMount > Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], > unattached volumes=[kube-api-access-grfld spark-conf-volume-exec > spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code} > This is because GKE like many cloud providers does not support > {{ReadWriteMany}} for pvcs/pvs. > > I suggest changing the documentation not to suggest using pvcs for > ThriftServers. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38221) Group by a stream of complex expressions fails
[ https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38221: Assignee: Apache Spark > Group by a stream of complex expressions fails > -- > > Key: SPARK-38221 > URL: https://issues.apache.org/jira/browse/SPARK-38221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > This query fails: > {noformat} > scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in > [id#4,_groupingexpression#23] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.foreach(Stream.scala:534) > at scala.collection.TraversableOnce.count(TraversableOnce.scala:152) > at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145) > at scala.collection.AbstractTraversable.count(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623) > {noformat} > However, replace {{Stream}} with {{Seq}} and it works: > {noformat} > scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > +++---+ > |(id + 1)|(id + 2)|sum(id)| > +++---+ > |2 |3 |1 | > +++---+ > scala> > {noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38221) Group by a stream of complex expressions fails
[ https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492944#comment-17492944 ] Apache Spark commented on SPARK-38221: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/35537 > Group by a stream of complex expressions fails > -- > > Key: SPARK-38221 > URL: https://issues.apache.org/jira/browse/SPARK-38221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Priority: Major > > This query fails: > {noformat} > scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in > [id#4,_groupingexpression#23] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.foreach(Stream.scala:534) > at scala.collection.TraversableOnce.count(TraversableOnce.scala:152) > at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145) > at scala.collection.AbstractTraversable.count(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623) > {noformat} > However, replace {{Stream}} with {{Seq}} and it works: > {noformat} > scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > +++---+ > |(id + 1)|(id + 2)|sum(id)| > +++---+ > |2 |3 |1 | > +++---+ > scala> > {noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38221) Group by a stream of complex expressions fails
[ https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38221: Assignee: (was: Apache Spark) > Group by a stream of complex expressions fails > -- > > Key: SPARK-38221 > URL: https://issues.apache.org/jira/browse/SPARK-38221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Priority: Major > > This query fails: > {noformat} > scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in > [id#4,_groupingexpression#23] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.foreach(Stream.scala:534) > at scala.collection.TraversableOnce.count(TraversableOnce.scala:152) > at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145) > at scala.collection.AbstractTraversable.count(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623) > {noformat} > However, replace {{Stream}} with {{Seq}} and it works: > {noformat} > scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > +++---+ > |(id + 1)|(id + 2)|sum(id)| > +++---+ > |2 |3 |1 | > +++---+ > scala> > {noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38222) Expose Node Description attribute in SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38222: Assignee: (was: Apache Spark) > Expose Node Description attribute in SQL Rest API > - > > Key: SPARK-38222 > URL: https://issues.apache.org/jira/browse/SPARK-38222 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Eren Avsarogullari >Priority: Major > > Currently, SQL public Rest API does not expose node description and it is > useful to have nodeDesc attribute at query level to have more details such as: > {code:java} > - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join > type and which leg is built for BHJ. > - HashAggregate => aggregated keys and agg functions > - List can be extended for other physical operators.{code} > *Current Sample Json Result:* > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} > *New* {*}Sample Json Result{*}{*}:{*} > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], > functions=[avg(cast(age#6 as bigint)), avg(salary#18)])", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38222) Expose Node Description attribute in SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492927#comment-17492927 ] Apache Spark commented on SPARK-38222: -- User 'erenavsarogullari' has created a pull request for this issue: https://github.com/apache/spark/pull/35536 > Expose Node Description attribute in SQL Rest API > - > > Key: SPARK-38222 > URL: https://issues.apache.org/jira/browse/SPARK-38222 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Eren Avsarogullari >Priority: Major > > Currently, SQL public Rest API does not expose node description and it is > useful to have nodeDesc attribute at query level to have more details such as: > {code:java} > - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join > type and which leg is built for BHJ. > - HashAggregate => aggregated keys and agg functions > - List can be extended for other physical operators.{code} > *Current Sample Json Result:* > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} > *New* {*}Sample Json Result{*}{*}:{*} > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], > functions=[avg(cast(age#6 as bigint)), avg(salary#18)])", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38222) Expose Node Description attribute in SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38222: Assignee: Apache Spark > Expose Node Description attribute in SQL Rest API > - > > Key: SPARK-38222 > URL: https://issues.apache.org/jira/browse/SPARK-38222 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Eren Avsarogullari >Assignee: Apache Spark >Priority: Major > > Currently, SQL public Rest API does not expose node description and it is > useful to have nodeDesc attribute at query level to have more details such as: > {code:java} > - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join > type and which leg is built for BHJ. > - HashAggregate => aggregated keys and agg functions > - List can be extended for other physical operators.{code} > *Current Sample Json Result:* > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} > *New* {*}Sample Json Result{*}{*}:{*} > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], > functions=[avg(cast(age#6 as bigint)), avg(salary#18)])", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38220) Upgrade `commons-math3` to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38220. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35535 [https://github.com/apache/spark/pull/35535] > Upgrade `commons-math3` to 3.6.1 > > > Key: SPARK-38220 > URL: https://issues.apache.org/jira/browse/SPARK-38220 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38220) Upgrade `commons-math3` to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38220: Assignee: Dongjoon Hyun > Upgrade `commons-math3` to 3.6.1 > > > Key: SPARK-38220 > URL: https://issues.apache.org/jira/browse/SPARK-38220 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38223) Spark
Zimo Li created SPARK-38223: --- Summary: Spark Key: SPARK-38223 URL: https://issues.apache.org/jira/browse/SPARK-38223 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.2.1 Environment: [https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works] [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes] Reporter: Zimo Li We are using {{spark-submit}} to establish a ThriftServer warehouse on Google Kubernetes Engine. The Spark documentation on running on Kubernetes suggests that we can use [persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim] for Spark applications. {code:bash} spark-submit \ --master k8s://$KUBERNETES_SERVICE_HOST \ --deploy-mode cluster \ --class $THRIFTSERVER \ --conf spark.sql.catalogImplementation=hive \ --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \ --conf spark.hadoop.hive.metastore.schema.verification=false \ --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \ --conf spark.hadoop.datanucleus.autoCreateSchema=false \ --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \ --conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \ --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \ --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \ --conf spark.sql.warehouse.dir=$MOUNT_PATH \ --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \ --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false \ --conf spark.kubernetes.executor.deleteOnTermination=true \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \ --conf spark.kubernetes.container.image=$IMAGE \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.executor.memory=2g \ --conf spark.driver.memory=2g \ local:///$JAR {code} When it ran, it created one driver and two executors. Each of these wanted to use the same pvc. Unfortunately, at least one of these pods was scheduled on a different node from the rest. As GKE mounts pvs to nodes in order to honor pvcs for pods, that odd pod out was unable to attach the pv: {code:java} FailedMount Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], unattached volumes=[kube-api-access-grfld spark-conf-volume-exec spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code} This is because GKE like many cloud providers does not support {{ReadWriteMany}} for pvcs/pvs. I suggest changing the documentation not to suggest using pvcs for ThriftServers. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38222) Expose nodeDesc attribute in SQL Rest API
Eren Avsarogullari created SPARK-38222: -- Summary: Expose nodeDesc attribute in SQL Rest API Key: SPARK-38222 URL: https://issues.apache.org/jira/browse/SPARK-38222 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: Eren Avsarogullari Currently, SQL public Rest API does not expose node description and it is useful to have nodeDesc attribute at query level to have more details such as: {code:java} - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join type and which leg is built for BHJ. - HashAggregate => aggregated keys and agg functions - List can be extended for other physical operators.{code} *Current Sample Json Result:* {code:java} { "nodeId" : 14, "nodeName" : "BroadcastHashJoin", "wholeStageCodegenId" : 3, "stageIds" : [ 5 ], "metrics" : [ { "name" : "number of output rows", "value" : { "amount" : "2" } } }, ... { "nodeId" : 8, "nodeName" : "HashAggregate", "wholeStageCodegenId" : 4, "stageIds" : [ 8 ], "metrics" : [ { "name" : "spill size", "value" : { "amount" : "0.0" } } } {code} *New* {*}Sample Json Result{*}{*}:{*} {code:java} { "nodeId" : 14, "nodeName" : "BroadcastHashJoin", "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false", "wholeStageCodegenId" : 3, "stageIds" : [ 5 ], "metrics" : [ { "name" : "number of output rows", "value" : { "amount" : "2" } } }, ... { "nodeId" : 8, "nodeName" : "HashAggregate", "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], functions=[avg(cast(age#6 as bigint)), avg(salary#18)])", "wholeStageCodegenId" : 4, "stageIds" : [ 8 ], "metrics" : [ { "name" : "spill size", "value" : { "amount" : "0.0" } } } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38222) Expose Node Description attribute in SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-38222: --- Summary: Expose Node Description attribute in SQL Rest API (was: Expose nodeDesc attribute in SQL Rest API) > Expose Node Description attribute in SQL Rest API > - > > Key: SPARK-38222 > URL: https://issues.apache.org/jira/browse/SPARK-38222 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Eren Avsarogullari >Priority: Major > > Currently, SQL public Rest API does not expose node description and it is > useful to have nodeDesc attribute at query level to have more details such as: > {code:java} > - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join > type and which leg is built for BHJ. > - HashAggregate => aggregated keys and agg functions > - List can be extended for other physical operators.{code} > *Current Sample Json Result:* > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} > *New* {*}Sample Json Result{*}{*}:{*} > {code:java} > { > "nodeId" : 14, > "nodeName" : "BroadcastHashJoin", > "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false", > "wholeStageCodegenId" : 3, > "stageIds" : [ 5 ], > "metrics" : [ { > "name" : "number of output rows", > "value" : { > "amount" : "2" > } > } > }, > ... > { > "nodeId" : 8, > "nodeName" : "HashAggregate", > "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], > functions=[avg(cast(age#6 as bigint)), avg(salary#18)])", > "wholeStageCodegenId" : 4, > "stageIds" : [ 8 ], > "metrics" : [ { > "name" : "spill size", > "value" : { > "amount" : "0.0" > } > } > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38205) The columns in state schema should be relaxed to be nullable
[ https://issues.apache.org/jira/browse/SPARK-38205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492895#comment-17492895 ] Jungtaek Lim commented on SPARK-38205: -- I realized the output schema should also be nullable (since the operator will produce output from state), and now puzzled whether there may be cases I’m going to break the existing query (DSv2 sink may check the nullability when writing). I guess another way is never changing the nullability on optimizer and keep the nullability check in state. I would rely on less invasive way if there is one, since the lifetime of streaming query is long, across Spark versions, and compatibility is a major concern. > The columns in state schema should be relaxed to be nullable > > > Key: SPARK-38205 > URL: https://issues.apache.org/jira/browse/SPARK-38205 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.2, 3.2.1, 3.3.0 >Reporter: Jungtaek Lim >Priority: Major > > Starting from SPARK-27237, Spark validates the schema of state across query > runs to make sure it doesn't fall into more weird issue like SIGSEGV on the > runtime. > The comparison logic is reasonable in terms of nullability; it has below > matrices: > ||existing schema||new schema||allowed|| > |nullable|nullable|O| > |nullable|non-nullable|O| > |non-nullable|nullable|X| > |non-nullable|non-nullable|O| > What we miss here is, the nullability of the column can be changed in the > optimizer (mostly nullable to non-nullable), and the optimization about > nullability could be applied differently with any simple changes. > So this scenario is hypothetically possible: > 1. At the first run of the query, optimizer marks some columns from nullable > to non-nullable, and it goes to the schema of the state. (state schema has a > column with non-nullable) > 2. At the second run of the query (possibly with code modification or > upgrading Spark version), optimizer no longer marks such columns from > nullable to non-nullable, and it goes with comparison of the schema of the > state (existing vs new), comparing non-nullable (existing) vs nullable (new), > which is NOT allowed. > In terms of storage view for state store, it is not required to determine the > column as non-nullable vs nullable. Interface-wise, state store has no > concept of schema; so it is safe to relax such constraint, and open the > chance for optimizer to do whatever it wants and doesn't break stateful > operators. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38221) Group by a stream of complex expressions fails
[ https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492869#comment-17492869 ] Bruce Robbins commented on SPARK-38221: --- I think I have an idea what's going on. I will submit a PR soon. > Group by a stream of complex expressions fails > -- > > Key: SPARK-38221 > URL: https://issues.apache.org/jira/browse/SPARK-38221 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0 >Reporter: Bruce Robbins >Priority: Major > > This query fails: > {noformat} > scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in > [id#4,_groupingexpression#23] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) > at scala.collection.immutable.Stream.foreach(Stream.scala:534) > at scala.collection.TraversableOnce.count(TraversableOnce.scala:152) > at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145) > at scala.collection.AbstractTraversable.count(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623) > {noformat} > However, replace {{Stream}} with {{Seq}} and it works: > {noformat} > scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): > _*).sum("id").show(false) > +++---+ > |(id + 1)|(id + 2)|sum(id)| > +++---+ > |2 |3 |1 | > +++---+ > scala> > {noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38221) Group by a stream of complex expressions fails
Bruce Robbins created SPARK-38221: - Summary: Group by a stream of complex expressions fails Key: SPARK-38221 URL: https://issues.apache.org/jira/browse/SPARK-38221 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.3.0 Reporter: Bruce Robbins This query fails: {noformat} scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): _*).sum("id").show(false) java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in [id#4,_groupingexpression#23] at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) at scala.collection.immutable.Stream.foreach(Stream.scala:534) at scala.collection.TraversableOnce.count(TraversableOnce.scala:152) at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145) at scala.collection.AbstractTraversable.count(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623) {noformat} However, replace {{Stream}} with {{Seq}} and it works: {noformat} scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): _*).sum("id").show(false) +++---+ |(id + 1)|(id + 2)|sum(id)| +++---+ |2 |3 |1 | +++---+ scala> {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem
[ https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492824#comment-17492824 ] kk commented on SPARK-38115: Is there any config as such to stop using FileOutputCommiter, because we didn't set any conf explicitly to use the committers. And more over when overwriting on s3:// then i don't have a problem of _temporary. Problem comes if our path has s3a:// Just I am looking if I can use conf/options to manage temporary location as staging and have target path as primary > No spark conf to control the path of _temporary when writing to target > filesystem > - > > Key: SPARK-38115 > URL: https://issues.apache.org/jira/browse/SPARK-38115 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.2.1 >Reporter: kk >Priority: Minor > Labels: spark, spark-conf, spark-sql, spark-submit > > No default spark conf or param to control the '_temporary' path when writing > to filesystem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38220) Upgrade `commons-math3` to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38220: Assignee: Apache Spark > Upgrade `commons-math3` to 3.6.1 > > > Key: SPARK-38220 > URL: https://issues.apache.org/jira/browse/SPARK-38220 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38220) Upgrade `commons-math3` to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38220: Assignee: (was: Apache Spark) > Upgrade `commons-math3` to 3.6.1 > > > Key: SPARK-38220 > URL: https://issues.apache.org/jira/browse/SPARK-38220 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38220) Upgrade `commons-math3` to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492823#comment-17492823 ] Apache Spark commented on SPARK-38220: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35535 > Upgrade `commons-math3` to 3.6.1 > > > Key: SPARK-38220 > URL: https://issues.apache.org/jira/browse/SPARK-38220 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38220) Upgrade `commons-math3` to 3.6.1
Dongjoon Hyun created SPARK-38220: - Summary: Upgrade `commons-math3` to 3.6.1 Key: SPARK-38220 URL: https://issues.apache.org/jira/browse/SPARK-38220 Project: Spark Issue Type: Improvement Components: Build, MLlib Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem
[ https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492810#comment-17492810 ] Steve Loughran commented on SPARK-38115: * stop using the classic FileOutputCommitter for your work, unless you like waiting a long time for your jobs to complete. along with a risk of corrupt data in the presence of worker failures. * the choice of where temporary paths go is a function of the committer, not the spark codebase. the s3a staging committer uses the local fs. for example * the magic committer does work under _temporary, but it doesn't write the final data there. it's "magic", after all. l > No spark conf to control the path of _temporary when writing to target > filesystem > - > > Key: SPARK-38115 > URL: https://issues.apache.org/jira/browse/SPARK-38115 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.2.1 >Reporter: kk >Priority: Minor > Labels: spark, spark-conf, spark-sql, spark-submit > > No default spark conf or param to control the '_temporary' path when writing > to filesystem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem
[ https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492785#comment-17492785 ] kk commented on SPARK-38115: Hello [~hyukjin.kwon] did you get a chance to look into this > No spark conf to control the path of _temporary when writing to target > filesystem > - > > Key: SPARK-38115 > URL: https://issues.apache.org/jira/browse/SPARK-38115 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.2.1 >Reporter: kk >Priority: Minor > Labels: spark, spark-conf, spark-sql, spark-submit > > No default spark conf or param to control the '_temporary' path when writing > to filesystem. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38130) array_sort does not allow non-orderable datatypes
[ https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38130. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35426 [https://github.com/apache/spark/pull/35426] > array_sort does not allow non-orderable datatypes > - > > Key: SPARK-38130 > URL: https://issues.apache.org/jira/browse/SPARK-38130 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 > Environment: >Reporter: Steven Aerts >Assignee: Steven Aerts >Priority: Major > Fix For: 3.3.0 > > > {{array_sort}} has check to see if the entries it has to sort are orderable. > I think this check should be removed. Because even entries which are not > orderable can have a lambda function which makes them orderable. > {code:java} > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", > "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - > cardinality(y))"){code} > fails with: > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, > lambdafunction((cardinality(namedlambdavariable()) - > cardinality(namedlambdavariable())), namedlambdavariable(), > namedlambdavariable()))' due to data type mismatch: array_sort does not > support sorting array of type map which is not orderable {code} > While the case where this check is relevant, fails with a different error > which is triggered earlier in the code path: > {code:java} > > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", > > "b").selectExpr("array_sort(a)"){code} > Fails with: > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve > '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: > LessThan does not support ordering on type map; line 1 pos 0; > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38130) array_sort does not allow non-orderable datatypes
[ https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38130: --- Assignee: Steven Aerts > array_sort does not allow non-orderable datatypes > - > > Key: SPARK-38130 > URL: https://issues.apache.org/jira/browse/SPARK-38130 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 > Environment: >Reporter: Steven Aerts >Assignee: Steven Aerts >Priority: Major > > {{array_sort}} has check to see if the entries it has to sort are orderable. > I think this check should be removed. Because even entries which are not > orderable can have a lambda function which makes them orderable. > {code:java} > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", > "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - > cardinality(y))"){code} > fails with: > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, > lambdafunction((cardinality(namedlambdavariable()) - > cardinality(namedlambdavariable())), namedlambdavariable(), > namedlambdavariable()))' due to data type mismatch: array_sort does not > support sorting array of type map which is not orderable {code} > While the case where this check is relevant, fails with a different error > which is triggered earlier in the code path: > {code:java} > > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", > > "b").selectExpr("array_sort(a)"){code} > Fails with: > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve > '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: > LessThan does not support ordering on type map; line 1 pos 0; > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36808) Upgrade Kafka to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-36808: --- Fix Version/s: 3.2.2 > Upgrade Kafka to 2.8.1 > -- > > Key: SPARK-36808 > URL: https://issues.apache.org/jira/browse/SPARK-36808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix. > https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36808) Upgrade Kafka to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-36808: --- Affects Version/s: 3.2.1 > Upgrade Kafka to 2.8.1 > -- > > Key: SPARK-36808 > URL: https://issues.apache.org/jira/browse/SPARK-36808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix. > https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37425) Inline type hints for python/pyspark/mllib/recommendation.py
[ https://issues.apache.org/jira/browse/SPARK-37425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492616#comment-17492616 ] Maciej Szymkiewicz commented on SPARK-37425: Hi [~amirkdv] Just FYI ‒ most of the blockers are already resolved. It should be able to pick pending changes from SPARK-37428 and SPARK-37154 and complete this, or any of the remaining ones in mllib. > Inline type hints for python/pyspark/mllib/recommendation.py > > > Key: SPARK-37425 > URL: https://issues.apache.org/jira/browse/SPARK-37425 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/mlib/recommendation.pyi to > python/pyspark/mllib/recommendation.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py
[ https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37413: -- Assignee: dch nguyen > Inline type hints for python/pyspark/ml/tree.py > --- > > Key: SPARK-37413 > URL: https://issues.apache.org/jira/browse/SPARK-37413 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: dch nguyen >Priority: Major > > Inline type hints from python/pyspark/ml/tree.pyi to > python/pyspark/ml/tree.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py
[ https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37413. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35420 [https://github.com/apache/spark/pull/35420] > Inline type hints for python/pyspark/ml/tree.py > --- > > Key: SPARK-37413 > URL: https://issues.apache.org/jira/browse/SPARK-37413 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > > Inline type hints from python/pyspark/ml/tree.pyi to > python/pyspark/ml/tree.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37428) Inline type hints for python/pyspark/mllib/util.py
[ https://issues.apache.org/jira/browse/SPARK-37428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37428: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/mllib/util.py > -- > > Key: SPARK-37428 > URL: https://issues.apache.org/jira/browse/SPARK-37428 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/mlib/util.pyi to > python/pyspark/mllib/util.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37428) Inline type hints for python/pyspark/mllib/util.py
[ https://issues.apache.org/jira/browse/SPARK-37428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37428: Assignee: Apache Spark > Inline type hints for python/pyspark/mllib/util.py > -- > > Key: SPARK-37428 > URL: https://issues.apache.org/jira/browse/SPARK-37428 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Inline type hints from python/pyspark/mlib/util.pyi to > python/pyspark/mllib/util.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37428) Inline type hints for python/pyspark/mllib/util.py
[ https://issues.apache.org/jira/browse/SPARK-37428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492599#comment-17492599 ] Apache Spark commented on SPARK-37428: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35532 > Inline type hints for python/pyspark/mllib/util.py > -- > > Key: SPARK-37428 > URL: https://issues.apache.org/jira/browse/SPARK-37428 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/mlib/util.pyi to > python/pyspark/mllib/util.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38199) Delete the unused `dataType` specified in the definition of `IntervalColumnAccessor`
[ https://issues.apache.org/jira/browse/SPARK-38199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38199: Assignee: Yang Jie > Delete the unused `dataType` specified in the definition of > `IntervalColumnAccessor` > > > Key: SPARK-38199 > URL: https://issues.apache.org/jira/browse/SPARK-38199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > SPARK-30066 introduce `IntervalColumnAccessor` and it accepts 2 constructor > parameters: `buffer` and `dataType`, the `dataType` is unused because the > parameter passed to `BasicColumnAccessor` is a constant `CALENDAR_INTERVAL` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38199) Delete the unused `dataType` specified in the definition of `IntervalColumnAccessor`
[ https://issues.apache.org/jira/browse/SPARK-38199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38199. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35507 [https://github.com/apache/spark/pull/35507] > Delete the unused `dataType` specified in the definition of > `IntervalColumnAccessor` > > > Key: SPARK-38199 > URL: https://issues.apache.org/jira/browse/SPARK-38199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > SPARK-30066 introduce `IntervalColumnAccessor` and it accepts 2 constructor > parameters: `buffer` and `dataType`, the `dataType` is unused because the > parameter passed to `BasicColumnAccessor` is a constant `CALENDAR_INTERVAL` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function
[ https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38219: Assignee: Apache Spark > Support ANSI aggregation function percentile_cont as window function > > > Key: SPARK-38219 > URL: https://issues.apache.org/jira/browse/SPARK-38219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > percentile_cont is an aggregate function, some database support it as window > function. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function
[ https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38219: Assignee: (was: Apache Spark) > Support ANSI aggregation function percentile_cont as window function > > > Key: SPARK-38219 > URL: https://issues.apache.org/jira/browse/SPARK-38219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > percentile_cont is an aggregate function, some database support it as window > function. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function
[ https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492571#comment-17492571 ] Apache Spark commented on SPARK-38219: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/35531 > Support ANSI aggregation function percentile_cont as window function > > > Key: SPARK-38219 > URL: https://issues.apache.org/jira/browse/SPARK-38219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > percentile_cont is an aggregate function, some database support it as window > function. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function
[ https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492572#comment-17492572 ] Apache Spark commented on SPARK-38219: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/35531 > Support ANSI aggregation function percentile_cont as window function > > > Key: SPARK-38219 > URL: https://issues.apache.org/jira/browse/SPARK-38219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > percentile_cont is an aggregate function, some database support it as window > function. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function
jiaan.geng created SPARK-38219: -- Summary: Support ANSI aggregation function percentile_cont as window function Key: SPARK-38219 URL: https://issues.apache.org/jira/browse/SPARK-38219 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng percentile_cont is an aggregate function, some database support it as window function. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py
[ https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492558#comment-17492558 ] Apache Spark commented on SPARK-37405: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/35530 > Inline type hints for python/pyspark/ml/feature.py > -- > > Key: SPARK-37405 > URL: https://issues.apache.org/jira/browse/SPARK-37405 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/feature.pyi to > python/pyspark/ml/feature.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py
[ https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492559#comment-17492559 ] Apache Spark commented on SPARK-37405: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/35530 > Inline type hints for python/pyspark/ml/feature.py > -- > > Key: SPARK-37405 > URL: https://issues.apache.org/jira/browse/SPARK-37405 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/feature.pyi to > python/pyspark/ml/feature.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py
[ https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37405: Assignee: Apache Spark > Inline type hints for python/pyspark/ml/feature.py > -- > > Key: SPARK-37405 > URL: https://issues.apache.org/jira/browse/SPARK-37405 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Inline type hints from python/pyspark/ml/feature.pyi to > python/pyspark/ml/feature.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py
[ https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37405: Assignee: Apache Spark > Inline type hints for python/pyspark/ml/feature.py > -- > > Key: SPARK-37405 > URL: https://issues.apache.org/jira/browse/SPARK-37405 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Inline type hints from python/pyspark/ml/feature.pyi to > python/pyspark/ml/feature.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py
[ https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37405: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/ml/feature.py > -- > > Key: SPARK-37405 > URL: https://issues.apache.org/jira/browse/SPARK-37405 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/feature.pyi to > python/pyspark/ml/feature.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mehul Batra updated SPARK-38218: Attachment: Screenshot_20220214-013156.jpg > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > Attachments: Screenshot_20220214-013156.jpg > > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
[ https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mehul Batra updated SPARK-38218: Description: !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop version only? if yes is hadoop comes with the S3 magic commitor support. was: !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop version only? if yes is hadoop comes with the S3 magic commitor support. > Looks like the wrong package is available on the spark downloads page. The > name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 > - > > Key: SPARK-38218 > URL: https://issues.apache.org/jira/browse/SPARK-38218 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Mehul Batra >Priority: Major > > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! > Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop > version only? > if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
Mehul Batra created SPARK-38218: --- Summary: Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2 Key: SPARK-38218 URL: https://issues.apache.org/jira/browse/SPARK-38218 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 3.2.1 Reporter: Mehul Batra !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg! Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop version only? if yes is hadoop comes with the S3 magic commitor support. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37913) Null Pointer Exception when Loading ML Pipeline Model with Custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-37913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492517#comment-17492517 ] zhengruifeng commented on SPARK-37913: -- does the `MyTransformer` in the example works? > Null Pointer Exception when Loading ML Pipeline Model with Custom Transformer > - > > Key: SPARK-37913 > URL: https://issues.apache.org/jira/browse/SPARK-37913 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 > Environment: Spark 3.1.2, Scala 2.12, Java 11 >Reporter: Alana Young >Priority: Critical > Labels: MLPipelineModels, MLPipelines > > I am trying to create and persist a ML pipeline model using a custom Spark > transformer that I created based on the [Unary Transformer > example|https://github.com/apache/spark/blob/v3.1.2/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala] > provided by Spark. I am able to save and load the transformer. When I > include the custom transformer as a stage in a pipeline model, I can save the > model, but am unable to load it. Here is the stack trace of the exception: > > {code:java} > 01-14-2022 03:49:52 PM ERROR Instrumentation: java.lang.NullPointerException > at java.base/java.lang.reflect.Method.invoke(Method.java:559) at > org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:631) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274) > at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268) at > org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) at > org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) at > org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355) > at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349) > at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) at > org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) at > org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337) at > com.dtech.scala.pipeline.PipelineProcess.process(PipelineProcess.scala:122) > at com.dtech.scala.pipeline.PipelineProcess$.main(PipelineProcess.scala:448) > at com.dtech.scala.pipeline.PipelineProcess.main(PipelineProcess.scala) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at > org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala){code} > > *Source Code* > [Unary > Transformer|https://gist.github.com/ally1221/ff10ec50f7ef98fb6cd365172195bfd5] > [Persist Unary Transformer & Pipeline > Model|https://gist.github.com/ally1221/42473cdc818a8cf795ac78d65d48ee14] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492514#comment-17492514 ] huangtengfei commented on SPARK-38112: -- I will work on this. Thanks [~maxgekk] > Use error classes in the execution errors of date/timestamp handling > > > Key: SPARK-38112 > URL: https://issues.apache.org/jira/browse/SPARK-38112 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * sparkUpgradeInReadingDatesError > * sparkUpgradeInWritingDatesError > * timeZoneIdNotSpecifiedForTimestampTypeError > * cannotConvertOrcTimestampToTimestampNTZError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492506#comment-17492506 ] Apache Spark commented on SPARK-38124: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/35529 > Revive HashClusteredDistribution and apply to stream-stream join > > > Key: SPARK-38124 > URL: https://issues.apache.org/jira/browse/SPARK-38124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Blocker > Fix For: 3.3.0 > > > SPARK-35703 removed HashClusteredDistribution and replaced its usages with > ClusteredDistribution. > While this works great for non stateful operators, we still need to have a > separate requirement of distribution for stateful operator, because the > requirement of ClusteredDistribution is too relaxed while the requirement of > physical partitioning on stateful operator is quite strict. > In most cases, stateful operators must require child distribution as > HashClusteredDistribution, with below major assumptions: > # HashClusteredDistribution creates HashPartitioning and we will never ever > change it for the future. > # We will never ever change the implementation of {{partitionIdExpression}} > in HashPartitioning for the future, so that Partitioner will behave > consistently across Spark versions. > # No partitioning except HashPartitioning can satisfy > HashClusteredDistribution. > > We should revive HashClusteredDistribution (with probably renaming > specifically with stateful operator) and apply the distribution to the all > stateful operators. > SPARK-35703 only touched stream-stream join, which means stream-stream join > hasn't been broken in actual releases. Let's aim the partial revert of > SPARK-35703 in this ticket, and have another ticket to deal with other > stateful operators, which have been broken for their introduction (2.2+). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492505#comment-17492505 ] Apache Spark commented on SPARK-38124: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/35529 > Revive HashClusteredDistribution and apply to stream-stream join > > > Key: SPARK-38124 > URL: https://issues.apache.org/jira/browse/SPARK-38124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Blocker > Fix For: 3.3.0 > > > SPARK-35703 removed HashClusteredDistribution and replaced its usages with > ClusteredDistribution. > While this works great for non stateful operators, we still need to have a > separate requirement of distribution for stateful operator, because the > requirement of ClusteredDistribution is too relaxed while the requirement of > physical partitioning on stateful operator is quite strict. > In most cases, stateful operators must require child distribution as > HashClusteredDistribution, with below major assumptions: > # HashClusteredDistribution creates HashPartitioning and we will never ever > change it for the future. > # We will never ever change the implementation of {{partitionIdExpression}} > in HashPartitioning for the future, so that Partitioner will behave > consistently across Spark versions. > # No partitioning except HashPartitioning can satisfy > HashClusteredDistribution. > > We should revive HashClusteredDistribution (with probably renaming > specifically with stateful operator) and apply the distribution to the all > stateful operators. > SPARK-35703 only touched stream-stream join, which means stream-stream join > hasn't been broken in actual releases. Let's aim the partial revert of > SPARK-35703 in this ticket, and have another ticket to deal with other > stateful operators, which have been broken for their introduction (2.2+). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38217) insert overwrite failed for external table with dynamic partition table
[ https://issues.apache.org/jira/browse/SPARK-38217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YuanGuanhu updated SPARK-38217: --- Affects Version/s: (was: 3.3.0) > insert overwrite failed for external table with dynamic partition table > --- > > Key: SPARK-38217 > URL: https://issues.apache.org/jira/browse/SPARK-38217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: YuanGuanhu >Priority: Major > > can't insert overwrite dynamic partition table, reproduce step with > spark3.2.1 hadoop 3.2: > sql("CREATE EXTERNAL TABLE exttb01(id int) PARTITIONED BY (p1 string, p2 > string) STORED AS PARQUET LOCATION '/tmp/exttb01'") > sql("set spark.sql.hive.convertMetastoreParquet=false") > sql("set hive.exec.dynamic.partition.mode=nonstrict") > val insertsql = "INSERT OVERWRITE TABLE exttb01 PARTITION(p1='n1', p2) SELECT > * FROM VALUES (1, 'n2'), (2, 'n3'), (3, 'n4') AS t(id, p2)" > sql(insertsql) > sql(insertsql) > when execute insert overwrite 2th time, it failed > > WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n4 cannot be cleaned: > java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not > exist > java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not > exist > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440) > at > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657) > at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929) > at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n3 cannot > be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 > does not exist > java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 does not > exist > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440) > at > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657) > at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929) > at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n2 cannot > be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 > does not exist > java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 does not > exist > at > org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) > at > org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440) > at > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657) > at
[jira] [Updated] (SPARK-38217) insert overwrite failed for external table with dynamic partition table
[ https://issues.apache.org/jira/browse/SPARK-38217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YuanGuanhu updated SPARK-38217: --- Description: can't insert overwrite dynamic partition table, reproduce step with spark3.2.1 hadoop 3.2: sql("CREATE EXTERNAL TABLE exttb01(id int) PARTITIONED BY (p1 string, p2 string) STORED AS PARQUET LOCATION '/tmp/exttb01'") sql("set spark.sql.hive.convertMetastoreParquet=false") sql("set hive.exec.dynamic.partition.mode=nonstrict") val insertsql = "INSERT OVERWRITE TABLE exttb01 PARTITION(p1='n1', p2) SELECT * FROM VALUES (1, 'n2'), (2, 'n3'), (3, 'n4') AS t(id, p2)" sql(insertsql) sql(insertsql) when execute insert overwrite 2th time, it failed WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n4 cannot be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not exist java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n3 cannot be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 does not exist java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n2 cannot be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 does not exist java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) was:can't insert overwrite dynamic partition table, reproduce step: > insert overwrite failed for external table with dynamic partition table >
[jira] [Created] (SPARK-38217) insert overwrite failed for external table with dynamic partition table
YuanGuanhu created SPARK-38217: -- Summary: insert overwrite failed for external table with dynamic partition table Key: SPARK-38217 URL: https://issues.apache.org/jira/browse/SPARK-38217 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.3.0 Reporter: YuanGuanhu can't insert overwrite dynamic partition table, reproduce step: -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38215) InsertIntoHiveDir support convert metadata
[ https://issues.apache.org/jira/browse/SPARK-38215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492490#comment-17492490 ] Apache Spark commented on SPARK-38215: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35528 > InsertIntoHiveDir support convert metadata > -- > > Key: SPARK-38215 > URL: https://issues.apache.org/jira/browse/SPARK-38215 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.1 >Reporter: angerszhu >Priority: Major > > Current InsertIntoHiveDir command use hive serde write data, con't supporot > convert, cause such SQL can't write parquet with zstd. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38215) InsertIntoHiveDir support convert metadata
[ https://issues.apache.org/jira/browse/SPARK-38215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38215: Assignee: Apache Spark > InsertIntoHiveDir support convert metadata > -- > > Key: SPARK-38215 > URL: https://issues.apache.org/jira/browse/SPARK-38215 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.1 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Current InsertIntoHiveDir command use hive serde write data, con't supporot > convert, cause such SQL can't write parquet with zstd. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38215) InsertIntoHiveDir support convert metadata
[ https://issues.apache.org/jira/browse/SPARK-38215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38215: Assignee: (was: Apache Spark) > InsertIntoHiveDir support convert metadata > -- > > Key: SPARK-38215 > URL: https://issues.apache.org/jira/browse/SPARK-38215 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.1 >Reporter: angerszhu >Priority: Major > > Current InsertIntoHiveDir command use hive serde write data, con't supporot > convert, cause such SQL can't write parquet with zstd. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492489#comment-17492489 ] Apache Spark commented on SPARK-38216: -- User 'Yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/35527 > When creating a Hive table, fail early if all the columns are partitioned > columns > - > > Key: SPARK-38216 > URL: https://issues.apache.org/jira/browse/SPARK-38216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: yikf >Priority: Minor > Fix For: 3.3.0 > > > In Hive the schema and partition columns must be disjoint sets, if hive table > which all columns are partitioned columns, so that other columns is empty, it > will fail when Hive create table, error msg as follow: > ` > throw new HiveException( > "at least one column must be specified for the table") > ` > So when creating a Hive table, fail early if all the columns are partitioned > columns, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492488#comment-17492488 ] Apache Spark commented on SPARK-38216: -- User 'Yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/35527 > When creating a Hive table, fail early if all the columns are partitioned > columns > - > > Key: SPARK-38216 > URL: https://issues.apache.org/jira/browse/SPARK-38216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: yikf >Priority: Minor > Fix For: 3.3.0 > > > In Hive the schema and partition columns must be disjoint sets, if hive table > which all columns are partitioned columns, so that other columns is empty, it > will fail when Hive create table, error msg as follow: > ` > throw new HiveException( > "at least one column must be specified for the table") > ` > So when creating a Hive table, fail early if all the columns are partitioned > columns, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38216: Assignee: (was: Apache Spark) > When creating a Hive table, fail early if all the columns are partitioned > columns > - > > Key: SPARK-38216 > URL: https://issues.apache.org/jira/browse/SPARK-38216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: yikf >Priority: Minor > Fix For: 3.3.0 > > > In Hive the schema and partition columns must be disjoint sets, if hive table > which all columns are partitioned columns, so that other columns is empty, it > will fail when Hive create table, error msg as follow: > ` > throw new HiveException( > "at least one column must be specified for the table") > ` > So when creating a Hive table, fail early if all the columns are partitioned > columns, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38216: Assignee: Apache Spark > When creating a Hive table, fail early if all the columns are partitioned > columns > - > > Key: SPARK-38216 > URL: https://issues.apache.org/jira/browse/SPARK-38216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: yikf >Assignee: Apache Spark >Priority: Minor > Fix For: 3.3.0 > > > In Hive the schema and partition columns must be disjoint sets, if hive table > which all columns are partitioned columns, so that other columns is empty, it > will fail when Hive create table, error msg as follow: > ` > throw new HiveException( > "at least one column must be specified for the table") > ` > So when creating a Hive table, fail early if all the columns are partitioned > columns, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns
yikf created SPARK-38216: Summary: When creating a Hive table, fail early if all the columns are partitioned columns Key: SPARK-38216 URL: https://issues.apache.org/jira/browse/SPARK-38216 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: yikf Fix For: 3.3.0 In Hive the schema and partition columns must be disjoint sets, if hive table which all columns are partitioned columns, so that other columns is empty, it will fail when Hive create table, such as ` throw new HiveException( "at least one column must be specified for the table") ` So when creating a Hive table, fail early if all the columns are partitioned columns, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-38216: - Description: In Hive the schema and partition columns must be disjoint sets, if hive table which all columns are partitioned columns, so that other columns is empty, it will fail when Hive create table, error msg as follow: ` throw new HiveException( "at least one column must be specified for the table") ` So when creating a Hive table, fail early if all the columns are partitioned columns, was: In Hive the schema and partition columns must be disjoint sets, if hive table which all columns are partitioned columns, so that other columns is empty, it will fail when Hive create table, such as ` throw new HiveException( "at least one column must be specified for the table") ` So when creating a Hive table, fail early if all the columns are partitioned columns, > When creating a Hive table, fail early if all the columns are partitioned > columns > - > > Key: SPARK-38216 > URL: https://issues.apache.org/jira/browse/SPARK-38216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: yikf >Priority: Minor > Fix For: 3.3.0 > > > In Hive the schema and partition columns must be disjoint sets, if hive table > which all columns are partitioned columns, so that other columns is empty, it > will fail when Hive create table, error msg as follow: > ` > throw new HiveException( > "at least one column must be specified for the table") > ` > So when creating a Hive table, fail early if all the columns are partitioned > columns, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38215) InsertIntoHiveDir support convert metadata
angerszhu created SPARK-38215: - Summary: InsertIntoHiveDir support convert metadata Key: SPARK-38215 URL: https://issues.apache.org/jira/browse/SPARK-38215 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.1 Reporter: angerszhu Current InsertIntoHiveDir command use hive serde write data, con't supporot convert, cause such SQL can't write parquet with zstd. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38214) No need to filter data when the sliding window length is not redundant
[ https://issues.apache.org/jira/browse/SPARK-38214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38214: Assignee: (was: Apache Spark) > No need to filter data when the sliding window length is not redundant > -- > > Key: SPARK-38214 > URL: https://issues.apache.org/jira/browse/SPARK-38214 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: nyingping >Priority: Minor > > At present, the sliding window adopts the form of expand + filter, but in > some cases, filter is not necessary. > Filter is required if the sliding window is irregular. When the window length > is divided by the slide length the result is an integer (I believe this is > also the case for most work scenarios in practice for sliding window), there > is no need to filter, which can save calculation resources and improve > performance. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38214) No need to filter data when the sliding window length is not redundant
[ https://issues.apache.org/jira/browse/SPARK-38214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492460#comment-17492460 ] Apache Spark commented on SPARK-38214: -- User 'nyingping' has created a pull request for this issue: https://github.com/apache/spark/pull/35526 > No need to filter data when the sliding window length is not redundant > -- > > Key: SPARK-38214 > URL: https://issues.apache.org/jira/browse/SPARK-38214 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: nyingping >Priority: Minor > > At present, the sliding window adopts the form of expand + filter, but in > some cases, filter is not necessary. > Filter is required if the sliding window is irregular. When the window length > is divided by the slide length the result is an integer (I believe this is > also the case for most work scenarios in practice for sliding window), there > is no need to filter, which can save calculation resources and improve > performance. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38214) No need to filter data when the sliding window length is not redundant
[ https://issues.apache.org/jira/browse/SPARK-38214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38214: Assignee: Apache Spark > No need to filter data when the sliding window length is not redundant > -- > > Key: SPARK-38214 > URL: https://issues.apache.org/jira/browse/SPARK-38214 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: nyingping >Assignee: Apache Spark >Priority: Minor > > At present, the sliding window adopts the form of expand + filter, but in > some cases, filter is not necessary. > Filter is required if the sliding window is irregular. When the window length > is divided by the slide length the result is an integer (I believe this is > also the case for most work scenarios in practice for sliding window), there > is no need to filter, which can save calculation resources and improve > performance. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492452#comment-17492452 ] Kousuke Saruta commented on SPARK-36808: Ah, O.K. I misunderstood. I'll withdraw the PRs. > Upgrade Kafka to 2.8.1 > -- > > Key: SPARK-36808 > URL: https://issues.apache.org/jira/browse/SPARK-36808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix. > https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38214) No need to filter data when the sliding window length is not redundant
nyingping created SPARK-38214: - Summary: No need to filter data when the sliding window length is not redundant Key: SPARK-38214 URL: https://issues.apache.org/jira/browse/SPARK-38214 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.2.1 Reporter: nyingping At present, the sliding window adopts the form of expand + filter, but in some cases, filter is not necessary. Filter is required if the sliding window is irregular. When the window length is divided by the slide length the result is an integer (I believe this is also the case for most work scenarios in practice for sliding window), there is no need to filter, which can save calculation resources and improve performance. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492440#comment-17492440 ] Dongjoon Hyun commented on SPARK-36808: --- I approved the first one and added comments for the other two PRs. > Upgrade Kafka to 2.8.1 > -- > > Key: SPARK-36808 > URL: https://issues.apache.org/jira/browse/SPARK-36808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix. > https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492437#comment-17492437 ] Dongjoon Hyun commented on SPARK-36808: --- Please consider `branch-3.2` only. > Upgrade Kafka to 2.8.1 > -- > > Key: SPARK-36808 > URL: https://issues.apache.org/jira/browse/SPARK-36808 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix. > https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org